All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging
@ 2018-10-07 23:19 NeilBrown
  2018-10-07 23:19 ` [lustre-devel] [PATCH 08/24] lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer() NeilBrown
                   ` (24 more replies)
  0 siblings, 25 replies; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

This is a port of the "Dynamic Discovery" series
(756abb9cf00b936b3..1c45d9051764e0637ba90b3)
to my mainline-linux-with-lustre tree.
It is all fairly straight forward, but I don't think I have the
hardware to testing properly.  And review never hurts.

This is all in my lustre-testing branch.

Thanks,
NeilBrown

---

Amir Shehata (2):
      lustre: lnet: add enhanced statistics
      lustre: lnet: show peer state

John L. Hammond (1):
      lustre: lnet: balance references in lnet_discover_peer_locked()

Olaf Weber (20):
      lustre: lnet: add lnet_interfaces_max tunable
      lustre: lnet: configure lnet_interfaces_max tunable from dlc
      lustre: lnet: add struct lnet_ping_buffer
      lustre: lnet: automatic sizing of router pinger buffers
      lustre: lnet: add Multi-Rail and Discovery ping feature bits
      lustre: lnet: add sanity checks on ping-related constants
      lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked()
      lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer()
      lustre: lnet: refactor lnet_del_peer_ni()
      lustre: lnet: refactor lnet_add_peer_ni()
      lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit
      lustre: lnet: preferred NIs for non-Multi-Rail peers
      lustre: lnet: add LNET_PEER_CONFIGURED flag
      lustre: lnet: reference counts on lnet_peer/lnet_peer_net
      lustre: lnet: add msg_type to lnet_event
      lustre: lnet: add discovery thread
      lustre: lnet: add the Push target
      lustre: lnet: implement Peer Discovery
      lustre: lnet: add "lnetctl peer list"
      lustre: lnet: add "lnetctl ping" command

Sonia Sharma (1):
      lustre: lnet: add "lnetctl discover"


 .../staging/lustre/include/linux/lnet/lib-lnet.h   |  156 +
 .../staging/lustre/include/linux/lnet/lib-types.h  |  258 ++
 .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    8 
 .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |   10 
 .../lustre/include/uapi/linux/lnet/lnet-types.h    |   42 
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |    2 
 .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |    2 
 .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |   22 
 .../staging/lustre/lnet/klnds/socklnd/socklnd.h    |    4 
 .../staging/lustre/lnet/klnds/socklnd/socklnd_cb.c |    2 
 .../lustre/lnet/klnds/socklnd/socklnd_modparams.c  |    2 
 .../lustre/lnet/klnds/socklnd/socklnd_proto.c      |    4 
 drivers/staging/lustre/lnet/lnet/api-ni.c          |  907 +++++-
 drivers/staging/lustre/lnet/lnet/config.c          |   10 
 drivers/staging/lustre/lnet/lnet/lib-move.c        |  242 +-
 drivers/staging/lustre/lnet/lnet/lib-msg.c         |   17 
 drivers/staging/lustre/lnet/lnet/net_fault.c       |    3 
 drivers/staging/lustre/lnet/lnet/peer.c            | 3002 +++++++++++++++++---
 drivers/staging/lustre/lnet/lnet/router.c          |  174 +
 19 files changed, 4056 insertions(+), 811 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 01/24] lustre: lnet: add lnet_interfaces_max tunable
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (5 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 10/24] lustre: lnet: refactor lnet_add_peer_ni() NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:08   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 05/24] lustre: lnet: add Multi-Rail and Discovery ping feature bits NeilBrown
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Add an lnet_interfaces_max tunable value, that describes the maximum
number of interfaces per node. This tunable is primarily useful for
sanity checks prior to allocating memory.

Allow lnet_interfaces_max to be set and get from the sysfs interface.

Add LNET_INTERFACES_MIN, value 16, as the minimum value.

Add LNET_INTERFACES_MAX_DEFAULT, value 200, as the default value. This
value was chosen to ensure that the size of an LNet ping message with
any associated LND overhead would fit in 4096 bytes.

(The LNET_INTERFACES_MAX name was not reused to allow for the early
detection of issues when merging code that uses it.)

Rename LNET_NUM_INTERFACES to LNET_INTERFACES_NUM

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-on: https://review.whamcloud.com/25770
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-types.h  |    2 +
 .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |    4 +--
 .../lustre/include/uapi/linux/lnet/lnet-types.h    |    7 ++++
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |    2 +
 .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |   22 +++++++-------
 .../staging/lustre/lnet/klnds/socklnd/socklnd.h    |    4 +--
 .../staging/lustre/lnet/klnds/socklnd/socklnd_cb.c |    2 +
 .../lustre/lnet/klnds/socklnd/socklnd_proto.c      |    4 +--
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   32 +++++++++++++++++++-
 drivers/staging/lustre/lnet/lnet/config.c          |   10 +++---
 10 files changed, 62 insertions(+), 27 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 7219a7bacf6e..7b11c31f0029 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -371,7 +371,7 @@ struct lnet_ni {
 	 * equivalent interfaces to use
 	 * This is an array because socklnd bonding can still be configured
 	 */
-	char			 *ni_interfaces[LNET_NUM_INTERFACES];
+	char			 *ni_interfaces[LNET_INTERFACES_NUM];
 	/* original net namespace */
 	struct net		 *ni_net_ns;
 };
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
index 8f03aa3c5676..d88b30d2e76c 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
@@ -81,7 +81,7 @@ struct lnet_ioctl_config_lnd_tunables {
 };
 
 struct lnet_ioctl_net_config {
-	char ni_interfaces[LNET_NUM_INTERFACES][LNET_MAX_STR_LEN];
+	char ni_interfaces[LNET_INTERFACES_NUM][LNET_MAX_STR_LEN];
 	__u32 ni_status;
 	__u32 ni_cpts[LNET_MAX_SHOW_NUM_CPT];
 	char cfg_bulk[0];
@@ -184,7 +184,7 @@ struct lnet_ioctl_element_msg_stats {
 struct lnet_ioctl_config_ni {
 	struct libcfs_ioctl_hdr lic_cfg_hdr;
 	lnet_nid_t		lic_nid;
-	char			lic_ni_intf[LNET_NUM_INTERFACES][LNET_MAX_STR_LEN];
+	char			lic_ni_intf[LNET_INTERFACES_NUM][LNET_MAX_STR_LEN];
 	char			lic_legacy_ip2nets[LNET_MAX_STR_LEN];
 	__u32			lic_cpts[LNET_MAX_SHOW_NUM_CPT];
 	__u32			lic_ncpts;
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
index f8a873bab135..6ee60d07ff84 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
@@ -264,7 +264,12 @@ struct lnet_counters {
 #define LNET_NI_STATUS_DOWN    0xdeadface
 #define LNET_NI_STATUS_INVALID 0x00000000
 
-#define LNET_NUM_INTERFACES    16
+#define LNET_INTERFACES_NUM		16
+
+/* The minimum number of interfaces per node supported by LNet. */
+#define LNET_INTERFACES_MIN		16
+/* The default - arbitrary - value of the lnet_max_interfaces tunable. */
+#define LNET_INTERFACES_MAX_DEFAULT	200
 
 /**
  * Objects maintained by the LNet are accessed through handles. Handle types
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index c20766379323..bf969b3891a9 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2915,7 +2915,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 	if (ni->ni_interfaces[0]) {
 		/* Use the IPoIB interface specified in 'networks=' */
 
-		BUILD_BUG_ON(LNET_NUM_INTERFACES <= 1);
+		BUILD_BUG_ON(LNET_INTERFACES_NUM <= 1);
 		if (ni->ni_interfaces[1]) {
 			CERROR("Multiple interfaces not supported\n");
 			goto failed;
diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
index b2f0148d0087..ff8d73295fff 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
@@ -53,7 +53,7 @@ ksocknal_ip2iface(struct lnet_ni *ni, __u32 ip)
 	struct ksock_interface *iface;
 
 	for (i = 0; i < net->ksnn_ninterfaces; i++) {
-		LASSERT(i < LNET_NUM_INTERFACES);
+		LASSERT(i < LNET_INTERFACES_NUM);
 		iface = &net->ksnn_interfaces[i];
 
 		if (iface->ksni_ipaddr == ip)
@@ -221,7 +221,7 @@ ksocknal_unlink_peer_locked(struct ksock_peer *peer_ni)
 	struct ksock_interface *iface;
 
 	for (i = 0; i < peer_ni->ksnp_n_passive_ips; i++) {
-		LASSERT(i < LNET_NUM_INTERFACES);
+		LASSERT(i < LNET_INTERFACES_NUM);
 		ip = peer_ni->ksnp_passive_ips[i];
 
 		iface = ksocknal_ip2iface(peer_ni->ksnp_ni, ip);
@@ -689,7 +689,7 @@ ksocknal_local_ipvec(struct lnet_ni *ni, __u32 *ipaddrs)
 	read_lock(&ksocknal_data.ksnd_global_lock);
 
 	nip = net->ksnn_ninterfaces;
-	LASSERT(nip <= LNET_NUM_INTERFACES);
+	LASSERT(nip <= LNET_INTERFACES_NUM);
 
 	/*
 	 * Only offer interfaces for additional connections if I have
@@ -770,8 +770,8 @@ ksocknal_select_ips(struct ksock_peer *peer_ni, __u32 *peerips, int n_peerips)
 	 */
 	write_lock_bh(global_lock);
 
-	LASSERT(n_peerips <= LNET_NUM_INTERFACES);
-	LASSERT(net->ksnn_ninterfaces <= LNET_NUM_INTERFACES);
+	LASSERT(n_peerips <= LNET_INTERFACES_NUM);
+	LASSERT(net->ksnn_ninterfaces <= LNET_INTERFACES_NUM);
 
 	/*
 	 * Only match interfaces for additional connections
@@ -890,7 +890,7 @@ ksocknal_create_routes(struct ksock_peer *peer_ni, int port,
 		return;
 	}
 
-	LASSERT(npeer_ipaddrs <= LNET_NUM_INTERFACES);
+	LASSERT(npeer_ipaddrs <= LNET_INTERFACES_NUM);
 
 	for (i = 0; i < npeer_ipaddrs; i++) {
 		if (newroute) {
@@ -919,7 +919,7 @@ ksocknal_create_routes(struct ksock_peer *peer_ni, int port,
 		best_nroutes = 0;
 		best_netmatch = 0;
 
-		LASSERT(net->ksnn_ninterfaces <= LNET_NUM_INTERFACES);
+		LASSERT(net->ksnn_ninterfaces <= LNET_INTERFACES_NUM);
 
 		/* Select interface to connect from */
 		for (j = 0; j < net->ksnn_ninterfaces; j++) {
@@ -1060,7 +1060,7 @@ ksocknal_create_conn(struct lnet_ni *ni, struct ksock_route *route,
 	atomic_set(&conn->ksnc_tx_nob, 0);
 
 	hello = kvzalloc(offsetof(struct ksock_hello_msg,
-				  kshm_ips[LNET_NUM_INTERFACES]),
+				  kshm_ips[LNET_INTERFACES_NUM]),
 			 GFP_KERNEL);
 	if (!hello) {
 		rc = -ENOMEM;
@@ -1983,7 +1983,7 @@ ksocknal_add_interface(struct lnet_ni *ni, __u32 ipaddress, __u32 netmask)
 	if (iface) {
 		/* silently ignore dups */
 		rc = 0;
-	} else if (net->ksnn_ninterfaces == LNET_NUM_INTERFACES) {
+	} else if (net->ksnn_ninterfaces == LNET_INTERFACES_NUM) {
 		rc = -ENOSPC;
 	} else {
 		iface = &net->ksnn_interfaces[net->ksnn_ninterfaces++];
@@ -2624,7 +2624,7 @@ ksocknal_enumerate_interfaces(struct ksock_net *net, char *iname)
 			continue;
 		}
 
-		if (j == LNET_NUM_INTERFACES) {
+		if (j == LNET_INTERFACES_NUM) {
 			CWARN("Ignoring interface %s (too many interfaces)\n",
 			      name);
 			continue;
@@ -2812,7 +2812,7 @@ ksocknal_startup(struct lnet_ni *ni)
 
 		net->ksnn_ninterfaces = rc;
 	} else {
-		for (i = 0; i < LNET_NUM_INTERFACES; i++) {
+		for (i = 0; i < LNET_INTERFACES_NUM; i++) {
 			if (!ni->ni_interfaces[i])
 				break;
 			rc = ksocknal_enumerate_interfaces(net, ni->ni_interfaces[i]);
diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h
index 82e3523f6463..297d1e5af1bd 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h
@@ -173,7 +173,7 @@ struct ksock_net {
 	int		  ksnn_npeers;		/* # peers */
 	int		  ksnn_shutdown;	/* shutting down? */
 	int		  ksnn_ninterfaces;	/* IP interfaces */
-	struct ksock_interface ksnn_interfaces[LNET_NUM_INTERFACES];
+	struct ksock_interface ksnn_interfaces[LNET_INTERFACES_NUM];
 };
 
 /** connd timeout */
@@ -441,7 +441,7 @@ struct ksock_peer {
 	int                ksnp_n_passive_ips;  /* # of... */
 
 	/* preferred local interfaces */
-	u32		   ksnp_passive_ips[LNET_NUM_INTERFACES];
+	u32              ksnp_passive_ips[LNET_INTERFACES_NUM];
 };
 
 struct ksock_connreq {
diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c
index dc9a12910a8d..c401896bf649 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c
@@ -1579,7 +1579,7 @@ ksocknal_send_hello(struct lnet_ni *ni, struct ksock_conn *conn,
 	/* CAVEAT EMPTOR: this byte flips 'ipaddrs' */
 	struct ksock_net *net = (struct ksock_net *)ni->ni_data;
 
-	LASSERT(hello->kshm_nips <= LNET_NUM_INTERFACES);
+	LASSERT(hello->kshm_nips <= LNET_INTERFACES_NUM);
 
 	/* rely on caller to hold a ref on socket so it wouldn't disappear */
 	LASSERT(conn->ksnc_proto);
diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c
index 10a2757895f3..54ec5d0a85c8 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c
@@ -614,7 +614,7 @@ ksocknal_recv_hello_v1(struct ksock_conn *conn, struct ksock_hello_msg *hello,
 	hello->kshm_nips            = le32_to_cpu(hdr->payload_length) /
 						  sizeof(__u32);
 
-	if (hello->kshm_nips > LNET_NUM_INTERFACES) {
+	if (hello->kshm_nips > LNET_INTERFACES_NUM) {
 		CERROR("Bad nips %d from ip %pI4h\n",
 		       hello->kshm_nips, &conn->ksnc_ipaddr);
 		rc = -EPROTO;
@@ -684,7 +684,7 @@ ksocknal_recv_hello_v2(struct ksock_conn *conn, struct ksock_hello_msg *hello,
 		__swab32s(&hello->kshm_nips);
 	}
 
-	if (hello->kshm_nips > LNET_NUM_INTERFACES) {
+	if (hello->kshm_nips > LNET_INTERFACES_NUM) {
 		CERROR("Bad nips %d from ip %pI4h\n",
 		       hello->kshm_nips, &conn->ksnc_ipaddr);
 		return -EPROTO;
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index b37abdedccaa..6a692d5c4608 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -34,6 +34,7 @@
 #define DEBUG_SUBSYSTEM S_LNET
 #include <linux/log2.h>
 #include <linux/ktime.h>
+#include <linux/moduleparam.h>
 
 #include <linux/lnet/lib-lnet.h>
 #include <uapi/linux/lnet/lnet-dlc.h>
@@ -70,6 +71,13 @@ module_param(lnet_numa_range, uint, 0444);
 MODULE_PARM_DESC(lnet_numa_range,
 		 "NUMA range to consider during Multi-Rail selection");
 
+static int lnet_interfaces_max = LNET_INTERFACES_MAX_DEFAULT;
+static int intf_max_set(const char *val, const struct kernel_param *kp);
+module_param_call(lnet_interfaces_max, intf_max_set, param_get_int,
+		  &lnet_interfaces_max, 0644);
+MODULE_PARM_DESC(lnet_interfaces_max,
+		 "Maximum number of interfaces in a node.");
+
 /*
  * This sequence number keeps track of how many times DLC was used to
  * update the local NIs. It is incremented when a NI is added or
@@ -82,6 +90,28 @@ static atomic_t lnet_dlc_seq_no = ATOMIC_INIT(0);
 static int lnet_ping(struct lnet_process_id id, signed long timeout,
 		     struct lnet_process_id __user *ids, int n_ids);
 
+static int
+intf_max_set(const char *val, const struct kernel_param *kp)
+{
+	int value, rc;
+
+	rc = kstrtoint(val, 0, &value);
+	if (rc) {
+		CERROR("Invalid module parameter value for 'lnet_interfaces_max'\n");
+		return rc;
+	}
+
+	if (value < LNET_INTERFACES_MIN) {
+		CWARN("max interfaces provided are too small, setting to %d\n",
+		      LNET_INTERFACES_MIN);
+		value = LNET_INTERFACES_MIN;
+	}
+
+	*(int *)kp->arg = value;
+
+	return 0;
+}
+
 static char *
 lnet_get_routes(void)
 {
@@ -2924,7 +2954,7 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	infosz = offsetof(struct lnet_ping_info, pi_ni[n_ids]);
 
 	/* n_ids limit is arbitrary */
-	if (n_ids <= 0 || n_ids > 20 || id.nid == LNET_NID_ANY)
+	if (n_ids <= 0 || n_ids > lnet_interfaces_max || id.nid == LNET_NID_ANY)
 		return -EINVAL;
 
 	if (id.pid == LNET_PID_ANY)
diff --git a/drivers/staging/lustre/lnet/lnet/config.c b/drivers/staging/lustre/lnet/lnet/config.c
index 3ea56c81ec0e..087d9a8a6b6a 100644
--- a/drivers/staging/lustre/lnet/lnet/config.c
+++ b/drivers/staging/lustre/lnet/lnet/config.c
@@ -123,10 +123,10 @@ lnet_ni_unique_net(struct list_head *nilist, char *iface)
 /* check that the NI is unique to the interfaces with in the same NI.
  * This is only a consideration if use_tcp_bonding is set */
 static bool
-lnet_ni_unique_ni(char *iface_list[LNET_NUM_INTERFACES], char *iface)
+lnet_ni_unique_ni(char *iface_list[LNET_INTERFACES_NUM], char *iface)
 {
 	int i;
-	for (i = 0; i < LNET_NUM_INTERFACES; i++) {
+	for (i = 0; i < LNET_INTERFACES_NUM; i++) {
 		if (iface_list[i] &&
 		    strncmp(iface_list[i], iface, strlen(iface)) == 0)
 			return false;
@@ -304,7 +304,7 @@ lnet_ni_free(struct lnet_ni *ni)
 
 	kfree(ni->ni_cpts);
 
-	for (i = 0; i < LNET_NUM_INTERFACES && ni->ni_interfaces[i]; i++)
+	for (i = 0; i < LNET_INTERFACES_NUM && ni->ni_interfaces[i]; i++)
 		kfree(ni->ni_interfaces[i]);
 
 	/* release reference to net namespace */
@@ -397,11 +397,11 @@ lnet_ni_add_interface(struct lnet_ni *ni, char *iface)
 	 * can free the tokens at the end of the function.
 	 * The newly allocated ni_interfaces[] can be
 	 * freed when freeing the NI */
-	while (niface < LNET_NUM_INTERFACES &&
+	while (niface < LNET_INTERFACES_NUM &&
 	       ni->ni_interfaces[niface])
 		niface++;
 
-	if (niface >= LNET_NUM_INTERFACES) {
+	if (niface >= LNET_INTERFACES_NUM) {
 		LCONSOLE_ERROR_MSG(0x115, "Too many interfaces "
 				   "for net %s\n",
 				   libcfs_net2str(LNET_NIDNET(ni->ni_nid)));

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 02/24] lustre: lnet: configure lnet_interfaces_max tunable from dlc
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
  2018-10-07 23:19 ` [lustre-devel] [PATCH 08/24] lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer() NeilBrown
  2018-10-07 23:19 ` [lustre-devel] [PATCH 09/24] lustre: lnet: refactor lnet_del_peer_ni() NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:10   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 06/24] lustre: lnet: add sanity checks on ping-related constants NeilBrown
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Added the ability to configure lnet_interfaces_max from DLC.
Combined the configure and show of numa range and max interfaces
under a "global" YAML element when configuring using YAML.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25771
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |    6 +++---
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   16 ++++++++--------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
index d88b30d2e76c..706892ca7efb 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
@@ -230,9 +230,9 @@ struct lnet_ioctl_peer_cfg {
 	void __user *prcfg_bulk;
 };
 
-struct lnet_ioctl_numa_range {
-	struct libcfs_ioctl_hdr nr_hdr;
-	__u32 nr_range;
+struct lnet_ioctl_set_value {
+	struct libcfs_ioctl_hdr sv_hdr;
+	__u32 sv_value;
 };
 
 struct lnet_ioctl_lnet_stats {
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 6a692d5c4608..8b6400da2836 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -2708,24 +2708,24 @@ LNetCtl(unsigned int cmd, void *arg)
 		return rc;
 
 	case IOC_LIBCFS_SET_NUMA_RANGE: {
-		struct lnet_ioctl_numa_range *numa;
+		struct lnet_ioctl_set_value *numa;
 
 		numa = arg;
-		if (numa->nr_hdr.ioc_len != sizeof(*numa))
+		if (numa->sv_hdr.ioc_len != sizeof(*numa))
 			return -EINVAL;
-		mutex_lock(&the_lnet.ln_api_mutex);
-		lnet_numa_range = numa->nr_range;
-		mutex_unlock(&the_lnet.ln_api_mutex);
+		lnet_net_lock(LNET_LOCK_EX);
+		lnet_numa_range = numa->sv_value;
+		lnet_net_unlock(LNET_LOCK_EX);
 		return 0;
 	}
 
 	case IOC_LIBCFS_GET_NUMA_RANGE: {
-		struct lnet_ioctl_numa_range *numa;
+		struct lnet_ioctl_set_value *numa;
 
 		numa = arg;
-		if (numa->nr_hdr.ioc_len != sizeof(*numa))
+		if (numa->sv_hdr.ioc_len != sizeof(*numa))
 			return -EINVAL;
-		numa->nr_range = lnet_numa_range;
+		numa->sv_value = lnet_numa_range;
 		return 0;
 	}
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 03/24] lustre: lnet: add struct lnet_ping_buffer
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (7 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 05/24] lustre: lnet: add Multi-Rail and Discovery ping feature bits NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:29   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked() NeilBrown
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

The Multi-Rail code will use the ping target buffer also as the
source of data to push to other nodes. This means that there
will be multiple MDs referencing the same buffer, and care must
be taken to ensure that the buffer is not freed while any such
reference remains.

Encapsulate the struct lnet_ping_info (aka lnet_ping_info_t) in
a struct lnet_ping_buffer. This adds a reference count, and the
number of NIDs for the encapsulated lnet_ping_info has been
sized.

For sizing the buffer the constant LNET_PINGINFO_SIZE is replaced
with LNET_PING_INFO_SIZE(NNIS).

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25773
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   22 +
 .../staging/lustre/include/linux/lnet/lib-types.h  |   40 ++
 drivers/staging/lustre/lnet/lnet/api-ni.c          |  345 +++++++++++---------
 drivers/staging/lustre/lnet/lnet/router.c          |   94 +++--
 4 files changed, 301 insertions(+), 200 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 16e64d83840d..2e2b5ed27116 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -634,7 +634,27 @@ int lnet_peer_buffer_credits(struct lnet_net *net);
 int lnet_router_checker_start(void);
 void lnet_router_checker_stop(void);
 void lnet_router_ni_update_locked(struct lnet_peer_ni *gw, __u32 net);
-void lnet_swap_pinginfo(struct lnet_ping_info *info);
+void lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf);
+
+int lnet_ping_info_validate(struct lnet_ping_info *pinfo);
+struct lnet_ping_buffer *lnet_ping_buffer_alloc(int nnis, gfp_t gfp);
+void lnet_ping_buffer_free(struct lnet_ping_buffer *pbuf);
+
+static inline void lnet_ping_buffer_addref(struct lnet_ping_buffer *pbuf)
+{
+	atomic_inc(&pbuf->pb_refcnt);
+}
+
+static inline void lnet_ping_buffer_decref(struct lnet_ping_buffer *pbuf)
+{
+	if (atomic_dec_and_test(&pbuf->pb_refcnt))
+		lnet_ping_buffer_free(pbuf);
+}
+
+static inline int lnet_ping_buffer_numref(struct lnet_ping_buffer *pbuf)
+{
+	return atomic_read(&pbuf->pb_refcnt);
+}
 
 int lnet_parse_ip2nets(char **networksp, char *ip2nets);
 int lnet_parse_routes(char *route_str, int *im_a_router);
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 7b11c31f0029..ab8c6d66cdbf 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -387,12 +387,32 @@ struct lnet_ni {
 #define LNET_PING_FEAT_NI_STATUS	BIT(1)	/* return NI status */
 #define LNET_PING_FEAT_RTE_DISABLED	BIT(2)	/* Routing enabled */
 
-#define LNET_PING_FEAT_MASK		(LNET_PING_FEAT_BASE | \
-					 LNET_PING_FEAT_NI_STATUS)
+#define LNET_PING_INFO_SIZE(NNIDS) \
+	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
+#define LNET_PING_INFO_LONI(PINFO)	((PINFO)->pi_ni[0].ns_nid)
+#define LNET_PING_INFO_SEQNO(PINFO)	((PINFO)->pi_ni[0].ns_status)
+
+/*
+ * Descriptor of a ping info buffer: keep a separate indicator of the
+ * size and a reference count. The type is used both as a source and
+ * sink of data, so we need to keep some information outside of the
+ * area that may be overwritten by network data.
+ */
+struct lnet_ping_buffer {
+	int			pb_nnis;
+	atomic_t		pb_refcnt;
+	struct lnet_ping_info	pb_info;
+};
+
+#define LNET_PING_BUFFER_SIZE(NNIDS) \
+	offsetof(struct lnet_ping_buffer, pb_info.pi_ni[NNIDS])
+#define LNET_PING_BUFFER_LONI(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_nid)
+#define LNET_PING_BUFFER_SEQNO(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_status)
+
 
 /* router checker data, per router */
-#define LNET_MAX_RTR_NIS   16
-#define LNET_PINGINFO_SIZE offsetof(struct lnet_ping_info, pi_ni[LNET_MAX_RTR_NIS])
+#define LNET_MAX_RTR_NIS   LNET_INTERFACES_MIN
+#define LNET_RTR_PINGINFO_SIZE	LNET_PING_INFO_SIZE(LNET_MAX_RTR_NIS)
 struct lnet_rc_data {
 	/* chain on the_lnet.ln_zombie_rcd or ln_deathrow_rcd */
 	struct list_head	rcd_list;
@@ -401,7 +421,7 @@ struct lnet_rc_data {
 	/* reference to gateway */
 	struct lnet_peer_ni	*rcd_gateway;
 	/* ping buffer */
-	struct lnet_ping_info	*rcd_pinginfo;
+	struct lnet_ping_buffer	*rcd_pingbuffer;
 };
 
 struct lnet_peer_ni {
@@ -792,9 +812,17 @@ struct lnet {
 	/* percpt router buffer pools */
 	struct lnet_rtrbufpool		**ln_rtrpools;
 
+	/*
+	 * Ping target / Push source
+	 *
+	 * The ping target and push source share a single buffer. The
+	 * ln_ping_target is protected against concurrent updates by
+	 * ln_api_mutex.
+	 */
 	struct lnet_handle_md		  ln_ping_target_md;
 	struct lnet_handle_eq		  ln_ping_target_eq;
-	struct lnet_ping_info		 *ln_ping_info;
+	struct lnet_ping_buffer		 *ln_ping_target;
+	atomic_t			ln_ping_target_seqno;
 
 	/* router checker startup/shutdown state */
 	enum lnet_rc_state		  ln_rc_state;
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 8b6400da2836..ca28ad75fe2b 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -902,25 +902,44 @@ lnet_count_acceptor_nets(void)
 	return count;
 }
 
-static struct lnet_ping_info *
-lnet_ping_info_create(int num_ni)
+struct lnet_ping_buffer *
+lnet_ping_buffer_alloc(int nnis, gfp_t gfp)
 {
-	struct lnet_ping_info *ping_info;
-	unsigned int infosz;
+	struct lnet_ping_buffer *pbuf;
 
-	infosz = offsetof(struct lnet_ping_info, pi_ni[num_ni]);
-	ping_info = kvzalloc(infosz, GFP_KERNEL);
-	if (!ping_info) {
-		CERROR("Can't allocate ping info[%d]\n", num_ni);
+	pbuf = kmalloc(LNET_PING_BUFFER_SIZE(nnis), gfp);
+	if (pbuf) {
+		pbuf->pb_nnis = nnis;
+		atomic_set(&pbuf->pb_refcnt, 1);
+	}
+
+	return pbuf;
+}
+
+void
+lnet_ping_buffer_free(struct lnet_ping_buffer *pbuf)
+{
+	LASSERT(lnet_ping_buffer_numref(pbuf) == 0);
+	kfree(pbuf);
+}
+
+static struct lnet_ping_buffer *
+lnet_ping_target_create(int nnis)
+{
+	struct lnet_ping_buffer *pbuf;
+
+	pbuf = lnet_ping_buffer_alloc(nnis, GFP_KERNEL);
+	if (!pbuf) {
+		CERROR("Can't allocate ping source [%d]\n", nnis);
 		return NULL;
 	}
 
-	ping_info->pi_nnis = num_ni;
-	ping_info->pi_pid = the_lnet.ln_pid;
-	ping_info->pi_magic = LNET_PROTO_PING_MAGIC;
-	ping_info->pi_features = LNET_PING_FEAT_NI_STATUS;
+	pbuf->pb_info.pi_nnis = nnis;
+	pbuf->pb_info.pi_pid = the_lnet.ln_pid;
+	pbuf->pb_info.pi_magic = LNET_PROTO_PING_MAGIC;
+	pbuf->pb_info.pi_features = LNET_PING_FEAT_NI_STATUS;
 
-	return ping_info;
+	return pbuf;
 }
 
 static inline int
@@ -966,14 +985,25 @@ lnet_get_ni_count(void)
 	return count;
 }
 
-static inline void
-lnet_ping_info_free(struct lnet_ping_info *pinfo)
+int
+lnet_ping_info_validate(struct lnet_ping_info *pinfo)
 {
-	kvfree(pinfo);
+	if (!pinfo)
+		return -EINVAL;
+	if (pinfo->pi_magic != LNET_PROTO_PING_MAGIC)
+		return -EPROTO;
+	if (!(pinfo->pi_features & LNET_PING_FEAT_NI_STATUS))
+		return -EPROTO;
+	/* Loopback is guaranteed to be present */
+	if (pinfo->pi_nnis < 1 || pinfo->pi_nnis > lnet_interfaces_max)
+		return -ERANGE;
+	if (LNET_NETTYP(LNET_NIDNET(LNET_PING_INFO_LONI(pinfo))) != LOLND)
+		return -EPROTO;
+	return 0;
 }
 
 static void
-lnet_ping_info_destroy(void)
+lnet_ping_target_destroy(void)
 {
 	struct lnet_net *net;
 	struct lnet_ni *ni;
@@ -988,25 +1018,25 @@ lnet_ping_info_destroy(void)
 		}
 	}
 
-	lnet_ping_info_free(the_lnet.ln_ping_info);
-	the_lnet.ln_ping_info = NULL;
+	lnet_ping_buffer_decref(the_lnet.ln_ping_target);
+	the_lnet.ln_ping_target = NULL;
 
 	lnet_net_unlock(LNET_LOCK_EX);
 }
 
 static void
-lnet_ping_event_handler(struct lnet_event *event)
+lnet_ping_target_event_handler(struct lnet_event *event)
 {
-	struct lnet_ping_info *pinfo = event->md.user_ptr;
+	struct lnet_ping_buffer *pbuf = event->md.user_ptr;
 
 	if (event->unlinked)
-		pinfo->pi_features = LNET_PING_FEAT_INVAL;
+		lnet_ping_buffer_decref(pbuf);
 }
 
 static int
-lnet_ping_info_setup(struct lnet_ping_info **ppinfo,
-		     struct lnet_handle_md *md_handle,
-		     int ni_count, bool set_eq)
+lnet_ping_target_setup(struct lnet_ping_buffer **ppbuf,
+		       struct lnet_handle_md *ping_mdh,
+		       int ni_count, bool set_eq)
 {
 	struct lnet_process_id id = { .nid = LNET_NID_ANY,
 				      .pid = LNET_PID_ANY };
@@ -1015,94 +1045,98 @@ lnet_ping_info_setup(struct lnet_ping_info **ppinfo,
 	int rc, rc2;
 
 	if (set_eq) {
-		rc = LNetEQAlloc(0, lnet_ping_event_handler,
+		rc = LNetEQAlloc(0, lnet_ping_target_event_handler,
 				 &the_lnet.ln_ping_target_eq);
 		if (rc) {
-			CERROR("Can't allocate ping EQ: %d\n", rc);
+			CERROR("Can't allocate ping buffer EQ: %d\n", rc);
 			return rc;
 		}
 	}
 
-	*ppinfo = lnet_ping_info_create(ni_count);
-	if (!*ppinfo) {
+	*ppbuf = lnet_ping_target_create(ni_count);
+	if (!*ppbuf) {
 		rc = -ENOMEM;
-		goto failed_0;
+		goto fail_free_eq;
 	}
 
+	/* Ping target ME/MD */
 	rc = LNetMEAttach(LNET_RESERVED_PORTAL, id,
 			  LNET_PROTO_PING_MATCHBITS, 0,
 			  LNET_UNLINK, LNET_INS_AFTER,
 			  &me_handle);
 	if (rc) {
-		CERROR("Can't create ping ME: %d\n", rc);
-		goto failed_1;
+		CERROR("Can't create ping target ME: %d\n", rc);
+		goto fail_decref_ping_buffer;
 	}
 
 	/* initialize md content */
-	md.start = *ppinfo;
-	md.length = offsetof(struct lnet_ping_info,
-			     pi_ni[(*ppinfo)->pi_nnis]);
+	md.start = &(*ppbuf)->pb_info;
+	md.length = LNET_PING_INFO_SIZE((*ppbuf)->pb_nnis);
 	md.threshold = LNET_MD_THRESH_INF;
 	md.max_size = 0;
 	md.options = LNET_MD_OP_GET | LNET_MD_TRUNCATE |
 		     LNET_MD_MANAGE_REMOTE;
-	md.user_ptr  = NULL;
 	md.eq_handle = the_lnet.ln_ping_target_eq;
-	md.user_ptr = *ppinfo;
+	md.user_ptr = *ppbuf;
 
-	rc = LNetMDAttach(me_handle, md, LNET_RETAIN, md_handle);
+	rc = LNetMDAttach(me_handle, md, LNET_RETAIN, ping_mdh);
 	if (rc) {
-		CERROR("Can't attach ping MD: %d\n", rc);
-		goto failed_2;
+		CERROR("Can't attach ping target MD: %d\n", rc);
+		goto fail_unlink_ping_me;
 	}
+	lnet_ping_buffer_addref(*ppbuf);
 
 	return 0;
 
-failed_2:
+fail_unlink_ping_me:
 	rc2 = LNetMEUnlink(me_handle);
 	LASSERT(!rc2);
-failed_1:
-	lnet_ping_info_free(*ppinfo);
-	*ppinfo = NULL;
-failed_0:
-	if (set_eq)
-		LNetEQFree(the_lnet.ln_ping_target_eq);
+fail_decref_ping_buffer:
+	LASSERT(lnet_ping_buffer_numref(*ppbuf) == 1);
+	lnet_ping_buffer_decref(*ppbuf);
+	*ppbuf = NULL;
+fail_free_eq:
+	if (set_eq) {
+		rc2 = LNetEQFree(the_lnet.ln_ping_target_eq);
+		LASSERT(rc2 == 0);
+	}
 	return rc;
 }
 
 static void
-lnet_ping_md_unlink(struct lnet_ping_info *pinfo,
-		    struct lnet_handle_md *md_handle)
+lnet_ping_md_unlink(struct lnet_ping_buffer *pbuf,
+		    struct lnet_handle_md *ping_mdh)
 {
-	LNetMDUnlink(*md_handle);
-	LNetInvalidateMDHandle(md_handle);
+	LNetMDUnlink(*ping_mdh);
+	LNetInvalidateMDHandle(ping_mdh);
 
-	/* NB md could be busy; this just starts the unlink */
-	while (pinfo->pi_features != LNET_PING_FEAT_INVAL) {
-		CDEBUG(D_NET, "Still waiting for ping MD to unlink\n");
+	/* NB the MD could be busy; this just starts the unlink */
+	while (lnet_ping_buffer_numref(pbuf) > 1) {
+		CDEBUG(D_NET, "Still waiting for ping data MD to unlink\n");
 		schedule_timeout_idle(HZ);
 	}
 }
 
 static void
-lnet_ping_info_install_locked(struct lnet_ping_info *ping_info)
+lnet_ping_target_install_locked(struct lnet_ping_buffer *pbuf)
 {
 	struct lnet_ni_status *ns;
 	struct lnet_ni *ni;
 	struct lnet_net *net;
 	int i = 0;
+	int rc;
 
 	list_for_each_entry(net, &the_lnet.ln_nets, net_list) {
 		list_for_each_entry(ni, &net->net_ni_list, ni_netlist) {
-			LASSERT(i < ping_info->pi_nnis);
+			LASSERT(i < pbuf->pb_nnis);
 
-			ns = &ping_info->pi_ni[i];
+			ns = &pbuf->pb_info.pi_ni[i];
 
 			ns->ns_nid = ni->ni_nid;
 
 			lnet_ni_lock(ni);
 			ns->ns_status = ni->ni_status ?
-					ni->ni_status->ns_status :
+					 ni->ni_status->ns_status :
 						LNET_NI_STATUS_UP;
 			ni->ni_status = ns;
 			lnet_ni_unlock(ni);
@@ -1110,35 +1144,47 @@ lnet_ping_info_install_locked(struct lnet_ping_info *ping_info)
 			i++;
 		}
 	}
+	/*
+	 * We (ab)use the ns_status of the loopback interface to
+	 * transmit the sequence number. The first interface listed
+	 * must be the loopback interface.
+	 */
+	rc = lnet_ping_info_validate(&pbuf->pb_info);
+	if (rc) {
+		LCONSOLE_EMERG("Invalid ping target: %d\n", rc);
+		LBUG();
+	}
+	LNET_PING_BUFFER_SEQNO(pbuf) =
+		atomic_inc_return(&the_lnet.ln_ping_target_seqno);
 }
 
 static void
-lnet_ping_target_update(struct lnet_ping_info *pinfo,
-			struct lnet_handle_md md_handle)
+lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
+			struct lnet_handle_md ping_mdh)
 {
-	struct lnet_ping_info *old_pinfo = NULL;
-	struct lnet_handle_md old_md;
+	struct lnet_ping_buffer *old_pbuf = NULL;
+	struct lnet_handle_md old_ping_md;
 
 	/* switch the NIs to point to the new ping info created */
 	lnet_net_lock(LNET_LOCK_EX);
 
 	if (!the_lnet.ln_routing)
-		pinfo->pi_features |= LNET_PING_FEAT_RTE_DISABLED;
-	lnet_ping_info_install_locked(pinfo);
+		pbuf->pb_info.pi_features |= LNET_PING_FEAT_RTE_DISABLED;
+	lnet_ping_target_install_locked(pbuf);
 
-	if (the_lnet.ln_ping_info) {
-		old_pinfo = the_lnet.ln_ping_info;
-		old_md = the_lnet.ln_ping_target_md;
+	if (the_lnet.ln_ping_target) {
+		old_pbuf = the_lnet.ln_ping_target;
+		old_ping_md = the_lnet.ln_ping_target_md;
 	}
-	the_lnet.ln_ping_target_md = md_handle;
-	the_lnet.ln_ping_info = pinfo;
+	the_lnet.ln_ping_target_md = ping_mdh;
+	the_lnet.ln_ping_target = pbuf;
 
 	lnet_net_unlock(LNET_LOCK_EX);
 
-	if (old_pinfo) {
-		/* unlink the old ping info */
-		lnet_ping_md_unlink(old_pinfo, &old_md);
-		lnet_ping_info_free(old_pinfo);
+	if (old_pbuf) {
+		/* unlink and free the old ping info */
+		lnet_ping_md_unlink(old_pbuf, &old_ping_md);
+		lnet_ping_buffer_decref(old_pbuf);
 	}
 }
 
@@ -1147,13 +1193,13 @@ lnet_ping_target_fini(void)
 {
 	int rc;
 
-	lnet_ping_md_unlink(the_lnet.ln_ping_info,
+	lnet_ping_md_unlink(the_lnet.ln_ping_target,
 			    &the_lnet.ln_ping_target_md);
 
 	rc = LNetEQFree(the_lnet.ln_ping_target_eq);
 	LASSERT(!rc);
 
-	lnet_ping_info_destroy();
+	lnet_ping_target_destroy();
 }
 
 static int
@@ -1745,8 +1791,8 @@ LNetNIInit(lnet_pid_t requested_pid)
 	int im_a_router = 0;
 	int rc;
 	int ni_count;
-	struct lnet_ping_info *pinfo;
-	struct lnet_handle_md md_handle;
+	struct lnet_ping_buffer *pbuf;
+	struct lnet_handle_md ping_mdh;
 	struct list_head net_head;
 	struct lnet_net *net;
 
@@ -1823,11 +1869,11 @@ LNetNIInit(lnet_pid_t requested_pid)
 	the_lnet.ln_refcount = 1;
 	/* Now I may use my own API functions... */
 
-	rc = lnet_ping_info_setup(&pinfo, &md_handle, ni_count, true);
+	rc = lnet_ping_target_setup(&pbuf, &ping_mdh, ni_count, true);
 	if (rc)
 		goto err_acceptor_stop;
 
-	lnet_ping_target_update(pinfo, md_handle);
+	lnet_ping_target_update(pbuf, ping_mdh);
 
 	rc = lnet_router_checker_start();
 	if (rc)
@@ -1936,7 +1982,10 @@ lnet_fill_ni_info(struct lnet_ni *ni, struct lnet_ioctl_config_ni *cfg_ni,
 	}
 
 	cfg_ni->lic_nid = ni->ni_nid;
-	cfg_ni->lic_status = ni->ni_status->ns_status;
+	if (LNET_NETTYP(LNET_NIDNET(ni->ni_nid)) == LOLND)
+		cfg_ni->lic_status = LNET_NI_STATUS_UP;
+	else
+		cfg_ni->lic_status = ni->ni_status->ns_status;
 	cfg_ni->lic_tcp_bonding = use_tcp_bonding;
 	cfg_ni->lic_dev_cpt = ni->ni_dev_cpt;
 
@@ -2021,7 +2070,10 @@ lnet_fill_ni_info_legacy(struct lnet_ni *ni,
 	config->cfg_config_u.cfg_net.net_peer_rtr_credits =
 		ni->ni_net->net_tunables.lct_peer_rtr_credits;
 
-	net_config->ni_status = ni->ni_status->ns_status;
+	if (LNET_NETTYP(LNET_NIDNET(ni->ni_nid)) == LOLND)
+		net_config->ni_status = LNET_NI_STATUS_UP;
+	else
+		net_config->ni_status = ni->ni_status->ns_status;
 
 	if (ni->ni_cpts) {
 		int num_cpts = min(ni->ni_ncpts, LNET_MAX_SHOW_NUM_CPT);
@@ -2172,8 +2224,8 @@ static int lnet_add_net_common(struct lnet_net *net,
 			       struct lnet_ioctl_config_lnd_tunables *tun)
 {
 	u32 net_id;
-	struct lnet_ping_info *pinfo;
-	struct lnet_handle_md md_handle;
+	struct lnet_ping_buffer *pbuf;
+	struct lnet_handle_md ping_mdh;
 	int rc;
 	struct lnet_remotenet *rnet;
 	int net_ni_count;
@@ -2195,7 +2247,7 @@ static int lnet_add_net_common(struct lnet_net *net,
 
 	/*
 	 * make sure you calculate the correct number of slots in the ping
-	 * info. Since the ping info is a flattened list of all the NIs,
+	 * buffer. Since the ping info is a flattened list of all the NIs,
 	 * we should allocate enough slots to accomodate the number of NIs
 	 * which will be added.
 	 *
@@ -2204,9 +2256,9 @@ static int lnet_add_net_common(struct lnet_net *net,
 	 */
 	net_ni_count = lnet_get_net_ni_count_pre(net);
 
-	rc = lnet_ping_info_setup(&pinfo, &md_handle,
-				  net_ni_count + lnet_get_ni_count(),
-				  false);
+	rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
+				    net_ni_count + lnet_get_ni_count(),
+				    false);
 	if (rc < 0) {
 		lnet_net_free(net);
 		return rc;
@@ -2257,13 +2309,13 @@ static int lnet_add_net_common(struct lnet_net *net,
 	lnet_peer_net_added(net);
 	lnet_net_unlock(LNET_LOCK_EX);
 
-	lnet_ping_target_update(pinfo, md_handle);
+	lnet_ping_target_update(pbuf, ping_mdh);
 
 	return 0;
 
 failed:
-	lnet_ping_md_unlink(pinfo, &md_handle);
-	lnet_ping_info_free(pinfo);
+	lnet_ping_md_unlink(pbuf, &ping_mdh);
+	lnet_ping_buffer_decref(pbuf);
 	return rc;
 }
 
@@ -2354,8 +2406,8 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
 	struct lnet_net *net;
 	struct lnet_ni *ni;
 	u32 net_id = LNET_NIDNET(conf->lic_nid);
-	struct lnet_ping_info *pinfo;
-	struct lnet_handle_md md_handle;
+	struct lnet_ping_buffer *pbuf;
+	struct lnet_handle_md  ping_mdh;
 	int rc;
 	int net_count;
 	u32 addr;
@@ -2373,7 +2425,7 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
 		CERROR("net %s not found\n",
 		       libcfs_net2str(net_id));
 		rc = -ENOENT;
-		goto net_unlock;
+		goto unlock_net;
 	}
 
 	addr = LNET_NIDADDR(conf->lic_nid);
@@ -2384,20 +2436,20 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
 		lnet_net_unlock(0);
 
 		/* create and link a new ping info, before removing the old one */
-		rc = lnet_ping_info_setup(&pinfo, &md_handle,
-					  lnet_get_ni_count() - net_count,
-					  false);
+		rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
+					    lnet_get_ni_count() - net_count,
+					    false);
 		if (rc != 0)
-			goto out;
+			goto unlock_api_mutex;
 
 		lnet_shutdown_lndnet(net);
 
 		if (lnet_count_acceptor_nets() == 0)
 			lnet_acceptor_stop();
 
-		lnet_ping_target_update(pinfo, md_handle);
+		lnet_ping_target_update(pbuf, ping_mdh);
 
-		goto out;
+		goto unlock_api_mutex;
 	}
 
 	ni = lnet_nid2ni_locked(conf->lic_nid, 0);
@@ -2405,7 +2457,7 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
 		CERROR("nid %s not found\n",
 		       libcfs_nid2str(conf->lic_nid));
 		rc = -ENOENT;
-		goto net_unlock;
+		goto unlock_net;
 	}
 
 	net_count = lnet_get_net_ni_count_locked(net);
@@ -2413,27 +2465,27 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
 	lnet_net_unlock(0);
 
 	/* create and link a new ping info, before removing the old one */
-	rc = lnet_ping_info_setup(&pinfo, &md_handle,
-				  lnet_get_ni_count() - 1, false);
+	rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
+				    lnet_get_ni_count() - 1, false);
 	if (rc != 0)
-		goto out;
+		goto unlock_api_mutex;
 
 	lnet_shutdown_lndni(ni);
 
 	if (lnet_count_acceptor_nets() == 0)
 		lnet_acceptor_stop();
 
-	lnet_ping_target_update(pinfo, md_handle);
+	lnet_ping_target_update(pbuf, ping_mdh);
 
 	/* check if the net is empty and remove it if it is */
 	if (net_count == 1)
 		lnet_shutdown_lndnet(net);
 
-	goto out;
+	goto unlock_api_mutex;
 
-net_unlock:
+unlock_net:
 	lnet_net_unlock(0);
-out:
+unlock_api_mutex:
 	mutex_unlock(&the_lnet.ln_api_mutex);
 
 	return rc;
@@ -2501,8 +2553,8 @@ int
 lnet_dyn_del_net(__u32 net_id)
 {
 	struct lnet_net *net;
-	struct lnet_ping_info *pinfo;
-	struct lnet_handle_md md_handle;
+	struct lnet_ping_buffer *pbuf;
+	struct lnet_handle_md ping_mdh;
 	int rc;
 	int net_ni_count;
 
@@ -2525,8 +2577,8 @@ lnet_dyn_del_net(__u32 net_id)
 	lnet_net_unlock(0);
 
 	/* create and link a new ping info, before removing the old one */
-	rc = lnet_ping_info_setup(&pinfo, &md_handle,
-				  lnet_get_ni_count() - net_ni_count, false);
+	rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
+				    lnet_get_ni_count() - net_ni_count, false);
 	if (rc)
 		goto out;
 
@@ -2535,7 +2587,7 @@ lnet_dyn_del_net(__u32 net_id)
 	if (!lnet_count_acceptor_nets())
 		lnet_acceptor_stop();
 
-	lnet_ping_target_update(pinfo, md_handle);
+	lnet_ping_target_update(pbuf, ping_mdh);
 
 out:
 	mutex_unlock(&the_lnet.ln_api_mutex);
@@ -2943,16 +2995,13 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	int unlinked = 0;
 	int replied = 0;
 	const signed long a_long_time = 60*HZ;
-	int infosz;
-	struct lnet_ping_info *info;
+	struct lnet_ping_buffer *pbuf;
 	struct lnet_process_id tmpid;
 	int i;
 	int nob;
 	int rc;
 	int rc2;
 
-	infosz = offsetof(struct lnet_ping_info, pi_ni[n_ids]);
-
 	/* n_ids limit is arbitrary */
 	if (n_ids <= 0 || n_ids > lnet_interfaces_max || id.nid == LNET_NID_ANY)
 		return -EINVAL;
@@ -2960,20 +3009,20 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	if (id.pid == LNET_PID_ANY)
 		id.pid = LNET_PID_LUSTRE;
 
-	info = kzalloc(infosz, GFP_KERNEL);
-	if (!info)
+	pbuf = lnet_ping_buffer_alloc(n_ids, GFP_NOFS);
+	if (!pbuf)
 		return -ENOMEM;
 
 	/* NB 2 events max (including any unlink event) */
 	rc = LNetEQAlloc(2, LNET_EQ_HANDLER_NONE, &eqh);
 	if (rc) {
 		CERROR("Can't allocate EQ: %d\n", rc);
-		goto out_0;
+		goto fail_ping_buffer_decref;
 	}
 
 	/* initialize md content */
-	md.start     = info;
-	md.length    = infosz;
+	md.start     = &pbuf->pb_info;
+	md.length    = LNET_PING_INFO_SIZE(n_ids);
 	md.threshold = 2; /*GET/REPLY*/
 	md.max_size  = 0;
 	md.options   = LNET_MD_TRUNCATE;
@@ -2983,7 +3032,7 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	rc = LNetMDBind(md, LNET_UNLINK, &mdh);
 	if (rc) {
 		CERROR("Can't bind MD: %d\n", rc);
-		goto out_1;
+		goto fail_free_eq;
 	}
 
 	rc = LNetGet(LNET_NID_ANY, mdh, id,
@@ -3044,11 +3093,11 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 			CWARN("%s: Unexpected rc >= 0 but no reply!\n",
 			      libcfs_id2str(id));
 		rc = -EIO;
-		goto out_1;
+		goto fail_free_eq;
 	}
 
 	nob = rc;
-	LASSERT(nob >= 0 && nob <= infosz);
+	LASSERT(nob >= 0 && nob <= LNET_PING_INFO_SIZE(n_ids));
 
 	rc = -EPROTO;			   /* if I can't parse... */
 
@@ -3056,56 +3105,56 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 		/* can't check magic/version */
 		CERROR("%s: ping info too short %d\n",
 		       libcfs_id2str(id), nob);
-		goto out_1;
+		goto fail_free_eq;
 	}
 
-	if (info->pi_magic == __swab32(LNET_PROTO_PING_MAGIC)) {
-		lnet_swap_pinginfo(info);
-	} else if (info->pi_magic != LNET_PROTO_PING_MAGIC) {
+	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC)) {
+		lnet_swap_pinginfo(pbuf);
+	} else if (pbuf->pb_info.pi_magic != LNET_PROTO_PING_MAGIC) {
 		CERROR("%s: Unexpected magic %08x\n",
-		       libcfs_id2str(id), info->pi_magic);
-		goto out_1;
+		       libcfs_id2str(id), pbuf->pb_info.pi_magic);
+		goto fail_free_eq;
 	}
 
-	if (!(info->pi_features & LNET_PING_FEAT_NI_STATUS)) {
+	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_NI_STATUS)) {
 		CERROR("%s: ping w/o NI status: 0x%x\n",
-		       libcfs_id2str(id), info->pi_features);
-		goto out_1;
+		       libcfs_id2str(id), pbuf->pb_info.pi_features);
+		goto fail_free_eq;
 	}
 
-	if (nob < offsetof(struct lnet_ping_info, pi_ni[0])) {
+	if (nob < LNET_PING_INFO_SIZE(0)) {
 		CERROR("%s: Short reply %d(%d min)\n", libcfs_id2str(id),
-		       nob, (int)offsetof(struct lnet_ping_info, pi_ni[0]));
-		goto out_1;
+		       nob, (int)LNET_PING_INFO_SIZE(0));
+		goto fail_free_eq;
 	}
 
-	if (info->pi_nnis < n_ids)
-		n_ids = info->pi_nnis;
+	if (pbuf->pb_info.pi_nnis < n_ids)
+		n_ids = pbuf->pb_info.pi_nnis;
 
-	if (nob < offsetof(struct lnet_ping_info, pi_ni[n_ids])) {
+	if (nob < LNET_PING_INFO_SIZE(n_ids)) {
 		CERROR("%s: Short reply %d(%d expected)\n", libcfs_id2str(id),
-		       nob, (int)offsetof(struct lnet_ping_info, pi_ni[n_ids]));
-		goto out_1;
+		       nob, (int)LNET_PING_INFO_SIZE(n_ids));
+		goto fail_free_eq;
 	}
 
 	rc = -EFAULT;			   /* If I SEGV... */
 
 	memset(&tmpid, 0, sizeof(tmpid));
 	for (i = 0; i < n_ids; i++) {
-		tmpid.pid = info->pi_pid;
-		tmpid.nid = info->pi_ni[i].ns_nid;
+		tmpid.pid = pbuf->pb_info.pi_pid;
+		tmpid.nid = pbuf->pb_info.pi_ni[i].ns_nid;
 		if (copy_to_user(&ids[i], &tmpid, sizeof(tmpid)))
-			goto out_1;
+			goto fail_free_eq;
 	}
-	rc = info->pi_nnis;
+	rc = pbuf->pb_info.pi_nnis;
 
- out_1:
+ fail_free_eq:
 	rc2 = LNetEQFree(eqh);
 	if (rc2)
 		CERROR("rc2 %d\n", rc2);
 	LASSERT(!rc2);
 
- out_0:
-	kfree(info);
+ fail_ping_buffer_decref:
+	lnet_ping_buffer_decref(pbuf);
 	return rc;
 }
diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index b31a383fe974..e97957ce9252 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -618,17 +618,21 @@ lnet_get_route(int idx, __u32 *net, __u32 *hops,
 }
 
 void
-lnet_swap_pinginfo(struct lnet_ping_info *info)
+lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf)
 {
-	int i;
 	struct lnet_ni_status *stat;
+	int nnis;
+	int i;
 
-	__swab32s(&info->pi_magic);
-	__swab32s(&info->pi_features);
-	__swab32s(&info->pi_pid);
-	__swab32s(&info->pi_nnis);
-	for (i = 0; i < info->pi_nnis && i < LNET_MAX_RTR_NIS; i++) {
-		stat = &info->pi_ni[i];
+	__swab32s(&pbuf->pb_info.pi_magic);
+	__swab32s(&pbuf->pb_info.pi_features);
+	__swab32s(&pbuf->pb_info.pi_pid);
+	__swab32s(&pbuf->pb_info.pi_nnis);
+	nnis = pbuf->pb_info.pi_nnis;
+	if (nnis > pbuf->pb_nnis)
+		nnis = pbuf->pb_nnis;
+	for (i = 0; i < nnis; i++) {
+		stat = &pbuf->pb_info.pi_ni[i];
 		__swab64s(&stat->ns_nid);
 		__swab32s(&stat->ns_status);
 	}
@@ -641,11 +645,12 @@ lnet_swap_pinginfo(struct lnet_ping_info *info)
 static void
 lnet_parse_rc_info(struct lnet_rc_data *rcd)
 {
-	struct lnet_ping_info *info = rcd->rcd_pinginfo;
+	struct lnet_ping_buffer *pbuf = rcd->rcd_pingbuffer;
 	struct lnet_peer_ni *gw = rcd->rcd_gateway;
 	struct lnet_route *rte;
+	int			nnis;
 
-	if (!gw->lpni_alive)
+	if (!gw->lpni_alive || !pbuf)
 		return;
 
 	/*
@@ -654,51 +659,48 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
 	 */
 	spin_lock(&gw->lpni_lock);
 
-	if (info->pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
-		lnet_swap_pinginfo(info);
+	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
+		lnet_swap_pinginfo(pbuf);
 
 	/* NB always racing with network! */
-	if (info->pi_magic != LNET_PROTO_PING_MAGIC) {
+	if (pbuf->pb_info.pi_magic != LNET_PROTO_PING_MAGIC) {
 		CDEBUG(D_NET, "%s: Unexpected magic %08x\n",
-		       libcfs_nid2str(gw->lpni_nid), info->pi_magic);
+		       libcfs_nid2str(gw->lpni_nid), pbuf->pb_info.pi_magic);
 		gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
-		spin_unlock(&gw->lpni_lock);
-		return;
+		goto out;
 	}
 
-	gw->lpni_ping_feats = info->pi_features;
-	if (!(gw->lpni_ping_feats & LNET_PING_FEAT_MASK)) {
-		CDEBUG(D_NET, "%s: Unexpected features 0x%x\n",
-		       libcfs_nid2str(gw->lpni_nid), gw->lpni_ping_feats);
-		spin_unlock(&gw->lpni_lock);
-		return; /* nothing I can understand */
-	}
+	gw->lpni_ping_feats = pbuf->pb_info.pi_features;
 
-	if (!(gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS)) {
-		spin_unlock(&gw->lpni_lock);
-		return; /* can't carry NI status info */
-	}
+	/* Without NI status info there's nothing more to do. */
+	if (!(gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS))
+		goto out;
+
+	/* Determine the number of NIs for which there is data. */
+	nnis = pbuf->pb_info.pi_nnis;
+	if (pbuf->pb_nnis < nnis)
+		nnis = pbuf->pb_nnis;
 
 	list_for_each_entry(rte, &gw->lpni_routes, lr_gwlist) {
 		int down = 0;
 		int up = 0;
 		int i;
 
+		/* If routing disabled then the route is down. */
 		if (gw->lpni_ping_feats & LNET_PING_FEAT_RTE_DISABLED) {
 			rte->lr_downis = 1;
 			continue;
 		}
 
-		for (i = 0; i < info->pi_nnis && i < LNET_MAX_RTR_NIS; i++) {
-			struct lnet_ni_status *stat = &info->pi_ni[i];
+		for (i = 0; i < nnis; i++) {
+			struct lnet_ni_status *stat = &pbuf->pb_info.pi_ni[i];
 			lnet_nid_t nid = stat->ns_nid;
 
 			if (nid == LNET_NID_ANY) {
 				CDEBUG(D_NET, "%s: unexpected LNET_NID_ANY\n",
 				       libcfs_nid2str(gw->lpni_nid));
 				gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
-				spin_unlock(&gw->lpni_lock);
-				return;
+				goto out;
 			}
 
 			if (LNET_NETTYP(LNET_NIDNET(nid)) == LOLND)
@@ -720,8 +722,7 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
 			CDEBUG(D_NET, "%s: Unexpected status 0x%x\n",
 			       libcfs_nid2str(gw->lpni_nid), stat->ns_status);
 			gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
-			spin_unlock(&gw->lpni_lock);
-			return;
+			goto out;
 		}
 
 		if (up) { /* ignore downed NIs if NI for dest network is up */
@@ -737,7 +738,7 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
 
 		rte->lr_downis = down;
 	}
-
+out:
 	spin_unlock(&gw->lpni_lock);
 }
 
@@ -903,7 +904,8 @@ lnet_destroy_rc_data(struct lnet_rc_data *rcd)
 		lnet_net_unlock(cpt);
 	}
 
-	kfree(rcd->rcd_pinginfo);
+	if (rcd->rcd_pingbuffer)
+		lnet_ping_buffer_decref(rcd->rcd_pingbuffer);
 
 	kfree(rcd);
 }
@@ -912,7 +914,7 @@ static struct lnet_rc_data *
 lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
 {
 	struct lnet_rc_data *rcd = NULL;
-	struct lnet_ping_info *pi;
+	struct lnet_ping_buffer *pbuf;
 	struct lnet_md md;
 	int rc;
 	int i;
@@ -926,19 +928,19 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
 	LNetInvalidateMDHandle(&rcd->rcd_mdh);
 	INIT_LIST_HEAD(&rcd->rcd_list);
 
-	pi = kzalloc(LNET_PINGINFO_SIZE, GFP_NOFS);
-	if (!pi)
+	pbuf = lnet_ping_buffer_alloc(LNET_MAX_RTR_NIS, GFP_NOFS);
+	if (!pbuf)
 		goto out;
 
 	for (i = 0; i < LNET_MAX_RTR_NIS; i++) {
-		pi->pi_ni[i].ns_nid = LNET_NID_ANY;
-		pi->pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
+		pbuf->pb_info.pi_ni[i].ns_nid = LNET_NID_ANY;
+		pbuf->pb_info.pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
 	}
-	rcd->rcd_pinginfo = pi;
+	rcd->rcd_pingbuffer = pbuf;
 
-	md.start = pi;
+	md.start = &pbuf->pb_info;
 	md.user_ptr = rcd;
-	md.length = LNET_PINGINFO_SIZE;
+	md.length = LNET_RTR_PINGINFO_SIZE;
 	md.threshold = LNET_MD_THRESH_INF;
 	md.options = LNET_MD_TRUNCATE;
 	md.eq_handle = the_lnet.ln_rc_eqh;
@@ -1714,7 +1716,8 @@ lnet_rtrpools_enable(void)
 	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_routing = 1;
 
-	the_lnet.ln_ping_info->pi_features &= ~LNET_PING_FEAT_RTE_DISABLED;
+	the_lnet.ln_ping_target->pb_info.pi_features &=
+		~LNET_PING_FEAT_RTE_DISABLED;
 	lnet_net_unlock(LNET_LOCK_EX);
 
 	return rc;
@@ -1728,7 +1731,8 @@ lnet_rtrpools_disable(void)
 
 	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_routing = 0;
-	the_lnet.ln_ping_info->pi_features |= LNET_PING_FEAT_RTE_DISABLED;
+	the_lnet.ln_ping_target->pb_info.pi_features |=
+		LNET_PING_FEAT_RTE_DISABLED;
 
 	tiny_router_buffers = 0;
 	small_router_buffers = 0;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (3 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 06/24] lustre: lnet: add sanity checks on ping-related constants NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:32   ` James Simmons
  2018-10-14 19:33   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 10/24] lustre: lnet: refactor lnet_add_peer_ni() NeilBrown
                   ` (19 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

The router pinger uses fixed-size buffers to receive the data
returned by a ping. When a router has more than 16 interfaces
(including loopback) this means the data for some interfaces
is dropped.

Detect this situation, and track the number of remote NIs in
the lnet_rc_data_t structure.  lnet_create_rc_data_locked()
becomes lnet_update_rc_data_locked(), and modified to replace
an existing ping buffer if one is present. It is now also
called by lnet_ping_router_locked() when the existing ping
buffer is too small.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25774
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-types.h  |    4 -
 drivers/staging/lustre/lnet/lnet/router.c          |   90 +++++++++++++-------
 2 files changed, 60 insertions(+), 34 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index ab8c6d66cdbf..d1d17ededd06 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -411,8 +411,6 @@ struct lnet_ping_buffer {
 
 
 /* router checker data, per router */
-#define LNET_MAX_RTR_NIS   LNET_INTERFACES_MIN
-#define LNET_RTR_PINGINFO_SIZE	LNET_PING_INFO_SIZE(LNET_MAX_RTR_NIS)
 struct lnet_rc_data {
 	/* chain on the_lnet.ln_zombie_rcd or ln_deathrow_rcd */
 	struct list_head	rcd_list;
@@ -422,6 +420,8 @@ struct lnet_rc_data {
 	struct lnet_peer_ni	*rcd_gateway;
 	/* ping buffer */
 	struct lnet_ping_buffer	*rcd_pingbuffer;
+	/* desired size of buffer */
+	int			rcd_nnis;
 };
 
 struct lnet_peer_ni {
diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index e97957ce9252..86cce27e10d8 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -678,8 +678,11 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
 
 	/* Determine the number of NIs for which there is data. */
 	nnis = pbuf->pb_info.pi_nnis;
-	if (pbuf->pb_nnis < nnis)
+	if (pbuf->pb_nnis < nnis) {
+		if (rcd->rcd_nnis < nnis)
+			rcd->rcd_nnis = nnis;
 		nnis = pbuf->pb_nnis;
+	}
 
 	list_for_each_entry(rte, &gw->lpni_routes, lr_gwlist) {
 		int down = 0;
@@ -911,28 +914,47 @@ lnet_destroy_rc_data(struct lnet_rc_data *rcd)
 }
 
 static struct lnet_rc_data *
-lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
+lnet_update_rc_data_locked(struct lnet_peer_ni *gateway)
 {
-	struct lnet_rc_data *rcd = NULL;
-	struct lnet_ping_buffer *pbuf;
+	struct lnet_handle_md mdh;
+	struct lnet_rc_data *rcd;
+	struct lnet_ping_buffer *pbuf = NULL;
 	struct lnet_md md;
+	int nnis = LNET_INTERFACES_MIN;
 	int rc;
 	int i;
 
+	rcd = gateway->lpni_rcd;
+	if (rcd) {
+		nnis = rcd->rcd_nnis;
+		mdh = rcd->rcd_mdh;
+		LNetInvalidateMDHandle(&rcd->rcd_mdh);
+		pbuf = rcd->rcd_pingbuffer;
+		rcd->rcd_pingbuffer = NULL;
+	} else {
+		LNetInvalidateMDHandle(&mdh);
+	}
+
 	lnet_net_unlock(gateway->lpni_cpt);
 
-	rcd = kzalloc(sizeof(*rcd), GFP_NOFS);
-	if (!rcd)
-		goto out;
+	if (rcd) {
+		LNetMDUnlink(mdh);
+		lnet_ping_buffer_decref(pbuf);
+	} else {
+		rcd = kzalloc(sizeof(*rcd), GFP_NOFS);
+		if (!rcd)
+			goto out;
 
-	LNetInvalidateMDHandle(&rcd->rcd_mdh);
-	INIT_LIST_HEAD(&rcd->rcd_list);
+		LNetInvalidateMDHandle(&rcd->rcd_mdh);
+		INIT_LIST_HEAD(&rcd->rcd_list);
+		rcd->rcd_nnis = nnis;
+	}
 
-	pbuf = lnet_ping_buffer_alloc(LNET_MAX_RTR_NIS, GFP_NOFS);
+	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
 	if (!pbuf)
 		goto out;
 
-	for (i = 0; i < LNET_MAX_RTR_NIS; i++) {
+	for (i = 0; i < nnis; i++) {
 		pbuf->pb_info.pi_ni[i].ns_nid = LNET_NID_ANY;
 		pbuf->pb_info.pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
 	}
@@ -940,7 +962,7 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
 
 	md.start = &pbuf->pb_info;
 	md.user_ptr = rcd;
-	md.length = LNET_RTR_PINGINFO_SIZE;
+	md.length = LNET_PING_INFO_SIZE(nnis);
 	md.threshold = LNET_MD_THRESH_INF;
 	md.options = LNET_MD_TRUNCATE;
 	md.eq_handle = the_lnet.ln_rc_eqh;
@@ -949,33 +971,37 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
 	rc = LNetMDBind(md, LNET_UNLINK, &rcd->rcd_mdh);
 	if (rc < 0) {
 		CERROR("Can't bind MD: %d\n", rc);
-		goto out;
+		goto out_ping_buffer_decref;
 	}
 	LASSERT(!rc);
 
 	lnet_net_lock(gateway->lpni_cpt);
-	/* router table changed or someone has created rcd for this gateway */
-	if (!lnet_isrouter(gateway) || gateway->lpni_rcd) {
-		lnet_net_unlock(gateway->lpni_cpt);
-		goto out;
+	/* Check if this is still a router. */
+	if (!lnet_isrouter(gateway))
+		goto out_unlock;
+	/* Check if someone else installed router data. */
+	if (gateway->lpni_rcd && gateway->lpni_rcd != rcd)
+		goto out_unlock;
+
+	/* Install and/or update the router data. */
+	if (!gateway->lpni_rcd) {
+		lnet_peer_ni_addref_locked(gateway);
+		rcd->rcd_gateway = gateway;
+		gateway->lpni_rcd = rcd;
 	}
-
-	lnet_peer_ni_addref_locked(gateway);
-	rcd->rcd_gateway = gateway;
-	gateway->lpni_rcd = rcd;
 	gateway->lpni_ping_notsent = 0;
 
 	return rcd;
 
- out:
-	if (rcd) {
-		if (!LNetMDHandleIsInvalid(rcd->rcd_mdh)) {
-			rc = LNetMDUnlink(rcd->rcd_mdh);
-			LASSERT(!rc);
-		}
+out_unlock:
+	lnet_net_unlock(gateway->lpni_cpt);
+	rc = LNetMDUnlink(mdh);
+	LASSERT(!rc);
+out_ping_buffer_decref:
+	lnet_ping_buffer_decref(pbuf);
+out:
+	if (rcd && rcd != gateway->lpni_rcd)
 		lnet_destroy_rc_data(rcd);
-	}
-
 	lnet_net_lock(gateway->lpni_cpt);
 	return gateway->lpni_rcd;
 }
@@ -1018,9 +1044,9 @@ lnet_ping_router_locked(struct lnet_peer_ni *rtr)
 		return;
 	}
 
-	rcd = rtr->lpni_rcd ?
-	      rtr->lpni_rcd : lnet_create_rc_data_locked(rtr);
-
+	rcd = rtr->lpni_rcd;
+	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis)
+		rcd = lnet_update_rc_data_locked(rtr);
 	if (!rcd)
 		return;
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 05/24] lustre: lnet: add Multi-Rail and Discovery ping feature bits
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (6 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 01/24] lustre: lnet: add lnet_interfaces_max tunable NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:34   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 03/24] lustre: lnet: add struct lnet_ping_buffer NeilBrown
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Claim ping features bit for Multi-Rail and Discovery.

Assert in lnet_ping_target_update() that no unknown bits will
be send over the wire.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25775
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-types.h  |   16 ++++++++++++++++
 drivers/staging/lustre/lnet/lnet/api-ni.c          |    5 +++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index d1d17ededd06..f4467a3bbfd1 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -386,6 +386,22 @@ struct lnet_ni {
 #define LNET_PING_FEAT_BASE		BIT(0)	/* just a ping */
 #define LNET_PING_FEAT_NI_STATUS	BIT(1)	/* return NI status */
 #define LNET_PING_FEAT_RTE_DISABLED	BIT(2)	/* Routing enabled */
+#define LNET_PING_FEAT_MULTI_RAIL	BIT(3)	/* Multi-Rail aware */
+#define LNET_PING_FEAT_DISCOVERY	BIT(4)	/* Supports Discovery */
+
+/*
+ * All ping feature bits fit to hit the wire.
+ * In lnet_assert_wire_constants() this is compared against its open-coded
+ * value, and in lnet_ping_target_update() it is used to verify that no
+ * unknown bits have been set.
+ * New feature bits can be added, just be aware that this does change the
+ * over-the-wire protocol.
+ */
+#define LNET_PING_FEAT_BITS		(LNET_PING_FEAT_BASE | \
+					 LNET_PING_FEAT_NI_STATUS | \
+					 LNET_PING_FEAT_RTE_DISABLED | \
+					 LNET_PING_FEAT_MULTI_RAIL | \
+					 LNET_PING_FEAT_DISCOVERY)
 
 #define LNET_PING_INFO_SIZE(NNIDS) \
 	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index ca28ad75fe2b..68af723bc6a1 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -1170,6 +1170,11 @@ lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
 
 	if (!the_lnet.ln_routing)
 		pbuf->pb_info.pi_features |= LNET_PING_FEAT_RTE_DISABLED;
+
+	/* Ensure only known feature bits have been set. */
+	LASSERT(pbuf->pb_info.pi_features & LNET_PING_FEAT_BITS);
+	LASSERT(!(pbuf->pb_info.pi_features & ~LNET_PING_FEAT_BITS));
+
 	lnet_ping_target_install_locked(pbuf);
 
 	if (the_lnet.ln_ping_target) {

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 06/24] lustre: lnet: add sanity checks on ping-related constants
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (2 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 02/24] lustre: lnet: configure lnet_interfaces_max tunable from dlc NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:36   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers NeilBrown
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Add sanity checks for LNet ping related data structures and
constants to wirecheck.c, and update the generated code in
lnet_assert_wire_constants().

In order for the structures and macros to be visible to
wirecheck.c, which is a userspace program, they were moved
from kernel-only lnet/lib-types.h to lnet/types.h

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25776
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-types.h  |   30 ----------------
 .../lustre/include/uapi/linux/lnet/lnet-types.h    |   30 ++++++++++++++++
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   38 ++++++++++++++++++++
 3 files changed, 68 insertions(+), 30 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index f4467a3bbfd1..f28fa5342914 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -378,36 +378,6 @@ struct lnet_ni {
 
 #define LNET_PROTO_PING_MATCHBITS	0x8000000000000000LL
 
-/*
- * NB: value of these features equal to LNET_PROTO_PING_VERSION_x
- * of old LNet, so there shouldn't be any compatibility issue
- */
-#define LNET_PING_FEAT_INVAL		(0)		/* no feature */
-#define LNET_PING_FEAT_BASE		BIT(0)	/* just a ping */
-#define LNET_PING_FEAT_NI_STATUS	BIT(1)	/* return NI status */
-#define LNET_PING_FEAT_RTE_DISABLED	BIT(2)	/* Routing enabled */
-#define LNET_PING_FEAT_MULTI_RAIL	BIT(3)	/* Multi-Rail aware */
-#define LNET_PING_FEAT_DISCOVERY	BIT(4)	/* Supports Discovery */
-
-/*
- * All ping feature bits fit to hit the wire.
- * In lnet_assert_wire_constants() this is compared against its open-coded
- * value, and in lnet_ping_target_update() it is used to verify that no
- * unknown bits have been set.
- * New feature bits can be added, just be aware that this does change the
- * over-the-wire protocol.
- */
-#define LNET_PING_FEAT_BITS		(LNET_PING_FEAT_BASE | \
-					 LNET_PING_FEAT_NI_STATUS | \
-					 LNET_PING_FEAT_RTE_DISABLED | \
-					 LNET_PING_FEAT_MULTI_RAIL | \
-					 LNET_PING_FEAT_DISCOVERY)
-
-#define LNET_PING_INFO_SIZE(NNIDS) \
-	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
-#define LNET_PING_INFO_LONI(PINFO)	((PINFO)->pi_ni[0].ns_nid)
-#define LNET_PING_INFO_SEQNO(PINFO)	((PINFO)->pi_ni[0].ns_status)
-
 /*
  * Descriptor of a ping info buffer: keep a separate indicator of the
  * size and a reference count. The type is used both as a source and
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
index 6ee60d07ff84..e0e4fd259795 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
@@ -190,6 +190,31 @@ struct lnet_hdr {
 	} msg;
 } __packed;
 
+/*
+ * NB: value of these features equal to LNET_PROTO_PING_VERSION_x
+ * of old LNet, so there shouldn't be any compatibility issue
+ */
+#define LNET_PING_FEAT_INVAL		(0)		/* no feature */
+#define LNET_PING_FEAT_BASE		(1 << 0)	/* just a ping */
+#define LNET_PING_FEAT_NI_STATUS	(1 << 1)	/* return NI status */
+#define LNET_PING_FEAT_RTE_DISABLED	(1 << 2)	/* Routing enabled */
+#define LNET_PING_FEAT_MULTI_RAIL	(1 << 3)	/* Multi-Rail aware */
+#define LNET_PING_FEAT_DISCOVERY	(1 << 4)	/* Supports Discovery */
+
+/*
+ * All ping feature bits fit to hit the wire.
+ * In lnet_assert_wire_constants() this is compared against its open-coded
+ * value, and in lnet_ping_target_update() it is used to verify that no
+ * unknown bits have been set.
+ * New feature bits can be added, just be aware that this does change the
+ * over-the-wire protocol.
+ */
+#define LNET_PING_FEAT_BITS		(LNET_PING_FEAT_BASE | \
+					 LNET_PING_FEAT_NI_STATUS | \
+					 LNET_PING_FEAT_RTE_DISABLED | \
+					 LNET_PING_FEAT_MULTI_RAIL | \
+					 LNET_PING_FEAT_DISCOVERY)
+
 /*
  * A HELLO message contains a magic number and protocol version
  * code in the header's dest_nid, the peer's NID in the src_nid, and
@@ -246,6 +271,11 @@ struct lnet_ping_info {
 	struct lnet_ni_status	pi_ni[0];
 } __packed;
 
+#define LNET_PING_INFO_SIZE(NNIDS) \
+	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
+#define LNET_PING_INFO_LONI(PINFO)	((PINFO)->pi_ni[0].ns_nid)
+#define LNET_PING_INFO_SEQNO(PINFO)	((PINFO)->pi_ni[0].ns_status)
+
 struct lnet_counters {
 	__u32	msgs_alloc;
 	__u32	msgs_max;
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 68af723bc6a1..d81501f4c282 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -313,6 +313,44 @@ static void lnet_assert_wire_constants(void)
 	BUILD_BUG_ON((int)sizeof(((struct lnet_hdr *)0)->msg.hello.incarnation) != 8);
 	BUILD_BUG_ON((int)offsetof(struct lnet_hdr, msg.hello.type) != 40);
 	BUILD_BUG_ON((int)sizeof(((struct lnet_hdr *)0)->msg.hello.type) != 4);
+
+	/* Checks for struct lnet_ni_status and related constants */
+	BUILD_BUG_ON(LNET_NI_STATUS_INVALID != 0x00000000);
+	BUILD_BUG_ON(LNET_NI_STATUS_UP != 0x15aac0de);
+	BUILD_BUG_ON(LNET_NI_STATUS_DOWN != 0xdeadface);
+
+	/* Checks for struct lnet_ni_status */
+	BUILD_BUG_ON((int)sizeof(struct lnet_ni_status) != 16);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ni_status, ns_nid) != 0);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ni_status *)0)->ns_nid) != 8);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ni_status, ns_status) != 8);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ni_status *)0)->ns_status) != 4);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ni_status, ns_unused) != 12);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ni_status *)0)->ns_unused) != 4);
+
+	/* Checks for struct lnet_ping_info and related constants */
+	BUILD_BUG_ON(LNET_PROTO_PING_MAGIC != 0x70696E67);
+	BUILD_BUG_ON(LNET_PING_FEAT_INVAL != 0);
+	BUILD_BUG_ON(LNET_PING_FEAT_BASE != 1);
+	BUILD_BUG_ON(LNET_PING_FEAT_NI_STATUS != 2);
+	BUILD_BUG_ON(LNET_PING_FEAT_RTE_DISABLED != 4);
+	BUILD_BUG_ON(LNET_PING_FEAT_MULTI_RAIL != 8);
+	BUILD_BUG_ON(LNET_PING_FEAT_DISCOVERY != 16);
+	BUILD_BUG_ON(LNET_PING_FEAT_BITS != 31);
+
+	/* Checks for struct lnet_ping_info */
+	BUILD_BUG_ON((int)sizeof(struct lnet_ping_info) != 16);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_magic) != 0);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_magic) != 4);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_features) != 4);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_features)
+		     != 4);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_pid) != 8);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_pid) != 4);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_nnis) != 12);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_nnis) != 4);
+	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_ni) != 16);
+	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_ni) != 0);
 }
 
 static struct lnet_lnd *

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked()
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (8 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 03/24] lustre: lnet: add struct lnet_ping_buffer NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:38   ` James Simmons
  2018-10-14 19:39   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 11/24] lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit NeilBrown
                   ` (14 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Address style issues in lnet_peer_ni_addref_locked() and
lnet_peer_ni_decref_locked(). In the latter routine, replace
a sequence of atomic_dec()/atomic_read() with atomic_dec_and_test().

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25777
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 2e2b5ed27116..f15f5c9c9a25 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -323,8 +323,7 @@ static inline void
 lnet_peer_ni_decref_locked(struct lnet_peer_ni *lp)
 {
 	LASSERT(atomic_read(&lp->lpni_refcount) > 0);
-	atomic_dec(&lp->lpni_refcount);
-	if (atomic_read(&lp->lpni_refcount) == 0)
+	if (atomic_dec_and_test(&lp->lpni_refcount))
 		lnet_destroy_peer_ni_locked(lp);
 }
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 08/24] lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer()
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:55   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 09/24] lustre: lnet: refactor lnet_del_peer_ni() NeilBrown
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Rename lnet_add_peer_ni_to_peer() to lnet_add_peer_ni(), and
lnet_del_peer_ni_from_peer() to lnet_del_peer_ni().  This brings
the function names closer to the ioctls they implement:
IOCTL_LIBCFS_ADD_PEER_NI and IOCTL_LIBCFS_DEL_PEER_NI. These
names are also a more accturate description their effect: adding
or deleting an lnet_peer_ni to LNet.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25778
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    4 ++--
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   10 +++++----
 drivers/staging/lustre/lnet/lnet/peer.c            |   22 +++++++++++++++-----
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index f15f5c9c9a25..69f45a76f1cc 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -682,8 +682,8 @@ struct lnet_peer_net *lnet_peer_get_net_locked(struct lnet_peer *peer,
 					       u32 net_id);
 bool lnet_peer_is_ni_pref_locked(struct lnet_peer_ni *lpni,
 				 struct lnet_ni *ni);
-int lnet_add_peer_ni_to_peer(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
-int lnet_del_peer_ni_from_peer(lnet_nid_t key_nid, lnet_nid_t nid);
+int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
+int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
 int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
 		       bool *mr,
 		       struct lnet_peer_ni_credit_info __user *peer_ni_info,
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index d81501f4c282..d64ae2939abc 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -2848,9 +2848,9 @@ LNetCtl(unsigned int cmd, void *arg)
 			return -EINVAL;
 
 		mutex_lock(&the_lnet.ln_api_mutex);
-		rc = lnet_add_peer_ni_to_peer(cfg->prcfg_prim_nid,
-					      cfg->prcfg_cfg_nid,
-					      cfg->prcfg_mr);
+		rc = lnet_add_peer_ni(cfg->prcfg_prim_nid,
+				      cfg->prcfg_cfg_nid,
+				      cfg->prcfg_mr);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 	}
@@ -2862,8 +2862,8 @@ LNetCtl(unsigned int cmd, void *arg)
 			return -EINVAL;
 
 		mutex_lock(&the_lnet.ln_api_mutex);
-		rc = lnet_del_peer_ni_from_peer(cfg->prcfg_prim_nid,
-						cfg->prcfg_cfg_nid);
+		rc = lnet_del_peer_ni(cfg->prcfg_prim_nid,
+				      cfg->prcfg_cfg_nid);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 	}
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index ebb84356302f..bbf07008dbb0 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -891,14 +891,16 @@ lnet_peer_ni_add_non_mr(lnet_nid_t nid)
 }
 
 /*
+ * Implementation of IOC_LIBCFS_ADD_PEER_NI.
+ *
  * This API handles the following combinations:
- *	Create a primary NI if only the prim_nid is provided
- *	Create or add an lpni to a primary NI. Primary NI must've already
- *	been created
- *	Create a non-MR peer.
+ *   Create a primary NI if only the prim_nid is provided
+ *   Create or add an lpni to a primary NI. Primary NI must've already
+ *   been created
+ *   Create a non-MR peer.
  */
 int
-lnet_add_peer_ni_to_peer(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
+lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
 {
 	/*
 	 * Caller trying to setup an MR like peer hierarchy but
@@ -929,8 +931,16 @@ lnet_add_peer_ni_to_peer(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
 	return 0;
 }
 
+/*
+ * Implementation of IOC_LIBCFS_DEL_PEER_NI.
+ *
+ * This API handles the following combinations:
+ *   Delete a NI from a peer if both prim_nid and nid are provided.
+ *   Delete a peer if only prim_nid is provided.
+ *   Delete a peer if its primary nid is provided.
+ */
 int
-lnet_del_peer_ni_from_peer(lnet_nid_t prim_nid, lnet_nid_t nid)
+lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
 {
 	lnet_nid_t local_nid;
 	struct lnet_peer *peer;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 09/24] lustre: lnet: refactor lnet_del_peer_ni()
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
  2018-10-07 23:19 ` [lustre-devel] [PATCH 08/24] lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer() NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 19:58   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 02/24] lustre: lnet: configure lnet_interfaces_max tunable from dlc NeilBrown
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Refactor lnet_del_peer_ni(). In particular break out the code
that removes an lnet_peer_ni from an lnet_peer and put it into
a separate function, lnet_peer_del_nid().

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25779
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lnet/lnet/peer.c |   96 +++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 25 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index bbf07008dbb0..30a2486712e4 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -254,7 +254,7 @@ lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni)
 	 *
 	 * The last reference may be lost in a place where the
 	 * lnet_net_lock locks only a single cpt, and that cpt may not
-	 * be lpni->lpni_cpt. So the zombie list of this peer_table
+	 * be lpni->lpni_cpt. So the zombie list of lnet_peer_table
 	 * has its own lock.
 	 */
 	spin_lock(&ptable->pt_zombie_lock);
@@ -340,6 +340,61 @@ lnet_peer_del_locked(struct lnet_peer *peer)
 	return rc2;
 }
 
+static int
+lnet_peer_del(struct lnet_peer *peer)
+{
+	lnet_net_lock(LNET_LOCK_EX);
+	lnet_peer_del_locked(peer);
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	return 0;
+}
+
+/*
+ * Delete a NID from a peer.
+ * Implements a few sanity checks.
+ * Call with ln_api_mutex held.
+ */
+static int
+lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid)
+{
+	struct lnet_peer *lp2;
+	struct lnet_peer_ni *lpni;
+
+	lpni = lnet_find_peer_ni_locked(nid);
+	if (!lpni) {
+		CERROR("Cannot remove unknown nid %s from peer %s\n",
+		       libcfs_nid2str(nid),
+		       libcfs_nid2str(lp->lp_primary_nid));
+		return -ENOENT;
+	}
+	lnet_peer_ni_decref_locked(lpni);
+	lp2 = lpni->lpni_peer_net->lpn_peer;
+	if (lp2 != lp) {
+		CERROR("Nid %s is attached to peer %s, not peer %s\n",
+		       libcfs_nid2str(nid),
+		       libcfs_nid2str(lp2->lp_primary_nid),
+		       libcfs_nid2str(lp->lp_primary_nid));
+		return -EINVAL;
+	}
+
+	/*
+	 * This function only allows deletion of the primary NID if it
+	 * is the only NID.
+	 */
+	if (nid == lp->lp_primary_nid && lnet_get_num_peer_nis(lp) != 1) {
+		CERROR("Cannot delete primary NID %s from multi-NID peer\n",
+		       libcfs_nid2str(nid));
+		return -EINVAL;
+	}
+
+	lnet_net_lock(LNET_LOCK_EX);
+	lnet_peer_ni_del_locked(lpni);
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	return 0;
+}
+
 static void
 lnet_peer_table_cleanup_locked(struct lnet_net *net,
 			       struct lnet_peer_table *ptable)
@@ -938,45 +993,36 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
  *   Delete a NI from a peer if both prim_nid and nid are provided.
  *   Delete a peer if only prim_nid is provided.
  *   Delete a peer if its primary nid is provided.
+ *
+ * The caller must hold ln_api_mutex. This prevents the peer from
+ * being modified/deleted by a different thread.
  */
 int
 lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
 {
-	lnet_nid_t local_nid;
-	struct lnet_peer *peer;
+	struct lnet_peer *lp;
 	struct lnet_peer_ni *lpni;
-	int rc;
 
 	if (prim_nid == LNET_NID_ANY)
 		return -EINVAL;
 
-	local_nid = (nid != LNET_NID_ANY) ? nid : prim_nid;
-
-	lpni = lnet_find_peer_ni_locked(local_nid);
+	lpni = lnet_find_peer_ni_locked(prim_nid);
 	if (!lpni)
-		return -EINVAL;
+		return -ENOENT;
 	lnet_peer_ni_decref_locked(lpni);
+	lp = lpni->lpni_peer_net->lpn_peer;
 
-	peer = lpni->lpni_peer_net->lpn_peer;
-	LASSERT(peer);
-
-	if (peer->lp_primary_nid == lpni->lpni_nid) {
-		/*
-		 * deleting the primary ni is equivalent to deleting the
-		 * entire peer
-		 */
-		lnet_net_lock(LNET_LOCK_EX);
-		rc = lnet_peer_del_locked(peer);
-		lnet_net_unlock(LNET_LOCK_EX);
-
-		return rc;
+	if (prim_nid != lp->lp_primary_nid) {
+		CDEBUG(D_NET, "prim_nid %s is not primary for peer %s\n",
+		       libcfs_nid2str(prim_nid),
+		       libcfs_nid2str(lp->lp_primary_nid));
+		return -ENODEV;
 	}
 
-	lnet_net_lock(LNET_LOCK_EX);
-	rc = lnet_peer_ni_del_locked(lpni);
-	lnet_net_unlock(LNET_LOCK_EX);
+	if (nid == LNET_NID_ANY || nid == lp->lp_primary_nid)
+		return lnet_peer_del(lp);
 
-	return rc;
+	return lnet_peer_del_nid(lp, nid);
 }
 
 void

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 10/24] lustre: lnet: refactor lnet_add_peer_ni()
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (4 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 20:02   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 01/24] lustre: lnet: add lnet_interfaces_max tunable NeilBrown
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Refactor lnet_add_peer_ni() and the functions called by it. In
particular, lnet_peer_add_nid() adds an lnet_peer_ni to an
existing lnet_peer, lnet_peer_add() adds a new lnet_peer.

lnet_find_or_create_peer_locked() is removed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25780
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    1 
 drivers/staging/lustre/lnet/lnet/lib-move.c        |   13 +
 drivers/staging/lustre/lnet/lnet/peer.c            |  230 +++++++-------------
 3 files changed, 92 insertions(+), 152 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 69f45a76f1cc..fc748ffa251d 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -668,7 +668,6 @@ u32 lnet_get_dlc_seq_locked(void);
 struct lnet_peer_ni *lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 						  struct lnet_peer_net *peer_net,
 						  struct lnet_peer_ni *prev);
-struct lnet_peer *lnet_find_or_create_peer_locked(lnet_nid_t dst_nid, int cpt);
 struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, int cpt);
 struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
 struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index e8c021622f91..59ae8d0649e5 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1262,11 +1262,18 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 		return -ESHUTDOWN;
 	}
 
-	peer = lnet_find_or_create_peer_locked(dst_nid, cpt);
-	if (IS_ERR(peer)) {
+	/*
+	 * lnet_nid2peerni_locked() is the path that will find an
+	 * existing peer_ni, or create one and mark it as having been
+	 * created due to network traffic.
+	 */
+	lpni = lnet_nid2peerni_locked(dst_nid, cpt);
+	if (IS_ERR(lpni)) {
 		lnet_net_unlock(cpt);
-		return PTR_ERR(peer);
+		return PTR_ERR(lpni);
 	}
+	peer = lpni->lpni_peer_net->lpn_peer;
+	lnet_peer_ni_decref_locked(lpni);
 
 	/* If peer is not healthy then can not send anything to it */
 	if (!lnet_is_peer_healthy_locked(peer)) {
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 30a2486712e4..6b7ca5c361b8 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -541,25 +541,6 @@ lnet_find_peer_ni_locked(lnet_nid_t nid)
 	return lpni;
 }
 
-struct lnet_peer *
-lnet_find_or_create_peer_locked(lnet_nid_t dst_nid, int cpt)
-{
-	struct lnet_peer_ni *lpni;
-	struct lnet_peer *lp;
-
-	lpni = lnet_find_peer_ni_locked(dst_nid);
-	if (!lpni) {
-		lpni = lnet_nid2peerni_locked(dst_nid, cpt);
-		if (IS_ERR(lpni))
-			return ERR_CAST(lpni);
-	}
-
-	lp = lpni->lpni_peer_net->lpn_peer;
-	lnet_peer_ni_decref_locked(lpni);
-
-	return lp;
-}
-
 struct lnet_peer_ni *
 lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
 			    struct lnet_peer **lp)
@@ -774,131 +755,95 @@ lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
 	return -ENOMEM;
 }
 
+/*
+ * Create a new peer, with nid as its primary nid.
+ *
+ * It is not an error if the peer already exists, provided that the
+ * given nid is the primary NID.
+ *
+ * Call with the lnet_api_mutex held.
+ */
 static int
-lnet_add_prim_lpni(lnet_nid_t nid)
+lnet_peer_add(lnet_nid_t nid, bool mr)
 {
-	int rc;
-	struct lnet_peer *peer;
+	struct lnet_peer *lp;
 	struct lnet_peer_ni *lpni;
 
 	LASSERT(nid != LNET_NID_ANY);
 
 	/*
-	 * lookup the NID and its peer
-	 *  if the peer doesn't exist, create it.
-	 *  if this is a non-MR peer then change its state to MR and exit.
-	 *  if this is an MR peer and it's a primary NI: NO-OP.
-	 *  if this is an MR peer and it's not a primary NI. Operation not
-	 *     allowed.
-	 *
-	 * The adding and deleting of peer nis is being serialized through
-	 * the api_mutex. So we can look up peers with the mutex locked
-	 * safely. Only when we need to change the ptable, do we need to
-	 * exclusively lock the lnet_net_lock()
+	 * No need for the lnet_net_lock here, because the
+	 * lnet_api_mutex is held.
 	 */
 	lpni = lnet_find_peer_ni_locked(nid);
 	if (!lpni) {
-		rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
+		int rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
 		if (rc != 0)
 			return rc;
 		lpni = lnet_find_peer_ni_locked(nid);
+		LASSERT(lpni);
 	}
-
-	LASSERT(lpni);
-
+	lp = lpni->lpni_peer_net->lpn_peer;
 	lnet_peer_ni_decref_locked(lpni);
 
-	peer = lpni->lpni_peer_net->lpn_peer;
-
-	/*
-	 * If we found a lpni with the same nid as the NID we're trying to
-	 * create, then we're trying to create an already existing lpni
-	 * that belongs to a different peer
-	 */
-	if (peer->lp_primary_nid != nid)
+	/* A found peer must have this primary NID */
+	if (lp->lp_primary_nid != nid)
 		return -EEXIST;
 
 	/*
-	 * if we found an lpni that is not a multi-rail, which could occur
+	 * If we found an lpni that is not a multi-rail, which could occur
 	 * if lpni is already created as a non-mr lpni or we just created
 	 * it, then make sure you indicate that this lpni is a primary mr
 	 * capable peer.
 	 *
 	 * TODO: update flags if necessary
 	 */
-	if (!peer->lp_multi_rail && peer->lp_primary_nid == nid)
-		peer->lp_multi_rail = true;
+	if (mr && !lp->lp_multi_rail) {
+		lp->lp_multi_rail = true;
+	} else if (!mr && lp->lp_multi_rail) {
+		/* The mr state is sticky. */
+		CDEBUG(D_NET, "Cannot clear multi-flag from peer %s\n",
+		       libcfs_nid2str(nid));
+	}
 
-	return rc;
+	return 0;
 }
 
 static int
-lnet_add_peer_ni_to_prim_lpni(lnet_nid_t prim_nid, lnet_nid_t nid)
+lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
 {
-	struct lnet_peer *peer, *primary_peer;
-	struct lnet_peer_ni *lpni = NULL, *klpni = NULL;
-
-	LASSERT(prim_nid != LNET_NID_ANY && nid != LNET_NID_ANY);
+	struct lnet_peer_ni *lpni;
 
-	/*
-	 * key nid must be created by this point. If not then this
-	 * operation is not permitted
-	 */
-	klpni = lnet_find_peer_ni_locked(prim_nid);
-	if (!klpni)
-		return -ENOENT;
+	LASSERT(lp);
+	LASSERT(nid != LNET_NID_ANY);
 
-	lnet_peer_ni_decref_locked(klpni);
+	if (!mr && !lp->lp_multi_rail) {
+		CERROR("Cannot add nid %s to non-multi-rail peer %s\n",
+		       libcfs_nid2str(nid),
+		       libcfs_nid2str(lp->lp_primary_nid));
+		return -EPERM;
+	}
 
-	primary_peer = klpni->lpni_peer_net->lpn_peer;
+	if (!lp->lp_multi_rail)
+		lp->lp_multi_rail = true;
 
 	lpni = lnet_find_peer_ni_locked(nid);
-	if (lpni) {
-		lnet_peer_ni_decref_locked(lpni);
-
-		peer = lpni->lpni_peer_net->lpn_peer;
-		/*
-		 * lpni already exists in the system but it belongs to
-		 * a different peer. We can't re-added it
-		 */
-		if (peer->lp_primary_nid != prim_nid && peer->lp_multi_rail) {
-			CERROR("Cannot add NID %s owned by peer %s to peer %s\n",
-			       libcfs_nid2str(lpni->lpni_nid),
-			       libcfs_nid2str(peer->lp_primary_nid),
-			       libcfs_nid2str(prim_nid));
-			return -EEXIST;
-		} else if (peer->lp_primary_nid == prim_nid) {
-			/*
-			 * found a peer_ni that is already part of the
-			 * peer. This is a no-op operation.
-			 */
-			return 0;
-		}
-
-		/*
-		 * TODO: else if (peer->lp_primary_nid != prim_nid &&
-		 *		  !peer->lp_multi_rail)
-		 * peer is not an MR peer and it will be moved in the next
-		 * step to klpni, so update its flags accordingly.
-		 * lnet_move_peer_ni()
-		 */
-
-		/*
-		 * TODO: call lnet_update_peer() from here to update the
-		 * flags. This is the case when the lpni you're trying to
-		 * add is already part of the peer. This could've been
-		 * added by the DD previously, so go ahead and do any
-		 * updates to the state if necessary
-		 */
+	if (!lpni)
+		return lnet_peer_setup_hierarchy(lp, NULL, nid);
 
+	if (lpni->lpni_peer_net->lpn_peer != lp) {
+		struct lnet_peer *lp2 = lpni->lpni_peer_net->lpn_peer;
+		CERROR("Cannot add NID %s owned by peer %s to peer %s\n",
+		       libcfs_nid2str(lpni->lpni_nid),
+		       libcfs_nid2str(lp2->lp_primary_nid),
+		       libcfs_nid2str(lp->lp_primary_nid));
+		return -EEXIST;
 	}
 
-	/*
-	 * When we get here we either have found an existing lpni, which
-	 * we can switch to the new peer. Or we need to create one and
-	 * add it to the new peer
-	 */
-	return lnet_peer_setup_hierarchy(primary_peer, lpni, nid);
+	CDEBUG(D_NET, "NID %s is already owned by peer %s\n",
+	       libcfs_nid2str(lpni->lpni_nid),
+	       libcfs_nid2str(lp->lp_primary_nid));
+	return 0;
 }
 
 /*
@@ -929,61 +874,50 @@ lnet_peer_ni_traffic_add(lnet_nid_t nid)
 	return rc;
 }
 
-static int
-lnet_peer_ni_add_non_mr(lnet_nid_t nid)
-{
-	struct lnet_peer_ni *lpni;
-
-	lpni = lnet_find_peer_ni_locked(nid);
-	if (lpni) {
-		CERROR("Cannot add %s as non-mr when it already exists\n",
-		       libcfs_nid2str(nid));
-		lnet_peer_ni_decref_locked(lpni);
-		return -EEXIST;
-	}
-
-	return lnet_peer_setup_hierarchy(NULL, NULL, nid);
-}
-
 /*
  * Implementation of IOC_LIBCFS_ADD_PEER_NI.
  *
  * This API handles the following combinations:
- *   Create a primary NI if only the prim_nid is provided
- *   Create or add an lpni to a primary NI. Primary NI must've already
- *   been created
- *   Create a non-MR peer.
+ *   Create a peer with its primary NI if only the prim_nid is provided
+ *   Add a NID to a peer identified by the prim_nid. The peer identified
+ *   by the prim_nid must already exist.
+ *   The peer being created may be non-MR.
+ *
+ * The caller must hold ln_api_mutex. This prevents the peer from
+ * being created/modified/deleted by a different thread.
  */
 int
 lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
 {
+	struct lnet_peer *lp = NULL;
+	struct lnet_peer_ni *lpni;
+
+	/* The prim_nid must always be specified */
+	if (prim_nid == LNET_NID_ANY)
+		return -EINVAL;
+
 	/*
-	 * Caller trying to setup an MR like peer hierarchy but
-	 * specifying it to be non-MR. This is not allowed.
+	 * If nid isn't specified, we must create a new peer with
+	 * prim_nid as its primary nid.
 	 */
-	if (prim_nid != LNET_NID_ANY &&
-	    nid != LNET_NID_ANY && !mr)
-		return -EPERM;
-
-	/* Add the primary NID of a peer */
-	if (prim_nid != LNET_NID_ANY &&
-	    nid == LNET_NID_ANY && mr)
-		return lnet_add_prim_lpni(prim_nid);
+	if (nid == LNET_NID_ANY)
+		return lnet_peer_add(prim_nid, mr);
 
-	/* Add a NID to an existing peer */
-	if (prim_nid != LNET_NID_ANY &&
-	    nid != LNET_NID_ANY && mr)
-		return lnet_add_peer_ni_to_prim_lpni(prim_nid, nid);
+	/* Look up the prim_nid, which must exist. */
+	lpni = lnet_find_peer_ni_locked(prim_nid);
+	if (!lpni)
+		return -ENOENT;
+	lnet_peer_ni_decref_locked(lpni);
+	lp = lpni->lpni_peer_net->lpn_peer;
 
-	/* Add a non-MR peer NI */
-	if (((prim_nid != LNET_NID_ANY &&
-	      nid == LNET_NID_ANY) ||
-	     (prim_nid == LNET_NID_ANY &&
-	      nid != LNET_NID_ANY)) && !mr)
-		return lnet_peer_ni_add_non_mr(prim_nid != LNET_NID_ANY ?
-							 prim_nid : nid);
+	if (lp->lp_primary_nid != prim_nid) {
+		CDEBUG(D_NET, "prim_nid %s is not primary for peer %s\n",
+		       libcfs_nid2str(prim_nid),
+		       libcfs_nid2str(lp->lp_primary_nid));
+		return -ENODEV;
+	}
 
-	return 0;
+	return lnet_peer_add_nid(lp, nid, mr);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 11/24] lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (9 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked() NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 20:11   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 16/24] lustre: lnet: add discovery thread NeilBrown
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Add lp_state as a flag word to lnet_peer, and add lp_lock
to protect it. This lock needs to be taken whenever the
field is updated, because setting or clearing a bit is
a read-modify-write cycle.

The lp_multi_rail is removed, its function is replaced by
the new LNET_PEER_MULTI_RAIL flag bit.

The helper lnet_peer_is_multi_rail() tests the bit.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25781
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    6 +++++
 .../staging/lustre/include/linux/lnet/lib-types.h  |   11 ++++++++--
 drivers/staging/lustre/lnet/lnet/lib-move.c        |    9 +++++---
 drivers/staging/lustre/lnet/lnet/peer.c            |   22 +++++++++++++-------
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index fc748ffa251d..75b47628c70e 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -757,4 +757,10 @@ lnet_peer_set_alive(struct lnet_peer_ni *lp)
 		lnet_notify_locked(lp, 0, 1, lp->lpni_last_alive);
 }
 
+static inline bool
+lnet_peer_is_multi_rail(struct lnet_peer *lp)
+{
+	return lp->lp_state & LNET_PEER_MULTI_RAIL;
+}
+
 #endif
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index f28fa5342914..602978a1c86e 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -467,6 +467,8 @@ struct lnet_peer_ni {
 	atomic_t		 lpni_refcount;
 	/* CPT this peer attached on */
 	int			 lpni_cpt;
+	/* state flags -- protected by lpni_lock */
+	unsigned int		lpni_state;
 	/* # refs from lnet_route::lr_gateway */
 	int			 lpni_rtr_refcount;
 	/* sequence number used to round robin over peer nis within a net */
@@ -497,10 +499,15 @@ struct lnet_peer {
 	/* primary NID of the peer */
 	lnet_nid_t		lp_primary_nid;
 
-	/* peer is Multi-Rail enabled peer */
-	bool			lp_multi_rail;
+	/* lock protecting peer state flags */
+	spinlock_t		lp_lock;
+
+	/* peer state flags */
+	unsigned int		lp_state;
 };
 
+#define LNET_PEER_MULTI_RAIL	BIT(0)
+
 struct lnet_peer_net {
 	/* chain on peer block */
 	struct list_head	lpn_on_peer_list;
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 59ae8d0649e5..0d0ad30bb164 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1281,7 +1281,8 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 		return -EHOSTUNREACH;
 	}
 
-	if (!peer->lp_multi_rail && lnet_get_num_peer_nis(peer) > 1) {
+	if (!lnet_peer_is_multi_rail(peer) &&
+	    lnet_get_num_peer_nis(peer) > 1) {
 		lnet_net_unlock(cpt);
 		CERROR("peer %s is declared to be non MR capable, yet configured with more than one NID\n",
 		       libcfs_nid2str(dst_nid));
@@ -1307,7 +1308,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 
 	if (msg->msg_type == LNET_MSG_REPLY ||
 	    msg->msg_type == LNET_MSG_ACK ||
-	    !peer->lp_multi_rail ||
+	    !lnet_peer_is_multi_rail(peer) ||
 	    best_ni) {
 		/*
 		 * for replies we want to respond on the same peer_ni we
@@ -1354,7 +1355,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 				 * then use the best_gw found to send
 				 * the message to
 				 */
-				if (!peer->lp_multi_rail)
+				if (!lnet_peer_is_multi_rail(peer))
 					best_lpni = best_gw;
 				else
 					best_lpni = NULL;
@@ -1375,7 +1376,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	 * if the peer is not MR capable, then we should always send to it
 	 * using the first NI in the NET we determined.
 	 */
-	if (!peer->lp_multi_rail) {
+	if (!lnet_peer_is_multi_rail(peer)) {
 		if (!best_lpni) {
 			lnet_net_unlock(cpt);
 			CERROR("no route to %s\n",
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 6b7ca5c361b8..cc2b926b76e4 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -182,6 +182,7 @@ lnet_peer_alloc(lnet_nid_t nid)
 
 	INIT_LIST_HEAD(&lp->lp_on_lnet_peer_list);
 	INIT_LIST_HEAD(&lp->lp_peer_nets);
+	spin_lock_init(&lp->lp_lock);
 	lp->lp_primary_nid = nid;
 
 	/* TODO: update flags */
@@ -798,13 +799,15 @@ lnet_peer_add(lnet_nid_t nid, bool mr)
 	 *
 	 * TODO: update flags if necessary
 	 */
-	if (mr && !lp->lp_multi_rail) {
-		lp->lp_multi_rail = true;
-	} else if (!mr && lp->lp_multi_rail) {
+	spin_lock(&lp->lp_lock);
+	if (mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+		lp->lp_state |= LNET_PEER_MULTI_RAIL;
+	} else if (!mr && (lp->lp_state & LNET_PEER_MULTI_RAIL)) {
 		/* The mr state is sticky. */
-		CDEBUG(D_NET, "Cannot clear multi-flag from peer %s\n",
+		CDEBUG(D_NET, "Cannot clear multi-rail flag from peer %s\n",
 		       libcfs_nid2str(nid));
 	}
+	spin_unlock(&lp->lp_lock);
 
 	return 0;
 }
@@ -817,15 +820,18 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
 	LASSERT(lp);
 	LASSERT(nid != LNET_NID_ANY);
 
-	if (!mr && !lp->lp_multi_rail) {
+	spin_lock(&lp->lp_lock);
+	if (!mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+		spin_unlock(&lp->lp_lock);
 		CERROR("Cannot add nid %s to non-multi-rail peer %s\n",
 		       libcfs_nid2str(nid),
 		       libcfs_nid2str(lp->lp_primary_nid));
 		return -EPERM;
 	}
 
-	if (!lp->lp_multi_rail)
-		lp->lp_multi_rail = true;
+	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL))
+		lp->lp_state |= LNET_PEER_MULTI_RAIL;
+	spin_unlock(&lp->lp_lock);
 
 	lpni = lnet_find_peer_ni_locked(nid);
 	if (!lpni)
@@ -1183,7 +1189,7 @@ int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
 		return -ENOENT;
 
 	*primary_nid = lp->lp_primary_nid;
-	*mr = lp->lp_multi_rail;
+	*mr = lnet_peer_is_multi_rail(lp);
 	*nid = lpni->lpni_nid;
 	snprintf(ni_info.cr_aliveness, LNET_MAX_STR_LEN, "NA");
 	if (lnet_isrouter(lpni) ||

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 12/24] lustre: lnet: preferred NIs for non-Multi-Rail peers
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (17 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 21/24] lustre: lnet: add "lnetctl discover" NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 20:20   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 20/24] lustre: lnet: add "lnetctl ping" command NeilBrown
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

When a node sends a message to a peer NI, there may be
a preferred local NI that should be the source of the
message. This is in particular the case for non-Multi-
Rail (NMR) peers, as an NMR peer depends in some cases
on the source address of a message to correctly identify
its origin. (This as opposed to using a UUID provided by
a higher protocol layer.)

Implement this by keeping an array of preferred local
NIDs in the lnet_peer_ni structure. The case where only
a single NID needs to be stored is optimized so that this
can be done without needing to allocate any memory.

A flag in the lnet_peer_ni, LNET_PEER_NI_NON_MR_PREF,
indicates that the preferred NI was automatically added
for an NMR peer. Note that a peer which has not been
explicitly configured as Multi-Rail will be treated as
non-Multi-Rail until proven otherwise. These automatic
preferences will be cleared if the peer is changed to
Multi-Rail.

- lnet_peer_ni_set_non_mr_pref_nid()
  set NMR preferred NI for peer_ni
- lnet_peer_ni_clr_non_mr_pref_nid()
  clear NMR preferred NI for peer_ni
- lnet_peer_clr_non_mr_pref_nids()
  clear NMR preferred NIs for all peer_ni

- lnet_peer_add_pref_nid()
  add a preferred NID
- lnet_peer_del_pref_nid()
  delete a preferred NID

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25782
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    7 -
 .../staging/lustre/include/linux/lnet/lib-types.h  |   10 +
 drivers/staging/lustre/lnet/lnet/lib-move.c        |   49 +++-
 drivers/staging/lustre/lnet/lnet/peer.c            |  257 +++++++++++++++++++-
 4 files changed, 285 insertions(+), 38 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 75b47628c70e..2864bd8a403b 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -668,7 +668,8 @@ u32 lnet_get_dlc_seq_locked(void);
 struct lnet_peer_ni *lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 						  struct lnet_peer_net *peer_net,
 						  struct lnet_peer_ni *prev);
-struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, int cpt);
+struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref,
+					    int cpt);
 struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
 struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
 void lnet_peer_net_added(struct lnet_net *net);
@@ -679,8 +680,8 @@ int lnet_peer_tables_create(void);
 void lnet_debug_peer(lnet_nid_t nid);
 struct lnet_peer_net *lnet_peer_get_net_locked(struct lnet_peer *peer,
 					       u32 net_id);
-bool lnet_peer_is_ni_pref_locked(struct lnet_peer_ni *lpni,
-				 struct lnet_ni *ni);
+bool lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid);
+int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid);
 int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
 int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
 int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 602978a1c86e..eff2aed5e5c1 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -481,14 +481,20 @@ struct lnet_peer_ni {
 	unsigned int		 lpni_ping_feats;
 	/* routers on this peer */
 	struct list_head	 lpni_routes;
-	/* array of preferred local nids */
-	lnet_nid_t		*lpni_pref_nids;
+	/* preferred local nids: if only one, use lpni_pref.nid */
+	union lpni_pref {
+		lnet_nid_t	nid;
+		lnet_nid_t	*nids;
+	} lpni_pref;
 	/* number of preferred NIDs in lnpi_pref_nids */
 	u32			lpni_pref_nnids;
 	/* router checker state */
 	struct lnet_rc_data	*lpni_rcd;
 };
 
+/* Preferred path added due to traffic on non-MR peer_ni */
+#define LNET_PEER_NI_NON_MR_PREF	BIT(0)
+
 struct lnet_peer {
 	/* chain on global peer list */
 	struct list_head	lp_on_lnet_peer_list;
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 0d0ad30bb164..99d8b22356bb 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1267,7 +1267,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	 * existing peer_ni, or create one and mark it as having been
 	 * created due to network traffic.
 	 */
-	lpni = lnet_nid2peerni_locked(dst_nid, cpt);
+	lpni = lnet_nid2peerni_locked(dst_nid, LNET_NID_ANY, cpt);
 	if (IS_ERR(lpni)) {
 		lnet_net_unlock(cpt);
 		return PTR_ERR(lpni);
@@ -1281,14 +1281,6 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 		return -EHOSTUNREACH;
 	}
 
-	if (!lnet_peer_is_multi_rail(peer) &&
-	    lnet_get_num_peer_nis(peer) > 1) {
-		lnet_net_unlock(cpt);
-		CERROR("peer %s is declared to be non MR capable, yet configured with more than one NID\n",
-		       libcfs_nid2str(dst_nid));
-		return -EINVAL;
-	}
-
 	/*
 	 * STEP 1: first jab at determining best_ni
 	 * if src_nid is explicitly specified, then best_ni is already
@@ -1373,8 +1365,14 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	}
 
 	/*
-	 * if the peer is not MR capable, then we should always send to it
-	 * using the first NI in the NET we determined.
+	 * We must use a consistent source address when sending to a
+	 * non-MR peer. However, a non-MR peer can have multiple NIDs
+	 * on multiple networks, and we may even need to talk to this
+	 * peer on multiple networks -- certain types of
+	 * load-balancing configuration do this.
+	 *
+	 * So we need to pick the NI the peer prefers for this
+	 * particular network.
 	 */
 	if (!lnet_peer_is_multi_rail(peer)) {
 		if (!best_lpni) {
@@ -1384,10 +1382,26 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 			return -EHOSTUNREACH;
 		}
 
-		/* best ni could be set because src_nid was provided */
+		/* best ni is already set if src_nid was provided */
+		if (!best_ni) {
+			/* Get the target peer_ni */
+			peer_net = lnet_peer_get_net_locked(
+				peer, LNET_NIDNET(best_lpni->lpni_nid));
+			list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
+					    lpni_on_peer_net_list) {
+				if (lpni->lpni_pref_nnids == 0)
+					continue;
+				LASSERT(lpni->lpni_pref_nnids == 1);
+				best_ni = lnet_nid2ni_locked(
+					lpni->lpni_pref.nid, cpt);
+				break;
+			}
+		}
+		/* if best_ni is still not set just pick one */
 		if (!best_ni) {
 			best_ni = lnet_net2ni_locked(
 				best_lpni->lpni_net->net_id, cpt);
+			/* If there is no best_ni we don't have a route */
 			if (!best_ni) {
 				lnet_net_unlock(cpt);
 				CERROR("no path to %s from net %s\n",
@@ -1395,7 +1409,13 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 				       libcfs_net2str(best_lpni->lpni_net->net_id));
 				return -EHOSTUNREACH;
 			}
+			lpni = list_entry(peer_net->lpn_peer_nis.next,
+					  struct lnet_peer_ni,
+					  lpni_on_peer_net_list);
 		}
+		/* Set preferred NI if necessary. */
+		if (lpni->lpni_pref_nnids == 0)
+			lnet_peer_ni_set_non_mr_pref_nid(lpni, best_ni->ni_nid);
 	}
 
 	/*
@@ -1593,7 +1613,8 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 		 */
 		if (!lnet_is_peer_ni_healthy_locked(lpni))
 			continue;
-		ni_is_pref = lnet_peer_is_ni_pref_locked(lpni, best_ni);
+		ni_is_pref = lnet_peer_is_pref_nid_locked(lpni,
+							  best_ni->ni_nid);
 
 		/* if this is a preferred peer use it */
 		if (!preferred && ni_is_pref) {
@@ -2380,7 +2401,7 @@ lnet_parse(struct lnet_ni *ni, struct lnet_hdr *hdr, lnet_nid_t from_nid,
 	}
 
 	lnet_net_lock(cpt);
-	lpni = lnet_nid2peerni_locked(from_nid, cpt);
+	lpni = lnet_nid2peerni_locked(from_nid, ni->ni_nid, cpt);
 	if (IS_ERR(lpni)) {
 		lnet_net_unlock(cpt);
 		CERROR("%s, src %s: Dropping %s (error %ld looking up sender)\n",
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index cc2b926b76e4..44a2bf641260 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -617,18 +617,233 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 	return lpni;
 }
 
+/*
+ * Test whether a ni is a preferred ni for this peer_ni, e.g, whether
+ * this is a preferred point-to-point path. Call with lnet_net_lock in
+ * shared mmode.
+ */
 bool
-lnet_peer_is_ni_pref_locked(struct lnet_peer_ni *lpni, struct lnet_ni *ni)
+lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid)
 {
 	int i;
 
+	if (lpni->lpni_pref_nnids == 0)
+		return false;
+	if (lpni->lpni_pref_nnids == 1)
+		return lpni->lpni_pref.nid == nid;
 	for (i = 0; i < lpni->lpni_pref_nnids; i++) {
-		if (lpni->lpni_pref_nids[i] == ni->ni_nid)
+		if (lpni->lpni_pref.nids[i] == nid)
 			return true;
 	}
 	return false;
 }
 
+/*
+ * Set a single ni as preferred, provided no preferred ni is already
+ * defined. Only to be used for non-multi-rail peer_ni.
+ */
+int
+lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid)
+{
+	int rc = 0;
+
+	spin_lock(&lpni->lpni_lock);
+	if (nid == LNET_NID_ANY) {
+		rc = -EINVAL;
+	} else if (lpni->lpni_pref_nnids > 0) {
+		rc = -EPERM;
+	} else if (lpni->lpni_pref_nnids == 0) {
+		lpni->lpni_pref.nid = nid;
+		lpni->lpni_pref_nnids = 1;
+		lpni->lpni_state |= LNET_PEER_NI_NON_MR_PREF;
+	}
+	spin_unlock(&lpni->lpni_lock);
+
+	CDEBUG(D_NET, "peer %s nid %s: %d\n",
+	       libcfs_nid2str(lpni->lpni_nid), libcfs_nid2str(nid), rc);
+	return rc;
+}
+
+/*
+ * Clear the preferred NID from a non-multi-rail peer_ni, provided
+ * this preference was set by lnet_peer_ni_set_non_mr_pref_nid().
+ */
+int
+lnet_peer_ni_clr_non_mr_pref_nid(struct lnet_peer_ni *lpni)
+{
+	int rc = 0;
+
+	spin_lock(&lpni->lpni_lock);
+	if (lpni->lpni_state & LNET_PEER_NI_NON_MR_PREF) {
+		lpni->lpni_pref_nnids = 0;
+		lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
+	} else if (lpni->lpni_pref_nnids == 0) {
+		rc = -ENOENT;
+	} else {
+		rc = -EPERM;
+	}
+	spin_unlock(&lpni->lpni_lock);
+
+	CDEBUG(D_NET, "peer %s: %d\n",
+	       libcfs_nid2str(lpni->lpni_nid), rc);
+	return rc;
+}
+
+/*
+ * Clear the preferred NIDs from a non-multi-rail peer.
+ */
+void
+lnet_peer_clr_non_mr_pref_nids(struct lnet_peer *lp)
+{
+	struct lnet_peer_ni *lpni = NULL;
+
+	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL)
+		lnet_peer_ni_clr_non_mr_pref_nid(lpni);
+}
+
+int
+lnet_peer_add_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid)
+{
+	lnet_nid_t *nids = NULL;
+	lnet_nid_t *oldnids = NULL;
+	struct lnet_peer *lp = lpni->lpni_peer_net->lpn_peer;
+	int size;
+	int i;
+	int rc = 0;
+
+	if (nid == LNET_NID_ANY) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (lpni->lpni_pref_nnids == 1 && lpni->lpni_pref.nid == nid) {
+		rc = -EEXIST;
+		goto out;
+	}
+
+	/* A non-MR node may have only one preferred NI per peer_ni */
+	if (lpni->lpni_pref_nnids > 0) {
+		if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+			rc = -EPERM;
+			goto out;
+		}
+	}
+
+	if (lpni->lpni_pref_nnids != 0) {
+		size = sizeof(*nids) * (lpni->lpni_pref_nnids + 1);
+		nids = kzalloc_cpt(size, GFP_KERNEL, lpni->lpni_cpt);
+		if (!nids) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		for (i = 0; i < lpni->lpni_pref_nnids; i++) {
+			if (lpni->lpni_pref.nids[i] == nid) {
+				kfree(nids);
+				rc = -EEXIST;
+				goto out;
+			}
+			nids[i] = lpni->lpni_pref.nids[i];
+		}
+		nids[i] = nid;
+	}
+
+	lnet_net_lock(LNET_LOCK_EX);
+	spin_lock(&lpni->lpni_lock);
+	if (lpni->lpni_pref_nnids == 0) {
+		lpni->lpni_pref.nid = nid;
+	} else {
+		oldnids = lpni->lpni_pref.nids;
+		lpni->lpni_pref.nids = nids;
+	}
+	lpni->lpni_pref_nnids++;
+	lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
+	spin_unlock(&lpni->lpni_lock);
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	kfree(oldnids);
+out:
+	if (rc == -EEXIST && (lpni->lpni_state & LNET_PEER_NI_NON_MR_PREF)) {
+		spin_lock(&lpni->lpni_lock);
+		lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
+		spin_unlock(&lpni->lpni_lock);
+	}
+	CDEBUG(D_NET, "peer %s nid %s: %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), libcfs_nid2str(nid), rc);
+	return rc;
+}
+
+int
+lnet_peer_del_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid)
+{
+	lnet_nid_t *nids = NULL;
+	lnet_nid_t *oldnids = NULL;
+	struct lnet_peer *lp = lpni->lpni_peer_net->lpn_peer;
+	int size;
+	int i, j;
+	int rc = 0;
+
+	if (lpni->lpni_pref_nnids == 0) {
+		rc = -ENOENT;
+		goto out;
+	}
+
+	if (lpni->lpni_pref_nnids == 1) {
+		if (lpni->lpni_pref.nid != nid) {
+			rc = -ENOENT;
+			goto out;
+		}
+	} else if (lpni->lpni_pref_nnids == 2) {
+		if (lpni->lpni_pref.nids[0] != nid &&
+		    lpni->lpni_pref.nids[1] != nid) {
+			rc = -ENOENT;
+			goto out;
+		}
+	} else {
+		size = sizeof(*nids) * (lpni->lpni_pref_nnids - 1);
+		nids = kzalloc_cpt(size, GFP_KERNEL, lpni->lpni_cpt);
+		if (!nids) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		for (i = 0, j = 0; i < lpni->lpni_pref_nnids; i++) {
+			if (lpni->lpni_pref.nids[i] != nid)
+				continue;
+			nids[j++] = lpni->lpni_pref.nids[i];
+		}
+		/* Check if we actually removed a nid. */
+		if (j == lpni->lpni_pref_nnids) {
+			kfree(nids);
+			rc = -ENOENT;
+			goto out;
+		}
+	}
+
+	lnet_net_lock(LNET_LOCK_EX);
+	spin_lock(&lpni->lpni_lock);
+	if (lpni->lpni_pref_nnids == 1) {
+		lpni->lpni_pref.nid = LNET_NID_ANY;
+	} else if (lpni->lpni_pref_nnids == 2) {
+		oldnids = lpni->lpni_pref.nids;
+		if (oldnids[0] == nid)
+			lpni->lpni_pref.nid = oldnids[1];
+		else
+			lpni->lpni_pref.nid = oldnids[2];
+	} else {
+		oldnids = lpni->lpni_pref.nids;
+		lpni->lpni_pref.nids = nids;
+	}
+	lpni->lpni_pref_nnids--;
+	lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
+	spin_unlock(&lpni->lpni_lock);
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	kfree(oldnids);
+out:
+	CDEBUG(D_NET, "peer %s nid %s: %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), libcfs_nid2str(nid), rc);
+	return rc;
+}
+
 lnet_nid_t
 lnet_peer_primary_nid_locked(lnet_nid_t nid)
 {
@@ -653,7 +868,7 @@ LNetPrimaryNID(lnet_nid_t nid)
 	int cpt;
 
 	cpt = lnet_net_lock_current();
-	lpni = lnet_nid2peerni_locked(nid, cpt);
+	lpni = lnet_nid2peerni_locked(nid, LNET_NID_ANY, cpt);
 	if (IS_ERR(lpni)) {
 		rc = PTR_ERR(lpni);
 		goto out_unlock;
@@ -802,6 +1017,7 @@ lnet_peer_add(lnet_nid_t nid, bool mr)
 	spin_lock(&lp->lp_lock);
 	if (mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
 		lp->lp_state |= LNET_PEER_MULTI_RAIL;
+		lnet_peer_clr_non_mr_pref_nids(lp);
 	} else if (!mr && (lp->lp_state & LNET_PEER_MULTI_RAIL)) {
 		/* The mr state is sticky. */
 		CDEBUG(D_NET, "Cannot clear multi-rail flag from peer %s\n",
@@ -829,8 +1045,10 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
 		return -EPERM;
 	}
 
-	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL))
+	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
 		lp->lp_state |= LNET_PEER_MULTI_RAIL;
+		lnet_peer_clr_non_mr_pref_nids(lp);
+	}
 	spin_unlock(&lp->lp_lock);
 
 	lpni = lnet_find_peer_ni_locked(nid);
@@ -856,28 +1074,27 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
  * lpni creation initiated due to traffic either sending or receiving.
  */
 static int
-lnet_peer_ni_traffic_add(lnet_nid_t nid)
+lnet_peer_ni_traffic_add(lnet_nid_t nid, lnet_nid_t pref)
 {
 	struct lnet_peer_ni *lpni;
-	int rc = 0;
+	int rc;
 
 	if (nid == LNET_NID_ANY)
 		return -EINVAL;
 
 	/* lnet_net_lock is not needed here because ln_api_lock is held */
 	lpni = lnet_find_peer_ni_locked(nid);
-	if (lpni) {
-		/*
-		 * TODO: lnet_update_primary_nid() but not all of it
-		 * only indicate if we're converting this to MR capable
-		 * Can happen due to DD
-		 */
-		lnet_peer_ni_decref_locked(lpni);
-	} else {
+	if (!lpni) {
 		rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
+		if (rc)
+			return rc;
+		lpni = lnet_find_peer_ni_locked(nid);
 	}
+	if (pref != LNET_NID_ANY)
+		lnet_peer_ni_set_non_mr_pref_nid(lpni, pref);
+	lnet_peer_ni_decref_locked(lpni);
 
-	return rc;
+	return 0;
 }
 
 /*
@@ -984,6 +1201,8 @@ lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)
 	ptable->pt_zombies--;
 	spin_unlock(&ptable->pt_zombie_lock);
 
+	if (lpni->lpni_pref_nnids > 1)
+		kfree(lpni->lpni_pref.nids);
 	kfree(lpni);
 }
 
@@ -1006,7 +1225,7 @@ lnet_nid2peerni_ex(lnet_nid_t nid, int cpt)
 
 	lnet_net_unlock(cpt);
 
-	rc = lnet_peer_ni_traffic_add(nid);
+	rc = lnet_peer_ni_traffic_add(nid, LNET_NID_ANY);
 	if (rc) {
 		lpni = ERR_PTR(rc);
 		goto out_net_relock;
@@ -1022,7 +1241,7 @@ lnet_nid2peerni_ex(lnet_nid_t nid, int cpt)
 }
 
 struct lnet_peer_ni *
-lnet_nid2peerni_locked(lnet_nid_t nid, int cpt)
+lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref, int cpt)
 {
 	struct lnet_peer_ni *lpni = NULL;
 	int rc;
@@ -1061,7 +1280,7 @@ lnet_nid2peerni_locked(lnet_nid_t nid, int cpt)
 		goto out_mutex_unlock;
 	}
 
-	rc = lnet_peer_ni_traffic_add(nid);
+	rc = lnet_peer_ni_traffic_add(nid, pref);
 	if (rc) {
 		lpni = ERR_PTR(rc);
 		goto out_mutex_unlock;
@@ -1087,7 +1306,7 @@ lnet_debug_peer(lnet_nid_t nid)
 	cpt = lnet_cpt_of_nid(nid, NULL);
 	lnet_net_lock(cpt);
 
-	lp = lnet_nid2peerni_locked(nid, cpt);
+	lp = lnet_nid2peerni_locked(nid, LNET_NID_ANY, cpt);
 	if (IS_ERR(lp)) {
 		lnet_net_unlock(cpt);
 		CDEBUG(D_WARNING, "No peer %s\n", libcfs_nid2str(nid));

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 13/24] lustre: lnet: add LNET_PEER_CONFIGURED flag
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (15 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 19/24] lustre: lnet: add "lnetctl peer list" NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 20:32   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 21/24] lustre: lnet: add "lnetctl discover" NeilBrown
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Add the LNET_PEER_CONFIGURED flag, which indicates that a peer
has been configured by DLC. This is used to enforce that only
DLC can modify such a peer.

This includes some further refactoring of the code that creates
or modifies peers to ensure that the flag is properly passed
through, set, and cleared.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25783
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   12 +
 .../staging/lustre/include/linux/lnet/lib-types.h  |    1 
 drivers/staging/lustre/lnet/lnet/peer.c            |  426 +++++++++++++-------
 3 files changed, 290 insertions(+), 149 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 2864bd8a403b..563417510722 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -764,4 +764,16 @@ lnet_peer_is_multi_rail(struct lnet_peer *lp)
 	return lp->lp_state & LNET_PEER_MULTI_RAIL;
 }
 
+static inline bool
+lnet_peer_ni_is_configured(struct lnet_peer_ni *lpni)
+{
+	return lpni->lpni_peer_net->lpn_peer->lp_state & LNET_PEER_CONFIGURED;
+}
+
+static inline bool
+lnet_peer_ni_is_primary(struct lnet_peer_ni *lpni)
+{
+	return lpni->lpni_nid == lpni->lpni_peer_net->lpn_peer->lp_primary_nid;
+}
+
 #endif
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index eff2aed5e5c1..d1721fd01d93 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -513,6 +513,7 @@ struct lnet_peer {
 };
 
 #define LNET_PEER_MULTI_RAIL	BIT(0)
+#define LNET_PEER_CONFIGURED	BIT(1)
 
 struct lnet_peer_net {
 	/* chain on peer block */
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 44a2bf641260..09c1b5516f6b 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -191,10 +191,10 @@ lnet_peer_alloc(lnet_nid_t nid)
 }
 
 static void
-lnet_try_destroy_peer_hierarchy_locked(struct lnet_peer_ni *lpni)
+lnet_peer_detach_peer_ni(struct lnet_peer_ni *lpni)
 {
-	struct lnet_peer_net *peer_net;
-	struct lnet_peer *peer;
+	struct lnet_peer_net *lpn;
+	struct lnet_peer *lp;
 
 	/* TODO: could the below situation happen? accessing an already
 	 * destroyed peer?
@@ -203,24 +203,28 @@ lnet_try_destroy_peer_hierarchy_locked(struct lnet_peer_ni *lpni)
 	    !lpni->lpni_peer_net->lpn_peer)
 		return;
 
-	peer_net = lpni->lpni_peer_net;
-	peer = lpni->lpni_peer_net->lpn_peer;
+	lpn = lpni->lpni_peer_net;
+	lp = lpni->lpni_peer_net->lpn_peer;
+
+	CDEBUG(D_NET, "peer %s NID %s\n",
+	       libcfs_nid2str(lp->lp_primary_nid),
+	       libcfs_nid2str(lpni->lpni_nid));
 
 	list_del_init(&lpni->lpni_on_peer_net_list);
 	lpni->lpni_peer_net = NULL;
 
-	/* if peer_net is empty, then remove it from the peer */
-	if (list_empty(&peer_net->lpn_peer_nis)) {
-		list_del_init(&peer_net->lpn_on_peer_list);
-		peer_net->lpn_peer = NULL;
-		kfree(peer_net);
+	/* if lpn is empty, then remove it from the peer */
+	if (list_empty(&lpn->lpn_peer_nis)) {
+		list_del_init(&lpn->lpn_on_peer_list);
+		lpn->lpn_peer = NULL;
+		kfree(lpn);
 
 		/* If the peer is empty then remove it from the
 		 * the_lnet.ln_peers.
 		 */
-		if (list_empty(&peer->lp_peer_nets)) {
-			list_del_init(&peer->lp_on_lnet_peer_list);
-			kfree(peer);
+		if (list_empty(&lp->lp_peer_nets)) {
+			list_del_init(&lp->lp_on_lnet_peer_list);
+			kfree(lp);
 		}
 	}
 }
@@ -263,10 +267,10 @@ lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni)
 	ptable->pt_zombies++;
 	spin_unlock(&ptable->pt_zombie_lock);
 
-	/* no need to keep this peer on the hierarchy anymore */
-	lnet_try_destroy_peer_hierarchy_locked(lpni);
+	/* no need to keep this peer_ni on the hierarchy anymore */
+	lnet_peer_detach_peer_ni(lpni);
 
-	/* decrement reference on peer */
+	/* decrement reference on peer_ni */
 	lnet_peer_ni_decref_locked(lpni);
 
 	return 0;
@@ -329,6 +333,8 @@ lnet_peer_del_locked(struct lnet_peer *peer)
 	struct lnet_peer_ni *lpni = NULL, *lpni2;
 	int rc = 0, rc2 = 0;
 
+	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(peer->lp_primary_nid));
+
 	lpni = lnet_get_next_peer_ni_locked(peer, NULL, lpni);
 	while (lpni) {
 		lpni2 = lnet_get_next_peer_ni_locked(peer, NULL, lpni);
@@ -352,31 +358,36 @@ lnet_peer_del(struct lnet_peer *peer)
 }
 
 /*
- * Delete a NID from a peer.
- * Implements a few sanity checks.
- * Call with ln_api_mutex held.
+ * Delete a NID from a peer. Call with ln_api_mutex held.
+ *
+ * Error codes:
+ *  -EPERM:  Non-DLC deletion from DLC-configured peer.
+ *  -ENOENT: No lnet_peer_ni corresponding to the nid.
+ *  -ECHILD: The lnet_peer_ni isn't connected to the peer.
+ *  -EBUSY:  The lnet_peer_ni is the primary, and not the only peer_ni.
  */
 static int
-lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid)
+lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
 {
-	struct lnet_peer *lp2;
 	struct lnet_peer_ni *lpni;
+	lnet_nid_t primary_nid = lp->lp_primary_nid;
+	int rc = 0;
 
+	if (!(flags & LNET_PEER_CONFIGURED)) {
+		if (lp->lp_state & LNET_PEER_CONFIGURED) {
+			rc = -EPERM;
+			goto out;
+		}
+	}
 	lpni = lnet_find_peer_ni_locked(nid);
 	if (!lpni) {
-		CERROR("Cannot remove unknown nid %s from peer %s\n",
-		       libcfs_nid2str(nid),
-		       libcfs_nid2str(lp->lp_primary_nid));
-		return -ENOENT;
+		rc = -ENOENT;
+		goto out;
 	}
 	lnet_peer_ni_decref_locked(lpni);
-	lp2 = lpni->lpni_peer_net->lpn_peer;
-	if (lp2 != lp) {
-		CERROR("Nid %s is attached to peer %s, not peer %s\n",
-		       libcfs_nid2str(nid),
-		       libcfs_nid2str(lp2->lp_primary_nid),
-		       libcfs_nid2str(lp->lp_primary_nid));
-		return -EINVAL;
+	if (lp != lpni->lpni_peer_net->lpn_peer) {
+		rc = -ECHILD;
+		goto out;
 	}
 
 	/*
@@ -384,16 +395,19 @@ lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid)
 	 * is the only NID.
 	 */
 	if (nid == lp->lp_primary_nid && lnet_get_num_peer_nis(lp) != 1) {
-		CERROR("Cannot delete primary NID %s from multi-NID peer\n",
-		       libcfs_nid2str(nid));
-		return -EINVAL;
+		rc = -EBUSY;
+		goto out;
 	}
 
 	lnet_net_lock(LNET_LOCK_EX);
 	lnet_peer_ni_del_locked(lpni);
 	lnet_net_unlock(LNET_LOCK_EX);
 
-	return 0;
+out:
+	CDEBUG(D_NET, "peer %s NID %s flags %#x: %d\n",
+	       libcfs_nid2str(primary_nid), libcfs_nid2str(nid), flags, rc);
+
+	return rc;
 }
 
 static void
@@ -895,46 +909,27 @@ lnet_peer_get_net_locked(struct lnet_peer *peer, u32 net_id)
 	return NULL;
 }
 
+/*
+ * Always returns 0, but it the last function called from functions
+ * that do return an int, so returning 0 here allows the compiler to
+ * do a tail call.
+ */
 static int
-lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
-	 *lpni,
-			  lnet_nid_t nid)
+lnet_peer_attach_peer_ni(struct lnet_peer *lp,
+			 struct lnet_peer_net *lpn,
+			 struct lnet_peer_ni *lpni,
+			 unsigned int flags)
 {
-	struct lnet_peer_net *lpn = NULL;
 	struct lnet_peer_table *ptable;
-	u32 net_id = LNET_NIDNET(nid);
-
-	/*
-	 * Create the peer_ni, peer_net, and peer if they don't exist
-	 * yet.
-	 */
-	if (lp) {
-		lpn = lnet_peer_get_net_locked(lp, net_id);
-	} else {
-		lp = lnet_peer_alloc(nid);
-		if (!lp)
-			goto out_enomem;
-	}
-
-	if (!lpn) {
-		lpn = lnet_peer_net_alloc(net_id);
-		if (!lpn)
-			goto out_maybe_free_lp;
-	}
-
-	if (!lpni) {
-		lpni = lnet_peer_ni_alloc(nid);
-		if (!lpni)
-			goto out_maybe_free_lpn;
-	}
 
 	/* Install the new peer_ni */
 	lnet_net_lock(LNET_LOCK_EX);
 	/* Add peer_ni to global peer table hash, if necessary. */
 	if (list_empty(&lpni->lpni_hashlist)) {
+		int hash = lnet_nid2peerhash(lpni->lpni_nid);
+
 		ptable = the_lnet.ln_peer_tables[lpni->lpni_cpt];
-		list_add_tail(&lpni->lpni_hashlist,
-			      &ptable->pt_hash[lnet_nid2peerhash(nid)]);
+		list_add_tail(&lpni->lpni_hashlist, &ptable->pt_hash[hash]);
 		ptable->pt_version++;
 		atomic_inc(&ptable->pt_number);
 		atomic_inc(&lpni->lpni_refcount);
@@ -942,7 +937,7 @@ lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
 
 	/* Detach the peer_ni from an existing peer, if necessary. */
 	if (lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer != lp)
-		lnet_try_destroy_peer_hierarchy_locked(lpni);
+		lnet_peer_detach_peer_ni(lpni);
 
 	/* Add peer_ni to peer_net */
 	lpni->lpni_peer_net = lpn;
@@ -957,33 +952,42 @@ lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
 	/* Add peer to global peer list */
 	if (list_empty(&lp->lp_on_lnet_peer_list))
 		list_add_tail(&lp->lp_on_lnet_peer_list, &the_lnet.ln_peers);
+
+	/* Update peer state */
+	spin_lock(&lp->lp_lock);
+	if (flags & LNET_PEER_CONFIGURED) {
+		if (!(lp->lp_state & LNET_PEER_CONFIGURED))
+			lp->lp_state |= LNET_PEER_CONFIGURED;
+	}
+	if (flags & LNET_PEER_MULTI_RAIL) {
+		if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+			lp->lp_state |= LNET_PEER_MULTI_RAIL;
+			lnet_peer_clr_non_mr_pref_nids(lp);
+		}
+	}
+	spin_unlock(&lp->lp_lock);
+
 	lnet_net_unlock(LNET_LOCK_EX);
 
-	return 0;
+	CDEBUG(D_NET, "peer %s NID %s flags %#x\n",
+	       libcfs_nid2str(lp->lp_primary_nid),
+	       libcfs_nid2str(lpni->lpni_nid), flags);
 
-out_maybe_free_lpn:
-	if (list_empty(&lpn->lpn_on_peer_list))
-		kfree(lpn);
-out_maybe_free_lp:
-	if (list_empty(&lp->lp_on_lnet_peer_list))
-		kfree(lp);
-out_enomem:
-	return -ENOMEM;
+	return 0;
 }
 
 /*
  * Create a new peer, with nid as its primary nid.
  *
- * It is not an error if the peer already exists, provided that the
- * given nid is the primary NID.
- *
  * Call with the lnet_api_mutex held.
  */
 static int
-lnet_peer_add(lnet_nid_t nid, bool mr)
+lnet_peer_add(lnet_nid_t nid, unsigned int flags)
 {
 	struct lnet_peer *lp;
+	struct lnet_peer_net *lpn;
 	struct lnet_peer_ni *lpni;
+	int rc = 0;
 
 	LASSERT(nid != LNET_NID_ANY);
 
@@ -992,82 +996,153 @@ lnet_peer_add(lnet_nid_t nid, bool mr)
 	 * lnet_api_mutex is held.
 	 */
 	lpni = lnet_find_peer_ni_locked(nid);
-	if (!lpni) {
-		int rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
-		if (rc != 0)
-			return rc;
-		lpni = lnet_find_peer_ni_locked(nid);
-		LASSERT(lpni);
+	if (lpni) {
+		/* A peer with this NID already exists. */
+		lp = lpni->lpni_peer_net->lpn_peer;
+		lnet_peer_ni_decref_locked(lpni);
+		/*
+		 * This is an error if the peer was configured and the
+		 * primary NID differs or an attempt is made to change
+		 * the Multi-Rail flag. Otherwise the assumption is
+		 * that an existing peer is being modified.
+		 */
+		if (lp->lp_state & LNET_PEER_CONFIGURED) {
+			if (lp->lp_primary_nid != nid)
+				rc = -EEXIST;
+			else if ((lp->lp_state ^ flags) & LNET_PEER_MULTI_RAIL)
+				rc = -EPERM;
+			goto out;
+		}
+		/* Delete and recreate as a configured peer. */
+		lnet_peer_del(lp);
 	}
-	lp = lpni->lpni_peer_net->lpn_peer;
-	lnet_peer_ni_decref_locked(lpni);
 
-	/* A found peer must have this primary NID */
-	if (lp->lp_primary_nid != nid)
-		return -EEXIST;
+	/* Create peer, peer_net, and peer_ni. */
+	rc = -ENOMEM;
+	lp = lnet_peer_alloc(nid);
+	if (!lp)
+		goto out;
+	lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
+	if (!lpn)
+		goto out_free_lp;
+	lpni = lnet_peer_ni_alloc(nid);
+	if (!lpni)
+		goto out_free_lpn;
 
-	/*
-	 * If we found an lpni that is not a multi-rail, which could occur
-	 * if lpni is already created as a non-mr lpni or we just created
-	 * it, then make sure you indicate that this lpni is a primary mr
-	 * capable peer.
-	 *
-	 * TODO: update flags if necessary
-	 */
-	spin_lock(&lp->lp_lock);
-	if (mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
-		lp->lp_state |= LNET_PEER_MULTI_RAIL;
-		lnet_peer_clr_non_mr_pref_nids(lp);
-	} else if (!mr && (lp->lp_state & LNET_PEER_MULTI_RAIL)) {
-		/* The mr state is sticky. */
-		CDEBUG(D_NET, "Cannot clear multi-rail flag from peer %s\n",
-		       libcfs_nid2str(nid));
-	}
-	spin_unlock(&lp->lp_lock);
+	return lnet_peer_attach_peer_ni(lp, lpn, lpni, flags);
 
-	return 0;
+out_free_lpn:
+	kfree(lpn);
+out_free_lp:
+	kfree(lp);
+out:
+	CDEBUG(D_NET, "peer %s NID flags %#x: %d\n",
+	       libcfs_nid2str(nid), flags, rc);
+	return rc;
 }
 
+/*
+ * Add a NID to a peer. Call with ln_api_mutex held.
+ *
+ * Error codes:
+ *  -EPERM:    Non-DLC addition to a DLC-configured peer.
+ *  -EEXIST:   The NID was configured by DLC for a different peer.
+ *  -ENOMEM:   Out of memory.
+ *  -ENOTUNIQ: Adding a second peer NID on a single network on a
+ *             non-multi-rail peer.
+ */
 static int
-lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
+lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
 {
+	struct lnet_peer_net *lpn;
 	struct lnet_peer_ni *lpni;
+	int rc = 0;
 
 	LASSERT(lp);
 	LASSERT(nid != LNET_NID_ANY);
 
-	spin_lock(&lp->lp_lock);
-	if (!mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
-		spin_unlock(&lp->lp_lock);
-		CERROR("Cannot add nid %s to non-multi-rail peer %s\n",
-		       libcfs_nid2str(nid),
-		       libcfs_nid2str(lp->lp_primary_nid));
-		return -EPERM;
+	/* A configured peer can only be updated through configuration. */
+	if (!(flags & LNET_PEER_CONFIGURED)) {
+		if (lp->lp_state & LNET_PEER_CONFIGURED) {
+			rc = -EPERM;
+			goto out;
+		}
 	}
 
-	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
-		lp->lp_state |= LNET_PEER_MULTI_RAIL;
-		lnet_peer_clr_non_mr_pref_nids(lp);
+	/*
+	 * The MULTI_RAIL flag can be set but not cleared, because
+	 * that would leave the peer struct in an invalid state.
+	 */
+	if (flags & LNET_PEER_MULTI_RAIL) {
+		spin_lock(&lp->lp_lock);
+		if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+			lp->lp_state |= LNET_PEER_MULTI_RAIL;
+			lnet_peer_clr_non_mr_pref_nids(lp);
+		}
+		spin_unlock(&lp->lp_lock);
+	} else if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
+		rc = -EPERM;
+		goto out;
 	}
-	spin_unlock(&lp->lp_lock);
 
 	lpni = lnet_find_peer_ni_locked(nid);
-	if (!lpni)
-		return lnet_peer_setup_hierarchy(lp, NULL, nid);
+	if (lpni) {
+		/*
+		 * A peer_ni already exists. This is only a problem if
+		 * it is not connected to this peer and was configured
+		 * by DLC.
+		 */
+		lnet_peer_ni_decref_locked(lpni);
+		if (lpni->lpni_peer_net->lpn_peer == lp)
+			goto out;
+		if (lnet_peer_ni_is_configured(lpni)) {
+			rc = -EEXIST;
+			goto out;
+		}
+		/* If this is the primary NID, destroy the peer. */
+		if (lnet_peer_ni_is_primary(lpni)) {
+			lnet_peer_del(lpni->lpni_peer_net->lpn_peer);
+			lpni = lnet_peer_ni_alloc(nid);
+			if (!lpni) {
+				rc = -ENOMEM;
+				goto out;
+			}
+		}
+	} else {
+		lpni = lnet_peer_ni_alloc(nid);
+		if (!lpni) {
+			rc = -ENOMEM;
+			goto out;
+		}
+	}
 
-	if (lpni->lpni_peer_net->lpn_peer != lp) {
-		struct lnet_peer *lp2 = lpni->lpni_peer_net->lpn_peer;
-		CERROR("Cannot add NID %s owned by peer %s to peer %s\n",
-		       libcfs_nid2str(lpni->lpni_nid),
-		       libcfs_nid2str(lp2->lp_primary_nid),
-		       libcfs_nid2str(lp->lp_primary_nid));
-		return -EEXIST;
+	/*
+	 * Get the peer_net. Check that we're not adding a second
+	 * peer_ni on a peer_net of a non-multi-rail peer.
+	 */
+	lpn = lnet_peer_get_net_locked(lp, LNET_NIDNET(nid));
+	if (!lpn) {
+		lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
+		if (!lpn) {
+			rc = -ENOMEM;
+			goto out_free_lpni;
+		}
+	} else if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+		rc = -ENOTUNIQ;
+		goto out_free_lpni;
 	}
 
-	CDEBUG(D_NET, "NID %s is already owned by peer %s\n",
-	       libcfs_nid2str(lpni->lpni_nid),
-	       libcfs_nid2str(lp->lp_primary_nid));
-	return 0;
+	return lnet_peer_attach_peer_ni(lp, lpn, lpni, flags);
+
+out_free_lpni:
+	/* If the peer_ni was allocated above its peer_net pointer is NULL */
+	if (!lpni->lpni_peer_net)
+		kfree(lpni);
+out:
+	CDEBUG(D_NET, "peer %s NID %s flags %#x: %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), libcfs_nid2str(nid),
+	       flags, rc);
+	return rc;
 }
 
 /*
@@ -1076,25 +1151,53 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
 static int
 lnet_peer_ni_traffic_add(lnet_nid_t nid, lnet_nid_t pref)
 {
+	struct lnet_peer *lp;
+	struct lnet_peer_net *lpn;
 	struct lnet_peer_ni *lpni;
-	int rc;
+	unsigned int flags = 0;
+	int rc = 0;
 
-	if (nid == LNET_NID_ANY)
-		return -EINVAL;
+	if (nid == LNET_NID_ANY) {
+		rc = -EINVAL;
+		goto out;
+	}
 
 	/* lnet_net_lock is not needed here because ln_api_lock is held */
 	lpni = lnet_find_peer_ni_locked(nid);
-	if (!lpni) {
-		rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
-		if (rc)
-			return rc;
-		lpni = lnet_find_peer_ni_locked(nid);
+	if (lpni) {
+		/*
+		 * We must have raced with another thread. Since we
+		 * know next to nothing about a peer_ni created by
+		 * traffic, we just assume everything is ok and
+		 * return.
+		 */
+		lnet_peer_ni_decref_locked(lpni);
+		goto out;
 	}
+
+	/* Create peer, peer_net, and peer_ni. */
+	rc = -ENOMEM;
+	lp = lnet_peer_alloc(nid);
+	if (!lp)
+		goto out;
+	lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
+	if (!lpn)
+		goto out_free_lp;
+	lpni = lnet_peer_ni_alloc(nid);
+	if (!lpni)
+		goto out_free_lpn;
 	if (pref != LNET_NID_ANY)
 		lnet_peer_ni_set_non_mr_pref_nid(lpni, pref);
-	lnet_peer_ni_decref_locked(lpni);
 
-	return 0;
+	return lnet_peer_attach_peer_ni(lp, lpn, lpni, flags);
+
+out_free_lpn:
+	kfree(lpn);
+out_free_lp:
+	kfree(lp);
+out:
+	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(nid), rc);
+	return rc;
 }
 
 /*
@@ -1114,17 +1217,22 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
 {
 	struct lnet_peer *lp = NULL;
 	struct lnet_peer_ni *lpni;
+	unsigned int flags;
 
 	/* The prim_nid must always be specified */
 	if (prim_nid == LNET_NID_ANY)
 		return -EINVAL;
 
+	flags = LNET_PEER_CONFIGURED;
+	if (mr)
+		flags |= LNET_PEER_MULTI_RAIL;
+
 	/*
 	 * If nid isn't specified, we must create a new peer with
 	 * prim_nid as its primary nid.
 	 */
 	if (nid == LNET_NID_ANY)
-		return lnet_peer_add(prim_nid, mr);
+		return lnet_peer_add(prim_nid, flags);
 
 	/* Look up the prim_nid, which must exist. */
 	lpni = lnet_find_peer_ni_locked(prim_nid);
@@ -1133,6 +1241,14 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
 	lnet_peer_ni_decref_locked(lpni);
 	lp = lpni->lpni_peer_net->lpn_peer;
 
+	/* Peer must have been configured. */
+	if (!(lp->lp_state & LNET_PEER_CONFIGURED)) {
+		CDEBUG(D_NET, "peer %s was not configured\n",
+		       libcfs_nid2str(prim_nid));
+		return -ENOENT;
+	}
+
+	/* Primary NID must match */
 	if (lp->lp_primary_nid != prim_nid) {
 		CDEBUG(D_NET, "prim_nid %s is not primary for peer %s\n",
 		       libcfs_nid2str(prim_nid),
@@ -1140,7 +1256,14 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
 		return -ENODEV;
 	}
 
-	return lnet_peer_add_nid(lp, nid, mr);
+	/* Multi-Rail flag must match. */
+	if ((lp->lp_state ^ flags) & LNET_PEER_MULTI_RAIL) {
+		CDEBUG(D_NET, "multi-rail state mismatch for peer %s\n",
+		       libcfs_nid2str(prim_nid));
+		return -EPERM;
+	}
+
+	return lnet_peer_add_nid(lp, nid, flags);
 }
 
 /*
@@ -1159,6 +1282,7 @@ lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
 {
 	struct lnet_peer *lp;
 	struct lnet_peer_ni *lpni;
+	unsigned int flags;
 
 	if (prim_nid == LNET_NID_ANY)
 		return -EINVAL;
@@ -1179,7 +1303,11 @@ lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
 	if (nid == LNET_NID_ANY || nid == lp->lp_primary_nid)
 		return lnet_peer_del(lp);
 
-	return lnet_peer_del_nid(lp, nid);
+	flags = LNET_PEER_CONFIGURED;
+	if (lp->lp_state & LNET_PEER_MULTI_RAIL)
+		flags |= LNET_PEER_MULTI_RAIL;
+
+	return lnet_peer_del_nid(lp, nid, flags);
 }
 
 void

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (20 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 23/24] lustre: lnet: show peer state NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 22:42   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 18/24] lustre: lnet: implement Peer Discovery NeilBrown
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Peer discovery will be keeping track of lnet_peer structures,
so there will be references to an lnet_peer independent of
the references implied by lnet_peer_ni structures. Manage
this by adding explicit reference counts to lnet_peer_net and
lnet_peer.

Each lnet_peer_net has a hold on the lnet_peer it links to
with its lpn_peer pointer. This hold is only removed when that
pointer is assigned a new value or the lnet_peer_net is freed.
Just removing an lnet_peer_net from the lp_peer_nets list does
not release this hold, it just prevents new lookups of the
lnet_peer_net via the lnet_peer.

Each lnet_peer_ni has a hold on the lnet_peer_net it links to
with its lpni_peer_net pointer. This hold is only removed when
that pointer is assigned a new value or the lnet_peer_ni is
freed. Just removing an lnet_peer_ni from the lpn_peer_nis
list does not release this hold, it just prevents new lookups
of the lnet_peer_ni via the lnet_peer_net.

This ensures that given a lnet_peer_ni *lpni, we can rely on
lpni->lpni_peer_net->lpn_peer pointing to a valid lnet_peer.

Keep a count of the total number of lnet_peer_ni attached to
an lnet_peer in lp_nnis.

Split the global ln_peers list into per-lnet_peer_table lists.
The CPT of the peer table in which the lnet_peer is linked is
stored in lp_cpt.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25784
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   49 +++--
 .../staging/lustre/include/linux/lnet/lib-types.h  |   50 ++++-
 drivers/staging/lustre/lnet/lnet/api-ni.c          |    1 
 drivers/staging/lustre/lnet/lnet/lib-move.c        |    8 -
 drivers/staging/lustre/lnet/lnet/peer.c            |  210 ++++++++++++++------
 5 files changed, 227 insertions(+), 91 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 563417510722..aad25eb0011b 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -310,6 +310,36 @@ lnet_handle2me(struct lnet_handle_me *handle)
 	return lh_entry(lh, struct lnet_me, me_lh);
 }
 
+static inline void
+lnet_peer_net_addref_locked(struct lnet_peer_net *lpn)
+{
+	atomic_inc(&lpn->lpn_refcount);
+}
+
+void lnet_destroy_peer_net_locked(struct lnet_peer_net *lpn);
+
+static inline void
+lnet_peer_net_decref_locked(struct lnet_peer_net *lpn)
+{
+	if (atomic_dec_and_test(&lpn->lpn_refcount))
+		lnet_destroy_peer_net_locked(lpn);
+}
+
+static inline void
+lnet_peer_addref_locked(struct lnet_peer *lp)
+{
+	atomic_inc(&lp->lp_refcount);
+}
+
+void lnet_destroy_peer_locked(struct lnet_peer *lp);
+
+static inline void
+lnet_peer_decref_locked(struct lnet_peer *lp)
+{
+	if (atomic_dec_and_test(&lp->lp_refcount))
+		lnet_destroy_peer_locked(lp);
+}
+
 static inline void
 lnet_peer_ni_addref_locked(struct lnet_peer_ni *lp)
 {
@@ -695,21 +725,6 @@ int lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
 			  __u32 *peer_rtr_credits, __u32 *peer_min_rtr_credtis,
 			  __u32 *peer_tx_qnob);
 
-static inline __u32
-lnet_get_num_peer_nis(struct lnet_peer *peer)
-{
-	struct lnet_peer_net *lpn;
-	struct lnet_peer_ni *lpni;
-	__u32 count = 0;
-
-	list_for_each_entry(lpn, &peer->lp_peer_nets, lpn_on_peer_list)
-		list_for_each_entry(lpni, &lpn->lpn_peer_nis,
-				    lpni_on_peer_net_list)
-			count++;
-
-	return count;
-}
-
 static inline bool
 lnet_is_peer_ni_healthy_locked(struct lnet_peer_ni *lpni)
 {
@@ -728,7 +743,7 @@ lnet_is_peer_net_healthy_locked(struct lnet_peer_net *peer_net)
 	struct lnet_peer_ni *lpni;
 
 	list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
-			    lpni_on_peer_net_list) {
+			    lpni_peer_nis) {
 		if (lnet_is_peer_ni_healthy_locked(lpni))
 			return true;
 	}
@@ -741,7 +756,7 @@ lnet_is_peer_healthy_locked(struct lnet_peer *peer)
 {
 	struct lnet_peer_net *peer_net;
 
-	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
+	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
 		if (lnet_is_peer_net_healthy_locked(peer_net))
 			return true;
 	}
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index d1721fd01d93..260619e19bde 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -411,7 +411,8 @@ struct lnet_rc_data {
 };
 
 struct lnet_peer_ni {
-	struct list_head	lpni_on_peer_net_list;
+	/* chain on lpn_peer_nis */
+	struct list_head	lpni_peer_nis;
 	/* chain on remote peer list */
 	struct list_head	lpni_on_remote_peer_ni_list;
 	/* chain on peer hash */
@@ -496,8 +497,8 @@ struct lnet_peer_ni {
 #define LNET_PEER_NI_NON_MR_PREF	BIT(0)
 
 struct lnet_peer {
-	/* chain on global peer list */
-	struct list_head	lp_on_lnet_peer_list;
+	/* chain on pt_peer_list */
+	struct list_head	lp_peer_list;
 
 	/* list of peer nets */
 	struct list_head	lp_peer_nets;
@@ -505,6 +506,15 @@ struct lnet_peer {
 	/* primary NID of the peer */
 	lnet_nid_t		lp_primary_nid;
 
+	/* CPT of peer_table */
+	int			lp_cpt;
+
+	/* number of NIDs on this peer */
+	int			lp_nnis;
+
+	/* reference count */
+	atomic_t		lp_refcount;
+
 	/* lock protecting peer state flags */
 	spinlock_t		lp_lock;
 
@@ -516,8 +526,8 @@ struct lnet_peer {
 #define LNET_PEER_CONFIGURED	BIT(1)
 
 struct lnet_peer_net {
-	/* chain on peer block */
-	struct list_head	lpn_on_peer_list;
+	/* chain on lp_peer_nets */
+	struct list_head	lpn_peer_nets;
 
 	/* list of peer_nis on this network */
 	struct list_head	lpn_peer_nis;
@@ -527,21 +537,45 @@ struct lnet_peer_net {
 
 	/* Net ID */
 	__u32			lpn_net_id;
+
+	/* reference count */
+	atomic_t		lpn_refcount;
 };
 
 /* peer hash size */
 #define LNET_PEER_HASH_BITS	9
 #define LNET_PEER_HASH_SIZE	(1 << LNET_PEER_HASH_BITS)
 
-/* peer hash table */
+/*
+ * peer hash table - one per CPT
+ *
+ * protected by lnet_net_lock/EX for update
+ *    pt_version
+ *    pt_number
+ *    pt_hash[...]
+ *    pt_peer_list
+ *    pt_peers
+ *    pt_peer_nnids
+ * protected by pt_zombie_lock:
+ *    pt_zombie_list
+ *    pt_zombies
+ *
+ * pt_zombie lock nests inside lnet_net_lock
+ */
 struct lnet_peer_table {
 	/* /proc validity stamp */
 	int			 pt_version;
 	/* # peers extant */
 	atomic_t		 pt_number;
+	/* peers */
+	struct list_head	pt_peer_list;
+	/* # peers */
+	int			pt_peers;
+	/* # NIDS on listed peers */
+	int			pt_peer_nnids;
 	/* # zombies to go to deathrow (and not there yet) */
 	int			 pt_zombies;
-	/* zombie peers */
+	/* zombie peers_ni */
 	struct list_head	 pt_zombie_list;
 	/* protect list and count */
 	spinlock_t		 pt_zombie_lock;
@@ -785,8 +819,6 @@ struct lnet {
 	struct lnet_msg_container	**ln_msg_containers;
 	struct lnet_counters		**ln_counters;
 	struct lnet_peer_table		**ln_peer_tables;
-	/* list of configured or discovered peers */
-	struct list_head		ln_peers;
 	/* list of peer nis not on a local network */
 	struct list_head		ln_remote_peer_ni_list;
 	/* failure simulation */
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index d64ae2939abc..c48bcb8722a0 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -625,7 +625,6 @@ lnet_prepare(lnet_pid_t requested_pid)
 	the_lnet.ln_pid = requested_pid;
 
 	INIT_LIST_HEAD(&the_lnet.ln_test_peers);
-	INIT_LIST_HEAD(&the_lnet.ln_peers);
 	INIT_LIST_HEAD(&the_lnet.ln_remote_peer_ni_list);
 	INIT_LIST_HEAD(&the_lnet.ln_nets);
 	INIT_LIST_HEAD(&the_lnet.ln_routers);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 99d8b22356bb..4c1eef907dc7 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1388,7 +1388,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 			peer_net = lnet_peer_get_net_locked(
 				peer, LNET_NIDNET(best_lpni->lpni_nid));
 			list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
-					    lpni_on_peer_net_list) {
+					    lpni_peer_nis) {
 				if (lpni->lpni_pref_nnids == 0)
 					continue;
 				LASSERT(lpni->lpni_pref_nnids == 1);
@@ -1411,7 +1411,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 			}
 			lpni = list_entry(peer_net->lpn_peer_nis.next,
 					  struct lnet_peer_ni,
-					  lpni_on_peer_net_list);
+					  lpni_peer_nis);
 		}
 		/* Set preferred NI if necessary. */
 		if (lpni->lpni_pref_nnids == 0)
@@ -1443,7 +1443,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	 * then the best route is chosen. If all routes are equal then
 	 * they are used in round robin.
 	 */
-	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
+	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
 		if (!lnet_is_peer_net_healthy_locked(peer_net))
 			continue;
 
@@ -1453,7 +1453,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 
 			lpni = list_entry(peer_net->lpn_peer_nis.next,
 					  struct lnet_peer_ni,
-					  lpni_on_peer_net_list);
+					  lpni_peer_nis);
 
 			net_gw = lnet_find_route_locked(NULL,
 							lpni->lpni_nid,
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 09c1b5516f6b..d7a0a2f3bdd9 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -118,7 +118,7 @@ lnet_peer_ni_alloc(lnet_nid_t nid)
 	INIT_LIST_HEAD(&lpni->lpni_rtrq);
 	INIT_LIST_HEAD(&lpni->lpni_routes);
 	INIT_LIST_HEAD(&lpni->lpni_hashlist);
-	INIT_LIST_HEAD(&lpni->lpni_on_peer_net_list);
+	INIT_LIST_HEAD(&lpni->lpni_peer_nis);
 	INIT_LIST_HEAD(&lpni->lpni_on_remote_peer_ni_list);
 
 	spin_lock_init(&lpni->lpni_lock);
@@ -150,7 +150,7 @@ lnet_peer_ni_alloc(lnet_nid_t nid)
 			      &the_lnet.ln_remote_peer_ni_list);
 	}
 
-	/* TODO: update flags */
+	CDEBUG(D_NET, "%p nid %s\n", lpni, libcfs_nid2str(lpni->lpni_nid));
 
 	return lpni;
 }
@@ -164,13 +164,32 @@ lnet_peer_net_alloc(u32 net_id)
 	if (!lpn)
 		return NULL;
 
-	INIT_LIST_HEAD(&lpn->lpn_on_peer_list);
+	INIT_LIST_HEAD(&lpn->lpn_peer_nets);
 	INIT_LIST_HEAD(&lpn->lpn_peer_nis);
 	lpn->lpn_net_id = net_id;
 
+	CDEBUG(D_NET, "%p net %s\n", lpn, libcfs_net2str(lpn->lpn_net_id));
+
 	return lpn;
 }
 
+void
+lnet_destroy_peer_net_locked(struct lnet_peer_net *lpn)
+{
+	struct lnet_peer *lp;
+
+	CDEBUG(D_NET, "%p net %s\n", lpn, libcfs_net2str(lpn->lpn_net_id));
+
+	LASSERT(atomic_read(&lpn->lpn_refcount) == 0);
+	LASSERT(list_empty(&lpn->lpn_peer_nis));
+	LASSERT(list_empty(&lpn->lpn_peer_nets));
+	lp = lpn->lpn_peer;
+	lpn->lpn_peer = NULL;
+	kfree(lpn);
+
+	lnet_peer_decref_locked(lp);
+}
+
 static struct lnet_peer *
 lnet_peer_alloc(lnet_nid_t nid)
 {
@@ -180,53 +199,73 @@ lnet_peer_alloc(lnet_nid_t nid)
 	if (!lp)
 		return NULL;
 
-	INIT_LIST_HEAD(&lp->lp_on_lnet_peer_list);
+	INIT_LIST_HEAD(&lp->lp_peer_list);
 	INIT_LIST_HEAD(&lp->lp_peer_nets);
 	spin_lock_init(&lp->lp_lock);
 	lp->lp_primary_nid = nid;
+	lp->lp_cpt = lnet_nid_cpt_hash(nid, LNET_CPT_NUMBER);
 
-	/* TODO: update flags */
+	CDEBUG(D_NET, "%p nid %s\n", lp, libcfs_nid2str(lp->lp_primary_nid));
 
 	return lp;
 }
 
+void
+lnet_destroy_peer_locked(struct lnet_peer *lp)
+{
+	CDEBUG(D_NET, "%p nid %s\n", lp, libcfs_nid2str(lp->lp_primary_nid));
+
+	LASSERT(atomic_read(&lp->lp_refcount) == 0);
+	LASSERT(list_empty(&lp->lp_peer_nets));
+	LASSERT(list_empty(&lp->lp_peer_list));
+
+	kfree(lp);
+}
+
+/*
+ * Detach a peer_ni from its peer_net. If this was the last peer_ni on
+ * that peer_net, detach the peer_net from the peer.
+ *
+ * Call with lnet_net_lock/EX held
+ */
 static void
-lnet_peer_detach_peer_ni(struct lnet_peer_ni *lpni)
+lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
 {
+	struct lnet_peer_table *ptable;
 	struct lnet_peer_net *lpn;
 	struct lnet_peer *lp;
 
-	/* TODO: could the below situation happen? accessing an already
-	 * destroyed peer?
+	/*
+	 * Belts and suspenders: gracefully handle teardown of a
+	 * partially connected peer_ni.
 	 */
-	if (!lpni->lpni_peer_net ||
-	    !lpni->lpni_peer_net->lpn_peer)
-		return;
-
 	lpn = lpni->lpni_peer_net;
-	lp = lpni->lpni_peer_net->lpn_peer;
 
-	CDEBUG(D_NET, "peer %s NID %s\n",
-	       libcfs_nid2str(lp->lp_primary_nid),
-	       libcfs_nid2str(lpni->lpni_nid));
-
-	list_del_init(&lpni->lpni_on_peer_net_list);
-	lpni->lpni_peer_net = NULL;
+	list_del_init(&lpni->lpni_peer_nis);
+	/*
+	 * If there are no lpni's left, we detach lpn from
+	 * lp_peer_nets, so it cannot be found anymore.
+	 */
+	if (list_empty(&lpn->lpn_peer_nis))
+		list_del_init(&lpn->lpn_peer_nets);
 
-	/* if lpn is empty, then remove it from the peer */
-	if (list_empty(&lpn->lpn_peer_nis)) {
-		list_del_init(&lpn->lpn_on_peer_list);
-		lpn->lpn_peer = NULL;
-		kfree(lpn);
+	/* Update peer NID count. */
+	lp = lpn->lpn_peer;
+	ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
+	lp->lp_nnis--;
+	ptable->pt_peer_nnids--;
 
-		/* If the peer is empty then remove it from the
-		 * the_lnet.ln_peers.
-		 */
-		if (list_empty(&lp->lp_peer_nets)) {
-			list_del_init(&lp->lp_on_lnet_peer_list);
-			kfree(lp);
-		}
+	/*
+	 * If there are no more peer nets, make the peer unfindable
+	 * via the peer_tables.
+	 */
+	if (list_empty(&lp->lp_peer_nets)) {
+		list_del_init(&lp->lp_peer_list);
+		ptable->pt_peers--;
 	}
+	CDEBUG(D_NET, "peer %s NID %s\n",
+	       libcfs_nid2str(lp->lp_primary_nid),
+	       libcfs_nid2str(lpni->lpni_nid));
 }
 
 /* called with lnet_net_lock LNET_LOCK_EX held */
@@ -268,9 +307,9 @@ lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni)
 	spin_unlock(&ptable->pt_zombie_lock);
 
 	/* no need to keep this peer_ni on the hierarchy anymore */
-	lnet_peer_detach_peer_ni(lpni);
+	lnet_peer_detach_peer_ni_locked(lpni);
 
-	/* decrement reference on peer_ni */
+	/* remove hashlist reference on peer_ni */
 	lnet_peer_ni_decref_locked(lpni);
 
 	return 0;
@@ -319,6 +358,8 @@ lnet_peer_tables_create(void)
 		spin_lock_init(&ptable->pt_zombie_lock);
 		INIT_LIST_HEAD(&ptable->pt_zombie_list);
 
+		INIT_LIST_HEAD(&ptable->pt_peer_list);
+
 		for (j = 0; j < LNET_PEER_HASH_SIZE; j++)
 			INIT_LIST_HEAD(&hash[j]);
 		ptable->pt_hash = hash; /* sign of initialization */
@@ -394,7 +435,7 @@ lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
 	 * This function only allows deletion of the primary NID if it
 	 * is the only NID.
 	 */
-	if (nid == lp->lp_primary_nid && lnet_get_num_peer_nis(lp) != 1) {
+	if (nid == lp->lp_primary_nid && lp->lp_nnis != 1) {
 		rc = -EBUSY;
 		goto out;
 	}
@@ -560,15 +601,34 @@ struct lnet_peer_ni *
 lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
 			    struct lnet_peer **lp)
 {
+	struct lnet_peer_table	*ptable;
 	struct lnet_peer_ni	*lpni;
+	int			lncpt;
+	int			cpt;
+
+	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
 
-	list_for_each_entry((*lp), &the_lnet.ln_peers, lp_on_lnet_peer_list) {
+	for (cpt = 0; cpt < lncpt; cpt++) {
+		ptable = the_lnet.ln_peer_tables[cpt];
+		if (ptable->pt_peer_nnids > idx)
+			break;
+		idx -= ptable->pt_peer_nnids;
+	}
+	if (cpt >= lncpt)
+		return NULL;
+
+	list_for_each_entry((*lp), &ptable->pt_peer_list, lp_peer_list) {
+		if ((*lp)->lp_nnis <= idx) {
+			idx -= (*lp)->lp_nnis;
+			continue;
+		}
 		list_for_each_entry((*lpn), &((*lp)->lp_peer_nets),
-				    lpn_on_peer_list) {
+				    lpn_peer_nets) {
 			list_for_each_entry(lpni, &((*lpn)->lpn_peer_nis),
-					    lpni_on_peer_net_list)
+					    lpni_peer_nis) {
 				if (idx-- == 0)
 					return lpni;
+			}
 		}
 	}
 
@@ -584,18 +644,21 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 	struct lnet_peer_net *net = peer_net;
 
 	if (!prev) {
-		if (!net)
+		if (!net) {
+			if (list_empty(&peer->lp_peer_nets))
+				return NULL;
+
 			net = list_entry(peer->lp_peer_nets.next,
 					 struct lnet_peer_net,
-					 lpn_on_peer_list);
+					 lpn_peer_nets);
+		}
 		lpni = list_entry(net->lpn_peer_nis.next, struct lnet_peer_ni,
-				  lpni_on_peer_net_list);
+				  lpni_peer_nis);
 
 		return lpni;
 	}
 
-	if (prev->lpni_on_peer_net_list.next ==
-	    &prev->lpni_peer_net->lpn_peer_nis) {
+	if (prev->lpni_peer_nis.next == &prev->lpni_peer_net->lpn_peer_nis) {
 		/*
 		 * if you reached the end of the peer ni list and the peer
 		 * net is specified then there are no more peer nis in that
@@ -608,25 +671,25 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 		 * we reached the end of this net ni list. move to the
 		 * next net
 		 */
-		if (prev->lpni_peer_net->lpn_on_peer_list.next ==
+		if (prev->lpni_peer_net->lpn_peer_nets.next ==
 		    &peer->lp_peer_nets)
 			/* no more nets and no more NIs. */
 			return NULL;
 
 		/* get the next net */
-		net = list_entry(prev->lpni_peer_net->lpn_on_peer_list.next,
+		net = list_entry(prev->lpni_peer_net->lpn_peer_nets.next,
 				 struct lnet_peer_net,
-				 lpn_on_peer_list);
+				 lpn_peer_nets);
 		/* get the ni on it */
 		lpni = list_entry(net->lpn_peer_nis.next, struct lnet_peer_ni,
-				  lpni_on_peer_net_list);
+				  lpni_peer_nis);
 
 		return lpni;
 	}
 
 	/* there are more nis left */
-	lpni = list_entry(prev->lpni_on_peer_net_list.next,
-			  struct lnet_peer_ni, lpni_on_peer_net_list);
+	lpni = list_entry(prev->lpni_peer_nis.next,
+			  struct lnet_peer_ni, lpni_peer_nis);
 
 	return lpni;
 }
@@ -902,7 +965,7 @@ struct lnet_peer_net *
 lnet_peer_get_net_locked(struct lnet_peer *peer, u32 net_id)
 {
 	struct lnet_peer_net *peer_net;
-	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
+	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
 		if (peer_net->lpn_net_id == net_id)
 			return peer_net;
 	}
@@ -910,15 +973,20 @@ lnet_peer_get_net_locked(struct lnet_peer *peer, u32 net_id)
 }
 
 /*
- * Always returns 0, but it the last function called from functions
+ * Attach a peer_ni to a peer_net and peer. This function assumes
+ * peer_ni is not already attached to the peer_net/peer. The peer_ni
+ * may be attached to a different peer, in which case it will be
+ * properly detached first. The whole operation is done atomically.
+ *
+ * Always returns 0.  This is the last function called from functions
  * that do return an int, so returning 0 here allows the compiler to
  * do a tail call.
  */
 static int
 lnet_peer_attach_peer_ni(struct lnet_peer *lp,
-			 struct lnet_peer_net *lpn,
-			 struct lnet_peer_ni *lpni,
-			 unsigned int flags)
+				struct lnet_peer_net *lpn,
+				struct lnet_peer_ni *lpni,
+				unsigned int flags)
 {
 	struct lnet_peer_table *ptable;
 
@@ -932,26 +1000,38 @@ lnet_peer_attach_peer_ni(struct lnet_peer *lp,
 		list_add_tail(&lpni->lpni_hashlist, &ptable->pt_hash[hash]);
 		ptable->pt_version++;
 		atomic_inc(&ptable->pt_number);
+		/* This is the 1st refcount on lpni. */
 		atomic_inc(&lpni->lpni_refcount);
 	}
 
 	/* Detach the peer_ni from an existing peer, if necessary. */
-	if (lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer != lp)
-		lnet_peer_detach_peer_ni(lpni);
+	if (lpni->lpni_peer_net) {
+		LASSERT(lpni->lpni_peer_net != lpn);
+		LASSERT(lpni->lpni_peer_net->lpn_peer != lp);
+		lnet_peer_detach_peer_ni_locked(lpni);
+		lnet_peer_net_decref_locked(lpni->lpni_peer_net);
+		lpni->lpni_peer_net = NULL;
+	}
 
 	/* Add peer_ni to peer_net */
 	lpni->lpni_peer_net = lpn;
-	list_add_tail(&lpni->lpni_on_peer_net_list, &lpn->lpn_peer_nis);
+	list_add_tail(&lpni->lpni_peer_nis, &lpn->lpn_peer_nis);
+	lnet_peer_net_addref_locked(lpn);
 
 	/* Add peer_net to peer */
 	if (!lpn->lpn_peer) {
 		lpn->lpn_peer = lp;
-		list_add_tail(&lpn->lpn_on_peer_list, &lp->lp_peer_nets);
+		list_add_tail(&lpn->lpn_peer_nets, &lp->lp_peer_nets);
+		lnet_peer_addref_locked(lp);
+	}
+
+	/* Add peer to global peer list, if necessary */
+	ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
+	if (list_empty(&lp->lp_peer_list)) {
+		list_add_tail(&lp->lp_peer_list, &ptable->pt_peer_list);
+		ptable->pt_peers++;
 	}
 
-	/* Add peer to global peer list */
-	if (list_empty(&lp->lp_on_lnet_peer_list))
-		list_add_tail(&lp->lp_on_lnet_peer_list, &the_lnet.ln_peers);
 
 	/* Update peer state */
 	spin_lock(&lp->lp_lock);
@@ -967,6 +1047,8 @@ lnet_peer_attach_peer_ni(struct lnet_peer *lp,
 	}
 	spin_unlock(&lp->lp_lock);
 
+	lp->lp_nnis++;
+	the_lnet.ln_peer_tables[lp->lp_cpt]->pt_peer_nnids++;
 	lnet_net_unlock(LNET_LOCK_EX);
 
 	CDEBUG(D_NET, "peer %s NID %s flags %#x\n",
@@ -1314,12 +1396,17 @@ void
 lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)
 {
 	struct lnet_peer_table *ptable;
+	struct lnet_peer_net *lpn;
+
+	CDEBUG(D_NET, "%p nid %s\n", lpni, libcfs_nid2str(lpni->lpni_nid));
 
 	LASSERT(atomic_read(&lpni->lpni_refcount) == 0);
 	LASSERT(lpni->lpni_rtr_refcount == 0);
 	LASSERT(list_empty(&lpni->lpni_txq));
 	LASSERT(lpni->lpni_txqnob == 0);
 
+	lpn = lpni->lpni_peer_net;
+	lpni->lpni_peer_net = NULL;
 	lpni->lpni_net = NULL;
 
 	/* remove the peer ni from the zombie list */
@@ -1332,6 +1419,8 @@ lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)
 	if (lpni->lpni_pref_nnids > 1)
 		kfree(lpni->lpni_pref.nids);
 	kfree(lpni);
+
+	lnet_peer_net_decref_locked(lpn);
 }
 
 struct lnet_peer_ni *
@@ -1518,6 +1607,7 @@ lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
 	return found ? 0 : -ENOENT;
 }
 
+/* ln_api_mutex is held, which keeps the peer list stable */
 int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
 		       bool *mr,
 		       struct lnet_peer_ni_credit_info __user *peer_ni_info,

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (11 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 16/24] lustre: lnet: add discovery thread NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 22:44   ` James Simmons
  2018-10-14 22:44   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 17/24] lustre: lnet: add the Push target NeilBrown
                   ` (11 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Add a msg_type field to the lnet_event structure. This makes
it possible for an event handler to tell whether LNET_EVENT_SEND
corresponds to a GET or a PUT message.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25785
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../lustre/include/uapi/linux/lnet/lnet-types.h    |    5 +++++
 drivers/staging/lustre/lnet/lnet/lib-msg.c         |    1 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
index e0e4fd259795..1ecf18e4a278 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
@@ -650,6 +650,11 @@ struct lnet_event {
 	 * \see LNetPut
 	 */
 	__u64			hdr_data;
+	/**
+	 * The message type, to ensure a handler for LNET_EVENT_SEND can
+	 * distinguish between LNET_MSG_GET and LNET_MSG_PUT.
+	 */
+	__u32               msg_type;
 	/**
 	 * Indicates the completion status of the operation. It's 0 for
 	 * successful operations, otherwise it's an error code.
diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
index 1817e54a16a5..db13d01d366f 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
@@ -63,6 +63,7 @@ lnet_build_msg_event(struct lnet_msg *msg, enum lnet_event_kind ev_type)
 	LASSERT(!msg->msg_routing);
 
 	ev->type = ev_type;
+	ev->msg_type = msg->msg_type;
 
 	if (ev_type == LNET_EVENT_SEND) {
 		/* event for active message */

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 16/24] lustre: lnet: add discovery thread
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (10 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 11/24] lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 22:51   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event NeilBrown
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Add the discovery thread, which will be used to handle peer
discovery. This change adds the thread and the infrastructure
that starts and stops it. The thread itself does trivial work.

Peer Discovery gets its own event queue (ln_dc_eqh), a queue
for peers that are to be discovered (ln_dc_request), a queue
for peers waiting for an event (ln_dc_working), a wait queue
head so the thread can sleep (ln_dc_waitq), and start/stop
state (ln_dc_state).

Peer discovery is started from lnet_select_pathway(), for
GET and PUT messages not sent to the LNET_RESERVED_PORTAL.
This criterion means that discovery will not be triggered by
the messages used in discovery, and neither will an LNet ping
trigger it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-on: https://review.whamcloud.com/25786
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    6 
 .../staging/lustre/include/linux/lnet/lib-types.h  |   71 ++++
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   31 ++
 drivers/staging/lustre/lnet/lnet/lib-move.c        |   45 ++-
 drivers/staging/lustre/lnet/lnet/peer.c            |  325 ++++++++++++++++++++
 5 files changed, 468 insertions(+), 10 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index aad25eb0011b..848d622911a4 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -438,6 +438,7 @@ bool lnet_is_ni_healthy_locked(struct lnet_ni *ni);
 struct lnet_net *lnet_get_net_locked(u32 net_id);
 
 extern unsigned int lnet_numa_range;
+extern unsigned int lnet_peer_discovery_disabled;
 extern int portal_rotor;
 
 int lnet_lib_init(void);
@@ -704,6 +705,9 @@ struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
 struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
 void lnet_peer_net_added(struct lnet_net *net);
 lnet_nid_t lnet_peer_primary_nid_locked(lnet_nid_t nid);
+int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt);
+int lnet_peer_discovery_start(void);
+void lnet_peer_discovery_stop(void);
 void lnet_peer_tables_cleanup(struct lnet_net *net);
 void lnet_peer_uninit(void);
 int lnet_peer_tables_create(void);
@@ -791,4 +795,6 @@ lnet_peer_ni_is_primary(struct lnet_peer_ni *lpni)
 	return lpni->lpni_nid == lpni->lpni_peer_net->lpn_peer->lp_primary_nid;
 }
 
+bool lnet_peer_is_uptodate(struct lnet_peer *lp);
+
 #endif
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 260619e19bde..6394a3af50b7 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -520,10 +520,61 @@ struct lnet_peer {
 
 	/* peer state flags */
 	unsigned int		lp_state;
+
+	/* link on discovery-related lists */
+	struct list_head	lp_dc_list;
+
+	/* tasks waiting on discovery of this peer */
+	wait_queue_head_t	lp_dc_waitq;
 };
 
-#define LNET_PEER_MULTI_RAIL	BIT(0)
-#define LNET_PEER_CONFIGURED	BIT(1)
+/*
+ * The status flags in lp_state. Their semantics have chosen so that
+ * lp_state can be zero-initialized.
+ *
+ * A peer is marked MULTI_RAIL in two cases: it was configured using DLC
+ * as multi-rail aware, or the LNET_PING_FEAT_MULTI_RAIL bit was set.
+ *
+ * A peer is marked NO_DISCOVERY if the LNET_PING_FEAT_DISCOVERY bit was
+ * NOT set when the peer was pinged by discovery.
+ */
+#define LNET_PEER_MULTI_RAIL	BIT(0)	/* Multi-rail aware */
+#define LNET_PEER_NO_DISCOVERY	BIT(1)	/* Peer disabled discovery */
+/*
+ * A peer is marked CONFIGURED if it was configured by DLC.
+ *
+ * In addition, a peer is marked DISCOVERED if it has fully passed
+ * through Peer Discovery.
+ *
+ * When Peer Discovery is disabled, the discovery thread will mark
+ * peers REDISCOVER to indicate that they should be re-examined if
+ * discovery is (re)enabled on the node.
+ *
+ * A peer that was created as the result of inbound traffic will not
+ * be marked at all.
+ */
+#define LNET_PEER_CONFIGURED	BIT(2)	/* Configured via DLC */
+#define LNET_PEER_DISCOVERED	BIT(3)	/* Peer was discovered */
+#define LNET_PEER_REDISCOVER	BIT(4)	/* Discovery was disabled */
+/*
+ * A peer is marked DISCOVERING when discovery is in progress.
+ * The other flags below correspond to stages of discovery.
+ */
+#define LNET_PEER_DISCOVERING	BIT(5)	/* Discovering */
+#define LNET_PEER_DATA_PRESENT	BIT(6)	/* Remote peer data present */
+#define LNET_PEER_NIDS_UPTODATE	BIT(7)	/* Remote peer info uptodate */
+#define LNET_PEER_PING_SENT	BIT(8)	/* Waiting for REPLY to Ping */
+#define LNET_PEER_PUSH_SENT	BIT(9)	/* Waiting for ACK of Push */
+#define LNET_PEER_PING_FAILED	BIT(10)	/* Ping send failure */
+#define LNET_PEER_PUSH_FAILED	BIT(11)	/* Push send failure */
+/*
+ * A ping can be forced as a way to fix up state, or as a manual
+ * intervention by an admin.
+ * A push can be forced in circumstances that would normally not
+ * allow for one to happen.
+ */
+#define LNET_PEER_FORCE_PING	BIT(12)	/* Forced Ping */
+#define LNET_PEER_FORCE_PUSH	BIT(13)	/* Forced Push */
 
 struct lnet_peer_net {
 	/* chain on lp_peer_nets */
@@ -775,6 +826,11 @@ struct lnet_msg_container {
 	void			**msc_finalizers;
 };
 
+/* Peer Discovery states */
+#define LNET_DC_STATE_SHUTDOWN		0	/* not started */
+#define LNET_DC_STATE_RUNNING		1	/* started up OK */
+#define LNET_DC_STATE_STOPPING		2	/* telling thread to stop */
+
 /* Router Checker states */
 enum lnet_rc_state {
 	LNET_RC_STATE_SHUTDOWN,	/* not started */
@@ -856,6 +912,17 @@ struct lnet {
 	struct lnet_ping_buffer		 *ln_ping_target;
 	atomic_t			ln_ping_target_seqno;
 
+	/* discovery event queue handle */
+	struct lnet_handle_eq		ln_dc_eqh;
+	/* discovery requests */
+	struct list_head		ln_dc_request;
+	/* discovery working list */
+	struct list_head		ln_dc_working;
+	/* discovery thread wait queue */
+	wait_queue_head_t		ln_dc_waitq;
+	/* discovery startup/shutdown state */
+	int				ln_dc_state;
+
 	/* router checker startup/shutdown state */
 	enum lnet_rc_state		  ln_rc_state;
 	/* router checker's event queue */
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index c48bcb8722a0..dccfd5bcc459 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -78,6 +78,13 @@ module_param_call(lnet_interfaces_max, intf_max_set, param_get_int,
 MODULE_PARM_DESC(lnet_interfaces_max,
 		"Maximum number of interfaces in a node.");
 
+unsigned int lnet_peer_discovery_disabled;
+static int discovery_set(const char *val, const struct kernel_param *kp);
+module_param_call(lnet_peer_discovery_disabled, discovery_set, param_get_int,
+		  &lnet_peer_discovery_disabled, 0644);
+MODULE_PARM_DESC(lnet_peer_discovery_disabled,
+		 "Set to 1 to disable peer discovery on this node.");
+
 /*
  * This sequence number keeps track of how many times DLC was used to
  * update the local NIs. It is incremented when a NI is added or
@@ -90,6 +97,23 @@ static atomic_t lnet_dlc_seq_no = ATOMIC_INIT(0);
 static int lnet_ping(struct lnet_process_id id, signed long timeout,
 		     struct lnet_process_id __user *ids, int n_ids);
 
+static int
+discovery_set(const char *val, const struct kernel_param *kp)
+{
+	int rc;
+	unsigned long value;
+
+	rc = kstrtoul(val, 0, &value);
+	if (rc) {
+		CERROR("Invalid module parameter value for 'lnet_peer_discovery_disabled'\n");
+		return rc;
+	}
+
+	*(unsigned int *)kp->arg = !!value;
+
+	return 0;
+}
+
 static int
 intf_max_set(const char *val, const struct kernel_param *kp)
 {
@@ -1921,6 +1945,10 @@ LNetNIInit(lnet_pid_t requested_pid)
 	if (rc)
 		goto err_stop_ping;
 
+	rc = lnet_peer_discovery_start();
+	if (rc != 0)
+		goto err_stop_router_checker;
+
 	lnet_fault_init();
 	lnet_router_debugfs_init();
 
@@ -1928,6 +1956,8 @@ LNetNIInit(lnet_pid_t requested_pid)
 
 	return 0;
 
+err_stop_router_checker:
+	lnet_router_checker_stop();
 err_stop_ping:
 	lnet_ping_target_fini();
 err_acceptor_stop:
@@ -1976,6 +2006,7 @@ LNetNIFini(void)
 
 		lnet_fault_fini();
 		lnet_router_debugfs_fini();
+		lnet_peer_discovery_stop();
 		lnet_router_checker_stop();
 		lnet_ping_target_fini();
 
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 4c1eef907dc7..4773180cc7b3 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1208,6 +1208,27 @@ lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *cur_ni,
 	return best_ni;
 }
 
+/*
+ * Traffic to the LNET_RESERVED_PORTAL may not trigger peer discovery,
+ * because such traffic is required to perform discovery. We therefore
+ * exclude all GET and PUT on that portal. We also exclude all ACK and
+ * REPLY traffic, but that is because the portal is not tracked in the
+ * message structure for these message types. We could restrict this
+ * further by also checking for LNET_PROTO_PING_MATCHBITS.
+ */
+static bool
+lnet_msg_discovery(struct lnet_msg *msg)
+{
+	if (msg->msg_type == LNET_MSG_PUT) {
+		if (msg->msg_hdr.msg.put.ptl_index != LNET_RESERVED_PORTAL)
+			return true;
+	} else if (msg->msg_type == LNET_MSG_GET) {
+		if (msg->msg_hdr.msg.get.ptl_index != LNET_RESERVED_PORTAL)
+			return true;
+	}
+	return false;
+}
+
 static int
 lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 		    struct lnet_msg *msg, lnet_nid_t rtr_nid)
@@ -1220,7 +1241,6 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	struct lnet_peer *peer;
 	struct lnet_peer_net *peer_net;
 	struct lnet_net *local_net;
-	__u32 seq;
 	int cpt, cpt2, rc;
 	bool routing;
 	bool routing2;
@@ -1255,13 +1275,6 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	routing2 = false;
 	local_found = false;
 
-	seq = lnet_get_dlc_seq_locked();
-
-	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
-		lnet_net_unlock(cpt);
-		return -ESHUTDOWN;
-	}
-
 	/*
 	 * lnet_nid2peerni_locked() is the path that will find an
 	 * existing peer_ni, or create one and mark it as having been
@@ -1272,7 +1285,22 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 		lnet_net_unlock(cpt);
 		return PTR_ERR(lpni);
 	}
+	/*
+	 * Now that we have a peer_ni, check if we want to discover
+	 * the peer. Traffic to the LNET_RESERVED_PORTAL should not
+	 * trigger discovery.
+	 */
 	peer = lpni->lpni_peer_net->lpn_peer;
+	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
+		rc = lnet_discover_peer_locked(lpni, cpt);
+		if (rc) {
+			lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(cpt);
+			return rc;
+		}
+		/* The peer may have changed. */
+		peer = lpni->lpni_peer_net->lpn_peer;
+	}
 	lnet_peer_ni_decref_locked(lpni);
 
 	/* If peer is not healthy then can not send anything to it */
@@ -1701,6 +1729,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	 */
 	cpt2 = lnet_cpt_of_nid_locked(best_lpni->lpni_nid, best_ni);
 	if (cpt != cpt2) {
+		__u32 seq = lnet_get_dlc_seq_locked();
 		lnet_net_unlock(cpt);
 		cpt = cpt2;
 		lnet_net_lock(cpt);
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index d7a0a2f3bdd9..038b58414ce0 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -201,6 +201,8 @@ lnet_peer_alloc(lnet_nid_t nid)
 
 	INIT_LIST_HEAD(&lp->lp_peer_list);
 	INIT_LIST_HEAD(&lp->lp_peer_nets);
+	INIT_LIST_HEAD(&lp->lp_dc_list);
+	init_waitqueue_head(&lp->lp_dc_waitq);
 	spin_lock_init(&lp->lp_lock);
 	lp->lp_primary_nid = nid;
 	lp->lp_cpt = lnet_nid_cpt_hash(nid, LNET_CPT_NUMBER);
@@ -1457,6 +1459,10 @@ lnet_nid2peerni_ex(lnet_nid_t nid, int cpt)
 	return lpni;
 }
 
+/*
+ * Get a peer_ni for the given nid, create it if necessary. Takes a
+ * hold on the peer_ni.
+ */
 struct lnet_peer_ni *
 lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref, int cpt)
 {
@@ -1510,9 +1516,326 @@ lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref, int cpt)
 	mutex_unlock(&the_lnet.ln_api_mutex);
 	lnet_net_lock(cpt);
 
+	/* Lock has been dropped, check again for shutdown. */
+	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN) {
+		if (!IS_ERR(lpni))
+			lnet_peer_ni_decref_locked(lpni);
+		lpni = ERR_PTR(-ESHUTDOWN);
+	}
+
 	return lpni;
 }
 
+/*
+ * Peer Discovery
+ */
+
+/*
+ * Is a peer uptodate from the point of view of discovery?
+ *
+ * If it is currently being processed, obviously not.
+ * A forced Ping or Push is also handled by the discovery thread.
+ *
+ * Otherwise look at whether the peer needs rediscovering.
+ */
+bool
+lnet_peer_is_uptodate(struct lnet_peer *lp)
+{
+	bool rc;
+
+	spin_lock(&lp->lp_lock);
+	if (lp->lp_state & (LNET_PEER_DISCOVERING |
+			    LNET_PEER_FORCE_PING |
+			    LNET_PEER_FORCE_PUSH)) {
+		rc = false;
+	} else if (lp->lp_state & LNET_PEER_REDISCOVER) {
+		if (lnet_peer_discovery_disabled)
+			rc = true;
+		else
+			rc = false;
+	} else if (lp->lp_state & LNET_PEER_DISCOVERED) {
+		if (lp->lp_state & LNET_PEER_NIDS_UPTODATE)
+			rc = true;
+		else
+			rc = false;
+	} else {
+		rc = false;
+	}
+	spin_unlock(&lp->lp_lock);
+
+	return rc;
+}
+
+/*
+ * Queue a peer for the attention of the discovery thread.  Call with
+ * lnet_net_lock/EX held. Returns 0 if the peer was queued, and
+ * -EALREADY if the peer was already queued.
+ */
+static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
+{
+	int rc;
+
+	spin_lock(&lp->lp_lock);
+	if (!(lp->lp_state & LNET_PEER_DISCOVERING))
+		lp->lp_state |= LNET_PEER_DISCOVERING;
+	spin_unlock(&lp->lp_lock);
+	if (list_empty(&lp->lp_dc_list)) {
+		lnet_peer_addref_locked(lp);
+		list_add_tail(&lp->lp_dc_list, &the_lnet.ln_dc_request);
+		wake_up(&the_lnet.ln_dc_waitq);
+		rc = 0;
+	} else {
+		rc = -EALREADY;
+	}
+
+	return rc;
+}
+
+/*
+ * Discovery of a peer is complete. Wake all waiters on the peer.
+ * Call with lnet_net_lock/EX held.
+ */
+static void lnet_peer_discovery_complete(struct lnet_peer *lp)
+{
+	list_del_init(&lp->lp_dc_list);
+	wake_up_all(&lp->lp_dc_waitq);
+	lnet_peer_decref_locked(lp);
+}
+
+/*
+ * Peer discovery slow path. The ln_api_mutex is held on entry, and
+ * dropped/retaken within this function. An lnet_peer_ni is passed in
+ * because discovery could tear down an lnet_peer.
+ */
+int
+lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
+{
+	DEFINE_WAIT(wait);
+	struct lnet_peer *lp;
+	int rc = 0;
+
+again:
+	lnet_net_unlock(cpt);
+	lnet_net_lock(LNET_LOCK_EX);
+
+	/* We're willing to be interrupted. */
+	for (;;) {
+		lp = lpni->lpni_peer_net->lpn_peer;
+		prepare_to_wait(&lp->lp_dc_waitq, &wait, TASK_INTERRUPTIBLE);
+		if (signal_pending(current))
+			break;
+		if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
+			break;
+		if (lnet_peer_is_uptodate(lp))
+			break;
+		lnet_peer_queue_for_discovery(lp);
+		lnet_peer_addref_locked(lp);
+		lnet_net_unlock(LNET_LOCK_EX);
+		schedule();
+		finish_wait(&lp->lp_dc_waitq, &wait);
+		lnet_net_lock(LNET_LOCK_EX);
+		lnet_peer_decref_locked(lp);
+		/* Do not use lp beyond this point. */
+	}
+	finish_wait(&lp->lp_dc_waitq, &wait);
+
+	lnet_net_unlock(LNET_LOCK_EX);
+	lnet_net_lock(cpt);
+
+	if (signal_pending(current))
+		rc = -EINTR;
+	else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
+		rc = -ESHUTDOWN;
+	else if (!lnet_peer_is_uptodate(lp))
+		goto again;
+
+	return rc;
+}
+
+/*
+ * Event handler for the discovery EQ.
+ *
+ * Called with lnet_res_lock(cpt) held. The cpt is the
+ * lnet_cpt_of_cookie() of the md handle cookie.
+ */
+static void lnet_discovery_event_handler(struct lnet_event *event)
+{
+	wake_up(&the_lnet.ln_dc_waitq);
+}
+
+/*
+ * Wait for work to be queued or some other change that must be
+ * attended to. Returns non-zero if the discovery thread should shut
+ * down.
+ */
+static int lnet_peer_discovery_wait_for_work(void)
+{
+	int cpt;
+	int rc = 0;
+
+	DEFINE_WAIT(wait);
+
+	cpt = lnet_net_lock_current();
+	for (;;) {
+		prepare_to_wait(&the_lnet.ln_dc_waitq, &wait,
+				TASK_IDLE);
+		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
+			break;
+		if (!list_empty(&the_lnet.ln_dc_request))
+			break;
+		lnet_net_unlock(cpt);
+		schedule();
+		finish_wait(&the_lnet.ln_dc_waitq, &wait);
+		cpt = lnet_net_lock_current();
+	}
+	finish_wait(&the_lnet.ln_dc_waitq, &wait);
+
+	if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
+		rc = -ESHUTDOWN;
+
+	lnet_net_unlock(cpt);
+
+	CDEBUG(D_NET, "woken: %d\n", rc);
+
+	return rc;
+}
+
+/* The discovery thread. */
+static int lnet_peer_discovery(void *arg)
+{
+	struct lnet_peer *lp;
+
+	CDEBUG(D_NET, "started\n");
+
+	for (;;) {
+		if (lnet_peer_discovery_wait_for_work())
+			break;
+
+		lnet_net_lock(LNET_LOCK_EX);
+		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
+			break;
+		while (!list_empty(&the_lnet.ln_dc_request)) {
+			lp = list_first_entry(&the_lnet.ln_dc_request,
+					      struct lnet_peer, lp_dc_list);
+			list_move(&lp->lp_dc_list, &the_lnet.ln_dc_working);
+			lnet_net_unlock(LNET_LOCK_EX);
+
+			/* Just tag and release for now. */
+			spin_lock(&lp->lp_lock);
+			if (lnet_peer_discovery_disabled) {
+				lp->lp_state |= LNET_PEER_REDISCOVER;
+				lp->lp_state &= ~(LNET_PEER_DISCOVERED |
+						  LNET_PEER_NIDS_UPTODATE |
+						  LNET_PEER_DISCOVERING);
+			} else {
+				lp->lp_state |= (LNET_PEER_DISCOVERED |
+						 LNET_PEER_NIDS_UPTODATE);
+				lp->lp_state &= ~(LNET_PEER_REDISCOVER |
+						  LNET_PEER_DISCOVERING);
+			}
+			spin_unlock(&lp->lp_lock);
+
+			lnet_net_lock(LNET_LOCK_EX);
+			if (!(lp->lp_state & LNET_PEER_DISCOVERING))
+				lnet_peer_discovery_complete(lp);
+			if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
+				break;
+		}
+		lnet_net_unlock(LNET_LOCK_EX);
+	}
+
+	CDEBUG(D_NET, "stopping\n");
+	/*
+	 * Clean up before telling lnet_peer_discovery_stop() that
+	 * we're done. Use wake_up() below to somewhat reduce the
+	 * size of the thundering herd if there are multiple threads
+	 * waiting on discovery of a single peer.
+	 */
+	LNetEQFree(the_lnet.ln_dc_eqh);
+	LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
+
+	lnet_net_lock(LNET_LOCK_EX);
+	list_for_each_entry(lp, &the_lnet.ln_dc_request, lp_dc_list) {
+		spin_lock(&lp->lp_lock);
+		lp->lp_state |= LNET_PEER_REDISCOVER;
+		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
+				  LNET_PEER_DISCOVERING |
+				  LNET_PEER_NIDS_UPTODATE);
+		spin_unlock(&lp->lp_lock);
+		lnet_peer_discovery_complete(lp);
+	}
+	list_for_each_entry(lp, &the_lnet.ln_dc_working, lp_dc_list) {
+		spin_lock(&lp->lp_lock);
+		lp->lp_state |= LNET_PEER_REDISCOVER;
+		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
+				  LNET_PEER_DISCOVERING |
+				  LNET_PEER_NIDS_UPTODATE);
+		spin_unlock(&lp->lp_lock);
+		lnet_peer_discovery_complete(lp);
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	the_lnet.ln_dc_state = LNET_DC_STATE_SHUTDOWN;
+	wake_up(&the_lnet.ln_dc_waitq);
+
+	CDEBUG(D_NET, "stopped\n");
+
+	return 0;
+}
+
+/* ln_api_mutex is held on entry. */
+int lnet_peer_discovery_start(void)
+{
+	struct task_struct *task;
+	int rc;
+
+	if (the_lnet.ln_dc_state != LNET_DC_STATE_SHUTDOWN)
+		return -EALREADY;
+
+	INIT_LIST_HEAD(&the_lnet.ln_dc_request);
+	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
+	init_waitqueue_head(&the_lnet.ln_dc_waitq);
+
+	rc = LNetEQAlloc(0, lnet_discovery_event_handler, &the_lnet.ln_dc_eqh);
+	if (rc != 0) {
+		CERROR("Can't allocate discovery EQ: %d\n", rc);
+		return rc;
+	}
+
+	the_lnet.ln_dc_state = LNET_DC_STATE_RUNNING;
+	task = kthread_run(lnet_peer_discovery, NULL, "lnet_discovery");
+	if (IS_ERR(task)) {
+		rc = PTR_ERR(task);
+		CERROR("Can't start peer discovery thread: %d\n", rc);
+
+		LNetEQFree(the_lnet.ln_dc_eqh);
+		LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
+
+		the_lnet.ln_dc_state = LNET_DC_STATE_SHUTDOWN;
+	}
+
+	return rc;
+}
+
+/* ln_api_mutex is held on entry. */
+void lnet_peer_discovery_stop(void)
+{
+	if (the_lnet.ln_dc_state == LNET_DC_STATE_SHUTDOWN)
+		return;
+
+	LASSERT(the_lnet.ln_dc_state == LNET_DC_STATE_RUNNING);
+	the_lnet.ln_dc_state = LNET_DC_STATE_STOPPING;
+	wake_up(&the_lnet.ln_dc_waitq);
+
+	wait_event(the_lnet.ln_dc_waitq,
+		   the_lnet.ln_dc_state == LNET_DC_STATE_SHUTDOWN);
+
+	LASSERT(list_empty(&the_lnet.ln_dc_request));
+	LASSERT(list_empty(&the_lnet.ln_dc_working));
+}
+
+/* Debugging */
+
 void
 lnet_debug_peer(lnet_nid_t nid)
 {
@@ -1544,6 +1867,8 @@ lnet_debug_peer(lnet_nid_t nid)
 	lnet_net_unlock(cpt);
 }
 
+/* Gathering information for userspace. */
+
 int
 lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
 		      char aliveness[LNET_MAX_STR_LEN],

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 17/24] lustre: lnet: add the Push target
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (12 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 22:58   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked() NeilBrown
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Peer Discovery will send a Push message (same format as an
LNet Ping) to Multi-Rail capable peers to give the peer the
list of local interfaces.

Set up a target buffer for these pushes in the_lnet. The
size of this buffer defaults to LNET_MIN_INTERFACES, but it
is resized if required.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25788
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    8 +
 .../staging/lustre/include/linux/lnet/lib-types.h  |   25 +++
 drivers/staging/lustre/lnet/lnet/api-ni.c          |  150 ++++++++++++++++++++
 drivers/staging/lustre/lnet/lnet/peer.c            |    5 +
 4 files changed, 187 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 848d622911a4..5632e5aadf41 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -686,6 +686,14 @@ static inline int lnet_ping_buffer_numref(struct lnet_ping_buffer *pbuf)
 	return atomic_read(&pbuf->pb_refcnt);
 }
 
+static inline int lnet_push_target_resize_needed(void)
+{
+	return the_lnet.ln_push_target->pb_nnis < the_lnet.ln_push_target_nnis;
+}
+
+int lnet_push_target_resize(void);
+void lnet_peer_push_event(struct lnet_event *ev);
+
 int lnet_parse_ip2nets(char **networksp, char *ip2nets);
 int lnet_parse_routes(char *route_str, int *im_a_router);
 int lnet_parse_networks(struct list_head *nilist, char *networks,
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 6394a3af50b7..e00c13355d43 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -521,6 +521,18 @@ struct lnet_peer {
 	/* peer state flags */
 	unsigned int		lp_state;
 
+	/* buffer for data pushed by peer */
+	struct lnet_ping_buffer	*lp_data;
+
+	/* number of NIDs for sizing push data */
+	int			lp_data_nnis;
+
+	/* NI config sequence number of peer */
+	__u32			lp_peer_seqno;
+
+	/* Local NI config sequence number peer knows */
+	__u32			lp_node_seqno;
+
 	/* link on discovery-related lists */
 	struct list_head	lp_dc_list;
 
@@ -912,6 +924,19 @@ struct lnet {
 	struct lnet_ping_buffer		 *ln_ping_target;
 	atomic_t			ln_ping_target_seqno;
 
+	/*
+	 * Push Target
+	 *
+	 * ln_push_nnis contains the desired size of the push target.
+	 * The lnet_net_lock is used to handle update races. The old
+	 * buffer may linger a while after it has been unlinked, in
+	 * which case the event handler cleans up.
+	 */
+	struct lnet_handle_eq		ln_push_target_eq;
+	struct lnet_handle_md		ln_push_target_md;
+	struct lnet_ping_buffer		*ln_push_target;
+	int				ln_push_target_nnis;
+
 	/* discovery event queue handle */
 	struct lnet_handle_eq		ln_dc_eqh;
 	/* discovery requests */
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index dccfd5bcc459..e6bc54e9de71 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -1268,6 +1268,147 @@ lnet_ping_target_fini(void)
 	lnet_ping_target_destroy();
 }
 
+/* Resize the push target. */
+int lnet_push_target_resize(void)
+{
+	struct lnet_process_id id = { LNET_NID_ANY, LNET_PID_ANY };
+	struct lnet_md md = { NULL };
+	struct lnet_handle_me meh;
+	struct lnet_handle_md mdh;
+	struct lnet_handle_md old_mdh;
+	struct lnet_ping_buffer *pbuf;
+	struct lnet_ping_buffer *old_pbuf;
+	int nnis = the_lnet.ln_push_target_nnis;
+	int rc;
+
+	if (nnis <= 0) {
+		rc = -EINVAL;
+		goto fail_return;
+	}
+again:
+	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
+	if (!pbuf) {
+		rc = -ENOMEM;
+		goto fail_return;
+	}
+
+	rc = LNetMEAttach(LNET_RESERVED_PORTAL, id,
+			  LNET_PROTO_PING_MATCHBITS, 0,
+			  LNET_UNLINK, LNET_INS_AFTER,
+			  &meh);
+	if (rc) {
+		CERROR("Can't create push target ME: %d\n", rc);
+		goto fail_decref_pbuf;
+	}
+
+	/* initialize md content */
+	md.start     = &pbuf->pb_info;
+	md.length    = LNET_PING_INFO_SIZE(nnis);
+	md.threshold = LNET_MD_THRESH_INF;
+	md.max_size  = 0;
+	md.options   = LNET_MD_OP_PUT | LNET_MD_TRUNCATE |
+		       LNET_MD_MANAGE_REMOTE;
+	md.user_ptr  = pbuf;
+	md.eq_handle = the_lnet.ln_push_target_eq;
+
+	rc = LNetMDAttach(meh, md, LNET_RETAIN, &mdh);
+	if (rc) {
+		CERROR("Can't attach push MD: %d\n", rc);
+		goto fail_unlink_meh;
+	}
+	lnet_ping_buffer_addref(pbuf);
+
+	lnet_net_lock(LNET_LOCK_EX);
+	old_pbuf = the_lnet.ln_push_target;
+	old_mdh = the_lnet.ln_push_target_md;
+	the_lnet.ln_push_target = pbuf;
+	the_lnet.ln_push_target_md = mdh;
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	if (old_pbuf) {
+		LNetMDUnlink(old_mdh);
+		lnet_ping_buffer_decref(old_pbuf);
+	}
+
+	if (nnis < the_lnet.ln_push_target_nnis)
+		goto again;
+
+	CDEBUG(D_NET, "nnis %d success\n", nnis);
+
+	return 0;
+
+fail_unlink_meh:
+	LNetMEUnlink(meh);
+fail_decref_pbuf:
+	lnet_ping_buffer_decref(pbuf);
+fail_return:
+	CDEBUG(D_NET, "nnis %d error %d\n", nnis, rc);
+	return rc;
+}
+
+static void lnet_push_target_event_handler(struct lnet_event *ev)
+{
+	struct lnet_ping_buffer *pbuf = ev->md.user_ptr;
+
+	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
+		lnet_swap_pinginfo(pbuf);
+
+	if (ev->unlinked)
+		lnet_ping_buffer_decref(pbuf);
+}
+
+/* Initialize the push target. */
+static int lnet_push_target_init(void)
+{
+	int rc;
+
+	if (the_lnet.ln_push_target)
+		return -EALREADY;
+
+	rc = LNetEQAlloc(0, lnet_push_target_event_handler,
+			 &the_lnet.ln_push_target_eq);
+	if (rc) {
+		CERROR("Can't allocated push target EQ: %d\n", rc);
+		return rc;
+	}
+
+	/* Start at the required minimum, we'll enlarge if required. */
+	the_lnet.ln_push_target_nnis = LNET_INTERFACES_MIN;
+
+	rc = lnet_push_target_resize();
+
+	if (rc) {
+		LNetEQFree(the_lnet.ln_push_target_eq);
+		LNetInvalidateEQHandle(&the_lnet.ln_push_target_eq);
+	}
+
+	return rc;
+}
+
+/* Clean up the push target. */
+static void lnet_push_target_fini(void)
+{
+	if (!the_lnet.ln_push_target)
+		return;
+
+	/* Unlink and invalidate to prevent new references. */
+	LNetMDUnlink(the_lnet.ln_push_target_md);
+	LNetInvalidateMDHandle(&the_lnet.ln_push_target_md);
+
+	/* Wait for the unlink to complete. */
+	while (lnet_ping_buffer_numref(the_lnet.ln_push_target) > 1) {
+		CDEBUG(D_NET, "Still waiting for ping data MD to unlink\n");
+		schedule_timeout_uninterruptible(HZ);
+	}
+
+	lnet_ping_buffer_decref(the_lnet.ln_push_target);
+	the_lnet.ln_push_target = NULL;
+	the_lnet.ln_push_target_nnis = 0;
+
+	LNetEQFree(the_lnet.ln_push_target_eq);
+	LNetInvalidateEQHandle(&the_lnet.ln_push_target_eq);
+}
+
 static int
 lnet_ni_tq_credits(struct lnet_ni *ni)
 {
@@ -1945,10 +2086,14 @@ LNetNIInit(lnet_pid_t requested_pid)
 	if (rc)
 		goto err_stop_ping;
 
-	rc = lnet_peer_discovery_start();
+	rc = lnet_push_target_init();
 	if (rc != 0)
 		goto err_stop_router_checker;
 
+	rc = lnet_peer_discovery_start();
+	if (rc != 0)
+		goto err_destroy_push_target;
+
 	lnet_fault_init();
 	lnet_router_debugfs_init();
 
@@ -1956,6 +2101,8 @@ LNetNIInit(lnet_pid_t requested_pid)
 
 	return 0;
 
+err_destroy_push_target:
+	lnet_push_target_fini();
 err_stop_router_checker:
 	lnet_router_checker_stop();
 err_stop_ping:
@@ -2007,6 +2154,7 @@ LNetNIFini(void)
 		lnet_fault_fini();
 		lnet_router_debugfs_fini();
 		lnet_peer_discovery_stop();
+		lnet_push_target_fini();
 		lnet_router_checker_stop();
 		lnet_ping_target_fini();
 
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 038b58414ce0..b78f99c354de 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -1681,6 +1681,8 @@ static int lnet_peer_discovery_wait_for_work(void)
 				TASK_IDLE);
 		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
 			break;
+		if (lnet_push_target_resize_needed())
+			break;
 		if (!list_empty(&the_lnet.ln_dc_request))
 			break;
 		lnet_net_unlock(cpt);
@@ -1711,6 +1713,9 @@ static int lnet_peer_discovery(void *arg)
 		if (lnet_peer_discovery_wait_for_work())
 			break;
 
+		if (lnet_push_target_resize_needed())
+			lnet_push_target_resize();
+
 		lnet_net_lock(LNET_LOCK_EX);
 		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
 			break;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 18/24] lustre: lnet: implement Peer Discovery
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (21 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 23:33   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 22/24] lustre: lnet: add enhanced statistics NeilBrown
  2018-10-14 23:54 ` [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging James Simmons
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Implement Peer Discovery.

A peer is queued for discovery by lnet_peer_queue_for_discovery().
This set LNET_PEER_DISCOVERING, to indicate that discovery is in
progress.

The discovery thread lnet_peer_discovery() checks the peer and
updates its state as appropriate.

If LNET_PEER_DATA_PRESENT is set, then a valid Push message or
Ping reply has been received. The peer is updated in accordance
with the data, and LNET_PEER_NIDS_UPTODATE is set.

If LNET_PEER_PING_FAILED is set, then an attempt to send a Ping
message failed, and peer state is updated accordingly. The discovery
thread can do some cleanup like unlinking an MD that cannot be done
from the message event handler.

If LNET_PEER_PUSH_FAILED is set, then an attempt to send a Push
message failed, and peer state is updated accordingly. The discovery
thread can do some cleanup like unlinking an MD that cannot be done
from the message event handler.

If LNET_PEER_PING_REQUIRED is set, we must Ping the peer in order to
correctly update our knowledge of it. This is set, for example, if
we receive a Push message for a peer, but cannot handle it because
the Push target was too small. In such a case we know that the
state of the peer is incorrect, but need to do extra work to obtain
the required information.

If discovery is not enabled, then the discovery process stops here
and the peer is marked with LNET_PEER_UNDISCOVERED. This tells the
discovery process that it doesn't need to revisit the peer while
discovery remains disabled.

If LNET_PEER_NIDS_UPTODATE is not set, then we have reason to think
the lnet_peer is not up to date, and will Ping it.

The peer needs a Push if it is multi-rail and the ping buffer
sequence number for this node is newer than the sequence number it
has acknowledged receiving by sending an Ack of a Push.

If none of the above is true, then discovery has completed its work
on the peer.

Discovery signals that it is done with a peer by clearing the
LNET_PEER_DISCOVERING flag, and setting LNET_PEER_DISCOVERED or
LNET_PEER_UNDISCOVERED as appropriate. It then dequeues the peer
and clears the LNET_PEER_QUEUED flag.

When the local node is discovered via the loopback network, the
peer structure that is created will have an lnet_peer_ni for the
local loopback interface. Subsequent traffic from this node to
itself will use the loopback net.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25789
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   20 
 .../staging/lustre/include/linux/lnet/lib-types.h  |   39 +
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   59 +
 drivers/staging/lustre/lnet/lnet/lib-move.c        |   18 
 drivers/staging/lustre/lnet/lnet/peer.c            | 1499 +++++++++++++++++++-
 5 files changed, 1543 insertions(+), 92 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 5632e5aadf41..f82a699371f2 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -76,6 +76,9 @@ extern struct lnet the_lnet;	/* THE network */
 #define LNET_ACCEPTOR_MIN_RESERVED_PORT    512
 #define LNET_ACCEPTOR_MAX_RESERVED_PORT    1023
 
+/* Discovery timeout - same as default peer_timeout */
+#define DISCOVERY_TIMEOUT	180
+
 static inline int lnet_is_route_alive(struct lnet_route *route)
 {
 	/* gateway is down */
@@ -713,9 +716,10 @@ struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
 struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
 void lnet_peer_net_added(struct lnet_net *net);
 lnet_nid_t lnet_peer_primary_nid_locked(lnet_nid_t nid);
-int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt);
+int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block);
 int lnet_peer_discovery_start(void);
 void lnet_peer_discovery_stop(void);
+void lnet_push_update_to_peers(int force);
 void lnet_peer_tables_cleanup(struct lnet_net *net);
 void lnet_peer_uninit(void);
 int lnet_peer_tables_create(void);
@@ -805,4 +809,18 @@ lnet_peer_ni_is_primary(struct lnet_peer_ni *lpni)
 
 bool lnet_peer_is_uptodate(struct lnet_peer *lp);
 
+static inline bool
+lnet_peer_needs_push(struct lnet_peer *lp)
+{
+	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL))
+		return false;
+	if (lp->lp_state & LNET_PEER_FORCE_PUSH)
+		return true;
+	if (lp->lp_state & LNET_PEER_NO_DISCOVERY)
+		return false;
+	if (lp->lp_node_seqno < atomic_read(&the_lnet.ln_ping_target_seqno))
+		return true;
+	return false;
+}
+
 #endif
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index e00c13355d43..07baa86e61ab 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -67,6 +67,13 @@ struct lnet_msg {
 	lnet_nid_t		msg_from;
 	__u32			msg_type;
 
+	/*
+	 * hold parameters in case message is with held due
+	 * to discovery
+	 */
+	lnet_nid_t		msg_src_nid_param;
+	lnet_nid_t		msg_rtr_nid_param;
+
 	/* committed for sending */
 	unsigned int		msg_tx_committed:1;
 	/* CPT # this message committed for sending */
@@ -395,6 +402,8 @@ struct lnet_ping_buffer {
 #define LNET_PING_BUFFER_LONI(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_nid)
 #define LNET_PING_BUFFER_SEQNO(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_status)
 
+#define LNET_PING_INFO_TO_BUFFER(PINFO)	\
+	container_of((PINFO), struct lnet_ping_buffer, pb_info)
 
 /* router checker data, per router */
 struct lnet_rc_data {
@@ -503,6 +512,9 @@ struct lnet_peer {
 	/* list of peer nets */
 	struct list_head	lp_peer_nets;
 
+	/* list of messages pending discovery*/
+	struct list_head	lp_dc_pendq;
+
 	/* primary NID of the peer */
 	lnet_nid_t		lp_primary_nid;
 
@@ -524,15 +536,36 @@ struct lnet_peer {
 	/* buffer for data pushed by peer */
 	struct lnet_ping_buffer	*lp_data;
 
+	/* MD handle for ping in progress */
+	struct lnet_handle_md	lp_ping_mdh;
+
+	/* MD handle for push in progress */
+	struct lnet_handle_md	lp_push_mdh;
+
 	/* number of NIDs for sizing push data */
 	int			lp_data_nnis;
 
 	/* NI config sequence number of peer */
 	__u32			lp_peer_seqno;
 
-	/* Local NI config sequence number peer knows */
+	/* Local NI config sequence number acked by peer */
 	__u32			lp_node_seqno;
 
+	/* Local NI config sequence number sent to peer */
+	__u32			lp_node_seqno_sent;
+
+	/* Ping error encountered during discovery. */
+	int			lp_ping_error;
+
+	/* Push error encountered during discovery. */
+	int			lp_push_error;
+
+	/* Error encountered during discovery. */
+	int			lp_dc_error;
+
+	/* time it was put on the ln_dc_working queue */
+	time64_t		lp_last_queued;
+
 	/* link on discovery-related lists */
 	struct list_head	lp_dc_list;
 
@@ -691,6 +724,8 @@ struct lnet_remotenet {
 #define LNET_CREDIT_OK		0
 /** lnet message is waiting for credit */
 #define LNET_CREDIT_WAIT	1
+/** lnet message is waiting for discovery */
+#define LNET_DC_WAIT		2
 
 struct lnet_rtrbufpool {
 	struct list_head	rbp_bufs;	/* my free buffer pool */
@@ -943,6 +978,8 @@ struct lnet {
 	struct list_head		ln_dc_request;
 	/* discovery working list */
 	struct list_head		ln_dc_working;
+	/* discovery expired list */
+	struct list_head		ln_dc_expired;
 	/* discovery thread wait queue */
 	wait_queue_head_t		ln_dc_waitq;
 	/* discovery startup/shutdown state */
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index e6bc54e9de71..955d1711eda4 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -41,7 +41,14 @@
 
 #define D_LNI D_CONSOLE
 
-struct lnet the_lnet;		/* THE state of the network */
+/*
+ * initialize ln_api_mutex statically, since it needs to be used in
+ * discovery_set callback. That module parameter callback can be called
+ * before module init completes. The mutex needs to be ready for use then.
+ */
+struct lnet the_lnet = {
+	.ln_api_mutex = __MUTEX_INITIALIZER(the_lnet.ln_api_mutex),
+};		/* THE state of the network */
 EXPORT_SYMBOL(the_lnet);
 
 static char *ip2nets = "";
@@ -101,7 +108,9 @@ static int
 discovery_set(const char *val, const struct kernel_param *kp)
 {
 	int rc;
+	unsigned int *discovery = (unsigned int *)kp->arg;
 	unsigned long value;
+	struct lnet_ping_buffer *pbuf;
 
 	rc = kstrtoul(val, 0, &value);
 	if (rc) {
@@ -109,7 +118,38 @@ discovery_set(const char *val, const struct kernel_param *kp)
 		return rc;
 	}
 
-	*(unsigned int *)kp->arg = !!value;
+	value = !!value;
+
+	/*
+	 * The purpose of locking the api_mutex here is to ensure that
+	 * the correct value ends up stored properly.
+	 */
+	mutex_lock(&the_lnet.ln_api_mutex);
+
+	if (value == *discovery) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	*discovery = value;
+
+	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	/* tell peers that discovery setting has changed */
+	lnet_net_lock(LNET_LOCK_EX);
+	pbuf = the_lnet.ln_ping_target;
+	if (value)
+		pbuf->pb_info.pi_features &= ~LNET_PING_FEAT_DISCOVERY;
+	else
+		pbuf->pb_info.pi_features |= LNET_PING_FEAT_DISCOVERY;
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	lnet_push_update_to_peers(1);
+
+	mutex_unlock(&the_lnet.ln_api_mutex);
 
 	return 0;
 }
@@ -171,7 +211,6 @@ lnet_init_locks(void)
 	init_waitqueue_head(&the_lnet.ln_eq_waitq);
 	init_waitqueue_head(&the_lnet.ln_rc_waitq);
 	mutex_init(&the_lnet.ln_lnd_mutex);
-	mutex_init(&the_lnet.ln_api_mutex);
 }
 
 static int
@@ -654,6 +693,10 @@ lnet_prepare(lnet_pid_t requested_pid)
 	INIT_LIST_HEAD(&the_lnet.ln_routers);
 	INIT_LIST_HEAD(&the_lnet.ln_drop_rules);
 	INIT_LIST_HEAD(&the_lnet.ln_delay_rules);
+	INIT_LIST_HEAD(&the_lnet.ln_dc_request);
+	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
+	INIT_LIST_HEAD(&the_lnet.ln_dc_expired);
+	init_waitqueue_head(&the_lnet.ln_dc_waitq);
 
 	rc = lnet_create_remote_nets_table();
 	if (rc)
@@ -998,7 +1041,8 @@ lnet_ping_target_create(int nnis)
 	pbuf->pb_info.pi_nnis = nnis;
 	pbuf->pb_info.pi_pid = the_lnet.ln_pid;
 	pbuf->pb_info.pi_magic = LNET_PROTO_PING_MAGIC;
-	pbuf->pb_info.pi_features = LNET_PING_FEAT_NI_STATUS;
+	pbuf->pb_info.pi_features =
+		LNET_PING_FEAT_NI_STATUS | LNET_PING_FEAT_MULTI_RAIL;
 
 	return pbuf;
 }
@@ -1231,6 +1275,8 @@ lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
 
 	if (!the_lnet.ln_routing)
 		pbuf->pb_info.pi_features |= LNET_PING_FEAT_RTE_DISABLED;
+	if (!lnet_peer_discovery_disabled)
+		pbuf->pb_info.pi_features |= LNET_PING_FEAT_DISCOVERY;
 
 	/* Ensure only known feature bits have been set. */
 	LASSERT(pbuf->pb_info.pi_features & LNET_PING_FEAT_BITS);
@@ -1252,6 +1298,8 @@ lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
 		lnet_ping_md_unlink(old_pbuf, &old_ping_md);
 		lnet_ping_buffer_decref(old_pbuf);
 	}
+
+	lnet_push_update_to_peers(0);
 }
 
 static void
@@ -1353,6 +1401,7 @@ static void lnet_push_target_event_handler(struct lnet_event *ev)
 	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
 		lnet_swap_pinginfo(pbuf);
 
+	lnet_peer_push_event(ev);
 	if (ev->unlinked)
 		lnet_ping_buffer_decref(pbuf);
 }
@@ -1910,8 +1959,6 @@ int lnet_lib_init(void)
 
 	lnet_assert_wire_constants();
 
-	memset(&the_lnet, 0, sizeof(the_lnet));
-
 	/* refer to global cfs_cpt_tab for now */
 	the_lnet.ln_cpt_table	= cfs_cpt_tab;
 	the_lnet.ln_cpt_number	= cfs_cpt_number(cfs_cpt_tab);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 4773180cc7b3..2ff329bf91ba 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -444,6 +444,8 @@ lnet_prep_send(struct lnet_msg *msg, int type, struct lnet_process_id target,
 
 	memset(&msg->msg_hdr, 0, sizeof(msg->msg_hdr));
 	msg->msg_hdr.type	   = cpu_to_le32(type);
+	/* dest_nid will be overwritten by lnet_select_pathway() */
+	msg->msg_hdr.dest_nid       = cpu_to_le64(target.nid);
 	msg->msg_hdr.dest_pid       = cpu_to_le32(target.pid);
 	/* src_nid will be set later */
 	msg->msg_hdr.src_pid	= cpu_to_le32(the_lnet.ln_pid);
@@ -1292,7 +1294,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 	 */
 	peer = lpni->lpni_peer_net->lpn_peer;
 	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
-		rc = lnet_discover_peer_locked(lpni, cpt);
+		rc = lnet_discover_peer_locked(lpni, cpt, false);
 		if (rc) {
 			lnet_peer_ni_decref_locked(lpni);
 			lnet_net_unlock(cpt);
@@ -1300,6 +1302,18 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
 		}
 		/* The peer may have changed. */
 		peer = lpni->lpni_peer_net->lpn_peer;
+		/* queue message and return */
+		msg->msg_src_nid_param = src_nid;
+		msg->msg_rtr_nid_param = rtr_nid;
+		msg->msg_sending = 0;
+		list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
+		lnet_peer_ni_decref_locked(lpni);
+		lnet_net_unlock(cpt);
+
+		CDEBUG(D_NET, "%s pending discovery\n",
+		       libcfs_nid2str(peer->lp_primary_nid));
+
+		return LNET_DC_WAIT;
 	}
 	lnet_peer_ni_decref_locked(lpni);
 
@@ -1840,7 +1854,7 @@ lnet_send(lnet_nid_t src_nid, struct lnet_msg *msg, lnet_nid_t rtr_nid)
 	if (rc == LNET_CREDIT_OK)
 		lnet_ni_send(msg->msg_txni, msg);
 
-	/* rc == LNET_CREDIT_OK or LNET_CREDIT_WAIT */
+	/* rc == LNET_CREDIT_OK or LNET_CREDIT_WAIT or LNET_DC_WAIT */
 	return 0;
 }
 
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index b78f99c354de..1ef4a44e752e 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -38,6 +38,11 @@
 #include <linux/lnet/lib-lnet.h>
 #include <uapi/linux/lnet/lnet-dlc.h>
 
+/* Value indicating that recovery needs to re-check a peer immediately. */
+#define LNET_REDISCOVER_PEER	(1)
+
+static int lnet_peer_queue_for_discovery(struct lnet_peer *lp);
+
 static void
 lnet_peer_remove_from_remote_list(struct lnet_peer_ni *lpni)
 {
@@ -202,6 +207,7 @@ lnet_peer_alloc(lnet_nid_t nid)
 	INIT_LIST_HEAD(&lp->lp_peer_list);
 	INIT_LIST_HEAD(&lp->lp_peer_nets);
 	INIT_LIST_HEAD(&lp->lp_dc_list);
+	INIT_LIST_HEAD(&lp->lp_dc_pendq);
 	init_waitqueue_head(&lp->lp_dc_waitq);
 	spin_lock_init(&lp->lp_lock);
 	lp->lp_primary_nid = nid;
@@ -220,6 +226,10 @@ lnet_destroy_peer_locked(struct lnet_peer *lp)
 	LASSERT(atomic_read(&lp->lp_refcount) == 0);
 	LASSERT(list_empty(&lp->lp_peer_nets));
 	LASSERT(list_empty(&lp->lp_peer_list));
+	LASSERT(list_empty(&lp->lp_dc_list));
+
+	if (lp->lp_data)
+		lnet_ping_buffer_decref(lp->lp_data);
 
 	kfree(lp);
 }
@@ -260,10 +270,19 @@ lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
 	/*
 	 * If there are no more peer nets, make the peer unfindable
 	 * via the peer_tables.
+	 *
+	 * Otherwise, if the peer is DISCOVERED, tell discovery to
+	 * take another look at it. This is a no-op if discovery for
+	 * this peer did the detaching.
 	 */
 	if (list_empty(&lp->lp_peer_nets)) {
 		list_del_init(&lp->lp_peer_list);
 		ptable->pt_peers--;
+	} else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING) {
+		/* Discovery isn't running, nothing to do here. */
+	} else if (lp->lp_state & LNET_PEER_DISCOVERED) {
+		lnet_peer_queue_for_discovery(lp);
+		wake_up(&the_lnet.ln_dc_waitq);
 	}
 	CDEBUG(D_NET, "peer %s NID %s\n",
 	       libcfs_nid2str(lp->lp_primary_nid),
@@ -599,6 +618,25 @@ lnet_find_peer_ni_locked(lnet_nid_t nid)
 	return lpni;
 }
 
+struct lnet_peer *
+lnet_find_peer(lnet_nid_t nid)
+{
+	struct lnet_peer_ni *lpni;
+	struct lnet_peer *lp = NULL;
+	int cpt;
+
+	cpt = lnet_net_lock_current();
+	lpni = lnet_find_peer_ni_locked(nid);
+	if (lpni) {
+		lp = lpni->lpni_peer_net->lpn_peer;
+		lnet_peer_addref_locked(lp);
+		lnet_peer_ni_decref_locked(lpni);
+	}
+	lnet_net_unlock(cpt);
+
+	return lp;
+}
+
 struct lnet_peer_ni *
 lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
 			    struct lnet_peer **lp)
@@ -696,6 +734,37 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 	return lpni;
 }
 
+/*
+ * Start pushes to peers that need to be updated for a configuration
+ * change on this node.
+ */
+void
+lnet_push_update_to_peers(int force)
+{
+	struct lnet_peer_table *ptable;
+	struct lnet_peer *lp;
+	int lncpt;
+	int cpt;
+
+	lnet_net_lock(LNET_LOCK_EX);
+	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
+	for (cpt = 0; cpt < lncpt; cpt++) {
+		ptable = the_lnet.ln_peer_tables[cpt];
+		list_for_each_entry(lp, &ptable->pt_peer_list, lp_peer_list) {
+			if (force) {
+				spin_lock(&lp->lp_lock);
+				if (lp->lp_state & LNET_PEER_MULTI_RAIL)
+					lp->lp_state |= LNET_PEER_FORCE_PUSH;
+				spin_unlock(&lp->lp_lock);
+			}
+			if (lnet_peer_needs_push(lp))
+				lnet_peer_queue_for_discovery(lp);
+		}
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+	wake_up(&the_lnet.ln_dc_waitq);
+}
+
 /*
  * Test whether a ni is a preferred ni for this peer_ni, e.g, whether
  * this is a preferred point-to-point path. Call with lnet_net_lock in
@@ -941,6 +1010,7 @@ lnet_peer_primary_nid_locked(lnet_nid_t nid)
 lnet_nid_t
 LNetPrimaryNID(lnet_nid_t nid)
 {
+	struct lnet_peer *lp;
 	struct lnet_peer_ni *lpni;
 	lnet_nid_t primary_nid = nid;
 	int rc = 0;
@@ -952,7 +1022,15 @@ LNetPrimaryNID(lnet_nid_t nid)
 		rc = PTR_ERR(lpni);
 		goto out_unlock;
 	}
-	primary_nid = lpni->lpni_peer_net->lpn_peer->lp_primary_nid;
+	lp = lpni->lpni_peer_net->lpn_peer;
+	while (!lnet_peer_is_uptodate(lp)) {
+		rc = lnet_discover_peer_locked(lpni, cpt, true);
+		if (rc)
+			goto out_decref;
+		lp = lpni->lpni_peer_net->lpn_peer;
+	}
+	primary_nid = lp->lp_primary_nid;
+out_decref:
 	lnet_peer_ni_decref_locked(lpni);
 out_unlock:
 	lnet_net_unlock(cpt);
@@ -1229,6 +1307,30 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
 	return rc;
 }
 
+/*
+ * Update the primary NID of a peer, if possible.
+ *
+ * Call with the lnet_api_mutex held.
+ */
+static int
+lnet_peer_set_primary_nid(struct lnet_peer *lp, lnet_nid_t nid,
+			  unsigned int flags)
+{
+	lnet_nid_t old = lp->lp_primary_nid;
+	int rc = 0;
+
+	if (lp->lp_primary_nid == nid)
+		goto out;
+	rc = lnet_peer_add_nid(lp, nid, flags);
+	if (rc)
+		goto out;
+	lp->lp_primary_nid = nid;
+out:
+	CDEBUG(D_NET, "peer %s NID %s: %d\n",
+	       libcfs_nid2str(old), libcfs_nid2str(nid), rc);
+	return rc;
+}
+
 /*
  * lpni creation initiated due to traffic either sending or receiving.
  */
@@ -1548,11 +1650,15 @@ lnet_peer_is_uptodate(struct lnet_peer *lp)
 			    LNET_PEER_FORCE_PING |
 			    LNET_PEER_FORCE_PUSH)) {
 		rc = false;
+	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
+		rc = true;
 	} else if (lp->lp_state & LNET_PEER_REDISCOVER) {
 		if (lnet_peer_discovery_disabled)
 			rc = true;
 		else
 			rc = false;
+	} else if (lnet_peer_needs_push(lp)) {
+		rc = false;
 	} else if (lp->lp_state & LNET_PEER_DISCOVERED) {
 		if (lp->lp_state & LNET_PEER_NIDS_UPTODATE)
 			rc = true;
@@ -1588,6 +1694,9 @@ static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
 		rc = -EALREADY;
 	}
 
+	CDEBUG(D_NET, "Queue peer %s: %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), rc);
+
 	return rc;
 }
 
@@ -1597,9 +1706,252 @@ static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
  */
 static void lnet_peer_discovery_complete(struct lnet_peer *lp)
 {
+	struct lnet_msg *msg = NULL;
+	int rc = 0;
+	struct list_head pending_msgs;
+
+	INIT_LIST_HEAD(&pending_msgs);
+
+	CDEBUG(D_NET, "Discovery complete. Dequeue peer %s\n",
+	       libcfs_nid2str(lp->lp_primary_nid));
+
 	list_del_init(&lp->lp_dc_list);
+	list_splice_init(&lp->lp_dc_pendq, &pending_msgs);
 	wake_up_all(&lp->lp_dc_waitq);
+
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	/* iterate through all pending messages and send them again */
+	list_for_each_entry(msg, &pending_msgs, msg_list) {
+		if (lp->lp_dc_error) {
+			lnet_finalize(msg, lp->lp_dc_error);
+			continue;
+		}
+
+		CDEBUG(D_NET, "sending pending message %s to target %s\n",
+		       lnet_msgtyp2str(msg->msg_type),
+		       libcfs_id2str(msg->msg_target));
+		rc = lnet_send(msg->msg_src_nid_param, msg,
+			       msg->msg_rtr_nid_param);
+		if (rc < 0) {
+			CNETERR("Error sending %s to %s: %d\n",
+				lnet_msgtyp2str(msg->msg_type),
+				libcfs_id2str(msg->msg_target), rc);
+			lnet_finalize(msg, rc);
+		}
+	}
+	lnet_net_lock(LNET_LOCK_EX);
+	lnet_peer_decref_locked(lp);
+}
+
+/*
+ * Handle inbound push.
+ * Like any event handler, called with lnet_res_lock/CPT held.
+ */
+void lnet_peer_push_event(struct lnet_event *ev)
+{
+	struct lnet_ping_buffer *pbuf = ev->md.user_ptr;
+	struct lnet_peer *lp;
+
+	/* lnet_find_peer() adds a refcount */
+	lp = lnet_find_peer(ev->source.nid);
+	if (!lp) {
+		CERROR("Push Put from unknown %s (source %s)\n",
+		       libcfs_nid2str(ev->initiator.nid),
+		       libcfs_nid2str(ev->source.nid));
+		return;
+	}
+
+	/* Ensure peer state remains consistent while we modify it. */
+	spin_lock(&lp->lp_lock);
+
+	/*
+	 * If some kind of error happened the contents of the message
+	 * cannot be used. Clear the NIDS_UPTODATE and set the
+	 * FORCE_PING flag to trigger a ping.
+	 */
+	if (ev->status) {
+		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+		lp->lp_state |= LNET_PEER_FORCE_PING;
+		CDEBUG(D_NET, "Push Put error %d from %s (source %s)\n",
+		       ev->status,
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       libcfs_nid2str(ev->source.nid));
+		goto out;
+	}
+
+	/*
+	 * A push with invalid or corrupted info. Clear the UPTODATE
+	 * flag to trigger a ping.
+	 */
+	if (lnet_ping_info_validate(&pbuf->pb_info)) {
+		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+		lp->lp_state |= LNET_PEER_FORCE_PING;
+		CDEBUG(D_NET, "Corrupted Push from %s\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		goto out;
+	}
+
+	/*
+	 * Make sure we'll allocate the correct size ping buffer when
+	 * pinging the peer.
+	 */
+	if (lp->lp_data_nnis < pbuf->pb_info.pi_nnis)
+		lp->lp_data_nnis = pbuf->pb_info.pi_nnis;
+
+	/*
+	 * A non-Multi-Rail peer is not supposed to be capable of
+	 * sending a push.
+	 */
+	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL)) {
+		CERROR("Push from non-Multi-Rail peer %s dropped\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		goto out;
+	}
+
+	/*
+	 * Check the MULTIRAIL flag. Complain if the peer was DLC
+	 * configured without it.
+	 */
+	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+		if (lp->lp_state & LNET_PEER_CONFIGURED) {
+			CERROR("Push says %s is Multi-Rail, DLC says not\n",
+			       libcfs_nid2str(lp->lp_primary_nid));
+		} else {
+			lp->lp_state |= LNET_PEER_MULTI_RAIL;
+			lnet_peer_clr_non_mr_pref_nids(lp);
+		}
+	}
+
+	/*
+	 * The peer may have discovery disabled at its end. Set
+	 * NO_DISCOVERY as appropriate.
+	 */
+	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_DISCOVERY)) {
+		CDEBUG(D_NET, "Peer %s has discovery disabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state |= LNET_PEER_NO_DISCOVERY;
+	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
+		CDEBUG(D_NET, "Peer %s has discovery enabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
+	}
+
+	/*
+	 * Check for truncation of the Put message. Clear the
+	 * NIDS_UPTODATE flag and set FORCE_PING to trigger a ping,
+	 * and tell discovery to allocate a bigger buffer.
+	 */
+	if (pbuf->pb_nnis < pbuf->pb_info.pi_nnis) {
+		if (the_lnet.ln_push_target_nnis < pbuf->pb_info.pi_nnis)
+			the_lnet.ln_push_target_nnis = pbuf->pb_info.pi_nnis;
+		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+		lp->lp_state |= LNET_PEER_FORCE_PING;
+		CDEBUG(D_NET, "Truncated Push from %s (%d nids)\n",
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       pbuf->pb_info.pi_nnis);
+		goto out;
+	}
+
+	/*
+	 * Check whether the Put data is stale. Stale data can just be
+	 * dropped.
+	 */
+	if (pbuf->pb_info.pi_nnis > 1 &&
+	    lp->lp_primary_nid == pbuf->pb_info.pi_ni[1].ns_nid &&
+	    LNET_PING_BUFFER_SEQNO(pbuf) < lp->lp_peer_seqno) {
+		CDEBUG(D_NET, "Stale Push from %s: got %u have %u\n",
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       LNET_PING_BUFFER_SEQNO(pbuf),
+		       lp->lp_peer_seqno);
+		goto out;
+	}
+
+	/*
+	 * Check whether the Put data is new, in which case we clear
+	 * the UPTODATE flag and prepare to process it.
+	 *
+	 * If the Put data is current, and the peer is UPTODATE then
+	 * we assome everything is all right and drop the data as
+	 * stale.
+	 */
+	if (LNET_PING_BUFFER_SEQNO(pbuf) > lp->lp_peer_seqno) {
+		lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
+		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+	} else if (lp->lp_state & LNET_PEER_NIDS_UPTODATE) {
+		CDEBUG(D_NET, "Stale Push from %s: got %u have %u\n",
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       LNET_PING_BUFFER_SEQNO(pbuf),
+		       lp->lp_peer_seqno);
+		goto out;
+	}
+
+	/*
+	 * If there is data present that hasn't been processed yet,
+	 * we'll replace it if the Put contained newer data and it
+	 * fits. We're racing with a Ping or earlier Push in this
+	 * case.
+	 */
+	if (lp->lp_state & LNET_PEER_DATA_PRESENT) {
+		if (LNET_PING_BUFFER_SEQNO(pbuf) >
+			LNET_PING_BUFFER_SEQNO(lp->lp_data) &&
+		    pbuf->pb_info.pi_nnis <= lp->lp_data->pb_nnis) {
+			memcpy(&lp->lp_data->pb_info, &pbuf->pb_info,
+			       LNET_PING_INFO_SIZE(pbuf->pb_info.pi_nnis));
+			CDEBUG(D_NET, "Ping/Push race from %s: %u vs %u\n",
+			       libcfs_nid2str(lp->lp_primary_nid),
+			       LNET_PING_BUFFER_SEQNO(pbuf),
+			       LNET_PING_BUFFER_SEQNO(lp->lp_data));
+		}
+		goto out;
+	}
+
+	/*
+	 * Allocate a buffer to copy the data. On a failure we drop
+	 * the Push and set FORCE_PING to force the discovery
+	 * thread to fix the problem by pinging the peer.
+	 */
+	lp->lp_data = lnet_ping_buffer_alloc(lp->lp_data_nnis, GFP_ATOMIC);
+	if (!lp->lp_data) {
+		lp->lp_state |= LNET_PEER_FORCE_PING;
+		CDEBUG(D_NET, "Cannot allocate Push buffer for %s %u\n",
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       LNET_PING_BUFFER_SEQNO(pbuf));
+		goto out;
+	}
+
+	/* Success */
+	memcpy(&lp->lp_data->pb_info, &pbuf->pb_info,
+	       LNET_PING_INFO_SIZE(pbuf->pb_info.pi_nnis));
+	lp->lp_state |= LNET_PEER_DATA_PRESENT;
+	CDEBUG(D_NET, "Received Push %s %u\n",
+	       libcfs_nid2str(lp->lp_primary_nid),
+	       LNET_PING_BUFFER_SEQNO(pbuf));
+
+out:
+	/*
+	 * Queue the peer for discovery, and wake the discovery thread
+	 * if the peer was already queued, because its status changed.
+	 */
+	spin_unlock(&lp->lp_lock);
+	lnet_net_lock(LNET_LOCK_EX);
+	if (lnet_peer_queue_for_discovery(lp))
+		wake_up(&the_lnet.ln_dc_waitq);
+	/* Drop refcount from lookup */
 	lnet_peer_decref_locked(lp);
+	lnet_net_unlock(LNET_LOCK_EX);
+}
+
+/*
+ * Clear the discovery error state, unless we're already discovering
+ * this peer, in which case the error is current.
+ */
+static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
+{
+	spin_lock(&lp->lp_lock);
+	if (!(lp->lp_state & LNET_PEER_DISCOVERING))
+		lp->lp_dc_error = 0;
+	spin_unlock(&lp->lp_lock);
 }
 
 /*
@@ -1608,7 +1960,7 @@ static void lnet_peer_discovery_complete(struct lnet_peer *lp)
  * because discovery could tear down an lnet_peer.
  */
 int
-lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
+lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
 {
 	DEFINE_WAIT(wait);
 	struct lnet_peer *lp;
@@ -1617,25 +1969,40 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
 again:
 	lnet_net_unlock(cpt);
 	lnet_net_lock(LNET_LOCK_EX);
+	lp = lpni->lpni_peer_net->lpn_peer;
+	lnet_peer_clear_discovery_error(lp);
 
-	/* We're willing to be interrupted. */
+	/*
+	 * We're willing to be interrupted. The lpni can become a
+	 * zombie if we race with DLC, so we must check for that.
+	 */
 	for (;;) {
-		lp = lpni->lpni_peer_net->lpn_peer;
 		prepare_to_wait(&lp->lp_dc_waitq, &wait, TASK_INTERRUPTIBLE);
 		if (signal_pending(current))
 			break;
 		if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
 			break;
+		if (lp->lp_dc_error)
+			break;
 		if (lnet_peer_is_uptodate(lp))
 			break;
 		lnet_peer_queue_for_discovery(lp);
 		lnet_peer_addref_locked(lp);
+		/*
+		 * if caller requested a non-blocking operation then
+		 * return immediately. Once discovery is complete then the
+		 * peer ref will be decremented and any pending messages
+		 * that were stopped due to discovery will be transmitted.
+		 */
+		if (!block)
+			break;
 		lnet_net_unlock(LNET_LOCK_EX);
 		schedule();
 		finish_wait(&lp->lp_dc_waitq, &wait);
 		lnet_net_lock(LNET_LOCK_EX);
 		lnet_peer_decref_locked(lp);
-		/* Do not use lp beyond this point. */
+		/* Peer may have changed */
+		lp = lpni->lpni_peer_net->lpn_peer;
 	}
 	finish_wait(&lp->lp_dc_waitq, &wait);
 
@@ -1646,71 +2013,969 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
 		rc = -EINTR;
 	else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
 		rc = -ESHUTDOWN;
+	else if (lp->lp_dc_error)
+		rc = lp->lp_dc_error;
+	else if (!block)
+		CDEBUG(D_NET, "non-blocking discovery\n");
 	else if (!lnet_peer_is_uptodate(lp))
 		goto again;
 
+	CDEBUG(D_NET, "peer %s NID %s: %d. %s\n",
+	       (lp ? libcfs_nid2str(lp->lp_primary_nid) : "(none)"),
+	       libcfs_nid2str(lpni->lpni_nid), rc,
+	       (!block) ? "pending discovery" : "discovery complete");
+
 	return rc;
 }
 
-/*
- * Event handler for the discovery EQ.
- *
- * Called with lnet_res_lock(cpt) held. The cpt is the
- * lnet_cpt_of_cookie() of the md handle cookie.
- */
-static void lnet_discovery_event_handler(struct lnet_event *event)
+/* Handle an incoming ack for a push. */
+static void
+lnet_discovery_event_ack(struct lnet_peer *lp, struct lnet_event *ev)
 {
-	wake_up(&the_lnet.ln_dc_waitq);
+	struct lnet_ping_buffer *pbuf;
+
+	pbuf = LNET_PING_INFO_TO_BUFFER(ev->md.start);
+	spin_lock(&lp->lp_lock);
+	lp->lp_state &= ~LNET_PEER_PUSH_SENT;
+	lp->lp_push_error = ev->status;
+	if (ev->status)
+		lp->lp_state |= LNET_PEER_PUSH_FAILED;
+	else
+		lp->lp_node_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
+	spin_unlock(&lp->lp_lock);
+
+	CDEBUG(D_NET, "peer %s ev->status %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), ev->status);
 }
 
-/*
- * Wait for work to be queued or some other change that must be
- * attended to. Returns non-zero if the discovery thread should shut
- * down.
- */
-static int lnet_peer_discovery_wait_for_work(void)
+/* Handle a Reply message. This is the reply to a Ping message. */
+static void
+lnet_discovery_event_reply(struct lnet_peer *lp, struct lnet_event *ev)
 {
-	int cpt;
-	int rc = 0;
+	struct lnet_ping_buffer *pbuf;
+	int rc;
 
-	DEFINE_WAIT(wait);
+	spin_lock(&lp->lp_lock);
 
-	cpt = lnet_net_lock_current();
-	for (;;) {
-		prepare_to_wait(&the_lnet.ln_dc_waitq, &wait,
-				TASK_IDLE);
-		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
-			break;
-		if (lnet_push_target_resize_needed())
-			break;
-		if (!list_empty(&the_lnet.ln_dc_request))
-			break;
-		lnet_net_unlock(cpt);
-		schedule();
-		finish_wait(&the_lnet.ln_dc_waitq, &wait);
-		cpt = lnet_net_lock_current();
+	/*
+	 * If some kind of error happened the contents of message
+	 * cannot be used. Set PING_FAILED to trigger a retry.
+	 */
+	if (ev->status) {
+		lp->lp_state |= LNET_PEER_PING_FAILED;
+		lp->lp_ping_error = ev->status;
+		CDEBUG(D_NET, "Ping Reply error %d from %s (source %s)\n",
+		       ev->status,
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       libcfs_nid2str(ev->source.nid));
+		goto out;
 	}
-	finish_wait(&the_lnet.ln_dc_waitq, &wait);
-
-	if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
-		rc = -ESHUTDOWN;
 
-	lnet_net_unlock(cpt);
+	pbuf = LNET_PING_INFO_TO_BUFFER(ev->md.start);
+	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
+		lnet_swap_pinginfo(pbuf);
 
-	CDEBUG(D_NET, "woken: %d\n", rc);
+	/*
+	 * A reply with invalid or corrupted info. Set PING_FAILED to
+	 * trigger a retry.
+	 */
+	rc = lnet_ping_info_validate(&pbuf->pb_info);
+	if (rc) {
+		lp->lp_state |= LNET_PEER_PING_FAILED;
+		lp->lp_ping_error = 0;
+		CDEBUG(D_NET, "Corrupted Ping Reply from %s: %d\n",
+		       libcfs_nid2str(lp->lp_primary_nid), rc);
+		goto out;
+	}
 
-	return rc;
-}
+	/*
+	 * Update the MULTI_RAIL flag based on the reply. If the peer
+	 * was configured with DLC then the setting should match what
+	 * DLC put in.
+	 */
+	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL) {
+		if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
+			/* Everything's fine */
+		} else if (lp->lp_state & LNET_PEER_CONFIGURED) {
+			CWARN("Reply says %s is Multi-Rail, DLC says not\n",
+			      libcfs_nid2str(lp->lp_primary_nid));
+		} else {
+			lp->lp_state |= LNET_PEER_MULTI_RAIL;
+			lnet_peer_clr_non_mr_pref_nids(lp);
+		}
+	} else if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
+		if (lp->lp_state & LNET_PEER_CONFIGURED) {
+			CWARN("DLC says %s is Multi-Rail, Reply says not\n",
+			      libcfs_nid2str(lp->lp_primary_nid));
+		} else {
+			CERROR("Multi-Rail state vanished from %s\n",
+			       libcfs_nid2str(lp->lp_primary_nid));
+			lp->lp_state &= ~LNET_PEER_MULTI_RAIL;
+		}
+	}
 
-/* The discovery thread. */
-static int lnet_peer_discovery(void *arg)
-{
-	struct lnet_peer *lp;
+	/*
+	 * Make sure we'll allocate the correct size ping buffer when
+	 * pinging the peer.
+	 */
+	if (lp->lp_data_nnis < pbuf->pb_info.pi_nnis)
+		lp->lp_data_nnis = pbuf->pb_info.pi_nnis;
 
-	CDEBUG(D_NET, "started\n");
+	/*
+	 * The peer may have discovery disabled at its end. Set
+	 * NO_DISCOVERY as appropriate.
+	 */
+	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_DISCOVERY)) {
+		CDEBUG(D_NET, "Peer %s has discovery disabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state |= LNET_PEER_NO_DISCOVERY;
+	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
+		CDEBUG(D_NET, "Peer %s has discovery enabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
+	}
 
-	for (;;) {
-		if (lnet_peer_discovery_wait_for_work())
+	/*
+	 * Check for truncation of the Reply. Clear PING_SENT and set
+	 * PING_FAILED to trigger a retry.
+	 */
+	if (pbuf->pb_nnis < pbuf->pb_info.pi_nnis) {
+		if (the_lnet.ln_push_target_nnis < pbuf->pb_info.pi_nnis)
+			the_lnet.ln_push_target_nnis = pbuf->pb_info.pi_nnis;
+		lp->lp_state |= LNET_PEER_PING_FAILED;
+		lp->lp_ping_error = 0;
+		CDEBUG(D_NET, "Truncated Reply from %s (%d nids)\n",
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       pbuf->pb_info.pi_nnis);
+		goto out;
+	}
+
+	/*
+	 * Check the sequence numbers in the reply. These are only
+	 * available if the reply came from a Multi-Rail peer.
+	 */
+	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL &&
+	    pbuf->pb_info.pi_nnis > 1 &&
+	    lp->lp_primary_nid == pbuf->pb_info.pi_ni[1].ns_nid) {
+		if (LNET_PING_BUFFER_SEQNO(pbuf) < lp->lp_peer_seqno) {
+			CDEBUG(D_NET, "Stale Reply from %s: got %u have %u\n",
+			       libcfs_nid2str(lp->lp_primary_nid),
+			       LNET_PING_BUFFER_SEQNO(pbuf),
+			       lp->lp_peer_seqno);
+			goto out;
+		}
+
+		if (LNET_PING_BUFFER_SEQNO(pbuf) > lp->lp_peer_seqno)
+			lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
+	}
+
+	/* We're happy with the state of the data in the buffer. */
+	CDEBUG(D_NET, "peer %s data present %u\n",
+	       libcfs_nid2str(lp->lp_primary_nid), lp->lp_peer_seqno);
+	if (lp->lp_state & LNET_PEER_DATA_PRESENT)
+		lnet_ping_buffer_decref(lp->lp_data);
+	else
+		lp->lp_state |= LNET_PEER_DATA_PRESENT;
+	lnet_ping_buffer_addref(pbuf);
+	lp->lp_data = pbuf;
+out:
+	lp->lp_state &= ~LNET_PEER_PING_SENT;
+	spin_unlock(&lp->lp_lock);
+}
+
+/*
+ * Send event handling. Only matters for error cases, where we clean
+ * up state on the peer and peer_ni that would otherwise be updated in
+ * the REPLY event handler for a successful Ping, and the ACK event
+ * handler for a successful Push.
+ */
+static int
+lnet_discovery_event_send(struct lnet_peer *lp, struct lnet_event *ev)
+{
+	int rc = 0;
+
+	if (!ev->status)
+		goto out;
+
+	spin_lock(&lp->lp_lock);
+	if (ev->msg_type == LNET_MSG_GET) {
+		lp->lp_state &= ~LNET_PEER_PING_SENT;
+		lp->lp_state |= LNET_PEER_PING_FAILED;
+		lp->lp_ping_error = ev->status;
+	} else { /* ev->msg_type == LNET_MSG_PUT */
+		lp->lp_state &= ~LNET_PEER_PUSH_SENT;
+		lp->lp_state |= LNET_PEER_PUSH_FAILED;
+		lp->lp_push_error = ev->status;
+	}
+	spin_unlock(&lp->lp_lock);
+	rc = LNET_REDISCOVER_PEER;
+out:
+	CDEBUG(D_NET, "%s Send to %s: %d\n",
+	       (ev->msg_type == LNET_MSG_GET ? "Ping" : "Push"),
+	       libcfs_nid2str(ev->target.nid), rc);
+	return rc;
+}
+
+/*
+ * Unlink event handling. This event is only seen if a call to
+ * LNetMDUnlink() caused the event to be unlinked. If this call was
+ * made after the event was set up in LNetGet() or LNetPut() then we
+ * assume the Ping or Push timed out.
+ */
+static void
+lnet_discovery_event_unlink(struct lnet_peer *lp, struct lnet_event *ev)
+{
+	spin_lock(&lp->lp_lock);
+	/* We've passed through LNetGet() */
+	if (lp->lp_state & LNET_PEER_PING_SENT) {
+		lp->lp_state &= ~LNET_PEER_PING_SENT;
+		lp->lp_state |= LNET_PEER_PING_FAILED;
+		lp->lp_ping_error = -ETIMEDOUT;
+		CDEBUG(D_NET, "Ping Unlink for message to peer %s\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+	}
+	/* We've passed through LNetPut() */
+	if (lp->lp_state & LNET_PEER_PUSH_SENT) {
+		lp->lp_state &= ~LNET_PEER_PUSH_SENT;
+		lp->lp_state |= LNET_PEER_PUSH_FAILED;
+		lp->lp_push_error = -ETIMEDOUT;
+		CDEBUG(D_NET, "Push Unlink for message to peer %s\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+	}
+	spin_unlock(&lp->lp_lock);
+}
+
+/*
+ * Event handler for the discovery EQ.
+ *
+ * Called with lnet_res_lock(cpt) held. The cpt is the
+ * lnet_cpt_of_cookie() of the md handle cookie.
+ */
+static void lnet_discovery_event_handler(struct lnet_event *event)
+{
+	struct lnet_peer *lp = event->md.user_ptr;
+	struct lnet_ping_buffer *pbuf;
+	int rc;
+
+	/* discovery needs to take another look */
+	rc = LNET_REDISCOVER_PEER;
+
+	CDEBUG(D_NET, "Received event: %d\n", event->type);
+
+	switch (event->type) {
+	case LNET_EVENT_ACK:
+		lnet_discovery_event_ack(lp, event);
+		break;
+	case LNET_EVENT_REPLY:
+		lnet_discovery_event_reply(lp, event);
+		break;
+	case LNET_EVENT_SEND:
+		/* Only send failure triggers a retry. */
+		rc = lnet_discovery_event_send(lp, event);
+		break;
+	case LNET_EVENT_UNLINK:
+		/* LNetMDUnlink() was called */
+		lnet_discovery_event_unlink(lp, event);
+		break;
+	default:
+		/* Invalid events. */
+		LBUG();
+	}
+	lnet_net_lock(LNET_LOCK_EX);
+	if (event->unlinked) {
+		pbuf = LNET_PING_INFO_TO_BUFFER(event->md.start);
+		lnet_ping_buffer_decref(pbuf);
+		lnet_peer_decref_locked(lp);
+	}
+	if (rc == LNET_REDISCOVER_PEER) {
+		list_move_tail(&lp->lp_dc_list, &the_lnet.ln_dc_request);
+		wake_up(&the_lnet.ln_dc_waitq);
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+}
+
+/*
+ * Build a peer from incoming data.
+ *
+ * The NIDs in the incoming data are supposed to be structured as follows:
+ *  - loopback
+ *  - primary NID
+ *  - other NIDs in same net
+ *  - NIDs in second net
+ *  - NIDs in third net
+ *  - ...
+ * This due to the way the list of NIDs in the data is created.
+ *
+ * Note that this function will mark the peer uptodate unless an
+ * ENOMEM is encontered. All other errors are due to a conflict
+ * between the DLC configuration and what discovery sees. We treat DLC
+ * as binding, and therefore set the NIDS_UPTODATE flag to prevent the
+ * peer from becoming stuck in discovery.
+ */
+static int lnet_peer_merge_data(struct lnet_peer *lp,
+				struct lnet_ping_buffer *pbuf)
+{
+	struct lnet_peer_ni *lpni;
+	lnet_nid_t *curnis = NULL;
+	lnet_nid_t *addnis = NULL;
+	lnet_nid_t *delnis = NULL;
+	unsigned int flags;
+	int ncurnis;
+	int naddnis;
+	int ndelnis;
+	int nnis = 0;
+	int i;
+	int j;
+	int rc;
+
+	flags = LNET_PEER_DISCOVERED;
+	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL)
+		flags |= LNET_PEER_MULTI_RAIL;
+
+	nnis = max_t(int, lp->lp_nnis, pbuf->pb_info.pi_nnis);
+	curnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
+	addnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
+	delnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
+	if (!curnis || !addnis || !delnis) {
+		rc = -ENOMEM;
+		goto out;
+	}
+	ncurnis = 0;
+	naddnis = 0;
+	ndelnis = 0;
+
+	/* Construct the list of NIDs present in peer. */
+	lpni = NULL;
+	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL)
+		curnis[ncurnis++] = lpni->lpni_nid;
+
+	/*
+	 * Check for NIDs in pbuf not present in curnis[].
+	 * The loop starts at 1 to skip the loopback NID.
+	 */
+	for (i = 1; i < pbuf->pb_info.pi_nnis; i++) {
+		for (j = 0; j < ncurnis; j++)
+			if (pbuf->pb_info.pi_ni[i].ns_nid == curnis[j])
+				break;
+		if (j == ncurnis)
+			addnis[naddnis++] = pbuf->pb_info.pi_ni[i].ns_nid;
+	}
+	/*
+	 * Check for NIDs in curnis[] not present in pbuf.
+	 * The nested loop starts at 1 to skip the loopback NID.
+	 *
+	 * But never add the loopback NID to delnis[]: if it is
+	 * present in curnis[] then this peer is for this node.
+	 */
+	for (i = 0; i < ncurnis; i++) {
+		if (LNET_NETTYP(LNET_NIDNET(curnis[i])) == LOLND)
+			continue;
+		for (j = 1; j < pbuf->pb_info.pi_nnis; j++)
+			if (curnis[i] == pbuf->pb_info.pi_ni[j].ns_nid)
+				break;
+		if (j == pbuf->pb_info.pi_nnis)
+			delnis[ndelnis++] = curnis[i];
+	}
+
+	for (i = 0; i < naddnis; i++) {
+		rc = lnet_peer_add_nid(lp, addnis[i], flags);
+		if (rc) {
+			CERROR("Error adding NID %s to peer %s: %d\n",
+			       libcfs_nid2str(addnis[i]),
+			       libcfs_nid2str(lp->lp_primary_nid), rc);
+			if (rc == -ENOMEM)
+				goto out;
+		}
+	}
+	for (i = 0; i < ndelnis; i++) {
+		rc = lnet_peer_del_nid(lp, delnis[i], flags);
+		if (rc) {
+			CERROR("Error deleting NID %s from peer %s: %d\n",
+			       libcfs_nid2str(delnis[i]),
+			       libcfs_nid2str(lp->lp_primary_nid), rc);
+			if (rc == -ENOMEM)
+				goto out;
+		}
+	}
+	/*
+	 * Errors other than -ENOMEM are due to peers having been
+	 * configured with DLC. Ignore these because DLC overrides
+	 * Discovery.
+	 */
+	rc = 0;
+out:
+	kfree(curnis);
+	kfree(addnis);
+	kfree(delnis);
+	lnet_ping_buffer_decref(pbuf);
+	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
+
+	if (rc) {
+		spin_lock(&lp->lp_lock);
+		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+		lp->lp_state |= LNET_PEER_FORCE_PING;
+		spin_unlock(&lp->lp_lock);
+	}
+	return rc;
+}
+
+/*
+ * The data in pbuf says lp is its primary peer, but the data was
+ * received by a different peer. Try to update lp with the data.
+ */
+static int
+lnet_peer_set_primary_data(struct lnet_peer *lp, struct lnet_ping_buffer *pbuf)
+{
+	struct lnet_handle_md mdh;
+
+	/* Queue lp for discovery, and force it on the request queue. */
+	lnet_net_lock(LNET_LOCK_EX);
+	if (lnet_peer_queue_for_discovery(lp))
+		list_move(&lp->lp_dc_list, &the_lnet.ln_dc_request);
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	LNetInvalidateMDHandle(&mdh);
+
+	/*
+	 * Decide whether we can move the peer to the DATA_PRESENT state.
+	 *
+	 * We replace stale data for a multi-rail peer, repair PING_FAILED
+	 * status, and preempt FORCE_PING.
+	 *
+	 * If after that we have DATA_PRESENT, we merge it into this peer.
+	 */
+	spin_lock(&lp->lp_lock);
+	if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
+		if (lp->lp_peer_seqno < LNET_PING_BUFFER_SEQNO(pbuf)) {
+			lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
+		} else if (lp->lp_state & LNET_PEER_DATA_PRESENT) {
+			lp->lp_state &= ~LNET_PEER_DATA_PRESENT;
+			lnet_ping_buffer_decref(pbuf);
+			pbuf = lp->lp_data;
+			lp->lp_data = NULL;
+		}
+	}
+	if (lp->lp_state & LNET_PEER_DATA_PRESENT) {
+		lnet_ping_buffer_decref(lp->lp_data);
+		lp->lp_data = NULL;
+		lp->lp_state &= ~LNET_PEER_DATA_PRESENT;
+	}
+	if (lp->lp_state & LNET_PEER_PING_FAILED) {
+		mdh = lp->lp_ping_mdh;
+		LNetInvalidateMDHandle(&lp->lp_ping_mdh);
+		lp->lp_state &= ~LNET_PEER_PING_FAILED;
+		lp->lp_ping_error = 0;
+	}
+	if (lp->lp_state & LNET_PEER_FORCE_PING)
+		lp->lp_state &= ~LNET_PEER_FORCE_PING;
+	lp->lp_state |= LNET_PEER_NIDS_UPTODATE;
+	spin_unlock(&lp->lp_lock);
+
+	if (!LNetMDHandleIsInvalid(mdh))
+		LNetMDUnlink(mdh);
+
+	if (pbuf)
+		return lnet_peer_merge_data(lp, pbuf);
+
+	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
+	return 0;
+}
+
+/*
+ * Update a peer using the data received.
+ */
+static int lnet_peer_data_present(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
+{
+	struct lnet_ping_buffer *pbuf;
+	struct lnet_peer_ni *lpni;
+	lnet_nid_t nid = LNET_NID_ANY;
+	unsigned int flags;
+	int rc = 0;
+
+	pbuf = lp->lp_data;
+	lp->lp_data = NULL;
+	lp->lp_state &= ~LNET_PEER_DATA_PRESENT;
+	lp->lp_state |= LNET_PEER_NIDS_UPTODATE;
+	spin_unlock(&lp->lp_lock);
+
+	/*
+	 * Modifications of peer structures are done while holding the
+	 * ln_api_mutex. A global lock is required because we may be
+	 * modifying multiple peer structures, and a mutex greatly
+	 * simplifies memory management.
+	 *
+	 * The actual changes to the data structures must also protect
+	 * against concurrent lookups, for which the lnet_net_lock in
+	 * LNET_LOCK_EX mode is used.
+	 */
+	mutex_lock(&the_lnet.ln_api_mutex);
+	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN) {
+		rc = -ESHUTDOWN;
+		goto out;
+	}
+
+	/*
+	 * If this peer is not on the peer list then it is being torn
+	 * down, and our reference count may be all that is keeping it
+	 * alive. Don't do any work on it.
+	 */
+	if (list_empty(&lp->lp_peer_list))
+		goto out;
+
+	flags = LNET_PEER_DISCOVERED;
+	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL)
+		flags |= LNET_PEER_MULTI_RAIL;
+
+	/*
+	 * Check whether the primary NID in the message matches the
+	 * primary NID of the peer. If it does, update the peer, if
+	 * it it does not, check whether there is already a peer with
+	 * that primary NID. If no such peer exists, try to update
+	 * the primary NID of the current peer (allowed if it was
+	 * created due to message traffic) and complete the update.
+	 * If the peer did exist, hand off the data to it.
+	 *
+	 * The peer for the loopback interface is a special case: this
+	 * is the peer for the local node, and we want to set its
+	 * primary NID to the correct value here.
+	 */
+	if (pbuf->pb_info.pi_nnis > 1)
+		nid = pbuf->pb_info.pi_ni[1].ns_nid;
+	if (LNET_NETTYP(LNET_NIDNET(lp->lp_primary_nid)) == LOLND) {
+		rc = lnet_peer_set_primary_nid(lp, nid, flags);
+		if (!rc)
+			rc = lnet_peer_merge_data(lp, pbuf);
+	} else if (lp->lp_primary_nid == nid) {
+		rc = lnet_peer_merge_data(lp, pbuf);
+	} else {
+		lpni = lnet_find_peer_ni_locked(nid);
+		if (!lpni) {
+			rc = lnet_peer_set_primary_nid(lp, nid, flags);
+			if (rc) {
+				CERROR("Primary NID error %s versus %s: %d\n",
+				       libcfs_nid2str(lp->lp_primary_nid),
+				       libcfs_nid2str(nid), rc);
+			} else {
+				rc = lnet_peer_merge_data(lp, pbuf);
+			}
+		} else {
+			rc = lnet_peer_set_primary_data(
+				lpni->lpni_peer_net->lpn_peer, pbuf);
+			lnet_peer_ni_decref_locked(lpni);
+		}
+	}
+out:
+	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
+	mutex_unlock(&the_lnet.ln_api_mutex);
+
+	spin_lock(&lp->lp_lock);
+	/* Tell discovery to re-check the peer immediately. */
+	if (!rc)
+		rc = LNET_REDISCOVER_PEER;
+	return rc;
+}
+
+/*
+ * A ping failed. Clear the PING_FAILED state and set the
+ * FORCE_PING state, to ensure a retry even if discovery is
+ * disabled. This avoids being left with incorrect state.
+ */
+static int lnet_peer_ping_failed(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
+{
+	struct lnet_handle_md mdh;
+	int rc;
+
+	mdh = lp->lp_ping_mdh;
+	LNetInvalidateMDHandle(&lp->lp_ping_mdh);
+	lp->lp_state &= ~LNET_PEER_PING_FAILED;
+	lp->lp_state |= LNET_PEER_FORCE_PING;
+	rc = lp->lp_ping_error;
+	lp->lp_ping_error = 0;
+	spin_unlock(&lp->lp_lock);
+
+	if (!LNetMDHandleIsInvalid(mdh))
+		LNetMDUnlink(mdh);
+
+	CDEBUG(D_NET, "peer %s:%d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), rc);
+
+	spin_lock(&lp->lp_lock);
+	return rc ? rc : LNET_REDISCOVER_PEER;
+}
+
+/*
+ * Select NID to send a Ping or Push to.
+ */
+static lnet_nid_t lnet_peer_select_nid(struct lnet_peer *lp)
+{
+	struct lnet_peer_ni *lpni;
+
+	/* Look for a direct-connected NID for this peer. */
+	lpni = NULL;
+	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
+		if (!lnet_is_peer_ni_healthy_locked(lpni))
+			continue;
+		if (!lnet_get_net_locked(lpni->lpni_peer_net->lpn_net_id))
+			continue;
+		break;
+	}
+	if (lpni)
+		return lpni->lpni_nid;
+
+	/* Look for a routed-connected NID for this peer. */
+	lpni = NULL;
+	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
+		if (!lnet_is_peer_ni_healthy_locked(lpni))
+			continue;
+		if (!lnet_find_rnet_locked(lpni->lpni_peer_net->lpn_net_id))
+			continue;
+		break;
+	}
+	if (lpni)
+		return lpni->lpni_nid;
+
+	return LNET_NID_ANY;
+}
+
+/* Active side of ping. */
+static int lnet_peer_send_ping(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
+{
+	struct lnet_md md = { NULL };
+	struct lnet_process_id id;
+	struct lnet_ping_buffer *pbuf;
+	int nnis;
+	int rc;
+	int cpt;
+
+	lp->lp_state |= LNET_PEER_PING_SENT;
+	lp->lp_state &= ~LNET_PEER_FORCE_PING;
+	spin_unlock(&lp->lp_lock);
+
+	nnis = max_t(int, lp->lp_data_nnis, LNET_INTERFACES_MIN);
+	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
+	if (!pbuf) {
+		rc = -ENOMEM;
+		goto fail_error;
+	}
+
+	/* initialize md content */
+	md.start     = &pbuf->pb_info;
+	md.length    = LNET_PING_INFO_SIZE(nnis);
+	md.threshold = 2; /* GET/REPLY */
+	md.max_size  = 0;
+	md.options   = LNET_MD_TRUNCATE;
+	md.user_ptr  = lp;
+	md.eq_handle = the_lnet.ln_dc_eqh;
+
+	rc = LNetMDBind(md, LNET_UNLINK, &lp->lp_ping_mdh);
+	if (rc != 0) {
+		lnet_ping_buffer_decref(pbuf);
+		CERROR("Can't bind MD: %d\n", rc);
+		goto fail_error;
+	}
+	cpt = lnet_net_lock_current();
+	/* Refcount for MD. */
+	lnet_peer_addref_locked(lp);
+	id.pid = LNET_PID_LUSTRE;
+	id.nid = lnet_peer_select_nid(lp);
+	lnet_net_unlock(cpt);
+
+	if (id.nid == LNET_NID_ANY) {
+		rc = -EHOSTUNREACH;
+		goto fail_unlink_md;
+	}
+
+	rc = LNetGet(LNET_NID_ANY, lp->lp_ping_mdh, id,
+		     LNET_RESERVED_PORTAL,
+		     LNET_PROTO_PING_MATCHBITS, 0);
+
+	if (rc)
+		goto fail_unlink_md;
+
+	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
+
+	spin_lock(&lp->lp_lock);
+	return 0;
+
+fail_unlink_md:
+	LNetMDUnlink(lp->lp_ping_mdh);
+	LNetInvalidateMDHandle(&lp->lp_ping_mdh);
+fail_error:
+	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
+	/*
+	 * The errors that get us here are considered hard errors and
+	 * cause Discovery to terminate. So we clear PING_SENT, but do
+	 * not set either PING_FAILED or FORCE_PING. In fact we need
+	 * to clear PING_FAILED, because the unlink event handler will
+	 * have set it if we called LNetMDUnlink() above.
+	 */
+	spin_lock(&lp->lp_lock);
+	lp->lp_state &= ~(LNET_PEER_PING_SENT | LNET_PEER_PING_FAILED);
+	return rc;
+}
+
+/*
+ * This function exists because you cannot call LNetMDUnlink() from an
+ * event handler.
+ */
+static int lnet_peer_push_failed(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
+{
+	struct lnet_handle_md mdh;
+	int rc;
+
+	mdh = lp->lp_push_mdh;
+	LNetInvalidateMDHandle(&lp->lp_push_mdh);
+	lp->lp_state &= ~LNET_PEER_PUSH_FAILED;
+	rc = lp->lp_push_error;
+	lp->lp_push_error = 0;
+	spin_unlock(&lp->lp_lock);
+
+	if (!LNetMDHandleIsInvalid(mdh))
+		LNetMDUnlink(mdh);
+
+	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
+	spin_lock(&lp->lp_lock);
+	return rc ? rc : LNET_REDISCOVER_PEER;
+}
+
+/* Active side of push. */
+static int lnet_peer_send_push(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
+{
+	struct lnet_ping_buffer *pbuf;
+	struct lnet_process_id id;
+	struct lnet_md md;
+	int cpt;
+	int rc;
+
+	/* Don't push to a non-multi-rail peer. */
+	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
+		lp->lp_state &= ~LNET_PEER_FORCE_PUSH;
+		return 0;
+	}
+
+	lp->lp_state |= LNET_PEER_PUSH_SENT;
+	lp->lp_state &= ~LNET_PEER_FORCE_PUSH;
+	spin_unlock(&lp->lp_lock);
+
+	cpt = lnet_net_lock_current();
+	pbuf = the_lnet.ln_ping_target;
+	lnet_ping_buffer_addref(pbuf);
+	lnet_net_unlock(cpt);
+
+	/* Push source MD */
+	md.start     = &pbuf->pb_info;
+	md.length    = LNET_PING_INFO_SIZE(pbuf->pb_nnis);
+	md.threshold = 2; /* Put/Ack */
+	md.max_size  = 0;
+	md.options   = 0;
+	md.eq_handle = the_lnet.ln_dc_eqh;
+	md.user_ptr  = lp;
+
+	rc = LNetMDBind(md, LNET_UNLINK, &lp->lp_push_mdh);
+	if (rc) {
+		lnet_ping_buffer_decref(pbuf);
+		CERROR("Can't bind push source MD: %d\n", rc);
+		goto fail_error;
+	}
+	cpt = lnet_net_lock_current();
+	/* Refcount for MD. */
+	lnet_peer_addref_locked(lp);
+	id.pid = LNET_PID_LUSTRE;
+	id.nid = lnet_peer_select_nid(lp);
+	lnet_net_unlock(cpt);
+
+	if (id.nid == LNET_NID_ANY) {
+		rc = -EHOSTUNREACH;
+		goto fail_unlink;
+	}
+
+	rc = LNetPut(LNET_NID_ANY, lp->lp_push_mdh,
+		     LNET_ACK_REQ, id, LNET_RESERVED_PORTAL,
+		     LNET_PROTO_PING_MATCHBITS, 0, 0);
+
+	if (rc)
+		goto fail_unlink;
+
+	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
+
+	spin_lock(&lp->lp_lock);
+	return 0;
+
+fail_unlink:
+	LNetMDUnlink(lp->lp_push_mdh);
+	LNetInvalidateMDHandle(&lp->lp_push_mdh);
+fail_error:
+	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
+	/*
+	 * The errors that get us here are considered hard errors and
+	 * cause Discovery to terminate. So we clear PUSH_SENT, but do
+	 * not set PUSH_FAILED. In fact we need to clear PUSH_FAILED,
+	 * because the unlink event handler will have set it if we
+	 * called LNetMDUnlink() above.
+	 */
+	spin_lock(&lp->lp_lock);
+	lp->lp_state &= ~(LNET_PEER_PUSH_SENT | LNET_PEER_PUSH_FAILED);
+	return rc;
+}
+
+/*
+ * An unrecoverable error was encountered during discovery.
+ * Set error status in peer and abort discovery.
+ */
+static void lnet_peer_discovery_error(struct lnet_peer *lp, int error)
+{
+	CDEBUG(D_NET, "Discovery error %s: %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), error);
+
+	spin_lock(&lp->lp_lock);
+	lp->lp_dc_error = error;
+	lp->lp_state &= ~LNET_PEER_DISCOVERING;
+	lp->lp_state |= LNET_PEER_REDISCOVER;
+	spin_unlock(&lp->lp_lock);
+}
+
+/*
+ * Mark the peer as discovered.
+ */
+static int lnet_peer_discovered(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
+{
+	lp->lp_state |= LNET_PEER_DISCOVERED;
+	lp->lp_state &= ~(LNET_PEER_DISCOVERING |
+			  LNET_PEER_REDISCOVER);
+
+	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
+
+	return 0;
+}
+
+/*
+ * Mark the peer as to be rediscovered.
+ */
+static int lnet_peer_rediscover(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
+{
+	lp->lp_state |= LNET_PEER_REDISCOVER;
+	lp->lp_state &= ~LNET_PEER_DISCOVERING;
+
+	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
+
+	return 0;
+}
+
+/*
+ * Returns the first peer on the ln_dc_working queue if its timeout
+ * has expired. Takes the current time as an argument so as to not
+ * obsessively re-check the clock. The oldest discovery request will
+ * be at the head of the queue.
+ */
+static struct lnet_peer *lnet_peer_dc_timed_out(time64_t now)
+{
+	struct lnet_peer *lp;
+
+	if (list_empty(&the_lnet.ln_dc_working))
+		return NULL;
+	lp = list_first_entry(&the_lnet.ln_dc_working,
+			      struct lnet_peer, lp_dc_list);
+	if (now < lp->lp_last_queued + DISCOVERY_TIMEOUT)
+		return NULL;
+	return lp;
+}
+
+/*
+ * Discovering this peer is taking too long. Cancel any Ping or Push
+ * that discovery is waiting on by unlinking the relevant MDs. The
+ * lnet_discovery_event_handler() will proceed from here and complete
+ * the cleanup.
+ */
+static void lnet_peer_discovery_timeout(struct lnet_peer *lp)
+{
+	struct lnet_handle_md ping_mdh;
+	struct lnet_handle_md push_mdh;
+
+	LNetInvalidateMDHandle(&ping_mdh);
+	LNetInvalidateMDHandle(&push_mdh);
+
+	spin_lock(&lp->lp_lock);
+	if (lp->lp_state & LNET_PEER_PING_SENT) {
+		ping_mdh = lp->lp_ping_mdh;
+		LNetInvalidateMDHandle(&lp->lp_ping_mdh);
+	}
+	if (lp->lp_state & LNET_PEER_PUSH_SENT) {
+		push_mdh = lp->lp_push_mdh;
+		LNetInvalidateMDHandle(&lp->lp_push_mdh);
+	}
+	spin_unlock(&lp->lp_lock);
+
+	if (!LNetMDHandleIsInvalid(ping_mdh))
+		LNetMDUnlink(ping_mdh);
+	if (!LNetMDHandleIsInvalid(push_mdh))
+		LNetMDUnlink(push_mdh);
+}
+
+/*
+ * Wait for work to be queued or some other change that must be
+ * attended to. Returns non-zero if the discovery thread should shut
+ * down.
+ */
+static int lnet_peer_discovery_wait_for_work(void)
+{
+	int cpt;
+	int rc = 0;
+
+	DEFINE_WAIT(wait);
+
+	cpt = lnet_net_lock_current();
+	for (;;) {
+		prepare_to_wait(&the_lnet.ln_dc_waitq, &wait,
+				TASK_IDLE);
+		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
+			break;
+		if (lnet_push_target_resize_needed())
+			break;
+		if (!list_empty(&the_lnet.ln_dc_request))
+			break;
+		if (lnet_peer_dc_timed_out(ktime_get_real_seconds()))
+			break;
+		lnet_net_unlock(cpt);
+
+		/*
+		 * wakeup max every second to check if there are peers that
+		 * have been stuck on the working queue for greater than
+		 * the peer timeout.
+		 */
+		schedule_timeout(HZ);
+		finish_wait(&the_lnet.ln_dc_waitq, &wait);
+		cpt = lnet_net_lock_current();
+	}
+	finish_wait(&the_lnet.ln_dc_waitq, &wait);
+
+	if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
+		rc = -ESHUTDOWN;
+
+	lnet_net_unlock(cpt);
+
+	CDEBUG(D_NET, "woken: %d\n", rc);
+
+	return rc;
+}
+
+/* The discovery thread. */
+static int lnet_peer_discovery(void *arg)
+{
+	struct lnet_peer *lp;
+	time64_t now;
+	int rc;
+
+	CDEBUG(D_NET, "started\n");
+
+	for (;;) {
+		if (lnet_peer_discovery_wait_for_work())
 			break;
 
 		if (lnet_push_target_resize_needed())
@@ -1719,33 +2984,97 @@ static int lnet_peer_discovery(void *arg)
 		lnet_net_lock(LNET_LOCK_EX);
 		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
 			break;
+
+		/*
+		 * Process all incoming discovery work requests.  When
+		 * discovery must wait on a peer to change state, it
+		 * is added to the tail of the ln_dc_working queue. A
+		 * timestamp keeps track of when the peer was added,
+		 * so we can time out discovery requests that take too
+		 * long.
+		 */
 		while (!list_empty(&the_lnet.ln_dc_request)) {
 			lp = list_first_entry(&the_lnet.ln_dc_request,
 					      struct lnet_peer, lp_dc_list);
 			list_move(&lp->lp_dc_list, &the_lnet.ln_dc_working);
+			/*
+			 * set the time the peer was put on the dc_working
+			 * queue. It shouldn't remain on the queue
+			 * forever, in case the GET message (for ping)
+			 * doesn't get a REPLY or the PUT message (for
+			 * push) doesn't get an ACK.
+			 *
+			 * TODO: LNet Health will deal with this scenario
+			 * in a generic way.
+			 */
+			lp->lp_last_queued = ktime_get_real_seconds();
 			lnet_net_unlock(LNET_LOCK_EX);
 
-			/* Just tag and release for now. */
+			/*
+			 * Select an action depending on the state of
+			 * the peer and whether discovery is disabled.
+			 * The check whether discovery is disabled is
+			 * done after the code that handles processing
+			 * for arrived data, cleanup for failures, and
+			 * forcing a Ping or Push.
+			 */
 			spin_lock(&lp->lp_lock);
-			if (lnet_peer_discovery_disabled) {
-				lp->lp_state |= LNET_PEER_REDISCOVER;
-				lp->lp_state &= ~(LNET_PEER_DISCOVERED |
-						  LNET_PEER_NIDS_UPTODATE |
-						  LNET_PEER_DISCOVERING);
-			} else {
-				lp->lp_state |= (LNET_PEER_DISCOVERED |
-						 LNET_PEER_NIDS_UPTODATE);
-				lp->lp_state &= ~(LNET_PEER_REDISCOVER |
-						  LNET_PEER_DISCOVERING);
-			}
+			CDEBUG(D_NET, "peer %s state %#x\n",
+			       libcfs_nid2str(lp->lp_primary_nid),
+			       lp->lp_state);
+			if (lp->lp_state & LNET_PEER_DATA_PRESENT)
+				rc = lnet_peer_data_present(lp);
+			else if (lp->lp_state & LNET_PEER_PING_FAILED)
+				rc = lnet_peer_ping_failed(lp);
+			else if (lp->lp_state & LNET_PEER_PUSH_FAILED)
+				rc = lnet_peer_push_failed(lp);
+			else if (lp->lp_state & LNET_PEER_FORCE_PING)
+				rc = lnet_peer_send_ping(lp);
+			else if (lp->lp_state & LNET_PEER_FORCE_PUSH)
+				rc = lnet_peer_send_push(lp);
+			else if (lnet_peer_discovery_disabled)
+				rc = lnet_peer_rediscover(lp);
+			else if (!(lp->lp_state & LNET_PEER_NIDS_UPTODATE))
+				rc = lnet_peer_send_ping(lp);
+			else if (lnet_peer_needs_push(lp))
+				rc = lnet_peer_send_push(lp);
+			else
+				rc = lnet_peer_discovered(lp);
+			CDEBUG(D_NET, "peer %s state %#x rc %d\n",
+			       libcfs_nid2str(lp->lp_primary_nid),
+			       lp->lp_state, rc);
 			spin_unlock(&lp->lp_lock);
 
 			lnet_net_lock(LNET_LOCK_EX);
+			if (rc == LNET_REDISCOVER_PEER) {
+				list_move(&lp->lp_dc_list,
+					  &the_lnet.ln_dc_request);
+			} else if (rc) {
+				lnet_peer_discovery_error(lp, rc);
+			}
 			if (!(lp->lp_state & LNET_PEER_DISCOVERING))
 				lnet_peer_discovery_complete(lp);
 			if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
 				break;
 		}
+
+		/*
+		 * Now that the ln_dc_request queue has been emptied
+		 * check the ln_dc_working queue for peers that are
+		 * taking too long. Move all that are found to the
+		 * ln_dc_expired queue and time out any pending
+		 * Ping or Push. We have to drop the lnet_net_lock
+		 * in the loop because lnet_peer_discovery_timeout()
+		 * calls LNetMDUnlink().
+		 */
+		now = ktime_get_real_seconds();
+		while ((lp = lnet_peer_dc_timed_out(now)) != NULL) {
+			list_move(&lp->lp_dc_list, &the_lnet.ln_dc_expired);
+			lnet_net_unlock(LNET_LOCK_EX);
+			lnet_peer_discovery_timeout(lp);
+			lnet_net_lock(LNET_LOCK_EX);
+		}
+
 		lnet_net_unlock(LNET_LOCK_EX);
 	}
 
@@ -1759,23 +3088,28 @@ static int lnet_peer_discovery(void *arg)
 	LNetEQFree(the_lnet.ln_dc_eqh);
 	LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
 
+	/* Queue cleanup 1: stop all pending pings and pushes. */
 	lnet_net_lock(LNET_LOCK_EX);
-	list_for_each_entry(lp, &the_lnet.ln_dc_request, lp_dc_list) {
-		spin_lock(&lp->lp_lock);
-		lp->lp_state |= LNET_PEER_REDISCOVER;
-		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
-				  LNET_PEER_DISCOVERING |
-				  LNET_PEER_NIDS_UPTODATE);
-		spin_unlock(&lp->lp_lock);
-		lnet_peer_discovery_complete(lp);
+	while (!list_empty(&the_lnet.ln_dc_working)) {
+		lp = list_first_entry(&the_lnet.ln_dc_working,
+				      struct lnet_peer, lp_dc_list);
+		list_move(&lp->lp_dc_list, &the_lnet.ln_dc_expired);
+		lnet_net_unlock(LNET_LOCK_EX);
+		lnet_peer_discovery_timeout(lp);
+		lnet_net_lock(LNET_LOCK_EX);
 	}
-	list_for_each_entry(lp, &the_lnet.ln_dc_working, lp_dc_list) {
-		spin_lock(&lp->lp_lock);
-		lp->lp_state |= LNET_PEER_REDISCOVER;
-		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
-				  LNET_PEER_DISCOVERING |
-				  LNET_PEER_NIDS_UPTODATE);
-		spin_unlock(&lp->lp_lock);
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	/* Queue cleanup 2: wait for the expired queue to clear. */
+	while (!list_empty(&the_lnet.ln_dc_expired))
+		schedule_timeout_uninterruptible(HZ);
+
+	/* Queue cleanup 3: clear the request queue. */
+	lnet_net_lock(LNET_LOCK_EX);
+	while (!list_empty(&the_lnet.ln_dc_request)) {
+		lp = list_first_entry(&the_lnet.ln_dc_request,
+				      struct lnet_peer, lp_dc_list);
+		lnet_peer_discovery_error(lp, -ESHUTDOWN);
 		lnet_peer_discovery_complete(lp);
 	}
 	lnet_net_unlock(LNET_LOCK_EX);
@@ -1797,10 +3131,6 @@ int lnet_peer_discovery_start(void)
 	if (the_lnet.ln_dc_state != LNET_DC_STATE_SHUTDOWN)
 		return -EALREADY;
 
-	INIT_LIST_HEAD(&the_lnet.ln_dc_request);
-	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
-	init_waitqueue_head(&the_lnet.ln_dc_waitq);
-
 	rc = LNetEQAlloc(0, lnet_discovery_event_handler, &the_lnet.ln_dc_eqh);
 	if (rc != 0) {
 		CERROR("Can't allocate discovery EQ: %d\n", rc);
@@ -1819,6 +3149,8 @@ int lnet_peer_discovery_start(void)
 		the_lnet.ln_dc_state = LNET_DC_STATE_SHUTDOWN;
 	}
 
+	CDEBUG(D_NET, "discovery start: %d\n", rc);
+
 	return rc;
 }
 
@@ -1837,6 +3169,9 @@ void lnet_peer_discovery_stop(void)
 
 	LASSERT(list_empty(&the_lnet.ln_dc_request));
 	LASSERT(list_empty(&the_lnet.ln_dc_working));
+	LASSERT(list_empty(&the_lnet.ln_dc_expired));
+
+	CDEBUG(D_NET, "discovery stopped\n");
 }
 
 /* Debugging */

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 19/24] lustre: lnet: add "lnetctl peer list"
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (14 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked() NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 23:38   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 13/24] lustre: lnet: add LNET_PEER_CONFIGURED flag NeilBrown
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Add IOC_LIBCFS_GET_PEER_LIST to obtain a list of the primary
NIDs of all peers known to the system. The list is written
into a userspace buffer by the kernel. The typical usage is
to make a first call to determine the required buffer size,
then a second call to obtain the list.

Extend the "lnetctl peer" set of commands with a "list"
subcommand that uses this interface.

Modify the IOC_LIBCFS_GET_PEER_NI ioctl (which is new in the
Multi-Rail code) to use a NID to indicate the peer to look
up, and then pass out the data for all NIDs of that peer.

Re-implement "lnetctl peer show" to obtain the list of NIDs
using IOC_LIBCFS_GET_PEER_LIST followed by one or more
IOC_LIBCFS_GET_PEER_NI calls to get information for each
peer.

Make sure to copy the structure from kernel space to
user space even if the ioctl handler returns an error.
This is needed because if the buffer passed in by the
user space is not big enough to copy the data, we want
to pass the requested size to user space in the structure
passed in. The return code in this case is -E2BIG.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25790
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    9 -
 .../staging/lustre/include/linux/lnet/lib-types.h  |    3 
 .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    3 
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   30 ++-
 drivers/staging/lustre/lnet/lnet/peer.c            |  222 +++++++++++++-------
 5 files changed, 169 insertions(+), 98 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index f82a699371f2..58e3a9c4e39f 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -462,6 +462,8 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg);
 struct lnet_ni *lnet_get_next_ni_locked(struct lnet_net *mynet,
 					struct lnet_ni *prev);
 struct lnet_ni *lnet_get_ni_idx_locked(int idx);
+int lnet_get_peer_list(__u32 *countp, __u32 *sizep,
+		       struct lnet_process_id __user *ids);
 
 void lnet_router_debugfs_init(void);
 void lnet_router_debugfs_fini(void);
@@ -730,10 +732,9 @@ bool lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid);
 int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid);
 int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
 int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
-int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
-		       bool *mr,
-		       struct lnet_peer_ni_credit_info __user *peer_ni_info,
-		       struct lnet_ioctl_element_stats __user *peer_ni_stats);
+int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nid,
+		       __u32 *nnis, bool *mr, __u32 *sizep,
+		       void __user *bulk);
 int lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
 			  char alivness[LNET_MAX_STR_LEN],
 			  __u32 *cpt_iter, __u32 *refcount,
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 07baa86e61ab..8543a67420d7 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -651,7 +651,6 @@ struct lnet_peer_net {
  *    pt_hash[...]
  *    pt_peer_list
  *    pt_peers
- *    pt_peer_nnids
  * protected by pt_zombie_lock:
  *    pt_zombie_list
  *    pt_zombies
@@ -667,8 +666,6 @@ struct lnet_peer_table {
 	struct list_head	pt_peer_list;
 	/* # peers */
 	int			pt_peers;
-	/* # NIDS on listed peers */
-	int			pt_peer_nnids;
 	/* # zombies to go to deathrow (and not there yet) */
 	int			 pt_zombies;
 	/* zombie peers_ni */
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
index 2a9beed23985..2607620e8ef8 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -144,6 +144,7 @@ struct libcfs_debug_ioctl_data {
 #define IOC_LIBCFS_GET_LOCAL_NI		_IOWR(IOC_LIBCFS_TYPE, 97, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_SET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 98, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_GET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 99, IOCTL_CONFIG_SIZE)
-#define IOC_LIBCFS_MAX_NR		99
+#define IOC_LIBCFS_GET_PEER_LIST	_IOWR(IOC_LIBCFS_TYPE, 100, IOCTL_CONFIG_SIZE)
+#define IOC_LIBCFS_MAX_NR		100
 
 #endif /* __LIBCFS_IOCTL_H__ */
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 955d1711eda4..f624abe7db80 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -3117,21 +3117,31 @@ LNetCtl(unsigned int cmd, void *arg)
 
 	case IOC_LIBCFS_GET_PEER_NI: {
 		struct lnet_ioctl_peer_cfg *cfg = arg;
-		struct lnet_peer_ni_credit_info __user *lpni_cri;
-		struct lnet_ioctl_element_stats __user *lpni_stats;
-		size_t usr_size = sizeof(*lpni_cri) + sizeof(*lpni_stats);
 
-		if ((cfg->prcfg_hdr.ioc_len != sizeof(*cfg)) ||
-		    (cfg->prcfg_size != usr_size))
+		if (cfg->prcfg_hdr.ioc_len < sizeof(*cfg))
 			return -EINVAL;
 
-		lpni_cri = cfg->prcfg_bulk;
-		lpni_stats = cfg->prcfg_bulk + sizeof(*lpni_cri);
+		mutex_lock(&the_lnet.ln_api_mutex);
+		rc = lnet_get_peer_info(&cfg->prcfg_prim_nid,
+					&cfg->prcfg_cfg_nid,
+					&cfg->prcfg_count,
+					&cfg->prcfg_mr,
+					&cfg->prcfg_size,
+					(void __user *)cfg->prcfg_bulk);
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return rc;
+	}
+
+	case IOC_LIBCFS_GET_PEER_LIST: {
+		struct lnet_ioctl_peer_cfg *cfg = arg;
+
+		if (cfg->prcfg_hdr.ioc_len < sizeof(*cfg))
+			return -EINVAL;
 
 		mutex_lock(&the_lnet.ln_api_mutex);
-		rc = lnet_get_peer_info(cfg->prcfg_count, &cfg->prcfg_prim_nid,
-					&cfg->prcfg_cfg_nid, &cfg->prcfg_mr,
-					lpni_cri, lpni_stats);
+		rc = lnet_get_peer_list(&cfg->prcfg_count, &cfg->prcfg_size,
+					(struct lnet_process_id __user *)
+					cfg->prcfg_bulk);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 	}
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 1ef4a44e752e..8dff3b767577 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -263,9 +263,7 @@ lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
 
 	/* Update peer NID count. */
 	lp = lpn->lpn_peer;
-	ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
 	lp->lp_nnis--;
-	ptable->pt_peer_nnids--;
 
 	/*
 	 * If there are no more peer nets, make the peer unfindable
@@ -277,6 +275,7 @@ lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
 	 */
 	if (list_empty(&lp->lp_peer_nets)) {
 		list_del_init(&lp->lp_peer_list);
+		ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
 		ptable->pt_peers--;
 	} else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING) {
 		/* Discovery isn't running, nothing to do here. */
@@ -637,44 +636,6 @@ lnet_find_peer(lnet_nid_t nid)
 	return lp;
 }
 
-struct lnet_peer_ni *
-lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
-			    struct lnet_peer **lp)
-{
-	struct lnet_peer_table	*ptable;
-	struct lnet_peer_ni	*lpni;
-	int			lncpt;
-	int			cpt;
-
-	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
-
-	for (cpt = 0; cpt < lncpt; cpt++) {
-		ptable = the_lnet.ln_peer_tables[cpt];
-		if (ptable->pt_peer_nnids > idx)
-			break;
-		idx -= ptable->pt_peer_nnids;
-	}
-	if (cpt >= lncpt)
-		return NULL;
-
-	list_for_each_entry((*lp), &ptable->pt_peer_list, lp_peer_list) {
-		if ((*lp)->lp_nnis <= idx) {
-			idx -= (*lp)->lp_nnis;
-			continue;
-		}
-		list_for_each_entry((*lpn), &((*lp)->lp_peer_nets),
-				    lpn_peer_nets) {
-			list_for_each_entry(lpni, &((*lpn)->lpn_peer_nis),
-					    lpni_peer_nis) {
-				if (idx-- == 0)
-					return lpni;
-			}
-		}
-	}
-
-	return NULL;
-}
-
 struct lnet_peer_ni *
 lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 			     struct lnet_peer_net *peer_net,
@@ -734,6 +695,69 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 	return lpni;
 }
 
+/* Call with the ln_api_mutex held */
+int
+lnet_get_peer_list(__u32 *countp, __u32 *sizep,
+		   struct lnet_process_id __user *ids)
+{
+	struct lnet_process_id id;
+	struct lnet_peer_table *ptable;
+	struct lnet_peer *lp;
+	__u32 count = 0;
+	__u32 size = 0;
+	int lncpt;
+	int cpt;
+	__u32 i;
+	int rc;
+
+	rc = -ESHUTDOWN;
+	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN)
+		goto done;
+
+	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
+
+	/*
+	 * Count the number of peers, and return E2BIG if the buffer
+	 * is too small. We'll also return the desired size.
+	 */
+	rc = -E2BIG;
+	for (cpt = 0; cpt < lncpt; cpt++) {
+		ptable = the_lnet.ln_peer_tables[cpt];
+		count += ptable->pt_peers;
+	}
+	size = count * sizeof(*ids);
+	if (size > *sizep)
+		goto done;
+
+	/*
+	 * Walk the peer lists and copy out the primary nids.
+	 * This is safe because the peer lists are only modified
+	 * while the ln_api_mutex is held. So we don't need to
+	 * hold the lnet_net_lock as well, and can therefore
+	 * directly call copy_to_user().
+	 */
+	rc = -EFAULT;
+	memset(&id, 0, sizeof(id));
+	id.pid = LNET_PID_LUSTRE;
+	i = 0;
+	for (cpt = 0; cpt < lncpt; cpt++) {
+		ptable = the_lnet.ln_peer_tables[cpt];
+		list_for_each_entry(lp, &ptable->pt_peer_list, lp_peer_list) {
+			if (i >= count)
+				goto done;
+			id.nid = lp->lp_primary_nid;
+			if (copy_to_user(&ids[i], &id, sizeof(id)))
+				goto done;
+			i++;
+		}
+	}
+	rc = 0;
+done:
+	*countp = count;
+	*sizep = size;
+	return rc;
+}
+
 /*
  * Start pushes to peers that need to be updated for a configuration
  * change on this node.
@@ -1128,7 +1152,6 @@ lnet_peer_attach_peer_ni(struct lnet_peer *lp,
 	spin_unlock(&lp->lp_lock);
 
 	lp->lp_nnis++;
-	the_lnet.ln_peer_tables[lp->lp_cpt]->pt_peer_nnids++;
 	lnet_net_unlock(LNET_LOCK_EX);
 
 	CDEBUG(D_NET, "peer %s NID %s flags %#x\n",
@@ -3273,55 +3296,94 @@ lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
 }
 
 /* ln_api_mutex is held, which keeps the peer list stable */
-int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
-		       bool *mr,
-		       struct lnet_peer_ni_credit_info __user *peer_ni_info,
-		       struct lnet_ioctl_element_stats __user *peer_ni_stats)
+int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
+		       __u32 *nnis, bool *mr, __u32 *sizep,
+		       void __user *bulk)
 {
-	struct lnet_ioctl_element_stats ni_stats;
-	struct lnet_peer_ni_credit_info ni_info;
-	struct lnet_peer_ni *lpni = NULL;
-	struct lnet_peer_net *lpn = NULL;
-	struct lnet_peer *lp = NULL;
+	struct lnet_ioctl_element_stats *lpni_stats;
+	struct lnet_peer_ni_credit_info *lpni_info;
+	struct lnet_peer_ni *lpni;
+	struct lnet_peer *lp;
+	lnet_nid_t nid;
+	__u32 size;
 	int rc;
 
-	lpni = lnet_get_peer_ni_idx_locked(idx, &lpn, &lp);
+	lp = lnet_find_peer(*primary_nid);
 
-	if (!lpni)
-		return -ENOENT;
+	if (!lp) {
+		rc = -ENOENT;
+		goto out;
+	}
+
+	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats);
+	size *= lp->lp_nnis;
+	if (size > *sizep) {
+		*sizep = size;
+		rc = -E2BIG;
+		goto out_lp_decref;
+	}
 
 	*primary_nid = lp->lp_primary_nid;
 	*mr = lnet_peer_is_multi_rail(lp);
-	*nid = lpni->lpni_nid;
-	snprintf(ni_info.cr_aliveness, LNET_MAX_STR_LEN, "NA");
-	if (lnet_isrouter(lpni) ||
-	    lnet_peer_aliveness_enabled(lpni))
-		snprintf(ni_info.cr_aliveness, LNET_MAX_STR_LEN,
-			 lpni->lpni_alive ? "up" : "down");
-
-	ni_info.cr_refcount = atomic_read(&lpni->lpni_refcount);
-	ni_info.cr_ni_peer_tx_credits = lpni->lpni_net ?
-		lpni->lpni_net->net_tunables.lct_peer_tx_credits : 0;
-	ni_info.cr_peer_tx_credits = lpni->lpni_txcredits;
-	ni_info.cr_peer_rtr_credits = lpni->lpni_rtrcredits;
-	ni_info.cr_peer_min_rtr_credits = lpni->lpni_minrtrcredits;
-	ni_info.cr_peer_min_tx_credits = lpni->lpni_mintxcredits;
-	ni_info.cr_peer_tx_qnob = lpni->lpni_txqnob;
-
-	ni_stats.iel_send_count = atomic_read(&lpni->lpni_stats.send_count);
-	ni_stats.iel_recv_count = atomic_read(&lpni->lpni_stats.recv_count);
-	ni_stats.iel_drop_count = atomic_read(&lpni->lpni_stats.drop_count);
-
-	/* If copy_to_user fails */
-	rc = -EFAULT;
-	if (copy_to_user(peer_ni_info, &ni_info, sizeof(ni_info)))
-		goto copy_failed;
+	*nidp = lp->lp_primary_nid;
+	*nnis = lp->lp_nnis;
+	*sizep = size;
 
-	if (copy_to_user(peer_ni_stats, &ni_stats, sizeof(ni_stats)))
-		goto copy_failed;
+	/* Allocate helper buffers. */
+	rc = -ENOMEM;
+	lpni_info = kzalloc(sizeof(*lpni_info), GFP_KERNEL);
+	if (!lpni_info)
+		goto out_lp_decref;
+	lpni_stats = kzalloc(sizeof(*lpni_stats), GFP_KERNEL);
+	if (!lpni_stats)
+		goto out_free_info;
 
+	lpni = NULL;
+	rc = -EFAULT;
+	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
+		nid = lpni->lpni_nid;
+		if (copy_to_user(bulk, &nid, sizeof(nid)))
+			goto out_free_stats;
+		bulk += sizeof(nid);
+
+		memset(lpni_info, 0, sizeof(*lpni_info));
+		snprintf(lpni_info->cr_aliveness, LNET_MAX_STR_LEN, "NA");
+		if (lnet_isrouter(lpni) ||
+		    lnet_peer_aliveness_enabled(lpni))
+			snprintf(lpni_info->cr_aliveness, LNET_MAX_STR_LEN,
+				 lpni->lpni_alive ? "up" : "down");
+
+		lpni_info->cr_refcount = atomic_read(&lpni->lpni_refcount);
+		lpni_info->cr_ni_peer_tx_credits = lpni->lpni_net ?
+			lpni->lpni_net->net_tunables.lct_peer_tx_credits : 0;
+		lpni_info->cr_peer_tx_credits = lpni->lpni_txcredits;
+		lpni_info->cr_peer_rtr_credits = lpni->lpni_rtrcredits;
+		lpni_info->cr_peer_min_rtr_credits = lpni->lpni_minrtrcredits;
+		lpni_info->cr_peer_min_tx_credits = lpni->lpni_mintxcredits;
+		lpni_info->cr_peer_tx_qnob = lpni->lpni_txqnob;
+		if (copy_to_user(bulk, lpni_info, sizeof(*lpni_info)))
+			goto out_free_stats;
+		bulk += sizeof(*lpni_info);
+
+		memset(lpni_stats, 0, sizeof(*lpni_stats));
+		lpni_stats->iel_send_count =
+			atomic_read(&lpni->lpni_stats.send_count);
+		lpni_stats->iel_recv_count =
+			atomic_read(&lpni->lpni_stats.recv_count);
+		lpni_stats->iel_drop_count =
+			atomic_read(&lpni->lpni_stats.drop_count);
+		if (copy_to_user(bulk, lpni_stats, sizeof(*lpni_stats)))
+			goto out_free_stats;
+		bulk += sizeof(*lpni_stats);
+	}
 	rc = 0;
 
-copy_failed:
+out_free_stats:
+	kfree(lpni_stats);
+out_free_info:
+	kfree(lpni_info);
+out_lp_decref:
+	lnet_peer_decref_locked(lp);
+out:
 	return rc;
 }

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 20/24] lustre: lnet: add "lnetctl ping" command
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (18 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 12/24] lustre: lnet: preferred NIs for non-Multi-Rail peers NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 23:43   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 23/24] lustre: lnet: show peer state NeilBrown
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf@sgi.com>

Adds function jt_ping() in lnetctl.c and
lustre_lnet_ping_nid() in liblnetconfig.c file.
The output of "lnetctl ping" is similar to
"lnetctl peer show".

Function jt_ping() in lnetctl.c calls lustre_lnet_ping_nid()
to implement "lnetctl ping". Adds a function infra_ping_nid()
to be later reused for the ping similar lnetctl commands.
Uses a new ioctl call, IOC_LIBCFS_PING_PEER for "lnetctl ping".
With "lnetctl ping", multiple nids can be pinged. Uses a new
struct(lnet_ioctl_ping_data in lib-dlc.h) to pass the data
from kernel to user space for ping. Also changes lnet_ping()
function and its input parameters in drivers/staging/lustre/lnet/lnet/api-ni.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Sonia Sharma <sonia.sharma@intel.com>
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25791
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Tested-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    5 +-
 .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    2 -
 .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |    2 -
 .../lustre/lnet/klnds/socklnd/socklnd_modparams.c  |    2 -
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   55 +++++++++++++++-----
 drivers/staging/lustre/lnet/lnet/peer.c            |    2 -
 6 files changed, 47 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 58e3a9c4e39f..adb4d0551ef5 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -76,8 +76,8 @@ extern struct lnet the_lnet;	/* THE network */
 #define LNET_ACCEPTOR_MIN_RESERVED_PORT    512
 #define LNET_ACCEPTOR_MAX_RESERVED_PORT    1023
 
-/* Discovery timeout - same as default peer_timeout */
-#define DISCOVERY_TIMEOUT	180
+/* default timeout */
+#define DEFAULT_PEER_TIMEOUT    180
 
 static inline int lnet_is_route_alive(struct lnet_route *route)
 {
@@ -716,6 +716,7 @@ struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref,
 					    int cpt);
 struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
 struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
+struct lnet_peer *lnet_find_peer(lnet_nid_t nid);
 void lnet_peer_net_added(struct lnet_net *net);
 lnet_nid_t lnet_peer_primary_nid_locked(lnet_nid_t nid);
 int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block);
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
index 2607620e8ef8..3d89202bd396 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -102,7 +102,7 @@ struct libcfs_debug_ioctl_data {
 #define IOC_LIBCFS_CONFIGURE		   _IOWR('e', 59, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_TESTPROTOCOMPAT	   _IOWR('e', 60, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_PING			   _IOWR('e', 61, IOCTL_LIBCFS_TYPE)
-/*	IOC_LIBCFS_DEBUG_PEER		   _IOWR('e', 62, IOCTL_LIBCFS_TYPE) */
+#define IOC_LIBCFS_PING_PEER               _IOWR('e', 62, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_LNETST		   _IOWR('e', 63, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_LNET_FAULT		   _IOWR('e', 64, IOCTL_LIBCFS_TYPE)
 /* lnd ioctls */
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
index 0f2ad9110dc9..13b19f3eabf0 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
@@ -83,7 +83,7 @@ static int peer_buffer_credits;
 module_param(peer_buffer_credits, int, 0444);
 MODULE_PARM_DESC(peer_buffer_credits, "# per-peer router buffer credits");
 
-static int peer_timeout = 180;
+static int peer_timeout = DEFAULT_PEER_TIMEOUT;
 module_param(peer_timeout, int, 0444);
 MODULE_PARM_DESC(peer_timeout, "Seconds without aliveness news to declare peer dead (<=0 to disable)");
 
diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c
index 5663a4ca94d4..da5910049fc1 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c
@@ -35,7 +35,7 @@ static int peer_buffer_credits;
 module_param(peer_buffer_credits, int, 0444);
 MODULE_PARM_DESC(peer_buffer_credits, "# per-peer router buffer credits");
 
-static int peer_timeout = 180;
+static int peer_timeout = DEFAULT_PEER_TIMEOUT;
 module_param(peer_timeout, int, 0444);
 MODULE_PARM_DESC(peer_timeout, "Seconds without aliveness news to declare peer dead (<=0 to disable)");
 
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index f624abe7db80..37f47bd1511f 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -3181,24 +3181,50 @@ LNetCtl(unsigned int cmd, void *arg)
 		id.nid = data->ioc_nid;
 		id.pid = data->ioc_u32[0];
 
-		/* Don't block longer than 2 minutes */
-		if (data->ioc_u32[1] > 120 * MSEC_PER_SEC)
-			return -EINVAL;
-
-		/* If timestamp is negative then disable timeout */
-		if ((s32)data->ioc_u32[1] < 0)
-			timeout = MAX_SCHEDULE_TIMEOUT;
+		/* If timeout is negative then set default of 3 minutes */
+		if (((s32)data->ioc_u32[1] <= 0) ||
+		    data->ioc_u32[1] > (DEFAULT_PEER_TIMEOUT * MSEC_PER_SEC))
+			timeout = DEFAULT_PEER_TIMEOUT * HZ;
 		else
 			timeout = msecs_to_jiffies(data->ioc_u32[1]);
 
 		rc = lnet_ping(id, timeout, data->ioc_pbuf1,
 			       data->ioc_plen1 / sizeof(struct lnet_process_id));
+
 		if (rc < 0)
 			return rc;
+
 		data->ioc_count = rc;
 		return 0;
 	}
 
+	case IOC_LIBCFS_PING_PEER: {
+		struct lnet_ioctl_ping_data *ping = arg;
+		struct lnet_peer *lp;
+		signed long timeout;
+
+		/* If timeout is negative then set default of 3 minutes */
+		if (((s32)ping->op_param) <= 0 ||
+		    ping->op_param > (DEFAULT_PEER_TIMEOUT * MSEC_PER_SEC))
+			timeout = DEFAULT_PEER_TIMEOUT * HZ;
+		else
+			timeout = msecs_to_jiffies(ping->op_param);
+
+		rc = lnet_ping(ping->ping_id, timeout,
+			       ping->ping_buf,
+			       ping->ping_count);
+		if (rc < 0)
+			return rc;
+
+		lp = lnet_find_peer(ping->ping_id.nid);
+		if (lp) {
+			ping->ping_id.nid = lp->lp_primary_nid;
+			ping->mr_info = lnet_peer_is_multi_rail(lp);
+		}
+		ping->ping_count = rc;
+		return 0;
+	}
+
 	default:
 		ni = lnet_net2ni_addref(data->ioc_net);
 		if (!ni)
@@ -3301,7 +3327,7 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	/* initialize md content */
 	md.start     = &pbuf->pb_info;
 	md.length    = LNET_PING_INFO_SIZE(n_ids);
-	md.threshold = 2; /*GET/REPLY*/
+	md.threshold = 2; /* GET/REPLY */
 	md.max_size  = 0;
 	md.options   = LNET_MD_TRUNCATE;
 	md.user_ptr  = NULL;
@@ -3319,7 +3345,6 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 
 	if (rc) {
 		/* Don't CERROR; this could be deliberate! */
-
 		rc2 = LNetMDUnlink(mdh);
 		LASSERT(!rc2);
 
@@ -3363,7 +3388,6 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 			replied = 1;
 			rc = event.mlength;
 		}
-
 	} while (rc2 <= 0 || !event.unlinked);
 
 	if (!replied) {
@@ -3377,10 +3401,9 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	nob = rc;
 	LASSERT(nob >= 0 && nob <= LNET_PING_INFO_SIZE(n_ids));
 
-	rc = -EPROTO;			   /* if I can't parse... */
+	rc = -EPROTO;		/* if I can't parse... */
 
 	if (nob < 8) {
-		/* can't check magic/version */
 		CERROR("%s: ping info too short %d\n",
 		       libcfs_id2str(id), nob);
 		goto fail_free_eq;
@@ -3401,7 +3424,8 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	}
 
 	if (nob < LNET_PING_INFO_SIZE(0)) {
-		CERROR("%s: Short reply %d(%d min)\n", libcfs_id2str(id),
+		CERROR("%s: Short reply %d(%d min)\n",
+		       libcfs_id2str(id),
 		       nob, (int)LNET_PING_INFO_SIZE(0));
 		goto fail_free_eq;
 	}
@@ -3410,12 +3434,13 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 		n_ids = pbuf->pb_info.pi_nnis;
 
 	if (nob < LNET_PING_INFO_SIZE(n_ids)) {
-		CERROR("%s: Short reply %d(%d expected)\n", libcfs_id2str(id),
+		CERROR("%s: Short reply %d(%d expected)\n",
+		       libcfs_id2str(id),
 		       nob, (int)LNET_PING_INFO_SIZE(n_ids));
 		goto fail_free_eq;
 	}
 
-	rc = -EFAULT;			   /* If I SEGV... */
+	rc = -EFAULT;		/* if I segv in copy_to_user()... */
 
 	memset(&tmpid, 0, sizeof(tmpid));
 	for (i = 0; i < n_ids; i++) {
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 8dff3b767577..95f72ae39a89 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -2905,7 +2905,7 @@ static struct lnet_peer *lnet_peer_dc_timed_out(time64_t now)
 		return NULL;
 	lp = list_first_entry(&the_lnet.ln_dc_working,
 			      struct lnet_peer, lp_dc_list);
-	if (now < lp->lp_last_queued + DISCOVERY_TIMEOUT)
+	if (now < lp->lp_last_queued + DEFAULT_PEER_TIMEOUT)
 		return NULL;
 	return lp;
 }

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 21/24] lustre: lnet: add "lnetctl discover"
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (16 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 13/24] lustre: lnet: add LNET_PEER_CONFIGURED flag NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 23:45   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 12/24] lustre: lnet: preferred NIs for non-Multi-Rail peers NeilBrown
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Sonia Sharma <sonia.sharma@intel.com>

Add a "discover" subcommand to lnetctl

jt_discover() in lnetctl.c calls lustre_lnet_discover_nid()
to implement "lnetctl discover". The output is similar to
"lnetctl ping" command.
This patch also does some clean up in linlnetconfig.c
For parameters under global settings, the common code
for them is pulled in functions ioctl_set_value() and
ioctl_show_global_values().

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Sonia Sharma <sonia.sharma@intel.com>
Signed-off-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25793
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    2 
 drivers/staging/lustre/lnet/lnet/api-ni.c          |  100 ++++++++++++++++++++
 2 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
index 3d89202bd396..60bc9713923e 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -113,7 +113,7 @@ struct libcfs_debug_ioctl_data {
 #define IOC_LIBCFS_DEL_PEER		   _IOWR('e', 74, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_ADD_PEER		   _IOWR('e', 75, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_GET_PEER		   _IOWR('e', 76, IOCTL_LIBCFS_TYPE)
-/* ioctl 77 is free for use */
+#define IOC_LIBCFS_DISCOVER                _IOWR('e', 77, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_ADD_INTERFACE	   _IOWR('e', 78, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_DEL_INTERFACE	   _IOWR('e', 79, IOCTL_LIBCFS_TYPE)
 #define IOC_LIBCFS_GET_INTERFACE	   _IOWR('e', 80, IOCTL_LIBCFS_TYPE)
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 37f47bd1511f..0511c6acb9b1 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -104,6 +104,9 @@ static atomic_t lnet_dlc_seq_no = ATOMIC_INIT(0);
 static int lnet_ping(struct lnet_process_id id, signed long timeout,
 		     struct lnet_process_id __user *ids, int n_ids);
 
+static int lnet_discover(struct lnet_process_id id, __u32 force,
+			 struct lnet_process_id __user *ids, int n_ids);
+
 static int
 discovery_set(const char *val, const struct kernel_param *kp)
 {
@@ -3225,6 +3228,25 @@ LNetCtl(unsigned int cmd, void *arg)
 		return 0;
 	}
 
+	case IOC_LIBCFS_DISCOVER: {
+		struct lnet_ioctl_ping_data *discover = arg;
+		struct lnet_peer *lp;
+
+		rc = lnet_discover(discover->ping_id, discover->op_param,
+				   discover->ping_buf,
+				   discover->ping_count);
+		if (rc < 0)
+			return rc;
+		lp = lnet_find_peer(discover->ping_id.nid);
+		if (lp) {
+			discover->ping_id.nid = lp->lp_primary_nid;
+			discover->mr_info = lnet_peer_is_multi_rail(lp);
+		}
+
+		discover->ping_count = rc;
+		return 0;
+	}
+
 	default:
 		ni = lnet_net2ni_addref(data->ioc_net);
 		if (!ni)
@@ -3461,3 +3483,81 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	lnet_ping_buffer_decref(pbuf);
 	return rc;
 }
+
+static int
+lnet_discover(struct lnet_process_id id, __u32 force,
+	      struct lnet_process_id __user *ids,
+	      int n_ids)
+{
+	struct lnet_peer_ni *lpni;
+	struct lnet_peer_ni *p;
+	struct lnet_peer *lp;
+	struct lnet_process_id *buf;
+	int cpt;
+	int i;
+	int rc;
+	int max_intf = lnet_interfaces_max;
+
+	if (n_ids <= 0 ||
+	    id.nid == LNET_NID_ANY ||
+	    n_ids > max_intf)
+		return -EINVAL;
+
+	if (id.pid == LNET_PID_ANY)
+		id.pid = LNET_PID_LUSTRE;
+
+	buf = kcalloc(n_ids, sizeof(*buf), GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	cpt = lnet_net_lock_current();
+	lpni = lnet_nid2peerni_locked(id.nid, LNET_NID_ANY, cpt);
+	if (IS_ERR(lpni)) {
+		rc = PTR_ERR(lpni);
+		goto out;
+	}
+
+	/*
+	 * Clearing the NIDS_UPTODATE flag ensures the peer will
+	 * be discovered, provided discovery has not been disabled.
+	 */
+	lp = lpni->lpni_peer_net->lpn_peer;
+	spin_lock(&lp->lp_lock);
+	lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+	/* If the force flag is set, force a PING and PUSH as well. */
+	if (force)
+		lp->lp_state |= LNET_PEER_FORCE_PING | LNET_PEER_FORCE_PUSH;
+	spin_unlock(&lp->lp_lock);
+	rc = lnet_discover_peer_locked(lpni, cpt, true);
+	if (rc)
+		goto out_decref;
+
+	/* Peer may have changed. */
+	lp = lpni->lpni_peer_net->lpn_peer;
+	if (lp->lp_nnis < n_ids)
+		n_ids = lp->lp_nnis;
+
+	i = 0;
+	p = NULL;
+	while ((p = lnet_get_next_peer_ni_locked(lp, NULL, p)) != NULL) {
+		buf[i].pid = id.pid;
+		buf[i].nid = p->lpni_nid;
+		if (++i >= n_ids)
+			break;
+	}
+
+	lnet_net_unlock(cpt);
+
+	rc = -EFAULT;
+	if (copy_to_user(ids, buf, n_ids * sizeof(*buf)))
+		goto out_relock;
+	rc = n_ids;
+out_relock:
+	lnet_net_lock(cpt);
+out_decref:
+	lnet_peer_ni_decref_locked(lpni);
+out:
+	lnet_net_unlock(cpt);
+	kfree(buf);
+	return rc;
+}

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 22/24] lustre: lnet: add enhanced statistics
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (22 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 18/24] lustre: lnet: implement Peer Discovery NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 23:50   ` James Simmons
  2018-10-14 23:54 ` [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging James Simmons
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

Added statistics to track the different types of
LNet messages which are sent/received/dropped

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/25795
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   12 ++
 .../staging/lustre/include/linux/lnet/lib-types.h  |   20 +++
 .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    3 -
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   45 +++++++-
 drivers/staging/lustre/lnet/lnet/lib-move.c        |  116 +++++++++++++++++++-
 drivers/staging/lustre/lnet/lnet/lib-msg.c         |   16 ++-
 drivers/staging/lustre/lnet/lnet/net_fault.c       |    3 -
 drivers/staging/lustre/lnet/lnet/peer.c            |   26 +++-
 8 files changed, 217 insertions(+), 24 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index adb4d0551ef5..91980f60a50d 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -575,7 +575,7 @@ void lnet_set_reply_msg_len(struct lnet_ni *ni, struct lnet_msg *msg,
 void lnet_finalize(struct lnet_msg *msg, int rc);
 
 void lnet_drop_message(struct lnet_ni *ni, int cpt, void *private,
-		       unsigned int nob);
+		       unsigned int nob, __u32 msg_type);
 void lnet_drop_delayed_msg_list(struct list_head *head, char *reason);
 void lnet_recv_delayed_msg_list(struct list_head *head);
 
@@ -825,4 +825,14 @@ lnet_peer_needs_push(struct lnet_peer *lp)
 	return false;
 }
 
+void lnet_incr_stats(struct lnet_element_stats *stats,
+		     enum lnet_msg_type msg_type,
+		     enum lnet_stats_type stats_type);
+
+__u32 lnet_sum_stats(struct lnet_element_stats *stats,
+		     enum lnet_stats_type stats_type);
+
+void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
+			      struct lnet_element_stats *stats);
+
 #endif
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 8543a67420d7..19f7b11a1e44 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -279,10 +279,24 @@ enum lnet_ni_state {
 	LNET_NI_STATE_DELETING
 };
 
+enum lnet_stats_type {
+	LNET_STATS_TYPE_SEND = 0,
+	LNET_STATS_TYPE_RECV,
+	LNET_STATS_TYPE_DROP
+};
+
+struct lnet_comm_count {
+	atomic_t co_get_count;
+	atomic_t co_put_count;
+	atomic_t co_reply_count;
+	atomic_t co_ack_count;
+	atomic_t co_hello_count;
+};
+
 struct lnet_element_stats {
-	atomic_t	send_count;
-	atomic_t	recv_count;
-	atomic_t	drop_count;
+	struct lnet_comm_count el_send_stats;
+	struct lnet_comm_count el_recv_stats;
+	struct lnet_comm_count el_drop_stats;
 };
 
 struct lnet_net {
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
index 60bc9713923e..4590f65c333f 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -145,6 +145,7 @@ struct libcfs_debug_ioctl_data {
 #define IOC_LIBCFS_SET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 98, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_GET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 99, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_GET_PEER_LIST	_IOWR(IOC_LIBCFS_TYPE, 100, IOCTL_CONFIG_SIZE)
-#define IOC_LIBCFS_MAX_NR		100
+#define IOC_LIBCFS_GET_LOCAL_NI_MSG_STATS  _IOWR(IOC_LIBCFS_TYPE, 101, IOCTL_CONFIG_SIZE)
+#define IOC_LIBCFS_MAX_NR		101
 
 #endif /* __LIBCFS_IOCTL_H__ */
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 0511c6acb9b1..0852118bf803 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -2263,8 +2263,12 @@ lnet_fill_ni_info(struct lnet_ni *ni, struct lnet_ioctl_config_ni *cfg_ni,
 	memcpy(&tun->lt_cmn, &ni->ni_net->net_tunables, sizeof(tun->lt_cmn));
 
 	if (stats) {
-		stats->iel_send_count = atomic_read(&ni->ni_stats.send_count);
-		stats->iel_recv_count = atomic_read(&ni->ni_stats.recv_count);
+		stats->iel_send_count = lnet_sum_stats(&ni->ni_stats,
+						       LNET_STATS_TYPE_SEND);
+		stats->iel_recv_count = lnet_sum_stats(&ni->ni_stats,
+						       LNET_STATS_TYPE_RECV);
+		stats->iel_drop_count = lnet_sum_stats(&ni->ni_stats,
+						       LNET_STATS_TYPE_DROP);
 	}
 
 	/*
@@ -2491,6 +2495,29 @@ lnet_get_ni_config(struct lnet_ioctl_config_ni *cfg_ni,
 	return rc;
 }
 
+int lnet_get_ni_stats(struct lnet_ioctl_element_msg_stats *msg_stats)
+{
+	struct lnet_ni *ni;
+	int cpt;
+	int rc = -ENOENT;
+
+	if (!msg_stats)
+		return -EINVAL;
+
+	cpt = lnet_net_lock_current();
+
+	ni = lnet_get_ni_idx_locked(msg_stats->im_idx);
+
+	if (ni) {
+		lnet_usr_translate_stats(msg_stats, &ni->ni_stats);
+		rc = 0;
+	}
+
+	lnet_net_unlock(cpt);
+
+	return rc;
+}
+
 static int lnet_add_net_common(struct lnet_net *net,
 			       struct lnet_ioctl_config_lnd_tunables *tun)
 {
@@ -2956,6 +2983,7 @@ LNetCtl(unsigned int cmd, void *arg)
 		__u32 tun_size;
 
 		cfg_ni = arg;
+
 		/* get the tunables if they are available */
 		if (cfg_ni->lic_cfg_hdr.ioc_len <
 		    sizeof(*cfg_ni) + sizeof(*stats) + sizeof(*tun))
@@ -2975,6 +3003,19 @@ LNetCtl(unsigned int cmd, void *arg)
 		return rc;
 	}
 
+	case IOC_LIBCFS_GET_LOCAL_NI_MSG_STATS: {
+		struct lnet_ioctl_element_msg_stats *msg_stats = arg;
+
+		if (msg_stats->im_hdr.ioc_len != sizeof(*msg_stats))
+			return -EINVAL;
+
+		mutex_lock(&the_lnet.ln_api_mutex);
+		rc = lnet_get_ni_stats(msg_stats);
+		mutex_unlock(&the_lnet.ln_api_mutex);
+
+		return rc;
+	}
+
 	case IOC_LIBCFS_GET_NET: {
 		size_t total = sizeof(*config) +
 			       sizeof(struct lnet_ioctl_net_config);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 2ff329bf91ba..5694d85c713c 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -45,6 +45,104 @@ static int local_nid_dist_zero = 1;
 module_param(local_nid_dist_zero, int, 0444);
 MODULE_PARM_DESC(local_nid_dist_zero, "Reserved");
 
+static inline struct lnet_comm_count *
+get_stats_counts(struct lnet_element_stats *stats,
+		 enum lnet_stats_type stats_type)
+{
+	switch (stats_type) {
+	case LNET_STATS_TYPE_SEND:
+		return &stats->el_send_stats;
+	case LNET_STATS_TYPE_RECV:
+		return &stats->el_recv_stats;
+	case LNET_STATS_TYPE_DROP:
+		return &stats->el_drop_stats;
+	default:
+		CERROR("Unknown stats type\n");
+	}
+
+	return NULL;
+}
+
+void lnet_incr_stats(struct lnet_element_stats *stats,
+		     enum lnet_msg_type msg_type,
+		     enum lnet_stats_type stats_type)
+{
+	struct lnet_comm_count *counts = get_stats_counts(stats, stats_type);
+
+	if (!counts)
+		return;
+
+	switch (msg_type) {
+	case LNET_MSG_ACK:
+		atomic_inc(&counts->co_ack_count);
+		break;
+	case LNET_MSG_PUT:
+		atomic_inc(&counts->co_put_count);
+		break;
+	case LNET_MSG_GET:
+		atomic_inc(&counts->co_get_count);
+		break;
+	case LNET_MSG_REPLY:
+		atomic_inc(&counts->co_reply_count);
+		break;
+	case LNET_MSG_HELLO:
+		atomic_inc(&counts->co_hello_count);
+		break;
+	default:
+		CERROR("There is a BUG in the code. Unknown message type\n");
+		break;
+	}
+}
+
+__u32 lnet_sum_stats(struct lnet_element_stats *stats,
+		     enum lnet_stats_type stats_type)
+{
+	struct lnet_comm_count *counts = get_stats_counts(stats, stats_type);
+
+	if (!counts)
+		return 0;
+
+	return (atomic_read(&counts->co_ack_count) +
+		atomic_read(&counts->co_put_count) +
+		atomic_read(&counts->co_get_count) +
+		atomic_read(&counts->co_reply_count) +
+		atomic_read(&counts->co_hello_count));
+}
+
+static inline void assign_stats(struct lnet_ioctl_comm_count *msg_stats,
+				struct lnet_comm_count *counts)
+{
+	msg_stats->ico_get_count = atomic_read(&counts->co_get_count);
+	msg_stats->ico_put_count = atomic_read(&counts->co_put_count);
+	msg_stats->ico_reply_count = atomic_read(&counts->co_reply_count);
+	msg_stats->ico_ack_count = atomic_read(&counts->co_ack_count);
+	msg_stats->ico_hello_count = atomic_read(&counts->co_hello_count);
+}
+
+void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
+			      struct lnet_element_stats *stats)
+{
+	struct lnet_comm_count *counts;
+
+	LASSERT(msg_stats);
+	LASSERT(stats);
+
+	counts = get_stats_counts(stats, LNET_STATS_TYPE_SEND);
+	if (!counts)
+		return;
+	assign_stats(&msg_stats->im_send_stats, counts);
+
+	counts = get_stats_counts(stats, LNET_STATS_TYPE_RECV);
+	if (!counts)
+		return;
+	assign_stats(&msg_stats->im_recv_stats, counts);
+
+	counts = get_stats_counts(stats, LNET_STATS_TYPE_DROP);
+	if (!counts)
+		return;
+	assign_stats(&msg_stats->im_drop_stats, counts);
+}
+
 int
 lnet_fail_nid(lnet_nid_t nid, unsigned int threshold)
 {
@@ -632,9 +730,13 @@ lnet_post_send_locked(struct lnet_msg *msg, int do_send)
 		the_lnet.ln_counters[cpt]->drop_length += msg->msg_len;
 		lnet_net_unlock(cpt);
 		if (msg->msg_txpeer)
-			atomic_inc(&msg->msg_txpeer->lpni_stats.drop_count);
+			lnet_incr_stats(&msg->msg_txpeer->lpni_stats,
+					msg->msg_type,
+					LNET_STATS_TYPE_DROP);
 		if (msg->msg_txni)
-			atomic_inc(&msg->msg_txni->ni_stats.drop_count);
+			lnet_incr_stats(&msg->msg_txni->ni_stats,
+					msg->msg_type,
+					LNET_STATS_TYPE_DROP);
 
 		CNETERR("Dropping message for %s: peer not alive\n",
 			libcfs_id2str(msg->msg_target));
@@ -1859,9 +1961,11 @@ lnet_send(lnet_nid_t src_nid, struct lnet_msg *msg, lnet_nid_t rtr_nid)
 }
 
 void
-lnet_drop_message(struct lnet_ni *ni, int cpt, void *private, unsigned int nob)
+lnet_drop_message(struct lnet_ni *ni, int cpt, void *private, unsigned int nob,
+		  __u32 msg_type)
 {
 	lnet_net_lock(cpt);
+	lnet_incr_stats(&ni->ni_stats, msg_type, LNET_STATS_TYPE_DROP);
 	the_lnet.ln_counters[cpt]->drop_count++;
 	the_lnet.ln_counters[cpt]->drop_length += nob;
 	lnet_net_unlock(cpt);
@@ -2510,7 +2614,7 @@ lnet_parse(struct lnet_ni *ni, struct lnet_hdr *hdr, lnet_nid_t from_nid,
 	lnet_finalize(msg, rc);
 
  drop:
-	lnet_drop_message(ni, cpt, private, payload_length);
+	lnet_drop_message(ni, cpt, private, payload_length, type);
 	return 0;
 }
 EXPORT_SYMBOL(lnet_parse);
@@ -2546,7 +2650,8 @@ lnet_drop_delayed_msg_list(struct list_head *head, char *reason)
 		 * until that's done
 		 */
 		lnet_drop_message(msg->msg_rxni, msg->msg_rx_cpt,
-				  msg->msg_private, msg->msg_len);
+				  msg->msg_private, msg->msg_len,
+				  msg->msg_type);
 		/*
 		 * NB: message will not generate event because w/o attached MD,
 		 * but we still should give error code so lnet_msg_decommit()
@@ -2786,6 +2891,7 @@ lnet_create_reply_msg(struct lnet_ni *ni, struct lnet_msg *getmsg)
 	cpt = lnet_cpt_of_nid(peer_id.nid, ni);
 
 	lnet_net_lock(cpt);
+	lnet_incr_stats(&ni->ni_stats, LNET_MSG_GET, LNET_STATS_TYPE_DROP);
 	the_lnet.ln_counters[cpt]->drop_count++;
 	the_lnet.ln_counters[cpt]->drop_length += getmd->md_length;
 	lnet_net_unlock(cpt);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
index db13d01d366f..7f58cfe25bc2 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
@@ -219,9 +219,13 @@ lnet_msg_decommit_tx(struct lnet_msg *msg, int status)
 
 incr_stats:
 	if (msg->msg_txpeer)
-		atomic_inc(&msg->msg_txpeer->lpni_stats.send_count);
+		lnet_incr_stats(&msg->msg_txpeer->lpni_stats,
+				msg->msg_type,
+				LNET_STATS_TYPE_SEND);
 	if (msg->msg_txni)
-		atomic_inc(&msg->msg_txni->ni_stats.send_count);
+		lnet_incr_stats(&msg->msg_txni->ni_stats,
+				msg->msg_type,
+				LNET_STATS_TYPE_SEND);
  out:
 	lnet_return_tx_credits_locked(msg);
 	msg->msg_tx_committed = 0;
@@ -280,9 +284,13 @@ lnet_msg_decommit_rx(struct lnet_msg *msg, int status)
 
 incr_stats:
 	if (msg->msg_rxpeer)
-		atomic_inc(&msg->msg_rxpeer->lpni_stats.recv_count);
+		lnet_incr_stats(&msg->msg_rxpeer->lpni_stats,
+				msg->msg_type,
+				LNET_STATS_TYPE_RECV);
 	if (msg->msg_rxni)
-		atomic_inc(&msg->msg_rxni->ni_stats.recv_count);
+		lnet_incr_stats(&msg->msg_rxni->ni_stats,
+				msg->msg_type,
+				LNET_STATS_TYPE_RECV);
 	if (ev->type == LNET_EVENT_PUT || ev->type == LNET_EVENT_REPLY)
 		counters->recv_length += msg->msg_wanted;
 
diff --git a/drivers/staging/lustre/lnet/lnet/net_fault.c b/drivers/staging/lustre/lnet/lnet/net_fault.c
index 3841bac1aa0a..e2c746855da9 100644
--- a/drivers/staging/lustre/lnet/lnet/net_fault.c
+++ b/drivers/staging/lustre/lnet/lnet/net_fault.c
@@ -632,7 +632,8 @@ delayed_msg_process(struct list_head *msg_list, bool drop)
 			}
 		}
 
-		lnet_drop_message(ni, cpt, msg->msg_private, msg->msg_len);
+		lnet_drop_message(ni, cpt, msg->msg_private, msg->msg_len,
+				  msg->msg_type);
 		lnet_finalize(msg, rc);
 	}
 }
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 95f72ae39a89..03c1c34517e4 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -3301,6 +3301,7 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
 		       void __user *bulk)
 {
 	struct lnet_ioctl_element_stats *lpni_stats;
+	struct lnet_ioctl_element_msg_stats *lpni_msg_stats;
 	struct lnet_peer_ni_credit_info *lpni_info;
 	struct lnet_peer_ni *lpni;
 	struct lnet_peer *lp;
@@ -3315,7 +3316,8 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
 		goto out;
 	}
 
-	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats);
+	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats)
+		+ sizeof(*lpni_msg_stats);
 	size *= lp->lp_nnis;
 	if (size > *sizep) {
 		*sizep = size;
@@ -3337,13 +3339,17 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
 	lpni_stats = kzalloc(sizeof(*lpni_stats), GFP_KERNEL);
 	if (!lpni_stats)
 		goto out_free_info;
+	lpni_msg_stats = kzalloc(sizeof(*lpni_msg_stats), GFP_KERNEL);
+	if (!lpni_msg_stats)
+		goto out_free_stats;
+
 
 	lpni = NULL;
 	rc = -EFAULT;
 	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
 		nid = lpni->lpni_nid;
 		if (copy_to_user(bulk, &nid, sizeof(nid)))
-			goto out_free_stats;
+			goto out_free_msg_stats;
 		bulk += sizeof(nid);
 
 		memset(lpni_info, 0, sizeof(*lpni_info));
@@ -3362,22 +3368,28 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
 		lpni_info->cr_peer_min_tx_credits = lpni->lpni_mintxcredits;
 		lpni_info->cr_peer_tx_qnob = lpni->lpni_txqnob;
 		if (copy_to_user(bulk, lpni_info, sizeof(*lpni_info)))
-			goto out_free_stats;
+			goto out_free_msg_stats;
 		bulk += sizeof(*lpni_info);
 
 		memset(lpni_stats, 0, sizeof(*lpni_stats));
 		lpni_stats->iel_send_count =
-			atomic_read(&lpni->lpni_stats.send_count);
+			lnet_sum_stats(&lpni->lpni_stats, LNET_STATS_TYPE_SEND);
 		lpni_stats->iel_recv_count =
-			atomic_read(&lpni->lpni_stats.recv_count);
+			lnet_sum_stats(&lpni->lpni_stats, LNET_STATS_TYPE_RECV);
 		lpni_stats->iel_drop_count =
-			atomic_read(&lpni->lpni_stats.drop_count);
+			lnet_sum_stats(&lpni->lpni_stats, LNET_STATS_TYPE_DROP);
 		if (copy_to_user(bulk, lpni_stats, sizeof(*lpni_stats)))
-			goto out_free_stats;
+			goto out_free_msg_stats;
 		bulk += sizeof(*lpni_stats);
+		lnet_usr_translate_stats(lpni_msg_stats, &lpni->lpni_stats);
+		if (copy_to_user(bulk, lpni_msg_stats, sizeof(*lpni_msg_stats)))
+			goto out_free_msg_stats;
+		bulk += sizeof(*lpni_msg_stats);
 	}
 	rc = 0;
 
+out_free_msg_stats:
+	kfree(lpni_msg_stats);
 out_free_stats:
 	kfree(lpni_stats);
 out_free_info:

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 23/24] lustre: lnet: show peer state
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (19 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 20/24] lustre: lnet: add "lnetctl ping" command NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 23:52   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net NeilBrown
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

It is important to show the peer state when debugging.
This patch exports the peer state from the kernel to
user space, and is shown when the detail level requested
in the peer show command is >= 3

WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
Signed-off-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: https://review.whamcloud.com/26130
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |    4 +---
 drivers/staging/lustre/lnet/lnet/api-ni.c          |    6 +-----
 drivers/staging/lustre/lnet/lnet/peer.c            |   21 ++++++++++----------
 3 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index 91980f60a50d..fcfd844e0162 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -733,9 +733,7 @@ bool lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid);
 int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid);
 int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
 int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
-int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nid,
-		       __u32 *nnis, bool *mr, __u32 *sizep,
-		       void __user *bulk);
+int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk);
 int lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
 			  char alivness[LNET_MAX_STR_LEN],
 			  __u32 *cpt_iter, __u32 *refcount,
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 0852118bf803..e2c86b8279e5 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -3166,11 +3166,7 @@ LNetCtl(unsigned int cmd, void *arg)
 			return -EINVAL;
 
 		mutex_lock(&the_lnet.ln_api_mutex);
-		rc = lnet_get_peer_info(&cfg->prcfg_prim_nid,
-					&cfg->prcfg_cfg_nid,
-					&cfg->prcfg_count,
-					&cfg->prcfg_mr,
-					&cfg->prcfg_size,
+		rc = lnet_get_peer_info(cfg,
 					(void __user *)cfg->prcfg_bulk);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 03c1c34517e4..5f61fca09f44 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -3296,9 +3296,7 @@ lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
 }
 
 /* ln_api_mutex is held, which keeps the peer list stable */
-int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
-		       __u32 *nnis, bool *mr, __u32 *sizep,
-		       void __user *bulk)
+int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 {
 	struct lnet_ioctl_element_stats *lpni_stats;
 	struct lnet_ioctl_element_msg_stats *lpni_msg_stats;
@@ -3309,7 +3307,7 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
 	__u32 size;
 	int rc;
 
-	lp = lnet_find_peer(*primary_nid);
+	lp = lnet_find_peer(cfg->prcfg_prim_nid);
 
 	if (!lp) {
 		rc = -ENOENT;
@@ -3319,17 +3317,18 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
 	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats)
 		+ sizeof(*lpni_msg_stats);
 	size *= lp->lp_nnis;
-	if (size > *sizep) {
-		*sizep = size;
+	if (size > cfg->prcfg_size) {
+		cfg->prcfg_size = size;
 		rc = -E2BIG;
 		goto out_lp_decref;
 	}
 
-	*primary_nid = lp->lp_primary_nid;
-	*mr = lnet_peer_is_multi_rail(lp);
-	*nidp = lp->lp_primary_nid;
-	*nnis = lp->lp_nnis;
-	*sizep = size;
+	cfg->prcfg_prim_nid = lp->lp_primary_nid;
+	cfg->prcfg_mr = lnet_peer_is_multi_rail(lp);
+	cfg->prcfg_cfg_nid = lp->lp_primary_nid;
+	cfg->prcfg_count = lp->lp_nnis;
+	cfg->prcfg_size = size;
+	cfg->prcfg_state = lp->lp_state;
 
 	/* Allocate helper buffers. */
 	rc = -ENOMEM;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked()
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (13 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 17/24] lustre: lnet: add the Push target NeilBrown
@ 2018-10-07 23:19 ` NeilBrown
  2018-10-14 23:53   ` James Simmons
  2018-10-14 23:54   ` James Simmons
  2018-10-07 23:19 ` [lustre-devel] [PATCH 19/24] lustre: lnet: add "lnetctl peer list" NeilBrown
                   ` (9 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: NeilBrown @ 2018-10-07 23:19 UTC (permalink / raw)
  To: lustre-devel

From: John L. Hammond <john.hammond@intel.com>

In lnet_discover_peer_locked() avoid a leaked reference to the peer in
the non-blocking discovery case.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9913
Signed-off-by: John L. Hammond <john.hammond@intel.com>
Reviewed-on: https://review.whamcloud.com/28695
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lnet/lnet/peer.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
index 5f61fca09f44..db36b5cf31e1 100644
--- a/drivers/staging/lustre/lnet/lnet/peer.c
+++ b/drivers/staging/lustre/lnet/lnet/peer.c
@@ -2010,7 +2010,6 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
 		if (lnet_peer_is_uptodate(lp))
 			break;
 		lnet_peer_queue_for_discovery(lp);
-		lnet_peer_addref_locked(lp);
 		/*
 		 * if caller requested a non-blocking operation then
 		 * return immediately. Once discovery is complete then the
@@ -2019,6 +2018,8 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
 		 */
 		if (!block)
 			break;
+
+		lnet_peer_addref_locked(lp);
 		lnet_net_unlock(LNET_LOCK_EX);
 		schedule();
 		finish_wait(&lp->lp_dc_waitq, &wait);

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 01/24] lustre: lnet: add lnet_interfaces_max tunable
  2018-10-07 23:19 ` [lustre-devel] [PATCH 01/24] lustre: lnet: add lnet_interfaces_max tunable NeilBrown
@ 2018-10-14 19:08   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:08 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add an lnet_interfaces_max tunable value, that describes the maximum
> number of interfaces per node. This tunable is primarily useful for
> sanity checks prior to allocating memory.
> 
> Allow lnet_interfaces_max to be set and get from the sysfs interface.
> 
> Add LNET_INTERFACES_MIN, value 16, as the minimum value.
> 
> Add LNET_INTERFACES_MAX_DEFAULT, value 200, as the default value. This
> value was chosen to ensure that the size of an LNet ping message with
> any associated LND overhead would fit in 4096 bytes.
> 
> (The LNET_INTERFACES_MAX name was not reused to allow for the early
> detection of issues when merging code that uses it.)
> 
> Rename LNET_NUM_INTERFACES to LNET_INTERFACES_NUM

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> Reviewed-on: https://review.whamcloud.com/25770
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-types.h  |    2 +
>  .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |    4 +--
>  .../lustre/include/uapi/linux/lnet/lnet-types.h    |    7 ++++
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |    2 +
>  .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |   22 +++++++-------
>  .../staging/lustre/lnet/klnds/socklnd/socklnd.h    |    4 +--
>  .../staging/lustre/lnet/klnds/socklnd/socklnd_cb.c |    2 +
>  .../lustre/lnet/klnds/socklnd/socklnd_proto.c      |    4 +--
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   32 +++++++++++++++++++-
>  drivers/staging/lustre/lnet/lnet/config.c          |   10 +++---
>  10 files changed, 62 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index 7219a7bacf6e..7b11c31f0029 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -371,7 +371,7 @@ struct lnet_ni {
>  	 * equivalent interfaces to use
>  	 * This is an array because socklnd bonding can still be configured
>  	 */
> -	char			 *ni_interfaces[LNET_NUM_INTERFACES];
> +	char			 *ni_interfaces[LNET_INTERFACES_NUM];
>  	/* original net namespace */
>  	struct net		 *ni_net_ns;
>  };
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
> index 8f03aa3c5676..d88b30d2e76c 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
> @@ -81,7 +81,7 @@ struct lnet_ioctl_config_lnd_tunables {
>  };
>  
>  struct lnet_ioctl_net_config {
> -	char ni_interfaces[LNET_NUM_INTERFACES][LNET_MAX_STR_LEN];
> +	char ni_interfaces[LNET_INTERFACES_NUM][LNET_MAX_STR_LEN];
>  	__u32 ni_status;
>  	__u32 ni_cpts[LNET_MAX_SHOW_NUM_CPT];
>  	char cfg_bulk[0];
> @@ -184,7 +184,7 @@ struct lnet_ioctl_element_msg_stats {
>  struct lnet_ioctl_config_ni {
>  	struct libcfs_ioctl_hdr lic_cfg_hdr;
>  	lnet_nid_t		lic_nid;
> -	char			lic_ni_intf[LNET_NUM_INTERFACES][LNET_MAX_STR_LEN];
> +	char			lic_ni_intf[LNET_INTERFACES_NUM][LNET_MAX_STR_LEN];
>  	char			lic_legacy_ip2nets[LNET_MAX_STR_LEN];
>  	__u32			lic_cpts[LNET_MAX_SHOW_NUM_CPT];
>  	__u32			lic_ncpts;
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> index f8a873bab135..6ee60d07ff84 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> @@ -264,7 +264,12 @@ struct lnet_counters {
>  #define LNET_NI_STATUS_DOWN    0xdeadface
>  #define LNET_NI_STATUS_INVALID 0x00000000
>  
> -#define LNET_NUM_INTERFACES    16
> +#define LNET_INTERFACES_NUM		16
> +
> +/* The minimum number of interfaces per node supported by LNet. */
> +#define LNET_INTERFACES_MIN		16
> +/* The default - arbitrary - value of the lnet_max_interfaces tunable. */
> +#define LNET_INTERFACES_MAX_DEFAULT	200
>  
>  /**
>   * Objects maintained by the LNet are accessed through handles. Handle types
> diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> index c20766379323..bf969b3891a9 100644
> --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> @@ -2915,7 +2915,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
>  	if (ni->ni_interfaces[0]) {
>  		/* Use the IPoIB interface specified in 'networks=' */
>  
> -		BUILD_BUG_ON(LNET_NUM_INTERFACES <= 1);
> +		BUILD_BUG_ON(LNET_INTERFACES_NUM <= 1);
>  		if (ni->ni_interfaces[1]) {
>  			CERROR("Multiple interfaces not supported\n");
>  			goto failed;
> diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
> index b2f0148d0087..ff8d73295fff 100644
> --- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
> +++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
> @@ -53,7 +53,7 @@ ksocknal_ip2iface(struct lnet_ni *ni, __u32 ip)
>  	struct ksock_interface *iface;
>  
>  	for (i = 0; i < net->ksnn_ninterfaces; i++) {
> -		LASSERT(i < LNET_NUM_INTERFACES);
> +		LASSERT(i < LNET_INTERFACES_NUM);
>  		iface = &net->ksnn_interfaces[i];
>  
>  		if (iface->ksni_ipaddr == ip)
> @@ -221,7 +221,7 @@ ksocknal_unlink_peer_locked(struct ksock_peer *peer_ni)
>  	struct ksock_interface *iface;
>  
>  	for (i = 0; i < peer_ni->ksnp_n_passive_ips; i++) {
> -		LASSERT(i < LNET_NUM_INTERFACES);
> +		LASSERT(i < LNET_INTERFACES_NUM);
>  		ip = peer_ni->ksnp_passive_ips[i];
>  
>  		iface = ksocknal_ip2iface(peer_ni->ksnp_ni, ip);
> @@ -689,7 +689,7 @@ ksocknal_local_ipvec(struct lnet_ni *ni, __u32 *ipaddrs)
>  	read_lock(&ksocknal_data.ksnd_global_lock);
>  
>  	nip = net->ksnn_ninterfaces;
> -	LASSERT(nip <= LNET_NUM_INTERFACES);
> +	LASSERT(nip <= LNET_INTERFACES_NUM);
>  
>  	/*
>  	 * Only offer interfaces for additional connections if I have
> @@ -770,8 +770,8 @@ ksocknal_select_ips(struct ksock_peer *peer_ni, __u32 *peerips, int n_peerips)
>  	 */
>  	write_lock_bh(global_lock);
>  
> -	LASSERT(n_peerips <= LNET_NUM_INTERFACES);
> -	LASSERT(net->ksnn_ninterfaces <= LNET_NUM_INTERFACES);
> +	LASSERT(n_peerips <= LNET_INTERFACES_NUM);
> +	LASSERT(net->ksnn_ninterfaces <= LNET_INTERFACES_NUM);
>  
>  	/*
>  	 * Only match interfaces for additional connections
> @@ -890,7 +890,7 @@ ksocknal_create_routes(struct ksock_peer *peer_ni, int port,
>  		return;
>  	}
>  
> -	LASSERT(npeer_ipaddrs <= LNET_NUM_INTERFACES);
> +	LASSERT(npeer_ipaddrs <= LNET_INTERFACES_NUM);
>  
>  	for (i = 0; i < npeer_ipaddrs; i++) {
>  		if (newroute) {
> @@ -919,7 +919,7 @@ ksocknal_create_routes(struct ksock_peer *peer_ni, int port,
>  		best_nroutes = 0;
>  		best_netmatch = 0;
>  
> -		LASSERT(net->ksnn_ninterfaces <= LNET_NUM_INTERFACES);
> +		LASSERT(net->ksnn_ninterfaces <= LNET_INTERFACES_NUM);
>  
>  		/* Select interface to connect from */
>  		for (j = 0; j < net->ksnn_ninterfaces; j++) {
> @@ -1060,7 +1060,7 @@ ksocknal_create_conn(struct lnet_ni *ni, struct ksock_route *route,
>  	atomic_set(&conn->ksnc_tx_nob, 0);
>  
>  	hello = kvzalloc(offsetof(struct ksock_hello_msg,
> -				  kshm_ips[LNET_NUM_INTERFACES]),
> +				  kshm_ips[LNET_INTERFACES_NUM]),
>  			 GFP_KERNEL);
>  	if (!hello) {
>  		rc = -ENOMEM;
> @@ -1983,7 +1983,7 @@ ksocknal_add_interface(struct lnet_ni *ni, __u32 ipaddress, __u32 netmask)
>  	if (iface) {
>  		/* silently ignore dups */
>  		rc = 0;
> -	} else if (net->ksnn_ninterfaces == LNET_NUM_INTERFACES) {
> +	} else if (net->ksnn_ninterfaces == LNET_INTERFACES_NUM) {
>  		rc = -ENOSPC;
>  	} else {
>  		iface = &net->ksnn_interfaces[net->ksnn_ninterfaces++];
> @@ -2624,7 +2624,7 @@ ksocknal_enumerate_interfaces(struct ksock_net *net, char *iname)
>  			continue;
>  		}
>  
> -		if (j == LNET_NUM_INTERFACES) {
> +		if (j == LNET_INTERFACES_NUM) {
>  			CWARN("Ignoring interface %s (too many interfaces)\n",
>  			      name);
>  			continue;
> @@ -2812,7 +2812,7 @@ ksocknal_startup(struct lnet_ni *ni)
>  
>  		net->ksnn_ninterfaces = rc;
>  	} else {
> -		for (i = 0; i < LNET_NUM_INTERFACES; i++) {
> +		for (i = 0; i < LNET_INTERFACES_NUM; i++) {
>  			if (!ni->ni_interfaces[i])
>  				break;
>  			rc = ksocknal_enumerate_interfaces(net, ni->ni_interfaces[i]);
> diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h
> index 82e3523f6463..297d1e5af1bd 100644
> --- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h
> +++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.h
> @@ -173,7 +173,7 @@ struct ksock_net {
>  	int		  ksnn_npeers;		/* # peers */
>  	int		  ksnn_shutdown;	/* shutting down? */
>  	int		  ksnn_ninterfaces;	/* IP interfaces */
> -	struct ksock_interface ksnn_interfaces[LNET_NUM_INTERFACES];
> +	struct ksock_interface ksnn_interfaces[LNET_INTERFACES_NUM];
>  };
>  
>  /** connd timeout */
> @@ -441,7 +441,7 @@ struct ksock_peer {
>  	int                ksnp_n_passive_ips;  /* # of... */
>  
>  	/* preferred local interfaces */
> -	u32		   ksnp_passive_ips[LNET_NUM_INTERFACES];
> +	u32              ksnp_passive_ips[LNET_INTERFACES_NUM];
>  };
>  
>  struct ksock_connreq {
> diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c
> index dc9a12910a8d..c401896bf649 100644
> --- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c
> +++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c
> @@ -1579,7 +1579,7 @@ ksocknal_send_hello(struct lnet_ni *ni, struct ksock_conn *conn,
>  	/* CAVEAT EMPTOR: this byte flips 'ipaddrs' */
>  	struct ksock_net *net = (struct ksock_net *)ni->ni_data;
>  
> -	LASSERT(hello->kshm_nips <= LNET_NUM_INTERFACES);
> +	LASSERT(hello->kshm_nips <= LNET_INTERFACES_NUM);
>  
>  	/* rely on caller to hold a ref on socket so it wouldn't disappear */
>  	LASSERT(conn->ksnc_proto);
> diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c
> index 10a2757895f3..54ec5d0a85c8 100644
> --- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c
> +++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_proto.c
> @@ -614,7 +614,7 @@ ksocknal_recv_hello_v1(struct ksock_conn *conn, struct ksock_hello_msg *hello,
>  	hello->kshm_nips            = le32_to_cpu(hdr->payload_length) /
>  						  sizeof(__u32);
>  
> -	if (hello->kshm_nips > LNET_NUM_INTERFACES) {
> +	if (hello->kshm_nips > LNET_INTERFACES_NUM) {
>  		CERROR("Bad nips %d from ip %pI4h\n",
>  		       hello->kshm_nips, &conn->ksnc_ipaddr);
>  		rc = -EPROTO;
> @@ -684,7 +684,7 @@ ksocknal_recv_hello_v2(struct ksock_conn *conn, struct ksock_hello_msg *hello,
>  		__swab32s(&hello->kshm_nips);
>  	}
>  
> -	if (hello->kshm_nips > LNET_NUM_INTERFACES) {
> +	if (hello->kshm_nips > LNET_INTERFACES_NUM) {
>  		CERROR("Bad nips %d from ip %pI4h\n",
>  		       hello->kshm_nips, &conn->ksnc_ipaddr);
>  		return -EPROTO;
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index b37abdedccaa..6a692d5c4608 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -34,6 +34,7 @@
>  #define DEBUG_SUBSYSTEM S_LNET
>  #include <linux/log2.h>
>  #include <linux/ktime.h>
> +#include <linux/moduleparam.h>
>  
>  #include <linux/lnet/lib-lnet.h>
>  #include <uapi/linux/lnet/lnet-dlc.h>
> @@ -70,6 +71,13 @@ module_param(lnet_numa_range, uint, 0444);
>  MODULE_PARM_DESC(lnet_numa_range,
>  		 "NUMA range to consider during Multi-Rail selection");
>  
> +static int lnet_interfaces_max = LNET_INTERFACES_MAX_DEFAULT;
> +static int intf_max_set(const char *val, const struct kernel_param *kp);
> +module_param_call(lnet_interfaces_max, intf_max_set, param_get_int,
> +		  &lnet_interfaces_max, 0644);
> +MODULE_PARM_DESC(lnet_interfaces_max,
> +		 "Maximum number of interfaces in a node.");
> +
>  /*
>   * This sequence number keeps track of how many times DLC was used to
>   * update the local NIs. It is incremented when a NI is added or
> @@ -82,6 +90,28 @@ static atomic_t lnet_dlc_seq_no = ATOMIC_INIT(0);
>  static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  		     struct lnet_process_id __user *ids, int n_ids);
>  
> +static int
> +intf_max_set(const char *val, const struct kernel_param *kp)
> +{
> +	int value, rc;
> +
> +	rc = kstrtoint(val, 0, &value);
> +	if (rc) {
> +		CERROR("Invalid module parameter value for 'lnet_interfaces_max'\n");
> +		return rc;
> +	}
> +
> +	if (value < LNET_INTERFACES_MIN) {
> +		CWARN("max interfaces provided are too small, setting to %d\n",
> +		      LNET_INTERFACES_MIN);
> +		value = LNET_INTERFACES_MIN;
> +	}
> +
> +	*(int *)kp->arg = value;
> +
> +	return 0;
> +}
> +
>  static char *
>  lnet_get_routes(void)
>  {
> @@ -2924,7 +2954,7 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	infosz = offsetof(struct lnet_ping_info, pi_ni[n_ids]);
>  
>  	/* n_ids limit is arbitrary */
> -	if (n_ids <= 0 || n_ids > 20 || id.nid == LNET_NID_ANY)
> +	if (n_ids <= 0 || n_ids > lnet_interfaces_max || id.nid == LNET_NID_ANY)
>  		return -EINVAL;
>  
>  	if (id.pid == LNET_PID_ANY)
> diff --git a/drivers/staging/lustre/lnet/lnet/config.c b/drivers/staging/lustre/lnet/lnet/config.c
> index 3ea56c81ec0e..087d9a8a6b6a 100644
> --- a/drivers/staging/lustre/lnet/lnet/config.c
> +++ b/drivers/staging/lustre/lnet/lnet/config.c
> @@ -123,10 +123,10 @@ lnet_ni_unique_net(struct list_head *nilist, char *iface)
>  /* check that the NI is unique to the interfaces with in the same NI.
>   * This is only a consideration if use_tcp_bonding is set */
>  static bool
> -lnet_ni_unique_ni(char *iface_list[LNET_NUM_INTERFACES], char *iface)
> +lnet_ni_unique_ni(char *iface_list[LNET_INTERFACES_NUM], char *iface)
>  {
>  	int i;
> -	for (i = 0; i < LNET_NUM_INTERFACES; i++) {
> +	for (i = 0; i < LNET_INTERFACES_NUM; i++) {
>  		if (iface_list[i] &&
>  		    strncmp(iface_list[i], iface, strlen(iface)) == 0)
>  			return false;
> @@ -304,7 +304,7 @@ lnet_ni_free(struct lnet_ni *ni)
>  
>  	kfree(ni->ni_cpts);
>  
> -	for (i = 0; i < LNET_NUM_INTERFACES && ni->ni_interfaces[i]; i++)
> +	for (i = 0; i < LNET_INTERFACES_NUM && ni->ni_interfaces[i]; i++)
>  		kfree(ni->ni_interfaces[i]);
>  
>  	/* release reference to net namespace */
> @@ -397,11 +397,11 @@ lnet_ni_add_interface(struct lnet_ni *ni, char *iface)
>  	 * can free the tokens at the end of the function.
>  	 * The newly allocated ni_interfaces[] can be
>  	 * freed when freeing the NI */
> -	while (niface < LNET_NUM_INTERFACES &&
> +	while (niface < LNET_INTERFACES_NUM &&
>  	       ni->ni_interfaces[niface])
>  		niface++;
>  
> -	if (niface >= LNET_NUM_INTERFACES) {
> +	if (niface >= LNET_INTERFACES_NUM) {
>  		LCONSOLE_ERROR_MSG(0x115, "Too many interfaces "
>  				   "for net %s\n",
>  				   libcfs_net2str(LNET_NIDNET(ni->ni_nid)));
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 02/24] lustre: lnet: configure lnet_interfaces_max tunable from dlc
  2018-10-07 23:19 ` [lustre-devel] [PATCH 02/24] lustre: lnet: configure lnet_interfaces_max tunable from dlc NeilBrown
@ 2018-10-14 19:10   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:10 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Added the ability to configure lnet_interfaces_max from DLC.
> Combined the configure and show of numa range and max interfaces
> under a "global" YAML element when configuring using YAML.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25771
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |    6 +++---
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   16 ++++++++--------
>  2 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
> index d88b30d2e76c..706892ca7efb 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-dlc.h
> @@ -230,9 +230,9 @@ struct lnet_ioctl_peer_cfg {
>  	void __user *prcfg_bulk;
>  };
>  
> -struct lnet_ioctl_numa_range {
> -	struct libcfs_ioctl_hdr nr_hdr;
> -	__u32 nr_range;
> +struct lnet_ioctl_set_value {
> +	struct libcfs_ioctl_hdr sv_hdr;
> +	__u32 sv_value;
>  };
>  
>  struct lnet_ioctl_lnet_stats {
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index 6a692d5c4608..8b6400da2836 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -2708,24 +2708,24 @@ LNetCtl(unsigned int cmd, void *arg)
>  		return rc;
>  
>  	case IOC_LIBCFS_SET_NUMA_RANGE: {
> -		struct lnet_ioctl_numa_range *numa;
> +		struct lnet_ioctl_set_value *numa;
>  
>  		numa = arg;
> -		if (numa->nr_hdr.ioc_len != sizeof(*numa))
> +		if (numa->sv_hdr.ioc_len != sizeof(*numa))
>  			return -EINVAL;
> -		mutex_lock(&the_lnet.ln_api_mutex);
> -		lnet_numa_range = numa->nr_range;
> -		mutex_unlock(&the_lnet.ln_api_mutex);
> +		lnet_net_lock(LNET_LOCK_EX);
> +		lnet_numa_range = numa->sv_value;
> +		lnet_net_unlock(LNET_LOCK_EX);
>  		return 0;
>  	}
>  
>  	case IOC_LIBCFS_GET_NUMA_RANGE: {
> -		struct lnet_ioctl_numa_range *numa;
> +		struct lnet_ioctl_set_value *numa;
>  
>  		numa = arg;
> -		if (numa->nr_hdr.ioc_len != sizeof(*numa))
> +		if (numa->sv_hdr.ioc_len != sizeof(*numa))
>  			return -EINVAL;
> -		numa->nr_range = lnet_numa_range;
> +		numa->sv_value = lnet_numa_range;
>  		return 0;
>  	}
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 03/24] lustre: lnet: add struct lnet_ping_buffer
  2018-10-07 23:19 ` [lustre-devel] [PATCH 03/24] lustre: lnet: add struct lnet_ping_buffer NeilBrown
@ 2018-10-14 19:29   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:29 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> The Multi-Rail code will use the ping target buffer also as the
> source of data to push to other nodes. This means that there
> will be multiple MDs referencing the same buffer, and care must
> be taken to ensure that the buffer is not freed while any such
> reference remains.
> 
> Encapsulate the struct lnet_ping_info (aka lnet_ping_info_t) in
> a struct lnet_ping_buffer. This adds a reference count, and the
> number of NIDs for the encapsulated lnet_ping_info has been
> sized.
> 
> For sizing the buffer the constant LNET_PINGINFO_SIZE is replaced
> with LNET_PING_INFO_SIZE(NNIS).

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25773
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   22 +
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   40 ++
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |  345 +++++++++++---------
>  drivers/staging/lustre/lnet/lnet/router.c          |   94 +++--
>  4 files changed, 301 insertions(+), 200 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 16e64d83840d..2e2b5ed27116 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -634,7 +634,27 @@ int lnet_peer_buffer_credits(struct lnet_net *net);
>  int lnet_router_checker_start(void);
>  void lnet_router_checker_stop(void);
>  void lnet_router_ni_update_locked(struct lnet_peer_ni *gw, __u32 net);
> -void lnet_swap_pinginfo(struct lnet_ping_info *info);
> +void lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf);
> +
> +int lnet_ping_info_validate(struct lnet_ping_info *pinfo);
> +struct lnet_ping_buffer *lnet_ping_buffer_alloc(int nnis, gfp_t gfp);
> +void lnet_ping_buffer_free(struct lnet_ping_buffer *pbuf);
> +
> +static inline void lnet_ping_buffer_addref(struct lnet_ping_buffer *pbuf)
> +{
> +	atomic_inc(&pbuf->pb_refcnt);
> +}
> +
> +static inline void lnet_ping_buffer_decref(struct lnet_ping_buffer *pbuf)
> +{
> +	if (atomic_dec_and_test(&pbuf->pb_refcnt))
> +		lnet_ping_buffer_free(pbuf);
> +}
> +
> +static inline int lnet_ping_buffer_numref(struct lnet_ping_buffer *pbuf)
> +{
> +	return atomic_read(&pbuf->pb_refcnt);
> +}
>  
>  int lnet_parse_ip2nets(char **networksp, char *ip2nets);
>  int lnet_parse_routes(char *route_str, int *im_a_router);
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index 7b11c31f0029..ab8c6d66cdbf 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -387,12 +387,32 @@ struct lnet_ni {
>  #define LNET_PING_FEAT_NI_STATUS	BIT(1)	/* return NI status */
>  #define LNET_PING_FEAT_RTE_DISABLED	BIT(2)	/* Routing enabled */
>  
> -#define LNET_PING_FEAT_MASK		(LNET_PING_FEAT_BASE | \
> -					 LNET_PING_FEAT_NI_STATUS)
> +#define LNET_PING_INFO_SIZE(NNIDS) \
> +	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
> +#define LNET_PING_INFO_LONI(PINFO)	((PINFO)->pi_ni[0].ns_nid)
> +#define LNET_PING_INFO_SEQNO(PINFO)	((PINFO)->pi_ni[0].ns_status)
> +
> +/*
> + * Descriptor of a ping info buffer: keep a separate indicator of the
> + * size and a reference count. The type is used both as a source and
> + * sink of data, so we need to keep some information outside of the
> + * area that may be overwritten by network data.
> + */
> +struct lnet_ping_buffer {
> +	int			pb_nnis;
> +	atomic_t		pb_refcnt;
> +	struct lnet_ping_info	pb_info;
> +};
> +
> +#define LNET_PING_BUFFER_SIZE(NNIDS) \
> +	offsetof(struct lnet_ping_buffer, pb_info.pi_ni[NNIDS])
> +#define LNET_PING_BUFFER_LONI(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_nid)
> +#define LNET_PING_BUFFER_SEQNO(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_status)
> +
>  
>  /* router checker data, per router */
> -#define LNET_MAX_RTR_NIS   16
> -#define LNET_PINGINFO_SIZE offsetof(struct lnet_ping_info, pi_ni[LNET_MAX_RTR_NIS])
> +#define LNET_MAX_RTR_NIS   LNET_INTERFACES_MIN
> +#define LNET_RTR_PINGINFO_SIZE	LNET_PING_INFO_SIZE(LNET_MAX_RTR_NIS)
>  struct lnet_rc_data {
>  	/* chain on the_lnet.ln_zombie_rcd or ln_deathrow_rcd */
>  	struct list_head	rcd_list;
> @@ -401,7 +421,7 @@ struct lnet_rc_data {
>  	/* reference to gateway */
>  	struct lnet_peer_ni	*rcd_gateway;
>  	/* ping buffer */
> -	struct lnet_ping_info	*rcd_pinginfo;
> +	struct lnet_ping_buffer	*rcd_pingbuffer;
>  };
>  
>  struct lnet_peer_ni {
> @@ -792,9 +812,17 @@ struct lnet {
>  	/* percpt router buffer pools */
>  	struct lnet_rtrbufpool		**ln_rtrpools;
>  
> +	/*
> +	 * Ping target / Push source
> +	 *
> +	 * The ping target and push source share a single buffer. The
> +	 * ln_ping_target is protected against concurrent updates by
> +	 * ln_api_mutex.
> +	 */
>  	struct lnet_handle_md		  ln_ping_target_md;
>  	struct lnet_handle_eq		  ln_ping_target_eq;
> -	struct lnet_ping_info		 *ln_ping_info;
> +	struct lnet_ping_buffer		 *ln_ping_target;
> +	atomic_t			ln_ping_target_seqno;
>  
>  	/* router checker startup/shutdown state */
>  	enum lnet_rc_state		  ln_rc_state;
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index 8b6400da2836..ca28ad75fe2b 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -902,25 +902,44 @@ lnet_count_acceptor_nets(void)
>  	return count;
>  }
>  
> -static struct lnet_ping_info *
> -lnet_ping_info_create(int num_ni)
> +struct lnet_ping_buffer *
> +lnet_ping_buffer_alloc(int nnis, gfp_t gfp)
>  {
> -	struct lnet_ping_info *ping_info;
> -	unsigned int infosz;
> +	struct lnet_ping_buffer *pbuf;
>  
> -	infosz = offsetof(struct lnet_ping_info, pi_ni[num_ni]);
> -	ping_info = kvzalloc(infosz, GFP_KERNEL);
> -	if (!ping_info) {
> -		CERROR("Can't allocate ping info[%d]\n", num_ni);
> +	pbuf = kmalloc(LNET_PING_BUFFER_SIZE(nnis), gfp);
> +	if (pbuf) {
> +		pbuf->pb_nnis = nnis;
> +		atomic_set(&pbuf->pb_refcnt, 1);
> +	}
> +
> +	return pbuf;
> +}
> +
> +void
> +lnet_ping_buffer_free(struct lnet_ping_buffer *pbuf)
> +{
> +	LASSERT(lnet_ping_buffer_numref(pbuf) == 0);
> +	kfree(pbuf);
> +}
> +
> +static struct lnet_ping_buffer *
> +lnet_ping_target_create(int nnis)
> +{
> +	struct lnet_ping_buffer *pbuf;
> +
> +	pbuf = lnet_ping_buffer_alloc(nnis, GFP_KERNEL);
> +	if (!pbuf) {
> +		CERROR("Can't allocate ping source [%d]\n", nnis);
>  		return NULL;
>  	}
>  
> -	ping_info->pi_nnis = num_ni;
> -	ping_info->pi_pid = the_lnet.ln_pid;
> -	ping_info->pi_magic = LNET_PROTO_PING_MAGIC;
> -	ping_info->pi_features = LNET_PING_FEAT_NI_STATUS;
> +	pbuf->pb_info.pi_nnis = nnis;
> +	pbuf->pb_info.pi_pid = the_lnet.ln_pid;
> +	pbuf->pb_info.pi_magic = LNET_PROTO_PING_MAGIC;
> +	pbuf->pb_info.pi_features = LNET_PING_FEAT_NI_STATUS;
>  
> -	return ping_info;
> +	return pbuf;
>  }
>  
>  static inline int
> @@ -966,14 +985,25 @@ lnet_get_ni_count(void)
>  	return count;
>  }
>  
> -static inline void
> -lnet_ping_info_free(struct lnet_ping_info *pinfo)
> +int
> +lnet_ping_info_validate(struct lnet_ping_info *pinfo)
>  {
> -	kvfree(pinfo);
> +	if (!pinfo)
> +		return -EINVAL;
> +	if (pinfo->pi_magic != LNET_PROTO_PING_MAGIC)
> +		return -EPROTO;
> +	if (!(pinfo->pi_features & LNET_PING_FEAT_NI_STATUS))
> +		return -EPROTO;
> +	/* Loopback is guaranteed to be present */
> +	if (pinfo->pi_nnis < 1 || pinfo->pi_nnis > lnet_interfaces_max)
> +		return -ERANGE;
> +	if (LNET_NETTYP(LNET_NIDNET(LNET_PING_INFO_LONI(pinfo))) != LOLND)
> +		return -EPROTO;
> +	return 0;
>  }
>  
>  static void
> -lnet_ping_info_destroy(void)
> +lnet_ping_target_destroy(void)
>  {
>  	struct lnet_net *net;
>  	struct lnet_ni *ni;
> @@ -988,25 +1018,25 @@ lnet_ping_info_destroy(void)
>  		}
>  	}
>  
> -	lnet_ping_info_free(the_lnet.ln_ping_info);
> -	the_lnet.ln_ping_info = NULL;
> +	lnet_ping_buffer_decref(the_lnet.ln_ping_target);
> +	the_lnet.ln_ping_target = NULL;
>  
>  	lnet_net_unlock(LNET_LOCK_EX);
>  }
>  
>  static void
> -lnet_ping_event_handler(struct lnet_event *event)
> +lnet_ping_target_event_handler(struct lnet_event *event)
>  {
> -	struct lnet_ping_info *pinfo = event->md.user_ptr;
> +	struct lnet_ping_buffer *pbuf = event->md.user_ptr;
>  
>  	if (event->unlinked)
> -		pinfo->pi_features = LNET_PING_FEAT_INVAL;
> +		lnet_ping_buffer_decref(pbuf);
>  }
>  
>  static int
> -lnet_ping_info_setup(struct lnet_ping_info **ppinfo,
> -		     struct lnet_handle_md *md_handle,
> -		     int ni_count, bool set_eq)
> +lnet_ping_target_setup(struct lnet_ping_buffer **ppbuf,
> +		       struct lnet_handle_md *ping_mdh,
> +		       int ni_count, bool set_eq)
>  {
>  	struct lnet_process_id id = { .nid = LNET_NID_ANY,
>  				      .pid = LNET_PID_ANY };
> @@ -1015,94 +1045,98 @@ lnet_ping_info_setup(struct lnet_ping_info **ppinfo,
>  	int rc, rc2;
>  
>  	if (set_eq) {
> -		rc = LNetEQAlloc(0, lnet_ping_event_handler,
> +		rc = LNetEQAlloc(0, lnet_ping_target_event_handler,
>  				 &the_lnet.ln_ping_target_eq);
>  		if (rc) {
> -			CERROR("Can't allocate ping EQ: %d\n", rc);
> +			CERROR("Can't allocate ping buffer EQ: %d\n", rc);
>  			return rc;
>  		}
>  	}
>  
> -	*ppinfo = lnet_ping_info_create(ni_count);
> -	if (!*ppinfo) {
> +	*ppbuf = lnet_ping_target_create(ni_count);
> +	if (!*ppbuf) {
>  		rc = -ENOMEM;
> -		goto failed_0;
> +		goto fail_free_eq;
>  	}
>  
> +	/* Ping target ME/MD */
>  	rc = LNetMEAttach(LNET_RESERVED_PORTAL, id,
>  			  LNET_PROTO_PING_MATCHBITS, 0,
>  			  LNET_UNLINK, LNET_INS_AFTER,
>  			  &me_handle);
>  	if (rc) {
> -		CERROR("Can't create ping ME: %d\n", rc);
> -		goto failed_1;
> +		CERROR("Can't create ping target ME: %d\n", rc);
> +		goto fail_decref_ping_buffer;
>  	}
>  
>  	/* initialize md content */
> -	md.start = *ppinfo;
> -	md.length = offsetof(struct lnet_ping_info,
> -			     pi_ni[(*ppinfo)->pi_nnis]);
> +	md.start = &(*ppbuf)->pb_info;
> +	md.length = LNET_PING_INFO_SIZE((*ppbuf)->pb_nnis);
>  	md.threshold = LNET_MD_THRESH_INF;
>  	md.max_size = 0;
>  	md.options = LNET_MD_OP_GET | LNET_MD_TRUNCATE |
>  		     LNET_MD_MANAGE_REMOTE;
> -	md.user_ptr  = NULL;
>  	md.eq_handle = the_lnet.ln_ping_target_eq;
> -	md.user_ptr = *ppinfo;
> +	md.user_ptr = *ppbuf;
>  
> -	rc = LNetMDAttach(me_handle, md, LNET_RETAIN, md_handle);
> +	rc = LNetMDAttach(me_handle, md, LNET_RETAIN, ping_mdh);
>  	if (rc) {
> -		CERROR("Can't attach ping MD: %d\n", rc);
> -		goto failed_2;
> +		CERROR("Can't attach ping target MD: %d\n", rc);
> +		goto fail_unlink_ping_me;
>  	}
> +	lnet_ping_buffer_addref(*ppbuf);
>  
>  	return 0;
>  
> -failed_2:
> +fail_unlink_ping_me:
>  	rc2 = LNetMEUnlink(me_handle);
>  	LASSERT(!rc2);
> -failed_1:
> -	lnet_ping_info_free(*ppinfo);
> -	*ppinfo = NULL;
> -failed_0:
> -	if (set_eq)
> -		LNetEQFree(the_lnet.ln_ping_target_eq);
> +fail_decref_ping_buffer:
> +	LASSERT(lnet_ping_buffer_numref(*ppbuf) == 1);
> +	lnet_ping_buffer_decref(*ppbuf);
> +	*ppbuf = NULL;
> +fail_free_eq:
> +	if (set_eq) {
> +		rc2 = LNetEQFree(the_lnet.ln_ping_target_eq);
> +		LASSERT(rc2 == 0);
> +	}
>  	return rc;
>  }
>  
>  static void
> -lnet_ping_md_unlink(struct lnet_ping_info *pinfo,
> -		    struct lnet_handle_md *md_handle)
> +lnet_ping_md_unlink(struct lnet_ping_buffer *pbuf,
> +		    struct lnet_handle_md *ping_mdh)
>  {
> -	LNetMDUnlink(*md_handle);
> -	LNetInvalidateMDHandle(md_handle);
> +	LNetMDUnlink(*ping_mdh);
> +	LNetInvalidateMDHandle(ping_mdh);
>  
> -	/* NB md could be busy; this just starts the unlink */
> -	while (pinfo->pi_features != LNET_PING_FEAT_INVAL) {
> -		CDEBUG(D_NET, "Still waiting for ping MD to unlink\n");
> +	/* NB the MD could be busy; this just starts the unlink */
> +	while (lnet_ping_buffer_numref(pbuf) > 1) {
> +		CDEBUG(D_NET, "Still waiting for ping data MD to unlink\n");
>  		schedule_timeout_idle(HZ);
>  	}
>  }
>  
>  static void
> -lnet_ping_info_install_locked(struct lnet_ping_info *ping_info)
> +lnet_ping_target_install_locked(struct lnet_ping_buffer *pbuf)
>  {
>  	struct lnet_ni_status *ns;
>  	struct lnet_ni *ni;
>  	struct lnet_net *net;
>  	int i = 0;
> +	int rc;
>  
>  	list_for_each_entry(net, &the_lnet.ln_nets, net_list) {
>  		list_for_each_entry(ni, &net->net_ni_list, ni_netlist) {
> -			LASSERT(i < ping_info->pi_nnis);
> +			LASSERT(i < pbuf->pb_nnis);
>  
> -			ns = &ping_info->pi_ni[i];
> +			ns = &pbuf->pb_info.pi_ni[i];
>  
>  			ns->ns_nid = ni->ni_nid;
>  
>  			lnet_ni_lock(ni);
>  			ns->ns_status = ni->ni_status ?
> -					ni->ni_status->ns_status :
> +					 ni->ni_status->ns_status :
>  						LNET_NI_STATUS_UP;
>  			ni->ni_status = ns;
>  			lnet_ni_unlock(ni);
> @@ -1110,35 +1144,47 @@ lnet_ping_info_install_locked(struct lnet_ping_info *ping_info)
>  			i++;
>  		}
>  	}
> +	/*
> +	 * We (ab)use the ns_status of the loopback interface to
> +	 * transmit the sequence number. The first interface listed
> +	 * must be the loopback interface.
> +	 */
> +	rc = lnet_ping_info_validate(&pbuf->pb_info);
> +	if (rc) {
> +		LCONSOLE_EMERG("Invalid ping target: %d\n", rc);
> +		LBUG();
> +	}
> +	LNET_PING_BUFFER_SEQNO(pbuf) =
> +		atomic_inc_return(&the_lnet.ln_ping_target_seqno);
>  }
>  
>  static void
> -lnet_ping_target_update(struct lnet_ping_info *pinfo,
> -			struct lnet_handle_md md_handle)
> +lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
> +			struct lnet_handle_md ping_mdh)
>  {
> -	struct lnet_ping_info *old_pinfo = NULL;
> -	struct lnet_handle_md old_md;
> +	struct lnet_ping_buffer *old_pbuf = NULL;
> +	struct lnet_handle_md old_ping_md;
>  
>  	/* switch the NIs to point to the new ping info created */
>  	lnet_net_lock(LNET_LOCK_EX);
>  
>  	if (!the_lnet.ln_routing)
> -		pinfo->pi_features |= LNET_PING_FEAT_RTE_DISABLED;
> -	lnet_ping_info_install_locked(pinfo);
> +		pbuf->pb_info.pi_features |= LNET_PING_FEAT_RTE_DISABLED;
> +	lnet_ping_target_install_locked(pbuf);
>  
> -	if (the_lnet.ln_ping_info) {
> -		old_pinfo = the_lnet.ln_ping_info;
> -		old_md = the_lnet.ln_ping_target_md;
> +	if (the_lnet.ln_ping_target) {
> +		old_pbuf = the_lnet.ln_ping_target;
> +		old_ping_md = the_lnet.ln_ping_target_md;
>  	}
> -	the_lnet.ln_ping_target_md = md_handle;
> -	the_lnet.ln_ping_info = pinfo;
> +	the_lnet.ln_ping_target_md = ping_mdh;
> +	the_lnet.ln_ping_target = pbuf;
>  
>  	lnet_net_unlock(LNET_LOCK_EX);
>  
> -	if (old_pinfo) {
> -		/* unlink the old ping info */
> -		lnet_ping_md_unlink(old_pinfo, &old_md);
> -		lnet_ping_info_free(old_pinfo);
> +	if (old_pbuf) {
> +		/* unlink and free the old ping info */
> +		lnet_ping_md_unlink(old_pbuf, &old_ping_md);
> +		lnet_ping_buffer_decref(old_pbuf);
>  	}
>  }
>  
> @@ -1147,13 +1193,13 @@ lnet_ping_target_fini(void)
>  {
>  	int rc;
>  
> -	lnet_ping_md_unlink(the_lnet.ln_ping_info,
> +	lnet_ping_md_unlink(the_lnet.ln_ping_target,
>  			    &the_lnet.ln_ping_target_md);
>  
>  	rc = LNetEQFree(the_lnet.ln_ping_target_eq);
>  	LASSERT(!rc);
>  
> -	lnet_ping_info_destroy();
> +	lnet_ping_target_destroy();
>  }
>  
>  static int
> @@ -1745,8 +1791,8 @@ LNetNIInit(lnet_pid_t requested_pid)
>  	int im_a_router = 0;
>  	int rc;
>  	int ni_count;
> -	struct lnet_ping_info *pinfo;
> -	struct lnet_handle_md md_handle;
> +	struct lnet_ping_buffer *pbuf;
> +	struct lnet_handle_md ping_mdh;
>  	struct list_head net_head;
>  	struct lnet_net *net;
>  
> @@ -1823,11 +1869,11 @@ LNetNIInit(lnet_pid_t requested_pid)
>  	the_lnet.ln_refcount = 1;
>  	/* Now I may use my own API functions... */
>  
> -	rc = lnet_ping_info_setup(&pinfo, &md_handle, ni_count, true);
> +	rc = lnet_ping_target_setup(&pbuf, &ping_mdh, ni_count, true);
>  	if (rc)
>  		goto err_acceptor_stop;
>  
> -	lnet_ping_target_update(pinfo, md_handle);
> +	lnet_ping_target_update(pbuf, ping_mdh);
>  
>  	rc = lnet_router_checker_start();
>  	if (rc)
> @@ -1936,7 +1982,10 @@ lnet_fill_ni_info(struct lnet_ni *ni, struct lnet_ioctl_config_ni *cfg_ni,
>  	}
>  
>  	cfg_ni->lic_nid = ni->ni_nid;
> -	cfg_ni->lic_status = ni->ni_status->ns_status;
> +	if (LNET_NETTYP(LNET_NIDNET(ni->ni_nid)) == LOLND)
> +		cfg_ni->lic_status = LNET_NI_STATUS_UP;
> +	else
> +		cfg_ni->lic_status = ni->ni_status->ns_status;
>  	cfg_ni->lic_tcp_bonding = use_tcp_bonding;
>  	cfg_ni->lic_dev_cpt = ni->ni_dev_cpt;
>  
> @@ -2021,7 +2070,10 @@ lnet_fill_ni_info_legacy(struct lnet_ni *ni,
>  	config->cfg_config_u.cfg_net.net_peer_rtr_credits =
>  		ni->ni_net->net_tunables.lct_peer_rtr_credits;
>  
> -	net_config->ni_status = ni->ni_status->ns_status;
> +	if (LNET_NETTYP(LNET_NIDNET(ni->ni_nid)) == LOLND)
> +		net_config->ni_status = LNET_NI_STATUS_UP;
> +	else
> +		net_config->ni_status = ni->ni_status->ns_status;
>  
>  	if (ni->ni_cpts) {
>  		int num_cpts = min(ni->ni_ncpts, LNET_MAX_SHOW_NUM_CPT);
> @@ -2172,8 +2224,8 @@ static int lnet_add_net_common(struct lnet_net *net,
>  			       struct lnet_ioctl_config_lnd_tunables *tun)
>  {
>  	u32 net_id;
> -	struct lnet_ping_info *pinfo;
> -	struct lnet_handle_md md_handle;
> +	struct lnet_ping_buffer *pbuf;
> +	struct lnet_handle_md ping_mdh;
>  	int rc;
>  	struct lnet_remotenet *rnet;
>  	int net_ni_count;
> @@ -2195,7 +2247,7 @@ static int lnet_add_net_common(struct lnet_net *net,
>  
>  	/*
>  	 * make sure you calculate the correct number of slots in the ping
> -	 * info. Since the ping info is a flattened list of all the NIs,
> +	 * buffer. Since the ping info is a flattened list of all the NIs,
>  	 * we should allocate enough slots to accomodate the number of NIs
>  	 * which will be added.
>  	 *
> @@ -2204,9 +2256,9 @@ static int lnet_add_net_common(struct lnet_net *net,
>  	 */
>  	net_ni_count = lnet_get_net_ni_count_pre(net);
>  
> -	rc = lnet_ping_info_setup(&pinfo, &md_handle,
> -				  net_ni_count + lnet_get_ni_count(),
> -				  false);
> +	rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
> +				    net_ni_count + lnet_get_ni_count(),
> +				    false);
>  	if (rc < 0) {
>  		lnet_net_free(net);
>  		return rc;
> @@ -2257,13 +2309,13 @@ static int lnet_add_net_common(struct lnet_net *net,
>  	lnet_peer_net_added(net);
>  	lnet_net_unlock(LNET_LOCK_EX);
>  
> -	lnet_ping_target_update(pinfo, md_handle);
> +	lnet_ping_target_update(pbuf, ping_mdh);
>  
>  	return 0;
>  
>  failed:
> -	lnet_ping_md_unlink(pinfo, &md_handle);
> -	lnet_ping_info_free(pinfo);
> +	lnet_ping_md_unlink(pbuf, &ping_mdh);
> +	lnet_ping_buffer_decref(pbuf);
>  	return rc;
>  }
>  
> @@ -2354,8 +2406,8 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
>  	struct lnet_net *net;
>  	struct lnet_ni *ni;
>  	u32 net_id = LNET_NIDNET(conf->lic_nid);
> -	struct lnet_ping_info *pinfo;
> -	struct lnet_handle_md md_handle;
> +	struct lnet_ping_buffer *pbuf;
> +	struct lnet_handle_md  ping_mdh;
>  	int rc;
>  	int net_count;
>  	u32 addr;
> @@ -2373,7 +2425,7 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
>  		CERROR("net %s not found\n",
>  		       libcfs_net2str(net_id));
>  		rc = -ENOENT;
> -		goto net_unlock;
> +		goto unlock_net;
>  	}
>  
>  	addr = LNET_NIDADDR(conf->lic_nid);
> @@ -2384,20 +2436,20 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
>  		lnet_net_unlock(0);
>  
>  		/* create and link a new ping info, before removing the old one */
> -		rc = lnet_ping_info_setup(&pinfo, &md_handle,
> -					  lnet_get_ni_count() - net_count,
> -					  false);
> +		rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
> +					    lnet_get_ni_count() - net_count,
> +					    false);
>  		if (rc != 0)
> -			goto out;
> +			goto unlock_api_mutex;
>  
>  		lnet_shutdown_lndnet(net);
>  
>  		if (lnet_count_acceptor_nets() == 0)
>  			lnet_acceptor_stop();
>  
> -		lnet_ping_target_update(pinfo, md_handle);
> +		lnet_ping_target_update(pbuf, ping_mdh);
>  
> -		goto out;
> +		goto unlock_api_mutex;
>  	}
>  
>  	ni = lnet_nid2ni_locked(conf->lic_nid, 0);
> @@ -2405,7 +2457,7 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
>  		CERROR("nid %s not found\n",
>  		       libcfs_nid2str(conf->lic_nid));
>  		rc = -ENOENT;
> -		goto net_unlock;
> +		goto unlock_net;
>  	}
>  
>  	net_count = lnet_get_net_ni_count_locked(net);
> @@ -2413,27 +2465,27 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
>  	lnet_net_unlock(0);
>  
>  	/* create and link a new ping info, before removing the old one */
> -	rc = lnet_ping_info_setup(&pinfo, &md_handle,
> -				  lnet_get_ni_count() - 1, false);
> +	rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
> +				    lnet_get_ni_count() - 1, false);
>  	if (rc != 0)
> -		goto out;
> +		goto unlock_api_mutex;
>  
>  	lnet_shutdown_lndni(ni);
>  
>  	if (lnet_count_acceptor_nets() == 0)
>  		lnet_acceptor_stop();
>  
> -	lnet_ping_target_update(pinfo, md_handle);
> +	lnet_ping_target_update(pbuf, ping_mdh);
>  
>  	/* check if the net is empty and remove it if it is */
>  	if (net_count == 1)
>  		lnet_shutdown_lndnet(net);
>  
> -	goto out;
> +	goto unlock_api_mutex;
>  
> -net_unlock:
> +unlock_net:
>  	lnet_net_unlock(0);
> -out:
> +unlock_api_mutex:
>  	mutex_unlock(&the_lnet.ln_api_mutex);
>  
>  	return rc;
> @@ -2501,8 +2553,8 @@ int
>  lnet_dyn_del_net(__u32 net_id)
>  {
>  	struct lnet_net *net;
> -	struct lnet_ping_info *pinfo;
> -	struct lnet_handle_md md_handle;
> +	struct lnet_ping_buffer *pbuf;
> +	struct lnet_handle_md ping_mdh;
>  	int rc;
>  	int net_ni_count;
>  
> @@ -2525,8 +2577,8 @@ lnet_dyn_del_net(__u32 net_id)
>  	lnet_net_unlock(0);
>  
>  	/* create and link a new ping info, before removing the old one */
> -	rc = lnet_ping_info_setup(&pinfo, &md_handle,
> -				  lnet_get_ni_count() - net_ni_count, false);
> +	rc = lnet_ping_target_setup(&pbuf, &ping_mdh,
> +				    lnet_get_ni_count() - net_ni_count, false);
>  	if (rc)
>  		goto out;
>  
> @@ -2535,7 +2587,7 @@ lnet_dyn_del_net(__u32 net_id)
>  	if (!lnet_count_acceptor_nets())
>  		lnet_acceptor_stop();
>  
> -	lnet_ping_target_update(pinfo, md_handle);
> +	lnet_ping_target_update(pbuf, ping_mdh);
>  
>  out:
>  	mutex_unlock(&the_lnet.ln_api_mutex);
> @@ -2943,16 +2995,13 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	int unlinked = 0;
>  	int replied = 0;
>  	const signed long a_long_time = 60*HZ;
> -	int infosz;
> -	struct lnet_ping_info *info;
> +	struct lnet_ping_buffer *pbuf;
>  	struct lnet_process_id tmpid;
>  	int i;
>  	int nob;
>  	int rc;
>  	int rc2;
>  
> -	infosz = offsetof(struct lnet_ping_info, pi_ni[n_ids]);
> -
>  	/* n_ids limit is arbitrary */
>  	if (n_ids <= 0 || n_ids > lnet_interfaces_max || id.nid == LNET_NID_ANY)
>  		return -EINVAL;
> @@ -2960,20 +3009,20 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	if (id.pid == LNET_PID_ANY)
>  		id.pid = LNET_PID_LUSTRE;
>  
> -	info = kzalloc(infosz, GFP_KERNEL);
> -	if (!info)
> +	pbuf = lnet_ping_buffer_alloc(n_ids, GFP_NOFS);
> +	if (!pbuf)
>  		return -ENOMEM;
>  
>  	/* NB 2 events max (including any unlink event) */
>  	rc = LNetEQAlloc(2, LNET_EQ_HANDLER_NONE, &eqh);
>  	if (rc) {
>  		CERROR("Can't allocate EQ: %d\n", rc);
> -		goto out_0;
> +		goto fail_ping_buffer_decref;
>  	}
>  
>  	/* initialize md content */
> -	md.start     = info;
> -	md.length    = infosz;
> +	md.start     = &pbuf->pb_info;
> +	md.length    = LNET_PING_INFO_SIZE(n_ids);
>  	md.threshold = 2; /*GET/REPLY*/
>  	md.max_size  = 0;
>  	md.options   = LNET_MD_TRUNCATE;
> @@ -2983,7 +3032,7 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	rc = LNetMDBind(md, LNET_UNLINK, &mdh);
>  	if (rc) {
>  		CERROR("Can't bind MD: %d\n", rc);
> -		goto out_1;
> +		goto fail_free_eq;
>  	}
>  
>  	rc = LNetGet(LNET_NID_ANY, mdh, id,
> @@ -3044,11 +3093,11 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  			CWARN("%s: Unexpected rc >= 0 but no reply!\n",
>  			      libcfs_id2str(id));
>  		rc = -EIO;
> -		goto out_1;
> +		goto fail_free_eq;
>  	}
>  
>  	nob = rc;
> -	LASSERT(nob >= 0 && nob <= infosz);
> +	LASSERT(nob >= 0 && nob <= LNET_PING_INFO_SIZE(n_ids));
>  
>  	rc = -EPROTO;			   /* if I can't parse... */
>  
> @@ -3056,56 +3105,56 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  		/* can't check magic/version */
>  		CERROR("%s: ping info too short %d\n",
>  		       libcfs_id2str(id), nob);
> -		goto out_1;
> +		goto fail_free_eq;
>  	}
>  
> -	if (info->pi_magic == __swab32(LNET_PROTO_PING_MAGIC)) {
> -		lnet_swap_pinginfo(info);
> -	} else if (info->pi_magic != LNET_PROTO_PING_MAGIC) {
> +	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC)) {
> +		lnet_swap_pinginfo(pbuf);
> +	} else if (pbuf->pb_info.pi_magic != LNET_PROTO_PING_MAGIC) {
>  		CERROR("%s: Unexpected magic %08x\n",
> -		       libcfs_id2str(id), info->pi_magic);
> -		goto out_1;
> +		       libcfs_id2str(id), pbuf->pb_info.pi_magic);
> +		goto fail_free_eq;
>  	}
>  
> -	if (!(info->pi_features & LNET_PING_FEAT_NI_STATUS)) {
> +	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_NI_STATUS)) {
>  		CERROR("%s: ping w/o NI status: 0x%x\n",
> -		       libcfs_id2str(id), info->pi_features);
> -		goto out_1;
> +		       libcfs_id2str(id), pbuf->pb_info.pi_features);
> +		goto fail_free_eq;
>  	}
>  
> -	if (nob < offsetof(struct lnet_ping_info, pi_ni[0])) {
> +	if (nob < LNET_PING_INFO_SIZE(0)) {
>  		CERROR("%s: Short reply %d(%d min)\n", libcfs_id2str(id),
> -		       nob, (int)offsetof(struct lnet_ping_info, pi_ni[0]));
> -		goto out_1;
> +		       nob, (int)LNET_PING_INFO_SIZE(0));
> +		goto fail_free_eq;
>  	}
>  
> -	if (info->pi_nnis < n_ids)
> -		n_ids = info->pi_nnis;
> +	if (pbuf->pb_info.pi_nnis < n_ids)
> +		n_ids = pbuf->pb_info.pi_nnis;
>  
> -	if (nob < offsetof(struct lnet_ping_info, pi_ni[n_ids])) {
> +	if (nob < LNET_PING_INFO_SIZE(n_ids)) {
>  		CERROR("%s: Short reply %d(%d expected)\n", libcfs_id2str(id),
> -		       nob, (int)offsetof(struct lnet_ping_info, pi_ni[n_ids]));
> -		goto out_1;
> +		       nob, (int)LNET_PING_INFO_SIZE(n_ids));
> +		goto fail_free_eq;
>  	}
>  
>  	rc = -EFAULT;			   /* If I SEGV... */
>  
>  	memset(&tmpid, 0, sizeof(tmpid));
>  	for (i = 0; i < n_ids; i++) {
> -		tmpid.pid = info->pi_pid;
> -		tmpid.nid = info->pi_ni[i].ns_nid;
> +		tmpid.pid = pbuf->pb_info.pi_pid;
> +		tmpid.nid = pbuf->pb_info.pi_ni[i].ns_nid;
>  		if (copy_to_user(&ids[i], &tmpid, sizeof(tmpid)))
> -			goto out_1;
> +			goto fail_free_eq;
>  	}
> -	rc = info->pi_nnis;
> +	rc = pbuf->pb_info.pi_nnis;
>  
> - out_1:
> + fail_free_eq:
>  	rc2 = LNetEQFree(eqh);
>  	if (rc2)
>  		CERROR("rc2 %d\n", rc2);
>  	LASSERT(!rc2);
>  
> - out_0:
> -	kfree(info);
> + fail_ping_buffer_decref:
> +	lnet_ping_buffer_decref(pbuf);
>  	return rc;
>  }
> diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
> index b31a383fe974..e97957ce9252 100644
> --- a/drivers/staging/lustre/lnet/lnet/router.c
> +++ b/drivers/staging/lustre/lnet/lnet/router.c
> @@ -618,17 +618,21 @@ lnet_get_route(int idx, __u32 *net, __u32 *hops,
>  }
>  
>  void
> -lnet_swap_pinginfo(struct lnet_ping_info *info)
> +lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf)
>  {
> -	int i;
>  	struct lnet_ni_status *stat;
> +	int nnis;
> +	int i;
>  
> -	__swab32s(&info->pi_magic);
> -	__swab32s(&info->pi_features);
> -	__swab32s(&info->pi_pid);
> -	__swab32s(&info->pi_nnis);
> -	for (i = 0; i < info->pi_nnis && i < LNET_MAX_RTR_NIS; i++) {
> -		stat = &info->pi_ni[i];
> +	__swab32s(&pbuf->pb_info.pi_magic);
> +	__swab32s(&pbuf->pb_info.pi_features);
> +	__swab32s(&pbuf->pb_info.pi_pid);
> +	__swab32s(&pbuf->pb_info.pi_nnis);
> +	nnis = pbuf->pb_info.pi_nnis;
> +	if (nnis > pbuf->pb_nnis)
> +		nnis = pbuf->pb_nnis;
> +	for (i = 0; i < nnis; i++) {
> +		stat = &pbuf->pb_info.pi_ni[i];
>  		__swab64s(&stat->ns_nid);
>  		__swab32s(&stat->ns_status);
>  	}
> @@ -641,11 +645,12 @@ lnet_swap_pinginfo(struct lnet_ping_info *info)
>  static void
>  lnet_parse_rc_info(struct lnet_rc_data *rcd)
>  {
> -	struct lnet_ping_info *info = rcd->rcd_pinginfo;
> +	struct lnet_ping_buffer *pbuf = rcd->rcd_pingbuffer;
>  	struct lnet_peer_ni *gw = rcd->rcd_gateway;
>  	struct lnet_route *rte;
> +	int			nnis;
>  
> -	if (!gw->lpni_alive)
> +	if (!gw->lpni_alive || !pbuf)
>  		return;
>  
>  	/*
> @@ -654,51 +659,48 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
>  	 */
>  	spin_lock(&gw->lpni_lock);
>  
> -	if (info->pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
> -		lnet_swap_pinginfo(info);
> +	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
> +		lnet_swap_pinginfo(pbuf);
>  
>  	/* NB always racing with network! */
> -	if (info->pi_magic != LNET_PROTO_PING_MAGIC) {
> +	if (pbuf->pb_info.pi_magic != LNET_PROTO_PING_MAGIC) {
>  		CDEBUG(D_NET, "%s: Unexpected magic %08x\n",
> -		       libcfs_nid2str(gw->lpni_nid), info->pi_magic);
> +		       libcfs_nid2str(gw->lpni_nid), pbuf->pb_info.pi_magic);
>  		gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
> -		spin_unlock(&gw->lpni_lock);
> -		return;
> +		goto out;
>  	}
>  
> -	gw->lpni_ping_feats = info->pi_features;
> -	if (!(gw->lpni_ping_feats & LNET_PING_FEAT_MASK)) {
> -		CDEBUG(D_NET, "%s: Unexpected features 0x%x\n",
> -		       libcfs_nid2str(gw->lpni_nid), gw->lpni_ping_feats);
> -		spin_unlock(&gw->lpni_lock);
> -		return; /* nothing I can understand */
> -	}
> +	gw->lpni_ping_feats = pbuf->pb_info.pi_features;
>  
> -	if (!(gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS)) {
> -		spin_unlock(&gw->lpni_lock);
> -		return; /* can't carry NI status info */
> -	}
> +	/* Without NI status info there's nothing more to do. */
> +	if (!(gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS))
> +		goto out;
> +
> +	/* Determine the number of NIs for which there is data. */
> +	nnis = pbuf->pb_info.pi_nnis;
> +	if (pbuf->pb_nnis < nnis)
> +		nnis = pbuf->pb_nnis;
>  
>  	list_for_each_entry(rte, &gw->lpni_routes, lr_gwlist) {
>  		int down = 0;
>  		int up = 0;
>  		int i;
>  
> +		/* If routing disabled then the route is down. */
>  		if (gw->lpni_ping_feats & LNET_PING_FEAT_RTE_DISABLED) {
>  			rte->lr_downis = 1;
>  			continue;
>  		}
>  
> -		for (i = 0; i < info->pi_nnis && i < LNET_MAX_RTR_NIS; i++) {
> -			struct lnet_ni_status *stat = &info->pi_ni[i];
> +		for (i = 0; i < nnis; i++) {
> +			struct lnet_ni_status *stat = &pbuf->pb_info.pi_ni[i];
>  			lnet_nid_t nid = stat->ns_nid;
>  
>  			if (nid == LNET_NID_ANY) {
>  				CDEBUG(D_NET, "%s: unexpected LNET_NID_ANY\n",
>  				       libcfs_nid2str(gw->lpni_nid));
>  				gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
> -				spin_unlock(&gw->lpni_lock);
> -				return;
> +				goto out;
>  			}
>  
>  			if (LNET_NETTYP(LNET_NIDNET(nid)) == LOLND)
> @@ -720,8 +722,7 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
>  			CDEBUG(D_NET, "%s: Unexpected status 0x%x\n",
>  			       libcfs_nid2str(gw->lpni_nid), stat->ns_status);
>  			gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
> -			spin_unlock(&gw->lpni_lock);
> -			return;
> +			goto out;
>  		}
>  
>  		if (up) { /* ignore downed NIs if NI for dest network is up */
> @@ -737,7 +738,7 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
>  
>  		rte->lr_downis = down;
>  	}
> -
> +out:
>  	spin_unlock(&gw->lpni_lock);
>  }
>  
> @@ -903,7 +904,8 @@ lnet_destroy_rc_data(struct lnet_rc_data *rcd)
>  		lnet_net_unlock(cpt);
>  	}
>  
> -	kfree(rcd->rcd_pinginfo);
> +	if (rcd->rcd_pingbuffer)
> +		lnet_ping_buffer_decref(rcd->rcd_pingbuffer);
>  
>  	kfree(rcd);
>  }
> @@ -912,7 +914,7 @@ static struct lnet_rc_data *
>  lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
>  {
>  	struct lnet_rc_data *rcd = NULL;
> -	struct lnet_ping_info *pi;
> +	struct lnet_ping_buffer *pbuf;
>  	struct lnet_md md;
>  	int rc;
>  	int i;
> @@ -926,19 +928,19 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
>  	LNetInvalidateMDHandle(&rcd->rcd_mdh);
>  	INIT_LIST_HEAD(&rcd->rcd_list);
>  
> -	pi = kzalloc(LNET_PINGINFO_SIZE, GFP_NOFS);
> -	if (!pi)
> +	pbuf = lnet_ping_buffer_alloc(LNET_MAX_RTR_NIS, GFP_NOFS);
> +	if (!pbuf)
>  		goto out;
>  
>  	for (i = 0; i < LNET_MAX_RTR_NIS; i++) {
> -		pi->pi_ni[i].ns_nid = LNET_NID_ANY;
> -		pi->pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
> +		pbuf->pb_info.pi_ni[i].ns_nid = LNET_NID_ANY;
> +		pbuf->pb_info.pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
>  	}
> -	rcd->rcd_pinginfo = pi;
> +	rcd->rcd_pingbuffer = pbuf;
>  
> -	md.start = pi;
> +	md.start = &pbuf->pb_info;
>  	md.user_ptr = rcd;
> -	md.length = LNET_PINGINFO_SIZE;
> +	md.length = LNET_RTR_PINGINFO_SIZE;
>  	md.threshold = LNET_MD_THRESH_INF;
>  	md.options = LNET_MD_TRUNCATE;
>  	md.eq_handle = the_lnet.ln_rc_eqh;
> @@ -1714,7 +1716,8 @@ lnet_rtrpools_enable(void)
>  	lnet_net_lock(LNET_LOCK_EX);
>  	the_lnet.ln_routing = 1;
>  
> -	the_lnet.ln_ping_info->pi_features &= ~LNET_PING_FEAT_RTE_DISABLED;
> +	the_lnet.ln_ping_target->pb_info.pi_features &=
> +		~LNET_PING_FEAT_RTE_DISABLED;
>  	lnet_net_unlock(LNET_LOCK_EX);
>  
>  	return rc;
> @@ -1728,7 +1731,8 @@ lnet_rtrpools_disable(void)
>  
>  	lnet_net_lock(LNET_LOCK_EX);
>  	the_lnet.ln_routing = 0;
> -	the_lnet.ln_ping_info->pi_features |= LNET_PING_FEAT_RTE_DISABLED;
> +	the_lnet.ln_ping_target->pb_info.pi_features |=
> +		LNET_PING_FEAT_RTE_DISABLED;
>  
>  	tiny_router_buffers = 0;
>  	small_router_buffers = 0;
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers
  2018-10-07 23:19 ` [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers NeilBrown
@ 2018-10-14 19:32   ` James Simmons
  2018-10-14 19:33   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:32 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> The router pinger uses fixed-size buffers to receive the data
> returned by a ping. When a router has more than 16 interfaces
> (including loopback) this means the data for some interfaces
> is dropped.
> 
> Detect this situation, and track the number of remote NIs in
> the lnet_rc_data_t structure.  lnet_create_rc_data_locked()
> becomes lnet_update_rc_data_locked(), and modified to replace
> an existing ping buffer if one is present. It is now also
> called by lnet_ping_router_locked() when the existing ping
> buffer is too small.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25774
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-types.h  |    4 -
>  drivers/staging/lustre/lnet/lnet/router.c          |   90 +++++++++++++-------
>  2 files changed, 60 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index ab8c6d66cdbf..d1d17ededd06 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -411,8 +411,6 @@ struct lnet_ping_buffer {
>  
>  
>  /* router checker data, per router */
> -#define LNET_MAX_RTR_NIS   LNET_INTERFACES_MIN
> -#define LNET_RTR_PINGINFO_SIZE	LNET_PING_INFO_SIZE(LNET_MAX_RTR_NIS)
>  struct lnet_rc_data {
>  	/* chain on the_lnet.ln_zombie_rcd or ln_deathrow_rcd */
>  	struct list_head	rcd_list;
> @@ -422,6 +420,8 @@ struct lnet_rc_data {
>  	struct lnet_peer_ni	*rcd_gateway;
>  	/* ping buffer */
>  	struct lnet_ping_buffer	*rcd_pingbuffer;
> +	/* desired size of buffer */
> +	int			rcd_nnis;
>  };
>  
>  struct lnet_peer_ni {
> diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
> index e97957ce9252..86cce27e10d8 100644
> --- a/drivers/staging/lustre/lnet/lnet/router.c
> +++ b/drivers/staging/lustre/lnet/lnet/router.c
> @@ -678,8 +678,11 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
>  
>  	/* Determine the number of NIs for which there is data. */
>  	nnis = pbuf->pb_info.pi_nnis;
> -	if (pbuf->pb_nnis < nnis)
> +	if (pbuf->pb_nnis < nnis) {
> +		if (rcd->rcd_nnis < nnis)
> +			rcd->rcd_nnis = nnis;
>  		nnis = pbuf->pb_nnis;
> +	}
>  
>  	list_for_each_entry(rte, &gw->lpni_routes, lr_gwlist) {
>  		int down = 0;
> @@ -911,28 +914,47 @@ lnet_destroy_rc_data(struct lnet_rc_data *rcd)
>  }
>  
>  static struct lnet_rc_data *
> -lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
> +lnet_update_rc_data_locked(struct lnet_peer_ni *gateway)
>  {
> -	struct lnet_rc_data *rcd = NULL;
> -	struct lnet_ping_buffer *pbuf;
> +	struct lnet_handle_md mdh;
> +	struct lnet_rc_data *rcd;
> +	struct lnet_ping_buffer *pbuf = NULL;
>  	struct lnet_md md;
> +	int nnis = LNET_INTERFACES_MIN;
>  	int rc;
>  	int i;
>  
> +	rcd = gateway->lpni_rcd;
> +	if (rcd) {
> +		nnis = rcd->rcd_nnis;
> +		mdh = rcd->rcd_mdh;
> +		LNetInvalidateMDHandle(&rcd->rcd_mdh);
> +		pbuf = rcd->rcd_pingbuffer;
> +		rcd->rcd_pingbuffer = NULL;
> +	} else {
> +		LNetInvalidateMDHandle(&mdh);
> +	}
> +
>  	lnet_net_unlock(gateway->lpni_cpt);
>  
> -	rcd = kzalloc(sizeof(*rcd), GFP_NOFS);
> -	if (!rcd)
> -		goto out;
> +	if (rcd) {
> +		LNetMDUnlink(mdh);
> +		lnet_ping_buffer_decref(pbuf);
> +	} else {
> +		rcd = kzalloc(sizeof(*rcd), GFP_NOFS);
> +		if (!rcd)
> +			goto out;
>  
> -	LNetInvalidateMDHandle(&rcd->rcd_mdh);
> -	INIT_LIST_HEAD(&rcd->rcd_list);
> +		LNetInvalidateMDHandle(&rcd->rcd_mdh);
> +		INIT_LIST_HEAD(&rcd->rcd_list);
> +		rcd->rcd_nnis = nnis;
> +	}
>  
> -	pbuf = lnet_ping_buffer_alloc(LNET_MAX_RTR_NIS, GFP_NOFS);
> +	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
>  	if (!pbuf)
>  		goto out;
>  
> -	for (i = 0; i < LNET_MAX_RTR_NIS; i++) {
> +	for (i = 0; i < nnis; i++) {
>  		pbuf->pb_info.pi_ni[i].ns_nid = LNET_NID_ANY;
>  		pbuf->pb_info.pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
>  	}
> @@ -940,7 +962,7 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
>  
>  	md.start = &pbuf->pb_info;
>  	md.user_ptr = rcd;
> -	md.length = LNET_RTR_PINGINFO_SIZE;
> +	md.length = LNET_PING_INFO_SIZE(nnis);
>  	md.threshold = LNET_MD_THRESH_INF;
>  	md.options = LNET_MD_TRUNCATE;
>  	md.eq_handle = the_lnet.ln_rc_eqh;
> @@ -949,33 +971,37 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
>  	rc = LNetMDBind(md, LNET_UNLINK, &rcd->rcd_mdh);
>  	if (rc < 0) {
>  		CERROR("Can't bind MD: %d\n", rc);
> -		goto out;
> +		goto out_ping_buffer_decref;
>  	}
>  	LASSERT(!rc);
>  
>  	lnet_net_lock(gateway->lpni_cpt);
> -	/* router table changed or someone has created rcd for this gateway */
> -	if (!lnet_isrouter(gateway) || gateway->lpni_rcd) {
> -		lnet_net_unlock(gateway->lpni_cpt);
> -		goto out;
> +	/* Check if this is still a router. */
> +	if (!lnet_isrouter(gateway))
> +		goto out_unlock;
> +	/* Check if someone else installed router data. */
> +	if (gateway->lpni_rcd && gateway->lpni_rcd != rcd)
> +		goto out_unlock;
> +
> +	/* Install and/or update the router data. */
> +	if (!gateway->lpni_rcd) {
> +		lnet_peer_ni_addref_locked(gateway);
> +		rcd->rcd_gateway = gateway;
> +		gateway->lpni_rcd = rcd;
>  	}
> -
> -	lnet_peer_ni_addref_locked(gateway);
> -	rcd->rcd_gateway = gateway;
> -	gateway->lpni_rcd = rcd;
>  	gateway->lpni_ping_notsent = 0;
>  
>  	return rcd;
>  
> - out:
> -	if (rcd) {
> -		if (!LNetMDHandleIsInvalid(rcd->rcd_mdh)) {
> -			rc = LNetMDUnlink(rcd->rcd_mdh);
> -			LASSERT(!rc);
> -		}
> +out_unlock:
> +	lnet_net_unlock(gateway->lpni_cpt);
> +	rc = LNetMDUnlink(mdh);
> +	LASSERT(!rc);
> +out_ping_buffer_decref:
> +	lnet_ping_buffer_decref(pbuf);
> +out:
> +	if (rcd && rcd != gateway->lpni_rcd)
>  		lnet_destroy_rc_data(rcd);
> -	}
> -
>  	lnet_net_lock(gateway->lpni_cpt);
>  	return gateway->lpni_rcd;
>  }
> @@ -1018,9 +1044,9 @@ lnet_ping_router_locked(struct lnet_peer_ni *rtr)
>  		return;
>  	}
>  
> -	rcd = rtr->lpni_rcd ?
> -	      rtr->lpni_rcd : lnet_create_rc_data_locked(rtr);
> -
> +	rcd = rtr->lpni_rcd;
> +	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis)
> +		rcd = lnet_update_rc_data_locked(rtr);
>  	if (!rcd)
>  		return;
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers
  2018-10-07 23:19 ` [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers NeilBrown
  2018-10-14 19:32   ` James Simmons
@ 2018-10-14 19:33   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:33 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> The router pinger uses fixed-size buffers to receive the data
> returned by a ping. When a router has more than 16 interfaces
> (including loopback) this means the data for some interfaces
> is dropped.
> 
> Detect this situation, and track the number of remote NIs in
> the lnet_rc_data_t structure.  lnet_create_rc_data_locked()
> becomes lnet_update_rc_data_locked(), and modified to replace
> an existing ping buffer if one is present. It is now also
> called by lnet_ping_router_locked() when the existing ping
> buffer is too small.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25774
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-types.h  |    4 -
>  drivers/staging/lustre/lnet/lnet/router.c          |   90 +++++++++++++-------
>  2 files changed, 60 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index ab8c6d66cdbf..d1d17ededd06 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -411,8 +411,6 @@ struct lnet_ping_buffer {
>  
>  
>  /* router checker data, per router */
> -#define LNET_MAX_RTR_NIS   LNET_INTERFACES_MIN
> -#define LNET_RTR_PINGINFO_SIZE	LNET_PING_INFO_SIZE(LNET_MAX_RTR_NIS)
>  struct lnet_rc_data {
>  	/* chain on the_lnet.ln_zombie_rcd or ln_deathrow_rcd */
>  	struct list_head	rcd_list;
> @@ -422,6 +420,8 @@ struct lnet_rc_data {
>  	struct lnet_peer_ni	*rcd_gateway;
>  	/* ping buffer */
>  	struct lnet_ping_buffer	*rcd_pingbuffer;
> +	/* desired size of buffer */
> +	int			rcd_nnis;
>  };
>  
>  struct lnet_peer_ni {
> diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
> index e97957ce9252..86cce27e10d8 100644
> --- a/drivers/staging/lustre/lnet/lnet/router.c
> +++ b/drivers/staging/lustre/lnet/lnet/router.c
> @@ -678,8 +678,11 @@ lnet_parse_rc_info(struct lnet_rc_data *rcd)
>  
>  	/* Determine the number of NIs for which there is data. */
>  	nnis = pbuf->pb_info.pi_nnis;
> -	if (pbuf->pb_nnis < nnis)
> +	if (pbuf->pb_nnis < nnis) {
> +		if (rcd->rcd_nnis < nnis)
> +			rcd->rcd_nnis = nnis;
>  		nnis = pbuf->pb_nnis;
> +	}
>  
>  	list_for_each_entry(rte, &gw->lpni_routes, lr_gwlist) {
>  		int down = 0;
> @@ -911,28 +914,47 @@ lnet_destroy_rc_data(struct lnet_rc_data *rcd)
>  }
>  
>  static struct lnet_rc_data *
> -lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
> +lnet_update_rc_data_locked(struct lnet_peer_ni *gateway)
>  {
> -	struct lnet_rc_data *rcd = NULL;
> -	struct lnet_ping_buffer *pbuf;
> +	struct lnet_handle_md mdh;
> +	struct lnet_rc_data *rcd;
> +	struct lnet_ping_buffer *pbuf = NULL;
>  	struct lnet_md md;
> +	int nnis = LNET_INTERFACES_MIN;
>  	int rc;
>  	int i;
>  
> +	rcd = gateway->lpni_rcd;
> +	if (rcd) {
> +		nnis = rcd->rcd_nnis;
> +		mdh = rcd->rcd_mdh;
> +		LNetInvalidateMDHandle(&rcd->rcd_mdh);
> +		pbuf = rcd->rcd_pingbuffer;
> +		rcd->rcd_pingbuffer = NULL;
> +	} else {
> +		LNetInvalidateMDHandle(&mdh);
> +	}
> +
>  	lnet_net_unlock(gateway->lpni_cpt);
>  
> -	rcd = kzalloc(sizeof(*rcd), GFP_NOFS);
> -	if (!rcd)
> -		goto out;
> +	if (rcd) {
> +		LNetMDUnlink(mdh);
> +		lnet_ping_buffer_decref(pbuf);
> +	} else {
> +		rcd = kzalloc(sizeof(*rcd), GFP_NOFS);
> +		if (!rcd)
> +			goto out;
>  
> -	LNetInvalidateMDHandle(&rcd->rcd_mdh);
> -	INIT_LIST_HEAD(&rcd->rcd_list);
> +		LNetInvalidateMDHandle(&rcd->rcd_mdh);
> +		INIT_LIST_HEAD(&rcd->rcd_list);
> +		rcd->rcd_nnis = nnis;
> +	}
>  
> -	pbuf = lnet_ping_buffer_alloc(LNET_MAX_RTR_NIS, GFP_NOFS);
> +	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
>  	if (!pbuf)
>  		goto out;
>  
> -	for (i = 0; i < LNET_MAX_RTR_NIS; i++) {
> +	for (i = 0; i < nnis; i++) {
>  		pbuf->pb_info.pi_ni[i].ns_nid = LNET_NID_ANY;
>  		pbuf->pb_info.pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
>  	}
> @@ -940,7 +962,7 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
>  
>  	md.start = &pbuf->pb_info;
>  	md.user_ptr = rcd;
> -	md.length = LNET_RTR_PINGINFO_SIZE;
> +	md.length = LNET_PING_INFO_SIZE(nnis);
>  	md.threshold = LNET_MD_THRESH_INF;
>  	md.options = LNET_MD_TRUNCATE;
>  	md.eq_handle = the_lnet.ln_rc_eqh;
> @@ -949,33 +971,37 @@ lnet_create_rc_data_locked(struct lnet_peer_ni *gateway)
>  	rc = LNetMDBind(md, LNET_UNLINK, &rcd->rcd_mdh);
>  	if (rc < 0) {
>  		CERROR("Can't bind MD: %d\n", rc);
> -		goto out;
> +		goto out_ping_buffer_decref;
>  	}
>  	LASSERT(!rc);
>  
>  	lnet_net_lock(gateway->lpni_cpt);
> -	/* router table changed or someone has created rcd for this gateway */
> -	if (!lnet_isrouter(gateway) || gateway->lpni_rcd) {
> -		lnet_net_unlock(gateway->lpni_cpt);
> -		goto out;
> +	/* Check if this is still a router. */
> +	if (!lnet_isrouter(gateway))
> +		goto out_unlock;
> +	/* Check if someone else installed router data. */
> +	if (gateway->lpni_rcd && gateway->lpni_rcd != rcd)
> +		goto out_unlock;
> +
> +	/* Install and/or update the router data. */
> +	if (!gateway->lpni_rcd) {
> +		lnet_peer_ni_addref_locked(gateway);
> +		rcd->rcd_gateway = gateway;
> +		gateway->lpni_rcd = rcd;
>  	}
> -
> -	lnet_peer_ni_addref_locked(gateway);
> -	rcd->rcd_gateway = gateway;
> -	gateway->lpni_rcd = rcd;
>  	gateway->lpni_ping_notsent = 0;
>  
>  	return rcd;
>  
> - out:
> -	if (rcd) {
> -		if (!LNetMDHandleIsInvalid(rcd->rcd_mdh)) {
> -			rc = LNetMDUnlink(rcd->rcd_mdh);
> -			LASSERT(!rc);
> -		}
> +out_unlock:
> +	lnet_net_unlock(gateway->lpni_cpt);
> +	rc = LNetMDUnlink(mdh);
> +	LASSERT(!rc);
> +out_ping_buffer_decref:
> +	lnet_ping_buffer_decref(pbuf);
> +out:
> +	if (rcd && rcd != gateway->lpni_rcd)
>  		lnet_destroy_rc_data(rcd);
> -	}
> -
>  	lnet_net_lock(gateway->lpni_cpt);
>  	return gateway->lpni_rcd;
>  }
> @@ -1018,9 +1044,9 @@ lnet_ping_router_locked(struct lnet_peer_ni *rtr)
>  		return;
>  	}
>  
> -	rcd = rtr->lpni_rcd ?
> -	      rtr->lpni_rcd : lnet_create_rc_data_locked(rtr);
> -
> +	rcd = rtr->lpni_rcd;
> +	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis)
> +		rcd = lnet_update_rc_data_locked(rtr);
>  	if (!rcd)
>  		return;
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 05/24] lustre: lnet: add Multi-Rail and Discovery ping feature bits
  2018-10-07 23:19 ` [lustre-devel] [PATCH 05/24] lustre: lnet: add Multi-Rail and Discovery ping feature bits NeilBrown
@ 2018-10-14 19:34   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:34 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Claim ping features bit for Multi-Rail and Discovery.
> 
> Assert in lnet_ping_target_update() that no unknown bits will
> be send over the wire.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25775
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   16 ++++++++++++++++
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |    5 +++++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index d1d17ededd06..f4467a3bbfd1 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -386,6 +386,22 @@ struct lnet_ni {
>  #define LNET_PING_FEAT_BASE		BIT(0)	/* just a ping */
>  #define LNET_PING_FEAT_NI_STATUS	BIT(1)	/* return NI status */
>  #define LNET_PING_FEAT_RTE_DISABLED	BIT(2)	/* Routing enabled */
> +#define LNET_PING_FEAT_MULTI_RAIL	BIT(3)	/* Multi-Rail aware */
> +#define LNET_PING_FEAT_DISCOVERY	BIT(4)	/* Supports Discovery */
> +
> +/*
> + * All ping feature bits fit to hit the wire.
> + * In lnet_assert_wire_constants() this is compared against its open-coded
> + * value, and in lnet_ping_target_update() it is used to verify that no
> + * unknown bits have been set.
> + * New feature bits can be added, just be aware that this does change the
> + * over-the-wire protocol.
> + */
> +#define LNET_PING_FEAT_BITS		(LNET_PING_FEAT_BASE | \
> +					 LNET_PING_FEAT_NI_STATUS | \
> +					 LNET_PING_FEAT_RTE_DISABLED | \
> +					 LNET_PING_FEAT_MULTI_RAIL | \
> +					 LNET_PING_FEAT_DISCOVERY)
>  
>  #define LNET_PING_INFO_SIZE(NNIDS) \
>  	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index ca28ad75fe2b..68af723bc6a1 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -1170,6 +1170,11 @@ lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
>  
>  	if (!the_lnet.ln_routing)
>  		pbuf->pb_info.pi_features |= LNET_PING_FEAT_RTE_DISABLED;
> +
> +	/* Ensure only known feature bits have been set. */
> +	LASSERT(pbuf->pb_info.pi_features & LNET_PING_FEAT_BITS);
> +	LASSERT(!(pbuf->pb_info.pi_features & ~LNET_PING_FEAT_BITS));
> +
>  	lnet_ping_target_install_locked(pbuf);
>  
>  	if (the_lnet.ln_ping_target) {
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 06/24] lustre: lnet: add sanity checks on ping-related constants
  2018-10-07 23:19 ` [lustre-devel] [PATCH 06/24] lustre: lnet: add sanity checks on ping-related constants NeilBrown
@ 2018-10-14 19:36   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:36 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add sanity checks for LNet ping related data structures and
> constants to wirecheck.c, and update the generated code in
> lnet_assert_wire_constants().
> 
> In order for the structures and macros to be visible to
> wirecheck.c, which is a userspace program, they were moved
> from kernel-only lnet/lib-types.h to lnet/types.h

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25776
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   30 ----------------
>  .../lustre/include/uapi/linux/lnet/lnet-types.h    |   30 ++++++++++++++++
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   38 ++++++++++++++++++++
>  3 files changed, 68 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index f4467a3bbfd1..f28fa5342914 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -378,36 +378,6 @@ struct lnet_ni {
>  
>  #define LNET_PROTO_PING_MATCHBITS	0x8000000000000000LL
>  
> -/*
> - * NB: value of these features equal to LNET_PROTO_PING_VERSION_x
> - * of old LNet, so there shouldn't be any compatibility issue
> - */
> -#define LNET_PING_FEAT_INVAL		(0)		/* no feature */
> -#define LNET_PING_FEAT_BASE		BIT(0)	/* just a ping */
> -#define LNET_PING_FEAT_NI_STATUS	BIT(1)	/* return NI status */
> -#define LNET_PING_FEAT_RTE_DISABLED	BIT(2)	/* Routing enabled */
> -#define LNET_PING_FEAT_MULTI_RAIL	BIT(3)	/* Multi-Rail aware */
> -#define LNET_PING_FEAT_DISCOVERY	BIT(4)	/* Supports Discovery */
> -
> -/*
> - * All ping feature bits fit to hit the wire.
> - * In lnet_assert_wire_constants() this is compared against its open-coded
> - * value, and in lnet_ping_target_update() it is used to verify that no
> - * unknown bits have been set.
> - * New feature bits can be added, just be aware that this does change the
> - * over-the-wire protocol.
> - */
> -#define LNET_PING_FEAT_BITS		(LNET_PING_FEAT_BASE | \
> -					 LNET_PING_FEAT_NI_STATUS | \
> -					 LNET_PING_FEAT_RTE_DISABLED | \
> -					 LNET_PING_FEAT_MULTI_RAIL | \
> -					 LNET_PING_FEAT_DISCOVERY)
> -
> -#define LNET_PING_INFO_SIZE(NNIDS) \
> -	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
> -#define LNET_PING_INFO_LONI(PINFO)	((PINFO)->pi_ni[0].ns_nid)
> -#define LNET_PING_INFO_SEQNO(PINFO)	((PINFO)->pi_ni[0].ns_status)
> -
>  /*
>   * Descriptor of a ping info buffer: keep a separate indicator of the
>   * size and a reference count. The type is used both as a source and
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> index 6ee60d07ff84..e0e4fd259795 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> @@ -190,6 +190,31 @@ struct lnet_hdr {
>  	} msg;
>  } __packed;
>  
> +/*
> + * NB: value of these features equal to LNET_PROTO_PING_VERSION_x
> + * of old LNet, so there shouldn't be any compatibility issue
> + */
> +#define LNET_PING_FEAT_INVAL		(0)		/* no feature */
> +#define LNET_PING_FEAT_BASE		(1 << 0)	/* just a ping */
> +#define LNET_PING_FEAT_NI_STATUS	(1 << 1)	/* return NI status */
> +#define LNET_PING_FEAT_RTE_DISABLED	(1 << 2)	/* Routing enabled */
> +#define LNET_PING_FEAT_MULTI_RAIL	(1 << 3)	/* Multi-Rail aware */
> +#define LNET_PING_FEAT_DISCOVERY	(1 << 4)	/* Supports Discovery */
> +
> +/*
> + * All ping feature bits fit to hit the wire.
> + * In lnet_assert_wire_constants() this is compared against its open-coded
> + * value, and in lnet_ping_target_update() it is used to verify that no
> + * unknown bits have been set.
> + * New feature bits can be added, just be aware that this does change the
> + * over-the-wire protocol.
> + */
> +#define LNET_PING_FEAT_BITS		(LNET_PING_FEAT_BASE | \
> +					 LNET_PING_FEAT_NI_STATUS | \
> +					 LNET_PING_FEAT_RTE_DISABLED | \
> +					 LNET_PING_FEAT_MULTI_RAIL | \
> +					 LNET_PING_FEAT_DISCOVERY)
> +
>  /*
>   * A HELLO message contains a magic number and protocol version
>   * code in the header's dest_nid, the peer's NID in the src_nid, and
> @@ -246,6 +271,11 @@ struct lnet_ping_info {
>  	struct lnet_ni_status	pi_ni[0];
>  } __packed;
>  
> +#define LNET_PING_INFO_SIZE(NNIDS) \
> +	offsetof(struct lnet_ping_info, pi_ni[NNIDS])
> +#define LNET_PING_INFO_LONI(PINFO)	((PINFO)->pi_ni[0].ns_nid)
> +#define LNET_PING_INFO_SEQNO(PINFO)	((PINFO)->pi_ni[0].ns_status)
> +
>  struct lnet_counters {
>  	__u32	msgs_alloc;
>  	__u32	msgs_max;
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index 68af723bc6a1..d81501f4c282 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -313,6 +313,44 @@ static void lnet_assert_wire_constants(void)
>  	BUILD_BUG_ON((int)sizeof(((struct lnet_hdr *)0)->msg.hello.incarnation) != 8);
>  	BUILD_BUG_ON((int)offsetof(struct lnet_hdr, msg.hello.type) != 40);
>  	BUILD_BUG_ON((int)sizeof(((struct lnet_hdr *)0)->msg.hello.type) != 4);
> +
> +	/* Checks for struct lnet_ni_status and related constants */
> +	BUILD_BUG_ON(LNET_NI_STATUS_INVALID != 0x00000000);
> +	BUILD_BUG_ON(LNET_NI_STATUS_UP != 0x15aac0de);
> +	BUILD_BUG_ON(LNET_NI_STATUS_DOWN != 0xdeadface);
> +
> +	/* Checks for struct lnet_ni_status */
> +	BUILD_BUG_ON((int)sizeof(struct lnet_ni_status) != 16);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ni_status, ns_nid) != 0);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ni_status *)0)->ns_nid) != 8);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ni_status, ns_status) != 8);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ni_status *)0)->ns_status) != 4);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ni_status, ns_unused) != 12);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ni_status *)0)->ns_unused) != 4);
> +
> +	/* Checks for struct lnet_ping_info and related constants */
> +	BUILD_BUG_ON(LNET_PROTO_PING_MAGIC != 0x70696E67);
> +	BUILD_BUG_ON(LNET_PING_FEAT_INVAL != 0);
> +	BUILD_BUG_ON(LNET_PING_FEAT_BASE != 1);
> +	BUILD_BUG_ON(LNET_PING_FEAT_NI_STATUS != 2);
> +	BUILD_BUG_ON(LNET_PING_FEAT_RTE_DISABLED != 4);
> +	BUILD_BUG_ON(LNET_PING_FEAT_MULTI_RAIL != 8);
> +	BUILD_BUG_ON(LNET_PING_FEAT_DISCOVERY != 16);
> +	BUILD_BUG_ON(LNET_PING_FEAT_BITS != 31);
> +
> +	/* Checks for struct lnet_ping_info */
> +	BUILD_BUG_ON((int)sizeof(struct lnet_ping_info) != 16);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_magic) != 0);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_magic) != 4);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_features) != 4);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_features)
> +		     != 4);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_pid) != 8);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_pid) != 4);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_nnis) != 12);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_nnis) != 4);
> +	BUILD_BUG_ON((int)offsetof(struct lnet_ping_info, pi_ni) != 16);
> +	BUILD_BUG_ON((int)sizeof(((struct lnet_ping_info *)0)->pi_ni) != 0);
>  }
>  
>  static struct lnet_lnd *
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked()
  2018-10-07 23:19 ` [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked() NeilBrown
@ 2018-10-14 19:38   ` James Simmons
  2018-10-14 19:39   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:38 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Address style issues in lnet_peer_ni_addref_locked() and
> lnet_peer_ni_decref_locked(). In the latter routine, replace
> a sequence of atomic_dec()/atomic_read() with atomic_dec_and_test().

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25777
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 2e2b5ed27116..f15f5c9c9a25 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -323,8 +323,7 @@ static inline void
>  lnet_peer_ni_decref_locked(struct lnet_peer_ni *lp)
>  {
>  	LASSERT(atomic_read(&lp->lpni_refcount) > 0);
> -	atomic_dec(&lp->lpni_refcount);
> -	if (atomic_read(&lp->lpni_refcount) == 0)
> +	if (atomic_dec_and_test(&lp->lpni_refcount))
>  		lnet_destroy_peer_ni_locked(lp);
>  }
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked()
  2018-10-07 23:19 ` [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked() NeilBrown
  2018-10-14 19:38   ` James Simmons
@ 2018-10-14 19:39   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:39 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Address style issues in lnet_peer_ni_addref_locked() and
> lnet_peer_ni_decref_locked(). In the latter routine, replace
> a sequence of atomic_dec()/atomic_read() with atomic_dec_and_test().

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25777
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 2e2b5ed27116..f15f5c9c9a25 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -323,8 +323,7 @@ static inline void
>  lnet_peer_ni_decref_locked(struct lnet_peer_ni *lp)
>  {
>  	LASSERT(atomic_read(&lp->lpni_refcount) > 0);
> -	atomic_dec(&lp->lpni_refcount);
> -	if (atomic_read(&lp->lpni_refcount) == 0)
> +	if (atomic_dec_and_test(&lp->lpni_refcount))
>  		lnet_destroy_peer_ni_locked(lp);
>  }
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 08/24] lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer()
  2018-10-07 23:19 ` [lustre-devel] [PATCH 08/24] lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer() NeilBrown
@ 2018-10-14 19:55   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:55 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Rename lnet_add_peer_ni_to_peer() to lnet_add_peer_ni(), and
> lnet_del_peer_ni_from_peer() to lnet_del_peer_ni().  This brings
> the function names closer to the ioctls they implement:
> IOCTL_LIBCFS_ADD_PEER_NI and IOCTL_LIBCFS_DEL_PEER_NI. These
> names are also a more accturate description their effect: adding
> or deleting an lnet_peer_ni to LNet.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25778
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    4 ++--
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   10 +++++----
>  drivers/staging/lustre/lnet/lnet/peer.c            |   22 +++++++++++++++-----
>  3 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index f15f5c9c9a25..69f45a76f1cc 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -682,8 +682,8 @@ struct lnet_peer_net *lnet_peer_get_net_locked(struct lnet_peer *peer,
>  					       u32 net_id);
>  bool lnet_peer_is_ni_pref_locked(struct lnet_peer_ni *lpni,
>  				 struct lnet_ni *ni);
> -int lnet_add_peer_ni_to_peer(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
> -int lnet_del_peer_ni_from_peer(lnet_nid_t key_nid, lnet_nid_t nid);
> +int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
> +int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
>  int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
>  		       bool *mr,
>  		       struct lnet_peer_ni_credit_info __user *peer_ni_info,
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index d81501f4c282..d64ae2939abc 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -2848,9 +2848,9 @@ LNetCtl(unsigned int cmd, void *arg)
>  			return -EINVAL;
>  
>  		mutex_lock(&the_lnet.ln_api_mutex);
> -		rc = lnet_add_peer_ni_to_peer(cfg->prcfg_prim_nid,
> -					      cfg->prcfg_cfg_nid,
> -					      cfg->prcfg_mr);
> +		rc = lnet_add_peer_ni(cfg->prcfg_prim_nid,
> +				      cfg->prcfg_cfg_nid,
> +				      cfg->prcfg_mr);
>  		mutex_unlock(&the_lnet.ln_api_mutex);
>  		return rc;
>  	}
> @@ -2862,8 +2862,8 @@ LNetCtl(unsigned int cmd, void *arg)
>  			return -EINVAL;
>  
>  		mutex_lock(&the_lnet.ln_api_mutex);
> -		rc = lnet_del_peer_ni_from_peer(cfg->prcfg_prim_nid,
> -						cfg->prcfg_cfg_nid);
> +		rc = lnet_del_peer_ni(cfg->prcfg_prim_nid,
> +				      cfg->prcfg_cfg_nid);
>  		mutex_unlock(&the_lnet.ln_api_mutex);
>  		return rc;
>  	}
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index ebb84356302f..bbf07008dbb0 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -891,14 +891,16 @@ lnet_peer_ni_add_non_mr(lnet_nid_t nid)
>  }
>  
>  /*
> + * Implementation of IOC_LIBCFS_ADD_PEER_NI.
> + *
>   * This API handles the following combinations:
> - *	Create a primary NI if only the prim_nid is provided
> - *	Create or add an lpni to a primary NI. Primary NI must've already
> - *	been created
> - *	Create a non-MR peer.
> + *   Create a primary NI if only the prim_nid is provided
> + *   Create or add an lpni to a primary NI. Primary NI must've already
> + *   been created
> + *   Create a non-MR peer.
>   */
>  int
> -lnet_add_peer_ni_to_peer(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
> +lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
>  {
>  	/*
>  	 * Caller trying to setup an MR like peer hierarchy but
> @@ -929,8 +931,16 @@ lnet_add_peer_ni_to_peer(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
>  	return 0;
>  }
>  
> +/*
> + * Implementation of IOC_LIBCFS_DEL_PEER_NI.
> + *
> + * This API handles the following combinations:
> + *   Delete a NI from a peer if both prim_nid and nid are provided.
> + *   Delete a peer if only prim_nid is provided.
> + *   Delete a peer if its primary nid is provided.
> + */
>  int
> -lnet_del_peer_ni_from_peer(lnet_nid_t prim_nid, lnet_nid_t nid)
> +lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
>  {
>  	lnet_nid_t local_nid;
>  	struct lnet_peer *peer;
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 09/24] lustre: lnet: refactor lnet_del_peer_ni()
  2018-10-07 23:19 ` [lustre-devel] [PATCH 09/24] lustre: lnet: refactor lnet_del_peer_ni() NeilBrown
@ 2018-10-14 19:58   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 19:58 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Refactor lnet_del_peer_ni(). In particular break out the code
> that removes an lnet_peer_ni from an lnet_peer and put it into
> a separate function, lnet_peer_del_nid().

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25779
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/staging/lustre/lnet/lnet/peer.c |   96 +++++++++++++++++++++++--------
>  1 file changed, 71 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index bbf07008dbb0..30a2486712e4 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -254,7 +254,7 @@ lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni)
>  	 *
>  	 * The last reference may be lost in a place where the
>  	 * lnet_net_lock locks only a single cpt, and that cpt may not
> -	 * be lpni->lpni_cpt. So the zombie list of this peer_table
> +	 * be lpni->lpni_cpt. So the zombie list of lnet_peer_table
>  	 * has its own lock.
>  	 */
>  	spin_lock(&ptable->pt_zombie_lock);
> @@ -340,6 +340,61 @@ lnet_peer_del_locked(struct lnet_peer *peer)
>  	return rc2;
>  }
>  
> +static int
> +lnet_peer_del(struct lnet_peer *peer)
> +{
> +	lnet_net_lock(LNET_LOCK_EX);
> +	lnet_peer_del_locked(peer);
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	return 0;
> +}
> +
> +/*
> + * Delete a NID from a peer.
> + * Implements a few sanity checks.
> + * Call with ln_api_mutex held.
> + */
> +static int
> +lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid)
> +{
> +	struct lnet_peer *lp2;
> +	struct lnet_peer_ni *lpni;
> +
> +	lpni = lnet_find_peer_ni_locked(nid);
> +	if (!lpni) {
> +		CERROR("Cannot remove unknown nid %s from peer %s\n",
> +		       libcfs_nid2str(nid),
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		return -ENOENT;
> +	}
> +	lnet_peer_ni_decref_locked(lpni);
> +	lp2 = lpni->lpni_peer_net->lpn_peer;
> +	if (lp2 != lp) {
> +		CERROR("Nid %s is attached to peer %s, not peer %s\n",
> +		       libcfs_nid2str(nid),
> +		       libcfs_nid2str(lp2->lp_primary_nid),
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * This function only allows deletion of the primary NID if it
> +	 * is the only NID.
> +	 */
> +	if (nid == lp->lp_primary_nid && lnet_get_num_peer_nis(lp) != 1) {
> +		CERROR("Cannot delete primary NID %s from multi-NID peer\n",
> +		       libcfs_nid2str(nid));
> +		return -EINVAL;
> +	}
> +
> +	lnet_net_lock(LNET_LOCK_EX);
> +	lnet_peer_ni_del_locked(lpni);
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	return 0;
> +}
> +
>  static void
>  lnet_peer_table_cleanup_locked(struct lnet_net *net,
>  			       struct lnet_peer_table *ptable)
> @@ -938,45 +993,36 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
>   *   Delete a NI from a peer if both prim_nid and nid are provided.
>   *   Delete a peer if only prim_nid is provided.
>   *   Delete a peer if its primary nid is provided.
> + *
> + * The caller must hold ln_api_mutex. This prevents the peer from
> + * being modified/deleted by a different thread.
>   */
>  int
>  lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
>  {
> -	lnet_nid_t local_nid;
> -	struct lnet_peer *peer;
> +	struct lnet_peer *lp;
>  	struct lnet_peer_ni *lpni;
> -	int rc;
>  
>  	if (prim_nid == LNET_NID_ANY)
>  		return -EINVAL;
>  
> -	local_nid = (nid != LNET_NID_ANY) ? nid : prim_nid;
> -
> -	lpni = lnet_find_peer_ni_locked(local_nid);
> +	lpni = lnet_find_peer_ni_locked(prim_nid);
>  	if (!lpni)
> -		return -EINVAL;
> +		return -ENOENT;
>  	lnet_peer_ni_decref_locked(lpni);
> +	lp = lpni->lpni_peer_net->lpn_peer;
>  
> -	peer = lpni->lpni_peer_net->lpn_peer;
> -	LASSERT(peer);
> -
> -	if (peer->lp_primary_nid == lpni->lpni_nid) {
> -		/*
> -		 * deleting the primary ni is equivalent to deleting the
> -		 * entire peer
> -		 */
> -		lnet_net_lock(LNET_LOCK_EX);
> -		rc = lnet_peer_del_locked(peer);
> -		lnet_net_unlock(LNET_LOCK_EX);
> -
> -		return rc;
> +	if (prim_nid != lp->lp_primary_nid) {
> +		CDEBUG(D_NET, "prim_nid %s is not primary for peer %s\n",
> +		       libcfs_nid2str(prim_nid),
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		return -ENODEV;
>  	}
>  
> -	lnet_net_lock(LNET_LOCK_EX);
> -	rc = lnet_peer_ni_del_locked(lpni);
> -	lnet_net_unlock(LNET_LOCK_EX);
> +	if (nid == LNET_NID_ANY || nid == lp->lp_primary_nid)
> +		return lnet_peer_del(lp);
>  
> -	return rc;
> +	return lnet_peer_del_nid(lp, nid);
>  }
>  
>  void
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 10/24] lustre: lnet: refactor lnet_add_peer_ni()
  2018-10-07 23:19 ` [lustre-devel] [PATCH 10/24] lustre: lnet: refactor lnet_add_peer_ni() NeilBrown
@ 2018-10-14 20:02   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 20:02 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Refactor lnet_add_peer_ni() and the functions called by it. In
> particular, lnet_peer_add_nid() adds an lnet_peer_ni to an
> existing lnet_peer, lnet_peer_add() adds a new lnet_peer.
> 
> lnet_find_or_create_peer_locked() is removed.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25780
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    1 
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |   13 +
>  drivers/staging/lustre/lnet/lnet/peer.c            |  230 +++++++-------------
>  3 files changed, 92 insertions(+), 152 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 69f45a76f1cc..fc748ffa251d 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -668,7 +668,6 @@ u32 lnet_get_dlc_seq_locked(void);
>  struct lnet_peer_ni *lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  						  struct lnet_peer_net *peer_net,
>  						  struct lnet_peer_ni *prev);
> -struct lnet_peer *lnet_find_or_create_peer_locked(lnet_nid_t dst_nid, int cpt);
>  struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, int cpt);
>  struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
>  struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
> index e8c021622f91..59ae8d0649e5 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
> @@ -1262,11 +1262,18 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  		return -ESHUTDOWN;
>  	}
>  
> -	peer = lnet_find_or_create_peer_locked(dst_nid, cpt);
> -	if (IS_ERR(peer)) {
> +	/*
> +	 * lnet_nid2peerni_locked() is the path that will find an
> +	 * existing peer_ni, or create one and mark it as having been
> +	 * created due to network traffic.
> +	 */
> +	lpni = lnet_nid2peerni_locked(dst_nid, cpt);
> +	if (IS_ERR(lpni)) {
>  		lnet_net_unlock(cpt);
> -		return PTR_ERR(peer);
> +		return PTR_ERR(lpni);
>  	}
> +	peer = lpni->lpni_peer_net->lpn_peer;
> +	lnet_peer_ni_decref_locked(lpni);
>  
>  	/* If peer is not healthy then can not send anything to it */
>  	if (!lnet_is_peer_healthy_locked(peer)) {
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 30a2486712e4..6b7ca5c361b8 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -541,25 +541,6 @@ lnet_find_peer_ni_locked(lnet_nid_t nid)
>  	return lpni;
>  }
>  
> -struct lnet_peer *
> -lnet_find_or_create_peer_locked(lnet_nid_t dst_nid, int cpt)
> -{
> -	struct lnet_peer_ni *lpni;
> -	struct lnet_peer *lp;
> -
> -	lpni = lnet_find_peer_ni_locked(dst_nid);
> -	if (!lpni) {
> -		lpni = lnet_nid2peerni_locked(dst_nid, cpt);
> -		if (IS_ERR(lpni))
> -			return ERR_CAST(lpni);
> -	}
> -
> -	lp = lpni->lpni_peer_net->lpn_peer;
> -	lnet_peer_ni_decref_locked(lpni);
> -
> -	return lp;
> -}
> -
>  struct lnet_peer_ni *
>  lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
>  			    struct lnet_peer **lp)
> @@ -774,131 +755,95 @@ lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
>  	return -ENOMEM;
>  }
>  
> +/*
> + * Create a new peer, with nid as its primary nid.
> + *
> + * It is not an error if the peer already exists, provided that the
> + * given nid is the primary NID.
> + *
> + * Call with the lnet_api_mutex held.
> + */
>  static int
> -lnet_add_prim_lpni(lnet_nid_t nid)
> +lnet_peer_add(lnet_nid_t nid, bool mr)
>  {
> -	int rc;
> -	struct lnet_peer *peer;
> +	struct lnet_peer *lp;
>  	struct lnet_peer_ni *lpni;
>  
>  	LASSERT(nid != LNET_NID_ANY);
>  
>  	/*
> -	 * lookup the NID and its peer
> -	 *  if the peer doesn't exist, create it.
> -	 *  if this is a non-MR peer then change its state to MR and exit.
> -	 *  if this is an MR peer and it's a primary NI: NO-OP.
> -	 *  if this is an MR peer and it's not a primary NI. Operation not
> -	 *     allowed.
> -	 *
> -	 * The adding and deleting of peer nis is being serialized through
> -	 * the api_mutex. So we can look up peers with the mutex locked
> -	 * safely. Only when we need to change the ptable, do we need to
> -	 * exclusively lock the lnet_net_lock()
> +	 * No need for the lnet_net_lock here, because the
> +	 * lnet_api_mutex is held.
>  	 */
>  	lpni = lnet_find_peer_ni_locked(nid);
>  	if (!lpni) {
> -		rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
> +		int rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
>  		if (rc != 0)
>  			return rc;
>  		lpni = lnet_find_peer_ni_locked(nid);
> +		LASSERT(lpni);
>  	}
> -
> -	LASSERT(lpni);
> -
> +	lp = lpni->lpni_peer_net->lpn_peer;
>  	lnet_peer_ni_decref_locked(lpni);
>  
> -	peer = lpni->lpni_peer_net->lpn_peer;
> -
> -	/*
> -	 * If we found a lpni with the same nid as the NID we're trying to
> -	 * create, then we're trying to create an already existing lpni
> -	 * that belongs to a different peer
> -	 */
> -	if (peer->lp_primary_nid != nid)
> +	/* A found peer must have this primary NID */
> +	if (lp->lp_primary_nid != nid)
>  		return -EEXIST;
>  
>  	/*
> -	 * if we found an lpni that is not a multi-rail, which could occur
> +	 * If we found an lpni that is not a multi-rail, which could occur
>  	 * if lpni is already created as a non-mr lpni or we just created
>  	 * it, then make sure you indicate that this lpni is a primary mr
>  	 * capable peer.
>  	 *
>  	 * TODO: update flags if necessary
>  	 */
> -	if (!peer->lp_multi_rail && peer->lp_primary_nid == nid)
> -		peer->lp_multi_rail = true;
> +	if (mr && !lp->lp_multi_rail) {
> +		lp->lp_multi_rail = true;
> +	} else if (!mr && lp->lp_multi_rail) {
> +		/* The mr state is sticky. */
> +		CDEBUG(D_NET, "Cannot clear multi-flag from peer %s\n",
> +		       libcfs_nid2str(nid));
> +	}
>  
> -	return rc;
> +	return 0;
>  }
>  
>  static int
> -lnet_add_peer_ni_to_prim_lpni(lnet_nid_t prim_nid, lnet_nid_t nid)
> +lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
>  {
> -	struct lnet_peer *peer, *primary_peer;
> -	struct lnet_peer_ni *lpni = NULL, *klpni = NULL;
> -
> -	LASSERT(prim_nid != LNET_NID_ANY && nid != LNET_NID_ANY);
> +	struct lnet_peer_ni *lpni;
>  
> -	/*
> -	 * key nid must be created by this point. If not then this
> -	 * operation is not permitted
> -	 */
> -	klpni = lnet_find_peer_ni_locked(prim_nid);
> -	if (!klpni)
> -		return -ENOENT;
> +	LASSERT(lp);
> +	LASSERT(nid != LNET_NID_ANY);
>  
> -	lnet_peer_ni_decref_locked(klpni);
> +	if (!mr && !lp->lp_multi_rail) {
> +		CERROR("Cannot add nid %s to non-multi-rail peer %s\n",
> +		       libcfs_nid2str(nid),
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		return -EPERM;
> +	}
>  
> -	primary_peer = klpni->lpni_peer_net->lpn_peer;
> +	if (!lp->lp_multi_rail)
> +		lp->lp_multi_rail = true;
>  
>  	lpni = lnet_find_peer_ni_locked(nid);
> -	if (lpni) {
> -		lnet_peer_ni_decref_locked(lpni);
> -
> -		peer = lpni->lpni_peer_net->lpn_peer;
> -		/*
> -		 * lpni already exists in the system but it belongs to
> -		 * a different peer. We can't re-added it
> -		 */
> -		if (peer->lp_primary_nid != prim_nid && peer->lp_multi_rail) {
> -			CERROR("Cannot add NID %s owned by peer %s to peer %s\n",
> -			       libcfs_nid2str(lpni->lpni_nid),
> -			       libcfs_nid2str(peer->lp_primary_nid),
> -			       libcfs_nid2str(prim_nid));
> -			return -EEXIST;
> -		} else if (peer->lp_primary_nid == prim_nid) {
> -			/*
> -			 * found a peer_ni that is already part of the
> -			 * peer. This is a no-op operation.
> -			 */
> -			return 0;
> -		}
> -
> -		/*
> -		 * TODO: else if (peer->lp_primary_nid != prim_nid &&
> -		 *		  !peer->lp_multi_rail)
> -		 * peer is not an MR peer and it will be moved in the next
> -		 * step to klpni, so update its flags accordingly.
> -		 * lnet_move_peer_ni()
> -		 */
> -
> -		/*
> -		 * TODO: call lnet_update_peer() from here to update the
> -		 * flags. This is the case when the lpni you're trying to
> -		 * add is already part of the peer. This could've been
> -		 * added by the DD previously, so go ahead and do any
> -		 * updates to the state if necessary
> -		 */
> +	if (!lpni)
> +		return lnet_peer_setup_hierarchy(lp, NULL, nid);
>  
> +	if (lpni->lpni_peer_net->lpn_peer != lp) {
> +		struct lnet_peer *lp2 = lpni->lpni_peer_net->lpn_peer;
> +		CERROR("Cannot add NID %s owned by peer %s to peer %s\n",
> +		       libcfs_nid2str(lpni->lpni_nid),
> +		       libcfs_nid2str(lp2->lp_primary_nid),
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		return -EEXIST;
>  	}
>  
> -	/*
> -	 * When we get here we either have found an existing lpni, which
> -	 * we can switch to the new peer. Or we need to create one and
> -	 * add it to the new peer
> -	 */
> -	return lnet_peer_setup_hierarchy(primary_peer, lpni, nid);
> +	CDEBUG(D_NET, "NID %s is already owned by peer %s\n",
> +	       libcfs_nid2str(lpni->lpni_nid),
> +	       libcfs_nid2str(lp->lp_primary_nid));
> +	return 0;
>  }
>  
>  /*
> @@ -929,61 +874,50 @@ lnet_peer_ni_traffic_add(lnet_nid_t nid)
>  	return rc;
>  }
>  
> -static int
> -lnet_peer_ni_add_non_mr(lnet_nid_t nid)
> -{
> -	struct lnet_peer_ni *lpni;
> -
> -	lpni = lnet_find_peer_ni_locked(nid);
> -	if (lpni) {
> -		CERROR("Cannot add %s as non-mr when it already exists\n",
> -		       libcfs_nid2str(nid));
> -		lnet_peer_ni_decref_locked(lpni);
> -		return -EEXIST;
> -	}
> -
> -	return lnet_peer_setup_hierarchy(NULL, NULL, nid);
> -}
> -
>  /*
>   * Implementation of IOC_LIBCFS_ADD_PEER_NI.
>   *
>   * This API handles the following combinations:
> - *   Create a primary NI if only the prim_nid is provided
> - *   Create or add an lpni to a primary NI. Primary NI must've already
> - *   been created
> - *   Create a non-MR peer.
> + *   Create a peer with its primary NI if only the prim_nid is provided
> + *   Add a NID to a peer identified by the prim_nid. The peer identified
> + *   by the prim_nid must already exist.
> + *   The peer being created may be non-MR.
> + *
> + * The caller must hold ln_api_mutex. This prevents the peer from
> + * being created/modified/deleted by a different thread.
>   */
>  int
>  lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
>  {
> +	struct lnet_peer *lp = NULL;
> +	struct lnet_peer_ni *lpni;
> +
> +	/* The prim_nid must always be specified */
> +	if (prim_nid == LNET_NID_ANY)
> +		return -EINVAL;
> +
>  	/*
> -	 * Caller trying to setup an MR like peer hierarchy but
> -	 * specifying it to be non-MR. This is not allowed.
> +	 * If nid isn't specified, we must create a new peer with
> +	 * prim_nid as its primary nid.
>  	 */
> -	if (prim_nid != LNET_NID_ANY &&
> -	    nid != LNET_NID_ANY && !mr)
> -		return -EPERM;
> -
> -	/* Add the primary NID of a peer */
> -	if (prim_nid != LNET_NID_ANY &&
> -	    nid == LNET_NID_ANY && mr)
> -		return lnet_add_prim_lpni(prim_nid);
> +	if (nid == LNET_NID_ANY)
> +		return lnet_peer_add(prim_nid, mr);
>  
> -	/* Add a NID to an existing peer */
> -	if (prim_nid != LNET_NID_ANY &&
> -	    nid != LNET_NID_ANY && mr)
> -		return lnet_add_peer_ni_to_prim_lpni(prim_nid, nid);
> +	/* Look up the prim_nid, which must exist. */
> +	lpni = lnet_find_peer_ni_locked(prim_nid);
> +	if (!lpni)
> +		return -ENOENT;
> +	lnet_peer_ni_decref_locked(lpni);
> +	lp = lpni->lpni_peer_net->lpn_peer;
>  
> -	/* Add a non-MR peer NI */
> -	if (((prim_nid != LNET_NID_ANY &&
> -	      nid == LNET_NID_ANY) ||
> -	     (prim_nid == LNET_NID_ANY &&
> -	      nid != LNET_NID_ANY)) && !mr)
> -		return lnet_peer_ni_add_non_mr(prim_nid != LNET_NID_ANY ?
> -							 prim_nid : nid);
> +	if (lp->lp_primary_nid != prim_nid) {
> +		CDEBUG(D_NET, "prim_nid %s is not primary for peer %s\n",
> +		       libcfs_nid2str(prim_nid),
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		return -ENODEV;
> +	}
>  
> -	return 0;
> +	return lnet_peer_add_nid(lp, nid, mr);
>  }
>  
>  /*
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 11/24] lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit
  2018-10-07 23:19 ` [lustre-devel] [PATCH 11/24] lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit NeilBrown
@ 2018-10-14 20:11   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 20:11 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add lp_state as a flag word to lnet_peer, and add lp_lock
> to protect it. This lock needs to be taken whenever the
> field is updated, because setting or clearing a bit is
> a read-modify-write cycle.
> 
> The lp_multi_rail is removed, its function is replaced by
> the new LNET_PEER_MULTI_RAIL flag bit.
> 
> The helper lnet_peer_is_multi_rail() tests the bit.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25781
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    6 +++++
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   11 ++++++++--
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |    9 +++++---
>  drivers/staging/lustre/lnet/lnet/peer.c            |   22 +++++++++++++-------
>  4 files changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index fc748ffa251d..75b47628c70e 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -757,4 +757,10 @@ lnet_peer_set_alive(struct lnet_peer_ni *lp)
>  		lnet_notify_locked(lp, 0, 1, lp->lpni_last_alive);
>  }
>  
> +static inline bool
> +lnet_peer_is_multi_rail(struct lnet_peer *lp)
> +{
> +	return lp->lp_state & LNET_PEER_MULTI_RAIL;
> +}
> +
>  #endif
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index f28fa5342914..602978a1c86e 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -467,6 +467,8 @@ struct lnet_peer_ni {
>  	atomic_t		 lpni_refcount;
>  	/* CPT this peer attached on */
>  	int			 lpni_cpt;
> +	/* state flags -- protected by lpni_lock */
> +	unsigned int		lpni_state;
>  	/* # refs from lnet_route::lr_gateway */
>  	int			 lpni_rtr_refcount;
>  	/* sequence number used to round robin over peer nis within a net */
> @@ -497,10 +499,15 @@ struct lnet_peer {
>  	/* primary NID of the peer */
>  	lnet_nid_t		lp_primary_nid;
>  
> -	/* peer is Multi-Rail enabled peer */
> -	bool			lp_multi_rail;
> +	/* lock protecting peer state flags */
> +	spinlock_t		lp_lock;
> +
> +	/* peer state flags */
> +	unsigned int		lp_state;
>  };
>  
> +#define LNET_PEER_MULTI_RAIL	BIT(0)
> +
>  struct lnet_peer_net {
>  	/* chain on peer block */
>  	struct list_head	lpn_on_peer_list;
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
> index 59ae8d0649e5..0d0ad30bb164 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
> @@ -1281,7 +1281,8 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  		return -EHOSTUNREACH;
>  	}
>  
> -	if (!peer->lp_multi_rail && lnet_get_num_peer_nis(peer) > 1) {
> +	if (!lnet_peer_is_multi_rail(peer) &&
> +	    lnet_get_num_peer_nis(peer) > 1) {
>  		lnet_net_unlock(cpt);
>  		CERROR("peer %s is declared to be non MR capable, yet configured with more than one NID\n",
>  		       libcfs_nid2str(dst_nid));
> @@ -1307,7 +1308,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  
>  	if (msg->msg_type == LNET_MSG_REPLY ||
>  	    msg->msg_type == LNET_MSG_ACK ||
> -	    !peer->lp_multi_rail ||
> +	    !lnet_peer_is_multi_rail(peer) ||
>  	    best_ni) {
>  		/*
>  		 * for replies we want to respond on the same peer_ni we
> @@ -1354,7 +1355,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  				 * then use the best_gw found to send
>  				 * the message to
>  				 */
> -				if (!peer->lp_multi_rail)
> +				if (!lnet_peer_is_multi_rail(peer))
>  					best_lpni = best_gw;
>  				else
>  					best_lpni = NULL;
> @@ -1375,7 +1376,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	 * if the peer is not MR capable, then we should always send to it
>  	 * using the first NI in the NET we determined.
>  	 */
> -	if (!peer->lp_multi_rail) {
> +	if (!lnet_peer_is_multi_rail(peer)) {
>  		if (!best_lpni) {
>  			lnet_net_unlock(cpt);
>  			CERROR("no route to %s\n",
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 6b7ca5c361b8..cc2b926b76e4 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -182,6 +182,7 @@ lnet_peer_alloc(lnet_nid_t nid)
>  
>  	INIT_LIST_HEAD(&lp->lp_on_lnet_peer_list);
>  	INIT_LIST_HEAD(&lp->lp_peer_nets);
> +	spin_lock_init(&lp->lp_lock);
>  	lp->lp_primary_nid = nid;
>  
>  	/* TODO: update flags */
> @@ -798,13 +799,15 @@ lnet_peer_add(lnet_nid_t nid, bool mr)
>  	 *
>  	 * TODO: update flags if necessary
>  	 */
> -	if (mr && !lp->lp_multi_rail) {
> -		lp->lp_multi_rail = true;
> -	} else if (!mr && lp->lp_multi_rail) {
> +	spin_lock(&lp->lp_lock);
> +	if (mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +		lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +	} else if (!mr && (lp->lp_state & LNET_PEER_MULTI_RAIL)) {
>  		/* The mr state is sticky. */
> -		CDEBUG(D_NET, "Cannot clear multi-flag from peer %s\n",
> +		CDEBUG(D_NET, "Cannot clear multi-rail flag from peer %s\n",
>  		       libcfs_nid2str(nid));
>  	}
> +	spin_unlock(&lp->lp_lock);
>  
>  	return 0;
>  }
> @@ -817,15 +820,18 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
>  	LASSERT(lp);
>  	LASSERT(nid != LNET_NID_ANY);
>  
> -	if (!mr && !lp->lp_multi_rail) {
> +	spin_lock(&lp->lp_lock);
> +	if (!mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +		spin_unlock(&lp->lp_lock);
>  		CERROR("Cannot add nid %s to non-multi-rail peer %s\n",
>  		       libcfs_nid2str(nid),
>  		       libcfs_nid2str(lp->lp_primary_nid));
>  		return -EPERM;
>  	}
>  
> -	if (!lp->lp_multi_rail)
> -		lp->lp_multi_rail = true;
> +	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL))
> +		lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +	spin_unlock(&lp->lp_lock);
>  
>  	lpni = lnet_find_peer_ni_locked(nid);
>  	if (!lpni)
> @@ -1183,7 +1189,7 @@ int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
>  		return -ENOENT;
>  
>  	*primary_nid = lp->lp_primary_nid;
> -	*mr = lp->lp_multi_rail;
> +	*mr = lnet_peer_is_multi_rail(lp);
>  	*nid = lpni->lpni_nid;
>  	snprintf(ni_info.cr_aliveness, LNET_MAX_STR_LEN, "NA");
>  	if (lnet_isrouter(lpni) ||
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 12/24] lustre: lnet: preferred NIs for non-Multi-Rail peers
  2018-10-07 23:19 ` [lustre-devel] [PATCH 12/24] lustre: lnet: preferred NIs for non-Multi-Rail peers NeilBrown
@ 2018-10-14 20:20   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 20:20 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> When a node sends a message to a peer NI, there may be
> a preferred local NI that should be the source of the
> message. This is in particular the case for non-Multi-
> Rail (NMR) peers, as an NMR peer depends in some cases
> on the source address of a message to correctly identify
> its origin. (This as opposed to using a UUID provided by
> a higher protocol layer.)
> 
> Implement this by keeping an array of preferred local
> NIDs in the lnet_peer_ni structure. The case where only
> a single NID needs to be stored is optimized so that this
> can be done without needing to allocate any memory.
> 
> A flag in the lnet_peer_ni, LNET_PEER_NI_NON_MR_PREF,
> indicates that the preferred NI was automatically added
> for an NMR peer. Note that a peer which has not been
> explicitly configured as Multi-Rail will be treated as
> non-Multi-Rail until proven otherwise. These automatic
> preferences will be cleared if the peer is changed to
> Multi-Rail.
> 
> - lnet_peer_ni_set_non_mr_pref_nid()
>   set NMR preferred NI for peer_ni
> - lnet_peer_ni_clr_non_mr_pref_nid()
>   clear NMR preferred NI for peer_ni
> - lnet_peer_clr_non_mr_pref_nids()
>   clear NMR preferred NIs for all peer_ni
> 
> - lnet_peer_add_pref_nid()
>   add a preferred NID
> - lnet_peer_del_pref_nid()
>   delete a preferred NID

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25782
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    7 -
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   10 +
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |   49 +++-
>  drivers/staging/lustre/lnet/lnet/peer.c            |  257 +++++++++++++++++++-
>  4 files changed, 285 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 75b47628c70e..2864bd8a403b 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -668,7 +668,8 @@ u32 lnet_get_dlc_seq_locked(void);
>  struct lnet_peer_ni *lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  						  struct lnet_peer_net *peer_net,
>  						  struct lnet_peer_ni *prev);
> -struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, int cpt);
> +struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref,
> +					    int cpt);
>  struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
>  struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
>  void lnet_peer_net_added(struct lnet_net *net);
> @@ -679,8 +680,8 @@ int lnet_peer_tables_create(void);
>  void lnet_debug_peer(lnet_nid_t nid);
>  struct lnet_peer_net *lnet_peer_get_net_locked(struct lnet_peer *peer,
>  					       u32 net_id);
> -bool lnet_peer_is_ni_pref_locked(struct lnet_peer_ni *lpni,
> -				 struct lnet_ni *ni);
> +bool lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid);
> +int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid);
>  int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
>  int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
>  int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index 602978a1c86e..eff2aed5e5c1 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -481,14 +481,20 @@ struct lnet_peer_ni {
>  	unsigned int		 lpni_ping_feats;
>  	/* routers on this peer */
>  	struct list_head	 lpni_routes;
> -	/* array of preferred local nids */
> -	lnet_nid_t		*lpni_pref_nids;
> +	/* preferred local nids: if only one, use lpni_pref.nid */
> +	union lpni_pref {
> +		lnet_nid_t	nid;
> +		lnet_nid_t	*nids;
> +	} lpni_pref;
>  	/* number of preferred NIDs in lnpi_pref_nids */
>  	u32			lpni_pref_nnids;
>  	/* router checker state */
>  	struct lnet_rc_data	*lpni_rcd;
>  };
>  
> +/* Preferred path added due to traffic on non-MR peer_ni */
> +#define LNET_PEER_NI_NON_MR_PREF	BIT(0)
> +
>  struct lnet_peer {
>  	/* chain on global peer list */
>  	struct list_head	lp_on_lnet_peer_list;
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
> index 0d0ad30bb164..99d8b22356bb 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
> @@ -1267,7 +1267,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	 * existing peer_ni, or create one and mark it as having been
>  	 * created due to network traffic.
>  	 */
> -	lpni = lnet_nid2peerni_locked(dst_nid, cpt);
> +	lpni = lnet_nid2peerni_locked(dst_nid, LNET_NID_ANY, cpt);
>  	if (IS_ERR(lpni)) {
>  		lnet_net_unlock(cpt);
>  		return PTR_ERR(lpni);
> @@ -1281,14 +1281,6 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  		return -EHOSTUNREACH;
>  	}
>  
> -	if (!lnet_peer_is_multi_rail(peer) &&
> -	    lnet_get_num_peer_nis(peer) > 1) {
> -		lnet_net_unlock(cpt);
> -		CERROR("peer %s is declared to be non MR capable, yet configured with more than one NID\n",
> -		       libcfs_nid2str(dst_nid));
> -		return -EINVAL;
> -	}
> -
>  	/*
>  	 * STEP 1: first jab at determining best_ni
>  	 * if src_nid is explicitly specified, then best_ni is already
> @@ -1373,8 +1365,14 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	}
>  
>  	/*
> -	 * if the peer is not MR capable, then we should always send to it
> -	 * using the first NI in the NET we determined.
> +	 * We must use a consistent source address when sending to a
> +	 * non-MR peer. However, a non-MR peer can have multiple NIDs
> +	 * on multiple networks, and we may even need to talk to this
> +	 * peer on multiple networks -- certain types of
> +	 * load-balancing configuration do this.
> +	 *
> +	 * So we need to pick the NI the peer prefers for this
> +	 * particular network.
>  	 */
>  	if (!lnet_peer_is_multi_rail(peer)) {
>  		if (!best_lpni) {
> @@ -1384,10 +1382,26 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  			return -EHOSTUNREACH;
>  		}
>  
> -		/* best ni could be set because src_nid was provided */
> +		/* best ni is already set if src_nid was provided */
> +		if (!best_ni) {
> +			/* Get the target peer_ni */
> +			peer_net = lnet_peer_get_net_locked(
> +				peer, LNET_NIDNET(best_lpni->lpni_nid));
> +			list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
> +					    lpni_on_peer_net_list) {
> +				if (lpni->lpni_pref_nnids == 0)
> +					continue;
> +				LASSERT(lpni->lpni_pref_nnids == 1);
> +				best_ni = lnet_nid2ni_locked(
> +					lpni->lpni_pref.nid, cpt);
> +				break;
> +			}
> +		}
> +		/* if best_ni is still not set just pick one */
>  		if (!best_ni) {
>  			best_ni = lnet_net2ni_locked(
>  				best_lpni->lpni_net->net_id, cpt);
> +			/* If there is no best_ni we don't have a route */
>  			if (!best_ni) {
>  				lnet_net_unlock(cpt);
>  				CERROR("no path to %s from net %s\n",
> @@ -1395,7 +1409,13 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  				       libcfs_net2str(best_lpni->lpni_net->net_id));
>  				return -EHOSTUNREACH;
>  			}
> +			lpni = list_entry(peer_net->lpn_peer_nis.next,
> +					  struct lnet_peer_ni,
> +					  lpni_on_peer_net_list);
>  		}
> +		/* Set preferred NI if necessary. */
> +		if (lpni->lpni_pref_nnids == 0)
> +			lnet_peer_ni_set_non_mr_pref_nid(lpni, best_ni->ni_nid);
>  	}
>  
>  	/*
> @@ -1593,7 +1613,8 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  		 */
>  		if (!lnet_is_peer_ni_healthy_locked(lpni))
>  			continue;
> -		ni_is_pref = lnet_peer_is_ni_pref_locked(lpni, best_ni);
> +		ni_is_pref = lnet_peer_is_pref_nid_locked(lpni,
> +							  best_ni->ni_nid);
>  
>  		/* if this is a preferred peer use it */
>  		if (!preferred && ni_is_pref) {
> @@ -2380,7 +2401,7 @@ lnet_parse(struct lnet_ni *ni, struct lnet_hdr *hdr, lnet_nid_t from_nid,
>  	}
>  
>  	lnet_net_lock(cpt);
> -	lpni = lnet_nid2peerni_locked(from_nid, cpt);
> +	lpni = lnet_nid2peerni_locked(from_nid, ni->ni_nid, cpt);
>  	if (IS_ERR(lpni)) {
>  		lnet_net_unlock(cpt);
>  		CERROR("%s, src %s: Dropping %s (error %ld looking up sender)\n",
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index cc2b926b76e4..44a2bf641260 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -617,18 +617,233 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  	return lpni;
>  }
>  
> +/*
> + * Test whether a ni is a preferred ni for this peer_ni, e.g, whether
> + * this is a preferred point-to-point path. Call with lnet_net_lock in
> + * shared mmode.
> + */
>  bool
> -lnet_peer_is_ni_pref_locked(struct lnet_peer_ni *lpni, struct lnet_ni *ni)
> +lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid)
>  {
>  	int i;
>  
> +	if (lpni->lpni_pref_nnids == 0)
> +		return false;
> +	if (lpni->lpni_pref_nnids == 1)
> +		return lpni->lpni_pref.nid == nid;
>  	for (i = 0; i < lpni->lpni_pref_nnids; i++) {
> -		if (lpni->lpni_pref_nids[i] == ni->ni_nid)
> +		if (lpni->lpni_pref.nids[i] == nid)
>  			return true;
>  	}
>  	return false;
>  }
>  
> +/*
> + * Set a single ni as preferred, provided no preferred ni is already
> + * defined. Only to be used for non-multi-rail peer_ni.
> + */
> +int
> +lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid)
> +{
> +	int rc = 0;
> +
> +	spin_lock(&lpni->lpni_lock);
> +	if (nid == LNET_NID_ANY) {
> +		rc = -EINVAL;
> +	} else if (lpni->lpni_pref_nnids > 0) {
> +		rc = -EPERM;
> +	} else if (lpni->lpni_pref_nnids == 0) {
> +		lpni->lpni_pref.nid = nid;
> +		lpni->lpni_pref_nnids = 1;
> +		lpni->lpni_state |= LNET_PEER_NI_NON_MR_PREF;
> +	}
> +	spin_unlock(&lpni->lpni_lock);
> +
> +	CDEBUG(D_NET, "peer %s nid %s: %d\n",
> +	       libcfs_nid2str(lpni->lpni_nid), libcfs_nid2str(nid), rc);
> +	return rc;
> +}
> +
> +/*
> + * Clear the preferred NID from a non-multi-rail peer_ni, provided
> + * this preference was set by lnet_peer_ni_set_non_mr_pref_nid().
> + */
> +int
> +lnet_peer_ni_clr_non_mr_pref_nid(struct lnet_peer_ni *lpni)
> +{
> +	int rc = 0;
> +
> +	spin_lock(&lpni->lpni_lock);
> +	if (lpni->lpni_state & LNET_PEER_NI_NON_MR_PREF) {
> +		lpni->lpni_pref_nnids = 0;
> +		lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
> +	} else if (lpni->lpni_pref_nnids == 0) {
> +		rc = -ENOENT;
> +	} else {
> +		rc = -EPERM;
> +	}
> +	spin_unlock(&lpni->lpni_lock);
> +
> +	CDEBUG(D_NET, "peer %s: %d\n",
> +	       libcfs_nid2str(lpni->lpni_nid), rc);
> +	return rc;
> +}
> +
> +/*
> + * Clear the preferred NIDs from a non-multi-rail peer.
> + */
> +void
> +lnet_peer_clr_non_mr_pref_nids(struct lnet_peer *lp)
> +{
> +	struct lnet_peer_ni *lpni = NULL;
> +
> +	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL)
> +		lnet_peer_ni_clr_non_mr_pref_nid(lpni);
> +}
> +
> +int
> +lnet_peer_add_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid)
> +{
> +	lnet_nid_t *nids = NULL;
> +	lnet_nid_t *oldnids = NULL;
> +	struct lnet_peer *lp = lpni->lpni_peer_net->lpn_peer;
> +	int size;
> +	int i;
> +	int rc = 0;
> +
> +	if (nid == LNET_NID_ANY) {
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (lpni->lpni_pref_nnids == 1 && lpni->lpni_pref.nid == nid) {
> +		rc = -EEXIST;
> +		goto out;
> +	}
> +
> +	/* A non-MR node may have only one preferred NI per peer_ni */
> +	if (lpni->lpni_pref_nnids > 0) {
> +		if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +			rc = -EPERM;
> +			goto out;
> +		}
> +	}
> +
> +	if (lpni->lpni_pref_nnids != 0) {
> +		size = sizeof(*nids) * (lpni->lpni_pref_nnids + 1);
> +		nids = kzalloc_cpt(size, GFP_KERNEL, lpni->lpni_cpt);
> +		if (!nids) {
> +			rc = -ENOMEM;
> +			goto out;
> +		}
> +		for (i = 0; i < lpni->lpni_pref_nnids; i++) {
> +			if (lpni->lpni_pref.nids[i] == nid) {
> +				kfree(nids);
> +				rc = -EEXIST;
> +				goto out;
> +			}
> +			nids[i] = lpni->lpni_pref.nids[i];
> +		}
> +		nids[i] = nid;
> +	}
> +
> +	lnet_net_lock(LNET_LOCK_EX);
> +	spin_lock(&lpni->lpni_lock);
> +	if (lpni->lpni_pref_nnids == 0) {
> +		lpni->lpni_pref.nid = nid;
> +	} else {
> +		oldnids = lpni->lpni_pref.nids;
> +		lpni->lpni_pref.nids = nids;
> +	}
> +	lpni->lpni_pref_nnids++;
> +	lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
> +	spin_unlock(&lpni->lpni_lock);
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	kfree(oldnids);
> +out:
> +	if (rc == -EEXIST && (lpni->lpni_state & LNET_PEER_NI_NON_MR_PREF)) {
> +		spin_lock(&lpni->lpni_lock);
> +		lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
> +		spin_unlock(&lpni->lpni_lock);
> +	}
> +	CDEBUG(D_NET, "peer %s nid %s: %d\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), libcfs_nid2str(nid), rc);
> +	return rc;
> +}
> +
> +int
> +lnet_peer_del_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid)
> +{
> +	lnet_nid_t *nids = NULL;
> +	lnet_nid_t *oldnids = NULL;
> +	struct lnet_peer *lp = lpni->lpni_peer_net->lpn_peer;
> +	int size;
> +	int i, j;
> +	int rc = 0;
> +
> +	if (lpni->lpni_pref_nnids == 0) {
> +		rc = -ENOENT;
> +		goto out;
> +	}
> +
> +	if (lpni->lpni_pref_nnids == 1) {
> +		if (lpni->lpni_pref.nid != nid) {
> +			rc = -ENOENT;
> +			goto out;
> +		}
> +	} else if (lpni->lpni_pref_nnids == 2) {
> +		if (lpni->lpni_pref.nids[0] != nid &&
> +		    lpni->lpni_pref.nids[1] != nid) {
> +			rc = -ENOENT;
> +			goto out;
> +		}
> +	} else {
> +		size = sizeof(*nids) * (lpni->lpni_pref_nnids - 1);
> +		nids = kzalloc_cpt(size, GFP_KERNEL, lpni->lpni_cpt);
> +		if (!nids) {
> +			rc = -ENOMEM;
> +			goto out;
> +		}
> +		for (i = 0, j = 0; i < lpni->lpni_pref_nnids; i++) {
> +			if (lpni->lpni_pref.nids[i] != nid)
> +				continue;
> +			nids[j++] = lpni->lpni_pref.nids[i];
> +		}
> +		/* Check if we actually removed a nid. */
> +		if (j == lpni->lpni_pref_nnids) {
> +			kfree(nids);
> +			rc = -ENOENT;
> +			goto out;
> +		}
> +	}
> +
> +	lnet_net_lock(LNET_LOCK_EX);
> +	spin_lock(&lpni->lpni_lock);
> +	if (lpni->lpni_pref_nnids == 1) {
> +		lpni->lpni_pref.nid = LNET_NID_ANY;
> +	} else if (lpni->lpni_pref_nnids == 2) {
> +		oldnids = lpni->lpni_pref.nids;
> +		if (oldnids[0] == nid)
> +			lpni->lpni_pref.nid = oldnids[1];
> +		else
> +			lpni->lpni_pref.nid = oldnids[2];
> +	} else {
> +		oldnids = lpni->lpni_pref.nids;
> +		lpni->lpni_pref.nids = nids;
> +	}
> +	lpni->lpni_pref_nnids--;
> +	lpni->lpni_state &= ~LNET_PEER_NI_NON_MR_PREF;
> +	spin_unlock(&lpni->lpni_lock);
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	kfree(oldnids);
> +out:
> +	CDEBUG(D_NET, "peer %s nid %s: %d\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), libcfs_nid2str(nid), rc);
> +	return rc;
> +}
> +
>  lnet_nid_t
>  lnet_peer_primary_nid_locked(lnet_nid_t nid)
>  {
> @@ -653,7 +868,7 @@ LNetPrimaryNID(lnet_nid_t nid)
>  	int cpt;
>  
>  	cpt = lnet_net_lock_current();
> -	lpni = lnet_nid2peerni_locked(nid, cpt);
> +	lpni = lnet_nid2peerni_locked(nid, LNET_NID_ANY, cpt);
>  	if (IS_ERR(lpni)) {
>  		rc = PTR_ERR(lpni);
>  		goto out_unlock;
> @@ -802,6 +1017,7 @@ lnet_peer_add(lnet_nid_t nid, bool mr)
>  	spin_lock(&lp->lp_lock);
>  	if (mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
>  		lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +		lnet_peer_clr_non_mr_pref_nids(lp);
>  	} else if (!mr && (lp->lp_state & LNET_PEER_MULTI_RAIL)) {
>  		/* The mr state is sticky. */
>  		CDEBUG(D_NET, "Cannot clear multi-rail flag from peer %s\n",
> @@ -829,8 +1045,10 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
>  		return -EPERM;
>  	}
>  
> -	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL))
> +	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
>  		lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +		lnet_peer_clr_non_mr_pref_nids(lp);
> +	}
>  	spin_unlock(&lp->lp_lock);
>  
>  	lpni = lnet_find_peer_ni_locked(nid);
> @@ -856,28 +1074,27 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
>   * lpni creation initiated due to traffic either sending or receiving.
>   */
>  static int
> -lnet_peer_ni_traffic_add(lnet_nid_t nid)
> +lnet_peer_ni_traffic_add(lnet_nid_t nid, lnet_nid_t pref)
>  {
>  	struct lnet_peer_ni *lpni;
> -	int rc = 0;
> +	int rc;
>  
>  	if (nid == LNET_NID_ANY)
>  		return -EINVAL;
>  
>  	/* lnet_net_lock is not needed here because ln_api_lock is held */
>  	lpni = lnet_find_peer_ni_locked(nid);
> -	if (lpni) {
> -		/*
> -		 * TODO: lnet_update_primary_nid() but not all of it
> -		 * only indicate if we're converting this to MR capable
> -		 * Can happen due to DD
> -		 */
> -		lnet_peer_ni_decref_locked(lpni);
> -	} else {
> +	if (!lpni) {
>  		rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
> +		if (rc)
> +			return rc;
> +		lpni = lnet_find_peer_ni_locked(nid);
>  	}
> +	if (pref != LNET_NID_ANY)
> +		lnet_peer_ni_set_non_mr_pref_nid(lpni, pref);
> +	lnet_peer_ni_decref_locked(lpni);
>  
> -	return rc;
> +	return 0;
>  }
>  
>  /*
> @@ -984,6 +1201,8 @@ lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)
>  	ptable->pt_zombies--;
>  	spin_unlock(&ptable->pt_zombie_lock);
>  
> +	if (lpni->lpni_pref_nnids > 1)
> +		kfree(lpni->lpni_pref.nids);
>  	kfree(lpni);
>  }
>  
> @@ -1006,7 +1225,7 @@ lnet_nid2peerni_ex(lnet_nid_t nid, int cpt)
>  
>  	lnet_net_unlock(cpt);
>  
> -	rc = lnet_peer_ni_traffic_add(nid);
> +	rc = lnet_peer_ni_traffic_add(nid, LNET_NID_ANY);
>  	if (rc) {
>  		lpni = ERR_PTR(rc);
>  		goto out_net_relock;
> @@ -1022,7 +1241,7 @@ lnet_nid2peerni_ex(lnet_nid_t nid, int cpt)
>  }
>  
>  struct lnet_peer_ni *
> -lnet_nid2peerni_locked(lnet_nid_t nid, int cpt)
> +lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref, int cpt)
>  {
>  	struct lnet_peer_ni *lpni = NULL;
>  	int rc;
> @@ -1061,7 +1280,7 @@ lnet_nid2peerni_locked(lnet_nid_t nid, int cpt)
>  		goto out_mutex_unlock;
>  	}
>  
> -	rc = lnet_peer_ni_traffic_add(nid);
> +	rc = lnet_peer_ni_traffic_add(nid, pref);
>  	if (rc) {
>  		lpni = ERR_PTR(rc);
>  		goto out_mutex_unlock;
> @@ -1087,7 +1306,7 @@ lnet_debug_peer(lnet_nid_t nid)
>  	cpt = lnet_cpt_of_nid(nid, NULL);
>  	lnet_net_lock(cpt);
>  
> -	lp = lnet_nid2peerni_locked(nid, cpt);
> +	lp = lnet_nid2peerni_locked(nid, LNET_NID_ANY, cpt);
>  	if (IS_ERR(lp)) {
>  		lnet_net_unlock(cpt);
>  		CDEBUG(D_WARNING, "No peer %s\n", libcfs_nid2str(nid));
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 13/24] lustre: lnet: add LNET_PEER_CONFIGURED flag
  2018-10-07 23:19 ` [lustre-devel] [PATCH 13/24] lustre: lnet: add LNET_PEER_CONFIGURED flag NeilBrown
@ 2018-10-14 20:32   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 20:32 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add the LNET_PEER_CONFIGURED flag, which indicates that a peer
> has been configured by DLC. This is used to enforce that only
> DLC can modify such a peer.
> 
> This includes some further refactoring of the code that creates
> or modifies peers to ensure that the flag is properly passed
> through, set, and cleared.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25783
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   12 +
>  .../staging/lustre/include/linux/lnet/lib-types.h  |    1 
>  drivers/staging/lustre/lnet/lnet/peer.c            |  426 +++++++++++++-------
>  3 files changed, 290 insertions(+), 149 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 2864bd8a403b..563417510722 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -764,4 +764,16 @@ lnet_peer_is_multi_rail(struct lnet_peer *lp)
>  	return lp->lp_state & LNET_PEER_MULTI_RAIL;
>  }
>  
> +static inline bool
> +lnet_peer_ni_is_configured(struct lnet_peer_ni *lpni)
> +{
> +	return lpni->lpni_peer_net->lpn_peer->lp_state & LNET_PEER_CONFIGURED;
> +}
> +
> +static inline bool
> +lnet_peer_ni_is_primary(struct lnet_peer_ni *lpni)
> +{
> +	return lpni->lpni_nid == lpni->lpni_peer_net->lpn_peer->lp_primary_nid;
> +}
> +
>  #endif
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index eff2aed5e5c1..d1721fd01d93 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -513,6 +513,7 @@ struct lnet_peer {
>  };
>  
>  #define LNET_PEER_MULTI_RAIL	BIT(0)
> +#define LNET_PEER_CONFIGURED	BIT(1)
>  
>  struct lnet_peer_net {
>  	/* chain on peer block */
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 44a2bf641260..09c1b5516f6b 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -191,10 +191,10 @@ lnet_peer_alloc(lnet_nid_t nid)
>  }
>  
>  static void
> -lnet_try_destroy_peer_hierarchy_locked(struct lnet_peer_ni *lpni)
> +lnet_peer_detach_peer_ni(struct lnet_peer_ni *lpni)
>  {
> -	struct lnet_peer_net *peer_net;
> -	struct lnet_peer *peer;
> +	struct lnet_peer_net *lpn;
> +	struct lnet_peer *lp;
>  
>  	/* TODO: could the below situation happen? accessing an already
>  	 * destroyed peer?
> @@ -203,24 +203,28 @@ lnet_try_destroy_peer_hierarchy_locked(struct lnet_peer_ni *lpni)
>  	    !lpni->lpni_peer_net->lpn_peer)
>  		return;
>  
> -	peer_net = lpni->lpni_peer_net;
> -	peer = lpni->lpni_peer_net->lpn_peer;
> +	lpn = lpni->lpni_peer_net;
> +	lp = lpni->lpni_peer_net->lpn_peer;
> +
> +	CDEBUG(D_NET, "peer %s NID %s\n",
> +	       libcfs_nid2str(lp->lp_primary_nid),
> +	       libcfs_nid2str(lpni->lpni_nid));
>  
>  	list_del_init(&lpni->lpni_on_peer_net_list);
>  	lpni->lpni_peer_net = NULL;
>  
> -	/* if peer_net is empty, then remove it from the peer */
> -	if (list_empty(&peer_net->lpn_peer_nis)) {
> -		list_del_init(&peer_net->lpn_on_peer_list);
> -		peer_net->lpn_peer = NULL;
> -		kfree(peer_net);
> +	/* if lpn is empty, then remove it from the peer */
> +	if (list_empty(&lpn->lpn_peer_nis)) {
> +		list_del_init(&lpn->lpn_on_peer_list);
> +		lpn->lpn_peer = NULL;
> +		kfree(lpn);
>  
>  		/* If the peer is empty then remove it from the
>  		 * the_lnet.ln_peers.
>  		 */
> -		if (list_empty(&peer->lp_peer_nets)) {
> -			list_del_init(&peer->lp_on_lnet_peer_list);
> -			kfree(peer);
> +		if (list_empty(&lp->lp_peer_nets)) {
> +			list_del_init(&lp->lp_on_lnet_peer_list);
> +			kfree(lp);
>  		}
>  	}
>  }
> @@ -263,10 +267,10 @@ lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni)
>  	ptable->pt_zombies++;
>  	spin_unlock(&ptable->pt_zombie_lock);
>  
> -	/* no need to keep this peer on the hierarchy anymore */
> -	lnet_try_destroy_peer_hierarchy_locked(lpni);
> +	/* no need to keep this peer_ni on the hierarchy anymore */
> +	lnet_peer_detach_peer_ni(lpni);
>  
> -	/* decrement reference on peer */
> +	/* decrement reference on peer_ni */
>  	lnet_peer_ni_decref_locked(lpni);
>  
>  	return 0;
> @@ -329,6 +333,8 @@ lnet_peer_del_locked(struct lnet_peer *peer)
>  	struct lnet_peer_ni *lpni = NULL, *lpni2;
>  	int rc = 0, rc2 = 0;
>  
> +	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(peer->lp_primary_nid));
> +
>  	lpni = lnet_get_next_peer_ni_locked(peer, NULL, lpni);
>  	while (lpni) {
>  		lpni2 = lnet_get_next_peer_ni_locked(peer, NULL, lpni);
> @@ -352,31 +358,36 @@ lnet_peer_del(struct lnet_peer *peer)
>  }
>  
>  /*
> - * Delete a NID from a peer.
> - * Implements a few sanity checks.
> - * Call with ln_api_mutex held.
> + * Delete a NID from a peer. Call with ln_api_mutex held.
> + *
> + * Error codes:
> + *  -EPERM:  Non-DLC deletion from DLC-configured peer.
> + *  -ENOENT: No lnet_peer_ni corresponding to the nid.
> + *  -ECHILD: The lnet_peer_ni isn't connected to the peer.
> + *  -EBUSY:  The lnet_peer_ni is the primary, and not the only peer_ni.
>   */
>  static int
> -lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid)
> +lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
>  {
> -	struct lnet_peer *lp2;
>  	struct lnet_peer_ni *lpni;
> +	lnet_nid_t primary_nid = lp->lp_primary_nid;
> +	int rc = 0;
>  
> +	if (!(flags & LNET_PEER_CONFIGURED)) {
> +		if (lp->lp_state & LNET_PEER_CONFIGURED) {
> +			rc = -EPERM;
> +			goto out;
> +		}
> +	}
>  	lpni = lnet_find_peer_ni_locked(nid);
>  	if (!lpni) {
> -		CERROR("Cannot remove unknown nid %s from peer %s\n",
> -		       libcfs_nid2str(nid),
> -		       libcfs_nid2str(lp->lp_primary_nid));
> -		return -ENOENT;
> +		rc = -ENOENT;
> +		goto out;
>  	}
>  	lnet_peer_ni_decref_locked(lpni);
> -	lp2 = lpni->lpni_peer_net->lpn_peer;
> -	if (lp2 != lp) {
> -		CERROR("Nid %s is attached to peer %s, not peer %s\n",
> -		       libcfs_nid2str(nid),
> -		       libcfs_nid2str(lp2->lp_primary_nid),
> -		       libcfs_nid2str(lp->lp_primary_nid));
> -		return -EINVAL;
> +	if (lp != lpni->lpni_peer_net->lpn_peer) {
> +		rc = -ECHILD;
> +		goto out;
>  	}
>  
>  	/*
> @@ -384,16 +395,19 @@ lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid)
>  	 * is the only NID.
>  	 */
>  	if (nid == lp->lp_primary_nid && lnet_get_num_peer_nis(lp) != 1) {
> -		CERROR("Cannot delete primary NID %s from multi-NID peer\n",
> -		       libcfs_nid2str(nid));
> -		return -EINVAL;
> +		rc = -EBUSY;
> +		goto out;
>  	}
>  
>  	lnet_net_lock(LNET_LOCK_EX);
>  	lnet_peer_ni_del_locked(lpni);
>  	lnet_net_unlock(LNET_LOCK_EX);
>  
> -	return 0;
> +out:
> +	CDEBUG(D_NET, "peer %s NID %s flags %#x: %d\n",
> +	       libcfs_nid2str(primary_nid), libcfs_nid2str(nid), flags, rc);
> +
> +	return rc;
>  }
>  
>  static void
> @@ -895,46 +909,27 @@ lnet_peer_get_net_locked(struct lnet_peer *peer, u32 net_id)
>  	return NULL;
>  }
>  
> +/*
> + * Always returns 0, but it the last function called from functions
> + * that do return an int, so returning 0 here allows the compiler to
> + * do a tail call.
> + */
>  static int
> -lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
> -	 *lpni,
> -			  lnet_nid_t nid)
> +lnet_peer_attach_peer_ni(struct lnet_peer *lp,
> +			 struct lnet_peer_net *lpn,
> +			 struct lnet_peer_ni *lpni,
> +			 unsigned int flags)
>  {
> -	struct lnet_peer_net *lpn = NULL;
>  	struct lnet_peer_table *ptable;
> -	u32 net_id = LNET_NIDNET(nid);
> -
> -	/*
> -	 * Create the peer_ni, peer_net, and peer if they don't exist
> -	 * yet.
> -	 */
> -	if (lp) {
> -		lpn = lnet_peer_get_net_locked(lp, net_id);
> -	} else {
> -		lp = lnet_peer_alloc(nid);
> -		if (!lp)
> -			goto out_enomem;
> -	}
> -
> -	if (!lpn) {
> -		lpn = lnet_peer_net_alloc(net_id);
> -		if (!lpn)
> -			goto out_maybe_free_lp;
> -	}
> -
> -	if (!lpni) {
> -		lpni = lnet_peer_ni_alloc(nid);
> -		if (!lpni)
> -			goto out_maybe_free_lpn;
> -	}
>  
>  	/* Install the new peer_ni */
>  	lnet_net_lock(LNET_LOCK_EX);
>  	/* Add peer_ni to global peer table hash, if necessary. */
>  	if (list_empty(&lpni->lpni_hashlist)) {
> +		int hash = lnet_nid2peerhash(lpni->lpni_nid);
> +
>  		ptable = the_lnet.ln_peer_tables[lpni->lpni_cpt];
> -		list_add_tail(&lpni->lpni_hashlist,
> -			      &ptable->pt_hash[lnet_nid2peerhash(nid)]);
> +		list_add_tail(&lpni->lpni_hashlist, &ptable->pt_hash[hash]);
>  		ptable->pt_version++;
>  		atomic_inc(&ptable->pt_number);
>  		atomic_inc(&lpni->lpni_refcount);
> @@ -942,7 +937,7 @@ lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
>  
>  	/* Detach the peer_ni from an existing peer, if necessary. */
>  	if (lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer != lp)
> -		lnet_try_destroy_peer_hierarchy_locked(lpni);
> +		lnet_peer_detach_peer_ni(lpni);
>  
>  	/* Add peer_ni to peer_net */
>  	lpni->lpni_peer_net = lpn;
> @@ -957,33 +952,42 @@ lnet_peer_setup_hierarchy(struct lnet_peer *lp, struct lnet_peer_ni
>  	/* Add peer to global peer list */
>  	if (list_empty(&lp->lp_on_lnet_peer_list))
>  		list_add_tail(&lp->lp_on_lnet_peer_list, &the_lnet.ln_peers);
> +
> +	/* Update peer state */
> +	spin_lock(&lp->lp_lock);
> +	if (flags & LNET_PEER_CONFIGURED) {
> +		if (!(lp->lp_state & LNET_PEER_CONFIGURED))
> +			lp->lp_state |= LNET_PEER_CONFIGURED;
> +	}
> +	if (flags & LNET_PEER_MULTI_RAIL) {
> +		if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +			lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +			lnet_peer_clr_non_mr_pref_nids(lp);
> +		}
> +	}
> +	spin_unlock(&lp->lp_lock);
> +
>  	lnet_net_unlock(LNET_LOCK_EX);
>  
> -	return 0;
> +	CDEBUG(D_NET, "peer %s NID %s flags %#x\n",
> +	       libcfs_nid2str(lp->lp_primary_nid),
> +	       libcfs_nid2str(lpni->lpni_nid), flags);
>  
> -out_maybe_free_lpn:
> -	if (list_empty(&lpn->lpn_on_peer_list))
> -		kfree(lpn);
> -out_maybe_free_lp:
> -	if (list_empty(&lp->lp_on_lnet_peer_list))
> -		kfree(lp);
> -out_enomem:
> -	return -ENOMEM;
> +	return 0;
>  }
>  
>  /*
>   * Create a new peer, with nid as its primary nid.
>   *
> - * It is not an error if the peer already exists, provided that the
> - * given nid is the primary NID.
> - *
>   * Call with the lnet_api_mutex held.
>   */
>  static int
> -lnet_peer_add(lnet_nid_t nid, bool mr)
> +lnet_peer_add(lnet_nid_t nid, unsigned int flags)
>  {
>  	struct lnet_peer *lp;
> +	struct lnet_peer_net *lpn;
>  	struct lnet_peer_ni *lpni;
> +	int rc = 0;
>  
>  	LASSERT(nid != LNET_NID_ANY);
>  
> @@ -992,82 +996,153 @@ lnet_peer_add(lnet_nid_t nid, bool mr)
>  	 * lnet_api_mutex is held.
>  	 */
>  	lpni = lnet_find_peer_ni_locked(nid);
> -	if (!lpni) {
> -		int rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
> -		if (rc != 0)
> -			return rc;
> -		lpni = lnet_find_peer_ni_locked(nid);
> -		LASSERT(lpni);
> +	if (lpni) {
> +		/* A peer with this NID already exists. */
> +		lp = lpni->lpni_peer_net->lpn_peer;
> +		lnet_peer_ni_decref_locked(lpni);
> +		/*
> +		 * This is an error if the peer was configured and the
> +		 * primary NID differs or an attempt is made to change
> +		 * the Multi-Rail flag. Otherwise the assumption is
> +		 * that an existing peer is being modified.
> +		 */
> +		if (lp->lp_state & LNET_PEER_CONFIGURED) {
> +			if (lp->lp_primary_nid != nid)
> +				rc = -EEXIST;
> +			else if ((lp->lp_state ^ flags) & LNET_PEER_MULTI_RAIL)
> +				rc = -EPERM;
> +			goto out;
> +		}
> +		/* Delete and recreate as a configured peer. */
> +		lnet_peer_del(lp);
>  	}
> -	lp = lpni->lpni_peer_net->lpn_peer;
> -	lnet_peer_ni_decref_locked(lpni);
>  
> -	/* A found peer must have this primary NID */
> -	if (lp->lp_primary_nid != nid)
> -		return -EEXIST;
> +	/* Create peer, peer_net, and peer_ni. */
> +	rc = -ENOMEM;
> +	lp = lnet_peer_alloc(nid);
> +	if (!lp)
> +		goto out;
> +	lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
> +	if (!lpn)
> +		goto out_free_lp;
> +	lpni = lnet_peer_ni_alloc(nid);
> +	if (!lpni)
> +		goto out_free_lpn;
>  
> -	/*
> -	 * If we found an lpni that is not a multi-rail, which could occur
> -	 * if lpni is already created as a non-mr lpni or we just created
> -	 * it, then make sure you indicate that this lpni is a primary mr
> -	 * capable peer.
> -	 *
> -	 * TODO: update flags if necessary
> -	 */
> -	spin_lock(&lp->lp_lock);
> -	if (mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> -		lp->lp_state |= LNET_PEER_MULTI_RAIL;
> -		lnet_peer_clr_non_mr_pref_nids(lp);
> -	} else if (!mr && (lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> -		/* The mr state is sticky. */
> -		CDEBUG(D_NET, "Cannot clear multi-rail flag from peer %s\n",
> -		       libcfs_nid2str(nid));
> -	}
> -	spin_unlock(&lp->lp_lock);
> +	return lnet_peer_attach_peer_ni(lp, lpn, lpni, flags);
>  
> -	return 0;
> +out_free_lpn:
> +	kfree(lpn);
> +out_free_lp:
> +	kfree(lp);
> +out:
> +	CDEBUG(D_NET, "peer %s NID flags %#x: %d\n",
> +	       libcfs_nid2str(nid), flags, rc);
> +	return rc;
>  }
>  
> +/*
> + * Add a NID to a peer. Call with ln_api_mutex held.
> + *
> + * Error codes:
> + *  -EPERM:    Non-DLC addition to a DLC-configured peer.
> + *  -EEXIST:   The NID was configured by DLC for a different peer.
> + *  -ENOMEM:   Out of memory.
> + *  -ENOTUNIQ: Adding a second peer NID on a single network on a
> + *             non-multi-rail peer.
> + */
>  static int
> -lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
> +lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
>  {
> +	struct lnet_peer_net *lpn;
>  	struct lnet_peer_ni *lpni;
> +	int rc = 0;
>  
>  	LASSERT(lp);
>  	LASSERT(nid != LNET_NID_ANY);
>  
> -	spin_lock(&lp->lp_lock);
> -	if (!mr && !(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> -		spin_unlock(&lp->lp_lock);
> -		CERROR("Cannot add nid %s to non-multi-rail peer %s\n",
> -		       libcfs_nid2str(nid),
> -		       libcfs_nid2str(lp->lp_primary_nid));
> -		return -EPERM;
> +	/* A configured peer can only be updated through configuration. */
> +	if (!(flags & LNET_PEER_CONFIGURED)) {
> +		if (lp->lp_state & LNET_PEER_CONFIGURED) {
> +			rc = -EPERM;
> +			goto out;
> +		}
>  	}
>  
> -	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> -		lp->lp_state |= LNET_PEER_MULTI_RAIL;
> -		lnet_peer_clr_non_mr_pref_nids(lp);
> +	/*
> +	 * The MULTI_RAIL flag can be set but not cleared, because
> +	 * that would leave the peer struct in an invalid state.
> +	 */
> +	if (flags & LNET_PEER_MULTI_RAIL) {
> +		spin_lock(&lp->lp_lock);
> +		if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +			lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +			lnet_peer_clr_non_mr_pref_nids(lp);
> +		}
> +		spin_unlock(&lp->lp_lock);
> +	} else if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
> +		rc = -EPERM;
> +		goto out;
>  	}
> -	spin_unlock(&lp->lp_lock);
>  
>  	lpni = lnet_find_peer_ni_locked(nid);
> -	if (!lpni)
> -		return lnet_peer_setup_hierarchy(lp, NULL, nid);
> +	if (lpni) {
> +		/*
> +		 * A peer_ni already exists. This is only a problem if
> +		 * it is not connected to this peer and was configured
> +		 * by DLC.
> +		 */
> +		lnet_peer_ni_decref_locked(lpni);
> +		if (lpni->lpni_peer_net->lpn_peer == lp)
> +			goto out;
> +		if (lnet_peer_ni_is_configured(lpni)) {
> +			rc = -EEXIST;
> +			goto out;
> +		}
> +		/* If this is the primary NID, destroy the peer. */
> +		if (lnet_peer_ni_is_primary(lpni)) {
> +			lnet_peer_del(lpni->lpni_peer_net->lpn_peer);
> +			lpni = lnet_peer_ni_alloc(nid);
> +			if (!lpni) {
> +				rc = -ENOMEM;
> +				goto out;
> +			}
> +		}
> +	} else {
> +		lpni = lnet_peer_ni_alloc(nid);
> +		if (!lpni) {
> +			rc = -ENOMEM;
> +			goto out;
> +		}
> +	}
>  
> -	if (lpni->lpni_peer_net->lpn_peer != lp) {
> -		struct lnet_peer *lp2 = lpni->lpni_peer_net->lpn_peer;
> -		CERROR("Cannot add NID %s owned by peer %s to peer %s\n",
> -		       libcfs_nid2str(lpni->lpni_nid),
> -		       libcfs_nid2str(lp2->lp_primary_nid),
> -		       libcfs_nid2str(lp->lp_primary_nid));
> -		return -EEXIST;
> +	/*
> +	 * Get the peer_net. Check that we're not adding a second
> +	 * peer_ni on a peer_net of a non-multi-rail peer.
> +	 */
> +	lpn = lnet_peer_get_net_locked(lp, LNET_NIDNET(nid));
> +	if (!lpn) {
> +		lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
> +		if (!lpn) {
> +			rc = -ENOMEM;
> +			goto out_free_lpni;
> +		}
> +	} else if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +		rc = -ENOTUNIQ;
> +		goto out_free_lpni;
>  	}
>  
> -	CDEBUG(D_NET, "NID %s is already owned by peer %s\n",
> -	       libcfs_nid2str(lpni->lpni_nid),
> -	       libcfs_nid2str(lp->lp_primary_nid));
> -	return 0;
> +	return lnet_peer_attach_peer_ni(lp, lpn, lpni, flags);
> +
> +out_free_lpni:
> +	/* If the peer_ni was allocated above its peer_net pointer is NULL */
> +	if (!lpni->lpni_peer_net)
> +		kfree(lpni);
> +out:
> +	CDEBUG(D_NET, "peer %s NID %s flags %#x: %d\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), libcfs_nid2str(nid),
> +	       flags, rc);
> +	return rc;
>  }
>  
>  /*
> @@ -1076,25 +1151,53 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, bool mr)
>  static int
>  lnet_peer_ni_traffic_add(lnet_nid_t nid, lnet_nid_t pref)
>  {
> +	struct lnet_peer *lp;
> +	struct lnet_peer_net *lpn;
>  	struct lnet_peer_ni *lpni;
> -	int rc;
> +	unsigned int flags = 0;
> +	int rc = 0;
>  
> -	if (nid == LNET_NID_ANY)
> -		return -EINVAL;
> +	if (nid == LNET_NID_ANY) {
> +		rc = -EINVAL;
> +		goto out;
> +	}
>  
>  	/* lnet_net_lock is not needed here because ln_api_lock is held */
>  	lpni = lnet_find_peer_ni_locked(nid);
> -	if (!lpni) {
> -		rc = lnet_peer_setup_hierarchy(NULL, NULL, nid);
> -		if (rc)
> -			return rc;
> -		lpni = lnet_find_peer_ni_locked(nid);
> +	if (lpni) {
> +		/*
> +		 * We must have raced with another thread. Since we
> +		 * know next to nothing about a peer_ni created by
> +		 * traffic, we just assume everything is ok and
> +		 * return.
> +		 */
> +		lnet_peer_ni_decref_locked(lpni);
> +		goto out;
>  	}
> +
> +	/* Create peer, peer_net, and peer_ni. */
> +	rc = -ENOMEM;
> +	lp = lnet_peer_alloc(nid);
> +	if (!lp)
> +		goto out;
> +	lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
> +	if (!lpn)
> +		goto out_free_lp;
> +	lpni = lnet_peer_ni_alloc(nid);
> +	if (!lpni)
> +		goto out_free_lpn;
>  	if (pref != LNET_NID_ANY)
>  		lnet_peer_ni_set_non_mr_pref_nid(lpni, pref);
> -	lnet_peer_ni_decref_locked(lpni);
>  
> -	return 0;
> +	return lnet_peer_attach_peer_ni(lp, lpn, lpni, flags);
> +
> +out_free_lpn:
> +	kfree(lpn);
> +out_free_lp:
> +	kfree(lp);
> +out:
> +	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(nid), rc);
> +	return rc;
>  }
>  
>  /*
> @@ -1114,17 +1217,22 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
>  {
>  	struct lnet_peer *lp = NULL;
>  	struct lnet_peer_ni *lpni;
> +	unsigned int flags;
>  
>  	/* The prim_nid must always be specified */
>  	if (prim_nid == LNET_NID_ANY)
>  		return -EINVAL;
>  
> +	flags = LNET_PEER_CONFIGURED;
> +	if (mr)
> +		flags |= LNET_PEER_MULTI_RAIL;
> +
>  	/*
>  	 * If nid isn't specified, we must create a new peer with
>  	 * prim_nid as its primary nid.
>  	 */
>  	if (nid == LNET_NID_ANY)
> -		return lnet_peer_add(prim_nid, mr);
> +		return lnet_peer_add(prim_nid, flags);
>  
>  	/* Look up the prim_nid, which must exist. */
>  	lpni = lnet_find_peer_ni_locked(prim_nid);
> @@ -1133,6 +1241,14 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
>  	lnet_peer_ni_decref_locked(lpni);
>  	lp = lpni->lpni_peer_net->lpn_peer;
>  
> +	/* Peer must have been configured. */
> +	if (!(lp->lp_state & LNET_PEER_CONFIGURED)) {
> +		CDEBUG(D_NET, "peer %s was not configured\n",
> +		       libcfs_nid2str(prim_nid));
> +		return -ENOENT;
> +	}
> +
> +	/* Primary NID must match */
>  	if (lp->lp_primary_nid != prim_nid) {
>  		CDEBUG(D_NET, "prim_nid %s is not primary for peer %s\n",
>  		       libcfs_nid2str(prim_nid),
> @@ -1140,7 +1256,14 @@ lnet_add_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid, bool mr)
>  		return -ENODEV;
>  	}
>  
> -	return lnet_peer_add_nid(lp, nid, mr);
> +	/* Multi-Rail flag must match. */
> +	if ((lp->lp_state ^ flags) & LNET_PEER_MULTI_RAIL) {
> +		CDEBUG(D_NET, "multi-rail state mismatch for peer %s\n",
> +		       libcfs_nid2str(prim_nid));
> +		return -EPERM;
> +	}
> +
> +	return lnet_peer_add_nid(lp, nid, flags);
>  }
>  
>  /*
> @@ -1159,6 +1282,7 @@ lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
>  {
>  	struct lnet_peer *lp;
>  	struct lnet_peer_ni *lpni;
> +	unsigned int flags;
>  
>  	if (prim_nid == LNET_NID_ANY)
>  		return -EINVAL;
> @@ -1179,7 +1303,11 @@ lnet_del_peer_ni(lnet_nid_t prim_nid, lnet_nid_t nid)
>  	if (nid == LNET_NID_ANY || nid == lp->lp_primary_nid)
>  		return lnet_peer_del(lp);
>  
> -	return lnet_peer_del_nid(lp, nid);
> +	flags = LNET_PEER_CONFIGURED;
> +	if (lp->lp_state & LNET_PEER_MULTI_RAIL)
> +		flags |= LNET_PEER_MULTI_RAIL;
> +
> +	return lnet_peer_del_nid(lp, nid, flags);
>  }
>  
>  void
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net
  2018-10-07 23:19 ` [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net NeilBrown
@ 2018-10-14 22:42   ` James Simmons
  2018-10-17  5:16     ` NeilBrown
  0 siblings, 1 reply; 57+ messages in thread
From: James Simmons @ 2018-10-14 22:42 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Peer discovery will be keeping track of lnet_peer structures,
> so there will be references to an lnet_peer independent of
> the references implied by lnet_peer_ni structures. Manage
> this by adding explicit reference counts to lnet_peer_net and
> lnet_peer.
> 
> Each lnet_peer_net has a hold on the lnet_peer it links to
> with its lpn_peer pointer. This hold is only removed when that
> pointer is assigned a new value or the lnet_peer_net is freed.
> Just removing an lnet_peer_net from the lp_peer_nets list does
> not release this hold, it just prevents new lookups of the
> lnet_peer_net via the lnet_peer.
> 
> Each lnet_peer_ni has a hold on the lnet_peer_net it links to
> with its lpni_peer_net pointer. This hold is only removed when
> that pointer is assigned a new value or the lnet_peer_ni is
> freed. Just removing an lnet_peer_ni from the lpn_peer_nis
> list does not release this hold, it just prevents new lookups
> of the lnet_peer_ni via the lnet_peer_net.
> 
> This ensures that given a lnet_peer_ni *lpni, we can rely on
> lpni->lpni_peer_net->lpn_peer pointing to a valid lnet_peer.
> 
> Keep a count of the total number of lnet_peer_ni attached to
> an lnet_peer in lp_nnis.
> 
> Split the global ln_peers list into per-lnet_peer_table lists.
> The CPT of the peer table in which the lnet_peer is linked is
> stored in lp_cpt.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25784
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   49 +++--
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   50 ++++-
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |    1 
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |    8 -
>  drivers/staging/lustre/lnet/lnet/peer.c            |  210 ++++++++++++++------
>  5 files changed, 227 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 563417510722..aad25eb0011b 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -310,6 +310,36 @@ lnet_handle2me(struct lnet_handle_me *handle)
>  	return lh_entry(lh, struct lnet_me, me_lh);
>  }
>  
> +static inline void
> +lnet_peer_net_addref_locked(struct lnet_peer_net *lpn)
> +{
> +	atomic_inc(&lpn->lpn_refcount);
> +}
> +
> +void lnet_destroy_peer_net_locked(struct lnet_peer_net *lpn);
> +
> +static inline void
> +lnet_peer_net_decref_locked(struct lnet_peer_net *lpn)
> +{
> +	if (atomic_dec_and_test(&lpn->lpn_refcount))
> +		lnet_destroy_peer_net_locked(lpn);
> +}
> +
> +static inline void
> +lnet_peer_addref_locked(struct lnet_peer *lp)
> +{
> +	atomic_inc(&lp->lp_refcount);
> +}
> +
> +void lnet_destroy_peer_locked(struct lnet_peer *lp);
> +
> +static inline void
> +lnet_peer_decref_locked(struct lnet_peer *lp)
> +{
> +	if (atomic_dec_and_test(&lp->lp_refcount))
> +		lnet_destroy_peer_locked(lp);
> +}
> +
>  static inline void
>  lnet_peer_ni_addref_locked(struct lnet_peer_ni *lp)
>  {
> @@ -695,21 +725,6 @@ int lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>  			  __u32 *peer_rtr_credits, __u32 *peer_min_rtr_credtis,
>  			  __u32 *peer_tx_qnob);
>  
> -static inline __u32
> -lnet_get_num_peer_nis(struct lnet_peer *peer)
> -{
> -	struct lnet_peer_net *lpn;
> -	struct lnet_peer_ni *lpni;
> -	__u32 count = 0;
> -
> -	list_for_each_entry(lpn, &peer->lp_peer_nets, lpn_on_peer_list)
> -		list_for_each_entry(lpni, &lpn->lpn_peer_nis,
> -				    lpni_on_peer_net_list)
> -			count++;
> -
> -	return count;
> -}
> -
>  static inline bool
>  lnet_is_peer_ni_healthy_locked(struct lnet_peer_ni *lpni)
>  {
> @@ -728,7 +743,7 @@ lnet_is_peer_net_healthy_locked(struct lnet_peer_net *peer_net)
>  	struct lnet_peer_ni *lpni;
>  
>  	list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
> -			    lpni_on_peer_net_list) {
> +			    lpni_peer_nis) {
>  		if (lnet_is_peer_ni_healthy_locked(lpni))
>  			return true;
>  	}
> @@ -741,7 +756,7 @@ lnet_is_peer_healthy_locked(struct lnet_peer *peer)
>  {
>  	struct lnet_peer_net *peer_net;
>  
> -	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
> +	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
>  		if (lnet_is_peer_net_healthy_locked(peer_net))
>  			return true;
>  	}
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index d1721fd01d93..260619e19bde 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -411,7 +411,8 @@ struct lnet_rc_data {
>  };
>  
>  struct lnet_peer_ni {
> -	struct list_head	lpni_on_peer_net_list;
> +	/* chain on lpn_peer_nis */
> +	struct list_head	lpni_peer_nis;
>  	/* chain on remote peer list */
>  	struct list_head	lpni_on_remote_peer_ni_list;
>  	/* chain on peer hash */
> @@ -496,8 +497,8 @@ struct lnet_peer_ni {
>  #define LNET_PEER_NI_NON_MR_PREF	BIT(0)
>  
>  struct lnet_peer {
> -	/* chain on global peer list */
> -	struct list_head	lp_on_lnet_peer_list;
> +	/* chain on pt_peer_list */
> +	struct list_head	lp_peer_list;
>  
>  	/* list of peer nets */
>  	struct list_head	lp_peer_nets;
> @@ -505,6 +506,15 @@ struct lnet_peer {
>  	/* primary NID of the peer */
>  	lnet_nid_t		lp_primary_nid;
>  
> +	/* CPT of peer_table */
> +	int			lp_cpt;
> +
> +	/* number of NIDs on this peer */
> +	int			lp_nnis;
> +
> +	/* reference count */
> +	atomic_t		lp_refcount;
> +
>  	/* lock protecting peer state flags */
>  	spinlock_t		lp_lock;
>  
> @@ -516,8 +526,8 @@ struct lnet_peer {
>  #define LNET_PEER_CONFIGURED	BIT(1)
>  
>  struct lnet_peer_net {
> -	/* chain on peer block */
> -	struct list_head	lpn_on_peer_list;
> +	/* chain on lp_peer_nets */
> +	struct list_head	lpn_peer_nets;
>  
>  	/* list of peer_nis on this network */
>  	struct list_head	lpn_peer_nis;
> @@ -527,21 +537,45 @@ struct lnet_peer_net {
>  
>  	/* Net ID */
>  	__u32			lpn_net_id;
> +
> +	/* reference count */
> +	atomic_t		lpn_refcount;
>  };
>  
>  /* peer hash size */
>  #define LNET_PEER_HASH_BITS	9
>  #define LNET_PEER_HASH_SIZE	(1 << LNET_PEER_HASH_BITS)
>  
> -/* peer hash table */
> +/*
> + * peer hash table - one per CPT
> + *
> + * protected by lnet_net_lock/EX for update
> + *    pt_version
> + *    pt_number
> + *    pt_hash[...]
> + *    pt_peer_list
> + *    pt_peers
> + *    pt_peer_nnids
> + * protected by pt_zombie_lock:
> + *    pt_zombie_list
> + *    pt_zombies
> + *
> + * pt_zombie lock nests inside lnet_net_lock
> + */
>  struct lnet_peer_table {
>  	/* /proc validity stamp */
>  	int			 pt_version;
>  	/* # peers extant */
>  	atomic_t		 pt_number;
> +	/* peers */
> +	struct list_head	pt_peer_list;
> +	/* # peers */
> +	int			pt_peers;
> +	/* # NIDS on listed peers */
> +	int			pt_peer_nnids;
>  	/* # zombies to go to deathrow (and not there yet) */
>  	int			 pt_zombies;
> -	/* zombie peers */
> +	/* zombie peers_ni */
>  	struct list_head	 pt_zombie_list;
>  	/* protect list and count */
>  	spinlock_t		 pt_zombie_lock;
> @@ -785,8 +819,6 @@ struct lnet {
>  	struct lnet_msg_container	**ln_msg_containers;
>  	struct lnet_counters		**ln_counters;
>  	struct lnet_peer_table		**ln_peer_tables;
> -	/* list of configured or discovered peers */
> -	struct list_head		ln_peers;
>  	/* list of peer nis not on a local network */
>  	struct list_head		ln_remote_peer_ni_list;
>  	/* failure simulation */
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index d64ae2939abc..c48bcb8722a0 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -625,7 +625,6 @@ lnet_prepare(lnet_pid_t requested_pid)
>  	the_lnet.ln_pid = requested_pid;
>  
>  	INIT_LIST_HEAD(&the_lnet.ln_test_peers);
> -	INIT_LIST_HEAD(&the_lnet.ln_peers);
>  	INIT_LIST_HEAD(&the_lnet.ln_remote_peer_ni_list);
>  	INIT_LIST_HEAD(&the_lnet.ln_nets);
>  	INIT_LIST_HEAD(&the_lnet.ln_routers);
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
> index 99d8b22356bb..4c1eef907dc7 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
> @@ -1388,7 +1388,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  			peer_net = lnet_peer_get_net_locked(
>  				peer, LNET_NIDNET(best_lpni->lpni_nid));
>  			list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
> -					    lpni_on_peer_net_list) {
> +					    lpni_peer_nis) {
>  				if (lpni->lpni_pref_nnids == 0)
>  					continue;
>  				LASSERT(lpni->lpni_pref_nnids == 1);
> @@ -1411,7 +1411,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  			}
>  			lpni = list_entry(peer_net->lpn_peer_nis.next,
>  					  struct lnet_peer_ni,
> -					  lpni_on_peer_net_list);
> +					  lpni_peer_nis);
>  		}
>  		/* Set preferred NI if necessary. */
>  		if (lpni->lpni_pref_nnids == 0)
> @@ -1443,7 +1443,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	 * then the best route is chosen. If all routes are equal then
>  	 * they are used in round robin.
>  	 */
> -	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
> +	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
>  		if (!lnet_is_peer_net_healthy_locked(peer_net))
>  			continue;
>  
> @@ -1453,7 +1453,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  
>  			lpni = list_entry(peer_net->lpn_peer_nis.next,
>  					  struct lnet_peer_ni,
> -					  lpni_on_peer_net_list);
> +					  lpni_peer_nis);
>  
>  			net_gw = lnet_find_route_locked(NULL,
>  							lpni->lpni_nid,
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 09c1b5516f6b..d7a0a2f3bdd9 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c

INIT_LIST_HEAD(&ptable->pt_peer_list); seems to be missing from
lnet_peer_tables_create(). This is in the patch merged into 
lustre-testing. Other than that it looks okay.

> @@ -118,7 +118,7 @@ lnet_peer_ni_alloc(lnet_nid_t nid)
>  	INIT_LIST_HEAD(&lpni->lpni_rtrq);
>  	INIT_LIST_HEAD(&lpni->lpni_routes);
>  	INIT_LIST_HEAD(&lpni->lpni_hashlist);
> -	INIT_LIST_HEAD(&lpni->lpni_on_peer_net_list);
> +	INIT_LIST_HEAD(&lpni->lpni_peer_nis);
>  	INIT_LIST_HEAD(&lpni->lpni_on_remote_peer_ni_list);
>  
>  	spin_lock_init(&lpni->lpni_lock);
> @@ -150,7 +150,7 @@ lnet_peer_ni_alloc(lnet_nid_t nid)
>  			      &the_lnet.ln_remote_peer_ni_list);
>  	}
>  
> -	/* TODO: update flags */
> +	CDEBUG(D_NET, "%p nid %s\n", lpni, libcfs_nid2str(lpni->lpni_nid));
>  
>  	return lpni;
>  }
> @@ -164,13 +164,32 @@ lnet_peer_net_alloc(u32 net_id)
>  	if (!lpn)
>  		return NULL;
>  
> -	INIT_LIST_HEAD(&lpn->lpn_on_peer_list);
> +	INIT_LIST_HEAD(&lpn->lpn_peer_nets);
>  	INIT_LIST_HEAD(&lpn->lpn_peer_nis);
>  	lpn->lpn_net_id = net_id;
>  
> +	CDEBUG(D_NET, "%p net %s\n", lpn, libcfs_net2str(lpn->lpn_net_id));
> +
>  	return lpn;
>  }
>  
> +void
> +lnet_destroy_peer_net_locked(struct lnet_peer_net *lpn)
> +{
> +	struct lnet_peer *lp;
> +
> +	CDEBUG(D_NET, "%p net %s\n", lpn, libcfs_net2str(lpn->lpn_net_id));
> +
> +	LASSERT(atomic_read(&lpn->lpn_refcount) == 0);
> +	LASSERT(list_empty(&lpn->lpn_peer_nis));
> +	LASSERT(list_empty(&lpn->lpn_peer_nets));
> +	lp = lpn->lpn_peer;
> +	lpn->lpn_peer = NULL;
> +	kfree(lpn);
> +
> +	lnet_peer_decref_locked(lp);
> +}
> +
>  static struct lnet_peer *
>  lnet_peer_alloc(lnet_nid_t nid)
>  {
> @@ -180,53 +199,73 @@ lnet_peer_alloc(lnet_nid_t nid)
>  	if (!lp)
>  		return NULL;
>  
> -	INIT_LIST_HEAD(&lp->lp_on_lnet_peer_list);
> +	INIT_LIST_HEAD(&lp->lp_peer_list);
>  	INIT_LIST_HEAD(&lp->lp_peer_nets);
>  	spin_lock_init(&lp->lp_lock);
>  	lp->lp_primary_nid = nid;
> +	lp->lp_cpt = lnet_nid_cpt_hash(nid, LNET_CPT_NUMBER);
>  
> -	/* TODO: update flags */
> +	CDEBUG(D_NET, "%p nid %s\n", lp, libcfs_nid2str(lp->lp_primary_nid));
>  
>  	return lp;
>  }
>  
> +void
> +lnet_destroy_peer_locked(struct lnet_peer *lp)
> +{
> +	CDEBUG(D_NET, "%p nid %s\n", lp, libcfs_nid2str(lp->lp_primary_nid));
> +
> +	LASSERT(atomic_read(&lp->lp_refcount) == 0);
> +	LASSERT(list_empty(&lp->lp_peer_nets));
> +	LASSERT(list_empty(&lp->lp_peer_list));
> +
> +	kfree(lp);
> +}
> +
> +/*
> + * Detach a peer_ni from its peer_net. If this was the last peer_ni on
> + * that peer_net, detach the peer_net from the peer.
> + *
> + * Call with lnet_net_lock/EX held
> + */
>  static void
> -lnet_peer_detach_peer_ni(struct lnet_peer_ni *lpni)
> +lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
>  {
> +	struct lnet_peer_table *ptable;
>  	struct lnet_peer_net *lpn;
>  	struct lnet_peer *lp;
>  
> -	/* TODO: could the below situation happen? accessing an already
> -	 * destroyed peer?
> +	/*
> +	 * Belts and suspenders: gracefully handle teardown of a
> +	 * partially connected peer_ni.
>  	 */
> -	if (!lpni->lpni_peer_net ||
> -	    !lpni->lpni_peer_net->lpn_peer)
> -		return;
> -
>  	lpn = lpni->lpni_peer_net;
> -	lp = lpni->lpni_peer_net->lpn_peer;
>  
> -	CDEBUG(D_NET, "peer %s NID %s\n",
> -	       libcfs_nid2str(lp->lp_primary_nid),
> -	       libcfs_nid2str(lpni->lpni_nid));
> -
> -	list_del_init(&lpni->lpni_on_peer_net_list);
> -	lpni->lpni_peer_net = NULL;
> +	list_del_init(&lpni->lpni_peer_nis);
> +	/*
> +	 * If there are no lpni's left, we detach lpn from
> +	 * lp_peer_nets, so it cannot be found anymore.
> +	 */
> +	if (list_empty(&lpn->lpn_peer_nis))
> +		list_del_init(&lpn->lpn_peer_nets);
>  
> -	/* if lpn is empty, then remove it from the peer */
> -	if (list_empty(&lpn->lpn_peer_nis)) {
> -		list_del_init(&lpn->lpn_on_peer_list);
> -		lpn->lpn_peer = NULL;
> -		kfree(lpn);
> +	/* Update peer NID count. */
> +	lp = lpn->lpn_peer;
> +	ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
> +	lp->lp_nnis--;
> +	ptable->pt_peer_nnids--;
>  
> -		/* If the peer is empty then remove it from the
> -		 * the_lnet.ln_peers.
> -		 */
> -		if (list_empty(&lp->lp_peer_nets)) {
> -			list_del_init(&lp->lp_on_lnet_peer_list);
> -			kfree(lp);
> -		}
> +	/*
> +	 * If there are no more peer nets, make the peer unfindable
> +	 * via the peer_tables.
> +	 */
> +	if (list_empty(&lp->lp_peer_nets)) {
> +		list_del_init(&lp->lp_peer_list);
> +		ptable->pt_peers--;
>  	}
> +	CDEBUG(D_NET, "peer %s NID %s\n",
> +	       libcfs_nid2str(lp->lp_primary_nid),
> +	       libcfs_nid2str(lpni->lpni_nid));
>  }
>  
>  /* called with lnet_net_lock LNET_LOCK_EX held */
> @@ -268,9 +307,9 @@ lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni)
>  	spin_unlock(&ptable->pt_zombie_lock);
>  
>  	/* no need to keep this peer_ni on the hierarchy anymore */
> -	lnet_peer_detach_peer_ni(lpni);
> +	lnet_peer_detach_peer_ni_locked(lpni);
>  
> -	/* decrement reference on peer_ni */
> +	/* remove hashlist reference on peer_ni */
>  	lnet_peer_ni_decref_locked(lpni);
>  
>  	return 0;
> @@ -319,6 +358,8 @@ lnet_peer_tables_create(void)
>  		spin_lock_init(&ptable->pt_zombie_lock);
>  		INIT_LIST_HEAD(&ptable->pt_zombie_list);
>  
> +		INIT_LIST_HEAD(&ptable->pt_peer_list);
> +
>  		for (j = 0; j < LNET_PEER_HASH_SIZE; j++)
>  			INIT_LIST_HEAD(&hash[j]);
>  		ptable->pt_hash = hash; /* sign of initialization */
> @@ -394,7 +435,7 @@ lnet_peer_del_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
>  	 * This function only allows deletion of the primary NID if it
>  	 * is the only NID.
>  	 */
> -	if (nid == lp->lp_primary_nid && lnet_get_num_peer_nis(lp) != 1) {
> +	if (nid == lp->lp_primary_nid && lp->lp_nnis != 1) {
>  		rc = -EBUSY;
>  		goto out;
>  	}
> @@ -560,15 +601,34 @@ struct lnet_peer_ni *
>  lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
>  			    struct lnet_peer **lp)
>  {
> +	struct lnet_peer_table	*ptable;
>  	struct lnet_peer_ni	*lpni;
> +	int			lncpt;
> +	int			cpt;
> +
> +	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
>  
> -	list_for_each_entry((*lp), &the_lnet.ln_peers, lp_on_lnet_peer_list) {
> +	for (cpt = 0; cpt < lncpt; cpt++) {
> +		ptable = the_lnet.ln_peer_tables[cpt];
> +		if (ptable->pt_peer_nnids > idx)
> +			break;
> +		idx -= ptable->pt_peer_nnids;
> +	}
> +	if (cpt >= lncpt)
> +		return NULL;
> +
> +	list_for_each_entry((*lp), &ptable->pt_peer_list, lp_peer_list) {
> +		if ((*lp)->lp_nnis <= idx) {
> +			idx -= (*lp)->lp_nnis;
> +			continue;
> +		}
>  		list_for_each_entry((*lpn), &((*lp)->lp_peer_nets),
> -				    lpn_on_peer_list) {
> +				    lpn_peer_nets) {
>  			list_for_each_entry(lpni, &((*lpn)->lpn_peer_nis),
> -					    lpni_on_peer_net_list)
> +					    lpni_peer_nis) {
>  				if (idx-- == 0)
>  					return lpni;
> +			}
>  		}
>  	}
>  
> @@ -584,18 +644,21 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  	struct lnet_peer_net *net = peer_net;
>  
>  	if (!prev) {
> -		if (!net)
> +		if (!net) {
> +			if (list_empty(&peer->lp_peer_nets))
> +				return NULL;
> +
>  			net = list_entry(peer->lp_peer_nets.next,
>  					 struct lnet_peer_net,
> -					 lpn_on_peer_list);
> +					 lpn_peer_nets);
> +		}
>  		lpni = list_entry(net->lpn_peer_nis.next, struct lnet_peer_ni,
> -				  lpni_on_peer_net_list);
> +				  lpni_peer_nis);
>  
>  		return lpni;
>  	}
>  
> -	if (prev->lpni_on_peer_net_list.next ==
> -	    &prev->lpni_peer_net->lpn_peer_nis) {
> +	if (prev->lpni_peer_nis.next == &prev->lpni_peer_net->lpn_peer_nis) {
>  		/*
>  		 * if you reached the end of the peer ni list and the peer
>  		 * net is specified then there are no more peer nis in that
> @@ -608,25 +671,25 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  		 * we reached the end of this net ni list. move to the
>  		 * next net
>  		 */
> -		if (prev->lpni_peer_net->lpn_on_peer_list.next ==
> +		if (prev->lpni_peer_net->lpn_peer_nets.next ==
>  		    &peer->lp_peer_nets)
>  			/* no more nets and no more NIs. */
>  			return NULL;
>  
>  		/* get the next net */
> -		net = list_entry(prev->lpni_peer_net->lpn_on_peer_list.next,
> +		net = list_entry(prev->lpni_peer_net->lpn_peer_nets.next,
>  				 struct lnet_peer_net,
> -				 lpn_on_peer_list);
> +				 lpn_peer_nets);
>  		/* get the ni on it */
>  		lpni = list_entry(net->lpn_peer_nis.next, struct lnet_peer_ni,
> -				  lpni_on_peer_net_list);
> +				  lpni_peer_nis);
>  
>  		return lpni;
>  	}
>  
>  	/* there are more nis left */
> -	lpni = list_entry(prev->lpni_on_peer_net_list.next,
> -			  struct lnet_peer_ni, lpni_on_peer_net_list);
> +	lpni = list_entry(prev->lpni_peer_nis.next,
> +			  struct lnet_peer_ni, lpni_peer_nis);
>  
>  	return lpni;
>  }
> @@ -902,7 +965,7 @@ struct lnet_peer_net *
>  lnet_peer_get_net_locked(struct lnet_peer *peer, u32 net_id)
>  {
>  	struct lnet_peer_net *peer_net;
> -	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
> +	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
>  		if (peer_net->lpn_net_id == net_id)
>  			return peer_net;
>  	}
> @@ -910,15 +973,20 @@ lnet_peer_get_net_locked(struct lnet_peer *peer, u32 net_id)
>  }
>  
>  /*
> - * Always returns 0, but it the last function called from functions
> + * Attach a peer_ni to a peer_net and peer. This function assumes
> + * peer_ni is not already attached to the peer_net/peer. The peer_ni
> + * may be attached to a different peer, in which case it will be
> + * properly detached first. The whole operation is done atomically.
> + *
> + * Always returns 0.  This is the last function called from functions
>   * that do return an int, so returning 0 here allows the compiler to
>   * do a tail call.
>   */
>  static int
>  lnet_peer_attach_peer_ni(struct lnet_peer *lp,
> -			 struct lnet_peer_net *lpn,
> -			 struct lnet_peer_ni *lpni,
> -			 unsigned int flags)
> +				struct lnet_peer_net *lpn,
> +				struct lnet_peer_ni *lpni,
> +				unsigned int flags)
>  {
>  	struct lnet_peer_table *ptable;
>  
> @@ -932,26 +1000,38 @@ lnet_peer_attach_peer_ni(struct lnet_peer *lp,
>  		list_add_tail(&lpni->lpni_hashlist, &ptable->pt_hash[hash]);
>  		ptable->pt_version++;
>  		atomic_inc(&ptable->pt_number);
> +		/* This is the 1st refcount on lpni. */
>  		atomic_inc(&lpni->lpni_refcount);
>  	}
>  
>  	/* Detach the peer_ni from an existing peer, if necessary. */
> -	if (lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer != lp)
> -		lnet_peer_detach_peer_ni(lpni);
> +	if (lpni->lpni_peer_net) {
> +		LASSERT(lpni->lpni_peer_net != lpn);
> +		LASSERT(lpni->lpni_peer_net->lpn_peer != lp);
> +		lnet_peer_detach_peer_ni_locked(lpni);
> +		lnet_peer_net_decref_locked(lpni->lpni_peer_net);
> +		lpni->lpni_peer_net = NULL;
> +	}
>  
>  	/* Add peer_ni to peer_net */
>  	lpni->lpni_peer_net = lpn;
> -	list_add_tail(&lpni->lpni_on_peer_net_list, &lpn->lpn_peer_nis);
> +	list_add_tail(&lpni->lpni_peer_nis, &lpn->lpn_peer_nis);
> +	lnet_peer_net_addref_locked(lpn);
>  
>  	/* Add peer_net to peer */
>  	if (!lpn->lpn_peer) {
>  		lpn->lpn_peer = lp;
> -		list_add_tail(&lpn->lpn_on_peer_list, &lp->lp_peer_nets);
> +		list_add_tail(&lpn->lpn_peer_nets, &lp->lp_peer_nets);
> +		lnet_peer_addref_locked(lp);
> +	}
> +
> +	/* Add peer to global peer list, if necessary */
> +	ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
> +	if (list_empty(&lp->lp_peer_list)) {
> +		list_add_tail(&lp->lp_peer_list, &ptable->pt_peer_list);
> +		ptable->pt_peers++;
>  	}
>  
> -	/* Add peer to global peer list */
> -	if (list_empty(&lp->lp_on_lnet_peer_list))
> -		list_add_tail(&lp->lp_on_lnet_peer_list, &the_lnet.ln_peers);
>  
>  	/* Update peer state */
>  	spin_lock(&lp->lp_lock);
> @@ -967,6 +1047,8 @@ lnet_peer_attach_peer_ni(struct lnet_peer *lp,
>  	}
>  	spin_unlock(&lp->lp_lock);
>  
> +	lp->lp_nnis++;
> +	the_lnet.ln_peer_tables[lp->lp_cpt]->pt_peer_nnids++;
>  	lnet_net_unlock(LNET_LOCK_EX);
>  
>  	CDEBUG(D_NET, "peer %s NID %s flags %#x\n",
> @@ -1314,12 +1396,17 @@ void
>  lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)
>  {
>  	struct lnet_peer_table *ptable;
> +	struct lnet_peer_net *lpn;
> +
> +	CDEBUG(D_NET, "%p nid %s\n", lpni, libcfs_nid2str(lpni->lpni_nid));
>  
>  	LASSERT(atomic_read(&lpni->lpni_refcount) == 0);
>  	LASSERT(lpni->lpni_rtr_refcount == 0);
>  	LASSERT(list_empty(&lpni->lpni_txq));
>  	LASSERT(lpni->lpni_txqnob == 0);
>  
> +	lpn = lpni->lpni_peer_net;
> +	lpni->lpni_peer_net = NULL;
>  	lpni->lpni_net = NULL;
>  
>  	/* remove the peer ni from the zombie list */
> @@ -1332,6 +1419,8 @@ lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)
>  	if (lpni->lpni_pref_nnids > 1)
>  		kfree(lpni->lpni_pref.nids);
>  	kfree(lpni);
> +
> +	lnet_peer_net_decref_locked(lpn);
>  }
>  
>  struct lnet_peer_ni *
> @@ -1518,6 +1607,7 @@ lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>  	return found ? 0 : -ENOENT;
>  }
>  
> +/* ln_api_mutex is held, which keeps the peer list stable */
>  int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
>  		       bool *mr,
>  		       struct lnet_peer_ni_credit_info __user *peer_ni_info,
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event
  2018-10-07 23:19 ` [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event NeilBrown
@ 2018-10-14 22:44   ` James Simmons
  2018-10-14 22:44   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 22:44 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add a msg_type field to the lnet_event structure. This makes
> it possible for an event handler to tell whether LNET_EVENT_SEND
> corresponds to a GET or a PUT message.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25785
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../lustre/include/uapi/linux/lnet/lnet-types.h    |    5 +++++
>  drivers/staging/lustre/lnet/lnet/lib-msg.c         |    1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> index e0e4fd259795..1ecf18e4a278 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> @@ -650,6 +650,11 @@ struct lnet_event {
>  	 * \see LNetPut
>  	 */
>  	__u64			hdr_data;
> +	/**
> +	 * The message type, to ensure a handler for LNET_EVENT_SEND can
> +	 * distinguish between LNET_MSG_GET and LNET_MSG_PUT.
> +	 */
> +	__u32               msg_type;
>  	/**
>  	 * Indicates the completion status of the operation. It's 0 for
>  	 * successful operations, otherwise it's an error code.
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> index 1817e54a16a5..db13d01d366f 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> @@ -63,6 +63,7 @@ lnet_build_msg_event(struct lnet_msg *msg, enum lnet_event_kind ev_type)
>  	LASSERT(!msg->msg_routing);
>  
>  	ev->type = ev_type;
> +	ev->msg_type = msg->msg_type;
>  
>  	if (ev_type == LNET_EVENT_SEND) {
>  		/* event for active message */
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event
  2018-10-07 23:19 ` [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event NeilBrown
  2018-10-14 22:44   ` James Simmons
@ 2018-10-14 22:44   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 22:44 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add a msg_type field to the lnet_event structure. This makes
> it possible for an event handler to tell whether LNET_EVENT_SEND
> corresponds to a GET or a PUT message.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25785
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../lustre/include/uapi/linux/lnet/lnet-types.h    |    5 +++++
>  drivers/staging/lustre/lnet/lnet/lib-msg.c         |    1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> index e0e4fd259795..1ecf18e4a278 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
> @@ -650,6 +650,11 @@ struct lnet_event {
>  	 * \see LNetPut
>  	 */
>  	__u64			hdr_data;
> +	/**
> +	 * The message type, to ensure a handler for LNET_EVENT_SEND can
> +	 * distinguish between LNET_MSG_GET and LNET_MSG_PUT.
> +	 */
> +	__u32               msg_type;
>  	/**
>  	 * Indicates the completion status of the operation. It's 0 for
>  	 * successful operations, otherwise it's an error code.
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> index 1817e54a16a5..db13d01d366f 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> @@ -63,6 +63,7 @@ lnet_build_msg_event(struct lnet_msg *msg, enum lnet_event_kind ev_type)
>  	LASSERT(!msg->msg_routing);
>  
>  	ev->type = ev_type;
> +	ev->msg_type = msg->msg_type;
>  
>  	if (ev_type == LNET_EVENT_SEND) {
>  		/* event for active message */
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 16/24] lustre: lnet: add discovery thread
  2018-10-07 23:19 ` [lustre-devel] [PATCH 16/24] lustre: lnet: add discovery thread NeilBrown
@ 2018-10-14 22:51   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 22:51 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add the discovery thread, which will be used to handle peer
> discovery. This change adds the thread and the infrastructure
> that starts and stops it. The thread itself does trivial work.
> 
> Peer Discovery gets its own event queue (ln_dc_eqh), a queue
> for peers that are to be discovered (ln_dc_request), a queue
> for peers waiting for an event (ln_dc_working), a wait queue
> head so the thread can sleep (ln_dc_waitq), and start/stop
> state (ln_dc_state).
> 
> Peer discovery is started from lnet_select_pathway(), for
> GET and PUT messages not sent to the LNET_RESERVED_PORTAL.
> This criterion means that discovery will not be triggered by
> the messages used in discovery, and neither will an LNet ping
> trigger it.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> Reviewed-on: https://review.whamcloud.com/25786
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    6 
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   71 ++++
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   31 ++
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |   45 ++-
>  drivers/staging/lustre/lnet/lnet/peer.c            |  325 ++++++++++++++++++++
>  5 files changed, 468 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index aad25eb0011b..848d622911a4 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -438,6 +438,7 @@ bool lnet_is_ni_healthy_locked(struct lnet_ni *ni);
>  struct lnet_net *lnet_get_net_locked(u32 net_id);
>  
>  extern unsigned int lnet_numa_range;
> +extern unsigned int lnet_peer_discovery_disabled;
>  extern int portal_rotor;
>  
>  int lnet_lib_init(void);
> @@ -704,6 +705,9 @@ struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
>  struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
>  void lnet_peer_net_added(struct lnet_net *net);
>  lnet_nid_t lnet_peer_primary_nid_locked(lnet_nid_t nid);
> +int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt);
> +int lnet_peer_discovery_start(void);
> +void lnet_peer_discovery_stop(void);
>  void lnet_peer_tables_cleanup(struct lnet_net *net);
>  void lnet_peer_uninit(void);
>  int lnet_peer_tables_create(void);
> @@ -791,4 +795,6 @@ lnet_peer_ni_is_primary(struct lnet_peer_ni *lpni)
>  	return lpni->lpni_nid == lpni->lpni_peer_net->lpn_peer->lp_primary_nid;
>  }
>  
> +bool lnet_peer_is_uptodate(struct lnet_peer *lp);
> +
>  #endif
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index 260619e19bde..6394a3af50b7 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -520,10 +520,61 @@ struct lnet_peer {
>  
>  	/* peer state flags */
>  	unsigned int		lp_state;
> +
> +	/* link on discovery-related lists */
> +	struct list_head	lp_dc_list;
> +
> +	/* tasks waiting on discovery of this peer */
> +	wait_queue_head_t	lp_dc_waitq;
>  };
>  
> -#define LNET_PEER_MULTI_RAIL	BIT(0)
> -#define LNET_PEER_CONFIGURED	BIT(1)
> +/*
> + * The status flags in lp_state. Their semantics have chosen so that
> + * lp_state can be zero-initialized.
> + *
> + * A peer is marked MULTI_RAIL in two cases: it was configured using DLC
> + * as multi-rail aware, or the LNET_PING_FEAT_MULTI_RAIL bit was set.
> + *
> + * A peer is marked NO_DISCOVERY if the LNET_PING_FEAT_DISCOVERY bit was
> + * NOT set when the peer was pinged by discovery.
> + */
> +#define LNET_PEER_MULTI_RAIL	BIT(0)	/* Multi-rail aware */
> +#define LNET_PEER_NO_DISCOVERY	BIT(1)	/* Peer disabled discovery */
> +/*
> + * A peer is marked CONFIGURED if it was configured by DLC.
> + *
> + * In addition, a peer is marked DISCOVERED if it has fully passed
> + * through Peer Discovery.
> + *
> + * When Peer Discovery is disabled, the discovery thread will mark
> + * peers REDISCOVER to indicate that they should be re-examined if
> + * discovery is (re)enabled on the node.
> + *
> + * A peer that was created as the result of inbound traffic will not
> + * be marked at all.
> + */
> +#define LNET_PEER_CONFIGURED	BIT(2)	/* Configured via DLC */
> +#define LNET_PEER_DISCOVERED	BIT(3)	/* Peer was discovered */
> +#define LNET_PEER_REDISCOVER	BIT(4)	/* Discovery was disabled */
> +/*
> + * A peer is marked DISCOVERING when discovery is in progress.
> + * The other flags below correspond to stages of discovery.
> + */
> +#define LNET_PEER_DISCOVERING	BIT(5)	/* Discovering */
> +#define LNET_PEER_DATA_PRESENT	BIT(6)	/* Remote peer data present */
> +#define LNET_PEER_NIDS_UPTODATE	BIT(7)	/* Remote peer info uptodate */
> +#define LNET_PEER_PING_SENT	BIT(8)	/* Waiting for REPLY to Ping */
> +#define LNET_PEER_PUSH_SENT	BIT(9)	/* Waiting for ACK of Push */
> +#define LNET_PEER_PING_FAILED	BIT(10)	/* Ping send failure */
> +#define LNET_PEER_PUSH_FAILED	BIT(11)	/* Push send failure */
> +/*
> + * A ping can be forced as a way to fix up state, or as a manual
> + * intervention by an admin.
> + * A push can be forced in circumstances that would normally not
> + * allow for one to happen.
> + */
> +#define LNET_PEER_FORCE_PING	BIT(12)	/* Forced Ping */
> +#define LNET_PEER_FORCE_PUSH	BIT(13)	/* Forced Push */
>  
>  struct lnet_peer_net {
>  	/* chain on lp_peer_nets */
> @@ -775,6 +826,11 @@ struct lnet_msg_container {
>  	void			**msc_finalizers;
>  };
>  
> +/* Peer Discovery states */
> +#define LNET_DC_STATE_SHUTDOWN		0	/* not started */
> +#define LNET_DC_STATE_RUNNING		1	/* started up OK */
> +#define LNET_DC_STATE_STOPPING		2	/* telling thread to stop */
> +
>  /* Router Checker states */
>  enum lnet_rc_state {
>  	LNET_RC_STATE_SHUTDOWN,	/* not started */
> @@ -856,6 +912,17 @@ struct lnet {
>  	struct lnet_ping_buffer		 *ln_ping_target;
>  	atomic_t			ln_ping_target_seqno;
>  
> +	/* discovery event queue handle */
> +	struct lnet_handle_eq		ln_dc_eqh;
> +	/* discovery requests */
> +	struct list_head		ln_dc_request;
> +	/* discovery working list */
> +	struct list_head		ln_dc_working;
> +	/* discovery thread wait queue */
> +	wait_queue_head_t		ln_dc_waitq;
> +	/* discovery startup/shutdown state */
> +	int				ln_dc_state;
> +
>  	/* router checker startup/shutdown state */
>  	enum lnet_rc_state		  ln_rc_state;
>  	/* router checker's event queue */
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index c48bcb8722a0..dccfd5bcc459 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -78,6 +78,13 @@ module_param_call(lnet_interfaces_max, intf_max_set, param_get_int,
>  MODULE_PARM_DESC(lnet_interfaces_max,
>  		"Maximum number of interfaces in a node.");
>  
> +unsigned int lnet_peer_discovery_disabled;
> +static int discovery_set(const char *val, const struct kernel_param *kp);
> +module_param_call(lnet_peer_discovery_disabled, discovery_set, param_get_int,
> +		  &lnet_peer_discovery_disabled, 0644);
> +MODULE_PARM_DESC(lnet_peer_discovery_disabled,
> +		 "Set to 1 to disable peer discovery on this node.");
> +
>  /*
>   * This sequence number keeps track of how many times DLC was used to
>   * update the local NIs. It is incremented when a NI is added or
> @@ -90,6 +97,23 @@ static atomic_t lnet_dlc_seq_no = ATOMIC_INIT(0);
>  static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  		     struct lnet_process_id __user *ids, int n_ids);
>  
> +static int
> +discovery_set(const char *val, const struct kernel_param *kp)
> +{
> +	int rc;
> +	unsigned long value;
> +
> +	rc = kstrtoul(val, 0, &value);
> +	if (rc) {
> +		CERROR("Invalid module parameter value for 'lnet_peer_discovery_disabled'\n");
> +		return rc;
> +	}
> +
> +	*(unsigned int *)kp->arg = !!value;
> +
> +	return 0;
> +}
> +
>  static int
>  intf_max_set(const char *val, const struct kernel_param *kp)
>  {
> @@ -1921,6 +1945,10 @@ LNetNIInit(lnet_pid_t requested_pid)
>  	if (rc)
>  		goto err_stop_ping;
>  
> +	rc = lnet_peer_discovery_start();
> +	if (rc != 0)
> +		goto err_stop_router_checker;
> +
>  	lnet_fault_init();
>  	lnet_router_debugfs_init();
>  
> @@ -1928,6 +1956,8 @@ LNetNIInit(lnet_pid_t requested_pid)
>  
>  	return 0;
>  
> +err_stop_router_checker:
> +	lnet_router_checker_stop();
>  err_stop_ping:
>  	lnet_ping_target_fini();
>  err_acceptor_stop:
> @@ -1976,6 +2006,7 @@ LNetNIFini(void)
>  
>  		lnet_fault_fini();
>  		lnet_router_debugfs_fini();
> +		lnet_peer_discovery_stop();
>  		lnet_router_checker_stop();
>  		lnet_ping_target_fini();
>  
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
> index 4c1eef907dc7..4773180cc7b3 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
> @@ -1208,6 +1208,27 @@ lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *cur_ni,
>  	return best_ni;
>  }
>  
> +/*
> + * Traffic to the LNET_RESERVED_PORTAL may not trigger peer discovery,
> + * because such traffic is required to perform discovery. We therefore
> + * exclude all GET and PUT on that portal. We also exclude all ACK and
> + * REPLY traffic, but that is because the portal is not tracked in the
> + * message structure for these message types. We could restrict this
> + * further by also checking for LNET_PROTO_PING_MATCHBITS.
> + */
> +static bool
> +lnet_msg_discovery(struct lnet_msg *msg)
> +{
> +	if (msg->msg_type == LNET_MSG_PUT) {
> +		if (msg->msg_hdr.msg.put.ptl_index != LNET_RESERVED_PORTAL)
> +			return true;
> +	} else if (msg->msg_type == LNET_MSG_GET) {
> +		if (msg->msg_hdr.msg.get.ptl_index != LNET_RESERVED_PORTAL)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  static int
>  lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  		    struct lnet_msg *msg, lnet_nid_t rtr_nid)
> @@ -1220,7 +1241,6 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	struct lnet_peer *peer;
>  	struct lnet_peer_net *peer_net;
>  	struct lnet_net *local_net;
> -	__u32 seq;
>  	int cpt, cpt2, rc;
>  	bool routing;
>  	bool routing2;
> @@ -1255,13 +1275,6 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	routing2 = false;
>  	local_found = false;
>  
> -	seq = lnet_get_dlc_seq_locked();
> -
> -	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
> -		lnet_net_unlock(cpt);
> -		return -ESHUTDOWN;
> -	}
> -
>  	/*
>  	 * lnet_nid2peerni_locked() is the path that will find an
>  	 * existing peer_ni, or create one and mark it as having been
> @@ -1272,7 +1285,22 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  		lnet_net_unlock(cpt);
>  		return PTR_ERR(lpni);
>  	}
> +	/*
> +	 * Now that we have a peer_ni, check if we want to discover
> +	 * the peer. Traffic to the LNET_RESERVED_PORTAL should not
> +	 * trigger discovery.
> +	 */
>  	peer = lpni->lpni_peer_net->lpn_peer;
> +	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
> +		rc = lnet_discover_peer_locked(lpni, cpt);
> +		if (rc) {
> +			lnet_peer_ni_decref_locked(lpni);
> +			lnet_net_unlock(cpt);
> +			return rc;
> +		}
> +		/* The peer may have changed. */
> +		peer = lpni->lpni_peer_net->lpn_peer;
> +	}
>  	lnet_peer_ni_decref_locked(lpni);
>  
>  	/* If peer is not healthy then can not send anything to it */
> @@ -1701,6 +1729,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	 */
>  	cpt2 = lnet_cpt_of_nid_locked(best_lpni->lpni_nid, best_ni);
>  	if (cpt != cpt2) {
> +		__u32 seq = lnet_get_dlc_seq_locked();
>  		lnet_net_unlock(cpt);
>  		cpt = cpt2;
>  		lnet_net_lock(cpt);
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index d7a0a2f3bdd9..038b58414ce0 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -201,6 +201,8 @@ lnet_peer_alloc(lnet_nid_t nid)
>  
>  	INIT_LIST_HEAD(&lp->lp_peer_list);
>  	INIT_LIST_HEAD(&lp->lp_peer_nets);
> +	INIT_LIST_HEAD(&lp->lp_dc_list);
> +	init_waitqueue_head(&lp->lp_dc_waitq);
>  	spin_lock_init(&lp->lp_lock);
>  	lp->lp_primary_nid = nid;
>  	lp->lp_cpt = lnet_nid_cpt_hash(nid, LNET_CPT_NUMBER);
> @@ -1457,6 +1459,10 @@ lnet_nid2peerni_ex(lnet_nid_t nid, int cpt)
>  	return lpni;
>  }
>  
> +/*
> + * Get a peer_ni for the given nid, create it if necessary. Takes a
> + * hold on the peer_ni.
> + */
>  struct lnet_peer_ni *
>  lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref, int cpt)
>  {
> @@ -1510,9 +1516,326 @@ lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref, int cpt)
>  	mutex_unlock(&the_lnet.ln_api_mutex);
>  	lnet_net_lock(cpt);
>  
> +	/* Lock has been dropped, check again for shutdown. */
> +	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN) {
> +		if (!IS_ERR(lpni))
> +			lnet_peer_ni_decref_locked(lpni);
> +		lpni = ERR_PTR(-ESHUTDOWN);
> +	}
> +
>  	return lpni;
>  }
>  
> +/*
> + * Peer Discovery
> + */
> +
> +/*
> + * Is a peer uptodate from the point of view of discovery?
> + *
> + * If it is currently being processed, obviously not.
> + * A forced Ping or Push is also handled by the discovery thread.
> + *
> + * Otherwise look at whether the peer needs rediscovering.
> + */
> +bool
> +lnet_peer_is_uptodate(struct lnet_peer *lp)
> +{
> +	bool rc;
> +
> +	spin_lock(&lp->lp_lock);
> +	if (lp->lp_state & (LNET_PEER_DISCOVERING |
> +			    LNET_PEER_FORCE_PING |
> +			    LNET_PEER_FORCE_PUSH)) {
> +		rc = false;
> +	} else if (lp->lp_state & LNET_PEER_REDISCOVER) {
> +		if (lnet_peer_discovery_disabled)
> +			rc = true;
> +		else
> +			rc = false;
> +	} else if (lp->lp_state & LNET_PEER_DISCOVERED) {
> +		if (lp->lp_state & LNET_PEER_NIDS_UPTODATE)
> +			rc = true;
> +		else
> +			rc = false;
> +	} else {
> +		rc = false;
> +	}
> +	spin_unlock(&lp->lp_lock);
> +
> +	return rc;
> +}
> +
> +/*
> + * Queue a peer for the attention of the discovery thread.  Call with
> + * lnet_net_lock/EX held. Returns 0 if the peer was queued, and
> + * -EALREADY if the peer was already queued.
> + */
> +static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
> +{
> +	int rc;
> +
> +	spin_lock(&lp->lp_lock);
> +	if (!(lp->lp_state & LNET_PEER_DISCOVERING))
> +		lp->lp_state |= LNET_PEER_DISCOVERING;
> +	spin_unlock(&lp->lp_lock);
> +	if (list_empty(&lp->lp_dc_list)) {
> +		lnet_peer_addref_locked(lp);
> +		list_add_tail(&lp->lp_dc_list, &the_lnet.ln_dc_request);
> +		wake_up(&the_lnet.ln_dc_waitq);
> +		rc = 0;
> +	} else {
> +		rc = -EALREADY;
> +	}
> +
> +	return rc;
> +}
> +
> +/*
> + * Discovery of a peer is complete. Wake all waiters on the peer.
> + * Call with lnet_net_lock/EX held.
> + */
> +static void lnet_peer_discovery_complete(struct lnet_peer *lp)
> +{
> +	list_del_init(&lp->lp_dc_list);
> +	wake_up_all(&lp->lp_dc_waitq);
> +	lnet_peer_decref_locked(lp);
> +}
> +
> +/*
> + * Peer discovery slow path. The ln_api_mutex is held on entry, and
> + * dropped/retaken within this function. An lnet_peer_ni is passed in
> + * because discovery could tear down an lnet_peer.
> + */
> +int
> +lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
> +{
> +	DEFINE_WAIT(wait);
> +	struct lnet_peer *lp;
> +	int rc = 0;
> +
> +again:
> +	lnet_net_unlock(cpt);
> +	lnet_net_lock(LNET_LOCK_EX);
> +
> +	/* We're willing to be interrupted. */
> +	for (;;) {
> +		lp = lpni->lpni_peer_net->lpn_peer;
> +		prepare_to_wait(&lp->lp_dc_waitq, &wait, TASK_INTERRUPTIBLE);
> +		if (signal_pending(current))
> +			break;
> +		if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
> +			break;
> +		if (lnet_peer_is_uptodate(lp))
> +			break;
> +		lnet_peer_queue_for_discovery(lp);
> +		lnet_peer_addref_locked(lp);
> +		lnet_net_unlock(LNET_LOCK_EX);
> +		schedule();
> +		finish_wait(&lp->lp_dc_waitq, &wait);
> +		lnet_net_lock(LNET_LOCK_EX);
> +		lnet_peer_decref_locked(lp);
> +		/* Do not use lp beyond this point. */
> +	}
> +	finish_wait(&lp->lp_dc_waitq, &wait);
> +
> +	lnet_net_unlock(LNET_LOCK_EX);
> +	lnet_net_lock(cpt);
> +
> +	if (signal_pending(current))
> +		rc = -EINTR;
> +	else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
> +		rc = -ESHUTDOWN;
> +	else if (!lnet_peer_is_uptodate(lp))
> +		goto again;
> +
> +	return rc;
> +}
> +
> +/*
> + * Event handler for the discovery EQ.
> + *
> + * Called with lnet_res_lock(cpt) held. The cpt is the
> + * lnet_cpt_of_cookie() of the md handle cookie.
> + */
> +static void lnet_discovery_event_handler(struct lnet_event *event)
> +{
> +	wake_up(&the_lnet.ln_dc_waitq);
> +}
> +
> +/*
> + * Wait for work to be queued or some other change that must be
> + * attended to. Returns non-zero if the discovery thread should shut
> + * down.
> + */
> +static int lnet_peer_discovery_wait_for_work(void)
> +{
> +	int cpt;
> +	int rc = 0;
> +
> +	DEFINE_WAIT(wait);
> +
> +	cpt = lnet_net_lock_current();
> +	for (;;) {
> +		prepare_to_wait(&the_lnet.ln_dc_waitq, &wait,
> +				TASK_IDLE);
> +		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> +			break;
> +		if (!list_empty(&the_lnet.ln_dc_request))
> +			break;
> +		lnet_net_unlock(cpt);
> +		schedule();
> +		finish_wait(&the_lnet.ln_dc_waitq, &wait);
> +		cpt = lnet_net_lock_current();
> +	}
> +	finish_wait(&the_lnet.ln_dc_waitq, &wait);
> +
> +	if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> +		rc = -ESHUTDOWN;
> +
> +	lnet_net_unlock(cpt);
> +
> +	CDEBUG(D_NET, "woken: %d\n", rc);
> +
> +	return rc;
> +}
> +
> +/* The discovery thread. */
> +static int lnet_peer_discovery(void *arg)
> +{
> +	struct lnet_peer *lp;
> +
> +	CDEBUG(D_NET, "started\n");
> +
> +	for (;;) {
> +		if (lnet_peer_discovery_wait_for_work())
> +			break;
> +
> +		lnet_net_lock(LNET_LOCK_EX);
> +		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> +			break;
> +		while (!list_empty(&the_lnet.ln_dc_request)) {
> +			lp = list_first_entry(&the_lnet.ln_dc_request,
> +					      struct lnet_peer, lp_dc_list);
> +			list_move(&lp->lp_dc_list, &the_lnet.ln_dc_working);
> +			lnet_net_unlock(LNET_LOCK_EX);
> +
> +			/* Just tag and release for now. */
> +			spin_lock(&lp->lp_lock);
> +			if (lnet_peer_discovery_disabled) {
> +				lp->lp_state |= LNET_PEER_REDISCOVER;
> +				lp->lp_state &= ~(LNET_PEER_DISCOVERED |
> +						  LNET_PEER_NIDS_UPTODATE |
> +						  LNET_PEER_DISCOVERING);
> +			} else {
> +				lp->lp_state |= (LNET_PEER_DISCOVERED |
> +						 LNET_PEER_NIDS_UPTODATE);
> +				lp->lp_state &= ~(LNET_PEER_REDISCOVER |
> +						  LNET_PEER_DISCOVERING);
> +			}
> +			spin_unlock(&lp->lp_lock);
> +
> +			lnet_net_lock(LNET_LOCK_EX);
> +			if (!(lp->lp_state & LNET_PEER_DISCOVERING))
> +				lnet_peer_discovery_complete(lp);
> +			if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> +				break;
> +		}
> +		lnet_net_unlock(LNET_LOCK_EX);
> +	}
> +
> +	CDEBUG(D_NET, "stopping\n");
> +	/*
> +	 * Clean up before telling lnet_peer_discovery_stop() that
> +	 * we're done. Use wake_up() below to somewhat reduce the
> +	 * size of the thundering herd if there are multiple threads
> +	 * waiting on discovery of a single peer.
> +	 */
> +	LNetEQFree(the_lnet.ln_dc_eqh);
> +	LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
> +
> +	lnet_net_lock(LNET_LOCK_EX);
> +	list_for_each_entry(lp, &the_lnet.ln_dc_request, lp_dc_list) {
> +		spin_lock(&lp->lp_lock);
> +		lp->lp_state |= LNET_PEER_REDISCOVER;
> +		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
> +				  LNET_PEER_DISCOVERING |
> +				  LNET_PEER_NIDS_UPTODATE);
> +		spin_unlock(&lp->lp_lock);
> +		lnet_peer_discovery_complete(lp);
> +	}
> +	list_for_each_entry(lp, &the_lnet.ln_dc_working, lp_dc_list) {
> +		spin_lock(&lp->lp_lock);
> +		lp->lp_state |= LNET_PEER_REDISCOVER;
> +		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
> +				  LNET_PEER_DISCOVERING |
> +				  LNET_PEER_NIDS_UPTODATE);
> +		spin_unlock(&lp->lp_lock);
> +		lnet_peer_discovery_complete(lp);
> +	}
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	the_lnet.ln_dc_state = LNET_DC_STATE_SHUTDOWN;
> +	wake_up(&the_lnet.ln_dc_waitq);
> +
> +	CDEBUG(D_NET, "stopped\n");
> +
> +	return 0;
> +}
> +
> +/* ln_api_mutex is held on entry. */
> +int lnet_peer_discovery_start(void)
> +{
> +	struct task_struct *task;
> +	int rc;
> +
> +	if (the_lnet.ln_dc_state != LNET_DC_STATE_SHUTDOWN)
> +		return -EALREADY;
> +
> +	INIT_LIST_HEAD(&the_lnet.ln_dc_request);
> +	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
> +	init_waitqueue_head(&the_lnet.ln_dc_waitq);
> +
> +	rc = LNetEQAlloc(0, lnet_discovery_event_handler, &the_lnet.ln_dc_eqh);
> +	if (rc != 0) {
> +		CERROR("Can't allocate discovery EQ: %d\n", rc);
> +		return rc;
> +	}
> +
> +	the_lnet.ln_dc_state = LNET_DC_STATE_RUNNING;
> +	task = kthread_run(lnet_peer_discovery, NULL, "lnet_discovery");
> +	if (IS_ERR(task)) {
> +		rc = PTR_ERR(task);
> +		CERROR("Can't start peer discovery thread: %d\n", rc);
> +
> +		LNetEQFree(the_lnet.ln_dc_eqh);
> +		LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
> +
> +		the_lnet.ln_dc_state = LNET_DC_STATE_SHUTDOWN;
> +	}
> +
> +	return rc;
> +}
> +
> +/* ln_api_mutex is held on entry. */
> +void lnet_peer_discovery_stop(void)
> +{
> +	if (the_lnet.ln_dc_state == LNET_DC_STATE_SHUTDOWN)
> +		return;
> +
> +	LASSERT(the_lnet.ln_dc_state == LNET_DC_STATE_RUNNING);
> +	the_lnet.ln_dc_state = LNET_DC_STATE_STOPPING;
> +	wake_up(&the_lnet.ln_dc_waitq);
> +
> +	wait_event(the_lnet.ln_dc_waitq,
> +		   the_lnet.ln_dc_state == LNET_DC_STATE_SHUTDOWN);
> +
> +	LASSERT(list_empty(&the_lnet.ln_dc_request));
> +	LASSERT(list_empty(&the_lnet.ln_dc_working));
> +}
> +
> +/* Debugging */
> +
>  void
>  lnet_debug_peer(lnet_nid_t nid)
>  {
> @@ -1544,6 +1867,8 @@ lnet_debug_peer(lnet_nid_t nid)
>  	lnet_net_unlock(cpt);
>  }
>  
> +/* Gathering information for userspace. */
> +
>  int
>  lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>  		      char aliveness[LNET_MAX_STR_LEN],
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 17/24] lustre: lnet: add the Push target
  2018-10-07 23:19 ` [lustre-devel] [PATCH 17/24] lustre: lnet: add the Push target NeilBrown
@ 2018-10-14 22:58   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 22:58 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Peer Discovery will send a Push message (same format as an
> LNet Ping) to Multi-Rail capable peers to give the peer the
> list of local interfaces.
> 
> Set up a target buffer for these pushes in the_lnet. The
> size of this buffer defaults to LNET_MIN_INTERFACES, but it
> is resized if required.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25788
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    8 +
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   25 +++
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |  150 ++++++++++++++++++++
>  drivers/staging/lustre/lnet/lnet/peer.c            |    5 +
>  4 files changed, 187 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 848d622911a4..5632e5aadf41 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -686,6 +686,14 @@ static inline int lnet_ping_buffer_numref(struct lnet_ping_buffer *pbuf)
>  	return atomic_read(&pbuf->pb_refcnt);
>  }
>  
> +static inline int lnet_push_target_resize_needed(void)
> +{
> +	return the_lnet.ln_push_target->pb_nnis < the_lnet.ln_push_target_nnis;
> +}
> +
> +int lnet_push_target_resize(void);
> +void lnet_peer_push_event(struct lnet_event *ev);
> +
>  int lnet_parse_ip2nets(char **networksp, char *ip2nets);
>  int lnet_parse_routes(char *route_str, int *im_a_router);
>  int lnet_parse_networks(struct list_head *nilist, char *networks,
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index 6394a3af50b7..e00c13355d43 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -521,6 +521,18 @@ struct lnet_peer {
>  	/* peer state flags */
>  	unsigned int		lp_state;
>  
> +	/* buffer for data pushed by peer */
> +	struct lnet_ping_buffer	*lp_data;
> +
> +	/* number of NIDs for sizing push data */
> +	int			lp_data_nnis;
> +
> +	/* NI config sequence number of peer */
> +	__u32			lp_peer_seqno;
> +
> +	/* Local NI config sequence number peer knows */
> +	__u32			lp_node_seqno;
> +
>  	/* link on discovery-related lists */
>  	struct list_head	lp_dc_list;
>  
> @@ -912,6 +924,19 @@ struct lnet {
>  	struct lnet_ping_buffer		 *ln_ping_target;
>  	atomic_t			ln_ping_target_seqno;
>  
> +	/*
> +	 * Push Target
> +	 *
> +	 * ln_push_nnis contains the desired size of the push target.
> +	 * The lnet_net_lock is used to handle update races. The old
> +	 * buffer may linger a while after it has been unlinked, in
> +	 * which case the event handler cleans up.
> +	 */
> +	struct lnet_handle_eq		ln_push_target_eq;
> +	struct lnet_handle_md		ln_push_target_md;
> +	struct lnet_ping_buffer		*ln_push_target;
> +	int				ln_push_target_nnis;
> +
>  	/* discovery event queue handle */
>  	struct lnet_handle_eq		ln_dc_eqh;
>  	/* discovery requests */
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index dccfd5bcc459..e6bc54e9de71 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -1268,6 +1268,147 @@ lnet_ping_target_fini(void)
>  	lnet_ping_target_destroy();
>  }
>  
> +/* Resize the push target. */
> +int lnet_push_target_resize(void)
> +{
> +	struct lnet_process_id id = { LNET_NID_ANY, LNET_PID_ANY };
> +	struct lnet_md md = { NULL };
> +	struct lnet_handle_me meh;
> +	struct lnet_handle_md mdh;
> +	struct lnet_handle_md old_mdh;
> +	struct lnet_ping_buffer *pbuf;
> +	struct lnet_ping_buffer *old_pbuf;
> +	int nnis = the_lnet.ln_push_target_nnis;
> +	int rc;
> +
> +	if (nnis <= 0) {
> +		rc = -EINVAL;
> +		goto fail_return;
> +	}
> +again:
> +	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
> +	if (!pbuf) {
> +		rc = -ENOMEM;
> +		goto fail_return;
> +	}
> +
> +	rc = LNetMEAttach(LNET_RESERVED_PORTAL, id,
> +			  LNET_PROTO_PING_MATCHBITS, 0,
> +			  LNET_UNLINK, LNET_INS_AFTER,
> +			  &meh);
> +	if (rc) {
> +		CERROR("Can't create push target ME: %d\n", rc);
> +		goto fail_decref_pbuf;
> +	}
> +
> +	/* initialize md content */
> +	md.start     = &pbuf->pb_info;
> +	md.length    = LNET_PING_INFO_SIZE(nnis);
> +	md.threshold = LNET_MD_THRESH_INF;
> +	md.max_size  = 0;
> +	md.options   = LNET_MD_OP_PUT | LNET_MD_TRUNCATE |
> +		       LNET_MD_MANAGE_REMOTE;
> +	md.user_ptr  = pbuf;
> +	md.eq_handle = the_lnet.ln_push_target_eq;
> +
> +	rc = LNetMDAttach(meh, md, LNET_RETAIN, &mdh);
> +	if (rc) {
> +		CERROR("Can't attach push MD: %d\n", rc);
> +		goto fail_unlink_meh;
> +	}
> +	lnet_ping_buffer_addref(pbuf);
> +
> +	lnet_net_lock(LNET_LOCK_EX);
> +	old_pbuf = the_lnet.ln_push_target;
> +	old_mdh = the_lnet.ln_push_target_md;
> +	the_lnet.ln_push_target = pbuf;
> +	the_lnet.ln_push_target_md = mdh;
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	if (old_pbuf) {
> +		LNetMDUnlink(old_mdh);
> +		lnet_ping_buffer_decref(old_pbuf);
> +	}
> +
> +	if (nnis < the_lnet.ln_push_target_nnis)
> +		goto again;
> +
> +	CDEBUG(D_NET, "nnis %d success\n", nnis);
> +
> +	return 0;
> +
> +fail_unlink_meh:
> +	LNetMEUnlink(meh);
> +fail_decref_pbuf:
> +	lnet_ping_buffer_decref(pbuf);
> +fail_return:
> +	CDEBUG(D_NET, "nnis %d error %d\n", nnis, rc);
> +	return rc;
> +}
> +
> +static void lnet_push_target_event_handler(struct lnet_event *ev)
> +{
> +	struct lnet_ping_buffer *pbuf = ev->md.user_ptr;
> +
> +	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
> +		lnet_swap_pinginfo(pbuf);
> +
> +	if (ev->unlinked)
> +		lnet_ping_buffer_decref(pbuf);
> +}
> +
> +/* Initialize the push target. */
> +static int lnet_push_target_init(void)
> +{
> +	int rc;
> +
> +	if (the_lnet.ln_push_target)
> +		return -EALREADY;
> +
> +	rc = LNetEQAlloc(0, lnet_push_target_event_handler,
> +			 &the_lnet.ln_push_target_eq);
> +	if (rc) {
> +		CERROR("Can't allocated push target EQ: %d\n", rc);
> +		return rc;
> +	}
> +
> +	/* Start at the required minimum, we'll enlarge if required. */
> +	the_lnet.ln_push_target_nnis = LNET_INTERFACES_MIN;
> +
> +	rc = lnet_push_target_resize();
> +
> +	if (rc) {
> +		LNetEQFree(the_lnet.ln_push_target_eq);
> +		LNetInvalidateEQHandle(&the_lnet.ln_push_target_eq);
> +	}
> +
> +	return rc;
> +}
> +
> +/* Clean up the push target. */
> +static void lnet_push_target_fini(void)
> +{
> +	if (!the_lnet.ln_push_target)
> +		return;
> +
> +	/* Unlink and invalidate to prevent new references. */
> +	LNetMDUnlink(the_lnet.ln_push_target_md);
> +	LNetInvalidateMDHandle(&the_lnet.ln_push_target_md);
> +
> +	/* Wait for the unlink to complete. */
> +	while (lnet_ping_buffer_numref(the_lnet.ln_push_target) > 1) {
> +		CDEBUG(D_NET, "Still waiting for ping data MD to unlink\n");
> +		schedule_timeout_uninterruptible(HZ);
> +	}
> +
> +	lnet_ping_buffer_decref(the_lnet.ln_push_target);
> +	the_lnet.ln_push_target = NULL;
> +	the_lnet.ln_push_target_nnis = 0;
> +
> +	LNetEQFree(the_lnet.ln_push_target_eq);
> +	LNetInvalidateEQHandle(&the_lnet.ln_push_target_eq);
> +}
> +
>  static int
>  lnet_ni_tq_credits(struct lnet_ni *ni)
>  {
> @@ -1945,10 +2086,14 @@ LNetNIInit(lnet_pid_t requested_pid)
>  	if (rc)
>  		goto err_stop_ping;
>  
> -	rc = lnet_peer_discovery_start();
> +	rc = lnet_push_target_init();
>  	if (rc != 0)
>  		goto err_stop_router_checker;
>  
> +	rc = lnet_peer_discovery_start();
> +	if (rc != 0)
> +		goto err_destroy_push_target;
> +
>  	lnet_fault_init();
>  	lnet_router_debugfs_init();
>  
> @@ -1956,6 +2101,8 @@ LNetNIInit(lnet_pid_t requested_pid)
>  
>  	return 0;
>  
> +err_destroy_push_target:
> +	lnet_push_target_fini();
>  err_stop_router_checker:
>  	lnet_router_checker_stop();
>  err_stop_ping:
> @@ -2007,6 +2154,7 @@ LNetNIFini(void)
>  		lnet_fault_fini();
>  		lnet_router_debugfs_fini();
>  		lnet_peer_discovery_stop();
> +		lnet_push_target_fini();
>  		lnet_router_checker_stop();
>  		lnet_ping_target_fini();
>  
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 038b58414ce0..b78f99c354de 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -1681,6 +1681,8 @@ static int lnet_peer_discovery_wait_for_work(void)
>  				TASK_IDLE);
>  		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
>  			break;
> +		if (lnet_push_target_resize_needed())
> +			break;
>  		if (!list_empty(&the_lnet.ln_dc_request))
>  			break;
>  		lnet_net_unlock(cpt);
> @@ -1711,6 +1713,9 @@ static int lnet_peer_discovery(void *arg)
>  		if (lnet_peer_discovery_wait_for_work())
>  			break;
>  
> +		if (lnet_push_target_resize_needed())
> +			lnet_push_target_resize();
> +
>  		lnet_net_lock(LNET_LOCK_EX);
>  		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
>  			break;
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 18/24] lustre: lnet: implement Peer Discovery
  2018-10-07 23:19 ` [lustre-devel] [PATCH 18/24] lustre: lnet: implement Peer Discovery NeilBrown
@ 2018-10-14 23:33   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:33 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Implement Peer Discovery.
> 
> A peer is queued for discovery by lnet_peer_queue_for_discovery().
> This set LNET_PEER_DISCOVERING, to indicate that discovery is in
> progress.
> 
> The discovery thread lnet_peer_discovery() checks the peer and
> updates its state as appropriate.
> 
> If LNET_PEER_DATA_PRESENT is set, then a valid Push message or
> Ping reply has been received. The peer is updated in accordance
> with the data, and LNET_PEER_NIDS_UPTODATE is set.
> 
> If LNET_PEER_PING_FAILED is set, then an attempt to send a Ping
> message failed, and peer state is updated accordingly. The discovery
> thread can do some cleanup like unlinking an MD that cannot be done
> from the message event handler.
> 
> If LNET_PEER_PUSH_FAILED is set, then an attempt to send a Push
> message failed, and peer state is updated accordingly. The discovery
> thread can do some cleanup like unlinking an MD that cannot be done
> from the message event handler.
> 
> If LNET_PEER_PING_REQUIRED is set, we must Ping the peer in order to
> correctly update our knowledge of it. This is set, for example, if
> we receive a Push message for a peer, but cannot handle it because
> the Push target was too small. In such a case we know that the
> state of the peer is incorrect, but need to do extra work to obtain
> the required information.
> 
> If discovery is not enabled, then the discovery process stops here
> and the peer is marked with LNET_PEER_UNDISCOVERED. This tells the
> discovery process that it doesn't need to revisit the peer while
> discovery remains disabled.
> 
> If LNET_PEER_NIDS_UPTODATE is not set, then we have reason to think
> the lnet_peer is not up to date, and will Ping it.
> 
> The peer needs a Push if it is multi-rail and the ping buffer
> sequence number for this node is newer than the sequence number it
> has acknowledged receiving by sending an Ack of a Push.
> 
> If none of the above is true, then discovery has completed its work
> on the peer.
> 
> Discovery signals that it is done with a peer by clearing the
> LNET_PEER_DISCOVERING flag, and setting LNET_PEER_DISCOVERED or
> LNET_PEER_UNDISCOVERED as appropriate. It then dequeues the peer
> and clears the LNET_PEER_QUEUED flag.
> 
> When the local node is discovered via the loopback network, the
> peer structure that is created will have an lnet_peer_ni for the
> local loopback interface. Subsequent traffic from this node to
> itself will use the loopback net.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25789
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   20 
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   39 +
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   59 +
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |   18 
>  drivers/staging/lustre/lnet/lnet/peer.c            | 1499 +++++++++++++++++++-
>  5 files changed, 1543 insertions(+), 92 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 5632e5aadf41..f82a699371f2 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -76,6 +76,9 @@ extern struct lnet the_lnet;	/* THE network */
>  #define LNET_ACCEPTOR_MIN_RESERVED_PORT    512
>  #define LNET_ACCEPTOR_MAX_RESERVED_PORT    1023
>  
> +/* Discovery timeout - same as default peer_timeout */
> +#define DISCOVERY_TIMEOUT	180
> +
>  static inline int lnet_is_route_alive(struct lnet_route *route)
>  {
>  	/* gateway is down */
> @@ -713,9 +716,10 @@ struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
>  struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
>  void lnet_peer_net_added(struct lnet_net *net);
>  lnet_nid_t lnet_peer_primary_nid_locked(lnet_nid_t nid);
> -int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt);
> +int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block);
>  int lnet_peer_discovery_start(void);
>  void lnet_peer_discovery_stop(void);
> +void lnet_push_update_to_peers(int force);
>  void lnet_peer_tables_cleanup(struct lnet_net *net);
>  void lnet_peer_uninit(void);
>  int lnet_peer_tables_create(void);
> @@ -805,4 +809,18 @@ lnet_peer_ni_is_primary(struct lnet_peer_ni *lpni)
>  
>  bool lnet_peer_is_uptodate(struct lnet_peer *lp);
>  
> +static inline bool
> +lnet_peer_needs_push(struct lnet_peer *lp)
> +{
> +	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL))
> +		return false;
> +	if (lp->lp_state & LNET_PEER_FORCE_PUSH)
> +		return true;
> +	if (lp->lp_state & LNET_PEER_NO_DISCOVERY)
> +		return false;
> +	if (lp->lp_node_seqno < atomic_read(&the_lnet.ln_ping_target_seqno))
> +		return true;
> +	return false;
> +}
> +
>  #endif
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index e00c13355d43..07baa86e61ab 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -67,6 +67,13 @@ struct lnet_msg {
>  	lnet_nid_t		msg_from;
>  	__u32			msg_type;
>  
> +	/*
> +	 * hold parameters in case message is with held due
> +	 * to discovery
> +	 */
> +	lnet_nid_t		msg_src_nid_param;
> +	lnet_nid_t		msg_rtr_nid_param;
> +
>  	/* committed for sending */
>  	unsigned int		msg_tx_committed:1;
>  	/* CPT # this message committed for sending */
> @@ -395,6 +402,8 @@ struct lnet_ping_buffer {
>  #define LNET_PING_BUFFER_LONI(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_nid)
>  #define LNET_PING_BUFFER_SEQNO(PBUF)	((PBUF)->pb_info.pi_ni[0].ns_status)
>  
> +#define LNET_PING_INFO_TO_BUFFER(PINFO)	\
> +	container_of((PINFO), struct lnet_ping_buffer, pb_info)
>  
>  /* router checker data, per router */
>  struct lnet_rc_data {
> @@ -503,6 +512,9 @@ struct lnet_peer {
>  	/* list of peer nets */
>  	struct list_head	lp_peer_nets;
>  
> +	/* list of messages pending discovery*/
> +	struct list_head	lp_dc_pendq;
> +
>  	/* primary NID of the peer */
>  	lnet_nid_t		lp_primary_nid;
>  
> @@ -524,15 +536,36 @@ struct lnet_peer {
>  	/* buffer for data pushed by peer */
>  	struct lnet_ping_buffer	*lp_data;
>  
> +	/* MD handle for ping in progress */
> +	struct lnet_handle_md	lp_ping_mdh;
> +
> +	/* MD handle for push in progress */
> +	struct lnet_handle_md	lp_push_mdh;
> +
>  	/* number of NIDs for sizing push data */
>  	int			lp_data_nnis;
>  
>  	/* NI config sequence number of peer */
>  	__u32			lp_peer_seqno;
>  
> -	/* Local NI config sequence number peer knows */
> +	/* Local NI config sequence number acked by peer */
>  	__u32			lp_node_seqno;
>  
> +	/* Local NI config sequence number sent to peer */
> +	__u32			lp_node_seqno_sent;
> +
> +	/* Ping error encountered during discovery. */
> +	int			lp_ping_error;
> +
> +	/* Push error encountered during discovery. */
> +	int			lp_push_error;
> +
> +	/* Error encountered during discovery. */
> +	int			lp_dc_error;
> +
> +	/* time it was put on the ln_dc_working queue */
> +	time64_t		lp_last_queued;
> +
>  	/* link on discovery-related lists */
>  	struct list_head	lp_dc_list;
>  
> @@ -691,6 +724,8 @@ struct lnet_remotenet {
>  #define LNET_CREDIT_OK		0
>  /** lnet message is waiting for credit */
>  #define LNET_CREDIT_WAIT	1
> +/** lnet message is waiting for discovery */
> +#define LNET_DC_WAIT		2
>  
>  struct lnet_rtrbufpool {
>  	struct list_head	rbp_bufs;	/* my free buffer pool */
> @@ -943,6 +978,8 @@ struct lnet {
>  	struct list_head		ln_dc_request;
>  	/* discovery working list */
>  	struct list_head		ln_dc_working;
> +	/* discovery expired list */
> +	struct list_head		ln_dc_expired;
>  	/* discovery thread wait queue */
>  	wait_queue_head_t		ln_dc_waitq;
>  	/* discovery startup/shutdown state */
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index e6bc54e9de71..955d1711eda4 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -41,7 +41,14 @@
>  
>  #define D_LNI D_CONSOLE
>  
> -struct lnet the_lnet;		/* THE state of the network */
> +/*
> + * initialize ln_api_mutex statically, since it needs to be used in
> + * discovery_set callback. That module parameter callback can be called
> + * before module init completes. The mutex needs to be ready for use then.
> + */
> +struct lnet the_lnet = {
> +	.ln_api_mutex = __MUTEX_INITIALIZER(the_lnet.ln_api_mutex),
> +};		/* THE state of the network */
>  EXPORT_SYMBOL(the_lnet);
>  
>  static char *ip2nets = "";
> @@ -101,7 +108,9 @@ static int
>  discovery_set(const char *val, const struct kernel_param *kp)
>  {
>  	int rc;
> +	unsigned int *discovery = (unsigned int *)kp->arg;
>  	unsigned long value;
> +	struct lnet_ping_buffer *pbuf;
>  
>  	rc = kstrtoul(val, 0, &value);
>  	if (rc) {
> @@ -109,7 +118,38 @@ discovery_set(const char *val, const struct kernel_param *kp)
>  		return rc;
>  	}
>  
> -	*(unsigned int *)kp->arg = !!value;
> +	value = !!value;
> +
> +	/*
> +	 * The purpose of locking the api_mutex here is to ensure that
> +	 * the correct value ends up stored properly.
> +	 */
> +	mutex_lock(&the_lnet.ln_api_mutex);
> +
> +	if (value == *discovery) {
> +		mutex_unlock(&the_lnet.ln_api_mutex);
> +		return 0;
> +	}
> +
> +	*discovery = value;
> +
> +	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN) {
> +		mutex_unlock(&the_lnet.ln_api_mutex);
> +		return 0;
> +	}
> +
> +	/* tell peers that discovery setting has changed */
> +	lnet_net_lock(LNET_LOCK_EX);
> +	pbuf = the_lnet.ln_ping_target;
> +	if (value)
> +		pbuf->pb_info.pi_features &= ~LNET_PING_FEAT_DISCOVERY;
> +	else
> +		pbuf->pb_info.pi_features |= LNET_PING_FEAT_DISCOVERY;
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	lnet_push_update_to_peers(1);
> +
> +	mutex_unlock(&the_lnet.ln_api_mutex);
>  
>  	return 0;
>  }
> @@ -171,7 +211,6 @@ lnet_init_locks(void)
>  	init_waitqueue_head(&the_lnet.ln_eq_waitq);
>  	init_waitqueue_head(&the_lnet.ln_rc_waitq);
>  	mutex_init(&the_lnet.ln_lnd_mutex);
> -	mutex_init(&the_lnet.ln_api_mutex);
>  }
>  
>  static int
> @@ -654,6 +693,10 @@ lnet_prepare(lnet_pid_t requested_pid)
>  	INIT_LIST_HEAD(&the_lnet.ln_routers);
>  	INIT_LIST_HEAD(&the_lnet.ln_drop_rules);
>  	INIT_LIST_HEAD(&the_lnet.ln_delay_rules);
> +	INIT_LIST_HEAD(&the_lnet.ln_dc_request);
> +	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
> +	INIT_LIST_HEAD(&the_lnet.ln_dc_expired);
> +	init_waitqueue_head(&the_lnet.ln_dc_waitq);
>  
>  	rc = lnet_create_remote_nets_table();
>  	if (rc)
> @@ -998,7 +1041,8 @@ lnet_ping_target_create(int nnis)
>  	pbuf->pb_info.pi_nnis = nnis;
>  	pbuf->pb_info.pi_pid = the_lnet.ln_pid;
>  	pbuf->pb_info.pi_magic = LNET_PROTO_PING_MAGIC;
> -	pbuf->pb_info.pi_features = LNET_PING_FEAT_NI_STATUS;
> +	pbuf->pb_info.pi_features =
> +		LNET_PING_FEAT_NI_STATUS | LNET_PING_FEAT_MULTI_RAIL;
>  
>  	return pbuf;
>  }
> @@ -1231,6 +1275,8 @@ lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
>  
>  	if (!the_lnet.ln_routing)
>  		pbuf->pb_info.pi_features |= LNET_PING_FEAT_RTE_DISABLED;
> +	if (!lnet_peer_discovery_disabled)
> +		pbuf->pb_info.pi_features |= LNET_PING_FEAT_DISCOVERY;
>  
>  	/* Ensure only known feature bits have been set. */
>  	LASSERT(pbuf->pb_info.pi_features & LNET_PING_FEAT_BITS);
> @@ -1252,6 +1298,8 @@ lnet_ping_target_update(struct lnet_ping_buffer *pbuf,
>  		lnet_ping_md_unlink(old_pbuf, &old_ping_md);
>  		lnet_ping_buffer_decref(old_pbuf);
>  	}
> +
> +	lnet_push_update_to_peers(0);
>  }
>  
>  static void
> @@ -1353,6 +1401,7 @@ static void lnet_push_target_event_handler(struct lnet_event *ev)
>  	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
>  		lnet_swap_pinginfo(pbuf);
>  
> +	lnet_peer_push_event(ev);
>  	if (ev->unlinked)
>  		lnet_ping_buffer_decref(pbuf);
>  }
> @@ -1910,8 +1959,6 @@ int lnet_lib_init(void)
>  
>  	lnet_assert_wire_constants();
>  
> -	memset(&the_lnet, 0, sizeof(the_lnet));
> -
>  	/* refer to global cfs_cpt_tab for now */
>  	the_lnet.ln_cpt_table	= cfs_cpt_tab;
>  	the_lnet.ln_cpt_number	= cfs_cpt_number(cfs_cpt_tab);
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
> index 4773180cc7b3..2ff329bf91ba 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
> @@ -444,6 +444,8 @@ lnet_prep_send(struct lnet_msg *msg, int type, struct lnet_process_id target,
>  
>  	memset(&msg->msg_hdr, 0, sizeof(msg->msg_hdr));
>  	msg->msg_hdr.type	   = cpu_to_le32(type);
> +	/* dest_nid will be overwritten by lnet_select_pathway() */
> +	msg->msg_hdr.dest_nid       = cpu_to_le64(target.nid);
>  	msg->msg_hdr.dest_pid       = cpu_to_le32(target.pid);
>  	/* src_nid will be set later */
>  	msg->msg_hdr.src_pid	= cpu_to_le32(the_lnet.ln_pid);
> @@ -1292,7 +1294,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  	 */
>  	peer = lpni->lpni_peer_net->lpn_peer;
>  	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
> -		rc = lnet_discover_peer_locked(lpni, cpt);
> +		rc = lnet_discover_peer_locked(lpni, cpt, false);
>  		if (rc) {
>  			lnet_peer_ni_decref_locked(lpni);
>  			lnet_net_unlock(cpt);
> @@ -1300,6 +1302,18 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>  		}
>  		/* The peer may have changed. */
>  		peer = lpni->lpni_peer_net->lpn_peer;
> +		/* queue message and return */
> +		msg->msg_src_nid_param = src_nid;
> +		msg->msg_rtr_nid_param = rtr_nid;
> +		msg->msg_sending = 0;
> +		list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
> +		lnet_peer_ni_decref_locked(lpni);
> +		lnet_net_unlock(cpt);
> +
> +		CDEBUG(D_NET, "%s pending discovery\n",
> +		       libcfs_nid2str(peer->lp_primary_nid));
> +
> +		return LNET_DC_WAIT;
>  	}
>  	lnet_peer_ni_decref_locked(lpni);
>  
> @@ -1840,7 +1854,7 @@ lnet_send(lnet_nid_t src_nid, struct lnet_msg *msg, lnet_nid_t rtr_nid)
>  	if (rc == LNET_CREDIT_OK)
>  		lnet_ni_send(msg->msg_txni, msg);
>  
> -	/* rc == LNET_CREDIT_OK or LNET_CREDIT_WAIT */
> +	/* rc == LNET_CREDIT_OK or LNET_CREDIT_WAIT or LNET_DC_WAIT */
>  	return 0;
>  }
>  
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index b78f99c354de..1ef4a44e752e 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -38,6 +38,11 @@
>  #include <linux/lnet/lib-lnet.h>
>  #include <uapi/linux/lnet/lnet-dlc.h>
>  
> +/* Value indicating that recovery needs to re-check a peer immediately. */
> +#define LNET_REDISCOVER_PEER	(1)
> +
> +static int lnet_peer_queue_for_discovery(struct lnet_peer *lp);
> +
>  static void
>  lnet_peer_remove_from_remote_list(struct lnet_peer_ni *lpni)
>  {
> @@ -202,6 +207,7 @@ lnet_peer_alloc(lnet_nid_t nid)
>  	INIT_LIST_HEAD(&lp->lp_peer_list);
>  	INIT_LIST_HEAD(&lp->lp_peer_nets);
>  	INIT_LIST_HEAD(&lp->lp_dc_list);
> +	INIT_LIST_HEAD(&lp->lp_dc_pendq);
>  	init_waitqueue_head(&lp->lp_dc_waitq);
>  	spin_lock_init(&lp->lp_lock);
>  	lp->lp_primary_nid = nid;
> @@ -220,6 +226,10 @@ lnet_destroy_peer_locked(struct lnet_peer *lp)
>  	LASSERT(atomic_read(&lp->lp_refcount) == 0);
>  	LASSERT(list_empty(&lp->lp_peer_nets));
>  	LASSERT(list_empty(&lp->lp_peer_list));
> +	LASSERT(list_empty(&lp->lp_dc_list));
> +
> +	if (lp->lp_data)
> +		lnet_ping_buffer_decref(lp->lp_data);
>  
>  	kfree(lp);
>  }
> @@ -260,10 +270,19 @@ lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
>  	/*
>  	 * If there are no more peer nets, make the peer unfindable
>  	 * via the peer_tables.
> +	 *
> +	 * Otherwise, if the peer is DISCOVERED, tell discovery to
> +	 * take another look at it. This is a no-op if discovery for
> +	 * this peer did the detaching.
>  	 */
>  	if (list_empty(&lp->lp_peer_nets)) {
>  		list_del_init(&lp->lp_peer_list);
>  		ptable->pt_peers--;
> +	} else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING) {
> +		/* Discovery isn't running, nothing to do here. */
> +	} else if (lp->lp_state & LNET_PEER_DISCOVERED) {
> +		lnet_peer_queue_for_discovery(lp);
> +		wake_up(&the_lnet.ln_dc_waitq);
>  	}
>  	CDEBUG(D_NET, "peer %s NID %s\n",
>  	       libcfs_nid2str(lp->lp_primary_nid),
> @@ -599,6 +618,25 @@ lnet_find_peer_ni_locked(lnet_nid_t nid)
>  	return lpni;
>  }
>  
> +struct lnet_peer *
> +lnet_find_peer(lnet_nid_t nid)
> +{
> +	struct lnet_peer_ni *lpni;
> +	struct lnet_peer *lp = NULL;
> +	int cpt;
> +
> +	cpt = lnet_net_lock_current();
> +	lpni = lnet_find_peer_ni_locked(nid);
> +	if (lpni) {
> +		lp = lpni->lpni_peer_net->lpn_peer;
> +		lnet_peer_addref_locked(lp);
> +		lnet_peer_ni_decref_locked(lpni);
> +	}
> +	lnet_net_unlock(cpt);
> +
> +	return lp;
> +}
> +
>  struct lnet_peer_ni *
>  lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
>  			    struct lnet_peer **lp)
> @@ -696,6 +734,37 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  	return lpni;
>  }
>  
> +/*
> + * Start pushes to peers that need to be updated for a configuration
> + * change on this node.
> + */
> +void
> +lnet_push_update_to_peers(int force)
> +{
> +	struct lnet_peer_table *ptable;
> +	struct lnet_peer *lp;
> +	int lncpt;
> +	int cpt;
> +
> +	lnet_net_lock(LNET_LOCK_EX);
> +	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
> +	for (cpt = 0; cpt < lncpt; cpt++) {
> +		ptable = the_lnet.ln_peer_tables[cpt];
> +		list_for_each_entry(lp, &ptable->pt_peer_list, lp_peer_list) {
> +			if (force) {
> +				spin_lock(&lp->lp_lock);
> +				if (lp->lp_state & LNET_PEER_MULTI_RAIL)
> +					lp->lp_state |= LNET_PEER_FORCE_PUSH;
> +				spin_unlock(&lp->lp_lock);
> +			}
> +			if (lnet_peer_needs_push(lp))
> +				lnet_peer_queue_for_discovery(lp);
> +		}
> +	}
> +	lnet_net_unlock(LNET_LOCK_EX);
> +	wake_up(&the_lnet.ln_dc_waitq);
> +}
> +
>  /*
>   * Test whether a ni is a preferred ni for this peer_ni, e.g, whether
>   * this is a preferred point-to-point path. Call with lnet_net_lock in
> @@ -941,6 +1010,7 @@ lnet_peer_primary_nid_locked(lnet_nid_t nid)
>  lnet_nid_t
>  LNetPrimaryNID(lnet_nid_t nid)
>  {
> +	struct lnet_peer *lp;
>  	struct lnet_peer_ni *lpni;
>  	lnet_nid_t primary_nid = nid;
>  	int rc = 0;
> @@ -952,7 +1022,15 @@ LNetPrimaryNID(lnet_nid_t nid)
>  		rc = PTR_ERR(lpni);
>  		goto out_unlock;
>  	}
> -	primary_nid = lpni->lpni_peer_net->lpn_peer->lp_primary_nid;
> +	lp = lpni->lpni_peer_net->lpn_peer;
> +	while (!lnet_peer_is_uptodate(lp)) {
> +		rc = lnet_discover_peer_locked(lpni, cpt, true);
> +		if (rc)
> +			goto out_decref;
> +		lp = lpni->lpni_peer_net->lpn_peer;
> +	}
> +	primary_nid = lp->lp_primary_nid;
> +out_decref:
>  	lnet_peer_ni_decref_locked(lpni);
>  out_unlock:
>  	lnet_net_unlock(cpt);
> @@ -1229,6 +1307,30 @@ lnet_peer_add_nid(struct lnet_peer *lp, lnet_nid_t nid, unsigned int flags)
>  	return rc;
>  }
>  
> +/*
> + * Update the primary NID of a peer, if possible.
> + *
> + * Call with the lnet_api_mutex held.
> + */
> +static int
> +lnet_peer_set_primary_nid(struct lnet_peer *lp, lnet_nid_t nid,
> +			  unsigned int flags)
> +{
> +	lnet_nid_t old = lp->lp_primary_nid;
> +	int rc = 0;
> +
> +	if (lp->lp_primary_nid == nid)
> +		goto out;
> +	rc = lnet_peer_add_nid(lp, nid, flags);
> +	if (rc)
> +		goto out;
> +	lp->lp_primary_nid = nid;
> +out:
> +	CDEBUG(D_NET, "peer %s NID %s: %d\n",
> +	       libcfs_nid2str(old), libcfs_nid2str(nid), rc);
> +	return rc;
> +}
> +
>  /*
>   * lpni creation initiated due to traffic either sending or receiving.
>   */
> @@ -1548,11 +1650,15 @@ lnet_peer_is_uptodate(struct lnet_peer *lp)
>  			    LNET_PEER_FORCE_PING |
>  			    LNET_PEER_FORCE_PUSH)) {
>  		rc = false;
> +	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
> +		rc = true;
>  	} else if (lp->lp_state & LNET_PEER_REDISCOVER) {
>  		if (lnet_peer_discovery_disabled)
>  			rc = true;
>  		else
>  			rc = false;
> +	} else if (lnet_peer_needs_push(lp)) {
> +		rc = false;
>  	} else if (lp->lp_state & LNET_PEER_DISCOVERED) {
>  		if (lp->lp_state & LNET_PEER_NIDS_UPTODATE)
>  			rc = true;
> @@ -1588,6 +1694,9 @@ static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
>  		rc = -EALREADY;
>  	}
>  
> +	CDEBUG(D_NET, "Queue peer %s: %d\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), rc);
> +
>  	return rc;
>  }
>  
> @@ -1597,9 +1706,252 @@ static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
>   */
>  static void lnet_peer_discovery_complete(struct lnet_peer *lp)
>  {
> +	struct lnet_msg *msg = NULL;
> +	int rc = 0;
> +	struct list_head pending_msgs;
> +
> +	INIT_LIST_HEAD(&pending_msgs);
> +
> +	CDEBUG(D_NET, "Discovery complete. Dequeue peer %s\n",
> +	       libcfs_nid2str(lp->lp_primary_nid));
> +
>  	list_del_init(&lp->lp_dc_list);
> +	list_splice_init(&lp->lp_dc_pendq, &pending_msgs);
>  	wake_up_all(&lp->lp_dc_waitq);
> +
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	/* iterate through all pending messages and send them again */
> +	list_for_each_entry(msg, &pending_msgs, msg_list) {
> +		if (lp->lp_dc_error) {
> +			lnet_finalize(msg, lp->lp_dc_error);
> +			continue;
> +		}
> +
> +		CDEBUG(D_NET, "sending pending message %s to target %s\n",
> +		       lnet_msgtyp2str(msg->msg_type),
> +		       libcfs_id2str(msg->msg_target));
> +		rc = lnet_send(msg->msg_src_nid_param, msg,
> +			       msg->msg_rtr_nid_param);
> +		if (rc < 0) {
> +			CNETERR("Error sending %s to %s: %d\n",
> +				lnet_msgtyp2str(msg->msg_type),
> +				libcfs_id2str(msg->msg_target), rc);
> +			lnet_finalize(msg, rc);
> +		}
> +	}
> +	lnet_net_lock(LNET_LOCK_EX);
> +	lnet_peer_decref_locked(lp);
> +}
> +
> +/*
> + * Handle inbound push.
> + * Like any event handler, called with lnet_res_lock/CPT held.
> + */
> +void lnet_peer_push_event(struct lnet_event *ev)
> +{
> +	struct lnet_ping_buffer *pbuf = ev->md.user_ptr;
> +	struct lnet_peer *lp;
> +
> +	/* lnet_find_peer() adds a refcount */
> +	lp = lnet_find_peer(ev->source.nid);
> +	if (!lp) {
> +		CERROR("Push Put from unknown %s (source %s)\n",
> +		       libcfs_nid2str(ev->initiator.nid),
> +		       libcfs_nid2str(ev->source.nid));
> +		return;
> +	}
> +
> +	/* Ensure peer state remains consistent while we modify it. */
> +	spin_lock(&lp->lp_lock);
> +
> +	/*
> +	 * If some kind of error happened the contents of the message
> +	 * cannot be used. Clear the NIDS_UPTODATE and set the
> +	 * FORCE_PING flag to trigger a ping.
> +	 */
> +	if (ev->status) {
> +		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
> +		lp->lp_state |= LNET_PEER_FORCE_PING;
> +		CDEBUG(D_NET, "Push Put error %d from %s (source %s)\n",
> +		       ev->status,
> +		       libcfs_nid2str(lp->lp_primary_nid),
> +		       libcfs_nid2str(ev->source.nid));
> +		goto out;
> +	}
> +
> +	/*
> +	 * A push with invalid or corrupted info. Clear the UPTODATE
> +	 * flag to trigger a ping.
> +	 */
> +	if (lnet_ping_info_validate(&pbuf->pb_info)) {
> +		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
> +		lp->lp_state |= LNET_PEER_FORCE_PING;
> +		CDEBUG(D_NET, "Corrupted Push from %s\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		goto out;
> +	}
> +
> +	/*
> +	 * Make sure we'll allocate the correct size ping buffer when
> +	 * pinging the peer.
> +	 */
> +	if (lp->lp_data_nnis < pbuf->pb_info.pi_nnis)
> +		lp->lp_data_nnis = pbuf->pb_info.pi_nnis;
> +
> +	/*
> +	 * A non-Multi-Rail peer is not supposed to be capable of
> +	 * sending a push.
> +	 */
> +	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL)) {
> +		CERROR("Push from non-Multi-Rail peer %s dropped\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		goto out;
> +	}
> +
> +	/*
> +	 * Check the MULTIRAIL flag. Complain if the peer was DLC
> +	 * configured without it.
> +	 */
> +	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +		if (lp->lp_state & LNET_PEER_CONFIGURED) {
> +			CERROR("Push says %s is Multi-Rail, DLC says not\n",
> +			       libcfs_nid2str(lp->lp_primary_nid));
> +		} else {
> +			lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +			lnet_peer_clr_non_mr_pref_nids(lp);
> +		}
> +	}
> +
> +	/*
> +	 * The peer may have discovery disabled at its end. Set
> +	 * NO_DISCOVERY as appropriate.
> +	 */
> +	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_DISCOVERY)) {
> +		CDEBUG(D_NET, "Peer %s has discovery disabled\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		lp->lp_state |= LNET_PEER_NO_DISCOVERY;
> +	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
> +		CDEBUG(D_NET, "Peer %s has discovery enabled\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
> +	}
> +
> +	/*
> +	 * Check for truncation of the Put message. Clear the
> +	 * NIDS_UPTODATE flag and set FORCE_PING to trigger a ping,
> +	 * and tell discovery to allocate a bigger buffer.
> +	 */
> +	if (pbuf->pb_nnis < pbuf->pb_info.pi_nnis) {
> +		if (the_lnet.ln_push_target_nnis < pbuf->pb_info.pi_nnis)
> +			the_lnet.ln_push_target_nnis = pbuf->pb_info.pi_nnis;
> +		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
> +		lp->lp_state |= LNET_PEER_FORCE_PING;
> +		CDEBUG(D_NET, "Truncated Push from %s (%d nids)\n",
> +		       libcfs_nid2str(lp->lp_primary_nid),
> +		       pbuf->pb_info.pi_nnis);
> +		goto out;
> +	}
> +
> +	/*
> +	 * Check whether the Put data is stale. Stale data can just be
> +	 * dropped.
> +	 */
> +	if (pbuf->pb_info.pi_nnis > 1 &&
> +	    lp->lp_primary_nid == pbuf->pb_info.pi_ni[1].ns_nid &&
> +	    LNET_PING_BUFFER_SEQNO(pbuf) < lp->lp_peer_seqno) {
> +		CDEBUG(D_NET, "Stale Push from %s: got %u have %u\n",
> +		       libcfs_nid2str(lp->lp_primary_nid),
> +		       LNET_PING_BUFFER_SEQNO(pbuf),
> +		       lp->lp_peer_seqno);
> +		goto out;
> +	}
> +
> +	/*
> +	 * Check whether the Put data is new, in which case we clear
> +	 * the UPTODATE flag and prepare to process it.
> +	 *
> +	 * If the Put data is current, and the peer is UPTODATE then
> +	 * we assome everything is all right and drop the data as
> +	 * stale.
> +	 */
> +	if (LNET_PING_BUFFER_SEQNO(pbuf) > lp->lp_peer_seqno) {
> +		lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
> +		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
> +	} else if (lp->lp_state & LNET_PEER_NIDS_UPTODATE) {
> +		CDEBUG(D_NET, "Stale Push from %s: got %u have %u\n",
> +		       libcfs_nid2str(lp->lp_primary_nid),
> +		       LNET_PING_BUFFER_SEQNO(pbuf),
> +		       lp->lp_peer_seqno);
> +		goto out;
> +	}
> +
> +	/*
> +	 * If there is data present that hasn't been processed yet,
> +	 * we'll replace it if the Put contained newer data and it
> +	 * fits. We're racing with a Ping or earlier Push in this
> +	 * case.
> +	 */
> +	if (lp->lp_state & LNET_PEER_DATA_PRESENT) {
> +		if (LNET_PING_BUFFER_SEQNO(pbuf) >
> +			LNET_PING_BUFFER_SEQNO(lp->lp_data) &&
> +		    pbuf->pb_info.pi_nnis <= lp->lp_data->pb_nnis) {
> +			memcpy(&lp->lp_data->pb_info, &pbuf->pb_info,
> +			       LNET_PING_INFO_SIZE(pbuf->pb_info.pi_nnis));
> +			CDEBUG(D_NET, "Ping/Push race from %s: %u vs %u\n",
> +			       libcfs_nid2str(lp->lp_primary_nid),
> +			       LNET_PING_BUFFER_SEQNO(pbuf),
> +			       LNET_PING_BUFFER_SEQNO(lp->lp_data));
> +		}
> +		goto out;
> +	}
> +
> +	/*
> +	 * Allocate a buffer to copy the data. On a failure we drop
> +	 * the Push and set FORCE_PING to force the discovery
> +	 * thread to fix the problem by pinging the peer.
> +	 */
> +	lp->lp_data = lnet_ping_buffer_alloc(lp->lp_data_nnis, GFP_ATOMIC);
> +	if (!lp->lp_data) {
> +		lp->lp_state |= LNET_PEER_FORCE_PING;
> +		CDEBUG(D_NET, "Cannot allocate Push buffer for %s %u\n",
> +		       libcfs_nid2str(lp->lp_primary_nid),
> +		       LNET_PING_BUFFER_SEQNO(pbuf));
> +		goto out;
> +	}
> +
> +	/* Success */
> +	memcpy(&lp->lp_data->pb_info, &pbuf->pb_info,
> +	       LNET_PING_INFO_SIZE(pbuf->pb_info.pi_nnis));
> +	lp->lp_state |= LNET_PEER_DATA_PRESENT;
> +	CDEBUG(D_NET, "Received Push %s %u\n",
> +	       libcfs_nid2str(lp->lp_primary_nid),
> +	       LNET_PING_BUFFER_SEQNO(pbuf));
> +
> +out:
> +	/*
> +	 * Queue the peer for discovery, and wake the discovery thread
> +	 * if the peer was already queued, because its status changed.
> +	 */
> +	spin_unlock(&lp->lp_lock);
> +	lnet_net_lock(LNET_LOCK_EX);
> +	if (lnet_peer_queue_for_discovery(lp))
> +		wake_up(&the_lnet.ln_dc_waitq);
> +	/* Drop refcount from lookup */
>  	lnet_peer_decref_locked(lp);
> +	lnet_net_unlock(LNET_LOCK_EX);
> +}
> +
> +/*
> + * Clear the discovery error state, unless we're already discovering
> + * this peer, in which case the error is current.
> + */
> +static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
> +{
> +	spin_lock(&lp->lp_lock);
> +	if (!(lp->lp_state & LNET_PEER_DISCOVERING))
> +		lp->lp_dc_error = 0;
> +	spin_unlock(&lp->lp_lock);
>  }
>  
>  /*
> @@ -1608,7 +1960,7 @@ static void lnet_peer_discovery_complete(struct lnet_peer *lp)
>   * because discovery could tear down an lnet_peer.
>   */
>  int
> -lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
> +lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
>  {
>  	DEFINE_WAIT(wait);
>  	struct lnet_peer *lp;
> @@ -1617,25 +1969,40 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
>  again:
>  	lnet_net_unlock(cpt);
>  	lnet_net_lock(LNET_LOCK_EX);
> +	lp = lpni->lpni_peer_net->lpn_peer;
> +	lnet_peer_clear_discovery_error(lp);
>  
> -	/* We're willing to be interrupted. */
> +	/*
> +	 * We're willing to be interrupted. The lpni can become a
> +	 * zombie if we race with DLC, so we must check for that.
> +	 */
>  	for (;;) {
> -		lp = lpni->lpni_peer_net->lpn_peer;
>  		prepare_to_wait(&lp->lp_dc_waitq, &wait, TASK_INTERRUPTIBLE);
>  		if (signal_pending(current))
>  			break;
>  		if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
>  			break;
> +		if (lp->lp_dc_error)
> +			break;
>  		if (lnet_peer_is_uptodate(lp))
>  			break;
>  		lnet_peer_queue_for_discovery(lp);
>  		lnet_peer_addref_locked(lp);
> +		/*
> +		 * if caller requested a non-blocking operation then
> +		 * return immediately. Once discovery is complete then the
> +		 * peer ref will be decremented and any pending messages
> +		 * that were stopped due to discovery will be transmitted.
> +		 */
> +		if (!block)
> +			break;
>  		lnet_net_unlock(LNET_LOCK_EX);
>  		schedule();
>  		finish_wait(&lp->lp_dc_waitq, &wait);
>  		lnet_net_lock(LNET_LOCK_EX);
>  		lnet_peer_decref_locked(lp);
> -		/* Do not use lp beyond this point. */
> +		/* Peer may have changed */
> +		lp = lpni->lpni_peer_net->lpn_peer;
>  	}
>  	finish_wait(&lp->lp_dc_waitq, &wait);
>  
> @@ -1646,71 +2013,969 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt)
>  		rc = -EINTR;
>  	else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
>  		rc = -ESHUTDOWN;
> +	else if (lp->lp_dc_error)
> +		rc = lp->lp_dc_error;
> +	else if (!block)
> +		CDEBUG(D_NET, "non-blocking discovery\n");
>  	else if (!lnet_peer_is_uptodate(lp))
>  		goto again;
>  
> +	CDEBUG(D_NET, "peer %s NID %s: %d. %s\n",
> +	       (lp ? libcfs_nid2str(lp->lp_primary_nid) : "(none)"),
> +	       libcfs_nid2str(lpni->lpni_nid), rc,
> +	       (!block) ? "pending discovery" : "discovery complete");
> +
>  	return rc;
>  }
>  
> -/*
> - * Event handler for the discovery EQ.
> - *
> - * Called with lnet_res_lock(cpt) held. The cpt is the
> - * lnet_cpt_of_cookie() of the md handle cookie.
> - */
> -static void lnet_discovery_event_handler(struct lnet_event *event)
> +/* Handle an incoming ack for a push. */
> +static void
> +lnet_discovery_event_ack(struct lnet_peer *lp, struct lnet_event *ev)
>  {
> -	wake_up(&the_lnet.ln_dc_waitq);
> +	struct lnet_ping_buffer *pbuf;
> +
> +	pbuf = LNET_PING_INFO_TO_BUFFER(ev->md.start);
> +	spin_lock(&lp->lp_lock);
> +	lp->lp_state &= ~LNET_PEER_PUSH_SENT;
> +	lp->lp_push_error = ev->status;
> +	if (ev->status)
> +		lp->lp_state |= LNET_PEER_PUSH_FAILED;
> +	else
> +		lp->lp_node_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
> +	spin_unlock(&lp->lp_lock);
> +
> +	CDEBUG(D_NET, "peer %s ev->status %d\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), ev->status);
>  }
>  
> -/*
> - * Wait for work to be queued or some other change that must be
> - * attended to. Returns non-zero if the discovery thread should shut
> - * down.
> - */
> -static int lnet_peer_discovery_wait_for_work(void)
> +/* Handle a Reply message. This is the reply to a Ping message. */
> +static void
> +lnet_discovery_event_reply(struct lnet_peer *lp, struct lnet_event *ev)
>  {
> -	int cpt;
> -	int rc = 0;
> +	struct lnet_ping_buffer *pbuf;
> +	int rc;
>  
> -	DEFINE_WAIT(wait);
> +	spin_lock(&lp->lp_lock);
>  
> -	cpt = lnet_net_lock_current();
> -	for (;;) {
> -		prepare_to_wait(&the_lnet.ln_dc_waitq, &wait,
> -				TASK_IDLE);
> -		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> -			break;
> -		if (lnet_push_target_resize_needed())
> -			break;
> -		if (!list_empty(&the_lnet.ln_dc_request))
> -			break;
> -		lnet_net_unlock(cpt);
> -		schedule();
> -		finish_wait(&the_lnet.ln_dc_waitq, &wait);
> -		cpt = lnet_net_lock_current();
> +	/*
> +	 * If some kind of error happened the contents of message
> +	 * cannot be used. Set PING_FAILED to trigger a retry.
> +	 */
> +	if (ev->status) {
> +		lp->lp_state |= LNET_PEER_PING_FAILED;
> +		lp->lp_ping_error = ev->status;
> +		CDEBUG(D_NET, "Ping Reply error %d from %s (source %s)\n",
> +		       ev->status,
> +		       libcfs_nid2str(lp->lp_primary_nid),
> +		       libcfs_nid2str(ev->source.nid));
> +		goto out;
>  	}
> -	finish_wait(&the_lnet.ln_dc_waitq, &wait);
> -
> -	if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> -		rc = -ESHUTDOWN;
>  
> -	lnet_net_unlock(cpt);
> +	pbuf = LNET_PING_INFO_TO_BUFFER(ev->md.start);
> +	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
> +		lnet_swap_pinginfo(pbuf);
>  
> -	CDEBUG(D_NET, "woken: %d\n", rc);
> +	/*
> +	 * A reply with invalid or corrupted info. Set PING_FAILED to
> +	 * trigger a retry.
> +	 */
> +	rc = lnet_ping_info_validate(&pbuf->pb_info);
> +	if (rc) {
> +		lp->lp_state |= LNET_PEER_PING_FAILED;
> +		lp->lp_ping_error = 0;
> +		CDEBUG(D_NET, "Corrupted Ping Reply from %s: %d\n",
> +		       libcfs_nid2str(lp->lp_primary_nid), rc);
> +		goto out;
> +	}
>  
> -	return rc;
> -}
> +	/*
> +	 * Update the MULTI_RAIL flag based on the reply. If the peer
> +	 * was configured with DLC then the setting should match what
> +	 * DLC put in.
> +	 */
> +	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL) {
> +		if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
> +			/* Everything's fine */
> +		} else if (lp->lp_state & LNET_PEER_CONFIGURED) {
> +			CWARN("Reply says %s is Multi-Rail, DLC says not\n",
> +			      libcfs_nid2str(lp->lp_primary_nid));
> +		} else {
> +			lp->lp_state |= LNET_PEER_MULTI_RAIL;
> +			lnet_peer_clr_non_mr_pref_nids(lp);
> +		}
> +	} else if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
> +		if (lp->lp_state & LNET_PEER_CONFIGURED) {
> +			CWARN("DLC says %s is Multi-Rail, Reply says not\n",
> +			      libcfs_nid2str(lp->lp_primary_nid));
> +		} else {
> +			CERROR("Multi-Rail state vanished from %s\n",
> +			       libcfs_nid2str(lp->lp_primary_nid));
> +			lp->lp_state &= ~LNET_PEER_MULTI_RAIL;
> +		}
> +	}
>  
> -/* The discovery thread. */
> -static int lnet_peer_discovery(void *arg)
> -{
> -	struct lnet_peer *lp;
> +	/*
> +	 * Make sure we'll allocate the correct size ping buffer when
> +	 * pinging the peer.
> +	 */
> +	if (lp->lp_data_nnis < pbuf->pb_info.pi_nnis)
> +		lp->lp_data_nnis = pbuf->pb_info.pi_nnis;
>  
> -	CDEBUG(D_NET, "started\n");
> +	/*
> +	 * The peer may have discovery disabled at its end. Set
> +	 * NO_DISCOVERY as appropriate.
> +	 */
> +	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_DISCOVERY)) {
> +		CDEBUG(D_NET, "Peer %s has discovery disabled\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		lp->lp_state |= LNET_PEER_NO_DISCOVERY;
> +	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
> +		CDEBUG(D_NET, "Peer %s has discovery enabled\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +		lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
> +	}
>  
> -	for (;;) {
> -		if (lnet_peer_discovery_wait_for_work())
> +	/*
> +	 * Check for truncation of the Reply. Clear PING_SENT and set
> +	 * PING_FAILED to trigger a retry.
> +	 */
> +	if (pbuf->pb_nnis < pbuf->pb_info.pi_nnis) {
> +		if (the_lnet.ln_push_target_nnis < pbuf->pb_info.pi_nnis)
> +			the_lnet.ln_push_target_nnis = pbuf->pb_info.pi_nnis;
> +		lp->lp_state |= LNET_PEER_PING_FAILED;
> +		lp->lp_ping_error = 0;
> +		CDEBUG(D_NET, "Truncated Reply from %s (%d nids)\n",
> +		       libcfs_nid2str(lp->lp_primary_nid),
> +		       pbuf->pb_info.pi_nnis);
> +		goto out;
> +	}
> +
> +	/*
> +	 * Check the sequence numbers in the reply. These are only
> +	 * available if the reply came from a Multi-Rail peer.
> +	 */
> +	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL &&
> +	    pbuf->pb_info.pi_nnis > 1 &&
> +	    lp->lp_primary_nid == pbuf->pb_info.pi_ni[1].ns_nid) {
> +		if (LNET_PING_BUFFER_SEQNO(pbuf) < lp->lp_peer_seqno) {
> +			CDEBUG(D_NET, "Stale Reply from %s: got %u have %u\n",
> +			       libcfs_nid2str(lp->lp_primary_nid),
> +			       LNET_PING_BUFFER_SEQNO(pbuf),
> +			       lp->lp_peer_seqno);
> +			goto out;
> +		}
> +
> +		if (LNET_PING_BUFFER_SEQNO(pbuf) > lp->lp_peer_seqno)
> +			lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
> +	}
> +
> +	/* We're happy with the state of the data in the buffer. */
> +	CDEBUG(D_NET, "peer %s data present %u\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), lp->lp_peer_seqno);
> +	if (lp->lp_state & LNET_PEER_DATA_PRESENT)
> +		lnet_ping_buffer_decref(lp->lp_data);
> +	else
> +		lp->lp_state |= LNET_PEER_DATA_PRESENT;
> +	lnet_ping_buffer_addref(pbuf);
> +	lp->lp_data = pbuf;
> +out:
> +	lp->lp_state &= ~LNET_PEER_PING_SENT;
> +	spin_unlock(&lp->lp_lock);
> +}
> +
> +/*
> + * Send event handling. Only matters for error cases, where we clean
> + * up state on the peer and peer_ni that would otherwise be updated in
> + * the REPLY event handler for a successful Ping, and the ACK event
> + * handler for a successful Push.
> + */
> +static int
> +lnet_discovery_event_send(struct lnet_peer *lp, struct lnet_event *ev)
> +{
> +	int rc = 0;
> +
> +	if (!ev->status)
> +		goto out;
> +
> +	spin_lock(&lp->lp_lock);
> +	if (ev->msg_type == LNET_MSG_GET) {
> +		lp->lp_state &= ~LNET_PEER_PING_SENT;
> +		lp->lp_state |= LNET_PEER_PING_FAILED;
> +		lp->lp_ping_error = ev->status;
> +	} else { /* ev->msg_type == LNET_MSG_PUT */
> +		lp->lp_state &= ~LNET_PEER_PUSH_SENT;
> +		lp->lp_state |= LNET_PEER_PUSH_FAILED;
> +		lp->lp_push_error = ev->status;
> +	}
> +	spin_unlock(&lp->lp_lock);
> +	rc = LNET_REDISCOVER_PEER;
> +out:
> +	CDEBUG(D_NET, "%s Send to %s: %d\n",
> +	       (ev->msg_type == LNET_MSG_GET ? "Ping" : "Push"),
> +	       libcfs_nid2str(ev->target.nid), rc);
> +	return rc;
> +}
> +
> +/*
> + * Unlink event handling. This event is only seen if a call to
> + * LNetMDUnlink() caused the event to be unlinked. If this call was
> + * made after the event was set up in LNetGet() or LNetPut() then we
> + * assume the Ping or Push timed out.
> + */
> +static void
> +lnet_discovery_event_unlink(struct lnet_peer *lp, struct lnet_event *ev)
> +{
> +	spin_lock(&lp->lp_lock);
> +	/* We've passed through LNetGet() */
> +	if (lp->lp_state & LNET_PEER_PING_SENT) {
> +		lp->lp_state &= ~LNET_PEER_PING_SENT;
> +		lp->lp_state |= LNET_PEER_PING_FAILED;
> +		lp->lp_ping_error = -ETIMEDOUT;
> +		CDEBUG(D_NET, "Ping Unlink for message to peer %s\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +	}
> +	/* We've passed through LNetPut() */
> +	if (lp->lp_state & LNET_PEER_PUSH_SENT) {
> +		lp->lp_state &= ~LNET_PEER_PUSH_SENT;
> +		lp->lp_state |= LNET_PEER_PUSH_FAILED;
> +		lp->lp_push_error = -ETIMEDOUT;
> +		CDEBUG(D_NET, "Push Unlink for message to peer %s\n",
> +		       libcfs_nid2str(lp->lp_primary_nid));
> +	}
> +	spin_unlock(&lp->lp_lock);
> +}
> +
> +/*
> + * Event handler for the discovery EQ.
> + *
> + * Called with lnet_res_lock(cpt) held. The cpt is the
> + * lnet_cpt_of_cookie() of the md handle cookie.
> + */
> +static void lnet_discovery_event_handler(struct lnet_event *event)
> +{
> +	struct lnet_peer *lp = event->md.user_ptr;
> +	struct lnet_ping_buffer *pbuf;
> +	int rc;
> +
> +	/* discovery needs to take another look */
> +	rc = LNET_REDISCOVER_PEER;
> +
> +	CDEBUG(D_NET, "Received event: %d\n", event->type);
> +
> +	switch (event->type) {
> +	case LNET_EVENT_ACK:
> +		lnet_discovery_event_ack(lp, event);
> +		break;
> +	case LNET_EVENT_REPLY:
> +		lnet_discovery_event_reply(lp, event);
> +		break;
> +	case LNET_EVENT_SEND:
> +		/* Only send failure triggers a retry. */
> +		rc = lnet_discovery_event_send(lp, event);
> +		break;
> +	case LNET_EVENT_UNLINK:
> +		/* LNetMDUnlink() was called */
> +		lnet_discovery_event_unlink(lp, event);
> +		break;
> +	default:
> +		/* Invalid events. */
> +		LBUG();
> +	}
> +	lnet_net_lock(LNET_LOCK_EX);
> +	if (event->unlinked) {
> +		pbuf = LNET_PING_INFO_TO_BUFFER(event->md.start);
> +		lnet_ping_buffer_decref(pbuf);
> +		lnet_peer_decref_locked(lp);
> +	}
> +	if (rc == LNET_REDISCOVER_PEER) {
> +		list_move_tail(&lp->lp_dc_list, &the_lnet.ln_dc_request);
> +		wake_up(&the_lnet.ln_dc_waitq);
> +	}
> +	lnet_net_unlock(LNET_LOCK_EX);
> +}
> +
> +/*
> + * Build a peer from incoming data.
> + *
> + * The NIDs in the incoming data are supposed to be structured as follows:
> + *  - loopback
> + *  - primary NID
> + *  - other NIDs in same net
> + *  - NIDs in second net
> + *  - NIDs in third net
> + *  - ...
> + * This due to the way the list of NIDs in the data is created.
> + *
> + * Note that this function will mark the peer uptodate unless an
> + * ENOMEM is encontered. All other errors are due to a conflict
> + * between the DLC configuration and what discovery sees. We treat DLC
> + * as binding, and therefore set the NIDS_UPTODATE flag to prevent the
> + * peer from becoming stuck in discovery.
> + */
> +static int lnet_peer_merge_data(struct lnet_peer *lp,
> +				struct lnet_ping_buffer *pbuf)
> +{
> +	struct lnet_peer_ni *lpni;
> +	lnet_nid_t *curnis = NULL;
> +	lnet_nid_t *addnis = NULL;
> +	lnet_nid_t *delnis = NULL;
> +	unsigned int flags;
> +	int ncurnis;
> +	int naddnis;
> +	int ndelnis;
> +	int nnis = 0;
> +	int i;
> +	int j;
> +	int rc;
> +
> +	flags = LNET_PEER_DISCOVERED;
> +	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL)
> +		flags |= LNET_PEER_MULTI_RAIL;
> +
> +	nnis = max_t(int, lp->lp_nnis, pbuf->pb_info.pi_nnis);
> +	curnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
> +	addnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
> +	delnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
> +	if (!curnis || !addnis || !delnis) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +	ncurnis = 0;
> +	naddnis = 0;
> +	ndelnis = 0;
> +
> +	/* Construct the list of NIDs present in peer. */
> +	lpni = NULL;
> +	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL)
> +		curnis[ncurnis++] = lpni->lpni_nid;
> +
> +	/*
> +	 * Check for NIDs in pbuf not present in curnis[].
> +	 * The loop starts at 1 to skip the loopback NID.
> +	 */
> +	for (i = 1; i < pbuf->pb_info.pi_nnis; i++) {
> +		for (j = 0; j < ncurnis; j++)
> +			if (pbuf->pb_info.pi_ni[i].ns_nid == curnis[j])
> +				break;
> +		if (j == ncurnis)
> +			addnis[naddnis++] = pbuf->pb_info.pi_ni[i].ns_nid;
> +	}
> +	/*
> +	 * Check for NIDs in curnis[] not present in pbuf.
> +	 * The nested loop starts at 1 to skip the loopback NID.
> +	 *
> +	 * But never add the loopback NID to delnis[]: if it is
> +	 * present in curnis[] then this peer is for this node.
> +	 */
> +	for (i = 0; i < ncurnis; i++) {
> +		if (LNET_NETTYP(LNET_NIDNET(curnis[i])) == LOLND)
> +			continue;
> +		for (j = 1; j < pbuf->pb_info.pi_nnis; j++)
> +			if (curnis[i] == pbuf->pb_info.pi_ni[j].ns_nid)
> +				break;
> +		if (j == pbuf->pb_info.pi_nnis)
> +			delnis[ndelnis++] = curnis[i];
> +	}
> +
> +	for (i = 0; i < naddnis; i++) {
> +		rc = lnet_peer_add_nid(lp, addnis[i], flags);
> +		if (rc) {
> +			CERROR("Error adding NID %s to peer %s: %d\n",
> +			       libcfs_nid2str(addnis[i]),
> +			       libcfs_nid2str(lp->lp_primary_nid), rc);
> +			if (rc == -ENOMEM)
> +				goto out;
> +		}
> +	}
> +	for (i = 0; i < ndelnis; i++) {
> +		rc = lnet_peer_del_nid(lp, delnis[i], flags);
> +		if (rc) {
> +			CERROR("Error deleting NID %s from peer %s: %d\n",
> +			       libcfs_nid2str(delnis[i]),
> +			       libcfs_nid2str(lp->lp_primary_nid), rc);
> +			if (rc == -ENOMEM)
> +				goto out;
> +		}
> +	}
> +	/*
> +	 * Errors other than -ENOMEM are due to peers having been
> +	 * configured with DLC. Ignore these because DLC overrides
> +	 * Discovery.
> +	 */
> +	rc = 0;
> +out:
> +	kfree(curnis);
> +	kfree(addnis);
> +	kfree(delnis);
> +	lnet_ping_buffer_decref(pbuf);
> +	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
> +
> +	if (rc) {
> +		spin_lock(&lp->lp_lock);
> +		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
> +		lp->lp_state |= LNET_PEER_FORCE_PING;
> +		spin_unlock(&lp->lp_lock);
> +	}
> +	return rc;
> +}
> +
> +/*
> + * The data in pbuf says lp is its primary peer, but the data was
> + * received by a different peer. Try to update lp with the data.
> + */
> +static int
> +lnet_peer_set_primary_data(struct lnet_peer *lp, struct lnet_ping_buffer *pbuf)
> +{
> +	struct lnet_handle_md mdh;
> +
> +	/* Queue lp for discovery, and force it on the request queue. */
> +	lnet_net_lock(LNET_LOCK_EX);
> +	if (lnet_peer_queue_for_discovery(lp))
> +		list_move(&lp->lp_dc_list, &the_lnet.ln_dc_request);
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	LNetInvalidateMDHandle(&mdh);
> +
> +	/*
> +	 * Decide whether we can move the peer to the DATA_PRESENT state.
> +	 *
> +	 * We replace stale data for a multi-rail peer, repair PING_FAILED
> +	 * status, and preempt FORCE_PING.
> +	 *
> +	 * If after that we have DATA_PRESENT, we merge it into this peer.
> +	 */
> +	spin_lock(&lp->lp_lock);
> +	if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
> +		if (lp->lp_peer_seqno < LNET_PING_BUFFER_SEQNO(pbuf)) {
> +			lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
> +		} else if (lp->lp_state & LNET_PEER_DATA_PRESENT) {
> +			lp->lp_state &= ~LNET_PEER_DATA_PRESENT;
> +			lnet_ping_buffer_decref(pbuf);
> +			pbuf = lp->lp_data;
> +			lp->lp_data = NULL;
> +		}
> +	}
> +	if (lp->lp_state & LNET_PEER_DATA_PRESENT) {
> +		lnet_ping_buffer_decref(lp->lp_data);
> +		lp->lp_data = NULL;
> +		lp->lp_state &= ~LNET_PEER_DATA_PRESENT;
> +	}
> +	if (lp->lp_state & LNET_PEER_PING_FAILED) {
> +		mdh = lp->lp_ping_mdh;
> +		LNetInvalidateMDHandle(&lp->lp_ping_mdh);
> +		lp->lp_state &= ~LNET_PEER_PING_FAILED;
> +		lp->lp_ping_error = 0;
> +	}
> +	if (lp->lp_state & LNET_PEER_FORCE_PING)
> +		lp->lp_state &= ~LNET_PEER_FORCE_PING;
> +	lp->lp_state |= LNET_PEER_NIDS_UPTODATE;
> +	spin_unlock(&lp->lp_lock);
> +
> +	if (!LNetMDHandleIsInvalid(mdh))
> +		LNetMDUnlink(mdh);
> +
> +	if (pbuf)
> +		return lnet_peer_merge_data(lp, pbuf);
> +
> +	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
> +	return 0;
> +}
> +
> +/*
> + * Update a peer using the data received.
> + */
> +static int lnet_peer_data_present(struct lnet_peer *lp)
> +__must_hold(&lp->lp_lock)
> +{
> +	struct lnet_ping_buffer *pbuf;
> +	struct lnet_peer_ni *lpni;
> +	lnet_nid_t nid = LNET_NID_ANY;
> +	unsigned int flags;
> +	int rc = 0;
> +
> +	pbuf = lp->lp_data;
> +	lp->lp_data = NULL;
> +	lp->lp_state &= ~LNET_PEER_DATA_PRESENT;
> +	lp->lp_state |= LNET_PEER_NIDS_UPTODATE;
> +	spin_unlock(&lp->lp_lock);
> +
> +	/*
> +	 * Modifications of peer structures are done while holding the
> +	 * ln_api_mutex. A global lock is required because we may be
> +	 * modifying multiple peer structures, and a mutex greatly
> +	 * simplifies memory management.
> +	 *
> +	 * The actual changes to the data structures must also protect
> +	 * against concurrent lookups, for which the lnet_net_lock in
> +	 * LNET_LOCK_EX mode is used.
> +	 */
> +	mutex_lock(&the_lnet.ln_api_mutex);
> +	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN) {
> +		rc = -ESHUTDOWN;
> +		goto out;
> +	}
> +
> +	/*
> +	 * If this peer is not on the peer list then it is being torn
> +	 * down, and our reference count may be all that is keeping it
> +	 * alive. Don't do any work on it.
> +	 */
> +	if (list_empty(&lp->lp_peer_list))
> +		goto out;
> +
> +	flags = LNET_PEER_DISCOVERED;
> +	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL)
> +		flags |= LNET_PEER_MULTI_RAIL;
> +
> +	/*
> +	 * Check whether the primary NID in the message matches the
> +	 * primary NID of the peer. If it does, update the peer, if
> +	 * it it does not, check whether there is already a peer with
> +	 * that primary NID. If no such peer exists, try to update
> +	 * the primary NID of the current peer (allowed if it was
> +	 * created due to message traffic) and complete the update.
> +	 * If the peer did exist, hand off the data to it.
> +	 *
> +	 * The peer for the loopback interface is a special case: this
> +	 * is the peer for the local node, and we want to set its
> +	 * primary NID to the correct value here.
> +	 */
> +	if (pbuf->pb_info.pi_nnis > 1)
> +		nid = pbuf->pb_info.pi_ni[1].ns_nid;
> +	if (LNET_NETTYP(LNET_NIDNET(lp->lp_primary_nid)) == LOLND) {
> +		rc = lnet_peer_set_primary_nid(lp, nid, flags);
> +		if (!rc)
> +			rc = lnet_peer_merge_data(lp, pbuf);
> +	} else if (lp->lp_primary_nid == nid) {
> +		rc = lnet_peer_merge_data(lp, pbuf);
> +	} else {
> +		lpni = lnet_find_peer_ni_locked(nid);
> +		if (!lpni) {
> +			rc = lnet_peer_set_primary_nid(lp, nid, flags);
> +			if (rc) {
> +				CERROR("Primary NID error %s versus %s: %d\n",
> +				       libcfs_nid2str(lp->lp_primary_nid),
> +				       libcfs_nid2str(nid), rc);
> +			} else {
> +				rc = lnet_peer_merge_data(lp, pbuf);
> +			}
> +		} else {
> +			rc = lnet_peer_set_primary_data(
> +				lpni->lpni_peer_net->lpn_peer, pbuf);
> +			lnet_peer_ni_decref_locked(lpni);
> +		}
> +	}
> +out:
> +	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
> +	mutex_unlock(&the_lnet.ln_api_mutex);
> +
> +	spin_lock(&lp->lp_lock);
> +	/* Tell discovery to re-check the peer immediately. */
> +	if (!rc)
> +		rc = LNET_REDISCOVER_PEER;
> +	return rc;
> +}
> +
> +/*
> + * A ping failed. Clear the PING_FAILED state and set the
> + * FORCE_PING state, to ensure a retry even if discovery is
> + * disabled. This avoids being left with incorrect state.
> + */
> +static int lnet_peer_ping_failed(struct lnet_peer *lp)
> +__must_hold(&lp->lp_lock)
> +{
> +	struct lnet_handle_md mdh;
> +	int rc;
> +
> +	mdh = lp->lp_ping_mdh;
> +	LNetInvalidateMDHandle(&lp->lp_ping_mdh);
> +	lp->lp_state &= ~LNET_PEER_PING_FAILED;
> +	lp->lp_state |= LNET_PEER_FORCE_PING;
> +	rc = lp->lp_ping_error;
> +	lp->lp_ping_error = 0;
> +	spin_unlock(&lp->lp_lock);
> +
> +	if (!LNetMDHandleIsInvalid(mdh))
> +		LNetMDUnlink(mdh);
> +
> +	CDEBUG(D_NET, "peer %s:%d\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), rc);
> +
> +	spin_lock(&lp->lp_lock);
> +	return rc ? rc : LNET_REDISCOVER_PEER;
> +}
> +
> +/*
> + * Select NID to send a Ping or Push to.
> + */
> +static lnet_nid_t lnet_peer_select_nid(struct lnet_peer *lp)
> +{
> +	struct lnet_peer_ni *lpni;
> +
> +	/* Look for a direct-connected NID for this peer. */
> +	lpni = NULL;
> +	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
> +		if (!lnet_is_peer_ni_healthy_locked(lpni))
> +			continue;
> +		if (!lnet_get_net_locked(lpni->lpni_peer_net->lpn_net_id))
> +			continue;
> +		break;
> +	}
> +	if (lpni)
> +		return lpni->lpni_nid;
> +
> +	/* Look for a routed-connected NID for this peer. */
> +	lpni = NULL;
> +	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
> +		if (!lnet_is_peer_ni_healthy_locked(lpni))
> +			continue;
> +		if (!lnet_find_rnet_locked(lpni->lpni_peer_net->lpn_net_id))
> +			continue;
> +		break;
> +	}
> +	if (lpni)
> +		return lpni->lpni_nid;
> +
> +	return LNET_NID_ANY;
> +}
> +
> +/* Active side of ping. */
> +static int lnet_peer_send_ping(struct lnet_peer *lp)
> +__must_hold(&lp->lp_lock)
> +{
> +	struct lnet_md md = { NULL };
> +	struct lnet_process_id id;
> +	struct lnet_ping_buffer *pbuf;
> +	int nnis;
> +	int rc;
> +	int cpt;
> +
> +	lp->lp_state |= LNET_PEER_PING_SENT;
> +	lp->lp_state &= ~LNET_PEER_FORCE_PING;
> +	spin_unlock(&lp->lp_lock);
> +
> +	nnis = max_t(int, lp->lp_data_nnis, LNET_INTERFACES_MIN);
> +	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
> +	if (!pbuf) {
> +		rc = -ENOMEM;
> +		goto fail_error;
> +	}
> +
> +	/* initialize md content */
> +	md.start     = &pbuf->pb_info;
> +	md.length    = LNET_PING_INFO_SIZE(nnis);
> +	md.threshold = 2; /* GET/REPLY */
> +	md.max_size  = 0;
> +	md.options   = LNET_MD_TRUNCATE;
> +	md.user_ptr  = lp;
> +	md.eq_handle = the_lnet.ln_dc_eqh;
> +
> +	rc = LNetMDBind(md, LNET_UNLINK, &lp->lp_ping_mdh);
> +	if (rc != 0) {
> +		lnet_ping_buffer_decref(pbuf);
> +		CERROR("Can't bind MD: %d\n", rc);
> +		goto fail_error;
> +	}
> +	cpt = lnet_net_lock_current();
> +	/* Refcount for MD. */
> +	lnet_peer_addref_locked(lp);
> +	id.pid = LNET_PID_LUSTRE;
> +	id.nid = lnet_peer_select_nid(lp);
> +	lnet_net_unlock(cpt);
> +
> +	if (id.nid == LNET_NID_ANY) {
> +		rc = -EHOSTUNREACH;
> +		goto fail_unlink_md;
> +	}
> +
> +	rc = LNetGet(LNET_NID_ANY, lp->lp_ping_mdh, id,
> +		     LNET_RESERVED_PORTAL,
> +		     LNET_PROTO_PING_MATCHBITS, 0);
> +
> +	if (rc)
> +		goto fail_unlink_md;
> +
> +	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
> +
> +	spin_lock(&lp->lp_lock);
> +	return 0;
> +
> +fail_unlink_md:
> +	LNetMDUnlink(lp->lp_ping_mdh);
> +	LNetInvalidateMDHandle(&lp->lp_ping_mdh);
> +fail_error:
> +	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
> +	/*
> +	 * The errors that get us here are considered hard errors and
> +	 * cause Discovery to terminate. So we clear PING_SENT, but do
> +	 * not set either PING_FAILED or FORCE_PING. In fact we need
> +	 * to clear PING_FAILED, because the unlink event handler will
> +	 * have set it if we called LNetMDUnlink() above.
> +	 */
> +	spin_lock(&lp->lp_lock);
> +	lp->lp_state &= ~(LNET_PEER_PING_SENT | LNET_PEER_PING_FAILED);
> +	return rc;
> +}
> +
> +/*
> + * This function exists because you cannot call LNetMDUnlink() from an
> + * event handler.
> + */
> +static int lnet_peer_push_failed(struct lnet_peer *lp)
> +__must_hold(&lp->lp_lock)
> +{
> +	struct lnet_handle_md mdh;
> +	int rc;
> +
> +	mdh = lp->lp_push_mdh;
> +	LNetInvalidateMDHandle(&lp->lp_push_mdh);
> +	lp->lp_state &= ~LNET_PEER_PUSH_FAILED;
> +	rc = lp->lp_push_error;
> +	lp->lp_push_error = 0;
> +	spin_unlock(&lp->lp_lock);
> +
> +	if (!LNetMDHandleIsInvalid(mdh))
> +		LNetMDUnlink(mdh);
> +
> +	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
> +	spin_lock(&lp->lp_lock);
> +	return rc ? rc : LNET_REDISCOVER_PEER;
> +}
> +
> +/* Active side of push. */
> +static int lnet_peer_send_push(struct lnet_peer *lp)
> +__must_hold(&lp->lp_lock)
> +{
> +	struct lnet_ping_buffer *pbuf;
> +	struct lnet_process_id id;
> +	struct lnet_md md;
> +	int cpt;
> +	int rc;
> +
> +	/* Don't push to a non-multi-rail peer. */
> +	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
> +		lp->lp_state &= ~LNET_PEER_FORCE_PUSH;
> +		return 0;
> +	}
> +
> +	lp->lp_state |= LNET_PEER_PUSH_SENT;
> +	lp->lp_state &= ~LNET_PEER_FORCE_PUSH;
> +	spin_unlock(&lp->lp_lock);
> +
> +	cpt = lnet_net_lock_current();
> +	pbuf = the_lnet.ln_ping_target;
> +	lnet_ping_buffer_addref(pbuf);
> +	lnet_net_unlock(cpt);
> +
> +	/* Push source MD */
> +	md.start     = &pbuf->pb_info;
> +	md.length    = LNET_PING_INFO_SIZE(pbuf->pb_nnis);
> +	md.threshold = 2; /* Put/Ack */
> +	md.max_size  = 0;
> +	md.options   = 0;
> +	md.eq_handle = the_lnet.ln_dc_eqh;
> +	md.user_ptr  = lp;
> +
> +	rc = LNetMDBind(md, LNET_UNLINK, &lp->lp_push_mdh);
> +	if (rc) {
> +		lnet_ping_buffer_decref(pbuf);
> +		CERROR("Can't bind push source MD: %d\n", rc);
> +		goto fail_error;
> +	}
> +	cpt = lnet_net_lock_current();
> +	/* Refcount for MD. */
> +	lnet_peer_addref_locked(lp);
> +	id.pid = LNET_PID_LUSTRE;
> +	id.nid = lnet_peer_select_nid(lp);
> +	lnet_net_unlock(cpt);
> +
> +	if (id.nid == LNET_NID_ANY) {
> +		rc = -EHOSTUNREACH;
> +		goto fail_unlink;
> +	}
> +
> +	rc = LNetPut(LNET_NID_ANY, lp->lp_push_mdh,
> +		     LNET_ACK_REQ, id, LNET_RESERVED_PORTAL,
> +		     LNET_PROTO_PING_MATCHBITS, 0, 0);
> +
> +	if (rc)
> +		goto fail_unlink;
> +
> +	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
> +
> +	spin_lock(&lp->lp_lock);
> +	return 0;
> +
> +fail_unlink:
> +	LNetMDUnlink(lp->lp_push_mdh);
> +	LNetInvalidateMDHandle(&lp->lp_push_mdh);
> +fail_error:
> +	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
> +	/*
> +	 * The errors that get us here are considered hard errors and
> +	 * cause Discovery to terminate. So we clear PUSH_SENT, but do
> +	 * not set PUSH_FAILED. In fact we need to clear PUSH_FAILED,
> +	 * because the unlink event handler will have set it if we
> +	 * called LNetMDUnlink() above.
> +	 */
> +	spin_lock(&lp->lp_lock);
> +	lp->lp_state &= ~(LNET_PEER_PUSH_SENT | LNET_PEER_PUSH_FAILED);
> +	return rc;
> +}
> +
> +/*
> + * An unrecoverable error was encountered during discovery.
> + * Set error status in peer and abort discovery.
> + */
> +static void lnet_peer_discovery_error(struct lnet_peer *lp, int error)
> +{
> +	CDEBUG(D_NET, "Discovery error %s: %d\n",
> +	       libcfs_nid2str(lp->lp_primary_nid), error);
> +
> +	spin_lock(&lp->lp_lock);
> +	lp->lp_dc_error = error;
> +	lp->lp_state &= ~LNET_PEER_DISCOVERING;
> +	lp->lp_state |= LNET_PEER_REDISCOVER;
> +	spin_unlock(&lp->lp_lock);
> +}
> +
> +/*
> + * Mark the peer as discovered.
> + */
> +static int lnet_peer_discovered(struct lnet_peer *lp)
> +__must_hold(&lp->lp_lock)
> +{
> +	lp->lp_state |= LNET_PEER_DISCOVERED;
> +	lp->lp_state &= ~(LNET_PEER_DISCOVERING |
> +			  LNET_PEER_REDISCOVER);
> +
> +	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
> +
> +	return 0;
> +}
> +
> +/*
> + * Mark the peer as to be rediscovered.
> + */
> +static int lnet_peer_rediscover(struct lnet_peer *lp)
> +__must_hold(&lp->lp_lock)
> +{
> +	lp->lp_state |= LNET_PEER_REDISCOVER;
> +	lp->lp_state &= ~LNET_PEER_DISCOVERING;
> +
> +	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
> +
> +	return 0;
> +}
> +
> +/*
> + * Returns the first peer on the ln_dc_working queue if its timeout
> + * has expired. Takes the current time as an argument so as to not
> + * obsessively re-check the clock. The oldest discovery request will
> + * be at the head of the queue.
> + */
> +static struct lnet_peer *lnet_peer_dc_timed_out(time64_t now)
> +{
> +	struct lnet_peer *lp;
> +
> +	if (list_empty(&the_lnet.ln_dc_working))
> +		return NULL;
> +	lp = list_first_entry(&the_lnet.ln_dc_working,
> +			      struct lnet_peer, lp_dc_list);
> +	if (now < lp->lp_last_queued + DISCOVERY_TIMEOUT)
> +		return NULL;
> +	return lp;
> +}
> +
> +/*
> + * Discovering this peer is taking too long. Cancel any Ping or Push
> + * that discovery is waiting on by unlinking the relevant MDs. The
> + * lnet_discovery_event_handler() will proceed from here and complete
> + * the cleanup.
> + */
> +static void lnet_peer_discovery_timeout(struct lnet_peer *lp)
> +{
> +	struct lnet_handle_md ping_mdh;
> +	struct lnet_handle_md push_mdh;
> +
> +	LNetInvalidateMDHandle(&ping_mdh);
> +	LNetInvalidateMDHandle(&push_mdh);
> +
> +	spin_lock(&lp->lp_lock);
> +	if (lp->lp_state & LNET_PEER_PING_SENT) {
> +		ping_mdh = lp->lp_ping_mdh;
> +		LNetInvalidateMDHandle(&lp->lp_ping_mdh);
> +	}
> +	if (lp->lp_state & LNET_PEER_PUSH_SENT) {
> +		push_mdh = lp->lp_push_mdh;
> +		LNetInvalidateMDHandle(&lp->lp_push_mdh);
> +	}
> +	spin_unlock(&lp->lp_lock);
> +
> +	if (!LNetMDHandleIsInvalid(ping_mdh))
> +		LNetMDUnlink(ping_mdh);
> +	if (!LNetMDHandleIsInvalid(push_mdh))
> +		LNetMDUnlink(push_mdh);
> +}
> +
> +/*
> + * Wait for work to be queued or some other change that must be
> + * attended to. Returns non-zero if the discovery thread should shut
> + * down.
> + */
> +static int lnet_peer_discovery_wait_for_work(void)
> +{
> +	int cpt;
> +	int rc = 0;
> +
> +	DEFINE_WAIT(wait);
> +
> +	cpt = lnet_net_lock_current();
> +	for (;;) {
> +		prepare_to_wait(&the_lnet.ln_dc_waitq, &wait,
> +				TASK_IDLE);
> +		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> +			break;
> +		if (lnet_push_target_resize_needed())
> +			break;
> +		if (!list_empty(&the_lnet.ln_dc_request))
> +			break;
> +		if (lnet_peer_dc_timed_out(ktime_get_real_seconds()))
> +			break;
> +		lnet_net_unlock(cpt);
> +
> +		/*
> +		 * wakeup max every second to check if there are peers that
> +		 * have been stuck on the working queue for greater than
> +		 * the peer timeout.
> +		 */
> +		schedule_timeout(HZ);
> +		finish_wait(&the_lnet.ln_dc_waitq, &wait);
> +		cpt = lnet_net_lock_current();
> +	}
> +	finish_wait(&the_lnet.ln_dc_waitq, &wait);
> +
> +	if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
> +		rc = -ESHUTDOWN;
> +
> +	lnet_net_unlock(cpt);
> +
> +	CDEBUG(D_NET, "woken: %d\n", rc);
> +
> +	return rc;
> +}
> +
> +/* The discovery thread. */
> +static int lnet_peer_discovery(void *arg)
> +{
> +	struct lnet_peer *lp;
> +	time64_t now;
> +	int rc;
> +
> +	CDEBUG(D_NET, "started\n");
> +
> +	for (;;) {
> +		if (lnet_peer_discovery_wait_for_work())
>  			break;
>  
>  		if (lnet_push_target_resize_needed())
> @@ -1719,33 +2984,97 @@ static int lnet_peer_discovery(void *arg)
>  		lnet_net_lock(LNET_LOCK_EX);
>  		if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
>  			break;
> +
> +		/*
> +		 * Process all incoming discovery work requests.  When
> +		 * discovery must wait on a peer to change state, it
> +		 * is added to the tail of the ln_dc_working queue. A
> +		 * timestamp keeps track of when the peer was added,
> +		 * so we can time out discovery requests that take too
> +		 * long.
> +		 */
>  		while (!list_empty(&the_lnet.ln_dc_request)) {
>  			lp = list_first_entry(&the_lnet.ln_dc_request,
>  					      struct lnet_peer, lp_dc_list);
>  			list_move(&lp->lp_dc_list, &the_lnet.ln_dc_working);
> +			/*
> +			 * set the time the peer was put on the dc_working
> +			 * queue. It shouldn't remain on the queue
> +			 * forever, in case the GET message (for ping)
> +			 * doesn't get a REPLY or the PUT message (for
> +			 * push) doesn't get an ACK.
> +			 *
> +			 * TODO: LNet Health will deal with this scenario
> +			 * in a generic way.
> +			 */
> +			lp->lp_last_queued = ktime_get_real_seconds();
>  			lnet_net_unlock(LNET_LOCK_EX);
>  
> -			/* Just tag and release for now. */
> +			/*
> +			 * Select an action depending on the state of
> +			 * the peer and whether discovery is disabled.
> +			 * The check whether discovery is disabled is
> +			 * done after the code that handles processing
> +			 * for arrived data, cleanup for failures, and
> +			 * forcing a Ping or Push.
> +			 */
>  			spin_lock(&lp->lp_lock);
> -			if (lnet_peer_discovery_disabled) {
> -				lp->lp_state |= LNET_PEER_REDISCOVER;
> -				lp->lp_state &= ~(LNET_PEER_DISCOVERED |
> -						  LNET_PEER_NIDS_UPTODATE |
> -						  LNET_PEER_DISCOVERING);
> -			} else {
> -				lp->lp_state |= (LNET_PEER_DISCOVERED |
> -						 LNET_PEER_NIDS_UPTODATE);
> -				lp->lp_state &= ~(LNET_PEER_REDISCOVER |
> -						  LNET_PEER_DISCOVERING);
> -			}
> +			CDEBUG(D_NET, "peer %s state %#x\n",
> +			       libcfs_nid2str(lp->lp_primary_nid),
> +			       lp->lp_state);
> +			if (lp->lp_state & LNET_PEER_DATA_PRESENT)
> +				rc = lnet_peer_data_present(lp);
> +			else if (lp->lp_state & LNET_PEER_PING_FAILED)
> +				rc = lnet_peer_ping_failed(lp);
> +			else if (lp->lp_state & LNET_PEER_PUSH_FAILED)
> +				rc = lnet_peer_push_failed(lp);
> +			else if (lp->lp_state & LNET_PEER_FORCE_PING)
> +				rc = lnet_peer_send_ping(lp);
> +			else if (lp->lp_state & LNET_PEER_FORCE_PUSH)
> +				rc = lnet_peer_send_push(lp);
> +			else if (lnet_peer_discovery_disabled)
> +				rc = lnet_peer_rediscover(lp);
> +			else if (!(lp->lp_state & LNET_PEER_NIDS_UPTODATE))
> +				rc = lnet_peer_send_ping(lp);
> +			else if (lnet_peer_needs_push(lp))
> +				rc = lnet_peer_send_push(lp);
> +			else
> +				rc = lnet_peer_discovered(lp);
> +			CDEBUG(D_NET, "peer %s state %#x rc %d\n",
> +			       libcfs_nid2str(lp->lp_primary_nid),
> +			       lp->lp_state, rc);
>  			spin_unlock(&lp->lp_lock);
>  
>  			lnet_net_lock(LNET_LOCK_EX);
> +			if (rc == LNET_REDISCOVER_PEER) {
> +				list_move(&lp->lp_dc_list,
> +					  &the_lnet.ln_dc_request);
> +			} else if (rc) {
> +				lnet_peer_discovery_error(lp, rc);
> +			}
>  			if (!(lp->lp_state & LNET_PEER_DISCOVERING))
>  				lnet_peer_discovery_complete(lp);
>  			if (the_lnet.ln_dc_state == LNET_DC_STATE_STOPPING)
>  				break;
>  		}
> +
> +		/*
> +		 * Now that the ln_dc_request queue has been emptied
> +		 * check the ln_dc_working queue for peers that are
> +		 * taking too long. Move all that are found to the
> +		 * ln_dc_expired queue and time out any pending
> +		 * Ping or Push. We have to drop the lnet_net_lock
> +		 * in the loop because lnet_peer_discovery_timeout()
> +		 * calls LNetMDUnlink().
> +		 */
> +		now = ktime_get_real_seconds();
> +		while ((lp = lnet_peer_dc_timed_out(now)) != NULL) {
> +			list_move(&lp->lp_dc_list, &the_lnet.ln_dc_expired);
> +			lnet_net_unlock(LNET_LOCK_EX);
> +			lnet_peer_discovery_timeout(lp);
> +			lnet_net_lock(LNET_LOCK_EX);
> +		}
> +
>  		lnet_net_unlock(LNET_LOCK_EX);
>  	}
>  
> @@ -1759,23 +3088,28 @@ static int lnet_peer_discovery(void *arg)
>  	LNetEQFree(the_lnet.ln_dc_eqh);
>  	LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
>  
> +	/* Queue cleanup 1: stop all pending pings and pushes. */
>  	lnet_net_lock(LNET_LOCK_EX);
> -	list_for_each_entry(lp, &the_lnet.ln_dc_request, lp_dc_list) {
> -		spin_lock(&lp->lp_lock);
> -		lp->lp_state |= LNET_PEER_REDISCOVER;
> -		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
> -				  LNET_PEER_DISCOVERING |
> -				  LNET_PEER_NIDS_UPTODATE);
> -		spin_unlock(&lp->lp_lock);
> -		lnet_peer_discovery_complete(lp);
> +	while (!list_empty(&the_lnet.ln_dc_working)) {
> +		lp = list_first_entry(&the_lnet.ln_dc_working,
> +				      struct lnet_peer, lp_dc_list);
> +		list_move(&lp->lp_dc_list, &the_lnet.ln_dc_expired);
> +		lnet_net_unlock(LNET_LOCK_EX);
> +		lnet_peer_discovery_timeout(lp);
> +		lnet_net_lock(LNET_LOCK_EX);
>  	}
> -	list_for_each_entry(lp, &the_lnet.ln_dc_working, lp_dc_list) {
> -		spin_lock(&lp->lp_lock);
> -		lp->lp_state |= LNET_PEER_REDISCOVER;
> -		lp->lp_state &= ~(LNET_PEER_DISCOVERED |
> -				  LNET_PEER_DISCOVERING |
> -				  LNET_PEER_NIDS_UPTODATE);
> -		spin_unlock(&lp->lp_lock);
> +	lnet_net_unlock(LNET_LOCK_EX);
> +
> +	/* Queue cleanup 2: wait for the expired queue to clear. */
> +	while (!list_empty(&the_lnet.ln_dc_expired))
> +		schedule_timeout_uninterruptible(HZ);
> +
> +	/* Queue cleanup 3: clear the request queue. */
> +	lnet_net_lock(LNET_LOCK_EX);
> +	while (!list_empty(&the_lnet.ln_dc_request)) {
> +		lp = list_first_entry(&the_lnet.ln_dc_request,
> +				      struct lnet_peer, lp_dc_list);
> +		lnet_peer_discovery_error(lp, -ESHUTDOWN);
>  		lnet_peer_discovery_complete(lp);
>  	}
>  	lnet_net_unlock(LNET_LOCK_EX);
> @@ -1797,10 +3131,6 @@ int lnet_peer_discovery_start(void)
>  	if (the_lnet.ln_dc_state != LNET_DC_STATE_SHUTDOWN)
>  		return -EALREADY;
>  
> -	INIT_LIST_HEAD(&the_lnet.ln_dc_request);
> -	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
> -	init_waitqueue_head(&the_lnet.ln_dc_waitq);
> -
>  	rc = LNetEQAlloc(0, lnet_discovery_event_handler, &the_lnet.ln_dc_eqh);
>  	if (rc != 0) {
>  		CERROR("Can't allocate discovery EQ: %d\n", rc);
> @@ -1819,6 +3149,8 @@ int lnet_peer_discovery_start(void)
>  		the_lnet.ln_dc_state = LNET_DC_STATE_SHUTDOWN;
>  	}
>  
> +	CDEBUG(D_NET, "discovery start: %d\n", rc);
> +
>  	return rc;
>  }
>  
> @@ -1837,6 +3169,9 @@ void lnet_peer_discovery_stop(void)
>  
>  	LASSERT(list_empty(&the_lnet.ln_dc_request));
>  	LASSERT(list_empty(&the_lnet.ln_dc_working));
> +	LASSERT(list_empty(&the_lnet.ln_dc_expired));
> +
> +	CDEBUG(D_NET, "discovery stopped\n");
>  }
>  
>  /* Debugging */
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 19/24] lustre: lnet: add "lnetctl peer list"
  2018-10-07 23:19 ` [lustre-devel] [PATCH 19/24] lustre: lnet: add "lnetctl peer list" NeilBrown
@ 2018-10-14 23:38   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:38 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Add IOC_LIBCFS_GET_PEER_LIST to obtain a list of the primary
> NIDs of all peers known to the system. The list is written
> into a userspace buffer by the kernel. The typical usage is
> to make a first call to determine the required buffer size,
> then a second call to obtain the list.
> 
> Extend the "lnetctl peer" set of commands with a "list"
> subcommand that uses this interface.
> 
> Modify the IOC_LIBCFS_GET_PEER_NI ioctl (which is new in the
> Multi-Rail code) to use a NID to indicate the peer to look
> up, and then pass out the data for all NIDs of that peer.
> 
> Re-implement "lnetctl peer show" to obtain the list of NIDs
> using IOC_LIBCFS_GET_PEER_LIST followed by one or more
> IOC_LIBCFS_GET_PEER_NI calls to get information for each
> peer.
> 
> Make sure to copy the structure from kernel space to
> user space even if the ioctl handler returns an error.
> This is needed because if the buffer passed in by the
> user space is not big enough to copy the data, we want
> to pass the requested size to user space in the structure
> passed in. The return code in this case is -E2BIG.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25790
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    9 -
>  .../staging/lustre/include/linux/lnet/lib-types.h  |    3 
>  .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    3 
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   30 ++-
>  drivers/staging/lustre/lnet/lnet/peer.c            |  222 +++++++++++++-------
>  5 files changed, 169 insertions(+), 98 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index f82a699371f2..58e3a9c4e39f 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -462,6 +462,8 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg);
>  struct lnet_ni *lnet_get_next_ni_locked(struct lnet_net *mynet,
>  					struct lnet_ni *prev);
>  struct lnet_ni *lnet_get_ni_idx_locked(int idx);
> +int lnet_get_peer_list(__u32 *countp, __u32 *sizep,
> +		       struct lnet_process_id __user *ids);
>  
>  void lnet_router_debugfs_init(void);
>  void lnet_router_debugfs_fini(void);
> @@ -730,10 +732,9 @@ bool lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid);
>  int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid);
>  int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
>  int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
> -int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
> -		       bool *mr,
> -		       struct lnet_peer_ni_credit_info __user *peer_ni_info,
> -		       struct lnet_ioctl_element_stats __user *peer_ni_stats);
> +int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nid,
> +		       __u32 *nnis, bool *mr, __u32 *sizep,
> +		       void __user *bulk);
>  int lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>  			  char alivness[LNET_MAX_STR_LEN],
>  			  __u32 *cpt_iter, __u32 *refcount,
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index 07baa86e61ab..8543a67420d7 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -651,7 +651,6 @@ struct lnet_peer_net {
>   *    pt_hash[...]
>   *    pt_peer_list
>   *    pt_peers
> - *    pt_peer_nnids
>   * protected by pt_zombie_lock:
>   *    pt_zombie_list
>   *    pt_zombies
> @@ -667,8 +666,6 @@ struct lnet_peer_table {
>  	struct list_head	pt_peer_list;
>  	/* # peers */
>  	int			pt_peers;
> -	/* # NIDS on listed peers */
> -	int			pt_peer_nnids;
>  	/* # zombies to go to deathrow (and not there yet) */
>  	int			 pt_zombies;
>  	/* zombie peers_ni */
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> index 2a9beed23985..2607620e8ef8 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> @@ -144,6 +144,7 @@ struct libcfs_debug_ioctl_data {
>  #define IOC_LIBCFS_GET_LOCAL_NI		_IOWR(IOC_LIBCFS_TYPE, 97, IOCTL_CONFIG_SIZE)
>  #define IOC_LIBCFS_SET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 98, IOCTL_CONFIG_SIZE)
>  #define IOC_LIBCFS_GET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 99, IOCTL_CONFIG_SIZE)
> -#define IOC_LIBCFS_MAX_NR		99
> +#define IOC_LIBCFS_GET_PEER_LIST	_IOWR(IOC_LIBCFS_TYPE, 100, IOCTL_CONFIG_SIZE)
> +#define IOC_LIBCFS_MAX_NR		100
>  
>  #endif /* __LIBCFS_IOCTL_H__ */
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index 955d1711eda4..f624abe7db80 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -3117,21 +3117,31 @@ LNetCtl(unsigned int cmd, void *arg)
>  
>  	case IOC_LIBCFS_GET_PEER_NI: {
>  		struct lnet_ioctl_peer_cfg *cfg = arg;
> -		struct lnet_peer_ni_credit_info __user *lpni_cri;
> -		struct lnet_ioctl_element_stats __user *lpni_stats;
> -		size_t usr_size = sizeof(*lpni_cri) + sizeof(*lpni_stats);
>  
> -		if ((cfg->prcfg_hdr.ioc_len != sizeof(*cfg)) ||
> -		    (cfg->prcfg_size != usr_size))
> +		if (cfg->prcfg_hdr.ioc_len < sizeof(*cfg))
>  			return -EINVAL;
>  
> -		lpni_cri = cfg->prcfg_bulk;
> -		lpni_stats = cfg->prcfg_bulk + sizeof(*lpni_cri);
> +		mutex_lock(&the_lnet.ln_api_mutex);
> +		rc = lnet_get_peer_info(&cfg->prcfg_prim_nid,
> +					&cfg->prcfg_cfg_nid,
> +					&cfg->prcfg_count,
> +					&cfg->prcfg_mr,
> +					&cfg->prcfg_size,
> +					(void __user *)cfg->prcfg_bulk);
> +		mutex_unlock(&the_lnet.ln_api_mutex);
> +		return rc;
> +	}
> +
> +	case IOC_LIBCFS_GET_PEER_LIST: {
> +		struct lnet_ioctl_peer_cfg *cfg = arg;
> +
> +		if (cfg->prcfg_hdr.ioc_len < sizeof(*cfg))
> +			return -EINVAL;
>  
>  		mutex_lock(&the_lnet.ln_api_mutex);
> -		rc = lnet_get_peer_info(cfg->prcfg_count, &cfg->prcfg_prim_nid,
> -					&cfg->prcfg_cfg_nid, &cfg->prcfg_mr,
> -					lpni_cri, lpni_stats);
> +		rc = lnet_get_peer_list(&cfg->prcfg_count, &cfg->prcfg_size,
> +					(struct lnet_process_id __user *)
> +					cfg->prcfg_bulk);
>  		mutex_unlock(&the_lnet.ln_api_mutex);
>  		return rc;
>  	}
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 1ef4a44e752e..8dff3b767577 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -263,9 +263,7 @@ lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
>  
>  	/* Update peer NID count. */
>  	lp = lpn->lpn_peer;
> -	ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
>  	lp->lp_nnis--;
> -	ptable->pt_peer_nnids--;
>  
>  	/*
>  	 * If there are no more peer nets, make the peer unfindable
> @@ -277,6 +275,7 @@ lnet_peer_detach_peer_ni_locked(struct lnet_peer_ni *lpni)
>  	 */
>  	if (list_empty(&lp->lp_peer_nets)) {
>  		list_del_init(&lp->lp_peer_list);
> +		ptable = the_lnet.ln_peer_tables[lp->lp_cpt];
>  		ptable->pt_peers--;
>  	} else if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING) {
>  		/* Discovery isn't running, nothing to do here. */
> @@ -637,44 +636,6 @@ lnet_find_peer(lnet_nid_t nid)
>  	return lp;
>  }
>  
> -struct lnet_peer_ni *
> -lnet_get_peer_ni_idx_locked(int idx, struct lnet_peer_net **lpn,
> -			    struct lnet_peer **lp)
> -{
> -	struct lnet_peer_table	*ptable;
> -	struct lnet_peer_ni	*lpni;
> -	int			lncpt;
> -	int			cpt;
> -
> -	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
> -
> -	for (cpt = 0; cpt < lncpt; cpt++) {
> -		ptable = the_lnet.ln_peer_tables[cpt];
> -		if (ptable->pt_peer_nnids > idx)
> -			break;
> -		idx -= ptable->pt_peer_nnids;
> -	}
> -	if (cpt >= lncpt)
> -		return NULL;
> -
> -	list_for_each_entry((*lp), &ptable->pt_peer_list, lp_peer_list) {
> -		if ((*lp)->lp_nnis <= idx) {
> -			idx -= (*lp)->lp_nnis;
> -			continue;
> -		}
> -		list_for_each_entry((*lpn), &((*lp)->lp_peer_nets),
> -				    lpn_peer_nets) {
> -			list_for_each_entry(lpni, &((*lpn)->lpn_peer_nis),
> -					    lpni_peer_nis) {
> -				if (idx-- == 0)
> -					return lpni;
> -			}
> -		}
> -	}
> -
> -	return NULL;
> -}
> -
>  struct lnet_peer_ni *
>  lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  			     struct lnet_peer_net *peer_net,
> @@ -734,6 +695,69 @@ lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
>  	return lpni;
>  }
>  
> +/* Call with the ln_api_mutex held */
> +int
> +lnet_get_peer_list(__u32 *countp, __u32 *sizep,
> +		   struct lnet_process_id __user *ids)
> +{
> +	struct lnet_process_id id;
> +	struct lnet_peer_table *ptable;
> +	struct lnet_peer *lp;
> +	__u32 count = 0;
> +	__u32 size = 0;
> +	int lncpt;
> +	int cpt;
> +	__u32 i;
> +	int rc;
> +
> +	rc = -ESHUTDOWN;
> +	if (the_lnet.ln_state == LNET_STATE_SHUTDOWN)
> +		goto done;
> +
> +	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
> +
> +	/*
> +	 * Count the number of peers, and return E2BIG if the buffer
> +	 * is too small. We'll also return the desired size.
> +	 */
> +	rc = -E2BIG;
> +	for (cpt = 0; cpt < lncpt; cpt++) {
> +		ptable = the_lnet.ln_peer_tables[cpt];
> +		count += ptable->pt_peers;
> +	}
> +	size = count * sizeof(*ids);
> +	if (size > *sizep)
> +		goto done;
> +
> +	/*
> +	 * Walk the peer lists and copy out the primary nids.
> +	 * This is safe because the peer lists are only modified
> +	 * while the ln_api_mutex is held. So we don't need to
> +	 * hold the lnet_net_lock as well, and can therefore
> +	 * directly call copy_to_user().
> +	 */
> +	rc = -EFAULT;
> +	memset(&id, 0, sizeof(id));
> +	id.pid = LNET_PID_LUSTRE;
> +	i = 0;
> +	for (cpt = 0; cpt < lncpt; cpt++) {
> +		ptable = the_lnet.ln_peer_tables[cpt];
> +		list_for_each_entry(lp, &ptable->pt_peer_list, lp_peer_list) {
> +			if (i >= count)
> +				goto done;
> +			id.nid = lp->lp_primary_nid;
> +			if (copy_to_user(&ids[i], &id, sizeof(id)))
> +				goto done;
> +			i++;
> +		}
> +	}
> +	rc = 0;
> +done:
> +	*countp = count;
> +	*sizep = size;
> +	return rc;
> +}
> +
>  /*
>   * Start pushes to peers that need to be updated for a configuration
>   * change on this node.
> @@ -1128,7 +1152,6 @@ lnet_peer_attach_peer_ni(struct lnet_peer *lp,
>  	spin_unlock(&lp->lp_lock);
>  
>  	lp->lp_nnis++;
> -	the_lnet.ln_peer_tables[lp->lp_cpt]->pt_peer_nnids++;
>  	lnet_net_unlock(LNET_LOCK_EX);
>  
>  	CDEBUG(D_NET, "peer %s NID %s flags %#x\n",
> @@ -3273,55 +3296,94 @@ lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>  }
>  
>  /* ln_api_mutex is held, which keeps the peer list stable */
> -int lnet_get_peer_info(__u32 idx, lnet_nid_t *primary_nid, lnet_nid_t *nid,
> -		       bool *mr,
> -		       struct lnet_peer_ni_credit_info __user *peer_ni_info,
> -		       struct lnet_ioctl_element_stats __user *peer_ni_stats)
> +int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
> +		       __u32 *nnis, bool *mr, __u32 *sizep,
> +		       void __user *bulk)
>  {
> -	struct lnet_ioctl_element_stats ni_stats;
> -	struct lnet_peer_ni_credit_info ni_info;
> -	struct lnet_peer_ni *lpni = NULL;
> -	struct lnet_peer_net *lpn = NULL;
> -	struct lnet_peer *lp = NULL;
> +	struct lnet_ioctl_element_stats *lpni_stats;
> +	struct lnet_peer_ni_credit_info *lpni_info;
> +	struct lnet_peer_ni *lpni;
> +	struct lnet_peer *lp;
> +	lnet_nid_t nid;
> +	__u32 size;
>  	int rc;
>  
> -	lpni = lnet_get_peer_ni_idx_locked(idx, &lpn, &lp);
> +	lp = lnet_find_peer(*primary_nid);
>  
> -	if (!lpni)
> -		return -ENOENT;
> +	if (!lp) {
> +		rc = -ENOENT;
> +		goto out;
> +	}
> +
> +	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats);
> +	size *= lp->lp_nnis;
> +	if (size > *sizep) {
> +		*sizep = size;
> +		rc = -E2BIG;
> +		goto out_lp_decref;
> +	}
>  
>  	*primary_nid = lp->lp_primary_nid;
>  	*mr = lnet_peer_is_multi_rail(lp);
> -	*nid = lpni->lpni_nid;
> -	snprintf(ni_info.cr_aliveness, LNET_MAX_STR_LEN, "NA");
> -	if (lnet_isrouter(lpni) ||
> -	    lnet_peer_aliveness_enabled(lpni))
> -		snprintf(ni_info.cr_aliveness, LNET_MAX_STR_LEN,
> -			 lpni->lpni_alive ? "up" : "down");
> -
> -	ni_info.cr_refcount = atomic_read(&lpni->lpni_refcount);
> -	ni_info.cr_ni_peer_tx_credits = lpni->lpni_net ?
> -		lpni->lpni_net->net_tunables.lct_peer_tx_credits : 0;
> -	ni_info.cr_peer_tx_credits = lpni->lpni_txcredits;
> -	ni_info.cr_peer_rtr_credits = lpni->lpni_rtrcredits;
> -	ni_info.cr_peer_min_rtr_credits = lpni->lpni_minrtrcredits;
> -	ni_info.cr_peer_min_tx_credits = lpni->lpni_mintxcredits;
> -	ni_info.cr_peer_tx_qnob = lpni->lpni_txqnob;
> -
> -	ni_stats.iel_send_count = atomic_read(&lpni->lpni_stats.send_count);
> -	ni_stats.iel_recv_count = atomic_read(&lpni->lpni_stats.recv_count);
> -	ni_stats.iel_drop_count = atomic_read(&lpni->lpni_stats.drop_count);
> -
> -	/* If copy_to_user fails */
> -	rc = -EFAULT;
> -	if (copy_to_user(peer_ni_info, &ni_info, sizeof(ni_info)))
> -		goto copy_failed;
> +	*nidp = lp->lp_primary_nid;
> +	*nnis = lp->lp_nnis;
> +	*sizep = size;
>  
> -	if (copy_to_user(peer_ni_stats, &ni_stats, sizeof(ni_stats)))
> -		goto copy_failed;
> +	/* Allocate helper buffers. */
> +	rc = -ENOMEM;
> +	lpni_info = kzalloc(sizeof(*lpni_info), GFP_KERNEL);
> +	if (!lpni_info)
> +		goto out_lp_decref;
> +	lpni_stats = kzalloc(sizeof(*lpni_stats), GFP_KERNEL);
> +	if (!lpni_stats)
> +		goto out_free_info;
>  
> +	lpni = NULL;
> +	rc = -EFAULT;
> +	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
> +		nid = lpni->lpni_nid;
> +		if (copy_to_user(bulk, &nid, sizeof(nid)))
> +			goto out_free_stats;
> +		bulk += sizeof(nid);
> +
> +		memset(lpni_info, 0, sizeof(*lpni_info));
> +		snprintf(lpni_info->cr_aliveness, LNET_MAX_STR_LEN, "NA");
> +		if (lnet_isrouter(lpni) ||
> +		    lnet_peer_aliveness_enabled(lpni))
> +			snprintf(lpni_info->cr_aliveness, LNET_MAX_STR_LEN,
> +				 lpni->lpni_alive ? "up" : "down");
> +
> +		lpni_info->cr_refcount = atomic_read(&lpni->lpni_refcount);
> +		lpni_info->cr_ni_peer_tx_credits = lpni->lpni_net ?
> +			lpni->lpni_net->net_tunables.lct_peer_tx_credits : 0;
> +		lpni_info->cr_peer_tx_credits = lpni->lpni_txcredits;
> +		lpni_info->cr_peer_rtr_credits = lpni->lpni_rtrcredits;
> +		lpni_info->cr_peer_min_rtr_credits = lpni->lpni_minrtrcredits;
> +		lpni_info->cr_peer_min_tx_credits = lpni->lpni_mintxcredits;
> +		lpni_info->cr_peer_tx_qnob = lpni->lpni_txqnob;
> +		if (copy_to_user(bulk, lpni_info, sizeof(*lpni_info)))
> +			goto out_free_stats;
> +		bulk += sizeof(*lpni_info);
> +
> +		memset(lpni_stats, 0, sizeof(*lpni_stats));
> +		lpni_stats->iel_send_count =
> +			atomic_read(&lpni->lpni_stats.send_count);
> +		lpni_stats->iel_recv_count =
> +			atomic_read(&lpni->lpni_stats.recv_count);
> +		lpni_stats->iel_drop_count =
> +			atomic_read(&lpni->lpni_stats.drop_count);
> +		if (copy_to_user(bulk, lpni_stats, sizeof(*lpni_stats)))
> +			goto out_free_stats;
> +		bulk += sizeof(*lpni_stats);
> +	}
>  	rc = 0;
>  
> -copy_failed:
> +out_free_stats:
> +	kfree(lpni_stats);
> +out_free_info:
> +	kfree(lpni_info);
> +out_lp_decref:
> +	lnet_peer_decref_locked(lp);
> +out:
>  	return rc;
>  }
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 20/24] lustre: lnet: add "lnetctl ping" command
  2018-10-07 23:19 ` [lustre-devel] [PATCH 20/24] lustre: lnet: add "lnetctl ping" command NeilBrown
@ 2018-10-14 23:43   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:43 UTC (permalink / raw)
  To: lustre-devel


> From: Olaf Weber <olaf@sgi.com>
> 
> Adds function jt_ping() in lnetctl.c and
> lustre_lnet_ping_nid() in liblnetconfig.c file.
> The output of "lnetctl ping" is similar to
> "lnetctl peer show".
> 
> Function jt_ping() in lnetctl.c calls lustre_lnet_ping_nid()
> to implement "lnetctl ping". Adds a function infra_ping_nid()
> to be later reused for the ping similar lnetctl commands.
> Uses a new ioctl call, IOC_LIBCFS_PING_PEER for "lnetctl ping".
> With "lnetctl ping", multiple nids can be pinged. Uses a new
> struct(lnet_ioctl_ping_data in lib-dlc.h) to pass the data
> from kernel to user space for ping. Also changes lnet_ping()
> function and its input parameters in drivers/staging/lustre/lnet/lnet/api-ni.c

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Sonia Sharma <sonia.sharma@intel.com>
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25791
> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> Tested-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    5 +-
>  .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    2 -
>  .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |    2 -
>  .../lustre/lnet/klnds/socklnd/socklnd_modparams.c  |    2 -
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   55 +++++++++++++++-----
>  drivers/staging/lustre/lnet/lnet/peer.c            |    2 -
>  6 files changed, 47 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 58e3a9c4e39f..adb4d0551ef5 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -76,8 +76,8 @@ extern struct lnet the_lnet;	/* THE network */
>  #define LNET_ACCEPTOR_MIN_RESERVED_PORT    512
>  #define LNET_ACCEPTOR_MAX_RESERVED_PORT    1023
>  
> -/* Discovery timeout - same as default peer_timeout */
> -#define DISCOVERY_TIMEOUT	180
> +/* default timeout */
> +#define DEFAULT_PEER_TIMEOUT    180
>  
>  static inline int lnet_is_route_alive(struct lnet_route *route)
>  {
> @@ -716,6 +716,7 @@ struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref,
>  					    int cpt);
>  struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
>  struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
> +struct lnet_peer *lnet_find_peer(lnet_nid_t nid);
>  void lnet_peer_net_added(struct lnet_net *net);
>  lnet_nid_t lnet_peer_primary_nid_locked(lnet_nid_t nid);
>  int lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block);
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> index 2607620e8ef8..3d89202bd396 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> @@ -102,7 +102,7 @@ struct libcfs_debug_ioctl_data {
>  #define IOC_LIBCFS_CONFIGURE		   _IOWR('e', 59, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_TESTPROTOCOMPAT	   _IOWR('e', 60, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_PING			   _IOWR('e', 61, IOCTL_LIBCFS_TYPE)
> -/*	IOC_LIBCFS_DEBUG_PEER		   _IOWR('e', 62, IOCTL_LIBCFS_TYPE) */
> +#define IOC_LIBCFS_PING_PEER               _IOWR('e', 62, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_LNETST		   _IOWR('e', 63, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_LNET_FAULT		   _IOWR('e', 64, IOCTL_LIBCFS_TYPE)
>  /* lnd ioctls */
> diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
> index 0f2ad9110dc9..13b19f3eabf0 100644
> --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
> +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
> @@ -83,7 +83,7 @@ static int peer_buffer_credits;
>  module_param(peer_buffer_credits, int, 0444);
>  MODULE_PARM_DESC(peer_buffer_credits, "# per-peer router buffer credits");
>  
> -static int peer_timeout = 180;
> +static int peer_timeout = DEFAULT_PEER_TIMEOUT;
>  module_param(peer_timeout, int, 0444);
>  MODULE_PARM_DESC(peer_timeout, "Seconds without aliveness news to declare peer dead (<=0 to disable)");
>  
> diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c
> index 5663a4ca94d4..da5910049fc1 100644
> --- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c
> +++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_modparams.c
> @@ -35,7 +35,7 @@ static int peer_buffer_credits;
>  module_param(peer_buffer_credits, int, 0444);
>  MODULE_PARM_DESC(peer_buffer_credits, "# per-peer router buffer credits");
>  
> -static int peer_timeout = 180;
> +static int peer_timeout = DEFAULT_PEER_TIMEOUT;
>  module_param(peer_timeout, int, 0444);
>  MODULE_PARM_DESC(peer_timeout, "Seconds without aliveness news to declare peer dead (<=0 to disable)");
>  
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index f624abe7db80..37f47bd1511f 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -3181,24 +3181,50 @@ LNetCtl(unsigned int cmd, void *arg)
>  		id.nid = data->ioc_nid;
>  		id.pid = data->ioc_u32[0];
>  
> -		/* Don't block longer than 2 minutes */
> -		if (data->ioc_u32[1] > 120 * MSEC_PER_SEC)
> -			return -EINVAL;
> -
> -		/* If timestamp is negative then disable timeout */
> -		if ((s32)data->ioc_u32[1] < 0)
> -			timeout = MAX_SCHEDULE_TIMEOUT;
> +		/* If timeout is negative then set default of 3 minutes */
> +		if (((s32)data->ioc_u32[1] <= 0) ||
> +		    data->ioc_u32[1] > (DEFAULT_PEER_TIMEOUT * MSEC_PER_SEC))
> +			timeout = DEFAULT_PEER_TIMEOUT * HZ;
>  		else
>  			timeout = msecs_to_jiffies(data->ioc_u32[1]);
>  
>  		rc = lnet_ping(id, timeout, data->ioc_pbuf1,
>  			       data->ioc_plen1 / sizeof(struct lnet_process_id));
> +
>  		if (rc < 0)
>  			return rc;
> +
>  		data->ioc_count = rc;
>  		return 0;
>  	}
>  
> +	case IOC_LIBCFS_PING_PEER: {
> +		struct lnet_ioctl_ping_data *ping = arg;
> +		struct lnet_peer *lp;
> +		signed long timeout;
> +
> +		/* If timeout is negative then set default of 3 minutes */
> +		if (((s32)ping->op_param) <= 0 ||
> +		    ping->op_param > (DEFAULT_PEER_TIMEOUT * MSEC_PER_SEC))
> +			timeout = DEFAULT_PEER_TIMEOUT * HZ;
> +		else
> +			timeout = msecs_to_jiffies(ping->op_param);
> +
> +		rc = lnet_ping(ping->ping_id, timeout,
> +			       ping->ping_buf,
> +			       ping->ping_count);
> +		if (rc < 0)
> +			return rc;
> +
> +		lp = lnet_find_peer(ping->ping_id.nid);
> +		if (lp) {
> +			ping->ping_id.nid = lp->lp_primary_nid;
> +			ping->mr_info = lnet_peer_is_multi_rail(lp);
> +		}
> +		ping->ping_count = rc;
> +		return 0;
> +	}
> +
>  	default:
>  		ni = lnet_net2ni_addref(data->ioc_net);
>  		if (!ni)
> @@ -3301,7 +3327,7 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	/* initialize md content */
>  	md.start     = &pbuf->pb_info;
>  	md.length    = LNET_PING_INFO_SIZE(n_ids);
> -	md.threshold = 2; /*GET/REPLY*/
> +	md.threshold = 2; /* GET/REPLY */
>  	md.max_size  = 0;
>  	md.options   = LNET_MD_TRUNCATE;
>  	md.user_ptr  = NULL;
> @@ -3319,7 +3345,6 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  
>  	if (rc) {
>  		/* Don't CERROR; this could be deliberate! */
> -
>  		rc2 = LNetMDUnlink(mdh);
>  		LASSERT(!rc2);
>  
> @@ -3363,7 +3388,6 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  			replied = 1;
>  			rc = event.mlength;
>  		}
> -
>  	} while (rc2 <= 0 || !event.unlinked);
>  
>  	if (!replied) {
> @@ -3377,10 +3401,9 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	nob = rc;
>  	LASSERT(nob >= 0 && nob <= LNET_PING_INFO_SIZE(n_ids));
>  
> -	rc = -EPROTO;			   /* if I can't parse... */
> +	rc = -EPROTO;		/* if I can't parse... */
>  
>  	if (nob < 8) {
> -		/* can't check magic/version */
>  		CERROR("%s: ping info too short %d\n",
>  		       libcfs_id2str(id), nob);
>  		goto fail_free_eq;
> @@ -3401,7 +3424,8 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	}
>  
>  	if (nob < LNET_PING_INFO_SIZE(0)) {
> -		CERROR("%s: Short reply %d(%d min)\n", libcfs_id2str(id),
> +		CERROR("%s: Short reply %d(%d min)\n",
> +		       libcfs_id2str(id),
>  		       nob, (int)LNET_PING_INFO_SIZE(0));
>  		goto fail_free_eq;
>  	}
> @@ -3410,12 +3434,13 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  		n_ids = pbuf->pb_info.pi_nnis;
>  
>  	if (nob < LNET_PING_INFO_SIZE(n_ids)) {
> -		CERROR("%s: Short reply %d(%d expected)\n", libcfs_id2str(id),
> +		CERROR("%s: Short reply %d(%d expected)\n",
> +		       libcfs_id2str(id),
>  		       nob, (int)LNET_PING_INFO_SIZE(n_ids));
>  		goto fail_free_eq;
>  	}
>  
> -	rc = -EFAULT;			   /* If I SEGV... */
> +	rc = -EFAULT;		/* if I segv in copy_to_user()... */
>  
>  	memset(&tmpid, 0, sizeof(tmpid));
>  	for (i = 0; i < n_ids; i++) {
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 8dff3b767577..95f72ae39a89 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -2905,7 +2905,7 @@ static struct lnet_peer *lnet_peer_dc_timed_out(time64_t now)
>  		return NULL;
>  	lp = list_first_entry(&the_lnet.ln_dc_working,
>  			      struct lnet_peer, lp_dc_list);
> -	if (now < lp->lp_last_queued + DISCOVERY_TIMEOUT)
> +	if (now < lp->lp_last_queued + DEFAULT_PEER_TIMEOUT)
>  		return NULL;
>  	return lp;
>  }
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 21/24] lustre: lnet: add "lnetctl discover"
  2018-10-07 23:19 ` [lustre-devel] [PATCH 21/24] lustre: lnet: add "lnetctl discover" NeilBrown
@ 2018-10-14 23:45   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:45 UTC (permalink / raw)
  To: lustre-devel


> From: Sonia Sharma <sonia.sharma@intel.com>
> 
> Add a "discover" subcommand to lnetctl
> 
> jt_discover() in lnetctl.c calls lustre_lnet_discover_nid()
> to implement "lnetctl discover". The output is similar to
> "lnetctl ping" command.
> This patch also does some clean up in linlnetconfig.c
> For parameters under global settings, the common code
> for them is pulled in functions ioctl_set_value() and
> ioctl_show_global_values().

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Sonia Sharma <sonia.sharma@intel.com>
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25793
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    2 
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |  100 ++++++++++++++++++++
>  2 files changed, 101 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> index 3d89202bd396..60bc9713923e 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> @@ -113,7 +113,7 @@ struct libcfs_debug_ioctl_data {
>  #define IOC_LIBCFS_DEL_PEER		   _IOWR('e', 74, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_ADD_PEER		   _IOWR('e', 75, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_GET_PEER		   _IOWR('e', 76, IOCTL_LIBCFS_TYPE)
> -/* ioctl 77 is free for use */
> +#define IOC_LIBCFS_DISCOVER                _IOWR('e', 77, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_ADD_INTERFACE	   _IOWR('e', 78, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_DEL_INTERFACE	   _IOWR('e', 79, IOCTL_LIBCFS_TYPE)
>  #define IOC_LIBCFS_GET_INTERFACE	   _IOWR('e', 80, IOCTL_LIBCFS_TYPE)
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index 37f47bd1511f..0511c6acb9b1 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -104,6 +104,9 @@ static atomic_t lnet_dlc_seq_no = ATOMIC_INIT(0);
>  static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  		     struct lnet_process_id __user *ids, int n_ids);
>  
> +static int lnet_discover(struct lnet_process_id id, __u32 force,
> +			 struct lnet_process_id __user *ids, int n_ids);
> +
>  static int
>  discovery_set(const char *val, const struct kernel_param *kp)
>  {
> @@ -3225,6 +3228,25 @@ LNetCtl(unsigned int cmd, void *arg)
>  		return 0;
>  	}
>  
> +	case IOC_LIBCFS_DISCOVER: {
> +		struct lnet_ioctl_ping_data *discover = arg;
> +		struct lnet_peer *lp;
> +
> +		rc = lnet_discover(discover->ping_id, discover->op_param,
> +				   discover->ping_buf,
> +				   discover->ping_count);
> +		if (rc < 0)
> +			return rc;
> +		lp = lnet_find_peer(discover->ping_id.nid);
> +		if (lp) {
> +			discover->ping_id.nid = lp->lp_primary_nid;
> +			discover->mr_info = lnet_peer_is_multi_rail(lp);
> +		}
> +
> +		discover->ping_count = rc;
> +		return 0;
> +	}
> +
>  	default:
>  		ni = lnet_net2ni_addref(data->ioc_net);
>  		if (!ni)
> @@ -3461,3 +3483,81 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
>  	lnet_ping_buffer_decref(pbuf);
>  	return rc;
>  }
> +
> +static int
> +lnet_discover(struct lnet_process_id id, __u32 force,
> +	      struct lnet_process_id __user *ids,
> +	      int n_ids)
> +{
> +	struct lnet_peer_ni *lpni;
> +	struct lnet_peer_ni *p;
> +	struct lnet_peer *lp;
> +	struct lnet_process_id *buf;
> +	int cpt;
> +	int i;
> +	int rc;
> +	int max_intf = lnet_interfaces_max;
> +
> +	if (n_ids <= 0 ||
> +	    id.nid == LNET_NID_ANY ||
> +	    n_ids > max_intf)
> +		return -EINVAL;
> +
> +	if (id.pid == LNET_PID_ANY)
> +		id.pid = LNET_PID_LUSTRE;
> +
> +	buf = kcalloc(n_ids, sizeof(*buf), GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	cpt = lnet_net_lock_current();
> +	lpni = lnet_nid2peerni_locked(id.nid, LNET_NID_ANY, cpt);
> +	if (IS_ERR(lpni)) {
> +		rc = PTR_ERR(lpni);
> +		goto out;
> +	}
> +
> +	/*
> +	 * Clearing the NIDS_UPTODATE flag ensures the peer will
> +	 * be discovered, provided discovery has not been disabled.
> +	 */
> +	lp = lpni->lpni_peer_net->lpn_peer;
> +	spin_lock(&lp->lp_lock);
> +	lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
> +	/* If the force flag is set, force a PING and PUSH as well. */
> +	if (force)
> +		lp->lp_state |= LNET_PEER_FORCE_PING | LNET_PEER_FORCE_PUSH;
> +	spin_unlock(&lp->lp_lock);
> +	rc = lnet_discover_peer_locked(lpni, cpt, true);
> +	if (rc)
> +		goto out_decref;
> +
> +	/* Peer may have changed. */
> +	lp = lpni->lpni_peer_net->lpn_peer;
> +	if (lp->lp_nnis < n_ids)
> +		n_ids = lp->lp_nnis;
> +
> +	i = 0;
> +	p = NULL;
> +	while ((p = lnet_get_next_peer_ni_locked(lp, NULL, p)) != NULL) {
> +		buf[i].pid = id.pid;
> +		buf[i].nid = p->lpni_nid;
> +		if (++i >= n_ids)
> +			break;
> +	}
> +
> +	lnet_net_unlock(cpt);
> +
> +	rc = -EFAULT;
> +	if (copy_to_user(ids, buf, n_ids * sizeof(*buf)))
> +		goto out_relock;
> +	rc = n_ids;
> +out_relock:
> +	lnet_net_lock(cpt);
> +out_decref:
> +	lnet_peer_ni_decref_locked(lpni);
> +out:
> +	lnet_net_unlock(cpt);
> +	kfree(buf);
> +	return rc;
> +}
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 22/24] lustre: lnet: add enhanced statistics
  2018-10-07 23:19 ` [lustre-devel] [PATCH 22/24] lustre: lnet: add enhanced statistics NeilBrown
@ 2018-10-14 23:50   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:50 UTC (permalink / raw)
  To: lustre-devel


> From: Amir Shehata <amir.shehata@intel.com>
> 
> Added statistics to track the different types of
> LNet messages which are sent/received/dropped

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/25795
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   12 ++
>  .../staging/lustre/include/linux/lnet/lib-types.h  |   20 +++
>  .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    3 -
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   45 +++++++-
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |  116 +++++++++++++++++++-
>  drivers/staging/lustre/lnet/lnet/lib-msg.c         |   16 ++-
>  drivers/staging/lustre/lnet/lnet/net_fault.c       |    3 -
>  drivers/staging/lustre/lnet/lnet/peer.c            |   26 +++-
>  8 files changed, 217 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index adb4d0551ef5..91980f60a50d 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -575,7 +575,7 @@ void lnet_set_reply_msg_len(struct lnet_ni *ni, struct lnet_msg *msg,
>  void lnet_finalize(struct lnet_msg *msg, int rc);
>  
>  void lnet_drop_message(struct lnet_ni *ni, int cpt, void *private,
> -		       unsigned int nob);
> +		       unsigned int nob, __u32 msg_type);
>  void lnet_drop_delayed_msg_list(struct list_head *head, char *reason);
>  void lnet_recv_delayed_msg_list(struct list_head *head);
>  
> @@ -825,4 +825,14 @@ lnet_peer_needs_push(struct lnet_peer *lp)
>  	return false;
>  }
>  
> +void lnet_incr_stats(struct lnet_element_stats *stats,
> +		     enum lnet_msg_type msg_type,
> +		     enum lnet_stats_type stats_type);
> +
> +__u32 lnet_sum_stats(struct lnet_element_stats *stats,
> +		     enum lnet_stats_type stats_type);
> +
> +void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
> +			      struct lnet_element_stats *stats);
> +
>  #endif
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> index 8543a67420d7..19f7b11a1e44 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
> @@ -279,10 +279,24 @@ enum lnet_ni_state {
>  	LNET_NI_STATE_DELETING
>  };
>  
> +enum lnet_stats_type {
> +	LNET_STATS_TYPE_SEND = 0,
> +	LNET_STATS_TYPE_RECV,
> +	LNET_STATS_TYPE_DROP
> +};
> +
> +struct lnet_comm_count {
> +	atomic_t co_get_count;
> +	atomic_t co_put_count;
> +	atomic_t co_reply_count;
> +	atomic_t co_ack_count;
> +	atomic_t co_hello_count;
> +};
> +
>  struct lnet_element_stats {
> -	atomic_t	send_count;
> -	atomic_t	recv_count;
> -	atomic_t	drop_count;
> +	struct lnet_comm_count el_send_stats;
> +	struct lnet_comm_count el_recv_stats;
> +	struct lnet_comm_count el_drop_stats;
>  };
>  
>  struct lnet_net {
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> index 60bc9713923e..4590f65c333f 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_ioctl.h
> @@ -145,6 +145,7 @@ struct libcfs_debug_ioctl_data {
>  #define IOC_LIBCFS_SET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 98, IOCTL_CONFIG_SIZE)
>  #define IOC_LIBCFS_GET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 99, IOCTL_CONFIG_SIZE)
>  #define IOC_LIBCFS_GET_PEER_LIST	_IOWR(IOC_LIBCFS_TYPE, 100, IOCTL_CONFIG_SIZE)
> -#define IOC_LIBCFS_MAX_NR		100
> +#define IOC_LIBCFS_GET_LOCAL_NI_MSG_STATS  _IOWR(IOC_LIBCFS_TYPE, 101, IOCTL_CONFIG_SIZE)
> +#define IOC_LIBCFS_MAX_NR		101
>  
>  #endif /* __LIBCFS_IOCTL_H__ */
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index 0511c6acb9b1..0852118bf803 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -2263,8 +2263,12 @@ lnet_fill_ni_info(struct lnet_ni *ni, struct lnet_ioctl_config_ni *cfg_ni,
>  	memcpy(&tun->lt_cmn, &ni->ni_net->net_tunables, sizeof(tun->lt_cmn));
>  
>  	if (stats) {
> -		stats->iel_send_count = atomic_read(&ni->ni_stats.send_count);
> -		stats->iel_recv_count = atomic_read(&ni->ni_stats.recv_count);
> +		stats->iel_send_count = lnet_sum_stats(&ni->ni_stats,
> +						       LNET_STATS_TYPE_SEND);
> +		stats->iel_recv_count = lnet_sum_stats(&ni->ni_stats,
> +						       LNET_STATS_TYPE_RECV);
> +		stats->iel_drop_count = lnet_sum_stats(&ni->ni_stats,
> +						       LNET_STATS_TYPE_DROP);
>  	}
>  
>  	/*
> @@ -2491,6 +2495,29 @@ lnet_get_ni_config(struct lnet_ioctl_config_ni *cfg_ni,
>  	return rc;
>  }
>  
> +int lnet_get_ni_stats(struct lnet_ioctl_element_msg_stats *msg_stats)
> +{
> +	struct lnet_ni *ni;
> +	int cpt;
> +	int rc = -ENOENT;
> +
> +	if (!msg_stats)
> +		return -EINVAL;
> +
> +	cpt = lnet_net_lock_current();
> +
> +	ni = lnet_get_ni_idx_locked(msg_stats->im_idx);
> +
> +	if (ni) {
> +		lnet_usr_translate_stats(msg_stats, &ni->ni_stats);
> +		rc = 0;
> +	}
> +
> +	lnet_net_unlock(cpt);
> +
> +	return rc;
> +}
> +
>  static int lnet_add_net_common(struct lnet_net *net,
>  			       struct lnet_ioctl_config_lnd_tunables *tun)
>  {
> @@ -2956,6 +2983,7 @@ LNetCtl(unsigned int cmd, void *arg)
>  		__u32 tun_size;
>  
>  		cfg_ni = arg;
> +
>  		/* get the tunables if they are available */
>  		if (cfg_ni->lic_cfg_hdr.ioc_len <
>  		    sizeof(*cfg_ni) + sizeof(*stats) + sizeof(*tun))
> @@ -2975,6 +3003,19 @@ LNetCtl(unsigned int cmd, void *arg)
>  		return rc;
>  	}
>  
> +	case IOC_LIBCFS_GET_LOCAL_NI_MSG_STATS: {
> +		struct lnet_ioctl_element_msg_stats *msg_stats = arg;
> +
> +		if (msg_stats->im_hdr.ioc_len != sizeof(*msg_stats))
> +			return -EINVAL;
> +
> +		mutex_lock(&the_lnet.ln_api_mutex);
> +		rc = lnet_get_ni_stats(msg_stats);
> +		mutex_unlock(&the_lnet.ln_api_mutex);
> +
> +		return rc;
> +	}
> +
>  	case IOC_LIBCFS_GET_NET: {
>  		size_t total = sizeof(*config) +
>  			       sizeof(struct lnet_ioctl_net_config);
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
> index 2ff329bf91ba..5694d85c713c 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
> @@ -45,6 +45,104 @@ static int local_nid_dist_zero = 1;
>  module_param(local_nid_dist_zero, int, 0444);
>  MODULE_PARM_DESC(local_nid_dist_zero, "Reserved");
>  
> +static inline struct lnet_comm_count *
> +get_stats_counts(struct lnet_element_stats *stats,
> +		 enum lnet_stats_type stats_type)
> +{
> +	switch (stats_type) {
> +	case LNET_STATS_TYPE_SEND:
> +		return &stats->el_send_stats;
> +	case LNET_STATS_TYPE_RECV:
> +		return &stats->el_recv_stats;
> +	case LNET_STATS_TYPE_DROP:
> +		return &stats->el_drop_stats;
> +	default:
> +		CERROR("Unknown stats type\n");
> +	}
> +
> +	return NULL;
> +}
> +
> +void lnet_incr_stats(struct lnet_element_stats *stats,
> +		     enum lnet_msg_type msg_type,
> +		     enum lnet_stats_type stats_type)
> +{
> +	struct lnet_comm_count *counts = get_stats_counts(stats, stats_type);
> +
> +	if (!counts)
> +		return;
> +
> +	switch (msg_type) {
> +	case LNET_MSG_ACK:
> +		atomic_inc(&counts->co_ack_count);
> +		break;
> +	case LNET_MSG_PUT:
> +		atomic_inc(&counts->co_put_count);
> +		break;
> +	case LNET_MSG_GET:
> +		atomic_inc(&counts->co_get_count);
> +		break;
> +	case LNET_MSG_REPLY:
> +		atomic_inc(&counts->co_reply_count);
> +		break;
> +	case LNET_MSG_HELLO:
> +		atomic_inc(&counts->co_hello_count);
> +		break;
> +	default:
> +		CERROR("There is a BUG in the code. Unknown message type\n");
> +		break;
> +	}
> +}
> +
> +__u32 lnet_sum_stats(struct lnet_element_stats *stats,
> +		     enum lnet_stats_type stats_type)
> +{
> +	struct lnet_comm_count *counts = get_stats_counts(stats, stats_type);
> +
> +	if (!counts)
> +		return 0;
> +
> +	return (atomic_read(&counts->co_ack_count) +
> +		atomic_read(&counts->co_put_count) +
> +		atomic_read(&counts->co_get_count) +
> +		atomic_read(&counts->co_reply_count) +
> +		atomic_read(&counts->co_hello_count));
> +}
> +
> +static inline void assign_stats(struct lnet_ioctl_comm_count *msg_stats,
> +				struct lnet_comm_count *counts)
> +{
> +	msg_stats->ico_get_count = atomic_read(&counts->co_get_count);
> +	msg_stats->ico_put_count = atomic_read(&counts->co_put_count);
> +	msg_stats->ico_reply_count = atomic_read(&counts->co_reply_count);
> +	msg_stats->ico_ack_count = atomic_read(&counts->co_ack_count);
> +	msg_stats->ico_hello_count = atomic_read(&counts->co_hello_count);
> +}
> +
> +void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
> +			      struct lnet_element_stats *stats)
> +{
> +	struct lnet_comm_count *counts;
> +
> +	LASSERT(msg_stats);
> +	LASSERT(stats);
> +
> +	counts = get_stats_counts(stats, LNET_STATS_TYPE_SEND);
> +	if (!counts)
> +		return;
> +	assign_stats(&msg_stats->im_send_stats, counts);
> +
> +	counts = get_stats_counts(stats, LNET_STATS_TYPE_RECV);
> +	if (!counts)
> +		return;
> +	assign_stats(&msg_stats->im_recv_stats, counts);
> +
> +	counts = get_stats_counts(stats, LNET_STATS_TYPE_DROP);
> +	if (!counts)
> +		return;
> +	assign_stats(&msg_stats->im_drop_stats, counts);
> +}
> +
>  int
>  lnet_fail_nid(lnet_nid_t nid, unsigned int threshold)
>  {
> @@ -632,9 +730,13 @@ lnet_post_send_locked(struct lnet_msg *msg, int do_send)
>  		the_lnet.ln_counters[cpt]->drop_length += msg->msg_len;
>  		lnet_net_unlock(cpt);
>  		if (msg->msg_txpeer)
> -			atomic_inc(&msg->msg_txpeer->lpni_stats.drop_count);
> +			lnet_incr_stats(&msg->msg_txpeer->lpni_stats,
> +					msg->msg_type,
> +					LNET_STATS_TYPE_DROP);
>  		if (msg->msg_txni)
> -			atomic_inc(&msg->msg_txni->ni_stats.drop_count);
> +			lnet_incr_stats(&msg->msg_txni->ni_stats,
> +					msg->msg_type,
> +					LNET_STATS_TYPE_DROP);
>  
>  		CNETERR("Dropping message for %s: peer not alive\n",
>  			libcfs_id2str(msg->msg_target));
> @@ -1859,9 +1961,11 @@ lnet_send(lnet_nid_t src_nid, struct lnet_msg *msg, lnet_nid_t rtr_nid)
>  }
>  
>  void
> -lnet_drop_message(struct lnet_ni *ni, int cpt, void *private, unsigned int nob)
> +lnet_drop_message(struct lnet_ni *ni, int cpt, void *private, unsigned int nob,
> +		  __u32 msg_type)
>  {
>  	lnet_net_lock(cpt);
> +	lnet_incr_stats(&ni->ni_stats, msg_type, LNET_STATS_TYPE_DROP);
>  	the_lnet.ln_counters[cpt]->drop_count++;
>  	the_lnet.ln_counters[cpt]->drop_length += nob;
>  	lnet_net_unlock(cpt);
> @@ -2510,7 +2614,7 @@ lnet_parse(struct lnet_ni *ni, struct lnet_hdr *hdr, lnet_nid_t from_nid,
>  	lnet_finalize(msg, rc);
>  
>   drop:
> -	lnet_drop_message(ni, cpt, private, payload_length);
> +	lnet_drop_message(ni, cpt, private, payload_length, type);
>  	return 0;
>  }
>  EXPORT_SYMBOL(lnet_parse);
> @@ -2546,7 +2650,8 @@ lnet_drop_delayed_msg_list(struct list_head *head, char *reason)
>  		 * until that's done
>  		 */
>  		lnet_drop_message(msg->msg_rxni, msg->msg_rx_cpt,
> -				  msg->msg_private, msg->msg_len);
> +				  msg->msg_private, msg->msg_len,
> +				  msg->msg_type);
>  		/*
>  		 * NB: message will not generate event because w/o attached MD,
>  		 * but we still should give error code so lnet_msg_decommit()
> @@ -2786,6 +2891,7 @@ lnet_create_reply_msg(struct lnet_ni *ni, struct lnet_msg *getmsg)
>  	cpt = lnet_cpt_of_nid(peer_id.nid, ni);
>  
>  	lnet_net_lock(cpt);
> +	lnet_incr_stats(&ni->ni_stats, LNET_MSG_GET, LNET_STATS_TYPE_DROP);
>  	the_lnet.ln_counters[cpt]->drop_count++;
>  	the_lnet.ln_counters[cpt]->drop_length += getmd->md_length;
>  	lnet_net_unlock(cpt);
> diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> index db13d01d366f..7f58cfe25bc2 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> @@ -219,9 +219,13 @@ lnet_msg_decommit_tx(struct lnet_msg *msg, int status)
>  
>  incr_stats:
>  	if (msg->msg_txpeer)
> -		atomic_inc(&msg->msg_txpeer->lpni_stats.send_count);
> +		lnet_incr_stats(&msg->msg_txpeer->lpni_stats,
> +				msg->msg_type,
> +				LNET_STATS_TYPE_SEND);
>  	if (msg->msg_txni)
> -		atomic_inc(&msg->msg_txni->ni_stats.send_count);
> +		lnet_incr_stats(&msg->msg_txni->ni_stats,
> +				msg->msg_type,
> +				LNET_STATS_TYPE_SEND);
>   out:
>  	lnet_return_tx_credits_locked(msg);
>  	msg->msg_tx_committed = 0;
> @@ -280,9 +284,13 @@ lnet_msg_decommit_rx(struct lnet_msg *msg, int status)
>  
>  incr_stats:
>  	if (msg->msg_rxpeer)
> -		atomic_inc(&msg->msg_rxpeer->lpni_stats.recv_count);
> +		lnet_incr_stats(&msg->msg_rxpeer->lpni_stats,
> +				msg->msg_type,
> +				LNET_STATS_TYPE_RECV);
>  	if (msg->msg_rxni)
> -		atomic_inc(&msg->msg_rxni->ni_stats.recv_count);
> +		lnet_incr_stats(&msg->msg_rxni->ni_stats,
> +				msg->msg_type,
> +				LNET_STATS_TYPE_RECV);
>  	if (ev->type == LNET_EVENT_PUT || ev->type == LNET_EVENT_REPLY)
>  		counters->recv_length += msg->msg_wanted;
>  
> diff --git a/drivers/staging/lustre/lnet/lnet/net_fault.c b/drivers/staging/lustre/lnet/lnet/net_fault.c
> index 3841bac1aa0a..e2c746855da9 100644
> --- a/drivers/staging/lustre/lnet/lnet/net_fault.c
> +++ b/drivers/staging/lustre/lnet/lnet/net_fault.c
> @@ -632,7 +632,8 @@ delayed_msg_process(struct list_head *msg_list, bool drop)
>  			}
>  		}
>  
> -		lnet_drop_message(ni, cpt, msg->msg_private, msg->msg_len);
> +		lnet_drop_message(ni, cpt, msg->msg_private, msg->msg_len,
> +				  msg->msg_type);
>  		lnet_finalize(msg, rc);
>  	}
>  }
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 95f72ae39a89..03c1c34517e4 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -3301,6 +3301,7 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
>  		       void __user *bulk)
>  {
>  	struct lnet_ioctl_element_stats *lpni_stats;
> +	struct lnet_ioctl_element_msg_stats *lpni_msg_stats;
>  	struct lnet_peer_ni_credit_info *lpni_info;
>  	struct lnet_peer_ni *lpni;
>  	struct lnet_peer *lp;
> @@ -3315,7 +3316,8 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
>  		goto out;
>  	}
>  
> -	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats);
> +	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats)
> +		+ sizeof(*lpni_msg_stats);
>  	size *= lp->lp_nnis;
>  	if (size > *sizep) {
>  		*sizep = size;
> @@ -3337,13 +3339,17 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
>  	lpni_stats = kzalloc(sizeof(*lpni_stats), GFP_KERNEL);
>  	if (!lpni_stats)
>  		goto out_free_info;
> +	lpni_msg_stats = kzalloc(sizeof(*lpni_msg_stats), GFP_KERNEL);
> +	if (!lpni_msg_stats)
> +		goto out_free_stats;
> +
>  
>  	lpni = NULL;
>  	rc = -EFAULT;
>  	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
>  		nid = lpni->lpni_nid;
>  		if (copy_to_user(bulk, &nid, sizeof(nid)))
> -			goto out_free_stats;
> +			goto out_free_msg_stats;
>  		bulk += sizeof(nid);
>  
>  		memset(lpni_info, 0, sizeof(*lpni_info));
> @@ -3362,22 +3368,28 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
>  		lpni_info->cr_peer_min_tx_credits = lpni->lpni_mintxcredits;
>  		lpni_info->cr_peer_tx_qnob = lpni->lpni_txqnob;
>  		if (copy_to_user(bulk, lpni_info, sizeof(*lpni_info)))
> -			goto out_free_stats;
> +			goto out_free_msg_stats;
>  		bulk += sizeof(*lpni_info);
>  
>  		memset(lpni_stats, 0, sizeof(*lpni_stats));
>  		lpni_stats->iel_send_count =
> -			atomic_read(&lpni->lpni_stats.send_count);
> +			lnet_sum_stats(&lpni->lpni_stats, LNET_STATS_TYPE_SEND);
>  		lpni_stats->iel_recv_count =
> -			atomic_read(&lpni->lpni_stats.recv_count);
> +			lnet_sum_stats(&lpni->lpni_stats, LNET_STATS_TYPE_RECV);
>  		lpni_stats->iel_drop_count =
> -			atomic_read(&lpni->lpni_stats.drop_count);
> +			lnet_sum_stats(&lpni->lpni_stats, LNET_STATS_TYPE_DROP);
>  		if (copy_to_user(bulk, lpni_stats, sizeof(*lpni_stats)))
> -			goto out_free_stats;
> +			goto out_free_msg_stats;
>  		bulk += sizeof(*lpni_stats);
> +		lnet_usr_translate_stats(lpni_msg_stats, &lpni->lpni_stats);
> +		if (copy_to_user(bulk, lpni_msg_stats, sizeof(*lpni_msg_stats)))
> +			goto out_free_msg_stats;
> +		bulk += sizeof(*lpni_msg_stats);
>  	}
>  	rc = 0;
>  
> +out_free_msg_stats:
> +	kfree(lpni_msg_stats);
>  out_free_stats:
>  	kfree(lpni_stats);
>  out_free_info:
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 23/24] lustre: lnet: show peer state
  2018-10-07 23:19 ` [lustre-devel] [PATCH 23/24] lustre: lnet: show peer state NeilBrown
@ 2018-10-14 23:52   ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:52 UTC (permalink / raw)
  To: lustre-devel


> From: Amir Shehata <amir.shehata@intel.com>
> 
> It is important to show the peer state when debugging.
> This patch exports the peer state from the kernel to
> user space, and is shown when the detail level requested
> in the peer show command is >= 3

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Reviewed-on: https://review.whamcloud.com/26130
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |    4 +---
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |    6 +-----
>  drivers/staging/lustre/lnet/lnet/peer.c            |   21 ++++++++++----------
>  3 files changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> index 91980f60a50d..fcfd844e0162 100644
> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> @@ -733,9 +733,7 @@ bool lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid);
>  int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni, lnet_nid_t nid);
>  int lnet_add_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid, bool mr);
>  int lnet_del_peer_ni(lnet_nid_t key_nid, lnet_nid_t nid);
> -int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nid,
> -		       __u32 *nnis, bool *mr, __u32 *sizep,
> -		       void __user *bulk);
> +int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk);
>  int lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>  			  char alivness[LNET_MAX_STR_LEN],
>  			  __u32 *cpt_iter, __u32 *refcount,
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index 0852118bf803..e2c86b8279e5 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -3166,11 +3166,7 @@ LNetCtl(unsigned int cmd, void *arg)
>  			return -EINVAL;
>  
>  		mutex_lock(&the_lnet.ln_api_mutex);
> -		rc = lnet_get_peer_info(&cfg->prcfg_prim_nid,
> -					&cfg->prcfg_cfg_nid,
> -					&cfg->prcfg_count,
> -					&cfg->prcfg_mr,
> -					&cfg->prcfg_size,
> +		rc = lnet_get_peer_info(cfg,
>  					(void __user *)cfg->prcfg_bulk);
>  		mutex_unlock(&the_lnet.ln_api_mutex);
>  		return rc;
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 03c1c34517e4..5f61fca09f44 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -3296,9 +3296,7 @@ lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>  }
>  
>  /* ln_api_mutex is held, which keeps the peer list stable */
> -int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
> -		       __u32 *nnis, bool *mr, __u32 *sizep,
> -		       void __user *bulk)
> +int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
>  {
>  	struct lnet_ioctl_element_stats *lpni_stats;
>  	struct lnet_ioctl_element_msg_stats *lpni_msg_stats;
> @@ -3309,7 +3307,7 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
>  	__u32 size;
>  	int rc;
>  
> -	lp = lnet_find_peer(*primary_nid);
> +	lp = lnet_find_peer(cfg->prcfg_prim_nid);
>  
>  	if (!lp) {
>  		rc = -ENOENT;
> @@ -3319,17 +3317,18 @@ int lnet_get_peer_info(lnet_nid_t *primary_nid, lnet_nid_t *nidp,
>  	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats)
>  		+ sizeof(*lpni_msg_stats);
>  	size *= lp->lp_nnis;
> -	if (size > *sizep) {
> -		*sizep = size;
> +	if (size > cfg->prcfg_size) {
> +		cfg->prcfg_size = size;
>  		rc = -E2BIG;
>  		goto out_lp_decref;
>  	}
>  
> -	*primary_nid = lp->lp_primary_nid;
> -	*mr = lnet_peer_is_multi_rail(lp);
> -	*nidp = lp->lp_primary_nid;
> -	*nnis = lp->lp_nnis;
> -	*sizep = size;
> +	cfg->prcfg_prim_nid = lp->lp_primary_nid;
> +	cfg->prcfg_mr = lnet_peer_is_multi_rail(lp);
> +	cfg->prcfg_cfg_nid = lp->lp_primary_nid;
> +	cfg->prcfg_count = lp->lp_nnis;
> +	cfg->prcfg_size = size;
> +	cfg->prcfg_state = lp->lp_state;
>  
>  	/* Allocate helper buffers. */
>  	rc = -ENOMEM;
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked()
  2018-10-07 23:19 ` [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked() NeilBrown
@ 2018-10-14 23:53   ` James Simmons
  2018-10-14 23:54   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:53 UTC (permalink / raw)
  To: lustre-devel


> From: John L. Hammond <john.hammond@intel.com>
> 
> In lnet_discover_peer_locked() avoid a leaked reference to the peer in
> the non-blocking discovery case.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9913
> Signed-off-by: John L. Hammond <john.hammond@intel.com>
> Reviewed-on: https://review.whamcloud.com/28695
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/staging/lustre/lnet/lnet/peer.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 5f61fca09f44..db36b5cf31e1 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -2010,7 +2010,6 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
>  		if (lnet_peer_is_uptodate(lp))
>  			break;
>  		lnet_peer_queue_for_discovery(lp);
> -		lnet_peer_addref_locked(lp);
>  		/*
>  		 * if caller requested a non-blocking operation then
>  		 * return immediately. Once discovery is complete then the
> @@ -2019,6 +2018,8 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
>  		 */
>  		if (!block)
>  			break;
> +
> +		lnet_peer_addref_locked(lp);
>  		lnet_net_unlock(LNET_LOCK_EX);
>  		schedule();
>  		finish_wait(&lp->lp_dc_waitq, &wait);
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked()
  2018-10-07 23:19 ` [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked() NeilBrown
  2018-10-14 23:53   ` James Simmons
@ 2018-10-14 23:54   ` James Simmons
  1 sibling, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:54 UTC (permalink / raw)
  To: lustre-devel


> From: John L. Hammond <john.hammond@intel.com>
> 
> In lnet_discover_peer_locked() avoid a leaked reference to the peer in
> the non-blocking discovery case.

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9913
> Signed-off-by: John L. Hammond <john.hammond@intel.com>
> Reviewed-on: https://review.whamcloud.com/28695
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/staging/lustre/lnet/lnet/peer.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
> index 5f61fca09f44..db36b5cf31e1 100644
> --- a/drivers/staging/lustre/lnet/lnet/peer.c
> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> @@ -2010,7 +2010,6 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
>  		if (lnet_peer_is_uptodate(lp))
>  			break;
>  		lnet_peer_queue_for_discovery(lp);
> -		lnet_peer_addref_locked(lp);
>  		/*
>  		 * if caller requested a non-blocking operation then
>  		 * return immediately. Once discovery is complete then the
> @@ -2019,6 +2018,8 @@ lnet_discover_peer_locked(struct lnet_peer_ni *lpni, int cpt, bool block)
>  		 */
>  		if (!block)
>  			break;
> +
> +		lnet_peer_addref_locked(lp);
>  		lnet_net_unlock(LNET_LOCK_EX);
>  		schedule();
>  		finish_wait(&lp->lp_dc_waitq, &wait);
> 
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging
  2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
                   ` (23 preceding siblings ...)
  2018-10-07 23:19 ` [lustre-devel] [PATCH 22/24] lustre: lnet: add enhanced statistics NeilBrown
@ 2018-10-14 23:54 ` James Simmons
  2018-10-17  5:20   ` NeilBrown
  24 siblings, 1 reply; 57+ messages in thread
From: James Simmons @ 2018-10-14 23:54 UTC (permalink / raw)
  To: lustre-devel


> This is a port of the "Dynamic Discovery" series
> (756abb9cf00b936b3..1c45d9051764e0637ba90b3)
> to my mainline-linux-with-lustre tree.
> It is all fairly straight forward, but I don't think I have the
> hardware to testing properly.  And review never hurts.
> 
> This is all in my lustre-testing branch.

Only one patch was incorrect but the verison in lustre-testing is fine.
Testing has shown no problems.
 
> Thanks,
> NeilBrown
> 
> ---
> 
> Amir Shehata (2):
>       lustre: lnet: add enhanced statistics
>       lustre: lnet: show peer state
> 
> John L. Hammond (1):
>       lustre: lnet: balance references in lnet_discover_peer_locked()
> 
> Olaf Weber (20):
>       lustre: lnet: add lnet_interfaces_max tunable
>       lustre: lnet: configure lnet_interfaces_max tunable from dlc
>       lustre: lnet: add struct lnet_ping_buffer
>       lustre: lnet: automatic sizing of router pinger buffers
>       lustre: lnet: add Multi-Rail and Discovery ping feature bits
>       lustre: lnet: add sanity checks on ping-related constants
>       lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked()
>       lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer()
>       lustre: lnet: refactor lnet_del_peer_ni()
>       lustre: lnet: refactor lnet_add_peer_ni()
>       lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit
>       lustre: lnet: preferred NIs for non-Multi-Rail peers
>       lustre: lnet: add LNET_PEER_CONFIGURED flag
>       lustre: lnet: reference counts on lnet_peer/lnet_peer_net
>       lustre: lnet: add msg_type to lnet_event
>       lustre: lnet: add discovery thread
>       lustre: lnet: add the Push target
>       lustre: lnet: implement Peer Discovery
>       lustre: lnet: add "lnetctl peer list"
>       lustre: lnet: add "lnetctl ping" command
> 
> Sonia Sharma (1):
>       lustre: lnet: add "lnetctl discover"
> 
> 
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |  156 +
>  .../staging/lustre/include/linux/lnet/lib-types.h  |  258 ++
>  .../lustre/include/uapi/linux/lnet/libcfs_ioctl.h  |    8 
>  .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |   10 
>  .../lustre/include/uapi/linux/lnet/lnet-types.h    |   42 
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |    2 
>  .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |    2 
>  .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |   22 
>  .../staging/lustre/lnet/klnds/socklnd/socklnd.h    |    4 
>  .../staging/lustre/lnet/klnds/socklnd/socklnd_cb.c |    2 
>  .../lustre/lnet/klnds/socklnd/socklnd_modparams.c  |    2 
>  .../lustre/lnet/klnds/socklnd/socklnd_proto.c      |    4 
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |  907 +++++-
>  drivers/staging/lustre/lnet/lnet/config.c          |   10 
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |  242 +-
>  drivers/staging/lustre/lnet/lnet/lib-msg.c         |   17 
>  drivers/staging/lustre/lnet/lnet/net_fault.c       |    3 
>  drivers/staging/lustre/lnet/lnet/peer.c            | 3002 +++++++++++++++++---
>  drivers/staging/lustre/lnet/lnet/router.c          |  174 +
>  19 files changed, 4056 insertions(+), 811 deletions(-)
> 
> --
> Signature
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net
  2018-10-14 22:42   ` James Simmons
@ 2018-10-17  5:16     ` NeilBrown
  2018-10-20 16:47       ` James Simmons
  0 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2018-10-17  5:16 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

>> From: Olaf Weber <olaf@sgi.com>
>> 
>> Peer discovery will be keeping track of lnet_peer structures,
>> so there will be references to an lnet_peer independent of
>> the references implied by lnet_peer_ni structures. Manage
>> this by adding explicit reference counts to lnet_peer_net and
>> lnet_peer.
>> 
>> Each lnet_peer_net has a hold on the lnet_peer it links to
>> with its lpn_peer pointer. This hold is only removed when that
>> pointer is assigned a new value or the lnet_peer_net is freed.
>> Just removing an lnet_peer_net from the lp_peer_nets list does
>> not release this hold, it just prevents new lookups of the
>> lnet_peer_net via the lnet_peer.
>> 
>> Each lnet_peer_ni has a hold on the lnet_peer_net it links to
>> with its lpni_peer_net pointer. This hold is only removed when
>> that pointer is assigned a new value or the lnet_peer_ni is
>> freed. Just removing an lnet_peer_ni from the lpn_peer_nis
>> list does not release this hold, it just prevents new lookups
>> of the lnet_peer_ni via the lnet_peer_net.
>> 
>> This ensures that given a lnet_peer_ni *lpni, we can rely on
>> lpni->lpni_peer_net->lpn_peer pointing to a valid lnet_peer.
>> 
>> Keep a count of the total number of lnet_peer_ni attached to
>> an lnet_peer in lp_nnis.
>> 
>> Split the global ln_peers list into per-lnet_peer_table lists.
>> The CPT of the peer table in which the lnet_peer is linked is
>> stored in lp_cpt.
>> 
>> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
>> Signed-off-by: Olaf Weber <olaf@sgi.com>
>> Reviewed-on: https://review.whamcloud.com/25784
>> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
>> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
>> Tested-by: Amir Shehata <amir.shehata@intel.com>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   49 +++--
>>  .../staging/lustre/include/linux/lnet/lib-types.h  |   50 ++++-
>>  drivers/staging/lustre/lnet/lnet/api-ni.c          |    1 
>>  drivers/staging/lustre/lnet/lnet/lib-move.c        |    8 -
>>  drivers/staging/lustre/lnet/lnet/peer.c            |  210 ++++++++++++++------
>>  5 files changed, 227 insertions(+), 91 deletions(-)
>> 
>> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
>> index 563417510722..aad25eb0011b 100644
>> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
>> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
>> @@ -310,6 +310,36 @@ lnet_handle2me(struct lnet_handle_me *handle)
>>  	return lh_entry(lh, struct lnet_me, me_lh);
>>  }
>>  
>> +static inline void
>> +lnet_peer_net_addref_locked(struct lnet_peer_net *lpn)
>> +{
>> +	atomic_inc(&lpn->lpn_refcount);
>> +}
>> +
>> +void lnet_destroy_peer_net_locked(struct lnet_peer_net *lpn);
>> +
>> +static inline void
>> +lnet_peer_net_decref_locked(struct lnet_peer_net *lpn)
>> +{
>> +	if (atomic_dec_and_test(&lpn->lpn_refcount))
>> +		lnet_destroy_peer_net_locked(lpn);
>> +}
>> +
>> +static inline void
>> +lnet_peer_addref_locked(struct lnet_peer *lp)
>> +{
>> +	atomic_inc(&lp->lp_refcount);
>> +}
>> +
>> +void lnet_destroy_peer_locked(struct lnet_peer *lp);
>> +
>> +static inline void
>> +lnet_peer_decref_locked(struct lnet_peer *lp)
>> +{
>> +	if (atomic_dec_and_test(&lp->lp_refcount))
>> +		lnet_destroy_peer_locked(lp);
>> +}
>> +
>>  static inline void
>>  lnet_peer_ni_addref_locked(struct lnet_peer_ni *lp)
>>  {
>> @@ -695,21 +725,6 @@ int lnet_get_peer_ni_info(__u32 peer_index, __u64 *nid,
>>  			  __u32 *peer_rtr_credits, __u32 *peer_min_rtr_credtis,
>>  			  __u32 *peer_tx_qnob);
>>  
>> -static inline __u32
>> -lnet_get_num_peer_nis(struct lnet_peer *peer)
>> -{
>> -	struct lnet_peer_net *lpn;
>> -	struct lnet_peer_ni *lpni;
>> -	__u32 count = 0;
>> -
>> -	list_for_each_entry(lpn, &peer->lp_peer_nets, lpn_on_peer_list)
>> -		list_for_each_entry(lpni, &lpn->lpn_peer_nis,
>> -				    lpni_on_peer_net_list)
>> -			count++;
>> -
>> -	return count;
>> -}
>> -
>>  static inline bool
>>  lnet_is_peer_ni_healthy_locked(struct lnet_peer_ni *lpni)
>>  {
>> @@ -728,7 +743,7 @@ lnet_is_peer_net_healthy_locked(struct lnet_peer_net *peer_net)
>>  	struct lnet_peer_ni *lpni;
>>  
>>  	list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
>> -			    lpni_on_peer_net_list) {
>> +			    lpni_peer_nis) {
>>  		if (lnet_is_peer_ni_healthy_locked(lpni))
>>  			return true;
>>  	}
>> @@ -741,7 +756,7 @@ lnet_is_peer_healthy_locked(struct lnet_peer *peer)
>>  {
>>  	struct lnet_peer_net *peer_net;
>>  
>> -	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
>> +	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
>>  		if (lnet_is_peer_net_healthy_locked(peer_net))
>>  			return true;
>>  	}
>> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
>> index d1721fd01d93..260619e19bde 100644
>> --- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
>> +++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
>> @@ -411,7 +411,8 @@ struct lnet_rc_data {
>>  };
>>  
>>  struct lnet_peer_ni {
>> -	struct list_head	lpni_on_peer_net_list;
>> +	/* chain on lpn_peer_nis */
>> +	struct list_head	lpni_peer_nis;
>>  	/* chain on remote peer list */
>>  	struct list_head	lpni_on_remote_peer_ni_list;
>>  	/* chain on peer hash */
>> @@ -496,8 +497,8 @@ struct lnet_peer_ni {
>>  #define LNET_PEER_NI_NON_MR_PREF	BIT(0)
>>  
>>  struct lnet_peer {
>> -	/* chain on global peer list */
>> -	struct list_head	lp_on_lnet_peer_list;
>> +	/* chain on pt_peer_list */
>> +	struct list_head	lp_peer_list;
>>  
>>  	/* list of peer nets */
>>  	struct list_head	lp_peer_nets;
>> @@ -505,6 +506,15 @@ struct lnet_peer {
>>  	/* primary NID of the peer */
>>  	lnet_nid_t		lp_primary_nid;
>>  
>> +	/* CPT of peer_table */
>> +	int			lp_cpt;
>> +
>> +	/* number of NIDs on this peer */
>> +	int			lp_nnis;
>> +
>> +	/* reference count */
>> +	atomic_t		lp_refcount;
>> +
>>  	/* lock protecting peer state flags */
>>  	spinlock_t		lp_lock;
>>  
>> @@ -516,8 +526,8 @@ struct lnet_peer {
>>  #define LNET_PEER_CONFIGURED	BIT(1)
>>  
>>  struct lnet_peer_net {
>> -	/* chain on peer block */
>> -	struct list_head	lpn_on_peer_list;
>> +	/* chain on lp_peer_nets */
>> +	struct list_head	lpn_peer_nets;
>>  
>>  	/* list of peer_nis on this network */
>>  	struct list_head	lpn_peer_nis;
>> @@ -527,21 +537,45 @@ struct lnet_peer_net {
>>  
>>  	/* Net ID */
>>  	__u32			lpn_net_id;
>> +
>> +	/* reference count */
>> +	atomic_t		lpn_refcount;
>>  };
>>  
>>  /* peer hash size */
>>  #define LNET_PEER_HASH_BITS	9
>>  #define LNET_PEER_HASH_SIZE	(1 << LNET_PEER_HASH_BITS)
>>  
>> -/* peer hash table */
>> +/*
>> + * peer hash table - one per CPT
>> + *
>> + * protected by lnet_net_lock/EX for update
>> + *    pt_version
>> + *    pt_number
>> + *    pt_hash[...]
>> + *    pt_peer_list
>> + *    pt_peers
>> + *    pt_peer_nnids
>> + * protected by pt_zombie_lock:
>> + *    pt_zombie_list
>> + *    pt_zombies
>> + *
>> + * pt_zombie lock nests inside lnet_net_lock
>> + */
>>  struct lnet_peer_table {
>>  	/* /proc validity stamp */
>>  	int			 pt_version;
>>  	/* # peers extant */
>>  	atomic_t		 pt_number;
>> +	/* peers */
>> +	struct list_head	pt_peer_list;
>> +	/* # peers */
>> +	int			pt_peers;
>> +	/* # NIDS on listed peers */
>> +	int			pt_peer_nnids;
>>  	/* # zombies to go to deathrow (and not there yet) */
>>  	int			 pt_zombies;
>> -	/* zombie peers */
>> +	/* zombie peers_ni */
>>  	struct list_head	 pt_zombie_list;
>>  	/* protect list and count */
>>  	spinlock_t		 pt_zombie_lock;
>> @@ -785,8 +819,6 @@ struct lnet {
>>  	struct lnet_msg_container	**ln_msg_containers;
>>  	struct lnet_counters		**ln_counters;
>>  	struct lnet_peer_table		**ln_peer_tables;
>> -	/* list of configured or discovered peers */
>> -	struct list_head		ln_peers;
>>  	/* list of peer nis not on a local network */
>>  	struct list_head		ln_remote_peer_ni_list;
>>  	/* failure simulation */
>> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
>> index d64ae2939abc..c48bcb8722a0 100644
>> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
>> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
>> @@ -625,7 +625,6 @@ lnet_prepare(lnet_pid_t requested_pid)
>>  	the_lnet.ln_pid = requested_pid;
>>  
>>  	INIT_LIST_HEAD(&the_lnet.ln_test_peers);
>> -	INIT_LIST_HEAD(&the_lnet.ln_peers);
>>  	INIT_LIST_HEAD(&the_lnet.ln_remote_peer_ni_list);
>>  	INIT_LIST_HEAD(&the_lnet.ln_nets);
>>  	INIT_LIST_HEAD(&the_lnet.ln_routers);
>> diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
>> index 99d8b22356bb..4c1eef907dc7 100644
>> --- a/drivers/staging/lustre/lnet/lnet/lib-move.c
>> +++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
>> @@ -1388,7 +1388,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>>  			peer_net = lnet_peer_get_net_locked(
>>  				peer, LNET_NIDNET(best_lpni->lpni_nid));
>>  			list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
>> -					    lpni_on_peer_net_list) {
>> +					    lpni_peer_nis) {
>>  				if (lpni->lpni_pref_nnids == 0)
>>  					continue;
>>  				LASSERT(lpni->lpni_pref_nnids == 1);
>> @@ -1411,7 +1411,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>>  			}
>>  			lpni = list_entry(peer_net->lpn_peer_nis.next,
>>  					  struct lnet_peer_ni,
>> -					  lpni_on_peer_net_list);
>> +					  lpni_peer_nis);
>>  		}
>>  		/* Set preferred NI if necessary. */
>>  		if (lpni->lpni_pref_nnids == 0)
>> @@ -1443,7 +1443,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>>  	 * then the best route is chosen. If all routes are equal then
>>  	 * they are used in round robin.
>>  	 */
>> -	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_on_peer_list) {
>> +	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
>>  		if (!lnet_is_peer_net_healthy_locked(peer_net))
>>  			continue;
>>  
>> @@ -1453,7 +1453,7 @@ lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
>>  
>>  			lpni = list_entry(peer_net->lpn_peer_nis.next,
>>  					  struct lnet_peer_ni,
>> -					  lpni_on_peer_net_list);
>> +					  lpni_peer_nis);
>>  
>>  			net_gw = lnet_find_route_locked(NULL,
>>  							lpni->lpni_nid,
>> diff --git a/drivers/staging/lustre/lnet/lnet/peer.c b/drivers/staging/lustre/lnet/lnet/peer.c
>> index 09c1b5516f6b..d7a0a2f3bdd9 100644
>> --- a/drivers/staging/lustre/lnet/lnet/peer.c
>> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
>
> INIT_LIST_HEAD(&ptable->pt_peer_list); seems to be missing from
> lnet_peer_tables_create(). This is in the patch merged into 
> lustre-testing. Other than that it looks okay.

No, it is there. It is just the the lnet_peer_tables_create() has moved,
so it isn't in the same place in the patch.

..snip..

>> @@ -319,6 +358,8 @@ lnet_peer_tables_create(void)
>>  		spin_lock_init(&ptable->pt_zombie_lock);
>>  		INIT_LIST_HEAD(&ptable->pt_zombie_list);
>>  
>> +		INIT_LIST_HEAD(&ptable->pt_peer_list);
>> +
>>  		for (j = 0; j < LNET_PEER_HASH_SIZE; j++)
>>  			INIT_LIST_HEAD(&hash[j]);
>>  		ptable->pt_hash = hash; /* sign of initialization */

here it is.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181017/8e4d4715/attachment.sig>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging
  2018-10-14 23:54 ` [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging James Simmons
@ 2018-10-17  5:20   ` NeilBrown
  0 siblings, 0 replies; 57+ messages in thread
From: NeilBrown @ 2018-10-17  5:20 UTC (permalink / raw)
  To: lustre-devel

On Mon, Oct 15 2018, James Simmons wrote:

>> This is a port of the "Dynamic Discovery" series
>> (756abb9cf00b936b3..1c45d9051764e0637ba90b3)
>> to my mainline-linux-with-lustre tree.
>> It is all fairly straight forward, but I don't think I have the
>> hardware to testing properly.  And review never hurts.
>> 
>> This is all in my lustre-testing branch.
>
> Only one patch was incorrect but the verison in lustre-testing is fine.
> Testing has shown no problems.

Excellent - thanks for the review!

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181017/17e36ab4/attachment.sig>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net
  2018-10-17  5:16     ` NeilBrown
@ 2018-10-20 16:47       ` James Simmons
  0 siblings, 0 replies; 57+ messages in thread
From: James Simmons @ 2018-10-20 16:47 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> >> From: Olaf Weber <olaf@sgi.com>
> >> 
> >> Peer discovery will be keeping track of lnet_peer structures,
> >> so there will be references to an lnet_peer independent of
> >> the references implied by lnet_peer_ni structures. Manage
> >> this by adding explicit reference counts to lnet_peer_net and
> >> lnet_peer.
> >> 
> >> Each lnet_peer_net has a hold on the lnet_peer it links to
> >> with its lpn_peer pointer. This hold is only removed when that
> >> pointer is assigned a new value or the lnet_peer_net is freed.
> >> Just removing an lnet_peer_net from the lp_peer_nets list does
> >> not release this hold, it just prevents new lookups of the
> >> lnet_peer_net via the lnet_peer.
> >> 
> >> Each lnet_peer_ni has a hold on the lnet_peer_net it links to
> >> with its lpni_peer_net pointer. This hold is only removed when
> >> that pointer is assigned a new value or the lnet_peer_ni is
> >> freed. Just removing an lnet_peer_ni from the lpn_peer_nis
> >> list does not release this hold, it just prevents new lookups
> >> of the lnet_peer_ni via the lnet_peer_net.
> >> 
> >> This ensures that given a lnet_peer_ni *lpni, we can rely on
> >> lpni->lpni_peer_net->lpn_peer pointing to a valid lnet_peer.
> >> 
> >> Keep a count of the total number of lnet_peer_ni attached to
> >> an lnet_peer in lp_nnis.
> >> 
> >> Split the global ln_peers list into per-lnet_peer_table lists.
> >> The CPT of the peer table in which the lnet_peer is linked is
> >> stored in lp_cpt.
> >> 
> >> WC-bug-id: https://jira.whamcloud.com/browse/LU-9480
> >> Signed-off-by: Olaf Weber <olaf@sgi.com>
> >> Reviewed-on: https://review.whamcloud.com/25784
> >> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> >> Reviewed-by: Amir Shehata <amir.shehata@intel.com>
> >> Tested-by: Amir Shehata <amir.shehata@intel.com>
> >> Signed-off-by: NeilBrown <neilb@suse.com>
> >> ---
> >>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   49 +++--
> >>  .../staging/lustre/include/linux/lnet/lib-types.h  |   50 ++++-
> >>  drivers/staging/lustre/lnet/lnet/api-ni.c          |    1 
> >>  drivers/staging/lustre/lnet/lnet/lib-move.c        |    8 -
> >>  drivers/staging/lustre/lnet/lnet/peer.c            |  210 ++++++++++++++------
> >>  5 files changed, 227 insertions(+), 91 deletions(-)
> >> 
> >> diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> >> index 563417510722..aad25eb0011b 100644
> >> --- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> >> +++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
> >> @@ -310,6 +310,36 @@ lnet_handle2me(struct lnet_handle_me *handle)

.....

> >> +++ b/drivers/staging/lustre/lnet/lnet/peer.c
> >
> > INIT_LIST_HEAD(&ptable->pt_peer_list); seems to be missing from
> > lnet_peer_tables_create(). This is in the patch merged into 
> > lustre-testing. Other than that it looks okay.
> 
> No, it is there. It is just the the lnet_peer_tables_create() has moved,
> so it isn't in the same place in the patch.
> 
> ..snip..
> 
> >> @@ -319,6 +358,8 @@ lnet_peer_tables_create(void)
> >>  		spin_lock_init(&ptable->pt_zombie_lock);
> >>  		INIT_LIST_HEAD(&ptable->pt_zombie_list);
> >>  
> >> +		INIT_LIST_HEAD(&ptable->pt_peer_list);
> >> +
> >>  		for (j = 0; j < LNET_PEER_HASH_SIZE; j++)
> >>  			INIT_LIST_HEAD(&hash[j]);
> >>  		ptable->pt_hash = hash; /* sign of initialization */
> 
> here it is.

Missed it. Thanks.

Reviewed-by: James Simmons <jsimmons@infradead.org>

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2018-10-20 16:47 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-07 23:19 [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging NeilBrown
2018-10-07 23:19 ` [lustre-devel] [PATCH 08/24] lustre: lnet: rename lnet_add/del_peer_ni_to/from_peer() NeilBrown
2018-10-14 19:55   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 09/24] lustre: lnet: refactor lnet_del_peer_ni() NeilBrown
2018-10-14 19:58   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 02/24] lustre: lnet: configure lnet_interfaces_max tunable from dlc NeilBrown
2018-10-14 19:10   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 06/24] lustre: lnet: add sanity checks on ping-related constants NeilBrown
2018-10-14 19:36   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 04/24] lustre: lnet: automatic sizing of router pinger buffers NeilBrown
2018-10-14 19:32   ` James Simmons
2018-10-14 19:33   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 10/24] lustre: lnet: refactor lnet_add_peer_ni() NeilBrown
2018-10-14 20:02   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 01/24] lustre: lnet: add lnet_interfaces_max tunable NeilBrown
2018-10-14 19:08   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 05/24] lustre: lnet: add Multi-Rail and Discovery ping feature bits NeilBrown
2018-10-14 19:34   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 03/24] lustre: lnet: add struct lnet_ping_buffer NeilBrown
2018-10-14 19:29   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 07/24] lustre: lnet: cleanup of lnet_peer_ni_addref/decref_locked() NeilBrown
2018-10-14 19:38   ` James Simmons
2018-10-14 19:39   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 11/24] lustre: lnet: introduce LNET_PEER_MULTI_RAIL flag bit NeilBrown
2018-10-14 20:11   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 16/24] lustre: lnet: add discovery thread NeilBrown
2018-10-14 22:51   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 15/24] lustre: lnet: add msg_type to lnet_event NeilBrown
2018-10-14 22:44   ` James Simmons
2018-10-14 22:44   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 17/24] lustre: lnet: add the Push target NeilBrown
2018-10-14 22:58   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 24/24] lustre: lnet: balance references in lnet_discover_peer_locked() NeilBrown
2018-10-14 23:53   ` James Simmons
2018-10-14 23:54   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 19/24] lustre: lnet: add "lnetctl peer list" NeilBrown
2018-10-14 23:38   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 13/24] lustre: lnet: add LNET_PEER_CONFIGURED flag NeilBrown
2018-10-14 20:32   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 21/24] lustre: lnet: add "lnetctl discover" NeilBrown
2018-10-14 23:45   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 12/24] lustre: lnet: preferred NIs for non-Multi-Rail peers NeilBrown
2018-10-14 20:20   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 20/24] lustre: lnet: add "lnetctl ping" command NeilBrown
2018-10-14 23:43   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 23/24] lustre: lnet: show peer state NeilBrown
2018-10-14 23:52   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 14/24] lustre: lnet: reference counts on lnet_peer/lnet_peer_net NeilBrown
2018-10-14 22:42   ` James Simmons
2018-10-17  5:16     ` NeilBrown
2018-10-20 16:47       ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 18/24] lustre: lnet: implement Peer Discovery NeilBrown
2018-10-14 23:33   ` James Simmons
2018-10-07 23:19 ` [lustre-devel] [PATCH 22/24] lustre: lnet: add enhanced statistics NeilBrown
2018-10-14 23:50   ` James Simmons
2018-10-14 23:54 ` [lustre-devel] [PATCH 00/24] Port Dynamic Discovery to drivers/staging James Simmons
2018-10-17  5:20   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.