All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Simmons <jsimmons@infradead.org>
To: Andreas Dilger <adilger@whamcloud.com>,
	Oleg Drokin <green@whamcloud.com>, NeilBrown <neilb@suse.de>
Cc: Chris Horn <chris.horn@hpe.com>,
	Lustre Development List <lustre-devel@lists.lustre.org>
Subject: [lustre-devel] [PATCH 04/42] lnet: Drop LNet message if deadline exceeded
Date: Mon, 23 Jan 2023 18:00:17 -0500	[thread overview]
Message-ID: <1674514855-15399-5-git-send-email-jsimmons@infradead.org> (raw)
In-Reply-To: <1674514855-15399-1-git-send-email-jsimmons@infradead.org>

From: Chris Horn <chris.horn@hpe.com>

The LNet message deadline is set when a message is committed for
sending. A message can be queued while waiting for send credit(s)
after it has been committed. Thus, it is possible for a message
deadline to be exceeded while on the queue. We should check for this
when posting messages to LND layer.

HPE-bug-id: LUS-11333
WC-bug-id: https://jira.whamcloud.com/browse/LU-16303
Lustre-commit: 52db11cdceef0851b ("LU-16303 lnet: Drop LNet message if deadline exceeded")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49078
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 57 +++++++++++++++++++++++++++-------------
 net/lnet/lnet/lib-msg.c  |  2 +-
 2 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 225accaf5d08..f602492ee75f 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -572,41 +572,52 @@ lnet_ni_eager_recv(struct lnet_ni *ni, struct lnet_msg *msg)
 	return rc;
 }
 
-/* returns true if this message should be dropped */
-static bool
+/* Returns:
+ *  -ETIMEDOUT if the message deadline has been exceeded
+ *  -EHOSTUNREACH if the peer is down
+ *  0 if this message should not be dropped
+ */
+static int
 lnet_check_message_drop(struct lnet_ni *ni, struct lnet_peer_ni *lpni,
 			struct lnet_msg *msg)
 {
+	/* Drop message if we've exceeded the message deadline */
+	if (ktime_after(ktime_get(), msg->msg_deadline))
+		return -ETIMEDOUT;
+
 	if (msg->msg_target.pid & LNET_PID_USERFLAG)
-		return false;
+		return 0;
 
 	if (!lnet_peer_aliveness_enabled(lpni))
-		return false;
+		return 0;
 
 	/* If we're resending a message, let's attempt to send it even if
 	 * the peer is down to fulfill our resend quota on the message
 	 */
 	if (msg->msg_retry_count > 0)
-		return false;
+		return 0;
 
-	/* try and send recovery messages irregardless */
+	/* try and send recovery messages regardless */
 	if (msg->msg_recovery)
-		return false;
+		return 0;
 
 	/* always send any responses */
 	if (lnet_msg_is_response(msg))
-		return false;
+		return 0;
 
 	/* always send non-routed messages */
 	if (!msg->msg_routing)
-		return false;
+		return 0;
 
 	/* assume peer_ni is alive as long as we're within the configured
 	 * peer timeout
 	 */
-	return ktime_get_seconds() >=
-		(lpni->lpni_last_alive +
-		 lpni->lpni_net->net_tunables.lct_peer_timeout);
+	if (ktime_get_seconds() >=
+	    (lpni->lpni_last_alive +
+	     lpni->lpni_net->net_tunables.lct_peer_timeout))
+		return -EHOSTUNREACH;
+
+	return 0;
 }
 
 /**
@@ -628,6 +639,7 @@ lnet_post_send_locked(struct lnet_msg *msg, int do_send)
 	struct lnet_ni *ni = msg->msg_txni;
 	int cpt = msg->msg_tx_cpt;
 	struct lnet_tx_queue *tq = ni->ni_tx_queues[cpt];
+	int rc;
 
 	/* non-lnet_send() callers have checked before */
 	LASSERT(!do_send || msg->msg_tx_delayed);
@@ -639,7 +651,8 @@ lnet_post_send_locked(struct lnet_msg *msg, int do_send)
 		LASSERT(!nid_same(&lp->lpni_nid, &the_lnet.ln_loni->ni_nid));
 
 	/* NB 'lp' is always the next hop */
-	if (lnet_check_message_drop(ni, lp, msg)) {
+	rc = lnet_check_message_drop(ni, lp, msg);
+	if (rc) {
 		the_lnet.ln_counters[cpt]->lct_common.lcc_drop_count++;
 		the_lnet.ln_counters[cpt]->lct_common.lcc_drop_length +=
 			msg->msg_len;
@@ -653,14 +666,22 @@ lnet_post_send_locked(struct lnet_msg *msg, int do_send)
 					msg->msg_type,
 					LNET_STATS_TYPE_DROP);
 
-		CNETERR("Dropping message for %s: peer not alive\n",
-			libcfs_idstr(&msg->msg_target));
-		msg->msg_health_status = LNET_MSG_STATUS_REMOTE_DROPPED;
+		if (rc == -EHOSTUNREACH) {
+			CNETERR("Dropping message for %s: peer not alive\n",
+				libcfs_idstr(&msg->msg_target));
+			msg->msg_health_status = LNET_MSG_STATUS_REMOTE_DROPPED;
+		} else {
+			CNETERR("Dropping message for %s: exceeded message deadline\n",
+				libcfs_idstr(&msg->msg_target));
+			msg->msg_health_status =
+				LNET_MSG_STATUS_NETWORK_TIMEOUT;
+		}
+
 		if (do_send)
-			lnet_finalize(msg, -EHOSTUNREACH);
+			lnet_finalize(msg, rc);
 
 		lnet_net_lock(cpt);
-		return -EHOSTUNREACH;
+		return rc;
 	}
 
 	if (msg->msg_md &&
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 898d8670aedf..82d117dc6b61 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -779,7 +779,7 @@ lnet_health_check(struct lnet_msg *msg)
 		lo = true;
 
 	if (hstatus != LNET_MSG_STATUS_OK &&
-	    ktime_compare(ktime_get(), msg->msg_deadline) >= 0)
+	    ktime_after(ktime_get(), msg->msg_deadline))
 		return -1;
 
 	/* always prefer txni/txpeer if they message is committed for both
-- 
2.27.0

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

  parent reply	other threads:[~2023-01-23 23:06 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-23 23:00 [lustre-devel] [PATCH 00/42] lustre: sync to OpenSFS tree as of Jan 22 2023 James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 01/42] lustre: osc: pack osc_async_page better James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 02/42] lnet: lnet_peer_merge_data to understand large addr James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 03/42] lnet: router_discover - handle large addrs in ping James Simmons
2023-01-23 23:00 ` James Simmons [this message]
2023-01-23 23:00 ` [lustre-devel] [PATCH 05/42] lnet: change lnet_find_best_lpni to handle large NIDs James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 06/42] lustre: ldebugfs: add histogram to stats counter James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 07/42] lustre: llite: wake_up after cl_object_kill James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 08/42] lustre: pcc: use two bits to indicate pcc type for attach James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 09/42] lustre: ldebugfs: make job_stats and rename_stats valid YAML James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 10/42] lustre: misc: fix stats snapshot_time to use wallclock James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 11/42] lustre: pools: force creation of a component without a pool James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 12/42] lustre: sec: reserve flag for fid2path for encrypted files James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 13/42] lustre: llite: update statx size/ctime for fallocate James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 14/42] lustre: ptlrpc: fiemap flexible array James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 15/42] lustre: ptlrpc: Add LCME_FL_PARITY to wirecheck James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 16/42] lnet: selftest: lst read-outside of allocation James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 17/42] lustre: misc: rename lprocfs_stats functions James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 18/42] lustre: osc: Fix possible null pointer James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 19/42] lustre: ptlrpc: NUL terminate long jobid strings James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 20/42] lustre: uapi: remove _GNU_SOURCE dependency in lustre_user.h James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 21/42] lnet: handles unregister/register events James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 22/42] lustre: update version to 2.15.53 James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 23/42] lustre: ptlrpc: don't panic during reconnection James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 24/42] lustre: move to kobj_type default_groups James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 25/42] lnet: increase transaction timeout James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 26/42] lnet: Allow IP specification James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 27/42] lustre: obdclass: fix T10PI prototypes James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 28/42] lustre: obdclass: prefer T10 checksum if the target supports it James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 29/42] lustre: llite: remove false outdated comment James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 30/42] lnet: socklnd: clarify error message on timeout James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 31/42] lustre: llite: replace selinux_is_enabled() James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 32/42] lustre: enc: S_ENCRYPTED flag on OST objects for enc files James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 33/42] lnet: asym route inconsistency warning James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 34/42] lnet: o2iblnd: reset hiw proportionally James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 35/42] lnet: libcfs: cfs_hash_for_each_empty optimization James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 36/42] lustre: llite: always enable remote subdir mount James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 37/42] lnet: selftest: migrate LNet selftest group handling to Netlink James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 38/42] lnet: use Netlink to support LNet ping commands James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 39/42] lustre: llite: revert: "llite: clear stale page's uptodate bit" James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 40/42] lnet: validate data sent from user land properly James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 41/42] lnet: modify lnet_inetdev to work with large NIDS James Simmons
2023-01-23 23:00 ` [lustre-devel] [PATCH 42/42] lustre: ldlm: remove obsolete LDLM_FL_SERVER_LOCK James Simmons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1674514855-15399-5-git-send-email-jsimmons@infradead.org \
    --to=jsimmons@infradead.org \
    --cc=adilger@whamcloud.com \
    --cc=chris.horn@hpe.com \
    --cc=green@whamcloud.com \
    --cc=lustre-devel@lists.lustre.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.