[Patch 0/2] iscsit/isert deadlock prevention under heavy I/O

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
@ 2022-03-11 17:57 David Jeffery
  2022-03-11 17:57 ` [PATCH 1/2] isert: support for unsolicited NOPIN with no response David Jeffery
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: David Jeffery @ 2022-03-11 17:57 UTC (permalink / raw)
  To: linux-rdma
  Cc: target-devel, Sagi Grimberg, Max Gurtovoy, Laurence Oberman,
	David Jeffery

With fast infiniband networks and rdma through isert, the isert version of
an iSCSI target can get itself into a deadlock condition from when
max_cmd_sn updates are pushed to the client versus when commands are fully
released after rdma completes.

iscsit preallocates a limited number of iscsi_cmd structs used for any
commands from the initiator. While the iscsi window would normally be
expected to limit the number used by normal SCSI commands, isert can exceed
this limit with commands waiting finally completion. max_cmd_sn gets
incremented and pushed to the client on sending the target's final
response, but the iscsi_cmd won't be freed for reuse until after all rdma
is acknowledged as complete.

This allows more new commands to come in even as older commands are not yet
released. With enough commands on the initiator wanting to be sent, this can
result in all iscsi_cmd structs being allocated and used for SCSI commands.

And once all are allocated, isert can deadlock when another new command is
received. Its receive processing waits for an iscsi_cmd to become available.
But this also stalls processing of the completions which would result in
releasing an iscsi_cmd, resulting in a deadlock.

This small patch series prevents this issue by altering when and how
max_cmd_sn changes are reported to the initiator for isert. It gets delayed
until iscsi_cmd release instead of when sending a final response.

To prevent failure or large delays for informing the initiator of changes to
max_cmd_sn, NOPIN is used as a method to inform the initiator should the
difference between internal max_cmd_sn and what has been passed to the
initiator grow too large.

David Jeffery (2):
  isert: support for unsolicited NOPIN with no response.
  iscsit: increment max_cmd_sn for isert on command release

 drivers/infiniband/ulp/isert/ib_isert.c    | 11 ++++++-
 drivers/target/iscsi/iscsi_target.c        | 18 +++++------
 drivers/target/iscsi/iscsi_target_device.c | 35 +++++++++++++++++++++-
 drivers/target/iscsi/iscsi_target_login.c  |  1 +
 drivers/target/iscsi/iscsi_target_util.c   |  5 +++-
 drivers/target/iscsi/iscsi_target_util.h   |  1 +
 include/target/iscsi/iscsi_target_core.h   |  8 +++++
 include/target/iscsi/iscsi_transport.h     |  1 +
 8 files changed, 68 insertions(+), 12 deletions(-)

-- 
2.31.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] isert: support for unsolicited NOPIN with no response.
  2022-03-11 17:57 [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O David Jeffery
@ 2022-03-11 17:57 ` David Jeffery
  2022-03-11 17:57 ` [PATCH 2/2] iscsit: increment max_cmd_sn for isert on command release David Jeffery
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: David Jeffery @ 2022-03-11 17:57 UTC (permalink / raw)
  To: linux-rdma
  Cc: target-devel, Sagi Grimberg, Max Gurtovoy, Laurence Oberman,
	David Jeffery

The isert module has only supported sending a NOPIN as a response to a
NOPOUT. In order to be able to use an unsolicited NOPIN to inform the
initiator of max_cmd_sn changes, isert needs to be able to send a NOPIN
as needed.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
---
 drivers/infiniband/ulp/isert/ib_isert.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 636d590765f9..b8390ad762eb 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1469,6 +1469,7 @@ isert_put_cmd(struct isert_cmd *isert_cmd, bool comp_err)
 		break;
 	case ISCSI_OP_REJECT:
 	case ISCSI_OP_NOOP_OUT:
+	case ISCSI_OP_NOOP_IN:
 	case ISCSI_OP_TEXT:
 		hdr = (struct iscsi_text_rsp *)&isert_cmd->tx_desc.iscsi_header;
 		/* If the continue bit is on, keep the command alive */
@@ -1858,7 +1859,14 @@ isert_put_nopin(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 
 	isert_dbg("conn %p Posting NOPIN Response\n", isert_conn);
 
-	return isert_post_response(isert_conn, isert_cmd);
+	if (nopout_response)
+		return isert_post_response(isert_conn, isert_cmd);
+
+	/* pointer init since didn't go through isert_allocate_cmd */
+	isert_cmd->conn = isert_conn;
+	isert_cmd->iscsi_cmd = cmd;
+
+	return ib_post_send(isert_conn->qp, send_wr, NULL);
 }
 
 static int
@@ -2159,6 +2167,7 @@ isert_immediate_queue(struct iscsi_conn *conn, struct iscsi_cmd *cmd, int state)
 		spin_unlock_bh(&conn->cmd_lock);
 		isert_put_cmd(isert_cmd, true);
 		break;
+	case ISTATE_SEND_NOPIN_NO_RESPONSE:
 	case ISTATE_SEND_NOPIN_WANT_RESPONSE:
 		ret = isert_put_nopin(cmd, conn, false);
 		break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] iscsit: increment max_cmd_sn for isert on command release
  2022-03-11 17:57 [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O David Jeffery
  2022-03-11 17:57 ` [PATCH 1/2] isert: support for unsolicited NOPIN with no response David Jeffery
@ 2022-03-11 17:57 ` David Jeffery
  2022-03-11 19:08 ` [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O Laurence Oberman
  2022-03-13  9:59 ` Max Gurtovoy
  3 siblings, 0 replies; 13+ messages in thread
From: David Jeffery @ 2022-03-11 17:57 UTC (permalink / raw)
  To: linux-rdma
  Cc: target-devel, Sagi Grimberg, Max Gurtovoy, Laurence Oberman,
	David Jeffery

iscsit with isert currently can suffer a rare deadlock due to how rdma
delays the release of an iscsi_cmd. Because max_cmd_sn is increased and
sent to the initiator before the last rdma completes, iscsit can end up in
a state where all iscsi_cmd structs are active even though the number is
more than double the iscsi window.

Once out of iscsi_cmd structs, isert then deadlocks trying to receive new
commands. It waits for an iscsi_cmd to become available, but the wait also
blocks processing for receiving completion events which would release an
iscsi_cmd waiting on rdma to finish. So neither can advance.

This patch avoids the deadlock state by delaying the increase to max_cmd_sn
until an iscsi_cmd is released. In this way, the number of iscsi_cmd
structs in use for SCSI commands will be limited to the iscsi window size.

An unsolicited NOPIN is then used to inform the initiator of changes to
max_cmd_sn should the difference between the internal value and the value
the initiator has been informed of grow too large (currently set to half
the window).

Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
---
 drivers/target/iscsi/iscsi_target.c        | 18 +++++------
 drivers/target/iscsi/iscsi_target_device.c | 35 +++++++++++++++++++++-
 drivers/target/iscsi/iscsi_target_login.c  |  1 +
 drivers/target/iscsi/iscsi_target_util.c   |  5 +++-
 drivers/target/iscsi/iscsi_target_util.h   |  1 +
 include/target/iscsi/iscsi_target_core.h   |  8 +++++
 include/target/iscsi/iscsi_transport.h     |  1 +
 7 files changed, 58 insertions(+), 11 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
index 2c54c5d8412d..f67e909c5546 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -2757,7 +2757,7 @@ static int iscsit_send_conn_drop_async_message(
 	cmd->stat_sn		= conn->stat_sn++;
 	hdr->statsn		= cpu_to_be32(cmd->stat_sn);
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 	hdr->async_event	= ISCSI_ASYNC_MSG_DROPPING_CONNECTION;
 	hdr->param1		= cpu_to_be16(cmd->logout_cid);
 	hdr->param2		= cpu_to_be16(conn->sess->sess_ops->DefaultTime2Wait);
@@ -2815,7 +2815,7 @@ iscsit_build_datain_pdu(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 		hdr->statsn		= cpu_to_be32(0xFFFFFFFF);
 
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 	hdr->datasn		= cpu_to_be32(datain->data_sn);
 	hdr->offset		= cpu_to_be32(datain->offset);
 
@@ -2970,7 +2970,7 @@ iscsit_build_logout_rsp(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 
 	iscsit_increment_maxcmdsn(cmd, conn->sess);
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 
 	pr_debug("Built Logout Response ITT: 0x%08x StatSN:"
 		" 0x%08x Response: 0x%02x CID: %hu on CID: %hu\n",
@@ -3013,7 +3013,7 @@ iscsit_build_nopin_rsp(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 		iscsit_increment_maxcmdsn(cmd, conn->sess);
 
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 
 	pr_debug("Built NOPIN %s Response ITT: 0x%08x, TTT: 0x%08x,"
 		" StatSN: 0x%08x, Length %u\n", (nopout_response) ?
@@ -3094,7 +3094,7 @@ static int iscsit_send_r2t(
 	hdr->ttt		= cpu_to_be32(r2t->targ_xfer_tag);
 	hdr->statsn		= cpu_to_be32(conn->stat_sn);
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 	hdr->r2tsn		= cpu_to_be32(r2t->r2t_sn);
 	hdr->data_offset	= cpu_to_be32(r2t->offset);
 	hdr->data_length	= cpu_to_be32(r2t->xfer_len);
@@ -3234,7 +3234,7 @@ void iscsit_build_rsp_pdu(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 
 	iscsit_increment_maxcmdsn(cmd, conn->sess);
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 
 	pr_debug("Built SCSI Response, ITT: 0x%08x, StatSN: 0x%08x,"
 		" Response: 0x%02x, SAM Status: 0x%02x, CID: %hu\n",
@@ -3314,7 +3314,7 @@ iscsit_build_task_mgt_rsp(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 
 	iscsit_increment_maxcmdsn(cmd, conn->sess);
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 
 	pr_debug("Built Task Management Response ITT: 0x%08x,"
 		" StatSN: 0x%08x, Response: 0x%02x, CID: %hu\n",
@@ -3522,7 +3522,7 @@ iscsit_build_text_rsp(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 	 */
 	cmd->maxcmdsn_inc = 0;
 	hdr->exp_cmdsn = cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn = cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 
 	pr_debug("Built Text Response: ITT: 0x%08x, TTT: 0x%08x, StatSN: 0x%08x,"
 		" Length: %u, CID: %hu F: %d C: %d\n", cmd->init_task_tag,
@@ -3563,7 +3563,7 @@ iscsit_build_reject(struct iscsi_cmd *cmd, struct iscsi_conn *conn,
 	cmd->stat_sn		= conn->stat_sn++;
 	hdr->statsn		= cpu_to_be32(cmd->stat_sn);
 	hdr->exp_cmdsn		= cpu_to_be32(conn->sess->exp_cmd_sn);
-	hdr->max_cmdsn		= cpu_to_be32((u32) atomic_read(&conn->sess->max_cmd_sn));
+	iscsit_set_max_cmdsn(hdr, conn);
 
 }
 EXPORT_SYMBOL(iscsit_build_reject);
diff --git a/drivers/target/iscsi/iscsi_target_device.c b/drivers/target/iscsi/iscsi_target_device.c
index 8bf36ec86e3f..09b23c133dca 100644
--- a/drivers/target/iscsi/iscsi_target_device.c
+++ b/drivers/target/iscsi/iscsi_target_device.c
@@ -13,10 +13,14 @@
 #include <target/target_core_fabric.h>
 
 #include <target/iscsi/iscsi_target_core.h>
+#include <target/iscsi/iscsi_transport.h>
 #include "iscsi_target_device.h"
 #include "iscsi_target_tpg.h"
 #include "iscsi_target_util.h"
 
+#define iscsit_needs_delayed_maxcmdsn_increment(conn) \
+	(conn->conn_transport->transport_type == ISCSI_INFINIBAND)
+
 void iscsit_determine_maxcmdsn(struct iscsi_session *sess)
 {
 	struct se_node_acl *se_nacl;
@@ -42,7 +46,7 @@ void iscsit_determine_maxcmdsn(struct iscsi_session *sess)
 	atomic_add(se_nacl->queue_depth - 1, &sess->max_cmd_sn);
 }
 
-void iscsit_increment_maxcmdsn(struct iscsi_cmd *cmd, struct iscsi_session *sess)
+void __iscsit_increment_maxcmdsn(struct iscsi_cmd *cmd, struct iscsi_session *sess)
 {
 	u32 max_cmd_sn;
 
@@ -54,4 +58,33 @@ void iscsit_increment_maxcmdsn(struct iscsi_cmd *cmd, struct iscsi_session *sess
 	max_cmd_sn = atomic_inc_return(&sess->max_cmd_sn);
 	pr_debug("Updated MaxCmdSN to 0x%08x\n", max_cmd_sn);
 }
+
+void iscsit_increment_maxcmdsn(struct iscsi_cmd *cmd, struct iscsi_session *sess)
+{
+	if (!iscsit_needs_delayed_maxcmdsn_increment(cmd->conn))
+		__iscsit_increment_maxcmdsn(cmd, sess);
+}
 EXPORT_SYMBOL(iscsit_increment_maxcmdsn);
+
+
+
+void iscsit_increment_maxcmdsn_on_release(struct iscsi_cmd *cmd, struct iscsi_session *sess)
+{
+	if (iscsit_needs_delayed_maxcmdsn_increment(cmd->conn)) {
+		__iscsit_increment_maxcmdsn(cmd, sess);
+		if ((u32)atomic_read(&sess->max_cmd_sn) -
+		     READ_ONCE(sess->last_max_cmd_sn)
+		     > sess->cmdsn_window / 2) {
+			/*
+			 * Prevent nopin races if one may be needed by using
+			 * a lock and rechecking after grabbing the lock
+			 */
+			spin_lock_bh(&cmd->conn->nopin_timer_lock);
+			if ((u32)atomic_read(&sess->max_cmd_sn) -
+			    READ_ONCE(sess->last_max_cmd_sn)
+			    > sess->cmdsn_window / 2)
+				iscsit_add_nopin(cmd->conn, 0);
+			spin_unlock_bh(&cmd->conn->nopin_timer_lock);
+		}
+	}
+}
diff --git a/drivers/target/iscsi/iscsi_target_login.c b/drivers/target/iscsi/iscsi_target_login.c
index 1a9c50401bdb..d97c3792f140 100644
--- a/drivers/target/iscsi/iscsi_target_login.c
+++ b/drivers/target/iscsi/iscsi_target_login.c
@@ -307,6 +307,7 @@ static int iscsi_login_zero_tsih_s1(
 	 * Initiator Node's ACL once the login has been successfully completed.
 	 */
 	atomic_set(&sess->max_cmd_sn, be32_to_cpu(pdu->cmdsn));
+	sess->last_max_cmd_sn = be32_to_cpu(pdu->cmdsn);
 
 	sess->sess_ops = kzalloc(sizeof(struct iscsi_sess_ops), GFP_KERNEL);
 	if (!sess->sess_ops) {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 6dd5810e2af1..18685b23e1d0 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -708,6 +708,8 @@ void iscsit_release_cmd(struct iscsi_cmd *cmd)
 
 	BUG_ON(!sess || !sess->se_sess);
 
+	iscsit_increment_maxcmdsn_on_release(cmd, sess);
+
 	kfree(cmd->buf_ptr);
 	kfree(cmd->pdu_list);
 	kfree(cmd->seq_list);
@@ -867,7 +869,7 @@ void iscsit_inc_conn_usage_count(struct iscsi_conn *conn)
 	spin_unlock_bh(&conn->conn_usage_lock);
 }
 
-static int iscsit_add_nopin(struct iscsi_conn *conn, int want_response)
+int iscsit_add_nopin(struct iscsi_conn *conn, int want_response)
 {
 	u8 state;
 	struct iscsi_cmd *cmd;
@@ -877,6 +879,7 @@ static int iscsit_add_nopin(struct iscsi_conn *conn, int want_response)
 		return -1;
 
 	cmd->iscsi_opcode = ISCSI_OP_NOOP_IN;
+	cmd->immediate_cmd = 1;
 	state = (want_response) ? ISTATE_SEND_NOPIN_WANT_RESPONSE :
 				ISTATE_SEND_NOPIN_NO_RESPONSE;
 	cmd->init_task_tag = RESERVED_ITT;
diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
index 8ee1c133a9b7..c4474943f310 100644
--- a/drivers/target/iscsi/iscsi_target_util.h
+++ b/drivers/target/iscsi/iscsi_target_util.h
@@ -68,5 +68,6 @@ extern int tx_data(struct iscsi_conn *, struct kvec *, int, int);
 extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8);
 extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *);
 extern void iscsit_fill_cxn_timeout_err_stats(struct iscsi_session *);
+extern int iscsit_add_nopin(struct iscsi_conn *, int);
 
 #endif /*** ISCSI_TARGET_UTIL_H ***/
diff --git a/include/target/iscsi/iscsi_target_core.h b/include/target/iscsi/iscsi_target_core.h
index 1eccb2ac7d02..2983d3798432 100644
--- a/include/target/iscsi/iscsi_target_core.h
+++ b/include/target/iscsi/iscsi_target_core.h
@@ -643,6 +643,7 @@ struct iscsi_session {
 	u32			exp_cmd_sn;
 	/* session wide counter: maximum allowed command sequence number */
 	atomic_t		max_cmd_sn;
+	u32			last_max_cmd_sn;
 	struct list_head	sess_ooo_cmdsn_list;
 
 	/* LIO specific session ID */
@@ -923,4 +924,11 @@ static inline void iscsit_thread_check_cpumask(
 	 */
 	set_cpus_allowed_ptr(p, conn->conn_cpumask);
 }
+
+#define iscsit_set_max_cmdsn(hdr, conn) \
+{ \
+	u32 max_cmdsn = (u32) atomic_read(&conn->sess->max_cmd_sn); \
+	hdr->max_cmdsn = cpu_to_be32(max_cmdsn); \
+	conn->sess->last_max_cmd_sn = max_cmdsn; \
+}
 #endif /* ISCSI_TARGET_CORE_H */
diff --git a/include/target/iscsi/iscsi_transport.h b/include/target/iscsi/iscsi_transport.h
index b8feba7ffebc..878733ca584c 100644
--- a/include/target/iscsi/iscsi_transport.h
+++ b/include/target/iscsi/iscsi_transport.h
@@ -106,6 +106,7 @@ extern int iscsit_response_queue(struct iscsi_conn *, struct iscsi_cmd *, int);
  * From iscsi_target_device.c
  */
 extern void iscsit_increment_maxcmdsn(struct iscsi_cmd *, struct iscsi_session *);
+extern void iscsit_increment_maxcmdsn_on_release(struct iscsi_cmd *, struct iscsi_session *);
 /*
  * From iscsi_target_erl0.c
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-11 17:57 [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O David Jeffery
  2022-03-11 17:57 ` [PATCH 1/2] isert: support for unsolicited NOPIN with no response David Jeffery
  2022-03-11 17:57 ` [PATCH 2/2] iscsit: increment max_cmd_sn for isert on command release David Jeffery
@ 2022-03-11 19:08 ` Laurence Oberman
  2022-03-13  9:59 ` Max Gurtovoy
  3 siblings, 0 replies; 13+ messages in thread
From: Laurence Oberman @ 2022-03-11 19:08 UTC (permalink / raw)
  To: David Jeffery, linux-rdma; +Cc: target-devel, Sagi Grimberg, Max Gurtovoy

On Fri, 2022-03-11 at 12:57 -0500, David Jeffery wrote:
> With fast infiniband networks and rdma through isert, the isert
> version of
> an iSCSI target can get itself into a deadlock condition from when
> max_cmd_sn updates are pushed to the client versus when commands are
> fully
> released after rdma completes.
> 
> iscsit preallocates a limited number of iscsi_cmd structs used for
> any
> commands from the initiator. While the iscsi window would normally be
> expected to limit the number used by normal SCSI commands, isert can
> exceed
> this limit with commands waiting finally completion. max_cmd_sn gets
> incremented and pushed to the client on sending the target's final
> response, but the iscsi_cmd won't be freed for reuse until after all
> rdma
> is acknowledged as complete.
> 
> This allows more new commands to come in even as older commands are
> not yet
> released. With enough commands on the initiator wanting to be sent,
> this can
> result in all iscsi_cmd structs being allocated and used for SCSI
> commands.
> 
> And once all are allocated, isert can deadlock when another new
> command is
> received. Its receive processing waits for an iscsi_cmd to become
> available.
> But this also stalls processing of the completions which would result
> in
> releasing an iscsi_cmd, resulting in a deadlock.
> 
> This small patch series prevents this issue by altering when and how
> max_cmd_sn changes are reported to the initiator for isert. It gets
> delayed
> until iscsi_cmd release instead of when sending a final response.
> 
> To prevent failure or large delays for informing the initiator of
> changes to
> max_cmd_sn, NOPIN is used as a method to inform the initiator should
> the
> difference between internal max_cmd_sn and what has been passed to
> the
> initiator grow too large.
> 
> David Jeffery (2):
>   isert: support for unsolicited NOPIN with no response.
>   iscsit: increment max_cmd_sn for isert on command release
> 
>  drivers/infiniband/ulp/isert/ib_isert.c    | 11 ++++++-
>  drivers/target/iscsi/iscsi_target.c        | 18 +++++------
>  drivers/target/iscsi/iscsi_target_device.c | 35
> +++++++++++++++++++++-
>  drivers/target/iscsi/iscsi_target_login.c  |  1 +
>  drivers/target/iscsi/iscsi_target_util.c   |  5 +++-
>  drivers/target/iscsi/iscsi_target_util.h   |  1 +
>  include/target/iscsi/iscsi_target_core.h   |  8 +++++
>  include/target/iscsi/iscsi_transport.h     |  1 +
>  8 files changed, 68 insertions(+), 12 deletions(-)
> 

This patch has had exhaustive testing in our lab and finally at a
customer. with 40GB FDR we could not reproduce this issue, when we
moved to 100G EDR it showed up. Its been literally over tested for many
days on two separate installations.

The patch corrected all the stalls and problems seen. Thanks David for
sending this.

Regards
Laurence Oberman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-11 17:57 [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O David Jeffery
                   ` (2 preceding siblings ...)
  2022-03-11 19:08 ` [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O Laurence Oberman
@ 2022-03-13  9:59 ` Max Gurtovoy
  2022-03-14 13:57   ` David Jeffery
  3 siblings, 1 reply; 13+ messages in thread
From: Max Gurtovoy @ 2022-03-13  9:59 UTC (permalink / raw)
  To: David Jeffery, linux-rdma; +Cc: target-devel, Sagi Grimberg, Laurence Oberman

Hi David,

thanks for the report.

On 3/11/2022 7:57 PM, David Jeffery wrote:
> With fast infiniband networks and rdma through isert, the isert version of
> an iSCSI target can get itself into a deadlock condition from when
> max_cmd_sn updates are pushed to the client versus when commands are fully
> released after rdma completes.
>
> iscsit preallocates a limited number of iscsi_cmd structs used for any
> commands from the initiator. While the iscsi window would normally be
> expected to limit the number used by normal SCSI commands, isert can exceed
> this limit with commands waiting finally completion. max_cmd_sn gets
> incremented and pushed to the client on sending the target's final
> response, but the iscsi_cmd won't be freed for reuse until after all rdma
> is acknowledged as complete.

Please check how we fixed that in NVMf in Sagi's commit:

nvmet-rdma: fix possible bogus dereference under heavy load (commit: 
8407879c4e0d77)

Maybe this can be done in isert and will solve this problem in a simpler 
way.

is it necessary to change max_cmd_sn ?


>
> This allows more new commands to come in even as older commands are not yet
> released. With enough commands on the initiator wanting to be sent, this can
> result in all iscsi_cmd structs being allocated and used for SCSI commands.
>
> And once all are allocated, isert can deadlock when another new command is
> received. Its receive processing waits for an iscsi_cmd to become available.
> But this also stalls processing of the completions which would result in
> releasing an iscsi_cmd, resulting in a deadlock.
>
> This small patch series prevents this issue by altering when and how
> max_cmd_sn changes are reported to the initiator for isert. It gets delayed
> until iscsi_cmd release instead of when sending a final response.
>
> To prevent failure or large delays for informing the initiator of changes to
> max_cmd_sn, NOPIN is used as a method to inform the initiator should the
> difference between internal max_cmd_sn and what has been passed to the
> initiator grow too large.
>
> David Jeffery (2):
>    isert: support for unsolicited NOPIN with no response.
>    iscsit: increment max_cmd_sn for isert on command release
>
>   drivers/infiniband/ulp/isert/ib_isert.c    | 11 ++++++-
>   drivers/target/iscsi/iscsi_target.c        | 18 +++++------
>   drivers/target/iscsi/iscsi_target_device.c | 35 +++++++++++++++++++++-
>   drivers/target/iscsi/iscsi_target_login.c  |  1 +
>   drivers/target/iscsi/iscsi_target_util.c   |  5 +++-
>   drivers/target/iscsi/iscsi_target_util.h   |  1 +
>   include/target/iscsi/iscsi_target_core.h   |  8 +++++
>   include/target/iscsi/iscsi_transport.h     |  1 +
>   8 files changed, 68 insertions(+), 12 deletions(-)
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-13  9:59 ` Max Gurtovoy
@ 2022-03-14 13:57   ` David Jeffery
  2022-03-14 14:52     ` Max Gurtovoy
  0 siblings, 1 reply; 13+ messages in thread
From: David Jeffery @ 2022-03-14 13:57 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: linux-rdma, target-devel, Sagi Grimberg, Laurence Oberman

On Sun, Mar 13, 2022 at 5:59 AM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
> Hi David,
>
> thanks for the report.
>
> Please check how we fixed that in NVMf in Sagi's commit:
>
> nvmet-rdma: fix possible bogus dereference under heavy load (commit:
> 8407879c4e0d77)
>
> Maybe this can be done in isert and will solve this problem in a simpler
> way.
>
> is it necessary to change max_cmd_sn ?
>
>

Hello,

Sure, there are alternative methods which could fix this immediate
issue. e.g. We could make the command structs for scsi commands get
allocated from a mempool. Is there a particular reason you don't want
to do anything to modify max_cmd_sn behavior?

I didn't do something like this as it seems to me to go against the
intent of the design. It makes the iscsi window mostly meaningless in
some conditions and complicates any allocation path since it now must
gracefully and sanely handle an iscsi_cmd/isert_cmd not existing. I
assume special commands like task-management, logouts, and pings would
need a separate allocation source to keep from being dropped under
memory load.

David Jeffery

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-14 13:57   ` David Jeffery
@ 2022-03-14 14:52     ` Max Gurtovoy
  2022-03-14 15:55       ` David Jeffery
  0 siblings, 1 reply; 13+ messages in thread
From: Max Gurtovoy @ 2022-03-14 14:52 UTC (permalink / raw)
  To: David Jeffery; +Cc: linux-rdma, target-devel, Sagi Grimberg, Laurence Oberman


On 3/14/2022 3:57 PM, David Jeffery wrote:
> On Sun, Mar 13, 2022 at 5:59 AM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>> Hi David,
>>
>> thanks for the report.
>>
>> Please check how we fixed that in NVMf in Sagi's commit:
>>
>> nvmet-rdma: fix possible bogus dereference under heavy load (commit:
>> 8407879c4e0d77)
>>
>> Maybe this can be done in isert and will solve this problem in a simpler
>> way.
>>
>> is it necessary to change max_cmd_sn ?
>>
>>
> Hello,
>
> Sure, there are alternative methods which could fix this immediate
> issue. e.g. We could make the command structs for scsi commands get
> allocated from a mempool. Is there a particular reason you don't want
> to do anything to modify max_cmd_sn behavior?

according to the description the command was parsed successful and sent 
to the initiator.

Why do we need to change the window ? it's just a race of putting the 
context back to the pool.

And this race is rare.


>
> I didn't do something like this as it seems to me to go against the
> intent of the design. It makes the iscsi window mostly meaningless in
> some conditions and complicates any allocation path since it now must
> gracefully and sanely handle an iscsi_cmd/isert_cmd not existing. I
> assume special commands like task-management, logouts, and pings would
> need a separate allocation source to keep from being dropped under
> memory load.

it won't be dropped. It would be allocated dynamically and freed 
(instead of putting it back to the pool).


> David Jeffery
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-14 14:52     ` Max Gurtovoy
@ 2022-03-14 15:55       ` David Jeffery
  2022-03-14 17:40         ` Laurence Oberman
  0 siblings, 1 reply; 13+ messages in thread
From: David Jeffery @ 2022-03-14 15:55 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: linux-rdma, target-devel, Sagi Grimberg, Laurence Oberman

On Mon, Mar 14, 2022 at 10:52 AM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>
> On 3/14/2022 3:57 PM, David Jeffery wrote:
> > On Sun, Mar 13, 2022 at 5:59 AM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >> Hi David,
> >>
> >> thanks for the report.
> >>
> >> Please check how we fixed that in NVMf in Sagi's commit:
> >>
> >> nvmet-rdma: fix possible bogus dereference under heavy load (commit:
> >> 8407879c4e0d77)
> >>
> >> Maybe this can be done in isert and will solve this problem in a simpler
> >> way.
> >>
> >> is it necessary to change max_cmd_sn ?
> >>
> >>
> > Hello,
> >
> > Sure, there are alternative methods which could fix this immediate
> > issue. e.g. We could make the command structs for scsi commands get
> > allocated from a mempool. Is there a particular reason you don't want
> > to do anything to modify max_cmd_sn behavior?
>
> according to the description the command was parsed successful and sent
> to the initiator.
>

Yes.

> Why do we need to change the window ? it's just a race of putting the
> context back to the pool.
>
> And this race is rare.
>

Sure, it's going to be rare. Systems using isert targets with
infiniband are going to be naturally rare. It's part of why I left the
max_cmd_sn behavior untouched for non-isert iscsit since they seem to
be fine as is. But it's easily and regularly triggered by some systems
which use isert, so worth fixing.

> >
> > I didn't do something like this as it seems to me to go against the
> > intent of the design. It makes the iscsi window mostly meaningless in
> > some conditions and complicates any allocation path since it now must
> > gracefully and sanely handle an iscsi_cmd/isert_cmd not existing. I
> > assume special commands like task-management, logouts, and pings would
> > need a separate allocation source to keep from being dropped under
> > memory load.
>
> it won't be dropped. It would be allocated dynamically and freed
> (instead of putting it back to the pool).
>

If it waits indefinitely for an allocation it ends up with a variation
of the original problem under memory pressure. If it waits for
allocation on isert receive, then receive stalls under memory pressure
and won't process the completions which would have released the other
iscsi_cmd structs just needing final acknowledgement.

David Jeffery


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-14 15:55       ` David Jeffery
@ 2022-03-14 17:40         ` Laurence Oberman
  2022-03-16 10:38           ` Max Gurtovoy
  0 siblings, 1 reply; 13+ messages in thread
From: Laurence Oberman @ 2022-03-14 17:40 UTC (permalink / raw)
  To: David Jeffery, Max Gurtovoy; +Cc: linux-rdma, target-devel, Sagi Grimberg

On Mon, 2022-03-14 at 11:55 -0400, David Jeffery wrote:
> On Mon, Mar 14, 2022 at 10:52 AM Max Gurtovoy <mgurtovoy@nvidia.com>
> wrote:
> > 
> > 
> > On 3/14/2022 3:57 PM, David Jeffery wrote:
> > > On Sun, Mar 13, 2022 at 5:59 AM Max Gurtovoy <
> > > mgurtovoy@nvidia.com> wrote:
> > > > Hi David,
> > > > 
> > > > thanks for the report.
> > > > 
> > > > Please check how we fixed that in NVMf in Sagi's commit:
> > > > 
> > > > nvmet-rdma: fix possible bogus dereference under heavy load
> > > > (commit:
> > > > 8407879c4e0d77)
> > > > 
> > > > Maybe this can be done in isert and will solve this problem in
> > > > a simpler
> > > > way.
> > > > 
> > > > is it necessary to change max_cmd_sn ?
> > > > 
> > > > 
> > > 
> > > Hello,
> > > 
> > > Sure, there are alternative methods which could fix this
> > > immediate
> > > issue. e.g. We could make the command structs for scsi commands
> > > get
> > > allocated from a mempool. Is there a particular reason you don't
> > > want
> > > to do anything to modify max_cmd_sn behavior?
> > 
> > according to the description the command was parsed successful and
> > sent
> > to the initiator.
> > 
> 
> Yes.
> 
> > Why do we need to change the window ? it's just a race of putting
> > the
> > context back to the pool.
> > 
> > And this race is rare.
> > 
> 
> Sure, it's going to be rare. Systems using isert targets with
> infiniband are going to be naturally rare. It's part of why I left
> the
> max_cmd_sn behavior untouched for non-isert iscsit since they seem to
> be fine as is. But it's easily and regularly triggered by some
> systems
> which use isert, so worth fixing.
> 
> > > 
> > > I didn't do something like this as it seems to me to go against
> > > the
> > > intent of the design. It makes the iscsi window mostly
> > > meaningless in
> > > some conditions and complicates any allocation path since it now
> > > must
> > > gracefully and sanely handle an iscsi_cmd/isert_cmd not existing.
> > > I
> > > assume special commands like task-management, logouts, and pings
> > > would
> > > need a separate allocation source to keep from being dropped
> > > under
> > > memory load.
> > 
> > it won't be dropped. It would be allocated dynamically and freed
> > (instead of putting it back to the pool).
> > 
> 
> If it waits indefinitely for an allocation it ends up with a
> variation
> of the original problem under memory pressure. If it waits for
> allocation on isert receive, then receive stalls under memory
> pressure
> and won't process the completions which would have released the other
> iscsi_cmd structs just needing final acknowledgement.
> 
> David Jeffery
> 

Folks this is a pending issue stopping a customer from making progress.
They run Oracle and very high workloads on EDR 100 so David fixed this
fosusing on the needs of the isert target changes etc. 

Are you able to give us technical reasons why David's patch is not
suitable and why we he would have to start from scratch.

We literally spent weeks on this and built another special lab for
fully testing EDR 100.
This issue was pending in a BZ for some time and Mellnox had eyes on it
then but this latest suggestion was never put forward in that BZ to us.

Sincerely
Laurence 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-14 17:40         ` Laurence Oberman
@ 2022-03-16 10:38           ` Max Gurtovoy
  2022-03-16 13:07             ` Laurence Oberman
  0 siblings, 1 reply; 13+ messages in thread
From: Max Gurtovoy @ 2022-03-16 10:38 UTC (permalink / raw)
  To: Laurence Oberman, David Jeffery; +Cc: linux-rdma, target-devel, Sagi Grimberg


On 3/14/2022 7:40 PM, Laurence Oberman wrote:
> On Mon, 2022-03-14 at 11:55 -0400, David Jeffery wrote:
>> On Mon, Mar 14, 2022 at 10:52 AM Max Gurtovoy <mgurtovoy@nvidia.com>
>> wrote:
>>>
>>> On 3/14/2022 3:57 PM, David Jeffery wrote:
>>>> On Sun, Mar 13, 2022 at 5:59 AM Max Gurtovoy <
>>>> mgurtovoy@nvidia.com> wrote:
>>>>> Hi David,
>>>>>
>>>>> thanks for the report.
>>>>>
>>>>> Please check how we fixed that in NVMf in Sagi's commit:
>>>>>
>>>>> nvmet-rdma: fix possible bogus dereference under heavy load
>>>>> (commit:
>>>>> 8407879c4e0d77)
>>>>>
>>>>> Maybe this can be done in isert and will solve this problem in
>>>>> a simpler
>>>>> way.
>>>>>
>>>>> is it necessary to change max_cmd_sn ?
>>>>>
>>>>>
>>>> Hello,
>>>>
>>>> Sure, there are alternative methods which could fix this
>>>> immediate
>>>> issue. e.g. We could make the command structs for scsi commands
>>>> get
>>>> allocated from a mempool. Is there a particular reason you don't
>>>> want
>>>> to do anything to modify max_cmd_sn behavior?
>>> according to the description the command was parsed successful and
>>> sent
>>> to the initiator.
>>>
>> Yes.
>>
>>> Why do we need to change the window ? it's just a race of putting
>>> the
>>> context back to the pool.
>>>
>>> And this race is rare.
>>>
>> Sure, it's going to be rare. Systems using isert targets with
>> infiniband are going to be naturally rare. It's part of why I left
>> the
>> max_cmd_sn behavior untouched for non-isert iscsit since they seem to
>> be fine as is. But it's easily and regularly triggered by some
>> systems
>> which use isert, so worth fixing.
>>
>>>> I didn't do something like this as it seems to me to go against
>>>> the
>>>> intent of the design. It makes the iscsi window mostly
>>>> meaningless in
>>>> some conditions and complicates any allocation path since it now
>>>> must
>>>> gracefully and sanely handle an iscsi_cmd/isert_cmd not existing.
>>>> I
>>>> assume special commands like task-management, logouts, and pings
>>>> would
>>>> need a separate allocation source to keep from being dropped
>>>> under
>>>> memory load.
>>> it won't be dropped. It would be allocated dynamically and freed
>>> (instead of putting it back to the pool).
>>>
>> If it waits indefinitely for an allocation it ends up with a
>> variation
>> of the original problem under memory pressure. If it waits for
>> allocation on isert receive, then receive stalls under memory
>> pressure
>> and won't process the completions which would have released the other
>> iscsi_cmd structs just needing final acknowledgement.

If your system is under such memory pressure can you can't allocate few 
bytes for isert response, the silent drop

of the command is your smallest problem. You need to keep the system 
from crashing. And we do that in my suggestion.

>>
>> David Jeffery
>>
> Folks this is a pending issue stopping a customer from making progress.
> They run Oracle and very high workloads on EDR 100 so David fixed this
> fosusing on the needs of the isert target changes etc.
>
> Are you able to give us technical reasons why David's patch is not
> suitable and why we he would have to start from scratch.

You shouldn't start from scratch. You did all the investigation and the 
debugging already.

Coding a solution is the small part after you understand the root cause.

>
> We literally spent weeks on this and built another special lab for
> fully testing EDR 100.
> This issue was pending in a BZ for some time and Mellnox had eyes on it
> then but this latest suggestion was never put forward in that BZ to us.

Mellanox maintainers saw this issue few days before you sent it 
upstream. I suggested sending it upstream and have a discussion here 
since it has nothing to do with Mellanox adapters and Mellanox SW stack 
MLNX_OFED.

Our job as maintainers and reviewers in the community is to see the big 
picture and suggest solutions that not always same as posted in the 
mailing list.

>
> Sincerely
> Laurence
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-16 10:38           ` Max Gurtovoy
@ 2022-03-16 13:07             ` Laurence Oberman
  2022-03-16 14:39               ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Laurence Oberman @ 2022-03-16 13:07 UTC (permalink / raw)
  To: Max Gurtovoy, David Jeffery; +Cc: linux-rdma, target-devel, Sagi Grimberg

On Wed, 2022-03-16 at 12:38 +0200, Max Gurtovoy wrote:
> On 3/14/2022 7:40 PM, Laurence Oberman wrote:
> > On Mon, 2022-03-14 at 11:55 -0400, David Jeffery wrote:
> > > On Mon, Mar 14, 2022 at 10:52 AM Max Gurtovoy <
> > > mgurtovoy@nvidia.com>
> > > wrote:
> > > > 
> > > > On 3/14/2022 3:57 PM, David Jeffery wrote:
> > > > > On Sun, Mar 13, 2022 at 5:59 AM Max Gurtovoy <
> > > > > mgurtovoy@nvidia.com> wrote:
> > > > > > Hi David,
> > > > > > 
> > > > > > thanks for the report.
> > > > > > 
> > > > > > Please check how we fixed that in NVMf in Sagi's commit:
> > > > > > 
> > > > > > nvmet-rdma: fix possible bogus dereference under heavy load
> > > > > > (commit:
> > > > > > 8407879c4e0d77)
> > > > > > 
> > > > > > Maybe this can be done in isert and will solve this problem
> > > > > > in
> > > > > > a simpler
> > > > > > way.
> > > > > > 
> > > > > > is it necessary to change max_cmd_sn ?
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > Sure, there are alternative methods which could fix this
> > > > > immediate
> > > > > issue. e.g. We could make the command structs for scsi
> > > > > commands
> > > > > get
> > > > > allocated from a mempool. Is there a particular reason you
> > > > > don't
> > > > > want
> > > > > to do anything to modify max_cmd_sn behavior?
> > > > 
> > > > according to the description the command was parsed successful
> > > > and
> > > > sent
> > > > to the initiator.
> > > > 
> > > 
> > > Yes.
> > > 
> > > > Why do we need to change the window ? it's just a race of
> > > > putting
> > > > the
> > > > context back to the pool.
> > > > 
> > > > And this race is rare.
> > > > 
> > > 
> > > Sure, it's going to be rare. Systems using isert targets with
> > > infiniband are going to be naturally rare. It's part of why I
> > > left
> > > the
> > > max_cmd_sn behavior untouched for non-isert iscsit since they
> > > seem to
> > > be fine as is. But it's easily and regularly triggered by some
> > > systems
> > > which use isert, so worth fixing.
> > > 
> > > > > I didn't do something like this as it seems to me to go
> > > > > against
> > > > > the
> > > > > intent of the design. It makes the iscsi window mostly
> > > > > meaningless in
> > > > > some conditions and complicates any allocation path since it
> > > > > now
> > > > > must
> > > > > gracefully and sanely handle an iscsi_cmd/isert_cmd not
> > > > > existing.
> > > > > I
> > > > > assume special commands like task-management, logouts, and
> > > > > pings
> > > > > would
> > > > > need a separate allocation source to keep from being dropped
> > > > > under
> > > > > memory load.
> > > > 
> > > > it won't be dropped. It would be allocated dynamically and
> > > > freed
> > > > (instead of putting it back to the pool).
> > > > 
> > > 
> > > If it waits indefinitely for an allocation it ends up with a
> > > variation
> > > of the original problem under memory pressure. If it waits for
> > > allocation on isert receive, then receive stalls under memory
> > > pressure
> > > and won't process the completions which would have released the
> > > other
> > > iscsi_cmd structs just needing final acknowledgement.
> 
> If your system is under such memory pressure can you can't allocate
> few 
> bytes for isert response, the silent drop
> 
> of the command is your smallest problem. You need to keep the system 
> from crashing. And we do that in my suggestion.
> 
> > > 
> > > David Jeffery
> > > 
> > 
> > Folks this is a pending issue stopping a customer from making
> > progress.
> > They run Oracle and very high workloads on EDR 100 so David fixed
> > this
> > fosusing on the needs of the isert target changes etc.
> > 
> > Are you able to give us technical reasons why David's patch is not
> > suitable and why we he would have to start from scratch.
> 
> You shouldn't start from scratch. You did all the investigation and
> the 
> debugging already.
> 
> Coding a solution is the small part after you understand the root
> cause.
> 
> > 
> > We literally spent weeks on this and built another special lab for
> > fully testing EDR 100.
> > This issue was pending in a BZ for some time and Mellnox had eyes
> > on it
> > then but this latest suggestion was never put forward in that BZ to
> > us.
> 
> Mellanox maintainers saw this issue few days before you sent it 
> upstream. I suggested sending it upstream and have a discussion here 
> since it has nothing to do with Mellanox adapters and Mellanox SW
> stack 
> MLNX_OFED.
> 
> Our job as maintainers and reviewers in the community is to see the
> big 
> picture and suggest solutions that not always same as posted in the 
> mailing list.
> 
> > 
> > Sincerely
> > Laurence
> > 
> 
> 

Hi Max 

The issue was reported with the OFED stack at the customer, so its why
we opened the BZ to get the Mallnox partners engineers engaged.
We had them then see if it also existed with the inbox stack which it
did. 
Sergey worked a little bit on the issue but did not have the same
suggestion you provivided and asked David for help.

We will be happy to take the fix you propose doing it your way. May I
that the engineewrs to work on this the most and understand the code
best propose a fix your way.

I will take on the responsibility to do all the kernel building and
testing.

Regards
Laurence



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-16 13:07             ` Laurence Oberman
@ 2022-03-16 14:39               ` Sagi Grimberg
  2022-03-16 15:26                 ` Laurence Oberman
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2022-03-16 14:39 UTC (permalink / raw)
  To: Laurence Oberman, Max Gurtovoy, David Jeffery; +Cc: linux-rdma, target-devel


>>>>>>> Hi David,
>>>>>>>
>>>>>>> thanks for the report.
>>>>>>>
>>>>>>> Please check how we fixed that in NVMf in Sagi's commit:
>>>>>>>
>>>>>>> nvmet-rdma: fix possible bogus dereference under heavy load
>>>>>>> (commit:
>>>>>>> 8407879c4e0d77)
>>>>>>>
>>>>>>> Maybe this can be done in isert and will solve this problem
>>>>>>> in
>>>>>>> a simpler
>>>>>>> way.
>>>>>>>
>>>>>>> is it necessary to change max_cmd_sn ?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Sure, there are alternative methods which could fix this
>>>>>> immediate
>>>>>> issue. e.g. We could make the command structs for scsi
>>>>>> commands
>>>>>> get
>>>>>> allocated from a mempool. Is there a particular reason you
>>>>>> don't
>>>>>> want
>>>>>> to do anything to modify max_cmd_sn behavior?
>>>>>
>>>>> according to the description the command was parsed successful
>>>>> and
>>>>> sent
>>>>> to the initiator.
>>>>>
>>>>
>>>> Yes.
>>>>
>>>>> Why do we need to change the window ? it's just a race of
>>>>> putting
>>>>> the
>>>>> context back to the pool.
>>>>>
>>>>> And this race is rare.
>>>>>
>>>>
>>>> Sure, it's going to be rare. Systems using isert targets with
>>>> infiniband are going to be naturally rare. It's part of why I
>>>> left
>>>> the
>>>> max_cmd_sn behavior untouched for non-isert iscsit since they
>>>> seem to
>>>> be fine as is. But it's easily and regularly triggered by some
>>>> systems
>>>> which use isert, so worth fixing.
>>>>
>>>>>> I didn't do something like this as it seems to me to go
>>>>>> against
>>>>>> the
>>>>>> intent of the design. It makes the iscsi window mostly
>>>>>> meaningless in
>>>>>> some conditions and complicates any allocation path since it
>>>>>> now
>>>>>> must
>>>>>> gracefully and sanely handle an iscsi_cmd/isert_cmd not
>>>>>> existing.
>>>>>> I
>>>>>> assume special commands like task-management, logouts, and
>>>>>> pings
>>>>>> would
>>>>>> need a separate allocation source to keep from being dropped
>>>>>> under
>>>>>> memory load.
>>>>>
>>>>> it won't be dropped. It would be allocated dynamically and
>>>>> freed
>>>>> (instead of putting it back to the pool).
>>>>>
>>>>
>>>> If it waits indefinitely for an allocation it ends up with a
>>>> variation
>>>> of the original problem under memory pressure. If it waits for
>>>> allocation on isert receive, then receive stalls under memory
>>>> pressure
>>>> and won't process the completions which would have released the
>>>> other
>>>> iscsi_cmd structs just needing final acknowledgement.
>>
>> If your system is under such memory pressure can you can't allocate
>> few
>> bytes for isert response, the silent drop
>>
>> of the command is your smallest problem. You need to keep the system
>> from crashing. And we do that in my suggestion.
>>
>>>>
>>>> David Jeffery
>>>>
>>>
>>> Folks this is a pending issue stopping a customer from making
>>> progress.
>>> They run Oracle and very high workloads on EDR 100 so David fixed
>>> this
>>> fosusing on the needs of the isert target changes etc.
>>>
>>> Are you able to give us technical reasons why David's patch is not
>>> suitable and why we he would have to start from scratch.
>>
>> You shouldn't start from scratch. You did all the investigation and
>> the
>> debugging already.
>>
>> Coding a solution is the small part after you understand the root
>> cause.
>>
>>>
>>> We literally spent weeks on this and built another special lab for
>>> fully testing EDR 100.
>>> This issue was pending in a BZ for some time and Mellnox had eyes
>>> on it
>>> then but this latest suggestion was never put forward in that BZ to
>>> us.
>>
>> Mellanox maintainers saw this issue few days before you sent it
>> upstream. I suggested sending it upstream and have a discussion here
>> since it has nothing to do with Mellanox adapters and Mellanox SW
>> stack
>> MLNX_OFED.
>>
>> Our job as maintainers and reviewers in the community is to see the
>> big
>> picture and suggest solutions that not always same as posted in the
>> mailing list.
>>
>>>
>>> Sincerely
>>> Laurence
>>>
>>
>>
> 
> Hi Max

Hey,

> The issue was reported with the OFED stack at the customer, so its why
> we opened the BZ to get the Mallnox partners engineers engaged.
> We had them then see if it also existed with the inbox stack which it
> did.
> Sergey worked a little bit on the issue but did not have the same
> suggestion you provivided and asked David for help.

I think you can move the corporate discussions offline.

> We will be happy to take the fix you propose doing it your way. May I
> that the engineewrs to work on this the most and understand the code
> best propose a fix your way.

I tend to agree with Max, I looked into the patch and I can't say that
we know for a fact that incrementing the cmdsn after releasing the
iscsi cmd will not introduce anything else (although looks fine at
a high-level).

Is there any measure-able performance implication?

Max, doing dynamic allocation is also a valid fix.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O
  2022-03-16 14:39               ` Sagi Grimberg
@ 2022-03-16 15:26                 ` Laurence Oberman
  0 siblings, 0 replies; 13+ messages in thread
From: Laurence Oberman @ 2022-03-16 15:26 UTC (permalink / raw)
  To: Sagi Grimberg, Max Gurtovoy, David Jeffery; +Cc: linux-rdma, target-devel

On Wed, 2022-03-16 at 16:39 +0200, Sagi Grimberg wrote:
> Is there any measure-able performance implication?

From our testing and customers it prevented the deadlock, but did not
seem to incur any additional latency. I was reaching similar IOPS/sec
amd GB/sec prior to deadlock.
The benefit seems to be just no more stalls and hung tasks

Thanks for the replies

Kind Regards
Laurence


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-03-16 15:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11 17:57 [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O David Jeffery
2022-03-11 17:57 ` [PATCH 1/2] isert: support for unsolicited NOPIN with no response David Jeffery
2022-03-11 17:57 ` [PATCH 2/2] iscsit: increment max_cmd_sn for isert on command release David Jeffery
2022-03-11 19:08 ` [Patch 0/2] iscsit/isert deadlock prevention under heavy I/O Laurence Oberman
2022-03-13  9:59 ` Max Gurtovoy
2022-03-14 13:57   ` David Jeffery
2022-03-14 14:52     ` Max Gurtovoy
2022-03-14 15:55       ` David Jeffery
2022-03-14 17:40         ` Laurence Oberman
2022-03-16 10:38           ` Max Gurtovoy
2022-03-16 13:07             ` Laurence Oberman
2022-03-16 14:39               ` Sagi Grimberg
2022-03-16 15:26                 ` Laurence Oberman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.