All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] FC class: misc fixes
@ 2010-09-23  5:17 michaelc
  2010-09-23  5:17 ` [RFC PATCH 1/7] fc class: fix rport re-add dev_loss handling race michaelc
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi

The following patches fix some FC issues I have found
while doing some testing and porting the code.

The first issue is that there seem to be some races
in the fc_remote_port_add and dev_loss_tmo handling.
Those are fixed with the first two patches:

[RFC PATCH 1/7] fc class: fix rport re-add dev_loss handling race
[RFC PATCH 2/7] fc class: remove fc_flush_work in fc_remote_port_add

The second issue is that there seems to be a race
with fc_block_scsi_eh and the terminate_rport_io
callback. This is fixed with:

[RFC PATCH 3/7] scsi error: rename FAST_IO_FAIL to TRANSPORT_FAILED
[RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up

And then I added support for fc_block_scsi_eh to
the remaining fc drivers that implement the terminate_port_io
callback:
[RFC PATCH 5/7] libfc: hook scsi eh into fc_block_scsi_eh
[RFC PATCH 6/7] fnic: hook scsi eh into fc_block_scsi_eh
[RFC PATCH 7/7] qla2xxx: hook scsi eh into fc_block_scsi_eh

The patches were made over scsi-misc.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/7] fc class: fix rport re-add dev_loss handling race
  2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
@ 2010-09-23  5:17 ` michaelc
  2010-09-23  5:17 ` [RFC PATCH 2/7] fc class: remove fc_flush_work in fc_remote_port_add michaelc
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi; +Cc: Mike Christie

From: Mike Christie <michaelc@cs.wisc.edu>

fc_remote_port_add calls fc_flush_work before matching and
readding a rport, but there is nothing that stops the class
from queueing a deletion right after the fc_flush_work call and
before we grab the host_lock. To fix this this adds a cancel_work_sync
call for the stgt_delete_work when we have the rport partially
resetup.

There is also a problem where fc_timeout_deleted_rport can
begin the rport timeout handling and move the rport to the
rport_bindings list. Then when fc_timeout_deleted_rport drops
the host_lock to call fc_terminate_rport_io or dev_loss_tmo_callbk,
fc_remote_port_add will grab the lock, see the rport is on the
rport_bindings list and then add the rport back while
fc_timeout_deleted_rport is still running fc_terminate_rport_io
or dev_loss_tmo_callbk. This patch also fixes this by adding
cancel_delayed_work_sync calls for the dev_loss_work work in
this code path.

Note: this patch replaces the cancel_delayed_work/fc_flush_devloss
calls with cancel_delayed_work_sync. I think this should
be more efficient, because we do not have to wait for all
work to complete (just have to wait for the work and anything
queued before it).

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/scsi_transport_fc.c |   66 ++++++++++++++++++--------------------
 1 files changed, 31 insertions(+), 35 deletions(-)

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 78486d5..d43f69a 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -2300,25 +2300,6 @@ fc_queue_devloss_work(struct Scsi_Host *shost, struct delayed_work *work,
 }
 
 /**
- * fc_flush_devloss - Flush a fc_host's devloss workqueue.
- * @shost:	Pointer to Scsi_Host bound to fc_host.
- */
-static void
-fc_flush_devloss(struct Scsi_Host *shost)
-{
-	if (!fc_host_devloss_work_q(shost)) {
-		printk(KERN_ERR
-			"ERROR: FC host '%s' attempted to flush work, "
-			"when no workqueue created.\n", shost->hostt->name);
-		dump_stack();
-		return;
-	}
-
-	flush_workqueue(fc_host_devloss_work_q(shost));
-}
-
-
-/**
  * fc_remove_host - called to terminate any fc_transport-related elements for a scsi host.
  * @shost:	Which &Scsi_Host
  *
@@ -2417,6 +2398,19 @@ fc_starget_delete(struct work_struct *work)
 	scsi_remove_target(&rport->dev);
 }
 
+/**
+ * fc_cancel_remote_port_delete - cancel rport deletion related work
+ * @rport: rport to sync up
+ *
+ * This does not cancel rport_delete_work, because if that is queued
+ * the rport will have been destroyed when the sync completes.
+ */
+static void fc_cancel_remote_port_delete(struct fc_rport *rport)
+{
+	cancel_delayed_work_sync(&rport->fail_io_work);
+	cancel_delayed_work_sync(&rport->dev_loss_work);
+	cancel_work_sync(&rport->stgt_delete_work);
+}
 
 /**
  * fc_rport_final_delete - finish rport termination and delete it.
@@ -2450,10 +2444,9 @@ fc_rport_final_delete(struct work_struct *work)
 	spin_lock_irqsave(shost->host_lock, flags);
 	if (rport->flags & FC_RPORT_DEVLOSS_PENDING) {
 		spin_unlock_irqrestore(shost->host_lock, flags);
-		if (!cancel_delayed_work(&rport->fail_io_work))
-			fc_flush_devloss(shost);
-		if (!cancel_delayed_work(&rport->dev_loss_work))
-			fc_flush_devloss(shost);
+
+		fc_cancel_remote_port_delete(rport);
+
 		spin_lock_irqsave(shost->host_lock, flags);
 		rport->flags &= ~FC_RPORT_DEVLOSS_PENDING;
 	}
@@ -2718,10 +2711,7 @@ fc_remote_port_add(struct Scsi_Host *shost, int channel,
 				 * If they flush, the port_state will
 				 * be checked and will NOOP the function.
 				 */
-				if (!cancel_delayed_work(&rport->fail_io_work))
-					fc_flush_devloss(shost);
-				if (!cancel_delayed_work(&rport->dev_loss_work))
-					fc_flush_devloss(shost);
+				fc_cancel_remote_port_delete(rport);
 
 				spin_lock_irqsave(shost->host_lock, flags);
 
@@ -2785,13 +2775,25 @@ fc_remote_port_add(struct Scsi_Host *shost, int channel,
 		}
 
 		if (match) {
+			rport->port_state = FC_PORTSTATE_ONLINE;
+			spin_unlock_irqrestore(shost->host_lock, flags);
+			/*
+			 * A stgt delete could be queued after the
+			 * flush call, or a fc_timeout_deleted_rport
+			 * could be in a fc_terminate_rport_io or
+			 * dev_loss_tmo_callbk call so we must
+			 * cancel all work for the rport before
+			 * proceeding.
+			 */
+			fc_cancel_remote_port_delete(rport);
+
+			spin_lock_irqsave(shost->host_lock, flags);
 			memcpy(&rport->node_name, &ids->node_name,
 				sizeof(rport->node_name));
 			memcpy(&rport->port_name, &ids->port_name,
 				sizeof(rport->port_name));
 			rport->port_id = ids->port_id;
 			rport->roles = ids->roles;
-			rport->port_state = FC_PORTSTATE_ONLINE;
 			rport->flags &= ~FC_RPORT_FAST_FAIL_TIMEDOUT;
 
 			if (fci->f->dd_fcrport_size)
@@ -2990,19 +2992,13 @@ fc_remote_port_rolechg(struct fc_rport  *rport, u32 roles)
 		 * machine state change will validate the
 		 * transaction.
 		 */
-		if (!cancel_delayed_work(&rport->fail_io_work))
-			fc_flush_devloss(shost);
-		if (!cancel_delayed_work(&rport->dev_loss_work))
-			fc_flush_devloss(shost);
+		fc_cancel_remote_port_delete(rport);
 
 		spin_lock_irqsave(shost->host_lock, flags);
 		rport->flags &= ~(FC_RPORT_FAST_FAIL_TIMEDOUT |
 				  FC_RPORT_DEVLOSS_PENDING);
 		spin_unlock_irqrestore(shost->host_lock, flags);
 
-		/* ensure any stgt delete functions are done */
-		fc_flush_work(shost);
-
 		/* initiate a scan of the target */
 		spin_lock_irqsave(shost->host_lock, flags);
 		rport->flags |= FC_RPORT_SCAN_PENDING;
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/7] fc class: remove fc_flush_work in fc_remote_port_add
  2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
  2010-09-23  5:17 ` [RFC PATCH 1/7] fc class: fix rport re-add dev_loss handling race michaelc
@ 2010-09-23  5:17 ` michaelc
  2010-09-23  5:17 ` [RFC PATCH 3/7] scsi error: rename FAST_IO_FAIL to TRANSPORT_FAILED michaelc
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi; +Cc: Mike Christie

From: Mike Christie <michaelc@cs.wisc.edu>

With the last patch, if a remote port is matched in
fc_remote_port_add we will call cancel_work_sync on the
rport's stgt_delete_work before completing the rport addition.
So unless we wanted to wait for other rport's stgt_delete_work
to complete before adding a rport then the fc_flush_work is
not needed, and this patch removes it.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/scsi_transport_fc.c |   21 ---------------------
 1 files changed, 0 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index d43f69a..3982a2b 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -2257,24 +2257,6 @@ fc_queue_work(struct Scsi_Host *shost, struct work_struct *work)
 }
 
 /**
- * fc_flush_work - Flush a fc_host's workqueue.
- * @shost:	Pointer to Scsi_Host bound to fc_host.
- */
-static void
-fc_flush_work(struct Scsi_Host *shost)
-{
-	if (!fc_host_work_q(shost)) {
-		printk(KERN_ERR
-			"ERROR: FC host '%s' attempted to flush work, "
-			"when no workqueue created.\n", shost->hostt->name);
-		dump_stack();
-		return;
-	}
-
-	flush_workqueue(fc_host_work_q(shost));
-}
-
-/**
  * fc_queue_devloss_work - Schedule work for the fc_host devloss workqueue.
  * @shost:	Pointer to Scsi_Host bound to fc_host.
  * @work:	Work to queue for execution.
@@ -2637,9 +2619,6 @@ fc_remote_port_add(struct Scsi_Host *shost, int channel,
 	unsigned long flags;
 	int match = 0;
 
-	/* ensure any stgt delete functions are done */
-	fc_flush_work(shost);
-
 	/*
 	 * Search the list of "active" rports, for an rport that has been
 	 * deleted, but we've held off the real delete while the target
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/7] scsi error: rename FAST_IO_FAIL to TRANSPORT_FAILED
  2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
  2010-09-23  5:17 ` [RFC PATCH 1/7] fc class: fix rport re-add dev_loss handling race michaelc
  2010-09-23  5:17 ` [RFC PATCH 2/7] fc class: remove fc_flush_work in fc_remote_port_add michaelc
@ 2010-09-23  5:17 ` michaelc
  2010-09-23  5:17 ` [RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up michaelc
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi; +Cc: Mike Christie

From: Mike Christie <michaelc@cs.wisc.edu>

This renames FAST_IO_FAIL to TRANSPORT_FAILED to indicate
that in the next patches transport classes can return the
error because their fast io fail timer expired, or because
the transport decided the port/target/session/connection
is dead and is going to be cleaning up the mess.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/scsi_error.c           |   20 ++++++++++----------
 drivers/scsi/scsi_transport_fc.c    |    4 ++--
 drivers/scsi/scsi_transport_iscsi.c |    2 +-
 include/scsi/scsi.h                 |    2 +-
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 1de30eb..7e80547 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -990,10 +990,10 @@ static int scsi_eh_abort_cmds(struct list_head *work_q,
 						  "0x%p\n", current->comm,
 						  scmd));
 		rtn = scsi_try_to_abort_cmd(scmd);
-		if (rtn == SUCCESS || rtn == FAST_IO_FAIL) {
+		if (rtn == SUCCESS || rtn == TRANSPORT_FAILED) {
 			scmd->eh_eflags &= ~SCSI_EH_CANCEL_CMD;
 			if (!scsi_device_online(scmd->device) ||
-			    rtn == FAST_IO_FAIL ||
+			    rtn == TRANSPORT_FAILED ||
 			    !scsi_eh_tur(scmd)) {
 				scsi_eh_finish_cmd(scmd, done_q);
 			}
@@ -1120,9 +1120,9 @@ static int scsi_eh_bus_device_reset(struct Scsi_Host *shost,
 						  " 0x%p\n", current->comm,
 						  sdev));
 		rtn = scsi_try_bus_device_reset(bdr_scmd);
-		if (rtn == SUCCESS || rtn == FAST_IO_FAIL) {
+		if (rtn == SUCCESS || rtn == TRANSPORT_FAILED) {
 			if (!scsi_device_online(sdev) ||
-			    rtn == FAST_IO_FAIL ||
+			    rtn == TRANSPORT_FAILED ||
 			    !scsi_eh_tur(bdr_scmd)) {
 				list_for_each_entry_safe(scmd, next,
 							 work_q, eh_entry) {
@@ -1185,11 +1185,11 @@ static int scsi_eh_target_reset(struct Scsi_Host *shost,
 						  "to target %d\n",
 						  current->comm, id));
 		rtn = scsi_try_target_reset(tgtr_scmd);
-		if (rtn == SUCCESS || rtn == FAST_IO_FAIL) {
+		if (rtn == SUCCESS || rtn == TRANSPORT_FAILED) {
 			list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
 				if (id == scmd_id(scmd))
 					if (!scsi_device_online(scmd->device) ||
-					    rtn == FAST_IO_FAIL ||
+					    rtn == TRANSPORT_FAILED ||
 					    !scsi_eh_tur(tgtr_scmd))
 						scsi_eh_finish_cmd(scmd,
 								   done_q);
@@ -1245,11 +1245,11 @@ static int scsi_eh_bus_reset(struct Scsi_Host *shost,
 						  " %d\n", current->comm,
 						  channel));
 		rtn = scsi_try_bus_reset(chan_scmd);
-		if (rtn == SUCCESS || rtn == FAST_IO_FAIL) {
+		if (rtn == SUCCESS || rtn == TRANSPORT_FAILED) {
 			list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
 				if (channel == scmd_channel(scmd))
 					if (!scsi_device_online(scmd->device) ||
-					    rtn == FAST_IO_FAIL ||
+					    rtn == TRANSPORT_FAILED ||
 					    !scsi_eh_tur(scmd))
 						scsi_eh_finish_cmd(scmd,
 								   done_q);
@@ -1283,10 +1283,10 @@ static int scsi_eh_host_reset(struct list_head *work_q,
 						  , current->comm));
 
 		rtn = scsi_try_host_reset(scmd);
-		if (rtn == SUCCESS || rtn == FAST_IO_FAIL) {
+		if (rtn == SUCCESS || rtn == TRANSPORT_FAILED) {
 			list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
 				if (!scsi_device_online(scmd->device) ||
-				    rtn == FAST_IO_FAIL ||
+				    rtn == TRANSPORT_FAILED ||
 				    (!scsi_eh_try_stu(scmd) && !scsi_eh_tur(scmd)) ||
 				    !scsi_eh_tur(scmd))
 					scsi_eh_finish_cmd(scmd, done_q);
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 3982a2b..a15e815 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3181,7 +3181,7 @@ fc_scsi_scan_rport(struct work_struct *work)
  * rports which would lead to offlined SCSI devices.
  *
  * Returns: 0 if the fc_rport left the state FC_PORTSTATE_BLOCKED.
- *	    FAST_IO_FAIL if the fast_io_fail_tmo fired, this should be
+ *	    TRANSPORT_FAILED if the fast_io_fail_tmo fired, this should be
  *	    passed back to scsi_eh.
  */
 int fc_block_scsi_eh(struct scsi_cmnd *cmnd)
@@ -3200,7 +3200,7 @@ int fc_block_scsi_eh(struct scsi_cmnd *cmnd)
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
 	if (rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT)
-		return FAST_IO_FAIL;
+		return TRANSPORT_FAILED;
 
 	return 0;
 }
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index 332387a..ddb87be 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -554,7 +554,7 @@ int iscsi_block_scsi_eh(struct scsi_cmnd *cmd)
 	spin_lock_irqsave(&session->lock, flags);
 	while (session->state != ISCSI_SESSION_LOGGED_IN) {
 		if (session->state == ISCSI_SESSION_FREE) {
-			ret = FAST_IO_FAIL;
+			ret = TRANSPORT_FAILED;
 			break;
 		}
 		spin_unlock_irqrestore(&session->lock, flags);
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 8fcb6e0..0683db8 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -425,7 +425,7 @@ static inline int scsi_is_wlun(unsigned int lun)
 #define ADD_TO_MLQUEUE  0x2006
 #define TIMEOUT_ERROR   0x2007
 #define SCSI_RETURN_NOT_HANDLED   0x2008
-#define FAST_IO_FAIL	0x2009
+#define TRANSPORT_FAILED	0x2009
 
 /*
  * Midlevel queue return values.
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up
  2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
                   ` (2 preceding siblings ...)
  2010-09-23  5:17 ` [RFC PATCH 3/7] scsi error: rename FAST_IO_FAIL to TRANSPORT_FAILED michaelc
@ 2010-09-23  5:17 ` michaelc
  2010-09-23  5:47   ` Mike Christie
  2010-09-23  5:17 ` [RFC PATCH 5/7] libfc: hook scsi eh into fc_block_scsi_eh michaelc
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi; +Cc: Mike Christie

From: Mike Christie <michaelc@cs.wisc.edu>

If a lld does:

        ret = fc_block_scsi_eh(cmnd);
        if (ret)
                return ret;

in the eh callbacks, then it could cause the following race:

1 the LLD will call fc_block_scsi_eh from the scsi eh thread.
2 From the FC class thread, the fast io fail tmo will fire and set
FC_RPORT_FAST_FAIL_TIMEDOUT, then begin to call terminate_rport_io.
3 The scsi eh thread and the LLD will then break from the
fc_block_scsi_eh block and will return FAST_IO_FAIL.
4 The scsi eh will then assume it owns the command and will start to
process it. It will call scsi_eh_flush_done_q which might fail it or
retry it.
5 But then in the FC class thread, the LLD terminate_rport_io callback
could be processing the IO and possibly accessing a scsi_cmnd struct
that the scsi eh thread has now started to retry or failed and
reallocated to a new request in #4.

This patch has fc_block_scsi_eh wait until the terminate_rport_io
callback has completed before returning. This allows LLDs to not
have to worry about the race.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/scsi_transport_fc.c |   54 ++++++++++++++++++++++++++++---------
 include/scsi/scsi_transport_fc.h |    1 +
 2 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index a15e815..93a8edc 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -1575,6 +1575,7 @@ store_fc_private_host_tgtid_bind_type(struct device *dev,
 				&fc_host_rport_bindings(shost), peers);
 			list_del(&rport->peers);
 			rport->port_state = FC_PORTSTATE_DELETED;
+			rport->flags |= FC_RPORT_TERMINATING_RPORT;
 			fc_queue_work(shost, &rport->rport_delete_work);
 		}
 		spin_unlock_irqrestore(shost->host_lock, flags);
@@ -2316,6 +2317,7 @@ fc_remove_host(struct Scsi_Host *shost)
 			&fc_host->rports, peers) {
 		list_del(&rport->peers);
 		rport->port_state = FC_PORTSTATE_DELETED;
+		rport->flags |= FC_RPORT_TERMINATING_RPORT;
 		fc_queue_work(shost, &rport->rport_delete_work);
 	}
 
@@ -2323,6 +2325,7 @@ fc_remove_host(struct Scsi_Host *shost)
 			&fc_host->rport_bindings, peers) {
 		list_del(&rport->peers);
 		rport->port_state = FC_PORTSTATE_DELETED;
+		rport->flags |= FC_RPORT_TERMINATING_RPORT;
 		fc_queue_work(shost, &rport->rport_delete_work);
 	}
 
@@ -2351,11 +2354,20 @@ static void fc_terminate_rport_io(struct fc_rport *rport)
 {
 	struct Scsi_Host *shost = rport_to_shost(rport);
 	struct fc_internal *i = to_fc_internal(shost->transportt);
+	unsigned long flags;
+
+	spin_lock_irqsave(shost->host_lock, flags);
+	rport->flags |= FC_RPORT_TERMINATING_RPORT;
+	spin_unlock_irqrestore(shost->host_lock, flags);
 
 	/* Involve the LLDD if possible to terminate all io on the rport. */
 	if (i->f->terminate_rport_io)
 		i->f->terminate_rport_io(rport);
 
+	spin_lock_irqsave(shost->host_lock, flags);
+	rport->flags &= ~FC_RPORT_TERMINATING_RPORT;
+	spin_unlock_irqrestore(shost->host_lock, flags);
+
 	/*
 	 * must unblock to flush queued IO. The caller will have set
 	 * the port_state or flags, so that fc_remote_port_chkready will
@@ -2696,7 +2708,8 @@ fc_remote_port_add(struct Scsi_Host *shost, int channel,
 
 				rport->flags &= ~(FC_RPORT_FAST_FAIL_TIMEDOUT |
 						  FC_RPORT_DEVLOSS_PENDING |
-						  FC_RPORT_DEVLOSS_CALLBK_DONE);
+						  FC_RPORT_DEVLOSS_CALLBK_DONE |
+						  FC_RPORT_TERMINATING_RPORT);
 
 				/* if target, initiate a scan */
 				if (rport->scsi_target_id != -1) {
@@ -2773,8 +2786,8 @@ fc_remote_port_add(struct Scsi_Host *shost, int channel,
 				sizeof(rport->port_name));
 			rport->port_id = ids->port_id;
 			rport->roles = ids->roles;
-			rport->flags &= ~FC_RPORT_FAST_FAIL_TIMEDOUT;
-
+			rport->flags &= ~(FC_RPORT_FAST_FAIL_TIMEDOUT |
+					  FC_RPORT_TERMINATING_RPORT);
 			if (fci->f->dd_fcrport_size)
 				memset(rport->dd_data, 0,
 						fci->f->dd_fcrport_size);
@@ -2975,7 +2988,8 @@ fc_remote_port_rolechg(struct fc_rport  *rport, u32 roles)
 
 		spin_lock_irqsave(shost->host_lock, flags);
 		rport->flags &= ~(FC_RPORT_FAST_FAIL_TIMEDOUT |
-				  FC_RPORT_DEVLOSS_PENDING);
+				  FC_RPORT_DEVLOSS_PENDING |
+				  FC_RPORT_TERMINATING_RPORT);
 		spin_unlock_irqrestore(shost->host_lock, flags);
 
 		/* initiate a scan of the target */
@@ -3041,6 +3055,7 @@ fc_timeout_deleted_rport(struct work_struct *work)
 	    (rport->scsi_target_id == -1)) {
 		list_del(&rport->peers);
 		rport->port_state = FC_PORTSTATE_DELETED;
+		rport->flags |= FC_RPORT_TERMINATING_RPORT;
 		dev_printk(KERN_ERR, &rport->dev,
 			"blocked FC remote port time out: removing"
 			" rport%s\n",
@@ -3070,6 +3085,12 @@ fc_timeout_deleted_rport(struct work_struct *work)
 	rport->roles = FC_PORT_ROLE_UNKNOWN;
 	rport->port_state = FC_PORTSTATE_NOTPRESENT;
 	rport->flags &= ~FC_RPORT_FAST_FAIL_TIMEDOUT;
+	/*
+	 * We changed the port_state and are going to drop the lock, so
+	 * we set this now because we want fc_block_scsi_eh to stay blocked
+	 * until terminate_rport_io has completed.
+	 */
+	rport->flags |= FC_RPORT_TERMINATING_RPORT;
 
 	/*
 	 * Pre-emptively kill I/O rather than waiting for the work queue
@@ -3137,12 +3158,17 @@ fc_timeout_fail_rport_io(struct work_struct *work)
 {
 	struct fc_rport *rport =
 		container_of(work, struct fc_rport, fail_io_work.work);
+	struct Scsi_Host *shost = rport_to_shost(rport);
+	unsigned long flags;
 
 	if (rport->port_state != FC_PORTSTATE_BLOCKED)
 		return;
 
-	rport->flags |= FC_RPORT_FAST_FAIL_TIMEDOUT;
 	fc_terminate_rport_io(rport);
+
+	spin_lock_irqsave(shost->host_lock, flags);
+	rport->flags |= FC_RPORT_FAST_FAIL_TIMEDOUT;
+	spin_unlock_irqrestore(shost->host_lock, flags);
 }
 
 /**
@@ -3176,9 +3202,10 @@ fc_scsi_scan_rport(struct work_struct *work)
  *
  * This routine can be called from a FC LLD scsi_eh callback. It
  * blocks the scsi_eh thread until the fc_rport leaves the
- * FC_PORTSTATE_BLOCKED, or the fast_io_fail_tmo fires. This is
- * necessary to avoid the scsi_eh failing recovery actions for blocked
- * rports which would lead to offlined SCSI devices.
+ * FC_PORTSTATE_BLOCKED, or the fast_io_fail_tmo fires and the IO
+ * on the rport has been terminated with the terminate_port_io callback.
+ * This is necessary to avoid the scsi_eh failing recovery actions for
+ * blocked rports which would lead to offlined SCSI devices.
  *
  * Returns: 0 if the fc_rport left the state FC_PORTSTATE_BLOCKED.
  *	    TRANSPORT_FAILED if the fast_io_fail_tmo fired, this should be
@@ -3191,18 +3218,19 @@ int fc_block_scsi_eh(struct scsi_cmnd *cmnd)
 	unsigned long flags;
 
 	spin_lock_irqsave(shost->host_lock, flags);
-	while (rport->port_state == FC_PORTSTATE_BLOCKED &&
-	       !(rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT)) {
+	while ((rport->port_state == FC_PORTSTATE_BLOCKED &&
+	       !(rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT)) ||
+	       rport->flags & FC_RPORT_TERMINATING_RPORT) {
 		spin_unlock_irqrestore(shost->host_lock, flags);
 		msleep(1000);
 		spin_lock_irqsave(shost->host_lock, flags);
 	}
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
-	if (rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT)
-		return TRANSPORT_FAILED;
+	if (rport->port_state == FC_PORTSTATE_ONLINE)
+		return 0;
 
-	return 0;
+	return TRANSPORT_FAILED;
 }
 EXPORT_SYMBOL(fc_block_scsi_eh);
 
diff --git a/include/scsi/scsi_transport_fc.h b/include/scsi/scsi_transport_fc.h
index 9f98fca..f392570 100644
--- a/include/scsi/scsi_transport_fc.h
+++ b/include/scsi/scsi_transport_fc.h
@@ -359,6 +359,7 @@ struct fc_rport {	/* aka fc_starget_attrs */
 #define FC_RPORT_SCAN_PENDING		0x02
 #define FC_RPORT_FAST_FAIL_TIMEDOUT	0x04
 #define FC_RPORT_DEVLOSS_CALLBK_DONE	0x08
+#define FC_RPORT_TERMINATING_RPORT	0x10
 
 #define	dev_to_rport(d)				\
 	container_of(d, struct fc_rport, dev)
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 5/7] libfc: hook scsi eh into fc_block_scsi_eh
  2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
                   ` (3 preceding siblings ...)
  2010-09-23  5:17 ` [RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up michaelc
@ 2010-09-23  5:17 ` michaelc
  2010-09-23  5:17 ` [RFC PATCH 6/7] fnic: " michaelc
  2010-09-23  5:17 ` [RFC PATCH 7/7] qla2xxx: " michaelc
  6 siblings, 0 replies; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi; +Cc: Mike Christie

From: Mike Christie <michaelc@cs.wisc.edu>

This patch just has libfc scsi eh callbacks call
fc_block_scsi_eh like other FC drivers

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/libfc/fc_fcp.c |   25 +++++++++++++++++++------
 1 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index c797f6b..61e7045 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1997,8 +1997,13 @@ int fc_eh_abort(struct scsi_cmnd *sc_cmd)
 {
 	struct fc_fcp_pkt *fsp;
 	struct fc_lport *lport;
-	int rc = FAILED;
 	unsigned long flags;
+	int rc;
+
+	rc = fc_block_scsi_eh(sc_cmd);
+	if (rc)
+		return rc;
+	rc = FAILED;
 
 	lport = shost_priv(sc_cmd->device->host);
 	if (lport->state != LPORT_ST_READY)
@@ -2044,12 +2049,15 @@ int fc_eh_device_reset(struct scsi_cmnd *sc_cmd)
 	struct fc_lport *lport;
 	struct fc_fcp_pkt *fsp;
 	struct fc_rport *rport = starget_to_rport(scsi_target(sc_cmd->device));
-	int rc = FAILED;
-	int rval;
+	int rc;
 
-	rval = fc_remote_port_chkready(rport);
-	if (rval)
-		goto out;
+	rc = fc_block_scsi_eh(sc_cmd);
+	if (rc)
+		return rc;
+	rc = FAILED;
+
+	if (fc_remote_port_chkready(rport))
+		return rc;
 
 	lport = shost_priv(sc_cmd->device->host);
 
@@ -2093,9 +2101,14 @@ int fc_eh_host_reset(struct scsi_cmnd *sc_cmd)
 	struct Scsi_Host *shost = sc_cmd->device->host;
 	struct fc_lport *lport = shost_priv(shost);
 	unsigned long wait_tmo;
+	int rc;
 
 	FC_SCSI_DBG(lport, "Resetting host\n");
 
+	rc = fc_block_scsi_eh(sc_cmd);
+	if (rc)
+		return rc;
+
 	lport->tt.lport_reset(lport);
 	wait_tmo = jiffies + FC_HOST_RESET_TIMEOUT;
 	while (!fc_fcp_lport_queue_ready(lport) && time_before(jiffies,
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 6/7] fnic: hook scsi eh into fc_block_scsi_eh
  2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
                   ` (4 preceding siblings ...)
  2010-09-23  5:17 ` [RFC PATCH 5/7] libfc: hook scsi eh into fc_block_scsi_eh michaelc
@ 2010-09-23  5:17 ` michaelc
  2010-09-23  5:37   ` Mike Christie
  2010-09-23  5:17 ` [RFC PATCH 7/7] qla2xxx: " michaelc
  6 siblings, 1 reply; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi; +Cc: Mike Christie

From: Mike Christie <michaelc@cs.wisc.edu>

fnic is calling fc_block_scsi_eh, but it is not passing
up TRANSPORT_FAILED if returned.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/fnic/fnic_scsi.c |   18 ++++++++++++++----
 1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
index 198cbab..4cedb37 100644
--- a/drivers/scsi/fnic/fnic_scsi.c
+++ b/drivers/scsi/fnic/fnic_scsi.c
@@ -1234,13 +1234,16 @@ int fnic_abort_cmd(struct scsi_cmnd *sc)
 	struct fc_rport *rport;
 	spinlock_t *io_lock;
 	unsigned long flags;
-	int ret = SUCCESS;
+	int ret;
 	u32 task_req;
 	struct scsi_lun fc_lun;
 	DECLARE_COMPLETION_ONSTACK(tm_done);
 
 	/* Wait for rport to unblock */
-	fc_block_scsi_eh(sc);
+	ret = fc_block_scsi_eh(sc);
+	if (ret)
+		return ret;
+	ret = SUCCESS;
 
 	/* Get local-port, check ready and link up */
 	lp = shost_priv(sc->device->host);
@@ -1512,13 +1515,16 @@ int fnic_device_reset(struct scsi_cmnd *sc)
 	struct fnic_io_req *io_req;
 	struct fc_rport *rport;
 	int status;
-	int ret = FAILED;
+	int ret;
 	spinlock_t *io_lock;
 	unsigned long flags;
 	DECLARE_COMPLETION_ONSTACK(tm_done);
 
 	/* Wait for rport to unblock */
-	fc_block_scsi_eh(sc);
+	ret = fc_block_scsi_eh(sc);
+	if (ret)
+		return ret;
+	ret = FAILED;
 
 	/* Get local-port, check ready and link up */
 	lp = shost_priv(sc->device->host);
@@ -1699,6 +1705,10 @@ int fnic_host_reset(struct scsi_cmnd *sc)
 	struct Scsi_Host *shost = sc->device->host;
 	struct fc_lport *lp = shost_priv(shost);
 
+	/* Wait for rport to unblock */
+	ret = fc_block_scsi_eh(sc);
+	if (ret)
+		return ret;
 	/*
 	 * If fnic_reset is successful, wait for fabric login to complete
 	 * scsi-ml tries to send a TUR to every device if host reset is
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 7/7] qla2xxx: hook scsi eh into fc_block_scsi_eh
  2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
                   ` (5 preceding siblings ...)
  2010-09-23  5:17 ` [RFC PATCH 6/7] fnic: " michaelc
@ 2010-09-23  5:17 ` michaelc
  6 siblings, 0 replies; 11+ messages in thread
From: michaelc @ 2010-09-23  5:17 UTC (permalink / raw)
  To: linux-scsi; +Cc: Mike Christie

From: Mike Christie <michaelc@cs.wisc.edu>

qla2xxx is calling fc_block_scsi_eh, but it is not passing
up TRANSPORT_FAILED if returned.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/qla2xxx/qla_os.c |   20 +++++++++++++++-----
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index bdd53f0..7523041 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -834,7 +834,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
 	srb_t *spt;
 	int got_ref = 0;
 
-	fc_block_scsi_eh(cmd);
+	ret = fc_block_scsi_eh(cmd);
+	if (ret)
+		return ret;
 
 	if (!CMD_SP(cmd))
 		return SUCCESS;
@@ -964,9 +966,11 @@ __qla2xxx_eh_generic_reset(char *name, enum nexus_wait_type type,
 {
 	scsi_qla_host_t *vha = shost_priv(cmd->device->host);
 	fc_port_t *fcport = (struct fc_port *) cmd->device->hostdata;
-	int err;
+	int err, ret;
 
-	fc_block_scsi_eh(cmd);
+	ret = fc_block_scsi_eh(cmd);
+	if (ret)
+		return ret;
 
 	if (!fcport)
 		return FAILED;
@@ -1045,7 +1049,10 @@ qla2xxx_eh_bus_reset(struct scsi_cmnd *cmd)
 	unsigned int id, lun;
 	unsigned long serial;
 
-	fc_block_scsi_eh(cmd);
+	ret = fc_block_scsi_eh(cmd);
+	if (ret)
+		return ret;
+	ret = FAILED;
 
 	id = cmd->device->id;
 	lun = cmd->device->lun;
@@ -1107,7 +1114,10 @@ qla2xxx_eh_host_reset(struct scsi_cmnd *cmd)
 	unsigned long serial;
 	scsi_qla_host_t *base_vha = pci_get_drvdata(ha->pdev);
 
-	fc_block_scsi_eh(cmd);
+	ret = fc_block_scsi_eh(cmd);
+	if (ret)
+		return ret;
+	ret = FAILED;
 
 	id = cmd->device->id;
 	lun = cmd->device->lun;
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 6/7] fnic: hook scsi eh into fc_block_scsi_eh
  2010-09-23  5:17 ` [RFC PATCH 6/7] fnic: " michaelc
@ 2010-09-23  5:37   ` Mike Christie
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Christie @ 2010-09-23  5:37 UTC (permalink / raw)
  To: michaelc; +Cc: linux-scsi, Abhijeet Joglekar

On 09/23/2010 12:17 AM, michaelc@cs.wisc.edu wrote:
> From: Mike Christie<michaelc@cs.wisc.edu>
>
> fnic is calling fc_block_scsi_eh, but it is not passing
> up TRANSPORT_FAILED if returned.
>
> Signed-off-by: Mike Christie<michaelc@cs.wisc.edu>

I did not mean to send this patch for fnic.

It does not always work, because because fnic's terminate_rport_io 
callback can fail and then relies on the scsi eh functions to clean up 
the command. However, the fc_block_scsi_eh function relies on the 
terminate_rport_io callback cleaning up the commands.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up
  2010-09-23  5:17 ` [RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up michaelc
@ 2010-09-23  5:47   ` Mike Christie
  2010-09-23  7:18     ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Christie @ 2010-09-23  5:47 UTC (permalink / raw)
  To: michaelc; +Cc: linux-scsi

On 09/23/2010 12:17 AM, michaelc@cs.wisc.edu wrote:
> From: Mike Christie<michaelc@cs.wisc.edu>
>
> If a lld does:
>
>          ret = fc_block_scsi_eh(cmnd);
>          if (ret)
>                  return ret;
>
> in the eh callbacks, then it could cause the following race:
>
> 1 the LLD will call fc_block_scsi_eh from the scsi eh thread.
> 2 From the FC class thread, the fast io fail tmo will fire and set
> FC_RPORT_FAST_FAIL_TIMEDOUT, then begin to call terminate_rport_io.
> 3 The scsi eh thread and the LLD will then break from the
> fc_block_scsi_eh block and will return FAST_IO_FAIL.
> 4 The scsi eh will then assume it owns the command and will start to
> process it. It will call scsi_eh_flush_done_q which might fail it or
> retry it.
> 5 But then in the FC class thread, the LLD terminate_rport_io callback
> could be processing the IO and possibly accessing a scsi_cmnd struct
> that the scsi eh thread has now started to retry or failed and
> reallocated to a new request in #4.
>
> This patch has fc_block_scsi_eh wait until the terminate_rport_io
> callback has completed before returning. This allows LLDs to not
> have to worry about the race.
>

I think this is not going to work. It looks like for drivers like lpfc 
and even qla2xxx in the ISP_ABORT case, because even after 
terminate_rport_io has completed the driver can still touch the scsi_cmnd.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up
  2010-09-23  5:47   ` Mike Christie
@ 2010-09-23  7:18     ` Hannes Reinecke
  0 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2010-09-23  7:18 UTC (permalink / raw)
  To: Mike Christie; +Cc: linux-scsi

Mike Christie wrote:
> On 09/23/2010 12:17 AM, michaelc@cs.wisc.edu wrote:
>> From: Mike Christie<michaelc@cs.wisc.edu>
>>
>> If a lld does:
>>
>>          ret = fc_block_scsi_eh(cmnd);
>>          if (ret)
>>                  return ret;
>>
>> in the eh callbacks, then it could cause the following race:
>>
>> 1 the LLD will call fc_block_scsi_eh from the scsi eh thread.
>> 2 From the FC class thread, the fast io fail tmo will fire and set
>> FC_RPORT_FAST_FAIL_TIMEDOUT, then begin to call terminate_rport_io.
>> 3 The scsi eh thread and the LLD will then break from the
>> fc_block_scsi_eh block and will return FAST_IO_FAIL.
>> 4 The scsi eh will then assume it owns the command and will start to
>> process it. It will call scsi_eh_flush_done_q which might fail it or
>> retry it.
>> 5 But then in the FC class thread, the LLD terminate_rport_io callback
>> could be processing the IO and possibly accessing a scsi_cmnd struct
>> that the scsi eh thread has now started to retry or failed and
>> reallocated to a new request in #4.
>>
>> This patch has fc_block_scsi_eh wait until the terminate_rport_io
>> callback has completed before returning. This allows LLDs to not
>> have to worry about the race.
>>
> 
> I think this is not going to work. It looks like for drivers like lpfc
> and even qla2xxx in the ISP_ABORT case, because even after
> terminate_rport_io has completed the driver can still touch the scsi_cmnd.

It's even worse than that.
fc_block_scsi_eh() returns '0' (!) on success, which causes havoc in
some drivers as they expect 'ret' to be a proper SCSI result value.
And we're not handling 'FAST_IO_FAIL' in all places. And not all FC
drivers evaluate the return value of fc_block_scsi_eh().

I'll have a patchset for this; will be sending it shortly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-09-23  7:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-23  5:17 [RFC] FC class: misc fixes michaelc
2010-09-23  5:17 ` [RFC PATCH 1/7] fc class: fix rport re-add dev_loss handling race michaelc
2010-09-23  5:17 ` [RFC PATCH 2/7] fc class: remove fc_flush_work in fc_remote_port_add michaelc
2010-09-23  5:17 ` [RFC PATCH 3/7] scsi error: rename FAST_IO_FAIL to TRANSPORT_FAILED michaelc
2010-09-23  5:17 ` [RFC PATCH 4/7] fc class: don't return from fc_block_scsi_eh until IO has been cleaned up michaelc
2010-09-23  5:47   ` Mike Christie
2010-09-23  7:18     ` Hannes Reinecke
2010-09-23  5:17 ` [RFC PATCH 5/7] libfc: hook scsi eh into fc_block_scsi_eh michaelc
2010-09-23  5:17 ` [RFC PATCH 6/7] fnic: " michaelc
2010-09-23  5:37   ` Mike Christie
2010-09-23  5:17 ` [RFC PATCH 7/7] qla2xxx: " michaelc

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.