All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] qla2xxx: Fix extraneous ref on sp's after adapter break
@ 2017-05-25 19:26 Bill Kuzeja
  2017-05-31 21:08 ` Madhani, Himanshu
  2017-06-01  2:49 ` Martin K. Petersen
  0 siblings, 2 replies; 3+ messages in thread
From: Bill Kuzeja @ 2017-05-25 19:26 UTC (permalink / raw)
  To: linux-scsi; +Cc: qla2xxx-upstream

Hung task timeouts can result if a qlogic board breaks unexpectedly while
running I/O. These tasks become hung because command srb reference counts
are not going to zero, hence the affected srbs and commands do not get 
freed. This fix accounts for this extra reference in the srbs in the case 
of a board failure. 

Fixes: a465537ad1a4 ("qla2xxx: Disable the adapter and skip error recovery in case of register disconnect")
Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>
---

I encountered this hang during by adapter break testing (on Stratus
hardware, we need to survive adapter breaks). I noticed that we wait
indefinitely for several commands which all have outstanding
references to them, but which have all followed this code path:

qla2x00_abort_all_cmds
   qla2xxx_eh_abort
       Exit early because qla2x00_isp_reg_stat is non-zero.

Note that before calling qla2xxx_eh_abort from this path, we do an extra
sp_get(sp), which takes out another reference on the current sp. If we
do not early exit immediately from qla2xxx_eh_abort, we'll get rid of this
extra reference through the abort - which is the normal case.

However, if we exit immediately, this extra sp_get is never dereferenced.
Each one of the commands flowing through this code will be stuck forever,
resulting in hung tasks.

This early exit in qla2xxx_eh_abort was introduced by this commit:

commit a465537ad1a4 ("qla2xxx: Disable the adapter and skip error recovery in case of register disconnect.")

The solution for this is somewhat tricky because qla2xxx_eh_abort can be
invoked from the scsi error handler as well as qla2x00_abort_all_cmds.
I originally thought of having qla2xxx_eh_abort remove the extra reference
before early exiting, but not knowing the context of its invocation,
this seemed dangerous.

Alternatively we could also just remove the early exit case from
qla2xxx_eh_abort, but the aforementioned commit was put in for a reason, so 
that doesn't seem correct either.

The right thing to do seems to be putting the fix in qla2x00_abort_all_cmds,
checking the conditions we where we have exited early from qla2xxx_eh_abort
before removing the extra reference.

---
 drivers/scsi/qla2xxx/qla_os.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 1c79579..1a93365 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1632,7 +1632,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
 void
 qla2x00_abort_all_cmds(scsi_qla_host_t *vha, int res)
 {
-	int que, cnt;
+	int que, cnt, status;
 	unsigned long flags;
 	srb_t *sp;
 	struct qla_hw_data *ha = vha->hw;
@@ -1662,8 +1662,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
 					 */
 					sp_get(sp);
 					spin_unlock_irqrestore(&ha->hardware_lock, flags);
-					qla2xxx_eh_abort(GET_CMD_SP(sp));
+					status = qla2xxx_eh_abort(GET_CMD_SP(sp));
 					spin_lock_irqsave(&ha->hardware_lock, flags);
+					/* Get rid of extra reference if immediate exit
+					 * from ql2xxx_eh_abort */
+					if (status == FAILED && (qla2x00_isp_reg_stat(ha)))
+						atomic_dec(&sp->ref_count);
 				}
 				req->outstanding_cmds[cnt] = NULL;
 				sp->done(sp, res);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] qla2xxx: Fix extraneous ref on sp's after adapter break
  2017-05-25 19:26 [PATCH] qla2xxx: Fix extraneous ref on sp's after adapter break Bill Kuzeja
@ 2017-05-31 21:08 ` Madhani, Himanshu
  2017-06-01  2:49 ` Martin K. Petersen
  1 sibling, 0 replies; 3+ messages in thread
From: Madhani, Himanshu @ 2017-05-31 21:08 UTC (permalink / raw)
  To: Bill Kuzeja; +Cc: linux-scsi, Dept-Eng QLA2xxx Upstream


> On May 25, 2017, at 12:26 PM, Bill Kuzeja <william.kuzeja@stratus.com> wrote:
> 
> Hung task timeouts can result if a qlogic board breaks unexpectedly while
> running I/O. These tasks become hung because command srb reference counts
> are not going to zero, hence the affected srbs and commands do not get 
> freed. This fix accounts for this extra reference in the srbs in the case 
> of a board failure. 
> 
> Fixes: a465537ad1a4 ("qla2xxx: Disable the adapter and skip error recovery in case of register disconnect")
> Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>
> ---
> 
> I encountered this hang during by adapter break testing (on Stratus
> hardware, we need to survive adapter breaks). I noticed that we wait
> indefinitely for several commands which all have outstanding
> references to them, but which have all followed this code path:
> 
> qla2x00_abort_all_cmds
>   qla2xxx_eh_abort
>       Exit early because qla2x00_isp_reg_stat is non-zero.
> 
> Note that before calling qla2xxx_eh_abort from this path, we do an extra
> sp_get(sp), which takes out another reference on the current sp. If we
> do not early exit immediately from qla2xxx_eh_abort, we'll get rid of this
> extra reference through the abort - which is the normal case.
> 
> However, if we exit immediately, this extra sp_get is never dereferenced.
> Each one of the commands flowing through this code will be stuck forever,
> resulting in hung tasks.
> 
> This early exit in qla2xxx_eh_abort was introduced by this commit:
> 
> commit a465537ad1a4 ("qla2xxx: Disable the adapter and skip error recovery in case of register disconnect.")
> 
> The solution for this is somewhat tricky because qla2xxx_eh_abort can be
> invoked from the scsi error handler as well as qla2x00_abort_all_cmds.
> I originally thought of having qla2xxx_eh_abort remove the extra reference
> before early exiting, but not knowing the context of its invocation,
> this seemed dangerous.
> 
> Alternatively we could also just remove the early exit case from
> qla2xxx_eh_abort, but the aforementioned commit was put in for a reason, so 
> that doesn't seem correct either.
> 
> The right thing to do seems to be putting the fix in qla2x00_abort_all_cmds,
> checking the conditions we where we have exited early from qla2xxx_eh_abort
> before removing the extra reference.
> 
> ---
> drivers/scsi/qla2xxx/qla_os.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 1c79579..1a93365 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -1632,7 +1632,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
> void
> qla2x00_abort_all_cmds(scsi_qla_host_t *vha, int res)
> {
> -	int que, cnt;
> +	int que, cnt, status;
> 	unsigned long flags;
> 	srb_t *sp;
> 	struct qla_hw_data *ha = vha->hw;
> @@ -1662,8 +1662,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
> 					 */
> 					sp_get(sp);
> 					spin_unlock_irqrestore(&ha->hardware_lock, flags);
> -					qla2xxx_eh_abort(GET_CMD_SP(sp));
> +					status = qla2xxx_eh_abort(GET_CMD_SP(sp));
> 					spin_lock_irqsave(&ha->hardware_lock, flags);
> +					/* Get rid of extra reference if immediate exit
> +					 * from ql2xxx_eh_abort */
> +					if (status == FAILED && (qla2x00_isp_reg_stat(ha)))
> +						atomic_dec(&sp->ref_count);
> 				}
> 				req->outstanding_cmds[cnt] = NULL;
> 				sp->done(sp, res);
> -- 
> 1.8.3.1
> 

Looks Good. 

Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>

Thanks,
- Himanshu

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] qla2xxx: Fix extraneous ref on sp's after adapter break
  2017-05-25 19:26 [PATCH] qla2xxx: Fix extraneous ref on sp's after adapter break Bill Kuzeja
  2017-05-31 21:08 ` Madhani, Himanshu
@ 2017-06-01  2:49 ` Martin K. Petersen
  1 sibling, 0 replies; 3+ messages in thread
From: Martin K. Petersen @ 2017-06-01  2:49 UTC (permalink / raw)
  To: Bill Kuzeja; +Cc: linux-scsi, qla2xxx-upstream


Bill,

> Hung task timeouts can result if a qlogic board breaks unexpectedly
> while running I/O. These tasks become hung because command srb
> reference counts are not going to zero, hence the affected srbs and
> commands do not get freed. This fix accounts for this extra reference
> in the srbs in the case of a board failure.

Applied to 4.12/scsi-fixes, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-06-01  2:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-25 19:26 [PATCH] qla2xxx: Fix extraneous ref on sp's after adapter break Bill Kuzeja
2017-05-31 21:08 ` Madhani, Himanshu
2017-06-01  2:49 ` Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.