Re: [PATCHv3 6/6] scsi: make asynchronous aborts mandatory

From: Benjamin Block <bblock@linux.vnet.ibm.com>
To: Hannes Reinecke <hare@suse.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
	Christoph Hellwig <hch@lst.de>,
	James Bottomley <james.bottomley@hansenpartnership.com>,
	Bart van Assche <bart.vanassche@sandisk.com>,
	linux-scsi@vger.kernel.org
Subject: Re: [PATCHv3 6/6] scsi: make asynchronous aborts mandatory
Date: Tue, 14 Mar 2017 18:33:22 +0100	[thread overview]
Message-ID: <20170314173322.GA19037@bblock-ThinkPad-W530> (raw)
In-Reply-To: <1488359720-130871-7-git-send-email-hare@suse.de>

Hello Hannes,

On Wed, Mar 01, 2017 at 10:15:20AM +0100, Hannes Reinecke wrote:
> There hasn't been any reports for HBAs where asynchronous abort
> would not work, so we should make it mandatory and remove
> the fallback.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
> ---
>  Documentation/scsi/scsi_eh.txt | 18 ++++------
>  drivers/scsi/scsi_error.c      | 81 ++++--------------------------------------
>  drivers/scsi/scsi_lib.c        |  2 +-
>  drivers/scsi/scsi_priv.h       |  3 +-
>  include/scsi/scsi_host.h       |  5 ---
>  5 files changed, 15 insertions(+), 94 deletions(-)
> 
> diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt
> index 4edb9c1c..11e447b 100644
> --- a/Documentation/scsi/scsi_eh.txt
> +++ b/Documentation/scsi/scsi_eh.txt
> @@ -70,7 +70,7 @@ with the command.
>  	scmd is requeued to blk queue.
> 
>   - otherwise
> -	scsi_eh_scmd_add(scmd, 0) is invoked for the command.  See
> +	scsi_eh_scmd_add(scmd) is invoked for the command.  See
>  	[1-3] for details of this function.
> 
> 
> @@ -103,9 +103,7 @@ function
>          eh_timed_out() callback did not handle the command.
>  	Step #2 is taken.
> 
> - 2. If the host supports asynchronous completion (as indicated by the
> -    no_async_abort setting in the host template) scsi_abort_command()
> -    is invoked to schedule an asynchrous abort.
> + 2. scsi_abort_command() is invoked to schedule an asynchrous abort.
>      Asynchronous abort are not invoked for commands which the
>      SCSI_EH_ABORT_SCHEDULED flag is set (this indicates that the command
>      already had been aborted once, and this is a retry which failed),
> @@ -127,16 +125,13 @@ function
> 
>   scmds enter EH via scsi_eh_scmd_add(), which does the following.
> 
> - 1. Turns on scmd->eh_eflags as requested.  It's 0 for error
> -    completions and SCSI_EH_CANCEL_CMD for timeouts.
> + 1. Links scmd->eh_entry to shost->eh_cmd_q
> 
> - 2. Links scmd->eh_entry to shost->eh_cmd_q
> + 2. Sets SHOST_RECOVERY bit in shost->shost_state
> 
> - 3. Sets SHOST_RECOVERY bit in shost->shost_state
> + 3. Increments shost->host_failed
> 
> - 4. Increments shost->host_failed
> -
> - 5. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed
> + 4. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed
> 
>   As can be seen above, once any scmd is added to shost->eh_cmd_q,
>  SHOST_RECOVERY shost_state bit is turned on.  This prevents any new
> @@ -252,7 +247,6 @@ scmd->allowed.
> 
>   1. Error completion / time out
>      ACTION: scsi_eh_scmd_add() is invoked for scmd
> -	- set scmd->eh_eflags
>  	- add scmd to shost->eh_cmd_q
>  	- set SHOST_RECOVERY
>  	- shost->host_failed++
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 802a081..7b70ee9 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -163,7 +163,7 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
>  		}
>  	}
> 
> -	scsi_eh_scmd_add(scmd, 0);
> +	scsi_eh_scmd_add(scmd);
>  }
> 
>  /**
> @@ -217,9 +217,8 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
>  /**
>   * scsi_eh_scmd_add - add scsi cmd to error handling.
>   * @scmd:	scmd to run eh on.
> - * @eh_flag:	optional SCSI_EH flag.
>   */
> -void scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
> +void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
>  {
>  	struct Scsi_Host *shost = scmd->device->host;
>  	unsigned long flags;
> @@ -235,9 +234,6 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
>  	if (shost->eh_deadline != -1 && !shost->last_reset)
>  		shost->last_reset = jiffies;
> 
> -	if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
> -		eh_flag &= ~SCSI_EH_CANCEL_CMD;
> -	scmd->eh_eflags |= eh_flag;
>  	scsi_eh_reset(scmd);
>  	list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
>  	shost->host_failed++;
> @@ -271,10 +267,9 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
>  		rtn = host->hostt->eh_timed_out(scmd);
> 
>  	if (rtn == BLK_EH_NOT_HANDLED) {
> -		if (host->hostt->no_async_abort ||
> -		    scsi_abort_command(scmd) != SUCCESS) {
> +		if (scsi_abort_command(scmd) != SUCCESS) {
>  			set_host_byte(scmd, DID_TIME_OUT);
> -			scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD);
> +			scsi_eh_scmd_add(scmd);
>  		}
>  	}
> 
> @@ -327,7 +322,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
>  		list_for_each_entry(scmd, work_q, eh_entry) {
>  			if (scmd->device == sdev) {
>  				++total_failures;
> -				if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD)
> +				if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>  					++cmd_cancel;
>  				else
>  					++cmd_failed;
> @@ -1153,8 +1148,7 @@ int scsi_eh_get_sense(struct list_head *work_q,
>  	 * should not get sense.
>  	 */
>  	list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
> -		if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
> -		    (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) ||
> +		if ((scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) ||
>  		    SCSI_SENSE_VALID(scmd))
>  			continue;
> 
> @@ -1294,61 +1288,6 @@ static int scsi_eh_test_devices(struct list_head *cmd_list,
>  	return list_empty(work_q);
>  }
> 
> -
> -/**
> - * scsi_eh_abort_cmds - abort pending commands.
> - * @work_q:	&list_head for pending commands.
> - * @done_q:	&list_head for processed commands.
> - *
> - * Decription:
> - *    Try and see whether or not it makes sense to try and abort the
> - *    running command.  This only works out to be the case if we have one
> - *    command that has timed out.  If the command simply failed, it makes
> - *    no sense to try and abort the command, since as far as the shost
> - *    adapter is concerned, it isn't running.
> - */
> -static int scsi_eh_abort_cmds(struct list_head *work_q,
> -			      struct list_head *done_q)
> -{
> -	struct scsi_cmnd *scmd, *next;
> -	LIST_HEAD(check_list);
> -	int rtn;
> -	struct Scsi_Host *shost;
> -
> -	list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
> -		if (!(scmd->eh_eflags & SCSI_EH_CANCEL_CMD))
> -			continue;
> -		shost = scmd->device->host;
> -		if (scsi_host_eh_past_deadline(shost)) {
> -			list_splice_init(&check_list, work_q);
> -			SCSI_LOG_ERROR_RECOVERY(3,
> -				scmd_printk(KERN_INFO, scmd,
> -					    "%s: skip aborting cmd, past eh deadline\n",
> -					    current->comm));
> -			return list_empty(work_q);
> -		}
> -		SCSI_LOG_ERROR_RECOVERY(3,
> -			scmd_printk(KERN_INFO, scmd,
> -				     "%s: aborting cmd\n", current->comm));
> -		rtn = scsi_try_to_abort_cmd(shost->hostt, scmd);
> -		if (rtn == FAILED) {
> -			SCSI_LOG_ERROR_RECOVERY(3,
> -				scmd_printk(KERN_INFO, scmd,
> -					    "%s: aborting cmd failed\n",
> -					     current->comm));
> -			list_splice_init(&check_list, work_q);
> -			return list_empty(work_q);
> -		}
> -		scmd->eh_eflags &= ~SCSI_EH_CANCEL_CMD;
> -		if (rtn == FAST_IO_FAIL)
> -			scsi_eh_finish_cmd(scmd, done_q);
> -		else
> -			list_move_tail(&scmd->eh_entry, &check_list);
> -	}
> -
> -	return scsi_eh_test_devices(&check_list, work_q, done_q, 0);
> -}
> -
>  /**
>   * scsi_eh_try_stu - Send START_UNIT to device.
>   * @scmd:	&scsi_cmnd to send START_UNIT
> @@ -1691,11 +1630,6 @@ static void scsi_eh_offline_sdevs(struct list_head *work_q,
>  		sdev_printk(KERN_INFO, scmd->device, "Device offlined - "
>  			    "not ready after error recovery\n");
>  		scsi_device_set_state(scmd->device, SDEV_OFFLINE);
> -		if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD) {
> -			/*
> -			 * FIXME: Handle lost cmds.
> -			 */
> -		}
>  		scsi_eh_finish_cmd(scmd, done_q);
>  	}
>  	return;
> @@ -2139,8 +2073,7 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
>  	SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
> 
>  	if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
> -		if (!scsi_eh_abort_cmds(&eh_work_q, &eh_done_q))
> -			scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
> +		scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
> 
>  	spin_lock_irqsave(shost->host_lock, flags);
>  	if (shost->eh_deadline != -1)
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 0735a46..98b2df8 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1593,7 +1593,7 @@ static void scsi_softirq_done(struct request *rq)
>  			scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
>  			break;
>  		default:
> -			scsi_eh_scmd_add(cmd, 0);
> +			scsi_eh_scmd_add(cmd);
>  			break;
>  	}
>  }
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> index 5be6cbf6..e20ab10 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -18,7 +18,6 @@
>  /*
>   * Scsi Error Handler Flags
>   */
> -#define SCSI_EH_CANCEL_CMD	0x0001	/* Cancel this cmd */
>  #define SCSI_EH_ABORT_SCHEDULED	0x0002	/* Abort has been scheduled */
> 
>  #define SCSI_SENSE_VALID(scmd) \
> @@ -72,7 +71,7 @@ extern int scsi_dev_info_list_add_keyed(int compatible, char *vendor,
>  extern int scsi_error_handler(void *host);
>  extern int scsi_decide_disposition(struct scsi_cmnd *cmd);
>  extern void scsi_eh_wakeup(struct Scsi_Host *shost);
> -extern void scsi_eh_scmd_add(struct scsi_cmnd *, int);
> +extern void scsi_eh_scmd_add(struct scsi_cmnd *);
>  void scsi_eh_ready_devs(struct Scsi_Host *shost,
>  			struct list_head *work_q,
>  			struct list_head *done_q);
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index 3cd8c3b..afb0481 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -452,11 +452,6 @@ struct scsi_host_template {
>  	unsigned no_write_same:1;
> 
>  	/*
> -	 * True if asynchronous aborts are not supported
> -	 */
> -	unsigned no_async_abort:1;
> -
> -	/*
>  	 * Countdown for host blocking with no commands outstanding.
>  	 */
>  	unsigned int max_host_blocked;
> -- 
> 1.8.5.6
> 

Hmm so, I guess we compromise in terms of how granular we want to
recover? 

When say an abort for command A on LUN 1 behind Port α fails for some
reason, then we also skip all possible aborts for command B on LUN 2
behind Port α and command C on LUN 1 behind Port β? (The host might
already be in recovery by the time command B and C fail)

I mean, in the end, all traffic holds on that host anyway, as long as we
get into EH - which is true as soon as one abort fails. Might as well
skip the wait time till we get there.

I would like it, if we could document that fact a bit more direct.
Otherwise I think that's good. The code also looks good.

Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>

                                                    Beste Grüße / Best regards,
                                                      - Benjamin Block
-- 
Linux on z Systems Development         /         IBM Systems & Technology Group
		  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz     /        Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294