From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Reinecke Subject: Re: [PATCH v4 03/43] hpsa: rework controller command submission Date: Fri, 17 Apr 2015 14:57:57 +0200 Message-ID: <55310355.8090701@suse.de> References: <20150416134224.30238.66082.stgit@brunhilda> <20150416134658.30238.76254.stgit@brunhilda> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from cantor2.suse.de ([195.135.220.15]:58740 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752515AbbDQM57 (ORCPT ); Fri, 17 Apr 2015 08:57:59 -0400 In-Reply-To: <20150416134658.30238.76254.stgit@brunhilda> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Don Brace , scott.teel@pmcs.com, Kevin.Barnett@pmcs.com, james.bottomley@parallels.com, hch@infradead.org, Justin.Lindley@pmcs.combrace@pmcs.com Cc: linux-scsi@vger.kernel.org Hi Don, some comments inline. On 04/16/2015 03:46 PM, Don Brace wrote: > From: Webb Scales >=20 > Allow driver initiated commands to have a timeout. It does not > yet try to do anything with timeouts on such commands. >=20 > We are sending a reset in order to get rid of a command we want to ab= ort. > If we make it return on the same reply queue as the command we want t= o abort, > the completion of the aborted command will not race with the completi= on of > the reset command. >=20 > Rename hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd(), s= ince > this function is the interface for issuing commands to the controller= and > not the "core" of that implementation. Add a parameter to it which a= llows > the caller to specify the reply queue to be used. Modify existing ca= llers > to specify the default reply queue. >=20 > Rename __hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd_co= re(), > since this routine is the "core" implementation of the "do simple com= mand" > function and there is no longer any other function with a similar nam= e. > Modify the existing callers of this routine (other than > hpsa_scsi_do_simple_cmd()) to instead call hpsa_scsi_do_simple_cmd(),= since > it will now accept the reply_queue paramenter, and it provides a cont= roller > lock-up check. (Also, tweak two related message strings to make them > distinct from each other.) >=20 > Submitting a command to a locked up controller always results in a ti= meout, > so check for controller lock-up before submitting. >=20 > This is to enable fixing a race between command completions and > abort completions on different reply queues in a subsequent patch. > We want to be able to specify which reply queue an abort completion > should occur on so that it cannot race the completion of the command > it is trying to abort. >=20 > The following race was possible in theory: >=20 > 1. Abort command is sent to hardware. > 2. Command to be aborted simultaneously completes on another > reply queue. > 3. Hardware receives abort command, decides command has already > completed and indicates this to the driver via another different > reply queue. > 4. driver processes abort completion finds that the hardware does n= ot know > about the command, concludes that therefore the command cannot c= omplete, > returns SUCCESS indicating to the mid-layer that the scsi_cmnd m= ay be > re-used. > 5. Command from step 2 is processed and completed back to scsi mid > layer (after we already promised that would never happen.) >=20 > Fix by forcing aborts to complete on the same reply queue as the comm= and > they are aborting. >=20 > Piggybacking device rescanning functionality onto the lockup > detection thread is not a good idea because if the controller > locks up during device rescanning, then the thread could get > stuck, then the lockup isn't detected. Use separate work > queues for device rescanning and lockup detection. >=20 > Detect controller lockup in abort handler. >=20 > After a lockup is detected, return DO_NO_CONNECT which results in imm= ediate > termination of commands rather than DID_ERR which results in retries. >=20 > Modify detect_controller_lockup() to return the result, to remove the= need for a separate check. >=20 > Reviewed-by: Scott Teel > Reviewed-by: Kevin Barnett > Signed-off-by: Webb Scales > Signed-off-by: Don Brace > --- > drivers/scsi/hpsa.c | 329 ++++++++++++++++++++++++++++++++++++-= ---------- > drivers/scsi/hpsa_cmd.h | 5 + > 2 files changed, 257 insertions(+), 77 deletions(-) >=20 [ .. ] > @@ -4375,13 +4469,46 @@ static int hpsa_eh_device_reset_handler(struc= t scsi_cmnd *scsicmd) > "device lookup failed.\n"); > return FAILED; > } > - dev_warn(&h->pdev->dev, "resetting device %d:%d:%d:%d\n", > - h->scsi_host->host_no, dev->bus, dev->target, dev->lun); > + > + /* if controller locked up, we can guarantee command won't complete= */ > + if (lockup_detected(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d RESET FAILED, lockup detected\n", > + h->scsi_host->host_no, dev->bus, dev->target, > + dev->lun); > + return FAILED; > + } > + > + /* this reset request might be the result of a lockup; check */ > + if (detect_controller_lockup(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d RESET FAILED, new lockup detected\n", > + h->scsi_host->host_no, dev->bus, dev->target, > + dev->lun); > + return FAILED; > + } > + > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d: %s %.8s %.16s resetting RAID-%s SSDSmartPathCap= %c En%c Exp=3D%d\n", > + h->scsi_host->host_no, dev->bus, dev->target, dev->lun, > + scsi_device_type(dev->devtype), > + dev->vendor, > + dev->model, > + dev->raid_level > RAID_UNKNOWN ? > + "RAID-?" : raid_label[dev->raid_level], > + dev->offload_config ? '+' : '-', > + dev->offload_enabled ? '+' : '-', > + dev->expose_state); > + Didn't you just reworked the message logging in the previous patch? Why can't you use it here? > /* send a reset to the SCSI LUN which the command was sent to */ > - rc =3D hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN); > + rc =3D hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN, > + DEFAULT_REPLY_QUEUE); > if (rc =3D=3D 0 && wait_for_device_to_become_ready(h, dev->scsi3add= r) =3D=3D 0) > return SUCCESS; > =20 > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d reset failed\n", > + h->scsi_host->host_no, dev->bus, dev->target, dev->lun); > return FAILED; > } > =20 > @@ -4426,7 +4553,7 @@ static void hpsa_get_tag(struct ctlr_info *h, > } > =20 > static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3= addr, > - struct CommandList *abort, int swizzle) > + struct CommandList *abort, int swizzle, int reply_queue) > { > int rc =3D IO_OK; > struct CommandList *c; > @@ -4444,9 +4571,9 @@ static int hpsa_send_abort(struct ctlr_info *h,= unsigned char *scsi3addr, > 0, 0, scsi3addr, TYPE_MSG); > if (swizzle) > swizzle_abort_tag(&c->Request.CDB[4]); > - hpsa_scsi_do_simple_cmd_core(h, c); > + (void) hpsa_scsi_do_simple_cmd(h, c, reply_queue, NO_TIMEOUT); > hpsa_get_tag(h, abort, &taglower, &tagupper); > - dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd_core com= pleted.\n", > + dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd(abort) c= ompleted.\n", > __func__, tagupper, taglower); > /* no unmap needed here because no data xfer. */ > =20 > @@ -4478,7 +4605,7 @@ static int hpsa_send_abort(struct ctlr_info *h,= unsigned char *scsi3addr, > */ > =20 > static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h, > - unsigned char *scsi3addr, struct CommandList *abort) > + unsigned char *scsi3addr, struct CommandList *abort, int reply_queu= e) > { > int rc =3D IO_OK; > struct scsi_cmnd *scmd; /* scsi command within request being aborte= d */ > @@ -4521,7 +4648,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(st= ruct ctlr_info *h, > "Reset as abort: Resetting physical device at scsi3addr 0x%02x%02= x%02x%02x%02x%02x%02x%02x\n", > psa[0], psa[1], psa[2], psa[3], > psa[4], psa[5], psa[6], psa[7]); > - rc =3D hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET); > + rc =3D hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET, reply_queue)= ; > if (rc !=3D 0) { > dev_warn(&h->pdev->dev, > "Reset as abort: Failed on physical device at scsi3addr 0x%02x%02= x%02x%02x%02x%02x%02x%02x\n", > @@ -4555,7 +4682,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(st= ruct ctlr_info *h, > * make this true someday become false. > */ > static int hpsa_send_abort_both_ways(struct ctlr_info *h, > - unsigned char *scsi3addr, struct CommandList *abort) > + unsigned char *scsi3addr, struct CommandList *abort, int reply_queu= e) > { > /* ioccelerator mode 2 commands should be aborted via the > * accelerated path, since RAID path is unaware of these commands, > @@ -4563,10 +4690,20 @@ static int hpsa_send_abort_both_ways(struct c= tlr_info *h, > * Change abort to physical device reset. > */ > if (abort->cmd_type =3D=3D CMD_IOACCEL2) > - return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr, abort); > + return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr, > + abort, reply_queue); > + > + return hpsa_send_abort(h, scsi3addr, abort, 0, reply_queue) && > + hpsa_send_abort(h, scsi3addr, abort, 1, reply_queue); > +} > =20 > - return hpsa_send_abort(h, scsi3addr, abort, 0) && > - hpsa_send_abort(h, scsi3addr, abort, 1); > +/* Find out which reply queue a command was meant to return on */ > +static int hpsa_extract_reply_queue(struct ctlr_info *h, > + struct CommandList *c) > +{ > + if (c->cmd_type =3D=3D CMD_IOACCEL2) > + return h->ioaccel2_cmd_pool[c->cmdindex].reply_queue; > + return c->Header.ReplyQueue; > } > =20 > /* Send an abort for the specified command. > @@ -4584,7 +4721,7 @@ static int hpsa_eh_abort_handler(struct scsi_cm= nd *sc) > char msg[256]; /* For debug messaging. */ > int ml =3D 0; > __le32 tagupper, taglower; > - int refcount; > + int refcount, reply_queue; > =20 > /* Find the controller of the command to be aborted */ > h =3D sdev_to_hba(sc->device); > @@ -4592,8 +4729,23 @@ static int hpsa_eh_abort_handler(struct scsi_c= mnd *sc) > "ABORT REQUEST FAILED, Controller lookup failed.\n")) > return FAILED; > =20 > - if (lockup_detected(h)) > + /* If controller locked up, we can guarantee command won't complete= */ > + if (lockup_detected(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%llu scmd %p ABORT FAILED, lockup detected\n", > + h->scsi_host->host_no, sc->device->channel, > + sc->device->id, sc->device->lun, sc); > return FAILED; > + } > + > + /* This is a good time to check if controller lockup has occurred *= / > + if (detect_controller_lockup(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%llu scmd %p ABORT FAILED, new lockup detected\n"= , > + h->scsi_host->host_no, sc->device->channel, > + sc->device->id, sc->device->lun, sc); > + return FAILED; > + } > =20 Same here. Cheers, Hannes --=20 Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg GF: F. Imend=C3=B6rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N=C3=BCrnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html