Mid-layer handling of NOT_READY conditions...

* Mid-layer handling of NOT_READY conditions...
@ 2005-01-28 23:24 Andrew Vasquez
  2005-01-29  5:46 ` Andrew Vasquez
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Vasquez @ 2005-01-28 23:24 UTC (permalink / raw)
  To: linux-scsi; +Cc: andrew.vasquez

[PREFACE: Please forgive the rather long absence on linux-scsi, I've
been occupied by several non-related projects]

All,

While stripping out the remnants of internal queuing from the qla2xxx
driver and adding-in support for various fc_host/fc_remote constructs,
I've ran into a rather peculiar problem with respect to the way the SCSI
mid-layer handles NOT_READY conditions (notably ASC 0x04 and ASCQ 0x01).

I was doing simple short-duration cable-pulls when I noticed I/O errors
would occur at unexpected times as the storage returned to the topology.
The simplest case goes like this:

      * Issue I/O to device A
      * Device A falls off the topology 
      * Driver (qla2xxx) blocks additional requests to device A via
        fc_remote_port_block() 
      * Short time later (couple of seconds) device A returns to
        topology
      * Driver logs-into device and unblocks requests via
        fc_remote_port_unblock().
      * I/O resumes

The storage still unable to process the commands returns
check-conditions (please excuse the crude printk()s):

        *** check 1148/1/5 [1:0] sdev_st=2 status=2 [6/29/0].
        *** check 1149/1/5 [1:0] sdev_st=2 status=2 [2/4/1].
        scsi_decide_disposition: sc 0 RETRY incremented 2/5
        *** check 1150/2/5 [1:0] sdev_st=2 status=2 [2/4/1].
        scsi_decide_disposition: sc 0 RETRY incremented 3/5
        *** check 1151/3/5 [1:0] sdev_st=2 status=2 [2/4/1].
        scsi_decide_disposition: sc 0 RETRY incremented 4/5
        *** check 1152/4/5 [1:0] sdev_st=2 status=2 [2/4/1].

while scsi_decide_disposition() agrees to retry the commands since
cmd->retries < cmd->allowed.  But when NOT_READYs persists beyond
cmd->allowed, scsi_decide_disposition() returns SUCCESS:

        scsi_decide_disposition: sc 0 2 SUCCESS 6/5 [2/4/1]

and the command then begins additional processing via:

        scsi_finish_command()
          sd_rw_itr()
            scsi_io_completion()

at which point, the following check is made:

        ...
        /*
         * If the device is in the process of becoming ready,
         * retry.
         */
        if (sshdr.asc == 0x04 && sshdr.ascq == 0x01) {
                scsi_requeue_command(q, cmd);
                return;
        }

and the command is requeued to the request-q via blk_insert_request()
and started again with:

        q->request_fn()
          scsi_request_fn()
            scsi_dispatch_cmd()

There seems to be two problem with this approach:

     1. As the storage continues to return NOT_READY,
        scsi_decide_disposition() blindly increments cmd->retries and
        checks against cmd->allowed, returning SUCCESS (since at this
        point cmd->retries is always greater than cmd->allowed) -- I've
        seen this condition loop several hundred times while the
        NOT_READY condition clears.
     2. as a result of the (cmd->retries > cmd->allowed) state of the
        command, if a LLDD returns any status (other than DID_OK) which
        could initiate a retry, the command is immediately failed.  As
        an example, the qla2xxx driver returns DID_BUS_BUSY in case of
        any 'transport' related problems during the exchange (dropped
        frames, FCP protocal failures, etc.).

When the qla2xxx driver managed command queuing internally, a NOT_READY
status would cause the lun-queue to be frozen for some period time while
the storage settled-down.

Would this be an approach to consider?  Or should we tackle the problem
by addressing the quirky (cmd->retries > cmd->allowed) state?

Thanks,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 15+ messages in thread