All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/4] scsi: ufs: Improve UFS error handling
@ 2013-07-09  9:16 Sujit Reddy Thumma
  2013-07-09  9:16 ` [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation Sujit Reddy Thumma
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-09  9:16 UTC (permalink / raw)
  To: Vinayak Holikatti, Santosh Y
  Cc: James E.J. Bottomley, linux-scsi, Sujit Reddy Thumma, linux-arm-msm

The first patch fixes many issues with current task management handling
in UFSHCD driver. Others improve error handling in various scenarios.

These patches depends on:
[PATCH V3 1/2] scsi: ufs: Add support for sending NOP OUT UPIU
[PATCH V3 2/2] scsi: ufs: Set fDeviceInit flag to initiate device initialization
[PATCH V3 1/2] scsi: ufs: Add support for host assisted background operations
[PATCH V3 2/2] scsi: ufs: Add runtime PM support for UFS host controller driver

Changes from v2:
	- [PATCH V3 1/4]: Make the task management command task tag unique
	  across SCSI/NOP/QUERY request tags.
	- [PATCH V3 3/4]: While handling device/host reset, wait for
	  pending fatal handler to return if running.
Changes from v1:
	- [PATCH V2 1/4]: Fix a race condition because of overloading
	  outstanding_tasks variable to lock the slots. A new variable
	  tm_slots_in_use will track which slots are in use by the driver.
	- [PATCH V2 2/4]: Commit text update to clarify the hardware race
	  with more details.
	- [PATCH V2 3/4]: Minor cleanup and rebase
	- [PATCH V2 4/4]: Fix a bug - sleeping in atomic context

Sujit Reddy Thumma (4):
  scsi: ufs: Fix broken task management command implementation
  scsi: ufs: Fix hardware race conditions while aborting a command
  scsi: ufs: Fix device and host reset methods
  scsi: ufs: Improve UFS fatal error handling

 drivers/scsi/ufs/ufshcd.c | 1035 +++++++++++++++++++++++++++++++++++++--------
 drivers/scsi/ufs/ufshcd.h |   12 +-
 drivers/scsi/ufs/ufshci.h |   19 +-
 3 files changed, 873 insertions(+), 193 deletions(-)

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation
  2013-07-09  9:16 [PATCH V3 0/4] scsi: ufs: Improve UFS error handling Sujit Reddy Thumma
@ 2013-07-09  9:16 ` Sujit Reddy Thumma
  2013-07-09 10:42   ` merez
  2013-07-19 13:56   ` Seungwon Jeon
  2013-07-09  9:16 ` [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command Sujit Reddy Thumma
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-09  9:16 UTC (permalink / raw)
  To: Vinayak Holikatti, Santosh Y
  Cc: James E.J. Bottomley, linux-scsi, Sujit Reddy Thumma, linux-arm-msm

Currently, sending Task Management (TM) command to the card might
be broken in some scenarios as listed below:

Problem: If there are more than 8 TM commands the implementation
         returns error to the caller.
Fix:     Wait for one of the slots to be emptied and send the command.

Problem: Sometimes it is necessary for the caller to know the TM service
         response code to determine the task status.
Fix:     Propogate the service response to the caller.

Problem: If the TM command times out no proper error recovery is
         implemented.
Fix:     Clear the command in the controller door-bell register, so that
         further commands for the same slot don't fail.

Problem: While preparing the TM command descriptor, the task tag used
         should be unique across SCSI/NOP/QUERY/TM commands and not the
	 task tag of the command which the TM command is trying to manage.
Fix:     Use a unique task tag instead of task tag of SCSI command.

Problem: Since the TM command involves H/W communication, abruptly ending
         the request on kill interrupt signal might cause h/w malfunction.
Fix:     Wait for hardware completion interrupt with TASK_UNINTERRUPTIBLE
         set.

Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
---
 drivers/scsi/ufs/ufshcd.c |  177 ++++++++++++++++++++++++++++++---------------
 drivers/scsi/ufs/ufshcd.h |    8 ++-
 2 files changed, 126 insertions(+), 59 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index af7d01d..a176421 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -53,6 +53,9 @@
 /* Query request timeout */
 #define QUERY_REQ_TIMEOUT 30 /* msec */
 
+/* Task management command timeout */
+#define TM_CMD_TIMEOUT	100 /* msecs */
+
 /* Expose the flag value from utp_upiu_query.value */
 #define MASK_QUERY_UPIU_FLAG_LOC 0xFF
 
@@ -190,13 +193,35 @@ ufshcd_get_tmr_ocs(struct utp_task_req_desc *task_req_descp)
 /**
  * ufshcd_get_tm_free_slot - get a free slot for task management request
  * @hba: per adapter instance
+ * @free_slot: pointer to variable with available slot value
  *
- * Returns maximum number of task management request slots in case of
- * task management queue full or returns the free slot number
+ * Get a free tag and lock it until ufshcd_put_tm_slot() is called.
+ * Returns 0 if free slot is not available, else return 1 with tag value
+ * in @free_slot.
  */
-static inline int ufshcd_get_tm_free_slot(struct ufs_hba *hba)
+static bool ufshcd_get_tm_free_slot(struct ufs_hba *hba, int *free_slot)
+{
+	int tag;
+	bool ret = false;
+
+	if (!free_slot)
+		goto out;
+
+	do {
+		tag = find_first_zero_bit(&hba->tm_slots_in_use, hba->nutmrs);
+		if (tag >= hba->nutmrs)
+			goto out;
+	} while (test_and_set_bit_lock(tag, &hba->tm_slots_in_use));
+
+	*free_slot = tag;
+	ret = true;
+out:
+	return ret;
+}
+
+static inline void ufshcd_put_tm_slot(struct ufs_hba *hba, int slot)
 {
-	return find_first_zero_bit(&hba->outstanding_tasks, hba->nutmrs);
+	clear_bit_unlock(slot, &hba->tm_slots_in_use);
 }
 
 /**
@@ -1778,10 +1803,11 @@ static void ufshcd_slave_destroy(struct scsi_device *sdev)
  * ufshcd_task_req_compl - handle task management request completion
  * @hba: per adapter instance
  * @index: index of the completed request
+ * @resp: task management service response
  *
- * Returns SUCCESS/FAILED
+ * Returns non-zero value on error, zero on success
  */
-static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
+static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index, u8 *resp)
 {
 	struct utp_task_req_desc *task_req_descp;
 	struct utp_upiu_task_rsp *task_rsp_upiup;
@@ -1802,19 +1828,15 @@ static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
 				task_req_descp[index].task_rsp_upiu;
 		task_result = be32_to_cpu(task_rsp_upiup->header.dword_1);
 		task_result = ((task_result & MASK_TASK_RESPONSE) >> 8);
-
-		if (task_result != UPIU_TASK_MANAGEMENT_FUNC_COMPL &&
-		    task_result != UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED)
-			task_result = FAILED;
-		else
-			task_result = SUCCESS;
+		if (resp)
+			*resp = (u8)task_result;
 	} else {
-		task_result = FAILED;
-		dev_err(hba->dev,
-			"trc: Invalid ocs = %x\n", ocs_value);
+		dev_err(hba->dev, "%s: failed, ocs = 0x%x\n",
+				__func__, ocs_value);
 	}
 	spin_unlock_irqrestore(hba->host->host_lock, flags);
-	return task_result;
+
+	return ocs_value;
 }
 
 /**
@@ -2298,7 +2320,7 @@ static void ufshcd_tmc_handler(struct ufs_hba *hba)
 
 	tm_doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
 	hba->tm_condition = tm_doorbell ^ hba->outstanding_tasks;
-	wake_up_interruptible(&hba->ufshcd_tm_wait_queue);
+	wake_up(&hba->tm_wq);
 }
 
 /**
@@ -2348,38 +2370,61 @@ static irqreturn_t ufshcd_intr(int irq, void *__hba)
 	return retval;
 }
 
+static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
+{
+	int err = 0;
+	u32 reg;
+	u32 mask = 1 << tag;
+	unsigned long flags;
+
+	if (!test_bit(tag, &hba->outstanding_reqs))
+		goto out;
+
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	ufshcd_writel(hba, ~(1 << tag), REG_UTP_TASK_REQ_LIST_CLEAR);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+
+	/* poll for max. 1 sec to clear door bell register by h/w */
+	reg = ufshcd_wait_for_register(hba,
+			REG_UTP_TASK_REQ_DOOR_BELL,
+			mask, 0, 1000, 1000);
+	if ((reg & mask) == mask)
+		err = -ETIMEDOUT;
+out:
+	return err;
+}
+
 /**
  * ufshcd_issue_tm_cmd - issues task management commands to controller
  * @hba: per adapter instance
- * @lrbp: pointer to local reference block
+ * @lun_id: LUN ID to which TM command is sent
+ * @task_id: task ID to which the TM command is applicable
+ * @tm_function: task management function opcode
+ * @tm_response: task management service response return value
  *
- * Returns SUCCESS/FAILED
+ * Returns non-zero value on error, zero on success.
  */
-static int
-ufshcd_issue_tm_cmd(struct ufs_hba *hba,
-		    struct ufshcd_lrb *lrbp,
-		    u8 tm_function)
+static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
+		u8 tm_function, u8 *tm_response)
 {
 	struct utp_task_req_desc *task_req_descp;
 	struct utp_upiu_task_req *task_req_upiup;
 	struct Scsi_Host *host;
 	unsigned long flags;
-	int free_slot = 0;
+	int free_slot;
 	int err;
+	int task_tag;
 
 	host = hba->host;
 
-	spin_lock_irqsave(host->host_lock, flags);
-
-	/* If task management queue is full */
-	free_slot = ufshcd_get_tm_free_slot(hba);
-	if (free_slot >= hba->nutmrs) {
-		spin_unlock_irqrestore(host->host_lock, flags);
-		dev_err(hba->dev, "Task management queue full\n");
-		err = FAILED;
-		goto out;
-	}
+	/*
+	 * Get free slot, sleep if slots are unavailable.
+	 * Even though we use wait_event() which sleeps indefinitely,
+	 * the maximum wait time is bounded by %TM_CMD_TIMEOUT.
+	 */
+	wait_event(hba->tm_tag_wq, ufshcd_get_tm_free_slot(hba, &free_slot));
 
+	spin_lock_irqsave(host->host_lock, flags);
 	task_req_descp = hba->utmrdl_base_addr;
 	task_req_descp += free_slot;
 
@@ -2391,18 +2436,15 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
 	/* Configure task request UPIU */
 	task_req_upiup =
 		(struct utp_upiu_task_req *) task_req_descp->task_req_upiu;
+	task_tag = hba->nutrs + free_slot;
 	task_req_upiup->header.dword_0 =
 		UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
-					      lrbp->lun, lrbp->task_tag);
+				lun_id, task_tag);
 	task_req_upiup->header.dword_1 =
 		UPIU_HEADER_DWORD(0, tm_function, 0, 0);
 
-	task_req_upiup->input_param1 = lrbp->lun;
-	task_req_upiup->input_param1 =
-		cpu_to_be32(task_req_upiup->input_param1);
-	task_req_upiup->input_param2 = lrbp->task_tag;
-	task_req_upiup->input_param2 =
-		cpu_to_be32(task_req_upiup->input_param2);
+	task_req_upiup->input_param1 = cpu_to_be32(lun_id);
+	task_req_upiup->input_param2 = cpu_to_be32(task_id);
 
 	/* send command to the controller */
 	__set_bit(free_slot, &hba->outstanding_tasks);
@@ -2411,20 +2453,24 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
 	spin_unlock_irqrestore(host->host_lock, flags);
 
 	/* wait until the task management command is completed */
-	err =
-	wait_event_interruptible_timeout(hba->ufshcd_tm_wait_queue,
-					 (test_bit(free_slot,
-					 &hba->tm_condition) != 0),
-					 60 * HZ);
+	err = wait_event_timeout(hba->tm_wq,
+			test_bit(free_slot, &hba->tm_condition),
+			msecs_to_jiffies(TM_CMD_TIMEOUT));
 	if (!err) {
-		dev_err(hba->dev,
-			"Task management command timed-out\n");
-		err = FAILED;
-		goto out;
+		dev_err(hba->dev, "%s: task management cmd 0x%.2x timed-out\n",
+				__func__, tm_function);
+		if (ufshcd_clear_tm_cmd(hba, free_slot))
+			dev_WARN(hba->dev, "%s: unable clear tm cmd (slot %d) after timeout\n",
+					__func__, free_slot);
+		err = -ETIMEDOUT;
+	} else {
+		err = ufshcd_task_req_compl(hba, free_slot, tm_response);
 	}
+
 	clear_bit(free_slot, &hba->tm_condition);
-	err = ufshcd_task_req_compl(hba, free_slot);
-out:
+	ufshcd_put_tm_slot(hba, free_slot);
+	wake_up(&hba->tm_tag_wq);
+
 	return err;
 }
 
@@ -2441,14 +2487,22 @@ static int ufshcd_device_reset(struct scsi_cmnd *cmd)
 	unsigned int tag;
 	u32 pos;
 	int err;
+	u8 resp;
+	struct ufshcd_lrb *lrbp;
 
 	host = cmd->device->host;
 	hba = shost_priv(host);
 	tag = cmd->request->tag;
 
-	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_LOGICAL_RESET);
-	if (err == FAILED)
+	lrbp = &hba->lrb[tag];
+	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
+			UFS_LOGICAL_RESET, &resp);
+	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
+		err = FAILED;
 		goto out;
+	} else {
+		err = SUCCESS;
+	}
 
 	for (pos = 0; pos < hba->nutrs; pos++) {
 		if (test_bit(pos, &hba->outstanding_reqs) &&
@@ -2505,6 +2559,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
 	unsigned long flags;
 	unsigned int tag;
 	int err;
+	u8 resp;
+	struct ufshcd_lrb *lrbp;
 
 	host = cmd->device->host;
 	hba = shost_priv(host);
@@ -2520,9 +2576,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
 	}
 	spin_unlock_irqrestore(host->host_lock, flags);
 
-	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_ABORT_TASK);
-	if (err == FAILED)
+	lrbp = &hba->lrb[tag];
+	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
+			UFS_ABORT_TASK, &resp);
+	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
+		err = FAILED;
 		goto out;
+	} else {
+		err = SUCCESS;
+	}
 
 	scsi_dma_unmap(cmd);
 
@@ -2744,7 +2806,8 @@ int ufshcd_init(struct device *dev, struct ufs_hba **hba_handle,
 	host->max_cmd_len = MAX_CDB_SIZE;
 
 	/* Initailize wait queue for task management */
-	init_waitqueue_head(&hba->ufshcd_tm_wait_queue);
+	init_waitqueue_head(&hba->tm_wq);
+	init_waitqueue_head(&hba->tm_tag_wq);
 
 	/* Initialize work queues */
 	INIT_WORK(&hba->feh_workq, ufshcd_fatal_err_handler);
diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
index 6c9bd35..5d4542c 100644
--- a/drivers/scsi/ufs/ufshcd.h
+++ b/drivers/scsi/ufs/ufshcd.h
@@ -174,8 +174,10 @@ struct ufs_dev_cmd {
  * @irq: Irq number of the controller
  * @active_uic_cmd: handle of active UIC command
  * @uic_cmd_mutex: mutex for uic command
- * @ufshcd_tm_wait_queue: wait queue for task management
+ * @tm_wq: wait queue for task management
+ * @tm_tag_wq: wait queue for free task management slots
  * @tm_condition: condition variable for task management
+ * @tm_slots_in_use: bit map of task management request slots in use
  * @ufshcd_state: UFSHCD states
  * @intr_mask: Interrupt Mask Bits
  * @ee_ctrl_mask: Exception event control mask
@@ -216,8 +218,10 @@ struct ufs_hba {
 	struct uic_command *active_uic_cmd;
 	struct mutex uic_cmd_mutex;
 
-	wait_queue_head_t ufshcd_tm_wait_queue;
+	wait_queue_head_t tm_wq;
+	wait_queue_head_t tm_tag_wq;
 	unsigned long tm_condition;
+	unsigned long tm_slots_in_use;
 
 	u32 ufshcd_state;
 	u32 intr_mask;
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command
  2013-07-09  9:16 [PATCH V3 0/4] scsi: ufs: Improve UFS error handling Sujit Reddy Thumma
  2013-07-09  9:16 ` [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation Sujit Reddy Thumma
@ 2013-07-09  9:16 ` Sujit Reddy Thumma
  2013-07-09 10:42   ` merez
  2013-07-19 13:56   ` Seungwon Jeon
  2013-07-09  9:16 ` [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods Sujit Reddy Thumma
  2013-07-09  9:16 ` [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling Sujit Reddy Thumma
  3 siblings, 2 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-09  9:16 UTC (permalink / raw)
  To: Vinayak Holikatti, Santosh Y
  Cc: James E.J. Bottomley, linux-scsi, Sujit Reddy Thumma, linux-arm-msm

There is a possible race condition in the hardware when the abort
command is issued to terminate the ongoing SCSI command as described
below:

- A bit in the door-bell register is set in the controller for a
  new SCSI command.
- In some rare situations, before controller get a chance to issue
  the command to the device, the software issued an abort command.
- If the device recieves abort command first then it returns success
  because the command itself is not present.
- Now if the controller commits the command to device it will be
  processed.
- Software thinks that command is aborted and proceed while still
  the device is processing it.
- The software, controller and device may go out of sync because of
  this race condition.

To avoid this, query task presence in the device before sending abort
task command so that after the abort operation, the command is guaranteed
to be non-existent in both controller and the device.

Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
---
 drivers/scsi/ufs/ufshcd.c |   70 +++++++++++++++++++++++++++++++++++---------
 1 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index a176421..51ce096 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -2550,6 +2550,12 @@ static int ufshcd_host_reset(struct scsi_cmnd *cmd)
  * ufshcd_abort - abort a specific command
  * @cmd: SCSI command pointer
  *
+ * Abort the pending command in device by sending UFS_ABORT_TASK task management
+ * command, and in host controller by clearing the door-bell register. There can
+ * be race between controller sending the command to the device while abort is
+ * issued. To avoid that, first issue UFS_QUERY_TASK to check if the command is
+ * really issued and then try to abort it.
+ *
  * Returns SUCCESS/FAILED
  */
 static int ufshcd_abort(struct scsi_cmnd *cmd)
@@ -2558,7 +2564,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
 	struct ufs_hba *hba;
 	unsigned long flags;
 	unsigned int tag;
-	int err;
+	int err = 0;
+	int poll_cnt;
 	u8 resp;
 	struct ufshcd_lrb *lrbp;
 
@@ -2566,33 +2573,59 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
 	hba = shost_priv(host);
 	tag = cmd->request->tag;
 
-	spin_lock_irqsave(host->host_lock, flags);
+	/* If command is already aborted/completed, return SUCCESS */
+	if (!(test_bit(tag, &hba->outstanding_reqs)))
+		goto out;
 
-	/* check if command is still pending */
-	if (!(test_bit(tag, &hba->outstanding_reqs))) {
-		err = FAILED;
-		spin_unlock_irqrestore(host->host_lock, flags);
+	lrbp = &hba->lrb[tag];
+	for (poll_cnt = 100; poll_cnt; poll_cnt--) {
+		err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
+				UFS_QUERY_TASK, &resp);
+		if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED) {
+			/* cmd pending in the device */
+			break;
+		} else if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
+			u32 reg;
+
+			/*
+			 * cmd not pending in the device, check if it is
+			 * in transition.
+			 */
+			reg = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
+			if (reg & (1 << tag)) {
+				/* sleep for max. 2ms to stabilize */
+				usleep_range(1000, 2000);
+				continue;
+			}
+			/* command completed already */
+			goto out;
+		} else {
+			if (!err)
+				err = resp; /* service response error */
+			goto out;
+		}
+	}
+
+	if (!poll_cnt) {
+		err = -EBUSY;
 		goto out;
 	}
-	spin_unlock_irqrestore(host->host_lock, flags);
 
-	lrbp = &hba->lrb[tag];
 	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
 			UFS_ABORT_TASK, &resp);
 	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
-		err = FAILED;
+		if (!err)
+			err = resp; /* service response error */
 		goto out;
-	} else {
-		err = SUCCESS;
 	}
 
+	err = ufshcd_clear_cmd(hba, tag);
+	if (err)
+		goto out;
+
 	scsi_dma_unmap(cmd);
 
 	spin_lock_irqsave(host->host_lock, flags);
-
-	/* clear the respective UTRLCLR register bit */
-	ufshcd_utrl_clear(hba, tag);
-
 	__clear_bit(tag, &hba->outstanding_reqs);
 	hba->lrb[tag].cmd = NULL;
 	spin_unlock_irqrestore(host->host_lock, flags);
@@ -2600,6 +2633,13 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
 	clear_bit_unlock(tag, &hba->lrb_in_use);
 	wake_up(&hba->dev_cmd.tag_wq);
 out:
+	if (!err) {
+		err = SUCCESS;
+	} else {
+		dev_err(hba->dev, "%s: failed with err %d\n", __func__, err);
+		err = FAILED;
+	}
+
 	return err;
 }
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation.


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-09  9:16 [PATCH V3 0/4] scsi: ufs: Improve UFS error handling Sujit Reddy Thumma
  2013-07-09  9:16 ` [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation Sujit Reddy Thumma
  2013-07-09  9:16 ` [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command Sujit Reddy Thumma
@ 2013-07-09  9:16 ` Sujit Reddy Thumma
  2013-07-09 10:43   ` merez
  2013-07-19 13:57   ` Seungwon Jeon
  2013-07-09  9:16 ` [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling Sujit Reddy Thumma
  3 siblings, 2 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-09  9:16 UTC (permalink / raw)
  To: Vinayak Holikatti, Santosh Y
  Cc: James E.J. Bottomley, linux-scsi, Sujit Reddy Thumma, linux-arm-msm

As of now SCSI initiated error handling is broken because,
the reset APIs don't try to bring back the device initialized and
ready for further transfers.

In case of timeouts, the scsi error handler takes care of handling aborts
and resets. Improve the error handling in such scenario by resetting the
device and host and re-initializing them in proper manner.

Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
---
 drivers/scsi/ufs/ufshcd.c |  467 +++++++++++++++++++++++++++++++++++++++------
 drivers/scsi/ufs/ufshcd.h |    2 +
 2 files changed, 411 insertions(+), 58 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 51ce096..b4c9910 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -69,9 +69,15 @@ enum {
 
 /* UFSHCD states */
 enum {
-	UFSHCD_STATE_OPERATIONAL,
 	UFSHCD_STATE_RESET,
 	UFSHCD_STATE_ERROR,
+	UFSHCD_STATE_OPERATIONAL,
+};
+
+/* UFSHCD error handling flags */
+enum {
+	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
+	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
 };
 
 /* Interrupt configuration options */
@@ -87,6 +93,22 @@ enum {
 	INT_AGGR_CONFIG,
 };
 
+#define ufshcd_set_device_reset_pending(h) \
+	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
+#define ufshcd_set_host_reset_pending(h) \
+	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
+#define ufshcd_device_reset_pending(h) \
+	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
+#define ufshcd_host_reset_pending(h) \
+	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
+#define ufshcd_clear_device_reset_pending(h) \
+	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
+#define ufshcd_clear_host_reset_pending(h) \
+	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
+
+static void ufshcd_tmc_handler(struct ufs_hba *hba);
+static void ufshcd_async_scan(void *data, async_cookie_t cookie);
+
 /*
  * ufshcd_wait_for_register - wait for register value to change
  * @hba - per-adapter interface
@@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
 
 	tag = cmd->request->tag;
 
-	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
+	switch (hba->ufshcd_state) {
+	case UFSHCD_STATE_OPERATIONAL:
+		break;
+	case UFSHCD_STATE_RESET:
 		err = SCSI_MLQUEUE_HOST_BUSY;
 		goto out;
+	case UFSHCD_STATE_ERROR:
+		set_host_byte(cmd, DID_ERROR);
+		cmd->scsi_done(cmd);
+		goto out;
+	default:
+		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
+				__func__, hba->ufshcd_state);
+		set_host_byte(cmd, DID_BAD_TARGET);
+		cmd->scsi_done(cmd);
+		goto out;
 	}
 
 	/* acquire the tag to make sure device cmds don't use it */
@@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
 	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
 		scsi_unblock_requests(hba->host);
 
-	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
-
 out:
 	return err;
 }
@@ -2273,6 +2306,106 @@ out:
 }
 
 /**
+ * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
+ * @hba: per-adapter instance
+ */
+static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
+{
+	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
+}
+
+/**
+ * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
+ * @hba: per-adapter instance
+ */
+static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
+{
+	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
+}
+
+/**
+ * ufshcd_complete_pending_tasks - complete outstanding tasks
+ * @hba: per adapter instance
+ *
+ * Abort in-progress task management commands and wakeup
+ * waiting threads.
+ *
+ * Returns non-zero error value when failed to clear all the commands.
+ */
+static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
+{
+	u32 reg;
+	int err = 0;
+	unsigned long flags;
+
+	if (!hba->outstanding_tasks)
+		goto out;
+
+	/* Clear UTMRL only when run-stop is enabled */
+	if (ufshcd_utmrl_is_rsr_enabled(hba))
+		ufshcd_writel(hba, ~hba->outstanding_tasks,
+				REG_UTP_TASK_REQ_LIST_CLEAR);
+
+	/* poll for max. 1 sec to clear door bell register by h/w */
+	reg = ufshcd_wait_for_register(hba,
+			REG_UTP_TASK_REQ_DOOR_BELL,
+			hba->outstanding_tasks, 0, 1000, 1000);
+	if (reg & hba->outstanding_tasks)
+		err = -ETIMEDOUT;
+
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	/* complete commands that were cleared out */
+	ufshcd_tmc_handler(hba);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+out:
+	if (err)
+		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
+				__func__, reg);
+	return err;
+}
+
+/**
+ * ufshcd_complete_pending_reqs - complete outstanding requests
+ * @hba: per adapter instance
+ *
+ * Abort in-progress transfer request commands and return them to SCSI.
+ *
+ * Returns non-zero error value when failed to clear all the commands.
+ */
+static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
+{
+	u32 reg;
+	int err = 0;
+	unsigned long flags;
+
+	/* check if we completed all of them */
+	if (!hba->outstanding_reqs)
+		goto out;
+
+	/* Clear UTRL only when run-stop is enabled */
+	if (ufshcd_utrl_is_rsr_enabled(hba))
+		ufshcd_writel(hba, ~hba->outstanding_reqs,
+				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
+
+	/* poll for max. 1 sec to clear door bell register by h/w */
+	reg = ufshcd_wait_for_register(hba,
+			REG_UTP_TRANSFER_REQ_DOOR_BELL,
+			hba->outstanding_reqs, 0, 1000, 1000);
+	if (reg & hba->outstanding_reqs)
+		err = -ETIMEDOUT;
+
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	/* complete commands that were cleared out */
+	ufshcd_transfer_req_compl(hba);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+out:
+	if (err)
+		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
+				__func__, reg);
+	return err;
+}
+
+/**
  * ufshcd_fatal_err_handler - handle fatal errors
  * @hba: per adapter instance
  */
@@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
 	}
 	return;
 fatal_eh:
-	hba->ufshcd_state = UFSHCD_STATE_ERROR;
-	schedule_work(&hba->feh_workq);
+	/* handle fatal errors only when link is functional */
+	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
+		/* block commands at driver layer until error is handled */
+		hba->ufshcd_state = UFSHCD_STATE_ERROR;
+		schedule_work(&hba->feh_workq);
+	}
 }
 
 /**
@@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
 }
 
 /**
- * ufshcd_device_reset - reset device and abort all the pending commands
- * @cmd: SCSI command pointer
+ * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
+ * @hba: per adapter instance
  *
- * Returns SUCCESS/FAILED
+ * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS flags,
+ * attributes and descriptors are reset to default state. Callers are
+ * expected to initialize the whole device again after this.
+ *
+ * Returns zero on success, non-zero on failure
  */
-static int ufshcd_device_reset(struct scsi_cmnd *cmd)
+static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
 {
-	struct Scsi_Host *host;
-	struct ufs_hba *hba;
-	unsigned int tag;
-	u32 pos;
-	int err;
-	u8 resp;
-	struct ufshcd_lrb *lrbp;
+	struct uic_command uic_cmd = {0};
+	int ret;
 
-	host = cmd->device->host;
-	hba = shost_priv(host);
-	tag = cmd->request->tag;
+	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
 
-	lrbp = &hba->lrb[tag];
-	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
-			UFS_LOGICAL_RESET, &resp);
-	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
-		err = FAILED;
+	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
+	if (ret)
+		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
+
+	return ret;
+}
+
+/**
+ * ufshcd_dme_reset - Local UniPro reset
+ * @hba: per adapter instance
+ *
+ * Returns zero on success, non-zero on failure
+ */
+static int ufshcd_dme_reset(struct ufs_hba *hba)
+{
+	struct uic_command uic_cmd = {0};
+	int ret;
+
+	uic_cmd.command = UIC_CMD_DME_RESET;
+
+	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
+	if (ret)
+		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
+
+	return ret;
+
+}
+
+/**
+ * ufshcd_dme_enable - Local UniPro DME Enable
+ * @hba: per adapter instance
+ *
+ * Returns zero on success, non-zero on failure
+ */
+static int ufshcd_dme_enable(struct ufs_hba *hba)
+{
+	struct uic_command uic_cmd = {0};
+	int ret;
+	uic_cmd.command = UIC_CMD_DME_ENABLE;
+
+	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
+	if (ret)
+		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
+
+	return ret;
+
+}
+
+/**
+ * ufshcd_device_reset_and_restore - reset and restore device
+ * @hba: per-adapter instance
+ *
+ * Note that the device reset issues DME_END_POINT_RESET which
+ * may reset entire device and restore device attributes to
+ * default state.
+ *
+ * Returns zero on success, non-zero on failure
+ */
+static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
+{
+	int err = 0;
+	u32 reg;
+
+	err = ufshcd_dme_end_point_reset(hba);
+	if (err)
+		goto out;
+
+	/* restore communication with the device */
+	err = ufshcd_dme_reset(hba);
+	if (err)
 		goto out;
-	} else {
-		err = SUCCESS;
-	}
 
-	for (pos = 0; pos < hba->nutrs; pos++) {
-		if (test_bit(pos, &hba->outstanding_reqs) &&
-		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
+	err = ufshcd_dme_enable(hba);
+	if (err)
+		goto out;
 
-			/* clear the respective UTRLCLR register bit */
-			ufshcd_utrl_clear(hba, pos);
+	err = ufshcd_dme_link_startup(hba);
+	if (err)
+		goto out;
 
-			clear_bit(pos, &hba->outstanding_reqs);
+	/* check if link is up and device is detected */
+	reg = ufshcd_readl(hba, REG_CONTROLLER_STATUS);
+	if (!ufshcd_is_device_present(reg)) {
+		dev_err(hba->dev, "Device not present\n");
+		err = -ENXIO;
+		goto out;
+	}
 
-			if (hba->lrb[pos].cmd) {
-				scsi_dma_unmap(hba->lrb[pos].cmd);
-				hba->lrb[pos].cmd->result =
-					DID_ABORT << 16;
-				hba->lrb[pos].cmd->scsi_done(cmd);
-				hba->lrb[pos].cmd = NULL;
-				clear_bit_unlock(pos, &hba->lrb_in_use);
-				wake_up(&hba->dev_cmd.tag_wq);
-			}
-		}
-	} /* end of for */
+	ufshcd_clear_device_reset_pending(hba);
 out:
+	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
 	return err;
 }
 
 /**
- * ufshcd_host_reset - Main reset function registered with scsi layer
- * @cmd: SCSI command pointer
+ * ufshcd_host_reset_and_restore - reset and restore host controller
+ * @hba: per-adapter instance
  *
- * Returns SUCCESS/FAILED
+ * Note that host controller reset may issue DME_RESET to
+ * local and remote (device) Uni-Pro stack and the attributes
+ * are reset to default state.
+ *
+ * Returns zero on success, non-zero on failure
  */
-static int ufshcd_host_reset(struct scsi_cmnd *cmd)
+static int ufshcd_host_reset_and_restore(struct ufs_hba *hba)
 {
-	struct ufs_hba *hba;
+	int err;
+	async_cookie_t cookie;
+	unsigned long flags;
 
-	hba = shost_priv(cmd->device->host);
+	/* Reset the host controller */
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	ufshcd_hba_stop(hba);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
 
-	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
-		return SUCCESS;
+	err = ufshcd_hba_enable(hba);
+	if (err)
+		goto out;
 
-	return ufshcd_do_reset(hba);
+	/* Establish the link again and restore the device */
+	cookie = async_schedule(ufshcd_async_scan, hba);
+	/* wait for async scan to be completed */
+	async_synchronize_cookie(++cookie);
+	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL)
+		err = -EIO;
+out:
+	if (err)
+		dev_err(hba->dev, "%s: Host init failed %d\n", __func__, err);
+	else
+		ufshcd_clear_host_reset_pending(hba);
+
+	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
+	return err;
 }
 
 /**
@@ -2644,6 +2861,134 @@ out:
 }
 
 /**
+ * ufshcd_reset_and_restore - resets device or host or both
+ * @hba: per-adapter instance
+ *
+ * Reset and recover device, host and re-establish link. This
+ * is helpful to recover the communication in fatal error conditions.
+ *
+ * Returns zero on success, non-zero on failure
+ */
+static int ufshcd_reset_and_restore(struct ufs_hba *hba)
+{
+	int err = 0;
+
+	if (ufshcd_device_reset_pending(hba) &&
+			!ufshcd_host_reset_pending(hba)) {
+		err = ufshcd_device_reset_and_restore(hba);
+		if (err) {
+			ufshcd_clear_device_reset_pending(hba);
+			ufshcd_set_host_reset_pending(hba);
+		}
+	}
+
+	if (ufshcd_host_reset_pending(hba))
+		err = ufshcd_host_reset_and_restore(hba);
+
+	/*
+	 * Due to reset the door-bell might be cleared, clear
+	 * outstanding requests in s/w here.
+	 */
+	ufshcd_complete_pending_reqs(hba);
+	ufshcd_complete_pending_tasks(hba);
+
+	return err;
+}
+
+/**
+ * ufshcd_eh_device_reset_handler - device reset handler registered to
+ *                                    scsi layer.
+ * @cmd - SCSI command pointer
+ *
+ * Returns SUCCESS/FAILED
+ */
+static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd)
+{
+	struct ufs_hba *hba;
+	int err;
+	unsigned long flags;
+
+	hba = shost_priv(cmd->device->host);
+
+	/*
+	 * Check if there is any race with fatal error handling.
+	 * If so, wait for it to complete. Even though fatal error
+	 * handling does reset and restore in some cases, don't assume
+	 * anything out of it. We are just avoiding race here.
+	 */
+	do {
+		spin_lock_irqsave(hba->host->host_lock, flags);
+		if (!(work_pending(&hba->feh_workq) ||
+				hba->ufshcd_state == UFSHCD_STATE_RESET))
+			break;
+		spin_unlock_irqrestore(hba->host->host_lock, flags);
+		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
+		flush_work_sync(&hba->feh_workq);
+	} while (1);
+
+	hba->ufshcd_state = UFSHCD_STATE_RESET;
+	ufshcd_set_device_reset_pending(hba);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+
+	err = ufshcd_reset_and_restore(hba);
+
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	if (!err) {
+		err = SUCCESS;
+		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
+	} else {
+		err = FAILED;
+		hba->ufshcd_state = UFSHCD_STATE_ERROR;
+	}
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+
+	return err;
+}
+
+/**
+ * ufshcd_eh_host_reset_handler - host reset handler registered to scsi layer
+ * @cmd - SCSI command pointer
+ *
+ * Returns SUCCESS/FAILED
+ */
+static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
+{
+	struct ufs_hba *hba;
+	int err;
+	unsigned long flags;
+
+	hba = shost_priv(cmd->device->host);
+
+	do {
+		spin_lock_irqsave(hba->host->host_lock, flags);
+		if (!(work_pending(&hba->feh_workq) ||
+				hba->ufshcd_state == UFSHCD_STATE_RESET))
+			break;
+		spin_unlock_irqrestore(hba->host->host_lock, flags);
+		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
+		flush_work_sync(&hba->feh_workq);
+	} while (1);
+
+	hba->ufshcd_state = UFSHCD_STATE_RESET;
+	ufshcd_set_host_reset_pending(hba);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+
+	err = ufshcd_reset_and_restore(hba);
+
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	if (!err) {
+		err = SUCCESS;
+		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
+	} else {
+		err = FAILED;
+		hba->ufshcd_state = UFSHCD_STATE_ERROR;
+	}
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+
+	return err;
+}
+
+/**
  * ufshcd_async_scan - asynchronous execution for link startup
  * @data: data pointer to pass to this function
  * @cookie: cookie data
@@ -2667,8 +3012,14 @@ static void ufshcd_async_scan(void *data, async_cookie_t cookie)
 
 	hba->auto_bkops_enabled = false;
 	ufshcd_enable_auto_bkops(hba);
-	scsi_scan_host(hba->host);
-	pm_runtime_put_sync(hba->dev);
+	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
+
+	/* If we are in error handling context no need to scan the host */
+	if (!(ufshcd_device_reset_pending(hba) ||
+			ufshcd_host_reset_pending(hba))) {
+		scsi_scan_host(hba->host);
+		pm_runtime_put_sync(hba->dev);
+	}
 out:
 	return;
 }
@@ -2681,8 +3032,8 @@ static struct scsi_host_template ufshcd_driver_template = {
 	.slave_alloc		= ufshcd_slave_alloc,
 	.slave_destroy		= ufshcd_slave_destroy,
 	.eh_abort_handler	= ufshcd_abort,
-	.eh_device_reset_handler = ufshcd_device_reset,
-	.eh_host_reset_handler	= ufshcd_host_reset,
+	.eh_device_reset_handler = ufshcd_eh_device_reset_handler,
+	.eh_host_reset_handler   = ufshcd_eh_host_reset_handler,
 	.this_id		= -1,
 	.sg_tablesize		= SG_ALL,
 	.cmd_per_lun		= UFSHCD_CMD_PER_LUN,
diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
index 5d4542c..7fcedd0 100644
--- a/drivers/scsi/ufs/ufshcd.h
+++ b/drivers/scsi/ufs/ufshcd.h
@@ -179,6 +179,7 @@ struct ufs_dev_cmd {
  * @tm_condition: condition variable for task management
  * @tm_slots_in_use: bit map of task management request slots in use
  * @ufshcd_state: UFSHCD states
+ * @eh_flags: Error handling flags
  * @intr_mask: Interrupt Mask Bits
  * @ee_ctrl_mask: Exception event control mask
  * @feh_workq: Work queue for fatal controller error handling
@@ -224,6 +225,7 @@ struct ufs_hba {
 	unsigned long tm_slots_in_use;
 
 	u32 ufshcd_state;
+	u32 eh_flags;
 	u32 intr_mask;
 	u16 ee_ctrl_mask;
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-09  9:16 [PATCH V3 0/4] scsi: ufs: Improve UFS error handling Sujit Reddy Thumma
                   ` (2 preceding siblings ...)
  2013-07-09  9:16 ` [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods Sujit Reddy Thumma
@ 2013-07-09  9:16 ` Sujit Reddy Thumma
  2013-07-09 10:43   ` merez
  2013-07-19 13:58   ` Seungwon Jeon
  3 siblings, 2 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-09  9:16 UTC (permalink / raw)
  To: Vinayak Holikatti, Santosh Y
  Cc: James E.J. Bottomley, linux-scsi, Sujit Reddy Thumma, linux-arm-msm

Error handling in UFS driver is broken and resets the host controller
for fatal errors without re-initialization. Correct the fatal error
handling sequence according to UFS Host Controller Interface (HCI)
v1.1 specification.

o Upon determining fatal error condition the host controller may hang
  forever until a reset is applied, so just retrying the command doesn't
  work without a reset. So, the reset is applied in the driver context
  in a separate work and SCSI mid-layer isn't informed until reset is
  applied.

o Processed requests which are completed without error are reported to
  SCSI layer as successful and any pending commands that are not started
  yet or are not cause of the error are re-queued into scsi midlayer queue.
  For the command that caused error, host controller or device is reset
  and DID_ERROR is returned for command retry after applying reset.

o SCSI is informed about the expected Unit-Attentioni exception from the
  device for the immediate command after a reset so that the SCSI layer
  take necessary steps to establish communication with the device.

Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
---
 drivers/scsi/ufs/ufshcd.c |  349 +++++++++++++++++++++++++++++++++++---------
 drivers/scsi/ufs/ufshcd.h |    2 +
 drivers/scsi/ufs/ufshci.h |   19 ++-
 3 files changed, 295 insertions(+), 75 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index b4c9910..2a3874f 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -80,6 +80,14 @@ enum {
 	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
 };
 
+/* UFSHCD UIC layer error flags */
+enum {
+	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
+	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
+	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
+	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
+};
+
 /* Interrupt configuration options */
 enum {
 	UFSHCD_INT_DISABLE,
@@ -108,6 +116,7 @@ enum {
 
 static void ufshcd_tmc_handler(struct ufs_hba *hba);
 static void ufshcd_async_scan(void *data, async_cookie_t cookie);
+static int ufshcd_reset_and_restore(struct ufs_hba *hba);
 
 /*
  * ufshcd_wait_for_register - wait for register value to change
@@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
 		goto out;
 	}
 
-	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
-		scsi_unblock_requests(hba->host);
-
 out:
 	return err;
 }
@@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct ufs_hba *hba)
 }
 
 /**
- * ufshcd_do_reset - reset the host controller
- * @hba: per adapter instance
- *
- * Returns SUCCESS/FAILED
- */
-static int ufshcd_do_reset(struct ufs_hba *hba)
-{
-	struct ufshcd_lrb *lrbp;
-	unsigned long flags;
-	int tag;
-
-	/* block commands from midlayer */
-	scsi_block_requests(hba->host);
-
-	spin_lock_irqsave(hba->host->host_lock, flags);
-	hba->ufshcd_state = UFSHCD_STATE_RESET;
-
-	/* send controller to reset state */
-	ufshcd_hba_stop(hba);
-	spin_unlock_irqrestore(hba->host->host_lock, flags);
-
-	/* abort outstanding commands */
-	for (tag = 0; tag < hba->nutrs; tag++) {
-		if (test_bit(tag, &hba->outstanding_reqs)) {
-			lrbp = &hba->lrb[tag];
-			if (lrbp->cmd) {
-				scsi_dma_unmap(lrbp->cmd);
-				lrbp->cmd->result = DID_RESET << 16;
-				lrbp->cmd->scsi_done(lrbp->cmd);
-				lrbp->cmd = NULL;
-				clear_bit_unlock(tag, &hba->lrb_in_use);
-			}
-		}
-	}
-
-	/* complete device management command */
-	if (hba->dev_cmd.complete)
-		complete(hba->dev_cmd.complete);
-
-	/* clear outstanding request/task bit maps */
-	hba->outstanding_reqs = 0;
-	hba->outstanding_tasks = 0;
-
-	/* Host controller enable */
-	if (ufshcd_hba_enable(hba)) {
-		dev_err(hba->dev,
-			"Reset: Controller initialization failed\n");
-		return FAILED;
-	}
-
-	if (ufshcd_link_startup(hba)) {
-		dev_err(hba->dev,
-			"Reset: Link start-up failed\n");
-		return FAILED;
-	}
-
-	return SUCCESS;
-}
-
-/**
  * ufshcd_slave_alloc - handle initial SCSI device configurations
  * @sdev: pointer to SCSI device
  *
@@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device *sdev)
 	sdev->use_10_for_ms = 1;
 	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
 
+	/* allow SCSI layer to restart the device in case of errors */
+	sdev->allow_restart = 1;
+
 	/*
 	 * Inform SCSI Midlayer that the LUN queue depth is same as the
 	 * controller queue depth. If a LUN queue depth is less than the
@@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
 	case OCS_ABORTED:
 		result |= DID_ABORT << 16;
 		break;
+	case OCS_INVALID_COMMAND_STATUS:
+		result |= DID_REQUEUE << 16;
+		break;
 	case OCS_INVALID_CMD_TABLE_ATTR:
 	case OCS_INVALID_PRDT_ATTR:
 	case OCS_MISMATCH_DATA_BUF_SIZE:
@@ -2405,42 +2357,295 @@ out:
 	return err;
 }
 
+static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
+{
+	switch (ocs) {
+	case OCS_SUCCESS:
+	case OCS_INVALID_COMMAND_STATUS:
+		break;
+	case OCS_MISMATCH_DATA_BUF_SIZE:
+	case OCS_MISMATCH_RESP_UPIU_SIZE:
+	case OCS_PEER_COMM_FAILURE:
+	case OCS_FATAL_ERROR:
+	case OCS_ABORTED:
+	case OCS_INVALID_CMD_TABLE_ATTR:
+	case OCS_INVALID_PRDT_ATTR:
+		ufshcd_set_host_reset_pending(hba);
+		break;
+	default:
+		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
+				__func__, ocs);
+		BUG();
+	}
+}
+
+static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
+{
+	switch (ocs) {
+	case OCS_TMR_SUCCESS:
+	case OCS_TMR_INVALID_COMMAND_STATUS:
+		break;
+	case OCS_TMR_MISMATCH_REQ_SIZE:
+	case OCS_TMR_MISMATCH_RESP_SIZE:
+	case OCS_TMR_PEER_COMM_FAILURE:
+	case OCS_TMR_INVALID_ATTR:
+	case OCS_TMR_ABORTED:
+	case OCS_TMR_FATAL_ERROR:
+		ufshcd_set_host_reset_pending(hba);
+		break;
+	default:
+		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
+				__func__, ocs);
+		BUG();
+	}
+}
+
 /**
- * ufshcd_fatal_err_handler - handle fatal errors
+ * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed command and
+ *                          decide error handling
  * @hba: per adapter instance
+ * @err_xfer: bit mask for transfer request errors
+ *
+ * Iterate over completed transfer requests and
+ * set error handling flags.
+ */
+static void
+ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
+{
+	unsigned long completed;
+	u32 doorbell;
+	int index;
+	int ocs;
+
+	if (!err_xfer)
+		goto out;
+
+	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
+	completed = doorbell ^ (u32)hba->outstanding_reqs;
+
+	for (index = 0; index < hba->nutrs; index++) {
+		if (test_bit(index, &completed)) {
+			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
+			if ((ocs == OCS_SUCCESS) ||
+					(ocs == OCS_INVALID_COMMAND_STATUS))
+				continue;
+
+			*err_xfer |= (1 << index);
+			ufshcd_decide_eh_xfer_req(hba, ocs);
+		}
+	}
+out:
+	return;
+}
+
+/**
+ * ufshcd_error_autopsy_task_req() - reads OCS field of failed command and
+ *                          decide error handling
+ * @hba: per adapter instance
+ * @err_tm: bit mask for task management errors
+ *
+ * Iterate over completed task management requests and
+ * set error handling flags.
+ */
+static void
+ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
+{
+	unsigned long completed;
+	u32 doorbell;
+	int index;
+	int ocs;
+
+	if (!err_tm)
+		goto out;
+
+	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
+	completed = doorbell ^ (u32)hba->outstanding_tasks;
+
+	for (index = 0; index < hba->nutmrs; index++) {
+		if (test_bit(index, &completed)) {
+			struct utp_task_req_desc *tm_descp;
+
+			tm_descp = hba->utmrdl_base_addr;
+			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
+			if ((ocs == OCS_TMR_SUCCESS) ||
+					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
+				continue;
+
+			*err_tm |= (1 << index);
+			ufshcd_decide_eh_task_req(hba, ocs);
+		}
+	}
+
+out:
+	return;
+}
+
+/**
+ * ufshcd_fatal_err_handler - handle fatal errors
+ * @work: pointer to work structure
  */
 static void ufshcd_fatal_err_handler(struct work_struct *work)
 {
 	struct ufs_hba *hba;
+	unsigned long flags;
+	u32 err_xfer = 0;
+	u32 err_tm = 0;
+	int err;
+
 	hba = container_of(work, struct ufs_hba, feh_workq);
 
 	pm_runtime_get_sync(hba->dev);
-	/* check if reset is already in progress */
-	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
-		ufshcd_do_reset(hba);
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
+		/* complete processed requests and exit */
+		ufshcd_transfer_req_compl(hba);
+		ufshcd_tmc_handler(hba);
+		spin_unlock_irqrestore(hba->host->host_lock, flags);
+		pm_runtime_put_sync(hba->dev);
+		return;
+	}
+
+	hba->ufshcd_state = UFSHCD_STATE_RESET;
+	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
+	ufshcd_error_autopsy_task_req(hba, &err_tm);
+
+	/*
+	 * Complete successful and pending transfer requests.
+	 * DID_REQUEUE is returned for pending requests as they have
+	 * nothing to do with error'ed request and SCSI layer should
+	 * not treat them as errors and decrement retry count.
+	 */
+	hba->outstanding_reqs &= ~err_xfer;
+	ufshcd_transfer_req_compl(hba);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+	ufshcd_complete_pending_reqs(hba);
+	spin_lock_irqsave(hba->host->host_lock, flags);
+	hba->outstanding_reqs |= err_xfer;
+
+	/* Complete successful and pending task requests */
+	hba->outstanding_tasks &= ~err_tm;
+	ufshcd_tmc_handler(hba);
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+	ufshcd_complete_pending_tasks(hba);
+	spin_lock_irqsave(hba->host->host_lock, flags);
+
+	hba->outstanding_tasks |= err_tm;
+
+	/*
+	 * Controller may generate multiple fatal errors, handle
+	 * errors based on severity.
+	 * 1) DEVICE_FATAL_ERROR
+	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
+	 * 3) UIC_ERROR
+	 */
+	if (hba->errors & DEVICE_FATAL_ERROR) {
+		/*
+		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
+		 * OCS field on device fatal error.
+		 */
+		ufshcd_set_host_reset_pending(hba);
+	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
+			CONTROLLER_FATAL_ERROR)) {
+		/* eh flags should be set in err autopsy based on OCS values */
+		if (!hba->eh_flags)
+			WARN(1, "%s: fatal error without error handling\n",
+				dev_name(hba->dev));
+	} else if (hba->errors & UIC_ERROR) {
+		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
+			/* fatal error - reset controller */
+			ufshcd_set_host_reset_pending(hba);
+		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
+					UFSHCD_UIC_TL_ERROR |
+					UFSHCD_UIC_DME_ERROR)) {
+			/* non-fatal, report error to SCSI layer */
+			if (!hba->eh_flags) {
+				spin_unlock_irqrestore(
+						hba->host->host_lock, flags);
+				ufshcd_complete_pending_reqs(hba);
+				ufshcd_complete_pending_tasks(hba);
+				spin_lock_irqsave(hba->host->host_lock, flags);
+			}
+		}
+	}
+	spin_unlock_irqrestore(hba->host->host_lock, flags);
+
+	if (hba->eh_flags) {
+		err = ufshcd_reset_and_restore(hba);
+		if (err) {
+			ufshcd_clear_host_reset_pending(hba);
+			ufshcd_clear_device_reset_pending(hba);
+			dev_err(hba->dev, "%s: reset and restore failed\n",
+					__func__);
+			hba->ufshcd_state = UFSHCD_STATE_ERROR;
+		}
+		/*
+		 * Inform scsi mid-layer that we did reset and allow to handle
+		 * Unit Attention properly.
+		 */
+		scsi_report_bus_reset(hba->host, 0);
+		hba->errors = 0;
+		hba->uic_error = 0;
+	}
+	scsi_unblock_requests(hba->host);
 	pm_runtime_put_sync(hba->dev);
 }
 
 /**
- * ufshcd_err_handler - Check for fatal errors
- * @work: pointer to a work queue structure
+ * ufshcd_update_uic_error - check and set fatal UIC error flags.
+ * @hba: per-adapter instance
  */
-static void ufshcd_err_handler(struct ufs_hba *hba)
+static void ufshcd_update_uic_error(struct ufs_hba *hba)
 {
 	u32 reg;
 
+	/* PA_INIT_ERROR is fatal and needs UIC reset */
+	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
+	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
+		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
+
+	/* UIC NL/TL/DME errors needs software retry */
+	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
+	if (reg)
+		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
+
+	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
+	if (reg)
+		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
+
+	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
+	if (reg)
+		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
+
+	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
+			__func__, hba->uic_error);
+}
+
+/**
+ * ufshcd_err_handler - Check for fatal errors
+ * @hba: per-adapter instance
+ */
+static void ufshcd_err_handler(struct ufs_hba *hba)
+{
 	if (hba->errors & INT_FATAL_ERRORS)
 		goto fatal_eh;
 
 	if (hba->errors & UIC_ERROR) {
-		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
-		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
+		hba->uic_error = 0;
+		ufshcd_update_uic_error(hba);
+		if (hba->uic_error)
 			goto fatal_eh;
 	}
+	/*
+	 * Other errors are either non-fatal or completed by the
+	 * controller by updating OCS fields with success/failure.
+	 */
 	return;
+
 fatal_eh:
 	/* handle fatal errors only when link is functional */
 	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
+		/* block commands from midlayer */
+		scsi_block_requests(hba->host);
 		/* block commands at driver layer until error is handled */
 		hba->ufshcd_state = UFSHCD_STATE_ERROR;
 		schedule_work(&hba->feh_workq);
diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
index 7fcedd0..4ee4d1a 100644
--- a/drivers/scsi/ufs/ufshcd.h
+++ b/drivers/scsi/ufs/ufshcd.h
@@ -185,6 +185,7 @@ struct ufs_dev_cmd {
  * @feh_workq: Work queue for fatal controller error handling
  * @eeh_work: Worker to handle exception events
  * @errors: HBA errors
+ * @uic_error: UFS interconnect layer error status
  * @dev_cmd: ufs device management command information
  * @auto_bkops_enabled: to track whether bkops is enabled in device
  */
@@ -235,6 +236,7 @@ struct ufs_hba {
 
 	/* HBA Errors */
 	u32 errors;
+	u32 uic_error;
 
 	/* Device management request data */
 	struct ufs_dev_cmd dev_cmd;
diff --git a/drivers/scsi/ufs/ufshci.h b/drivers/scsi/ufs/ufshci.h
index f1e1b74..36f68ef 100644
--- a/drivers/scsi/ufs/ufshci.h
+++ b/drivers/scsi/ufs/ufshci.h
@@ -264,7 +264,7 @@ enum {
 	UTP_DEVICE_TO_HOST	= 0x04000000,
 };
 
-/* Overall command status values */
+/* Overall command status values for transfer request */
 enum {
 	OCS_SUCCESS			= 0x0,
 	OCS_INVALID_CMD_TABLE_ATTR	= 0x1,
@@ -274,8 +274,21 @@ enum {
 	OCS_PEER_COMM_FAILURE		= 0x5,
 	OCS_ABORTED			= 0x6,
 	OCS_FATAL_ERROR			= 0x7,
-	OCS_INVALID_COMMAND_STATUS	= 0x0F,
-	MASK_OCS			= 0x0F,
+	OCS_INVALID_COMMAND_STATUS	= 0xF,
+	MASK_OCS			= 0xFF,
+};
+
+/* Overall command status values for task management request */
+enum {
+	OCS_TMR_SUCCESS			= 0x0,
+	OCS_TMR_INVALID_ATTR		= 0x1,
+	OCS_TMR_MISMATCH_REQ_SIZE	= 0x2,
+	OCS_TMR_MISMATCH_RESP_SIZE	= 0x3,
+	OCS_TMR_PEER_COMM_FAILURE	= 0x4,
+	OCS_TMR_ABORTED			= 0x5,
+	OCS_TMR_FATAL_ERROR		= 0x6,
+	OCS_TMR_INVALID_COMMAND_STATUS	= 0xF,
+	MASK_OCS_TMR			= 0xFF,
 };
 
 /**
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation
  2013-07-09  9:16 ` [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation Sujit Reddy Thumma
@ 2013-07-09 10:42   ` merez
  2013-07-19 13:56   ` Seungwon Jeon
  1 sibling, 0 replies; 27+ messages in thread
From: merez @ 2013-07-09 10:42 UTC (permalink / raw)
  Cc: Vinayak Holikatti, Santosh Y, James E.J. Bottomley, linux-scsi,
	Sujit Reddy Thumma, linux-arm-msm

Tested-by: Maya Erez <merez@codeaurora.org>

> Currently, sending Task Management (TM) command to the card might
> be broken in some scenarios as listed below:
>
> Problem: If there are more than 8 TM commands the implementation
>          returns error to the caller.
> Fix:     Wait for one of the slots to be emptied and send the command.
>
> Problem: Sometimes it is necessary for the caller to know the TM service
>          response code to determine the task status.
> Fix:     Propogate the service response to the caller.
>
> Problem: If the TM command times out no proper error recovery is
>          implemented.
> Fix:     Clear the command in the controller door-bell register, so that
>          further commands for the same slot don't fail.
>
> Problem: While preparing the TM command descriptor, the task tag used
>          should be unique across SCSI/NOP/QUERY/TM commands and not the
> 	 task tag of the command which the TM command is trying to manage.
> Fix:     Use a unique task tag instead of task tag of SCSI command.
>
> Problem: Since the TM command involves H/W communication, abruptly ending
>          the request on kill interrupt signal might cause h/w malfunction.
> Fix:     Wait for hardware completion interrupt with TASK_UNINTERRUPTIBLE
>          set.
>
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |  177
> ++++++++++++++++++++++++++++++---------------
>  drivers/scsi/ufs/ufshcd.h |    8 ++-
>  2 files changed, 126 insertions(+), 59 deletions(-)
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index af7d01d..a176421 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -53,6 +53,9 @@
>  /* Query request timeout */
>  #define QUERY_REQ_TIMEOUT 30 /* msec */
>
> +/* Task management command timeout */
> +#define TM_CMD_TIMEOUT	100 /* msecs */
> +
>  /* Expose the flag value from utp_upiu_query.value */
>  #define MASK_QUERY_UPIU_FLAG_LOC 0xFF
>
> @@ -190,13 +193,35 @@ ufshcd_get_tmr_ocs(struct utp_task_req_desc
> *task_req_descp)
>  /**
>   * ufshcd_get_tm_free_slot - get a free slot for task management request
>   * @hba: per adapter instance
> + * @free_slot: pointer to variable with available slot value
>   *
> - * Returns maximum number of task management request slots in case of
> - * task management queue full or returns the free slot number
> + * Get a free tag and lock it until ufshcd_put_tm_slot() is called.
> + * Returns 0 if free slot is not available, else return 1 with tag value
> + * in @free_slot.
>   */
> -static inline int ufshcd_get_tm_free_slot(struct ufs_hba *hba)
> +static bool ufshcd_get_tm_free_slot(struct ufs_hba *hba, int *free_slot)
> +{
> +	int tag;
> +	bool ret = false;
> +
> +	if (!free_slot)
> +		goto out;
> +
> +	do {
> +		tag = find_first_zero_bit(&hba->tm_slots_in_use, hba->nutmrs);
> +		if (tag >= hba->nutmrs)
> +			goto out;
> +	} while (test_and_set_bit_lock(tag, &hba->tm_slots_in_use));
> +
> +	*free_slot = tag;
> +	ret = true;
> +out:
> +	return ret;
> +}
> +
> +static inline void ufshcd_put_tm_slot(struct ufs_hba *hba, int slot)
>  {
> -	return find_first_zero_bit(&hba->outstanding_tasks, hba->nutmrs);
> +	clear_bit_unlock(slot, &hba->tm_slots_in_use);
>  }
>
>  /**
> @@ -1778,10 +1803,11 @@ static void ufshcd_slave_destroy(struct
> scsi_device *sdev)
>   * ufshcd_task_req_compl - handle task management request completion
>   * @hba: per adapter instance
>   * @index: index of the completed request
> + * @resp: task management service response
>   *
> - * Returns SUCCESS/FAILED
> + * Returns non-zero value on error, zero on success
>   */
> -static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
> +static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index, u8
> *resp)
>  {
>  	struct utp_task_req_desc *task_req_descp;
>  	struct utp_upiu_task_rsp *task_rsp_upiup;
> @@ -1802,19 +1828,15 @@ static int ufshcd_task_req_compl(struct ufs_hba
> *hba, u32 index)
>  				task_req_descp[index].task_rsp_upiu;
>  		task_result = be32_to_cpu(task_rsp_upiup->header.dword_1);
>  		task_result = ((task_result & MASK_TASK_RESPONSE) >> 8);
> -
> -		if (task_result != UPIU_TASK_MANAGEMENT_FUNC_COMPL &&
> -		    task_result != UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED)
> -			task_result = FAILED;
> -		else
> -			task_result = SUCCESS;
> +		if (resp)
> +			*resp = (u8)task_result;
>  	} else {
> -		task_result = FAILED;
> -		dev_err(hba->dev,
> -			"trc: Invalid ocs = %x\n", ocs_value);
> +		dev_err(hba->dev, "%s: failed, ocs = 0x%x\n",
> +				__func__, ocs_value);
>  	}
>  	spin_unlock_irqrestore(hba->host->host_lock, flags);
> -	return task_result;
> +
> +	return ocs_value;
>  }
>
>  /**
> @@ -2298,7 +2320,7 @@ static void ufshcd_tmc_handler(struct ufs_hba *hba)
>
>  	tm_doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
>  	hba->tm_condition = tm_doorbell ^ hba->outstanding_tasks;
> -	wake_up_interruptible(&hba->ufshcd_tm_wait_queue);
> +	wake_up(&hba->tm_wq);
>  }
>
>  /**
> @@ -2348,38 +2370,61 @@ static irqreturn_t ufshcd_intr(int irq, void
> *__hba)
>  	return retval;
>  }
>
> +static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
> +{
> +	int err = 0;
> +	u32 reg;
> +	u32 mask = 1 << tag;
> +	unsigned long flags;
> +
> +	if (!test_bit(tag, &hba->outstanding_reqs))
> +		goto out;
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	ufshcd_writel(hba, ~(1 << tag), REG_UTP_TASK_REQ_LIST_CLEAR);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	/* poll for max. 1 sec to clear door bell register by h/w */
> +	reg = ufshcd_wait_for_register(hba,
> +			REG_UTP_TASK_REQ_DOOR_BELL,
> +			mask, 0, 1000, 1000);
> +	if ((reg & mask) == mask)
> +		err = -ETIMEDOUT;
> +out:
> +	return err;
> +}
> +
>  /**
>   * ufshcd_issue_tm_cmd - issues task management commands to controller
>   * @hba: per adapter instance
> - * @lrbp: pointer to local reference block
> + * @lun_id: LUN ID to which TM command is sent
> + * @task_id: task ID to which the TM command is applicable
> + * @tm_function: task management function opcode
> + * @tm_response: task management service response return value
>   *
> - * Returns SUCCESS/FAILED
> + * Returns non-zero value on error, zero on success.
>   */
> -static int
> -ufshcd_issue_tm_cmd(struct ufs_hba *hba,
> -		    struct ufshcd_lrb *lrbp,
> -		    u8 tm_function)
> +static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int
> task_id,
> +		u8 tm_function, u8 *tm_response)
>  {
>  	struct utp_task_req_desc *task_req_descp;
>  	struct utp_upiu_task_req *task_req_upiup;
>  	struct Scsi_Host *host;
>  	unsigned long flags;
> -	int free_slot = 0;
> +	int free_slot;
>  	int err;
> +	int task_tag;
>
>  	host = hba->host;
>
> -	spin_lock_irqsave(host->host_lock, flags);
> -
> -	/* If task management queue is full */
> -	free_slot = ufshcd_get_tm_free_slot(hba);
> -	if (free_slot >= hba->nutmrs) {
> -		spin_unlock_irqrestore(host->host_lock, flags);
> -		dev_err(hba->dev, "Task management queue full\n");
> -		err = FAILED;
> -		goto out;
> -	}
> +	/*
> +	 * Get free slot, sleep if slots are unavailable.
> +	 * Even though we use wait_event() which sleeps indefinitely,
> +	 * the maximum wait time is bounded by %TM_CMD_TIMEOUT.
> +	 */
> +	wait_event(hba->tm_tag_wq, ufshcd_get_tm_free_slot(hba, &free_slot));
>
> +	spin_lock_irqsave(host->host_lock, flags);
>  	task_req_descp = hba->utmrdl_base_addr;
>  	task_req_descp += free_slot;
>
> @@ -2391,18 +2436,15 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>  	/* Configure task request UPIU */
>  	task_req_upiup =
>  		(struct utp_upiu_task_req *) task_req_descp->task_req_upiu;
> +	task_tag = hba->nutrs + free_slot;
>  	task_req_upiup->header.dword_0 =
>  		UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
> -					      lrbp->lun, lrbp->task_tag);
> +				lun_id, task_tag);
>  	task_req_upiup->header.dword_1 =
>  		UPIU_HEADER_DWORD(0, tm_function, 0, 0);
>
> -	task_req_upiup->input_param1 = lrbp->lun;
> -	task_req_upiup->input_param1 =
> -		cpu_to_be32(task_req_upiup->input_param1);
> -	task_req_upiup->input_param2 = lrbp->task_tag;
> -	task_req_upiup->input_param2 =
> -		cpu_to_be32(task_req_upiup->input_param2);
> +	task_req_upiup->input_param1 = cpu_to_be32(lun_id);
> +	task_req_upiup->input_param2 = cpu_to_be32(task_id);
>
>  	/* send command to the controller */
>  	__set_bit(free_slot, &hba->outstanding_tasks);
> @@ -2411,20 +2453,24 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>  	spin_unlock_irqrestore(host->host_lock, flags);
>
>  	/* wait until the task management command is completed */
> -	err =
> -	wait_event_interruptible_timeout(hba->ufshcd_tm_wait_queue,
> -					 (test_bit(free_slot,
> -					 &hba->tm_condition) != 0),
> -					 60 * HZ);
> +	err = wait_event_timeout(hba->tm_wq,
> +			test_bit(free_slot, &hba->tm_condition),
> +			msecs_to_jiffies(TM_CMD_TIMEOUT));
>  	if (!err) {
> -		dev_err(hba->dev,
> -			"Task management command timed-out\n");
> -		err = FAILED;
> -		goto out;
> +		dev_err(hba->dev, "%s: task management cmd 0x%.2x timed-out\n",
> +				__func__, tm_function);
> +		if (ufshcd_clear_tm_cmd(hba, free_slot))
> +			dev_WARN(hba->dev, "%s: unable clear tm cmd (slot %d) after
> timeout\n",
> +					__func__, free_slot);
> +		err = -ETIMEDOUT;
> +	} else {
> +		err = ufshcd_task_req_compl(hba, free_slot, tm_response);
>  	}
> +
>  	clear_bit(free_slot, &hba->tm_condition);
> -	err = ufshcd_task_req_compl(hba, free_slot);
> -out:
> +	ufshcd_put_tm_slot(hba, free_slot);
> +	wake_up(&hba->tm_tag_wq);
> +
>  	return err;
>  }
>
> @@ -2441,14 +2487,22 @@ static int ufshcd_device_reset(struct scsi_cmnd
> *cmd)
>  	unsigned int tag;
>  	u32 pos;
>  	int err;
> +	u8 resp;
> +	struct ufshcd_lrb *lrbp;
>
>  	host = cmd->device->host;
>  	hba = shost_priv(host);
>  	tag = cmd->request->tag;
>
> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_LOGICAL_RESET);
> -	if (err == FAILED)
> +	lrbp = &hba->lrb[tag];
> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> +			UFS_LOGICAL_RESET, &resp);
> +	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> +		err = FAILED;
>  		goto out;
> +	} else {
> +		err = SUCCESS;
> +	}
>
>  	for (pos = 0; pos < hba->nutrs; pos++) {
>  		if (test_bit(pos, &hba->outstanding_reqs) &&
> @@ -2505,6 +2559,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	unsigned long flags;
>  	unsigned int tag;
>  	int err;
> +	u8 resp;
> +	struct ufshcd_lrb *lrbp;
>
>  	host = cmd->device->host;
>  	hba = shost_priv(host);
> @@ -2520,9 +2576,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	}
>  	spin_unlock_irqrestore(host->host_lock, flags);
>
> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_ABORT_TASK);
> -	if (err == FAILED)
> +	lrbp = &hba->lrb[tag];
> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> +			UFS_ABORT_TASK, &resp);
> +	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> +		err = FAILED;
>  		goto out;
> +	} else {
> +		err = SUCCESS;
> +	}
>
>  	scsi_dma_unmap(cmd);
>
> @@ -2744,7 +2806,8 @@ int ufshcd_init(struct device *dev, struct ufs_hba
> **hba_handle,
>  	host->max_cmd_len = MAX_CDB_SIZE;
>
>  	/* Initailize wait queue for task management */
> -	init_waitqueue_head(&hba->ufshcd_tm_wait_queue);
> +	init_waitqueue_head(&hba->tm_wq);
> +	init_waitqueue_head(&hba->tm_tag_wq);
>
>  	/* Initialize work queues */
>  	INIT_WORK(&hba->feh_workq, ufshcd_fatal_err_handler);
> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
> index 6c9bd35..5d4542c 100644
> --- a/drivers/scsi/ufs/ufshcd.h
> +++ b/drivers/scsi/ufs/ufshcd.h
> @@ -174,8 +174,10 @@ struct ufs_dev_cmd {
>   * @irq: Irq number of the controller
>   * @active_uic_cmd: handle of active UIC command
>   * @uic_cmd_mutex: mutex for uic command
> - * @ufshcd_tm_wait_queue: wait queue for task management
> + * @tm_wq: wait queue for task management
> + * @tm_tag_wq: wait queue for free task management slots
>   * @tm_condition: condition variable for task management
> + * @tm_slots_in_use: bit map of task management request slots in use
>   * @ufshcd_state: UFSHCD states
>   * @intr_mask: Interrupt Mask Bits
>   * @ee_ctrl_mask: Exception event control mask
> @@ -216,8 +218,10 @@ struct ufs_hba {
>  	struct uic_command *active_uic_cmd;
>  	struct mutex uic_cmd_mutex;
>
> -	wait_queue_head_t ufshcd_tm_wait_queue;
> +	wait_queue_head_t tm_wq;
> +	wait_queue_head_t tm_tag_wq;
>  	unsigned long tm_condition;
> +	unsigned long tm_slots_in_use;
>
>  	u32 ufshcd_state;
>  	u32 intr_mask;
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Maya Erez
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command
  2013-07-09  9:16 ` [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command Sujit Reddy Thumma
@ 2013-07-09 10:42   ` merez
  2013-07-19 13:56   ` Seungwon Jeon
  1 sibling, 0 replies; 27+ messages in thread
From: merez @ 2013-07-09 10:42 UTC (permalink / raw)
  Cc: Vinayak Holikatti, Santosh Y, James E.J. Bottomley, linux-scsi,
	Sujit Reddy Thumma, linux-arm-msm

Tested-by: Maya Erez <merez@codeaurora.org>

> There is a possible race condition in the hardware when the abort
> command is issued to terminate the ongoing SCSI command as described
> below:
>
> - A bit in the door-bell register is set in the controller for a
>   new SCSI command.
> - In some rare situations, before controller get a chance to issue
>   the command to the device, the software issued an abort command.
> - If the device recieves abort command first then it returns success
>   because the command itself is not present.
> - Now if the controller commits the command to device it will be
>   processed.
> - Software thinks that command is aborted and proceed while still
>   the device is processing it.
> - The software, controller and device may go out of sync because of
>   this race condition.
>
> To avoid this, query task presence in the device before sending abort
> task command so that after the abort operation, the command is guaranteed
> to be non-existent in both controller and the device.
>
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |   70
> +++++++++++++++++++++++++++++++++++---------
>  1 files changed, 55 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index a176421..51ce096 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -2550,6 +2550,12 @@ static int ufshcd_host_reset(struct scsi_cmnd *cmd)
>   * ufshcd_abort - abort a specific command
>   * @cmd: SCSI command pointer
>   *
> + * Abort the pending command in device by sending UFS_ABORT_TASK task
> management
> + * command, and in host controller by clearing the door-bell register.
> There can
> + * be race between controller sending the command to the device while
> abort is
> + * issued. To avoid that, first issue UFS_QUERY_TASK to check if the
> command is
> + * really issued and then try to abort it.
> + *
>   * Returns SUCCESS/FAILED
>   */
>  static int ufshcd_abort(struct scsi_cmnd *cmd)
> @@ -2558,7 +2564,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	struct ufs_hba *hba;
>  	unsigned long flags;
>  	unsigned int tag;
> -	int err;
> +	int err = 0;
> +	int poll_cnt;
>  	u8 resp;
>  	struct ufshcd_lrb *lrbp;
>
> @@ -2566,33 +2573,59 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	hba = shost_priv(host);
>  	tag = cmd->request->tag;
>
> -	spin_lock_irqsave(host->host_lock, flags);
> +	/* If command is already aborted/completed, return SUCCESS */
> +	if (!(test_bit(tag, &hba->outstanding_reqs)))
> +		goto out;
>
> -	/* check if command is still pending */
> -	if (!(test_bit(tag, &hba->outstanding_reqs))) {
> -		err = FAILED;
> -		spin_unlock_irqrestore(host->host_lock, flags);
> +	lrbp = &hba->lrb[tag];
> +	for (poll_cnt = 100; poll_cnt; poll_cnt--) {
> +		err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> +				UFS_QUERY_TASK, &resp);
> +		if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED) {
> +			/* cmd pending in the device */
> +			break;
> +		} else if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> +			u32 reg;
> +
> +			/*
> +			 * cmd not pending in the device, check if it is
> +			 * in transition.
> +			 */
> +			reg = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
> +			if (reg & (1 << tag)) {
> +				/* sleep for max. 2ms to stabilize */
> +				usleep_range(1000, 2000);
> +				continue;
> +			}
> +			/* command completed already */
> +			goto out;
> +		} else {
> +			if (!err)
> +				err = resp; /* service response error */
> +			goto out;
> +		}
> +	}
> +
> +	if (!poll_cnt) {
> +		err = -EBUSY;
>  		goto out;
>  	}
> -	spin_unlock_irqrestore(host->host_lock, flags);
>
> -	lrbp = &hba->lrb[tag];
>  	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>  			UFS_ABORT_TASK, &resp);
>  	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> -		err = FAILED;
> +		if (!err)
> +			err = resp; /* service response error */
>  		goto out;
> -	} else {
> -		err = SUCCESS;
>  	}
>
> +	err = ufshcd_clear_cmd(hba, tag);
> +	if (err)
> +		goto out;
> +
>  	scsi_dma_unmap(cmd);
>
>  	spin_lock_irqsave(host->host_lock, flags);
> -
> -	/* clear the respective UTRLCLR register bit */
> -	ufshcd_utrl_clear(hba, tag);
> -
>  	__clear_bit(tag, &hba->outstanding_reqs);
>  	hba->lrb[tag].cmd = NULL;
>  	spin_unlock_irqrestore(host->host_lock, flags);
> @@ -2600,6 +2633,13 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	clear_bit_unlock(tag, &hba->lrb_in_use);
>  	wake_up(&hba->dev_cmd.tag_wq);
>  out:
> +	if (!err) {
> +		err = SUCCESS;
> +	} else {
> +		dev_err(hba->dev, "%s: failed with err %d\n", __func__, err);
> +		err = FAILED;
> +	}
> +
>  	return err;
>  }
>
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Maya Erez
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-09  9:16 ` [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods Sujit Reddy Thumma
@ 2013-07-09 10:43   ` merez
  2013-07-19 13:57   ` Seungwon Jeon
  1 sibling, 0 replies; 27+ messages in thread
From: merez @ 2013-07-09 10:43 UTC (permalink / raw)
  Cc: Vinayak Holikatti, Santosh Y, James E.J. Bottomley, linux-scsi,
	Sujit Reddy Thumma, linux-arm-msm

Tested with error injection.

Tested-by: Maya Erez <merez@codeaurora.org>

> As of now SCSI initiated error handling is broken because,
> the reset APIs don't try to bring back the device initialized and
> ready for further transfers.
>
> In case of timeouts, the scsi error handler takes care of handling aborts
> and resets. Improve the error handling in such scenario by resetting the
> device and host and re-initializing them in proper manner.
>
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |  467
> +++++++++++++++++++++++++++++++++++++++------
>  drivers/scsi/ufs/ufshcd.h |    2 +
>  2 files changed, 411 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index 51ce096..b4c9910 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -69,9 +69,15 @@ enum {
>
>  /* UFSHCD states */
>  enum {
> -	UFSHCD_STATE_OPERATIONAL,
>  	UFSHCD_STATE_RESET,
>  	UFSHCD_STATE_ERROR,
> +	UFSHCD_STATE_OPERATIONAL,
> +};
> +
> +/* UFSHCD error handling flags */
> +enum {
> +	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
> +	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>  };
>
>  /* Interrupt configuration options */
> @@ -87,6 +93,22 @@ enum {
>  	INT_AGGR_CONFIG,
>  };
>
> +#define ufshcd_set_device_reset_pending(h) \
> +	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
> +#define ufshcd_set_host_reset_pending(h) \
> +	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
> +#define ufshcd_device_reset_pending(h) \
> +	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
> +#define ufshcd_host_reset_pending(h) \
> +	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
> +#define ufshcd_clear_device_reset_pending(h) \
> +	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
> +#define ufshcd_clear_host_reset_pending(h) \
> +	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
> +
> +static void ufshcd_tmc_handler(struct ufs_hba *hba);
> +static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> +
>  /*
>   * ufshcd_wait_for_register - wait for register value to change
>   * @hba - per-adapter interface
> @@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host
> *host, struct scsi_cmnd *cmd)
>
>  	tag = cmd->request->tag;
>
> -	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
> +	switch (hba->ufshcd_state) {
> +	case UFSHCD_STATE_OPERATIONAL:
> +		break;
> +	case UFSHCD_STATE_RESET:
>  		err = SCSI_MLQUEUE_HOST_BUSY;
>  		goto out;
> +	case UFSHCD_STATE_ERROR:
> +		set_host_byte(cmd, DID_ERROR);
> +		cmd->scsi_done(cmd);
> +		goto out;
> +	default:
> +		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
> +				__func__, hba->ufshcd_state);
> +		set_host_byte(cmd, DID_BAD_TARGET);
> +		cmd->scsi_done(cmd);
> +		goto out;
>  	}
>
>  	/* acquire the tag to make sure device cmds don't use it */
> @@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct
> ufs_hba *hba)
>  	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>  		scsi_unblock_requests(hba->host);
>
> -	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> -
>  out:
>  	return err;
>  }
> @@ -2273,6 +2306,106 @@ out:
>  }
>
>  /**
> + * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
> + * @hba: per-adapter instance
> + */
> +static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
> +{
> +	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
> +}
> +
> +/**
> + * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
> + * @hba: per-adapter instance
> + */
> +static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
> +{
> +	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
> +}
> +
> +/**
> + * ufshcd_complete_pending_tasks - complete outstanding tasks
> + * @hba: per adapter instance
> + *
> + * Abort in-progress task management commands and wakeup
> + * waiting threads.
> + *
> + * Returns non-zero error value when failed to clear all the commands.
> + */
> +static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
> +{
> +	u32 reg;
> +	int err = 0;
> +	unsigned long flags;
> +
> +	if (!hba->outstanding_tasks)
> +		goto out;
> +
> +	/* Clear UTMRL only when run-stop is enabled */
> +	if (ufshcd_utmrl_is_rsr_enabled(hba))
> +		ufshcd_writel(hba, ~hba->outstanding_tasks,
> +				REG_UTP_TASK_REQ_LIST_CLEAR);
> +
> +	/* poll for max. 1 sec to clear door bell register by h/w */
> +	reg = ufshcd_wait_for_register(hba,
> +			REG_UTP_TASK_REQ_DOOR_BELL,
> +			hba->outstanding_tasks, 0, 1000, 1000);
> +	if (reg & hba->outstanding_tasks)
> +		err = -ETIMEDOUT;
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	/* complete commands that were cleared out */
> +	ufshcd_tmc_handler(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +out:
> +	if (err)
> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> +				__func__, reg);
> +	return err;
> +}
> +
> +/**
> + * ufshcd_complete_pending_reqs - complete outstanding requests
> + * @hba: per adapter instance
> + *
> + * Abort in-progress transfer request commands and return them to SCSI.
> + *
> + * Returns non-zero error value when failed to clear all the commands.
> + */
> +static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
> +{
> +	u32 reg;
> +	int err = 0;
> +	unsigned long flags;
> +
> +	/* check if we completed all of them */
> +	if (!hba->outstanding_reqs)
> +		goto out;
> +
> +	/* Clear UTRL only when run-stop is enabled */
> +	if (ufshcd_utrl_is_rsr_enabled(hba))
> +		ufshcd_writel(hba, ~hba->outstanding_reqs,
> +				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
> +
> +	/* poll for max. 1 sec to clear door bell register by h/w */
> +	reg = ufshcd_wait_for_register(hba,
> +			REG_UTP_TRANSFER_REQ_DOOR_BELL,
> +			hba->outstanding_reqs, 0, 1000, 1000);
> +	if (reg & hba->outstanding_reqs)
> +		err = -ETIMEDOUT;
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	/* complete commands that were cleared out */
> +	ufshcd_transfer_req_compl(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +out:
> +	if (err)
> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> +				__func__, reg);
> +	return err;
> +}
> +
> +/**
>   * ufshcd_fatal_err_handler - handle fatal errors
>   * @hba: per adapter instance
>   */
> @@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
>  	}
>  	return;
>  fatal_eh:
> -	hba->ufshcd_state = UFSHCD_STATE_ERROR;
> -	schedule_work(&hba->feh_workq);
> +	/* handle fatal errors only when link is functional */
> +	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
> +		/* block commands at driver layer until error is handled */
> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> +		schedule_work(&hba->feh_workq);
> +	}
>  }
>
>  /**
> @@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba
> *hba, int lun_id, int task_id,
>  }
>
>  /**
> - * ufshcd_device_reset - reset device and abort all the pending commands
> - * @cmd: SCSI command pointer
> + * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
> + * @hba: per adapter instance
>   *
> - * Returns SUCCESS/FAILED
> + * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS
> flags,
> + * attributes and descriptors are reset to default state. Callers are
> + * expected to initialize the whole device again after this.
> + *
> + * Returns zero on success, non-zero on failure
>   */
> -static int ufshcd_device_reset(struct scsi_cmnd *cmd)
> +static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
>  {
> -	struct Scsi_Host *host;
> -	struct ufs_hba *hba;
> -	unsigned int tag;
> -	u32 pos;
> -	int err;
> -	u8 resp;
> -	struct ufshcd_lrb *lrbp;
> +	struct uic_command uic_cmd = {0};
> +	int ret;
>
> -	host = cmd->device->host;
> -	hba = shost_priv(host);
> -	tag = cmd->request->tag;
> +	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
>
> -	lrbp = &hba->lrb[tag];
> -	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> -			UFS_LOGICAL_RESET, &resp);
> -	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> -		err = FAILED;
> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> +	if (ret)
> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> +
> +	return ret;
> +}
> +
> +/**
> + * ufshcd_dme_reset - Local UniPro reset
> + * @hba: per adapter instance
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_dme_reset(struct ufs_hba *hba)
> +{
> +	struct uic_command uic_cmd = {0};
> +	int ret;
> +
> +	uic_cmd.command = UIC_CMD_DME_RESET;
> +
> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> +	if (ret)
> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> +
> +	return ret;
> +
> +}
> +
> +/**
> + * ufshcd_dme_enable - Local UniPro DME Enable
> + * @hba: per adapter instance
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_dme_enable(struct ufs_hba *hba)
> +{
> +	struct uic_command uic_cmd = {0};
> +	int ret;
> +	uic_cmd.command = UIC_CMD_DME_ENABLE;
> +
> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> +	if (ret)
> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> +
> +	return ret;
> +
> +}
> +
> +/**
> + * ufshcd_device_reset_and_restore - reset and restore device
> + * @hba: per-adapter instance
> + *
> + * Note that the device reset issues DME_END_POINT_RESET which
> + * may reset entire device and restore device attributes to
> + * default state.
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
> +{
> +	int err = 0;
> +	u32 reg;
> +
> +	err = ufshcd_dme_end_point_reset(hba);
> +	if (err)
> +		goto out;
> +
> +	/* restore communication with the device */
> +	err = ufshcd_dme_reset(hba);
> +	if (err)
>  		goto out;
> -	} else {
> -		err = SUCCESS;
> -	}
>
> -	for (pos = 0; pos < hba->nutrs; pos++) {
> -		if (test_bit(pos, &hba->outstanding_reqs) &&
> -		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
> +	err = ufshcd_dme_enable(hba);
> +	if (err)
> +		goto out;
>
> -			/* clear the respective UTRLCLR register bit */
> -			ufshcd_utrl_clear(hba, pos);
> +	err = ufshcd_dme_link_startup(hba);
> +	if (err)
> +		goto out;
>
> -			clear_bit(pos, &hba->outstanding_reqs);
> +	/* check if link is up and device is detected */
> +	reg = ufshcd_readl(hba, REG_CONTROLLER_STATUS);
> +	if (!ufshcd_is_device_present(reg)) {
> +		dev_err(hba->dev, "Device not present\n");
> +		err = -ENXIO;
> +		goto out;
> +	}
>
> -			if (hba->lrb[pos].cmd) {
> -				scsi_dma_unmap(hba->lrb[pos].cmd);
> -				hba->lrb[pos].cmd->result =
> -					DID_ABORT << 16;
> -				hba->lrb[pos].cmd->scsi_done(cmd);
> -				hba->lrb[pos].cmd = NULL;
> -				clear_bit_unlock(pos, &hba->lrb_in_use);
> -				wake_up(&hba->dev_cmd.tag_wq);
> -			}
> -		}
> -	} /* end of for */
> +	ufshcd_clear_device_reset_pending(hba);
>  out:
> +	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
>  	return err;
>  }
>
>  /**
> - * ufshcd_host_reset - Main reset function registered with scsi layer
> - * @cmd: SCSI command pointer
> + * ufshcd_host_reset_and_restore - reset and restore host controller
> + * @hba: per-adapter instance
>   *
> - * Returns SUCCESS/FAILED
> + * Note that host controller reset may issue DME_RESET to
> + * local and remote (device) Uni-Pro stack and the attributes
> + * are reset to default state.
> + *
> + * Returns zero on success, non-zero on failure
>   */
> -static int ufshcd_host_reset(struct scsi_cmnd *cmd)
> +static int ufshcd_host_reset_and_restore(struct ufs_hba *hba)
>  {
> -	struct ufs_hba *hba;
> +	int err;
> +	async_cookie_t cookie;
> +	unsigned long flags;
>
> -	hba = shost_priv(cmd->device->host);
> +	/* Reset the host controller */
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	ufshcd_hba_stop(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>
> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> -		return SUCCESS;
> +	err = ufshcd_hba_enable(hba);
> +	if (err)
> +		goto out;
>
> -	return ufshcd_do_reset(hba);
> +	/* Establish the link again and restore the device */
> +	cookie = async_schedule(ufshcd_async_scan, hba);
> +	/* wait for async scan to be completed */
> +	async_synchronize_cookie(++cookie);
> +	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL)
> +		err = -EIO;
> +out:
> +	if (err)
> +		dev_err(hba->dev, "%s: Host init failed %d\n", __func__, err);
> +	else
> +		ufshcd_clear_host_reset_pending(hba);
> +
> +	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
> +	return err;
>  }
>
>  /**
> @@ -2644,6 +2861,134 @@ out:
>  }
>
>  /**
> + * ufshcd_reset_and_restore - resets device or host or both
> + * @hba: per-adapter instance
> + *
> + * Reset and recover device, host and re-establish link. This
> + * is helpful to recover the communication in fatal error conditions.
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_reset_and_restore(struct ufs_hba *hba)
> +{
> +	int err = 0;
> +
> +	if (ufshcd_device_reset_pending(hba) &&
> +			!ufshcd_host_reset_pending(hba)) {
> +		err = ufshcd_device_reset_and_restore(hba);
> +		if (err) {
> +			ufshcd_clear_device_reset_pending(hba);
> +			ufshcd_set_host_reset_pending(hba);
> +		}
> +	}
> +
> +	if (ufshcd_host_reset_pending(hba))
> +		err = ufshcd_host_reset_and_restore(hba);
> +
> +	/*
> +	 * Due to reset the door-bell might be cleared, clear
> +	 * outstanding requests in s/w here.
> +	 */
> +	ufshcd_complete_pending_reqs(hba);
> +	ufshcd_complete_pending_tasks(hba);
> +
> +	return err;
> +}
> +
> +/**
> + * ufshcd_eh_device_reset_handler - device reset handler registered to
> + *                                    scsi layer.
> + * @cmd - SCSI command pointer
> + *
> + * Returns SUCCESS/FAILED
> + */
> +static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd)
> +{
> +	struct ufs_hba *hba;
> +	int err;
> +	unsigned long flags;
> +
> +	hba = shost_priv(cmd->device->host);
> +
> +	/*
> +	 * Check if there is any race with fatal error handling.
> +	 * If so, wait for it to complete. Even though fatal error
> +	 * handling does reset and restore in some cases, don't assume
> +	 * anything out of it. We are just avoiding race here.
> +	 */
> +	do {
> +		spin_lock_irqsave(hba->host->host_lock, flags);
> +		if (!(work_pending(&hba->feh_workq) ||
> +				hba->ufshcd_state == UFSHCD_STATE_RESET))
> +			break;
> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> +		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
> +		flush_work_sync(&hba->feh_workq);
> +	} while (1);
> +
> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> +	ufshcd_set_device_reset_pending(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	err = ufshcd_reset_and_restore(hba);
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	if (!err) {
> +		err = SUCCESS;
> +		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> +	} else {
> +		err = FAILED;
> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> +	}
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	return err;
> +}
> +
> +/**
> + * ufshcd_eh_host_reset_handler - host reset handler registered to scsi
> layer
> + * @cmd - SCSI command pointer
> + *
> + * Returns SUCCESS/FAILED
> + */
> +static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
> +{
> +	struct ufs_hba *hba;
> +	int err;
> +	unsigned long flags;
> +
> +	hba = shost_priv(cmd->device->host);
> +
> +	do {
> +		spin_lock_irqsave(hba->host->host_lock, flags);
> +		if (!(work_pending(&hba->feh_workq) ||
> +				hba->ufshcd_state == UFSHCD_STATE_RESET))
> +			break;
> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> +		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
> +		flush_work_sync(&hba->feh_workq);
> +	} while (1);
> +
> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> +	ufshcd_set_host_reset_pending(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	err = ufshcd_reset_and_restore(hba);
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	if (!err) {
> +		err = SUCCESS;
> +		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> +	} else {
> +		err = FAILED;
> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> +	}
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	return err;
> +}
> +
> +/**
>   * ufshcd_async_scan - asynchronous execution for link startup
>   * @data: data pointer to pass to this function
>   * @cookie: cookie data
> @@ -2667,8 +3012,14 @@ static void ufshcd_async_scan(void *data,
> async_cookie_t cookie)
>
>  	hba->auto_bkops_enabled = false;
>  	ufshcd_enable_auto_bkops(hba);
> -	scsi_scan_host(hba->host);
> -	pm_runtime_put_sync(hba->dev);
> +	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> +
> +	/* If we are in error handling context no need to scan the host */
> +	if (!(ufshcd_device_reset_pending(hba) ||
> +			ufshcd_host_reset_pending(hba))) {
> +		scsi_scan_host(hba->host);
> +		pm_runtime_put_sync(hba->dev);
> +	}
>  out:
>  	return;
>  }
> @@ -2681,8 +3032,8 @@ static struct scsi_host_template
> ufshcd_driver_template = {
>  	.slave_alloc		= ufshcd_slave_alloc,
>  	.slave_destroy		= ufshcd_slave_destroy,
>  	.eh_abort_handler	= ufshcd_abort,
> -	.eh_device_reset_handler = ufshcd_device_reset,
> -	.eh_host_reset_handler	= ufshcd_host_reset,
> +	.eh_device_reset_handler = ufshcd_eh_device_reset_handler,
> +	.eh_host_reset_handler   = ufshcd_eh_host_reset_handler,
>  	.this_id		= -1,
>  	.sg_tablesize		= SG_ALL,
>  	.cmd_per_lun		= UFSHCD_CMD_PER_LUN,
> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
> index 5d4542c..7fcedd0 100644
> --- a/drivers/scsi/ufs/ufshcd.h
> +++ b/drivers/scsi/ufs/ufshcd.h
> @@ -179,6 +179,7 @@ struct ufs_dev_cmd {
>   * @tm_condition: condition variable for task management
>   * @tm_slots_in_use: bit map of task management request slots in use
>   * @ufshcd_state: UFSHCD states
> + * @eh_flags: Error handling flags
>   * @intr_mask: Interrupt Mask Bits
>   * @ee_ctrl_mask: Exception event control mask
>   * @feh_workq: Work queue for fatal controller error handling
> @@ -224,6 +225,7 @@ struct ufs_hba {
>  	unsigned long tm_slots_in_use;
>
>  	u32 ufshcd_state;
> +	u32 eh_flags;
>  	u32 intr_mask;
>  	u16 ee_ctrl_mask;
>
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Maya Erez
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-09  9:16 ` [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling Sujit Reddy Thumma
@ 2013-07-09 10:43   ` merez
  2013-07-19 13:58   ` Seungwon Jeon
  1 sibling, 0 replies; 27+ messages in thread
From: merez @ 2013-07-09 10:43 UTC (permalink / raw)
  Cc: Vinayak Holikatti, Santosh Y, James E.J. Bottomley, linux-scsi,
	Sujit Reddy Thumma, linux-arm-msm

Tested with error injection.

Tested-by: Maya Erez <merez@codeaurora.org>

> Error handling in UFS driver is broken and resets the host controller
> for fatal errors without re-initialization. Correct the fatal error
> handling sequence according to UFS Host Controller Interface (HCI)
> v1.1 specification.
>
> o Upon determining fatal error condition the host controller may hang
>   forever until a reset is applied, so just retrying the command doesn't
>   work without a reset. So, the reset is applied in the driver context
>   in a separate work and SCSI mid-layer isn't informed until reset is
>   applied.
>
> o Processed requests which are completed without error are reported to
>   SCSI layer as successful and any pending commands that are not started
>   yet or are not cause of the error are re-queued into scsi midlayer
> queue.
>   For the command that caused error, host controller or device is reset
>   and DID_ERROR is returned for command retry after applying reset.
>
> o SCSI is informed about the expected Unit-Attentioni exception from the
>   device for the immediate command after a reset so that the SCSI layer
>   take necessary steps to establish communication with the device.
>
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |  349
> +++++++++++++++++++++++++++++++++++---------
>  drivers/scsi/ufs/ufshcd.h |    2 +
>  drivers/scsi/ufs/ufshci.h |   19 ++-
>  3 files changed, 295 insertions(+), 75 deletions(-)
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index b4c9910..2a3874f 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -80,6 +80,14 @@ enum {
>  	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>  };
>
> +/* UFSHCD UIC layer error flags */
> +enum {
> +	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
> +	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
> +	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
> +	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
> +};
> +
>  /* Interrupt configuration options */
>  enum {
>  	UFSHCD_INT_DISABLE,
> @@ -108,6 +116,7 @@ enum {
>
>  static void ufshcd_tmc_handler(struct ufs_hba *hba);
>  static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> +static int ufshcd_reset_and_restore(struct ufs_hba *hba);
>
>  /*
>   * ufshcd_wait_for_register - wait for register value to change
> @@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct
> ufs_hba *hba)
>  		goto out;
>  	}
>
> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> -		scsi_unblock_requests(hba->host);
> -
>  out:
>  	return err;
>  }
> @@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct
> ufs_hba *hba)
>  }
>
>  /**
> - * ufshcd_do_reset - reset the host controller
> - * @hba: per adapter instance
> - *
> - * Returns SUCCESS/FAILED
> - */
> -static int ufshcd_do_reset(struct ufs_hba *hba)
> -{
> -	struct ufshcd_lrb *lrbp;
> -	unsigned long flags;
> -	int tag;
> -
> -	/* block commands from midlayer */
> -	scsi_block_requests(hba->host);
> -
> -	spin_lock_irqsave(hba->host->host_lock, flags);
> -	hba->ufshcd_state = UFSHCD_STATE_RESET;
> -
> -	/* send controller to reset state */
> -	ufshcd_hba_stop(hba);
> -	spin_unlock_irqrestore(hba->host->host_lock, flags);
> -
> -	/* abort outstanding commands */
> -	for (tag = 0; tag < hba->nutrs; tag++) {
> -		if (test_bit(tag, &hba->outstanding_reqs)) {
> -			lrbp = &hba->lrb[tag];
> -			if (lrbp->cmd) {
> -				scsi_dma_unmap(lrbp->cmd);
> -				lrbp->cmd->result = DID_RESET << 16;
> -				lrbp->cmd->scsi_done(lrbp->cmd);
> -				lrbp->cmd = NULL;
> -				clear_bit_unlock(tag, &hba->lrb_in_use);
> -			}
> -		}
> -	}
> -
> -	/* complete device management command */
> -	if (hba->dev_cmd.complete)
> -		complete(hba->dev_cmd.complete);
> -
> -	/* clear outstanding request/task bit maps */
> -	hba->outstanding_reqs = 0;
> -	hba->outstanding_tasks = 0;
> -
> -	/* Host controller enable */
> -	if (ufshcd_hba_enable(hba)) {
> -		dev_err(hba->dev,
> -			"Reset: Controller initialization failed\n");
> -		return FAILED;
> -	}
> -
> -	if (ufshcd_link_startup(hba)) {
> -		dev_err(hba->dev,
> -			"Reset: Link start-up failed\n");
> -		return FAILED;
> -	}
> -
> -	return SUCCESS;
> -}
> -
> -/**
>   * ufshcd_slave_alloc - handle initial SCSI device configurations
>   * @sdev: pointer to SCSI device
>   *
> @@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device
> *sdev)
>  	sdev->use_10_for_ms = 1;
>  	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
>
> +	/* allow SCSI layer to restart the device in case of errors */
> +	sdev->allow_restart = 1;
> +
>  	/*
>  	 * Inform SCSI Midlayer that the LUN queue depth is same as the
>  	 * controller queue depth. If a LUN queue depth is less than the
> @@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba,
> struct ufshcd_lrb *lrbp)
>  	case OCS_ABORTED:
>  		result |= DID_ABORT << 16;
>  		break;
> +	case OCS_INVALID_COMMAND_STATUS:
> +		result |= DID_REQUEUE << 16;
> +		break;
>  	case OCS_INVALID_CMD_TABLE_ATTR:
>  	case OCS_INVALID_PRDT_ATTR:
>  	case OCS_MISMATCH_DATA_BUF_SIZE:
> @@ -2405,42 +2357,295 @@ out:
>  	return err;
>  }
>
> +static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
> +{
> +	switch (ocs) {
> +	case OCS_SUCCESS:
> +	case OCS_INVALID_COMMAND_STATUS:
> +		break;
> +	case OCS_MISMATCH_DATA_BUF_SIZE:
> +	case OCS_MISMATCH_RESP_UPIU_SIZE:
> +	case OCS_PEER_COMM_FAILURE:
> +	case OCS_FATAL_ERROR:
> +	case OCS_ABORTED:
> +	case OCS_INVALID_CMD_TABLE_ATTR:
> +	case OCS_INVALID_PRDT_ATTR:
> +		ufshcd_set_host_reset_pending(hba);
> +		break;
> +	default:
> +		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
> +				__func__, ocs);
> +		BUG();
> +	}
> +}
> +
> +static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
> +{
> +	switch (ocs) {
> +	case OCS_TMR_SUCCESS:
> +	case OCS_TMR_INVALID_COMMAND_STATUS:
> +		break;
> +	case OCS_TMR_MISMATCH_REQ_SIZE:
> +	case OCS_TMR_MISMATCH_RESP_SIZE:
> +	case OCS_TMR_PEER_COMM_FAILURE:
> +	case OCS_TMR_INVALID_ATTR:
> +	case OCS_TMR_ABORTED:
> +	case OCS_TMR_FATAL_ERROR:
> +		ufshcd_set_host_reset_pending(hba);
> +		break;
> +	default:
> +		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
> +				__func__, ocs);
> +		BUG();
> +	}
> +}
> +
>  /**
> - * ufshcd_fatal_err_handler - handle fatal errors
> + * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed
> command and
> + *                          decide error handling
>   * @hba: per adapter instance
> + * @err_xfer: bit mask for transfer request errors
> + *
> + * Iterate over completed transfer requests and
> + * set error handling flags.
> + */
> +static void
> +ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
> +{
> +	unsigned long completed;
> +	u32 doorbell;
> +	int index;
> +	int ocs;
> +
> +	if (!err_xfer)
> +		goto out;
> +
> +	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
> +	completed = doorbell ^ (u32)hba->outstanding_reqs;
> +
> +	for (index = 0; index < hba->nutrs; index++) {
> +		if (test_bit(index, &completed)) {
> +			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
> +			if ((ocs == OCS_SUCCESS) ||
> +					(ocs == OCS_INVALID_COMMAND_STATUS))
> +				continue;
> +
> +			*err_xfer |= (1 << index);
> +			ufshcd_decide_eh_xfer_req(hba, ocs);
> +		}
> +	}
> +out:
> +	return;
> +}
> +
> +/**
> + * ufshcd_error_autopsy_task_req() - reads OCS field of failed command
> and
> + *                          decide error handling
> + * @hba: per adapter instance
> + * @err_tm: bit mask for task management errors
> + *
> + * Iterate over completed task management requests and
> + * set error handling flags.
> + */
> +static void
> +ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
> +{
> +	unsigned long completed;
> +	u32 doorbell;
> +	int index;
> +	int ocs;
> +
> +	if (!err_tm)
> +		goto out;
> +
> +	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
> +	completed = doorbell ^ (u32)hba->outstanding_tasks;
> +
> +	for (index = 0; index < hba->nutmrs; index++) {
> +		if (test_bit(index, &completed)) {
> +			struct utp_task_req_desc *tm_descp;
> +
> +			tm_descp = hba->utmrdl_base_addr;
> +			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
> +			if ((ocs == OCS_TMR_SUCCESS) ||
> +					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
> +				continue;
> +
> +			*err_tm |= (1 << index);
> +			ufshcd_decide_eh_task_req(hba, ocs);
> +		}
> +	}
> +
> +out:
> +	return;
> +}
> +
> +/**
> + * ufshcd_fatal_err_handler - handle fatal errors
> + * @work: pointer to work structure
>   */
>  static void ufshcd_fatal_err_handler(struct work_struct *work)
>  {
>  	struct ufs_hba *hba;
> +	unsigned long flags;
> +	u32 err_xfer = 0;
> +	u32 err_tm = 0;
> +	int err;
> +
>  	hba = container_of(work, struct ufs_hba, feh_workq);
>
>  	pm_runtime_get_sync(hba->dev);
> -	/* check if reset is already in progress */
> -	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
> -		ufshcd_do_reset(hba);
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
> +		/* complete processed requests and exit */
> +		ufshcd_transfer_req_compl(hba);
> +		ufshcd_tmc_handler(hba);
> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> +		pm_runtime_put_sync(hba->dev);
> +		return;
> +	}
> +
> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> +	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
> +	ufshcd_error_autopsy_task_req(hba, &err_tm);
> +
> +	/*
> +	 * Complete successful and pending transfer requests.
> +	 * DID_REQUEUE is returned for pending requests as they have
> +	 * nothing to do with error'ed request and SCSI layer should
> +	 * not treat them as errors and decrement retry count.
> +	 */
> +	hba->outstanding_reqs &= ~err_xfer;
> +	ufshcd_transfer_req_compl(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +	ufshcd_complete_pending_reqs(hba);
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	hba->outstanding_reqs |= err_xfer;
> +
> +	/* Complete successful and pending task requests */
> +	hba->outstanding_tasks &= ~err_tm;
> +	ufshcd_tmc_handler(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +	ufshcd_complete_pending_tasks(hba);
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +
> +	hba->outstanding_tasks |= err_tm;
> +
> +	/*
> +	 * Controller may generate multiple fatal errors, handle
> +	 * errors based on severity.
> +	 * 1) DEVICE_FATAL_ERROR
> +	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
> +	 * 3) UIC_ERROR
> +	 */
> +	if (hba->errors & DEVICE_FATAL_ERROR) {
> +		/*
> +		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
> +		 * OCS field on device fatal error.
> +		 */
> +		ufshcd_set_host_reset_pending(hba);
> +	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
> +			CONTROLLER_FATAL_ERROR)) {
> +		/* eh flags should be set in err autopsy based on OCS values */
> +		if (!hba->eh_flags)
> +			WARN(1, "%s: fatal error without error handling\n",
> +				dev_name(hba->dev));
> +	} else if (hba->errors & UIC_ERROR) {
> +		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
> +			/* fatal error - reset controller */
> +			ufshcd_set_host_reset_pending(hba);
> +		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
> +					UFSHCD_UIC_TL_ERROR |
> +					UFSHCD_UIC_DME_ERROR)) {
> +			/* non-fatal, report error to SCSI layer */
> +			if (!hba->eh_flags) {
> +				spin_unlock_irqrestore(
> +						hba->host->host_lock, flags);
> +				ufshcd_complete_pending_reqs(hba);
> +				ufshcd_complete_pending_tasks(hba);
> +				spin_lock_irqsave(hba->host->host_lock, flags);
> +			}
> +		}
> +	}
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	if (hba->eh_flags) {
> +		err = ufshcd_reset_and_restore(hba);
> +		if (err) {
> +			ufshcd_clear_host_reset_pending(hba);
> +			ufshcd_clear_device_reset_pending(hba);
> +			dev_err(hba->dev, "%s: reset and restore failed\n",
> +					__func__);
> +			hba->ufshcd_state = UFSHCD_STATE_ERROR;
> +		}
> +		/*
> +		 * Inform scsi mid-layer that we did reset and allow to handle
> +		 * Unit Attention properly.
> +		 */
> +		scsi_report_bus_reset(hba->host, 0);
> +		hba->errors = 0;
> +		hba->uic_error = 0;
> +	}
> +	scsi_unblock_requests(hba->host);
>  	pm_runtime_put_sync(hba->dev);
>  }
>
>  /**
> - * ufshcd_err_handler - Check for fatal errors
> - * @work: pointer to a work queue structure
> + * ufshcd_update_uic_error - check and set fatal UIC error flags.
> + * @hba: per-adapter instance
>   */
> -static void ufshcd_err_handler(struct ufs_hba *hba)
> +static void ufshcd_update_uic_error(struct ufs_hba *hba)
>  {
>  	u32 reg;
>
> +	/* PA_INIT_ERROR is fatal and needs UIC reset */
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> +	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> +		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
> +
> +	/* UIC NL/TL/DME errors needs software retry */
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
> +	if (reg)
> +		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
> +
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
> +	if (reg)
> +		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
> +
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
> +	if (reg)
> +		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
> +
> +	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
> +			__func__, hba->uic_error);
> +}
> +
> +/**
> + * ufshcd_err_handler - Check for fatal errors
> + * @hba: per-adapter instance
> + */
> +static void ufshcd_err_handler(struct ufs_hba *hba)
> +{
>  	if (hba->errors & INT_FATAL_ERRORS)
>  		goto fatal_eh;
>
>  	if (hba->errors & UIC_ERROR) {
> -		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> -		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> +		hba->uic_error = 0;
> +		ufshcd_update_uic_error(hba);
> +		if (hba->uic_error)
>  			goto fatal_eh;
>  	}
> +	/*
> +	 * Other errors are either non-fatal or completed by the
> +	 * controller by updating OCS fields with success/failure.
> +	 */
>  	return;
> +
>  fatal_eh:
>  	/* handle fatal errors only when link is functional */
>  	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
> +		/* block commands from midlayer */
> +		scsi_block_requests(hba->host);
>  		/* block commands at driver layer until error is handled */
>  		hba->ufshcd_state = UFSHCD_STATE_ERROR;
>  		schedule_work(&hba->feh_workq);
> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
> index 7fcedd0..4ee4d1a 100644
> --- a/drivers/scsi/ufs/ufshcd.h
> +++ b/drivers/scsi/ufs/ufshcd.h
> @@ -185,6 +185,7 @@ struct ufs_dev_cmd {
>   * @feh_workq: Work queue for fatal controller error handling
>   * @eeh_work: Worker to handle exception events
>   * @errors: HBA errors
> + * @uic_error: UFS interconnect layer error status
>   * @dev_cmd: ufs device management command information
>   * @auto_bkops_enabled: to track whether bkops is enabled in device
>   */
> @@ -235,6 +236,7 @@ struct ufs_hba {
>
>  	/* HBA Errors */
>  	u32 errors;
> +	u32 uic_error;
>
>  	/* Device management request data */
>  	struct ufs_dev_cmd dev_cmd;
> diff --git a/drivers/scsi/ufs/ufshci.h b/drivers/scsi/ufs/ufshci.h
> index f1e1b74..36f68ef 100644
> --- a/drivers/scsi/ufs/ufshci.h
> +++ b/drivers/scsi/ufs/ufshci.h
> @@ -264,7 +264,7 @@ enum {
>  	UTP_DEVICE_TO_HOST	= 0x04000000,
>  };
>
> -/* Overall command status values */
> +/* Overall command status values for transfer request */
>  enum {
>  	OCS_SUCCESS			= 0x0,
>  	OCS_INVALID_CMD_TABLE_ATTR	= 0x1,
> @@ -274,8 +274,21 @@ enum {
>  	OCS_PEER_COMM_FAILURE		= 0x5,
>  	OCS_ABORTED			= 0x6,
>  	OCS_FATAL_ERROR			= 0x7,
> -	OCS_INVALID_COMMAND_STATUS	= 0x0F,
> -	MASK_OCS			= 0x0F,
> +	OCS_INVALID_COMMAND_STATUS	= 0xF,
> +	MASK_OCS			= 0xFF,
> +};
> +
> +/* Overall command status values for task management request */
> +enum {
> +	OCS_TMR_SUCCESS			= 0x0,
> +	OCS_TMR_INVALID_ATTR		= 0x1,
> +	OCS_TMR_MISMATCH_REQ_SIZE	= 0x2,
> +	OCS_TMR_MISMATCH_RESP_SIZE	= 0x3,
> +	OCS_TMR_PEER_COMM_FAILURE	= 0x4,
> +	OCS_TMR_ABORTED			= 0x5,
> +	OCS_TMR_FATAL_ERROR		= 0x6,
> +	OCS_TMR_INVALID_COMMAND_STATUS	= 0xF,
> +	MASK_OCS_TMR			= 0xFF,
>  };
>
>  /**
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Maya Erez
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation
  2013-07-09  9:16 ` [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation Sujit Reddy Thumma
  2013-07-09 10:42   ` merez
@ 2013-07-19 13:56   ` Seungwon Jeon
  2013-07-19 18:26     ` Sujit Reddy Thumma
  1 sibling, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-19 13:56 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma', 'Vinayak Holikatti',
	'Santosh Y'
  Cc: 'James E.J. Bottomley', linux-scsi, linux-arm-msm

On Tue, July 09, 2013 Sujit Reddy Thumma wrote:
> Currently, sending Task Management (TM) command to the card might
> be broken in some scenarios as listed below:
> 
> Problem: If there are more than 8 TM commands the implementation
>          returns error to the caller.
> Fix:     Wait for one of the slots to be emptied and send the command.
> 
> Problem: Sometimes it is necessary for the caller to know the TM service
>          response code to determine the task status.
> Fix:     Propogate the service response to the caller.
> 
> Problem: If the TM command times out no proper error recovery is
>          implemented.
> Fix:     Clear the command in the controller door-bell register, so that
>          further commands for the same slot don't fail.
> 
> Problem: While preparing the TM command descriptor, the task tag used
>          should be unique across SCSI/NOP/QUERY/TM commands and not the
> 	 task tag of the command which the TM command is trying to manage.
> Fix:     Use a unique task tag instead of task tag of SCSI command.
> 
> Problem: Since the TM command involves H/W communication, abruptly ending
>          the request on kill interrupt signal might cause h/w malfunction.
> Fix:     Wait for hardware completion interrupt with TASK_UNINTERRUPTIBLE
>          set.
> 
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |  177 ++++++++++++++++++++++++++++++---------------
>  drivers/scsi/ufs/ufshcd.h |    8 ++-
>  2 files changed, 126 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index af7d01d..a176421 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -53,6 +53,9 @@
>  /* Query request timeout */
>  #define QUERY_REQ_TIMEOUT 30 /* msec */
> 
> +/* Task management command timeout */
> +#define TM_CMD_TIMEOUT	100 /* msecs */
> +
>  /* Expose the flag value from utp_upiu_query.value */
>  #define MASK_QUERY_UPIU_FLAG_LOC 0xFF
> 
> @@ -190,13 +193,35 @@ ufshcd_get_tmr_ocs(struct utp_task_req_desc *task_req_descp)
>  /**
>   * ufshcd_get_tm_free_slot - get a free slot for task management request
>   * @hba: per adapter instance
> + * @free_slot: pointer to variable with available slot value
>   *
> - * Returns maximum number of task management request slots in case of
> - * task management queue full or returns the free slot number
> + * Get a free tag and lock it until ufshcd_put_tm_slot() is called.
> + * Returns 0 if free slot is not available, else return 1 with tag value
> + * in @free_slot.
>   */
> -static inline int ufshcd_get_tm_free_slot(struct ufs_hba *hba)
> +static bool ufshcd_get_tm_free_slot(struct ufs_hba *hba, int *free_slot)
> +{
> +	int tag;
> +	bool ret = false;
> +
> +	if (!free_slot)
> +		goto out;
> +
> +	do {
> +		tag = find_first_zero_bit(&hba->tm_slots_in_use, hba->nutmrs);
> +		if (tag >= hba->nutmrs)
> +			goto out;
> +	} while (test_and_set_bit_lock(tag, &hba->tm_slots_in_use));
> +
> +	*free_slot = tag;
> +	ret = true;
> +out:
> +	return ret;
> +}
> +
> +static inline void ufshcd_put_tm_slot(struct ufs_hba *hba, int slot)
>  {
> -	return find_first_zero_bit(&hba->outstanding_tasks, hba->nutmrs);
> +	clear_bit_unlock(slot, &hba->tm_slots_in_use);
>  }
> 
>  /**
> @@ -1778,10 +1803,11 @@ static void ufshcd_slave_destroy(struct scsi_device *sdev)
>   * ufshcd_task_req_compl - handle task management request completion
>   * @hba: per adapter instance
>   * @index: index of the completed request
> + * @resp: task management service response
>   *
> - * Returns SUCCESS/FAILED
> + * Returns non-zero value on error, zero on success
>   */
> -static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
> +static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index, u8 *resp)
>  {
>  	struct utp_task_req_desc *task_req_descp;
>  	struct utp_upiu_task_rsp *task_rsp_upiup;
> @@ -1802,19 +1828,15 @@ static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
>  				task_req_descp[index].task_rsp_upiu;
>  		task_result = be32_to_cpu(task_rsp_upiup->header.dword_1);
>  		task_result = ((task_result & MASK_TASK_RESPONSE) >> 8);
> -
> -		if (task_result != UPIU_TASK_MANAGEMENT_FUNC_COMPL &&
> -		    task_result != UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED)
> -			task_result = FAILED;
> -		else
> -			task_result = SUCCESS;
> +		if (resp)
> +			*resp = (u8)task_result;
>  	} else {
> -		task_result = FAILED;
> -		dev_err(hba->dev,
> -			"trc: Invalid ocs = %x\n", ocs_value);
> +		dev_err(hba->dev, "%s: failed, ocs = 0x%x\n",
> +				__func__, ocs_value);
>  	}
>  	spin_unlock_irqrestore(hba->host->host_lock, flags);
> -	return task_result;
> +
> +	return ocs_value;
>  }
> 
>  /**
> @@ -2298,7 +2320,7 @@ static void ufshcd_tmc_handler(struct ufs_hba *hba)
> 
>  	tm_doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
>  	hba->tm_condition = tm_doorbell ^ hba->outstanding_tasks;
> -	wake_up_interruptible(&hba->ufshcd_tm_wait_queue);
> +	wake_up(&hba->tm_wq);
>  }
> 
>  /**
> @@ -2348,38 +2370,61 @@ static irqreturn_t ufshcd_intr(int irq, void *__hba)
>  	return retval;
>  }
> 
> +static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
> +{
> +	int err = 0;
> +	u32 reg;
> +	u32 mask = 1 << tag;
> +	unsigned long flags;
> +
> +	if (!test_bit(tag, &hba->outstanding_reqs))
> +		goto out;
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	ufshcd_writel(hba, ~(1 << tag), REG_UTP_TASK_REQ_LIST_CLEAR);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	/* poll for max. 1 sec to clear door bell register by h/w */
> +	reg = ufshcd_wait_for_register(hba,
> +			REG_UTP_TASK_REQ_DOOR_BELL,
> +			mask, 0, 1000, 1000);
> +	if ((reg & mask) == mask)
> +		err = -ETIMEDOUT;
> +out:
> +	return err;
> +}
> +
>  /**
>   * ufshcd_issue_tm_cmd - issues task management commands to controller
>   * @hba: per adapter instance
> - * @lrbp: pointer to local reference block
> + * @lun_id: LUN ID to which TM command is sent
> + * @task_id: task ID to which the TM command is applicable
> + * @tm_function: task management function opcode
> + * @tm_response: task management service response return value
>   *
> - * Returns SUCCESS/FAILED
> + * Returns non-zero value on error, zero on success.
>   */
> -static int
> -ufshcd_issue_tm_cmd(struct ufs_hba *hba,
> -		    struct ufshcd_lrb *lrbp,
> -		    u8 tm_function)
> +static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
> +		u8 tm_function, u8 *tm_response)
>  {
>  	struct utp_task_req_desc *task_req_descp;
>  	struct utp_upiu_task_req *task_req_upiup;
>  	struct Scsi_Host *host;
>  	unsigned long flags;
> -	int free_slot = 0;
> +	int free_slot;
>  	int err;
> +	int task_tag;
> 
>  	host = hba->host;
> 
> -	spin_lock_irqsave(host->host_lock, flags);
> -
> -	/* If task management queue is full */
> -	free_slot = ufshcd_get_tm_free_slot(hba);
> -	if (free_slot >= hba->nutmrs) {
> -		spin_unlock_irqrestore(host->host_lock, flags);
> -		dev_err(hba->dev, "Task management queue full\n");
> -		err = FAILED;
> -		goto out;
> -	}
> +	/*
> +	 * Get free slot, sleep if slots are unavailable.
> +	 * Even though we use wait_event() which sleeps indefinitely,
> +	 * the maximum wait time is bounded by %TM_CMD_TIMEOUT.
> +	 */
> +	wait_event(hba->tm_tag_wq, ufshcd_get_tm_free_slot(hba, &free_slot));
> 
> +	spin_lock_irqsave(host->host_lock, flags);
>  	task_req_descp = hba->utmrdl_base_addr;
>  	task_req_descp += free_slot;
> 
> @@ -2391,18 +2436,15 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>  	/* Configure task request UPIU */
>  	task_req_upiup =
>  		(struct utp_upiu_task_req *) task_req_descp->task_req_upiu;
> +	task_tag = hba->nutrs + free_slot;
Possible, did you intend 'hba->nutmrs', not 'hba->nutrs'?
I think it's safer with hba->nutmrs if we can't sure that NUTRS is larger than NUTMRS.

>  	task_req_upiup->header.dword_0 =
>  		UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
> -					      lrbp->lun, lrbp->task_tag);
> +				lun_id, task_tag);
>  	task_req_upiup->header.dword_1 =
>  		UPIU_HEADER_DWORD(0, tm_function, 0, 0);
> 
> -	task_req_upiup->input_param1 = lrbp->lun;
> -	task_req_upiup->input_param1 =
> -		cpu_to_be32(task_req_upiup->input_param1);
> -	task_req_upiup->input_param2 = lrbp->task_tag;
> -	task_req_upiup->input_param2 =
> -		cpu_to_be32(task_req_upiup->input_param2);
> +	task_req_upiup->input_param1 = cpu_to_be32(lun_id);
> +	task_req_upiup->input_param2 = cpu_to_be32(task_id);
> 
>  	/* send command to the controller */
>  	__set_bit(free_slot, &hba->outstanding_tasks);
> @@ -2411,20 +2453,24 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>  	spin_unlock_irqrestore(host->host_lock, flags);
> 
>  	/* wait until the task management command is completed */
> -	err =
> -	wait_event_interruptible_timeout(hba->ufshcd_tm_wait_queue,
> -					 (test_bit(free_slot,
> -					 &hba->tm_condition) != 0),
> -					 60 * HZ);
> +	err = wait_event_timeout(hba->tm_wq,
> +			test_bit(free_slot, &hba->tm_condition),
> +			msecs_to_jiffies(TM_CMD_TIMEOUT));
>  	if (!err) {
> -		dev_err(hba->dev,
> -			"Task management command timed-out\n");
> -		err = FAILED;
> -		goto out;
> +		dev_err(hba->dev, "%s: task management cmd 0x%.2x timed-out\n",
> +				__func__, tm_function);
> +		if (ufshcd_clear_tm_cmd(hba, free_slot))
> +			dev_WARN(hba->dev, "%s: unable clear tm cmd (slot %d) after timeout\n",
> +					__func__, free_slot);
> +		err = -ETIMEDOUT;
> +	} else {
> +		err = ufshcd_task_req_compl(hba, free_slot, tm_response);
>  	}
> +
>  	clear_bit(free_slot, &hba->tm_condition);
> -	err = ufshcd_task_req_compl(hba, free_slot);
> -out:
> +	ufshcd_put_tm_slot(hba, free_slot);
> +	wake_up(&hba->tm_tag_wq);
> +
>  	return err;
>  }
> 
> @@ -2441,14 +2487,22 @@ static int ufshcd_device_reset(struct scsi_cmnd *cmd)
>  	unsigned int tag;
>  	u32 pos;
>  	int err;
> +	u8 resp;
> +	struct ufshcd_lrb *lrbp;
> 
>  	host = cmd->device->host;
>  	hba = shost_priv(host);
>  	tag = cmd->request->tag;
> 
> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_LOGICAL_RESET);
> -	if (err == FAILED)
> +	lrbp = &hba->lrb[tag];
> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
Argument 2nd, 3rd can be replaced by lrbp.
Then, we can reduce the number of argument.

Thanks,
Seungwon Jeon

> +			UFS_LOGICAL_RESET, &resp);
> +	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> +		err = FAILED;
>  		goto out;
> +	} else {
> +		err = SUCCESS;
> +	}
> 
>  	for (pos = 0; pos < hba->nutrs; pos++) {
>  		if (test_bit(pos, &hba->outstanding_reqs) &&
> @@ -2505,6 +2559,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	unsigned long flags;
>  	unsigned int tag;
>  	int err;
> +	u8 resp;
> +	struct ufshcd_lrb *lrbp;
> 
>  	host = cmd->device->host;
>  	hba = shost_priv(host);
> @@ -2520,9 +2576,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	}
>  	spin_unlock_irqrestore(host->host_lock, flags);
> 
> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_ABORT_TASK);
> -	if (err == FAILED)
> +	lrbp = &hba->lrb[tag];
> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> +			UFS_ABORT_TASK, &resp);
> +	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> +		err = FAILED;
>  		goto out;
> +	} else {
> +		err = SUCCESS;
> +	}
> 
>  	scsi_dma_unmap(cmd);
> 
> @@ -2744,7 +2806,8 @@ int ufshcd_init(struct device *dev, struct ufs_hba **hba_handle,
>  	host->max_cmd_len = MAX_CDB_SIZE;
> 
>  	/* Initailize wait queue for task management */
> -	init_waitqueue_head(&hba->ufshcd_tm_wait_queue);
> +	init_waitqueue_head(&hba->tm_wq);
> +	init_waitqueue_head(&hba->tm_tag_wq);
> 
>  	/* Initialize work queues */
>  	INIT_WORK(&hba->feh_workq, ufshcd_fatal_err_handler);
> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
> index 6c9bd35..5d4542c 100644
> --- a/drivers/scsi/ufs/ufshcd.h
> +++ b/drivers/scsi/ufs/ufshcd.h
> @@ -174,8 +174,10 @@ struct ufs_dev_cmd {
>   * @irq: Irq number of the controller
>   * @active_uic_cmd: handle of active UIC command
>   * @uic_cmd_mutex: mutex for uic command
> - * @ufshcd_tm_wait_queue: wait queue for task management
> + * @tm_wq: wait queue for task management
> + * @tm_tag_wq: wait queue for free task management slots
>   * @tm_condition: condition variable for task management
> + * @tm_slots_in_use: bit map of task management request slots in use
>   * @ufshcd_state: UFSHCD states
>   * @intr_mask: Interrupt Mask Bits
>   * @ee_ctrl_mask: Exception event control mask
> @@ -216,8 +218,10 @@ struct ufs_hba {
>  	struct uic_command *active_uic_cmd;
>  	struct mutex uic_cmd_mutex;
> 
> -	wait_queue_head_t ufshcd_tm_wait_queue;
> +	wait_queue_head_t tm_wq;
> +	wait_queue_head_t tm_tag_wq;
>  	unsigned long tm_condition;
> +	unsigned long tm_slots_in_use;
> 
>  	u32 ufshcd_state;
>  	u32 intr_mask;
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command
  2013-07-09  9:16 ` [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command Sujit Reddy Thumma
  2013-07-09 10:42   ` merez
@ 2013-07-19 13:56   ` Seungwon Jeon
  2013-07-19 18:26     ` Sujit Reddy Thumma
  1 sibling, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-19 13:56 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma', 'Vinayak Holikatti',
	'Santosh Y'
  Cc: 'James E.J. Bottomley', linux-scsi, linux-arm-msm

On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
> There is a possible race condition in the hardware when the abort
> command is issued to terminate the ongoing SCSI command as described
> below:
> 
> - A bit in the door-bell register is set in the controller for a
>   new SCSI command.
> - In some rare situations, before controller get a chance to issue
>   the command to the device, the software issued an abort command.
It's interesting.
I wonder when we can meet this situation.
Is it possible if SCSI mid layer should send abort command as soon as the transfer command is issued?
AFAIK abort command is followed if one command has timed out.
That means command have been already issued and no response?
If you had some problem, could you share?

> - If the device recieves abort command first then it returns success
receives

>   because the command itself is not present.
> - Now if the controller commits the command to device it will be
>   processed.
> - Software thinks that command is aborted and proceed while still
>   the device is processing it.
> - The software, controller and device may go out of sync because of
>   this race condition.
> 
> To avoid this, query task presence in the device before sending abort
> task command so that after the abort operation, the command is guaranteed
> to be non-existent in both controller and the device.
> 
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |   70 +++++++++++++++++++++++++++++++++++---------
>  1 files changed, 55 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index a176421..51ce096 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -2550,6 +2550,12 @@ static int ufshcd_host_reset(struct scsi_cmnd *cmd)
>   * ufshcd_abort - abort a specific command
>   * @cmd: SCSI command pointer
>   *
> + * Abort the pending command in device by sending UFS_ABORT_TASK task management
> + * command, and in host controller by clearing the door-bell register. There can
> + * be race between controller sending the command to the device while abort is
> + * issued. To avoid that, first issue UFS_QUERY_TASK to check if the command is
> + * really issued and then try to abort it.
> + *
>   * Returns SUCCESS/FAILED
>   */
>  static int ufshcd_abort(struct scsi_cmnd *cmd)
> @@ -2558,7 +2564,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	struct ufs_hba *hba;
>  	unsigned long flags;
>  	unsigned int tag;
> -	int err;
> +	int err = 0;
> +	int poll_cnt;
>  	u8 resp;
>  	struct ufshcd_lrb *lrbp;
> 
> @@ -2566,33 +2573,59 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	hba = shost_priv(host);
>  	tag = cmd->request->tag;
> 
> -	spin_lock_irqsave(host->host_lock, flags);
> +	/* If command is already aborted/completed, return SUCCESS */
> +	if (!(test_bit(tag, &hba->outstanding_reqs)))
> +		goto out;
> 
> -	/* check if command is still pending */
> -	if (!(test_bit(tag, &hba->outstanding_reqs))) {
> -		err = FAILED;
> -		spin_unlock_irqrestore(host->host_lock, flags);
> +	lrbp = &hba->lrb[tag];
> +	for (poll_cnt = 100; poll_cnt; poll_cnt--) {
> +		err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> +				UFS_QUERY_TASK, &resp);
> +		if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED) {
> +			/* cmd pending in the device */
> +			break;
> +		} else if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> +			u32 reg;
> +
> +			/*
> +			 * cmd not pending in the device, check if it is
> +			 * in transition.
> +			 */
> +			reg = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
> +			if (reg & (1 << tag)) {
> +				/* sleep for max. 2ms to stabilize */
> +				usleep_range(1000, 2000);
> +				continue;
> +			}
> +			/* command completed already */
> +			goto out;
> +		} else {
> +			if (!err)
> +				err = resp; /* service response error */
> +			goto out;
> +		}
> +	}
> +
> +	if (!poll_cnt) {
> +		err = -EBUSY;
>  		goto out;
>  	}
> -	spin_unlock_irqrestore(host->host_lock, flags);
> 
> -	lrbp = &hba->lrb[tag];
>  	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>  			UFS_ABORT_TASK, &resp);
>  	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> -		err = FAILED;
> +		if (!err)
> +			err = resp; /* service response error */
>  		goto out;
> -	} else {
> -		err = SUCCESS;
>  	}
> 
> +	err = ufshcd_clear_cmd(hba, tag);
> +	if (err)
> +		goto out;
> +
>  	scsi_dma_unmap(cmd);
> 
>  	spin_lock_irqsave(host->host_lock, flags);
> -
> -	/* clear the respective UTRLCLR register bit */
> -	ufshcd_utrl_clear(hba, tag);
> -
>  	__clear_bit(tag, &hba->outstanding_reqs);
>  	hba->lrb[tag].cmd = NULL;
>  	spin_unlock_irqrestore(host->host_lock, flags);
> @@ -2600,6 +2633,13 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  	clear_bit_unlock(tag, &hba->lrb_in_use);
>  	wake_up(&hba->dev_cmd.tag_wq);
>  out:
> +	if (!err) {
> +		err = SUCCESS;
> +	} else {
> +		dev_err(hba->dev, "%s: failed with err %d\n", __func__, err);
> +		err = FAILED;
> +	}
> +
>  	return err;
>  }
> 
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-09  9:16 ` [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods Sujit Reddy Thumma
  2013-07-09 10:43   ` merez
@ 2013-07-19 13:57   ` Seungwon Jeon
  2013-07-19 18:26     ` Sujit Reddy Thumma
  1 sibling, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-19 13:57 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma', 'Vinayak Holikatti',
	'Santosh Y'
  Cc: 'James E.J. Bottomley', linux-scsi, linux-arm-msm

On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
> As of now SCSI initiated error handling is broken because,
> the reset APIs don't try to bring back the device initialized and
> ready for further transfers.
> 
> In case of timeouts, the scsi error handler takes care of handling aborts
> and resets. Improve the error handling in such scenario by resetting the
> device and host and re-initializing them in proper manner.
> 
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |  467 +++++++++++++++++++++++++++++++++++++++------
>  drivers/scsi/ufs/ufshcd.h |    2 +
>  2 files changed, 411 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index 51ce096..b4c9910 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -69,9 +69,15 @@ enum {
> 
>  /* UFSHCD states */
>  enum {
> -	UFSHCD_STATE_OPERATIONAL,
>  	UFSHCD_STATE_RESET,
>  	UFSHCD_STATE_ERROR,
> +	UFSHCD_STATE_OPERATIONAL,
> +};
> +
> +/* UFSHCD error handling flags */
> +enum {
> +	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
> +	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>  };
> 
>  /* Interrupt configuration options */
> @@ -87,6 +93,22 @@ enum {
>  	INT_AGGR_CONFIG,
>  };
> 
> +#define ufshcd_set_device_reset_pending(h) \
> +	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
> +#define ufshcd_set_host_reset_pending(h) \
> +	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
> +#define ufshcd_device_reset_pending(h) \
> +	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
> +#define ufshcd_host_reset_pending(h) \
> +	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
> +#define ufshcd_clear_device_reset_pending(h) \
> +	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
> +#define ufshcd_clear_host_reset_pending(h) \
> +	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
> +
> +static void ufshcd_tmc_handler(struct ufs_hba *hba);
> +static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> +
>  /*
>   * ufshcd_wait_for_register - wait for register value to change
>   * @hba - per-adapter interface
> @@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
> 
>  	tag = cmd->request->tag;
> 
> -	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
> +	switch (hba->ufshcd_state) {
Lock is no needed for ufshcd_state?

> +	case UFSHCD_STATE_OPERATIONAL:
> +		break;
> +	case UFSHCD_STATE_RESET:
>  		err = SCSI_MLQUEUE_HOST_BUSY;
>  		goto out;
> +	case UFSHCD_STATE_ERROR:
> +		set_host_byte(cmd, DID_ERROR);
> +		cmd->scsi_done(cmd);
> +		goto out;
> +	default:
> +		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
> +				__func__, hba->ufshcd_state);
> +		set_host_byte(cmd, DID_BAD_TARGET);
> +		cmd->scsi_done(cmd);
> +		goto out;
>  	}
> 
>  	/* acquire the tag to make sure device cmds don't use it */
> @@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>  	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>  		scsi_unblock_requests(hba->host);
> 
> -	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> -
>  out:
>  	return err;
>  }
> @@ -2273,6 +2306,106 @@ out:
>  }
> 
>  /**
> + * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
> + * @hba: per-adapter instance
> + */
> +static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
> +{
> +	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
> +}
> +
> +/**
> + * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
> + * @hba: per-adapter instance
> + */
> +static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
> +{
> +	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
> +}
> +
> +/**
> + * ufshcd_complete_pending_tasks - complete outstanding tasks
> + * @hba: per adapter instance
> + *
> + * Abort in-progress task management commands and wakeup
> + * waiting threads.
> + *
> + * Returns non-zero error value when failed to clear all the commands.
> + */
> +static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
> +{
> +	u32 reg;
> +	int err = 0;
> +	unsigned long flags;
> +
> +	if (!hba->outstanding_tasks)
> +		goto out;
> +
> +	/* Clear UTMRL only when run-stop is enabled */
> +	if (ufshcd_utmrl_is_rsr_enabled(hba))
> +		ufshcd_writel(hba, ~hba->outstanding_tasks,
> +				REG_UTP_TASK_REQ_LIST_CLEAR);
> +
> +	/* poll for max. 1 sec to clear door bell register by h/w */
> +	reg = ufshcd_wait_for_register(hba,
> +			REG_UTP_TASK_REQ_DOOR_BELL,
> +			hba->outstanding_tasks, 0, 1000, 1000);
> +	if (reg & hba->outstanding_tasks)
> +		err = -ETIMEDOUT;
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	/* complete commands that were cleared out */
> +	ufshcd_tmc_handler(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +out:
> +	if (err)
> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> +				__func__, reg);
> +	return err;
> +}
> +
> +/**
> + * ufshcd_complete_pending_reqs - complete outstanding requests
> + * @hba: per adapter instance
> + *
> + * Abort in-progress transfer request commands and return them to SCSI.
> + *
> + * Returns non-zero error value when failed to clear all the commands.
> + */
> +static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
> +{
> +	u32 reg;
> +	int err = 0;
> +	unsigned long flags;
> +
> +	/* check if we completed all of them */
> +	if (!hba->outstanding_reqs)
> +		goto out;
> +
> +	/* Clear UTRL only when run-stop is enabled */
> +	if (ufshcd_utrl_is_rsr_enabled(hba))
> +		ufshcd_writel(hba, ~hba->outstanding_reqs,
> +				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
> +
> +	/* poll for max. 1 sec to clear door bell register by h/w */
> +	reg = ufshcd_wait_for_register(hba,
> +			REG_UTP_TRANSFER_REQ_DOOR_BELL,
> +			hba->outstanding_reqs, 0, 1000, 1000);
> +	if (reg & hba->outstanding_reqs)
> +		err = -ETIMEDOUT;
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	/* complete commands that were cleared out */
> +	ufshcd_transfer_req_compl(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +out:
> +	if (err)
> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> +				__func__, reg);
> +	return err;
> +}
> +
> +/**
>   * ufshcd_fatal_err_handler - handle fatal errors
>   * @hba: per adapter instance
>   */
> @@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
>  	}
>  	return;
>  fatal_eh:
> -	hba->ufshcd_state = UFSHCD_STATE_ERROR;
> -	schedule_work(&hba->feh_workq);
> +	/* handle fatal errors only when link is functional */
> +	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
> +		/* block commands at driver layer until error is handled */
> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
Locking omitted for ufshcd_state? 

> +		schedule_work(&hba->feh_workq);
> +	}
>  }
> 
>  /**
> @@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
>  }
> 
>  /**
> - * ufshcd_device_reset - reset device and abort all the pending commands
> - * @cmd: SCSI command pointer
> + * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
> + * @hba: per adapter instance
>   *
> - * Returns SUCCESS/FAILED
> + * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS flags,
> + * attributes and descriptors are reset to default state. Callers are
> + * expected to initialize the whole device again after this.
> + *
> + * Returns zero on success, non-zero on failure
>   */
> -static int ufshcd_device_reset(struct scsi_cmnd *cmd)
> +static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
>  {
> -	struct Scsi_Host *host;
> -	struct ufs_hba *hba;
> -	unsigned int tag;
> -	u32 pos;
> -	int err;
> -	u8 resp;
> -	struct ufshcd_lrb *lrbp;
> +	struct uic_command uic_cmd = {0};
> +	int ret;
> 
> -	host = cmd->device->host;
> -	hba = shost_priv(host);
> -	tag = cmd->request->tag;
> +	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
> 
> -	lrbp = &hba->lrb[tag];
> -	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> -			UFS_LOGICAL_RESET, &resp);
> -	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> -		err = FAILED;
> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> +	if (ret)
> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> +
> +	return ret;
> +}
> +
> +/**
> + * ufshcd_dme_reset - Local UniPro reset
> + * @hba: per adapter instance
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_dme_reset(struct ufs_hba *hba)
> +{
> +	struct uic_command uic_cmd = {0};
> +	int ret;
> +
> +	uic_cmd.command = UIC_CMD_DME_RESET;
> +
> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> +	if (ret)
> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> +
> +	return ret;
> +
> +}
> +
> +/**
> + * ufshcd_dme_enable - Local UniPro DME Enable
> + * @hba: per adapter instance
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_dme_enable(struct ufs_hba *hba)
> +{
> +	struct uic_command uic_cmd = {0};
> +	int ret;
> +	uic_cmd.command = UIC_CMD_DME_ENABLE;
> +
> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> +	if (ret)
> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> +
> +	return ret;
> +
> +}
> +
> +/**
> + * ufshcd_device_reset_and_restore - reset and restore device
> + * @hba: per-adapter instance
> + *
> + * Note that the device reset issues DME_END_POINT_RESET which
> + * may reset entire device and restore device attributes to
> + * default state.
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
> +{
> +	int err = 0;
> +	u32 reg;
> +
> +	err = ufshcd_dme_end_point_reset(hba);
> +	if (err)
> +		goto out;
> +
> +	/* restore communication with the device */
> +	err = ufshcd_dme_reset(hba);
> +	if (err)
>  		goto out;
> -	} else {
> -		err = SUCCESS;
> -	}
> 
> -	for (pos = 0; pos < hba->nutrs; pos++) {
> -		if (test_bit(pos, &hba->outstanding_reqs) &&
> -		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
> +	err = ufshcd_dme_enable(hba);
> +	if (err)
> +		goto out;
> 
> -			/* clear the respective UTRLCLR register bit */
> -			ufshcd_utrl_clear(hba, pos);
> +	err = ufshcd_dme_link_startup(hba);
UFS_LOGICAL_RESET is no more used?
ufshcd_device_reset_and_restore have a role of device reset.
Both ufshcd_dme_reset and ufshcd_dme_enable are valid for local one, not for remote.
Should we do those for host including link-startup here? 

> +	if (err)
> +		goto out;
> 
> -			clear_bit(pos, &hba->outstanding_reqs);
> +	/* check if link is up and device is detected */
> +	reg = ufshcd_readl(hba, REG_CONTROLLER_STATUS);
> +	if (!ufshcd_is_device_present(reg)) {
> +		dev_err(hba->dev, "Device not present\n");
> +		err = -ENXIO;
> +		goto out;
> +	}
> 
> -			if (hba->lrb[pos].cmd) {
> -				scsi_dma_unmap(hba->lrb[pos].cmd);
> -				hba->lrb[pos].cmd->result =
> -					DID_ABORT << 16;
> -				hba->lrb[pos].cmd->scsi_done(cmd);
> -				hba->lrb[pos].cmd = NULL;
> -				clear_bit_unlock(pos, &hba->lrb_in_use);
> -				wake_up(&hba->dev_cmd.tag_wq);
> -			}
> -		}
> -	} /* end of for */
> +	ufshcd_clear_device_reset_pending(hba);
>  out:
> +	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
>  	return err;
>  }
> 
>  /**
> - * ufshcd_host_reset - Main reset function registered with scsi layer
> - * @cmd: SCSI command pointer
> + * ufshcd_host_reset_and_restore - reset and restore host controller
> + * @hba: per-adapter instance
>   *
> - * Returns SUCCESS/FAILED
> + * Note that host controller reset may issue DME_RESET to
> + * local and remote (device) Uni-Pro stack and the attributes
> + * are reset to default state.
> + *
> + * Returns zero on success, non-zero on failure
>   */
> -static int ufshcd_host_reset(struct scsi_cmnd *cmd)
> +static int ufshcd_host_reset_and_restore(struct ufs_hba *hba)
>  {
> -	struct ufs_hba *hba;
> +	int err;
> +	async_cookie_t cookie;
> +	unsigned long flags;
> 
> -	hba = shost_priv(cmd->device->host);
> +	/* Reset the host controller */
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	ufshcd_hba_stop(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> 
> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> -		return SUCCESS;
> +	err = ufshcd_hba_enable(hba);
> +	if (err)
> +		goto out;
> 
> -	return ufshcd_do_reset(hba);
> +	/* Establish the link again and restore the device */
> +	cookie = async_schedule(ufshcd_async_scan, hba);
> +	/* wait for async scan to be completed */
> +	async_synchronize_cookie(++cookie);
> +	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL)
> +		err = -EIO;
> +out:
> +	if (err)
> +		dev_err(hba->dev, "%s: Host init failed %d\n", __func__, err);
> +	else
> +		ufshcd_clear_host_reset_pending(hba);
> +
> +	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
> +	return err;
>  }
> 
>  /**
> @@ -2644,6 +2861,134 @@ out:
>  }
> 
>  /**
> + * ufshcd_reset_and_restore - resets device or host or both
> + * @hba: per-adapter instance
> + *
> + * Reset and recover device, host and re-establish link. This
> + * is helpful to recover the communication in fatal error conditions.
> + *
> + * Returns zero on success, non-zero on failure
> + */
> +static int ufshcd_reset_and_restore(struct ufs_hba *hba)
> +{
> +	int err = 0;
> +
> +	if (ufshcd_device_reset_pending(hba) &&
> +			!ufshcd_host_reset_pending(hba)) {
> +		err = ufshcd_device_reset_and_restore(hba);
> +		if (err) {
> +			ufshcd_clear_device_reset_pending(hba);
> +			ufshcd_set_host_reset_pending(hba);
> +		}
> +	}
> +
> +	if (ufshcd_host_reset_pending(hba))
> +		err = ufshcd_host_reset_and_restore(hba);
> +
> +	/*
> +	 * Due to reset the door-bell might be cleared, clear
> +	 * outstanding requests in s/w here.
> +	 */
> +	ufshcd_complete_pending_reqs(hba);
After above, pending requests will be completed by ufshcd_transfer_req_compl.
'cmd->result' which is reported to scsi mid-layer should be a failure.
I think it may not be guaranteed.

> +	ufshcd_complete_pending_tasks(hba);
> +
> +	return err;
> +}
> +
> +/**
> + * ufshcd_eh_device_reset_handler - device reset handler registered to
> + *                                    scsi layer.
> + * @cmd - SCSI command pointer
> + *
> + * Returns SUCCESS/FAILED
> + */
> +static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd)
> +{
> +	struct ufs_hba *hba;
> +	int err;
> +	unsigned long flags;
> +
> +	hba = shost_priv(cmd->device->host);
> +
> +	/*
> +	 * Check if there is any race with fatal error handling.
> +	 * If so, wait for it to complete. Even though fatal error
> +	 * handling does reset and restore in some cases, don't assume
> +	 * anything out of it. We are just avoiding race here.
> +	 */
> +	do {
> +		spin_lock_irqsave(hba->host->host_lock, flags);
> +		if (!(work_pending(&hba->feh_workq) ||
> +				hba->ufshcd_state == UFSHCD_STATE_RESET))
> +			break;
> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> +		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
> +		flush_work_sync(&hba->feh_workq);
> +	} while (1);
> +
> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> +	ufshcd_set_device_reset_pending(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	err = ufshcd_reset_and_restore(hba);
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	if (!err) {
> +		err = SUCCESS;
> +		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> +	} else {
> +		err = FAILED;
> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> +	}
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	return err;
> +}
> +
> +/**
> + * ufshcd_eh_host_reset_handler - host reset handler registered to scsi layer
> + * @cmd - SCSI command pointer
> + *
> + * Returns SUCCESS/FAILED
> + */
> +static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
> +{
> +	struct ufs_hba *hba;
> +	int err;
> +	unsigned long flags;
> +
> +	hba = shost_priv(cmd->device->host);
> +
> +	do {
> +		spin_lock_irqsave(hba->host->host_lock, flags);
> +		if (!(work_pending(&hba->feh_workq) ||
> +				hba->ufshcd_state == UFSHCD_STATE_RESET))
> +			break;
> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> +		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
> +		flush_work_sync(&hba->feh_workq);
> +	} while (1);
> +
> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> +	ufshcd_set_host_reset_pending(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	err = ufshcd_reset_and_restore(hba);
> +
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	if (!err) {
> +		err = SUCCESS;
> +		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> +	} else {
> +		err = FAILED;
> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> +	}
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	return err;
> +}
Both 'ufshcd_eh_device_reset_handler' and 'ufshcd_eh_host_reset_handler' have 
common routine. If possible, it would be better to gather in one function.

> +
> +/**
>   * ufshcd_async_scan - asynchronous execution for link startup
>   * @data: data pointer to pass to this function
>   * @cookie: cookie data
> @@ -2667,8 +3012,14 @@ static void ufshcd_async_scan(void *data, async_cookie_t cookie)
> 
>  	hba->auto_bkops_enabled = false;
>  	ufshcd_enable_auto_bkops(hba);
> -	scsi_scan_host(hba->host);
> -	pm_runtime_put_sync(hba->dev);
> +	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
lock is no needed?

Thanks,
Seungwon Jeon

> +
> +	/* If we are in error handling context no need to scan the host */
> +	if (!(ufshcd_device_reset_pending(hba) ||
> +			ufshcd_host_reset_pending(hba))) {
> +		scsi_scan_host(hba->host);
> +		pm_runtime_put_sync(hba->dev);
> +	}
>  out:
>  	return;
>  }
> @@ -2681,8 +3032,8 @@ static struct scsi_host_template ufshcd_driver_template = {
>  	.slave_alloc		= ufshcd_slave_alloc,
>  	.slave_destroy		= ufshcd_slave_destroy,
>  	.eh_abort_handler	= ufshcd_abort,
> -	.eh_device_reset_handler = ufshcd_device_reset,
> -	.eh_host_reset_handler	= ufshcd_host_reset,
> +	.eh_device_reset_handler = ufshcd_eh_device_reset_handler,
> +	.eh_host_reset_handler   = ufshcd_eh_host_reset_handler,
>  	.this_id		= -1,
>  	.sg_tablesize		= SG_ALL,
>  	.cmd_per_lun		= UFSHCD_CMD_PER_LUN,
> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
> index 5d4542c..7fcedd0 100644
> --- a/drivers/scsi/ufs/ufshcd.h
> +++ b/drivers/scsi/ufs/ufshcd.h
> @@ -179,6 +179,7 @@ struct ufs_dev_cmd {
>   * @tm_condition: condition variable for task management
>   * @tm_slots_in_use: bit map of task management request slots in use
>   * @ufshcd_state: UFSHCD states
> + * @eh_flags: Error handling flags
>   * @intr_mask: Interrupt Mask Bits
>   * @ee_ctrl_mask: Exception event control mask
>   * @feh_workq: Work queue for fatal controller error handling
> @@ -224,6 +225,7 @@ struct ufs_hba {
>  	unsigned long tm_slots_in_use;
> 
>  	u32 ufshcd_state;
> +	u32 eh_flags;
>  	u32 intr_mask;
>  	u16 ee_ctrl_mask;
> 
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-09  9:16 ` [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling Sujit Reddy Thumma
  2013-07-09 10:43   ` merez
@ 2013-07-19 13:58   ` Seungwon Jeon
  2013-07-19 18:26     ` Sujit Reddy Thumma
  1 sibling, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-19 13:58 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma', 'Vinayak Holikatti',
	'Santosh Y'
  Cc: 'James E.J. Bottomley', linux-scsi, linux-arm-msm

On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
> Error handling in UFS driver is broken and resets the host controller
> for fatal errors without re-initialization. Correct the fatal error
> handling sequence according to UFS Host Controller Interface (HCI)
> v1.1 specification.
> 
> o Upon determining fatal error condition the host controller may hang
>   forever until a reset is applied, so just retrying the command doesn't
>   work without a reset. So, the reset is applied in the driver context
>   in a separate work and SCSI mid-layer isn't informed until reset is
>   applied.
> 
> o Processed requests which are completed without error are reported to
>   SCSI layer as successful and any pending commands that are not started
>   yet or are not cause of the error are re-queued into scsi midlayer queue.
>   For the command that caused error, host controller or device is reset
>   and DID_ERROR is returned for command retry after applying reset.
> 
> o SCSI is informed about the expected Unit-Attentioni exception from the
Attention'i',  typo.

>   device for the immediate command after a reset so that the SCSI layer
>   take necessary steps to establish communication with the device.
> 
> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> ---
>  drivers/scsi/ufs/ufshcd.c |  349 +++++++++++++++++++++++++++++++++++---------
>  drivers/scsi/ufs/ufshcd.h |    2 +
>  drivers/scsi/ufs/ufshci.h |   19 ++-
>  3 files changed, 295 insertions(+), 75 deletions(-)
> 
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index b4c9910..2a3874f 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -80,6 +80,14 @@ enum {
>  	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>  };
> 
> +/* UFSHCD UIC layer error flags */
> +enum {
> +	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
> +	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
> +	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
> +	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
> +};
> +
>  /* Interrupt configuration options */
>  enum {
>  	UFSHCD_INT_DISABLE,
> @@ -108,6 +116,7 @@ enum {
> 
>  static void ufshcd_tmc_handler(struct ufs_hba *hba);
>  static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> +static int ufshcd_reset_and_restore(struct ufs_hba *hba);
> 
>  /*
>   * ufshcd_wait_for_register - wait for register value to change
> @@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>  		goto out;
>  	}
> 
> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> -		scsi_unblock_requests(hba->host);
> -
>  out:
>  	return err;
>  }
> @@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct ufs_hba *hba)
>  }
> 
>  /**
> - * ufshcd_do_reset - reset the host controller
> - * @hba: per adapter instance
> - *
> - * Returns SUCCESS/FAILED
> - */
> -static int ufshcd_do_reset(struct ufs_hba *hba)
> -{
> -	struct ufshcd_lrb *lrbp;
> -	unsigned long flags;
> -	int tag;
> -
> -	/* block commands from midlayer */
> -	scsi_block_requests(hba->host);
> -
> -	spin_lock_irqsave(hba->host->host_lock, flags);
> -	hba->ufshcd_state = UFSHCD_STATE_RESET;
> -
> -	/* send controller to reset state */
> -	ufshcd_hba_stop(hba);
> -	spin_unlock_irqrestore(hba->host->host_lock, flags);
> -
> -	/* abort outstanding commands */
> -	for (tag = 0; tag < hba->nutrs; tag++) {
> -		if (test_bit(tag, &hba->outstanding_reqs)) {
> -			lrbp = &hba->lrb[tag];
> -			if (lrbp->cmd) {
> -				scsi_dma_unmap(lrbp->cmd);
> -				lrbp->cmd->result = DID_RESET << 16;
> -				lrbp->cmd->scsi_done(lrbp->cmd);
> -				lrbp->cmd = NULL;
> -				clear_bit_unlock(tag, &hba->lrb_in_use);
> -			}
> -		}
> -	}
> -
> -	/* complete device management command */
> -	if (hba->dev_cmd.complete)
> -		complete(hba->dev_cmd.complete);
> -
> -	/* clear outstanding request/task bit maps */
> -	hba->outstanding_reqs = 0;
> -	hba->outstanding_tasks = 0;
> -
> -	/* Host controller enable */
> -	if (ufshcd_hba_enable(hba)) {
> -		dev_err(hba->dev,
> -			"Reset: Controller initialization failed\n");
> -		return FAILED;
> -	}
> -
> -	if (ufshcd_link_startup(hba)) {
> -		dev_err(hba->dev,
> -			"Reset: Link start-up failed\n");
> -		return FAILED;
> -	}
> -
> -	return SUCCESS;
> -}
> -
> -/**
>   * ufshcd_slave_alloc - handle initial SCSI device configurations
>   * @sdev: pointer to SCSI device
>   *
> @@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device *sdev)
>  	sdev->use_10_for_ms = 1;
>  	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
> 
> +	/* allow SCSI layer to restart the device in case of errors */
> +	sdev->allow_restart = 1;
> +
>  	/*
>  	 * Inform SCSI Midlayer that the LUN queue depth is same as the
>  	 * controller queue depth. If a LUN queue depth is less than the
> @@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
>  	case OCS_ABORTED:
>  		result |= DID_ABORT << 16;
>  		break;
> +	case OCS_INVALID_COMMAND_STATUS:
> +		result |= DID_REQUEUE << 16;
> +		break;
>  	case OCS_INVALID_CMD_TABLE_ATTR:
>  	case OCS_INVALID_PRDT_ATTR:
>  	case OCS_MISMATCH_DATA_BUF_SIZE:
> @@ -2405,42 +2357,295 @@ out:
>  	return err;
>  }
> 
> +static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
> +{
> +	switch (ocs) {
> +	case OCS_SUCCESS:
> +	case OCS_INVALID_COMMAND_STATUS:
> +		break;
> +	case OCS_MISMATCH_DATA_BUF_SIZE:
> +	case OCS_MISMATCH_RESP_UPIU_SIZE:
> +	case OCS_PEER_COMM_FAILURE:
> +	case OCS_FATAL_ERROR:
> +	case OCS_ABORTED:
> +	case OCS_INVALID_CMD_TABLE_ATTR:
> +	case OCS_INVALID_PRDT_ATTR:
> +		ufshcd_set_host_reset_pending(hba);
Should host be reset on ocs error, including below ufshcd_decide_eh_task_req?
It's just overall command status.

> +		break;
> +	default:
> +		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
> +				__func__, ocs);
> +		BUG();
> +	}
> +}
> +
> +static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
> +{
> +	switch (ocs) {
> +	case OCS_TMR_SUCCESS:
> +	case OCS_TMR_INVALID_COMMAND_STATUS:
> +		break;
> +	case OCS_TMR_MISMATCH_REQ_SIZE:
> +	case OCS_TMR_MISMATCH_RESP_SIZE:
> +	case OCS_TMR_PEER_COMM_FAILURE:
> +	case OCS_TMR_INVALID_ATTR:
> +	case OCS_TMR_ABORTED:
> +	case OCS_TMR_FATAL_ERROR:
> +		ufshcd_set_host_reset_pending(hba);
> +		break;
> +	default:
> +		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
> +				__func__, ocs);
> +		BUG();
> +	}
> +}
> +
>  /**
> - * ufshcd_fatal_err_handler - handle fatal errors
> + * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed command and
> + *                          decide error handling
>   * @hba: per adapter instance
> + * @err_xfer: bit mask for transfer request errors
> + *
> + * Iterate over completed transfer requests and
> + * set error handling flags.
> + */
> +static void
> +ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
> +{
> +	unsigned long completed;
> +	u32 doorbell;
> +	int index;
> +	int ocs;
> +
> +	if (!err_xfer)
> +		goto out;
> +
> +	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
> +	completed = doorbell ^ (u32)hba->outstanding_reqs;
> +
> +	for (index = 0; index < hba->nutrs; index++) {
> +		if (test_bit(index, &completed)) {
> +			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
> +			if ((ocs == OCS_SUCCESS) ||
> +					(ocs == OCS_INVALID_COMMAND_STATUS))
> +				continue;
> +
> +			*err_xfer |= (1 << index);
> +			ufshcd_decide_eh_xfer_req(hba, ocs);
> +		}
> +	}
> +out:
> +	return;
> +}
> +
> +/**
> + * ufshcd_error_autopsy_task_req() - reads OCS field of failed command and
> + *                          decide error handling
> + * @hba: per adapter instance
> + * @err_tm: bit mask for task management errors
> + *
> + * Iterate over completed task management requests and
> + * set error handling flags.
> + */
> +static void
> +ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
> +{
> +	unsigned long completed;
> +	u32 doorbell;
> +	int index;
> +	int ocs;
> +
> +	if (!err_tm)
> +		goto out;
> +
> +	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
> +	completed = doorbell ^ (u32)hba->outstanding_tasks;
> +
> +	for (index = 0; index < hba->nutmrs; index++) {
> +		if (test_bit(index, &completed)) {
> +			struct utp_task_req_desc *tm_descp;
> +
> +			tm_descp = hba->utmrdl_base_addr;
> +			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
> +			if ((ocs == OCS_TMR_SUCCESS) ||
> +					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
> +				continue;
> +
> +			*err_tm |= (1 << index);
> +			ufshcd_decide_eh_task_req(hba, ocs);
> +		}
> +	}
> +
> +out:
> +	return;
> +}
> +
> +/**
> + * ufshcd_fatal_err_handler - handle fatal errors
> + * @work: pointer to work structure
>   */
>  static void ufshcd_fatal_err_handler(struct work_struct *work)
>  {
>  	struct ufs_hba *hba;
> +	unsigned long flags;
> +	u32 err_xfer = 0;
> +	u32 err_tm = 0;
> +	int err;
> +
>  	hba = container_of(work, struct ufs_hba, feh_workq);
> 
>  	pm_runtime_get_sync(hba->dev);
> -	/* check if reset is already in progress */
> -	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
> -		ufshcd_do_reset(hba);
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
> +		/* complete processed requests and exit */
> +		ufshcd_transfer_req_compl(hba);
> +		ufshcd_tmc_handler(hba);
> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> +		pm_runtime_put_sync(hba->dev);
> +		return;
Host driver is here with finishing 'scsi_block_requests'.
'scsi_unblock_requests' can be called somewhere?

> +	}
> +
> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> +	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
> +	ufshcd_error_autopsy_task_req(hba, &err_tm);
> +
> +	/*
> +	 * Complete successful and pending transfer requests.
> +	 * DID_REQUEUE is returned for pending requests as they have
> +	 * nothing to do with error'ed request and SCSI layer should
> +	 * not treat them as errors and decrement retry count.
> +	 */
> +	hba->outstanding_reqs &= ~err_xfer;
> +	ufshcd_transfer_req_compl(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +	ufshcd_complete_pending_reqs(hba);
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +	hba->outstanding_reqs |= err_xfer;
Hmm... error handling seems so complicated.
To simplify it, how about below?

1. If requests(transfer or task management) are completed, finish them with success/failure.
2. If there are pending requests, abort them.
3. If fatal error, reset.

> +
> +	/* Complete successful and pending task requests */
> +	hba->outstanding_tasks &= ~err_tm;
> +	ufshcd_tmc_handler(hba);
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +	ufshcd_complete_pending_tasks(hba);
> +	spin_lock_irqsave(hba->host->host_lock, flags);
> +
> +	hba->outstanding_tasks |= err_tm;
> +
> +	/*
> +	 * Controller may generate multiple fatal errors, handle
> +	 * errors based on severity.
> +	 * 1) DEVICE_FATAL_ERROR
> +	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
> +	 * 3) UIC_ERROR
> +	 */
> +	if (hba->errors & DEVICE_FATAL_ERROR) {
> +		/*
> +		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
> +		 * OCS field on device fatal error.
> +		 */
> +		ufshcd_set_host_reset_pending(hba);
In DEVICE_FATAL_ERROR, ufshcd_device_reset_pending is right?

> +	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
> +			CONTROLLER_FATAL_ERROR)) {
> +		/* eh flags should be set in err autopsy based on OCS values */
> +		if (!hba->eh_flags)
> +			WARN(1, "%s: fatal error without error handling\n",
> +				dev_name(hba->dev));
> +	} else if (hba->errors & UIC_ERROR) {
> +		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
> +			/* fatal error - reset controller */
> +			ufshcd_set_host_reset_pending(hba);
> +		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
> +					UFSHCD_UIC_TL_ERROR |
> +					UFSHCD_UIC_DME_ERROR)) {
> +			/* non-fatal, report error to SCSI layer */
> +			if (!hba->eh_flags) {
> +				spin_unlock_irqrestore(
> +						hba->host->host_lock, flags);
> +				ufshcd_complete_pending_reqs(hba);
> +				ufshcd_complete_pending_tasks(hba);
> +				spin_lock_irqsave(hba->host->host_lock, flags);
> +			}
> +		}
> +	}
> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> +
> +	if (hba->eh_flags) {
> +		err = ufshcd_reset_and_restore(hba);
> +		if (err) {
> +			ufshcd_clear_host_reset_pending(hba);
> +			ufshcd_clear_device_reset_pending(hba);
> +			dev_err(hba->dev, "%s: reset and restore failed\n",
> +					__func__);
> +			hba->ufshcd_state = UFSHCD_STATE_ERROR;
> +		}
> +		/*
> +		 * Inform scsi mid-layer that we did reset and allow to handle
> +		 * Unit Attention properly.
> +		 */
> +		scsi_report_bus_reset(hba->host, 0);
> +		hba->errors = 0;
> +		hba->uic_error = 0;
> +	}
> +	scsi_unblock_requests(hba->host);
>  	pm_runtime_put_sync(hba->dev);
>  }
> 
>  /**
> - * ufshcd_err_handler - Check for fatal errors
> - * @work: pointer to a work queue structure
> + * ufshcd_update_uic_error - check and set fatal UIC error flags.
> + * @hba: per-adapter instance
>   */
> -static void ufshcd_err_handler(struct ufs_hba *hba)
> +static void ufshcd_update_uic_error(struct ufs_hba *hba)
>  {
>  	u32 reg;
> 
> +	/* PA_INIT_ERROR is fatal and needs UIC reset */
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> +	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> +		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
> +
> +	/* UIC NL/TL/DME errors needs software retry */
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
> +	if (reg)
> +		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
> +
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
> +	if (reg)
> +		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
> +
> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
> +	if (reg)
> +		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
REG_UIC_ERROR_CODE_PHY_ADAPTER_LAYER is not handled.

> +
> +	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
> +			__func__, hba->uic_error);
> +}
> +
> +/**
> + * ufshcd_err_handler - Check for fatal errors
> + * @hba: per-adapter instance
> + */
> +static void ufshcd_err_handler(struct ufs_hba *hba)
> +{
>  	if (hba->errors & INT_FATAL_ERRORS)
>  		goto fatal_eh;
> 
>  	if (hba->errors & UIC_ERROR) {
> -		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> -		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> +		hba->uic_error = 0;
> +		ufshcd_update_uic_error(hba);
> +		if (hba->uic_error)
Except UFSHCD_UIC_DL_PA_INIT_ERROR, it's not fatal. Should it go to fatal_eh?

Thanks,
Seungwon Jeon

>  			goto fatal_eh;
>  	}
> +	/*
> +	 * Other errors are either non-fatal or completed by the
> +	 * controller by updating OCS fields with success/failure.
> +	 */
>  	return;
> +
>  fatal_eh:
>  	/* handle fatal errors only when link is functional */
>  	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
> +		/* block commands from midlayer */
> +		scsi_block_requests(hba->host);
>  		/* block commands at driver layer until error is handled */
>  		hba->ufshcd_state = UFSHCD_STATE_ERROR;
>  		schedule_work(&hba->feh_workq);
> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
> index 7fcedd0..4ee4d1a 100644
> --- a/drivers/scsi/ufs/ufshcd.h
> +++ b/drivers/scsi/ufs/ufshcd.h
> @@ -185,6 +185,7 @@ struct ufs_dev_cmd {
>   * @feh_workq: Work queue for fatal controller error handling
>   * @eeh_work: Worker to handle exception events
>   * @errors: HBA errors
> + * @uic_error: UFS interconnect layer error status
>   * @dev_cmd: ufs device management command information
>   * @auto_bkops_enabled: to track whether bkops is enabled in device
>   */
> @@ -235,6 +236,7 @@ struct ufs_hba {
> 
>  	/* HBA Errors */
>  	u32 errors;
> +	u32 uic_error;
> 
>  	/* Device management request data */
>  	struct ufs_dev_cmd dev_cmd;
> diff --git a/drivers/scsi/ufs/ufshci.h b/drivers/scsi/ufs/ufshci.h
> index f1e1b74..36f68ef 100644
> --- a/drivers/scsi/ufs/ufshci.h
> +++ b/drivers/scsi/ufs/ufshci.h
> @@ -264,7 +264,7 @@ enum {
>  	UTP_DEVICE_TO_HOST	= 0x04000000,
>  };
> 
> -/* Overall command status values */
> +/* Overall command status values for transfer request */
>  enum {
>  	OCS_SUCCESS			= 0x0,
>  	OCS_INVALID_CMD_TABLE_ATTR	= 0x1,
> @@ -274,8 +274,21 @@ enum {
>  	OCS_PEER_COMM_FAILURE		= 0x5,
>  	OCS_ABORTED			= 0x6,
>  	OCS_FATAL_ERROR			= 0x7,
> -	OCS_INVALID_COMMAND_STATUS	= 0x0F,
> -	MASK_OCS			= 0x0F,
> +	OCS_INVALID_COMMAND_STATUS	= 0xF,
> +	MASK_OCS			= 0xFF,
> +};
> +
> +/* Overall command status values for task management request */
> +enum {
> +	OCS_TMR_SUCCESS			= 0x0,
> +	OCS_TMR_INVALID_ATTR		= 0x1,
> +	OCS_TMR_MISMATCH_REQ_SIZE	= 0x2,
> +	OCS_TMR_MISMATCH_RESP_SIZE	= 0x3,
> +	OCS_TMR_PEER_COMM_FAILURE	= 0x4,
> +	OCS_TMR_ABORTED			= 0x5,
> +	OCS_TMR_FATAL_ERROR		= 0x6,
> +	OCS_TMR_INVALID_COMMAND_STATUS	= 0xF,
> +	MASK_OCS_TMR			= 0xFF,
>  };
> 
>  /**
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation
  2013-07-19 13:56   ` Seungwon Jeon
@ 2013-07-19 18:26     ` Sujit Reddy Thumma
  2013-07-23  8:24       ` Seungwon Jeon
  0 siblings, 1 reply; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-19 18:26 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/19/2013 7:26 PM, Seungwon Jeon wrote:
> On Tue, July 09, 2013 Sujit Reddy Thumma wrote:
>> Currently, sending Task Management (TM) command to the card might
>> be broken in some scenarios as listed below:
>>
>> Problem: If there are more than 8 TM commands the implementation
>>           returns error to the caller.
>> Fix:     Wait for one of the slots to be emptied and send the command.
>>
>> Problem: Sometimes it is necessary for the caller to know the TM service
>>           response code to determine the task status.
>> Fix:     Propogate the service response to the caller.
>>
>> Problem: If the TM command times out no proper error recovery is
>>           implemented.
>> Fix:     Clear the command in the controller door-bell register, so that
>>           further commands for the same slot don't fail.
>>
>> Problem: While preparing the TM command descriptor, the task tag used
>>           should be unique across SCSI/NOP/QUERY/TM commands and not the
>> 	 task tag of the command which the TM command is trying to manage.
>> Fix:     Use a unique task tag instead of task tag of SCSI command.
>>
>> Problem: Since the TM command involves H/W communication, abruptly ending
>>           the request on kill interrupt signal might cause h/w malfunction.
>> Fix:     Wait for hardware completion interrupt with TASK_UNINTERRUPTIBLE
>>           set.
>>
>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>> ---
>>   drivers/scsi/ufs/ufshcd.c |  177 ++++++++++++++++++++++++++++++---------------
>>   drivers/scsi/ufs/ufshcd.h |    8 ++-
>>   2 files changed, 126 insertions(+), 59 deletions(-)
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> index af7d01d..a176421 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -53,6 +53,9 @@
>>   /* Query request timeout */
>>   #define QUERY_REQ_TIMEOUT 30 /* msec */
>>
>> +/* Task management command timeout */
>> +#define TM_CMD_TIMEOUT	100 /* msecs */
>> +
>>   /* Expose the flag value from utp_upiu_query.value */
>>   #define MASK_QUERY_UPIU_FLAG_LOC 0xFF
>>
>> @@ -190,13 +193,35 @@ ufshcd_get_tmr_ocs(struct utp_task_req_desc *task_req_descp)
>>   /**
>>    * ufshcd_get_tm_free_slot - get a free slot for task management request
>>    * @hba: per adapter instance
>> + * @free_slot: pointer to variable with available slot value
>>    *
>> - * Returns maximum number of task management request slots in case of
>> - * task management queue full or returns the free slot number
>> + * Get a free tag and lock it until ufshcd_put_tm_slot() is called.
>> + * Returns 0 if free slot is not available, else return 1 with tag value
>> + * in @free_slot.
>>    */
>> -static inline int ufshcd_get_tm_free_slot(struct ufs_hba *hba)
>> +static bool ufshcd_get_tm_free_slot(struct ufs_hba *hba, int *free_slot)
>> +{
>> +	int tag;
>> +	bool ret = false;
>> +
>> +	if (!free_slot)
>> +		goto out;
>> +
>> +	do {
>> +		tag = find_first_zero_bit(&hba->tm_slots_in_use, hba->nutmrs);
>> +		if (tag >= hba->nutmrs)
>> +			goto out;
>> +	} while (test_and_set_bit_lock(tag, &hba->tm_slots_in_use));
>> +
>> +	*free_slot = tag;
>> +	ret = true;
>> +out:
>> +	return ret;
>> +}
>> +
>> +static inline void ufshcd_put_tm_slot(struct ufs_hba *hba, int slot)
>>   {
>> -	return find_first_zero_bit(&hba->outstanding_tasks, hba->nutmrs);
>> +	clear_bit_unlock(slot, &hba->tm_slots_in_use);
>>   }
>>
>>   /**
>> @@ -1778,10 +1803,11 @@ static void ufshcd_slave_destroy(struct scsi_device *sdev)
>>    * ufshcd_task_req_compl - handle task management request completion
>>    * @hba: per adapter instance
>>    * @index: index of the completed request
>> + * @resp: task management service response
>>    *
>> - * Returns SUCCESS/FAILED
>> + * Returns non-zero value on error, zero on success
>>    */
>> -static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
>> +static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index, u8 *resp)
>>   {
>>   	struct utp_task_req_desc *task_req_descp;
>>   	struct utp_upiu_task_rsp *task_rsp_upiup;
>> @@ -1802,19 +1828,15 @@ static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
>>   				task_req_descp[index].task_rsp_upiu;
>>   		task_result = be32_to_cpu(task_rsp_upiup->header.dword_1);
>>   		task_result = ((task_result & MASK_TASK_RESPONSE) >> 8);
>> -
>> -		if (task_result != UPIU_TASK_MANAGEMENT_FUNC_COMPL &&
>> -		    task_result != UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED)
>> -			task_result = FAILED;
>> -		else
>> -			task_result = SUCCESS;
>> +		if (resp)
>> +			*resp = (u8)task_result;
>>   	} else {
>> -		task_result = FAILED;
>> -		dev_err(hba->dev,
>> -			"trc: Invalid ocs = %x\n", ocs_value);
>> +		dev_err(hba->dev, "%s: failed, ocs = 0x%x\n",
>> +				__func__, ocs_value);
>>   	}
>>   	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> -	return task_result;
>> +
>> +	return ocs_value;
>>   }
>>
>>   /**
>> @@ -2298,7 +2320,7 @@ static void ufshcd_tmc_handler(struct ufs_hba *hba)
>>
>>   	tm_doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
>>   	hba->tm_condition = tm_doorbell ^ hba->outstanding_tasks;
>> -	wake_up_interruptible(&hba->ufshcd_tm_wait_queue);
>> +	wake_up(&hba->tm_wq);
>>   }
>>
>>   /**
>> @@ -2348,38 +2370,61 @@ static irqreturn_t ufshcd_intr(int irq, void *__hba)
>>   	return retval;
>>   }
>>
>> +static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
>> +{
>> +	int err = 0;
>> +	u32 reg;
>> +	u32 mask = 1 << tag;
>> +	unsigned long flags;
>> +
>> +	if (!test_bit(tag, &hba->outstanding_reqs))
>> +		goto out;
>> +
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	ufshcd_writel(hba, ~(1 << tag), REG_UTP_TASK_REQ_LIST_CLEAR);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +
>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>> +	reg = ufshcd_wait_for_register(hba,
>> +			REG_UTP_TASK_REQ_DOOR_BELL,
>> +			mask, 0, 1000, 1000);
>> +	if ((reg & mask) == mask)
>> +		err = -ETIMEDOUT;
>> +out:
>> +	return err;
>> +}
>> +
>>   /**
>>    * ufshcd_issue_tm_cmd - issues task management commands to controller
>>    * @hba: per adapter instance
>> - * @lrbp: pointer to local reference block
>> + * @lun_id: LUN ID to which TM command is sent
>> + * @task_id: task ID to which the TM command is applicable
>> + * @tm_function: task management function opcode
>> + * @tm_response: task management service response return value
>>    *
>> - * Returns SUCCESS/FAILED
>> + * Returns non-zero value on error, zero on success.
>>    */
>> -static int
>> -ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>> -		    struct ufshcd_lrb *lrbp,
>> -		    u8 tm_function)
>> +static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
>> +		u8 tm_function, u8 *tm_response)
>>   {
>>   	struct utp_task_req_desc *task_req_descp;
>>   	struct utp_upiu_task_req *task_req_upiup;
>>   	struct Scsi_Host *host;
>>   	unsigned long flags;
>> -	int free_slot = 0;
>> +	int free_slot;
>>   	int err;
>> +	int task_tag;
>>
>>   	host = hba->host;
>>
>> -	spin_lock_irqsave(host->host_lock, flags);
>> -
>> -	/* If task management queue is full */
>> -	free_slot = ufshcd_get_tm_free_slot(hba);
>> -	if (free_slot >= hba->nutmrs) {
>> -		spin_unlock_irqrestore(host->host_lock, flags);
>> -		dev_err(hba->dev, "Task management queue full\n");
>> -		err = FAILED;
>> -		goto out;
>> -	}
>> +	/*
>> +	 * Get free slot, sleep if slots are unavailable.
>> +	 * Even though we use wait_event() which sleeps indefinitely,
>> +	 * the maximum wait time is bounded by %TM_CMD_TIMEOUT.
>> +	 */
>> +	wait_event(hba->tm_tag_wq, ufshcd_get_tm_free_slot(hba, &free_slot));
>>
>> +	spin_lock_irqsave(host->host_lock, flags);
>>   	task_req_descp = hba->utmrdl_base_addr;
>>   	task_req_descp += free_slot;
>>
>> @@ -2391,18 +2436,15 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>>   	/* Configure task request UPIU */
>>   	task_req_upiup =
>>   		(struct utp_upiu_task_req *) task_req_descp->task_req_upiu;
>> +	task_tag = hba->nutrs + free_slot;
> Possible, did you intend 'hba->nutmrs', not 'hba->nutrs'?
> I think it's safer with hba->nutmrs if we can't sure that NUTRS is larger than NUTMRS.

It should be hba->nutrs and not hba->nutmrs.

The equation is -
0 <= free_slot < hba->nutmrs
0 <= transfer_req_task_id < hba->nutrs
hba->nutrs <= tm_req_task_id < hba->nutmrs + hba_nutrs

Whatever be the values of NUTRS/NUTMRS the above gives a unique
task_id.


> 
>>   	task_req_upiup->header.dword_0 =
>>   		UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
>> -					      lrbp->lun, lrbp->task_tag);
>> +				lun_id, task_tag);
>>   	task_req_upiup->header.dword_1 =
>>   		UPIU_HEADER_DWORD(0, tm_function, 0, 0);
>>
>> -	task_req_upiup->input_param1 = lrbp->lun;
>> -	task_req_upiup->input_param1 =
>> -		cpu_to_be32(task_req_upiup->input_param1);
>> -	task_req_upiup->input_param2 = lrbp->task_tag;
>> -	task_req_upiup->input_param2 =
>> -		cpu_to_be32(task_req_upiup->input_param2);
>> +	task_req_upiup->input_param1 = cpu_to_be32(lun_id);
>> +	task_req_upiup->input_param2 = cpu_to_be32(task_id);
>>
>>   	/* send command to the controller */
>>   	__set_bit(free_slot, &hba->outstanding_tasks);
>> @@ -2411,20 +2453,24 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>>   	spin_unlock_irqrestore(host->host_lock, flags);
>>
>>   	/* wait until the task management command is completed */
>> -	err =
>> -	wait_event_interruptible_timeout(hba->ufshcd_tm_wait_queue,
>> -					 (test_bit(free_slot,
>> -					 &hba->tm_condition) != 0),
>> -					 60 * HZ);
>> +	err = wait_event_timeout(hba->tm_wq,
>> +			test_bit(free_slot, &hba->tm_condition),
>> +			msecs_to_jiffies(TM_CMD_TIMEOUT));
>>   	if (!err) {
>> -		dev_err(hba->dev,
>> -			"Task management command timed-out\n");
>> -		err = FAILED;
>> -		goto out;
>> +		dev_err(hba->dev, "%s: task management cmd 0x%.2x timed-out\n",
>> +				__func__, tm_function);
>> +		if (ufshcd_clear_tm_cmd(hba, free_slot))
>> +			dev_WARN(hba->dev, "%s: unable clear tm cmd (slot %d) after timeout\n",
>> +					__func__, free_slot);
>> +		err = -ETIMEDOUT;
>> +	} else {
>> +		err = ufshcd_task_req_compl(hba, free_slot, tm_response);
>>   	}
>> +
>>   	clear_bit(free_slot, &hba->tm_condition);
>> -	err = ufshcd_task_req_compl(hba, free_slot);
>> -out:
>> +	ufshcd_put_tm_slot(hba, free_slot);
>> +	wake_up(&hba->tm_tag_wq);
>> +
>>   	return err;
>>   }
>>
>> @@ -2441,14 +2487,22 @@ static int ufshcd_device_reset(struct scsi_cmnd *cmd)
>>   	unsigned int tag;
>>   	u32 pos;
>>   	int err;
>> +	u8 resp;
>> +	struct ufshcd_lrb *lrbp;
>>
>>   	host = cmd->device->host;
>>   	hba = shost_priv(host);
>>   	tag = cmd->request->tag;
>>
>> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_LOGICAL_RESET);
>> -	if (err == FAILED)
>> +	lrbp = &hba->lrb[tag];
>> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> Argument 2nd, 3rd can be replaced by lrbp.
> Then, we can reduce the number of argument.
> 

TM issue command doesn't need to know about lrbp, It just need
LUN ID and task ID. This helps when we are not dealing with lrbp's
and just want to issue some other TM command.
I believe an extra argument is not so costly on the systems which
demand high performance UFS devices.

> Thanks,
> Seungwon Jeon
> 
>> +			UFS_LOGICAL_RESET, &resp);
>> +	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
>> +		err = FAILED;
>>   		goto out;
>> +	} else {
>> +		err = SUCCESS;
>> +	}
>>
>>   	for (pos = 0; pos < hba->nutrs; pos++) {
>>   		if (test_bit(pos, &hba->outstanding_reqs) &&
>> @@ -2505,6 +2559,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>   	unsigned long flags;
>>   	unsigned int tag;
>>   	int err;
>> +	u8 resp;
>> +	struct ufshcd_lrb *lrbp;
>>
>>   	host = cmd->device->host;
>>   	hba = shost_priv(host);
>> @@ -2520,9 +2576,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>   	}
>>   	spin_unlock_irqrestore(host->host_lock, flags);
>>
>> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_ABORT_TASK);
>> -	if (err == FAILED)
>> +	lrbp = &hba->lrb[tag];
>> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>> +			UFS_ABORT_TASK, &resp);
>> +	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
>> +		err = FAILED;
>>   		goto out;
>> +	} else {
>> +		err = SUCCESS;
>> +	}
>>
>>   	scsi_dma_unmap(cmd);
>>
>> @@ -2744,7 +2806,8 @@ int ufshcd_init(struct device *dev, struct ufs_hba **hba_handle,
>>   	host->max_cmd_len = MAX_CDB_SIZE;
>>
>>   	/* Initailize wait queue for task management */
>> -	init_waitqueue_head(&hba->ufshcd_tm_wait_queue);
>> +	init_waitqueue_head(&hba->tm_wq);
>> +	init_waitqueue_head(&hba->tm_tag_wq);
>>
>>   	/* Initialize work queues */
>>   	INIT_WORK(&hba->feh_workq, ufshcd_fatal_err_handler);
>> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
>> index 6c9bd35..5d4542c 100644
>> --- a/drivers/scsi/ufs/ufshcd.h
>> +++ b/drivers/scsi/ufs/ufshcd.h
>> @@ -174,8 +174,10 @@ struct ufs_dev_cmd {
>>    * @irq: Irq number of the controller
>>    * @active_uic_cmd: handle of active UIC command
>>    * @uic_cmd_mutex: mutex for uic command
>> - * @ufshcd_tm_wait_queue: wait queue for task management
>> + * @tm_wq: wait queue for task management
>> + * @tm_tag_wq: wait queue for free task management slots
>>    * @tm_condition: condition variable for task management
>> + * @tm_slots_in_use: bit map of task management request slots in use
>>    * @ufshcd_state: UFSHCD states
>>    * @intr_mask: Interrupt Mask Bits
>>    * @ee_ctrl_mask: Exception event control mask
>> @@ -216,8 +218,10 @@ struct ufs_hba {
>>   	struct uic_command *active_uic_cmd;
>>   	struct mutex uic_cmd_mutex;
>>
>> -	wait_queue_head_t ufshcd_tm_wait_queue;
>> +	wait_queue_head_t tm_wq;
>> +	wait_queue_head_t tm_tag_wq;
>>   	unsigned long tm_condition;
>> +	unsigned long tm_slots_in_use;
>>
>>   	u32 ufshcd_state;
>>   	u32 intr_mask;
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
>> of Code Aurora Forum, hosted by The Linux Foundation.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command
  2013-07-19 13:56   ` Seungwon Jeon
@ 2013-07-19 18:26     ` Sujit Reddy Thumma
  0 siblings, 0 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-19 18:26 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/19/2013 7:26 PM, Seungwon Jeon wrote:
> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
>> There is a possible race condition in the hardware when the abort
>> command is issued to terminate the ongoing SCSI command as described
>> below:
>>
>> - A bit in the door-bell register is set in the controller for a
>>    new SCSI command.
>> - In some rare situations, before controller get a chance to issue
>>    the command to the device, the software issued an abort command.
> It's interesting.
> I wonder when we can meet this situation.
> Is it possible if SCSI mid layer should send abort command as soon as the transfer command is issued?
> AFAIK abort command is followed if one command has timed out.
> That means command have been already issued and no response?
> If you had some problem, could you share?

You are right. This is a very rare case and probably don't happen at all
until and unless the SCSI error handling is changed. We found it just by
static analysis. Probably, at some point I may push a patch that tries
to reduce the read latencies while aborting a large write transfer when
I come across a UFS device that has command per LU as 1 :-)

I guess this is good to have change. The path chosen is according to
SCSI SAM-5 architecture specification, so I don't expect any issues
here.

> 
>> - If the device recieves abort command first then it returns success
> receives
> 
>>    because the command itself is not present.
>> - Now if the controller commits the command to device it will be
>>    processed.
>> - Software thinks that command is aborted and proceed while still
>>    the device is processing it.
>> - The software, controller and device may go out of sync because of
>>    this race condition.
>>
>> To avoid this, query task presence in the device before sending abort
>> task command so that after the abort operation, the command is guaranteed
>> to be non-existent in both controller and the device.
>>
>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>> ---
>>   drivers/scsi/ufs/ufshcd.c |   70 +++++++++++++++++++++++++++++++++++---------
>>   1 files changed, 55 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> index a176421..51ce096 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -2550,6 +2550,12 @@ static int ufshcd_host_reset(struct scsi_cmnd *cmd)
>>    * ufshcd_abort - abort a specific command
>>    * @cmd: SCSI command pointer
>>    *
>> + * Abort the pending command in device by sending UFS_ABORT_TASK task management
>> + * command, and in host controller by clearing the door-bell register. There can
>> + * be race between controller sending the command to the device while abort is
>> + * issued. To avoid that, first issue UFS_QUERY_TASK to check if the command is
>> + * really issued and then try to abort it.
>> + *
>>    * Returns SUCCESS/FAILED
>>    */
>>   static int ufshcd_abort(struct scsi_cmnd *cmd)
>> @@ -2558,7 +2564,8 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>   	struct ufs_hba *hba;
>>   	unsigned long flags;
>>   	unsigned int tag;
>> -	int err;
>> +	int err = 0;
>> +	int poll_cnt;
>>   	u8 resp;
>>   	struct ufshcd_lrb *lrbp;
>>
>> @@ -2566,33 +2573,59 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>   	hba = shost_priv(host);
>>   	tag = cmd->request->tag;
>>
>> -	spin_lock_irqsave(host->host_lock, flags);
>> +	/* If command is already aborted/completed, return SUCCESS */
>> +	if (!(test_bit(tag, &hba->outstanding_reqs)))
>> +		goto out;
>>
>> -	/* check if command is still pending */
>> -	if (!(test_bit(tag, &hba->outstanding_reqs))) {
>> -		err = FAILED;
>> -		spin_unlock_irqrestore(host->host_lock, flags);
>> +	lrbp = &hba->lrb[tag];
>> +	for (poll_cnt = 100; poll_cnt; poll_cnt--) {
>> +		err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>> +				UFS_QUERY_TASK, &resp);
>> +		if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED) {
>> +			/* cmd pending in the device */
>> +			break;
>> +		} else if (!err && resp == UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
>> +			u32 reg;
>> +
>> +			/*
>> +			 * cmd not pending in the device, check if it is
>> +			 * in transition.
>> +			 */
>> +			reg = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
>> +			if (reg & (1 << tag)) {
>> +				/* sleep for max. 2ms to stabilize */
>> +				usleep_range(1000, 2000);
>> +				continue;
>> +			}
>> +			/* command completed already */
>> +			goto out;
>> +		} else {
>> +			if (!err)
>> +				err = resp; /* service response error */
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	if (!poll_cnt) {
>> +		err = -EBUSY;
>>   		goto out;
>>   	}
>> -	spin_unlock_irqrestore(host->host_lock, flags);
>>
>> -	lrbp = &hba->lrb[tag];
>>   	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>>   			UFS_ABORT_TASK, &resp);
>>   	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
>> -		err = FAILED;
>> +		if (!err)
>> +			err = resp; /* service response error */
>>   		goto out;
>> -	} else {
>> -		err = SUCCESS;
>>   	}
>>
>> +	err = ufshcd_clear_cmd(hba, tag);
>> +	if (err)
>> +		goto out;
>> +
>>   	scsi_dma_unmap(cmd);
>>
>>   	spin_lock_irqsave(host->host_lock, flags);
>> -
>> -	/* clear the respective UTRLCLR register bit */
>> -	ufshcd_utrl_clear(hba, tag);
>> -
>>   	__clear_bit(tag, &hba->outstanding_reqs);
>>   	hba->lrb[tag].cmd = NULL;
>>   	spin_unlock_irqrestore(host->host_lock, flags);
>> @@ -2600,6 +2633,13 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>   	clear_bit_unlock(tag, &hba->lrb_in_use);
>>   	wake_up(&hba->dev_cmd.tag_wq);
>>   out:
>> +	if (!err) {
>> +		err = SUCCESS;
>> +	} else {
>> +		dev_err(hba->dev, "%s: failed with err %d\n", __func__, err);
>> +		err = FAILED;
>> +	}
>> +
>>   	return err;
>>   }
>>
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
>> of Code Aurora Forum, hosted by The Linux Foundation.
>>

-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-19 13:57   ` Seungwon Jeon
@ 2013-07-19 18:26     ` Sujit Reddy Thumma
  2013-07-23  8:27       ` Seungwon Jeon
  0 siblings, 1 reply; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-19 18:26 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/19/2013 7:27 PM, Seungwon Jeon wrote:
> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
>> As of now SCSI initiated error handling is broken because,
>> the reset APIs don't try to bring back the device initialized and
>> ready for further transfers.
>>
>> In case of timeouts, the scsi error handler takes care of handling aborts
>> and resets. Improve the error handling in such scenario by resetting the
>> device and host and re-initializing them in proper manner.
>>
>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>> ---
>>   drivers/scsi/ufs/ufshcd.c |  467 +++++++++++++++++++++++++++++++++++++++------
>>   drivers/scsi/ufs/ufshcd.h |    2 +
>>   2 files changed, 411 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> index 51ce096..b4c9910 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -69,9 +69,15 @@ enum {
>>
>>   /* UFSHCD states */
>>   enum {
>> -	UFSHCD_STATE_OPERATIONAL,
>>   	UFSHCD_STATE_RESET,
>>   	UFSHCD_STATE_ERROR,
>> +	UFSHCD_STATE_OPERATIONAL,
>> +};
>> +
>> +/* UFSHCD error handling flags */
>> +enum {
>> +	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
>> +	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>>   };
>>
>>   /* Interrupt configuration options */
>> @@ -87,6 +93,22 @@ enum {
>>   	INT_AGGR_CONFIG,
>>   };
>>
>> +#define ufshcd_set_device_reset_pending(h) \
>> +	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
>> +#define ufshcd_set_host_reset_pending(h) \
>> +	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
>> +#define ufshcd_device_reset_pending(h) \
>> +	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
>> +#define ufshcd_host_reset_pending(h) \
>> +	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
>> +#define ufshcd_clear_device_reset_pending(h) \
>> +	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
>> +#define ufshcd_clear_host_reset_pending(h) \
>> +	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
>> +
>> +static void ufshcd_tmc_handler(struct ufs_hba *hba);
>> +static void ufshcd_async_scan(void *data, async_cookie_t cookie);
>> +
>>   /*
>>    * ufshcd_wait_for_register - wait for register value to change
>>    * @hba - per-adapter interface
>> @@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
>>
>>   	tag = cmd->request->tag;
>>
>> -	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
>> +	switch (hba->ufshcd_state) {
> Lock is no needed for ufshcd_state?
> 
>> +	case UFSHCD_STATE_OPERATIONAL:
>> +		break;
>> +	case UFSHCD_STATE_RESET:
>>   		err = SCSI_MLQUEUE_HOST_BUSY;
>>   		goto out;
>> +	case UFSHCD_STATE_ERROR:
>> +		set_host_byte(cmd, DID_ERROR);
>> +		cmd->scsi_done(cmd);
>> +		goto out;
>> +	default:
>> +		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
>> +				__func__, hba->ufshcd_state);
>> +		set_host_byte(cmd, DID_BAD_TARGET);
>> +		cmd->scsi_done(cmd);
>> +		goto out;
>>   	}
>>
>>   	/* acquire the tag to make sure device cmds don't use it */
>> @@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>>   	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>>   		scsi_unblock_requests(hba->host);
>>
>> -	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
>> -
>>   out:
>>   	return err;
>>   }
>> @@ -2273,6 +2306,106 @@ out:
>>   }
>>
>>   /**
>> + * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
>> + * @hba: per-adapter instance
>> + */
>> +static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
>> +{
>> +	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
>> +}
>> +
>> +/**
>> + * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
>> + * @hba: per-adapter instance
>> + */
>> +static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
>> +{
>> +	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
>> +}
>> +
>> +/**
>> + * ufshcd_complete_pending_tasks - complete outstanding tasks
>> + * @hba: per adapter instance
>> + *
>> + * Abort in-progress task management commands and wakeup
>> + * waiting threads.
>> + *
>> + * Returns non-zero error value when failed to clear all the commands.
>> + */
>> +static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
>> +{
>> +	u32 reg;
>> +	int err = 0;
>> +	unsigned long flags;
>> +
>> +	if (!hba->outstanding_tasks)
>> +		goto out;
>> +
>> +	/* Clear UTMRL only when run-stop is enabled */
>> +	if (ufshcd_utmrl_is_rsr_enabled(hba))
>> +		ufshcd_writel(hba, ~hba->outstanding_tasks,
>> +				REG_UTP_TASK_REQ_LIST_CLEAR);
>> +
>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>> +	reg = ufshcd_wait_for_register(hba,
>> +			REG_UTP_TASK_REQ_DOOR_BELL,
>> +			hba->outstanding_tasks, 0, 1000, 1000);
>> +	if (reg & hba->outstanding_tasks)
>> +		err = -ETIMEDOUT;
>> +
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	/* complete commands that were cleared out */
>> +	ufshcd_tmc_handler(hba);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +out:
>> +	if (err)
>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
>> +				__func__, reg);
>> +	return err;
>> +}
>> +
>> +/**
>> + * ufshcd_complete_pending_reqs - complete outstanding requests
>> + * @hba: per adapter instance
>> + *
>> + * Abort in-progress transfer request commands and return them to SCSI.
>> + *
>> + * Returns non-zero error value when failed to clear all the commands.
>> + */
>> +static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
>> +{
>> +	u32 reg;
>> +	int err = 0;
>> +	unsigned long flags;
>> +
>> +	/* check if we completed all of them */
>> +	if (!hba->outstanding_reqs)
>> +		goto out;
>> +
>> +	/* Clear UTRL only when run-stop is enabled */
>> +	if (ufshcd_utrl_is_rsr_enabled(hba))
>> +		ufshcd_writel(hba, ~hba->outstanding_reqs,
>> +				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
>> +
>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>> +	reg = ufshcd_wait_for_register(hba,
>> +			REG_UTP_TRANSFER_REQ_DOOR_BELL,
>> +			hba->outstanding_reqs, 0, 1000, 1000);
>> +	if (reg & hba->outstanding_reqs)
>> +		err = -ETIMEDOUT;
>> +
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	/* complete commands that were cleared out */
>> +	ufshcd_transfer_req_compl(hba);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +out:
>> +	if (err)
>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
>> +				__func__, reg);
>> +	return err;
>> +}
>> +
>> +/**
>>    * ufshcd_fatal_err_handler - handle fatal errors
>>    * @hba: per adapter instance
>>    */
>> @@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
>>   	}
>>   	return;
>>   fatal_eh:
>> -	hba->ufshcd_state = UFSHCD_STATE_ERROR;
>> -	schedule_work(&hba->feh_workq);
>> +	/* handle fatal errors only when link is functional */
>> +	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
>> +		/* block commands at driver layer until error is handled */
>> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> Locking omitted for ufshcd_state?
This is called in interrupt context with spin_lock held.

> 
>> +		schedule_work(&hba->feh_workq);
>> +	}
>>   }
>>
>>   /**
>> @@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
>>   }
>>
>>   /**
>> - * ufshcd_device_reset - reset device and abort all the pending commands
>> - * @cmd: SCSI command pointer
>> + * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
>> + * @hba: per adapter instance
>>    *
>> - * Returns SUCCESS/FAILED
>> + * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS flags,
>> + * attributes and descriptors are reset to default state. Callers are
>> + * expected to initialize the whole device again after this.
>> + *
>> + * Returns zero on success, non-zero on failure
>>    */
>> -static int ufshcd_device_reset(struct scsi_cmnd *cmd)
>> +static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
>>   {
>> -	struct Scsi_Host *host;
>> -	struct ufs_hba *hba;
>> -	unsigned int tag;
>> -	u32 pos;
>> -	int err;
>> -	u8 resp;
>> -	struct ufshcd_lrb *lrbp;
>> +	struct uic_command uic_cmd = {0};
>> +	int ret;
>>
>> -	host = cmd->device->host;
>> -	hba = shost_priv(host);
>> -	tag = cmd->request->tag;
>> +	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
>>
>> -	lrbp = &hba->lrb[tag];
>> -	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>> -			UFS_LOGICAL_RESET, &resp);
>> -	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
>> -		err = FAILED;
>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>> +	if (ret)
>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>> +
>> +	return ret;
>> +}
>> +
>> +/**
>> + * ufshcd_dme_reset - Local UniPro reset
>> + * @hba: per adapter instance
>> + *
>> + * Returns zero on success, non-zero on failure
>> + */
>> +static int ufshcd_dme_reset(struct ufs_hba *hba)
>> +{
>> +	struct uic_command uic_cmd = {0};
>> +	int ret;
>> +
>> +	uic_cmd.command = UIC_CMD_DME_RESET;
>> +
>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>> +	if (ret)
>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>> +
>> +	return ret;
>> +
>> +}
>> +
>> +/**
>> + * ufshcd_dme_enable - Local UniPro DME Enable
>> + * @hba: per adapter instance
>> + *
>> + * Returns zero on success, non-zero on failure
>> + */
>> +static int ufshcd_dme_enable(struct ufs_hba *hba)
>> +{
>> +	struct uic_command uic_cmd = {0};
>> +	int ret;
>> +	uic_cmd.command = UIC_CMD_DME_ENABLE;
>> +
>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>> +	if (ret)
>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>> +
>> +	return ret;
>> +
>> +}
>> +
>> +/**
>> + * ufshcd_device_reset_and_restore - reset and restore device
>> + * @hba: per-adapter instance
>> + *
>> + * Note that the device reset issues DME_END_POINT_RESET which
>> + * may reset entire device and restore device attributes to
>> + * default state.
>> + *
>> + * Returns zero on success, non-zero on failure
>> + */
>> +static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
>> +{
>> +	int err = 0;
>> +	u32 reg;
>> +
>> +	err = ufshcd_dme_end_point_reset(hba);
>> +	if (err)
>> +		goto out;
>> +
>> +	/* restore communication with the device */
>> +	err = ufshcd_dme_reset(hba);
>> +	if (err)
>>   		goto out;
>> -	} else {
>> -		err = SUCCESS;
>> -	}
>>
>> -	for (pos = 0; pos < hba->nutrs; pos++) {
>> -		if (test_bit(pos, &hba->outstanding_reqs) &&
>> -		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
>> +	err = ufshcd_dme_enable(hba);
>> +	if (err)
>> +		goto out;
>>
>> -			/* clear the respective UTRLCLR register bit */
>> -			ufshcd_utrl_clear(hba, pos);
>> +	err = ufshcd_dme_link_startup(hba);
> UFS_LOGICAL_RESET is no more used?

Yes, I don't see any use for this as of now (given that we are using
dme_end_point_reset, refer to figure. 7.4 of UFS 1.1 spec). Also, the
UFS spec. error handling section doesn't mention anything about
LOGICAL_RESET. If you know a valid use case where we need to have LUN
reset, please let me know I will bring it back.

> ufshcd_device_reset_and_restore have a role of device reset.
> Both ufshcd_dme_reset and ufshcd_dme_enable are valid for local one, not for remote.
> Should we do those for host including link-startup here?

Yes, it is needed. After DME_ENDPOINT_RESET the remote link goes into
link down state. To initialize the link, the host needs to send
DME_LINKSTARTUP, but according to Uni-Pro spec. the link-startup can
only be sent when the local uni-pro is in link-down state. So first
we need to get the local unipro from link-up to disabled to link-down
using the DME_RESET and DME_ENABLE commands and then issue
DME_LINKSTARTUP to re-initialize the link.

> 
>> +	if (err)
>> +		goto out;
>>
>> -			clear_bit(pos, &hba->outstanding_reqs);
>> +	/* check if link is up and device is detected */
>> +	reg = ufshcd_readl(hba, REG_CONTROLLER_STATUS);
>> +	if (!ufshcd_is_device_present(reg)) {
>> +		dev_err(hba->dev, "Device not present\n");
>> +		err = -ENXIO;
>> +		goto out;
>> +	}
>>
>> -			if (hba->lrb[pos].cmd) {
>> -				scsi_dma_unmap(hba->lrb[pos].cmd);
>> -				hba->lrb[pos].cmd->result =
>> -					DID_ABORT << 16;
>> -				hba->lrb[pos].cmd->scsi_done(cmd);
>> -				hba->lrb[pos].cmd = NULL;
>> -				clear_bit_unlock(pos, &hba->lrb_in_use);
>> -				wake_up(&hba->dev_cmd.tag_wq);
>> -			}
>> -		}
>> -	} /* end of for */
>> +	ufshcd_clear_device_reset_pending(hba);
>>   out:
>> +	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
>>   	return err;
>>   }
>>
>>   /**
>> - * ufshcd_host_reset - Main reset function registered with scsi layer
>> - * @cmd: SCSI command pointer
>> + * ufshcd_host_reset_and_restore - reset and restore host controller
>> + * @hba: per-adapter instance
>>    *
>> - * Returns SUCCESS/FAILED
>> + * Note that host controller reset may issue DME_RESET to
>> + * local and remote (device) Uni-Pro stack and the attributes
>> + * are reset to default state.
>> + *
>> + * Returns zero on success, non-zero on failure
>>    */
>> -static int ufshcd_host_reset(struct scsi_cmnd *cmd)
>> +static int ufshcd_host_reset_and_restore(struct ufs_hba *hba)
>>   {
>> -	struct ufs_hba *hba;
>> +	int err;
>> +	async_cookie_t cookie;
>> +	unsigned long flags;
>>
>> -	hba = shost_priv(cmd->device->host);
>> +	/* Reset the host controller */
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	ufshcd_hba_stop(hba);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>
>> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>> -		return SUCCESS;
>> +	err = ufshcd_hba_enable(hba);
>> +	if (err)
>> +		goto out;
>>
>> -	return ufshcd_do_reset(hba);
>> +	/* Establish the link again and restore the device */
>> +	cookie = async_schedule(ufshcd_async_scan, hba);
>> +	/* wait for async scan to be completed */
>> +	async_synchronize_cookie(++cookie);
>> +	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL)
>> +		err = -EIO;
>> +out:
>> +	if (err)
>> +		dev_err(hba->dev, "%s: Host init failed %d\n", __func__, err);
>> +	else
>> +		ufshcd_clear_host_reset_pending(hba);
>> +
>> +	dev_dbg(hba->dev, "%s: done err = %d\n", __func__, err);
>> +	return err;
>>   }
>>
>>   /**
>> @@ -2644,6 +2861,134 @@ out:
>>   }
>>
>>   /**
>> + * ufshcd_reset_and_restore - resets device or host or both
>> + * @hba: per-adapter instance
>> + *
>> + * Reset and recover device, host and re-establish link. This
>> + * is helpful to recover the communication in fatal error conditions.
>> + *
>> + * Returns zero on success, non-zero on failure
>> + */
>> +static int ufshcd_reset_and_restore(struct ufs_hba *hba)
>> +{
>> +	int err = 0;
>> +
>> +	if (ufshcd_device_reset_pending(hba) &&
>> +			!ufshcd_host_reset_pending(hba)) {
>> +		err = ufshcd_device_reset_and_restore(hba);
>> +		if (err) {
>> +			ufshcd_clear_device_reset_pending(hba);
>> +			ufshcd_set_host_reset_pending(hba);
>> +		}
>> +	}
>> +
>> +	if (ufshcd_host_reset_pending(hba))
>> +		err = ufshcd_host_reset_and_restore(hba);
>> +
>> +	/*
>> +	 * Due to reset the door-bell might be cleared, clear
>> +	 * outstanding requests in s/w here.
>> +	 */
>> +	ufshcd_complete_pending_reqs(hba);
> After above, pending requests will be completed by ufshcd_transfer_req_compl.
> 'cmd->result' which is reported to scsi mid-layer should be a failure.
> I think it may not be guaranteed.

The ufshcd_transfer_req_compl() checks for OCS value and sets the
failure. If the command is timed-out we did reset then the OCS
value would be 0xF. If the command has fatal error and we did reset
OCS value would have been updated to relevant fatal error cause
by the h/w already.

> 
>> +	ufshcd_complete_pending_tasks(hba);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * ufshcd_eh_device_reset_handler - device reset handler registered to
>> + *                                    scsi layer.
>> + * @cmd - SCSI command pointer
>> + *
>> + * Returns SUCCESS/FAILED
>> + */
>> +static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd)
>> +{
>> +	struct ufs_hba *hba;
>> +	int err;
>> +	unsigned long flags;
>> +
>> +	hba = shost_priv(cmd->device->host);
>> +
>> +	/*
>> +	 * Check if there is any race with fatal error handling.
>> +	 * If so, wait for it to complete. Even though fatal error
>> +	 * handling does reset and restore in some cases, don't assume
>> +	 * anything out of it. We are just avoiding race here.
>> +	 */
>> +	do {
>> +		spin_lock_irqsave(hba->host->host_lock, flags);
>> +		if (!(work_pending(&hba->feh_workq) ||
>> +				hba->ufshcd_state == UFSHCD_STATE_RESET))
>> +			break;
>> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
>> +		flush_work_sync(&hba->feh_workq);
>> +	} while (1);
>> +
>> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
>> +	ufshcd_set_device_reset_pending(hba);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +
>> +	err = ufshcd_reset_and_restore(hba);
>> +
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	if (!err) {
>> +		err = SUCCESS;
>> +		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
>> +	} else {
>> +		err = FAILED;
>> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
>> +	}
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * ufshcd_eh_host_reset_handler - host reset handler registered to scsi layer
>> + * @cmd - SCSI command pointer
>> + *
>> + * Returns SUCCESS/FAILED
>> + */
>> +static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
>> +{
>> +	struct ufs_hba *hba;
>> +	int err;
>> +	unsigned long flags;
>> +
>> +	hba = shost_priv(cmd->device->host);
>> +
>> +	do {
>> +		spin_lock_irqsave(hba->host->host_lock, flags);
>> +		if (!(work_pending(&hba->feh_workq) ||
>> +				hba->ufshcd_state == UFSHCD_STATE_RESET))
>> +			break;
>> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +		dev_dbg(hba->dev, "%s: reset in progress\n", __func__);
>> +		flush_work_sync(&hba->feh_workq);
>> +	} while (1);
>> +
>> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
>> +	ufshcd_set_host_reset_pending(hba);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +
>> +	err = ufshcd_reset_and_restore(hba);
>> +
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	if (!err) {
>> +		err = SUCCESS;
>> +		hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
>> +	} else {
>> +		err = FAILED;
>> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
>> +	}
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +
>> +	return err;
>> +}
> Both 'ufshcd_eh_device_reset_handler' and 'ufshcd_eh_host_reset_handler' have
> common routine. If possible, it would be better to gather in one function.

okay.

> 
>> +
>> +/**
>>    * ufshcd_async_scan - asynchronous execution for link startup
>>    * @data: data pointer to pass to this function
>>    * @cookie: cookie data
>> @@ -2667,8 +3012,14 @@ static void ufshcd_async_scan(void *data, async_cookie_t cookie)
>>
>>   	hba->auto_bkops_enabled = false;
>>   	ufshcd_enable_auto_bkops(hba);
>> -	scsi_scan_host(hba->host);
>> -	pm_runtime_put_sync(hba->dev);
>> +	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> lock is no needed?

This is synchronous to all other operations. I don't see any race to
have a lock here.

> 
> Thanks,
> Seungwon Jeon
> 
>> +
>> +	/* If we are in error handling context no need to scan the host */
>> +	if (!(ufshcd_device_reset_pending(hba) ||
>> +			ufshcd_host_reset_pending(hba))) {
>> +		scsi_scan_host(hba->host);
>> +		pm_runtime_put_sync(hba->dev);
>> +	}
>>   out:
>>   	return;
>>   }
>> @@ -2681,8 +3032,8 @@ static struct scsi_host_template ufshcd_driver_template = {
>>   	.slave_alloc		= ufshcd_slave_alloc,
>>   	.slave_destroy		= ufshcd_slave_destroy,
>>   	.eh_abort_handler	= ufshcd_abort,
>> -	.eh_device_reset_handler = ufshcd_device_reset,
>> -	.eh_host_reset_handler	= ufshcd_host_reset,
>> +	.eh_device_reset_handler = ufshcd_eh_device_reset_handler,
>> +	.eh_host_reset_handler   = ufshcd_eh_host_reset_handler,
>>   	.this_id		= -1,
>>   	.sg_tablesize		= SG_ALL,
>>   	.cmd_per_lun		= UFSHCD_CMD_PER_LUN,
>> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
>> index 5d4542c..7fcedd0 100644
>> --- a/drivers/scsi/ufs/ufshcd.h
>> +++ b/drivers/scsi/ufs/ufshcd.h
>> @@ -179,6 +179,7 @@ struct ufs_dev_cmd {
>>    * @tm_condition: condition variable for task management
>>    * @tm_slots_in_use: bit map of task management request slots in use
>>    * @ufshcd_state: UFSHCD states
>> + * @eh_flags: Error handling flags
>>    * @intr_mask: Interrupt Mask Bits
>>    * @ee_ctrl_mask: Exception event control mask
>>    * @feh_workq: Work queue for fatal controller error handling
>> @@ -224,6 +225,7 @@ struct ufs_hba {
>>   	unsigned long tm_slots_in_use;
>>
>>   	u32 ufshcd_state;
>> +	u32 eh_flags;
>>   	u32 intr_mask;
>>   	u16 ee_ctrl_mask;
>>
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
>> of Code Aurora Forum, hosted by The Linux Foundation.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-19 13:58   ` Seungwon Jeon
@ 2013-07-19 18:26     ` Sujit Reddy Thumma
  2013-07-23  8:34       ` Seungwon Jeon
  0 siblings, 1 reply; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-19 18:26 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/19/2013 7:28 PM, Seungwon Jeon wrote:
> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
>> Error handling in UFS driver is broken and resets the host controller
>> for fatal errors without re-initialization. Correct the fatal error
>> handling sequence according to UFS Host Controller Interface (HCI)
>> v1.1 specification.
>>
>> o Upon determining fatal error condition the host controller may hang
>>    forever until a reset is applied, so just retrying the command doesn't
>>    work without a reset. So, the reset is applied in the driver context
>>    in a separate work and SCSI mid-layer isn't informed until reset is
>>    applied.
>>
>> o Processed requests which are completed without error are reported to
>>    SCSI layer as successful and any pending commands that are not started
>>    yet or are not cause of the error are re-queued into scsi midlayer queue.
>>    For the command that caused error, host controller or device is reset
>>    and DID_ERROR is returned for command retry after applying reset.
>>
>> o SCSI is informed about the expected Unit-Attentioni exception from the
> Attention'i',  typo.
Okay.

> 
>>    device for the immediate command after a reset so that the SCSI layer
>>    take necessary steps to establish communication with the device.
>>
>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>> ---
>>   drivers/scsi/ufs/ufshcd.c |  349 +++++++++++++++++++++++++++++++++++---------
>>   drivers/scsi/ufs/ufshcd.h |    2 +
>>   drivers/scsi/ufs/ufshci.h |   19 ++-
>>   3 files changed, 295 insertions(+), 75 deletions(-)
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> index b4c9910..2a3874f 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -80,6 +80,14 @@ enum {
>>   	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>>   };
>>
>> +/* UFSHCD UIC layer error flags */
>> +enum {
>> +	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
>> +	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
>> +	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
>> +	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
>> +};
>> +
>>   /* Interrupt configuration options */
>>   enum {
>>   	UFSHCD_INT_DISABLE,
>> @@ -108,6 +116,7 @@ enum {
>>
>>   static void ufshcd_tmc_handler(struct ufs_hba *hba);
>>   static void ufshcd_async_scan(void *data, async_cookie_t cookie);
>> +static int ufshcd_reset_and_restore(struct ufs_hba *hba);
>>
>>   /*
>>    * ufshcd_wait_for_register - wait for register value to change
>> @@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>>   		goto out;
>>   	}
>>
>> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>> -		scsi_unblock_requests(hba->host);
>> -
>>   out:
>>   	return err;
>>   }
>> @@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct ufs_hba *hba)
>>   }
>>
>>   /**
>> - * ufshcd_do_reset - reset the host controller
>> - * @hba: per adapter instance
>> - *
>> - * Returns SUCCESS/FAILED
>> - */
>> -static int ufshcd_do_reset(struct ufs_hba *hba)
>> -{
>> -	struct ufshcd_lrb *lrbp;
>> -	unsigned long flags;
>> -	int tag;
>> -
>> -	/* block commands from midlayer */
>> -	scsi_block_requests(hba->host);
>> -
>> -	spin_lock_irqsave(hba->host->host_lock, flags);
>> -	hba->ufshcd_state = UFSHCD_STATE_RESET;
>> -
>> -	/* send controller to reset state */
>> -	ufshcd_hba_stop(hba);
>> -	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> -
>> -	/* abort outstanding commands */
>> -	for (tag = 0; tag < hba->nutrs; tag++) {
>> -		if (test_bit(tag, &hba->outstanding_reqs)) {
>> -			lrbp = &hba->lrb[tag];
>> -			if (lrbp->cmd) {
>> -				scsi_dma_unmap(lrbp->cmd);
>> -				lrbp->cmd->result = DID_RESET << 16;
>> -				lrbp->cmd->scsi_done(lrbp->cmd);
>> -				lrbp->cmd = NULL;
>> -				clear_bit_unlock(tag, &hba->lrb_in_use);
>> -			}
>> -		}
>> -	}
>> -
>> -	/* complete device management command */
>> -	if (hba->dev_cmd.complete)
>> -		complete(hba->dev_cmd.complete);
>> -
>> -	/* clear outstanding request/task bit maps */
>> -	hba->outstanding_reqs = 0;
>> -	hba->outstanding_tasks = 0;
>> -
>> -	/* Host controller enable */
>> -	if (ufshcd_hba_enable(hba)) {
>> -		dev_err(hba->dev,
>> -			"Reset: Controller initialization failed\n");
>> -		return FAILED;
>> -	}
>> -
>> -	if (ufshcd_link_startup(hba)) {
>> -		dev_err(hba->dev,
>> -			"Reset: Link start-up failed\n");
>> -		return FAILED;
>> -	}
>> -
>> -	return SUCCESS;
>> -}
>> -
>> -/**
>>    * ufshcd_slave_alloc - handle initial SCSI device configurations
>>    * @sdev: pointer to SCSI device
>>    *
>> @@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device *sdev)
>>   	sdev->use_10_for_ms = 1;
>>   	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
>>
>> +	/* allow SCSI layer to restart the device in case of errors */
>> +	sdev->allow_restart = 1;
>> +
>>   	/*
>>   	 * Inform SCSI Midlayer that the LUN queue depth is same as the
>>   	 * controller queue depth. If a LUN queue depth is less than the
>> @@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
>>   	case OCS_ABORTED:
>>   		result |= DID_ABORT << 16;
>>   		break;
>> +	case OCS_INVALID_COMMAND_STATUS:
>> +		result |= DID_REQUEUE << 16;
>> +		break;
>>   	case OCS_INVALID_CMD_TABLE_ATTR:
>>   	case OCS_INVALID_PRDT_ATTR:
>>   	case OCS_MISMATCH_DATA_BUF_SIZE:
>> @@ -2405,42 +2357,295 @@ out:
>>   	return err;
>>   }
>>
>> +static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
>> +{
>> +	switch (ocs) {
>> +	case OCS_SUCCESS:
>> +	case OCS_INVALID_COMMAND_STATUS:
>> +		break;
>> +	case OCS_MISMATCH_DATA_BUF_SIZE:
>> +	case OCS_MISMATCH_RESP_UPIU_SIZE:
>> +	case OCS_PEER_COMM_FAILURE:
>> +	case OCS_FATAL_ERROR:
>> +	case OCS_ABORTED:
>> +	case OCS_INVALID_CMD_TABLE_ATTR:
>> +	case OCS_INVALID_PRDT_ATTR:
>> +		ufshcd_set_host_reset_pending(hba);
> Should host be reset on ocs error, including below ufshcd_decide_eh_task_req?
> It's just overall command status.

Yes, the error handling section in the UFS 1.1 spec. mentions so.

> 
>> +		break;
>> +	default:
>> +		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
>> +				__func__, ocs);
>> +		BUG();
>> +	}
>> +}
>> +
>> +static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
>> +{
>> +	switch (ocs) {
>> +	case OCS_TMR_SUCCESS:
>> +	case OCS_TMR_INVALID_COMMAND_STATUS:
>> +		break;
>> +	case OCS_TMR_MISMATCH_REQ_SIZE:
>> +	case OCS_TMR_MISMATCH_RESP_SIZE:
>> +	case OCS_TMR_PEER_COMM_FAILURE:
>> +	case OCS_TMR_INVALID_ATTR:
>> +	case OCS_TMR_ABORTED:
>> +	case OCS_TMR_FATAL_ERROR:
>> +		ufshcd_set_host_reset_pending(hba);
>> +		break;
>> +	default:
>> +		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
>> +				__func__, ocs);
>> +		BUG();
>> +	}
>> +}
>> +
>>   /**
>> - * ufshcd_fatal_err_handler - handle fatal errors
>> + * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed command and
>> + *                          decide error handling
>>    * @hba: per adapter instance
>> + * @err_xfer: bit mask for transfer request errors
>> + *
>> + * Iterate over completed transfer requests and
>> + * set error handling flags.
>> + */
>> +static void
>> +ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
>> +{
>> +	unsigned long completed;
>> +	u32 doorbell;
>> +	int index;
>> +	int ocs;
>> +
>> +	if (!err_xfer)
>> +		goto out;
>> +
>> +	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
>> +	completed = doorbell ^ (u32)hba->outstanding_reqs;
>> +
>> +	for (index = 0; index < hba->nutrs; index++) {
>> +		if (test_bit(index, &completed)) {
>> +			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
>> +			if ((ocs == OCS_SUCCESS) ||
>> +					(ocs == OCS_INVALID_COMMAND_STATUS))
>> +				continue;
>> +
>> +			*err_xfer |= (1 << index);
>> +			ufshcd_decide_eh_xfer_req(hba, ocs);
>> +		}
>> +	}
>> +out:
>> +	return;
>> +}
>> +
>> +/**
>> + * ufshcd_error_autopsy_task_req() - reads OCS field of failed command and
>> + *                          decide error handling
>> + * @hba: per adapter instance
>> + * @err_tm: bit mask for task management errors
>> + *
>> + * Iterate over completed task management requests and
>> + * set error handling flags.
>> + */
>> +static void
>> +ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
>> +{
>> +	unsigned long completed;
>> +	u32 doorbell;
>> +	int index;
>> +	int ocs;
>> +
>> +	if (!err_tm)
>> +		goto out;
>> +
>> +	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
>> +	completed = doorbell ^ (u32)hba->outstanding_tasks;
>> +
>> +	for (index = 0; index < hba->nutmrs; index++) {
>> +		if (test_bit(index, &completed)) {
>> +			struct utp_task_req_desc *tm_descp;
>> +
>> +			tm_descp = hba->utmrdl_base_addr;
>> +			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
>> +			if ((ocs == OCS_TMR_SUCCESS) ||
>> +					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
>> +				continue;
>> +
>> +			*err_tm |= (1 << index);
>> +			ufshcd_decide_eh_task_req(hba, ocs);
>> +		}
>> +	}
>> +
>> +out:
>> +	return;
>> +}
>> +
>> +/**
>> + * ufshcd_fatal_err_handler - handle fatal errors
>> + * @work: pointer to work structure
>>    */
>>   static void ufshcd_fatal_err_handler(struct work_struct *work)
>>   {
>>   	struct ufs_hba *hba;
>> +	unsigned long flags;
>> +	u32 err_xfer = 0;
>> +	u32 err_tm = 0;
>> +	int err;
>> +
>>   	hba = container_of(work, struct ufs_hba, feh_workq);
>>
>>   	pm_runtime_get_sync(hba->dev);
>> -	/* check if reset is already in progress */
>> -	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
>> -		ufshcd_do_reset(hba);
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
>> +		/* complete processed requests and exit */
>> +		ufshcd_transfer_req_compl(hba);
>> +		ufshcd_tmc_handler(hba);
>> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +		pm_runtime_put_sync(hba->dev);
>> +		return;
> Host driver is here with finishing 'scsi_block_requests'.
> 'scsi_unblock_requests' can be called somewhere?

No, but it can be possible that SCSI command timeout which triggers
device/host reset and fatal error handler race each other.

> 
>> +	}
>> +
>> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
>> +	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
>> +	ufshcd_error_autopsy_task_req(hba, &err_tm);
>> +
>> +	/*
>> +	 * Complete successful and pending transfer requests.
>> +	 * DID_REQUEUE is returned for pending requests as they have
>> +	 * nothing to do with error'ed request and SCSI layer should
>> +	 * not treat them as errors and decrement retry count.
>> +	 */
>> +	hba->outstanding_reqs &= ~err_xfer;
>> +	ufshcd_transfer_req_compl(hba);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +	ufshcd_complete_pending_reqs(hba);
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +	hba->outstanding_reqs |= err_xfer;
> Hmm... error handling seems so complicated.
> To simplify it, how about below?
> 
> 1. If requests(transfer or task management) are completed, finish them with success/failure.
This is what we are trying to do above.

> 2. If there are pending requests, abort them.
No, if a fatal error is occurred it is possible that host controller is
freez'ed we are not sure if it can take task management commands and
execute them.

> 3. If fatal error, reset.
> 


>> +
>> +	/* Complete successful and pending task requests */
>> +	hba->outstanding_tasks &= ~err_tm;
>> +	ufshcd_tmc_handler(hba);
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +	ufshcd_complete_pending_tasks(hba);
>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>> +
>> +	hba->outstanding_tasks |= err_tm;
>> +
>> +	/*
>> +	 * Controller may generate multiple fatal errors, handle
>> +	 * errors based on severity.
>> +	 * 1) DEVICE_FATAL_ERROR
>> +	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
>> +	 * 3) UIC_ERROR
>> +	 */
>> +	if (hba->errors & DEVICE_FATAL_ERROR) {
>> +		/*
>> +		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
>> +		 * OCS field on device fatal error.
>> +		 */
>> +		ufshcd_set_host_reset_pending(hba);
> In DEVICE_FATAL_ERROR, ufshcd_device_reset_pending is right?

It looks so, but the spec. mentions to reset the host as well (8.3.6).

> 
>> +	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
>> +			CONTROLLER_FATAL_ERROR)) {
>> +		/* eh flags should be set in err autopsy based on OCS values */
>> +		if (!hba->eh_flags)
>> +			WARN(1, "%s: fatal error without error handling\n",
>> +				dev_name(hba->dev));
>> +	} else if (hba->errors & UIC_ERROR) {
>> +		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
>> +			/* fatal error - reset controller */
>> +			ufshcd_set_host_reset_pending(hba);
>> +		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
>> +					UFSHCD_UIC_TL_ERROR |
>> +					UFSHCD_UIC_DME_ERROR)) {
>> +			/* non-fatal, report error to SCSI layer */
>> +			if (!hba->eh_flags) {
>> +				spin_unlock_irqrestore(
>> +						hba->host->host_lock, flags);
>> +				ufshcd_complete_pending_reqs(hba);
>> +				ufshcd_complete_pending_tasks(hba);
>> +				spin_lock_irqsave(hba->host->host_lock, flags);
>> +			}
>> +		}
>> +	}
>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>> +
>> +	if (hba->eh_flags) {
>> +		err = ufshcd_reset_and_restore(hba);
>> +		if (err) {
>> +			ufshcd_clear_host_reset_pending(hba);
>> +			ufshcd_clear_device_reset_pending(hba);
>> +			dev_err(hba->dev, "%s: reset and restore failed\n",
>> +					__func__);
>> +			hba->ufshcd_state = UFSHCD_STATE_ERROR;
>> +		}
>> +		/*
>> +		 * Inform scsi mid-layer that we did reset and allow to handle
>> +		 * Unit Attention properly.
>> +		 */
>> +		scsi_report_bus_reset(hba->host, 0);
>> +		hba->errors = 0;
>> +		hba->uic_error = 0;
>> +	}
>> +	scsi_unblock_requests(hba->host);
>>   	pm_runtime_put_sync(hba->dev);
>>   }
>>
>>   /**
>> - * ufshcd_err_handler - Check for fatal errors
>> - * @work: pointer to a work queue structure
>> + * ufshcd_update_uic_error - check and set fatal UIC error flags.
>> + * @hba: per-adapter instance
>>    */
>> -static void ufshcd_err_handler(struct ufs_hba *hba)
>> +static void ufshcd_update_uic_error(struct ufs_hba *hba)
>>   {
>>   	u32 reg;
>>
>> +	/* PA_INIT_ERROR is fatal and needs UIC reset */
>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
>> +	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
>> +		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
>> +
>> +	/* UIC NL/TL/DME errors needs software retry */
>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
>> +	if (reg)
>> +		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
>> +
>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
>> +	if (reg)
>> +		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
>> +
>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
>> +	if (reg)
>> +		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
> REG_UIC_ERROR_CODE_PHY_ADAPTER_LAYER is not handled.

UFS spec. mentions that it is non-fatal error and UIC recovers
by itself and doesn't need software intervention.

> 
>> +
>> +	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
>> +			__func__, hba->uic_error);
>> +}
>> +
>> +/**
>> + * ufshcd_err_handler - Check for fatal errors
>> + * @hba: per-adapter instance
>> + */
>> +static void ufshcd_err_handler(struct ufs_hba *hba)
>> +{
>>   	if (hba->errors & INT_FATAL_ERRORS)
>>   		goto fatal_eh;
>>
>>   	if (hba->errors & UIC_ERROR) {
>> -		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
>> -		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
>> +		hba->uic_error = 0;
>> +		ufshcd_update_uic_error(hba);
>> +		if (hba->uic_error)
> Except UFSHCD_UIC_DL_PA_INIT_ERROR, it's not fatal. Should it go to fatal_eh?

Please see the UIC error handling in ufshcd_fatal_err_handler(), others
need software intervention so I combined it with fatal_eh to complete
the requests and report to SCSI.

> 
> Thanks,
> Seungwon Jeon
> 
>>   			goto fatal_eh;
>>   	}
>> +	/*
>> +	 * Other errors are either non-fatal or completed by the
>> +	 * controller by updating OCS fields with success/failure.
>> +	 */
>>   	return;
>> +
>>   fatal_eh:
>>   	/* handle fatal errors only when link is functional */
>>   	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
>> +		/* block commands from midlayer */
>> +		scsi_block_requests(hba->host);
>>   		/* block commands at driver layer until error is handled */
>>   		hba->ufshcd_state = UFSHCD_STATE_ERROR;
>>   		schedule_work(&hba->feh_workq);
>> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
>> index 7fcedd0..4ee4d1a 100644
>> --- a/drivers/scsi/ufs/ufshcd.h
>> +++ b/drivers/scsi/ufs/ufshcd.h
>> @@ -185,6 +185,7 @@ struct ufs_dev_cmd {
>>    * @feh_workq: Work queue for fatal controller error handling
>>    * @eeh_work: Worker to handle exception events
>>    * @errors: HBA errors
>> + * @uic_error: UFS interconnect layer error status
>>    * @dev_cmd: ufs device management command information
>>    * @auto_bkops_enabled: to track whether bkops is enabled in device
>>    */
>> @@ -235,6 +236,7 @@ struct ufs_hba {
>>
>>   	/* HBA Errors */
>>   	u32 errors;
>> +	u32 uic_error;
>>
>>   	/* Device management request data */
>>   	struct ufs_dev_cmd dev_cmd;
>> diff --git a/drivers/scsi/ufs/ufshci.h b/drivers/scsi/ufs/ufshci.h
>> index f1e1b74..36f68ef 100644
>> --- a/drivers/scsi/ufs/ufshci.h
>> +++ b/drivers/scsi/ufs/ufshci.h
>> @@ -264,7 +264,7 @@ enum {
>>   	UTP_DEVICE_TO_HOST	= 0x04000000,
>>   };
>>
>> -/* Overall command status values */
>> +/* Overall command status values for transfer request */
>>   enum {
>>   	OCS_SUCCESS			= 0x0,
>>   	OCS_INVALID_CMD_TABLE_ATTR	= 0x1,
>> @@ -274,8 +274,21 @@ enum {
>>   	OCS_PEER_COMM_FAILURE		= 0x5,
>>   	OCS_ABORTED			= 0x6,
>>   	OCS_FATAL_ERROR			= 0x7,
>> -	OCS_INVALID_COMMAND_STATUS	= 0x0F,
>> -	MASK_OCS			= 0x0F,
>> +	OCS_INVALID_COMMAND_STATUS	= 0xF,
>> +	MASK_OCS			= 0xFF,
>> +};
>> +
>> +/* Overall command status values for task management request */
>> +enum {
>> +	OCS_TMR_SUCCESS			= 0x0,
>> +	OCS_TMR_INVALID_ATTR		= 0x1,
>> +	OCS_TMR_MISMATCH_REQ_SIZE	= 0x2,
>> +	OCS_TMR_MISMATCH_RESP_SIZE	= 0x3,
>> +	OCS_TMR_PEER_COMM_FAILURE	= 0x4,
>> +	OCS_TMR_ABORTED			= 0x5,
>> +	OCS_TMR_FATAL_ERROR		= 0x6,
>> +	OCS_TMR_INVALID_COMMAND_STATUS	= 0xF,
>> +	MASK_OCS_TMR			= 0xFF,
>>   };
>>
>>   /**
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
>> of Code Aurora Forum, hosted by The Linux Foundation.
>>


-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation
  2013-07-19 18:26     ` Sujit Reddy Thumma
@ 2013-07-23  8:24       ` Seungwon Jeon
  2013-07-23 15:40         ` Sujit Reddy Thumma
  0 siblings, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-23  8:24 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma'
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
> On 7/19/2013 7:26 PM, Seungwon Jeon wrote:
> > On Tue, July 09, 2013 Sujit Reddy Thumma wrote:
> >> Currently, sending Task Management (TM) command to the card might
> >> be broken in some scenarios as listed below:
> >>
> >> Problem: If there are more than 8 TM commands the implementation
> >>           returns error to the caller.
> >> Fix:     Wait for one of the slots to be emptied and send the command.
> >>
> >> Problem: Sometimes it is necessary for the caller to know the TM service
> >>           response code to determine the task status.
> >> Fix:     Propogate the service response to the caller.
> >>
> >> Problem: If the TM command times out no proper error recovery is
> >>           implemented.
> >> Fix:     Clear the command in the controller door-bell register, so that
> >>           further commands for the same slot don't fail.
> >>
> >> Problem: While preparing the TM command descriptor, the task tag used
> >>           should be unique across SCSI/NOP/QUERY/TM commands and not the
> >> 	 task tag of the command which the TM command is trying to manage.
> >> Fix:     Use a unique task tag instead of task tag of SCSI command.
> >>
> >> Problem: Since the TM command involves H/W communication, abruptly ending
> >>           the request on kill interrupt signal might cause h/w malfunction.
> >> Fix:     Wait for hardware completion interrupt with TASK_UNINTERRUPTIBLE
> >>           set.
> >>
> >> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> >> ---
> >>   drivers/scsi/ufs/ufshcd.c |  177 ++++++++++++++++++++++++++++++---------------
> >>   drivers/scsi/ufs/ufshcd.h |    8 ++-
> >>   2 files changed, 126 insertions(+), 59 deletions(-)
> >>
> >> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> >> index af7d01d..a176421 100644
> >> --- a/drivers/scsi/ufs/ufshcd.c
> >> +++ b/drivers/scsi/ufs/ufshcd.c
> >> @@ -53,6 +53,9 @@
> >>   /* Query request timeout */
> >>   #define QUERY_REQ_TIMEOUT 30 /* msec */
> >>
> >> +/* Task management command timeout */
> >> +#define TM_CMD_TIMEOUT	100 /* msecs */
> >> +
> >>   /* Expose the flag value from utp_upiu_query.value */
> >>   #define MASK_QUERY_UPIU_FLAG_LOC 0xFF
> >>
> >> @@ -190,13 +193,35 @@ ufshcd_get_tmr_ocs(struct utp_task_req_desc *task_req_descp)
> >>   /**
> >>    * ufshcd_get_tm_free_slot - get a free slot for task management request
> >>    * @hba: per adapter instance
> >> + * @free_slot: pointer to variable with available slot value
> >>    *
> >> - * Returns maximum number of task management request slots in case of
> >> - * task management queue full or returns the free slot number
> >> + * Get a free tag and lock it until ufshcd_put_tm_slot() is called.
> >> + * Returns 0 if free slot is not available, else return 1 with tag value
> >> + * in @free_slot.
> >>    */
> >> -static inline int ufshcd_get_tm_free_slot(struct ufs_hba *hba)
> >> +static bool ufshcd_get_tm_free_slot(struct ufs_hba *hba, int *free_slot)
> >> +{
> >> +	int tag;
> >> +	bool ret = false;
> >> +
> >> +	if (!free_slot)
> >> +		goto out;
> >> +
> >> +	do {
> >> +		tag = find_first_zero_bit(&hba->tm_slots_in_use, hba->nutmrs);
> >> +		if (tag >= hba->nutmrs)
> >> +			goto out;
> >> +	} while (test_and_set_bit_lock(tag, &hba->tm_slots_in_use));
> >> +
> >> +	*free_slot = tag;
> >> +	ret = true;
> >> +out:
> >> +	return ret;
> >> +}
> >> +
> >> +static inline void ufshcd_put_tm_slot(struct ufs_hba *hba, int slot)
> >>   {
> >> -	return find_first_zero_bit(&hba->outstanding_tasks, hba->nutmrs);
> >> +	clear_bit_unlock(slot, &hba->tm_slots_in_use);
> >>   }
> >>
> >>   /**
> >> @@ -1778,10 +1803,11 @@ static void ufshcd_slave_destroy(struct scsi_device *sdev)
> >>    * ufshcd_task_req_compl - handle task management request completion
> >>    * @hba: per adapter instance
> >>    * @index: index of the completed request
> >> + * @resp: task management service response
> >>    *
> >> - * Returns SUCCESS/FAILED
> >> + * Returns non-zero value on error, zero on success
> >>    */
> >> -static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
> >> +static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index, u8 *resp)
> >>   {
> >>   	struct utp_task_req_desc *task_req_descp;
> >>   	struct utp_upiu_task_rsp *task_rsp_upiup;
> >> @@ -1802,19 +1828,15 @@ static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
> >>   				task_req_descp[index].task_rsp_upiu;
> >>   		task_result = be32_to_cpu(task_rsp_upiup->header.dword_1);
> >>   		task_result = ((task_result & MASK_TASK_RESPONSE) >> 8);
> >> -
> >> -		if (task_result != UPIU_TASK_MANAGEMENT_FUNC_COMPL &&
> >> -		    task_result != UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED)
> >> -			task_result = FAILED;
> >> -		else
> >> -			task_result = SUCCESS;
> >> +		if (resp)
> >> +			*resp = (u8)task_result;
> >>   	} else {
> >> -		task_result = FAILED;
> >> -		dev_err(hba->dev,
> >> -			"trc: Invalid ocs = %x\n", ocs_value);
> >> +		dev_err(hba->dev, "%s: failed, ocs = 0x%x\n",
> >> +				__func__, ocs_value);
> >>   	}
> >>   	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> -	return task_result;
> >> +
> >> +	return ocs_value;
> >>   }
> >>
> >>   /**
> >> @@ -2298,7 +2320,7 @@ static void ufshcd_tmc_handler(struct ufs_hba *hba)
> >>
> >>   	tm_doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
> >>   	hba->tm_condition = tm_doorbell ^ hba->outstanding_tasks;
> >> -	wake_up_interruptible(&hba->ufshcd_tm_wait_queue);
> >> +	wake_up(&hba->tm_wq);
> >>   }
> >>
> >>   /**
> >> @@ -2348,38 +2370,61 @@ static irqreturn_t ufshcd_intr(int irq, void *__hba)
> >>   	return retval;
> >>   }
> >>
> >> +static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
> >> +{
> >> +	int err = 0;
> >> +	u32 reg;
> >> +	u32 mask = 1 << tag;
> >> +	unsigned long flags;
> >> +
> >> +	if (!test_bit(tag, &hba->outstanding_reqs))
> >> +		goto out;
> >> +
> >> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >> +	ufshcd_writel(hba, ~(1 << tag), REG_UTP_TASK_REQ_LIST_CLEAR);
> >> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> +
> >> +	/* poll for max. 1 sec to clear door bell register by h/w */
> >> +	reg = ufshcd_wait_for_register(hba,
> >> +			REG_UTP_TASK_REQ_DOOR_BELL,
> >> +			mask, 0, 1000, 1000);
> >> +	if ((reg & mask) == mask)
> >> +		err = -ETIMEDOUT;
> >> +out:
> >> +	return err;
> >> +}
> >> +
> >>   /**
> >>    * ufshcd_issue_tm_cmd - issues task management commands to controller
> >>    * @hba: per adapter instance
> >> - * @lrbp: pointer to local reference block
> >> + * @lun_id: LUN ID to which TM command is sent
> >> + * @task_id: task ID to which the TM command is applicable
> >> + * @tm_function: task management function opcode
> >> + * @tm_response: task management service response return value
> >>    *
> >> - * Returns SUCCESS/FAILED
> >> + * Returns non-zero value on error, zero on success.
> >>    */
> >> -static int
> >> -ufshcd_issue_tm_cmd(struct ufs_hba *hba,
> >> -		    struct ufshcd_lrb *lrbp,
> >> -		    u8 tm_function)
> >> +static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
> >> +		u8 tm_function, u8 *tm_response)
> >>   {
> >>   	struct utp_task_req_desc *task_req_descp;
> >>   	struct utp_upiu_task_req *task_req_upiup;
> >>   	struct Scsi_Host *host;
> >>   	unsigned long flags;
> >> -	int free_slot = 0;
> >> +	int free_slot;
> >>   	int err;
> >> +	int task_tag;
> >>
> >>   	host = hba->host;
> >>
> >> -	spin_lock_irqsave(host->host_lock, flags);
> >> -
> >> -	/* If task management queue is full */
> >> -	free_slot = ufshcd_get_tm_free_slot(hba);
> >> -	if (free_slot >= hba->nutmrs) {
> >> -		spin_unlock_irqrestore(host->host_lock, flags);
> >> -		dev_err(hba->dev, "Task management queue full\n");
> >> -		err = FAILED;
> >> -		goto out;
> >> -	}
> >> +	/*
> >> +	 * Get free slot, sleep if slots are unavailable.
> >> +	 * Even though we use wait_event() which sleeps indefinitely,
> >> +	 * the maximum wait time is bounded by %TM_CMD_TIMEOUT.
> >> +	 */
> >> +	wait_event(hba->tm_tag_wq, ufshcd_get_tm_free_slot(hba, &free_slot));
> >>
> >> +	spin_lock_irqsave(host->host_lock, flags);
> >>   	task_req_descp = hba->utmrdl_base_addr;
> >>   	task_req_descp += free_slot;
> >>
> >> @@ -2391,18 +2436,15 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
> >>   	/* Configure task request UPIU */
> >>   	task_req_upiup =
> >>   		(struct utp_upiu_task_req *) task_req_descp->task_req_upiu;
> >> +	task_tag = hba->nutrs + free_slot;
> > Possible, did you intend 'hba->nutmrs', not 'hba->nutrs'?
> > I think it's safer with hba->nutmrs if we can't sure that NUTRS is larger than NUTMRS.
> 
> It should be hba->nutrs and not hba->nutmrs.
> 
> The equation is -
> 0 <= free_slot < hba->nutmrs
> 0 <= transfer_req_task_id < hba->nutrs
> hba->nutrs <= tm_req_task_id < hba->nutmrs + hba_nutrs
> 
> Whatever be the values of NUTRS/NUTMRS the above gives a unique
> task_id.
Yes.

> 
> 
> >
> >>   	task_req_upiup->header.dword_0 =
> >>   		UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
> >> -					      lrbp->lun, lrbp->task_tag);
> >> +				lun_id, task_tag);
> >>   	task_req_upiup->header.dword_1 =
> >>   		UPIU_HEADER_DWORD(0, tm_function, 0, 0);
> >>
> >> -	task_req_upiup->input_param1 = lrbp->lun;
> >> -	task_req_upiup->input_param1 =
> >> -		cpu_to_be32(task_req_upiup->input_param1);
> >> -	task_req_upiup->input_param2 = lrbp->task_tag;
> >> -	task_req_upiup->input_param2 =
> >> -		cpu_to_be32(task_req_upiup->input_param2);
> >> +	task_req_upiup->input_param1 = cpu_to_be32(lun_id);
> >> +	task_req_upiup->input_param2 = cpu_to_be32(task_id);
> >>
> >>   	/* send command to the controller */
> >>   	__set_bit(free_slot, &hba->outstanding_tasks);
> >> @@ -2411,20 +2453,24 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
> >>   	spin_unlock_irqrestore(host->host_lock, flags);
> >>
> >>   	/* wait until the task management command is completed */
> >> -	err =
> >> -	wait_event_interruptible_timeout(hba->ufshcd_tm_wait_queue,
> >> -					 (test_bit(free_slot,
> >> -					 &hba->tm_condition) != 0),
> >> -					 60 * HZ);
> >> +	err = wait_event_timeout(hba->tm_wq,
> >> +			test_bit(free_slot, &hba->tm_condition),
> >> +			msecs_to_jiffies(TM_CMD_TIMEOUT));
> >>   	if (!err) {
> >> -		dev_err(hba->dev,
> >> -			"Task management command timed-out\n");
> >> -		err = FAILED;
> >> -		goto out;
> >> +		dev_err(hba->dev, "%s: task management cmd 0x%.2x timed-out\n",
> >> +				__func__, tm_function);
> >> +		if (ufshcd_clear_tm_cmd(hba, free_slot))
> >> +			dev_WARN(hba->dev, "%s: unable clear tm cmd (slot %d) after timeout\n",
> >> +					__func__, free_slot);
> >> +		err = -ETIMEDOUT;
> >> +	} else {
> >> +		err = ufshcd_task_req_compl(hba, free_slot, tm_response);
> >>   	}
> >> +
> >>   	clear_bit(free_slot, &hba->tm_condition);
> >> -	err = ufshcd_task_req_compl(hba, free_slot);
> >> -out:
> >> +	ufshcd_put_tm_slot(hba, free_slot);
> >> +	wake_up(&hba->tm_tag_wq);
> >> +
> >>   	return err;
> >>   }
> >>
> >> @@ -2441,14 +2487,22 @@ static int ufshcd_device_reset(struct scsi_cmnd *cmd)
> >>   	unsigned int tag;
> >>   	u32 pos;
> >>   	int err;
> >> +	u8 resp;
> >> +	struct ufshcd_lrb *lrbp;
> >>
> >>   	host = cmd->device->host;
> >>   	hba = shost_priv(host);
> >>   	tag = cmd->request->tag;
> >>
> >> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_LOGICAL_RESET);
> >> -	if (err == FAILED)
> >> +	lrbp = &hba->lrb[tag];
> >> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> > Argument 2nd, 3rd can be replaced by lrbp.
> > Then, we can reduce the number of argument.
> >
> 
> TM issue command doesn't need to know about lrbp, It just need
> LUN ID and task ID. This helps when we are not dealing with lrbp's
> and just want to issue some other TM command.
> I believe an extra argument is not so costly on the systems which
> demand high performance UFS devices.
Yes, you're right. only need LUN ID and task ID.
It might be trivial. But 'lrbp' should be referred for getting these.

Thanks,
Seungwon Jeon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-19 18:26     ` Sujit Reddy Thumma
@ 2013-07-23  8:27       ` Seungwon Jeon
  2013-07-23 15:40         ` Sujit Reddy Thumma
  0 siblings, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-23  8:27 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma'
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
> On 7/19/2013 7:27 PM, Seungwon Jeon wrote:
> > On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
> >> As of now SCSI initiated error handling is broken because,
> >> the reset APIs don't try to bring back the device initialized and
> >> ready for further transfers.
> >>
> >> In case of timeouts, the scsi error handler takes care of handling aborts
> >> and resets. Improve the error handling in such scenario by resetting the
> >> device and host and re-initializing them in proper manner.
> >>
> >> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> >> ---
> >>   drivers/scsi/ufs/ufshcd.c |  467 +++++++++++++++++++++++++++++++++++++++------
> >>   drivers/scsi/ufs/ufshcd.h |    2 +
> >>   2 files changed, 411 insertions(+), 58 deletions(-)
> >>
> >> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> >> index 51ce096..b4c9910 100644
> >> --- a/drivers/scsi/ufs/ufshcd.c
> >> +++ b/drivers/scsi/ufs/ufshcd.c
> >> @@ -69,9 +69,15 @@ enum {
> >>
> >>   /* UFSHCD states */
> >>   enum {
> >> -	UFSHCD_STATE_OPERATIONAL,
> >>   	UFSHCD_STATE_RESET,
> >>   	UFSHCD_STATE_ERROR,
> >> +	UFSHCD_STATE_OPERATIONAL,
> >> +};
> >> +
> >> +/* UFSHCD error handling flags */
> >> +enum {
> >> +	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
> >> +	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
> >>   };
> >>
> >>   /* Interrupt configuration options */
> >> @@ -87,6 +93,22 @@ enum {
> >>   	INT_AGGR_CONFIG,
> >>   };
> >>
> >> +#define ufshcd_set_device_reset_pending(h) \
> >> +	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
> >> +#define ufshcd_set_host_reset_pending(h) \
> >> +	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
> >> +#define ufshcd_device_reset_pending(h) \
> >> +	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
> >> +#define ufshcd_host_reset_pending(h) \
> >> +	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
> >> +#define ufshcd_clear_device_reset_pending(h) \
> >> +	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
> >> +#define ufshcd_clear_host_reset_pending(h) \
> >> +	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
> >> +
> >> +static void ufshcd_tmc_handler(struct ufs_hba *hba);
> >> +static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> >> +
> >>   /*
> >>    * ufshcd_wait_for_register - wait for register value to change
> >>    * @hba - per-adapter interface
> >> @@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
> >>
> >>   	tag = cmd->request->tag;
> >>
> >> -	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
> >> +	switch (hba->ufshcd_state) {
> > Lock is no needed for ufshcd_state?
Please check?

> >
> >> +	case UFSHCD_STATE_OPERATIONAL:
> >> +		break;
> >> +	case UFSHCD_STATE_RESET:
> >>   		err = SCSI_MLQUEUE_HOST_BUSY;
> >>   		goto out;
> >> +	case UFSHCD_STATE_ERROR:
> >> +		set_host_byte(cmd, DID_ERROR);
> >> +		cmd->scsi_done(cmd);
> >> +		goto out;
> >> +	default:
> >> +		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
> >> +				__func__, hba->ufshcd_state);
> >> +		set_host_byte(cmd, DID_BAD_TARGET);
> >> +		cmd->scsi_done(cmd);
> >> +		goto out;
> >>   	}
> >>
> >>   	/* acquire the tag to make sure device cmds don't use it */
> >> @@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
> >>   	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> >>   		scsi_unblock_requests(hba->host);
> >>
> >> -	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> >> -
> >>   out:
> >>   	return err;
> >>   }
> >> @@ -2273,6 +2306,106 @@ out:
> >>   }
> >>
> >>   /**
> >> + * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
> >> + * @hba: per-adapter instance
> >> + */
> >> +static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
> >> +{
> >> +	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
> >> + * @hba: per-adapter instance
> >> + */
> >> +static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
> >> +{
> >> +	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_complete_pending_tasks - complete outstanding tasks
> >> + * @hba: per adapter instance
> >> + *
> >> + * Abort in-progress task management commands and wakeup
> >> + * waiting threads.
> >> + *
> >> + * Returns non-zero error value when failed to clear all the commands.
> >> + */
> >> +static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
> >> +{
> >> +	u32 reg;
> >> +	int err = 0;
> >> +	unsigned long flags;
> >> +
> >> +	if (!hba->outstanding_tasks)
> >> +		goto out;
> >> +
> >> +	/* Clear UTMRL only when run-stop is enabled */
> >> +	if (ufshcd_utmrl_is_rsr_enabled(hba))
> >> +		ufshcd_writel(hba, ~hba->outstanding_tasks,
> >> +				REG_UTP_TASK_REQ_LIST_CLEAR);
> >> +
> >> +	/* poll for max. 1 sec to clear door bell register by h/w */
> >> +	reg = ufshcd_wait_for_register(hba,
> >> +			REG_UTP_TASK_REQ_DOOR_BELL,
> >> +			hba->outstanding_tasks, 0, 1000, 1000);
> >> +	if (reg & hba->outstanding_tasks)
> >> +		err = -ETIMEDOUT;
> >> +
> >> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >> +	/* complete commands that were cleared out */
> >> +	ufshcd_tmc_handler(hba);
> >> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> +out:
> >> +	if (err)
> >> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> >> +				__func__, reg);
> >> +	return err;
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_complete_pending_reqs - complete outstanding requests
> >> + * @hba: per adapter instance
> >> + *
> >> + * Abort in-progress transfer request commands and return them to SCSI.
> >> + *
> >> + * Returns non-zero error value when failed to clear all the commands.
> >> + */
> >> +static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
> >> +{
> >> +	u32 reg;
> >> +	int err = 0;
> >> +	unsigned long flags;
> >> +
> >> +	/* check if we completed all of them */
> >> +	if (!hba->outstanding_reqs)
> >> +		goto out;
> >> +
> >> +	/* Clear UTRL only when run-stop is enabled */
> >> +	if (ufshcd_utrl_is_rsr_enabled(hba))
> >> +		ufshcd_writel(hba, ~hba->outstanding_reqs,
> >> +				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
> >> +
> >> +	/* poll for max. 1 sec to clear door bell register by h/w */
> >> +	reg = ufshcd_wait_for_register(hba,
> >> +			REG_UTP_TRANSFER_REQ_DOOR_BELL,
> >> +			hba->outstanding_reqs, 0, 1000, 1000);
> >> +	if (reg & hba->outstanding_reqs)
> >> +		err = -ETIMEDOUT;
> >> +
> >> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >> +	/* complete commands that were cleared out */
> >> +	ufshcd_transfer_req_compl(hba);
> >> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> +out:
> >> +	if (err)
> >> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> >> +				__func__, reg);
> >> +	return err;
> >> +}
> >> +
> >> +/**
> >>    * ufshcd_fatal_err_handler - handle fatal errors
> >>    * @hba: per adapter instance
> >>    */
> >> @@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
> >>   	}
> >>   	return;
> >>   fatal_eh:
> >> -	hba->ufshcd_state = UFSHCD_STATE_ERROR;
> >> -	schedule_work(&hba->feh_workq);
> >> +	/* handle fatal errors only when link is functional */
> >> +	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
> >> +		/* block commands at driver layer until error is handled */
> >> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> > Locking omitted for ufshcd_state?
> This is called in interrupt context with spin_lock held.
Right, I missed it.

> 
> >
> >> +		schedule_work(&hba->feh_workq);
> >> +	}
> >>   }
> >>
> >>   /**
> >> @@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int
> task_id,
> >>   }
> >>
> >>   /**
> >> - * ufshcd_device_reset - reset device and abort all the pending commands
> >> - * @cmd: SCSI command pointer
> >> + * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
> >> + * @hba: per adapter instance
> >>    *
> >> - * Returns SUCCESS/FAILED
> >> + * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS flags,
> >> + * attributes and descriptors are reset to default state. Callers are
> >> + * expected to initialize the whole device again after this.
> >> + *
> >> + * Returns zero on success, non-zero on failure
> >>    */
> >> -static int ufshcd_device_reset(struct scsi_cmnd *cmd)
> >> +static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
> >>   {
> >> -	struct Scsi_Host *host;
> >> -	struct ufs_hba *hba;
> >> -	unsigned int tag;
> >> -	u32 pos;
> >> -	int err;
> >> -	u8 resp;
> >> -	struct ufshcd_lrb *lrbp;
> >> +	struct uic_command uic_cmd = {0};
> >> +	int ret;
> >>
> >> -	host = cmd->device->host;
> >> -	hba = shost_priv(host);
> >> -	tag = cmd->request->tag;
> >> +	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
> >>
> >> -	lrbp = &hba->lrb[tag];
> >> -	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> >> -			UFS_LOGICAL_RESET, &resp);
> >> -	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> >> -		err = FAILED;
> >> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> >> +	if (ret)
> >> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_dme_reset - Local UniPro reset
> >> + * @hba: per adapter instance
> >> + *
> >> + * Returns zero on success, non-zero on failure
> >> + */
> >> +static int ufshcd_dme_reset(struct ufs_hba *hba)
> >> +{
> >> +	struct uic_command uic_cmd = {0};
> >> +	int ret;
> >> +
> >> +	uic_cmd.command = UIC_CMD_DME_RESET;
> >> +
> >> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> >> +	if (ret)
> >> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> >> +
> >> +	return ret;
> >> +
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_dme_enable - Local UniPro DME Enable
> >> + * @hba: per adapter instance
> >> + *
> >> + * Returns zero on success, non-zero on failure
> >> + */
> >> +static int ufshcd_dme_enable(struct ufs_hba *hba)
> >> +{
> >> +	struct uic_command uic_cmd = {0};
> >> +	int ret;
> >> +	uic_cmd.command = UIC_CMD_DME_ENABLE;
> >> +
> >> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> >> +	if (ret)
> >> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> >> +
> >> +	return ret;
> >> +
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_device_reset_and_restore - reset and restore device
> >> + * @hba: per-adapter instance
> >> + *
> >> + * Note that the device reset issues DME_END_POINT_RESET which
> >> + * may reset entire device and restore device attributes to
> >> + * default state.
> >> + *
> >> + * Returns zero on success, non-zero on failure
> >> + */
> >> +static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
> >> +{
> >> +	int err = 0;
> >> +	u32 reg;
> >> +
> >> +	err = ufshcd_dme_end_point_reset(hba);
> >> +	if (err)
> >> +		goto out;
> >> +
> >> +	/* restore communication with the device */
> >> +	err = ufshcd_dme_reset(hba);
> >> +	if (err)
> >>   		goto out;
> >> -	} else {
> >> -		err = SUCCESS;
> >> -	}
> >>
> >> -	for (pos = 0; pos < hba->nutrs; pos++) {
> >> -		if (test_bit(pos, &hba->outstanding_reqs) &&
> >> -		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
> >> +	err = ufshcd_dme_enable(hba);
> >> +	if (err)
> >> +		goto out;
> >>
> >> -			/* clear the respective UTRLCLR register bit */
> >> -			ufshcd_utrl_clear(hba, pos);
> >> +	err = ufshcd_dme_link_startup(hba);
> > UFS_LOGICAL_RESET is no more used?
> 
> Yes, I don't see any use for this as of now (given that we are using
> dme_end_point_reset, refer to figure. 7.4 of UFS 1.1 spec). Also, the
> UFS spec. error handling section doesn't mention anything about
> LOGICAL_RESET. If you know a valid use case where we need to have LUN
> reset, please let me know I will bring it back.
As refered the scsi-mid layer and other host's implementation,
eh_device_reset_handler(= ufshcd_eh_device_reset_handler) may
have a role of LOGICAL_RESET for specific lun.
I found that ENDPOINT_RESET is recommended with IS.DFES in spec.

Let me add some comments additionally.
Both 'ufshcd_eh_device_reset_handler' and 'ufshcd_host_reset_and_restore' do almost same things.
At a glance, it's confused about their role and It is mixed.
'ufshcd_reset_and_restore' is eventually called, which is actual part of reset functionality; Once device reset is failed, then
host reset is tried.
Actually, that is being handled for each level of error recovery in scsi mid-layer. Please chekc 'drivers/scsi/scsi_error.c'.
[scsi_eh_ready_devs, scsi_abort_eh_cmnd]
In this stage, each reset functionality could be separated obviously.

> 
> > ufshcd_device_reset_and_restore have a role of device reset.
> > Both ufshcd_dme_reset and ufshcd_dme_enable are valid for local one, not for remote.
> > Should we do those for host including link-startup here?
> 
> Yes, it is needed. After DME_ENDPOINT_RESET the remote link goes into link down state.
I want to know more related description. I didn't find it. Could you point that?

> To initialize the link, the host needs to send
> DME_LINKSTARTUP, but according to Uni-Pro spec. the link-startup can
> only be sent when the local uni-pro is in link-down state. So first
If it's right you mentioned above, uni-pro state is already in link-down after DME_ENDPOINT_RESET.
Then, DME_RESET isn't needed.

> we need to get the local unipro from link-up to disabled to link-down
> using the DME_RESET and DME_ENABLE commands and then issue
> DME_LINKSTARTUP to re-initialize the link.
'ufshcd_hba_enable' can be used instead of both if these are really needed.
This will do dme_reset and dme_enable.

Thanks,
Seungwon Jeon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-19 18:26     ` Sujit Reddy Thumma
@ 2013-07-23  8:34       ` Seungwon Jeon
  2013-07-23 15:41         ` Sujit Reddy Thumma
  0 siblings, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-23  8:34 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma'
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
> On 7/19/2013 7:28 PM, Seungwon Jeon wrote:
> > On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
> >> Error handling in UFS driver is broken and resets the host controller
> >> for fatal errors without re-initialization. Correct the fatal error
> >> handling sequence according to UFS Host Controller Interface (HCI)
> >> v1.1 specification.
> >>
> >> o Upon determining fatal error condition the host controller may hang
> >>    forever until a reset is applied, so just retrying the command doesn't
> >>    work without a reset. So, the reset is applied in the driver context
> >>    in a separate work and SCSI mid-layer isn't informed until reset is
> >>    applied.
> >>
> >> o Processed requests which are completed without error are reported to
> >>    SCSI layer as successful and any pending commands that are not started
> >>    yet or are not cause of the error are re-queued into scsi midlayer queue.
> >>    For the command that caused error, host controller or device is reset
> >>    and DID_ERROR is returned for command retry after applying reset.
> >>
> >> o SCSI is informed about the expected Unit-Attentioni exception from the
> > Attention'i',  typo.
> Okay.
> 
> >
> >>    device for the immediate command after a reset so that the SCSI layer
> >>    take necessary steps to establish communication with the device.
> >>
> >> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> >> ---
> >>   drivers/scsi/ufs/ufshcd.c |  349 +++++++++++++++++++++++++++++++++++---------
> >>   drivers/scsi/ufs/ufshcd.h |    2 +
> >>   drivers/scsi/ufs/ufshci.h |   19 ++-
> >>   3 files changed, 295 insertions(+), 75 deletions(-)
> >>
> >> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> >> index b4c9910..2a3874f 100644
> >> --- a/drivers/scsi/ufs/ufshcd.c
> >> +++ b/drivers/scsi/ufs/ufshcd.c
> >> @@ -80,6 +80,14 @@ enum {
> >>   	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
> >>   };
> >>
> >> +/* UFSHCD UIC layer error flags */
> >> +enum {
> >> +	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
> >> +	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
> >> +	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
> >> +	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
> >> +};
> >> +
> >>   /* Interrupt configuration options */
> >>   enum {
> >>   	UFSHCD_INT_DISABLE,
> >> @@ -108,6 +116,7 @@ enum {
> >>
> >>   static void ufshcd_tmc_handler(struct ufs_hba *hba);
> >>   static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> >> +static int ufshcd_reset_and_restore(struct ufs_hba *hba);
> >>
> >>   /*
> >>    * ufshcd_wait_for_register - wait for register value to change
> >> @@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
> >>   		goto out;
> >>   	}
> >>
> >> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> >> -		scsi_unblock_requests(hba->host);
> >> -
> >>   out:
> >>   	return err;
> >>   }
> >> @@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct ufs_hba *hba)
> >>   }
> >>
> >>   /**
> >> - * ufshcd_do_reset - reset the host controller
> >> - * @hba: per adapter instance
> >> - *
> >> - * Returns SUCCESS/FAILED
> >> - */
> >> -static int ufshcd_do_reset(struct ufs_hba *hba)
> >> -{
> >> -	struct ufshcd_lrb *lrbp;
> >> -	unsigned long flags;
> >> -	int tag;
> >> -
> >> -	/* block commands from midlayer */
> >> -	scsi_block_requests(hba->host);
> >> -
> >> -	spin_lock_irqsave(hba->host->host_lock, flags);
> >> -	hba->ufshcd_state = UFSHCD_STATE_RESET;
> >> -
> >> -	/* send controller to reset state */
> >> -	ufshcd_hba_stop(hba);
> >> -	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> -
> >> -	/* abort outstanding commands */
> >> -	for (tag = 0; tag < hba->nutrs; tag++) {
> >> -		if (test_bit(tag, &hba->outstanding_reqs)) {
> >> -			lrbp = &hba->lrb[tag];
> >> -			if (lrbp->cmd) {
> >> -				scsi_dma_unmap(lrbp->cmd);
> >> -				lrbp->cmd->result = DID_RESET << 16;
> >> -				lrbp->cmd->scsi_done(lrbp->cmd);
> >> -				lrbp->cmd = NULL;
> >> -				clear_bit_unlock(tag, &hba->lrb_in_use);
> >> -			}
> >> -		}
> >> -	}
> >> -
> >> -	/* complete device management command */
> >> -	if (hba->dev_cmd.complete)
> >> -		complete(hba->dev_cmd.complete);
> >> -
> >> -	/* clear outstanding request/task bit maps */
> >> -	hba->outstanding_reqs = 0;
> >> -	hba->outstanding_tasks = 0;
> >> -
> >> -	/* Host controller enable */
> >> -	if (ufshcd_hba_enable(hba)) {
> >> -		dev_err(hba->dev,
> >> -			"Reset: Controller initialization failed\n");
> >> -		return FAILED;
> >> -	}
> >> -
> >> -	if (ufshcd_link_startup(hba)) {
> >> -		dev_err(hba->dev,
> >> -			"Reset: Link start-up failed\n");
> >> -		return FAILED;
> >> -	}
> >> -
> >> -	return SUCCESS;
> >> -}
> >> -
> >> -/**
> >>    * ufshcd_slave_alloc - handle initial SCSI device configurations
> >>    * @sdev: pointer to SCSI device
> >>    *
> >> @@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device *sdev)
> >>   	sdev->use_10_for_ms = 1;
> >>   	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
> >>
> >> +	/* allow SCSI layer to restart the device in case of errors */
> >> +	sdev->allow_restart = 1;
> >> +
> >>   	/*
> >>   	 * Inform SCSI Midlayer that the LUN queue depth is same as the
> >>   	 * controller queue depth. If a LUN queue depth is less than the
> >> @@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
> >>   	case OCS_ABORTED:
> >>   		result |= DID_ABORT << 16;
> >>   		break;
> >> +	case OCS_INVALID_COMMAND_STATUS:
> >> +		result |= DID_REQUEUE << 16;
> >> +		break;
> >>   	case OCS_INVALID_CMD_TABLE_ATTR:
> >>   	case OCS_INVALID_PRDT_ATTR:
> >>   	case OCS_MISMATCH_DATA_BUF_SIZE:
> >> @@ -2405,42 +2357,295 @@ out:
> >>   	return err;
> >>   }
> >>
> >> +static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
> >> +{
> >> +	switch (ocs) {
> >> +	case OCS_SUCCESS:
> >> +	case OCS_INVALID_COMMAND_STATUS:
> >> +		break;
> >> +	case OCS_MISMATCH_DATA_BUF_SIZE:
> >> +	case OCS_MISMATCH_RESP_UPIU_SIZE:
> >> +	case OCS_PEER_COMM_FAILURE:
> >> +	case OCS_FATAL_ERROR:
> >> +	case OCS_ABORTED:
> >> +	case OCS_INVALID_CMD_TABLE_ATTR:
> >> +	case OCS_INVALID_PRDT_ATTR:
> >> +		ufshcd_set_host_reset_pending(hba);
> > Should host be reset on ocs error, including below ufshcd_decide_eh_task_req?
> > It's just overall command status.
> 
> Yes, the error handling section in the UFS 1.1 spec. mentions so.
If host's reset is required, it should be allowed in fatal situation.
Deciding with OCS field seems not proper. There is no mentions for that in spec.
If I have a wrong information, please let it clear.

> 
> >
> >> +		break;
> >> +	default:
> >> +		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
> >> +				__func__, ocs);
> >> +		BUG();
> >> +	}
> >> +}
> >> +
> >> +static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
> >> +{
> >> +	switch (ocs) {
> >> +	case OCS_TMR_SUCCESS:
> >> +	case OCS_TMR_INVALID_COMMAND_STATUS:
> >> +		break;
> >> +	case OCS_TMR_MISMATCH_REQ_SIZE:
> >> +	case OCS_TMR_MISMATCH_RESP_SIZE:
> >> +	case OCS_TMR_PEER_COMM_FAILURE:
> >> +	case OCS_TMR_INVALID_ATTR:
> >> +	case OCS_TMR_ABORTED:
> >> +	case OCS_TMR_FATAL_ERROR:
> >> +		ufshcd_set_host_reset_pending(hba);
> >> +		break;
> >> +	default:
> >> +		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
> >> +				__func__, ocs);
> >> +		BUG();
> >> +	}
> >> +}
> >> +
> >>   /**
> >> - * ufshcd_fatal_err_handler - handle fatal errors
> >> + * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed command and
> >> + *                          decide error handling
> >>    * @hba: per adapter instance
> >> + * @err_xfer: bit mask for transfer request errors
> >> + *
> >> + * Iterate over completed transfer requests and
> >> + * set error handling flags.
> >> + */
> >> +static void
> >> +ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
> >> +{
> >> +	unsigned long completed;
> >> +	u32 doorbell;
> >> +	int index;
> >> +	int ocs;
> >> +
> >> +	if (!err_xfer)
> >> +		goto out;
> >> +
> >> +	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
> >> +	completed = doorbell ^ (u32)hba->outstanding_reqs;
> >> +
> >> +	for (index = 0; index < hba->nutrs; index++) {
> >> +		if (test_bit(index, &completed)) {
> >> +			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
> >> +			if ((ocs == OCS_SUCCESS) ||
> >> +					(ocs == OCS_INVALID_COMMAND_STATUS))
> >> +				continue;
> >> +
> >> +			*err_xfer |= (1 << index);
> >> +			ufshcd_decide_eh_xfer_req(hba, ocs);
> >> +		}
> >> +	}
> >> +out:
> >> +	return;
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_error_autopsy_task_req() - reads OCS field of failed command and
> >> + *                          decide error handling
> >> + * @hba: per adapter instance
> >> + * @err_tm: bit mask for task management errors
> >> + *
> >> + * Iterate over completed task management requests and
> >> + * set error handling flags.
> >> + */
> >> +static void
> >> +ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
> >> +{
> >> +	unsigned long completed;
> >> +	u32 doorbell;
> >> +	int index;
> >> +	int ocs;
> >> +
> >> +	if (!err_tm)
> >> +		goto out;
> >> +
> >> +	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
> >> +	completed = doorbell ^ (u32)hba->outstanding_tasks;
> >> +
> >> +	for (index = 0; index < hba->nutmrs; index++) {
> >> +		if (test_bit(index, &completed)) {
> >> +			struct utp_task_req_desc *tm_descp;
> >> +
> >> +			tm_descp = hba->utmrdl_base_addr;
> >> +			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
> >> +			if ((ocs == OCS_TMR_SUCCESS) ||
> >> +					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
> >> +				continue;
> >> +
> >> +			*err_tm |= (1 << index);
> >> +			ufshcd_decide_eh_task_req(hba, ocs);
> >> +		}
> >> +	}
> >> +
> >> +out:
> >> +	return;
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_fatal_err_handler - handle fatal errors
> >> + * @work: pointer to work structure
> >>    */
> >>   static void ufshcd_fatal_err_handler(struct work_struct *work)
> >>   {
> >>   	struct ufs_hba *hba;
> >> +	unsigned long flags;
> >> +	u32 err_xfer = 0;
> >> +	u32 err_tm = 0;
> >> +	int err;
> >> +
> >>   	hba = container_of(work, struct ufs_hba, feh_workq);
> >>
> >>   	pm_runtime_get_sync(hba->dev);
> >> -	/* check if reset is already in progress */
> >> -	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
> >> -		ufshcd_do_reset(hba);
> >> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >> +	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
> >> +		/* complete processed requests and exit */
> >> +		ufshcd_transfer_req_compl(hba);
> >> +		ufshcd_tmc_handler(hba);
> >> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> +		pm_runtime_put_sync(hba->dev);
> >> +		return;
> > Host driver is here with finishing 'scsi_block_requests'.
> > 'scsi_unblock_requests' can be called somewhere?
> 
> No, but it can be possible that SCSI command timeout which triggers
> device/host reset and fatal error handler race each other.
Sorry, I didn't get your meaning exactly.
I saw that scsi_block_requests is done before ufshcd_fatal_err_handler is scheduled.
If device or host was requested from scsi mid-layer just before ufshcd_fatal_err_handler,
ufshcd_fatal_err_handler will be out through if statement. Then, there is nowhere to call scsi_unblock_requests
though device/host reset is done successfully.
> 
> >
> >> +	}
> >> +
> >> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> >> +	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
> >> +	ufshcd_error_autopsy_task_req(hba, &err_tm);
> >> +
> >> +	/*
> >> +	 * Complete successful and pending transfer requests.
> >> +	 * DID_REQUEUE is returned for pending requests as they have
> >> +	 * nothing to do with error'ed request and SCSI layer should
> >> +	 * not treat them as errors and decrement retry count.
> >> +	 */
> >> +	hba->outstanding_reqs &= ~err_xfer;
> >> +	ufshcd_transfer_req_compl(hba);
> >> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> +	ufshcd_complete_pending_reqs(hba);
> >> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >> +	hba->outstanding_reqs |= err_xfer;
> > Hmm... error handling seems so complicated.
> > To simplify it, how about below?
> >
> > 1. If requests(transfer or task management) are completed, finish them with success/failure.
> This is what we are trying to do above.
> 
> > 2. If there are pending requests, abort them.
> No, if a fatal error is occurred it is possible that host controller is
> freez'ed we are not sure if it can take task management commands and
> execute them.
I meant that aborting the request by clearing corresponding UTMRLCLR/UTMRLCLR.

> 
> > 3. If fatal error, reset.
> >
> 
> 
> >> +
> >> +	/* Complete successful and pending task requests */
> >> +	hba->outstanding_tasks &= ~err_tm;
> >> +	ufshcd_tmc_handler(hba);
> >> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> +	ufshcd_complete_pending_tasks(hba);
> >> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >> +
> >> +	hba->outstanding_tasks |= err_tm;
> >> +
> >> +	/*
> >> +	 * Controller may generate multiple fatal errors, handle
> >> +	 * errors based on severity.
> >> +	 * 1) DEVICE_FATAL_ERROR
> >> +	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
> >> +	 * 3) UIC_ERROR
> >> +	 */
> >> +	if (hba->errors & DEVICE_FATAL_ERROR) {
> >> +		/*
> >> +		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
> >> +		 * OCS field on device fatal error.
> >> +		 */
> >> +		ufshcd_set_host_reset_pending(hba);
> > In DEVICE_FATAL_ERROR, ufshcd_device_reset_pending is right?
> 
> It looks so, but the spec. mentions to reset the host as well (8.3.6).
Do you pointing below?
[8.3.6. Device Errors are fatal errors. ...the host software shall reset the device too.]

> 
> >
> >> +	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
> >> +			CONTROLLER_FATAL_ERROR)) {
> >> +		/* eh flags should be set in err autopsy based on OCS values */
> >> +		if (!hba->eh_flags)
> >> +			WARN(1, "%s: fatal error without error handling\n",
> >> +				dev_name(hba->dev));
> >> +	} else if (hba->errors & UIC_ERROR) {
> >> +		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
> >> +			/* fatal error - reset controller */
> >> +			ufshcd_set_host_reset_pending(hba);
> >> +		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
> >> +					UFSHCD_UIC_TL_ERROR |
> >> +					UFSHCD_UIC_DME_ERROR)) {
> >> +			/* non-fatal, report error to SCSI layer */
> >> +			if (!hba->eh_flags) {
> >> +				spin_unlock_irqrestore(
> >> +						hba->host->host_lock, flags);
> >> +				ufshcd_complete_pending_reqs(hba);
> >> +				ufshcd_complete_pending_tasks(hba);
> >> +				spin_lock_irqsave(hba->host->host_lock, flags);
> >> +			}
> >> +		}
> >> +	}
> >> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> +
> >> +	if (hba->eh_flags) {
> >> +		err = ufshcd_reset_and_restore(hba);
> >> +		if (err) {
> >> +			ufshcd_clear_host_reset_pending(hba);
> >> +			ufshcd_clear_device_reset_pending(hba);
> >> +			dev_err(hba->dev, "%s: reset and restore failed\n",
> >> +					__func__);
> >> +			hba->ufshcd_state = UFSHCD_STATE_ERROR;
> >> +		}
> >> +		/*
> >> +		 * Inform scsi mid-layer that we did reset and allow to handle
> >> +		 * Unit Attention properly.
> >> +		 */
> >> +		scsi_report_bus_reset(hba->host, 0);
> >> +		hba->errors = 0;
> >> +		hba->uic_error = 0;
> >> +	}
> >> +	scsi_unblock_requests(hba->host);
> >>   	pm_runtime_put_sync(hba->dev);
> >>   }
> >>
> >>   /**
> >> - * ufshcd_err_handler - Check for fatal errors
> >> - * @work: pointer to a work queue structure
> >> + * ufshcd_update_uic_error - check and set fatal UIC error flags.
> >> + * @hba: per-adapter instance
> >>    */
> >> -static void ufshcd_err_handler(struct ufs_hba *hba)
> >> +static void ufshcd_update_uic_error(struct ufs_hba *hba)
> >>   {
> >>   	u32 reg;
> >>
> >> +	/* PA_INIT_ERROR is fatal and needs UIC reset */
> >> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> >> +	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> >> +		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
> >> +
> >> +	/* UIC NL/TL/DME errors needs software retry */
> >> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
> >> +	if (reg)
> >> +		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
> >> +
> >> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
> >> +	if (reg)
> >> +		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
> >> +
> >> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
> >> +	if (reg)
> >> +		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
> > REG_UIC_ERROR_CODE_PHY_ADAPTER_LAYER is not handled.
> 
> UFS spec. mentions that it is non-fatal error and UIC recovers
> by itself and doesn't need software intervention.
Ok.

> 
> >
> >> +
> >> +	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
> >> +			__func__, hba->uic_error);
> >> +}
> >> +
> >> +/**
> >> + * ufshcd_err_handler - Check for fatal errors
> >> + * @hba: per-adapter instance
> >> + */
> >> +static void ufshcd_err_handler(struct ufs_hba *hba)
> >> +{
> >>   	if (hba->errors & INT_FATAL_ERRORS)
> >>   		goto fatal_eh;
> >>
> >>   	if (hba->errors & UIC_ERROR) {
> >> -		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> >> -		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> >> +		hba->uic_error = 0;
> >> +		ufshcd_update_uic_error(hba);
> >> +		if (hba->uic_error)
> > Except UFSHCD_UIC_DL_PA_INIT_ERROR, it's not fatal. Should it go to fatal_eh?
> 
> Please see the UIC error handling in ufshcd_fatal_err_handler(), others
> need software intervention so I combined it with fatal_eh to complete
> the requests and report to SCSI.
As gathering all error(fatal, non-fatal)handling into origin one, it makes confused.
Then, I would be better to rename ufshcd_fatal_err_handler.

Thanks,
Seungwon Jeon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation
  2013-07-23  8:24       ` Seungwon Jeon
@ 2013-07-23 15:40         ` Sujit Reddy Thumma
  0 siblings, 0 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-23 15:40 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/23/2013 1:54 PM, Seungwon Jeon wrote:
> On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
>> On 7/19/2013 7:26 PM, Seungwon Jeon wrote:
>>> On Tue, July 09, 2013 Sujit Reddy Thumma wrote:
>>>> Currently, sending Task Management (TM) command to the card might
>>>> be broken in some scenarios as listed below:
>>>>
>>>> Problem: If there are more than 8 TM commands the implementation
>>>>            returns error to the caller.
>>>> Fix:     Wait for one of the slots to be emptied and send the command.
>>>>
>>>> Problem: Sometimes it is necessary for the caller to know the TM service
>>>>            response code to determine the task status.
>>>> Fix:     Propogate the service response to the caller.
>>>>
>>>> Problem: If the TM command times out no proper error recovery is
>>>>            implemented.
>>>> Fix:     Clear the command in the controller door-bell register, so that
>>>>            further commands for the same slot don't fail.
>>>>
>>>> Problem: While preparing the TM command descriptor, the task tag used
>>>>            should be unique across SCSI/NOP/QUERY/TM commands and not the
>>>> 	 task tag of the command which the TM command is trying to manage.
>>>> Fix:     Use a unique task tag instead of task tag of SCSI command.
>>>>
>>>> Problem: Since the TM command involves H/W communication, abruptly ending
>>>>            the request on kill interrupt signal might cause h/w malfunction.
>>>> Fix:     Wait for hardware completion interrupt with TASK_UNINTERRUPTIBLE
>>>>            set.
>>>>
>>>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>>>> ---
>>>>    drivers/scsi/ufs/ufshcd.c |  177 ++++++++++++++++++++++++++++++---------------
>>>>    drivers/scsi/ufs/ufshcd.h |    8 ++-
>>>>    2 files changed, 126 insertions(+), 59 deletions(-)
>>>>
>>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>>>> index af7d01d..a176421 100644
>>>> --- a/drivers/scsi/ufs/ufshcd.c
>>>> +++ b/drivers/scsi/ufs/ufshcd.c
>>>> @@ -53,6 +53,9 @@
>>>>    /* Query request timeout */
>>>>    #define QUERY_REQ_TIMEOUT 30 /* msec */
>>>>
>>>> +/* Task management command timeout */
>>>> +#define TM_CMD_TIMEOUT	100 /* msecs */
>>>> +
>>>>    /* Expose the flag value from utp_upiu_query.value */
>>>>    #define MASK_QUERY_UPIU_FLAG_LOC 0xFF
>>>>
>>>> @@ -190,13 +193,35 @@ ufshcd_get_tmr_ocs(struct utp_task_req_desc *task_req_descp)
>>>>    /**
>>>>     * ufshcd_get_tm_free_slot - get a free slot for task management request
>>>>     * @hba: per adapter instance
>>>> + * @free_slot: pointer to variable with available slot value
>>>>     *
>>>> - * Returns maximum number of task management request slots in case of
>>>> - * task management queue full or returns the free slot number
>>>> + * Get a free tag and lock it until ufshcd_put_tm_slot() is called.
>>>> + * Returns 0 if free slot is not available, else return 1 with tag value
>>>> + * in @free_slot.
>>>>     */
>>>> -static inline int ufshcd_get_tm_free_slot(struct ufs_hba *hba)
>>>> +static bool ufshcd_get_tm_free_slot(struct ufs_hba *hba, int *free_slot)
>>>> +{
>>>> +	int tag;
>>>> +	bool ret = false;
>>>> +
>>>> +	if (!free_slot)
>>>> +		goto out;
>>>> +
>>>> +	do {
>>>> +		tag = find_first_zero_bit(&hba->tm_slots_in_use, hba->nutmrs);
>>>> +		if (tag >= hba->nutmrs)
>>>> +			goto out;
>>>> +	} while (test_and_set_bit_lock(tag, &hba->tm_slots_in_use));
>>>> +
>>>> +	*free_slot = tag;
>>>> +	ret = true;
>>>> +out:
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static inline void ufshcd_put_tm_slot(struct ufs_hba *hba, int slot)
>>>>    {
>>>> -	return find_first_zero_bit(&hba->outstanding_tasks, hba->nutmrs);
>>>> +	clear_bit_unlock(slot, &hba->tm_slots_in_use);
>>>>    }
>>>>
>>>>    /**
>>>> @@ -1778,10 +1803,11 @@ static void ufshcd_slave_destroy(struct scsi_device *sdev)
>>>>     * ufshcd_task_req_compl - handle task management request completion
>>>>     * @hba: per adapter instance
>>>>     * @index: index of the completed request
>>>> + * @resp: task management service response
>>>>     *
>>>> - * Returns SUCCESS/FAILED
>>>> + * Returns non-zero value on error, zero on success
>>>>     */
>>>> -static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
>>>> +static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index, u8 *resp)
>>>>    {
>>>>    	struct utp_task_req_desc *task_req_descp;
>>>>    	struct utp_upiu_task_rsp *task_rsp_upiup;
>>>> @@ -1802,19 +1828,15 @@ static int ufshcd_task_req_compl(struct ufs_hba *hba, u32 index)
>>>>    				task_req_descp[index].task_rsp_upiu;
>>>>    		task_result = be32_to_cpu(task_rsp_upiup->header.dword_1);
>>>>    		task_result = ((task_result & MASK_TASK_RESPONSE) >> 8);
>>>> -
>>>> -		if (task_result != UPIU_TASK_MANAGEMENT_FUNC_COMPL &&
>>>> -		    task_result != UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED)
>>>> -			task_result = FAILED;
>>>> -		else
>>>> -			task_result = SUCCESS;
>>>> +		if (resp)
>>>> +			*resp = (u8)task_result;
>>>>    	} else {
>>>> -		task_result = FAILED;
>>>> -		dev_err(hba->dev,
>>>> -			"trc: Invalid ocs = %x\n", ocs_value);
>>>> +		dev_err(hba->dev, "%s: failed, ocs = 0x%x\n",
>>>> +				__func__, ocs_value);
>>>>    	}
>>>>    	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> -	return task_result;
>>>> +
>>>> +	return ocs_value;
>>>>    }
>>>>
>>>>    /**
>>>> @@ -2298,7 +2320,7 @@ static void ufshcd_tmc_handler(struct ufs_hba *hba)
>>>>
>>>>    	tm_doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
>>>>    	hba->tm_condition = tm_doorbell ^ hba->outstanding_tasks;
>>>> -	wake_up_interruptible(&hba->ufshcd_tm_wait_queue);
>>>> +	wake_up(&hba->tm_wq);
>>>>    }
>>>>
>>>>    /**
>>>> @@ -2348,38 +2370,61 @@ static irqreturn_t ufshcd_intr(int irq, void *__hba)
>>>>    	return retval;
>>>>    }
>>>>
>>>> +static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
>>>> +{
>>>> +	int err = 0;
>>>> +	u32 reg;
>>>> +	u32 mask = 1 << tag;
>>>> +	unsigned long flags;
>>>> +
>>>> +	if (!test_bit(tag, &hba->outstanding_reqs))
>>>> +		goto out;
>>>> +
>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>> +	ufshcd_writel(hba, ~(1 << tag), REG_UTP_TASK_REQ_LIST_CLEAR);
>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> +
>>>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>>>> +	reg = ufshcd_wait_for_register(hba,
>>>> +			REG_UTP_TASK_REQ_DOOR_BELL,
>>>> +			mask, 0, 1000, 1000);
>>>> +	if ((reg & mask) == mask)
>>>> +		err = -ETIMEDOUT;
>>>> +out:
>>>> +	return err;
>>>> +}
>>>> +
>>>>    /**
>>>>     * ufshcd_issue_tm_cmd - issues task management commands to controller
>>>>     * @hba: per adapter instance
>>>> - * @lrbp: pointer to local reference block
>>>> + * @lun_id: LUN ID to which TM command is sent
>>>> + * @task_id: task ID to which the TM command is applicable
>>>> + * @tm_function: task management function opcode
>>>> + * @tm_response: task management service response return value
>>>>     *
>>>> - * Returns SUCCESS/FAILED
>>>> + * Returns non-zero value on error, zero on success.
>>>>     */
>>>> -static int
>>>> -ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>>>> -		    struct ufshcd_lrb *lrbp,
>>>> -		    u8 tm_function)
>>>> +static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id,
>>>> +		u8 tm_function, u8 *tm_response)
>>>>    {
>>>>    	struct utp_task_req_desc *task_req_descp;
>>>>    	struct utp_upiu_task_req *task_req_upiup;
>>>>    	struct Scsi_Host *host;
>>>>    	unsigned long flags;
>>>> -	int free_slot = 0;
>>>> +	int free_slot;
>>>>    	int err;
>>>> +	int task_tag;
>>>>
>>>>    	host = hba->host;
>>>>
>>>> -	spin_lock_irqsave(host->host_lock, flags);
>>>> -
>>>> -	/* If task management queue is full */
>>>> -	free_slot = ufshcd_get_tm_free_slot(hba);
>>>> -	if (free_slot >= hba->nutmrs) {
>>>> -		spin_unlock_irqrestore(host->host_lock, flags);
>>>> -		dev_err(hba->dev, "Task management queue full\n");
>>>> -		err = FAILED;
>>>> -		goto out;
>>>> -	}
>>>> +	/*
>>>> +	 * Get free slot, sleep if slots are unavailable.
>>>> +	 * Even though we use wait_event() which sleeps indefinitely,
>>>> +	 * the maximum wait time is bounded by %TM_CMD_TIMEOUT.
>>>> +	 */
>>>> +	wait_event(hba->tm_tag_wq, ufshcd_get_tm_free_slot(hba, &free_slot));
>>>>
>>>> +	spin_lock_irqsave(host->host_lock, flags);
>>>>    	task_req_descp = hba->utmrdl_base_addr;
>>>>    	task_req_descp += free_slot;
>>>>
>>>> @@ -2391,18 +2436,15 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>>>>    	/* Configure task request UPIU */
>>>>    	task_req_upiup =
>>>>    		(struct utp_upiu_task_req *) task_req_descp->task_req_upiu;
>>>> +	task_tag = hba->nutrs + free_slot;
>>> Possible, did you intend 'hba->nutmrs', not 'hba->nutrs'?
>>> I think it's safer with hba->nutmrs if we can't sure that NUTRS is larger than NUTMRS.
>>
>> It should be hba->nutrs and not hba->nutmrs.
>>
>> The equation is -
>> 0 <= free_slot < hba->nutmrs
>> 0 <= transfer_req_task_id < hba->nutrs
>> hba->nutrs <= tm_req_task_id < hba->nutmrs + hba_nutrs
>>
>> Whatever be the values of NUTRS/NUTMRS the above gives a unique
>> task_id.
> Yes.
> 
>>
>>
>>>
>>>>    	task_req_upiup->header.dword_0 =
>>>>    		UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
>>>> -					      lrbp->lun, lrbp->task_tag);
>>>> +				lun_id, task_tag);
>>>>    	task_req_upiup->header.dword_1 =
>>>>    		UPIU_HEADER_DWORD(0, tm_function, 0, 0);
>>>>
>>>> -	task_req_upiup->input_param1 = lrbp->lun;
>>>> -	task_req_upiup->input_param1 =
>>>> -		cpu_to_be32(task_req_upiup->input_param1);
>>>> -	task_req_upiup->input_param2 = lrbp->task_tag;
>>>> -	task_req_upiup->input_param2 =
>>>> -		cpu_to_be32(task_req_upiup->input_param2);
>>>> +	task_req_upiup->input_param1 = cpu_to_be32(lun_id);
>>>> +	task_req_upiup->input_param2 = cpu_to_be32(task_id);
>>>>
>>>>    	/* send command to the controller */
>>>>    	__set_bit(free_slot, &hba->outstanding_tasks);
>>>> @@ -2411,20 +2453,24 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
>>>>    	spin_unlock_irqrestore(host->host_lock, flags);
>>>>
>>>>    	/* wait until the task management command is completed */
>>>> -	err =
>>>> -	wait_event_interruptible_timeout(hba->ufshcd_tm_wait_queue,
>>>> -					 (test_bit(free_slot,
>>>> -					 &hba->tm_condition) != 0),
>>>> -					 60 * HZ);
>>>> +	err = wait_event_timeout(hba->tm_wq,
>>>> +			test_bit(free_slot, &hba->tm_condition),
>>>> +			msecs_to_jiffies(TM_CMD_TIMEOUT));
>>>>    	if (!err) {
>>>> -		dev_err(hba->dev,
>>>> -			"Task management command timed-out\n");
>>>> -		err = FAILED;
>>>> -		goto out;
>>>> +		dev_err(hba->dev, "%s: task management cmd 0x%.2x timed-out\n",
>>>> +				__func__, tm_function);
>>>> +		if (ufshcd_clear_tm_cmd(hba, free_slot))
>>>> +			dev_WARN(hba->dev, "%s: unable clear tm cmd (slot %d) after timeout\n",
>>>> +					__func__, free_slot);
>>>> +		err = -ETIMEDOUT;
>>>> +	} else {
>>>> +		err = ufshcd_task_req_compl(hba, free_slot, tm_response);
>>>>    	}
>>>> +
>>>>    	clear_bit(free_slot, &hba->tm_condition);
>>>> -	err = ufshcd_task_req_compl(hba, free_slot);
>>>> -out:
>>>> +	ufshcd_put_tm_slot(hba, free_slot);
>>>> +	wake_up(&hba->tm_tag_wq);
>>>> +
>>>>    	return err;
>>>>    }
>>>>
>>>> @@ -2441,14 +2487,22 @@ static int ufshcd_device_reset(struct scsi_cmnd *cmd)
>>>>    	unsigned int tag;
>>>>    	u32 pos;
>>>>    	int err;
>>>> +	u8 resp;
>>>> +	struct ufshcd_lrb *lrbp;
>>>>
>>>>    	host = cmd->device->host;
>>>>    	hba = shost_priv(host);
>>>>    	tag = cmd->request->tag;
>>>>
>>>> -	err = ufshcd_issue_tm_cmd(hba, &hba->lrb[tag], UFS_LOGICAL_RESET);
>>>> -	if (err == FAILED)
>>>> +	lrbp = &hba->lrb[tag];
>>>> +	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>>> Argument 2nd, 3rd can be replaced by lrbp.
>>> Then, we can reduce the number of argument.
>>>
>>
>> TM issue command doesn't need to know about lrbp, It just need
>> LUN ID and task ID. This helps when we are not dealing with lrbp's
>> and just want to issue some other TM command.
>> I believe an extra argument is not so costly on the systems which
>> demand high performance UFS devices.
> Yes, you're right. only need LUN ID and task ID.
> It might be trivial. But 'lrbp' should be referred for getting these.
> 

Whatever way the caller gets lun_id/task_id is not of a concern for
ufshcd_issue_tm_cmd().

I prefer this way even though lrbp is anyway needed to determine IDs.

--
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-23  8:27       ` Seungwon Jeon
@ 2013-07-23 15:40         ` Sujit Reddy Thumma
  2013-07-24 13:39           ` Seungwon Jeon
  0 siblings, 1 reply; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-23 15:40 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/23/2013 1:57 PM, Seungwon Jeon wrote:
> On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
>> On 7/19/2013 7:27 PM, Seungwon Jeon wrote:
>>> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
>>>> As of now SCSI initiated error handling is broken because,
>>>> the reset APIs don't try to bring back the device initialized and
>>>> ready for further transfers.
>>>>
>>>> In case of timeouts, the scsi error handler takes care of handling aborts
>>>> and resets. Improve the error handling in such scenario by resetting the
>>>> device and host and re-initializing them in proper manner.
>>>>
>>>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>>>> ---
>>>>    drivers/scsi/ufs/ufshcd.c |  467 +++++++++++++++++++++++++++++++++++++++------
>>>>    drivers/scsi/ufs/ufshcd.h |    2 +
>>>>    2 files changed, 411 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>>>> index 51ce096..b4c9910 100644
>>>> --- a/drivers/scsi/ufs/ufshcd.c
>>>> +++ b/drivers/scsi/ufs/ufshcd.c
>>>> @@ -69,9 +69,15 @@ enum {
>>>>
>>>>    /* UFSHCD states */
>>>>    enum {
>>>> -	UFSHCD_STATE_OPERATIONAL,
>>>>    	UFSHCD_STATE_RESET,
>>>>    	UFSHCD_STATE_ERROR,
>>>> +	UFSHCD_STATE_OPERATIONAL,
>>>> +};
>>>> +
>>>> +/* UFSHCD error handling flags */
>>>> +enum {
>>>> +	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
>>>> +	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>>>>    };
>>>>
>>>>    /* Interrupt configuration options */
>>>> @@ -87,6 +93,22 @@ enum {
>>>>    	INT_AGGR_CONFIG,
>>>>    };
>>>>
>>>> +#define ufshcd_set_device_reset_pending(h) \
>>>> +	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
>>>> +#define ufshcd_set_host_reset_pending(h) \
>>>> +	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
>>>> +#define ufshcd_device_reset_pending(h) \
>>>> +	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
>>>> +#define ufshcd_host_reset_pending(h) \
>>>> +	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
>>>> +#define ufshcd_clear_device_reset_pending(h) \
>>>> +	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
>>>> +#define ufshcd_clear_host_reset_pending(h) \
>>>> +	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
>>>> +
>>>> +static void ufshcd_tmc_handler(struct ufs_hba *hba);
>>>> +static void ufshcd_async_scan(void *data, async_cookie_t cookie);
>>>> +
>>>>    /*
>>>>     * ufshcd_wait_for_register - wait for register value to change
>>>>     * @hba - per-adapter interface
>>>> @@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
>>>>
>>>>    	tag = cmd->request->tag;
>>>>
>>>> -	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
>>>> +	switch (hba->ufshcd_state) {
>>> Lock is no needed for ufshcd_state?
> Please check?

Yes, it is needed. Thanks for catching this.

> 
>>>
>>>> +	case UFSHCD_STATE_OPERATIONAL:
>>>> +		break;
>>>> +	case UFSHCD_STATE_RESET:
>>>>    		err = SCSI_MLQUEUE_HOST_BUSY;
>>>>    		goto out;
>>>> +	case UFSHCD_STATE_ERROR:
>>>> +		set_host_byte(cmd, DID_ERROR);
>>>> +		cmd->scsi_done(cmd);
>>>> +		goto out;
>>>> +	default:
>>>> +		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
>>>> +				__func__, hba->ufshcd_state);
>>>> +		set_host_byte(cmd, DID_BAD_TARGET);
>>>> +		cmd->scsi_done(cmd);
>>>> +		goto out;
>>>>    	}
>>>>
>>>>    	/* acquire the tag to make sure device cmds don't use it */
>>>> @@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>>>>    	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>>>>    		scsi_unblock_requests(hba->host);
>>>>
>>>> -	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
>>>> -
>>>>    out:
>>>>    	return err;
>>>>    }
>>>> @@ -2273,6 +2306,106 @@ out:
>>>>    }
>>>>
>>>>    /**
>>>> + * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
>>>> + * @hba: per-adapter instance
>>>> + */
>>>> +static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
>>>> +{
>>>> +	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
>>>> + * @hba: per-adapter instance
>>>> + */
>>>> +static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
>>>> +{
>>>> +	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_complete_pending_tasks - complete outstanding tasks
>>>> + * @hba: per adapter instance
>>>> + *
>>>> + * Abort in-progress task management commands and wakeup
>>>> + * waiting threads.
>>>> + *
>>>> + * Returns non-zero error value when failed to clear all the commands.
>>>> + */
>>>> +static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
>>>> +{
>>>> +	u32 reg;
>>>> +	int err = 0;
>>>> +	unsigned long flags;
>>>> +
>>>> +	if (!hba->outstanding_tasks)
>>>> +		goto out;
>>>> +
>>>> +	/* Clear UTMRL only when run-stop is enabled */
>>>> +	if (ufshcd_utmrl_is_rsr_enabled(hba))
>>>> +		ufshcd_writel(hba, ~hba->outstanding_tasks,
>>>> +				REG_UTP_TASK_REQ_LIST_CLEAR);
>>>> +
>>>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>>>> +	reg = ufshcd_wait_for_register(hba,
>>>> +			REG_UTP_TASK_REQ_DOOR_BELL,
>>>> +			hba->outstanding_tasks, 0, 1000, 1000);
>>>> +	if (reg & hba->outstanding_tasks)
>>>> +		err = -ETIMEDOUT;
>>>> +
>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>> +	/* complete commands that were cleared out */
>>>> +	ufshcd_tmc_handler(hba);
>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> +out:
>>>> +	if (err)
>>>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
>>>> +				__func__, reg);
>>>> +	return err;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_complete_pending_reqs - complete outstanding requests
>>>> + * @hba: per adapter instance
>>>> + *
>>>> + * Abort in-progress transfer request commands and return them to SCSI.
>>>> + *
>>>> + * Returns non-zero error value when failed to clear all the commands.
>>>> + */
>>>> +static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
>>>> +{
>>>> +	u32 reg;
>>>> +	int err = 0;
>>>> +	unsigned long flags;
>>>> +
>>>> +	/* check if we completed all of them */
>>>> +	if (!hba->outstanding_reqs)
>>>> +		goto out;
>>>> +
>>>> +	/* Clear UTRL only when run-stop is enabled */
>>>> +	if (ufshcd_utrl_is_rsr_enabled(hba))
>>>> +		ufshcd_writel(hba, ~hba->outstanding_reqs,
>>>> +				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
>>>> +
>>>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>>>> +	reg = ufshcd_wait_for_register(hba,
>>>> +			REG_UTP_TRANSFER_REQ_DOOR_BELL,
>>>> +			hba->outstanding_reqs, 0, 1000, 1000);
>>>> +	if (reg & hba->outstanding_reqs)
>>>> +		err = -ETIMEDOUT;
>>>> +
>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>> +	/* complete commands that were cleared out */
>>>> +	ufshcd_transfer_req_compl(hba);
>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> +out:
>>>> +	if (err)
>>>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
>>>> +				__func__, reg);
>>>> +	return err;
>>>> +}
>>>> +
>>>> +/**
>>>>     * ufshcd_fatal_err_handler - handle fatal errors
>>>>     * @hba: per adapter instance
>>>>     */
>>>> @@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
>>>>    	}
>>>>    	return;
>>>>    fatal_eh:
>>>> -	hba->ufshcd_state = UFSHCD_STATE_ERROR;
>>>> -	schedule_work(&hba->feh_workq);
>>>> +	/* handle fatal errors only when link is functional */
>>>> +	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
>>>> +		/* block commands at driver layer until error is handled */
>>>> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
>>> Locking omitted for ufshcd_state?
>> This is called in interrupt context with spin_lock held.
> Right, I missed it.
> 
>>
>>>
>>>> +		schedule_work(&hba->feh_workq);
>>>> +	}
>>>>    }
>>>>
>>>>    /**
>>>> @@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int
>> task_id,
>>>>    }
>>>>
>>>>    /**
>>>> - * ufshcd_device_reset - reset device and abort all the pending commands
>>>> - * @cmd: SCSI command pointer
>>>> + * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
>>>> + * @hba: per adapter instance
>>>>     *
>>>> - * Returns SUCCESS/FAILED
>>>> + * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS flags,
>>>> + * attributes and descriptors are reset to default state. Callers are
>>>> + * expected to initialize the whole device again after this.
>>>> + *
>>>> + * Returns zero on success, non-zero on failure
>>>>     */
>>>> -static int ufshcd_device_reset(struct scsi_cmnd *cmd)
>>>> +static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
>>>>    {
>>>> -	struct Scsi_Host *host;
>>>> -	struct ufs_hba *hba;
>>>> -	unsigned int tag;
>>>> -	u32 pos;
>>>> -	int err;
>>>> -	u8 resp;
>>>> -	struct ufshcd_lrb *lrbp;
>>>> +	struct uic_command uic_cmd = {0};
>>>> +	int ret;
>>>>
>>>> -	host = cmd->device->host;
>>>> -	hba = shost_priv(host);
>>>> -	tag = cmd->request->tag;
>>>> +	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
>>>>
>>>> -	lrbp = &hba->lrb[tag];
>>>> -	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>>>> -			UFS_LOGICAL_RESET, &resp);
>>>> -	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
>>>> -		err = FAILED;
>>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>>>> +	if (ret)
>>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_dme_reset - Local UniPro reset
>>>> + * @hba: per adapter instance
>>>> + *
>>>> + * Returns zero on success, non-zero on failure
>>>> + */
>>>> +static int ufshcd_dme_reset(struct ufs_hba *hba)
>>>> +{
>>>> +	struct uic_command uic_cmd = {0};
>>>> +	int ret;
>>>> +
>>>> +	uic_cmd.command = UIC_CMD_DME_RESET;
>>>> +
>>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>>>> +	if (ret)
>>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>>>> +
>>>> +	return ret;
>>>> +
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_dme_enable - Local UniPro DME Enable
>>>> + * @hba: per adapter instance
>>>> + *
>>>> + * Returns zero on success, non-zero on failure
>>>> + */
>>>> +static int ufshcd_dme_enable(struct ufs_hba *hba)
>>>> +{
>>>> +	struct uic_command uic_cmd = {0};
>>>> +	int ret;
>>>> +	uic_cmd.command = UIC_CMD_DME_ENABLE;
>>>> +
>>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>>>> +	if (ret)
>>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>>>> +
>>>> +	return ret;
>>>> +
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_device_reset_and_restore - reset and restore device
>>>> + * @hba: per-adapter instance
>>>> + *
>>>> + * Note that the device reset issues DME_END_POINT_RESET which
>>>> + * may reset entire device and restore device attributes to
>>>> + * default state.
>>>> + *
>>>> + * Returns zero on success, non-zero on failure
>>>> + */
>>>> +static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
>>>> +{
>>>> +	int err = 0;
>>>> +	u32 reg;
>>>> +
>>>> +	err = ufshcd_dme_end_point_reset(hba);
>>>> +	if (err)
>>>> +		goto out;
>>>> +
>>>> +	/* restore communication with the device */
>>>> +	err = ufshcd_dme_reset(hba);
>>>> +	if (err)
>>>>    		goto out;
>>>> -	} else {
>>>> -		err = SUCCESS;
>>>> -	}
>>>>
>>>> -	for (pos = 0; pos < hba->nutrs; pos++) {
>>>> -		if (test_bit(pos, &hba->outstanding_reqs) &&
>>>> -		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
>>>> +	err = ufshcd_dme_enable(hba);
>>>> +	if (err)
>>>> +		goto out;
>>>>
>>>> -			/* clear the respective UTRLCLR register bit */
>>>> -			ufshcd_utrl_clear(hba, pos);
>>>> +	err = ufshcd_dme_link_startup(hba);
>>> UFS_LOGICAL_RESET is no more used?
>>
>> Yes, I don't see any use for this as of now (given that we are using
>> dme_end_point_reset, refer to figure. 7.4 of UFS 1.1 spec). Also, the
>> UFS spec. error handling section doesn't mention anything about
>> LOGICAL_RESET. If you know a valid use case where we need to have LUN
>> reset, please let me know I will bring it back.
> As refered the scsi-mid layer and other host's implementation,
> eh_device_reset_handler(= ufshcd_eh_device_reset_handler) may
> have a role of LOGICAL_RESET for specific lun.

I am still not convinced why we need LOGICAL_RESET. Just because other
SCSI host drivers have it do we really need it for UFS?

> I found that ENDPOINT_RESET is recommended with IS.DFES in spec.

Here in this case, a command hang (scsi timeout) is considered as Device
Fatal Error. If there are some LUN failures the response would still be
transferred but with Unit-Attention condition with sense data. However,
if the command itself hangs, there is something seriously wrong with the
device or the communication. So we first try to reset the device and
then the host. Unlike most of other SCSI HBAs, UFS is point-to-point
(host <--> device) link and if something goes wrong and caused a hang,
mostly would be a serious error and logical unit reset wouldn't help
much.


> 
> Let me add some comments additionally.
> Both 'ufshcd_eh_device_reset_handler' and 'ufshcd_host_reset_and_restore' do almost same things.
> At a glance, it's confused about their role and It is mixed.
> 'ufshcd_reset_and_restore' is eventually called, which is actual part of reset functionality; Once device reset is failed, then
> host reset is tried.
> Actually, that is being handled for each level of error recovery in scsi mid-layer. Please chekc 'drivers/scsi/scsi_error.c'.
> [scsi_eh_ready_devs, scsi_abort_eh_cmnd]
> In this stage, each reset functionality could be separated obviously.

Yes, in that case we are optimistically doing the host reset twice,
just a hope that it recovers before SCSI layer choke and mark the
device as OFFLINE. If you think that this shouldn't be the case and
have a valid reason for not doing so, I will return appropriate error
in the case device reset fails.

> 
>>
>>> ufshcd_device_reset_and_restore have a role of device reset.
>>> Both ufshcd_dme_reset and ufshcd_dme_enable are valid for local one, not for remote.
>>> Should we do those for host including link-startup here?
>>
>> Yes, it is needed. After DME_ENDPOINT_RESET the remote link goes into link down state.
> I want to know more related description. I didn't find it. Could you point that?

Please refer to "Table 121 DME_SAP restrictions" of MIPI Uni-Pro spec.
The spec. doesn't mention about this explicitly but here is the logic
that is derived from the spec.
1) The DME_LINKSTARTUP can be sent only when the link is in down state,
in all other states DME_LINKSTARTUP is ignored.
2) So if we are sending DME_ENDPOINT_RESET then that must ensure that
remote link is in down state, and hence it can receive linkstartup and
establish the communication.

> 
>> To initialize the link, the host needs to send
>> DME_LINKSTARTUP, but according to Uni-Pro spec. the link-startup can
>> only be sent when the local uni-pro is in link-down state. So first
> If it's right you mentioned above, uni-pro state is already in link-down after DME_ENDPOINT_RESET.
> Then, DME_RESET isn't needed.

You are getting confused here -

- State1: before sending DME_ENDPOINT_RESET
	Local Unipro (host) - Link-UP
	Remote Unipro (device) - Link-Up

- State2: after sending DME_ENDPOINT_RESET
	Local Unipro (host) - Link-UP
	Remote Unipro (device) - Link-Down

- State3: After sending DME_RESET+DME_ENABLE
	Local Unipro (host) - Link-Down
	Remote Unipro (device) - Link-Down

- State4: After sending DME_LINKSTARTUP
	Local Unipro(host) - Link-up
	Remote Unipro (device) - Link-up

The local unipro ignores the DME_LINKSTARTUP if we send it before
DME_RESET.

> 
>> we need to get the local unipro from link-up to disabled to link-down
>> using the DME_RESET and DME_ENABLE commands and then issue
>> DME_LINKSTARTUP to re-initialize the link.
> 'ufshcd_hba_enable' can be used instead of both if these are really needed.
> This will do dme_reset and dme_enable.
> 

The only reason for this is that in some implementations the HCE reset
also resets UTP layer in addition to Uni-Pro layer. There is no need
of UTP layer reset for device reset. So explicit DME_RESET and
DME_ENABLE is used. For those implementations which don't do UTP layer
reset then the advantage is instead of wasting CPU cycles in polling for
HCE=1 we depend on UIC interrupts.


-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-23  8:34       ` Seungwon Jeon
@ 2013-07-23 15:41         ` Sujit Reddy Thumma
  2013-07-24 13:39           ` Seungwon Jeon
  0 siblings, 1 reply; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-23 15:41 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/23/2013 2:04 PM, Seungwon Jeon wrote:
> On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
>> On 7/19/2013 7:28 PM, Seungwon Jeon wrote:
>>> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
>>>> Error handling in UFS driver is broken and resets the host controller
>>>> for fatal errors without re-initialization. Correct the fatal error
>>>> handling sequence according to UFS Host Controller Interface (HCI)
>>>> v1.1 specification.
>>>>
>>>> o Upon determining fatal error condition the host controller may hang
>>>>     forever until a reset is applied, so just retrying the command doesn't
>>>>     work without a reset. So, the reset is applied in the driver context
>>>>     in a separate work and SCSI mid-layer isn't informed until reset is
>>>>     applied.
>>>>
>>>> o Processed requests which are completed without error are reported to
>>>>     SCSI layer as successful and any pending commands that are not started
>>>>     yet or are not cause of the error are re-queued into scsi midlayer queue.
>>>>     For the command that caused error, host controller or device is reset
>>>>     and DID_ERROR is returned for command retry after applying reset.
>>>>
>>>> o SCSI is informed about the expected Unit-Attentioni exception from the
>>> Attention'i',  typo.
>> Okay.
>>
>>>
>>>>     device for the immediate command after a reset so that the SCSI layer
>>>>     take necessary steps to establish communication with the device.
>>>>
>>>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>>>> ---
>>>>    drivers/scsi/ufs/ufshcd.c |  349 +++++++++++++++++++++++++++++++++++---------
>>>>    drivers/scsi/ufs/ufshcd.h |    2 +
>>>>    drivers/scsi/ufs/ufshci.h |   19 ++-
>>>>    3 files changed, 295 insertions(+), 75 deletions(-)
>>>>
>>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>>>> index b4c9910..2a3874f 100644
>>>> --- a/drivers/scsi/ufs/ufshcd.c
>>>> +++ b/drivers/scsi/ufs/ufshcd.c
>>>> @@ -80,6 +80,14 @@ enum {
>>>>    	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>>>>    };
>>>>
>>>> +/* UFSHCD UIC layer error flags */
>>>> +enum {
>>>> +	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
>>>> +	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
>>>> +	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
>>>> +	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
>>>> +};
>>>> +
>>>>    /* Interrupt configuration options */
>>>>    enum {
>>>>    	UFSHCD_INT_DISABLE,
>>>> @@ -108,6 +116,7 @@ enum {
>>>>
>>>>    static void ufshcd_tmc_handler(struct ufs_hba *hba);
>>>>    static void ufshcd_async_scan(void *data, async_cookie_t cookie);
>>>> +static int ufshcd_reset_and_restore(struct ufs_hba *hba);
>>>>
>>>>    /*
>>>>     * ufshcd_wait_for_register - wait for register value to change
>>>> @@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>>>>    		goto out;
>>>>    	}
>>>>
>>>> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>>>> -		scsi_unblock_requests(hba->host);
>>>> -
>>>>    out:
>>>>    	return err;
>>>>    }
>>>> @@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct ufs_hba *hba)
>>>>    }
>>>>
>>>>    /**
>>>> - * ufshcd_do_reset - reset the host controller
>>>> - * @hba: per adapter instance
>>>> - *
>>>> - * Returns SUCCESS/FAILED
>>>> - */
>>>> -static int ufshcd_do_reset(struct ufs_hba *hba)
>>>> -{
>>>> -	struct ufshcd_lrb *lrbp;
>>>> -	unsigned long flags;
>>>> -	int tag;
>>>> -
>>>> -	/* block commands from midlayer */
>>>> -	scsi_block_requests(hba->host);
>>>> -
>>>> -	spin_lock_irqsave(hba->host->host_lock, flags);
>>>> -	hba->ufshcd_state = UFSHCD_STATE_RESET;
>>>> -
>>>> -	/* send controller to reset state */
>>>> -	ufshcd_hba_stop(hba);
>>>> -	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> -
>>>> -	/* abort outstanding commands */
>>>> -	for (tag = 0; tag < hba->nutrs; tag++) {
>>>> -		if (test_bit(tag, &hba->outstanding_reqs)) {
>>>> -			lrbp = &hba->lrb[tag];
>>>> -			if (lrbp->cmd) {
>>>> -				scsi_dma_unmap(lrbp->cmd);
>>>> -				lrbp->cmd->result = DID_RESET << 16;
>>>> -				lrbp->cmd->scsi_done(lrbp->cmd);
>>>> -				lrbp->cmd = NULL;
>>>> -				clear_bit_unlock(tag, &hba->lrb_in_use);
>>>> -			}
>>>> -		}
>>>> -	}
>>>> -
>>>> -	/* complete device management command */
>>>> -	if (hba->dev_cmd.complete)
>>>> -		complete(hba->dev_cmd.complete);
>>>> -
>>>> -	/* clear outstanding request/task bit maps */
>>>> -	hba->outstanding_reqs = 0;
>>>> -	hba->outstanding_tasks = 0;
>>>> -
>>>> -	/* Host controller enable */
>>>> -	if (ufshcd_hba_enable(hba)) {
>>>> -		dev_err(hba->dev,
>>>> -			"Reset: Controller initialization failed\n");
>>>> -		return FAILED;
>>>> -	}
>>>> -
>>>> -	if (ufshcd_link_startup(hba)) {
>>>> -		dev_err(hba->dev,
>>>> -			"Reset: Link start-up failed\n");
>>>> -		return FAILED;
>>>> -	}
>>>> -
>>>> -	return SUCCESS;
>>>> -}
>>>> -
>>>> -/**
>>>>     * ufshcd_slave_alloc - handle initial SCSI device configurations
>>>>     * @sdev: pointer to SCSI device
>>>>     *
>>>> @@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device *sdev)
>>>>    	sdev->use_10_for_ms = 1;
>>>>    	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
>>>>
>>>> +	/* allow SCSI layer to restart the device in case of errors */
>>>> +	sdev->allow_restart = 1;
>>>> +
>>>>    	/*
>>>>    	 * Inform SCSI Midlayer that the LUN queue depth is same as the
>>>>    	 * controller queue depth. If a LUN queue depth is less than the
>>>> @@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
>>>>    	case OCS_ABORTED:
>>>>    		result |= DID_ABORT << 16;
>>>>    		break;
>>>> +	case OCS_INVALID_COMMAND_STATUS:
>>>> +		result |= DID_REQUEUE << 16;
>>>> +		break;
>>>>    	case OCS_INVALID_CMD_TABLE_ATTR:
>>>>    	case OCS_INVALID_PRDT_ATTR:
>>>>    	case OCS_MISMATCH_DATA_BUF_SIZE:
>>>> @@ -2405,42 +2357,295 @@ out:
>>>>    	return err;
>>>>    }
>>>>
>>>> +static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
>>>> +{
>>>> +	switch (ocs) {
>>>> +	case OCS_SUCCESS:
>>>> +	case OCS_INVALID_COMMAND_STATUS:
>>>> +		break;
>>>> +	case OCS_MISMATCH_DATA_BUF_SIZE:
>>>> +	case OCS_MISMATCH_RESP_UPIU_SIZE:
>>>> +	case OCS_PEER_COMM_FAILURE:
>>>> +	case OCS_FATAL_ERROR:
>>>> +	case OCS_ABORTED:
>>>> +	case OCS_INVALID_CMD_TABLE_ATTR:
>>>> +	case OCS_INVALID_PRDT_ATTR:
>>>> +		ufshcd_set_host_reset_pending(hba);
>>> Should host be reset on ocs error, including below ufshcd_decide_eh_task_req?
>>> It's just overall command status.
>>
>> Yes, the error handling section in the UFS 1.1 spec. mentions so.
> If host's reset is required, it should be allowed in fatal situation.
> Deciding with OCS field seems not proper. There is no mentions for that in spec.
> If I have a wrong information, please let it clear.
> 

You can refer to section 8.3 of HCI spec.
On fatal errors the controller h/w will have to update the OCS field of
the command that caused error and then raise an fatal error interrupt.
The s/w reads the OCS value and determine commands that are in error
and then carry out reset.

>>
>>>
>>>> +		break;
>>>> +	default:
>>>> +		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
>>>> +				__func__, ocs);
>>>> +		BUG();
>>>> +	}
>>>> +}
>>>> +
>>>> +static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
>>>> +{
>>>> +	switch (ocs) {
>>>> +	case OCS_TMR_SUCCESS:
>>>> +	case OCS_TMR_INVALID_COMMAND_STATUS:
>>>> +		break;
>>>> +	case OCS_TMR_MISMATCH_REQ_SIZE:
>>>> +	case OCS_TMR_MISMATCH_RESP_SIZE:
>>>> +	case OCS_TMR_PEER_COMM_FAILURE:
>>>> +	case OCS_TMR_INVALID_ATTR:
>>>> +	case OCS_TMR_ABORTED:
>>>> +	case OCS_TMR_FATAL_ERROR:
>>>> +		ufshcd_set_host_reset_pending(hba);
>>>> +		break;
>>>> +	default:
>>>> +		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
>>>> +				__func__, ocs);
>>>> +		BUG();
>>>> +	}
>>>> +}
>>>> +
>>>>    /**
>>>> - * ufshcd_fatal_err_handler - handle fatal errors
>>>> + * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed command and
>>>> + *                          decide error handling
>>>>     * @hba: per adapter instance
>>>> + * @err_xfer: bit mask for transfer request errors
>>>> + *
>>>> + * Iterate over completed transfer requests and
>>>> + * set error handling flags.
>>>> + */
>>>> +static void
>>>> +ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
>>>> +{
>>>> +	unsigned long completed;
>>>> +	u32 doorbell;
>>>> +	int index;
>>>> +	int ocs;
>>>> +
>>>> +	if (!err_xfer)
>>>> +		goto out;
>>>> +
>>>> +	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
>>>> +	completed = doorbell ^ (u32)hba->outstanding_reqs;
>>>> +
>>>> +	for (index = 0; index < hba->nutrs; index++) {
>>>> +		if (test_bit(index, &completed)) {
>>>> +			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
>>>> +			if ((ocs == OCS_SUCCESS) ||
>>>> +					(ocs == OCS_INVALID_COMMAND_STATUS))
>>>> +				continue;
>>>> +
>>>> +			*err_xfer |= (1 << index);
>>>> +			ufshcd_decide_eh_xfer_req(hba, ocs);
>>>> +		}
>>>> +	}
>>>> +out:
>>>> +	return;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_error_autopsy_task_req() - reads OCS field of failed command and
>>>> + *                          decide error handling
>>>> + * @hba: per adapter instance
>>>> + * @err_tm: bit mask for task management errors
>>>> + *
>>>> + * Iterate over completed task management requests and
>>>> + * set error handling flags.
>>>> + */
>>>> +static void
>>>> +ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
>>>> +{
>>>> +	unsigned long completed;
>>>> +	u32 doorbell;
>>>> +	int index;
>>>> +	int ocs;
>>>> +
>>>> +	if (!err_tm)
>>>> +		goto out;
>>>> +
>>>> +	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
>>>> +	completed = doorbell ^ (u32)hba->outstanding_tasks;
>>>> +
>>>> +	for (index = 0; index < hba->nutmrs; index++) {
>>>> +		if (test_bit(index, &completed)) {
>>>> +			struct utp_task_req_desc *tm_descp;
>>>> +
>>>> +			tm_descp = hba->utmrdl_base_addr;
>>>> +			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
>>>> +			if ((ocs == OCS_TMR_SUCCESS) ||
>>>> +					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
>>>> +				continue;
>>>> +
>>>> +			*err_tm |= (1 << index);
>>>> +			ufshcd_decide_eh_task_req(hba, ocs);
>>>> +		}
>>>> +	}
>>>> +
>>>> +out:
>>>> +	return;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_fatal_err_handler - handle fatal errors
>>>> + * @work: pointer to work structure
>>>>     */
>>>>    static void ufshcd_fatal_err_handler(struct work_struct *work)
>>>>    {
>>>>    	struct ufs_hba *hba;
>>>> +	unsigned long flags;
>>>> +	u32 err_xfer = 0;
>>>> +	u32 err_tm = 0;
>>>> +	int err;
>>>> +
>>>>    	hba = container_of(work, struct ufs_hba, feh_workq);
>>>>
>>>>    	pm_runtime_get_sync(hba->dev);
>>>> -	/* check if reset is already in progress */
>>>> -	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
>>>> -		ufshcd_do_reset(hba);
>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>> +	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
>>>> +		/* complete processed requests and exit */
>>>> +		ufshcd_transfer_req_compl(hba);
>>>> +		ufshcd_tmc_handler(hba);
>>>> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> +		pm_runtime_put_sync(hba->dev);
>>>> +		return;
>>> Host driver is here with finishing 'scsi_block_requests'.
>>> 'scsi_unblock_requests' can be called somewhere?
>>
>> No, but it can be possible that SCSI command timeout which triggers
>> device/host reset and fatal error handler race each other.
> Sorry, I didn't get your meaning exactly.
> I saw that scsi_block_requests is done before ufshcd_fatal_err_handler is scheduled.
> If device or host was requested from scsi mid-layer just before ufshcd_fatal_err_handler,
> ufshcd_fatal_err_handler will be out through if statement. Then, there is nowhere to call scsi_unblock_requests
> though device/host reset is done successfully.

You are right, this should return with scsi_unblock_requests()
called and there is no need to complete the processed requests as we
might be in middle of something else while the RESET is in progress.


>>
>>>
>>>> +	}
>>>> +
>>>> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
>>>> +	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
>>>> +	ufshcd_error_autopsy_task_req(hba, &err_tm);
>>>> +
>>>> +	/*
>>>> +	 * Complete successful and pending transfer requests.
>>>> +	 * DID_REQUEUE is returned for pending requests as they have
>>>> +	 * nothing to do with error'ed request and SCSI layer should
>>>> +	 * not treat them as errors and decrement retry count.
>>>> +	 */
>>>> +	hba->outstanding_reqs &= ~err_xfer;
>>>> +	ufshcd_transfer_req_compl(hba);
>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> +	ufshcd_complete_pending_reqs(hba);
>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>> +	hba->outstanding_reqs |= err_xfer;
>>> Hmm... error handling seems so complicated.
>>> To simplify it, how about below?
>>>
>>> 1. If requests(transfer or task management) are completed, finish them with success/failure.
>> This is what we are trying to do above.
>>
>>> 2. If there are pending requests, abort them.
>> No, if a fatal error is occurred it is possible that host controller is
>> freez'ed we are not sure if it can take task management commands and
>> execute them.
> I meant that aborting the request by clearing corresponding UTMRLCLR/UTMRLCLR.
> 

I am doing the same in this patch -
1) Return to SCSI the successful commands.
2) Clear the pending (but not cause of error) commands by writing into
UTMRLCLR/UTRCLR registers. So scsi_host_result = DID_REQUEUE
3) Reset and return the commands that "caused error" to SCSI with
DID_ERROR.

Am I doing anything extra than what you have suggested?

>>
>>> 3. If fatal error, reset.
>>>
>>
>>
>>>> +
>>>> +	/* Complete successful and pending task requests */
>>>> +	hba->outstanding_tasks &= ~err_tm;
>>>> +	ufshcd_tmc_handler(hba);
>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> +	ufshcd_complete_pending_tasks(hba);
>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>> +
>>>> +	hba->outstanding_tasks |= err_tm;
>>>> +
>>>> +	/*
>>>> +	 * Controller may generate multiple fatal errors, handle
>>>> +	 * errors based on severity.
>>>> +	 * 1) DEVICE_FATAL_ERROR
>>>> +	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
>>>> +	 * 3) UIC_ERROR
>>>> +	 */
>>>> +	if (hba->errors & DEVICE_FATAL_ERROR) {
>>>> +		/*
>>>> +		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
>>>> +		 * OCS field on device fatal error.
>>>> +		 */
>>>> +		ufshcd_set_host_reset_pending(hba);
>>> In DEVICE_FATAL_ERROR, ufshcd_device_reset_pending is right?
>>
>> It looks so, but the spec. mentions to reset the host as well (8.3.6).
> Do you pointing below?
> [8.3.6. Device Errors are fatal errors. ...the host software shall reset the device too.]
> 

I meant "8.3.6: When this condition occurs, host software shall follow
the same procedure for UIC error handling as described in 8.2.2,". There
is an error in the spec. it was not 8.2.2 but 8.3.2 for UIC error
handling. So going by 8.3.2 HCE needs to be toggled.



>>
>>>
>>>> +	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
>>>> +			CONTROLLER_FATAL_ERROR)) {
>>>> +		/* eh flags should be set in err autopsy based on OCS values */
>>>> +		if (!hba->eh_flags)
>>>> +			WARN(1, "%s: fatal error without error handling\n",
>>>> +				dev_name(hba->dev));
>>>> +	} else if (hba->errors & UIC_ERROR) {
>>>> +		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
>>>> +			/* fatal error - reset controller */
>>>> +			ufshcd_set_host_reset_pending(hba);
>>>> +		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
>>>> +					UFSHCD_UIC_TL_ERROR |
>>>> +					UFSHCD_UIC_DME_ERROR)) {
>>>> +			/* non-fatal, report error to SCSI layer */
>>>> +			if (!hba->eh_flags) {
>>>> +				spin_unlock_irqrestore(
>>>> +						hba->host->host_lock, flags);
>>>> +				ufshcd_complete_pending_reqs(hba);
>>>> +				ufshcd_complete_pending_tasks(hba);
>>>> +				spin_lock_irqsave(hba->host->host_lock, flags);
>>>> +			}
>>>> +		}
>>>> +	}
>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>> +
>>>> +	if (hba->eh_flags) {
>>>> +		err = ufshcd_reset_and_restore(hba);
>>>> +		if (err) {
>>>> +			ufshcd_clear_host_reset_pending(hba);
>>>> +			ufshcd_clear_device_reset_pending(hba);
>>>> +			dev_err(hba->dev, "%s: reset and restore failed\n",
>>>> +					__func__);
>>>> +			hba->ufshcd_state = UFSHCD_STATE_ERROR;
>>>> +		}
>>>> +		/*
>>>> +		 * Inform scsi mid-layer that we did reset and allow to handle
>>>> +		 * Unit Attention properly.
>>>> +		 */
>>>> +		scsi_report_bus_reset(hba->host, 0);
>>>> +		hba->errors = 0;
>>>> +		hba->uic_error = 0;
>>>> +	}
>>>> +	scsi_unblock_requests(hba->host);
>>>>    	pm_runtime_put_sync(hba->dev);
>>>>    }
>>>>
>>>>    /**
>>>> - * ufshcd_err_handler - Check for fatal errors
>>>> - * @work: pointer to a work queue structure
>>>> + * ufshcd_update_uic_error - check and set fatal UIC error flags.
>>>> + * @hba: per-adapter instance
>>>>     */
>>>> -static void ufshcd_err_handler(struct ufs_hba *hba)
>>>> +static void ufshcd_update_uic_error(struct ufs_hba *hba)
>>>>    {
>>>>    	u32 reg;
>>>>
>>>> +	/* PA_INIT_ERROR is fatal and needs UIC reset */
>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
>>>> +	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
>>>> +		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
>>>> +
>>>> +	/* UIC NL/TL/DME errors needs software retry */
>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
>>>> +	if (reg)
>>>> +		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
>>>> +
>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
>>>> +	if (reg)
>>>> +		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
>>>> +
>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
>>>> +	if (reg)
>>>> +		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
>>> REG_UIC_ERROR_CODE_PHY_ADAPTER_LAYER is not handled.
>>
>> UFS spec. mentions that it is non-fatal error and UIC recovers
>> by itself and doesn't need software intervention.
> Ok.
> 
>>
>>>
>>>> +
>>>> +	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
>>>> +			__func__, hba->uic_error);
>>>> +}
>>>> +
>>>> +/**
>>>> + * ufshcd_err_handler - Check for fatal errors
>>>> + * @hba: per-adapter instance
>>>> + */
>>>> +static void ufshcd_err_handler(struct ufs_hba *hba)
>>>> +{
>>>>    	if (hba->errors & INT_FATAL_ERRORS)
>>>>    		goto fatal_eh;
>>>>
>>>>    	if (hba->errors & UIC_ERROR) {
>>>> -		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
>>>> -		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
>>>> +		hba->uic_error = 0;
>>>> +		ufshcd_update_uic_error(hba);
>>>> +		if (hba->uic_error)
>>> Except UFSHCD_UIC_DL_PA_INIT_ERROR, it's not fatal. Should it go to fatal_eh?
>>
>> Please see the UIC error handling in ufshcd_fatal_err_handler(), others
>> need software intervention so I combined it with fatal_eh to complete
>> the requests and report to SCSI.
> As gathering all error(fatal, non-fatal)handling into origin one, it makes confused.
> Then, I would be better to rename ufshcd_fatal_err_handler.
> 

Yeah, ufshcd_err_handler is apt but it is already consumed. Probably,
ufshcd_err_handler -> ufshcd_check_errors
ufshcd_fatal_err_handler -> ufshcd_err_handler
rename would be fine?


-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-23 15:40         ` Sujit Reddy Thumma
@ 2013-07-24 13:39           ` Seungwon Jeon
  2013-07-29  9:45             ` Sujit Reddy Thumma
  0 siblings, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-24 13:39 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma'
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On Wed, July 24, 2013, Sujit Reddy Thumma wrote:
> On 7/23/2013 1:57 PM, Seungwon Jeon wrote:
> > On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
> >> On 7/19/2013 7:27 PM, Seungwon Jeon wrote:
> >>> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
> >>>> As of now SCSI initiated error handling is broken because,
> >>>> the reset APIs don't try to bring back the device initialized and
> >>>> ready for further transfers.
> >>>>
> >>>> In case of timeouts, the scsi error handler takes care of handling aborts
> >>>> and resets. Improve the error handling in such scenario by resetting the
> >>>> device and host and re-initializing them in proper manner.
> >>>>
> >>>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> >>>> ---
> >>>>    drivers/scsi/ufs/ufshcd.c |  467 +++++++++++++++++++++++++++++++++++++++------
> >>>>    drivers/scsi/ufs/ufshcd.h |    2 +
> >>>>    2 files changed, 411 insertions(+), 58 deletions(-)
> >>>>
> >>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> >>>> index 51ce096..b4c9910 100644
> >>>> --- a/drivers/scsi/ufs/ufshcd.c
> >>>> +++ b/drivers/scsi/ufs/ufshcd.c
> >>>> @@ -69,9 +69,15 @@ enum {
> >>>>
> >>>>    /* UFSHCD states */
> >>>>    enum {
> >>>> -	UFSHCD_STATE_OPERATIONAL,
> >>>>    	UFSHCD_STATE_RESET,
> >>>>    	UFSHCD_STATE_ERROR,
> >>>> +	UFSHCD_STATE_OPERATIONAL,
> >>>> +};
> >>>> +
> >>>> +/* UFSHCD error handling flags */
> >>>> +enum {
> >>>> +	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
> >>>> +	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
> >>>>    };
> >>>>
> >>>>    /* Interrupt configuration options */
> >>>> @@ -87,6 +93,22 @@ enum {
> >>>>    	INT_AGGR_CONFIG,
> >>>>    };
> >>>>
> >>>> +#define ufshcd_set_device_reset_pending(h) \
> >>>> +	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
> >>>> +#define ufshcd_set_host_reset_pending(h) \
> >>>> +	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
> >>>> +#define ufshcd_device_reset_pending(h) \
> >>>> +	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
> >>>> +#define ufshcd_host_reset_pending(h) \
> >>>> +	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
> >>>> +#define ufshcd_clear_device_reset_pending(h) \
> >>>> +	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
> >>>> +#define ufshcd_clear_host_reset_pending(h) \
> >>>> +	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
> >>>> +
> >>>> +static void ufshcd_tmc_handler(struct ufs_hba *hba);
> >>>> +static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> >>>> +
> >>>>    /*
> >>>>     * ufshcd_wait_for_register - wait for register value to change
> >>>>     * @hba - per-adapter interface
> >>>> @@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
> >>>>
> >>>>    	tag = cmd->request->tag;
> >>>>
> >>>> -	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
> >>>> +	switch (hba->ufshcd_state) {
> >>> Lock is no needed for ufshcd_state?
> > Please check?
> 
> Yes, it is needed. Thanks for catching this.
> 
> >
> >>>
> >>>> +	case UFSHCD_STATE_OPERATIONAL:
> >>>> +		break;
> >>>> +	case UFSHCD_STATE_RESET:
> >>>>    		err = SCSI_MLQUEUE_HOST_BUSY;
> >>>>    		goto out;
> >>>> +	case UFSHCD_STATE_ERROR:
> >>>> +		set_host_byte(cmd, DID_ERROR);
> >>>> +		cmd->scsi_done(cmd);
> >>>> +		goto out;
> >>>> +	default:
> >>>> +		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
> >>>> +				__func__, hba->ufshcd_state);
> >>>> +		set_host_byte(cmd, DID_BAD_TARGET);
> >>>> +		cmd->scsi_done(cmd);
> >>>> +		goto out;
> >>>>    	}
> >>>>
> >>>>    	/* acquire the tag to make sure device cmds don't use it */
> >>>> @@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
> >>>>    	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> >>>>    		scsi_unblock_requests(hba->host);
> >>>>
> >>>> -	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
> >>>> -
> >>>>    out:
> >>>>    	return err;
> >>>>    }
> >>>> @@ -2273,6 +2306,106 @@ out:
> >>>>    }
> >>>>
> >>>>    /**
> >>>> + * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
> >>>> + * @hba: per-adapter instance
> >>>> + */
> >>>> +static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
> >>>> +{
> >>>> +	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
> >>>> + * @hba: per-adapter instance
> >>>> + */
> >>>> +static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
> >>>> +{
> >>>> +	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_complete_pending_tasks - complete outstanding tasks
> >>>> + * @hba: per adapter instance
> >>>> + *
> >>>> + * Abort in-progress task management commands and wakeup
> >>>> + * waiting threads.
> >>>> + *
> >>>> + * Returns non-zero error value when failed to clear all the commands.
> >>>> + */
> >>>> +static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
> >>>> +{
> >>>> +	u32 reg;
> >>>> +	int err = 0;
> >>>> +	unsigned long flags;
> >>>> +
> >>>> +	if (!hba->outstanding_tasks)
> >>>> +		goto out;
> >>>> +
> >>>> +	/* Clear UTMRL only when run-stop is enabled */
> >>>> +	if (ufshcd_utmrl_is_rsr_enabled(hba))
> >>>> +		ufshcd_writel(hba, ~hba->outstanding_tasks,
> >>>> +				REG_UTP_TASK_REQ_LIST_CLEAR);
> >>>> +
> >>>> +	/* poll for max. 1 sec to clear door bell register by h/w */
> >>>> +	reg = ufshcd_wait_for_register(hba,
> >>>> +			REG_UTP_TASK_REQ_DOOR_BELL,
> >>>> +			hba->outstanding_tasks, 0, 1000, 1000);
> >>>> +	if (reg & hba->outstanding_tasks)
> >>>> +		err = -ETIMEDOUT;
> >>>> +
> >>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >>>> +	/* complete commands that were cleared out */
> >>>> +	ufshcd_tmc_handler(hba);
> >>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >>>> +out:
> >>>> +	if (err)
> >>>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> >>>> +				__func__, reg);
> >>>> +	return err;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_complete_pending_reqs - complete outstanding requests
> >>>> + * @hba: per adapter instance
> >>>> + *
> >>>> + * Abort in-progress transfer request commands and return them to SCSI.
> >>>> + *
> >>>> + * Returns non-zero error value when failed to clear all the commands.
> >>>> + */
> >>>> +static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
> >>>> +{
> >>>> +	u32 reg;
> >>>> +	int err = 0;
> >>>> +	unsigned long flags;
> >>>> +
> >>>> +	/* check if we completed all of them */
> >>>> +	if (!hba->outstanding_reqs)
> >>>> +		goto out;
> >>>> +
> >>>> +	/* Clear UTRL only when run-stop is enabled */
> >>>> +	if (ufshcd_utrl_is_rsr_enabled(hba))
> >>>> +		ufshcd_writel(hba, ~hba->outstanding_reqs,
> >>>> +				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
> >>>> +
> >>>> +	/* poll for max. 1 sec to clear door bell register by h/w */
> >>>> +	reg = ufshcd_wait_for_register(hba,
> >>>> +			REG_UTP_TRANSFER_REQ_DOOR_BELL,
> >>>> +			hba->outstanding_reqs, 0, 1000, 1000);
> >>>> +	if (reg & hba->outstanding_reqs)
> >>>> +		err = -ETIMEDOUT;
> >>>> +
> >>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >>>> +	/* complete commands that were cleared out */
> >>>> +	ufshcd_transfer_req_compl(hba);
> >>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >>>> +out:
> >>>> +	if (err)
> >>>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
> >>>> +				__func__, reg);
> >>>> +	return err;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>>     * ufshcd_fatal_err_handler - handle fatal errors
> >>>>     * @hba: per adapter instance
> >>>>     */
> >>>> @@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
> >>>>    	}
> >>>>    	return;
> >>>>    fatal_eh:
> >>>> -	hba->ufshcd_state = UFSHCD_STATE_ERROR;
> >>>> -	schedule_work(&hba->feh_workq);
> >>>> +	/* handle fatal errors only when link is functional */
> >>>> +	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
> >>>> +		/* block commands at driver layer until error is handled */
> >>>> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
> >>> Locking omitted for ufshcd_state?
> >> This is called in interrupt context with spin_lock held.
> > Right, I missed it.
> >
> >>
> >>>
> >>>> +		schedule_work(&hba->feh_workq);
> >>>> +	}
> >>>>    }
> >>>>
> >>>>    /**
> >>>> @@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int
> >> task_id,
> >>>>    }
> >>>>
> >>>>    /**
> >>>> - * ufshcd_device_reset - reset device and abort all the pending commands
> >>>> - * @cmd: SCSI command pointer
> >>>> + * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
> >>>> + * @hba: per adapter instance
> >>>>     *
> >>>> - * Returns SUCCESS/FAILED
> >>>> + * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS flags,
> >>>> + * attributes and descriptors are reset to default state. Callers are
> >>>> + * expected to initialize the whole device again after this.
> >>>> + *
> >>>> + * Returns zero on success, non-zero on failure
> >>>>     */
> >>>> -static int ufshcd_device_reset(struct scsi_cmnd *cmd)
> >>>> +static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
> >>>>    {
> >>>> -	struct Scsi_Host *host;
> >>>> -	struct ufs_hba *hba;
> >>>> -	unsigned int tag;
> >>>> -	u32 pos;
> >>>> -	int err;
> >>>> -	u8 resp;
> >>>> -	struct ufshcd_lrb *lrbp;
> >>>> +	struct uic_command uic_cmd = {0};
> >>>> +	int ret;
> >>>>
> >>>> -	host = cmd->device->host;
> >>>> -	hba = shost_priv(host);
> >>>> -	tag = cmd->request->tag;
> >>>> +	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
> >>>>
> >>>> -	lrbp = &hba->lrb[tag];
> >>>> -	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
> >>>> -			UFS_LOGICAL_RESET, &resp);
> >>>> -	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
> >>>> -		err = FAILED;
> >>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> >>>> +	if (ret)
> >>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_dme_reset - Local UniPro reset
> >>>> + * @hba: per adapter instance
> >>>> + *
> >>>> + * Returns zero on success, non-zero on failure
> >>>> + */
> >>>> +static int ufshcd_dme_reset(struct ufs_hba *hba)
> >>>> +{
> >>>> +	struct uic_command uic_cmd = {0};
> >>>> +	int ret;
> >>>> +
> >>>> +	uic_cmd.command = UIC_CMD_DME_RESET;
> >>>> +
> >>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> >>>> +	if (ret)
> >>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> >>>> +
> >>>> +	return ret;
> >>>> +
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_dme_enable - Local UniPro DME Enable
> >>>> + * @hba: per adapter instance
> >>>> + *
> >>>> + * Returns zero on success, non-zero on failure
> >>>> + */
> >>>> +static int ufshcd_dme_enable(struct ufs_hba *hba)
> >>>> +{
> >>>> +	struct uic_command uic_cmd = {0};
> >>>> +	int ret;
> >>>> +	uic_cmd.command = UIC_CMD_DME_ENABLE;
> >>>> +
> >>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
> >>>> +	if (ret)
> >>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
> >>>> +
> >>>> +	return ret;
> >>>> +
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_device_reset_and_restore - reset and restore device
> >>>> + * @hba: per-adapter instance
> >>>> + *
> >>>> + * Note that the device reset issues DME_END_POINT_RESET which
> >>>> + * may reset entire device and restore device attributes to
> >>>> + * default state.
> >>>> + *
> >>>> + * Returns zero on success, non-zero on failure
> >>>> + */
> >>>> +static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
> >>>> +{
> >>>> +	int err = 0;
> >>>> +	u32 reg;
> >>>> +
> >>>> +	err = ufshcd_dme_end_point_reset(hba);
> >>>> +	if (err)
> >>>> +		goto out;
> >>>> +
> >>>> +	/* restore communication with the device */
> >>>> +	err = ufshcd_dme_reset(hba);
> >>>> +	if (err)
> >>>>    		goto out;
> >>>> -	} else {
> >>>> -		err = SUCCESS;
> >>>> -	}
> >>>>
> >>>> -	for (pos = 0; pos < hba->nutrs; pos++) {
> >>>> -		if (test_bit(pos, &hba->outstanding_reqs) &&
> >>>> -		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
> >>>> +	err = ufshcd_dme_enable(hba);
> >>>> +	if (err)
> >>>> +		goto out;
> >>>>
> >>>> -			/* clear the respective UTRLCLR register bit */
> >>>> -			ufshcd_utrl_clear(hba, pos);
> >>>> +	err = ufshcd_dme_link_startup(hba);
> >>> UFS_LOGICAL_RESET is no more used?
> >>
> >> Yes, I don't see any use for this as of now (given that we are using
> >> dme_end_point_reset, refer to figure. 7.4 of UFS 1.1 spec). Also, the
> >> UFS spec. error handling section doesn't mention anything about
> >> LOGICAL_RESET. If you know a valid use case where we need to have LUN
> >> reset, please let me know I will bring it back.
> > As refered the scsi-mid layer and other host's implementation,
> > eh_device_reset_handler(= ufshcd_eh_device_reset_handler) may
> > have a role of LOGICAL_RESET for specific lun.
> 
> I am still not convinced why we need LOGICAL_RESET. Just because other
> SCSI host drivers have it do we really need it for UFS?
> 
> > I found that ENDPOINT_RESET is recommended with IS.DFES in spec.
> 
> Here in this case, a command hang (scsi timeout) is considered as Device
> Fatal Error. If there are some LUN failures the response would still be
> transferred but with Unit-Attention condition with sense data. However,
> if the command itself hangs, there is something seriously wrong with the
> device or the communication. So we first try to reset the device and
> then the host. Unlike most of other SCSI HBAs, UFS is point-to-point
> (host <--> device) link and if something goes wrong and caused a hang,
> mostly would be a serious error and logical unit reset wouldn't help
> much.
As far as UFS follows the SAM-5 model, LOGICAL_RESET should be considered.
LOGICAL_RESET would be handled in 'eh_device_reset_handler' as I see it.
And it looks like actual device reset is close to 'eh_target_reset_handler'.

> 
> 
> >
> > Let me add some comments additionally.
> > Both 'ufshcd_eh_device_reset_handler' and 'ufshcd_host_reset_and_restore' do almost same things.
> > At a glance, it's confused about their role and It is mixed.
> > 'ufshcd_reset_and_restore' is eventually called, which is actual part of reset functionality; Once
> device reset is failed, then
> > host reset is tried.
> > Actually, that is being handled for each level of error recovery in scsi mid-layer. Please chekc
> 'drivers/scsi/scsi_error.c'.
> > [scsi_eh_ready_devs, scsi_abort_eh_cmnd]
> > In this stage, each reset functionality could be separated obviously.
> 
> Yes, in that case we are optimistically doing the host reset twice,
> just a hope that it recovers before SCSI layer choke and mark the
> device as OFFLINE. If you think that this shouldn't be the case and
> have a valid reason for not doing so, I will return appropriate error
> in the case device reset fails.
The two are much the same actually.
To simplify implementation in host driver, leaving it to scsi mid-layer would be better.
Eventually, the controlling of callback function is from upper layer.

> 
> >
> >>
> >>> ufshcd_device_reset_and_restore have a role of device reset.
> >>> Both ufshcd_dme_reset and ufshcd_dme_enable are valid for local one, not for remote.
> >>> Should we do those for host including link-startup here?
> >>
> >> Yes, it is needed. After DME_ENDPOINT_RESET the remote link goes into link down state.
> > I want to know more related description. I didn't find it. Could you point that?
> 
> Please refer to "Table 121 DME_SAP restrictions" of MIPI Uni-Pro spec.
> The spec. doesn't mention about this explicitly but here is the logic
> that is derived from the spec.
> 1) The DME_LINKSTARTUP can be sent only when the link is in down state,
> in all other states DME_LINKSTARTUP is ignored.
> 2) So if we are sending DME_ENDPOINT_RESET then that must ensure that
> remote link is in down state, and hence it can receive linkstartup and
> establish the communication.
> 
> >
> >> To initialize the link, the host needs to send
> >> DME_LINKSTARTUP, but according to Uni-Pro spec. the link-startup can
> >> only be sent when the local uni-pro is in link-down state. So first
> > If it's right you mentioned above, uni-pro state is already in link-down after DME_ENDPOINT_RESET.
> > Then, DME_RESET isn't needed.
> 
> You are getting confused here -
Yeah, I'm mixed up with link-down of remote side you mentioned.
There is no special saying about link state after receiving DME_ENDPOINT_RESET in spec.
I just found the point that link startup is initiated.
It means that link startup is triggered from remote device after DME_ENDPOINT_RESET.
In that case, host will detect ULSS(UIC Link Startup Status) interrupt.
After that, host shall start link startup procedure with DME_RESET.
Of course, your approach could be acceptable.

> 
> - State1: before sending DME_ENDPOINT_RESET
> 	Local Unipro (host) - Link-UP
> 	Remote Unipro (device) - Link-Up
> 
> - State2: after sending DME_ENDPOINT_RESET
> 	Local Unipro (host) - Link-UP
> 	Remote Unipro (device) - Link-Down
> 
> - State3: After sending DME_RESET+DME_ENABLE
> 	Local Unipro (host) - Link-Down
> 	Remote Unipro (device) - Link-Down
> 
> - State4: After sending DME_LINKSTARTUP
> 	Local Unipro(host) - Link-up
> 	Remote Unipro (device) - Link-up
> 
> The local unipro ignores the DME_LINKSTARTUP if we send it before
> DME_RESET.
> 
> >
> >> we need to get the local unipro from link-up to disabled to link-down
> >> using the DME_RESET and DME_ENABLE commands and then issue
> >> DME_LINKSTARTUP to re-initialize the link.
> > 'ufshcd_hba_enable' can be used instead of both if these are really needed.
> > This will do dme_reset and dme_enable.
> >
> 
> The only reason for this is that in some implementations the HCE reset
> also resets UTP layer in addition to Uni-Pro layer. There is no need
> of UTP layer reset for device reset. So explicit DME_RESET and
> DME_ENABLE is used. For those implementations which don't do UTP layer
> reset then the advantage is instead of wasting CPU cycles in polling for
> HCE=1 we depend on UIC interrupts.
HCE reset can involve the additional unipro configurations, depending
host controller implementation. 
As considering that unipro stack is reset with DME_RESET, 
usage of individual DME_RESET might be inappropriate during link startup.

Thanks,
Seungwon Jeon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-23 15:41         ` Sujit Reddy Thumma
@ 2013-07-24 13:39           ` Seungwon Jeon
  2013-07-29  9:45             ` Sujit Reddy Thumma
  0 siblings, 1 reply; 27+ messages in thread
From: Seungwon Jeon @ 2013-07-24 13:39 UTC (permalink / raw)
  To: 'Sujit Reddy Thumma'
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On Wed, July 24, 2013, Sujit Reddy Thumma wrote:
> On 7/23/2013 2:04 PM, Seungwon Jeon wrote:
> > On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
> >> On 7/19/2013 7:28 PM, Seungwon Jeon wrote:
> >>> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
> >>>> Error handling in UFS driver is broken and resets the host controller
> >>>> for fatal errors without re-initialization. Correct the fatal error
> >>>> handling sequence according to UFS Host Controller Interface (HCI)
> >>>> v1.1 specification.
> >>>>
> >>>> o Upon determining fatal error condition the host controller may hang
> >>>>     forever until a reset is applied, so just retrying the command doesn't
> >>>>     work without a reset. So, the reset is applied in the driver context
> >>>>     in a separate work and SCSI mid-layer isn't informed until reset is
> >>>>     applied.
> >>>>
> >>>> o Processed requests which are completed without error are reported to
> >>>>     SCSI layer as successful and any pending commands that are not started
> >>>>     yet or are not cause of the error are re-queued into scsi midlayer queue.
> >>>>     For the command that caused error, host controller or device is reset
> >>>>     and DID_ERROR is returned for command retry after applying reset.
> >>>>
> >>>> o SCSI is informed about the expected Unit-Attentioni exception from the
> >>> Attention'i',  typo.
> >> Okay.
> >>
> >>>
> >>>>     device for the immediate command after a reset so that the SCSI layer
> >>>>     take necessary steps to establish communication with the device.
> >>>>
> >>>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
> >>>> ---
> >>>>    drivers/scsi/ufs/ufshcd.c |  349 +++++++++++++++++++++++++++++++++++---------
> >>>>    drivers/scsi/ufs/ufshcd.h |    2 +
> >>>>    drivers/scsi/ufs/ufshci.h |   19 ++-
> >>>>    3 files changed, 295 insertions(+), 75 deletions(-)
> >>>>
> >>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> >>>> index b4c9910..2a3874f 100644
> >>>> --- a/drivers/scsi/ufs/ufshcd.c
> >>>> +++ b/drivers/scsi/ufs/ufshcd.c
> >>>> @@ -80,6 +80,14 @@ enum {
> >>>>    	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
> >>>>    };
> >>>>
> >>>> +/* UFSHCD UIC layer error flags */
> >>>> +enum {
> >>>> +	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
> >>>> +	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
> >>>> +	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
> >>>> +	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
> >>>> +};
> >>>> +
> >>>>    /* Interrupt configuration options */
> >>>>    enum {
> >>>>    	UFSHCD_INT_DISABLE,
> >>>> @@ -108,6 +116,7 @@ enum {
> >>>>
> >>>>    static void ufshcd_tmc_handler(struct ufs_hba *hba);
> >>>>    static void ufshcd_async_scan(void *data, async_cookie_t cookie);
> >>>> +static int ufshcd_reset_and_restore(struct ufs_hba *hba);
> >>>>
> >>>>    /*
> >>>>     * ufshcd_wait_for_register - wait for register value to change
> >>>> @@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
> >>>>    		goto out;
> >>>>    	}
> >>>>
> >>>> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
> >>>> -		scsi_unblock_requests(hba->host);
> >>>> -
> >>>>    out:
> >>>>    	return err;
> >>>>    }
> >>>> @@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct ufs_hba *hba)
> >>>>    }
> >>>>
> >>>>    /**
> >>>> - * ufshcd_do_reset - reset the host controller
> >>>> - * @hba: per adapter instance
> >>>> - *
> >>>> - * Returns SUCCESS/FAILED
> >>>> - */
> >>>> -static int ufshcd_do_reset(struct ufs_hba *hba)
> >>>> -{
> >>>> -	struct ufshcd_lrb *lrbp;
> >>>> -	unsigned long flags;
> >>>> -	int tag;
> >>>> -
> >>>> -	/* block commands from midlayer */
> >>>> -	scsi_block_requests(hba->host);
> >>>> -
> >>>> -	spin_lock_irqsave(hba->host->host_lock, flags);
> >>>> -	hba->ufshcd_state = UFSHCD_STATE_RESET;
> >>>> -
> >>>> -	/* send controller to reset state */
> >>>> -	ufshcd_hba_stop(hba);
> >>>> -	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >>>> -
> >>>> -	/* abort outstanding commands */
> >>>> -	for (tag = 0; tag < hba->nutrs; tag++) {
> >>>> -		if (test_bit(tag, &hba->outstanding_reqs)) {
> >>>> -			lrbp = &hba->lrb[tag];
> >>>> -			if (lrbp->cmd) {
> >>>> -				scsi_dma_unmap(lrbp->cmd);
> >>>> -				lrbp->cmd->result = DID_RESET << 16;
> >>>> -				lrbp->cmd->scsi_done(lrbp->cmd);
> >>>> -				lrbp->cmd = NULL;
> >>>> -				clear_bit_unlock(tag, &hba->lrb_in_use);
> >>>> -			}
> >>>> -		}
> >>>> -	}
> >>>> -
> >>>> -	/* complete device management command */
> >>>> -	if (hba->dev_cmd.complete)
> >>>> -		complete(hba->dev_cmd.complete);
> >>>> -
> >>>> -	/* clear outstanding request/task bit maps */
> >>>> -	hba->outstanding_reqs = 0;
> >>>> -	hba->outstanding_tasks = 0;
> >>>> -
> >>>> -	/* Host controller enable */
> >>>> -	if (ufshcd_hba_enable(hba)) {
> >>>> -		dev_err(hba->dev,
> >>>> -			"Reset: Controller initialization failed\n");
> >>>> -		return FAILED;
> >>>> -	}
> >>>> -
> >>>> -	if (ufshcd_link_startup(hba)) {
> >>>> -		dev_err(hba->dev,
> >>>> -			"Reset: Link start-up failed\n");
> >>>> -		return FAILED;
> >>>> -	}
> >>>> -
> >>>> -	return SUCCESS;
> >>>> -}
> >>>> -
> >>>> -/**
> >>>>     * ufshcd_slave_alloc - handle initial SCSI device configurations
> >>>>     * @sdev: pointer to SCSI device
> >>>>     *
> >>>> @@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device *sdev)
> >>>>    	sdev->use_10_for_ms = 1;
> >>>>    	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
> >>>>
> >>>> +	/* allow SCSI layer to restart the device in case of errors */
> >>>> +	sdev->allow_restart = 1;
> >>>> +
> >>>>    	/*
> >>>>    	 * Inform SCSI Midlayer that the LUN queue depth is same as the
> >>>>    	 * controller queue depth. If a LUN queue depth is less than the
> >>>> @@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
> >>>>    	case OCS_ABORTED:
> >>>>    		result |= DID_ABORT << 16;
> >>>>    		break;
> >>>> +	case OCS_INVALID_COMMAND_STATUS:
> >>>> +		result |= DID_REQUEUE << 16;
> >>>> +		break;
> >>>>    	case OCS_INVALID_CMD_TABLE_ATTR:
> >>>>    	case OCS_INVALID_PRDT_ATTR:
> >>>>    	case OCS_MISMATCH_DATA_BUF_SIZE:
> >>>> @@ -2405,42 +2357,295 @@ out:
> >>>>    	return err;
> >>>>    }
> >>>>
> >>>> +static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
> >>>> +{
> >>>> +	switch (ocs) {
> >>>> +	case OCS_SUCCESS:
> >>>> +	case OCS_INVALID_COMMAND_STATUS:
> >>>> +		break;
> >>>> +	case OCS_MISMATCH_DATA_BUF_SIZE:
> >>>> +	case OCS_MISMATCH_RESP_UPIU_SIZE:
> >>>> +	case OCS_PEER_COMM_FAILURE:
> >>>> +	case OCS_FATAL_ERROR:
> >>>> +	case OCS_ABORTED:
> >>>> +	case OCS_INVALID_CMD_TABLE_ATTR:
> >>>> +	case OCS_INVALID_PRDT_ATTR:
> >>>> +		ufshcd_set_host_reset_pending(hba);
> >>> Should host be reset on ocs error, including below ufshcd_decide_eh_task_req?
> >>> It's just overall command status.
> >>
> >> Yes, the error handling section in the UFS 1.1 spec. mentions so.
> > If host's reset is required, it should be allowed in fatal situation.
> > Deciding with OCS field seems not proper. There is no mentions for that in spec.
> > If I have a wrong information, please let it clear.
> >
> 
> You can refer to section 8.3 of HCI spec.
> On fatal errors the controller h/w will have to update the OCS field of
> the command that caused error and then raise an fatal error interrupt.
> The s/w reads the OCS value and determine commands that are in error
> and then carry out reset.
I don't think so.
OCS field can be updated regardless of fatal error.
As mentioned previously, your implementations are gathering all errors into 'ufshcd_fatal_err_handler'.
It means that non-fatal error is also handled and if any OCS value, host reset will be reserved.

> 
> >>
> >>>
> >>>> +		break;
> >>>> +	default:
> >>>> +		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
> >>>> +				__func__, ocs);
> >>>> +		BUG();
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
> >>>> +{
> >>>> +	switch (ocs) {
> >>>> +	case OCS_TMR_SUCCESS:
> >>>> +	case OCS_TMR_INVALID_COMMAND_STATUS:
> >>>> +		break;
> >>>> +	case OCS_TMR_MISMATCH_REQ_SIZE:
> >>>> +	case OCS_TMR_MISMATCH_RESP_SIZE:
> >>>> +	case OCS_TMR_PEER_COMM_FAILURE:
> >>>> +	case OCS_TMR_INVALID_ATTR:
> >>>> +	case OCS_TMR_ABORTED:
> >>>> +	case OCS_TMR_FATAL_ERROR:
> >>>> +		ufshcd_set_host_reset_pending(hba);
> >>>> +		break;
> >>>> +	default:
> >>>> +		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
> >>>> +				__func__, ocs);
> >>>> +		BUG();
> >>>> +	}
> >>>> +}
> >>>> +
> >>>>    /**
> >>>> - * ufshcd_fatal_err_handler - handle fatal errors
> >>>> + * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed command and
> >>>> + *                          decide error handling
> >>>>     * @hba: per adapter instance
> >>>> + * @err_xfer: bit mask for transfer request errors
> >>>> + *
> >>>> + * Iterate over completed transfer requests and
> >>>> + * set error handling flags.
> >>>> + */
> >>>> +static void
> >>>> +ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
> >>>> +{
> >>>> +	unsigned long completed;
> >>>> +	u32 doorbell;
> >>>> +	int index;
> >>>> +	int ocs;
> >>>> +
> >>>> +	if (!err_xfer)
> >>>> +		goto out;
> >>>> +
> >>>> +	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
> >>>> +	completed = doorbell ^ (u32)hba->outstanding_reqs;
> >>>> +
> >>>> +	for (index = 0; index < hba->nutrs; index++) {
> >>>> +		if (test_bit(index, &completed)) {
> >>>> +			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
> >>>> +			if ((ocs == OCS_SUCCESS) ||
> >>>> +					(ocs == OCS_INVALID_COMMAND_STATUS))
> >>>> +				continue;
> >>>> +
> >>>> +			*err_xfer |= (1 << index);
> >>>> +			ufshcd_decide_eh_xfer_req(hba, ocs);
> >>>> +		}
> >>>> +	}
> >>>> +out:
> >>>> +	return;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_error_autopsy_task_req() - reads OCS field of failed command and
> >>>> + *                          decide error handling
> >>>> + * @hba: per adapter instance
> >>>> + * @err_tm: bit mask for task management errors
> >>>> + *
> >>>> + * Iterate over completed task management requests and
> >>>> + * set error handling flags.
> >>>> + */
> >>>> +static void
> >>>> +ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
> >>>> +{
> >>>> +	unsigned long completed;
> >>>> +	u32 doorbell;
> >>>> +	int index;
> >>>> +	int ocs;
> >>>> +
> >>>> +	if (!err_tm)
> >>>> +		goto out;
> >>>> +
> >>>> +	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
> >>>> +	completed = doorbell ^ (u32)hba->outstanding_tasks;
> >>>> +
> >>>> +	for (index = 0; index < hba->nutmrs; index++) {
> >>>> +		if (test_bit(index, &completed)) {
> >>>> +			struct utp_task_req_desc *tm_descp;
> >>>> +
> >>>> +			tm_descp = hba->utmrdl_base_addr;
> >>>> +			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
> >>>> +			if ((ocs == OCS_TMR_SUCCESS) ||
> >>>> +					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
> >>>> +				continue;
> >>>> +
> >>>> +			*err_tm |= (1 << index);
> >>>> +			ufshcd_decide_eh_task_req(hba, ocs);
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +out:
> >>>> +	return;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_fatal_err_handler - handle fatal errors
> >>>> + * @work: pointer to work structure
> >>>>     */
> >>>>    static void ufshcd_fatal_err_handler(struct work_struct *work)
> >>>>    {
> >>>>    	struct ufs_hba *hba;
> >>>> +	unsigned long flags;
> >>>> +	u32 err_xfer = 0;
> >>>> +	u32 err_tm = 0;
> >>>> +	int err;
> >>>> +
> >>>>    	hba = container_of(work, struct ufs_hba, feh_workq);
> >>>>
> >>>>    	pm_runtime_get_sync(hba->dev);
> >>>> -	/* check if reset is already in progress */
> >>>> -	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
> >>>> -		ufshcd_do_reset(hba);
> >>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >>>> +	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
> >>>> +		/* complete processed requests and exit */
> >>>> +		ufshcd_transfer_req_compl(hba);
> >>>> +		ufshcd_tmc_handler(hba);
> >>>> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
> >>>> +		pm_runtime_put_sync(hba->dev);
> >>>> +		return;
> >>> Host driver is here with finishing 'scsi_block_requests'.
> >>> 'scsi_unblock_requests' can be called somewhere?
> >>
> >> No, but it can be possible that SCSI command timeout which triggers
> >> device/host reset and fatal error handler race each other.
> > Sorry, I didn't get your meaning exactly.
> > I saw that scsi_block_requests is done before ufshcd_fatal_err_handler is scheduled.
> > If device or host was requested from scsi mid-layer just before ufshcd_fatal_err_handler,
> > ufshcd_fatal_err_handler will be out through if statement. Then, there is nowhere to call
> scsi_unblock_requests
> > though device/host reset is done successfully.
> 
> You are right, this should return with scsi_unblock_requests()
> called and there is no need to complete the processed requests as we
> might be in middle of something else while the RESET is in progress.
> 
> 
> >>
> >>>
> >>>> +	}
> >>>> +
> >>>> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
> >>>> +	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
> >>>> +	ufshcd_error_autopsy_task_req(hba, &err_tm);
> >>>> +
> >>>> +	/*
> >>>> +	 * Complete successful and pending transfer requests.
> >>>> +	 * DID_REQUEUE is returned for pending requests as they have
> >>>> +	 * nothing to do with error'ed request and SCSI layer should
> >>>> +	 * not treat them as errors and decrement retry count.
> >>>> +	 */
> >>>> +	hba->outstanding_reqs &= ~err_xfer;
> >>>> +	ufshcd_transfer_req_compl(hba);
> >>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >>>> +	ufshcd_complete_pending_reqs(hba);
> >>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >>>> +	hba->outstanding_reqs |= err_xfer;
> >>> Hmm... error handling seems so complicated.
> >>> To simplify it, how about below?
> >>>
> >>> 1. If requests(transfer or task management) are completed, finish them with success/failure.
> >> This is what we are trying to do above.
> >>
> >>> 2. If there are pending requests, abort them.
> >> No, if a fatal error is occurred it is possible that host controller is
> >> freez'ed we are not sure if it can take task management commands and
> >> execute them.
> > I meant that aborting the request by clearing corresponding UTMRLCLR/UTMRLCLR.
> >
> 
> I am doing the same in this patch -
> 1) Return to SCSI the successful commands.
> 2) Clear the pending (but not cause of error) commands by writing into
> UTMRLCLR/UTRCLR registers. So scsi_host_result = DID_REQUEUE
> 3) Reset and return the commands that "caused error" to SCSI with
> DID_ERROR.
> 
> Am I doing anything extra than what you have suggested?
If some are cleared, let me review more.

> 
> >>
> >>> 3. If fatal error, reset.
> >>>
> >>
> >>
> >>>> +
> >>>> +	/* Complete successful and pending task requests */
> >>>> +	hba->outstanding_tasks &= ~err_tm;
> >>>> +	ufshcd_tmc_handler(hba);
> >>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >>>> +	ufshcd_complete_pending_tasks(hba);
> >>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
> >>>> +
> >>>> +	hba->outstanding_tasks |= err_tm;
> >>>> +
> >>>> +	/*
> >>>> +	 * Controller may generate multiple fatal errors, handle
> >>>> +	 * errors based on severity.
> >>>> +	 * 1) DEVICE_FATAL_ERROR
> >>>> +	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
> >>>> +	 * 3) UIC_ERROR
> >>>> +	 */
> >>>> +	if (hba->errors & DEVICE_FATAL_ERROR) {
> >>>> +		/*
> >>>> +		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
> >>>> +		 * OCS field on device fatal error.
> >>>> +		 */
> >>>> +		ufshcd_set_host_reset_pending(hba);
> >>> In DEVICE_FATAL_ERROR, ufshcd_device_reset_pending is right?
> >>
> >> It looks so, but the spec. mentions to reset the host as well (8.3.6).
> > Do you pointing below?
> > [8.3.6. Device Errors are fatal errors. ...the host software shall reset the device too.]
> >
> 
> I meant "8.3.6: When this condition occurs, host software shall follow
> the same procedure for UIC error handling as described in 8.2.2,". There
> is an error in the spec. it was not 8.2.2 but 8.3.2 for UIC error
> handling. So going by 8.3.2 HCE needs to be toggled.
I feel like 8.3.2 of spec. makes it difficult to identifying 'device fatal error' with a fatal UIC error.
It needs to clarify these.

Anyway, I found some descriptions related to host' reset.
5.3.1 Device Fatal Error Status (DFES):
...
If the error occurs, host SW should reset the host controller.
I's explicit. If spec. saying is right, we would reset host.

> 
> 
> 
> >>
> >>>
> >>>> +	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
> >>>> +			CONTROLLER_FATAL_ERROR)) {
> >>>> +		/* eh flags should be set in err autopsy based on OCS values */
> >>>> +		if (!hba->eh_flags)
> >>>> +			WARN(1, "%s: fatal error without error handling\n",
> >>>> +				dev_name(hba->dev));
> >>>> +	} else if (hba->errors & UIC_ERROR) {
> >>>> +		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
> >>>> +			/* fatal error - reset controller */
> >>>> +			ufshcd_set_host_reset_pending(hba);
> >>>> +		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
> >>>> +					UFSHCD_UIC_TL_ERROR |
> >>>> +					UFSHCD_UIC_DME_ERROR)) {
> >>>> +			/* non-fatal, report error to SCSI layer */
> >>>> +			if (!hba->eh_flags) {
> >>>> +				spin_unlock_irqrestore(
> >>>> +						hba->host->host_lock, flags);
> >>>> +				ufshcd_complete_pending_reqs(hba);
> >>>> +				ufshcd_complete_pending_tasks(hba);
> >>>> +				spin_lock_irqsave(hba->host->host_lock, flags);
> >>>> +			}
> >>>> +		}
> >>>> +	}
> >>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
> >>>> +
> >>>> +	if (hba->eh_flags) {
> >>>> +		err = ufshcd_reset_and_restore(hba);
> >>>> +		if (err) {
> >>>> +			ufshcd_clear_host_reset_pending(hba);
> >>>> +			ufshcd_clear_device_reset_pending(hba);
> >>>> +			dev_err(hba->dev, "%s: reset and restore failed\n",
> >>>> +					__func__);
> >>>> +			hba->ufshcd_state = UFSHCD_STATE_ERROR;
> >>>> +		}
> >>>> +		/*
> >>>> +		 * Inform scsi mid-layer that we did reset and allow to handle
> >>>> +		 * Unit Attention properly.
> >>>> +		 */
> >>>> +		scsi_report_bus_reset(hba->host, 0);
> >>>> +		hba->errors = 0;
> >>>> +		hba->uic_error = 0;
> >>>> +	}
> >>>> +	scsi_unblock_requests(hba->host);
> >>>>    	pm_runtime_put_sync(hba->dev);
> >>>>    }
> >>>>
> >>>>    /**
> >>>> - * ufshcd_err_handler - Check for fatal errors
> >>>> - * @work: pointer to a work queue structure
> >>>> + * ufshcd_update_uic_error - check and set fatal UIC error flags.
> >>>> + * @hba: per-adapter instance
> >>>>     */
> >>>> -static void ufshcd_err_handler(struct ufs_hba *hba)
> >>>> +static void ufshcd_update_uic_error(struct ufs_hba *hba)
> >>>>    {
> >>>>    	u32 reg;
> >>>>
> >>>> +	/* PA_INIT_ERROR is fatal and needs UIC reset */
> >>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> >>>> +	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> >>>> +		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
> >>>> +
> >>>> +	/* UIC NL/TL/DME errors needs software retry */
> >>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
> >>>> +	if (reg)
> >>>> +		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
> >>>> +
> >>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
> >>>> +	if (reg)
> >>>> +		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
> >>>> +
> >>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
> >>>> +	if (reg)
> >>>> +		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
> >>> REG_UIC_ERROR_CODE_PHY_ADAPTER_LAYER is not handled.
> >>
> >> UFS spec. mentions that it is non-fatal error and UIC recovers
> >> by itself and doesn't need software intervention.
> > Ok.
> >
> >>
> >>>
> >>>> +
> >>>> +	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
> >>>> +			__func__, hba->uic_error);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * ufshcd_err_handler - Check for fatal errors
> >>>> + * @hba: per-adapter instance
> >>>> + */
> >>>> +static void ufshcd_err_handler(struct ufs_hba *hba)
> >>>> +{
> >>>>    	if (hba->errors & INT_FATAL_ERRORS)
> >>>>    		goto fatal_eh;
> >>>>
> >>>>    	if (hba->errors & UIC_ERROR) {
> >>>> -		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
> >>>> -		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
> >>>> +		hba->uic_error = 0;
> >>>> +		ufshcd_update_uic_error(hba);
> >>>> +		if (hba->uic_error)
> >>> Except UFSHCD_UIC_DL_PA_INIT_ERROR, it's not fatal. Should it go to fatal_eh?
> >>
> >> Please see the UIC error handling in ufshcd_fatal_err_handler(), others
> >> need software intervention so I combined it with fatal_eh to complete
> >> the requests and report to SCSI.
> > As gathering all error(fatal, non-fatal)handling into origin one, it makes confused.
> > Then, I would be better to rename ufshcd_fatal_err_handler.
> >
> 
> Yeah, ufshcd_err_handler is apt but it is already consumed. Probably,
> ufshcd_err_handler -> ufshcd_check_errors
> ufshcd_fatal_err_handler -> ufshcd_err_handler
> rename would be fine?
I like it.

Thanks,
Seungwon Jeon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling
  2013-07-24 13:39           ` Seungwon Jeon
@ 2013-07-29  9:45             ` Sujit Reddy Thumma
  0 siblings, 0 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-29  9:45 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/24/2013 7:09 PM, Seungwon Jeon wrote:
> On Wed, July 24, 2013, Sujit Reddy Thumma wrote:
>> On 7/23/2013 2:04 PM, Seungwon Jeon wrote:
>>> On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
>>>> On 7/19/2013 7:28 PM, Seungwon Jeon wrote:
>>>>> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
>>>>>> Error handling in UFS driver is broken and resets the host controller
>>>>>> for fatal errors without re-initialization. Correct the fatal error
>>>>>> handling sequence according to UFS Host Controller Interface (HCI)
>>>>>> v1.1 specification.
>>>>>>
>>>>>> o Upon determining fatal error condition the host controller may hang
>>>>>>      forever until a reset is applied, so just retrying the command doesn't
>>>>>>      work without a reset. So, the reset is applied in the driver context
>>>>>>      in a separate work and SCSI mid-layer isn't informed until reset is
>>>>>>      applied.
>>>>>>
>>>>>> o Processed requests which are completed without error are reported to
>>>>>>      SCSI layer as successful and any pending commands that are not started
>>>>>>      yet or are not cause of the error are re-queued into scsi midlayer queue.
>>>>>>      For the command that caused error, host controller or device is reset
>>>>>>      and DID_ERROR is returned for command retry after applying reset.
>>>>>>
>>>>>> o SCSI is informed about the expected Unit-Attentioni exception from the
>>>>> Attention'i',  typo.
>>>> Okay.
>>>>
>>>>>
>>>>>>      device for the immediate command after a reset so that the SCSI layer
>>>>>>      take necessary steps to establish communication with the device.
>>>>>>
>>>>>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>>>>>> ---
>>>>>>     drivers/scsi/ufs/ufshcd.c |  349 +++++++++++++++++++++++++++++++++++---------
>>>>>>     drivers/scsi/ufs/ufshcd.h |    2 +
>>>>>>     drivers/scsi/ufs/ufshci.h |   19 ++-
>>>>>>     3 files changed, 295 insertions(+), 75 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>>>>>> index b4c9910..2a3874f 100644
>>>>>> --- a/drivers/scsi/ufs/ufshcd.c
>>>>>> +++ b/drivers/scsi/ufs/ufshcd.c
>>>>>> @@ -80,6 +80,14 @@ enum {
>>>>>>     	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>>>>>>     };
>>>>>>
>>>>>> +/* UFSHCD UIC layer error flags */
>>>>>> +enum {
>>>>>> +	UFSHCD_UIC_DL_PA_INIT_ERROR = (1 << 0), /* Data link layer error */
>>>>>> +	UFSHCD_UIC_NL_ERROR = (1 << 1), /* Network layer error */
>>>>>> +	UFSHCD_UIC_TL_ERROR = (1 << 2), /* Transport Layer error */
>>>>>> +	UFSHCD_UIC_DME_ERROR = (1 << 3), /* DME error */
>>>>>> +};
>>>>>> +
>>>>>>     /* Interrupt configuration options */
>>>>>>     enum {
>>>>>>     	UFSHCD_INT_DISABLE,
>>>>>> @@ -108,6 +116,7 @@ enum {
>>>>>>
>>>>>>     static void ufshcd_tmc_handler(struct ufs_hba *hba);
>>>>>>     static void ufshcd_async_scan(void *data, async_cookie_t cookie);
>>>>>> +static int ufshcd_reset_and_restore(struct ufs_hba *hba);
>>>>>>
>>>>>>     /*
>>>>>>      * ufshcd_wait_for_register - wait for register value to change
>>>>>> @@ -1605,9 +1614,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>>>>>>     		goto out;
>>>>>>     	}
>>>>>>
>>>>>> -	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>>>>>> -		scsi_unblock_requests(hba->host);
>>>>>> -
>>>>>>     out:
>>>>>>     	return err;
>>>>>>     }
>>>>>> @@ -1733,66 +1739,6 @@ static int ufshcd_validate_dev_connection(struct ufs_hba *hba)
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>>> - * ufshcd_do_reset - reset the host controller
>>>>>> - * @hba: per adapter instance
>>>>>> - *
>>>>>> - * Returns SUCCESS/FAILED
>>>>>> - */
>>>>>> -static int ufshcd_do_reset(struct ufs_hba *hba)
>>>>>> -{
>>>>>> -	struct ufshcd_lrb *lrbp;
>>>>>> -	unsigned long flags;
>>>>>> -	int tag;
>>>>>> -
>>>>>> -	/* block commands from midlayer */
>>>>>> -	scsi_block_requests(hba->host);
>>>>>> -
>>>>>> -	spin_lock_irqsave(hba->host->host_lock, flags);
>>>>>> -	hba->ufshcd_state = UFSHCD_STATE_RESET;
>>>>>> -
>>>>>> -	/* send controller to reset state */
>>>>>> -	ufshcd_hba_stop(hba);
>>>>>> -	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>>>> -
>>>>>> -	/* abort outstanding commands */
>>>>>> -	for (tag = 0; tag < hba->nutrs; tag++) {
>>>>>> -		if (test_bit(tag, &hba->outstanding_reqs)) {
>>>>>> -			lrbp = &hba->lrb[tag];
>>>>>> -			if (lrbp->cmd) {
>>>>>> -				scsi_dma_unmap(lrbp->cmd);
>>>>>> -				lrbp->cmd->result = DID_RESET << 16;
>>>>>> -				lrbp->cmd->scsi_done(lrbp->cmd);
>>>>>> -				lrbp->cmd = NULL;
>>>>>> -				clear_bit_unlock(tag, &hba->lrb_in_use);
>>>>>> -			}
>>>>>> -		}
>>>>>> -	}
>>>>>> -
>>>>>> -	/* complete device management command */
>>>>>> -	if (hba->dev_cmd.complete)
>>>>>> -		complete(hba->dev_cmd.complete);
>>>>>> -
>>>>>> -	/* clear outstanding request/task bit maps */
>>>>>> -	hba->outstanding_reqs = 0;
>>>>>> -	hba->outstanding_tasks = 0;
>>>>>> -
>>>>>> -	/* Host controller enable */
>>>>>> -	if (ufshcd_hba_enable(hba)) {
>>>>>> -		dev_err(hba->dev,
>>>>>> -			"Reset: Controller initialization failed\n");
>>>>>> -		return FAILED;
>>>>>> -	}
>>>>>> -
>>>>>> -	if (ufshcd_link_startup(hba)) {
>>>>>> -		dev_err(hba->dev,
>>>>>> -			"Reset: Link start-up failed\n");
>>>>>> -		return FAILED;
>>>>>> -	}
>>>>>> -
>>>>>> -	return SUCCESS;
>>>>>> -}
>>>>>> -
>>>>>> -/**
>>>>>>      * ufshcd_slave_alloc - handle initial SCSI device configurations
>>>>>>      * @sdev: pointer to SCSI device
>>>>>>      *
>>>>>> @@ -1809,6 +1755,9 @@ static int ufshcd_slave_alloc(struct scsi_device *sdev)
>>>>>>     	sdev->use_10_for_ms = 1;
>>>>>>     	scsi_set_tag_type(sdev, MSG_SIMPLE_TAG);
>>>>>>
>>>>>> +	/* allow SCSI layer to restart the device in case of errors */
>>>>>> +	sdev->allow_restart = 1;
>>>>>> +
>>>>>>     	/*
>>>>>>     	 * Inform SCSI Midlayer that the LUN queue depth is same as the
>>>>>>     	 * controller queue depth. If a LUN queue depth is less than the
>>>>>> @@ -2013,6 +1962,9 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
>>>>>>     	case OCS_ABORTED:
>>>>>>     		result |= DID_ABORT << 16;
>>>>>>     		break;
>>>>>> +	case OCS_INVALID_COMMAND_STATUS:
>>>>>> +		result |= DID_REQUEUE << 16;
>>>>>> +		break;
>>>>>>     	case OCS_INVALID_CMD_TABLE_ATTR:
>>>>>>     	case OCS_INVALID_PRDT_ATTR:
>>>>>>     	case OCS_MISMATCH_DATA_BUF_SIZE:
>>>>>> @@ -2405,42 +2357,295 @@ out:
>>>>>>     	return err;
>>>>>>     }
>>>>>>
>>>>>> +static void ufshcd_decide_eh_xfer_req(struct ufs_hba *hba, u32 ocs)
>>>>>> +{
>>>>>> +	switch (ocs) {
>>>>>> +	case OCS_SUCCESS:
>>>>>> +	case OCS_INVALID_COMMAND_STATUS:
>>>>>> +		break;
>>>>>> +	case OCS_MISMATCH_DATA_BUF_SIZE:
>>>>>> +	case OCS_MISMATCH_RESP_UPIU_SIZE:
>>>>>> +	case OCS_PEER_COMM_FAILURE:
>>>>>> +	case OCS_FATAL_ERROR:
>>>>>> +	case OCS_ABORTED:
>>>>>> +	case OCS_INVALID_CMD_TABLE_ATTR:
>>>>>> +	case OCS_INVALID_PRDT_ATTR:
>>>>>> +		ufshcd_set_host_reset_pending(hba);
>>>>> Should host be reset on ocs error, including below ufshcd_decide_eh_task_req?
>>>>> It's just overall command status.
>>>>
>>>> Yes, the error handling section in the UFS 1.1 spec. mentions so.
>>> If host's reset is required, it should be allowed in fatal situation.
>>> Deciding with OCS field seems not proper. There is no mentions for that in spec.
>>> If I have a wrong information, please let it clear.
>>>
>>
>> You can refer to section 8.3 of HCI spec.
>> On fatal errors the controller h/w will have to update the OCS field of
>> the command that caused error and then raise an fatal error interrupt.
>> The s/w reads the OCS value and determine commands that are in error
>> and then carry out reset.
> I don't think so.
> OCS field can be updated regardless of fatal error.
> As mentioned previously, your implementations are gathering all errors into 'ufshcd_fatal_err_handler'.
> It means that non-fatal error is also handled and if any OCS value, host reset will be reserved.
> 

Okay. Will remove the OCS dependency.

>>
>>>>
>>>>>
>>>>>> +		break;
>>>>>> +	default:
>>>>>> +		dev_err(hba->dev, "%s: unknown OCS 0x%x\n",
>>>>>> +				__func__, ocs);
>>>>>> +		BUG();
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>> +static void ufshcd_decide_eh_task_req(struct ufs_hba *hba, u32 ocs)
>>>>>> +{
>>>>>> +	switch (ocs) {
>>>>>> +	case OCS_TMR_SUCCESS:
>>>>>> +	case OCS_TMR_INVALID_COMMAND_STATUS:
>>>>>> +		break;
>>>>>> +	case OCS_TMR_MISMATCH_REQ_SIZE:
>>>>>> +	case OCS_TMR_MISMATCH_RESP_SIZE:
>>>>>> +	case OCS_TMR_PEER_COMM_FAILURE:
>>>>>> +	case OCS_TMR_INVALID_ATTR:
>>>>>> +	case OCS_TMR_ABORTED:
>>>>>> +	case OCS_TMR_FATAL_ERROR:
>>>>>> +		ufshcd_set_host_reset_pending(hba);
>>>>>> +		break;
>>>>>> +	default:
>>>>>> +		dev_err(hba->dev, "%s: uknown TMR OCS 0x%x\n",
>>>>>> +				__func__, ocs);
>>>>>> +		BUG();
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>> - * ufshcd_fatal_err_handler - handle fatal errors
>>>>>> + * ufshcd_error_autopsy_transfer_req() - reads OCS field of failed command and
>>>>>> + *                          decide error handling
>>>>>>      * @hba: per adapter instance
>>>>>> + * @err_xfer: bit mask for transfer request errors
>>>>>> + *
>>>>>> + * Iterate over completed transfer requests and
>>>>>> + * set error handling flags.
>>>>>> + */
>>>>>> +static void
>>>>>> +ufshcd_error_autopsy_transfer_req(struct ufs_hba *hba, u32 *err_xfer)
>>>>>> +{
>>>>>> +	unsigned long completed;
>>>>>> +	u32 doorbell;
>>>>>> +	int index;
>>>>>> +	int ocs;
>>>>>> +
>>>>>> +	if (!err_xfer)
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
>>>>>> +	completed = doorbell ^ (u32)hba->outstanding_reqs;
>>>>>> +
>>>>>> +	for (index = 0; index < hba->nutrs; index++) {
>>>>>> +		if (test_bit(index, &completed)) {
>>>>>> +			ocs = ufshcd_get_tr_ocs(&hba->lrb[index]);
>>>>>> +			if ((ocs == OCS_SUCCESS) ||
>>>>>> +					(ocs == OCS_INVALID_COMMAND_STATUS))
>>>>>> +				continue;
>>>>>> +
>>>>>> +			*err_xfer |= (1 << index);
>>>>>> +			ufshcd_decide_eh_xfer_req(hba, ocs);
>>>>>> +		}
>>>>>> +	}
>>>>>> +out:
>>>>>> +	return;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_error_autopsy_task_req() - reads OCS field of failed command and
>>>>>> + *                          decide error handling
>>>>>> + * @hba: per adapter instance
>>>>>> + * @err_tm: bit mask for task management errors
>>>>>> + *
>>>>>> + * Iterate over completed task management requests and
>>>>>> + * set error handling flags.
>>>>>> + */
>>>>>> +static void
>>>>>> +ufshcd_error_autopsy_task_req(struct ufs_hba *hba, u32 *err_tm)
>>>>>> +{
>>>>>> +	unsigned long completed;
>>>>>> +	u32 doorbell;
>>>>>> +	int index;
>>>>>> +	int ocs;
>>>>>> +
>>>>>> +	if (!err_tm)
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	doorbell = ufshcd_readl(hba, REG_UTP_TASK_REQ_DOOR_BELL);
>>>>>> +	completed = doorbell ^ (u32)hba->outstanding_tasks;
>>>>>> +
>>>>>> +	for (index = 0; index < hba->nutmrs; index++) {
>>>>>> +		if (test_bit(index, &completed)) {
>>>>>> +			struct utp_task_req_desc *tm_descp;
>>>>>> +
>>>>>> +			tm_descp = hba->utmrdl_base_addr;
>>>>>> +			ocs = ufshcd_get_tmr_ocs(&tm_descp[index]);
>>>>>> +			if ((ocs == OCS_TMR_SUCCESS) ||
>>>>>> +					(ocs == OCS_TMR_INVALID_COMMAND_STATUS))
>>>>>> +				continue;
>>>>>> +
>>>>>> +			*err_tm |= (1 << index);
>>>>>> +			ufshcd_decide_eh_task_req(hba, ocs);
>>>>>> +		}
>>>>>> +	}
>>>>>> +
>>>>>> +out:
>>>>>> +	return;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_fatal_err_handler - handle fatal errors
>>>>>> + * @work: pointer to work structure
>>>>>>      */
>>>>>>     static void ufshcd_fatal_err_handler(struct work_struct *work)
>>>>>>     {
>>>>>>     	struct ufs_hba *hba;
>>>>>> +	unsigned long flags;
>>>>>> +	u32 err_xfer = 0;
>>>>>> +	u32 err_tm = 0;
>>>>>> +	int err;
>>>>>> +
>>>>>>     	hba = container_of(work, struct ufs_hba, feh_workq);
>>>>>>
>>>>>>     	pm_runtime_get_sync(hba->dev);
>>>>>> -	/* check if reset is already in progress */
>>>>>> -	if (hba->ufshcd_state != UFSHCD_STATE_RESET)
>>>>>> -		ufshcd_do_reset(hba);
>>>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>>>> +	if (hba->ufshcd_state == UFSHCD_STATE_RESET) {
>>>>>> +		/* complete processed requests and exit */
>>>>>> +		ufshcd_transfer_req_compl(hba);
>>>>>> +		ufshcd_tmc_handler(hba);
>>>>>> +		spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>>>> +		pm_runtime_put_sync(hba->dev);
>>>>>> +		return;
>>>>> Host driver is here with finishing 'scsi_block_requests'.
>>>>> 'scsi_unblock_requests' can be called somewhere?
>>>>
>>>> No, but it can be possible that SCSI command timeout which triggers
>>>> device/host reset and fatal error handler race each other.
>>> Sorry, I didn't get your meaning exactly.
>>> I saw that scsi_block_requests is done before ufshcd_fatal_err_handler is scheduled.
>>> If device or host was requested from scsi mid-layer just before ufshcd_fatal_err_handler,
>>> ufshcd_fatal_err_handler will be out through if statement. Then, there is nowhere to call
>> scsi_unblock_requests
>>> though device/host reset is done successfully.
>>
>> You are right, this should return with scsi_unblock_requests()
>> called and there is no need to complete the processed requests as we
>> might be in middle of something else while the RESET is in progress.
>>
>>
>>>>
>>>>>
>>>>>> +	}
>>>>>> +
>>>>>> +	hba->ufshcd_state = UFSHCD_STATE_RESET;
>>>>>> +	ufshcd_error_autopsy_transfer_req(hba, &err_xfer);
>>>>>> +	ufshcd_error_autopsy_task_req(hba, &err_tm);
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Complete successful and pending transfer requests.
>>>>>> +	 * DID_REQUEUE is returned for pending requests as they have
>>>>>> +	 * nothing to do with error'ed request and SCSI layer should
>>>>>> +	 * not treat them as errors and decrement retry count.
>>>>>> +	 */
>>>>>> +	hba->outstanding_reqs &= ~err_xfer;
>>>>>> +	ufshcd_transfer_req_compl(hba);
>>>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>>>> +	ufshcd_complete_pending_reqs(hba);
>>>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>>>> +	hba->outstanding_reqs |= err_xfer;
>>>>> Hmm... error handling seems so complicated.
>>>>> To simplify it, how about below?
>>>>>
>>>>> 1. If requests(transfer or task management) are completed, finish them with success/failure.
>>>> This is what we are trying to do above.
>>>>
>>>>> 2. If there are pending requests, abort them.
>>>> No, if a fatal error is occurred it is possible that host controller is
>>>> freez'ed we are not sure if it can take task management commands and
>>>> execute them.
>>> I meant that aborting the request by clearing corresponding UTMRLCLR/UTMRLCLR.
>>>
>>
>> I am doing the same in this patch -
>> 1) Return to SCSI the successful commands.
>> 2) Clear the pending (but not cause of error) commands by writing into
>> UTMRLCLR/UTRCLR registers. So scsi_host_result = DID_REQUEUE
>> 3) Reset and return the commands that "caused error" to SCSI with
>> DID_ERROR.
>>
>> Am I doing anything extra than what you have suggested?
> If some are cleared, let me review more.
> 
>>
>>>>
>>>>> 3. If fatal error, reset.
>>>>>
>>>>
>>>>
>>>>>> +
>>>>>> +	/* Complete successful and pending task requests */
>>>>>> +	hba->outstanding_tasks &= ~err_tm;
>>>>>> +	ufshcd_tmc_handler(hba);
>>>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>>>> +	ufshcd_complete_pending_tasks(hba);
>>>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>>>> +
>>>>>> +	hba->outstanding_tasks |= err_tm;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Controller may generate multiple fatal errors, handle
>>>>>> +	 * errors based on severity.
>>>>>> +	 * 1) DEVICE_FATAL_ERROR
>>>>>> +	 * 2) SYSTEM_BUS/CONTROLLER_FATAL_ERROR
>>>>>> +	 * 3) UIC_ERROR
>>>>>> +	 */
>>>>>> +	if (hba->errors & DEVICE_FATAL_ERROR) {
>>>>>> +		/*
>>>>>> +		 * Some HBAs may not clear UTRLDBR/UTMRLDBR or update
>>>>>> +		 * OCS field on device fatal error.
>>>>>> +		 */
>>>>>> +		ufshcd_set_host_reset_pending(hba);
>>>>> In DEVICE_FATAL_ERROR, ufshcd_device_reset_pending is right?
>>>>
>>>> It looks so, but the spec. mentions to reset the host as well (8.3.6).
>>> Do you pointing below?
>>> [8.3.6. Device Errors are fatal errors. ...the host software shall reset the device too.]
>>>
>>
>> I meant "8.3.6: When this condition occurs, host software shall follow
>> the same procedure for UIC error handling as described in 8.2.2,". There
>> is an error in the spec. it was not 8.2.2 but 8.3.2 for UIC error
>> handling. So going by 8.3.2 HCE needs to be toggled.
> I feel like 8.3.2 of spec. makes it difficult to identifying 'device fatal error' with a fatal UIC error.
> It needs to clarify these.

Yes, we can post this to Jedec to clarify with more details in error
handling section.

> 
> Anyway, I found some descriptions related to host' reset.
> 5.3.1 Device Fatal Error Status (DFES):
> ...
> If the error occurs, host SW should reset the host controller.
> I's explicit. If spec. saying is right, we would reset host.

For now, I guess it is safe have following implementation until spec.
clarifies -

1) For any non-fatal errors (but still require s/w attention), clear
the door-bell register and retry the command.

2) For any fatal errors (UIC fatal, device fatal, host fatal, system
bus fatal), reset the controller and device and re-establish the link.

Given that these fatal errors are very rare the delay involved in reset
and re-establish link should not be much concern unless the device is
really bad and failing even after reset.

> 
>>
>>
>>
>>>>
>>>>>
>>>>>> +	} else if (hba->errors & (SYSTEM_BUS_FATAL_ERROR |
>>>>>> +			CONTROLLER_FATAL_ERROR)) {
>>>>>> +		/* eh flags should be set in err autopsy based on OCS values */
>>>>>> +		if (!hba->eh_flags)
>>>>>> +			WARN(1, "%s: fatal error without error handling\n",
>>>>>> +				dev_name(hba->dev));
>>>>>> +	} else if (hba->errors & UIC_ERROR) {
>>>>>> +		if (hba->uic_error & UFSHCD_UIC_DL_PA_INIT_ERROR) {
>>>>>> +			/* fatal error - reset controller */
>>>>>> +			ufshcd_set_host_reset_pending(hba);
>>>>>> +		} else if (hba->uic_error & (UFSHCD_UIC_NL_ERROR |
>>>>>> +					UFSHCD_UIC_TL_ERROR |
>>>>>> +					UFSHCD_UIC_DME_ERROR)) {
>>>>>> +			/* non-fatal, report error to SCSI layer */
>>>>>> +			if (!hba->eh_flags) {
>>>>>> +				spin_unlock_irqrestore(
>>>>>> +						hba->host->host_lock, flags);
>>>>>> +				ufshcd_complete_pending_reqs(hba);
>>>>>> +				ufshcd_complete_pending_tasks(hba);
>>>>>> +				spin_lock_irqsave(hba->host->host_lock, flags);
>>>>>> +			}
>>>>>> +		}
>>>>>> +	}
>>>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>>>> +
>>>>>> +	if (hba->eh_flags) {
>>>>>> +		err = ufshcd_reset_and_restore(hba);
>>>>>> +		if (err) {
>>>>>> +			ufshcd_clear_host_reset_pending(hba);
>>>>>> +			ufshcd_clear_device_reset_pending(hba);
>>>>>> +			dev_err(hba->dev, "%s: reset and restore failed\n",
>>>>>> +					__func__);
>>>>>> +			hba->ufshcd_state = UFSHCD_STATE_ERROR;
>>>>>> +		}
>>>>>> +		/*
>>>>>> +		 * Inform scsi mid-layer that we did reset and allow to handle
>>>>>> +		 * Unit Attention properly.
>>>>>> +		 */
>>>>>> +		scsi_report_bus_reset(hba->host, 0);
>>>>>> +		hba->errors = 0;
>>>>>> +		hba->uic_error = 0;
>>>>>> +	}
>>>>>> +	scsi_unblock_requests(hba->host);
>>>>>>     	pm_runtime_put_sync(hba->dev);
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>>> - * ufshcd_err_handler - Check for fatal errors
>>>>>> - * @work: pointer to a work queue structure
>>>>>> + * ufshcd_update_uic_error - check and set fatal UIC error flags.
>>>>>> + * @hba: per-adapter instance
>>>>>>      */
>>>>>> -static void ufshcd_err_handler(struct ufs_hba *hba)
>>>>>> +static void ufshcd_update_uic_error(struct ufs_hba *hba)
>>>>>>     {
>>>>>>     	u32 reg;
>>>>>>
>>>>>> +	/* PA_INIT_ERROR is fatal and needs UIC reset */
>>>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
>>>>>> +	if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
>>>>>> +		hba->uic_error |= UFSHCD_UIC_DL_PA_INIT_ERROR;
>>>>>> +
>>>>>> +	/* UIC NL/TL/DME errors needs software retry */
>>>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_NETWORK_LAYER);
>>>>>> +	if (reg)
>>>>>> +		hba->uic_error |= UFSHCD_UIC_NL_ERROR;
>>>>>> +
>>>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_TRANSPORT_LAYER);
>>>>>> +	if (reg)
>>>>>> +		hba->uic_error |= UFSHCD_UIC_TL_ERROR;
>>>>>> +
>>>>>> +	reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DME);
>>>>>> +	if (reg)
>>>>>> +		hba->uic_error |= UFSHCD_UIC_DME_ERROR;
>>>>> REG_UIC_ERROR_CODE_PHY_ADAPTER_LAYER is not handled.
>>>>
>>>> UFS spec. mentions that it is non-fatal error and UIC recovers
>>>> by itself and doesn't need software intervention.
>>> Ok.
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> +	dev_dbg(hba->dev, "%s: UIC error flags = 0x%08x\n",
>>>>>> +			__func__, hba->uic_error);
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_err_handler - Check for fatal errors
>>>>>> + * @hba: per-adapter instance
>>>>>> + */
>>>>>> +static void ufshcd_err_handler(struct ufs_hba *hba)
>>>>>> +{
>>>>>>     	if (hba->errors & INT_FATAL_ERRORS)
>>>>>>     		goto fatal_eh;
>>>>>>
>>>>>>     	if (hba->errors & UIC_ERROR) {
>>>>>> -		reg = ufshcd_readl(hba, REG_UIC_ERROR_CODE_DATA_LINK_LAYER);
>>>>>> -		if (reg & UIC_DATA_LINK_LAYER_ERROR_PA_INIT)
>>>>>> +		hba->uic_error = 0;
>>>>>> +		ufshcd_update_uic_error(hba);
>>>>>> +		if (hba->uic_error)
>>>>> Except UFSHCD_UIC_DL_PA_INIT_ERROR, it's not fatal. Should it go to fatal_eh?
>>>>
>>>> Please see the UIC error handling in ufshcd_fatal_err_handler(), others
>>>> need software intervention so I combined it with fatal_eh to complete
>>>> the requests and report to SCSI.
>>> As gathering all error(fatal, non-fatal)handling into origin one, it makes confused.
>>> Then, I would be better to rename ufshcd_fatal_err_handler.
>>>
>>
>> Yeah, ufshcd_err_handler is apt but it is already consumed. Probably,
>> ufshcd_err_handler -> ufshcd_check_errors
>> ufshcd_fatal_err_handler -> ufshcd_err_handler
>> rename would be fine?
> I like it.
> 
> Thanks,
> Seungwon Jeon
> 

Thanks a lot for reviewing the patches :)
I will update the patchset shortly.

-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods
  2013-07-24 13:39           ` Seungwon Jeon
@ 2013-07-29  9:45             ` Sujit Reddy Thumma
  0 siblings, 0 replies; 27+ messages in thread
From: Sujit Reddy Thumma @ 2013-07-29  9:45 UTC (permalink / raw)
  To: Seungwon Jeon
  Cc: 'Vinayak Holikatti', 'Santosh Y',
	'James E.J. Bottomley',
	linux-scsi, linux-arm-msm

On 7/24/2013 7:09 PM, Seungwon Jeon wrote:
> On Wed, July 24, 2013, Sujit Reddy Thumma wrote:
>> On 7/23/2013 1:57 PM, Seungwon Jeon wrote:
>>> On Sat, July 20, 2013, Sujit Reddy Thumma wrote:
>>>> On 7/19/2013 7:27 PM, Seungwon Jeon wrote:
>>>>> On Tue, July 09, 2013, Sujit Reddy Thumma wrote:
>>>>>> As of now SCSI initiated error handling is broken because,
>>>>>> the reset APIs don't try to bring back the device initialized and
>>>>>> ready for further transfers.
>>>>>>
>>>>>> In case of timeouts, the scsi error handler takes care of handling aborts
>>>>>> and resets. Improve the error handling in such scenario by resetting the
>>>>>> device and host and re-initializing them in proper manner.
>>>>>>
>>>>>> Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
>>>>>> ---
>>>>>>     drivers/scsi/ufs/ufshcd.c |  467 +++++++++++++++++++++++++++++++++++++++------
>>>>>>     drivers/scsi/ufs/ufshcd.h |    2 +
>>>>>>     2 files changed, 411 insertions(+), 58 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>>>>>> index 51ce096..b4c9910 100644
>>>>>> --- a/drivers/scsi/ufs/ufshcd.c
>>>>>> +++ b/drivers/scsi/ufs/ufshcd.c
>>>>>> @@ -69,9 +69,15 @@ enum {
>>>>>>
>>>>>>     /* UFSHCD states */
>>>>>>     enum {
>>>>>> -	UFSHCD_STATE_OPERATIONAL,
>>>>>>     	UFSHCD_STATE_RESET,
>>>>>>     	UFSHCD_STATE_ERROR,
>>>>>> +	UFSHCD_STATE_OPERATIONAL,
>>>>>> +};
>>>>>> +
>>>>>> +/* UFSHCD error handling flags */
>>>>>> +enum {
>>>>>> +	UFSHCD_EH_HOST_RESET_PENDING = (1 << 0),
>>>>>> +	UFSHCD_EH_DEVICE_RESET_PENDING = (1 << 1),
>>>>>>     };
>>>>>>
>>>>>>     /* Interrupt configuration options */
>>>>>> @@ -87,6 +93,22 @@ enum {
>>>>>>     	INT_AGGR_CONFIG,
>>>>>>     };
>>>>>>
>>>>>> +#define ufshcd_set_device_reset_pending(h) \
>>>>>> +	(h->eh_flags |= UFSHCD_EH_DEVICE_RESET_PENDING)
>>>>>> +#define ufshcd_set_host_reset_pending(h) \
>>>>>> +	(h->eh_flags |= UFSHCD_EH_HOST_RESET_PENDING)
>>>>>> +#define ufshcd_device_reset_pending(h) \
>>>>>> +	(h->eh_flags & UFSHCD_EH_DEVICE_RESET_PENDING)
>>>>>> +#define ufshcd_host_reset_pending(h) \
>>>>>> +	(h->eh_flags & UFSHCD_EH_HOST_RESET_PENDING)
>>>>>> +#define ufshcd_clear_device_reset_pending(h) \
>>>>>> +	(h->eh_flags &= ~UFSHCD_EH_DEVICE_RESET_PENDING)
>>>>>> +#define ufshcd_clear_host_reset_pending(h) \
>>>>>> +	(h->eh_flags &= ~UFSHCD_EH_HOST_RESET_PENDING)
>>>>>> +
>>>>>> +static void ufshcd_tmc_handler(struct ufs_hba *hba);
>>>>>> +static void ufshcd_async_scan(void *data, async_cookie_t cookie);
>>>>>> +
>>>>>>     /*
>>>>>>      * ufshcd_wait_for_register - wait for register value to change
>>>>>>      * @hba - per-adapter interface
>>>>>> @@ -851,9 +873,22 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
>>>>>>
>>>>>>     	tag = cmd->request->tag;
>>>>>>
>>>>>> -	if (hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL) {
>>>>>> +	switch (hba->ufshcd_state) {
>>>>> Lock is no needed for ufshcd_state?
>>> Please check?
>>
>> Yes, it is needed. Thanks for catching this.
>>
>>>
>>>>>
>>>>>> +	case UFSHCD_STATE_OPERATIONAL:
>>>>>> +		break;
>>>>>> +	case UFSHCD_STATE_RESET:
>>>>>>     		err = SCSI_MLQUEUE_HOST_BUSY;
>>>>>>     		goto out;
>>>>>> +	case UFSHCD_STATE_ERROR:
>>>>>> +		set_host_byte(cmd, DID_ERROR);
>>>>>> +		cmd->scsi_done(cmd);
>>>>>> +		goto out;
>>>>>> +	default:
>>>>>> +		dev_WARN_ONCE(hba->dev, 1, "%s: invalid state %d\n",
>>>>>> +				__func__, hba->ufshcd_state);
>>>>>> +		set_host_byte(cmd, DID_BAD_TARGET);
>>>>>> +		cmd->scsi_done(cmd);
>>>>>> +		goto out;
>>>>>>     	}
>>>>>>
>>>>>>     	/* acquire the tag to make sure device cmds don't use it */
>>>>>> @@ -1573,8 +1608,6 @@ static int ufshcd_make_hba_operational(struct ufs_hba *hba)
>>>>>>     	if (hba->ufshcd_state == UFSHCD_STATE_RESET)
>>>>>>     		scsi_unblock_requests(hba->host);
>>>>>>
>>>>>> -	hba->ufshcd_state = UFSHCD_STATE_OPERATIONAL;
>>>>>> -
>>>>>>     out:
>>>>>>     	return err;
>>>>>>     }
>>>>>> @@ -2273,6 +2306,106 @@ out:
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>>> + * ufshcd_utrl_is_rsr_enabled - check if run-stop register is enabled
>>>>>> + * @hba: per-adapter instance
>>>>>> + */
>>>>>> +static bool ufshcd_utrl_is_rsr_enabled(struct ufs_hba *hba)
>>>>>> +{
>>>>>> +	return ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_LIST_RUN_STOP) & 0x1;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_utmrl_is_rsr_enabled - check if run-stop register is enabled
>>>>>> + * @hba: per-adapter instance
>>>>>> + */
>>>>>> +static bool ufshcd_utmrl_is_rsr_enabled(struct ufs_hba *hba)
>>>>>> +{
>>>>>> +	return ufshcd_readl(hba, REG_UTP_TASK_REQ_LIST_RUN_STOP) & 0x1;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_complete_pending_tasks - complete outstanding tasks
>>>>>> + * @hba: per adapter instance
>>>>>> + *
>>>>>> + * Abort in-progress task management commands and wakeup
>>>>>> + * waiting threads.
>>>>>> + *
>>>>>> + * Returns non-zero error value when failed to clear all the commands.
>>>>>> + */
>>>>>> +static int ufshcd_complete_pending_tasks(struct ufs_hba *hba)
>>>>>> +{
>>>>>> +	u32 reg;
>>>>>> +	int err = 0;
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	if (!hba->outstanding_tasks)
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	/* Clear UTMRL only when run-stop is enabled */
>>>>>> +	if (ufshcd_utmrl_is_rsr_enabled(hba))
>>>>>> +		ufshcd_writel(hba, ~hba->outstanding_tasks,
>>>>>> +				REG_UTP_TASK_REQ_LIST_CLEAR);
>>>>>> +
>>>>>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>>>>>> +	reg = ufshcd_wait_for_register(hba,
>>>>>> +			REG_UTP_TASK_REQ_DOOR_BELL,
>>>>>> +			hba->outstanding_tasks, 0, 1000, 1000);
>>>>>> +	if (reg & hba->outstanding_tasks)
>>>>>> +		err = -ETIMEDOUT;
>>>>>> +
>>>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>>>> +	/* complete commands that were cleared out */
>>>>>> +	ufshcd_tmc_handler(hba);
>>>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>>>> +out:
>>>>>> +	if (err)
>>>>>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
>>>>>> +				__func__, reg);
>>>>>> +	return err;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_complete_pending_reqs - complete outstanding requests
>>>>>> + * @hba: per adapter instance
>>>>>> + *
>>>>>> + * Abort in-progress transfer request commands and return them to SCSI.
>>>>>> + *
>>>>>> + * Returns non-zero error value when failed to clear all the commands.
>>>>>> + */
>>>>>> +static int ufshcd_complete_pending_reqs(struct ufs_hba *hba)
>>>>>> +{
>>>>>> +	u32 reg;
>>>>>> +	int err = 0;
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	/* check if we completed all of them */
>>>>>> +	if (!hba->outstanding_reqs)
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	/* Clear UTRL only when run-stop is enabled */
>>>>>> +	if (ufshcd_utrl_is_rsr_enabled(hba))
>>>>>> +		ufshcd_writel(hba, ~hba->outstanding_reqs,
>>>>>> +				REG_UTP_TRANSFER_REQ_LIST_CLEAR);
>>>>>> +
>>>>>> +	/* poll for max. 1 sec to clear door bell register by h/w */
>>>>>> +	reg = ufshcd_wait_for_register(hba,
>>>>>> +			REG_UTP_TRANSFER_REQ_DOOR_BELL,
>>>>>> +			hba->outstanding_reqs, 0, 1000, 1000);
>>>>>> +	if (reg & hba->outstanding_reqs)
>>>>>> +		err = -ETIMEDOUT;
>>>>>> +
>>>>>> +	spin_lock_irqsave(hba->host->host_lock, flags);
>>>>>> +	/* complete commands that were cleared out */
>>>>>> +	ufshcd_transfer_req_compl(hba);
>>>>>> +	spin_unlock_irqrestore(hba->host->host_lock, flags);
>>>>>> +out:
>>>>>> +	if (err)
>>>>>> +		dev_err(hba->dev, "%s: failed, still pending = 0x%.8x\n",
>>>>>> +				__func__, reg);
>>>>>> +	return err;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>>      * ufshcd_fatal_err_handler - handle fatal errors
>>>>>>      * @hba: per adapter instance
>>>>>>      */
>>>>>> @@ -2306,8 +2439,12 @@ static void ufshcd_err_handler(struct ufs_hba *hba)
>>>>>>     	}
>>>>>>     	return;
>>>>>>     fatal_eh:
>>>>>> -	hba->ufshcd_state = UFSHCD_STATE_ERROR;
>>>>>> -	schedule_work(&hba->feh_workq);
>>>>>> +	/* handle fatal errors only when link is functional */
>>>>>> +	if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
>>>>>> +		/* block commands at driver layer until error is handled */
>>>>>> +		hba->ufshcd_state = UFSHCD_STATE_ERROR;
>>>>> Locking omitted for ufshcd_state?
>>>> This is called in interrupt context with spin_lock held.
>>> Right, I missed it.
>>>
>>>>
>>>>>
>>>>>> +		schedule_work(&hba->feh_workq);
>>>>>> +	}
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>>> @@ -2475,75 +2612,155 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int
>>>> task_id,
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>>> - * ufshcd_device_reset - reset device and abort all the pending commands
>>>>>> - * @cmd: SCSI command pointer
>>>>>> + * ufshcd_dme_end_point_reset - Notify device Unipro to perform reset
>>>>>> + * @hba: per adapter instance
>>>>>>      *
>>>>>> - * Returns SUCCESS/FAILED
>>>>>> + * UIC_CMD_DME_END_PT_RST resets the UFS device completely, the UFS flags,
>>>>>> + * attributes and descriptors are reset to default state. Callers are
>>>>>> + * expected to initialize the whole device again after this.
>>>>>> + *
>>>>>> + * Returns zero on success, non-zero on failure
>>>>>>      */
>>>>>> -static int ufshcd_device_reset(struct scsi_cmnd *cmd)
>>>>>> +static int ufshcd_dme_end_point_reset(struct ufs_hba *hba)
>>>>>>     {
>>>>>> -	struct Scsi_Host *host;
>>>>>> -	struct ufs_hba *hba;
>>>>>> -	unsigned int tag;
>>>>>> -	u32 pos;
>>>>>> -	int err;
>>>>>> -	u8 resp;
>>>>>> -	struct ufshcd_lrb *lrbp;
>>>>>> +	struct uic_command uic_cmd = {0};
>>>>>> +	int ret;
>>>>>>
>>>>>> -	host = cmd->device->host;
>>>>>> -	hba = shost_priv(host);
>>>>>> -	tag = cmd->request->tag;
>>>>>> +	uic_cmd.command = UIC_CMD_DME_END_PT_RST;
>>>>>>
>>>>>> -	lrbp = &hba->lrb[tag];
>>>>>> -	err = ufshcd_issue_tm_cmd(hba, lrbp->lun, lrbp->task_tag,
>>>>>> -			UFS_LOGICAL_RESET, &resp);
>>>>>> -	if (err || resp != UPIU_TASK_MANAGEMENT_FUNC_COMPL) {
>>>>>> -		err = FAILED;
>>>>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>>>>>> +	if (ret)
>>>>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>>>>>> +
>>>>>> +	return ret;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_dme_reset - Local UniPro reset
>>>>>> + * @hba: per adapter instance
>>>>>> + *
>>>>>> + * Returns zero on success, non-zero on failure
>>>>>> + */
>>>>>> +static int ufshcd_dme_reset(struct ufs_hba *hba)
>>>>>> +{
>>>>>> +	struct uic_command uic_cmd = {0};
>>>>>> +	int ret;
>>>>>> +
>>>>>> +	uic_cmd.command = UIC_CMD_DME_RESET;
>>>>>> +
>>>>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>>>>>> +	if (ret)
>>>>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>>>>>> +
>>>>>> +	return ret;
>>>>>> +
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_dme_enable - Local UniPro DME Enable
>>>>>> + * @hba: per adapter instance
>>>>>> + *
>>>>>> + * Returns zero on success, non-zero on failure
>>>>>> + */
>>>>>> +static int ufshcd_dme_enable(struct ufs_hba *hba)
>>>>>> +{
>>>>>> +	struct uic_command uic_cmd = {0};
>>>>>> +	int ret;
>>>>>> +	uic_cmd.command = UIC_CMD_DME_ENABLE;
>>>>>> +
>>>>>> +	ret = ufshcd_send_uic_cmd(hba, &uic_cmd);
>>>>>> +	if (ret)
>>>>>> +		dev_err(hba->dev, "%s: error code %d\n", __func__, ret);
>>>>>> +
>>>>>> +	return ret;
>>>>>> +
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * ufshcd_device_reset_and_restore - reset and restore device
>>>>>> + * @hba: per-adapter instance
>>>>>> + *
>>>>>> + * Note that the device reset issues DME_END_POINT_RESET which
>>>>>> + * may reset entire device and restore device attributes to
>>>>>> + * default state.
>>>>>> + *
>>>>>> + * Returns zero on success, non-zero on failure
>>>>>> + */
>>>>>> +static int ufshcd_device_reset_and_restore(struct ufs_hba *hba)
>>>>>> +{
>>>>>> +	int err = 0;
>>>>>> +	u32 reg;
>>>>>> +
>>>>>> +	err = ufshcd_dme_end_point_reset(hba);
>>>>>> +	if (err)
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	/* restore communication with the device */
>>>>>> +	err = ufshcd_dme_reset(hba);
>>>>>> +	if (err)
>>>>>>     		goto out;
>>>>>> -	} else {
>>>>>> -		err = SUCCESS;
>>>>>> -	}
>>>>>>
>>>>>> -	for (pos = 0; pos < hba->nutrs; pos++) {
>>>>>> -		if (test_bit(pos, &hba->outstanding_reqs) &&
>>>>>> -		    (hba->lrb[tag].lun == hba->lrb[pos].lun)) {
>>>>>> +	err = ufshcd_dme_enable(hba);
>>>>>> +	if (err)
>>>>>> +		goto out;
>>>>>>
>>>>>> -			/* clear the respective UTRLCLR register bit */
>>>>>> -			ufshcd_utrl_clear(hba, pos);
>>>>>> +	err = ufshcd_dme_link_startup(hba);
>>>>> UFS_LOGICAL_RESET is no more used?
>>>>
>>>> Yes, I don't see any use for this as of now (given that we are using
>>>> dme_end_point_reset, refer to figure. 7.4 of UFS 1.1 spec). Also, the
>>>> UFS spec. error handling section doesn't mention anything about
>>>> LOGICAL_RESET. If you know a valid use case where we need to have LUN
>>>> reset, please let me know I will bring it back.
>>> As refered the scsi-mid layer and other host's implementation,
>>> eh_device_reset_handler(= ufshcd_eh_device_reset_handler) may
>>> have a role of LOGICAL_RESET for specific lun.
>>
>> I am still not convinced why we need LOGICAL_RESET. Just because other
>> SCSI host drivers have it do we really need it for UFS?
>>
>>> I found that ENDPOINT_RESET is recommended with IS.DFES in spec.
>>
>> Here in this case, a command hang (scsi timeout) is considered as Device
>> Fatal Error. If there are some LUN failures the response would still be
>> transferred but with Unit-Attention condition with sense data. However,
>> if the command itself hangs, there is something seriously wrong with the
>> device or the communication. So we first try to reset the device and
>> then the host. Unlike most of other SCSI HBAs, UFS is point-to-point
>> (host <--> device) link and if something goes wrong and caused a hang,
>> mostly would be a serious error and logical unit reset wouldn't help
>> much.
> As far as UFS follows the SAM-5 model, LOGICAL_RESET should be considered.
> LOGICAL_RESET would be handled in 'eh_device_reset_handler' as I see it.
> And it looks like actual device reset is close to 'eh_target_reset_handler'.

Okay. I will retain the LOGICAL_RESET then. It looks like for UFS there
is no point in doing device reset only without doing host resetSo I
will retain the flow with some fixups..
I haven't gone into the details but target_reset_handler is used by
SCSI target modules, so not sure if it appropriate for UFS.

> 
>>
>>
>>>
>>> Let me add some comments additionally.
>>> Both 'ufshcd_eh_device_reset_handler' and 'ufshcd_host_reset_and_restore' do almost same things.
>>> At a glance, it's confused about their role and It is mixed.
>>> 'ufshcd_reset_and_restore' is eventually called, which is actual part of reset functionality; Once
>> device reset is failed, then
>>> host reset is tried.
>>> Actually, that is being handled for each level of error recovery in scsi mid-layer. Please chekc
>> 'drivers/scsi/scsi_error.c'.
>>> [scsi_eh_ready_devs, scsi_abort_eh_cmnd]
>>> In this stage, each reset functionality could be separated obviously.
>>
>> Yes, in that case we are optimistically doing the host reset twice,
>> just a hope that it recovers before SCSI layer choke and mark the
>> device as OFFLINE. If you think that this shouldn't be the case and
>> have a valid reason for not doing so, I will return appropriate error
>> in the case device reset fails.
> The two are much the same actually.
> To simplify implementation in host driver, leaving it to scsi mid-layer would be better.
> Eventually, the controlling of callback function is from upper layer.

Okay.

> 
>>
>>>
>>>>
>>>>> ufshcd_device_reset_and_restore have a role of device reset.
>>>>> Both ufshcd_dme_reset and ufshcd_dme_enable are valid for local one, not for remote.
>>>>> Should we do those for host including link-startup here?
>>>>
>>>> Yes, it is needed. After DME_ENDPOINT_RESET the remote link goes into link down state.
>>> I want to know more related description. I didn't find it. Could you point that?
>>
>> Please refer to "Table 121 DME_SAP restrictions" of MIPI Uni-Pro spec.
>> The spec. doesn't mention about this explicitly but here is the logic
>> that is derived from the spec.
>> 1) The DME_LINKSTARTUP can be sent only when the link is in down state,
>> in all other states DME_LINKSTARTUP is ignored.
>> 2) So if we are sending DME_ENDPOINT_RESET then that must ensure that
>> remote link is in down state, and hence it can receive linkstartup and
>> establish the communication.
>>
>>>
>>>> To initialize the link, the host needs to send
>>>> DME_LINKSTARTUP, but according to Uni-Pro spec. the link-startup can
>>>> only be sent when the local uni-pro is in link-down state. So first
>>> If it's right you mentioned above, uni-pro state is already in link-down after DME_ENDPOINT_RESET.
>>> Then, DME_RESET isn't needed.
>>
>> You are getting confused here -
> Yeah, I'm mixed up with link-down of remote side you mentioned.
> There is no special saying about link state after receiving DME_ENDPOINT_RESET in spec.
> I just found the point that link startup is initiated.
> It means that link startup is triggered from remote device after DME_ENDPOINT_RESET.
> In that case, host will detect ULSS(UIC Link Startup Status) interrupt.
> After that, host shall start link startup procedure with DME_RESET.

Yes, either way it works.

> Of course, your approach could be acceptable.
> 
>>
>> - State1: before sending DME_ENDPOINT_RESET
>> 	Local Unipro (host) - Link-UP
>> 	Remote Unipro (device) - Link-Up
>>
>> - State2: after sending DME_ENDPOINT_RESET
>> 	Local Unipro (host) - Link-UP
>> 	Remote Unipro (device) - Link-Down
>>
>> - State3: After sending DME_RESET+DME_ENABLE
>> 	Local Unipro (host) - Link-Down
>> 	Remote Unipro (device) - Link-Down
>>
>> - State4: After sending DME_LINKSTARTUP
>> 	Local Unipro(host) - Link-up
>> 	Remote Unipro (device) - Link-up
>>
>> The local unipro ignores the DME_LINKSTARTUP if we send it before
>> DME_RESET.
>>
>>>
>>>> we need to get the local unipro from link-up to disabled to link-down
>>>> using the DME_RESET and DME_ENABLE commands and then issue
>>>> DME_LINKSTARTUP to re-initialize the link.
>>> 'ufshcd_hba_enable' can be used instead of both if these are really needed.
>>> This will do dme_reset and dme_enable.
>>>
>>
>> The only reason for this is that in some implementations the HCE reset
>> also resets UTP layer in addition to Uni-Pro layer. There is no need
>> of UTP layer reset for device reset. So explicit DME_RESET and
>> DME_ENABLE is used. For those implementations which don't do UTP layer
>> reset then the advantage is instead of wasting CPU cycles in polling for
>> HCE=1 we depend on UIC interrupts.
> HCE reset can involve the additional unipro configurations, depending
> host controller implementation.
> As considering that unipro stack is reset with DME_RESET,
> usage of individual DME_RESET might be inappropriate during link startup.

Hmm.. yes, HCI spec. mentions this so I can't disagree.


-- 
Regards,
Sujit

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-07-29  9:46 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-09  9:16 [PATCH V3 0/4] scsi: ufs: Improve UFS error handling Sujit Reddy Thumma
2013-07-09  9:16 ` [PATCH V3 1/4] scsi: ufs: Fix broken task management command implementation Sujit Reddy Thumma
2013-07-09 10:42   ` merez
2013-07-19 13:56   ` Seungwon Jeon
2013-07-19 18:26     ` Sujit Reddy Thumma
2013-07-23  8:24       ` Seungwon Jeon
2013-07-23 15:40         ` Sujit Reddy Thumma
2013-07-09  9:16 ` [PATCH V3 2/4] scsi: ufs: Fix hardware race conditions while aborting a command Sujit Reddy Thumma
2013-07-09 10:42   ` merez
2013-07-19 13:56   ` Seungwon Jeon
2013-07-19 18:26     ` Sujit Reddy Thumma
2013-07-09  9:16 ` [PATCH V3 3/4] scsi: ufs: Fix device and host reset methods Sujit Reddy Thumma
2013-07-09 10:43   ` merez
2013-07-19 13:57   ` Seungwon Jeon
2013-07-19 18:26     ` Sujit Reddy Thumma
2013-07-23  8:27       ` Seungwon Jeon
2013-07-23 15:40         ` Sujit Reddy Thumma
2013-07-24 13:39           ` Seungwon Jeon
2013-07-29  9:45             ` Sujit Reddy Thumma
2013-07-09  9:16 ` [PATCH V3 4/4] scsi: ufs: Improve UFS fatal error handling Sujit Reddy Thumma
2013-07-09 10:43   ` merez
2013-07-19 13:58   ` Seungwon Jeon
2013-07-19 18:26     ` Sujit Reddy Thumma
2013-07-23  8:34       ` Seungwon Jeon
2013-07-23 15:41         ` Sujit Reddy Thumma
2013-07-24 13:39           ` Seungwon Jeon
2013-07-29  9:45             ` Sujit Reddy Thumma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.