From: Rajat Jain <rajatja@google.com>
To: "James E.J. Bottomley" <JBottomley@odin.com>,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: rajatxjain@gmail.com, Rajat Jain <rajatja@google.com>
Subject: [PATCH] scsi: Avoid potential infinite eh_timeout_handler() loop
Date: Thu, 4 Jun 2015 11:40:21 -0700 [thread overview]
Message-ID: <1433443222-8260-1-git-send-email-rajatja@google.com> (raw)
Each cmd timeout should result in scmd->retries++. Currently it happens
just only before a command is requeued back. However, if the LLD
eh_timed_out() handler asks to reset timer back again, then also it should
be incremented because effectively LLD will be given a full time period
(SD_TIMEOUT = 30 secs!) to attempt to complete the command.
Why this is a problem:
=> Currently the SCSI low level transport drivers can provide
eh_timeout_handler() calls (for e.g. iscsi provides this) to deal
with command timeouts.
=> The eh_timeout_handler() can return BLK_EH_RESET_TIMER that causes the
SCSI / block layer to reset the timer, thus giving more time to the
LLD.
=> Currently a LLD can potentially loop infinitely on a command if it
always keeps on returning BLK_EH_RESET_TIMER.
* => Other than choking its own devices, if the command that is stuck is a
command issued during sd_probe_async() (e.g. a partition table scan),
then it impacts all the disks because no other disks can be removed
from the system until sd_probe_async() returns. (sd_remove waits on
async_synchronize_full_domain(...))
=> This problem actually resulted in the situation mentioned above,
whereby no disks in the system (on other scsi hosts) could be removed,
because of a stuck scsi command to read the partition tables of an
unrelated problematic disk during probe. The threads were stuck at:
schedule+0x312/0x7a0
async_synchronize_cookie_domain+0xb8/0x115
? __wake_up_bit+0x40/0x40
async_synchronize_full_domain+0x15/0x17
sd_remove+0x5f/0x135
__device_release_driver+0x8a/0xe0
device_release_driver+0x23/0x30
bus_remove_device+0x10f/0x123
device_del+0x132/0x18e
__scsi_remove_device+0x56/0xb6
scsi_remove_device+0x26/0x33
scsi_remove_target+0x12d/0x1a0
...
What this patch does:
=> Ensure that any quests to reset the timer are accounted for, so that
there is a finite upper bound on the time that a command is tried.
Once allowed number of retries is reached, we proceed to standard
error handling procedure (abort etc.) by scheduling the command
for EH.
Signed-off-by: Rajat Jain <rajatja@google.com>
---
drivers/scsi/scsi_error.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c95a4e9..9671ec5 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -283,6 +283,17 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
else if (host->hostt->eh_timed_out)
rtn = host->hostt->eh_timed_out(scmd);
+ /*
+ * If a scmd times out because LLD failed to complete it, make sure that
+ * LLD can ask for more time only finite number of times. Also each such
+ * request must account towards the time the LLD has been spent on that
+ * cmd. Thus each timeout attempt by an LLD to complete a scmd must be
+ * treated as a retry since it involves waiting for another whole period
+ * of time before it times out again.
+ */
+ if (rtn == BLK_EH_RESET_TIMER && (++scmd->retries > scmd->allowed))
+ rtn = BLK_EH_NOT_HANDLED;
+
if (rtn == BLK_EH_NOT_HANDLED) {
if (!host->hostt->no_async_abort &&
scsi_abort_command(scmd) == SUCCESS)
--
2.2.0.rc0.207.ga3a616c
next reply other threads:[~2015-06-04 18:40 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-04 18:40 Rajat Jain [this message]
2015-06-04 20:27 ` [PATCH] scsi: Avoid potential infinite eh_timeout_handler() loop James Bottomley
2015-06-04 21:49 ` Rajat Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1433443222-8260-1-git-send-email-rajatja@google.com \
--to=rajatja@google.com \
--cc=JBottomley@odin.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=rajatxjain@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.