linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* drivers/mmc/card/block.c infinite loop in mmc_blk_err_check waiting on R1_READY_FOR_DATA
@ 2012-10-31 19:36 Trey Ramsay
  2012-10-31 20:47 ` Chris Ball
  0 siblings, 1 reply; 10+ messages in thread
From: Trey Ramsay @ 2012-10-31 19:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: cjb

In the 3.7-rc3 kernel, there is an infinite loop in the 
mmc_blk_err_check() function in drivers/mmc/card/block.c that can be 
caused bad hardware. This loop has moved around a little, but appears to 
have been around in the kernel since v2.6.12.  The code will loop 
forever on write if the card isn't ready for data or if it's in program 
mode.  I did some searching and saw that it was reported to 
linux-mmc@vger.kernel.org 
http://permalink.gmane.org/gmane.linux.kernel.mmc/2021 back in May but 
didn't see a response.  Should there be a maximum retry count or a 
timeout to prevent an infinite loop?


1030         /*
1031          * Everything else is either success, or a data error of some
1032          * kind.  If it was a write, we may have transitioned to
1033          * program mode, which we have to wait for it to complete.
1034          */
1035         if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
1036                 u32 status;
1037                 do {
1038                         int err = get_card_status(card, &status, 5);
1039                         if (err) {
1040 
                                 pr_err("%s: error %d requesting status\n",
1041                                        req->rq_disk->disk_name, err);
1042                                 return MMC_BLK_CMD_ERR;
1043                         }
1044                         /*
1045                          * Some cards mishandle the status bits,
1046                          * so make sure to check both the busy
1047                          * indication and the card state.
1048                          */
1049                 } while (!(status & R1_READY_FOR_DATA) ||
1050                          (R1_CURRENT_STATE(status) == R1_STATE_PRG));
1051         }


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: drivers/mmc/card/block.c infinite loop in mmc_blk_err_check waiting on R1_READY_FOR_DATA
  2012-10-31 19:36 drivers/mmc/card/block.c infinite loop in mmc_blk_err_check waiting on R1_READY_FOR_DATA Trey Ramsay
@ 2012-10-31 20:47 ` Chris Ball
       [not found]   ` <50A2A08D.5000601@linux.vnet.ibm.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Ball @ 2012-10-31 20:47 UTC (permalink / raw)
  To: Trey Ramsay; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

Hi,

On Wed, Oct 31 2012, Trey Ramsay wrote:
> In the 3.7-rc3 kernel, there is an infinite loop in the
> mmc_blk_err_check() function in drivers/mmc/card/block.c that can be
> caused bad hardware. This loop has moved around a little, but appears
> to have been around in the kernel since v2.6.12.  The code will loop
> forever on write if the card isn't ready for data or if it's in
> program mode.  I did some searching and saw that it was reported to
> linux-mmc@vger.kernel.org
> http://permalink.gmane.org/gmane.linux.kernel.mmc/2021 back in May but
> didn't see a response.  Should there be a maximum retry count or a
> timeout to prevent an infinite loop?

Yes, a (long) timeout is a good idea, and there are three different
places where we use this type of loop waiting for R1_STATE_PRG to drop.
I'll add this to my TODO list, but feel free to fix it if you want to.

Thanks,

- Chris.
-- 
Chris Ball   <cjb@laptop.org>   <http://printf.net/>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: drivers/mmc/card/block.c infinite loop in mmc_blk_err_check waiting on R1_READY_FOR_DATA
       [not found]   ` <50A2A08D.5000601@linux.vnet.ibm.com>
@ 2012-11-13 20:48     ` Chris Ball
  2012-11-16 15:31       ` [PATCH 1/1] mmc: Bad device can cause mmc driver to hang Trey Ramsay
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Ball @ 2012-11-13 20:48 UTC (permalink / raw)
  To: Trey Ramsay; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

Hi Trey,

On Tue, Nov 13 2012, Trey Ramsay wrote:
> Thanks Chris. Any idea how long the timeout should be? It could
> cause problems if we timeout to early. This what I have so far with a
> 10 minute timeout. The code is based off of v3.7-rc3

Ten minutes sounds excessive, which is actually fine for our purpose; I
don't think we need to make the value tunable.  Thanks!  Please can you:

* resend in plain text instead of HTML, with no line-wrapping corrupting
  the patch

* include a commit message and "Signed-off-by" (see SubmittingPatches
  in the Documentation/ directory)

* change the pr_err()s to include the MMC hostname, and give an English
  description of the error instead of the line number -- most users
  don't have their kernel source available to find line numbers in.
  Something like "Card stuck in programming state!" sounds good.

After that, I'll merge the patch for testing.

Thanks,

- Chris.
-- 
Chris Ball   <cjb@laptop.org>   <http://printf.net/>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/1] mmc: Bad device can cause mmc driver to hang
  2012-11-13 20:48     ` Chris Ball
@ 2012-11-16 15:31       ` Trey Ramsay
  2012-11-16 15:37         ` Chris Ball
  2012-11-17 14:34         ` Chris Ball
  0 siblings, 2 replies; 10+ messages in thread
From: Trey Ramsay @ 2012-11-16 15:31 UTC (permalink / raw)
  To: Chris Ball
  Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic, Trey Ramsay

There are infinite loops in the mmc code that can be caused by bad hardware.
The code will loop forever if the device never comes back from program mode,
R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.

A long timeout will be added to prevent the code from looping forever.
The timeout will occur if the device never comes back from program
state or the device never becomes ready for data.

Signed-off-by: Trey Ramsay <tramsay@linux.vnet.ibm.com>
---
 drivers/mmc/card/block.c   |   15 +++++++++++++++
 drivers/mmc/core/core.c    |   18 +++++++++++++++++-
 drivers/mmc/core/mmc_ops.c |   11 +++++++++++
 3 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 172a768..8e60508 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -57,6 +57,7 @@ MODULE_ALIAS("mmc:block");
 #define INAND_CMD38_ARG_SECERASE 0x80
 #define INAND_CMD38_ARG_SECTRIM1 0x81
 #define INAND_CMD38_ARG_SECTRIM2 0x88
+#define MMC_BLK_TIMEOUT_MS  (10 * 60 * 1000)        /* 10 minute timeout */
 
 static DEFINE_MUTEX(block_mutex);
 
@@ -1034,6 +1035,9 @@ static int mmc_blk_err_check(struct mmc_card *card,
 	 */
 	if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
 		u32 status;
+		unsigned long timeout;
+
+		timeout = jiffies + msecs_to_jiffies(MMC_BLK_TIMEOUT_MS);
 		do {
 			int err = get_card_status(card, &status, 5);
 			if (err) {
@@ -1041,6 +1045,17 @@ static int mmc_blk_err_check(struct mmc_card *card,
 				       req->rq_disk->disk_name, err);
 				return MMC_BLK_CMD_ERR;
 			}
+
+			/* Timeout if the device never becomes ready for data
+			 * and never leaves the program state.
+			 */
+			if (time_after(jiffies, timeout)) {
+				pr_err("%s: Card stuck in programming state!"\
+					" %s %s\n", mmc_hostname(card->host),
+					req->rq_disk->disk_name, __func__);
+
+				return MMC_BLK_CMD_ERR;
+			}
 			/*
 			 * Some cards mishandle the status bits,
 			 * so make sure to check both the busy
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 06c42cf..bccfd18 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -42,6 +42,9 @@
 #include "sd_ops.h"
 #include "sdio_ops.h"
 
+/* If the device is not responding */
+#define MMC_CORE_TIMEOUT_MS	(10 * 60 * 1000) /* 10 minute timeout */
+
 /*
  * Background operations can take a long time, depending on the housekeeping
  * operations the card has to perform.
@@ -1631,6 +1634,7 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from,
 {
 	struct mmc_command cmd = {0};
 	unsigned int qty = 0;
+	unsigned long timeout;
 	int err;
 
 	/*
@@ -1708,6 +1712,7 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from,
 	if (mmc_host_is_spi(card->host))
 		goto out;
 
+	timeout = jiffies + msecs_to_jiffies(MMC_CORE_TIMEOUT_MS);
 	do {
 		memset(&cmd, 0, sizeof(struct mmc_command));
 		cmd.opcode = MMC_SEND_STATUS;
@@ -1721,8 +1726,19 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from,
 			err = -EIO;
 			goto out;
 		}
+
+		/* Timeout if the device never becomes ready for data and
+		 * never leaves the program state.
+		 */
+		if (time_after(jiffies, timeout)) {
+			pr_err("%s: Card stuck in programming state! %s\n",
+				mmc_hostname(card->host), __func__);
+			err =  -EIO;
+			goto out;
+		}
+
 	} while (!(cmd.resp[0] & R1_READY_FOR_DATA) ||
-		 R1_CURRENT_STATE(cmd.resp[0]) == R1_STATE_PRG);
+		 (R1_CURRENT_STATE(cmd.resp[0]) == R1_STATE_PRG));
 out:
 	return err;
 }
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index a0e1720..6d8f701 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -21,6 +21,8 @@
 #include "core.h"
 #include "mmc_ops.h"
 
+#define MMC_OPS_TIMEOUT_MS	(10 * 60 * 1000) /* 10 minute timeout */
+
 static int _mmc_select_card(struct mmc_host *host, struct mmc_card *card)
 {
 	int err;
@@ -409,6 +411,7 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
 {
 	int err;
 	struct mmc_command cmd = {0};
+	unsigned long timeout;
 	u32 status;
 
 	BUG_ON(!card);
@@ -437,6 +440,7 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
 		return 0;
 
 	/* Must check status to be sure of no errors */
+	timeout = jiffies + msecs_to_jiffies(MMC_OPS_TIMEOUT_MS);
 	do {
 		err = mmc_send_status(card, &status);
 		if (err)
@@ -445,6 +449,13 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
 			break;
 		if (mmc_host_is_spi(card->host))
 			break;
+
+		/* Timeout if the device never leaves the program state. */
+		if (time_after(jiffies, timeout)) {
+			pr_err("%s: Card stuck in programming state! %s\n",
+				mmc_hostname(card->host), __func__);
+			return -ETIMEDOUT;
+		}
 	} while (R1_CURRENT_STATE(status) == R1_STATE_PRG);
 
 	if (mmc_host_is_spi(card->host)) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mmc: Bad device can cause mmc driver to hang
  2012-11-16 15:31       ` [PATCH 1/1] mmc: Bad device can cause mmc driver to hang Trey Ramsay
@ 2012-11-16 15:37         ` Chris Ball
  2012-11-16 23:52           ` Trey Ramsay
  2012-11-17  0:40           ` Trey Ramsay
  2012-11-17 14:34         ` Chris Ball
  1 sibling, 2 replies; 10+ messages in thread
From: Chris Ball @ 2012-11-16 15:37 UTC (permalink / raw)
  To: Trey Ramsay; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

Hi Trey,

On Fri, Nov 16 2012, Trey Ramsay wrote:
> There are infinite loops in the mmc code that can be caused by bad hardware.
> The code will loop forever if the device never comes back from program mode,
> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>
> A long timeout will be added to prevent the code from looping forever.
> The timeout will occur if the device never comes back from program
> state or the device never becomes ready for data.
>
> Signed-off-by: Trey Ramsay <tramsay@linux.vnet.ibm.com>

Thanks, looks good!

Have you thought about what's going to happen after this path is hit?
Are we just going to start a new request that begins a new ten-minute
hang, or do we notice the bad card state somewhere and refuse to start
new I/O?

- Chris.
-- 
Chris Ball   <cjb@laptop.org>   <http://printf.net/>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mmc: Bad device can cause mmc driver to hang
  2012-11-16 15:37         ` Chris Ball
@ 2012-11-16 23:52           ` Trey Ramsay
  2012-11-17  0:37             ` Chris Ball
  2012-11-17  0:40           ` Trey Ramsay
  1 sibling, 1 reply; 10+ messages in thread
From: Trey Ramsay @ 2012-11-16 23:52 UTC (permalink / raw)
  To: Chris Ball; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

On 11/16/2012 09:37 AM, Chris Ball wrote:
> Hi Trey,
> 
> On Fri, Nov 16 2012, Trey Ramsay wrote:
>> There are infinite loops in the mmc code that can be caused by bad hardware.
>> The code will loop forever if the device never comes back from program mode,
>> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>>
>> A long timeout will be added to prevent the code from looping forever.
>> The timeout will occur if the device never comes back from program
>> state or the device never becomes ready for data.
>>
>> Signed-off-by: Trey Ramsay <tramsay@linux.vnet.ibm.com>
> 
> Thanks, looks good!
> 
> Have you thought about what's going to happen after this path is hit?
> Are we just going to start a new request that begins a new ten-minute
> hang, or do we notice the bad card state somewhere and refuse to start
> new I/O?
> 
> - Chris.
> 

You're welcome, and thanks for the help!

Good question.  In regards to the original problem were it was hung in
mmc_blk_err_check, the new code path will timeout after 10 minutes, log
an error, issue a hardware reset and abort the request. Is the hardware
reset enough or will that even work when the device isn't coming out of
program state? Should we try to refuse all new I/O?


Below are just some notes I took about the code path for
mmc_blk_err_check and mmc_do_erase.
=============================
mmc_blk_err_check
                        int err = get_card_status(card, &status, 5);
                        if (err) {
                                pr_err("%s: error %d requesting status\n",
                                       req->rq_disk->disk_name, err);
                                return MMC_BLK_CMD_ERR;
                        }
+
+                       /* Timeout if the device never becomes ready for
data
+                        * and never leaves the program state.
+                        */
+                       if (time_after(jiffies, timeout)) {
+                               pr_err("%s: Card stuck in programming
state!"\
+                                       " %s %s\n",
mmc_hostname(card->host),
+                                       req->rq_disk->disk_name, __func__);
+
+                               return MMC_BLK_CMD_ERR;
+                       }

Stack
-----------------
mmc_blk_err_check
mmc_start_req
mmc_blk_issue_rw_rq
mmc_blk_issue_rq
mmc_queue_thread

We return MMC_BLK_CMD_ERR the same way as we return MMC_BLK_CMD_ERR if
the get_card_status failed. The code which returns to mmc_start_req
which sets the error status and returns to mmc_blk_issue_rw_rq with a
status of MMC_BLK_CMD_ERR where it calls mmc_blk_reset which calls
mmc_hw_reset. The code then return so mmc_queue_thread which does not
check the return code.

=============================
mmc_do_erase
+
+               /* Timeout if the device never becomes ready for data and
+                * never leaves the program state.
+                */
+               if (time_after(jiffies, timeout)) {
+                       pr_err("%s: Card stuck in programming state! %s\n",
+                               mmc_hostname(card->host), __func__);
+                       err =  -EIO;
+                       goto out;
+               }
+

        err = mmc_erase(card, from, nr, arg);
out:
        if (err == -EIO && !mmc_blk_reset(md, card->host, type))
                goto retry;


The mmc_do_erase function is called by mmc_erase which is called from 3
locations in the block.c code.

    1) mmc_blk_issue_discard_rq--->mmc_erase--->mmc_do_erase    line 848
    2) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 884
    3) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 904

    This applies to 1, 2 and 3 above.
    Since we return -EIO, mmc_blk_issue_discard_rq or
    mmc_blk_issue_secdiscard will do a mmc_blk_reset which will call
    mmc_hw_reset. If hardware reset is successfull, it will retry the
    command.  If the hardware reset is unsuccessfull, it will call
    blk_end_request and will return 0 if there is
    an err and will return to the mmc_queue_thread function which does
    not check the return code.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mmc: Bad device can cause mmc driver to hang
  2012-11-16 23:52           ` Trey Ramsay
@ 2012-11-17  0:37             ` Chris Ball
  2012-11-17  5:16               ` Trey Ramsay
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Ball @ 2012-11-17  0:37 UTC (permalink / raw)
  To: Trey Ramsay; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

Hi Trey, thanks for the analysis,

On Fri, Nov 16 2012, Trey Ramsay wrote:
> Good question.  In regards to the original problem were it was hung in
> mmc_blk_err_check, the new code path will timeout after 10 minutes, log
> an error, issue a hardware reset and abort the request. Is the hardware
> reset enough or will that even work when the device isn't coming out of
> program state? Should we try to refuse all new I/O?

mmc_hw_reset() only works for eMMC devices with a hooked up reset GPIO
-- not SD cards -- and at the moment there's only one system (Intel
Medfield) that supplies a GPIO, so that's not a general solution.

Maybe we should just merge your patch for now; we'll definitely get at
least a pr_err() explaining what's going on, which is an improvement.
Next time someone hits this (if anyone has an SD card that exhibits
this problem, it'd be very valuable for testing) we can look at going
farther, such as immediately setting host->flags |= SDHCI_DEVICE_DEAD.
What do you think?

- Chris.
-- 
Chris Ball   <cjb@laptop.org>   <http://printf.net/>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mmc: Bad device can cause mmc driver to hang
  2012-11-16 15:37         ` Chris Ball
  2012-11-16 23:52           ` Trey Ramsay
@ 2012-11-17  0:40           ` Trey Ramsay
  1 sibling, 0 replies; 10+ messages in thread
From: Trey Ramsay @ 2012-11-17  0:40 UTC (permalink / raw)
  To: Chris Ball; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

On 11/16/2012 09:37 AM, Chris Ball wrote:
> Hi Trey,
> 
> On Fri, Nov 16 2012, Trey Ramsay wrote:
>> There are infinite loops in the mmc code that can be caused by bad hardware.
>> The code will loop forever if the device never comes back from program mode,
>> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>>
>> A long timeout will be added to prevent the code from looping forever.
>> The timeout will occur if the device never comes back from program
>> state or the device never becomes ready for data.
>>
>> Signed-off-by: Trey Ramsay <tramsay@linux.vnet.ibm.com>
> 
> Thanks, looks good!
> 
> Have you thought about what's going to happen after this path is hit?
> Are we just going to start a new request that begins a new ten-minute
> hang, or do we notice the bad card state somewhere and refuse to start
> new I/O?
> 
> - Chris.
> 

You're welcome, and thanks for the help!

Good question.  In regards to the original problem were it was hung in
mmc_blk_err_check, the new code path will timeout after 10 minutes, log
an error, issue a hardware reset and abort the request. Is the hardware
reset enough or will that even work when the device isn't coming out of
program state? Should we try to refuse all new I/O?


Below are just some notes I took about the code path for
mmc_blk_err_check and mmc_do_erase.
=============================
mmc_blk_err_check
                        int err = get_card_status(card, &status, 5);
                        if (err) {
                                pr_err("%s: error %d requesting status\n",
                                       req->rq_disk->disk_name, err);
                                return MMC_BLK_CMD_ERR;
                        }
+
+                       /* Timeout if the device never becomes ready for
data
+                        * and never leaves the program state.
+                        */
+                       if (time_after(jiffies, timeout)) {
+                               pr_err("%s: Card stuck in programming
state!"\
+                                       " %s %s\n",
mmc_hostname(card->host),
+                                       req->rq_disk->disk_name, __func__);
+
+                               return MMC_BLK_CMD_ERR;
+                       }

Stack
-----------------
mmc_blk_err_check
mmc_start_req
mmc_blk_issue_rw_rq
mmc_blk_issue_rq
mmc_queue_thread

We return MMC_BLK_CMD_ERR the same way as we return MMC_BLK_CMD_ERR if
the get_card_status failed. The code which returns to mmc_start_req
which sets the error status and returns to mmc_blk_issue_rw_rq with a
status of MMC_BLK_CMD_ERR where it calls mmc_blk_reset which calls
mmc_hw_reset. The code then return so mmc_queue_thread which does not
check the return code.

=============================
mmc_do_erase
+
+               /* Timeout if the device never becomes ready for data and
+                * never leaves the program state.
+                */
+               if (time_after(jiffies, timeout)) {
+                       pr_err("%s: Card stuck in programming state! %s\n",
+                               mmc_hostname(card->host), __func__);
+                       err =  -EIO;
+                       goto out;
+               }
+

        err = mmc_erase(card, from, nr, arg);
out:
        if (err == -EIO && !mmc_blk_reset(md, card->host, type))
                goto retry;


The mmc_do_erase function is called by mmc_erase which is called from 3
locations in the block.c code.

    1) mmc_blk_issue_discard_rq--->mmc_erase--->mmc_do_erase    line 848
    2) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 884
    3) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 904

    This applies to 1, 2 and 3 above.
    Since we return -EIO, mmc_blk_issue_discard_rq or
    mmc_blk_issue_secdiscard will do a mmc_blk_reset which will call
    mmc_hw_reset. If hardware reset is successfull, it will retry the
    command.  If the hardware reset is unsuccessfull, it will call
    blk_end_request and will return 0 if there is
    an err and will return to the mmc_queue_thread function which does
    not check the return code.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mmc: Bad device can cause mmc driver to hang
  2012-11-17  0:37             ` Chris Ball
@ 2012-11-17  5:16               ` Trey Ramsay
  0 siblings, 0 replies; 10+ messages in thread
From: Trey Ramsay @ 2012-11-17  5:16 UTC (permalink / raw)
  To: Chris Ball; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

On 11/16/2012 06:37 PM, Chris Ball wrote:
> Hi Trey, thanks for the analysis,
> 
> On Fri, Nov 16 2012, Trey Ramsay wrote:
>> Good question.  In regards to the original problem were it was hung in
>> mmc_blk_err_check, the new code path will timeout after 10 minutes, log
>> an error, issue a hardware reset and abort the request. Is the hardware
>> reset enough or will that even work when the device isn't coming out of
>> program state? Should we try to refuse all new I/O?
> 
> mmc_hw_reset() only works for eMMC devices with a hooked up reset GPIO
> -- not SD cards -- and at the moment there's only one system (Intel
> Medfield) that supplies a GPIO, so that's not a general solution.
> 
> Maybe we should just merge your patch for now; we'll definitely get at
> least a pr_err() explaining what's going on, which is an improvement.
> Next time someone hits this (if anyone has an SD card that exhibits
> this problem, it'd be very valuable for testing) we can look at going
> farther, such as immediately setting host->flags |= SDHCI_DEVICE_DEAD.
> What do you think?
> 
> - Chris.
> 

Hi Chris,
Sounds good.  Thanks for the explanation. Setting host->flags |=
SDHCI_DEVICE_DEAD is a great idea.  I'll check with my team to see if we
have any hardware that exhibits this problem.  If we do, I can do some
testing on the code you suggested.

Thanks,
Trey


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mmc: Bad device can cause mmc driver to hang
  2012-11-16 15:31       ` [PATCH 1/1] mmc: Bad device can cause mmc driver to hang Trey Ramsay
  2012-11-16 15:37         ` Chris Ball
@ 2012-11-17 14:34         ` Chris Ball
  1 sibling, 0 replies; 10+ messages in thread
From: Chris Ball @ 2012-11-17 14:34 UTC (permalink / raw)
  To: Trey Ramsay; +Cc: linux-kernel, linux-mmc, Rich Rattanni, Radovan Lekanovic

Hi,

On Fri, Nov 16 2012, Trey Ramsay wrote:
> There are infinite loops in the mmc code that can be caused by bad hardware.
> The code will loop forever if the device never comes back from program mode,
> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>
> A long timeout will be added to prevent the code from looping forever.
> The timeout will occur if the device never comes back from program
> state or the device never becomes ready for data.
>
> Signed-off-by: Trey Ramsay <tramsay@linux.vnet.ibm.com>

Thanks for doing this; pushed to mmc-next for 3.8.

- Chris.
-- 
Chris Ball   <cjb@laptop.org>   <http://printf.net/>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-11-17 14:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-31 19:36 drivers/mmc/card/block.c infinite loop in mmc_blk_err_check waiting on R1_READY_FOR_DATA Trey Ramsay
2012-10-31 20:47 ` Chris Ball
     [not found]   ` <50A2A08D.5000601@linux.vnet.ibm.com>
2012-11-13 20:48     ` Chris Ball
2012-11-16 15:31       ` [PATCH 1/1] mmc: Bad device can cause mmc driver to hang Trey Ramsay
2012-11-16 15:37         ` Chris Ball
2012-11-16 23:52           ` Trey Ramsay
2012-11-17  0:37             ` Chris Ball
2012-11-17  5:16               ` Trey Ramsay
2012-11-17  0:40           ` Trey Ramsay
2012-11-17 14:34         ` Chris Ball

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).