[PATCH] mmc: dw_mmc: Wait for data transfer after response errors

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
@ 2016-03-17 12:12 Enric Balletbo Serra
  2016-03-21 22:38 ` Doug Anderson
  0 siblings, 1 reply; 19+ messages in thread
From: Enric Balletbo Serra @ 2016-03-17 12:12 UTC (permalink / raw)
  To: linux-mmc, linux-kernel
  Cc: Doug Anderson, Alim Akhtar, Jaehoon Chung, Seungwon Jeon,
	Ulf Hansson, Alim Akhtar, Sonny Rao, Andrew Bresticker,
	Heiko Stuebner, Addy Ke, Alexandru Stan,
	Javier Martinez Canillas, Chris Zhong, Caesar Wang

Dear all,

Seems the following thread[1] didn't go anywhere. I'd like to continue
the discussion and share some tests that I did regarding the issue
that the patch is trying to fix.

First I reproduced the issue on my rockchip board and I tested the
patch intensively, I can confirm that the patch made by Doug fixes the
issue.But, as reported by Alim, seems that this patch has the side
effect that breaks mmc on peach-pi board [2], specially on
suspend/resume, I ran lots of tests on peach-pi and, although is a bit
random, I can also confirm the breakage.

Looks like that on peach-pi, when the patch is applied the controller
moves into a data transfer and the interrupt does not come, then the
task blocks. The reason why I think the dw_mmc-rockchip driver works
is because it has the DW_MCI_QUIRK_BROKEN_DTO quirk [3].

So I did lots of tests on peach-pi with dto quirk, suspend/resume
started to work again. But I guess this is not the proper solution or
it is? Thoughts?

[1] https://lkml.org/lkml/2015/5/18/495
[2] https://lava.collabora.co.uk/scheduler/job/169384/log_file#L_195_5
[3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/host/dw_mmc-rockchip.c?id=57e104864bc4874a36796fd222d8d084dbf90b9b

Cheers,
Enric

>
> Alim,
>
> On Tue, May 26, 2015 at 11:02 AM, Alim Akhtar <alim.akhtar@gmail.com> wrote:
>> Hi Doug,
>> On peach-pi, I got a hung task once in 4 cold boot as [1].
>
> OK, I'll have to get my peach-pi or peach-pit up and running again.  I
> ran out of desk space and I haven't gotten it set back up.  :(
>
> I've been testing on an rk3288-based device.  Past experience has
> taught me that the rk3288 dw_mmc works differently than the exynos
> one, so perhaps this is a difference.
>
> Could you possibly patch in something like
> <https://chromium-review.googlesource.com/#/c/244347/> and provide the
> console for the failure?  I'll put it on my list to try this myself,
> too
>
>
>> I was checking on v4.1-rc5, git hash as below:
>>
>> 862e58a mmc: dw_mmc: Wait for data transfer after response errors
>> ba155e2 Linux 4.1-rc5
>> 5b13966
>>
>> Not sure if I missed any dependent patch??
>
> I'm currently testing out of tree, but my dw_mmc is very close to mainline.
>
>
>> Have not checked the dw TRM for this change, will do that as soon as I
>> get access to it.
>
> OK, sounds good.  I have some old version of the DesignWare TRM, so
> possibly something is different in the newer one...
>
>
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-17 12:12 [PATCH] mmc: dw_mmc: Wait for data transfer after response errors Enric Balletbo Serra
@ 2016-03-21 22:38 ` Doug Anderson
  2016-03-24 11:26   ` Enric Balletbo Serra
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Anderson @ 2016-03-21 22:38 UTC (permalink / raw)
  To: Enric Balletbo Serra
  Cc: linux-mmc, linux-kernel, Alim Akhtar, Jaehoon Chung,
	Seungwon Jeon, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Javier Martinez Canillas, Chris Zhong, Caesar Wang

Enric,

On Thu, Mar 17, 2016 at 5:12 AM, Enric Balletbo Serra
<eballetbo@gmail.com> wrote:
> Dear all,
>
> Seems the following thread[1] didn't go anywhere. I'd like to continue
> the discussion and share some tests that I did regarding the issue
> that the patch is trying to fix.
>
> First I reproduced the issue on my rockchip board and I tested the
> patch intensively, I can confirm that the patch made by Doug fixes the
> issue.But, as reported by Alim, seems that this patch has the side
> effect that breaks mmc on peach-pi board [2], specially on
> suspend/resume, I ran lots of tests on peach-pi and, although is a bit
> random, I can also confirm the breakage.
>
> Looks like that on peach-pi, when the patch is applied the controller
> moves into a data transfer and the interrupt does not come, then the
> task blocks. The reason why I think the dw_mmc-rockchip driver works
> is because it has the DW_MCI_QUIRK_BROKEN_DTO quirk [3].
>
> So I did lots of tests on peach-pi with dto quirk, suspend/resume
> started to work again. But I guess this is not the proper solution or
> it is? Thoughts?
>
> [1] https://lkml.org/lkml/2015/5/18/495
> [2] https://lava.collabora.co.uk/scheduler/job/169384/log_file#L_195_5
> [3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/host/dw_mmc-rockchip.c?id=57e104864bc4874a36796fd222d8d084dbf90b9b

Ah, that would make some sense why things work OK on Rockchip.  Adding
DW_MCI_QUIRK_BROKEN_DTO to peach probably doesn't make sense, then.
Hrm...

Since my original debugging of the issue was over a year ago, I think
I've almost totally lost context of any debugging I did on the issue,
so I'm not sure I'm going to be too much help in giving any details
other than what I put in the original commit message.  From the
original message it appears that I thought we could solve this other
ways but just that my patch was easier than the alternative of
handling every error case.  Maybe we just need to go back to the
drawing board and handle the error directly?

Also: my original commit message says "response error or response CRC
error".  Do you happen to know which of these two we're hitting on
rk3288?  If we limit the workaround to just one of these two cases
does peach pi still break?

Also: I'd be curious, with the same SD card can you reproduce any
failures on peach pi?  ...or does peach-pi work fine in this case?

Hmm, also I think my last suggestion was to see how things looked with
<https://chromium-review.googlesource.com/#/c/244347/> picked to get
extra debug info...

-Doug

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-21 22:38 ` Doug Anderson
@ 2016-03-24 11:26   ` Enric Balletbo Serra
  2016-03-24 15:16     ` Doug Anderson
  2016-03-24 15:30     ` Russell King - ARM Linux
  0 siblings, 2 replies; 19+ messages in thread
From: Enric Balletbo Serra @ 2016-03-24 11:26 UTC (permalink / raw)
  To: Doug Anderson
  Cc: linux-mmc, linux-kernel, Alim Akhtar, Jaehoon Chung, Ulf Hansson,
	Alim Akhtar, Sonny Rao, Andrew Bresticker, Heiko Stuebner,
	Addy Ke, Alexandru Stan, Chris Zhong, Caesar Wang,
	Javier Martinez Canillas, Russell King

I fixed Javier Martinez email and removed tgih.jun@samsung.com (delivery fail)
Also cc'ing Russell King as I think might help (see my comment below)


2016-03-21 23:38 GMT+01:00 Doug Anderson <dianders@chromium.org>:
> Enric,
>
> On Thu, Mar 17, 2016 at 5:12 AM, Enric Balletbo Serra
> <eballetbo@gmail.com> wrote:
>> Dear all,
>>
>> Seems the following thread[1] didn't go anywhere. I'd like to continue
>> the discussion and share some tests that I did regarding the issue
>> that the patch is trying to fix.
>>
>> First I reproduced the issue on my rockchip board and I tested the
>> patch intensively, I can confirm that the patch made by Doug fixes the
>> issue.But, as reported by Alim, seems that this patch has the side
>> effect that breaks mmc on peach-pi board [2], specially on
>> suspend/resume, I ran lots of tests on peach-pi and, although is a bit
>> random, I can also confirm the breakage.
>>
>> Looks like that on peach-pi, when the patch is applied the controller
>> moves into a data transfer and the interrupt does not come, then the
>> task blocks. The reason why I think the dw_mmc-rockchip driver works
>> is because it has the DW_MCI_QUIRK_BROKEN_DTO quirk [3].
>>
>> So I did lots of tests on peach-pi with dto quirk, suspend/resume
>> started to work again. But I guess this is not the proper solution or
>> it is? Thoughts?
>>
>> [1] https://lkml.org/lkml/2015/5/18/495
>> [2] https://lava.collabora.co.uk/scheduler/job/169384/log_file#L_195_5
>> [3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/host/dw_mmc-rockchip.c?id=57e104864bc4874a36796fd222d8d084dbf90b9b
>
> Ah, that would make some sense why things work OK on Rockchip.  Adding
> DW_MCI_QUIRK_BROKEN_DTO to peach probably doesn't make sense, then.
> Hrm...
>
> Since my original debugging of the issue was over a year ago, I think
> I've almost totally lost context of any debugging I did on the issue,
> so I'm not sure I'm going to be too much help in giving any details
> other than what I put in the original commit message.  From the
> original message it appears that I thought we could solve this other
> ways but just that my patch was easier than the alternative of
> handling every error case.  Maybe we just need to go back to the
> drawing board and handle the error directly?
>

I just saw that Russell introduced a patch [1] that will land on 4.6.
I think that patch solves the same issue that we're trying to fix, but
for sdhci controller.

The problem that we have on peach-pi, with our patch applied, is that
when we get a response CRC error on a command and we move to start
sending data, the transfer doesn't receives a timeout interrupt (I
don't know why). As I told, on rockchip works due the DTO quirk.
exynos is not using this quirk. Also, please correct me if I'm wrong,
looks like the sdhci controller has a timer to signal the command
timed out.

ooi, anyone knows what was the test case that caused the necessity of
the DTO quirk?

> Also: my original commit message says "response error or response CRC
> error".  Do you happen to know which of these two we're hitting on
> rk3288?  If we limit the workaround to just one of these two cases
> does peach pi still break?
>

Yes, the peach pi still break. The one that is hitting is the response
CRC error, so limit the workaround doesn't help.


> Also: I'd be curious, with the same SD card can you reproduce any
> failures on peach pi?  ...or does peach-pi work fine in this case?
>

I can't test this now because I don't have physical access to the
peach-pi. But yeah, this is something to test.

> Hmm, also I think my last suggestion was to see how things looked with
> <https://chromium-review.googlesource.com/#/c/244347/> picked to get
> extra debug info...
>
>
> -Doug

[1] https://git.linaro.org/people/ulf.hansson/mmc.git/commit/71fcbda0fcddd0896c4982a484f6c8aa802d28b1

Enric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-24 11:26   ` Enric Balletbo Serra
@ 2016-03-24 15:16     ` Doug Anderson
  2016-03-24 15:30     ` Russell King - ARM Linux
  1 sibling, 0 replies; 19+ messages in thread
From: Doug Anderson @ 2016-03-24 15:16 UTC (permalink / raw)
  To: Enric Balletbo Serra, Jaehoon Chung
  Cc: linux-mmc, linux-kernel, Alim Akhtar, Ulf Hansson, Alim Akhtar,
	Sonny Rao, Andrew Bresticker, Heiko Stuebner, Addy Ke,
	Alexandru Stan, Chris Zhong, Caesar Wang,
	Javier Martinez Canillas, Russell King

Hi,

On Thu, Mar 24, 2016 at 4:26 AM, Enric Balletbo Serra
<eballetbo@gmail.com> wrote:
>> Ah, that would make some sense why things work OK on Rockchip.  Adding
>> DW_MCI_QUIRK_BROKEN_DTO to peach probably doesn't make sense, then.
>> Hrm...
>>
>> Since my original debugging of the issue was over a year ago, I think
>> I've almost totally lost context of any debugging I did on the issue,
>> so I'm not sure I'm going to be too much help in giving any details
>> other than what I put in the original commit message.  From the
>> original message it appears that I thought we could solve this other
>> ways but just that my patch was easier than the alternative of
>> handling every error case.  Maybe we just need to go back to the
>> drawing board and handle the error directly?
>>
>
> I just saw that Russell introduced a patch [1] that will land on 4.6.
> I think that patch solves the same issue that we're trying to fix, but
> for sdhci controller.

Yes, the description sounds very similar for sure.


> The problem that we have on peach-pi, with our patch applied, is that
> when we get a response CRC error on a command and we move to start
> sending data, the transfer doesn't receives a timeout interrupt (I
> don't know why). As I told, on rockchip works due the DTO quirk.
> exynos is not using this quirk. Also, please correct me if I'm wrong,
> looks like the sdhci controller has a timer to signal the command
> timed out.

I haven't spent any amount of time looking at SDHCI driver.

If we have no other better solution, enabling the DTO timer for this
specific case on all dw_mmc might be a good idea?


> ooi, anyone knows what was the test case that caused the necessity of
> the DTO quirk?

If I remember correctly, the DTO quirk is necessary to get basic
tuning working on almost every UHS card out there on rk3288.  Take it
out and you'll see lots of problems.  If you have it in there and add
a printout when the DTO quirk fires, you'll see your printout a decent
amount during tuning.  Should be easy to test that...

Those same cards work fine on exynos devices without the DTO quirk.


-Doug

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-24 11:26   ` Enric Balletbo Serra
  2016-03-24 15:16     ` Doug Anderson
@ 2016-03-24 15:30     ` Russell King - ARM Linux
  2016-03-24 16:06       ` Doug Anderson
  1 sibling, 1 reply; 19+ messages in thread
From: Russell King - ARM Linux @ 2016-03-24 15:30 UTC (permalink / raw)
  To: Enric Balletbo Serra
  Cc: Doug Anderson, linux-mmc, linux-kernel, Alim Akhtar,
	Jaehoon Chung, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

On Thu, Mar 24, 2016 at 12:26:43PM +0100, Enric Balletbo Serra wrote:
> I just saw that Russell introduced a patch [1] that will land on 4.6.
> I think that patch solves the same issue that we're trying to fix, but
> for sdhci controller.

It doesn't sound like the same issue to me, though it was a long while
back when I was looking at sdhci, so I may be misremembering.

> The problem that we have on peach-pi, with our patch applied, is that
> when we get a response CRC error on a command and we move to start
> sending data, the transfer doesn't receives a timeout interrupt (I
> don't know why). As I told, on rockchip works due the DTO quirk.
> exynos is not using this quirk. Also, please correct me if I'm wrong,
> looks like the sdhci controller has a timer to signal the command
> timed out.

>From what I remember, the problem I was seeing is that SDHCI sends a
command (iirc, a tuning command), and receives a response CRC error.
The card, however, knows nothing about the CRC error, so it moves into
the transfer state.

Meanwhile, SDHCI stopped processing the command, resetting the SDHCI
controller and reporting the error to the upper layers.

Then, a new command gets queued, issued to the card, and this fails
because the card is still in transfer state.  This totally screws up
the SDHCI UHS tuning.

This is not the only SDHCI UHS tuning bug: others exist which do not
yet have patches, where we can get spurious false positives/false
negatives for various tuning steps which totally confuse the code.

>From what you say above, your issue is that you get a response CRC
error, but the dw MMC masks the data side, which sounds like a
different solution is needed.  The MMC block driver error handling
is fairly robust, but there is no core error handling (because the
error handling is not obvious.)  So any command not eminating from
the MMC block driver that invokes a transfer from the card which
fails won't have a stop command sent for it.

Maybe that's a weakness of the core MMC code: when I originally
designed that part of the MMC code, my thoughts were to leave error
handling to the higher levels (such as MMC block) because its
dependent on those higher levels.  Eg, the various status bits
which report errors, whether a stop command needs to be issued,
etc.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-24 15:30     ` Russell King - ARM Linux
@ 2016-03-24 16:06       ` Doug Anderson
  2016-03-24 16:22         ` Russell King - ARM Linux
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Anderson @ 2016-03-24 16:06 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Enric Balletbo Serra, linux-mmc, linux-kernel, Alim Akhtar,
	Jaehoon Chung, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

Russell,

On Thu, Mar 24, 2016 at 8:30 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Thu, Mar 24, 2016 at 12:26:43PM +0100, Enric Balletbo Serra wrote:
>> I just saw that Russell introduced a patch [1] that will land on 4.6.
>> I think that patch solves the same issue that we're trying to fix, but
>> for sdhci controller.
>
> It doesn't sound like the same issue to me, though it was a long while
> back when I was looking at sdhci, so I may be misremembering.
>
>> The problem that we have on peach-pi, with our patch applied, is that
>> when we get a response CRC error on a command and we move to start
>> sending data, the transfer doesn't receives a timeout interrupt (I
>> don't know why). As I told, on rockchip works due the DTO quirk.
>> exynos is not using this quirk. Also, please correct me if I'm wrong,
>> looks like the sdhci controller has a timer to signal the command
>> timed out.
>
> From what I remember, the problem I was seeing is that SDHCI sends a
> command (iirc, a tuning command), and receives a response CRC error.
> The card, however, knows nothing about the CRC error, so it moves into
> the transfer state.
>
> Meanwhile, SDHCI stopped processing the command, resetting the SDHCI
> controller and reporting the error to the upper layers.
>
> Then, a new command gets queued, issued to the card, and this fails
> because the card is still in transfer state.  This totally screws up
> the SDHCI UHS tuning.
>
> This is not the only SDHCI UHS tuning bug: others exist which do not
> yet have patches, where we can get spurious false positives/false
> negatives for various tuning steps which totally confuse the code.
>
> From what you say above, your issue is that you get a response CRC
> error, but the dw MMC masks the data side, which sounds like a
> different solution is needed.

What I was seeing that when the controller saw the CRC error it tried
to abort with a "stop" command.  You can see the
"send_stop_abort(host, data)" in dw_mmc.c.  Then I saw:

> Sending the stop command after the "response CRC error" would
> then throw the system into a confused state causing all future
> tuning phases to report failure.

Presumably this is similar to what you saw: the host saw the CRC error
but the card knew nothing about it.  Sending the stop command during
this time confused the card.  Presumably the card was in transfer
state during this time?


-Doug

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-24 16:06       ` Doug Anderson
@ 2016-03-24 16:22         ` Russell King - ARM Linux
  2016-03-30 17:16           ` Enric Balletbo Serra
  2016-03-31 18:12           ` Doug Anderson
  0 siblings, 2 replies; 19+ messages in thread
From: Russell King - ARM Linux @ 2016-03-24 16:22 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Enric Balletbo Serra, linux-mmc, linux-kernel, Alim Akhtar,
	Jaehoon Chung, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
> Russell,
...
> Presumably this is similar to what you saw: the host saw the CRC error
> but the card knew nothing about it.  Sending the stop command during
> this time confused the card.  Presumably the card was in transfer
> state during this time?

If the card was in transfer state for a command which expects a stop
command, and that stop command was issued after the card entered
the transfer state, then I'd expect the card to handle it... though
there's always the firmware bug issue.

If the card hadn't entered transfer state at the time the stop command
was issued.. I think that's more likely to hit card firmware issues.

With the tuning commands, there's another case you can hit though:
the data transfer may have completed before you get around to sending
the stop command.

That's why, for sdhci, I came to the conclusion that waiting for the
data transfer to complete or timeout was the best solution for SDHCI.

Maybe, if sending a STOP command does cause card firmware issues, then:

1) it provides evidence that trying to send a stop command on response
   CRC error is the wrong thing to do (it was talked about making SDHCI
   do this.)

2) it suggests that the solution I came up with for SDHCI is the better
   solution, rather than trying to immediately recover the situation by
   sending a STOP command.

Maybe dw-mmc can do something similar, but with the lack of data transfer
timeout, maybe it's possible to do something with a kernel timer instead,
and check what the hardware is doing after a response CRC error?

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-24 16:22         ` Russell King - ARM Linux
@ 2016-03-30 17:16           ` Enric Balletbo Serra
  2016-03-30 17:26             ` Russell King - ARM Linux
  2016-03-31  2:03             ` Jaehoon Chung
  2016-03-31 18:12           ` Doug Anderson
  1 sibling, 2 replies; 19+ messages in thread
From: Enric Balletbo Serra @ 2016-03-30 17:16 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Doug Anderson, linux-mmc, linux-kernel, Alim Akhtar,
	Jaehoon Chung, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>> Russell,
> ...
>> Presumably this is similar to what you saw: the host saw the CRC error
>> but the card knew nothing about it.  Sending the stop command during
>> this time confused the card.  Presumably the card was in transfer
>> state during this time?
>
> If the card was in transfer state for a command which expects a stop
> command, and that stop command was issued after the card entered
> the transfer state, then I'd expect the card to handle it... though
> there's always the firmware bug issue.
>
> If the card hadn't entered transfer state at the time the stop command
> was issued.. I think that's more likely to hit card firmware issues.
>
> With the tuning commands, there's another case you can hit though:
> the data transfer may have completed before you get around to sending
> the stop command.
>
> That's why, for sdhci, I came to the conclusion that waiting for the
> data transfer to complete or timeout was the best solution for SDHCI.
>

In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
behaviour. What does this is use a kernel timer to signal when DTO
interrupt does NOT come. Note that if I disable this quirk I can also
saw the problem on rockchip.

> Maybe, if sending a STOP command does cause card firmware issues, then:
>
> 1) it provides evidence that trying to send a stop command on response
>    CRC error is the wrong thing to do (it was talked about making SDHCI
>    do this.)
>

Seems the same here, so guess is the wrong thing to do.

> 2) it suggests that the solution I came up with for SDHCI is the better
>    solution, rather than trying to immediately recover the situation by
>    sending a STOP command.
>

I'm wondering if just enable this quirk on exynos too is the proper
solution. Unfortunately I don't have enough documentation to check
differences between those controllers.
Also will really help have access to some hardware that uses
dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
Anyone with the hardware who can do some tests?


> Maybe dw-mmc can do something similar, but with the lack of data transfer
> timeout, maybe it's possible to do something with a kernel timer instead,
> and check what the hardware is doing after a response CRC error?
>
> --
> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-30 17:16           ` Enric Balletbo Serra
@ 2016-03-30 17:26             ` Russell King - ARM Linux
       [not found]               ` <CAFqH_51sMLbURO4n7OTEuC8S-6w0Q4aRv47nEoCAmK8-MJ+Jbw@mail.gmail.com>
  2016-03-31  1:56               ` Shawn Lin
  2016-03-31  2:03             ` Jaehoon Chung
  1 sibling, 2 replies; 19+ messages in thread
From: Russell King - ARM Linux @ 2016-03-30 17:26 UTC (permalink / raw)
  To: Enric Balletbo Serra
  Cc: Doug Anderson, linux-mmc, linux-kernel, Alim Akhtar,
	Jaehoon Chung, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

On Wed, Mar 30, 2016 at 07:16:18PM +0200, Enric Balletbo Serra wrote:
> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
> > On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
> >> Russell,
> > ...
> >> Presumably this is similar to what you saw: the host saw the CRC error
> >> but the card knew nothing about it.  Sending the stop command during
> >> this time confused the card.  Presumably the card was in transfer
> >> state during this time?
> >
> > If the card was in transfer state for a command which expects a stop
> > command, and that stop command was issued after the card entered
> > the transfer state, then I'd expect the card to handle it... though
> > there's always the firmware bug issue.
> >
> > If the card hadn't entered transfer state at the time the stop command
> > was issued.. I think that's more likely to hit card firmware issues.
> >
> > With the tuning commands, there's another case you can hit though:
> > the data transfer may have completed before you get around to sending
> > the stop command.
> >
> > That's why, for sdhci, I came to the conclusion that waiting for the
> > data transfer to complete or timeout was the best solution for SDHCI.
> >
> 
> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
> behaviour. What does this is use a kernel timer to signal when DTO
> interrupt does NOT come. Note that if I disable this quirk I can also
> saw the problem on rockchip.
> 
> > Maybe, if sending a STOP command does cause card firmware issues, then:
> >
> > 1) it provides evidence that trying to send a stop command on response
> >    CRC error is the wrong thing to do (it was talked about making SDHCI
> >    do this.)
> >
> 
> Seems the same here, so guess is the wrong thing to do.
> 
> > 2) it suggests that the solution I came up with for SDHCI is the better
> >    solution, rather than trying to immediately recover the situation by
> >    sending a STOP command.
> >
> 
> I'm wondering if just enable this quirk on exynos too is the proper
> solution. Unfortunately I don't have enough documentation to check
> differences between those controllers.
> Also will really help have access to some hardware that uses
> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
> Anyone with the hardware who can do some tests?

I'd really suggest that the dw-mmc folk place a moritorium on quirk
flags, and instead deal with situations like this without resorting
to this kind of thing.

sdhci is a good example why the quirk flag approach is totally wrong,
and shows that it leads to an unmaintainable mess.  If dw-mmc people
don't want the driver to decend into the same state that sdhci is,
then things like this should not be quirks.  sdhci already has a
long-term moritorium on quirk flags until the resulting mess has been
cleaned up.

The danger that quirk flags cause is also highlighted in your mail:
it's very likely that this _isn't_ a host controller issue at all,
but a MMC protocol issue or a card issue - and the behaviour required
here is not specific to any particular host controller.  The problem
with having a quirk flag for it is that you end up with some hosts
enabling it, and other hosts having it disabled only because they
haven't yet tripped over the issue.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
       [not found]               ` <CAFqH_51sMLbURO4n7OTEuC8S-6w0Q4aRv47nEoCAmK8-MJ+Jbw@mail.gmail.com>
@ 2016-03-30 20:39                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 19+ messages in thread
From: Russell King - ARM Linux @ 2016-03-30 20:39 UTC (permalink / raw)
  To: Enric Balletbo Serra
  Cc: Doug Anderson, Heiko Stübner, Sonny Rao, Caesar Wang,
	Alim Akhtar, Ulf Hansson, Alexandru Stan, Alim Akhtar,
	Javier Martinez Canillas, Addy Ke, Andrew Bresticker, linux-mmc,
	Jaehoon Chung, linux-kernel, Chris Zhong

On Wed, Mar 30, 2016 at 10:06:36PM +0200, Enric Balletbo Serra wrote:
> El dia 30/03/2016 19:26, "Russell King - ARM Linux" <linux@arm.linux.org.uk>
> va escriure:
> > I'd really suggest that the dw-mmc folk place a moritorium on quirk
> > flags, and instead deal with situations like this without resorting
> > to this kind of thing.
> >
> > sdhci is a good example why the quirk flag approach is totally wrong,
> > and shows that it leads to an unmaintainable mess.  If dw-mmc people
> > don't want the driver to decend into the same state that sdhci is,
> > then things like this should not be quirks.  sdhci already has a
> > long-term moritorium on quirk flags until the resulting mess has been
> > cleaned up.
> >
> 
> You mean that was a mess in the past and now is cleaned up?

SDHCI is far from being "cleaned up" - the conclusion that I've put
forward, and Ulf appears to agree with is that the core SDHCI driver
needs to become a library.

We then need individual drivers, which make selective use of the
library functions, and the "quirks" implemented as either fixes to
the core code or implemented by replacement functions.

The problem with SDHCI is that it's not far off having every other
line of code being conditional on a quirk flag - I've submitted to
very large series of patches so far doing a series of code transforms
to reduce this, but the job is nowhere near complete.  Given the
number of drivers, it's something that needs to be done with care
and over a period of time.

This is why I'm warning about this as soon as I've heard another
driver going down the quirks route - hopefully you can avoid this
mistake early enough that it doesn't become a several year project
to sort out.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-30 17:26             ` Russell King - ARM Linux
       [not found]               ` <CAFqH_51sMLbURO4n7OTEuC8S-6w0Q4aRv47nEoCAmK8-MJ+Jbw@mail.gmail.com>
@ 2016-03-31  1:56               ` Shawn Lin
  1 sibling, 0 replies; 19+ messages in thread
From: Shawn Lin @ 2016-03-31  1:56 UTC (permalink / raw)
  To: Russell King - ARM Linux, Enric Balletbo Serra
  Cc: shawn.lin, Doug Anderson, linux-mmc, linux-kernel, Alim Akhtar,
	Jaehoon Chung, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

在 2016/3/31 1:26, Russell King - ARM Linux 写道:
> On Wed, Mar 30, 2016 at 07:16:18PM +0200, Enric Balletbo Serra wrote:
>> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
>>> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>>>> Russell,
>>> ...
>>>> Presumably this is similar to what you saw: the host saw the CRC error
>>>> but the card knew nothing about it.  Sending the stop command during
>>>> this time confused the card.  Presumably the card was in transfer
>>>> state during this time?
>>>
>>> If the card was in transfer state for a command which expects a stop
>>> command, and that stop command was issued after the card entered
>>> the transfer state, then I'd expect the card to handle it... though
>>> there's always the firmware bug issue.
>>>
>>> If the card hadn't entered transfer state at the time the stop command
>>> was issued.. I think that's more likely to hit card firmware issues.
>>>
>>> With the tuning commands, there's another case you can hit though:
>>> the data transfer may have completed before you get around to sending
>>> the stop command.
>>>
>>> That's why, for sdhci, I came to the conclusion that waiting for the
>>> data transfer to complete or timeout was the best solution for SDHCI.
>>>
>>
>> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
>> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
>> behaviour. What does this is use a kernel timer to signal when DTO
>> interrupt does NOT come. Note that if I disable this quirk I can also
>> saw the problem on rockchip.
>>
>>> Maybe, if sending a STOP command does cause card firmware issues, then:
>>>
>>> 1) it provides evidence that trying to send a stop command on response
>>>     CRC error is the wrong thing to do (it was talked about making SDHCI
>>>     do this.)
>>>
>>
>> Seems the same here, so guess is the wrong thing to do.
>>
>>> 2) it suggests that the solution I came up with for SDHCI is the better
>>>     solution, rather than trying to immediately recover the situation by
>>>     sending a STOP command.
>>>
>>
>> I'm wondering if just enable this quirk on exynos too is the proper
>> solution. Unfortunately I don't have enough documentation to check
>> differences between those controllers.
>> Also will really help have access to some hardware that uses
>> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
>> Anyone with the hardware who can do some tests?
>
> I'd really suggest that the dw-mmc folk place a moritorium on quirk
> flags, and instead deal with situations like this without resorting
> to this kind of thing.
>

Some quirks and some callbacks have been cleaned in Jaehoon's repo,and
still some are going to removed. Finally we do plan to turn dw_mmc core
into a pure library..

> sdhci is a good example why the quirk flag approach is totally wrong,
> and shows that it leads to an unmaintainable mess.  If dw-mmc people
> don't want the driver to decend into the same state that sdhci is,
> then things like this should not be quirks.  sdhci already has a
> long-term moritorium on quirk flags until the resulting mess has been
> cleaned up.
>
> The danger that quirk flags cause is also highlighted in your mail:
> it's very likely that this _isn't_ a host controller issue at all,

Two issues found by dw_mmc-rockchip part,
(1) need reset idma when switching between fifo-transfer and
idma-transfer. When biu:ciu > 1:6, idma internal fsm take a risk of
a race condition to maintain its fifo lookup pointer. It can be very
easy reproduce by seting biu:ciu > 1:6.. Common bug for dw_mmc! But 
unfortunately these details was missing in the commit msg.

(2) Missing DTO/DRTO; I missed the thread for this topic, so I need to
reproduce it by setting a simple C model code. I can't say more
currently until we can find a way to easily reproduce it. But I guess
it's NOT a host issue....since I slightly glance at the TMOUT reg at 
dw_mmc databook and find a software timer requirement:

31:8 data_timeout 0xffffff
Value for card Data Read Timeout; same value also used for Data
Starvation by Host timeout. The timeout counter is started only after 
thecard clock is stopped. Value is in number of card output clocks – 
cclk_out of selected card.

Note: The software timer should be used if the timeout value is in the 
order of 100 ms. In this case, read data timeout interrupt needs to be 
disabled.

> but a MMC protocol issue or a card issue - and the behaviour required
> here is not specific to any particular host controller.  The problem
> with having a quirk flag for it is that you end up with some hosts
> enabling it, and other hosts having it disabled only because they
> haven't yet tripped over the issue.
>


-- 
Best Regards
Shawn Lin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-30 17:16           ` Enric Balletbo Serra
  2016-03-30 17:26             ` Russell King - ARM Linux
@ 2016-03-31  2:03             ` Jaehoon Chung
  2016-03-31  6:39               ` Enric Balletbo Serra
  1 sibling, 1 reply; 19+ messages in thread
From: Jaehoon Chung @ 2016-03-31  2:03 UTC (permalink / raw)
  To: Enric Balletbo Serra, Russell King - ARM Linux
  Cc: Doug Anderson, linux-mmc, linux-kernel, Alim Akhtar, Ulf Hansson,
	Alim Akhtar, Sonny Rao, Andrew Bresticker, Heiko Stuebner,
	Addy Ke, Alexandru Stan, Chris Zhong, Caesar Wang,
	Javier Martinez Canillas

On 03/31/2016 02:16 AM, Enric Balletbo Serra wrote:
> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
>> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>>> Russell,
>> ...
>>> Presumably this is similar to what you saw: the host saw the CRC error
>>> but the card knew nothing about it.  Sending the stop command during
>>> this time confused the card.  Presumably the card was in transfer
>>> state during this time?
>>
>> If the card was in transfer state for a command which expects a stop
>> command, and that stop command was issued after the card entered
>> the transfer state, then I'd expect the card to handle it... though
>> there's always the firmware bug issue.
>>
>> If the card hadn't entered transfer state at the time the stop command
>> was issued.. I think that's more likely to hit card firmware issues.
>>
>> With the tuning commands, there's another case you can hit though:
>> the data transfer may have completed before you get around to sending
>> the stop command.
>>
>> That's why, for sdhci, I came to the conclusion that waiting for the
>> data transfer to complete or timeout was the best solution for SDHCI.
>>
> 
> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
> behaviour. What does this is use a kernel timer to signal when DTO
> interrupt does NOT come. Note that if I disable this quirk I can also
> saw the problem on rockchip.

Did you see the problem with exynos? Could you share which exynos chip you use?
Then i can check this with all exynos.

> 
>> Maybe, if sending a STOP command does cause card firmware issues, then:
>>
>> 1) it provides evidence that trying to send a stop command on response
>>    CRC error is the wrong thing to do (it was talked about making SDHCI
>>    do this.)
>>
> 
> Seems the same here, so guess is the wrong thing to do.
> 
>> 2) it suggests that the solution I came up with for SDHCI is the better
>>    solution, rather than trying to immediately recover the situation by
>>    sending a STOP command.
>>
> 
> I'm wondering if just enable this quirk on exynos too is the proper
> solution. Unfortunately I don't have enough documentation to check
> differences between those controllers.
> Also will really help have access to some hardware that uses
> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
> Anyone with the hardware who can do some tests?

I want to remove all quirks for dwmmc controller. (in progressing with Shawn.)

> 
> 
>> Maybe dw-mmc can do something similar, but with the lack of data transfer
>> timeout, maybe it's possible to do something with a kernel timer instead,
>> and check what the hardware is doing after a response CRC error?
>>
>> --
>> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
>> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
>> according to speedtest.net.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-31  2:03             ` Jaehoon Chung
@ 2016-03-31  6:39               ` Enric Balletbo Serra
  0 siblings, 0 replies; 19+ messages in thread
From: Enric Balletbo Serra @ 2016-03-31  6:39 UTC (permalink / raw)
  To: Jaehoon Chung
  Cc: Russell King - ARM Linux, Doug Anderson, linux-mmc, linux-kernel,
	Alim Akhtar, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

2016-03-31 4:03 GMT+02:00 Jaehoon Chung <jh80.chung@samsung.com>:
> On 03/31/2016 02:16 AM, Enric Balletbo Serra wrote:
>> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
>>> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>>>> Russell,
>>> ...
>>>> Presumably this is similar to what you saw: the host saw the CRC error
>>>> but the card knew nothing about it.  Sending the stop command during
>>>> this time confused the card.  Presumably the card was in transfer
>>>> state during this time?
>>>
>>> If the card was in transfer state for a command which expects a stop
>>> command, and that stop command was issued after the card entered
>>> the transfer state, then I'd expect the card to handle it... though
>>> there's always the firmware bug issue.
>>>
>>> If the card hadn't entered transfer state at the time the stop command
>>> was issued.. I think that's more likely to hit card firmware issues.
>>>
>>> With the tuning commands, there's another case you can hit though:
>>> the data transfer may have completed before you get around to sending
>>> the stop command.
>>>
>>> That's why, for sdhci, I came to the conclusion that waiting for the
>>> data transfer to complete or timeout was the best solution for SDHCI.
>>>
>>
>> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
>> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
>> behaviour. What does this is use a kernel timer to signal when DTO
>> interrupt does NOT come. Note that if I disable this quirk I can also
>> saw the problem on rockchip.
>
> Did you see the problem with exynos? Could you share which exynos chip you use?
> Then i can check this with all exynos.
>

Applying this patch I saw the problem with peach-pi, exynos 58000.

>>
>>> Maybe, if sending a STOP command does cause card firmware issues, then:
>>>
>>> 1) it provides evidence that trying to send a stop command on response
>>>    CRC error is the wrong thing to do (it was talked about making SDHCI
>>>    do this.)
>>>
>>
>> Seems the same here, so guess is the wrong thing to do.
>>
>>> 2) it suggests that the solution I came up with for SDHCI is the better
>>>    solution, rather than trying to immediately recover the situation by
>>>    sending a STOP command.
>>>
>>
>> I'm wondering if just enable this quirk on exynos too is the proper
>> solution. Unfortunately I don't have enough documentation to check
>> differences between those controllers.
>> Also will really help have access to some hardware that uses
>> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
>> Anyone with the hardware who can do some tests?
>
> I want to remove all quirks for dwmmc controller. (in progressing with Shawn.)
>

Nice, do you have a work-in-progress repository?  Maybe I can help
testing the patches.

>>
>>
>>> Maybe dw-mmc can do something similar, but with the lack of data transfer
>>> timeout, maybe it's possible to do something with a kernel timer instead,
>>> and check what the hardware is doing after a response CRC error?
>>>
>>> --
>>> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
>>> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
>>> according to speedtest.net.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2016-03-24 16:22         ` Russell King - ARM Linux
  2016-03-30 17:16           ` Enric Balletbo Serra
@ 2016-03-31 18:12           ` Doug Anderson
  1 sibling, 0 replies; 19+ messages in thread
From: Doug Anderson @ 2016-03-31 18:12 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Enric Balletbo Serra, linux-mmc, linux-kernel, Alim Akhtar,
	Jaehoon Chung, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Chris Zhong, Caesar Wang, Javier Martinez Canillas

Hi,

On Thu, Mar 24, 2016 at 9:22 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> That's why, for sdhci, I came to the conclusion that waiting for the
> data transfer to complete or timeout was the best solution for SDHCI.
>
> Maybe, if sending a STOP command does cause card firmware issues, then:
>
> 1) it provides evidence that trying to send a stop command on response
>    CRC error is the wrong thing to do (it was talked about making SDHCI
>    do this.)
>
> 2) it suggests that the solution I came up with for SDHCI is the better
>    solution, rather than trying to immediately recover the situation by
>    sending a STOP command.
>
> Maybe dw-mmc can do something similar, but with the lack of data transfer
> timeout, maybe it's possible to do something with a kernel timer instead,
> and check what the hardware is doing after a response CRC error?

Since the problem only reproduced with a single model of a single
brand of card, it is probably a card firmware issue as you say.
Presumably if we put this card in an exynos that had tuning enabled
we'd seem similar issues.  I haven't had a chance to test this.

dw_mmc does have a data transfer timeout, but for some reason it
doesn't seem to fire reliably on Rockchip.  Sounds like Shawn may be
digging here?  In the past I've found that the same code running on
rk3288 and exynos would fire the timeout on exynos but not on rk3288.
That's the reason why rk3288 has a special quirk to enable a software
timer.

In the case of CRC error perhaps the controller sin't sending a data
timeout because it already told us about the CRC error (so need to
report further?), but I guess the quirk on the Rockhip platform makes
it work.

Sounds like the right fix is to enable a timer after the CRC error
(similar to the DTO quirk) and not send a STOP.  Even if it's not a
firmware problem, keeping dw_mmc behaving more like SDHCI is a good
idea because presumably SD cards out there will not test against all
controllers, so we'll have the most compatibility if all controllers
behave the same (even if the spec technically allows them to behave
otherwise).

-Doug

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2015-05-27  1:53     ` Jaehoon Chung
@ 2015-05-27 16:52       ` Doug Anderson
  0 siblings, 0 replies; 19+ messages in thread
From: Doug Anderson @ 2015-05-27 16:52 UTC (permalink / raw)
  To: Jaehoon Chung
  Cc: Alim Akhtar, Seungwon Jeon, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Javier Martinez Canillas, Chris Zhong, Caesar Wang, linux-mmc,
	linux-kernel

Hi,

On Tue, May 26, 2015 at 6:53 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>> Have not checked the dw TRM for this change, will do that as soon as I
>>> get access to it.
>>
>> OK, sounds good.  I have some old version of the DesignWare TRM, so
>> possibly something is different in the newer one...
>
> Which version do you have? I also need to check the TRM version.

It is ancient.  It says 2.41a July 2011.  If anyone has the ability to
give me a newer version, I wouldn't object.

-Doug

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2015-05-26 20:44   ` Doug Anderson
@ 2015-05-27  1:53     ` Jaehoon Chung
  2015-05-27 16:52       ` Doug Anderson
  0 siblings, 1 reply; 19+ messages in thread
From: Jaehoon Chung @ 2015-05-27  1:53 UTC (permalink / raw)
  To: Doug Anderson, Alim Akhtar
  Cc: Seungwon Jeon, Ulf Hansson, Alim Akhtar, Sonny Rao,
	Andrew Bresticker, Heiko Stuebner, Addy Ke, Alexandru Stan,
	Javier Martinez Canillas, Chris Zhong, Caesar Wang, linux-mmc,
	linux-kernel

Hi, All.

Thanks for your effort and sorry for late!

On 05/27/2015 05:44 AM, Doug Anderson wrote:
> Alim,
> 
> On Tue, May 26, 2015 at 11:02 AM, Alim Akhtar <alim.akhtar@gmail.com> wrote:
>> Hi Doug,
>> On peach-pi, I got a hung task once in 4 cold boot as [1].
> 
> OK, I'll have to get my peach-pi or peach-pit up and running again.  I
> ran out of desk space and I haven't gotten it set back up.  :(
> 
> I've been testing on an rk3288-based device.  Past experience has
> taught me that the rk3288 dw_mmc works differently than the exynos
> one, so perhaps this is a difference.
> 
> Could you possibly patch in something like
> <https://chromium-review.googlesource.com/#/c/244347/> and provide the
> console for the failure?  I'll put it on my list to try this myself,
> too

I don't test with this..but if needs, i will test.
It needs to fix the Alim's problem.

> 
> 
>> I was checking on v4.1-rc5, git hash as below:
>>
>> 862e58a mmc: dw_mmc: Wait for data transfer after response errors
>> ba155e2 Linux 4.1-rc5
>> 5b13966
>>
>> Not sure if I missed any dependent patch??
> 
> I'm currently testing out of tree, but my dw_mmc is very close to mainline.
> 
> 
>> Have not checked the dw TRM for this change, will do that as soon as I
>> get access to it.
> 
> OK, sounds good.  I have some old version of the DesignWare TRM, so
> possibly something is different in the newer one...

Which version do you have? I also need to check the TRM version.

Best Regards,
Jaehoon Chung

> 
> 
> -Doug
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2015-05-26 18:02 ` Alim Akhtar
@ 2015-05-26 20:44   ` Doug Anderson
  2015-05-27  1:53     ` Jaehoon Chung
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Anderson @ 2015-05-26 20:44 UTC (permalink / raw)
  To: Alim Akhtar
  Cc: Jaehoon Chung, Seungwon Jeon, Ulf Hansson, Alim Akhtar,
	Sonny Rao, Andrew Bresticker, Heiko Stuebner, Addy Ke,
	Alexandru Stan, Javier Martinez Canillas, Chris Zhong,
	Caesar Wang, linux-mmc, linux-kernel

Alim,

On Tue, May 26, 2015 at 11:02 AM, Alim Akhtar <alim.akhtar@gmail.com> wrote:
> Hi Doug,
> On peach-pi, I got a hung task once in 4 cold boot as [1].

OK, I'll have to get my peach-pi or peach-pit up and running again.  I
ran out of desk space and I haven't gotten it set back up.  :(

I've been testing on an rk3288-based device.  Past experience has
taught me that the rk3288 dw_mmc works differently than the exynos
one, so perhaps this is a difference.

Could you possibly patch in something like
<https://chromium-review.googlesource.com/#/c/244347/> and provide the
console for the failure?  I'll put it on my list to try this myself,
too

> I was checking on v4.1-rc5, git hash as below:
>
> 862e58a mmc: dw_mmc: Wait for data transfer after response errors
> ba155e2 Linux 4.1-rc5
> 5b13966
>
> Not sure if I missed any dependent patch??

I'm currently testing out of tree, but my dw_mmc is very close to mainline.

> Have not checked the dw TRM for this change, will do that as soon as I
> get access to it.

OK, sounds good.  I have some old version of the DesignWare TRM, so
possibly something is different in the newer one...

-Doug

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
  2015-05-18 15:53 Doug Anderson
@ 2015-05-26 18:02 ` Alim Akhtar
  2015-05-26 20:44   ` Doug Anderson
  0 siblings, 1 reply; 19+ messages in thread
From: Alim Akhtar @ 2015-05-26 18:02 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Jaehoon Chung, Seungwon Jeon, Ulf Hansson, Alim Akhtar,
	Sonny Rao, Andrew Bresticker, Heiko Stuebner, Addy Ke,
	Alexandru Stan, Javier Martinez Canillas, Chris Zhong,
	Caesar Wang, linux-mmc, linux-kernel

Hi Doug,
On peach-pi, I got a hung task once in 4 cold boot as [1].
And every time got a hung task [2] on suspend/resume, triggered
exactly from this change. I have a debug print at $SUBJECT change.
[1]:
----------------
on boot:
[  240.197190] INFO: task kworker/u16:1:50 blocked for more than 120 seconds.
[  240.202602]       Not tainted 4.1.0-rc5-00001-g862e58a-dirty #6
[  240.208505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  240.216307] kworker/u16:1   D c0505528     0    50      2 0x00000000
[  240.222650] Workqueue: kmmcd mmc_rescan
[  240.226448] [<c0505528>] (__schedule) from [<c0505924>] (schedule+0x34/0x98)
[  240.233496] [<c0505924>] (schedule) from [<c0509274>]
(schedule_timeout+0x110/0x164)
[  240.241207] [<c0509274>] (schedule_timeout) from [<c0506258>]
(wait_for_common+0xb8/0x14c)
[  240.249454] [<c0506258>] (wait_for_common) from [<c03b2b5c>]
(mmc_wait_for_req+0xb0/0x13c)
[  240.257691] [<c03b2b5c>] (mmc_wait_for_req) from [<c03b8684>]
(mmc_send_tuning+0x140/0x1b0)
[  240.266026] [<c03b8684>] (mmc_send_tuning) from [<c03cd45c>]
(dw_mci_exynos_execute_tuning+0x70/0x174)
[  240.275299] [<c03cd45c>] (dw_mci_exynos_execute_tuning) from
[<c03c97f0>] (dw_mci_execute_tuning+0x2c/0x38)
[  240.285021] [<c03c97f0>] (dw_mci_execute_tuning) from [<c03b3e90>]
(mmc_execute_tuning+0x30/0x60)
[  240.293864] [<c03b3e90>] (mmc_execute_tuning) from [<c03b6fe4>]
(mmc_init_card+0x8e0/0x1648)
[  240.302283] [<c03b6fe4>] (mmc_init_card) from [<c03b801c>]
(mmc_attach_mmc+0x90/0x15c)
[  240.310172] [<c03b801c>] (mmc_attach_mmc) from [<c03b4cf4>]
(mmc_rescan+0x29c/0x2e4)
[  240.317903] [<c03b4cf4>] (mmc_rescan) from [<c00342a0>]
(process_one_work+0x120/0x324)
[  240.325788] [<c00342a0>] (process_one_work) from [<c0034500>]
(worker_thread+0x28/0x490)
[  240.333861] [<c0034500>] (worker_thread) from [<c0039324>]
(kthread+0xd8/0xf4)
[  240.341058] [<c0039324>] (kthread) from [<c000f5a0>]
(ret_from_fork+0x14/0x34)
-----------


[2]: On s2r cycle:
========
[   71.107181] mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot
req 400000Hz, actual 396825HZ div = 63)
[   71.147363] mmc_host mmc0: Bus speed (slot 0) = 200000000Hz (slot
req 200000000Hz, actual 200000000HZ div = 0)


[   71.155946] dwmmc_exynos 12200000.mmc: [ALIM] :
dw_mci_tasklet_func: 1593: cmd->data and err


[  240.197177] INFO: task sh:1645 blocked for more than 120 seconds.
[  240.201802]       Not tainted 4.1.0-rc5-00001-g862e58a-dirty #6
[  240.207713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  240.215518] sh              D c0505528     0  1645      1 0x00000000
[  240.221857] [<c0505528>] (__schedule) from [<c0505924>] (schedule+0x34/0x98)
[  240.228887] [<c0505924>] (schedule) from [<c0509274>]
(schedule_timeout+0x110/0x164)
[  240.236606] [<c0509274>] (schedule_timeout) from [<c0506258>]
(wait_for_common+0xb8/0x14c)
[  240.244851] [<c0506258>] (wait_for_common) from [<c03b2b5c>]
(mmc_wait_for_req+0xb0/0x13c)
[  240.253090] [<c03b2b5c>] (mmc_wait_for_req) from [<c03b8684>]
(mmc_send_tuning+0x140/0x1b0)
[  240.261417] [<c03b8684>] (mmc_send_tuning) from [<c03cd45c>]
(dw_mci_exynos_execute_tuning+0x70/0x174)
[  240.270699] [<c03cd45c>] (dw_mci_exynos_execute_tuning) from
[<c03c97f0>] (dw_mci_execute_tuning+0x2c/0x38)
[  240.280415] [<c03c97f0>] (dw_mci_execute_tuning) from [<c03b3e90>]
(mmc_execute_tuning+0x30/0x60)
[  240.289263] [<c03b3e90>] (mmc_execute_tuning) from [<c03b6fe4>]
(mmc_init_card+0x8e0/0x1648)
[  240.297675] [<c03b6fe4>] (mmc_init_card) from [<c03b7d98>]
(_mmc_resume+0x4c/0x78)
[  240.305222] [<c03b7d98>] (_mmc_resume) from [<c03b7e8c>]
(mmc_resume+0x1c/0x58)
[  240.312508] [<c03b7e8c>] (mmc_resume) from [<c03b5128>]
(mmc_bus_resume+0x1c/0x50)
[  240.320060] [<c03b5128>] (mmc_bus_resume) from [<c02c0008>]
(dpm_run_callback.isra.8+0x1c/0x48)
[  240.328736] [<c02c0008>] (dpm_run_callback.isra.8) from
[<c02c00e0>] (device_resume+0xac/0x180)
[  240.337436] [<c02c00e0>] (device_resume) from [<c02c1358>]
(dpm_resume+0xe8/0x210)
[  240.344957] [<c02c1358>] (dpm_resume) from [<c02c161c>]
(dpm_resume_end+0xc/0x18)
[  240.352419] [<c02c161c>] (dpm_resume_end) from [<c00577f8>]
(suspend_devices_and_enter+0x10c/0x3e4)
[  240.361437] [<c00577f8>] (suspend_devices_and_enter) from
[<c0057c38>] (pm_suspend+0x168/0x208)
[  240.370110] [<c0057c38>] (pm_suspend) from [<c00568e0>]
(state_store+0x6c/0xbc)
[  240.377400] [<c00568e0>] (state_store) from [<c01f6ac8>]
(kobj_attr_store+0x14/0x20)
[  240.385123] [<c01f6ac8>] (kobj_attr_store) from [<c012215c>]
(sysfs_kf_write+0x44/0x48)
[  240.393103] [<c012215c>] (sysfs_kf_write) from [<c01217d0>]
(kernfs_fop_write+0xb8/0x194)
[  240.401257] [<c01217d0>] (kernfs_fop_write) from [<c00c91a4>]
(__vfs_write+0x2c/0xd8)
[  240.409064] [<c00c91a4>] (__vfs_write) from [<c00ca05c>]
(vfs_write+0x90/0x14c)
[  240.416350] [<c00ca05c>] (vfs_write) from [<c00ca2e4>] (SyS_write+0x40/0x8c)
[  240.423378] [<c00ca2e4>] (SyS_write) from [<c000f500>]
(ret_fast_syscall+0x0/0x34)
====

I was checking on v4.1-rc5, git hash as below:

862e58a mmc: dw_mmc: Wait for data transfer after response errors
ba155e2 Linux 4.1-rc5
5b13966

Not sure if I missed any dependent patch??
Have not checked the dw TRM for this change, will do that as soon as I
get access to it.

Regards,
Alim


On Mon, May 18, 2015 at 9:23 PM, Doug Anderson <dianders@chromium.org> wrote:
> According to the DesignWare state machine description, after we get a
> "response error" or "response CRC error" we move into data transfer
> mode.  That means that we don't necessarily need to special case
> trying to deal with the failure right away.  We can wait until we are
> notified that the data transfer is complete (with or without errors)
> and then we can deal with the failure.
>
> It may sound strange to defer dealing with a command that we know will
> fail anyway, but this appears to fix a bug.  During tuning (CMD19) on
> a specific card on an rk3288-based system, we found that we could get
> a "response CRC error".  Sending the stop command after the "response
> CRC error" would then throw the system into a confused state causing
> all future tuning phases to report failure.
>
> When in the confused state, the controller would show these (hex codes
> are interrupt status register):
>  CMD ERR: 0x00000046 (cmd=19)
>  CMD ERR: 0x0000004e (cmd=12)
>  DATA ERR: 0x00000208
>  DATA ERR: 0x0000020c
>  CMD ERR: 0x00000104 (cmd=19)
>  CMD ERR: 0x00000104 (cmd=12)
>  DATA ERR: 0x00000208
>  DATA ERR: 0x0000020c
>  ...
>  ...
>
> It is inherently difficult to deal with the complexity of trying to
> correctly send a stop command while a data transfer is taking place
> since you need to deal with different corner cases caused by the fact
> that the data transfer could complete (with errors or without errors)
> during various places in sending the stop command (dw_mci_stop_dma,
> send_stop_abort, etc)
>
> Instead of adding a bunch of extra complexity to deal with this, it
> seems much simpler to just use the more straightforward (and less
> error-prone) path of letting the data transfer finish.  There
> shouldn't be any huge benefit to sending the stop command slightly
> earlier, anyway.
>
> Signed-off-by: Doug Anderson <dianders@chromium.org>
> ---
>  drivers/mmc/host/dw_mmc.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 5f5adaf..c081ce2 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -1574,6 +1574,26 @@ static void dw_mci_tasklet_func(unsigned long priv)
>                         }
>
>                         if (cmd->data && err) {
> +                               /*
> +                                * Controller will move into a data transfer
> +                                * state after a response error or response CRC
> +                                * error.  Let's let that finish before trying
> +                                * to send a stop, so we'll go to
> +                                * STATE_SENDING_DATA.
> +                                *
> +                                * Although letting the data transfer take place
> +                                * will waste a bit of time (we already know
> +                                * the command was bad), it can't cause any
> +                                * errors since it's possible it would have
> +                                * taken place anyway if this tasklet got
> +                                * delayed.  Allowing the transfer to take place
> +                                * avoids races and keeps things simple.
> +                                */
> +                               if (err != -ETIMEDOUT) {
> +                                       state = STATE_SENDING_DATA;
> +                                       continue;
> +                               }
> +
>                                 dw_mci_stop_dma(host);
>                                 send_stop_abort(host, data);
>                                 state = STATE_SENDING_STOP;
> --
> 2.2.0.rc0.207.ga3a616c
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,
Alim

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] mmc: dw_mmc: Wait for data transfer after response errors
@ 2015-05-18 15:53 Doug Anderson
  2015-05-26 18:02 ` Alim Akhtar
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Anderson @ 2015-05-18 15:53 UTC (permalink / raw)
  To: Jaehoon Chung, Seungwon Jeon, Ulf Hansson
  Cc: Alim Akhtar, Sonny Rao, Andrew Bresticker, Heiko Stuebner,
	Addy Ke, Alexandru Stan, javier.martinez, Chris Zhong,
	Caesar Wang, Doug Anderson, linux-mmc, linux-kernel

According to the DesignWare state machine description, after we get a
"response error" or "response CRC error" we move into data transfer
mode.  That means that we don't necessarily need to special case
trying to deal with the failure right away.  We can wait until we are
notified that the data transfer is complete (with or without errors)
and then we can deal with the failure.

It may sound strange to defer dealing with a command that we know will
fail anyway, but this appears to fix a bug.  During tuning (CMD19) on
a specific card on an rk3288-based system, we found that we could get
a "response CRC error".  Sending the stop command after the "response
CRC error" would then throw the system into a confused state causing
all future tuning phases to report failure.

When in the confused state, the controller would show these (hex codes
are interrupt status register):
 CMD ERR: 0x00000046 (cmd=19)
 CMD ERR: 0x0000004e (cmd=12)
 DATA ERR: 0x00000208
 DATA ERR: 0x0000020c
 CMD ERR: 0x00000104 (cmd=19)
 CMD ERR: 0x00000104 (cmd=12)
 DATA ERR: 0x00000208
 DATA ERR: 0x0000020c
 ...
 ...

It is inherently difficult to deal with the complexity of trying to
correctly send a stop command while a data transfer is taking place
since you need to deal with different corner cases caused by the fact
that the data transfer could complete (with errors or without errors)
during various places in sending the stop command (dw_mci_stop_dma,
send_stop_abort, etc)

Instead of adding a bunch of extra complexity to deal with this, it
seems much simpler to just use the more straightforward (and less
error-prone) path of letting the data transfer finish.  There
shouldn't be any huge benefit to sending the stop command slightly
earlier, anyway.

Signed-off-by: Doug Anderson <dianders@chromium.org>
---
 drivers/mmc/host/dw_mmc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 5f5adaf..c081ce2 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -1574,6 +1574,26 @@ static void dw_mci_tasklet_func(unsigned long priv)
 			}

 			if (cmd->data && err) {
+				/*
+				 * Controller will move into a data transfer
+				 * state after a response error or response CRC
+				 * error.  Let's let that finish before trying
+				 * to send a stop, so we'll go to
+				 * STATE_SENDING_DATA.
+				 *
+				 * Although letting the data transfer take place
+				 * will waste a bit of time (we already know
+				 * the command was bad), it can't cause any
+				 * errors since it's possible it would have
+				 * taken place anyway if this tasklet got
+				 * delayed.  Allowing the transfer to take place
+				 * avoids races and keeps things simple.
+				 */
+				if (err != -ETIMEDOUT) {
+					state = STATE_SENDING_DATA;
+					continue;
+				}
+
 				dw_mci_stop_dma(host);
 				send_stop_abort(host, data);
 				state = STATE_SENDING_STOP;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-03-31 18:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-17 12:12 [PATCH] mmc: dw_mmc: Wait for data transfer after response errors Enric Balletbo Serra
2016-03-21 22:38 ` Doug Anderson
2016-03-24 11:26   ` Enric Balletbo Serra
2016-03-24 15:16     ` Doug Anderson
2016-03-24 15:30     ` Russell King - ARM Linux
2016-03-24 16:06       ` Doug Anderson
2016-03-24 16:22         ` Russell King - ARM Linux
2016-03-30 17:16           ` Enric Balletbo Serra
2016-03-30 17:26             ` Russell King - ARM Linux
     [not found]               ` <CAFqH_51sMLbURO4n7OTEuC8S-6w0Q4aRv47nEoCAmK8-MJ+Jbw@mail.gmail.com>
2016-03-30 20:39                 ` Russell King - ARM Linux
2016-03-31  1:56               ` Shawn Lin
2016-03-31  2:03             ` Jaehoon Chung
2016-03-31  6:39               ` Enric Balletbo Serra
2016-03-31 18:12           ` Doug Anderson
  -- strict thread matches above, loose matches on Subject: below --
2015-05-18 15:53 Doug Anderson
2015-05-26 18:02 ` Alim Akhtar
2015-05-26 20:44   ` Doug Anderson
2015-05-27  1:53     ` Jaehoon Chung
2015-05-27 16:52       ` Doug Anderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.