All of lore.kernel.org
 help / color / mirror / Atom feed
* LKFT: arm x15: mmc1: cache flush error -110
@ 2020-02-13 15:42 Naresh Kamboju
  2020-02-14  9:09 ` Arnd Bergmann
  2020-02-19 16:23 ` Ulf Hansson
  0 siblings, 2 replies; 45+ messages in thread
From: Naresh Kamboju @ 2020-02-13 15:42 UTC (permalink / raw)
  To: Jens Axboe, Alexei Starovoitov, Ulf Hansson
  Cc: linux-block, lkft-triage, open list, linux-mmc, Arnd Bergmann,
	John Stultz

arm beagleboard x15 device failed to boot Linux mainline and
linux-next kernel due
to below error.
This error occurred across all x15 device for these kernel version.

This regression started happening on x15 from this commit onwards (27th Jan)
  git branch: master
  git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  git commit: aae1464f46a2403565f75717438118691d31ccf1
  git describe: v5.5-489-gaae1464f46a2


Test output log,
[   37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
[   37.611850] mmc1: cache flush error -110
[   37.615883] blk_update_request: I/O error, dev mmcblk1, sector
4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[   37.627387] Aborting journal on device mmcblk1p9-8.
[   37.635448] systemd[1]: Installed transient /etc/machine-id file.
[   37.659283] systemd[1]: Couldn't move remaining userspace
processes, ignoring: Input/output error
[   37.744027] EXT4-fs error (device mmcblk1p9):
ext4_journal_check_start:61: Detected aborted journal
[   37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only
[   37.917486] systemd-gpt-auto-generator[108]: Failed to dissect:
Input/output error
[   37.927825] systemd[104]:
/lib/systemd/system-generators/systemd-gpt-auto-generator failed with
exit status 1.
<>
[   68.856307] mmc1: Card stuck being busy! mmc_poll_for_busy
[   68.861838] mmc1: cache flush error -110
[   68.865812] blk_update_request: I/O error, dev mmcblk1, sector 0 op
0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
<>
[   98.906243] mmc1: Card stuck being busy! mmc_poll_for_busy
[   98.911774] mmc1: cache flush error -110
[   98.915747] blk_update_request: I/O error, dev mmcblk1, sector 0 op
0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
<>
Dependency failed for Serial Getty on ttyS2.
[  128.946258] mmc1: Card stuck being busy! mmc_poll_for_busy
[  128.951786] mmc1: cache flush error -110
[  128.955756] blk_update_request: I/O error, dev mmcblk1, sector 0 op
0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[FAILED] Failed to start File System Check on Root Device.
See 'systemctl status systemd-fsck-root.service' for details.
[  OK  ] Started Apply Kernel Variables.
[  OK  ] Reached target Login Prompts.
         Starting Remount Root and Kernel File Systems...
[  OK  ] Reached target Timers.
[  OK  ] Closed Syslog Socket.
[  OK  ] Started Emergency Shell.
[  129.227328] EXT4-fs error (device mmcblk1p9): ext4_remount:5354:
Abort forced by user
[  OK  ] Reached target Emergency Mode.
[  OK  ] Reached target Sockets.
[FAILED] Failed to start Remount Root and Kernel File Systems.
<>
You are in emergency mode. After logging in, type \"journalctl -xb\" to view
system logs, \"systemctl reboot\" to reboot, \"systemctl default\" or \"exit\"
to boot into default mode.
Press Enter for maintenance
auto-login-action timed out after 874 seconds

ref:
https://lkft.validation.linaro.org/scheduler/job/1137693#L4034
https://lkft.validation.linaro.org/scheduler/job/1158106#L4048
https://lkft.validation.linaro.org/scheduler/job/1137690#L3985
https://lkft.validation.linaro.org/scheduler/job/1137691#L4012
https://lkft.validation.linaro.org/scheduler/job/1137696#L4043
https://lkft.validation.linaro.org/scheduler/job/1137699#L4153

-- 
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-13 15:42 LKFT: arm x15: mmc1: cache flush error -110 Naresh Kamboju
@ 2020-02-14  9:09 ` Arnd Bergmann
  2020-02-14 12:09   ` Mark Brown
  2020-02-19 16:23 ` Ulf Hansson
  1 sibling, 1 reply; 45+ messages in thread
From: Arnd Bergmann @ 2020-02-14  9:09 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Jens Axboe, Alexei Starovoitov, Ulf Hansson, linux-block,
	lkft-triage, open list, linux-mmc, John Stultz, Mark Brown

On Thu, Feb 13, 2020 at 4:43 PM Naresh Kamboju
<naresh.kamboju@linaro.org> wrote:
>
> arm beagleboard x15 device failed to boot Linux mainline and
> linux-next kernel due
> to below error.
> This error occurred across all x15 device for these kernel version.
>
> This regression started happening on x15 from this commit onwards (27th Jan)
>   git branch: master
>   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>   git commit: aae1464f46a2403565f75717438118691d31ccf1
>   git describe: v5.5-489-gaae1464f46a2

Is it only the merge that introduced the issue, or is the branch that got
merged already broken?

If it's easy for you to reproduce, please run the same test on commit
e4e4c2ff78ed from Mark's regulator tree to narrow it down further.

Added Mark to Cc as well, in case it is indeed one of those.

      Arnd

8<---
> Test output log,
> [   37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
> [   37.611850] mmc1: cache flush error -110
> [   37.615883] blk_update_request: I/O error, dev mmcblk1, sector
> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> [   37.627387] Aborting journal on device mmcblk1p9-8.
> [   37.635448] systemd[1]: Installed transient /etc/machine-id file.
> [   37.659283] systemd[1]: Couldn't move remaining userspace
> processes, ignoring: Input/output error
> [   37.744027] EXT4-fs error (device mmcblk1p9):
> ext4_journal_check_start:61: Detected aborted journal
> [   37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only
> [   37.917486] systemd-gpt-auto-generator[108]: Failed to dissect:
> Input/output error
> [   37.927825] systemd[104]:
> /lib/systemd/system-generators/systemd-gpt-auto-generator failed with
> exit status 1.
> <>
> [   68.856307] mmc1: Card stuck being busy! mmc_poll_for_busy
> [   68.861838] mmc1: cache flush error -110
> [   68.865812] blk_update_request: I/O error, dev mmcblk1, sector 0 op
> 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> <>
> [   98.906243] mmc1: Card stuck being busy! mmc_poll_for_busy
> [   98.911774] mmc1: cache flush error -110
> [   98.915747] blk_update_request: I/O error, dev mmcblk1, sector 0 op
> 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> <>
> Dependency failed for Serial Getty on ttyS2.
> [  128.946258] mmc1: Card stuck being busy! mmc_poll_for_busy
> [  128.951786] mmc1: cache flush error -110
> [  128.955756] blk_update_request: I/O error, dev mmcblk1, sector 0 op
> 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> [FAILED] Failed to start File System Check on Root Device.
> See 'systemctl status systemd-fsck-root.service' for details.
> [  OK  ] Started Apply Kernel Variables.
> [  OK  ] Reached target Login Prompts.
>          Starting Remount Root and Kernel File Systems...
> [  OK  ] Reached target Timers.
> [  OK  ] Closed Syslog Socket.
> [  OK  ] Started Emergency Shell.
> [  129.227328] EXT4-fs error (device mmcblk1p9): ext4_remount:5354:
> Abort forced by user
> [  OK  ] Reached target Emergency Mode.
> [  OK  ] Reached target Sockets.
> [FAILED] Failed to start Remount Root and Kernel File Systems.
> <>
> You are in emergency mode. After logging in, type \"journalctl -xb\" to view
> system logs, \"systemctl reboot\" to reboot, \"systemctl default\" or \"exit\"
> to boot into default mode.
> Press Enter for maintenance
> auto-login-action timed out after 874 seconds
>
> ref:
> https://lkft.validation.linaro.org/scheduler/job/1137693#L4034
> https://lkft.validation.linaro.org/scheduler/job/1158106#L4048
> https://lkft.validation.linaro.org/scheduler/job/1137690#L3985
> https://lkft.validation.linaro.org/scheduler/job/1137691#L4012
> https://lkft.validation.linaro.org/scheduler/job/1137696#L4043
> https://lkft.validation.linaro.org/scheduler/job/1137699#L4153

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-14  9:09 ` Arnd Bergmann
@ 2020-02-14 12:09   ` Mark Brown
  0 siblings, 0 replies; 45+ messages in thread
From: Mark Brown @ 2020-02-14 12:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Naresh Kamboju, Jens Axboe, Alexei Starovoitov, Ulf Hansson,
	linux-block, lkft-triage, open list, linux-mmc, John Stultz

[-- Attachment #1: Type: text/plain, Size: 1636 bytes --]

On Fri, Feb 14, 2020 at 10:09:34AM +0100, Arnd Bergmann wrote:
> On Thu, Feb 13, 2020 at 4:43 PM Naresh Kamboju

> > arm beagleboard x15 device failed to boot Linux mainline and
> > linux-next kernel due
> > to below error.
> > This error occurred across all x15 device for these kernel version.

> > This regression started happening on x15 from this commit onwards (27th Jan)
> >   git branch: master
> >   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> >   git commit: aae1464f46a2403565f75717438118691d31ccf1
> >   git describe: v5.5-489-gaae1464f46a2

That's Linus' commit:

	Merge tag 'regulator-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Please include human readable descriptions of things like commits and
issues being discussed in e-mail in your mails, this makes them much
easier for humans to read especially when they have no internet access.
I do frequently catch up on my mail on flights or while otherwise
travelling so this is even more pressing for me than just being about
making things a bit easier to read.

> Is it only the merge that introduced the issue, or is the branch that got
> merged already broken?

Or is this just the first commit where the problem happened to be
noticed?  Commits done just before that on the same day include the
dmaengine and MMC merges which seem potentially relevant:

> > [   37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
> > [   37.611850] mmc1: cache flush error -110
> > [   37.615883] blk_update_request: I/O error, dev mmcblk1, sector
> > 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-13 15:42 LKFT: arm x15: mmc1: cache flush error -110 Naresh Kamboju
  2020-02-14  9:09 ` Arnd Bergmann
@ 2020-02-19 16:23 ` Ulf Hansson
  2020-02-20 17:54   ` Naresh Kamboju
  1 sibling, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-02-19 16:23 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz

On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> arm beagleboard x15 device failed to boot Linux mainline and
> linux-next kernel due
> to below error.
> This error occurred across all x15 device for these kernel version.
>
> This regression started happening on x15 from this commit onwards (27th Jan)
>   git branch: master
>   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>   git commit: aae1464f46a2403565f75717438118691d31ccf1
>   git describe: v5.5-489-gaae1464f46a2
>
>
> Test output log,
> [   37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
> [   37.611850] mmc1: cache flush error -110
> [   37.615883] blk_update_request: I/O error, dev mmcblk1, sector
> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> [   37.627387] Aborting journal on device mmcblk1p9-8.
> [   37.635448] systemd[1]: Installed transient /etc/machine-id file.
> [   37.659283] systemd[1]: Couldn't move remaining userspace
> processes, ignoring: Input/output error
> [   37.744027] EXT4-fs error (device mmcblk1p9):
> ext4_journal_check_start:61: Detected aborted journal
> [   37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only
> [   37.917486] systemd-gpt-auto-generator[108]: Failed to dissect:
> Input/output error
> [   37.927825] systemd[104]:
> /lib/systemd/system-generators/systemd-gpt-auto-generator failed with
> exit status 1.
> <>

Try to restore the value for the cache flush timeout, by updating the
define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".

The offending commit could perhaps be this one.

commit 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9
Author: Ulf Hansson <ulf.hansson@linaro.org>
Date:   Wed Jan 22 15:27:45 2020 +0100
mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC

[...]

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-19 16:23 ` Ulf Hansson
@ 2020-02-20 17:54   ` Naresh Kamboju
  2020-02-21  9:48     ` Ulf Hansson
  0 siblings, 1 reply; 45+ messages in thread
From: Naresh Kamboju @ 2020-02-20 17:54 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz

On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >
>
> Try to restore the value for the cache flush timeout, by updating the
> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".

I have increased the timeout to 10 minutes but it did not help.
Same error found.
[  608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
[  608.684964] mmc1: cache flush error -110
[  608.689005] blk_update_request: I/O error, dev mmcblk1, sector
4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0

OTOH, What best i could do for my own experiment to revert all three patches and
now the reported error gone and device mount successfully [1].

List of patches reverted,
  mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
  mmc: block: Use generic_cmd6_time when modifying
    INAND_CMD38_ARG_EXT_CSD
  mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()

[1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346

- Naresh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-20 17:54   ` Naresh Kamboju
@ 2020-02-21  9:48     ` Ulf Hansson
  2020-02-21 19:44       ` Bitan Biswas
  0 siblings, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-02-21  9:48 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz

On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> > >
> >
> > Try to restore the value for the cache flush timeout, by updating the
> > define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>
> I have increased the timeout to 10 minutes but it did not help.
> Same error found.
> [  608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
> [  608.684964] mmc1: cache flush error -110
> [  608.689005] blk_update_request: I/O error, dev mmcblk1, sector
> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>
> OTOH, What best i could do for my own experiment to revert all three patches and
> now the reported error gone and device mount successfully [1].
>
> List of patches reverted,
>   mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
>   mmc: block: Use generic_cmd6_time when modifying
>     INAND_CMD38_ARG_EXT_CSD
>   mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>
> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
>
> - Naresh

Thanks for testing!

This sounds a bit weird, I must say. Also, while looking into the
logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
maybe I didn't read the logs carefully enough.

 In any case, I am looking into creating a debug patch so we can
narrow down the problem a bit further.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-21  9:48     ` Ulf Hansson
@ 2020-02-21 19:44       ` Bitan Biswas
  2020-02-24 11:16         ` Ulf Hansson
  0 siblings, 1 reply; 45+ messages in thread
From: Bitan Biswas @ 2020-02-21 19:44 UTC (permalink / raw)
  To: Ulf Hansson, Naresh Kamboju
  Cc: Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding,
	Jon Hunter

On 2/21/20 1:48 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>>
>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>>>
>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>>>>
>>>
>>> Try to restore the value for the cache flush timeout, by updating the
>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>>
>> I have increased the timeout to 10 minutes but it did not help.
>> Same error found.
>> [  608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
>> [  608.684964] mmc1: cache flush error -110
>> [  608.689005] blk_update_request: I/O error, dev mmcblk1, sector
>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>>
>> OTOH, What best i could do for my own experiment to revert all three patches and
>> now the reported error gone and device mount successfully [1].
>>
>> List of patches reverted,
>>    mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
>>    mmc: block: Use generic_cmd6_time when modifying
>>      INAND_CMD38_ARG_EXT_CSD
>>    mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>>
>> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
>>
>> - Naresh
> 
> Thanks for testing!
> 
> This sounds a bit weird, I must say. Also, while looking into the
> logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
> maybe I didn't read the logs carefully enough.
> 
>   In any case, I am looking into creating a debug patch so we can
> narrow down the problem a bit further.
> 
> Kind regards
> Uffe
> 

Hi Ulf,

  I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and 
below error is seen due to the commit 
24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9  ("mmc: core: Specify timeouts 
for BKOPS and CACHE_FLUSH for eMMC"):

##
[  268.976197] Freezing user space processes ... (elapsed 0.001 seconds) 
done.
[  268.984414] OOM killer disabled.
[  268.987635] Freezing remaining freezable tasks ... (elapsed 0.000 
seconds) done.
[  269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110
[  269.224255] PM: Device mmc1:0001 failed to suspend: error -110
[  269.230080] PM: Some devices failed to suspend, or early wake event 
detected
##

  I find that from the commit the changes in mmc_flush_cache below is 
the cause.

##
@@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
                         (card->ext_csd.cache_size > 0) &&
                         (card->ext_csd.cache_ctrl & 1)) {
                 err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
-                               EXT_CSD_FLUSH_CACHE, 1, 0);
+                                EXT_CSD_FLUSH_CACHE, 1,
+                                MMC_CACHE_FLUSH_TIMEOUT_MS);

##

  Do you have suggestion to try for the suspend errors ?

-regards,
  Bitan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-21 19:44       ` Bitan Biswas
@ 2020-02-24 11:16         ` Ulf Hansson
  2020-02-24 12:59           ` Adrian Hunter
  2020-02-25 10:04           ` Jon Hunter
  0 siblings, 2 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-02-24 11:16 UTC (permalink / raw)
  To: Bitan Biswas, Adrian Hunter
  Cc: Naresh Kamboju, Jens Axboe, Alexei Starovoitov, linux-block,
	lkft-triage, open list, linux-mmc, Arnd Bergmann, John Stultz,
	Thierry Reding, Jon Hunter

+ Adrian

On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <bbiswas@nvidia.com> wrote:
>
> On 2/21/20 1:48 AM, Ulf Hansson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >>
> >> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >>>
> >>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >>>>
> >>>
> >>> Try to restore the value for the cache flush timeout, by updating the
> >>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
> >>
> >> I have increased the timeout to 10 minutes but it did not help.
> >> Same error found.
> >> [  608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
> >> [  608.684964] mmc1: cache flush error -110
> >> [  608.689005] blk_update_request: I/O error, dev mmcblk1, sector
> >> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> >>
> >> OTOH, What best i could do for my own experiment to revert all three patches and
> >> now the reported error gone and device mount successfully [1].
> >>
> >> List of patches reverted,
> >>    mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
> >>    mmc: block: Use generic_cmd6_time when modifying
> >>      INAND_CMD38_ARG_EXT_CSD
> >>    mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
> >>
> >> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
> >>
> >> - Naresh
> >
> > Thanks for testing!
> >
> > This sounds a bit weird, I must say. Also, while looking into the
> > logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
> > maybe I didn't read the logs carefully enough.
> >
> >   In any case, I am looking into creating a debug patch so we can
> > narrow down the problem a bit further.
> >
> > Kind regards
> > Uffe
> >
>
> Hi Ulf,
>
>   I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and
> below error is seen due to the commit
> 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9  ("mmc: core: Specify timeouts
> for BKOPS and CACHE_FLUSH for eMMC"):
>
> ##
> [  268.976197] Freezing user space processes ... (elapsed 0.001 seconds)
> done.
> [  268.984414] OOM killer disabled.
> [  268.987635] Freezing remaining freezable tasks ... (elapsed 0.000
> seconds) done.
> [  269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110
> [  269.224255] PM: Device mmc1:0001 failed to suspend: error -110
> [  269.230080] PM: Some devices failed to suspend, or early wake event
> detected

Is there also a print in the log about "Card stuck being busy!"?

In any case, it seems like the timeout error (-110) is happening way
too soon. The cache flush timeout is now 30s, but the timeout seems to
fire only a few hundred ms (at most) after the cache flush command has
been sent.

> ##
>
>   I find that from the commit the changes in mmc_flush_cache below is
> the cause.
>
> ##
> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
>                          (card->ext_csd.cache_size > 0) &&
>                          (card->ext_csd.cache_ctrl & 1)) {
>                  err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> -                               EXT_CSD_FLUSH_CACHE, 1, 0);
> +                                EXT_CSD_FLUSH_CACHE, 1,
> +                                MMC_CACHE_FLUSH_TIMEOUT_MS);
>
> ##
>
>   Do you have suggestion to try for the suspend errors ?

Just as a quick sanity test, please try the below patch, which
restores the old cache flush timeout to 10min.

However, as I indicated above, this seems to be a problem that needs
to be fixed at in the host driver side. For the sdhci driver, there is
a bit of a tricky logic around how to deal with timeouts in
sdhci_send_command(). My best guess is that's where we should look
more closely (and I am doing that).

From: Ulf Hansson <ulf.hansson@linaro.org>
Date: Mon, 24 Feb 2020 11:43:33 +0100
Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 drivers/mmc/core/mmc_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index da425ee2d9bf..713e7dd6d028 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -21,7 +21,7 @@

 #define MMC_OPS_TIMEOUT_MS             (10 * 60 * 1000) /* 10min*/
 #define MMC_BKOPS_TIMEOUT_MS           (120 * 1000) /* 120s */
-#define MMC_CACHE_FLUSH_TIMEOUT_MS     (30 * 1000) /* 30s */
+#define MMC_CACHE_FLUSH_TIMEOUT_MS     (10 * 60 * 1000) /* 10min */

 static const u8 tuning_blk_pattern_4bit[] = {
        0xff, 0x0f, 0xff, 0x00, 0xff, 0xcc, 0xc3, 0xcc,
-- 

Kind regards
Uffe

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-24 11:16         ` Ulf Hansson
@ 2020-02-24 12:59           ` Adrian Hunter
  2020-02-25 10:04           ` Jon Hunter
  1 sibling, 0 replies; 45+ messages in thread
From: Adrian Hunter @ 2020-02-24 12:59 UTC (permalink / raw)
  To: Ulf Hansson, Bitan Biswas
  Cc: Naresh Kamboju, Jens Axboe, Alexei Starovoitov, linux-block,
	lkft-triage, open list, linux-mmc, Arnd Bergmann, John Stultz,
	Thierry Reding, Jon Hunter

On 24/02/20 1:16 pm, Ulf Hansson wrote:
> + Adrian
> 
> On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <bbiswas@nvidia.com> wrote:
>>
>> On 2/21/20 1:48 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>>>>
>>>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>>>>>
>>>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>>>>>>
>>>>>
>>>>> Try to restore the value for the cache flush timeout, by updating the
>>>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>>>>
>>>> I have increased the timeout to 10 minutes but it did not help.
>>>> Same error found.
>>>> [  608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
>>>> [  608.684964] mmc1: cache flush error -110
>>>> [  608.689005] blk_update_request: I/O error, dev mmcblk1, sector
>>>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>>>>
>>>> OTOH, What best i could do for my own experiment to revert all three patches and
>>>> now the reported error gone and device mount successfully [1].
>>>>
>>>> List of patches reverted,
>>>>    mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
>>>>    mmc: block: Use generic_cmd6_time when modifying
>>>>      INAND_CMD38_ARG_EXT_CSD
>>>>    mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>>>>
>>>> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
>>>>
>>>> - Naresh
>>>
>>> Thanks for testing!
>>>
>>> This sounds a bit weird, I must say. Also, while looking into the
>>> logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
>>> maybe I didn't read the logs carefully enough.
>>>
>>>   In any case, I am looking into creating a debug patch so we can
>>> narrow down the problem a bit further.
>>>
>>> Kind regards
>>> Uffe
>>>
>>
>> Hi Ulf,
>>
>>   I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and
>> below error is seen due to the commit
>> 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9  ("mmc: core: Specify timeouts
>> for BKOPS and CACHE_FLUSH for eMMC"):
>>
>> ##
>> [  268.976197] Freezing user space processes ... (elapsed 0.001 seconds)
>> done.
>> [  268.984414] OOM killer disabled.
>> [  268.987635] Freezing remaining freezable tasks ... (elapsed 0.000
>> seconds) done.
>> [  269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110
>> [  269.224255] PM: Device mmc1:0001 failed to suspend: error -110
>> [  269.230080] PM: Some devices failed to suspend, or early wake event
>> detected
> 
> Is there also a print in the log about "Card stuck being busy!"?
> 
> In any case, it seems like the timeout error (-110) is happening way
> too soon. The cache flush timeout is now 30s, but the timeout seems to
> fire only a few hundred ms (at most) after the cache flush command has
> been sent.
> 
>> ##
>>
>>   I find that from the commit the changes in mmc_flush_cache below is
>> the cause.
>>
>> ##
>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
>>                          (card->ext_csd.cache_size > 0) &&
>>                          (card->ext_csd.cache_ctrl & 1)) {
>>                  err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
>> -                               EXT_CSD_FLUSH_CACHE, 1, 0);
>> +                                EXT_CSD_FLUSH_CACHE, 1,
>> +                                MMC_CACHE_FLUSH_TIMEOUT_MS);
>>
>> ##
>>
>>   Do you have suggestion to try for the suspend errors ?
> 
> Just as a quick sanity test, please try the below patch, which
> restores the old cache flush timeout to 10min.
> 
> However, as I indicated above, this seems to be a problem that needs
> to be fixed at in the host driver side. For the sdhci driver, there is
> a bit of a tricky logic around how to deal with timeouts in
> sdhci_send_command(). My best guess is that's where we should look
> more closely (and I am doing that).

Could be, but hard to narrow down without more debug.  I would enable
dynamic debug and mmc messages e.g.

kernel config

	CONFIG_DYNAMIC_DEBUG=y

kernel command line option

	dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"

But that might lead to a very large number of messages.


> 
> From: Ulf Hansson <ulf.hansson@linaro.org>
> Date: Mon, 24 Feb 2020 11:43:33 +0100
> Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
> 
> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
>  drivers/mmc/core/mmc_ops.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index da425ee2d9bf..713e7dd6d028 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -21,7 +21,7 @@
> 
>  #define MMC_OPS_TIMEOUT_MS             (10 * 60 * 1000) /* 10min*/
>  #define MMC_BKOPS_TIMEOUT_MS           (120 * 1000) /* 120s */
> -#define MMC_CACHE_FLUSH_TIMEOUT_MS     (30 * 1000) /* 30s */
> +#define MMC_CACHE_FLUSH_TIMEOUT_MS     (10 * 60 * 1000) /* 10min */
> 
>  static const u8 tuning_blk_pattern_4bit[] = {
>         0xff, 0x0f, 0xff, 0x00, 0xff, 0xcc, 0xc3, 0xcc,
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-24 11:16         ` Ulf Hansson
  2020-02-24 12:59           ` Adrian Hunter
@ 2020-02-25 10:04           ` Jon Hunter
  2020-02-25 11:35             ` Ulf Hansson
  2020-02-25 11:41             ` Jon Hunter
  1 sibling, 2 replies; 45+ messages in thread
From: Jon Hunter @ 2020-02-25 10:04 UTC (permalink / raw)
  To: Ulf Hansson, Bitan Biswas, Adrian Hunter
  Cc: Naresh Kamboju, Jens Axboe, Alexei Starovoitov, linux-block,
	lkft-triage, open list, linux-mmc, Arnd Bergmann, John Stultz,
	Thierry Reding


On 24/02/2020 11:16, Ulf Hansson wrote:
> + Adrian
> 
> On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <bbiswas@nvidia.com> wrote:
>>
>> On 2/21/20 1:48 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>>>>
>>>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>>>>>
>>>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>>>>>>
>>>>>
>>>>> Try to restore the value for the cache flush timeout, by updating the
>>>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>>>>
>>>> I have increased the timeout to 10 minutes but it did not help.
>>>> Same error found.
>>>> [  608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
>>>> [  608.684964] mmc1: cache flush error -110
>>>> [  608.689005] blk_update_request: I/O error, dev mmcblk1, sector
>>>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>>>>
>>>> OTOH, What best i could do for my own experiment to revert all three patches and
>>>> now the reported error gone and device mount successfully [1].
>>>>
>>>> List of patches reverted,
>>>>    mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
>>>>    mmc: block: Use generic_cmd6_time when modifying
>>>>      INAND_CMD38_ARG_EXT_CSD
>>>>    mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()


Reverting all the above also fixes the problem for me.

>>   I find that from the commit the changes in mmc_flush_cache below is
>> the cause.
>>
>> ##
>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
>>                          (card->ext_csd.cache_size > 0) &&
>>                          (card->ext_csd.cache_ctrl & 1)) {
>>                  err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
>> -                               EXT_CSD_FLUSH_CACHE, 1, 0);
>> +                                EXT_CSD_FLUSH_CACHE, 1,
>> +                                MMC_CACHE_FLUSH_TIMEOUT_MS);


I no longer see the issue on reverting the above hunk as Bitan suggested
but now I see the following (which is expected) ...

 WARNING KERN mmc1: unspecified timeout for CMD6 - use generic

> Just as a quick sanity test, please try the below patch, which
> restores the old cache flush timeout to 10min.
> 
> However, as I indicated above, this seems to be a problem that needs
> to be fixed at in the host driver side. For the sdhci driver, there is
> a bit of a tricky logic around how to deal with timeouts in
> sdhci_send_command(). My best guess is that's where we should look
> more closely (and I am doing that).
> 
> From: Ulf Hansson <ulf.hansson@linaro.org>
> Date: Mon, 24 Feb 2020 11:43:33 +0100
> Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
> 
> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
>  drivers/mmc/core/mmc_ops.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index da425ee2d9bf..713e7dd6d028 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -21,7 +21,7 @@
> 
>  #define MMC_OPS_TIMEOUT_MS             (10 * 60 * 1000) /* 10min*/
>  #define MMC_BKOPS_TIMEOUT_MS           (120 * 1000) /* 120s */
> -#define MMC_CACHE_FLUSH_TIMEOUT_MS     (30 * 1000) /* 30s */
> +#define MMC_CACHE_FLUSH_TIMEOUT_MS     (10 * 60 * 1000) /* 10min */

This does not fix the problem for me.

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-25 10:04           ` Jon Hunter
@ 2020-02-25 11:35             ` Ulf Hansson
  2020-02-25 11:41             ` Jon Hunter
  1 sibling, 0 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-02-25 11:35 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Bitan Biswas, Adrian Hunter, Naresh Kamboju, Jens Axboe,
	Alexei Starovoitov, linux-block, lkft-triage, open list,
	linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding

On Tue, 25 Feb 2020 at 11:04, Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 24/02/2020 11:16, Ulf Hansson wrote:
> > + Adrian
> >
> > On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <bbiswas@nvidia.com> wrote:
> >>
> >> On 2/21/20 1:48 AM, Ulf Hansson wrote:
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >>>>
> >>>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >>>>>
> >>>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >>>>>>
> >>>>>
> >>>>> Try to restore the value for the cache flush timeout, by updating the
> >>>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
> >>>>
> >>>> I have increased the timeout to 10 minutes but it did not help.
> >>>> Same error found.
> >>>> [  608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
> >>>> [  608.684964] mmc1: cache flush error -110
> >>>> [  608.689005] blk_update_request: I/O error, dev mmcblk1, sector
> >>>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> >>>>
> >>>> OTOH, What best i could do for my own experiment to revert all three patches and
> >>>> now the reported error gone and device mount successfully [1].
> >>>>
> >>>> List of patches reverted,
> >>>>    mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
> >>>>    mmc: block: Use generic_cmd6_time when modifying
> >>>>      INAND_CMD38_ARG_EXT_CSD
> >>>>    mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>
>
> Reverting all the above also fixes the problem for me.
>
> >>   I find that from the commit the changes in mmc_flush_cache below is
> >> the cause.
> >>
> >> ##
> >> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
> >>                          (card->ext_csd.cache_size > 0) &&
> >>                          (card->ext_csd.cache_ctrl & 1)) {
> >>                  err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> >> -                               EXT_CSD_FLUSH_CACHE, 1, 0);
> >> +                                EXT_CSD_FLUSH_CACHE, 1,
> >> +                                MMC_CACHE_FLUSH_TIMEOUT_MS);
>
>
> I no longer see the issue on reverting the above hunk as Bitan suggested
> but now I see the following (which is expected) ...
>
>  WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
>
> > Just as a quick sanity test, please try the below patch, which
> > restores the old cache flush timeout to 10min.
> >
> > However, as I indicated above, this seems to be a problem that needs
> > to be fixed at in the host driver side. For the sdhci driver, there is
> > a bit of a tricky logic around how to deal with timeouts in
> > sdhci_send_command(). My best guess is that's where we should look
> > more closely (and I am doing that).
> >
> > From: Ulf Hansson <ulf.hansson@linaro.org>
> > Date: Mon, 24 Feb 2020 11:43:33 +0100
> > Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
> >
> > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > ---
> >  drivers/mmc/core/mmc_ops.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> > index da425ee2d9bf..713e7dd6d028 100644
> > --- a/drivers/mmc/core/mmc_ops.c
> > +++ b/drivers/mmc/core/mmc_ops.c
> > @@ -21,7 +21,7 @@
> >
> >  #define MMC_OPS_TIMEOUT_MS             (10 * 60 * 1000) /* 10min*/
> >  #define MMC_BKOPS_TIMEOUT_MS           (120 * 1000) /* 120s */
> > -#define MMC_CACHE_FLUSH_TIMEOUT_MS     (30 * 1000) /* 30s */
> > +#define MMC_CACHE_FLUSH_TIMEOUT_MS     (10 * 60 * 1000) /* 10min */
>
> This does not fix the problem for me.

Thanks for testing and the report - the results are not surprising.

I am getting back with an update asap, just want to complete my tests
over here before I say more.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-25 10:04           ` Jon Hunter
  2020-02-25 11:35             ` Ulf Hansson
@ 2020-02-25 11:41             ` Jon Hunter
  2020-02-25 14:26               ` Ulf Hansson
  1 sibling, 1 reply; 45+ messages in thread
From: Jon Hunter @ 2020-02-25 11:41 UTC (permalink / raw)
  To: Ulf Hansson, Bitan Biswas, Adrian Hunter
  Cc: Naresh Kamboju, Jens Axboe, Alexei Starovoitov, linux-block,
	lkft-triage, open list, linux-mmc, Arnd Bergmann, John Stultz,
	Thierry Reding


On 25/02/2020 10:04, Jon Hunter wrote:

...

>>>   I find that from the commit the changes in mmc_flush_cache below is
>>> the cause.
>>>
>>> ##
>>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
>>>                          (card->ext_csd.cache_size > 0) &&
>>>                          (card->ext_csd.cache_ctrl & 1)) {
>>>                  err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
>>> -                               EXT_CSD_FLUSH_CACHE, 1, 0);
>>> +                                EXT_CSD_FLUSH_CACHE, 1,
>>> +                                MMC_CACHE_FLUSH_TIMEOUT_MS);
> 
> 
> I no longer see the issue on reverting the above hunk as Bitan suggested
> but now I see the following (which is expected) ...
> 
>  WARNING KERN mmc1: unspecified timeout for CMD6 - use generic

For Tegra, the default timeout used when no timeout is specified for CMD6
is 100mS. So hard-coding the following also appears to workaround the
problem on Tegra ...

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 868653bc1555..5155e0240fca 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -992,7 +992,7 @@ int mmc_flush_cache(struct mmc_card *card)
                        (card->ext_csd.cache_size > 0) &&
                        (card->ext_csd.cache_ctrl & 1)) {
                err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
-                                EXT_CSD_FLUSH_CACHE, 1, 0);
+                                EXT_CSD_FLUSH_CACHE, 1, 100);
                if (err)
                        pr_err("%s: cache flush error %d\n",
                                        mmc_hostname(card->host), err);

So the problem appears to be causing by the timeout being too long rather
than not long enough. 

Looking more at the code, I think now that we are hitting the condition
...

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 868653bc1555..feae82b1ff35 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -579,8 +579,10 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
         * the host to avoid HW busy detection, by converting to a R1 response
         * instead of a R1B.
         */
-       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
+       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
+               pr_warn("%s: timeout (%d) > max busy timeout (%d)", mmc_hostname(host), timeout_ms, host->max_busy_timeout);
                use_r1b_resp = false;
+       }


With the above I see ...

 WARNING KERN mmc1: timeout (1600) > max busy timeout (672)

So with the longer timeout we are not using/requesting the response.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-25 11:41             ` Jon Hunter
@ 2020-02-25 14:26               ` Ulf Hansson
  2020-02-25 16:24                 ` Jon Hunter
  0 siblings, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-02-25 14:26 UTC (permalink / raw)
  To: Jon Hunter, Faiz Abbas
  Cc: Bitan Biswas, Adrian Hunter, Naresh Kamboju, Jens Axboe,
	Alexei Starovoitov, linux-block, lkft-triage, open list,
	linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding

+ Faiz Abbas

On Tue, 25 Feb 2020 at 12:41, Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 25/02/2020 10:04, Jon Hunter wrote:
>
> ...
>
> >>>   I find that from the commit the changes in mmc_flush_cache below is
> >>> the cause.
> >>>
> >>> ##
> >>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
> >>>                          (card->ext_csd.cache_size > 0) &&
> >>>                          (card->ext_csd.cache_ctrl & 1)) {
> >>>                  err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> >>> -                               EXT_CSD_FLUSH_CACHE, 1, 0);
> >>> +                                EXT_CSD_FLUSH_CACHE, 1,
> >>> +                                MMC_CACHE_FLUSH_TIMEOUT_MS);
> >
> >
> > I no longer see the issue on reverting the above hunk as Bitan suggested
> > but now I see the following (which is expected) ...
> >
> >  WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
>
> For Tegra, the default timeout used when no timeout is specified for CMD6
> is 100mS. So hard-coding the following also appears to workaround the
> problem on Tegra ...

Interesting.

>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 868653bc1555..5155e0240fca 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -992,7 +992,7 @@ int mmc_flush_cache(struct mmc_card *card)
>                         (card->ext_csd.cache_size > 0) &&
>                         (card->ext_csd.cache_ctrl & 1)) {
>                 err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> -                                EXT_CSD_FLUSH_CACHE, 1, 0);
> +                                EXT_CSD_FLUSH_CACHE, 1, 100);
>                 if (err)
>                         pr_err("%s: cache flush error %d\n",
>                                         mmc_hostname(card->host), err);
>
> So the problem appears to be causing by the timeout being too long rather
> than not long enough.
>
> Looking more at the code, I think now that we are hitting the condition
> ...
>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 868653bc1555..feae82b1ff35 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -579,8 +579,10 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
>          * the host to avoid HW busy detection, by converting to a R1 response
>          * instead of a R1B.
>          */
> -       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
> +       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
> +               pr_warn("%s: timeout (%d) > max busy timeout (%d)", mmc_hostname(host), timeout_ms, host->max_busy_timeout);
>                 use_r1b_resp = false;
> +       }
>
>
> With the above I see ...
>
>  WARNING KERN mmc1: timeout (1600) > max busy timeout (672)
>
> So with the longer timeout we are not using/requesting the response.

You are most likely correct.

However, from the core point of view, the response is still requested,
only that we don't want the driver to wait for the card to stop
signaling busy. Instead we want to deal with that via "polling" from
the core.

This is a rather worrying behaviour, as it seems like the host driver
doesn't really follow this expectations from the core point of view.
And mmc_flush_cache() is not the only case, as we have erase, bkops,
sanitize, etc. Are all these working or not really well tested?

Earlier, before my three patches, if the provided timeout_ms parameter
to __mmc_switch() was zero, which was the case for
mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
ignored validating host->max_busy_timeout, which was wrong. In any
case, this also meant that an R1B response was always used for
mmc_flush_cache(), as you also indicated above. Perhaps this is the
critical part where things can go wrong.

BTW, have you tried erase commands for sdhci tegra driver? If those
are working fine, do you have any special treatments for these?

I have looped in Faiz, as sdhci-omap seems to suffer from very similar
problems. One thing I noted for sdhci-omap, is that MMC_ERASE commands
is treated in a special manner in sdhci_omap_set_timeout(). This
indicates that there is something fishy going on.

Faiz, can you please comment on this?

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-25 14:26               ` Ulf Hansson
@ 2020-02-25 16:24                 ` Jon Hunter
  2020-02-26 15:21                   ` Ulf Hansson
  0 siblings, 1 reply; 45+ messages in thread
From: Jon Hunter @ 2020-02-25 16:24 UTC (permalink / raw)
  To: Ulf Hansson, Faiz Abbas, Sowjanya Komatineni
  Cc: Bitan Biswas, Adrian Hunter, Naresh Kamboju, Jens Axboe,
	Alexei Starovoitov, linux-block, lkft-triage, open list,
	linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding


On 25/02/2020 14:26, Ulf Hansson wrote:

...

> However, from the core point of view, the response is still requested,
> only that we don't want the driver to wait for the card to stop
> signaling busy. Instead we want to deal with that via "polling" from
> the core.
> 
> This is a rather worrying behaviour, as it seems like the host driver
> doesn't really follow this expectations from the core point of view.
> And mmc_flush_cache() is not the only case, as we have erase, bkops,
> sanitize, etc. Are all these working or not really well tested?

I don't believe that they are well tested. We have a simple test to
mount an eMMC partition, create a file, check the contents, remove the
file and unmount. The timeouts always occur during unmounting.

> Earlier, before my three patches, if the provided timeout_ms parameter
> to __mmc_switch() was zero, which was the case for
> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> ignored validating host->max_busy_timeout, which was wrong. In any
> case, this also meant that an R1B response was always used for
> mmc_flush_cache(), as you also indicated above. Perhaps this is the
> critical part where things can go wrong.
> 
> BTW, have you tried erase commands for sdhci tegra driver? If those
> are working fine, do you have any special treatments for these?

That I am not sure, but I will check.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-25 16:24                 ` Jon Hunter
@ 2020-02-26 15:21                   ` Ulf Hansson
  2020-02-26 17:04                     ` Jon Hunter
  2020-03-02 13:12                     ` Faiz Abbas
  0 siblings, 2 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-02-26 15:21 UTC (permalink / raw)
  To: Jon Hunter, Faiz Abbas, Bitan Biswas
  Cc: Sowjanya Komatineni, Adrian Hunter, Naresh Kamboju, Jens Axboe,
	Alexei Starovoitov, linux-block, lkft-triage, open list,
	linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding,
	Anders Roxell, Kishon

+ Anders, Kishon

On Tue, 25 Feb 2020 at 17:24, Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 25/02/2020 14:26, Ulf Hansson wrote:
>
> ...
>
> > However, from the core point of view, the response is still requested,
> > only that we don't want the driver to wait for the card to stop
> > signaling busy. Instead we want to deal with that via "polling" from
> > the core.
> >
> > This is a rather worrying behaviour, as it seems like the host driver
> > doesn't really follow this expectations from the core point of view.
> > And mmc_flush_cache() is not the only case, as we have erase, bkops,
> > sanitize, etc. Are all these working or not really well tested?
>
> I don't believe that they are well tested. We have a simple test to
> mount an eMMC partition, create a file, check the contents, remove the
> file and unmount. The timeouts always occur during unmounting.
>
> > Earlier, before my three patches, if the provided timeout_ms parameter
> > to __mmc_switch() was zero, which was the case for
> > mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> > ignored validating host->max_busy_timeout, which was wrong. In any
> > case, this also meant that an R1B response was always used for
> > mmc_flush_cache(), as you also indicated above. Perhaps this is the
> > critical part where things can go wrong.
> >
> > BTW, have you tried erase commands for sdhci tegra driver? If those
> > are working fine, do you have any special treatments for these?
>
> That I am not sure, but I will check.

Great, thanks. Looking forward to your report.

So, from my side, me and Anders Roxell, have been collaborating on
testing the behaviour on a TI Beagleboard x15 (remotely with limited
debug options), which is using the sdhci-omap variant. I am trying to
get hold of an Nvidia jetson-TX2, but not found one yet. These are the
conclusions from the observed behaviour on the Beagleboard for the
CMD6 cache flush command.

First, the reported host->max_busy_timeout is 2581 (ms) for the
sdhci-omap driver in this configuration.

1. As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
from the command.

2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
timeout_ms parameter is less than max_busy_timeout (2000 <  2581).
Then everything works fine.

3. Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
set, even when the timeout_ms becomes greater than max_busy_timeout.
This also works fine.

Clearly this indicates a problem that I think needs to be addressed in
the sdhci driver. However, of course I can revert the three discussed
patches to fix the problem, but that would only hide the issues and I
am sure we would then get back to this issue, sooner or later.

To fix the problem in the sdhci driver, I would appreciate if someone
from TI and Nvidia can step in to help, as I don't have the HW on my
desk.

Comments or other ideas of how to move forward?

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-26 15:21                   ` Ulf Hansson
@ 2020-02-26 17:04                     ` Jon Hunter
  2020-03-02 13:12                     ` Faiz Abbas
  1 sibling, 0 replies; 45+ messages in thread
From: Jon Hunter @ 2020-02-26 17:04 UTC (permalink / raw)
  To: Ulf Hansson, Faiz Abbas, Bitan Biswas
  Cc: Sowjanya Komatineni, Adrian Hunter, Naresh Kamboju, Jens Axboe,
	Alexei Starovoitov, linux-block, lkft-triage, open list,
	linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding,
	Anders Roxell, Kishon


On 26/02/2020 15:21, Ulf Hansson wrote:
> + Anders, Kishon
> 
> On Tue, 25 Feb 2020 at 17:24, Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>>
>> On 25/02/2020 14:26, Ulf Hansson wrote:
>>
>> ...
>>
>>> However, from the core point of view, the response is still requested,
>>> only that we don't want the driver to wait for the card to stop
>>> signaling busy. Instead we want to deal with that via "polling" from
>>> the core.
>>>
>>> This is a rather worrying behaviour, as it seems like the host driver
>>> doesn't really follow this expectations from the core point of view.
>>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
>>> sanitize, etc. Are all these working or not really well tested?
>>
>> I don't believe that they are well tested. We have a simple test to
>> mount an eMMC partition, create a file, check the contents, remove the
>> file and unmount. The timeouts always occur during unmounting.
>>
>>> Earlier, before my three patches, if the provided timeout_ms parameter
>>> to __mmc_switch() was zero, which was the case for
>>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
>>> ignored validating host->max_busy_timeout, which was wrong. In any
>>> case, this also meant that an R1B response was always used for
>>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
>>> critical part where things can go wrong.
>>>
>>> BTW, have you tried erase commands for sdhci tegra driver? If those
>>> are working fine, do you have any special treatments for these?
>>
>> That I am not sure, but I will check.
> 
> Great, thanks. Looking forward to your report.

So no I don't believe that we have explicitly tried erase commands and
no there is nothing special that we are doing for erase.

> So, from my side, me and Anders Roxell, have been collaborating on
> testing the behaviour on a TI Beagleboard x15 (remotely with limited
> debug options), which is using the sdhci-omap variant. I am trying to
> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> conclusions from the observed behaviour on the Beagleboard for the
> CMD6 cache flush command.
> 
> First, the reported host->max_busy_timeout is 2581 (ms) for the
> sdhci-omap driver in this configuration.
> 
> 1. As we all know by now, the cache flush command (CMD6) fails with
> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> from the command.
> 
> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> timeout_ms parameter is less than max_busy_timeout (2000 <  2581).
> Then everything works fine.
> 
> 3. Updating the code to again use 30s as the
> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> set, even when the timeout_ms becomes greater than max_busy_timeout.
> This also works fine.
> 
> Clearly this indicates a problem that I think needs to be addressed in
> the sdhci driver. However, of course I can revert the three discussed
> patches to fix the problem, but that would only hide the issues and I
> am sure we would then get back to this issue, sooner or later.
> 
> To fix the problem in the sdhci driver, I would appreciate if someone
> from TI and Nvidia can step in to help, as I don't have the HW on my
> desk.

OK, let me check to see who can help from our side.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-02-26 15:21                   ` Ulf Hansson
  2020-02-26 17:04                     ` Jon Hunter
@ 2020-03-02 13:12                     ` Faiz Abbas
  2020-03-02 16:50                       ` Ulf Hansson
  1 sibling, 1 reply; 45+ messages in thread
From: Faiz Abbas @ 2020-03-02 13:12 UTC (permalink / raw)
  To: Ulf Hansson, Jon Hunter, Bitan Biswas
  Cc: Sowjanya Komatineni, Adrian Hunter, Naresh Kamboju, Jens Axboe,
	Alexei Starovoitov, linux-block, lkft-triage, open list,
	linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding,
	Anders Roxell, Kishon

Uffe,

On 26/02/20 8:51 pm, Ulf Hansson wrote:
> + Anders, Kishon
> 
> On Tue, 25 Feb 2020 at 17:24, Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>>
>> On 25/02/2020 14:26, Ulf Hansson wrote:
>>
>> ...
>>
>>> However, from the core point of view, the response is still requested,
>>> only that we don't want the driver to wait for the card to stop
>>> signaling busy. Instead we want to deal with that via "polling" from
>>> the core.
>>>
>>> This is a rather worrying behaviour, as it seems like the host driver
>>> doesn't really follow this expectations from the core point of view.
>>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
>>> sanitize, etc. Are all these working or not really well tested?
>>
>> I don't believe that they are well tested. We have a simple test to
>> mount an eMMC partition, create a file, check the contents, remove the
>> file and unmount. The timeouts always occur during unmounting.
>>
>>> Earlier, before my three patches, if the provided timeout_ms parameter
>>> to __mmc_switch() was zero, which was the case for
>>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
>>> ignored validating host->max_busy_timeout, which was wrong. In any
>>> case, this also meant that an R1B response was always used for
>>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
>>> critical part where things can go wrong.
>>>
>>> BTW, have you tried erase commands for sdhci tegra driver? If those
>>> are working fine, do you have any special treatments for these?
>>
>> That I am not sure, but I will check.
> 
> Great, thanks. Looking forward to your report.
> 
> So, from my side, me and Anders Roxell, have been collaborating on
> testing the behaviour on a TI Beagleboard x15 (remotely with limited
> debug options), which is using the sdhci-omap variant. I am trying to
> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> conclusions from the observed behaviour on the Beagleboard for the
> CMD6 cache flush command.
> 
> First, the reported host->max_busy_timeout is 2581 (ms) for the
> sdhci-omap driver in this configuration.
> 
> 1. As we all know by now, the cache flush command (CMD6) fails with
> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> from the command.
> 
> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> timeout_ms parameter is less than max_busy_timeout (2000 <  2581).
> Then everything works fine.
> 
> 3. Updating the code to again use 30s as the
> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> set, even when the timeout_ms becomes greater than max_busy_timeout.
> This also works fine.
> 
> Clearly this indicates a problem that I think needs to be addressed in
> the sdhci driver. However, of course I can revert the three discussed
> patches to fix the problem, but that would only hide the issues and I
> am sure we would then get back to this issue, sooner or later.
> 
> To fix the problem in the sdhci driver, I would appreciate if someone
> from TI and Nvidia can step in to help, as I don't have the HW on my
> desk.
> 
> Comments or other ideas of how to move forward?
> 

Sorry I missed this earlier.

I don't have an X15 with me here but I'm trying to set one up in our
remote farm. In the meantime, I tried to reproduce this issue on two
platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue
because those eMMC's don't even have a cache. I will keep you updated
when I do get a board with a eMMC that has a cache.

Is there a way to reproduce this CMD6 issue with another operation?

Thanks,
Faiz

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-02 13:12                     ` Faiz Abbas
@ 2020-03-02 16:50                       ` Ulf Hansson
  2020-03-03 21:35                         ` Ulf Hansson
       [not found]                         ` <5e9b5646-bd48-e55b-54ee-1c2c41fc9218@nvidia.com>
  0 siblings, 2 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-03-02 16:50 UTC (permalink / raw)
  To: Faiz Abbas
  Cc: Jon Hunter, Bitan Biswas, Sowjanya Komatineni, Adrian Hunter,
	Naresh Kamboju, Jens Axboe, Alexei Starovoitov, linux-block,
	lkft-triage, open list, linux-mmc, Arnd Bergmann, John Stultz,
	Thierry Reding, Anders Roxell, Kishon

On Mon, 2 Mar 2020 at 14:11, Faiz Abbas <faiz_abbas@ti.com> wrote:
>
> Uffe,
>
> On 26/02/20 8:51 pm, Ulf Hansson wrote:
> > + Anders, Kishon
> >
> > On Tue, 25 Feb 2020 at 17:24, Jon Hunter <jonathanh@nvidia.com> wrote:
> >>
> >>
> >> On 25/02/2020 14:26, Ulf Hansson wrote:
> >>
> >> ...
> >>
> >>> However, from the core point of view, the response is still requested,
> >>> only that we don't want the driver to wait for the card to stop
> >>> signaling busy. Instead we want to deal with that via "polling" from
> >>> the core.
> >>>
> >>> This is a rather worrying behaviour, as it seems like the host driver
> >>> doesn't really follow this expectations from the core point of view.
> >>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
> >>> sanitize, etc. Are all these working or not really well tested?
> >>
> >> I don't believe that they are well tested. We have a simple test to
> >> mount an eMMC partition, create a file, check the contents, remove the
> >> file and unmount. The timeouts always occur during unmounting.
> >>
> >>> Earlier, before my three patches, if the provided timeout_ms parameter
> >>> to __mmc_switch() was zero, which was the case for
> >>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> >>> ignored validating host->max_busy_timeout, which was wrong. In any
> >>> case, this also meant that an R1B response was always used for
> >>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
> >>> critical part where things can go wrong.
> >>>
> >>> BTW, have you tried erase commands for sdhci tegra driver? If those
> >>> are working fine, do you have any special treatments for these?
> >>
> >> That I am not sure, but I will check.
> >
> > Great, thanks. Looking forward to your report.
> >
> > So, from my side, me and Anders Roxell, have been collaborating on
> > testing the behaviour on a TI Beagleboard x15 (remotely with limited
> > debug options), which is using the sdhci-omap variant. I am trying to
> > get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> > conclusions from the observed behaviour on the Beagleboard for the
> > CMD6 cache flush command.
> >
> > First, the reported host->max_busy_timeout is 2581 (ms) for the
> > sdhci-omap driver in this configuration.
> >
> > 1. As we all know by now, the cache flush command (CMD6) fails with
> > -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> > 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> > from the command.
> >
> > 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> > the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> > timeout_ms parameter is less than max_busy_timeout (2000 <  2581).
> > Then everything works fine.
> >
> > 3. Updating the code to again use 30s as the
> > MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> > set, even when the timeout_ms becomes greater than max_busy_timeout.
> > This also works fine.
> >
> > Clearly this indicates a problem that I think needs to be addressed in
> > the sdhci driver. However, of course I can revert the three discussed
> > patches to fix the problem, but that would only hide the issues and I
> > am sure we would then get back to this issue, sooner or later.
> >
> > To fix the problem in the sdhci driver, I would appreciate if someone
> > from TI and Nvidia can step in to help, as I don't have the HW on my
> > desk.
> >
> > Comments or other ideas of how to move forward?
> >
>
> Sorry I missed this earlier.
>
> I don't have an X15 with me here but I'm trying to set one up in our
> remote farm. In the meantime, I tried to reproduce this issue on two
> platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue
> because those eMMC's don't even have a cache. I will keep you updated
> when I do get a board with a eMMC that has a cache.
>
> Is there a way to reproduce this CMD6 issue with another operation?

Yes, most definitely.

Let me cook a debug patch for you that should trigger the problem for
another CMD6 operation. I will post something later this evening or in
the mornings (Swedish timezone).

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-02 16:50                       ` Ulf Hansson
@ 2020-03-03 21:35                         ` Ulf Hansson
       [not found]                         ` <5e9b5646-bd48-e55b-54ee-1c2c41fc9218@nvidia.com>
  1 sibling, 0 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-03-03 21:35 UTC (permalink / raw)
  To: Faiz Abbas
  Cc: Jon Hunter, Bitan Biswas, Sowjanya Komatineni, Adrian Hunter,
	Naresh Kamboju, Jens Axboe, Alexei Starovoitov, linux-block,
	lkft-triage, open list, linux-mmc, Arnd Bergmann, John Stultz,
	Thierry Reding, Anders Roxell, Kishon

On Mon, 2 Mar 2020 at 17:50, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Mon, 2 Mar 2020 at 14:11, Faiz Abbas <faiz_abbas@ti.com> wrote:
> >
> > Uffe,
> >
> > On 26/02/20 8:51 pm, Ulf Hansson wrote:
> > > + Anders, Kishon
> > >
> > > On Tue, 25 Feb 2020 at 17:24, Jon Hunter <jonathanh@nvidia.com> wrote:
> > >>
> > >>
> > >> On 25/02/2020 14:26, Ulf Hansson wrote:
> > >>
> > >> ...
> > >>
> > >>> However, from the core point of view, the response is still requested,
> > >>> only that we don't want the driver to wait for the card to stop
> > >>> signaling busy. Instead we want to deal with that via "polling" from
> > >>> the core.
> > >>>
> > >>> This is a rather worrying behaviour, as it seems like the host driver
> > >>> doesn't really follow this expectations from the core point of view.
> > >>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
> > >>> sanitize, etc. Are all these working or not really well tested?
> > >>
> > >> I don't believe that they are well tested. We have a simple test to
> > >> mount an eMMC partition, create a file, check the contents, remove the
> > >> file and unmount. The timeouts always occur during unmounting.
> > >>
> > >>> Earlier, before my three patches, if the provided timeout_ms parameter
> > >>> to __mmc_switch() was zero, which was the case for
> > >>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> > >>> ignored validating host->max_busy_timeout, which was wrong. In any
> > >>> case, this also meant that an R1B response was always used for
> > >>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
> > >>> critical part where things can go wrong.
> > >>>
> > >>> BTW, have you tried erase commands for sdhci tegra driver? If those
> > >>> are working fine, do you have any special treatments for these?
> > >>
> > >> That I am not sure, but I will check.
> > >
> > > Great, thanks. Looking forward to your report.
> > >
> > > So, from my side, me and Anders Roxell, have been collaborating on
> > > testing the behaviour on a TI Beagleboard x15 (remotely with limited
> > > debug options), which is using the sdhci-omap variant. I am trying to
> > > get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> > > conclusions from the observed behaviour on the Beagleboard for the
> > > CMD6 cache flush command.
> > >
> > > First, the reported host->max_busy_timeout is 2581 (ms) for the
> > > sdhci-omap driver in this configuration.
> > >
> > > 1. As we all know by now, the cache flush command (CMD6) fails with
> > > -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> > > 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> > > from the command.
> > >
> > > 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> > > the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> > > timeout_ms parameter is less than max_busy_timeout (2000 <  2581).
> > > Then everything works fine.
> > >
> > > 3. Updating the code to again use 30s as the
> > > MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> > > set, even when the timeout_ms becomes greater than max_busy_timeout.
> > > This also works fine.
> > >
> > > Clearly this indicates a problem that I think needs to be addressed in
> > > the sdhci driver. However, of course I can revert the three discussed
> > > patches to fix the problem, but that would only hide the issues and I
> > > am sure we would then get back to this issue, sooner or later.
> > >
> > > To fix the problem in the sdhci driver, I would appreciate if someone
> > > from TI and Nvidia can step in to help, as I don't have the HW on my
> > > desk.
> > >
> > > Comments or other ideas of how to move forward?
> > >
> >
> > Sorry I missed this earlier.
> >
> > I don't have an X15 with me here but I'm trying to set one up in our
> > remote farm. In the meantime, I tried to reproduce this issue on two
> > platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue
> > because those eMMC's don't even have a cache. I will keep you updated
> > when I do get a board with a eMMC that has a cache.
> >
> > Is there a way to reproduce this CMD6 issue with another operation?
>
> Yes, most definitely.
>
> Let me cook a debug patch for you that should trigger the problem for
> another CMD6 operation. I will post something later this evening or in
> the mornings (Swedish timezone).

A bit later than promised, I am clearly an optimist. In any case
here's the patch I had in mind to trigger the problem for other CMD6
operations. Please give at shot and see what happens.

-------

From: Ulf Hansson <ulf.hansson@linaro.org>
Date: Tue, 3 Mar 2020 22:11:05 +0100
Subject: [PATCH] mmc: core: DEBUG: Force a long timeout for all CMD6

This is to test sdhci-omap, for example, to see what happens when using a
longer timeout. My guess is that it triggers __mmc_switch() to disable the
MMC_RSP_BUSY flag for the command. If so, it likely to make the host driver
to fail, in some way or the other.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 drivers/mmc/core/mmc_ops.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index da425ee2d9bf..f0d2563961f6 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -532,6 +532,9 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8
index, u8 value,

        mmc_retune_hold(host);

+       /* Force a long timeout to likely make use_r1b_resp to become false. */
+       timeout_ms = MMC_CACHE_FLUSH_TIMEOUT_MS;
+
        if (!timeout_ms) {
                pr_warn("%s: unspecified timeout for CMD6 - use generic\n",
                        mmc_hostname(host));
@@ -544,8 +547,11 @@ int __mmc_switch(struct mmc_card *card, u8 set,
u8 index, u8 value,
         * the host to avoid HW busy detection, by converting to a R1 response
         * instead of a R1B.
         */
-       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
+       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
+               pr_warn("%s:Disable MMC_RSP_BUSY. timeout_ms(%u) >
max_busy_timeout(%u)\n",
+                       mmc_hostname(host), timeout_ms, host->max_busy_timeout);
                use_r1b_resp = false;
+       }

        cmd.opcode = MMC_SWITCH;
        cmd.arg = (MMC_SWITCH_MODE_WRITE_BYTE << 24) |
-- 

Kind regards
Uffe

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
       [not found]                         ` <5e9b5646-bd48-e55b-54ee-1c2c41fc9218@nvidia.com>
@ 2020-03-04 10:18                           ` Ulf Hansson
  2020-03-04 10:32                             ` Ulf Hansson
  2020-03-04 16:56                             ` Sowjanya Komatineni
  0 siblings, 2 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-03-04 10:18 UTC (permalink / raw)
  To: Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

[...]

>
> So, from my side, me and Anders Roxell, have been collaborating on
> testing the behaviour on a TI Beagleboard x15 (remotely with limited
> debug options), which is using the sdhci-omap variant. I am trying to
> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> conclusions from the observed behaviour on the Beagleboard for the
> CMD6 cache flush command.
>
> First, the reported host->max_busy_timeout is 2581 (ms) for the
> sdhci-omap driver in this configuration.
>
> 1. As we all know by now, the cache flush command (CMD6) fails with
> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> from the command.
>
> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> timeout_ms parameter is less than max_busy_timeout (2000 <  2581).
> Then everything works fine.
>
> 3. Updating the code to again use 30s as the
> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> set, even when the timeout_ms becomes greater than max_busy_timeout.
> This also works fine.
>
> Clearly this indicates a problem that I think needs to be addressed in
> the sdhci driver. However, of course I can revert the three discussed
> patches to fix the problem, but that would only hide the issues and I
> am sure we would then get back to this issue, sooner or later.
>
> To fix the problem in the sdhci driver, I would appreciate if someone
> from TI and Nvidia can step in to help, as I don't have the HW on my
> desk.
>
> Comments or other ideas of how to move forward?

[...]

> Hi Ulf,
>
> I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.

Okay, great.

>
>
> Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
>
>
>
> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
>
>
>
> Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.

If I understand correctly, because of the timeout now set to 30s,
MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
sdhci-tegra as well?

In other words, mmc_poll_for_busy() is being called, which in your
case means the ->card_busy() host ops (set to sdhci_card_busy() in
your case) will be invoked to wait for the card to stop signal busy on
DAT0.

This indicates to me, that the ->card_busy() ops returns zero to
inform that the card is *not* busy, even if the card actually signals
busy? Is that correct?

>
>
>
> With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
>
>
>
> With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.

So, yes, using R1 instead of R1B triggers a different behaviour, but
according to the eMMC spec it's perfectly allowed to issue a CMD13
even if the card signals busy on DAT0. The CMD13 is not using the DATA
lines, so this should work.

If I understand correctly, your driver (and controller?) has issues
with coping with this scenario. Is it something that can be fixed?

>
>
> 30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.

Right, as stated above, isn't sdhci_card_busy() working for your case?
Can we fix it?

>
>
>
>
> Actually we always use R1B with CMD6 as per spec.

I fully agree that R1B is preferable, but it's not against the spec to
send CMD13 to poll for busy.

Moreover, we need to cope with the scenario when the host has
specified a maximum timeout that isn't sufficiently long enough for
the requested operation. Do you have another proposal for how to
manage this, but disabling MMC_RSP_BUSY?

Let's assume you driver would get a R1B for the CMD6 (we force it),
then what timeout would the driver be using if we would set
cmd.busy_timeout to 30ms?

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-04 10:18                           ` Ulf Hansson
@ 2020-03-04 10:32                             ` Ulf Hansson
  2020-03-04 16:56                             ` Sowjanya Komatineni
  1 sibling, 0 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-03-04 10:32 UTC (permalink / raw)
  To: Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

[...]

 >
> > Actually we always use R1B with CMD6 as per spec.
>
> I fully agree that R1B is preferable, but it's not against the spec to
> send CMD13 to poll for busy.
>
> Moreover, we need to cope with the scenario when the host has
> specified a maximum timeout that isn't sufficiently long enough for
> the requested operation. Do you have another proposal for how to
> manage this, but disabling MMC_RSP_BUSY?
>
> Let's assume you driver would get a R1B for the CMD6 (we force it),
> then what timeout would the driver be using if we would set
> cmd.busy_timeout to 30ms?

/s/30ms/30s

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-04 10:18                           ` Ulf Hansson
  2020-03-04 10:32                             ` Ulf Hansson
@ 2020-03-04 16:56                             ` Sowjanya Komatineni
  2020-03-04 17:21                               ` Sowjanya Komatineni
  1 sibling, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-04 16:56 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/4/20 2:18 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> [...]
>
>> So, from my side, me and Anders Roxell, have been collaborating on
>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>> debug options), which is using the sdhci-omap variant. I am trying to
>> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
>> conclusions from the observed behaviour on the Beagleboard for the
>> CMD6 cache flush command.
>>
>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>> sdhci-omap driver in this configuration.
>>
>> 1. As we all know by now, the cache flush command (CMD6) fails with
>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>> from the command.
>>
>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>> timeout_ms parameter is less than max_busy_timeout (2000 <  2581).
>> Then everything works fine.
>>
>> 3. Updating the code to again use 30s as the
>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>> This also works fine.
>>
>> Clearly this indicates a problem that I think needs to be addressed in
>> the sdhci driver. However, of course I can revert the three discussed
>> patches to fix the problem, but that would only hide the issues and I
>> am sure we would then get back to this issue, sooner or later.
>>
>> To fix the problem in the sdhci driver, I would appreciate if someone
>> from TI and Nvidia can step in to help, as I don't have the HW on my
>> desk.
>>
>> Comments or other ideas of how to move forward?
> [...]
>
>> Hi Ulf,
>>
>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
> Okay, great.
>
>>
>> Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
>>
>>
>>
>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
>>
>>
>>
>> Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
> If I understand correctly, because of the timeout now set to 30s,
> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
> sdhci-tegra as well?
Yes
>
> In other words, mmc_poll_for_busy() is being called, which in your
> case means the ->card_busy() host ops (set to sdhci_card_busy() in
> your case) will be invoked to wait for the card to stop signal busy on
> DAT0.
>
> This indicates to me, that the ->card_busy() ops returns zero to
> inform that the card is *not* busy, even if the card actually signals
> busy? Is that correct?
Yes
>
>>
>>
>> With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
>>
>>
>>
>> With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
> So, yes, using R1 instead of R1B triggers a different behaviour, but
> according to the eMMC spec it's perfectly allowed to issue a CMD13
> even if the card signals busy on DAT0. The CMD13 is not using the DATA
> lines, so this should work.
>
> If I understand correctly, your driver (and controller?) has issues
> with coping with this scenario. Is it something that can be fixed?
>
>>
>> 30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
> Right, as stated above, isn't sdhci_card_busy() working for your case?
> Can we fix it?

sdhci_card_busy() returned 0 indicating its not busy.

Based on our host design, When CMD6 is issued with R1 type, we program 
it as NO_RESPONSE and with this command complete interrupt happens right 
at end bit of command and there will be no transfer complete interrupt.

When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and 
with this command complete is end bit of device resp and transfer 
complete interrupt will be when DAT0 LOW -> HIGH.

Regardless of R1/R1B, device side CMD6 will always have busy state on D0 
and response on CMD lines.

There will be 2 clock cycles period after sending CMD6 for device to 
send busy state on data0.

In case of R1 type, after sending command DAT will stay high and looks 
like we are polling for busy early before busy state has started and 
sending CMD13 while device is busy and sending response on CMD line is 
causing timeout.

Probably with this specific case of CMD6 with R1 type, to wait for card 
busy we should poll for DAT0 to go Low first and then to go High??

>
>>
>>
>>
>> Actually we always use R1B with CMD6 as per spec.
> I fully agree that R1B is preferable, but it's not against the spec to
> send CMD13 to poll for busy.
>
> Moreover, we need to cope with the scenario when the host has
> specified a maximum timeout that isn't sufficiently long enough for
> the requested operation. Do you have another proposal for how to
> manage this, but disabling MMC_RSP_BUSY?
>
> Let's assume you driver would get a R1B for the CMD6 (we force it),
> then what timeout would the driver be using if we would set
> cmd.busy_timeout to 30ms?
>
> Kind regards
> Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-04 16:56                             ` Sowjanya Komatineni
@ 2020-03-04 17:21                               ` Sowjanya Komatineni
  2020-03-04 17:26                                 ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-04 17:21 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> [...]
>>
>>> So, from my side, me and Anders Roxell, have been collaborating on
>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>> debug options), which is using the sdhci-omap variant. I am trying to
>>> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
>>> conclusions from the observed behaviour on the Beagleboard for the
>>> CMD6 cache flush command.
>>>
>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>> sdhci-omap driver in this configuration.
>>>
>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>> from the command.
>>>
>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>> Then everything works fine.
>>>
>>> 3. Updating the code to again use 30s as the
>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>> This also works fine.
>>>
>>> Clearly this indicates a problem that I think needs to be addressed in
>>> the sdhci driver. However, of course I can revert the three discussed
>>> patches to fix the problem, but that would only hide the issues and I
>>> am sure we would then get back to this issue, sooner or later.
>>>
>>> To fix the problem in the sdhci driver, I would appreciate if someone
>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>> desk.
>>>
>>> Comments or other ideas of how to move forward?
>> [...]
>>
>>> Hi Ulf,
>>>
>>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc 
>>> flush cache.
>> Okay, great.
>>
>>>
>>> Timeout I see is for switch status CMD13 after sending CMD6 as 
>>> device side CMD6 is still inflight while host sends CMD13 as we are 
>>> using R1 response type with timeout_ms changes to 30s.
>>>
>>>
>>>
>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it 
>>> uses R1B response type and host will wait for busy state followed by 
>>> response from device for CMD6 and then data lines go High.
>>>
>>>
>>>
>>> Now with timeout_ms changed to 30s, we use R1 response and SW waits 
>>> for busy by checking for DAT0 line to go High.
>> If I understand correctly, because of the timeout now set to 30s,
>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>> sdhci-tegra as well?
> Yes
>>
>> In other words, mmc_poll_for_busy() is being called, which in your
>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>> your case) will be invoked to wait for the card to stop signal busy on
>> DAT0.
>>
>> This indicates to me, that the ->card_busy() ops returns zero to
>> inform that the card is *not* busy, even if the card actually signals
>> busy? Is that correct?
> Yes
>>
>>>
>>>
>>> With R1B type, host design after sending command at end of 
>>> completion after end bit waits for 2 cycles for data line to go low 
>>> (busy state from device) and waits for response cycles after which 
>>> data lines will go back high and then we issue switch status CMD13.
>>>
>>>
>>>
>>> With R1 type, host after sending command and at end of completion 
>>> after end bit, DATA lines will go high immediately as its R1 type 
>>> and switch status CMD13 gets issued but by this time it looks like 
>>> CMD6 on device side is still in flight for sending status and data.
>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>> even if the card signals busy on DAT0. The CMD13 is not using the DATA
>> lines, so this should work.
>>
>> If I understand correctly, your driver (and controller?) has issues
>> with coping with this scenario. Is it something that can be fixed?
>>
>>>
>>> 30s timeout is the wait time for data0 line to go high and 
>>> mmc_busy_status will return success right away with R1 response type 
>>> and SW sends switch status CMD13 but during that time on device side 
>>> looks like still processing CMD6 as we are not waiting for enough 
>>> time when we use R1 response type.
>> Right, as stated above, isn't sdhci_card_busy() working for your case?
>> Can we fix it?
>
> sdhci_card_busy() returned 0 indicating its not busy.
>
> Based on our host design, When CMD6 is issued with R1 type, we program 
> it as NO_RESPONSE and with this command complete interrupt happens 
> right at end bit of command and there will be no transfer complete 
> interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type 
as we program it as NO_RESPONSE and with this command complete interrupt 
happens right at end bit of command and there will be no transfer 
complete interrupt.
>
> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and 
> with this command complete is end bit of device resp and transfer 
> complete interrupt will be when DAT0 LOW -> HIGH.
>
> Regardless of R1/R1B, device side CMD6 will always have busy state on 
> D0 and response on CMD lines.
>
> There will be 2 clock cycles period after sending CMD6 for device to 
> send busy state on data0.
>
> In case of R1 type, after sending command DAT will stay high and looks 
> like we are polling for busy early before busy state has started and 
> sending CMD13 while device is busy and sending response on CMD line is 
> causing timeout.
>
> Probably with this specific case of CMD6 with R1 type, to wait for 
> card busy we should poll for DAT0 to go Low first and then to go High??
>
>>
>>>
>>>
>>>
>>> Actually we always use R1B with CMD6 as per spec.
>> I fully agree that R1B is preferable, but it's not against the spec to
>> send CMD13 to poll for busy.
>>
>> Moreover, we need to cope with the scenario when the host has
>> specified a maximum timeout that isn't sufficiently long enough for
>> the requested operation. Do you have another proposal for how to
>> manage this, but disabling MMC_RSP_BUSY?
>>
>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>> then what timeout would the driver be using if we would set
>> cmd.busy_timeout to 30ms?
>>
>> Kind regards
>> Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-04 17:21                               ` Sowjanya Komatineni
@ 2020-03-04 17:26                                 ` Sowjanya Komatineni
  2020-03-04 17:51                                   ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-04 17:26 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> [...]
>>>
>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>>> debug options), which is using the sdhci-omap variant. I am trying to
>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>> CMD6 cache flush command.
>>>>
>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>> sdhci-omap driver in this configuration.
>>>>
>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>> from the command.
>>>>
>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>> Then everything works fine.
>>>>
>>>> 3. Updating the code to again use 30s as the
>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
>>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>>> This also works fine.
>>>>
>>>> Clearly this indicates a problem that I think needs to be addressed in
>>>> the sdhci driver. However, of course I can revert the three discussed
>>>> patches to fix the problem, but that would only hide the issues and I
>>>> am sure we would then get back to this issue, sooner or later.
>>>>
>>>> To fix the problem in the sdhci driver, I would appreciate if someone
>>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>>> desk.
>>>>
>>>> Comments or other ideas of how to move forward?
>>> [...]
>>>
>>>> Hi Ulf,
>>>>
>>>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc 
>>>> flush cache.
>>> Okay, great.
>>>
>>>>
>>>> Timeout I see is for switch status CMD13 after sending CMD6 as 
>>>> device side CMD6 is still inflight while host sends CMD13 as we are 
>>>> using R1 response type with timeout_ms changes to 30s.
>>>>
>>>>
>>>>
>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it 
>>>> uses R1B response type and host will wait for busy state followed 
>>>> by response from device for CMD6 and then data lines go High.
>>>>
>>>>
>>>>
>>>> Now with timeout_ms changed to 30s, we use R1 response and SW waits 
>>>> for busy by checking for DAT0 line to go High.
>>> If I understand correctly, because of the timeout now set to 30s,
>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>> sdhci-tegra as well?
>> Yes
>>>
>>> In other words, mmc_poll_for_busy() is being called, which in your
>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>> your case) will be invoked to wait for the card to stop signal busy on
>>> DAT0.
>>>
>>> This indicates to me, that the ->card_busy() ops returns zero to
>>> inform that the card is *not* busy, even if the card actually signals
>>> busy? Is that correct?
>> Yes
>>>
>>>>
>>>>
>>>> With R1B type, host design after sending command at end of 
>>>> completion after end bit waits for 2 cycles for data line to go low 
>>>> (busy state from device) and waits for response cycles after which 
>>>> data lines will go back high and then we issue switch status CMD13.
>>>>
>>>>
>>>>
>>>> With R1 type, host after sending command and at end of completion 
>>>> after end bit, DATA lines will go high immediately as its R1 type 
>>>> and switch status CMD13 gets issued but by this time it looks like 
>>>> CMD6 on device side is still in flight for sending status and data.
>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>> even if the card signals busy on DAT0. The CMD13 is not using the DATA
>>> lines, so this should work.
>>>
>>> If I understand correctly, your driver (and controller?) has issues
>>> with coping with this scenario. Is it something that can be fixed?
>>>
>>>>
>>>> 30s timeout is the wait time for data0 line to go high and 
>>>> mmc_busy_status will return success right away with R1 response 
>>>> type and SW sends switch status CMD13 but during that time on 
>>>> device side looks like still processing CMD6 as we are not waiting 
>>>> for enough time when we use R1 response type.
>>> Right, as stated above, isn't sdhci_card_busy() working for your case?
>>> Can we fix it?
>>
>> sdhci_card_busy() returned 0 indicating its not busy.
>>
>> Based on our host design, When CMD6 is issued with R1 type, we 
>> program it as NO_RESPONSE and with this command complete interrupt 
>> happens right at end bit of command and there will be no transfer 
>> complete interrupt.
> *[Correction] Based on our host design, When CMD6 is issued with R1 
> type as we program it as NO_RESPONSE and with this command complete 
> interrupt happens right at end bit of command and there will be no 
> transfer complete interrupt.

Sorry to correct wordings, I meant sdhci driver programs response type 
as NO_RESPONSE for CMD6.

When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host 
design  command complete interrupt happens right at end bit of command 
and there will be no transfer complete interrupt.


>>
>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT 
>> and with this command complete is end bit of device resp and transfer 
>> complete interrupt will be when DAT0 LOW -> HIGH.
>>
>> Regardless of R1/R1B, device side CMD6 will always have busy state on 
>> D0 and response on CMD lines.
>>
>> There will be 2 clock cycles period after sending CMD6 for device to 
>> send busy state on data0.
>>
>> In case of R1 type, after sending command DAT will stay high and 
>> looks like we are polling for busy early before busy state has 
>> started and sending CMD13 while device is busy and sending response 
>> on CMD line is causing timeout.
>>
>> Probably with this specific case of CMD6 with R1 type, to wait for 
>> card busy we should poll for DAT0 to go Low first and then to go High??
>>
>>>
>>>>
>>>>
>>>>
>>>> Actually we always use R1B with CMD6 as per spec.
>>> I fully agree that R1B is preferable, but it's not against the spec to
>>> send CMD13 to poll for busy.
>>>
>>> Moreover, we need to cope with the scenario when the host has
>>> specified a maximum timeout that isn't sufficiently long enough for
>>> the requested operation. Do you have another proposal for how to
>>> manage this, but disabling MMC_RSP_BUSY?
>>>
>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>> then what timeout would the driver be using if we would set
>>> cmd.busy_timeout to 30ms?
>>>
>>> Kind regards
>>> Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-04 17:26                                 ` Sowjanya Komatineni
@ 2020-03-04 17:51                                   ` Sowjanya Komatineni
  2020-03-04 22:35                                     ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-04 17:51 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> [...]
>>>>
>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>>>> debug options), which is using the sdhci-omap variant. I am trying to
>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These are 
>>>>> the
>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>> CMD6 cache flush command.
>>>>>
>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>> sdhci-omap driver in this configuration.
>>>>>
>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 
>>>>> 30 *
>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>> from the command.
>>>>>
>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>> Then everything works fine.
>>>>>
>>>>> 3. Updating the code to again use 30s as the
>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY 
>>>>> to be
>>>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>>>> This also works fine.
>>>>>
>>>>> Clearly this indicates a problem that I think needs to be 
>>>>> addressed in
>>>>> the sdhci driver. However, of course I can revert the three discussed
>>>>> patches to fix the problem, but that would only hide the issues and I
>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>
>>>>> To fix the problem in the sdhci driver, I would appreciate if someone
>>>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>>>> desk.
>>>>>
>>>>> Comments or other ideas of how to move forward?
>>>> [...]
>>>>
>>>>> Hi Ulf,
>>>>>
>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc 
>>>>> flush cache.
>>>> Okay, great.
>>>>
>>>>>
>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as 
>>>>> device side CMD6 is still inflight while host sends CMD13 as we 
>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>
>>>>>
>>>>>
>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it 
>>>>> uses R1B response type and host will wait for busy state followed 
>>>>> by response from device for CMD6 and then data lines go High.
>>>>>
>>>>>
>>>>>
>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW 
>>>>> waits for busy by checking for DAT0 line to go High.
>>>> If I understand correctly, because of the timeout now set to 30s,
>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>> sdhci-tegra as well?
>>> Yes
>>>>
>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>> your case) will be invoked to wait for the card to stop signal busy on
>>>> DAT0.
>>>>
>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>> inform that the card is *not* busy, even if the card actually signals
>>>> busy? Is that correct?
>>> Yes
>>>>
>>>>>
>>>>>
>>>>> With R1B type, host design after sending command at end of 
>>>>> completion after end bit waits for 2 cycles for data line to go 
>>>>> low (busy state from device) and waits for response cycles after 
>>>>> which data lines will go back high and then we issue switch status 
>>>>> CMD13.
>>>>>
>>>>>
>>>>>
>>>>> With R1 type, host after sending command and at end of completion 
>>>>> after end bit, DATA lines will go high immediately as its R1 type 
>>>>> and switch status CMD13 gets issued but by this time it looks like 
>>>>> CMD6 on device side is still in flight for sending status and data.
>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>> even if the card signals busy on DAT0. The CMD13 is not using the DATA
>>>> lines, so this should work.
>>>>
>>>> If I understand correctly, your driver (and controller?) has issues
>>>> with coping with this scenario. Is it something that can be fixed?
>>>>
>>>>>
>>>>> 30s timeout is the wait time for data0 line to go high and 
>>>>> mmc_busy_status will return success right away with R1 response 
>>>>> type and SW sends switch status CMD13 but during that time on 
>>>>> device side looks like still processing CMD6 as we are not waiting 
>>>>> for enough time when we use R1 response type.
>>>> Right, as stated above, isn't sdhci_card_busy() working for your case?
>>>> Can we fix it?
>>>
>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>
>>> Based on our host design, When CMD6 is issued with R1 type, we 
>>> program it as NO_RESPONSE and with this command complete interrupt 
>>> happens right at end bit of command and there will be no transfer 
>>> complete interrupt.
>> *[Correction] Based on our host design, When CMD6 is issued with R1 
>> type as we program it as NO_RESPONSE and with this command complete 
>> interrupt happens right at end bit of command and there will be no 
>> transfer complete interrupt.
>
> Sorry to correct wordings, I meant sdhci driver programs response type 
> as NO_RESPONSE for CMD6.
>
> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host 
> design  command complete interrupt happens right at end bit of command 
> and there will be no transfer complete interrupt.
>
>
Sorry for confusion. Please ignore above on response. it is using SHORT 
response for R1. So SW poll should be working.

Will get back on checking on host design side internally.

>>>
>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT 
>>> and with this command complete is end bit of device resp and 
>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>
>>> Regardless of R1/R1B, device side CMD6 will always have busy state 
>>> on D0 and response on CMD lines.
>>>
>>> There will be 2 clock cycles period after sending CMD6 for device to 
>>> send busy state on data0.
>>>
>>> In case of R1 type, after sending command DAT will stay high and 
>>> looks like we are polling for busy early before busy state has 
>>> started and sending CMD13 while device is busy and sending response 
>>> on CMD line is causing timeout.
>>>
>>> Probably with this specific case of CMD6 with R1 type, to wait for 
>>> card busy we should poll for DAT0 to go Low first and then to go High??
>>>
>>>>

>>>>>
>>>>>
>>>>>
>>>>> Actually we always use R1B with CMD6 as per spec.
>>>> I fully agree that R1B is preferable, but it's not against the spec to
>>>> send CMD13 to poll for busy.
>>>>
>>>> Moreover, we need to cope with the scenario when the host has
>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>> the requested operation. Do you have another proposal for how to
>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>
>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>> then what timeout would the driver be using if we would set
>>>> cmd.busy_timeout to 30ms?
>>>>
>>>> Kind regards
>>>> Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-04 17:51                                   ` Sowjanya Komatineni
@ 2020-03-04 22:35                                     ` Sowjanya Komatineni
  2020-03-05  0:20                                       ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-04 22:35 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>
>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>>>>> debug options), which is using the sdhci-omap variant. I am 
>>>>>> trying to
>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These 
>>>>>> are the
>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>> CMD6 cache flush command.
>>>>>>
>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>> sdhci-omap driver in this configuration.
>>>>>>
>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 
>>>>>> 30 *
>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>> from the command.
>>>>>>
>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>> Then everything works fine.
>>>>>>
>>>>>> 3. Updating the code to again use 30s as the
>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY 
>>>>>> to be
>>>>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>>>>> This also works fine.
>>>>>>
>>>>>> Clearly this indicates a problem that I think needs to be 
>>>>>> addressed in
>>>>>> the sdhci driver. However, of course I can revert the three 
>>>>>> discussed
>>>>>> patches to fix the problem, but that would only hide the issues 
>>>>>> and I
>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>
>>>>>> To fix the problem in the sdhci driver, I would appreciate if 
>>>>>> someone
>>>>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>>>>> desk.
>>>>>>
>>>>>> Comments or other ideas of how to move forward?
>>>>> [...]
>>>>>
>>>>>> Hi Ulf,
>>>>>>
>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does 
>>>>>> mmc flush cache.
>>>>> Okay, great.
>>>>>
>>>>>>
>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as 
>>>>>> device side CMD6 is still inflight while host sends CMD13 as we 
>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it 
>>>>>> uses R1B response type and host will wait for busy state followed 
>>>>>> by response from device for CMD6 and then data lines go High.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW 
>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>> sdhci-tegra as well?
>>>> Yes
>>>>>
>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>> your case) will be invoked to wait for the card to stop signal 
>>>>> busy on
>>>>> DAT0.
>>>>>
>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>> inform that the card is *not* busy, even if the card actually signals
>>>>> busy? Is that correct?
>>>> Yes
>>>>>
>>>>>>
>>>>>>
>>>>>> With R1B type, host design after sending command at end of 
>>>>>> completion after end bit waits for 2 cycles for data line to go 
>>>>>> low (busy state from device) and waits for response cycles after 
>>>>>> which data lines will go back high and then we issue switch 
>>>>>> status CMD13.
>>>>>>
>>>>>>
>>>>>>
>>>>>> With R1 type, host after sending command and at end of completion 
>>>>>> after end bit, DATA lines will go high immediately as its R1 type 
>>>>>> and switch status CMD13 gets issued but by this time it looks 
>>>>>> like CMD6 on device side is still in flight for sending status 
>>>>>> and data.
>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>> even if the card signals busy on DAT0. The CMD13 is not using the 
>>>>> DATA
>>>>> lines, so this should work.
>>>>>
>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>
>>>>>>
>>>>>> 30s timeout is the wait time for data0 line to go high and 
>>>>>> mmc_busy_status will return success right away with R1 response 
>>>>>> type and SW sends switch status CMD13 but during that time on 
>>>>>> device side looks like still processing CMD6 as we are not 
>>>>>> waiting for enough time when we use R1 response type.
>>>>> Right, as stated above, isn't sdhci_card_busy() working for your 
>>>>> case?
>>>>> Can we fix it?
>>>>
>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>
>>>> Based on our host design, When CMD6 is issued with R1 type, we 
>>>> program it as NO_RESPONSE and with this command complete interrupt 
>>>> happens right at end bit of command and there will be no transfer 
>>>> complete interrupt.
>>> *[Correction] Based on our host design, When CMD6 is issued with R1 
>>> type as we program it as NO_RESPONSE and with this command complete 
>>> interrupt happens right at end bit of command and there will be no 
>>> transfer complete interrupt.
>>
>> Sorry to correct wordings, I meant sdhci driver programs response 
>> type as NO_RESPONSE for CMD6.
>>
>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our 
>> host design  command complete interrupt happens right at end bit of 
>> command and there will be no transfer complete interrupt.
>>
>>
> Sorry for confusion. Please ignore above on response. it is using 
> SHORT response for R1. So SW poll should be working.
>
> Will get back on checking on host design side internally.
>
Hi Ulf,

Verified internally regarding the busy state over DATA0 Our host design 
has known minor bug where command complete interrupt is asserted after 
waiting for busy cycles from device.So because of this polling for 
card_busy() returns 0 (DAT0 line High) immediately as waiting for busy 
is taken care during command complete interrupt in host design. This 
behavior is same for R1 and R1B.


>>>>
>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT 
>>>> and with this command complete is end bit of device resp and 
>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>
>>>> Regardless of R1/R1B, device side CMD6 will always have busy state 
>>>> on D0 and response on CMD lines.
>>>>
>>>> There will be 2 clock cycles period after sending CMD6 for device 
>>>> to send busy state on data0.
>>>>
>>>> In case of R1 type, after sending command DAT will stay high and 
>>>> looks like we are polling for busy early before busy state has 
>>>> started and sending CMD13 while device is busy and sending response 
>>>> on CMD line is causing timeout.
>>>>
>>>> Probably with this specific case of CMD6 with R1 type, to wait for 
>>>> card busy we should poll for DAT0 to go Low first and then to go 
>>>> High??
>>>>
>>>>>
>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>> I fully agree that R1B is preferable, but it's not against the 
>>>>> spec to
>>>>> send CMD13 to poll for busy.
>>>>>
>>>>> Moreover, we need to cope with the scenario when the host has
>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>> the requested operation. Do you have another proposal for how to
>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>
>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>> then what timeout would the driver be using if we would set
>>>>> cmd.busy_timeout to 30ms?
>>>>>

>>>>> Kind regards
>>>>> Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-04 22:35                                     ` Sowjanya Komatineni
@ 2020-03-05  0:20                                       ` Sowjanya Komatineni
  2020-03-05  3:06                                         ` Sowjanya Komatineni
  2020-03-05 13:05                                         ` Ulf Hansson
  0 siblings, 2 replies; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-05  0:20 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
>
> On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>>
>>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>>
>>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>>> External email: Use caution opening links or attachments
>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with 
>>>>>>> limited
>>>>>>> debug options), which is using the sdhci-omap variant. I am 
>>>>>>> trying to
>>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These 
>>>>>>> are the
>>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>>> CMD6 cache flush command.
>>>>>>>
>>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>>> sdhci-omap driver in this configuration.
>>>>>>>
>>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set 
>>>>>>> to 30 *
>>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>>> from the command.
>>>>>>>
>>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>>> Then everything works fine.
>>>>>>>
>>>>>>> 3. Updating the code to again use 30s as the
>>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY 
>>>>>>> to be
>>>>>>> set, even when the timeout_ms becomes greater than 
>>>>>>> max_busy_timeout.
>>>>>>> This also works fine.
>>>>>>>
>>>>>>> Clearly this indicates a problem that I think needs to be 
>>>>>>> addressed in
>>>>>>> the sdhci driver. However, of course I can revert the three 
>>>>>>> discussed
>>>>>>> patches to fix the problem, but that would only hide the issues 
>>>>>>> and I
>>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>>
>>>>>>> To fix the problem in the sdhci driver, I would appreciate if 
>>>>>>> someone
>>>>>>> from TI and Nvidia can step in to help, as I don't have the HW 
>>>>>>> on my
>>>>>>> desk.
>>>>>>>
>>>>>>> Comments or other ideas of how to move forward?
>>>>>> [...]
>>>>>>
>>>>>>> Hi Ulf,
>>>>>>>
>>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does 
>>>>>>> mmc flush cache.
>>>>>> Okay, great.
>>>>>>
>>>>>>>
>>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as 
>>>>>>> device side CMD6 is still inflight while host sends CMD13 as we 
>>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with 
>>>>>>> it uses R1B response type and host will wait for busy state 
>>>>>>> followed by response from device for CMD6 and then data lines go 
>>>>>>> High.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW 
>>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>>> sdhci-tegra as well?
>>>>> Yes
>>>>>>
>>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>>> your case) will be invoked to wait for the card to stop signal 
>>>>>> busy on
>>>>>> DAT0.
>>>>>>
>>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>>> inform that the card is *not* busy, even if the card actually 
>>>>>> signals
>>>>>> busy? Is that correct?
>>>>> Yes
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With R1B type, host design after sending command at end of 
>>>>>>> completion after end bit waits for 2 cycles for data line to go 
>>>>>>> low (busy state from device) and waits for response cycles after 
>>>>>>> which data lines will go back high and then we issue switch 
>>>>>>> status CMD13.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With R1 type, host after sending command and at end of 
>>>>>>> completion after end bit, DATA lines will go high immediately as 
>>>>>>> its R1 type and switch status CMD13 gets issued but by this time 
>>>>>>> it looks like CMD6 on device side is still in flight for sending 
>>>>>>> status and data.
>>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>>> even if the card signals busy on DAT0. The CMD13 is not using the 
>>>>>> DATA
>>>>>> lines, so this should work.
>>>>>>
>>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>>
>>>>>>>
>>>>>>> 30s timeout is the wait time for data0 line to go high and 
>>>>>>> mmc_busy_status will return success right away with R1 response 
>>>>>>> type and SW sends switch status CMD13 but during that time on 
>>>>>>> device side looks like still processing CMD6 as we are not 
>>>>>>> waiting for enough time when we use R1 response type.
>>>>>> Right, as stated above, isn't sdhci_card_busy() working for your 
>>>>>> case?
>>>>>> Can we fix it?
>>>>>
>>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>>
>>>>> Based on our host design, When CMD6 is issued with R1 type, we 
>>>>> program it as NO_RESPONSE and with this command complete interrupt 
>>>>> happens right at end bit of command and there will be no transfer 
>>>>> complete interrupt.
>>>> *[Correction] Based on our host design, When CMD6 is issued with R1 
>>>> type as we program it as NO_RESPONSE and with this command complete 
>>>> interrupt happens right at end bit of command and there will be no 
>>>> transfer complete interrupt.
>>>
>>> Sorry to correct wordings, I meant sdhci driver programs response 
>>> type as NO_RESPONSE for CMD6.
>>>
>>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our 
>>> host design  command complete interrupt happens right at end bit of 
>>> command and there will be no transfer complete interrupt.
>>>
>>>
>> Sorry for confusion. Please ignore above on response. it is using 
>> SHORT response for R1. So SW poll should be working.
>>
>> Will get back on checking on host design side internally.
>>
> Hi Ulf,
>
> Verified internally regarding the busy state over DATA0 Our host 
> design has known minor bug where command complete interrupt is 
> asserted after waiting for busy cycles from device.So because of this 
> polling for card_busy() returns 0 (DAT0 line High) immediately as 
> waiting for busy is taken care during command complete interrupt in 
> host design. This behavior is same for R1 and R1B.
>
>
>>>>>
>>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT 
>>>>> and with this command complete is end bit of device resp and 
>>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>>
>>>>> Regardless of R1/R1B, device side CMD6 will always have busy state 
>>>>> on D0 and response on CMD lines.
>>>>>
>>>>> There will be 2 clock cycles period after sending CMD6 for device 
>>>>> to send busy state on data0.
>>>>>
>>>>> In case of R1 type, after sending command DAT will stay high and 
>>>>> looks like we are polling for busy early before busy state has 
>>>>> started and sending CMD13 while device is busy and sending 
>>>>> response on CMD line is causing timeout.
>>>>>
>>>>> Probably with this specific case of CMD6 with R1 type, to wait for 
>>>>> card busy we should poll for DAT0 to go Low first and then to go 
>>>>> High??
>>>>>
>>>>>>
>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>> I fully agree that R1B is preferable, but it's not against the 
>>>>>> spec to
>>>>>> send CMD13 to poll for busy.
>>>>>>
>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>> the requested operation. Do you have another proposal for how to
>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>
>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>> then what timeout would the driver be using if we would set
>>>>>> cmd.busy_timeout to 30ms?
>>>>>>
>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats 
the data timeout counter used?

Because of above mentioned issue on our host where CMD interrupt happens 
after busy state, poll for busy returns right away as not busy.

So issuing CMD13 after CMD6-R1 followed by busy poll should be working. 
But weird that with small delay of 1ms or debug print before CMD13 it 
doesn't timeout and works all the time.


>>>>>> Kind regards
>>>>>> Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-05  0:20                                       ` Sowjanya Komatineni
@ 2020-03-05  3:06                                         ` Sowjanya Komatineni
  2020-03-05 13:05                                         ` Ulf Hansson
  1 sibling, 0 replies; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-05  3:06 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/4/20 4:20 PM, Sowjanya Komatineni wrote:
>
> On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>>>
>>>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>>>
>>>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>>>
>>>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>>>> External email: Use caution opening links or attachments
>>>>>>>
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with 
>>>>>>>> limited
>>>>>>>> debug options), which is using the sdhci-omap variant. I am 
>>>>>>>> trying to
>>>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These 
>>>>>>>> are the
>>>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>>>> CMD6 cache flush command.
>>>>>>>>
>>>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>>>> sdhci-omap driver in this configuration.
>>>>>>>>
>>>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails 
>>>>>>>> with
>>>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set 
>>>>>>>> to 30 *
>>>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>>>> from the command.
>>>>>>>>
>>>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means 
>>>>>>>> that
>>>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>>>> Then everything works fine.
>>>>>>>>
>>>>>>>> 3. Updating the code to again use 30s as the
>>>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the 
>>>>>>>> MMC_RSP_BUSY to be
>>>>>>>> set, even when the timeout_ms becomes greater than 
>>>>>>>> max_busy_timeout.
>>>>>>>> This also works fine.
>>>>>>>>
>>>>>>>> Clearly this indicates a problem that I think needs to be 
>>>>>>>> addressed in
>>>>>>>> the sdhci driver. However, of course I can revert the three 
>>>>>>>> discussed
>>>>>>>> patches to fix the problem, but that would only hide the issues 
>>>>>>>> and I
>>>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>>>
>>>>>>>> To fix the problem in the sdhci driver, I would appreciate if 
>>>>>>>> someone
>>>>>>>> from TI and Nvidia can step in to help, as I don't have the HW 
>>>>>>>> on my
>>>>>>>> desk.
>>>>>>>>
>>>>>>>> Comments or other ideas of how to move forward?
>>>>>>> [...]
>>>>>>>
>>>>>>>> Hi Ulf,
>>>>>>>>
>>>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does 
>>>>>>>> mmc flush cache.
>>>>>>> Okay, great.
>>>>>>>
>>>>>>>>
>>>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as 
>>>>>>>> device side CMD6 is still inflight while host sends CMD13 as we 
>>>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with 
>>>>>>>> it uses R1B response type and host will wait for busy state 
>>>>>>>> followed by response from device for CMD6 and then data lines 
>>>>>>>> go High.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW 
>>>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>>>> sdhci-tegra as well?
>>>>>> Yes
>>>>>>>
>>>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>>>> your case) will be invoked to wait for the card to stop signal 
>>>>>>> busy on
>>>>>>> DAT0.
>>>>>>>
>>>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>>>> inform that the card is *not* busy, even if the card actually 
>>>>>>> signals
>>>>>>> busy? Is that correct?
>>>>>> Yes
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With R1B type, host design after sending command at end of 
>>>>>>>> completion after end bit waits for 2 cycles for data line to go 
>>>>>>>> low (busy state from device) and waits for response cycles 
>>>>>>>> after which data lines will go back high and then we issue 
>>>>>>>> switch status CMD13.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With R1 type, host after sending command and at end of 
>>>>>>>> completion after end bit, DATA lines will go high immediately 
>>>>>>>> as its R1 type and switch status CMD13 gets issued but by this 
>>>>>>>> time it looks like CMD6 on device side is still in flight for 
>>>>>>>> sending status and data.
>>>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, 
>>>>>>> but
>>>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>>>> even if the card signals busy on DAT0. The CMD13 is not using 
>>>>>>> the DATA
>>>>>>> lines, so this should work.
>>>>>>>
>>>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>>>
>>>>>>>>
>>>>>>>> 30s timeout is the wait time for data0 line to go high and 
>>>>>>>> mmc_busy_status will return success right away with R1 response 
>>>>>>>> type and SW sends switch status CMD13 but during that time on 
>>>>>>>> device side looks like still processing CMD6 as we are not 
>>>>>>>> waiting for enough time when we use R1 response type.
>>>>>>> Right, as stated above, isn't sdhci_card_busy() working for your 
>>>>>>> case?
>>>>>>> Can we fix it?
>>>>>>
>>>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>>>
>>>>>> Based on our host design, When CMD6 is issued with R1 type, we 
>>>>>> program it as NO_RESPONSE and with this command complete 
>>>>>> interrupt happens right at end bit of command and there will be 
>>>>>> no transfer complete interrupt.
>>>>> *[Correction] Based on our host design, When CMD6 is issued with 
>>>>> R1 type as we program it as NO_RESPONSE and with this command 
>>>>> complete interrupt happens right at end bit of command and there 
>>>>> will be no transfer complete interrupt.
>>>>
>>>> Sorry to correct wordings, I meant sdhci driver programs response 
>>>> type as NO_RESPONSE for CMD6.
>>>>
>>>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our 
>>>> host design  command complete interrupt happens right at end bit of 
>>>> command and there will be no transfer complete interrupt.
>>>>
>>>>
>>> Sorry for confusion. Please ignore above on response. it is using 
>>> SHORT response for R1. So SW poll should be working.
>>>
>>> Will get back on checking on host design side internally.
>>>
>> Hi Ulf,
>>
>> Verified internally regarding the busy state over DATA0 Our host 
>> design has known minor bug where command complete interrupt is 
>> asserted after waiting for busy cycles from device.So because of this 
>> polling for card_busy() returns 0 (DAT0 line High) immediately as 
>> waiting for busy is taken care during command complete interrupt in 
>> host design. This behavior is same for R1 and R1B.
>>
>>
>>>>>>
>>>>>> When CMD6 is issued with R1B type, we program is as R1B 
>>>>>> RESP_SHORT and with this command complete is end bit of device 
>>>>>> resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>>>
>>>>>> Regardless of R1/R1B, device side CMD6 will always have busy 
>>>>>> state on D0 and response on CMD lines.
>>>>>>
>>>>>> There will be 2 clock cycles period after sending CMD6 for device 
>>>>>> to send busy state on data0.
>>>>>>
>>>>>> In case of R1 type, after sending command DAT will stay high and 
>>>>>> looks like we are polling for busy early before busy state has 
>>>>>> started and sending CMD13 while device is busy and sending 
>>>>>> response on CMD line is causing timeout.
>>>>>>
>>>>>> Probably with this specific case of CMD6 with R1 type, to wait 
>>>>>> for card busy we should poll for DAT0 to go Low first and then to 
>>>>>> go High??
>>>>>>
>>>>>>>
>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>> I fully agree that R1B is preferable, but it's not against the 
>>>>>>> spec to
>>>>>>> send CMD13 to poll for busy.
>>>>>>>
>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>
>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>> then what timeout would the driver be using if we would set
>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>
>>
> Sorry didn't understood clearly. Are you asking with 30s timeout, 
> whats the data timeout counter used?
>
> Because of above mentioned issue on our host where CMD interrupt 
> happens after busy state, poll for busy returns right away as not busy.
>
> So issuing CMD13 after CMD6-R1 followed by busy poll should be 
> working. But weird that with small delay of 1ms or debug print before 
> CMD13 it doesn't timeout and works all the time.
>
>
With R1B for CMD6, busy detection timeout on our host is 11s (data 
timeout count = 0xE)
>>>>>>> Kind regards
>>>>>>> Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-05  0:20                                       ` Sowjanya Komatineni
  2020-03-05  3:06                                         ` Sowjanya Komatineni
@ 2020-03-05 13:05                                         ` Ulf Hansson
  2020-03-06  2:44                                           ` Sowjanya Komatineni
  1 sibling, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-03-05 13:05 UTC (permalink / raw)
  To: Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

On Thu, 5 Mar 2020 at 01:20, Sowjanya Komatineni <skomatineni@nvidia.com> wrote:
>
>
> On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
> >
> > On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
> >>
> >> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
> >>>
> >>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
> >>>>
> >>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
> >>>>>
> >>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
> >>>>>> External email: Use caution opening links or attachments
> >>>>>>
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
> >>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with
> >>>>>>> limited
> >>>>>>> debug options), which is using the sdhci-omap variant. I am
> >>>>>>> trying to
> >>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These
> >>>>>>> are the
> >>>>>>> conclusions from the observed behaviour on the Beagleboard for the
> >>>>>>> CMD6 cache flush command.
> >>>>>>>
> >>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
> >>>>>>> sdhci-omap driver in this configuration.
> >>>>>>>
> >>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
> >>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set
> >>>>>>> to 30 *
> >>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> >>>>>>> from the command.
> >>>>>>>
> >>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> >>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> >>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
> >>>>>>> Then everything works fine.
> >>>>>>>
> >>>>>>> 3. Updating the code to again use 30s as the
> >>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY
> >>>>>>> to be
> >>>>>>> set, even when the timeout_ms becomes greater than
> >>>>>>> max_busy_timeout.
> >>>>>>> This also works fine.
> >>>>>>>
> >>>>>>> Clearly this indicates a problem that I think needs to be
> >>>>>>> addressed in
> >>>>>>> the sdhci driver. However, of course I can revert the three
> >>>>>>> discussed
> >>>>>>> patches to fix the problem, but that would only hide the issues
> >>>>>>> and I
> >>>>>>> am sure we would then get back to this issue, sooner or later.
> >>>>>>>
> >>>>>>> To fix the problem in the sdhci driver, I would appreciate if
> >>>>>>> someone
> >>>>>>> from TI and Nvidia can step in to help, as I don't have the HW
> >>>>>>> on my
> >>>>>>> desk.
> >>>>>>>
> >>>>>>> Comments or other ideas of how to move forward?
> >>>>>> [...]
> >>>>>>
> >>>>>>> Hi Ulf,
> >>>>>>>
> >>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does
> >>>>>>> mmc flush cache.
> >>>>>> Okay, great.
> >>>>>>
> >>>>>>>
> >>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
> >>>>>>> device side CMD6 is still inflight while host sends CMD13 as we
> >>>>>>> are using R1 response type with timeout_ms changes to 30s.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with
> >>>>>>> it uses R1B response type and host will wait for busy state
> >>>>>>> followed by response from device for CMD6 and then data lines go
> >>>>>>> High.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
> >>>>>>> waits for busy by checking for DAT0 line to go High.
> >>>>>> If I understand correctly, because of the timeout now set to 30s,
> >>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
> >>>>>> sdhci-tegra as well?
> >>>>> Yes
> >>>>>>
> >>>>>> In other words, mmc_poll_for_busy() is being called, which in your
> >>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
> >>>>>> your case) will be invoked to wait for the card to stop signal
> >>>>>> busy on
> >>>>>> DAT0.
> >>>>>>
> >>>>>> This indicates to me, that the ->card_busy() ops returns zero to
> >>>>>> inform that the card is *not* busy, even if the card actually
> >>>>>> signals
> >>>>>> busy? Is that correct?
> >>>>> Yes
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> With R1B type, host design after sending command at end of
> >>>>>>> completion after end bit waits for 2 cycles for data line to go
> >>>>>>> low (busy state from device) and waits for response cycles after
> >>>>>>> which data lines will go back high and then we issue switch
> >>>>>>> status CMD13.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> With R1 type, host after sending command and at end of
> >>>>>>> completion after end bit, DATA lines will go high immediately as
> >>>>>>> its R1 type and switch status CMD13 gets issued but by this time
> >>>>>>> it looks like CMD6 on device side is still in flight for sending
> >>>>>>> status and data.
> >>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
> >>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
> >>>>>> even if the card signals busy on DAT0. The CMD13 is not using the
> >>>>>> DATA
> >>>>>> lines, so this should work.
> >>>>>>
> >>>>>> If I understand correctly, your driver (and controller?) has issues
> >>>>>> with coping with this scenario. Is it something that can be fixed?
> >>>>>>
> >>>>>>>
> >>>>>>> 30s timeout is the wait time for data0 line to go high and
> >>>>>>> mmc_busy_status will return success right away with R1 response
> >>>>>>> type and SW sends switch status CMD13 but during that time on
> >>>>>>> device side looks like still processing CMD6 as we are not
> >>>>>>> waiting for enough time when we use R1 response type.
> >>>>>> Right, as stated above, isn't sdhci_card_busy() working for your
> >>>>>> case?
> >>>>>> Can we fix it?
> >>>>>
> >>>>> sdhci_card_busy() returned 0 indicating its not busy.
> >>>>>
> >>>>> Based on our host design, When CMD6 is issued with R1 type, we
> >>>>> program it as NO_RESPONSE and with this command complete interrupt
> >>>>> happens right at end bit of command and there will be no transfer
> >>>>> complete interrupt.
> >>>> *[Correction] Based on our host design, When CMD6 is issued with R1
> >>>> type as we program it as NO_RESPONSE and with this command complete
> >>>> interrupt happens right at end bit of command and there will be no
> >>>> transfer complete interrupt.
> >>>
> >>> Sorry to correct wordings, I meant sdhci driver programs response
> >>> type as NO_RESPONSE for CMD6.
> >>>
> >>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our
> >>> host design  command complete interrupt happens right at end bit of
> >>> command and there will be no transfer complete interrupt.
> >>>
> >>>
> >> Sorry for confusion. Please ignore above on response. it is using
> >> SHORT response for R1. So SW poll should be working.
> >>
> >> Will get back on checking on host design side internally.
> >>
> > Hi Ulf,
> >
> > Verified internally regarding the busy state over DATA0 Our host
> > design has known minor bug where command complete interrupt is
> > asserted after waiting for busy cycles from device.So because of this
> > polling for card_busy() returns 0 (DAT0 line High) immediately as
> > waiting for busy is taken care during command complete interrupt in
> > host design. This behavior is same for R1 and R1B.
> >
> >
> >>>>>
> >>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
> >>>>> and with this command complete is end bit of device resp and
> >>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
> >>>>>
> >>>>> Regardless of R1/R1B, device side CMD6 will always have busy state
> >>>>> on D0 and response on CMD lines.
> >>>>>
> >>>>> There will be 2 clock cycles period after sending CMD6 for device
> >>>>> to send busy state on data0.
> >>>>>
> >>>>> In case of R1 type, after sending command DAT will stay high and
> >>>>> looks like we are polling for busy early before busy state has
> >>>>> started and sending CMD13 while device is busy and sending
> >>>>> response on CMD line is causing timeout.
> >>>>>
> >>>>> Probably with this specific case of CMD6 with R1 type, to wait for
> >>>>> card busy we should poll for DAT0 to go Low first and then to go
> >>>>> High??
> >>>>>
> >>>>>>
> >>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>> spec to
> >>>>>> send CMD13 to poll for busy.
> >>>>>>
> >>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>> the requested operation. Do you have another proposal for how to
> >>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>
> >>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>> then what timeout would the driver be using if we would set
> >>>>>> cmd.busy_timeout to 30ms?
> >>>>>>
> >
> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> the data timeout counter used?

Yes. It seems like it will pick the maximum, which is 11s?

>
> Because of above mentioned issue on our host where CMD interrupt happens
> after busy state, poll for busy returns right away as not busy.

I see.

>
> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> But weird that with small delay of 1ms or debug print before CMD13 it
> doesn't timeout and works all the time.

I have digested the information you provided in these emails. Let me
summarize it, to see if I have understood correctly.

1.
Your controller can't distinguish between R1 and R1B because of a
limitation in the HW. So, in both cases you need to wait for the card
to stop signal busy, before the controller can give an IRQ to notify
that the R1 response has been received. Correct?

In this context, I am wondering if sdhci_send_command(), really
conforms to these requirements. For example, depending on if the CMD6
has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
SDHCI_CMD_RESP_SHORT_BUSY.

Does this work as expected for your case?

2.
Assuming my interpretation of the above is somewhat correct. Then you
always need to set a busy timeout for R1/R1B responses in the
controller. The maximum timeout seems to be 11s long. Obviously, this
isn't enough for all cases, such as cache flushing and erase, for
example. So, what can we do to support a longer timeouts than 11s?
Would it be possible to disable the HW timeout, if the requested
timeout is longer than 11s and use a SW timeout instead?

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-05 13:05                                         ` Ulf Hansson
@ 2020-03-06  2:44                                           ` Sowjanya Komatineni
  2020-03-06 11:14                                             ` Ulf Hansson
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-06  2:44 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/5/20 5:05 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 5 Mar 2020 at 01:20, Sowjanya Komatineni <skomatineni@nvidia.com> wrote:
>>
>> On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
>>> On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>>>> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>>>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with
>>>>>>>>> limited
>>>>>>>>> debug options), which is using the sdhci-omap variant. I am
>>>>>>>>> trying to
>>>>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These
>>>>>>>>> are the
>>>>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>>>>> CMD6 cache flush command.
>>>>>>>>>
>>>>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>>>>> sdhci-omap driver in this configuration.
>>>>>>>>>
>>>>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set
>>>>>>>>> to 30 *
>>>>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>>>>> from the command.
>>>>>>>>>
>>>>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>>>>> Then everything works fine.
>>>>>>>>>
>>>>>>>>> 3. Updating the code to again use 30s as the
>>>>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY
>>>>>>>>> to be
>>>>>>>>> set, even when the timeout_ms becomes greater than
>>>>>>>>> max_busy_timeout.
>>>>>>>>> This also works fine.
>>>>>>>>>
>>>>>>>>> Clearly this indicates a problem that I think needs to be
>>>>>>>>> addressed in
>>>>>>>>> the sdhci driver. However, of course I can revert the three
>>>>>>>>> discussed
>>>>>>>>> patches to fix the problem, but that would only hide the issues
>>>>>>>>> and I
>>>>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>>>>
>>>>>>>>> To fix the problem in the sdhci driver, I would appreciate if
>>>>>>>>> someone
>>>>>>>>> from TI and Nvidia can step in to help, as I don't have the HW
>>>>>>>>> on my
>>>>>>>>> desk.
>>>>>>>>>
>>>>>>>>> Comments or other ideas of how to move forward?
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> Hi Ulf,
>>>>>>>>>
>>>>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does
>>>>>>>>> mmc flush cache.
>>>>>>>> Okay, great.
>>>>>>>>
>>>>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>>>>>>>> device side CMD6 is still inflight while host sends CMD13 as we
>>>>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with
>>>>>>>>> it uses R1B response type and host will wait for busy state
>>>>>>>>> followed by response from device for CMD6 and then data lines go
>>>>>>>>> High.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
>>>>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>>>>> sdhci-tegra as well?
>>>>>>> Yes
>>>>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>>>>> your case) will be invoked to wait for the card to stop signal
>>>>>>>> busy on
>>>>>>>> DAT0.
>>>>>>>>
>>>>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>>>>> inform that the card is *not* busy, even if the card actually
>>>>>>>> signals
>>>>>>>> busy? Is that correct?
>>>>>>> Yes
>>>>>>>>>
>>>>>>>>> With R1B type, host design after sending command at end of
>>>>>>>>> completion after end bit waits for 2 cycles for data line to go
>>>>>>>>> low (busy state from device) and waits for response cycles after
>>>>>>>>> which data lines will go back high and then we issue switch
>>>>>>>>> status CMD13.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With R1 type, host after sending command and at end of
>>>>>>>>> completion after end bit, DATA lines will go high immediately as
>>>>>>>>> its R1 type and switch status CMD13 gets issued but by this time
>>>>>>>>> it looks like CMD6 on device side is still in flight for sending
>>>>>>>>> status and data.
>>>>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>>>>> even if the card signals busy on DAT0. The CMD13 is not using the
>>>>>>>> DATA
>>>>>>>> lines, so this should work.
>>>>>>>>
>>>>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>>>>
>>>>>>>>> 30s timeout is the wait time for data0 line to go high and
>>>>>>>>> mmc_busy_status will return success right away with R1 response
>>>>>>>>> type and SW sends switch status CMD13 but during that time on
>>>>>>>>> device side looks like still processing CMD6 as we are not
>>>>>>>>> waiting for enough time when we use R1 response type.
>>>>>>>> Right, as stated above, isn't sdhci_card_busy() working for your
>>>>>>>> case?
>>>>>>>> Can we fix it?
>>>>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>>>>
>>>>>>> Based on our host design, When CMD6 is issued with R1 type, we
>>>>>>> program it as NO_RESPONSE and with this command complete interrupt
>>>>>>> happens right at end bit of command and there will be no transfer
>>>>>>> complete interrupt.
>>>>>> *[Correction] Based on our host design, When CMD6 is issued with R1
>>>>>> type as we program it as NO_RESPONSE and with this command complete
>>>>>> interrupt happens right at end bit of command and there will be no
>>>>>> transfer complete interrupt.
>>>>> Sorry to correct wordings, I meant sdhci driver programs response
>>>>> type as NO_RESPONSE for CMD6.
>>>>>
>>>>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our
>>>>> host design  command complete interrupt happens right at end bit of
>>>>> command and there will be no transfer complete interrupt.
>>>>>
>>>>>
>>>> Sorry for confusion. Please ignore above on response. it is using
>>>> SHORT response for R1. So SW poll should be working.
>>>>
>>>> Will get back on checking on host design side internally.
>>>>
>>> Hi Ulf,
>>>
>>> Verified internally regarding the busy state over DATA0 Our host
>>> design has known minor bug where command complete interrupt is
>>> asserted after waiting for busy cycles from device.So because of this
>>> polling for card_busy() returns 0 (DAT0 line High) immediately as
>>> waiting for busy is taken care during command complete interrupt in
>>> host design. This behavior is same for R1 and R1B.
>>>
>>>
>>>>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
>>>>>>> and with this command complete is end bit of device resp and
>>>>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>>>>
>>>>>>> Regardless of R1/R1B, device side CMD6 will always have busy state
>>>>>>> on D0 and response on CMD lines.
>>>>>>>
>>>>>>> There will be 2 clock cycles period after sending CMD6 for device
>>>>>>> to send busy state on data0.
>>>>>>>
>>>>>>> In case of R1 type, after sending command DAT will stay high and
>>>>>>> looks like we are polling for busy early before busy state has
>>>>>>> started and sending CMD13 while device is busy and sending
>>>>>>> response on CMD line is causing timeout.
>>>>>>>
>>>>>>> Probably with this specific case of CMD6 with R1 type, to wait for
>>>>>>> card busy we should poll for DAT0 to go Low first and then to go
>>>>>>> High??
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>> spec to
>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>
>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>
>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>
>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>> the data timeout counter used?
> Yes. It seems like it will pick the maximum, which is 11s?
yes
>
>> Because of above mentioned issue on our host where CMD interrupt happens
>> after busy state, poll for busy returns right away as not busy.
> I see.
>
>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>> But weird that with small delay of 1ms or debug print before CMD13 it
>> doesn't timeout and works all the time.
> I have digested the information you provided in these emails. Let me
> summarize it, to see if I have understood correctly.
>
> 1.
> Your controller can't distinguish between R1 and R1B because of a
> limitation in the HW. So, in both cases you need to wait for the card
> to stop signal busy, before the controller can give an IRQ to notify
> that the R1 response has been received. Correct?
>
> In this context, I am wondering if sdhci_send_command(), really
> conforms to these requirements. For example, depending on if the CMD6
> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> SDHCI_CMD_RESP_SHORT_BUSY.
>
> Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state 
before IRQ is only for R1B and R1 is spec compliant.

So, with R1, CMD complete is generated after response received.

With R1B, CMD complete and xfer complete both are generated after 
response received + device busy (max timeout of 11s)
DATA timeout interrupt will be asserted incase if HW busy detection fails.

With R1B we may see DATA Timeout if operation takes more than max busy 
timeout of 11s.

> 2.
> Assuming my interpretation of the above is somewhat correct. Then you
> always need to set a busy timeout for R1/R1B responses in the
> controller. The maximum timeout seems to be 11s long. Obviously, this
> isn't enough for all cases, such as cache flushing and erase, for
> example. So, what can we do to support a longer timeouts than 11s?
> Would it be possible to disable the HW timeout, if the requested
> timeout is longer than 11s and use a SW timeout instead?
>
> Kind regards
> Uffe

For erase long operations we have register bit to enable for infinite 
busy wait mode where host controller would be monitoring until card is busy.

But so far for emmc devices we used on our platforms, we haven't seen 
cache flush taking more than 11s.

Will get back on possibility of disabling HW timeout and using SW timeout..

Thanks

Sowjanya


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-06  2:44                                           ` Sowjanya Komatineni
@ 2020-03-06 11:14                                             ` Ulf Hansson
  2020-03-09 14:07                                               ` Faiz Abbas
  2020-03-09 17:35                                               ` Sowjanya Komatineni
  0 siblings, 2 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-03-06 11:14 UTC (permalink / raw)
  To: Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

[...]

> >>>>>>>>>
> >>>>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>>>> spec to
> >>>>>>>> send CMD13 to poll for busy.
> >>>>>>>>
> >>>>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>>>> the requested operation. Do you have another proposal for how to
> >>>>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>>>
> >>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>>>> then what timeout would the driver be using if we would set
> >>>>>>>> cmd.busy_timeout to 30ms?
> >>>>>>>>
> >> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> >> the data timeout counter used?
> > Yes. It seems like it will pick the maximum, which is 11s?
> yes

Okay, thanks!

> >
> >> Because of above mentioned issue on our host where CMD interrupt happens
> >> after busy state, poll for busy returns right away as not busy.
> > I see.
> >
> >> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> >> But weird that with small delay of 1ms or debug print before CMD13 it
> >> doesn't timeout and works all the time.
> > I have digested the information you provided in these emails. Let me
> > summarize it, to see if I have understood correctly.
> >
> > 1.
> > Your controller can't distinguish between R1 and R1B because of a
> > limitation in the HW. So, in both cases you need to wait for the card
> > to stop signal busy, before the controller can give an IRQ to notify
> > that the R1 response has been received. Correct?
> >
> > In this context, I am wondering if sdhci_send_command(), really
> > conforms to these requirements. For example, depending on if the CMD6
> > has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> > SDHCI_CMD_RESP_SHORT_BUSY.
> >
> > Does this work as expected for your case?
> Design team re-verified internally and bug where HW waits for busy state
> before IRQ is only for R1B and R1 is spec compliant.
>
> So, with R1, CMD complete is generated after response received.

Okay.

So, the issue we see for CMD6 with R1, is a software problem that we
should be able to fix.

>
> With R1B, CMD complete and xfer complete both are generated after
> response received + device busy (max timeout of 11s)
> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>
> With R1B we may see DATA Timeout if operation takes more than max busy
> timeout of 11s.

Okay, I see.

>
> > 2.
> > Assuming my interpretation of the above is somewhat correct. Then you
> > always need to set a busy timeout for R1/R1B responses in the
> > controller. The maximum timeout seems to be 11s long. Obviously, this
> > isn't enough for all cases, such as cache flushing and erase, for
> > example. So, what can we do to support a longer timeouts than 11s?
> > Would it be possible to disable the HW timeout, if the requested
> > timeout is longer than 11s and use a SW timeout instead?
> >
> > Kind regards
> > Uffe
>
> For erase long operations we have register bit to enable for infinite
> busy wait mode where host controller would be monitoring until card is busy.

Alright, that sounds great!

>
> But so far for emmc devices we used on our platforms, we haven't seen
> cache flush taking more than 11s.

I understand that 11s is probably fine to use, for most cases.

However, it's not spec compliant, as for some operations there are
simply no timeout specified. BKOPS, cache flush, sanitize are cases
like this - and then 11s is definitely not sufficient.

>
> Will get back on possibility of disabling HW timeout and using SW timeout..

Thanks!

I would like to get the regression fixed asap, but I also would like
to avoid reverting patches, unless really necessary. May I propose the
following two options.

1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.

2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that
when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
disable the HW busy timeout and just wait "forever".

If you decide for 2, you can add the software timeout support on top,
but make that can be considered as a next step of an improvement,
rather than needed as fix. Note that, I believe there are some support
for software timeout already in the sdhci core, maybe you need to
tweak it a bit for your case, I don't know.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-06 11:14                                             ` Ulf Hansson
@ 2020-03-09 14:07                                               ` Faiz Abbas
  2020-03-09 15:57                                                 ` Ulf Hansson
  2020-03-09 17:35                                               ` Sowjanya Komatineni
  1 sibling, 1 reply; 45+ messages in thread
From: Faiz Abbas @ 2020-03-09 14:07 UTC (permalink / raw)
  To: Ulf Hansson, Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Thierry Reding,
	Anders Roxell, Kishon

Uffe,

On 06/03/20 4:44 pm, Ulf Hansson wrote:
> [...]
> 
>>>>>>>>>>>
>>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>>>> spec to
>>>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>>>
>>>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>>>
>>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>>>
>>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>>>> the data timeout counter used?
>>> Yes. It seems like it will pick the maximum, which is 11s?
>> yes
> 
> Okay, thanks!
> 
>>>
>>>> Because of above mentioned issue on our host where CMD interrupt happens
>>>> after busy state, poll for busy returns right away as not busy.
>>> I see.
>>>
>>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>>>> But weird that with small delay of 1ms or debug print before CMD13 it
>>>> doesn't timeout and works all the time.
>>> I have digested the information you provided in these emails. Let me
>>> summarize it, to see if I have understood correctly.
>>>
>>> 1.
>>> Your controller can't distinguish between R1 and R1B because of a
>>> limitation in the HW. So, in both cases you need to wait for the card
>>> to stop signal busy, before the controller can give an IRQ to notify
>>> that the R1 response has been received. Correct?
>>>
>>> In this context, I am wondering if sdhci_send_command(), really
>>> conforms to these requirements. For example, depending on if the CMD6
>>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
>>> SDHCI_CMD_RESP_SHORT_BUSY.
>>>
>>> Does this work as expected for your case?
>> Design team re-verified internally and bug where HW waits for busy state
>> before IRQ is only for R1B and R1 is spec compliant.
>>
>> So, with R1, CMD complete is generated after response received.
> 
> Okay.
> 
> So, the issue we see for CMD6 with R1, is a software problem that we
> should be able to fix.
> 
>>
>> With R1B, CMD complete and xfer complete both are generated after
>> response received + device busy (max timeout of 11s)
>> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>>
>> With R1B we may see DATA Timeout if operation takes more than max busy
>> timeout of 11s.
> 
> Okay, I see.
> 
>>
>>> 2.
>>> Assuming my interpretation of the above is somewhat correct. Then you
>>> always need to set a busy timeout for R1/R1B responses in the
>>> controller. The maximum timeout seems to be 11s long. Obviously, this
>>> isn't enough for all cases, such as cache flushing and erase, for
>>> example. So, what can we do to support a longer timeouts than 11s?
>>> Would it be possible to disable the HW timeout, if the requested
>>> timeout is longer than 11s and use a SW timeout instead?
>>>
>>> Kind regards
>>> Uffe
>>
>> For erase long operations we have register bit to enable for infinite
>> busy wait mode where host controller would be monitoring until card is busy.
> 
> Alright, that sounds great!
> 
>>
>> But so far for emmc devices we used on our platforms, we haven't seen
>> cache flush taking more than 11s.
> 
> I understand that 11s is probably fine to use, for most cases.
> 
> However, it's not spec compliant, as for some operations there are
> simply no timeout specified. BKOPS, cache flush, sanitize are cases
> like this - and then 11s is definitely not sufficient.
> 
>>
>> Will get back on possibility of disabling HW timeout and using SW timeout..
> 
> Thanks!
> 
> I would like to get the regression fixed asap, but I also would like
> to avoid reverting patches, unless really necessary. May I propose the
> following two options.
> 
> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> an R1 response doesn't work - and then fix that behaviour.
> 
> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> the core to always use R1B for CMD6 (and erase). This also means that
> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> disable the HW busy timeout and just wait "forever".
> 
> If you decide for 2, you can add the software timeout support on top,
> but make that can be considered as a next step of an improvement,
> rather than needed as fix. Note that, I believe there are some support
> for software timeout already in the sdhci core, maybe you need to
> tweak it a bit for your case, I don't know.
> 

I was able to reproduce the issue on sdhci-omap with your patch applied.

In our case, the dat0 line never comes back up when trying to send a
CMD6 with R1. It needs an R1b to work properly. I also set
mmc->max_busy_timeout to 0 and that made it work as well.

I wanna try it out with other R1b commands like erase to see if it works
properly if the delay is actually very long.

Thanks,
Faiz

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-09 14:07                                               ` Faiz Abbas
@ 2020-03-09 15:57                                                 ` Ulf Hansson
  0 siblings, 0 replies; 45+ messages in thread
From: Ulf Hansson @ 2020-03-09 15:57 UTC (permalink / raw)
  To: Faiz Abbas
  Cc: Sowjanya Komatineni, Jon Hunter, Bitan Biswas, Adrian Hunter,
	Naresh Kamboju, Jens Axboe, Alexei Starovoitov, linux-block,
	lkft-triage, open list, linux-mmc, Arnd Bergmann, John Stultz,
	Thierry Reding, Anders Roxell, Kishon

On Mon, 9 Mar 2020 at 15:06, Faiz Abbas <faiz_abbas@ti.com> wrote:
>
> Uffe,
>
> On 06/03/20 4:44 pm, Ulf Hansson wrote:
> > [...]
> >
> >>>>>>>>>>>
> >>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>>>>>> spec to
> >>>>>>>>>> send CMD13 to poll for busy.
> >>>>>>>>>>
> >>>>>>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>>>>>> the requested operation. Do you have another proposal for how to
> >>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>>>>>
> >>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>>>>>> then what timeout would the driver be using if we would set
> >>>>>>>>>> cmd.busy_timeout to 30ms?
> >>>>>>>>>>
> >>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> >>>> the data timeout counter used?
> >>> Yes. It seems like it will pick the maximum, which is 11s?
> >> yes
> >
> > Okay, thanks!
> >
> >>>
> >>>> Because of above mentioned issue on our host where CMD interrupt happens
> >>>> after busy state, poll for busy returns right away as not busy.
> >>> I see.
> >>>
> >>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> >>>> But weird that with small delay of 1ms or debug print before CMD13 it
> >>>> doesn't timeout and works all the time.
> >>> I have digested the information you provided in these emails. Let me
> >>> summarize it, to see if I have understood correctly.
> >>>
> >>> 1.
> >>> Your controller can't distinguish between R1 and R1B because of a
> >>> limitation in the HW. So, in both cases you need to wait for the card
> >>> to stop signal busy, before the controller can give an IRQ to notify
> >>> that the R1 response has been received. Correct?
> >>>
> >>> In this context, I am wondering if sdhci_send_command(), really
> >>> conforms to these requirements. For example, depending on if the CMD6
> >>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> >>> SDHCI_CMD_RESP_SHORT_BUSY.
> >>>
> >>> Does this work as expected for your case?
> >> Design team re-verified internally and bug where HW waits for busy state
> >> before IRQ is only for R1B and R1 is spec compliant.
> >>
> >> So, with R1, CMD complete is generated after response received.
> >
> > Okay.
> >
> > So, the issue we see for CMD6 with R1, is a software problem that we
> > should be able to fix.
> >
> >>
> >> With R1B, CMD complete and xfer complete both are generated after
> >> response received + device busy (max timeout of 11s)
> >> DATA timeout interrupt will be asserted incase if HW busy detection fails.
> >>
> >> With R1B we may see DATA Timeout if operation takes more than max busy
> >> timeout of 11s.
> >
> > Okay, I see.
> >
> >>
> >>> 2.
> >>> Assuming my interpretation of the above is somewhat correct. Then you
> >>> always need to set a busy timeout for R1/R1B responses in the
> >>> controller. The maximum timeout seems to be 11s long. Obviously, this
> >>> isn't enough for all cases, such as cache flushing and erase, for
> >>> example. So, what can we do to support a longer timeouts than 11s?
> >>> Would it be possible to disable the HW timeout, if the requested
> >>> timeout is longer than 11s and use a SW timeout instead?
> >>>
> >>> Kind regards
> >>> Uffe
> >>
> >> For erase long operations we have register bit to enable for infinite
> >> busy wait mode where host controller would be monitoring until card is busy.
> >
> > Alright, that sounds great!
> >
> >>
> >> But so far for emmc devices we used on our platforms, we haven't seen
> >> cache flush taking more than 11s.
> >
> > I understand that 11s is probably fine to use, for most cases.
> >
> > However, it's not spec compliant, as for some operations there are
> > simply no timeout specified. BKOPS, cache flush, sanitize are cases
> > like this - and then 11s is definitely not sufficient.
> >
> >>
> >> Will get back on possibility of disabling HW timeout and using SW timeout..
> >
> > Thanks!
> >
> > I would like to get the regression fixed asap, but I also would like
> > to avoid reverting patches, unless really necessary. May I propose the
> > following two options.
> >
> > 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> > an R1 response doesn't work - and then fix that behaviour.
> >
> > 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> > the core to always use R1B for CMD6 (and erase). This also means that
> > when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> > disable the HW busy timeout and just wait "forever".
> >
> > If you decide for 2, you can add the software timeout support on top,
> > but make that can be considered as a next step of an improvement,
> > rather than needed as fix. Note that, I believe there are some support
> > for software timeout already in the sdhci core, maybe you need to
> > tweak it a bit for your case, I don't know.
> >
>
> I was able to reproduce the issue on sdhci-omap with your patch applied.

Great, thanks for confirming!

>
> In our case, the dat0 line never comes back up when trying to send a
> CMD6 with R1. It needs an R1b to work properly. I also set

Do you know if that is because of how the driver manages R1B/R1 or
because of a limitation in the HW?

> mmc->max_busy_timeout to 0 and that made it work as well.

Okay, good.

>
> I wanna try it out with other R1b commands like erase to see if it works
> properly if the delay is actually very long.

Right, it probably works for most cases. In any case, this was broken
even before my three patches that screwed things up.

BTW, what HW busy time out will you be using for sdhci omap if
cmd->busy_timeout exceeds it? Is it 2581ms for all sdhci-omap
instances or does it differ depending on platform?

Anyway, in this phase, it sounds like we should move forward with a
patch setting max_busy_timeout to zero for sdhci omap, as that
restores things to the earlier behavior. Would you mind posting a
patch, or do you want me to manage it?

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-06 11:14                                             ` Ulf Hansson
  2020-03-09 14:07                                               ` Faiz Abbas
@ 2020-03-09 17:35                                               ` Sowjanya Komatineni
  2020-03-10  9:46                                                 ` Ulf Hansson
  1 sibling, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-09 17:35 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/6/20 3:14 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> [...]
>
>>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>>>> spec to
>>>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>>>
>>>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>>>
>>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>>>
>>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>>>> the data timeout counter used?
>>> Yes. It seems like it will pick the maximum, which is 11s?
>> yes
> Okay, thanks!
>
>>>> Because of above mentioned issue on our host where CMD interrupt happens
>>>> after busy state, poll for busy returns right away as not busy.
>>> I see.
>>>
>>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>>>> But weird that with small delay of 1ms or debug print before CMD13 it
>>>> doesn't timeout and works all the time.
>>> I have digested the information you provided in these emails. Let me
>>> summarize it, to see if I have understood correctly.
>>>
>>> 1.
>>> Your controller can't distinguish between R1 and R1B because of a
>>> limitation in the HW. So, in both cases you need to wait for the card
>>> to stop signal busy, before the controller can give an IRQ to notify
>>> that the R1 response has been received. Correct?
>>>
>>> In this context, I am wondering if sdhci_send_command(), really
>>> conforms to these requirements. For example, depending on if the CMD6
>>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
>>> SDHCI_CMD_RESP_SHORT_BUSY.
>>>
>>> Does this work as expected for your case?
>> Design team re-verified internally and bug where HW waits for busy state
>> before IRQ is only for R1B and R1 is spec compliant.
>>
>> So, with R1, CMD complete is generated after response received.
> Okay.
>
> So, the issue we see for CMD6 with R1, is a software problem that we
> should be able to fix.
>
>> With R1B, CMD complete and xfer complete both are generated after
>> response received + device busy (max timeout of 11s)
>> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>>
>> With R1B we may see DATA Timeout if operation takes more than max busy
>> timeout of 11s.
> Okay, I see.
>
>>> 2.
>>> Assuming my interpretation of the above is somewhat correct. Then you
>>> always need to set a busy timeout for R1/R1B responses in the
>>> controller. The maximum timeout seems to be 11s long. Obviously, this
>>> isn't enough for all cases, such as cache flushing and erase, for
>>> example. So, what can we do to support a longer timeouts than 11s?
>>> Would it be possible to disable the HW timeout, if the requested
>>> timeout is longer than 11s and use a SW timeout instead?
>>>
>>> Kind regards
>>> Uffe
>> For erase long operations we have register bit to enable for infinite
>> busy wait mode where host controller would be monitoring until card is busy.
> Alright, that sounds great!
>
>> But so far for emmc devices we used on our platforms, we haven't seen
>> cache flush taking more than 11s.
> I understand that 11s is probably fine to use, for most cases.
>
> However, it's not spec compliant, as for some operations there are
> simply no timeout specified. BKOPS, cache flush, sanitize are cases
> like this - and then 11s is definitely not sufficient.
>
>> Will get back on possibility of disabling HW timeout and using SW timeout..
> Thanks!
>
> I would like to get the regression fixed asap, but I also would like
> to avoid reverting patches, unless really necessary. May I propose the
> following two options.
>
> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> an R1 response doesn't work - and then fix that behaviour.
>
> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> the core to always use R1B for CMD6 (and erase). This also means that
> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> disable the HW busy timeout and just wait "forever".
>
> If you decide for 2, you can add the software timeout support on top,
> but make that can be considered as a next step of an improvement,
> rather than needed as fix. Note that, I believe there are some support
> for software timeout already in the sdhci core, maybe you need to
> tweak it a bit for your case, I don't know.
>
> Kind regards
> Uffe

Hi Uffe

Will go with 2nd option and will send patches out when ready.

BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for 
data timeout based on host clock when using finite mode (HW busy 
detection based on DATA TIMEOUT count value when cmd operation timeout 
is < 11s for tegra host).

So, looks like we cant set host max_busy_timeout to 0 for Tegra host to 
force R1B during SWITCH and SLEEP_AWAKE.

So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY 
which can be used for hosts supporting long or infinite HW busy wait 
detection and will update mmc and mmc_ops drivers to not allow convert 
R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.

Thanks

Sowjanya


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-09 17:35                                               ` Sowjanya Komatineni
@ 2020-03-10  9:46                                                 ` Ulf Hansson
  2020-03-10 16:59                                                   ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

On Mon, 9 Mar 2020 at 18:33, Sowjanya Komatineni <skomatineni@nvidia.com> wrote:
>
>
> On 3/6/20 3:14 AM, Ulf Hansson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > [...]
> >
> >>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>>>>>> spec to
> >>>>>>>>>> send CMD13 to poll for busy.
> >>>>>>>>>>
> >>>>>>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>>>>>> the requested operation. Do you have another proposal for how to
> >>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>>>>>
> >>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>>>>>> then what timeout would the driver be using if we would set
> >>>>>>>>>> cmd.busy_timeout to 30ms?
> >>>>>>>>>>
> >>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> >>>> the data timeout counter used?
> >>> Yes. It seems like it will pick the maximum, which is 11s?
> >> yes
> > Okay, thanks!
> >
> >>>> Because of above mentioned issue on our host where CMD interrupt happens
> >>>> after busy state, poll for busy returns right away as not busy.
> >>> I see.
> >>>
> >>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> >>>> But weird that with small delay of 1ms or debug print before CMD13 it
> >>>> doesn't timeout and works all the time.
> >>> I have digested the information you provided in these emails. Let me
> >>> summarize it, to see if I have understood correctly.
> >>>
> >>> 1.
> >>> Your controller can't distinguish between R1 and R1B because of a
> >>> limitation in the HW. So, in both cases you need to wait for the card
> >>> to stop signal busy, before the controller can give an IRQ to notify
> >>> that the R1 response has been received. Correct?
> >>>
> >>> In this context, I am wondering if sdhci_send_command(), really
> >>> conforms to these requirements. For example, depending on if the CMD6
> >>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> >>> SDHCI_CMD_RESP_SHORT_BUSY.
> >>>
> >>> Does this work as expected for your case?
> >> Design team re-verified internally and bug where HW waits for busy state
> >> before IRQ is only for R1B and R1 is spec compliant.
> >>
> >> So, with R1, CMD complete is generated after response received.
> > Okay.
> >
> > So, the issue we see for CMD6 with R1, is a software problem that we
> > should be able to fix.
> >
> >> With R1B, CMD complete and xfer complete both are generated after
> >> response received + device busy (max timeout of 11s)
> >> DATA timeout interrupt will be asserted incase if HW busy detection fails.
> >>
> >> With R1B we may see DATA Timeout if operation takes more than max busy
> >> timeout of 11s.
> > Okay, I see.
> >
> >>> 2.
> >>> Assuming my interpretation of the above is somewhat correct. Then you
> >>> always need to set a busy timeout for R1/R1B responses in the
> >>> controller. The maximum timeout seems to be 11s long. Obviously, this
> >>> isn't enough for all cases, such as cache flushing and erase, for
> >>> example. So, what can we do to support a longer timeouts than 11s?
> >>> Would it be possible to disable the HW timeout, if the requested
> >>> timeout is longer than 11s and use a SW timeout instead?
> >>>
> >>> Kind regards
> >>> Uffe
> >> For erase long operations we have register bit to enable for infinite
> >> busy wait mode where host controller would be monitoring until card is busy.
> > Alright, that sounds great!
> >
> >> But so far for emmc devices we used on our platforms, we haven't seen
> >> cache flush taking more than 11s.
> > I understand that 11s is probably fine to use, for most cases.
> >
> > However, it's not spec compliant, as for some operations there are
> > simply no timeout specified. BKOPS, cache flush, sanitize are cases
> > like this - and then 11s is definitely not sufficient.
> >
> >> Will get back on possibility of disabling HW timeout and using SW timeout..
> > Thanks!
> >
> > I would like to get the regression fixed asap, but I also would like
> > to avoid reverting patches, unless really necessary. May I propose the
> > following two options.
> >
> > 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> > an R1 response doesn't work - and then fix that behaviour.
> >
> > 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> > the core to always use R1B for CMD6 (and erase). This also means that
> > when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> > disable the HW busy timeout and just wait "forever".
> >
> > If you decide for 2, you can add the software timeout support on top,
> > but make that can be considered as a next step of an improvement,
> > rather than needed as fix. Note that, I believe there are some support
> > for software timeout already in the sdhci core, maybe you need to
> > tweak it a bit for your case, I don't know.
> >
> > Kind regards
> > Uffe
>
> Hi Uffe
>
> Will go with 2nd option and will send patches out when ready.

Okay, good.

>
> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
> data timeout based on host clock when using finite mode (HW busy
> detection based on DATA TIMEOUT count value when cmd operation timeout
> is < 11s for tegra host).
>
> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
> force R1B during SWITCH and SLEEP_AWAKE.
>
> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
> which can be used for hosts supporting long or infinite HW busy wait
> detection and will update mmc and mmc_ops drivers to not allow convert
> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.

That seems reasonable, it becomes probably both easier and clearer by
adding a new host cap.

In any case, let me help out and cook a patch for this for the core
part (I leave the sdhci change to you). It may be a bit tricky,
especially since I have currently queued a bunch of new changes for
v5.7, that enables more users of mmc_poll_for_busy() in the core.
Maybe I need to temporarily drop them, so we can fix these problems
first. I will check.

Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
seems to be describing the common problem we have for sdhci
omap/tegra.

Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
sdhci- tegra, so while at it, perhaps you can cook a patch for that as
well.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-10  9:46                                                 ` Ulf Hansson
@ 2020-03-10 16:59                                                   ` Sowjanya Komatineni
  2020-03-10 17:09                                                     ` Ulf Hansson
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-10 16:59 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/10/20 2:46 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, 9 Mar 2020 at 18:33, Sowjanya Komatineni <skomatineni@nvidia.com> wrote:
>>
>> On 3/6/20 3:14 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> [...]
>>>
>>>>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>>>>>> spec to
>>>>>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>>>>>
>>>>>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>>>>>
>>>>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>>>>>
>>>>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>>>>>> the data timeout counter used?
>>>>> Yes. It seems like it will pick the maximum, which is 11s?
>>>> yes
>>> Okay, thanks!
>>>
>>>>>> Because of above mentioned issue on our host where CMD interrupt happens
>>>>>> after busy state, poll for busy returns right away as not busy.
>>>>> I see.
>>>>>
>>>>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>>>>>> But weird that with small delay of 1ms or debug print before CMD13 it
>>>>>> doesn't timeout and works all the time.
>>>>> I have digested the information you provided in these emails. Let me
>>>>> summarize it, to see if I have understood correctly.
>>>>>
>>>>> 1.
>>>>> Your controller can't distinguish between R1 and R1B because of a
>>>>> limitation in the HW. So, in both cases you need to wait for the card
>>>>> to stop signal busy, before the controller can give an IRQ to notify
>>>>> that the R1 response has been received. Correct?
>>>>>
>>>>> In this context, I am wondering if sdhci_send_command(), really
>>>>> conforms to these requirements. For example, depending on if the CMD6
>>>>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
>>>>> SDHCI_CMD_RESP_SHORT_BUSY.
>>>>>
>>>>> Does this work as expected for your case?
>>>> Design team re-verified internally and bug where HW waits for busy state
>>>> before IRQ is only for R1B and R1 is spec compliant.
>>>>
>>>> So, with R1, CMD complete is generated after response received.
>>> Okay.
>>>
>>> So, the issue we see for CMD6 with R1, is a software problem that we
>>> should be able to fix.
>>>
>>>> With R1B, CMD complete and xfer complete both are generated after
>>>> response received + device busy (max timeout of 11s)
>>>> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>>>>
>>>> With R1B we may see DATA Timeout if operation takes more than max busy
>>>> timeout of 11s.
>>> Okay, I see.
>>>
>>>>> 2.
>>>>> Assuming my interpretation of the above is somewhat correct. Then you
>>>>> always need to set a busy timeout for R1/R1B responses in the
>>>>> controller. The maximum timeout seems to be 11s long. Obviously, this
>>>>> isn't enough for all cases, such as cache flushing and erase, for
>>>>> example. So, what can we do to support a longer timeouts than 11s?
>>>>> Would it be possible to disable the HW timeout, if the requested
>>>>> timeout is longer than 11s and use a SW timeout instead?
>>>>>
>>>>> Kind regards
>>>>> Uffe
>>>> For erase long operations we have register bit to enable for infinite
>>>> busy wait mode where host controller would be monitoring until card is busy.
>>> Alright, that sounds great!
>>>
>>>> But so far for emmc devices we used on our platforms, we haven't seen
>>>> cache flush taking more than 11s.
>>> I understand that 11s is probably fine to use, for most cases.
>>>
>>> However, it's not spec compliant, as for some operations there are
>>> simply no timeout specified. BKOPS, cache flush, sanitize are cases
>>> like this - and then 11s is definitely not sufficient.
>>>
>>>> Will get back on possibility of disabling HW timeout and using SW timeout..
>>> Thanks!
>>>
>>> I would like to get the regression fixed asap, but I also would like
>>> to avoid reverting patches, unless really necessary. May I propose the
>>> following two options.
>>>
>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
>>> an R1 response doesn't work - and then fix that behaviour.
>>>
>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
>>> the core to always use R1B for CMD6 (and erase). This also means that
>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
>>> disable the HW busy timeout and just wait "forever".
>>>
>>> If you decide for 2, you can add the software timeout support on top,
>>> but make that can be considered as a next step of an improvement,
>>> rather than needed as fix. Note that, I believe there are some support
>>> for software timeout already in the sdhci core, maybe you need to
>>> tweak it a bit for your case, I don't know.
>>>
>>> Kind regards
>>> Uffe
>> Hi Uffe
>>
>> Will go with 2nd option and will send patches out when ready.
> Okay, good.
>
>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>> data timeout based on host clock when using finite mode (HW busy
>> detection based on DATA TIMEOUT count value when cmd operation timeout
>> is < 11s for tegra host).
>>
>> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
>> force R1B during SWITCH and SLEEP_AWAKE.
>>
>> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
>> which can be used for hosts supporting long or infinite HW busy wait
>> detection and will update mmc and mmc_ops drivers to not allow convert
>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
> That seems reasonable, it becomes probably both easier and clearer by
> adding a new host cap.
>
> In any case, let me help out and cook a patch for this for the core
> part (I leave the sdhci change to you). It may be a bit tricky,
> especially since I have currently queued a bunch of new changes for
> v5.7, that enables more users of mmc_poll_for_busy() in the core.
> Maybe I need to temporarily drop them, so we can fix these problems
> first. I will check.
>
> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
> seems to be describing the common problem we have for sdhci
> omap/tegra.
>
> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
> sdhci- tegra, so while at it, perhaps you can cook a patch for that as
> well.
>
> Kind regards
> Uffe

OK, I sent v1 yesterday. Please ignore them then.

Will send out patches only for HW busy wait modes program based on cmd 
timeout and WAIT_WHILE_BUSY enabled.

Thanks

Sowjanya


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-10 16:59                                                   ` Sowjanya Komatineni
@ 2020-03-10 17:09                                                     ` Ulf Hansson
  2020-03-10 17:27                                                       ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-03-10 17:09 UTC (permalink / raw)
  To: Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

[...]

> >>>
> >>> I would like to get the regression fixed asap, but I also would like
> >>> to avoid reverting patches, unless really necessary. May I propose the
> >>> following two options.
> >>>
> >>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> >>> an R1 response doesn't work - and then fix that behaviour.
> >>>
> >>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> >>> the core to always use R1B for CMD6 (and erase). This also means that
> >>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> >>> disable the HW busy timeout and just wait "forever".
> >>>
> >>> If you decide for 2, you can add the software timeout support on top,
> >>> but make that can be considered as a next step of an improvement,
> >>> rather than needed as fix. Note that, I believe there are some support
> >>> for software timeout already in the sdhci core, maybe you need to
> >>> tweak it a bit for your case, I don't know.
> >>>
> >>> Kind regards
> >>> Uffe
> >> Hi Uffe
> >>
> >> Will go with 2nd option and will send patches out when ready.
> > Okay, good.
> >
> >> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
> >> data timeout based on host clock when using finite mode (HW busy
> >> detection based on DATA TIMEOUT count value when cmd operation timeout
> >> is < 11s for tegra host).
> >>
> >> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
> >> force R1B during SWITCH and SLEEP_AWAKE.
> >>
> >> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
> >> which can be used for hosts supporting long or infinite HW busy wait
> >> detection and will update mmc and mmc_ops drivers to not allow convert
> >> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
> > That seems reasonable, it becomes probably both easier and clearer by
> > adding a new host cap.
> >
> > In any case, let me help out and cook a patch for this for the core
> > part (I leave the sdhci change to you). It may be a bit tricky,
> > especially since I have currently queued a bunch of new changes for
> > v5.7, that enables more users of mmc_poll_for_busy() in the core.
> > Maybe I need to temporarily drop them, so we can fix these problems
> > first. I will check.
> >
> > Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
> > seems to be describing the common problem we have for sdhci
> > omap/tegra.
> >
> > Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
> > sdhci- tegra, so while at it, perhaps you can cook a patch for that as
> > well.
> >
> > Kind regards
> > Uffe
>
> OK, I sent v1 yesterday. Please ignore them then.

Oh, I haven't seen them. In any case, I am ignoring them.

>
> Will send out patches only for HW busy wait modes program based on cmd
> timeout and WAIT_WHILE_BUSY enabled.

Great, thanks!

Please help test the series I just posted as well, if you have the
time ofcourse.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-10 17:09                                                     ` Ulf Hansson
@ 2020-03-10 17:27                                                       ` Sowjanya Komatineni
  2020-03-10 21:59                                                         ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-10 17:27 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/10/20 10:09 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> [...]
>
>>>>> I would like to get the regression fixed asap, but I also would like
>>>>> to avoid reverting patches, unless really necessary. May I propose the
>>>>> following two options.
>>>>>
>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>
>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
>>>>> the core to always use R1B for CMD6 (and erase). This also means that
>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
>>>>> disable the HW busy timeout and just wait "forever".
>>>>>
>>>>> If you decide for 2, you can add the software timeout support on top,
>>>>> but make that can be considered as a next step of an improvement,
>>>>> rather than needed as fix. Note that, I believe there are some support
>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>> tweak it a bit for your case, I don't know.
>>>>>
>>>>> Kind regards
>>>>> Uffe
>>>> Hi Uffe
>>>>
>>>> Will go with 2nd option and will send patches out when ready.
>>> Okay, good.
>>>
>>>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>>>> data timeout based on host clock when using finite mode (HW busy
>>>> detection based on DATA TIMEOUT count value when cmd operation timeout
>>>> is < 11s for tegra host).
>>>>
>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>
>>>> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
>>>> which can be used for hosts supporting long or infinite HW busy wait
>>>> detection and will update mmc and mmc_ops drivers to not allow convert
>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
>>> That seems reasonable, it becomes probably both easier and clearer by
>>> adding a new host cap.
>>>
>>> In any case, let me help out and cook a patch for this for the core
>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>> especially since I have currently queued a bunch of new changes for
>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>> Maybe I need to temporarily drop them, so we can fix these problems
>>> first. I will check.
>>>
>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
>>> seems to be describing the common problem we have for sdhci
>>> omap/tegra.
>>>
>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>> sdhci- tegra, so while at it, perhaps you can cook a patch for that as
>>> well.
>>>
>>> Kind regards
>>> Uffe
>> OK, I sent v1 yesterday. Please ignore them then.
> Oh, I haven't seen them. In any case, I am ignoring them.
>
>> Will send out patches only for HW busy wait modes program based on cmd
>> timeout and WAIT_WHILE_BUSY enabled.
> Great, thanks!
>
> Please help test the series I just posted as well, if you have the
> time ofcourse.
>
> Kind regards
> Uffe

Sure,

Thanks

Sowjanya


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-10 17:27                                                       ` Sowjanya Komatineni
@ 2020-03-10 21:59                                                         ` Sowjanya Komatineni
  2020-03-10 23:10                                                           ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-10 21:59 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
>
> On 3/10/20 10:09 AM, Ulf Hansson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> [...]
>>
>>>>>> I would like to get the regression fixed asap, but I also would like
>>>>>> to avoid reverting patches, unless really necessary. May I 
>>>>>> propose the
>>>>>> following two options.
>>>>>>
>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
>>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>>
>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which 
>>>>>> makes
>>>>>> the core to always use R1B for CMD6 (and erase). This also means 
>>>>>> that
>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
>>>>>> disable the HW busy timeout and just wait "forever".
>>>>>>
>>>>>> If you decide for 2, you can add the software timeout support on 
>>>>>> top,
>>>>>> but make that can be considered as a next step of an improvement,
>>>>>> rather than needed as fix. Note that, I believe there are some 
>>>>>> support
>>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>>> tweak it a bit for your case, I don't know.
>>>>>>
>>>>>> Kind regards
>>>>>> Uffe
>>>>> Hi Uffe
>>>>>
>>>>> Will go with 2nd option and will send patches out when ready.
>>>> Okay, good.
>>>>
>>>>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>>>>> data timeout based on host clock when using finite mode (HW busy
>>>>> detection based on DATA TIMEOUT count value when cmd operation 
>>>>> timeout
>>>>> is < 11s for tegra host).
>>>>>
>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra 
>>>>> host to
>>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>>
>>>>> So, was thinking to introduce host capability 
>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
>>>>> which can be used for hosts supporting long or infinite HW busy wait
>>>>> detection and will update mmc and mmc_ops drivers to not allow 
>>>>> convert
>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and 
>>>>> SWITCH.
>>>> That seems reasonable, it becomes probably both easier and clearer by
>>>> adding a new host cap.
>>>>
>>>> In any case, let me help out and cook a patch for this for the core
>>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>>> especially since I have currently queued a bunch of new changes for
>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>>> Maybe I need to temporarily drop them, so we can fix these problems
>>>> first. I will check.
>>>>
>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
>>>> seems to be describing the common problem we have for sdhci
>>>> omap/tegra.
>>>>
>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for that as
>>>> well.
>>>>
>>>> Kind regards
>>>> Uffe
>>> OK, I sent v1 yesterday. Please ignore them then.
>> Oh, I haven't seen them. In any case, I am ignoring them.
>>
>>> Will send out patches only for HW busy wait modes program based on cmd
>>> timeout and WAIT_WHILE_BUSY enabled.
>> Great, thanks!
>>
>> Please help test the series I just posted as well, if you have the
>> time ofcourse.
>>
>> Kind regards
>> Uffe
>
> Sure,
>
> Thanks
>
> Sowjanya


mmc_sleep() also needs update to force R1B when host sets capability 
MMC_CAP_NEED_RSP_BUSY



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-10 21:59                                                         ` Sowjanya Komatineni
@ 2020-03-10 23:10                                                           ` Sowjanya Komatineni
  2020-03-11  0:22                                                             ` Sowjanya Komatineni
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-10 23:10 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
>
> On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
>>
>> On 3/10/20 10:09 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> [...]
>>>
>>>>>>> I would like to get the regression fixed asap, but I also would 
>>>>>>> like
>>>>>>> to avoid reverting patches, unless really necessary. May I 
>>>>>>> propose the
>>>>>>> following two options.
>>>>>>>
>>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 
>>>>>>> with
>>>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>>>
>>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which 
>>>>>>> makes
>>>>>>> the core to always use R1B for CMD6 (and erase). This also means 
>>>>>>> that
>>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra 
>>>>>>> must
>>>>>>> disable the HW busy timeout and just wait "forever".
>>>>>>>
>>>>>>> If you decide for 2, you can add the software timeout support on 
>>>>>>> top,
>>>>>>> but make that can be considered as a next step of an improvement,
>>>>>>> rather than needed as fix. Note that, I believe there are some 
>>>>>>> support
>>>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>>>> tweak it a bit for your case, I don't know.
>>>>>>>
>>>>>>> Kind regards
>>>>>>> Uffe
>>>>>> Hi Uffe
>>>>>>
>>>>>> Will go with 2nd option and will send patches out when ready.
>>>>> Okay, good.
>>>>>
>>>>>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK 
>>>>>> for
>>>>>> data timeout based on host clock when using finite mode (HW busy
>>>>>> detection based on DATA TIMEOUT count value when cmd operation 
>>>>>> timeout
>>>>>> is < 11s for tegra host).
>>>>>>
>>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra 
>>>>>> host to
>>>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>>>
>>>>>> So, was thinking to introduce host capability 
>>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
>>>>>> which can be used for hosts supporting long or infinite HW busy wait
>>>>>> detection and will update mmc and mmc_ops drivers to not allow 
>>>>>> convert
>>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and 
>>>>>> SWITCH.
>>>>> That seems reasonable, it becomes probably both easier and clearer by
>>>>> adding a new host cap.
>>>>>
>>>>> In any case, let me help out and cook a patch for this for the core
>>>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>>>> especially since I have currently queued a bunch of new changes for
>>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>>>> Maybe I need to temporarily drop them, so we can fix these problems
>>>>> first. I will check.
>>>>>
>>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
>>>>> seems to be describing the common problem we have for sdhci
>>>>> omap/tegra.
>>>>>
>>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for 
>>>>> that as
>>>>> well.
>>>>>
>>>>> Kind regards
>>>>> Uffe
>>>> OK, I sent v1 yesterday. Please ignore them then.
>>> Oh, I haven't seen them. In any case, I am ignoring them.
>>>
>>>> Will send out patches only for HW busy wait modes program based on cmd
>>>> timeout and WAIT_WHILE_BUSY enabled.
>>> Great, thanks!
>>>
>>> Please help test the series I just posted as well, if you have the
>>> time ofcourse.
>>>
>>> Kind regards
>>> Uffe
>>
>> Sure,
>>
>> Thanks
>>
>> Sowjanya
>
>
> mmc_sleep() also needs update to force R1B when host sets capability 
> MMC_CAP_NEED_RSP_BUSY
>
>
Tested patches and they work good.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-10 23:10                                                           ` Sowjanya Komatineni
@ 2020-03-11  0:22                                                             ` Sowjanya Komatineni
  2020-03-11  8:34                                                               ` Ulf Hansson
  0 siblings, 1 reply; 45+ messages in thread
From: Sowjanya Komatineni @ 2020-03-11  0:22 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon


On 3/10/20 4:10 PM, Sowjanya Komatineni wrote:
>
> On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
>>
>> On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/10/20 10:09 AM, Ulf Hansson wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> [...]
>>>>
>>>>>>>> I would like to get the regression fixed asap, but I also would 
>>>>>>>> like
>>>>>>>> to avoid reverting patches, unless really necessary. May I 
>>>>>>>> propose the
>>>>>>>> following two options.
>>>>>>>>
>>>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 
>>>>>>>> with
>>>>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>>>>
>>>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which 
>>>>>>>> makes
>>>>>>>> the core to always use R1B for CMD6 (and erase). This also 
>>>>>>>> means that
>>>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra 
>>>>>>>> must
>>>>>>>> disable the HW busy timeout and just wait "forever".
>>>>>>>>
>>>>>>>> If you decide for 2, you can add the software timeout support 
>>>>>>>> on top,
>>>>>>>> but make that can be considered as a next step of an improvement,
>>>>>>>> rather than needed as fix. Note that, I believe there are some 
>>>>>>>> support
>>>>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>>>>> tweak it a bit for your case, I don't know.
>>>>>>>>
>>>>>>>> Kind regards
>>>>>>>> Uffe
>>>>>>> Hi Uffe
>>>>>>>
>>>>>>> Will go with 2nd option and will send patches out when ready.
>>>>>> Okay, good.
>>>>>>
>>>>>>> BTW, Tegra host also supports 
>>>>>>> SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>>>>>>> data timeout based on host clock when using finite mode (HW busy
>>>>>>> detection based on DATA TIMEOUT count value when cmd operation 
>>>>>>> timeout
>>>>>>> is < 11s for tegra host).
>>>>>>>
>>>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra 
>>>>>>> host to
>>>>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>>>>
>>>>>>> So, was thinking to introduce host capability 
>>>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
>>>>>>> which can be used for hosts supporting long or infinite HW busy 
>>>>>>> wait
>>>>>>> detection and will update mmc and mmc_ops drivers to not allow 
>>>>>>> convert
>>>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and 
>>>>>>> SWITCH.
>>>>>> That seems reasonable, it becomes probably both easier and 
>>>>>> clearer by
>>>>>> adding a new host cap.
>>>>>>
>>>>>> In any case, let me help out and cook a patch for this for the core
>>>>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>>>>> especially since I have currently queued a bunch of new changes for
>>>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>>>>> Maybe I need to temporarily drop them, so we can fix these problems
>>>>>> first. I will check.
>>>>>>
>>>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as 
>>>>>> that
>>>>>> seems to be describing the common problem we have for sdhci
>>>>>> omap/tegra.
>>>>>>
>>>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for 
>>>>>> that as
>>>>>> well.
>>>>>>
>>>>>> Kind regards
>>>>>> Uffe
>>>>> OK, I sent v1 yesterday. Please ignore them then.
>>>> Oh, I haven't seen them. In any case, I am ignoring them.
>>>>
>>>>> Will send out patches only for HW busy wait modes program based on 
>>>>> cmd
>>>>> timeout and WAIT_WHILE_BUSY enabled.
>>>> Great, thanks!
>>>>
>>>> Please help test the series I just posted as well, if you have the
>>>> time ofcourse.
>>>>
>>>> Kind regards
>>>> Uffe
>>>
>>> Sure,
>>>
>>> Thanks
>>>
>>> Sowjanya
>>
>>
>> mmc_sleep() also needs update to force R1B when host sets capability 
>> MMC_CAP_NEED_RSP_BUSY
>>
>>
> Tested patches and they work good.
>
Sent sdhci-tegra v2 patches

- includes busy wait mode programming based on cmd busy_timeout

- enables MMC_CAP_WAIT_WHILE_BUSY

Thanks

Sowjanya


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-11  0:22                                                             ` Sowjanya Komatineni
@ 2020-03-11  8:34                                                               ` Ulf Hansson
  2020-03-19 19:12                                                                 ` Naresh Kamboju
  0 siblings, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-03-11  8:34 UTC (permalink / raw)
  To: Sowjanya Komatineni
  Cc: Jon Hunter, Bitan Biswas, Adrian Hunter, Naresh Kamboju,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

On Wed, 11 Mar 2020 at 01:19, Sowjanya Komatineni
<skomatineni@nvidia.com> wrote:
>
>
> On 3/10/20 4:10 PM, Sowjanya Komatineni wrote:
> >
> > On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
> >>
> >> On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
> >>>
> >>> On 3/10/20 10:09 AM, Ulf Hansson wrote:
> >>>> External email: Use caution opening links or attachments
> >>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>>>>> I would like to get the regression fixed asap, but I also would
> >>>>>>>> like
> >>>>>>>> to avoid reverting patches, unless really necessary. May I
> >>>>>>>> propose the
> >>>>>>>> following two options.
> >>>>>>>>
> >>>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6
> >>>>>>>> with
> >>>>>>>> an R1 response doesn't work - and then fix that behaviour.
> >>>>>>>>
> >>>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which
> >>>>>>>> makes
> >>>>>>>> the core to always use R1B for CMD6 (and erase). This also
> >>>>>>>> means that
> >>>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra
> >>>>>>>> must
> >>>>>>>> disable the HW busy timeout and just wait "forever".
> >>>>>>>>
> >>>>>>>> If you decide for 2, you can add the software timeout support
> >>>>>>>> on top,
> >>>>>>>> but make that can be considered as a next step of an improvement,
> >>>>>>>> rather than needed as fix. Note that, I believe there are some
> >>>>>>>> support
> >>>>>>>> for software timeout already in the sdhci core, maybe you need to
> >>>>>>>> tweak it a bit for your case, I don't know.
> >>>>>>>>
> >>>>>>>> Kind regards
> >>>>>>>> Uffe
> >>>>>>> Hi Uffe
> >>>>>>>
> >>>>>>> Will go with 2nd option and will send patches out when ready.
> >>>>>> Okay, good.
> >>>>>>
> >>>>>>> BTW, Tegra host also supports
> >>>>>>> SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
> >>>>>>> data timeout based on host clock when using finite mode (HW busy
> >>>>>>> detection based on DATA TIMEOUT count value when cmd operation
> >>>>>>> timeout
> >>>>>>> is < 11s for tegra host).
> >>>>>>>
> >>>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra
> >>>>>>> host to
> >>>>>>> force R1B during SWITCH and SLEEP_AWAKE.
> >>>>>>>
> >>>>>>> So, was thinking to introduce host capability
> >>>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
> >>>>>>> which can be used for hosts supporting long or infinite HW busy
> >>>>>>> wait
> >>>>>>> detection and will update mmc and mmc_ops drivers to not allow
> >>>>>>> convert
> >>>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and
> >>>>>>> SWITCH.
> >>>>>> That seems reasonable, it becomes probably both easier and
> >>>>>> clearer by
> >>>>>> adding a new host cap.
> >>>>>>
> >>>>>> In any case, let me help out and cook a patch for this for the core
> >>>>>> part (I leave the sdhci change to you). It may be a bit tricky,
> >>>>>> especially since I have currently queued a bunch of new changes for
> >>>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
> >>>>>> Maybe I need to temporarily drop them, so we can fix these problems
> >>>>>> first. I will check.
> >>>>>>
> >>>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as
> >>>>>> that
> >>>>>> seems to be describing the common problem we have for sdhci
> >>>>>> omap/tegra.
> >>>>>>
> >>>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
> >>>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for
> >>>>>> that as
> >>>>>> well.
> >>>>>>
> >>>>>> Kind regards
> >>>>>> Uffe
> >>>>> OK, I sent v1 yesterday. Please ignore them then.
> >>>> Oh, I haven't seen them. In any case, I am ignoring them.
> >>>>
> >>>>> Will send out patches only for HW busy wait modes program based on
> >>>>> cmd
> >>>>> timeout and WAIT_WHILE_BUSY enabled.
> >>>> Great, thanks!
> >>>>
> >>>> Please help test the series I just posted as well, if you have the
> >>>> time ofcourse.
> >>>>
> >>>> Kind regards
> >>>> Uffe
> >>>
> >>> Sure,
> >>>
> >>> Thanks
> >>>
> >>> Sowjanya
> >>
> >>
> >> mmc_sleep() also needs update to force R1B when host sets capability
> >> MMC_CAP_NEED_RSP_BUSY

Yes, I am on it! Thanks!

> >>
> >>
> > Tested patches and they work good.

Great, I am adding your tested-by tag then.

> >
> Sent sdhci-tegra v2 patches
>
> - includes busy wait mode programming based on cmd busy_timeout
>
> - enables MMC_CAP_WAIT_WHILE_BUSY

Sounds great, but I don't see the patches on the mailing list, nor did
they reach the mmc patchtracker.

Seems like you probably need to check your email settings when sending patches.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-11  8:34                                                               ` Ulf Hansson
@ 2020-03-19 19:12                                                                 ` Naresh Kamboju
  2020-03-20  9:20                                                                   ` Ulf Hansson
  0 siblings, 1 reply; 45+ messages in thread
From: Naresh Kamboju @ 2020-03-19 19:12 UTC (permalink / raw)
  To: Ulf Hansson, Greg Kroah-Hartman, Sasha Levin
  Cc: Sowjanya Komatineni, Jon Hunter, Bitan Biswas, Adrian Hunter,
	Jens Axboe, Alexei Starovoitov, linux-block, lkft-triage,
	open list, linux-mmc, Arnd Bergmann, John Stultz, Faiz Abbas,
	Thierry Reding, Anders Roxell, Kishon

FYI,

The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2
kernel pops up the following messages on console log,
Is this a problem ?

[ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version
7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020

[   15.737765] mmc1: unspecified timeout for CMD6 - use generic
[   16.754248] mmc1: unspecified timeout for CMD6 - use generic
[   16.842071] mmc1: unspecified timeout for CMD6 - use generic
...
[   20.580541] mmc1: unspecified timeout for CMD6 - use generic
[   20.588216] mmc1: unspecified timeout for CMD6 - use generic
[   20.604011] mmc1: unspecified timeout for CMD6 - use generic

ref:
https://lkft.validation.linaro.org/scheduler/job/1298207#L4037
https://lkft.validation.linaro.org/scheduler/job/1298945#L4132
https://lkft.validation.linaro.org/scheduler/job/1299973#L4232

- Naresh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-19 19:12                                                                 ` Naresh Kamboju
@ 2020-03-20  9:20                                                                   ` Ulf Hansson
  2020-03-20  9:49                                                                     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 45+ messages in thread
From: Ulf Hansson @ 2020-03-20  9:20 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Greg Kroah-Hartman, Sasha Levin, Sowjanya Komatineni, Jon Hunter,
	Bitan Biswas, Adrian Hunter, Jens Axboe, Alexei Starovoitov,
	linux-block, lkft-triage, open list, linux-mmc, Arnd Bergmann,
	John Stultz, Faiz Abbas, Thierry Reding, Anders Roxell, Kishon

On Thu, 19 Mar 2020 at 20:12, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> FYI,
>
> The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2
> kernel pops up the following messages on console log,
> Is this a problem ?
>
> [ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version
> 7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020
>
> [   15.737765] mmc1: unspecified timeout for CMD6 - use generic
> [   16.754248] mmc1: unspecified timeout for CMD6 - use generic
> [   16.842071] mmc1: unspecified timeout for CMD6 - use generic
> ...
> [   20.580541] mmc1: unspecified timeout for CMD6 - use generic
> [   20.588216] mmc1: unspecified timeout for CMD6 - use generic
> [   20.604011] mmc1: unspecified timeout for CMD6 - use generic
>
> ref:
> https://lkft.validation.linaro.org/scheduler/job/1298207#L4037
> https://lkft.validation.linaro.org/scheduler/job/1298945#L4132
> https://lkft.validation.linaro.org/scheduler/job/1299973#L4232

The commit below is the problem, it shouldn't be applied for stable.

commit 533a6cfe08f96a7b5c65e06d20916d552c11b256
Author: Ulf Hansson <ulf.hansson@linaro.org>
Date:   Wed Jan 22 15:27:47 2020 +0100
mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()

Let me sort it out with Greg/Sasha. I will keep you in the loop,
thanks for reporting!

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: LKFT: arm x15: mmc1: cache flush error -110
  2020-03-20  9:20                                                                   ` Ulf Hansson
@ 2020-03-20  9:49                                                                     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-20  9:49 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Naresh Kamboju, Sasha Levin, Sowjanya Komatineni, Jon Hunter,
	Bitan Biswas, Adrian Hunter, Jens Axboe, Alexei Starovoitov,
	linux-block, lkft-triage, open list, linux-mmc, Arnd Bergmann,
	John Stultz, Faiz Abbas, Thierry Reding, Anders Roxell, Kishon

On Fri, Mar 20, 2020 at 10:20:04AM +0100, Ulf Hansson wrote:
> On Thu, 19 Mar 2020 at 20:12, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >
> > FYI,
> >
> > The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2
> > kernel pops up the following messages on console log,
> > Is this a problem ?
> >
> > [ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version
> > 7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020
> >
> > [   15.737765] mmc1: unspecified timeout for CMD6 - use generic
> > [   16.754248] mmc1: unspecified timeout for CMD6 - use generic
> > [   16.842071] mmc1: unspecified timeout for CMD6 - use generic
> > ...
> > [   20.580541] mmc1: unspecified timeout for CMD6 - use generic
> > [   20.588216] mmc1: unspecified timeout for CMD6 - use generic
> > [   20.604011] mmc1: unspecified timeout for CMD6 - use generic
> >
> > ref:
> > https://lkft.validation.linaro.org/scheduler/job/1298207#L4037
> > https://lkft.validation.linaro.org/scheduler/job/1298945#L4132
> > https://lkft.validation.linaro.org/scheduler/job/1299973#L4232
> 
> The commit below is the problem, it shouldn't be applied for stable.
> 
> commit 533a6cfe08f96a7b5c65e06d20916d552c11b256
> Author: Ulf Hansson <ulf.hansson@linaro.org>
> Date:   Wed Jan 22 15:27:47 2020 +0100
> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
> 
> Let me sort it out with Greg/Sasha. I will keep you in the loop,
> thanks for reporting!

Now dropped from the 4.19, 5.4, and 5.5 stable queues, sorry for the
noise.

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-03-20  9:50 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-13 15:42 LKFT: arm x15: mmc1: cache flush error -110 Naresh Kamboju
2020-02-14  9:09 ` Arnd Bergmann
2020-02-14 12:09   ` Mark Brown
2020-02-19 16:23 ` Ulf Hansson
2020-02-20 17:54   ` Naresh Kamboju
2020-02-21  9:48     ` Ulf Hansson
2020-02-21 19:44       ` Bitan Biswas
2020-02-24 11:16         ` Ulf Hansson
2020-02-24 12:59           ` Adrian Hunter
2020-02-25 10:04           ` Jon Hunter
2020-02-25 11:35             ` Ulf Hansson
2020-02-25 11:41             ` Jon Hunter
2020-02-25 14:26               ` Ulf Hansson
2020-02-25 16:24                 ` Jon Hunter
2020-02-26 15:21                   ` Ulf Hansson
2020-02-26 17:04                     ` Jon Hunter
2020-03-02 13:12                     ` Faiz Abbas
2020-03-02 16:50                       ` Ulf Hansson
2020-03-03 21:35                         ` Ulf Hansson
     [not found]                         ` <5e9b5646-bd48-e55b-54ee-1c2c41fc9218@nvidia.com>
2020-03-04 10:18                           ` Ulf Hansson
2020-03-04 10:32                             ` Ulf Hansson
2020-03-04 16:56                             ` Sowjanya Komatineni
2020-03-04 17:21                               ` Sowjanya Komatineni
2020-03-04 17:26                                 ` Sowjanya Komatineni
2020-03-04 17:51                                   ` Sowjanya Komatineni
2020-03-04 22:35                                     ` Sowjanya Komatineni
2020-03-05  0:20                                       ` Sowjanya Komatineni
2020-03-05  3:06                                         ` Sowjanya Komatineni
2020-03-05 13:05                                         ` Ulf Hansson
2020-03-06  2:44                                           ` Sowjanya Komatineni
2020-03-06 11:14                                             ` Ulf Hansson
2020-03-09 14:07                                               ` Faiz Abbas
2020-03-09 15:57                                                 ` Ulf Hansson
2020-03-09 17:35                                               ` Sowjanya Komatineni
2020-03-10  9:46                                                 ` Ulf Hansson
2020-03-10 16:59                                                   ` Sowjanya Komatineni
2020-03-10 17:09                                                     ` Ulf Hansson
2020-03-10 17:27                                                       ` Sowjanya Komatineni
2020-03-10 21:59                                                         ` Sowjanya Komatineni
2020-03-10 23:10                                                           ` Sowjanya Komatineni
2020-03-11  0:22                                                             ` Sowjanya Komatineni
2020-03-11  8:34                                                               ` Ulf Hansson
2020-03-19 19:12                                                                 ` Naresh Kamboju
2020-03-20  9:20                                                                   ` Ulf Hansson
2020-03-20  9:49                                                                     ` Greg Kroah-Hartman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.