Re: LKFT: arm x15: mmc1: cache flush error -110

From: Ulf Hansson <ulf.hansson@linaro.org>
To: Jon Hunter <jonathanh@nvidia.com>, Faiz Abbas <faiz_abbas@ti.com>
Cc: Bitan Biswas <bbiswas@nvidia.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Naresh Kamboju <naresh.kamboju@linaro.org>,
	Jens Axboe <axboe@kernel.dk>, Alexei Starovoitov <ast@kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	lkft-triage@lists.linaro.org,
	open list <linux-kernel@vger.kernel.org>,
	"linux-mmc@vger.kernel.org" <linux-mmc@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	John Stultz <john.stultz@linaro.org>,
	Thierry Reding <treding@nvidia.com>
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110
Date: Tue, 25 Feb 2020 15:26:16 +0100	[thread overview]
Message-ID: <CAPDyKFokE6x0mn+v5B9=so-SyrdTn0JBU8Mrp3Zdu6kSaCie2g@mail.gmail.com> (raw)
In-Reply-To: <f960aa98-5508-36fd-166d-7f41c7d85154@nvidia.com>

+ Faiz Abbas

On Tue, 25 Feb 2020 at 12:41, Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 25/02/2020 10:04, Jon Hunter wrote:
>
> ...
>
> >>>   I find that from the commit the changes in mmc_flush_cache below is
> >>> the cause.
> >>>
> >>> ##
> >>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
> >>>                          (card->ext_csd.cache_size > 0) &&
> >>>                          (card->ext_csd.cache_ctrl & 1)) {
> >>>                  err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> >>> -                               EXT_CSD_FLUSH_CACHE, 1, 0);
> >>> +                                EXT_CSD_FLUSH_CACHE, 1,
> >>> +                                MMC_CACHE_FLUSH_TIMEOUT_MS);
> >
> >
> > I no longer see the issue on reverting the above hunk as Bitan suggested
> > but now I see the following (which is expected) ...
> >
> >  WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
>
> For Tegra, the default timeout used when no timeout is specified for CMD6
> is 100mS. So hard-coding the following also appears to workaround the
> problem on Tegra ...

Interesting.

>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 868653bc1555..5155e0240fca 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -992,7 +992,7 @@ int mmc_flush_cache(struct mmc_card *card)
>                         (card->ext_csd.cache_size > 0) &&
>                         (card->ext_csd.cache_ctrl & 1)) {
>                 err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> -                                EXT_CSD_FLUSH_CACHE, 1, 0);
> +                                EXT_CSD_FLUSH_CACHE, 1, 100);
>                 if (err)
>                         pr_err("%s: cache flush error %d\n",
>                                         mmc_hostname(card->host), err);
>
> So the problem appears to be causing by the timeout being too long rather
> than not long enough.
>
> Looking more at the code, I think now that we are hitting the condition
> ...
>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 868653bc1555..feae82b1ff35 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -579,8 +579,10 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
>          * the host to avoid HW busy detection, by converting to a R1 response
>          * instead of a R1B.
>          */
> -       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
> +       if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
> +               pr_warn("%s: timeout (%d) > max busy timeout (%d)", mmc_hostname(host), timeout_ms, host->max_busy_timeout);
>                 use_r1b_resp = false;
> +       }
>
>
> With the above I see ...
>
>  WARNING KERN mmc1: timeout (1600) > max busy timeout (672)
>
> So with the longer timeout we are not using/requesting the response.

You are most likely correct.

However, from the core point of view, the response is still requested,
only that we don't want the driver to wait for the card to stop
signaling busy. Instead we want to deal with that via "polling" from
the core.

This is a rather worrying behaviour, as it seems like the host driver
doesn't really follow this expectations from the core point of view.
And mmc_flush_cache() is not the only case, as we have erase, bkops,
sanitize, etc. Are all these working or not really well tested?

Earlier, before my three patches, if the provided timeout_ms parameter
to __mmc_switch() was zero, which was the case for
mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
ignored validating host->max_busy_timeout, which was wrong. In any
case, this also meant that an R1B response was always used for
mmc_flush_cache(), as you also indicated above. Perhaps this is the
critical part where things can go wrong.

BTW, have you tried erase commands for sdhci tegra driver? If those
are working fine, do you have any special treatments for these?

I have looped in Faiz, as sdhci-omap seems to suffer from very similar
problems. One thing I noted for sdhci-omap, is that MMC_ERASE commands
is treated in a special manner in sdhci_omap_set_timeout(). This
indicates that there is something fishy going on.

Faiz, can you please comment on this?

Kind regards
Uffe