sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000"

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000"
@ 2015-05-19 22:23 Russell King - ARM Linux
  2015-05-26 14:49 ` Ulf Hansson
  0 siblings, 1 reply; 5+ messages in thread
From: Russell King - ARM Linux @ 2015-05-19 22:23 UTC (permalink / raw)
  To: linux-arm-kernel

iMX6 with a Samsung EVO UHS-1 16GB card.

There's actually two problems here.

1. SDHCI chooses to impose a 10 second timeout on any data operation.
   This magic value of 10 seconds is rediculous.  Consider that SD
   cards are typically slower than ATA, and ATA has a timeout of more
   than one minute for a stuck command...  And yes, I've had this fire
   a good 10 seconds before I then got...

2. "Got data interrupt 0x00100000 even though no data operation was in progress."

   That's SDHCI_INT_DATA_TIMEOUT.

   Unfortunately, I have no other information, as the registers are
   dumped at pr_debug() level, which means that they're all compiled
   out in normal kernel builds.  In any case, I have no way to copy
   information off of the installing system; debian does not start up
   a ssh daemon during the install, so remote login is not possible.
   Nor can I photograph the rather reflective TV screen.

The side-effect of this is that the entire MMC IO subsystem locks up
and I'm left with lots of processes stuck in IO-wait state, with the
hungtask detector spewing onto the console.

This happens while installing Debian Jessie, which is an "expensive"
operation: not only in time (it takes 45 minutes or so to reproduce,
requiring the stupid debian installer to be babysat during that time)
but also because it's having to download the entire distro for each
attempt (which eats into my monthly bandwidth allowance, so I can
_only_ do this after 8pm local time.)

I've tried twice tonight (which is about the limit that I can do in an
evening), the second time after having up-ed the stupid 10 second limit
to 60 seconds.  However (2) still occurs.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000"
  2015-05-19 22:23 sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000" Russell King - ARM Linux
@ 2015-05-26 14:49 ` Ulf Hansson
  2015-05-26 15:01   ` Russell King - ARM Linux
  0 siblings, 1 reply; 5+ messages in thread
From: Ulf Hansson @ 2015-05-26 14:49 UTC (permalink / raw)
  To: linux-arm-kernel

On 20 May 2015 at 00:23, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> iMX6 with a Samsung EVO UHS-1 16GB card.
>
> There's actually two problems here.
>
> 1. SDHCI chooses to impose a 10 second timeout on any data operation.
>    This magic value of 10 seconds is rediculous.  Consider that SD
>    cards are typically slower than ATA, and ATA has a timeout of more
>    than one minute for a stuck command...  And yes, I've had this fire
>    a good 10 seconds before I then got...
>
> 2. "Got data interrupt 0x00100000 even though no data operation was in progress."
>
>    That's SDHCI_INT_DATA_TIMEOUT.
>
>    Unfortunately, I have no other information, as the registers are
>    dumped at pr_debug() level, which means that they're all compiled
>    out in normal kernel builds.  In any case, I have no way to copy
>    information off of the installing system; debian does not start up
>    a ssh daemon during the install, so remote login is not possible.
>    Nor can I photograph the rather reflective TV screen.
>
> The side-effect of this is that the entire MMC IO subsystem locks up
> and I'm left with lots of processes stuck in IO-wait state, with the
> hungtask detector spewing onto the console.

Sorry, I can't tell much around the host driver and HW as such. I
don't have any iMX6 boards at hand.

Though, the side-effect you are describing isn't very nice. Even if it
doesn't solve you problem, perhaps we should discuss about converting
from wait_for_completion() to wait_for_completion_timeout(), when the
mmc core waits for the host driver to return the result for the
request.

I guess the tricky part is to find a decent value for the "timeout".

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 5+ messages in thread

* sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000"
  2015-05-26 14:49 ` Ulf Hansson
@ 2015-05-26 15:01   ` Russell King - ARM Linux
  2015-05-27  9:44     ` Ulf Hansson
  0 siblings, 1 reply; 5+ messages in thread
From: Russell King - ARM Linux @ 2015-05-26 15:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 26, 2015 at 04:49:34PM +0200, Ulf Hansson wrote:
> On 20 May 2015 at 00:23, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > iMX6 with a Samsung EVO UHS-1 16GB card.
> >
> > There's actually two problems here.
> >
> > 1. SDHCI chooses to impose a 10 second timeout on any data operation.
> >    This magic value of 10 seconds is rediculous.  Consider that SD
> >    cards are typically slower than ATA, and ATA has a timeout of more
> >    than one minute for a stuck command...  And yes, I've had this fire
> >    a good 10 seconds before I then got...
> >
> > 2. "Got data interrupt 0x00100000 even though no data operation was in progress."
> >
> >    That's SDHCI_INT_DATA_TIMEOUT.
> >
> >    Unfortunately, I have no other information, as the registers are
> >    dumped at pr_debug() level, which means that they're all compiled
> >    out in normal kernel builds.  In any case, I have no way to copy
> >    information off of the installing system; debian does not start up
> >    a ssh daemon during the install, so remote login is not possible.
> >    Nor can I photograph the rather reflective TV screen.
> >
> > The side-effect of this is that the entire MMC IO subsystem locks up
> > and I'm left with lots of processes stuck in IO-wait state, with the
> > hungtask detector spewing onto the console.
> 
> Sorry, I can't tell much around the host driver and HW as such. I
> don't have any iMX6 boards at hand.
> 
> Though, the side-effect you are describing isn't very nice. Even if it
> doesn't solve you problem, perhaps we should discuss about converting
> from wait_for_completion() to wait_for_completion_timeout(), when the
> mmc core waits for the host driver to return the result for the
> request.
> 
> I guess the tricky part is to find a decent value for the "timeout".

There's two issues which would need solving for that:

1. The only sane timeout is one which will never trigger under normal
   operating circumstances, and abnormal load conditions.

2. When the timeout occurs, the core would need some way to reset the
   host driver back to a sane state before retrying the command.

However, the better question to ask is what's causing this.  It seemed
to lock up at the same point in the installation - after the base system
had been installed, but while it was installing additional stuff (for
xfce.)  From what I remember, it was exactly the same package that the
MMC host failed at.

I suppose it's entirely possible that the Debian install is running
some package scripts which end up poking about in memory, which are
screwing up the MMC host - nothing would surprise me... the Debian
Jessie install is screwed in other ways (if you select Gnome instead,
the installer aborts because it discovers that the distro packages
are missing some dependencies, though I had put that down to the UK
mirror possibly being out of date...)

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000"
  2015-05-26 15:01   ` Russell King - ARM Linux
@ 2015-05-27  9:44     ` Ulf Hansson
  2015-05-27  9:59       ` Russell King - ARM Linux
  0 siblings, 1 reply; 5+ messages in thread
From: Ulf Hansson @ 2015-05-27  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

[...]

>> Though, the side-effect you are describing isn't very nice. Even if it
>> doesn't solve you problem, perhaps we should discuss about converting
>> from wait_for_completion() to wait_for_completion_timeout(), when the
>> mmc core waits for the host driver to return the result for the
>> request.
>>
>> I guess the tricky part is to find a decent value for the "timeout".
>
> There's two issues which would need solving for that:
>
> 1. The only sane timeout is one which will never trigger under normal
>    operating circumstances, and abnormal load conditions.

Perhaps we could have the timeout calculated per request?

By using the current bus-speed and bus-width we can calculate the
available bandwidth. Obviously we need to also account for overhead
both at host and card side. Probably that overhead is taken from "best
guesses", not sure.

Finally considering the amount of data for the request, we can
calculate a value for the timeout.

>
> 2. When the timeout occurs, the core would need some way to reset the
>    host driver back to a sane state before retrying the command.

Yep. I guess adding a host_ops->abort() callback or similar would be needed.

>
> However, the better question to ask is what's causing this.  It seemed
> to lock up at the same point in the installation - after the base system
> had been installed, but while it was installing additional stuff (for
> xfce.)  From what I remember, it was exactly the same package that the
> MMC host failed at.
>
> I suppose it's entirely possible that the Debian install is running
> some package scripts which end up poking about in memory, which are
> screwing up the MMC host - nothing would surprise me... the Debian
> Jessie install is screwed in other ways (if you select Gnome instead,
> the installer aborts because it discovers that the distro packages
> are missing some dependencies, though I had put that down to the UK
> mirror possibly being out of date...)
>

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 5+ messages in thread

* sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000"
  2015-05-27  9:44     ` Ulf Hansson
@ 2015-05-27  9:59       ` Russell King - ARM Linux
  0 siblings, 0 replies; 5+ messages in thread
From: Russell King - ARM Linux @ 2015-05-27  9:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 27, 2015 at 11:44:59AM +0200, Ulf Hansson wrote:
> [...]
> 
> >> Though, the side-effect you are describing isn't very nice. Even if it
> >> doesn't solve you problem, perhaps we should discuss about converting
> >> from wait_for_completion() to wait_for_completion_timeout(), when the
> >> mmc core waits for the host driver to return the result for the
> >> request.
> >>
> >> I guess the tricky part is to find a decent value for the "timeout".
> >
> > There's two issues which would need solving for that:
> >
> > 1. The only sane timeout is one which will never trigger under normal
> >    operating circumstances, and abnormal load conditions.
> 
> Perhaps we could have the timeout calculated per request?
> 
> By using the current bus-speed and bus-width we can calculate the
> available bandwidth. Obviously we need to also account for overhead
> both at host and card side. Probably that overhead is taken from "best
> guesses", not sure.
> 
> Finally considering the amount of data for the request, we can
> calculate a value for the timeout.

The MMC specs already have this: cards specify exactly that in the
data.  However, that doesn't stop a card being buggy and supplying
incorrect values (hey, it works for Windows, ship it!)

The host is responsible for that part already (many hosts need to
have that programmed.)  So it doesn't make sense to use this.

What we're after here is something rather different: failure of the
host itself.  That should be a much longer timeout, one which we can
be sure will never trigger unless something has definitely gone wrong.
That's why I'd suggest something along the lines of ATA, around 60 to
120 seconds.  If the host has been dead for that long, it's definitely
dead.

(I know ATA's timeout very well, each time I put my very old Thinkpad
into standby mode, its APM talks to the drive, which then upsets Linux,
triggering that timeout... because it has disabled BM-DMA in one of the
control registers without Linux knowing.)

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-05-27  9:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-19 22:23 sdhci(imx6): misbehaves while installing debian jessie, "Got data interrupt 0x00100000" Russell King - ARM Linux
2015-05-26 14:49 ` Ulf Hansson
2015-05-26 15:01   ` Russell King - ARM Linux
2015-05-27  9:44     ` Ulf Hansson
2015-05-27  9:59       ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).