All of lore.kernel.org
 help / color / mirror / Atom feed
* mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
@ 2020-09-01  7:37 Baumgartner, Claus (GE Healthcare)
  2020-09-01 11:46 ` Sebastian Reichel
  0 siblings, 1 reply; 7+ messages in thread
From: Baumgartner, Claus (GE Healthcare) @ 2020-09-01  7:37 UTC (permalink / raw)
  To: linux-mmc; +Cc: haibo.chen

Hi,

We have a board with an i.MX535 using a Samsung eMMC as persistent storage connected to eSDHCv3. Every now and then we produce a build that causes emmc timeouts: 

Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware cmd interrupt.
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Sys addr:  0xe3f12000 | Version:  0x00001201
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 | Host ctl: 0x00000031
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x0000011f
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e | Int stat: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b | Sig enab: 0x107f000b
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00001201
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x08100810
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 | Resp[1]:  0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0xef041208
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============================================

Timeouts do not occur with every build. After some debugging I have found that timeouts seem to depend on code alignment of the esdhc_readl_le function. I have bisected the behavior by using the System.map and moving/padding the code with NOP instructions (mov r0,r0).

My test case has 5 processes continuously creating a file, writing random long data, reading data and deleting the file. It seems that when the esdhc_writel_le is aligned on a certain address then the timeout will occur about 5 times/12h using my test case. If I add one more NOP, the timeout will not occur at all. If I continue adding some more NOPs the timeouts come back. Seems that it doesn't matter where in the code I add NOPs as long as the address is below the address of esdhc_writel_le. 

We also run the same software on  a dual core i.MX6 without any timeout issues.

I have reproduced this with kernel version 4.19.94 and 5.8.3 and we have compiled with both gcc8 and gcc9.
I'm still searching for the root cause and I would appreciate any thoughts about where to go next. 

Thanks,

-Claus-




 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
  2020-09-01  7:37 mmc0: Timeout waiting for hardware cmd interrupt on i.MX535 Baumgartner, Claus (GE Healthcare)
@ 2020-09-01 11:46 ` Sebastian Reichel
  2020-09-02 11:24   ` Bough Chen
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Reichel @ 2020-09-01 11:46 UTC (permalink / raw)
  To: NXP Linux Team
  Cc: linux-mmc, Haibo Chen, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Baumgartner,
	Claus (GE Healthcare)

[-- Attachment #1: Type: text/plain, Size: 3776 bytes --]

Hi,

[add i.MX architecture maintainers to Cc]

On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE Healthcare) wrote:
> We have a board with an i.MX535 using a Samsung eMMC as persistent
> storage connected to eSDHCv3. Every now and then we produce a
> build that causes emmc timeouts: 
> 
> Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware cmd interrupt.
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Sys addr:  0xe3f12000 | Version:  0x00001201
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 | Host ctl: 0x00000031
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000000
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x0000011f
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e | Int stat: 0x00000000
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b | Sig enab: 0x107f000b
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00001201
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x08100810
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00000000
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 | Resp[1]:  0x00000000
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0xef041208
> Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============================================

Some extra information: The timeout always has cmd = 0x00000d1a
(MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] translating
to this IIUIC:

Bit 8 = Ready for data
Bit 11 = CURRENT_STATE is TRAN
Bit 22 = Illegal command

> Timeouts do not occur with every build. After some debugging I
> have found that timeouts seem to depend on code alignment of the
> esdhc_readl_le function. I have bisected the behavior by using the
> System.map and moving/padding the code with NOP instructions (mov
> r0,r0).
> 
> My test case has 5 processes continuously creating a file, writing
> random long data, reading data and deleting the file. It seems
> that when the esdhc_writel_le is aligned on a certain address then
> the timeout will occur about 5 times/12h using my test case. If I
> add one more NOP, the timeout will not occur at all. If I continue
> adding some more NOPs the timeouts come back. Seems that it
> doesn't matter where in the code I add NOPs as long as the address
> is below the address of esdhc_writel_le. 
> 
> We also run the same software on a dual core i.MX6 without any
> timeout issues.

And the same kernel binary is also used on an i.MX6 single core
(albeit with different SW) withot triggering the problem so far.

> I have reproduced this with kernel version 4.19.94 and 5.8.3 and
> we have compiled with both gcc8 and gcc9. I'm still searching for
> the root cause and I would appreciate any thoughts about where to
> go next. 
> 
> Thanks,
> 
> -Claus-

To me it looks like it might involve an unknown hardware errata for
i.MX53, but there has been one similar report before (unfortunately
without the full register dump) involving virtualization:

https://patchwork.kernel.org/patch/10705823/

Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.

-- Sebastian

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
  2020-09-01 11:46 ` Sebastian Reichel
@ 2020-09-02 11:24   ` Bough Chen
  2020-09-02 13:49     ` Sebastian Reichel
  0 siblings, 1 reply; 7+ messages in thread
From: Bough Chen @ 2020-09-02 11:24 UTC (permalink / raw)
  To: Sebastian Reichel, dl-linux-imx
  Cc: linux-mmc, Shawn Guo, Sascha Hauer, Pengutronix Kernel Team,
	Fabio Estevam, Baumgartner, Claus (GE Healthcare)


> -----Original Message-----
> From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> Sent: 2020年9月1日 19:47
> To: dl-linux-imx <linux-imx@nxp.com>
> Cc: linux-mmc@vger.kernel.org; Bough Chen <haibo.chen@nxp.com>; Shawn
> Guo <shawnguo@kernel.org>; Sascha Hauer <s.hauer@pengutronix.de>;
> Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> <festevam@gmail.com>; Baumgartner, Claus (GE Healthcare)
> <claus.baumgartner@med.ge.com>
> Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
> 
> Hi,
> 
> [add i.MX architecture maintainers to Cc]
> 
> On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> Healthcare) wrote:
> > We have a board with an i.MX535 using a Samsung eMMC as persistent
> > storage connected to eSDHCv3. Every now and then we produce a build
> > that causes emmc timeouts:
> >
> > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware cmd
> interrupt.
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI REGISTER
> > DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Sys addr:
> > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel: mmc0:
> > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28 07:32:12 csmon
> kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 | Host
> ctl: 0x00000031
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 | Blk
> gap:  0x00000000
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 |
> Clock:    0x0000011f
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e | Int
> stat: 0x00000000
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b | Sig
> > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd stat:
> 0x00000000 | Slot int: 0x00001201
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> Caps_1:   0x08100810
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> Max curr: 0x00000000
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> Resp[1]:  0x00000000
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> Resp[3]:  0x00000000
> > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000 Aug
> > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 | ADMA
> > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > ============================================
> 
> Some extra information: The timeout always has cmd = 0x00000d1a
> (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] translating to
> this IIUIC:
> 
> Bit 8 = Ready for data
> Bit 11 = CURRENT_STATE is TRAN
> Bit 22 = Illegal command

According to the code logic, since this cmd13 get hardware cmd timeout, which means this cmd13 do not get any response. Here the Resp[0] should be the previous command's response.
So this means the previous command is an illegal command, cause the emmc internal firmware stuck, and can't response to the next cmd13.
I think we need to firstly identify the specific place in emmc driver which trigger the log dump. 


Best Regards
Haibo Chen

> 
> > Timeouts do not occur with every build. After some debugging I have
> > found that timeouts seem to depend on code alignment of the
> > esdhc_readl_le function. I have bisected the behavior by using the
> > System.map and moving/padding the code with NOP instructions (mov
> > r0,r0).
> >
> > My test case has 5 processes continuously creating a file, writing
> > random long data, reading data and deleting the file. It seems that
> > when the esdhc_writel_le is aligned on a certain address then the
> > timeout will occur about 5 times/12h using my test case. If I add one
> > more NOP, the timeout will not occur at all. If I continue adding some
> > more NOPs the timeouts come back. Seems that it doesn't matter where
> > in the code I add NOPs as long as the address is below the address of
> > esdhc_writel_le.
> >
> > We also run the same software on a dual core i.MX6 without any timeout
> > issues.
> 
> And the same kernel binary is also used on an i.MX6 single core (albeit with
> different SW) withot triggering the problem so far.
> 
> > I have reproduced this with kernel version 4.19.94 and 5.8.3 and we
> > have compiled with both gcc8 and gcc9. I'm still searching for the
> > root cause and I would appreciate any thoughts about where to go next.
> >
> > Thanks,
> >
> > -Claus-
> 
> To me it looks like it might involve an unknown hardware errata for i.MX53, but
> there has been one similar report before (unfortunately without the full
> register dump) involving virtualization:
> 
> https://patchwork.kernel.org/patch/10705823/
> 
> Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> 
> -- Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
  2020-09-02 11:24   ` Bough Chen
@ 2020-09-02 13:49     ` Sebastian Reichel
  2020-09-03  2:10       ` Bough Chen
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Reichel @ 2020-09-02 13:49 UTC (permalink / raw)
  To: Bough Chen
  Cc: dl-linux-imx, linux-mmc, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Baumgartner,
	Claus (GE Healthcare)

[-- Attachment #1: Type: text/plain, Size: 5900 bytes --]

Hi,

On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote:
> > -----Original Message-----
> > From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> > Sent: 2020年9月1日 19:47
> > To: dl-linux-imx <linux-imx@nxp.com>
> > Cc: linux-mmc@vger.kernel.org; Bough Chen <haibo.chen@nxp.com>; Shawn
> > Guo <shawnguo@kernel.org>; Sascha Hauer <s.hauer@pengutronix.de>;
> > Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> > <festevam@gmail.com>; Baumgartner, Claus (GE Healthcare)
> > <claus.baumgartner@med.ge.com>
> > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
> > 
> > Hi,
> > 
> > [add i.MX architecture maintainers to Cc]
> > 
> > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> > Healthcare) wrote:
> > > We have a board with an i.MX535 using a Samsung eMMC as persistent
> > > storage connected to eSDHCv3. Every now and then we produce a build
> > > that causes emmc timeouts:
> > >
> > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware cmd
> > interrupt.
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI REGISTER
> > > DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Sys addr:
> > > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel: mmc0:
> > > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28 07:32:12 csmon
> > kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 | Host
> > ctl: 0x00000031
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 | Blk
> > gap:  0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 |
> > Clock:    0x0000011f
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e | Int
> > stat: 0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b | Sig
> > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd stat:
> > 0x00000000 | Slot int: 0x00001201
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> > Caps_1:   0x08100810
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> > Max curr: 0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> > Resp[1]:  0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> > Resp[3]:  0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000 Aug
> > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 | ADMA
> > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > ============================================
> > 
> > Some extra information: The timeout always has cmd = 0x00000d1a
> > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] translating to
> > this IIUIC:
> > 
> > Bit 8 = Ready for data
> > Bit 11 = CURRENT_STATE is TRAN
> > Bit 22 = Illegal command
> 
> According to the code logic, since this cmd13 get hardware cmd
> timeout, which means this cmd13 do not get any response. Here
> the Resp[0] should be the previous command's response.
>
> So this means the previous command is an illegal command, cause
> the emmc internal firmware stuck, and can't response to the next
> cmd13.
>
> I think we need to firstly identify the specific place in
> emmc driver which trigger the log dump.

My understanding is, that a missing response from the eMMC should trigger
the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK cycles
(see section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference manual).
64 SDCLK cycles means, that this should recover quickly and would not be
a problem for most usecases.

But what we are seeing is the software 10 seconds timeout. My understanding
is, that this should not be triggered if the SDHCI controller works as expected
(e.g. by generating a IRQ for the timeout). This timeout is much more
problematic, since all eMMC accessing processes block for those 10 seconds.

-- Sebastian

> Best Regards
> Haibo Chen
> 
> > 
> > > Timeouts do not occur with every build. After some debugging I have
> > > found that timeouts seem to depend on code alignment of the
> > > esdhc_readl_le function. I have bisected the behavior by using the
> > > System.map and moving/padding the code with NOP instructions (mov
> > > r0,r0).
> > >
> > > My test case has 5 processes continuously creating a file, writing
> > > random long data, reading data and deleting the file. It seems that
> > > when the esdhc_writel_le is aligned on a certain address then the
> > > timeout will occur about 5 times/12h using my test case. If I add one
> > > more NOP, the timeout will not occur at all. If I continue adding some
> > > more NOPs the timeouts come back. Seems that it doesn't matter where
> > > in the code I add NOPs as long as the address is below the address of
> > > esdhc_writel_le.
> > >
> > > We also run the same software on a dual core i.MX6 without any timeout
> > > issues.
> > 
> > And the same kernel binary is also used on an i.MX6 single core (albeit with
> > different SW) withot triggering the problem so far.
> > 
> > > I have reproduced this with kernel version 4.19.94 and 5.8.3 and we
> > > have compiled with both gcc8 and gcc9. I'm still searching for the
> > > root cause and I would appreciate any thoughts about where to go next.
> > >
> > > Thanks,
> > >
> > > -Claus-
> > 
> > To me it looks like it might involve an unknown hardware errata for i.MX53, but
> > there has been one similar report before (unfortunately without the full
> > register dump) involving virtualization:
> > 
> > https://patchwork.kernel.org/patch/10705823/
> > 
> > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> > 
> > -- Sebastian

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
  2020-09-02 13:49     ` Sebastian Reichel
@ 2020-09-03  2:10       ` Bough Chen
  2020-09-03 13:29         ` Sebastian Reichel
  0 siblings, 1 reply; 7+ messages in thread
From: Bough Chen @ 2020-09-03  2:10 UTC (permalink / raw)
  To: Sebastian Reichel
  Cc: dl-linux-imx, linux-mmc, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Baumgartner,
	Claus (GE Healthcare)


> -----Original Message-----
> From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> Sent: 2020年9月2日 21:49
> To: Bough Chen <haibo.chen@nxp.com>
> Cc: dl-linux-imx <linux-imx@nxp.com>; linux-mmc@vger.kernel.org; Shawn Guo
> <shawnguo@kernel.org>; Sascha Hauer <s.hauer@pengutronix.de>;
> Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> <festevam@gmail.com>; Baumgartner, Claus (GE Healthcare)
> <claus.baumgartner@med.ge.com>
> Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
> 
> Hi,
> 
> On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote:
> > > -----Original Message-----
> > > From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> > > Sent: 2020年9月1日 19:47
> > > To: dl-linux-imx <linux-imx@nxp.com>
> > > Cc: linux-mmc@vger.kernel.org; Bough Chen <haibo.chen@nxp.com>;
> > > Shawn Guo <shawnguo@kernel.org>; Sascha Hauer
> > > <s.hauer@pengutronix.de>; Pengutronix Kernel Team
> > > <kernel@pengutronix.de>; Fabio Estevam <festevam@gmail.com>;
> > > Baumgartner, Claus (GE Healthcare) <claus.baumgartner@med.ge.com>
> > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> > > i.MX535
> > >
> > > Hi,
> > >
> > > [add i.MX architecture maintainers to Cc]
> > >
> > > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> > > Healthcare) wrote:
> > > > We have a board with an i.MX535 using a Samsung eMMC as persistent
> > > > storage connected to eSDHCv3. Every now and then we produce a
> > > > build that causes emmc timeouts:
> > > >
> > > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware
> > > > cmd
> > > interrupt.
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI
> > > > REGISTER DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0:
> sdhci: Sys addr:
> > > > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel:
> mmc0:
> > > > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28
> > > > 07:32:12 csmon
> > > kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 |
> Host
> > > ctl: 0x00000031
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 |
> Blk
> > > gap:  0x00000000
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 |
> > > Clock:    0x0000011f
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e |
> Int
> > > stat: 0x00000000
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b |
> > > > Sig
> > > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd
> stat:
> > > 0x00000000 | Slot int: 0x00001201
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> > > Caps_1:   0x08100810
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> > > Max curr: 0x00000000
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> > > Resp[1]:  0x00000000
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> > > Resp[3]:  0x00000000
> > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000
> > > > Aug
> > > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 |
> > > > ADMA
> > > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > > ============================================
> > >
> > > Some extra information: The timeout always has cmd = 0x00000d1a
> > > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] translating
> > > to this IIUIC:
> > >
> > > Bit 8 = Ready for data
> > > Bit 11 = CURRENT_STATE is TRAN
> > > Bit 22 = Illegal command
> >
> > According to the code logic, since this cmd13 get hardware cmd
> > timeout, which means this cmd13 do not get any response. Here the
> > Resp[0] should be the previous command's response.
> >
> > So this means the previous command is an illegal command, cause the
> > emmc internal firmware stuck, and can't response to the next cmd13.
> >
> > I think we need to firstly identify the specific place in emmc driver
> > which trigger the log dump.
> 
> My understanding is, that a missing response from the eMMC should trigger
> the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK cycles (see
> section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference manual).
> 64 SDCLK cycles means, that this should recover quickly and would not be a
> problem for most usecases.
> 
> But what we are seeing is the software 10 seconds timeout. My understanding
> is, that this should not be triggered if the SDHCI controller works as expected
> (e.g. by generating a IRQ for the timeout). This timeout is much more
> problematic, since all eMMC accessing processes block for those 10 seconds.
> 

Agree, only one possibility, the cmd13 do not send out successfully. 
The count of 64 SDCLK cycle only trigged by the end of the sending command.
If the command still not send out completely, then should trigger the 10s sw
timeout. Let me double confirm with our IC team. 
I still suggest that we need first to find which cmd13 in our mmc driver meet this issue.

Best Regards
Haibo Chen

> -- Sebastian
> 
> > Best Regards
> > Haibo Chen
> >
> > >
> > > > Timeouts do not occur with every build. After some debugging I
> > > > have found that timeouts seem to depend on code alignment of the
> > > > esdhc_readl_le function. I have bisected the behavior by using the
> > > > System.map and moving/padding the code with NOP instructions (mov
> > > > r0,r0).
> > > >
> > > > My test case has 5 processes continuously creating a file, writing
> > > > random long data, reading data and deleting the file. It seems
> > > > that when the esdhc_writel_le is aligned on a certain address then
> > > > the timeout will occur about 5 times/12h using my test case. If I
> > > > add one more NOP, the timeout will not occur at all. If I continue
> > > > adding some more NOPs the timeouts come back. Seems that it
> > > > doesn't matter where in the code I add NOPs as long as the address
> > > > is below the address of esdhc_writel_le.
> > > >
> > > > We also run the same software on a dual core i.MX6 without any
> > > > timeout issues.
> > >
> > > And the same kernel binary is also used on an i.MX6 single core
> > > (albeit with different SW) withot triggering the problem so far.
> > >
> > > > I have reproduced this with kernel version 4.19.94 and 5.8.3 and
> > > > we have compiled with both gcc8 and gcc9. I'm still searching for
> > > > the root cause and I would appreciate any thoughts about where to go
> next.
> > > >
> > > > Thanks,
> > > >
> > > > -Claus-
> > >
> > > To me it looks like it might involve an unknown hardware errata for
> > > i.MX53, but there has been one similar report before (unfortunately
> > > without the full register dump) involving virtualization:
> > >
> > > https://patchwork.kernel.org/patch/10705823/
> > >
> > > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> > >
> > > -- Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
  2020-09-03  2:10       ` Bough Chen
@ 2020-09-03 13:29         ` Sebastian Reichel
  2020-09-07 10:50           ` EXT: " Baumgartner, Claus (GE Healthcare)
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Reichel @ 2020-09-03 13:29 UTC (permalink / raw)
  To: Bough Chen
  Cc: dl-linux-imx, linux-mmc, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Baumgartner,
	Claus (GE Healthcare)

[-- Attachment #1: Type: text/plain, Size: 8042 bytes --]

Hi,

On Thu, Sep 03, 2020 at 02:10:43AM +0000, Bough Chen wrote:
> > -----Original Message-----
> > From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> > Sent: 2020年9月2日 21:49
> > To: Bough Chen <haibo.chen@nxp.com>
> > Cc: dl-linux-imx <linux-imx@nxp.com>; linux-mmc@vger.kernel.org; Shawn Guo
> > <shawnguo@kernel.org>; Sascha Hauer <s.hauer@pengutronix.de>;
> > Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> > <festevam@gmail.com>; Baumgartner, Claus (GE Healthcare)
> > <claus.baumgartner@med.ge.com>
> > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
> > 
> > On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote:
> > > > -----Original Message-----
> > > > From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> > > > Sent: 2020年9月1日 19:47
> > > > To: dl-linux-imx <linux-imx@nxp.com>
> > > > Cc: linux-mmc@vger.kernel.org; Bough Chen <haibo.chen@nxp.com>;
> > > > Shawn Guo <shawnguo@kernel.org>; Sascha Hauer
> > > > <s.hauer@pengutronix.de>; Pengutronix Kernel Team
> > > > <kernel@pengutronix.de>; Fabio Estevam <festevam@gmail.com>;
> > > > Baumgartner, Claus (GE Healthcare) <claus.baumgartner@med.ge.com>
> > > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> > > > i.MX535
> > > >
> > > > Hi,
> > > >
> > > > [add i.MX architecture maintainers to Cc]
> > > >
> > > > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> > > > Healthcare) wrote:
> > > > > We have a board with an i.MX535 using a Samsung eMMC as persistent
> > > > > storage connected to eSDHCv3. Every now and then we produce a
> > > > > build that causes emmc timeouts:
> > > > >
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware
> > > > > cmd
> > > > interrupt.
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI
> > > > > REGISTER DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0:
> > sdhci: Sys addr:
> > > > > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel:
> > mmc0:
> > > > > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28
> > > > > 07:32:12 csmon
> > > > kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 |
> > Host
> > > > ctl: 0x00000031
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 |
> > Blk
> > > > gap:  0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 |
> > > > Clock:    0x0000011f
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e |
> > Int
> > > > stat: 0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b |
> > > > > Sig
> > > > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd
> > stat:
> > > > 0x00000000 | Slot int: 0x00001201
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> > > > Caps_1:   0x08100810
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> > > > Max curr: 0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> > > > Resp[1]:  0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> > > > Resp[3]:  0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000
> > > > > Aug
> > > > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 |
> > > > > ADMA
> > > > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > > > ============================================
> > > >
> > > > Some extra information: The timeout always has cmd = 0x00000d1a
> > > > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] translating
> > > > to this IIUIC:
> > > >
> > > > Bit 8 = Ready for data
> > > > Bit 11 = CURRENT_STATE is TRAN
> > > > Bit 22 = Illegal command
> > >
> > > According to the code logic, since this cmd13 get hardware cmd
> > > timeout, which means this cmd13 do not get any response. Here the
> > > Resp[0] should be the previous command's response.
> > >
> > > So this means the previous command is an illegal command, cause the
> > > emmc internal firmware stuck, and can't response to the next cmd13.
> > >
> > > I think we need to firstly identify the specific place in emmc driver
> > > which trigger the log dump.
> > 
> > My understanding is, that a missing response from the eMMC should trigger
> > the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK cycles (see
> > section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference manual).
> > 64 SDCLK cycles means, that this should recover quickly and would not be a
> > problem for most usecases.
> > 
> > But what we are seeing is the software 10 seconds timeout. My understanding
> > is, that this should not be triggered if the SDHCI controller works as expected
> > (e.g. by generating a IRQ for the timeout). This timeout is much more
> > problematic, since all eMMC accessing processes block for those 10 seconds.
> > 
> 
> Agree, only one possibility, the cmd13 do not send out successfully.

I think there are two possibilities:

1. The command is not send out, so no IRQs are received.
2. The IRQ gets lost or is not generated

The esdhc_writel_le() has a workaround to avoid missing the card
irq. If that does not fully fix the issue, I would expect the SW
fallback to also cover that case.

> The count of 64 SDCLK cycle only trigged by the end of the sending
> command. If the command still not send out completely, then
> should trigger the 10s sw timeout. Let me double confirm with our
> IC team.

Ack.

> I still suggest that we need first to find which cmd13 in our mmc
> driver meet this issue.

We will try to figure that out and report back. Needs a bit of time,
since the error only appears after some hours on an affected kernel
and adding the necessary code potentially hides the problem due to
the alignment changes requiring another run with padding nops.

-- Sebastian

> > > > > Timeouts do not occur with every build. After some debugging I
> > > > > have found that timeouts seem to depend on code alignment of the
> > > > > esdhc_readl_le function. I have bisected the behavior by using the
> > > > > System.map and moving/padding the code with NOP instructions (mov
> > > > > r0,r0).
> > > > >
> > > > > My test case has 5 processes continuously creating a file, writing
> > > > > random long data, reading data and deleting the file. It seems
> > > > > that when the esdhc_writel_le is aligned on a certain address then
> > > > > the timeout will occur about 5 times/12h using my test case. If I
> > > > > add one more NOP, the timeout will not occur at all. If I continue
> > > > > adding some more NOPs the timeouts come back. Seems that it
> > > > > doesn't matter where in the code I add NOPs as long as the address
> > > > > is below the address of esdhc_writel_le.
> > > > >
> > > > > We also run the same software on a dual core i.MX6 without any
> > > > > timeout issues.
> > > >
> > > > And the same kernel binary is also used on an i.MX6 single core
> > > > (albeit with different SW) withot triggering the problem so far.
> > > >
> > > > > I have reproduced this with kernel version 4.19.94 and 5.8.3 and
> > > > > we have compiled with both gcc8 and gcc9. I'm still searching for
> > > > > the root cause and I would appreciate any thoughts about where to go
> > next.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > -Claus-
> > > >
> > > > To me it looks like it might involve an unknown hardware errata for
> > > > i.MX53, but there has been one similar report before (unfortunately
> > > > without the full register dump) involving virtualization:
> > > >
> > > > https://patchwork.kernel.org/patch/10705823/
> > > >
> > > > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> > > >
> > > > -- Sebastian

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: EXT: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
  2020-09-03 13:29         ` Sebastian Reichel
@ 2020-09-07 10:50           ` Baumgartner, Claus (GE Healthcare)
  0 siblings, 0 replies; 7+ messages in thread
From: Baumgartner, Claus (GE Healthcare) @ 2020-09-07 10:50 UTC (permalink / raw)
  To: Sebastian Reichel, Bough Chen
  Cc: dl-linux-imx, linux-mmc, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam

Hi,

> -----Original Message-----
> From: Sebastian Reichel <sebastian.reichel@collabora.com>
> Sent: 3. syyskuutata 2020 16:29
> To: Bough Chen <haibo.chen@nxp.com>
> Cc: dl-linux-imx <linux-imx@nxp.com>; linux-mmc@vger.kernel.org; Shawn
> Guo <shawnguo@kernel.org>; Sascha Hauer <s.hauer@pengutronix.de>;
> Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> <festevam@gmail.com>; Baumgartner, Claus (GE Healthcare)
> <claus.baumgartner@med.ge.com>
> Subject: EXT: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> i.MX535
> 
> Hi,
> 
> On Thu, Sep 03, 2020 at 02:10:43AM +0000, Bough Chen wrote:
> > > -----Original Message-----
> > > From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> > > Sent: 2020年9月2日 21:49
> > > To: Bough Chen <haibo.chen@nxp.com>
> > > Cc: dl-linux-imx <linux-imx@nxp.com>; linux-mmc@vger.kernel.org;
> > > Shawn Guo <shawnguo@kernel.org>; Sascha Hauer
> > > <s.hauer@pengutronix.de>; Pengutronix Kernel Team
> > > <kernel@pengutronix.de>; Fabio Estevam <festevam@gmail.com>;
> > > Baumgartner, Claus (GE Healthcare) <claus.baumgartner@med.ge.com>
> > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> > > i.MX535
> > >
> > > On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Sebastian Reichel [mailto:sebastian.reichel@collabora.com]
> > > > > Sent: 2020年9月1日 19:47
> > > > > To: dl-linux-imx <linux-imx@nxp.com>
> > > > > Cc: linux-mmc@vger.kernel.org; Bough Chen <haibo.chen@nxp.com>;
> > > > > Shawn Guo <shawnguo@kernel.org>; Sascha Hauer
> > > > > <s.hauer@pengutronix.de>; Pengutronix Kernel Team
> > > > > <kernel@pengutronix.de>; Fabio Estevam <festevam@gmail.com>;
> > > > > Baumgartner, Claus (GE Healthcare)
> > > > > <claus.baumgartner@med.ge.com>
> > > > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> > > > > i.MX535
> > > > >
> > > > > Hi,
> > > > >
> > > > > [add i.MX architecture maintainers to Cc]
> > > > >
> > > > > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> > > > > Healthcare) wrote:
> > > > > > We have a board with an i.MX535 using a Samsung eMMC as
> > > > > > persistent storage connected to eSDHCv3. Every now and then we
> > > > > > produce a build that causes emmc timeouts:
> > > > > >
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for
> > > > > > hardware cmd
> > > > > interrupt.
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI
> > > > > > REGISTER DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0:
> > > sdhci: Sys addr:
> > > > > > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel:
> > > mmc0:
> > > > > > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28
> > > > > > 07:32:12 csmon
> > > > > kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode:
> > > > > 0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 |
> > > Host
> > > > > ctl: 0x00000031
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 |
> > > Blk
> > > > > gap:  0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000
> |
> > > > > Clock:    0x0000011f
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e |
> > > Int
> > > > > stat: 0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:
> > > > > > 0x107f000b | Sig
> > > > > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > > > > ACmd
> > > stat:
> > > > > 0x00000000 | Slot int: 0x00001201
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> > > > > Caps_1:   0x08100810
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> > > > > Max curr: 0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> > > > > Resp[1]:  0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> > > > > Resp[3]:  0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2:
> > > > > > 0x00000000 Aug
> > > > > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 |
> > > > > > ADMA
> > > > > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > > > > ============================================
> > > > >
> > > > > Some extra information: The timeout always has cmd = 0x00000d1a
> > > > > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0]
> > > > > translating to this IIUIC:
> > > > >
> > > > > Bit 8 = Ready for data
> > > > > Bit 11 = CURRENT_STATE is TRAN
> > > > > Bit 22 = Illegal command
> > > >
> > > > According to the code logic, since this cmd13 get hardware cmd
> > > > timeout, which means this cmd13 do not get any response. Here the
> > > > Resp[0] should be the previous command's response.
> > > >
> > > > So this means the previous command is an illegal command, cause
> > > > the emmc internal firmware stuck, and can't response to the next cmd13.
> > > >
> > > > I think we need to firstly identify the specific place in emmc
> > > > driver which trigger the log dump.
> > >
> > > My understanding is, that a missing response from the eMMC should
> > > trigger the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK
> > > cycles (see section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference
> manual).
> > > 64 SDCLK cycles means, that this should recover quickly and would
> > > not be a problem for most usecases.
> > >
> > > But what we are seeing is the software 10 seconds timeout. My
> > > understanding is, that this should not be triggered if the SDHCI
> > > controller works as expected (e.g. by generating a IRQ for the
> > > timeout). This timeout is much more problematic, since all eMMC accessing
> processes block for those 10 seconds.
> > >
> >
> > Agree, only one possibility, the cmd13 do not send out successfully.
> 
> I think there are two possibilities:
> 
> 1. The command is not send out, so no IRQs are received.
> 2. The IRQ gets lost or is not generated
> 
> The esdhc_writel_le() has a workaround to avoid missing the card irq. If that
> does not fully fix the issue, I would expect the SW fallback to also cover that
> case.
> 
> > The count of 64 SDCLK cycle only trigged by the end of the sending
> > command. If the command still not send out completely, then should
> > trigger the 10s sw timeout. Let me double confirm with our IC team.
> 
> Ack.
> 
> > I still suggest that we need first to find which cmd13 in our mmc
> > driver meet this issue.
> 
> We will try to figure that out and report back. Needs a bit of time, since the
> error only appears after some hours on an affected kernel and adding the
> necessary code potentially hides the problem due to the alignment changes
> requiring another run with padding nops.
> 
> -- Sebastian

I changed the software so that it generates a back trace when the timeout occurs
and I had to pad with a few NOP instructions again to get the timeouts to occur.

I run the software on two units and got all together 22 timeouts during 2 days.  All
back traces were same as below:  

[ 2178.405338] mmc0: Timeout waiting for hardware cmd interrupt.
[ 2178.405354] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[ 2178.405367] mmc0: sdhci: Sys addr:  0xe5f1a000 | Version:  0x00001201
[ 2178.405374] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001
[ 2178.405380] mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
[ 2178.405389] mmc0: sdhci: Present:   0x01f80008 | Host ctl: 0x00000031
[ 2178.405395] mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000000
[ 2178.405401] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x0000011f
[ 2178.405407] mmc0: sdhci: Timeout:   0x0000008e | Int stat: 0x00000000
[ 2178.405413] mmc0: sdhci: Int enab:  0x107f000b | Sig enab: 0x107f000b
[ 2178.405420] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00001201
[ 2178.405426] mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x08100810
[ 2178.405432] mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00000000
[ 2178.405438] mmc0: sdhci: Resp[0]:   0x00400900 | Resp[1]:  0x00000000
[ 2178.405444] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[ 2178.405449] mmc0: sdhci: Host ctl2: 0x00000000
[ 2178.405456] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0xef041208
[ 2178.405460] mmc0: sdhci: ============================================
[ 2178.405681] CPU: 0 PID: 140 Comm: kworker/0:4H Not tainted 5.8.3 #1
[ 2178.405689] Hardware name: Freescale i.MX53 (Device Tree Support)
[ 2178.405715] Workqueue: mmc_complete mmc_blk_mq_complete_work
[ 2178.405723] Backtrace: 
[ 2178.405745] [<8010e0bc>] (dump_backtrace) from [<8010e480>] (show_stack+0x20/0x24)
[ 2178.405755]  r7:00000005 r6:200b0013 r5:00000000 r4:80f46ff4
[ 2178.405775] [<8010e460>] (show_stack) from [<8046c024>] (dump_stack+0xa4/0xb8)
[ 2178.405795] [<8046bf80>] (dump_stack) from [<80765890>] (mmc_wait_for_cmd+0x94/0xd0)
[ 2178.405804]  r7:00000005 r6:80f04f48 r5:00000000 r4:ed7abe14
[ 2178.405817] [<807657fc>] (mmc_wait_for_cmd) from [<8076c338>] (__mmc_send_status+0x78/0xa8)
[ 2178.405826]  r8:00000000 r7:00000005 r6:ed7abe68 r5:ee137800 r4:80f04f48
[ 2178.405844] [<8076c2c0>] (__mmc_send_status) from [<80775d7c>] (card_busy_detect+0x5c/0xf8)
[ 2178.405853]  r7:ee137800 r6:0002ddba r5:80f03d00 r4:0002d9d2
[ 2178.405864] [<80775d20>] (card_busy_detect) from [<80778d68>] (mmc_blk_mq_complete_prev_req.part.4+0x154/0x254)
[ 2178.405874]  r10:ee137800 r9:ee3034a8 r8:00000000 r7:ee21eec4 r6:80f04f48 r5:ee303400
[ 2178.405879]  r4:ee21ee08
[ 2178.405889] [<80778c14>] (mmc_blk_mq_complete_prev_req.part.4) from [<80779704>] (mmc_blk_mq_complete_work+0x30/0x34)
[ 2178.405899]  r10:80faaa50 r9:00000000 r8:00000000 r7:eefe4800 r6:eefdcb00 r5:ed5a4700
[ 2178.405905]  r4:ee21eed8
[ 2178.405917] [<807796d4>] (mmc_blk_mq_complete_work) from [<8014221c>] (process_one_work+0x238/0x590)
[ 2178.405926] [<80141fe4>] (process_one_work) from [<80142a20>] (worker_thread+0x60/0x5b8)
[ 2178.405936]  r10:eefdcb00 r9:80f03d00 r8:eefdcb18 r7:00000008 r6:eefdcb00 r5:ed5a4714
[ 2178.405941]  r4:ed5a4700
[ 2178.405956] [<801429c0>] (worker_thread) from [<8014957c>] (kthread+0x170/0x174)
[ 2178.405966]  r10:ed92de74 r9:ed5a4700 r8:801429c0 r7:ed7aa000 r6:00000000 r5:edbd8ac0
[ 2178.405971]  r4:eda67180
[ 2178.405982] [<8014940c>] (kthread) from [<80100128>] (ret_from_fork+0x14/0x2c)
[ 2178.405989] Exception stack(0xed7abfb0 to 0xed7abff8)
[ 2178.405997] bfa0:                                     00000000 00000000 00000000 00000000
[ 2178.406007] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 2178.406015] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 2178.406027]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:8014940c
[ 2178.406032]  r4:edbd8ac0

-Claus-

> 
> > > > > > Timeouts do not occur with every build. After some debugging I
> > > > > > have found that timeouts seem to depend on code alignment of
> > > > > > the esdhc_readl_le function. I have bisected the behavior by
> > > > > > using the System.map and moving/padding the code with NOP
> > > > > > instructions (mov r0,r0).
> > > > > >
> > > > > > My test case has 5 processes continuously creating a file,
> > > > > > writing random long data, reading data and deleting the file.
> > > > > > It seems that when the esdhc_writel_le is aligned on a certain
> > > > > > address then the timeout will occur about 5 times/12h using my
> > > > > > test case. If I add one more NOP, the timeout will not occur
> > > > > > at all. If I continue adding some more NOPs the timeouts come
> > > > > > back. Seems that it doesn't matter where in the code I add
> > > > > > NOPs as long as the address is below the address of esdhc_writel_le.
> > > > > >
> > > > > > We also run the same software on a dual core i.MX6 without any
> > > > > > timeout issues.
> > > > >
> > > > > And the same kernel binary is also used on an i.MX6 single core
> > > > > (albeit with different SW) withot triggering the problem so far.
> > > > >
> > > > > > I have reproduced this with kernel version 4.19.94 and 5.8.3
> > > > > > and we have compiled with both gcc8 and gcc9. I'm still
> > > > > > searching for the root cause and I would appreciate any
> > > > > > thoughts about where to go
> > > next.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > -Claus-
> > > > >
> > > > > To me it looks like it might involve an unknown hardware errata
> > > > > for i.MX53, but there has been one similar report before
> > > > > (unfortunately without the full register dump) involving virtualization:
> > > > >
> > > > > https://patchwork.kernel.org/patch/10705823/
> > > > >
> > > > > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> > > > >
> > > > > -- Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-07 12:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-01  7:37 mmc0: Timeout waiting for hardware cmd interrupt on i.MX535 Baumgartner, Claus (GE Healthcare)
2020-09-01 11:46 ` Sebastian Reichel
2020-09-02 11:24   ` Bough Chen
2020-09-02 13:49     ` Sebastian Reichel
2020-09-03  2:10       ` Bough Chen
2020-09-03 13:29         ` Sebastian Reichel
2020-09-07 10:50           ` EXT: " Baumgartner, Claus (GE Healthcare)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.