linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] sdhci no longer detects SD cards on LX2160A
@ 2019-09-16 17:15 Russell King - ARM Linux admin
  2019-09-16 22:57 ` Russell King - ARM Linux admin
  2019-09-17  8:06 ` Marc Gonzalez
  0 siblings, 2 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-16 17:15 UTC (permalink / raw)
  To: Adrian Hunter, linux-mmc, linux-arm-kernel

Hi,

It seems that somewhere between v5.2 and v5.3, sdhci fails to detect
SD cards on the NXP LX2160A, but continues to work with eMMC.
This uses the sdhci-esdhc driver.

v5.2:

sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
sdhci-pltfm: SDHCI platform and OF driver helper
mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit
mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit
mmc0: new high speed SDHC card at address aaaa
mmcblk0: mmc0:aaaa SU04G 3.69 GiB
mmc1: new HS400 MMC card at address 0001
mmcblk1: mmc1:0001 DF4064 58.2 GiB
mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB
mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB
mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0)
 mmcblk1: p1

v5.3:

sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
sdhci-pltfm: SDHCI platform and OF driver helper
mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit
mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit
mmc0: ADMA error
mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x000040d8
mmc0: sdhci: Timeout:   0x00000003 | Int stat: 0x00000001
mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
mmc0: sdhci: Host ctl2: 0x00000000
mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
mmc0: sdhci: ============================================
mmc0: error -5 whilst initialising SD card
mmc0: ADMA error
mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00008098
mmc0: sdhci: Timeout:   0x00000002 | Int stat: 0x00000001
mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
mmc0: sdhci: Host ctl2: 0x00000000
mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
mmc0: sdhci: ============================================
mmc0: error -5 whilst initialising SD card
mmc1: new HS400 MMC card at address 0001
mmcblk1: mmc1:0001 DF4064 58.2 GiB
mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB
mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB
mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0)
 mmcblk1: p1
mmc0: ADMA error
mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x000080d8
mmc0: sdhci: Timeout:   0x00000002 | Int stat: 0x00000001
mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
mmc0: sdhci: Host ctl2: 0x00000000
mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
mmc0: sdhci: ============================================
mmc0: error -5 whilst initialising SD card
mmc0: ADMA error
mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x000080f8
mmc0: sdhci: Timeout:   0x00000002 | Int stat: 0x00000001
mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
mmc0: sdhci: Host ctl2: 0x00000000
mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
mmc0: sdhci: ============================================
mmc0: error -5 whilst initialising SD card

The platform has an iommu, which is in pass-through mode, via
arm_smmu.disable_bypass=0.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-16 17:15 [REGRESSION] sdhci no longer detects SD cards on LX2160A Russell King - ARM Linux admin
@ 2019-09-16 22:57 ` Russell King - ARM Linux admin
  2019-09-17  8:06 ` Marc Gonzalez
  1 sibling, 0 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-16 22:57 UTC (permalink / raw)
  To: Adrian Hunter, linux-mmc, linux-arm-kernel

The problem below seems to be sporadic with v5.3, which makes finding
its cause quite difficult - and makes bisecting useless.  I'd describe
it as "likely" to affect either mmc0 or mmc1 on v5.3.  I haven't seen
it at all with v5.2.

I'll see whether I can pin anything down tomorrow.

On Mon, Sep 16, 2019 at 06:15:10PM +0100, Russell King - ARM Linux admin wrote:
> Hi,
> 
> It seems that somewhere between v5.2 and v5.3, sdhci fails to detect
> SD cards on the NXP LX2160A, but continues to work with eMMC.
> This uses the sdhci-esdhc driver.
> 
> v5.2:
> 
> sdhci: Secure Digital Host Controller Interface driver
> sdhci: Copyright(c) Pierre Ossman
> sdhci-pltfm: SDHCI platform and OF driver helper
> mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit
> mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit
> mmc0: new high speed SDHC card at address aaaa
> mmcblk0: mmc0:aaaa SU04G 3.69 GiB
> mmc1: new HS400 MMC card at address 0001
> mmcblk1: mmc1:0001 DF4064 58.2 GiB
> mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB
> mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB
> mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0)
>  mmcblk1: p1
> 
> v5.3:
> 
> sdhci: Secure Digital Host Controller Interface driver
> sdhci: Copyright(c) Pierre Ossman
> sdhci-pltfm: SDHCI platform and OF driver helper
> mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit
> mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit
> mmc0: ADMA error
> mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
> mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
> mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
> mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x000040d8
> mmc0: sdhci: Timeout:   0x00000003 | Int stat: 0x00000001
> mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
> mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
> mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
> mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
> mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
> mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
> mmc0: sdhci: Host ctl2: 0x00000000
> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
> mmc0: sdhci: ============================================
> mmc0: error -5 whilst initialising SD card
> mmc0: ADMA error
> mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
> mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
> mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
> mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00008098
> mmc0: sdhci: Timeout:   0x00000002 | Int stat: 0x00000001
> mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
> mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
> mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
> mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
> mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
> mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
> mmc0: sdhci: Host ctl2: 0x00000000
> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
> mmc0: sdhci: ============================================
> mmc0: error -5 whilst initialising SD card
> mmc1: new HS400 MMC card at address 0001
> mmcblk1: mmc1:0001 DF4064 58.2 GiB
> mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB
> mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB
> mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0)
>  mmcblk1: p1
> mmc0: ADMA error
> mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
> mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
> mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
> mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x000080d8
> mmc0: sdhci: Timeout:   0x00000002 | Int stat: 0x00000001
> mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
> mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
> mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
> mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
> mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
> mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
> mmc0: sdhci: Host ctl2: 0x00000000
> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
> mmc0: sdhci: ============================================
> mmc0: error -5 whilst initialising SD card
> mmc0: ADMA error
> mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> mmc0: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000013
> mmc0: sdhci: Present:   0x01f50008 | Host ctl: 0x00000038
> mmc0: sdhci: Power:     0x00000003 | Blk gap:  0x00000000
> mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x000080f8
> mmc0: sdhci: Timeout:   0x00000002 | Int stat: 0x00000001
> mmc0: sdhci: Int enab:  0x037f108f | Sig enab: 0x037f108b
> mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202
> mmc0: sdhci: Caps:      0x35fa0000 | Caps_1:   0x0000af00
> mmc0: sdhci: Cmd:       0x0000333a | Max curr: 0x00000000
> mmc0: sdhci: Resp[0]:   0x00000920 | Resp[1]:  0x001d8a33
> mmc0: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x3f400e00
> mmc0: sdhci: Host ctl2: 0x00000000
> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236d43820c
> mmc0: sdhci: ============================================
> mmc0: error -5 whilst initialising SD card
> 
> The platform has an iommu, which is in pass-through mode, via
> arm_smmu.disable_bypass=0.
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-16 17:15 [REGRESSION] sdhci no longer detects SD cards on LX2160A Russell King - ARM Linux admin
  2019-09-16 22:57 ` Russell King - ARM Linux admin
@ 2019-09-17  8:06 ` Marc Gonzalez
  2019-09-17  8:19   ` Russell King - ARM Linux admin
  1 sibling, 1 reply; 31+ messages in thread
From: Marc Gonzalez @ 2019-09-17  8:06 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Adrian Hunter; +Cc: linux-mmc, Linux ARM

On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:

> The platform has an iommu, which is in pass-through mode, via
> arm_smmu.disable_bypass=0.

Could be 954a03be033c7cef80ddc232e7cbdb17df735663
"iommu/arm-smmu: Break insecure users by disabling bypass by default"

Although it had already landed in v5.2

Regards.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17  8:06 ` Marc Gonzalez
@ 2019-09-17  8:19   ` Russell King - ARM Linux admin
  2019-09-17 10:42     ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17  8:19 UTC (permalink / raw)
  To: Marc Gonzalez; +Cc: linux-mmc, Adrian Hunter, Linux ARM

On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> 
> > The platform has an iommu, which is in pass-through mode, via
> > arm_smmu.disable_bypass=0.
> 
> Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> 
> Although it had already landed in v5.2

It is not - and the two lines that you quoted above are sufficient
to negate that as a cause.  (Please read the help for the option that
the commit referrs to.)

In fact, with bypass disabled, the SoC fails due to other masters.
That's already been discussed privately between myself and Will
Deacon.

arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
the default setting in the Kconfig.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17  8:19   ` Russell King - ARM Linux admin
@ 2019-09-17 10:42     ` Russell King - ARM Linux admin
  2019-09-17 11:16       ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 10:42 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: linux-mmc, Linux ARM

On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > 
> > > The platform has an iommu, which is in pass-through mode, via
> > > arm_smmu.disable_bypass=0.
> > 
> > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > 
> > Although it had already landed in v5.2
> 
> It is not - and the two lines that you quoted above are sufficient
> to negate that as a cause.  (Please read the help for the option that
> the commit referrs to.)
> 
> In fact, with bypass disabled, the SoC fails due to other masters.
> That's already been discussed privately between myself and Will
> Deacon.
> 
> arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> the default setting in the Kconfig.

Adding some further debugging, and fixing the existing ADMA debugging
shows:

mmc0: ADMA error: 0x02000000

So this is an ADMA error without the transfer having completed.

mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001

The block size is 8, with one block.

mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c

The ADMA error is a descriptor error at address 0x000000236df1d20c.
The descriptor table contains (including the following entry):

mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00

The descriptor table contains one descriptor of 8 bytes, is marked
as the last (END bit set) and is at DMA address 0x236df1d200.  The
following descriptor is empty, with VALID=0.

One may be tempted to blame it on the following descriptor, but having
had another example on eMMC while userspace was booting (rootfs on
eMMC):

mmc1: ADMA error: 0x02000000
mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21

... which is interesting for several reasons:
- The ADMA error register indicates a length mismatch error.  The
  transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
  Summing the ADMA lengths up to the last descriptor (length=0 is
  0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
  bytes than the requested transfer.

- The ADMA error register indicates ST_CADR, which is described as
  "This state is never set because do not generate ADMA error in this
  state."

- The error descriptor is again after the descriptor with END=1, but
  this time has VALID=1.

This _feels_ like a coherency issue, where the SDHCI engine is not
correctly seeing the descriptor table, but then I would have expected
userspace (which is basically debian stable) to fail to boot every
time given that its rootfs is on eMMC.

The other weird thing is if I wind the core MMC code back via:

$ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R

and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
completely stable, but way better than plain v5.3.  I don't see
much in that diff which would be responsible for this - although it
does seem that hch's DMA changes do make the problem more likely.
(going from 1 in 3 boots with a problem to being not able to boot.)

Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
bypass mode on the IOMMU (but then I saw global smmu errors right
from when the IOMMU had bypass disabled before MMC was probed - the
reason being is the SoC is not currently setup to have the MMU
bypass mode disabled.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 10:42     ` Russell King - ARM Linux admin
@ 2019-09-17 11:16       ` Russell King - ARM Linux admin
  2019-09-17 11:42         ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 11:16 UTC (permalink / raw)
  To: Adrian Hunter, Will Deacon; +Cc: linux-mmc, Linux ARM

On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > > 
> > > > The platform has an iommu, which is in pass-through mode, via
> > > > arm_smmu.disable_bypass=0.
> > > 
> > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > > 
> > > Although it had already landed in v5.2
> > 
> > It is not - and the two lines that you quoted above are sufficient
> > to negate that as a cause.  (Please read the help for the option that
> > the commit referrs to.)
> > 
> > In fact, with bypass disabled, the SoC fails due to other masters.
> > That's already been discussed privately between myself and Will
> > Deacon.
> > 
> > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> > the default setting in the Kconfig.
> 
> Adding some further debugging, and fixing the existing ADMA debugging
> shows:
> 
> mmc0: ADMA error: 0x02000000
> 
> So this is an ADMA error without the transfer having completed.
> 
> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> 
> The block size is 8, with one block.
> 
> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
> 
> The ADMA error is a descriptor error at address 0x000000236df1d20c.
> The descriptor table contains (including the following entry):
> 
> mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
> mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
> 
> The descriptor table contains one descriptor of 8 bytes, is marked
> as the last (END bit set) and is at DMA address 0x236df1d200.  The
> following descriptor is empty, with VALID=0.
> 
> One may be tempted to blame it on the following descriptor, but having
> had another example on eMMC while userspace was booting (rootfs on
> eMMC):
> 
> mmc1: ADMA error: 0x02000000
> mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
> mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
> mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
> mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
> mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
> mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
> mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
> mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
> mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
> mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
> mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
> mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
> 
> ... which is interesting for several reasons:
> - The ADMA error register indicates a length mismatch error.  The
>   transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
>   Summing the ADMA lengths up to the last descriptor (length=0 is
>   0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
>   bytes than the requested transfer.
> 
> - The ADMA error register indicates ST_CADR, which is described as
>   "This state is never set because do not generate ADMA error in this
>   state."
> 
> - The error descriptor is again after the descriptor with END=1, but
>   this time has VALID=1.
> 
> This _feels_ like a coherency issue, where the SDHCI engine is not
> correctly seeing the descriptor table, but then I would have expected
> userspace (which is basically debian stable) to fail to boot every
> time given that its rootfs is on eMMC.
> 
> The other weird thing is if I wind the core MMC code back via:
> 
> $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
> 
> and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
> completely stable, but way better than plain v5.3.  I don't see
> much in that diff which would be responsible for this - although it
> does seem that hch's DMA changes do make the problem more likely.
> (going from 1 in 3 boots with a problem to being not able to boot.)
> 
> Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
> bypass mode on the IOMMU (but then I saw global smmu errors right
> from when the IOMMU had bypass disabled before MMC was probed - the
> reason being is the SoC is not currently setup to have the MMU
> bypass mode disabled.)

This looks like an ARM64 coherency issue.

I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
which had no effect.  I then tried adding:

+       __dma_flush_area(host->adma_table, desc - host->adma_table);
+       dma_wmb();

and so far I haven't had any further ADMA errors.  Adding Will Deacon
to the thread.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 11:16       ` Russell King - ARM Linux admin
@ 2019-09-17 11:42         ` Russell King - ARM Linux admin
  2019-09-17 12:33           ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 11:42 UTC (permalink / raw)
  To: Adrian Hunter, Will Deacon; +Cc: linux-mmc, Linux ARM

On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
> > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > > > 
> > > > > The platform has an iommu, which is in pass-through mode, via
> > > > > arm_smmu.disable_bypass=0.
> > > > 
> > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > > > 
> > > > Although it had already landed in v5.2
> > > 
> > > It is not - and the two lines that you quoted above are sufficient
> > > to negate that as a cause.  (Please read the help for the option that
> > > the commit referrs to.)
> > > 
> > > In fact, with bypass disabled, the SoC fails due to other masters.
> > > That's already been discussed privately between myself and Will
> > > Deacon.
> > > 
> > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> > > the default setting in the Kconfig.
> > 
> > Adding some further debugging, and fixing the existing ADMA debugging
> > shows:
> > 
> > mmc0: ADMA error: 0x02000000
> > 
> > So this is an ADMA error without the transfer having completed.
> > 
> > mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> > 
> > The block size is 8, with one block.
> > 
> > mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
> > 
> > The ADMA error is a descriptor error at address 0x000000236df1d20c.
> > The descriptor table contains (including the following entry):
> > 
> > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
> > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
> > 
> > The descriptor table contains one descriptor of 8 bytes, is marked
> > as the last (END bit set) and is at DMA address 0x236df1d200.  The
> > following descriptor is empty, with VALID=0.
> > 
> > One may be tempted to blame it on the following descriptor, but having
> > had another example on eMMC while userspace was booting (rootfs on
> > eMMC):
> > 
> > mmc1: ADMA error: 0x02000000
> > mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
> > mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
> > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
> > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
> > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
> > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
> > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
> > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
> > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
> > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
> > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
> > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
> > 
> > ... which is interesting for several reasons:
> > - The ADMA error register indicates a length mismatch error.  The
> >   transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
> >   Summing the ADMA lengths up to the last descriptor (length=0 is
> >   0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
> >   bytes than the requested transfer.
> > 
> > - The ADMA error register indicates ST_CADR, which is described as
> >   "This state is never set because do not generate ADMA error in this
> >   state."
> > 
> > - The error descriptor is again after the descriptor with END=1, but
> >   this time has VALID=1.
> > 
> > This _feels_ like a coherency issue, where the SDHCI engine is not
> > correctly seeing the descriptor table, but then I would have expected
> > userspace (which is basically debian stable) to fail to boot every
> > time given that its rootfs is on eMMC.
> > 
> > The other weird thing is if I wind the core MMC code back via:
> > 
> > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
> > 
> > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
> > completely stable, but way better than plain v5.3.  I don't see
> > much in that diff which would be responsible for this - although it
> > does seem that hch's DMA changes do make the problem more likely.
> > (going from 1 in 3 boots with a problem to being not able to boot.)
> > 
> > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
> > bypass mode on the IOMMU (but then I saw global smmu errors right
> > from when the IOMMU had bypass disabled before MMC was probed - the
> > reason being is the SoC is not currently setup to have the MMU
> > bypass mode disabled.)
> 
> This looks like an ARM64 coherency issue.
> 
> I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
> which had no effect.  I then tried adding:
> 
> +       __dma_flush_area(host->adma_table, desc - host->adma_table);
> +       dma_wmb();
> 
> and so far I haven't had any further ADMA errors.  Adding Will Deacon
> to the thread.

These are the changes to sdhci that I'm currently running.  I think
some of the debugging related changes are probably worth adding to
the driver, particularly printing the intmask on ADMA error (which
is not printed by the register dump, as the value is lost) and printing
the DMA addresses of the descriptor table entries which can be tied
up with the DMA address error register.  Also, maybe printing the
DMA descriptor table with the register dump, rather than having to
resort to enabling debug would be a good idea?

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index a5dc5aae973e..884dcaa9cad5 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
 		/* Add a terminating entry - nop, end, valid */
 		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
 	}
+	__dma_flush_area(host->adma_table, desc - host->adma_table);
+	dma_wmb();
 }
 
 static void sdhci_adma_table_post(struct sdhci_host *host,
@@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
 static void sdhci_adma_show_error(struct sdhci_host *host)
 {
 	void *desc = host->adma_table;
+	dma_addr_t dma = host->adma_addr;
+	bool end = false;
 
 	sdhci_dumpregs(host);
 
@@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
 		struct sdhci_adma2_64_desc *dma_desc = desc;
 
 		if (host->flags & SDHCI_USE_64_BIT_DMA)
-			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
-			    desc, le32_to_cpu(dma_desc->addr_hi),
+			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
+			    (unsigned long long)dma,
+			    le32_to_cpu(dma_desc->addr_hi),
 			    le32_to_cpu(dma_desc->addr_lo),
 			    le16_to_cpu(dma_desc->len),
 			    le16_to_cpu(dma_desc->cmd));
 		else
-			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
-			    desc, le32_to_cpu(dma_desc->addr_lo),
+			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
+			    (unsigned long long)dma,
+			    le32_to_cpu(dma_desc->addr_lo),
 			    le16_to_cpu(dma_desc->len),
 			    le16_to_cpu(dma_desc->cmd));
 
+		if (end) break;
+
 		desc += host->desc_sz;
+		dma += host->desc_sz;
 
 		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
-			break;
+			end = true;
 	}
 }
 
@@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
 			!= MMC_BUS_TEST_R)
 		host->data->error = -EILSEQ;
 	else if (intmask & SDHCI_INT_ADMA_ERROR) {
-		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
+		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
 		sdhci_adma_show_error(host);
 		host->data->error = -EIO;
 		if (host->ops->adma_workaround)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 11:42         ` Russell King - ARM Linux admin
@ 2019-09-17 12:33           ` Russell King - ARM Linux admin
  2019-09-17 13:03             ` Robin Murphy
  2019-09-17 13:07             ` Russell King - ARM Linux admin
  0 siblings, 2 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 12:33 UTC (permalink / raw)
  To: Adrian Hunter, Will Deacon; +Cc: linux-mmc, Linux ARM

On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
> > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
> > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > > > > 
> > > > > > The platform has an iommu, which is in pass-through mode, via
> > > > > > arm_smmu.disable_bypass=0.
> > > > > 
> > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > > > > 
> > > > > Although it had already landed in v5.2
> > > > 
> > > > It is not - and the two lines that you quoted above are sufficient
> > > > to negate that as a cause.  (Please read the help for the option that
> > > > the commit referrs to.)
> > > > 
> > > > In fact, with bypass disabled, the SoC fails due to other masters.
> > > > That's already been discussed privately between myself and Will
> > > > Deacon.
> > > > 
> > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> > > > the default setting in the Kconfig.
> > > 
> > > Adding some further debugging, and fixing the existing ADMA debugging
> > > shows:
> > > 
> > > mmc0: ADMA error: 0x02000000
> > > 
> > > So this is an ADMA error without the transfer having completed.
> > > 
> > > mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> > > 
> > > The block size is 8, with one block.
> > > 
> > > mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
> > > 
> > > The ADMA error is a descriptor error at address 0x000000236df1d20c.
> > > The descriptor table contains (including the following entry):
> > > 
> > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
> > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
> > > 
> > > The descriptor table contains one descriptor of 8 bytes, is marked
> > > as the last (END bit set) and is at DMA address 0x236df1d200.  The
> > > following descriptor is empty, with VALID=0.
> > > 
> > > One may be tempted to blame it on the following descriptor, but having
> > > had another example on eMMC while userspace was booting (rootfs on
> > > eMMC):
> > > 
> > > mmc1: ADMA error: 0x02000000
> > > mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
> > > mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
> > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
> > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
> > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
> > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
> > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
> > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
> > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
> > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
> > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
> > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
> > > 
> > > ... which is interesting for several reasons:
> > > - The ADMA error register indicates a length mismatch error.  The
> > >   transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
> > >   Summing the ADMA lengths up to the last descriptor (length=0 is
> > >   0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
> > >   bytes than the requested transfer.
> > > 
> > > - The ADMA error register indicates ST_CADR, which is described as
> > >   "This state is never set because do not generate ADMA error in this
> > >   state."
> > > 
> > > - The error descriptor is again after the descriptor with END=1, but
> > >   this time has VALID=1.
> > > 
> > > This _feels_ like a coherency issue, where the SDHCI engine is not
> > > correctly seeing the descriptor table, but then I would have expected
> > > userspace (which is basically debian stable) to fail to boot every
> > > time given that its rootfs is on eMMC.
> > > 
> > > The other weird thing is if I wind the core MMC code back via:
> > > 
> > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
> > > 
> > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
> > > completely stable, but way better than plain v5.3.  I don't see
> > > much in that diff which would be responsible for this - although it
> > > does seem that hch's DMA changes do make the problem more likely.
> > > (going from 1 in 3 boots with a problem to being not able to boot.)
> > > 
> > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
> > > bypass mode on the IOMMU (but then I saw global smmu errors right
> > > from when the IOMMU had bypass disabled before MMC was probed - the
> > > reason being is the SoC is not currently setup to have the MMU
> > > bypass mode disabled.)
> > 
> > This looks like an ARM64 coherency issue.
> > 
> > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
> > which had no effect.  I then tried adding:
> > 
> > +       __dma_flush_area(host->adma_table, desc - host->adma_table);
> > +       dma_wmb();
> > 
> > and so far I haven't had any further ADMA errors.  Adding Will Deacon
> > to the thread.
> 
> These are the changes to sdhci that I'm currently running.  I think
> some of the debugging related changes are probably worth adding to
> the driver, particularly printing the intmask on ADMA error (which
> is not printed by the register dump, as the value is lost) and printing
> the DMA addresses of the descriptor table entries which can be tied
> up with the DMA address error register.  Also, maybe printing the
> DMA descriptor table with the register dump, rather than having to
> resort to enabling debug would be a good idea?
> 
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index a5dc5aae973e..884dcaa9cad5 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
>  		/* Add a terminating entry - nop, end, valid */
>  		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
>  	}
> +	__dma_flush_area(host->adma_table, desc - host->adma_table);
> +	dma_wmb();
>  }
>  
>  static void sdhci_adma_table_post(struct sdhci_host *host,
> @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
>  static void sdhci_adma_show_error(struct sdhci_host *host)
>  {
>  	void *desc = host->adma_table;
> +	dma_addr_t dma = host->adma_addr;
> +	bool end = false;
>  
>  	sdhci_dumpregs(host);
>  
> @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
>  		struct sdhci_adma2_64_desc *dma_desc = desc;
>  
>  		if (host->flags & SDHCI_USE_64_BIT_DMA)
> -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> -			    desc, le32_to_cpu(dma_desc->addr_hi),
> +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> +			    (unsigned long long)dma,
> +			    le32_to_cpu(dma_desc->addr_hi),
>  			    le32_to_cpu(dma_desc->addr_lo),
>  			    le16_to_cpu(dma_desc->len),
>  			    le16_to_cpu(dma_desc->cmd));
>  		else
> -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> -			    desc, le32_to_cpu(dma_desc->addr_lo),
> +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> +			    (unsigned long long)dma,
> +			    le32_to_cpu(dma_desc->addr_lo),
>  			    le16_to_cpu(dma_desc->len),
>  			    le16_to_cpu(dma_desc->cmd));
>  
> +		if (end) break;
> +
>  		desc += host->desc_sz;
> +		dma += host->desc_sz;
>  
>  		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
> -			break;
> +			end = true;
>  	}
>  }
>  
> @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
>  			!= MMC_BUS_TEST_R)
>  		host->data->error = -EILSEQ;
>  	else if (intmask & SDHCI_INT_ADMA_ERROR) {
> -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
> +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
>  		sdhci_adma_show_error(host);
>  		host->data->error = -EIO;
>  		if (host->ops->adma_workaround)

Further debug shows:

coherent=0 - sdhci device is not cache coherent
swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
[ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
pmd=000000237fffb003, pte=00e800236d62270f

The mapping for the ADMA table seems to be using MAIR index 3, which is
MT_MEMORY_NC, so should be non-cacheable.

vmallocinfo:
0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
user

So this memory has been remapped.  Could there be an alias that has
cache lines still in the cache for the physical address, and could we
be hitting those cache lines while accessing through a non-cacheable
mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
definitely _feels_ like it has unpredictable attributes!)

Also, given that this memory is mapped NC, then surely
__dma_flush_area() should have no effect?  However, it _does_ have the
effect of reliably solving the problem, which to me implies that there
_are_ cache lines in this NC mapping.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 12:33           ` Russell King - ARM Linux admin
@ 2019-09-17 13:03             ` Robin Murphy
  2019-09-17 13:28               ` Russell King - ARM Linux admin
  2019-09-17 13:07             ` Russell King - ARM Linux admin
  1 sibling, 1 reply; 31+ messages in thread
From: Robin Murphy @ 2019-09-17 13:03 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Adrian Hunter, Will Deacon
  Cc: linux-mmc, Linux ARM

On 17/09/2019 13:33, Russell King - ARM Linux admin wrote:
[...]
> Further debug shows:
> 
> coherent=0 - sdhci device is not cache coherent
> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
> pmd=000000237fffb003, pte=00e800236d62270f
> 
> The mapping for the ADMA table seems to be using MAIR index 3, which is
> MT_MEMORY_NC, so should be non-cacheable.
> 
> vmallocinfo:
> 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
> user
> 
> So this memory has been remapped.  Could there be an alias that has
> cache lines still in the cache for the physical address, and could we
> be hitting those cache lines while accessing through a non-cacheable
> mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
> definitely _feels_ like it has unpredictable attributes!)
> 
> Also, given that this memory is mapped NC, then surely
> __dma_flush_area() should have no effect?  However, it _does_ have the
> effect of reliably solving the problem, which to me implies that there
> _are_ cache lines in this NC mapping.

The non-cacheable mapping of the descriptor table will still have its 
cacheable linear map alias, so it's quite likely that the invalidate 
aspect of __dma_flush_area(), rather than the clean, is what's making 
the difference - if using __dma_clean_area() instead doesn't help, it 
would more or less confirm that.

One possibility in that case is that you might actually have the rare 
backwards coherency problem - if the device *is* actually snooping the 
cache, then it could hit lines which were speculatively prefetched via 
the cacheable alias before the descriptors were fully written, rather 
than the up-to-date data which went straight to RAM via the NC mapping. 
I'd try declaring the device as "dma-coherent" to see if that's actually 
true (and it should become pretty obvious if it isn't).

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 12:33           ` Russell King - ARM Linux admin
  2019-09-17 13:03             ` Robin Murphy
@ 2019-09-17 13:07             ` Russell King - ARM Linux admin
  2019-09-17 13:24               ` Fabio Estevam
  2019-09-17 13:38               ` Robin Murphy
  1 sibling, 2 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 13:07 UTC (permalink / raw)
  To: Adrian Hunter, Will Deacon, Nicolin Chen, dann frazier,
	Christoph Hellwig
  Cc: linux-mmc, Linux ARM

On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
> > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
> > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
> > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > > > > > 
> > > > > > > The platform has an iommu, which is in pass-through mode, via
> > > > > > > arm_smmu.disable_bypass=0.
> > > > > > 
> > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > > > > > 
> > > > > > Although it had already landed in v5.2
> > > > > 
> > > > > It is not - and the two lines that you quoted above are sufficient
> > > > > to negate that as a cause.  (Please read the help for the option that
> > > > > the commit referrs to.)
> > > > > 
> > > > > In fact, with bypass disabled, the SoC fails due to other masters.
> > > > > That's already been discussed privately between myself and Will
> > > > > Deacon.
> > > > > 
> > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> > > > > the default setting in the Kconfig.
> > > > 
> > > > Adding some further debugging, and fixing the existing ADMA debugging
> > > > shows:
> > > > 
> > > > mmc0: ADMA error: 0x02000000
> > > > 
> > > > So this is an ADMA error without the transfer having completed.
> > > > 
> > > > mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> > > > 
> > > > The block size is 8, with one block.
> > > > 
> > > > mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
> > > > 
> > > > The ADMA error is a descriptor error at address 0x000000236df1d20c.
> > > > The descriptor table contains (including the following entry):
> > > > 
> > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
> > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
> > > > 
> > > > The descriptor table contains one descriptor of 8 bytes, is marked
> > > > as the last (END bit set) and is at DMA address 0x236df1d200.  The
> > > > following descriptor is empty, with VALID=0.
> > > > 
> > > > One may be tempted to blame it on the following descriptor, but having
> > > > had another example on eMMC while userspace was booting (rootfs on
> > > > eMMC):
> > > > 
> > > > mmc1: ADMA error: 0x02000000
> > > > mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
> > > > mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
> > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
> > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
> > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
> > > > 
> > > > ... which is interesting for several reasons:
> > > > - The ADMA error register indicates a length mismatch error.  The
> > > >   transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
> > > >   Summing the ADMA lengths up to the last descriptor (length=0 is
> > > >   0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
> > > >   bytes than the requested transfer.
> > > > 
> > > > - The ADMA error register indicates ST_CADR, which is described as
> > > >   "This state is never set because do not generate ADMA error in this
> > > >   state."
> > > > 
> > > > - The error descriptor is again after the descriptor with END=1, but
> > > >   this time has VALID=1.
> > > > 
> > > > This _feels_ like a coherency issue, where the SDHCI engine is not
> > > > correctly seeing the descriptor table, but then I would have expected
> > > > userspace (which is basically debian stable) to fail to boot every
> > > > time given that its rootfs is on eMMC.
> > > > 
> > > > The other weird thing is if I wind the core MMC code back via:
> > > > 
> > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
> > > > 
> > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
> > > > completely stable, but way better than plain v5.3.  I don't see
> > > > much in that diff which would be responsible for this - although it
> > > > does seem that hch's DMA changes do make the problem more likely.
> > > > (going from 1 in 3 boots with a problem to being not able to boot.)
> > > > 
> > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
> > > > bypass mode on the IOMMU (but then I saw global smmu errors right
> > > > from when the IOMMU had bypass disabled before MMC was probed - the
> > > > reason being is the SoC is not currently setup to have the MMU
> > > > bypass mode disabled.)
> > > 
> > > This looks like an ARM64 coherency issue.
> > > 
> > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
> > > which had no effect.  I then tried adding:
> > > 
> > > +       __dma_flush_area(host->adma_table, desc - host->adma_table);
> > > +       dma_wmb();
> > > 
> > > and so far I haven't had any further ADMA errors.  Adding Will Deacon
> > > to the thread.
> > 
> > These are the changes to sdhci that I'm currently running.  I think
> > some of the debugging related changes are probably worth adding to
> > the driver, particularly printing the intmask on ADMA error (which
> > is not printed by the register dump, as the value is lost) and printing
> > the DMA addresses of the descriptor table entries which can be tied
> > up with the DMA address error register.  Also, maybe printing the
> > DMA descriptor table with the register dump, rather than having to
> > resort to enabling debug would be a good idea?
> > 
> > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > index a5dc5aae973e..884dcaa9cad5 100644
> > --- a/drivers/mmc/host/sdhci.c
> > +++ b/drivers/mmc/host/sdhci.c
> > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
> >  		/* Add a terminating entry - nop, end, valid */
> >  		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
> >  	}
> > +	__dma_flush_area(host->adma_table, desc - host->adma_table);
> > +	dma_wmb();
> >  }
> >  
> >  static void sdhci_adma_table_post(struct sdhci_host *host,
> > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
> >  static void sdhci_adma_show_error(struct sdhci_host *host)
> >  {
> >  	void *desc = host->adma_table;
> > +	dma_addr_t dma = host->adma_addr;
> > +	bool end = false;
> >  
> >  	sdhci_dumpregs(host);
> >  
> > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
> >  		struct sdhci_adma2_64_desc *dma_desc = desc;
> >  
> >  		if (host->flags & SDHCI_USE_64_BIT_DMA)
> > -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > -			    desc, le32_to_cpu(dma_desc->addr_hi),
> > +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > +			    (unsigned long long)dma,
> > +			    le32_to_cpu(dma_desc->addr_hi),
> >  			    le32_to_cpu(dma_desc->addr_lo),
> >  			    le16_to_cpu(dma_desc->len),
> >  			    le16_to_cpu(dma_desc->cmd));
> >  		else
> > -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > -			    desc, le32_to_cpu(dma_desc->addr_lo),
> > +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > +			    (unsigned long long)dma,
> > +			    le32_to_cpu(dma_desc->addr_lo),
> >  			    le16_to_cpu(dma_desc->len),
> >  			    le16_to_cpu(dma_desc->cmd));
> >  
> > +		if (end) break;
> > +
> >  		desc += host->desc_sz;
> > +		dma += host->desc_sz;
> >  
> >  		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
> > -			break;
> > +			end = true;
> >  	}
> >  }
> >  
> > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
> >  			!= MMC_BUS_TEST_R)
> >  		host->data->error = -EILSEQ;
> >  	else if (intmask & SDHCI_INT_ADMA_ERROR) {
> > -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
> > +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
> >  		sdhci_adma_show_error(host);
> >  		host->data->error = -EIO;
> >  		if (host->ops->adma_workaround)
> 
> Further debug shows:
> 
> coherent=0 - sdhci device is not cache coherent
> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
> pmd=000000237fffb003, pte=00e800236d62270f
> 
> The mapping for the ADMA table seems to be using MAIR index 3, which is
> MT_MEMORY_NC, so should be non-cacheable.
> 
> vmallocinfo:
> 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
> user
> 
> So this memory has been remapped.  Could there be an alias that has
> cache lines still in the cache for the physical address, and could we
> be hitting those cache lines while accessing through a non-cacheable
> mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
> definitely _feels_ like it has unpredictable attributes!)
> 
> Also, given that this memory is mapped NC, then surely
> __dma_flush_area() should have no effect?  However, it _does_ have the
> effect of reliably solving the problem, which to me implies that there
> _are_ cache lines in this NC mapping.

Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
alloc_pages for single pages") which has been implicated in the same
problem here:

https://www.spinics.net/lists/arm-kernel/msg750623.html

Although reverting the commit is not clean, this also fixes the issue
for me.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:07             ` Russell King - ARM Linux admin
@ 2019-09-17 13:24               ` Fabio Estevam
  2019-09-17 13:33                 ` Russell King - ARM Linux admin
  2019-09-17 13:38               ` Robin Murphy
  1 sibling, 1 reply; 31+ messages in thread
From: Fabio Estevam @ 2019-09-17 13:24 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

Hi Russell,

On Tue, Sep 17, 2019 at 10:10 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:

> Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
> alloc_pages for single pages") which has been implicated in the same
> problem here:
>
> https://www.spinics.net/lists/arm-kernel/msg750623.html

It seems that the final fix was:
https://lore.kernel.org/patchwork/patch/1121600/

Does it work for you if we follow the same idea?

--- a/drivers/mmc/host/sdhci-of-esdhc.c
+++ b/drivers/mmc/host/sdhci-of-esdhc.c
@@ -1105,6 +1105,7 @@ static int sdhci_esdhc_probe(struct platform_device *pdev)

        esdhc_init(pdev, host);

+       sdhci_enable_v4_mode(host);
        sdhci_get_of_property(pdev);

        pltfm_host = sdhci_priv(host);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:03             ` Robin Murphy
@ 2019-09-17 13:28               ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 13:28 UTC (permalink / raw)
  To: Robin Murphy; +Cc: linux-mmc, Will Deacon, Adrian Hunter, Linux ARM

On Tue, Sep 17, 2019 at 02:03:04PM +0100, Robin Murphy wrote:
> On 17/09/2019 13:33, Russell King - ARM Linux admin wrote:
> [...]
> > Further debug shows:
> > 
> > coherent=0 - sdhci device is not cache coherent
> > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
> > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
> > pmd=000000237fffb003, pte=00e800236d62270f
> > 
> > The mapping for the ADMA table seems to be using MAIR index 3, which is
> > MT_MEMORY_NC, so should be non-cacheable.
> > 
> > vmallocinfo:
> > 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
> > user
> > 
> > So this memory has been remapped.  Could there be an alias that has
> > cache lines still in the cache for the physical address, and could we
> > be hitting those cache lines while accessing through a non-cacheable
> > mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
> > definitely _feels_ like it has unpredictable attributes!)
> > 
> > Also, given that this memory is mapped NC, then surely
> > __dma_flush_area() should have no effect?  However, it _does_ have the
> > effect of reliably solving the problem, which to me implies that there
> > _are_ cache lines in this NC mapping.
> 
> The non-cacheable mapping of the descriptor table will still have its
> cacheable linear map alias, so it's quite likely that the invalidate aspect
> of __dma_flush_area(), rather than the clean, is what's making the
> difference - if using __dma_clean_area() instead doesn't help, it would more
> or less confirm that.
> 
> One possibility in that case is that you might actually have the rare
> backwards coherency problem - if the device *is* actually snooping the
> cache, then it could hit lines which were speculatively prefetched via the
> cacheable alias before the descriptors were fully written, rather than the
> up-to-date data which went straight to RAM via the NC mapping. I'd try
> declaring the device as "dma-coherent" to see if that's actually true (and
> it should become pretty obvious if it isn't).

As just mentioned in my previous reply, there's a commit to the
dma-contiguous which changes where the CMA memory comes from.

[ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
	pmd=000000237fffb003, pte=00e800236d62270f

vs

[ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
	pmd=000000237fffb003, pte=00e80000f9c9a70f

Former is with the patch applied, latter is with it reverted.

This makes me question whether the cache handling for a page that is
remapped is being performed.  If there's cache lines present for a
page that is being remapped as non-cacheable, what prevents those
cache lines from being dirty and possibly being written-back at some
point after the non-cacheable mapping as been started to be used?

And yes, it looks like adding "dma-coherent" to the SDHCI controller
with the SD card in resolves the issue, so your hypothesis may be
true.  On the other hand, I haven't added "dma-coherent" to the eMMC
side, and that's also working fine over several reboots without the
offending commit reverted _nor_ with my __dma_flush_area() hack in
place.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:24               ` Fabio Estevam
@ 2019-09-17 13:33                 ` Russell King - ARM Linux admin
  2019-09-17 13:43                   ` Fabio Estevam
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 13:33 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

On Tue, Sep 17, 2019 at 10:24:57AM -0300, Fabio Estevam wrote:
> Hi Russell,
> 
> On Tue, Sep 17, 2019 at 10:10 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> 
> > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
> > alloc_pages for single pages") which has been implicated in the same
> > problem here:
> >
> > https://www.spinics.net/lists/arm-kernel/msg750623.html
> 
> It seems that the final fix was:
> https://lore.kernel.org/patchwork/patch/1121600/
> 
> Does it work for you if we follow the same idea?
> 
> --- a/drivers/mmc/host/sdhci-of-esdhc.c
> +++ b/drivers/mmc/host/sdhci-of-esdhc.c
> @@ -1105,6 +1105,7 @@ static int sdhci_esdhc_probe(struct platform_device *pdev)
> 
>         esdhc_init(pdev, host);
> 
> +       sdhci_enable_v4_mode(host);
>         sdhci_get_of_property(pdev);
> 
>         pltfm_host = sdhci_priv(host);

That attempts to set bit 12 of the host control register 2 (0x3e).
The LX2160A documentation states that bit 28 of 0x3c (they're 32-bit
wide registers there) is "reserved".

So, you're asking for a documented reserved bit to be set...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:07             ` Russell King - ARM Linux admin
  2019-09-17 13:24               ` Fabio Estevam
@ 2019-09-17 13:38               ` Robin Murphy
  2019-09-17 13:49                 ` Russell King - ARM Linux admin
  2019-09-17 13:50                 ` Will Deacon
  1 sibling, 2 replies; 31+ messages in thread
From: Robin Murphy @ 2019-09-17 13:38 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Adrian Hunter, Will Deacon,
	Nicolin Chen, dann frazier, Christoph Hellwig
  Cc: linux-mmc, Linux ARM

On 17/09/2019 14:07, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote:
>> On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
>>> On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
>>>> On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
>>>>> On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
>>>>>> On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
>>>>>>> On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
>>>>>>>
>>>>>>>> The platform has an iommu, which is in pass-through mode, via
>>>>>>>> arm_smmu.disable_bypass=0.
>>>>>>>
>>>>>>> Could be 954a03be033c7cef80ddc232e7cbdb17df735663
>>>>>>> "iommu/arm-smmu: Break insecure users by disabling bypass by default"
>>>>>>>
>>>>>>> Although it had already landed in v5.2
>>>>>>
>>>>>> It is not - and the two lines that you quoted above are sufficient
>>>>>> to negate that as a cause.  (Please read the help for the option that
>>>>>> the commit referrs to.)
>>>>>>
>>>>>> In fact, with bypass disabled, the SoC fails due to other masters.
>>>>>> That's already been discussed privately between myself and Will
>>>>>> Deacon.
>>>>>>
>>>>>> arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
>>>>>> the default setting in the Kconfig.
>>>>>
>>>>> Adding some further debugging, and fixing the existing ADMA debugging
>>>>> shows:
>>>>>
>>>>> mmc0: ADMA error: 0x02000000
>>>>>
>>>>> So this is an ADMA error without the transfer having completed.
>>>>>
>>>>> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
>>>>>
>>>>> The block size is 8, with one block.
>>>>>
>>>>> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
>>>>>
>>>>> The ADMA error is a descriptor error at address 0x000000236df1d20c.
>>>>> The descriptor table contains (including the following entry):
>>>>>
>>>>> mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
>>>>> mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
>>>>>
>>>>> The descriptor table contains one descriptor of 8 bytes, is marked
>>>>> as the last (END bit set) and is at DMA address 0x236df1d200.  The
>>>>> following descriptor is empty, with VALID=0.
>>>>>
>>>>> One may be tempted to blame it on the following descriptor, but having
>>>>> had another example on eMMC while userspace was booting (rootfs on
>>>>> eMMC):
>>>>>
>>>>> mmc1: ADMA error: 0x02000000
>>>>> mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
>>>>> mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
>>>>> mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
>>>>> mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
>>>>> mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
>>>>>
>>>>> ... which is interesting for several reasons:
>>>>> - The ADMA error register indicates a length mismatch error.  The
>>>>>    transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
>>>>>    Summing the ADMA lengths up to the last descriptor (length=0 is
>>>>>    0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
>>>>>    bytes than the requested transfer.
>>>>>
>>>>> - The ADMA error register indicates ST_CADR, which is described as
>>>>>    "This state is never set because do not generate ADMA error in this
>>>>>    state."
>>>>>
>>>>> - The error descriptor is again after the descriptor with END=1, but
>>>>>    this time has VALID=1.
>>>>>
>>>>> This _feels_ like a coherency issue, where the SDHCI engine is not
>>>>> correctly seeing the descriptor table, but then I would have expected
>>>>> userspace (which is basically debian stable) to fail to boot every
>>>>> time given that its rootfs is on eMMC.
>>>>>
>>>>> The other weird thing is if I wind the core MMC code back via:
>>>>>
>>>>> $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
>>>>>
>>>>> and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
>>>>> completely stable, but way better than plain v5.3.  I don't see
>>>>> much in that diff which would be responsible for this - although it
>>>>> does seem that hch's DMA changes do make the problem more likely.
>>>>> (going from 1 in 3 boots with a problem to being not able to boot.)
>>>>>
>>>>> Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
>>>>> bypass mode on the IOMMU (but then I saw global smmu errors right
>>>>> from when the IOMMU had bypass disabled before MMC was probed - the
>>>>> reason being is the SoC is not currently setup to have the MMU
>>>>> bypass mode disabled.)
>>>>
>>>> This looks like an ARM64 coherency issue.
>>>>
>>>> I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
>>>> which had no effect.  I then tried adding:
>>>>
>>>> +       __dma_flush_area(host->adma_table, desc - host->adma_table);
>>>> +       dma_wmb();
>>>>
>>>> and so far I haven't had any further ADMA errors.  Adding Will Deacon
>>>> to the thread.
>>>
>>> These are the changes to sdhci that I'm currently running.  I think
>>> some of the debugging related changes are probably worth adding to
>>> the driver, particularly printing the intmask on ADMA error (which
>>> is not printed by the register dump, as the value is lost) and printing
>>> the DMA addresses of the descriptor table entries which can be tied
>>> up with the DMA address error register.  Also, maybe printing the
>>> DMA descriptor table with the register dump, rather than having to
>>> resort to enabling debug would be a good idea?
>>>
>>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>>> index a5dc5aae973e..884dcaa9cad5 100644
>>> --- a/drivers/mmc/host/sdhci.c
>>> +++ b/drivers/mmc/host/sdhci.c
>>> @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
>>>   		/* Add a terminating entry - nop, end, valid */
>>>   		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
>>>   	}
>>> +	__dma_flush_area(host->adma_table, desc - host->adma_table);
>>> +	dma_wmb();
>>>   }
>>>   
>>>   static void sdhci_adma_table_post(struct sdhci_host *host,
>>> @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
>>>   static void sdhci_adma_show_error(struct sdhci_host *host)
>>>   {
>>>   	void *desc = host->adma_table;
>>> +	dma_addr_t dma = host->adma_addr;
>>> +	bool end = false;
>>>   
>>>   	sdhci_dumpregs(host);
>>>   
>>> @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
>>>   		struct sdhci_adma2_64_desc *dma_desc = desc;
>>>   
>>>   		if (host->flags & SDHCI_USE_64_BIT_DMA)
>>> -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>> -			    desc, le32_to_cpu(dma_desc->addr_hi),
>>> +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>> +			    (unsigned long long)dma,
>>> +			    le32_to_cpu(dma_desc->addr_hi),
>>>   			    le32_to_cpu(dma_desc->addr_lo),
>>>   			    le16_to_cpu(dma_desc->len),
>>>   			    le16_to_cpu(dma_desc->cmd));
>>>   		else
>>> -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>> -			    desc, le32_to_cpu(dma_desc->addr_lo),
>>> +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>> +			    (unsigned long long)dma,
>>> +			    le32_to_cpu(dma_desc->addr_lo),
>>>   			    le16_to_cpu(dma_desc->len),
>>>   			    le16_to_cpu(dma_desc->cmd));
>>>   
>>> +		if (end) break;
>>> +
>>>   		desc += host->desc_sz;
>>> +		dma += host->desc_sz;
>>>   
>>>   		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
>>> -			break;
>>> +			end = true;
>>>   	}
>>>   }
>>>   
>>> @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
>>>   			!= MMC_BUS_TEST_R)
>>>   		host->data->error = -EILSEQ;
>>>   	else if (intmask & SDHCI_INT_ADMA_ERROR) {
>>> -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
>>> +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
>>>   		sdhci_adma_show_error(host);
>>>   		host->data->error = -EIO;
>>>   		if (host->ops->adma_workaround)
>>
>> Further debug shows:
>>
>> coherent=0 - sdhci device is not cache coherent
>> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
>> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
>> pmd=000000237fffb003, pte=00e800236d62270f
>>
>> The mapping for the ADMA table seems to be using MAIR index 3, which is
>> MT_MEMORY_NC, so should be non-cacheable.
>>
>> vmallocinfo:
>> 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
>> user
>>
>> So this memory has been remapped.  Could there be an alias that has
>> cache lines still in the cache for the physical address, and could we
>> be hitting those cache lines while accessing through a non-cacheable
>> mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
>> definitely _feels_ like it has unpredictable attributes!)
>>
>> Also, given that this memory is mapped NC, then surely
>> __dma_flush_area() should have no effect?  However, it _does_ have the
>> effect of reliably solving the problem, which to me implies that there
>> _are_ cache lines in this NC mapping.
> 
> Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
> alloc_pages for single pages") which has been implicated in the same
> problem here:
> 
> https://www.spinics.net/lists/arm-kernel/msg750623.html
> 
> Although reverting the commit is not clean, this also fixes the issue
> for me.

Note that that one turned out to be something totally different, namely 
that the single-page allocations, in difference to CMA, came from 
physical addresses that the controller needed additional configuration 
to be able to access[1] - no amount of cache maintenance would affect that.

However, the other difference between getting a single page directly 
from the page allocator vs. the CMA area is that accesses to the linear 
mapping of the CMA area are probably pretty rare, whereas for the 
single-page case it's much more likely that kernel tasks using adjacent 
pages could lead to prefetching of the descriptor page's cacheable 
alias. That could certainly explain how reverting that commit manages to 
hide an apparent coherency issue.

Robin.

[1] https://lore.kernel.org/patchwork/patch/1121600/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:33                 ` Russell King - ARM Linux admin
@ 2019-09-17 13:43                   ` Fabio Estevam
  2019-09-17 13:51                     ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Fabio Estevam @ 2019-09-17 13:43 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

On Tue, Sep 17, 2019 at 10:33 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:

> That attempts to set bit 12 of the host control register 2 (0x3e).
> The LX2160A documentation states that bit 28 of 0x3c (they're 32-bit
> wide registers there) is "reserved".
>
> So, you're asking for a documented reserved bit to be set...

Correct, v4 is not supported here indeed.

From the LX2160 doc:
"Conforms to the SD Host Controller Standard Specification version 3.0"

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:38               ` Robin Murphy
@ 2019-09-17 13:49                 ` Russell King - ARM Linux admin
  2019-09-17 14:03                   ` Robin Murphy
  2019-09-17 13:50                 ` Will Deacon
  1 sibling, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 13:49 UTC (permalink / raw)
  To: Robin Murphy
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote:
> On 17/09/2019 14:07, Russell King - ARM Linux admin wrote:
> > On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote:
> > > On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
> > > > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
> > > > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
> > > > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> > > > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > > > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > > > > > > > 
> > > > > > > > > The platform has an iommu, which is in pass-through mode, via
> > > > > > > > > arm_smmu.disable_bypass=0.
> > > > > > > > 
> > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > > > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > > > > > > > 
> > > > > > > > Although it had already landed in v5.2
> > > > > > > 
> > > > > > > It is not - and the two lines that you quoted above are sufficient
> > > > > > > to negate that as a cause.  (Please read the help for the option that
> > > > > > > the commit referrs to.)
> > > > > > > 
> > > > > > > In fact, with bypass disabled, the SoC fails due to other masters.
> > > > > > > That's already been discussed privately between myself and Will
> > > > > > > Deacon.
> > > > > > > 
> > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> > > > > > > the default setting in the Kconfig.
> > > > > > 
> > > > > > Adding some further debugging, and fixing the existing ADMA debugging
> > > > > > shows:
> > > > > > 
> > > > > > mmc0: ADMA error: 0x02000000
> > > > > > 
> > > > > > So this is an ADMA error without the transfer having completed.
> > > > > > 
> > > > > > mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> > > > > > 
> > > > > > The block size is 8, with one block.
> > > > > > 
> > > > > > mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
> > > > > > 
> > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c.
> > > > > > The descriptor table contains (including the following entry):
> > > > > > 
> > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
> > > > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
> > > > > > 
> > > > > > The descriptor table contains one descriptor of 8 bytes, is marked
> > > > > > as the last (END bit set) and is at DMA address 0x236df1d200.  The
> > > > > > following descriptor is empty, with VALID=0.
> > > > > > 
> > > > > > One may be tempted to blame it on the following descriptor, but having
> > > > > > had another example on eMMC while userspace was booting (rootfs on
> > > > > > eMMC):
> > > > > > 
> > > > > > mmc1: ADMA error: 0x02000000
> > > > > > mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
> > > > > > mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
> > > > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
> > > > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
> > > > > > 
> > > > > > ... which is interesting for several reasons:
> > > > > > - The ADMA error register indicates a length mismatch error.  The
> > > > > >    transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
> > > > > >    Summing the ADMA lengths up to the last descriptor (length=0 is
> > > > > >    0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
> > > > > >    bytes than the requested transfer.
> > > > > > 
> > > > > > - The ADMA error register indicates ST_CADR, which is described as
> > > > > >    "This state is never set because do not generate ADMA error in this
> > > > > >    state."
> > > > > > 
> > > > > > - The error descriptor is again after the descriptor with END=1, but
> > > > > >    this time has VALID=1.
> > > > > > 
> > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not
> > > > > > correctly seeing the descriptor table, but then I would have expected
> > > > > > userspace (which is basically debian stable) to fail to boot every
> > > > > > time given that its rootfs is on eMMC.
> > > > > > 
> > > > > > The other weird thing is if I wind the core MMC code back via:
> > > > > > 
> > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
> > > > > > 
> > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
> > > > > > completely stable, but way better than plain v5.3.  I don't see
> > > > > > much in that diff which would be responsible for this - although it
> > > > > > does seem that hch's DMA changes do make the problem more likely.
> > > > > > (going from 1 in 3 boots with a problem to being not able to boot.)
> > > > > > 
> > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
> > > > > > bypass mode on the IOMMU (but then I saw global smmu errors right
> > > > > > from when the IOMMU had bypass disabled before MMC was probed - the
> > > > > > reason being is the SoC is not currently setup to have the MMU
> > > > > > bypass mode disabled.)
> > > > > 
> > > > > This looks like an ARM64 coherency issue.
> > > > > 
> > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
> > > > > which had no effect.  I then tried adding:
> > > > > 
> > > > > +       __dma_flush_area(host->adma_table, desc - host->adma_table);
> > > > > +       dma_wmb();
> > > > > 
> > > > > and so far I haven't had any further ADMA errors.  Adding Will Deacon
> > > > > to the thread.
> > > > 
> > > > These are the changes to sdhci that I'm currently running.  I think
> > > > some of the debugging related changes are probably worth adding to
> > > > the driver, particularly printing the intmask on ADMA error (which
> > > > is not printed by the register dump, as the value is lost) and printing
> > > > the DMA addresses of the descriptor table entries which can be tied
> > > > up with the DMA address error register.  Also, maybe printing the
> > > > DMA descriptor table with the register dump, rather than having to
> > > > resort to enabling debug would be a good idea?
> > > > 
> > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > > > index a5dc5aae973e..884dcaa9cad5 100644
> > > > --- a/drivers/mmc/host/sdhci.c
> > > > +++ b/drivers/mmc/host/sdhci.c
> > > > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
> > > >   		/* Add a terminating entry - nop, end, valid */
> > > >   		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
> > > >   	}
> > > > +	__dma_flush_area(host->adma_table, desc - host->adma_table);
> > > > +	dma_wmb();
> > > >   }
> > > >   static void sdhci_adma_table_post(struct sdhci_host *host,
> > > > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
> > > >   static void sdhci_adma_show_error(struct sdhci_host *host)
> > > >   {
> > > >   	void *desc = host->adma_table;
> > > > +	dma_addr_t dma = host->adma_addr;
> > > > +	bool end = false;
> > > >   	sdhci_dumpregs(host);
> > > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
> > > >   		struct sdhci_adma2_64_desc *dma_desc = desc;
> > > >   		if (host->flags & SDHCI_USE_64_BIT_DMA)
> > > > -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > -			    desc, le32_to_cpu(dma_desc->addr_hi),
> > > > +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > +			    (unsigned long long)dma,
> > > > +			    le32_to_cpu(dma_desc->addr_hi),
> > > >   			    le32_to_cpu(dma_desc->addr_lo),
> > > >   			    le16_to_cpu(dma_desc->len),
> > > >   			    le16_to_cpu(dma_desc->cmd));
> > > >   		else
> > > > -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > -			    desc, le32_to_cpu(dma_desc->addr_lo),
> > > > +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > +			    (unsigned long long)dma,
> > > > +			    le32_to_cpu(dma_desc->addr_lo),
> > > >   			    le16_to_cpu(dma_desc->len),
> > > >   			    le16_to_cpu(dma_desc->cmd));
> > > > +		if (end) break;
> > > > +
> > > >   		desc += host->desc_sz;
> > > > +		dma += host->desc_sz;
> > > >   		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
> > > > -			break;
> > > > +			end = true;
> > > >   	}
> > > >   }
> > > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
> > > >   			!= MMC_BUS_TEST_R)
> > > >   		host->data->error = -EILSEQ;
> > > >   	else if (intmask & SDHCI_INT_ADMA_ERROR) {
> > > > -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
> > > > +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
> > > >   		sdhci_adma_show_error(host);
> > > >   		host->data->error = -EIO;
> > > >   		if (host->ops->adma_workaround)
> > > 
> > > Further debug shows:
> > > 
> > > coherent=0 - sdhci device is not cache coherent
> > > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
> > > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
> > > pmd=000000237fffb003, pte=00e800236d62270f
> > > 
> > > The mapping for the ADMA table seems to be using MAIR index 3, which is
> > > MT_MEMORY_NC, so should be non-cacheable.
> > > 
> > > vmallocinfo:
> > > 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
> > > user
> > > 
> > > So this memory has been remapped.  Could there be an alias that has
> > > cache lines still in the cache for the physical address, and could we
> > > be hitting those cache lines while accessing through a non-cacheable
> > > mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
> > > definitely _feels_ like it has unpredictable attributes!)
> > > 
> > > Also, given that this memory is mapped NC, then surely
> > > __dma_flush_area() should have no effect?  However, it _does_ have the
> > > effect of reliably solving the problem, which to me implies that there
> > > _are_ cache lines in this NC mapping.
> > 
> > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
> > alloc_pages for single pages") which has been implicated in the same
> > problem here:
> > 
> > https://www.spinics.net/lists/arm-kernel/msg750623.html
> > 
> > Although reverting the commit is not clean, this also fixes the issue
> > for me.
> 
> Note that that one turned out to be something totally different, namely that
> the single-page allocations, in difference to CMA, came from physical
> addresses that the controller needed additional configuration to be able to
> access[1] - no amount of cache maintenance would affect that.

As already replied, v4 mode is not documented as being available on
the LX2160A - the bit in the control register is marked as "reserved".
This is as expected as it is documented that it is using a v3.00 of
the SDHCI standard, rather than v4.00.

So, sorry, enabling "v4 mode" isn't a workaround in this scenario.

Given that v4 mode is not mandatory, this shouldn't be a work-around.

Given that it _does_ work some of the time with the table >4GB, then
this is not an addressing limitation.

> However, the other difference between getting a single page directly from
> the page allocator vs. the CMA area is that accesses to the linear mapping
> of the CMA area are probably pretty rare, whereas for the single-page case
> it's much more likely that kernel tasks using adjacent pages could lead to
> prefetching of the descriptor page's cacheable alias. That could certainly
> explain how reverting that commit manages to hide an apparent coherency
> issue.

Right, so how do we fix this?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:38               ` Robin Murphy
  2019-09-17 13:49                 ` Russell King - ARM Linux admin
@ 2019-09-17 13:50                 ` Will Deacon
  2019-09-17 13:55                   ` Robin Murphy
  1 sibling, 1 reply; 31+ messages in thread
From: Will Deacon @ 2019-09-17 13:50 UTC (permalink / raw)
  To: Robin Murphy
  Cc: dann frazier, Will Deacon, Russell King - ARM Linux admin,
	Adrian Hunter, Nicolin Chen, linux-mmc, Christoph Hellwig,
	Linux ARM

On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote:
> On 17/09/2019 14:07, Russell King - ARM Linux admin wrote:
> > On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote:
> > > On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
> > > > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
> > > > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
> > > > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> > > > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > > > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > > > > > > > 
> > > > > > > > > The platform has an iommu, which is in pass-through mode, via
> > > > > > > > > arm_smmu.disable_bypass=0.
> > > > > > > > 
> > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > > > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > > > > > > > 
> > > > > > > > Although it had already landed in v5.2
> > > > > > > 
> > > > > > > It is not - and the two lines that you quoted above are sufficient
> > > > > > > to negate that as a cause.  (Please read the help for the option that
> > > > > > > the commit referrs to.)
> > > > > > > 
> > > > > > > In fact, with bypass disabled, the SoC fails due to other masters.
> > > > > > > That's already been discussed privately between myself and Will
> > > > > > > Deacon.
> > > > > > > 
> > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> > > > > > > the default setting in the Kconfig.
> > > > > > 
> > > > > > Adding some further debugging, and fixing the existing ADMA debugging
> > > > > > shows:
> > > > > > 
> > > > > > mmc0: ADMA error: 0x02000000
> > > > > > 
> > > > > > So this is an ADMA error without the transfer having completed.
> > > > > > 
> > > > > > mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> > > > > > 
> > > > > > The block size is 8, with one block.
> > > > > > 
> > > > > > mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
> > > > > > 
> > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c.
> > > > > > The descriptor table contains (including the following entry):
> > > > > > 
> > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
> > > > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
> > > > > > 
> > > > > > The descriptor table contains one descriptor of 8 bytes, is marked
> > > > > > as the last (END bit set) and is at DMA address 0x236df1d200.  The
> > > > > > following descriptor is empty, with VALID=0.
> > > > > > 
> > > > > > One may be tempted to blame it on the following descriptor, but having
> > > > > > had another example on eMMC while userspace was booting (rootfs on
> > > > > > eMMC):
> > > > > > 
> > > > > > mmc1: ADMA error: 0x02000000
> > > > > > mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
> > > > > > mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
> > > > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
> > > > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
> > > > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
> > > > > > 
> > > > > > ... which is interesting for several reasons:
> > > > > > - The ADMA error register indicates a length mismatch error.  The
> > > > > >    transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
> > > > > >    Summing the ADMA lengths up to the last descriptor (length=0 is
> > > > > >    0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
> > > > > >    bytes than the requested transfer.
> > > > > > 
> > > > > > - The ADMA error register indicates ST_CADR, which is described as
> > > > > >    "This state is never set because do not generate ADMA error in this
> > > > > >    state."
> > > > > > 
> > > > > > - The error descriptor is again after the descriptor with END=1, but
> > > > > >    this time has VALID=1.
> > > > > > 
> > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not
> > > > > > correctly seeing the descriptor table, but then I would have expected
> > > > > > userspace (which is basically debian stable) to fail to boot every
> > > > > > time given that its rootfs is on eMMC.
> > > > > > 
> > > > > > The other weird thing is if I wind the core MMC code back via:
> > > > > > 
> > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
> > > > > > 
> > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
> > > > > > completely stable, but way better than plain v5.3.  I don't see
> > > > > > much in that diff which would be responsible for this - although it
> > > > > > does seem that hch's DMA changes do make the problem more likely.
> > > > > > (going from 1 in 3 boots with a problem to being not able to boot.)
> > > > > > 
> > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
> > > > > > bypass mode on the IOMMU (but then I saw global smmu errors right
> > > > > > from when the IOMMU had bypass disabled before MMC was probed - the
> > > > > > reason being is the SoC is not currently setup to have the MMU
> > > > > > bypass mode disabled.)
> > > > > 
> > > > > This looks like an ARM64 coherency issue.
> > > > > 
> > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
> > > > > which had no effect.  I then tried adding:
> > > > > 
> > > > > +       __dma_flush_area(host->adma_table, desc - host->adma_table);
> > > > > +       dma_wmb();
> > > > > 
> > > > > and so far I haven't had any further ADMA errors.  Adding Will Deacon
> > > > > to the thread.
> > > > 
> > > > These are the changes to sdhci that I'm currently running.  I think
> > > > some of the debugging related changes are probably worth adding to
> > > > the driver, particularly printing the intmask on ADMA error (which
> > > > is not printed by the register dump, as the value is lost) and printing
> > > > the DMA addresses of the descriptor table entries which can be tied
> > > > up with the DMA address error register.  Also, maybe printing the
> > > > DMA descriptor table with the register dump, rather than having to
> > > > resort to enabling debug would be a good idea?
> > > > 
> > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > > > index a5dc5aae973e..884dcaa9cad5 100644
> > > > --- a/drivers/mmc/host/sdhci.c
> > > > +++ b/drivers/mmc/host/sdhci.c
> > > > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
> > > >   		/* Add a terminating entry - nop, end, valid */
> > > >   		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
> > > >   	}
> > > > +	__dma_flush_area(host->adma_table, desc - host->adma_table);
> > > > +	dma_wmb();
> > > >   }
> > > >   static void sdhci_adma_table_post(struct sdhci_host *host,
> > > > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
> > > >   static void sdhci_adma_show_error(struct sdhci_host *host)
> > > >   {
> > > >   	void *desc = host->adma_table;
> > > > +	dma_addr_t dma = host->adma_addr;
> > > > +	bool end = false;
> > > >   	sdhci_dumpregs(host);
> > > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
> > > >   		struct sdhci_adma2_64_desc *dma_desc = desc;
> > > >   		if (host->flags & SDHCI_USE_64_BIT_DMA)
> > > > -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > -			    desc, le32_to_cpu(dma_desc->addr_hi),
> > > > +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > +			    (unsigned long long)dma,
> > > > +			    le32_to_cpu(dma_desc->addr_hi),
> > > >   			    le32_to_cpu(dma_desc->addr_lo),
> > > >   			    le16_to_cpu(dma_desc->len),
> > > >   			    le16_to_cpu(dma_desc->cmd));
> > > >   		else
> > > > -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > -			    desc, le32_to_cpu(dma_desc->addr_lo),
> > > > +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > +			    (unsigned long long)dma,
> > > > +			    le32_to_cpu(dma_desc->addr_lo),
> > > >   			    le16_to_cpu(dma_desc->len),
> > > >   			    le16_to_cpu(dma_desc->cmd));
> > > > +		if (end) break;
> > > > +
> > > >   		desc += host->desc_sz;
> > > > +		dma += host->desc_sz;
> > > >   		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
> > > > -			break;
> > > > +			end = true;
> > > >   	}
> > > >   }
> > > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
> > > >   			!= MMC_BUS_TEST_R)
> > > >   		host->data->error = -EILSEQ;
> > > >   	else if (intmask & SDHCI_INT_ADMA_ERROR) {
> > > > -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
> > > > +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
> > > >   		sdhci_adma_show_error(host);
> > > >   		host->data->error = -EIO;
> > > >   		if (host->ops->adma_workaround)
> > > 
> > > Further debug shows:
> > > 
> > > coherent=0 - sdhci device is not cache coherent
> > > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
> > > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
> > > pmd=000000237fffb003, pte=00e800236d62270f
> > > 
> > > The mapping for the ADMA table seems to be using MAIR index 3, which is
> > > MT_MEMORY_NC, so should be non-cacheable.
> > > 
> > > vmallocinfo:
> > > 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
> > > user
> > > 
> > > So this memory has been remapped.  Could there be an alias that has
> > > cache lines still in the cache for the physical address, and could we
> > > be hitting those cache lines while accessing through a non-cacheable
> > > mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
> > > definitely _feels_ like it has unpredictable attributes!)
> > > 
> > > Also, given that this memory is mapped NC, then surely
> > > __dma_flush_area() should have no effect?  However, it _does_ have the
> > > effect of reliably solving the problem, which to me implies that there
> > > _are_ cache lines in this NC mapping.
> > 
> > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
> > alloc_pages for single pages") which has been implicated in the same
> > problem here:
> > 
> > https://www.spinics.net/lists/arm-kernel/msg750623.html
> > 
> > Although reverting the commit is not clean, this also fixes the issue
> > for me.
> 
> Note that that one turned out to be something totally different, namely that
> the single-page allocations, in difference to CMA, came from physical
> addresses that the controller needed additional configuration to be able to
> access[1] - no amount of cache maintenance would affect that.

To be honest, the conclusion in that other thread wasn't exactly satisfying.
The reporter says "Probably, my device is not 64-bit capable." and the fix
changes the buffer allocation enough that I wouldn't rule out the same
coherency issue as being the root cause. I don't think we ever tried adding
cache flushing to see if it helped.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:43                   ` Fabio Estevam
@ 2019-09-17 13:51                     ` Russell King - ARM Linux admin
  2019-09-17 13:56                       ` Fabio Estevam
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 13:51 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

On Tue, Sep 17, 2019 at 10:43:35AM -0300, Fabio Estevam wrote:
> On Tue, Sep 17, 2019 at 10:33 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> 
> > That attempts to set bit 12 of the host control register 2 (0x3e).
> > The LX2160A documentation states that bit 28 of 0x3c (they're 32-bit
> > wide registers there) is "reserved".
> >
> > So, you're asking for a documented reserved bit to be set...
> 
> Correct, v4 is not supported here indeed.
> 
> From the LX2160 doc:
> "Conforms to the SD Host Controller Standard Specification version 3.0"

The pressing question seems to be this:

Are the eSDHC on the LX2160A DMA coherent or are they not?

Any chances of finding out internally what the true answer to that,
rather than me poking about trying stuff experimentally?  Having a
definitive answer for a potentially data-corrupting change would
be really good...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:50                 ` Will Deacon
@ 2019-09-17 13:55                   ` Robin Murphy
  2019-09-17 14:12                     ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Robin Murphy @ 2019-09-17 13:55 UTC (permalink / raw)
  To: Will Deacon
  Cc: dann frazier, Will Deacon, Russell King - ARM Linux admin,
	Adrian Hunter, Nicolin Chen, linux-mmc, Christoph Hellwig,
	Linux ARM

On 17/09/2019 14:50, Will Deacon wrote:
> On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote:
>> On 17/09/2019 14:07, Russell King - ARM Linux admin wrote:
>>> On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote:
>>>> On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
>>>>> On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
>>>>>> On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
>>>>>>> On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
>>>>>>>> On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
>>>>>>>>> On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
>>>>>>>>>
>>>>>>>>>> The platform has an iommu, which is in pass-through mode, via
>>>>>>>>>> arm_smmu.disable_bypass=0.
>>>>>>>>>
>>>>>>>>> Could be 954a03be033c7cef80ddc232e7cbdb17df735663
>>>>>>>>> "iommu/arm-smmu: Break insecure users by disabling bypass by default"
>>>>>>>>>
>>>>>>>>> Although it had already landed in v5.2
>>>>>>>>
>>>>>>>> It is not - and the two lines that you quoted above are sufficient
>>>>>>>> to negate that as a cause.  (Please read the help for the option that
>>>>>>>> the commit referrs to.)
>>>>>>>>
>>>>>>>> In fact, with bypass disabled, the SoC fails due to other masters.
>>>>>>>> That's already been discussed privately between myself and Will
>>>>>>>> Deacon.
>>>>>>>>
>>>>>>>> arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
>>>>>>>> the default setting in the Kconfig.
>>>>>>>
>>>>>>> Adding some further debugging, and fixing the existing ADMA debugging
>>>>>>> shows:
>>>>>>>
>>>>>>> mmc0: ADMA error: 0x02000000
>>>>>>>
>>>>>>> So this is an ADMA error without the transfer having completed.
>>>>>>>
>>>>>>> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
>>>>>>>
>>>>>>> The block size is 8, with one block.
>>>>>>>
>>>>>>> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
>>>>>>>
>>>>>>> The ADMA error is a descriptor error at address 0x000000236df1d20c.
>>>>>>> The descriptor table contains (including the following entry):
>>>>>>>
>>>>>>> mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
>>>>>>> mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
>>>>>>>
>>>>>>> The descriptor table contains one descriptor of 8 bytes, is marked
>>>>>>> as the last (END bit set) and is at DMA address 0x236df1d200.  The
>>>>>>> following descriptor is empty, with VALID=0.
>>>>>>>
>>>>>>> One may be tempted to blame it on the following descriptor, but having
>>>>>>> had another example on eMMC while userspace was booting (rootfs on
>>>>>>> eMMC):
>>>>>>>
>>>>>>> mmc1: ADMA error: 0x02000000
>>>>>>> mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
>>>>>>> mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
>>>>>>> mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
>>>>>>> mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
>>>>>>>
>>>>>>> ... which is interesting for several reasons:
>>>>>>> - The ADMA error register indicates a length mismatch error.  The
>>>>>>>     transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
>>>>>>>     Summing the ADMA lengths up to the last descriptor (length=0 is
>>>>>>>     0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
>>>>>>>     bytes than the requested transfer.
>>>>>>>
>>>>>>> - The ADMA error register indicates ST_CADR, which is described as
>>>>>>>     "This state is never set because do not generate ADMA error in this
>>>>>>>     state."
>>>>>>>
>>>>>>> - The error descriptor is again after the descriptor with END=1, but
>>>>>>>     this time has VALID=1.
>>>>>>>
>>>>>>> This _feels_ like a coherency issue, where the SDHCI engine is not
>>>>>>> correctly seeing the descriptor table, but then I would have expected
>>>>>>> userspace (which is basically debian stable) to fail to boot every
>>>>>>> time given that its rootfs is on eMMC.
>>>>>>>
>>>>>>> The other weird thing is if I wind the core MMC code back via:
>>>>>>>
>>>>>>> $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
>>>>>>>
>>>>>>> and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
>>>>>>> completely stable, but way better than plain v5.3.  I don't see
>>>>>>> much in that diff which would be responsible for this - although it
>>>>>>> does seem that hch's DMA changes do make the problem more likely.
>>>>>>> (going from 1 in 3 boots with a problem to being not able to boot.)
>>>>>>>
>>>>>>> Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
>>>>>>> bypass mode on the IOMMU (but then I saw global smmu errors right
>>>>>>> from when the IOMMU had bypass disabled before MMC was probed - the
>>>>>>> reason being is the SoC is not currently setup to have the MMU
>>>>>>> bypass mode disabled.)
>>>>>>
>>>>>> This looks like an ARM64 coherency issue.
>>>>>>
>>>>>> I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
>>>>>> which had no effect.  I then tried adding:
>>>>>>
>>>>>> +       __dma_flush_area(host->adma_table, desc - host->adma_table);
>>>>>> +       dma_wmb();
>>>>>>
>>>>>> and so far I haven't had any further ADMA errors.  Adding Will Deacon
>>>>>> to the thread.
>>>>>
>>>>> These are the changes to sdhci that I'm currently running.  I think
>>>>> some of the debugging related changes are probably worth adding to
>>>>> the driver, particularly printing the intmask on ADMA error (which
>>>>> is not printed by the register dump, as the value is lost) and printing
>>>>> the DMA addresses of the descriptor table entries which can be tied
>>>>> up with the DMA address error register.  Also, maybe printing the
>>>>> DMA descriptor table with the register dump, rather than having to
>>>>> resort to enabling debug would be a good idea?
>>>>>
>>>>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>>>>> index a5dc5aae973e..884dcaa9cad5 100644
>>>>> --- a/drivers/mmc/host/sdhci.c
>>>>> +++ b/drivers/mmc/host/sdhci.c
>>>>> @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
>>>>>    		/* Add a terminating entry - nop, end, valid */
>>>>>    		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
>>>>>    	}
>>>>> +	__dma_flush_area(host->adma_table, desc - host->adma_table);
>>>>> +	dma_wmb();
>>>>>    }
>>>>>    static void sdhci_adma_table_post(struct sdhci_host *host,
>>>>> @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
>>>>>    static void sdhci_adma_show_error(struct sdhci_host *host)
>>>>>    {
>>>>>    	void *desc = host->adma_table;
>>>>> +	dma_addr_t dma = host->adma_addr;
>>>>> +	bool end = false;
>>>>>    	sdhci_dumpregs(host);
>>>>> @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
>>>>>    		struct sdhci_adma2_64_desc *dma_desc = desc;
>>>>>    		if (host->flags & SDHCI_USE_64_BIT_DMA)
>>>>> -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> -			    desc, le32_to_cpu(dma_desc->addr_hi),
>>>>> +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> +			    (unsigned long long)dma,
>>>>> +			    le32_to_cpu(dma_desc->addr_hi),
>>>>>    			    le32_to_cpu(dma_desc->addr_lo),
>>>>>    			    le16_to_cpu(dma_desc->len),
>>>>>    			    le16_to_cpu(dma_desc->cmd));
>>>>>    		else
>>>>> -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> -			    desc, le32_to_cpu(dma_desc->addr_lo),
>>>>> +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> +			    (unsigned long long)dma,
>>>>> +			    le32_to_cpu(dma_desc->addr_lo),
>>>>>    			    le16_to_cpu(dma_desc->len),
>>>>>    			    le16_to_cpu(dma_desc->cmd));
>>>>> +		if (end) break;
>>>>> +
>>>>>    		desc += host->desc_sz;
>>>>> +		dma += host->desc_sz;
>>>>>    		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
>>>>> -			break;
>>>>> +			end = true;
>>>>>    	}
>>>>>    }
>>>>> @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
>>>>>    			!= MMC_BUS_TEST_R)
>>>>>    		host->data->error = -EILSEQ;
>>>>>    	else if (intmask & SDHCI_INT_ADMA_ERROR) {
>>>>> -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
>>>>> +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
>>>>>    		sdhci_adma_show_error(host);
>>>>>    		host->data->error = -EIO;
>>>>>    		if (host->ops->adma_workaround)
>>>>
>>>> Further debug shows:
>>>>
>>>> coherent=0 - sdhci device is not cache coherent
>>>> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
>>>> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
>>>> pmd=000000237fffb003, pte=00e800236d62270f
>>>>
>>>> The mapping for the ADMA table seems to be using MAIR index 3, which is
>>>> MT_MEMORY_NC, so should be non-cacheable.
>>>>
>>>> vmallocinfo:
>>>> 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
>>>> user
>>>>
>>>> So this memory has been remapped.  Could there be an alias that has
>>>> cache lines still in the cache for the physical address, and could we
>>>> be hitting those cache lines while accessing through a non-cacheable
>>>> mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
>>>> definitely _feels_ like it has unpredictable attributes!)
>>>>
>>>> Also, given that this memory is mapped NC, then surely
>>>> __dma_flush_area() should have no effect?  However, it _does_ have the
>>>> effect of reliably solving the problem, which to me implies that there
>>>> _are_ cache lines in this NC mapping.
>>>
>>> Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
>>> alloc_pages for single pages") which has been implicated in the same
>>> problem here:
>>>
>>> https://www.spinics.net/lists/arm-kernel/msg750623.html
>>>
>>> Although reverting the commit is not clean, this also fixes the issue
>>> for me.
>>
>> Note that that one turned out to be something totally different, namely that
>> the single-page allocations, in difference to CMA, came from physical
>> addresses that the controller needed additional configuration to be able to
>> access[1] - no amount of cache maintenance would affect that.
> 
> To be honest, the conclusion in that other thread wasn't exactly satisfying.
> The reporter says "Probably, my device is not 64-bit capable." and the fix
> changes the buffer allocation enough that I wouldn't rule out the same
> coherency issue as being the root cause. I don't think we ever tried adding
> cache flushing to see if it helped.

Huh? The conclusion of that thread seemed pretty clear to me:

https://lore.kernel.org/linux-arm-kernel/CAK7LNASs2qkpGY_BkL--hvmKm3FJ9sEK4+v5VVYc1_CrowAB4w@mail.gmail.com/

which is where I linked that patch from :/

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:51                     ` Russell King - ARM Linux admin
@ 2019-09-17 13:56                       ` Fabio Estevam
       [not found]                         ` <CADRPPNQ-WTY0QC7_bX=N0QeueKve=k0SaMvbjOrByyvzFojz2g@mail.gmail.com>
  0 siblings, 1 reply; 31+ messages in thread
From: Fabio Estevam @ 2019-09-17 13:56 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Li Yang
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

[Adding Li Yang]

On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:

> The pressing question seems to be this:
>
> Are the eSDHC on the LX2160A DMA coherent or are they not?
>
> Any chances of finding out internally what the true answer to that,
> rather than me poking about trying stuff experimentally?  Having a
> definitive answer for a potentially data-corrupting change would
> be really good...

Li Yang,

Could you please help to confirm Russell's question?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:49                 ` Russell King - ARM Linux admin
@ 2019-09-17 14:03                   ` Robin Murphy
  2019-09-19  9:16                     ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Robin Murphy @ 2019-09-17 14:03 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

On 17/09/2019 14:49, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote:
>> On 17/09/2019 14:07, Russell King - ARM Linux admin wrote:
>>> On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote:
>>>> On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
>>>>> On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
>>>>>> On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
>>>>>>> On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
>>>>>>>> On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
>>>>>>>>> On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
>>>>>>>>>
>>>>>>>>>> The platform has an iommu, which is in pass-through mode, via
>>>>>>>>>> arm_smmu.disable_bypass=0.
>>>>>>>>>
>>>>>>>>> Could be 954a03be033c7cef80ddc232e7cbdb17df735663
>>>>>>>>> "iommu/arm-smmu: Break insecure users by disabling bypass by default"
>>>>>>>>>
>>>>>>>>> Although it had already landed in v5.2
>>>>>>>>
>>>>>>>> It is not - and the two lines that you quoted above are sufficient
>>>>>>>> to negate that as a cause.  (Please read the help for the option that
>>>>>>>> the commit referrs to.)
>>>>>>>>
>>>>>>>> In fact, with bypass disabled, the SoC fails due to other masters.
>>>>>>>> That's already been discussed privately between myself and Will
>>>>>>>> Deacon.
>>>>>>>>
>>>>>>>> arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
>>>>>>>> the default setting in the Kconfig.
>>>>>>>
>>>>>>> Adding some further debugging, and fixing the existing ADMA debugging
>>>>>>> shows:
>>>>>>>
>>>>>>> mmc0: ADMA error: 0x02000000
>>>>>>>
>>>>>>> So this is an ADMA error without the transfer having completed.
>>>>>>>
>>>>>>> mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
>>>>>>>
>>>>>>> The block size is 8, with one block.
>>>>>>>
>>>>>>> mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
>>>>>>>
>>>>>>> The ADMA error is a descriptor error at address 0x000000236df1d20c.
>>>>>>> The descriptor table contains (including the following entry):
>>>>>>>
>>>>>>> mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
>>>>>>> mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
>>>>>>>
>>>>>>> The descriptor table contains one descriptor of 8 bytes, is marked
>>>>>>> as the last (END bit set) and is at DMA address 0x236df1d200.  The
>>>>>>> following descriptor is empty, with VALID=0.
>>>>>>>
>>>>>>> One may be tempted to blame it on the following descriptor, but having
>>>>>>> had another example on eMMC while userspace was booting (rootfs on
>>>>>>> eMMC):
>>>>>>>
>>>>>>> mmc1: ADMA error: 0x02000000
>>>>>>> mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
>>>>>>> mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
>>>>>>> mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
>>>>>>> mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
>>>>>>> mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
>>>>>>>
>>>>>>> ... which is interesting for several reasons:
>>>>>>> - The ADMA error register indicates a length mismatch error.  The
>>>>>>>     transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
>>>>>>>     Summing the ADMA lengths up to the last descriptor (length=0 is
>>>>>>>     0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
>>>>>>>     bytes than the requested transfer.
>>>>>>>
>>>>>>> - The ADMA error register indicates ST_CADR, which is described as
>>>>>>>     "This state is never set because do not generate ADMA error in this
>>>>>>>     state."
>>>>>>>
>>>>>>> - The error descriptor is again after the descriptor with END=1, but
>>>>>>>     this time has VALID=1.
>>>>>>>
>>>>>>> This _feels_ like a coherency issue, where the SDHCI engine is not
>>>>>>> correctly seeing the descriptor table, but then I would have expected
>>>>>>> userspace (which is basically debian stable) to fail to boot every
>>>>>>> time given that its rootfs is on eMMC.
>>>>>>>
>>>>>>> The other weird thing is if I wind the core MMC code back via:
>>>>>>>
>>>>>>> $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
>>>>>>>
>>>>>>> and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
>>>>>>> completely stable, but way better than plain v5.3.  I don't see
>>>>>>> much in that diff which would be responsible for this - although it
>>>>>>> does seem that hch's DMA changes do make the problem more likely.
>>>>>>> (going from 1 in 3 boots with a problem to being not able to boot.)
>>>>>>>
>>>>>>> Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
>>>>>>> bypass mode on the IOMMU (but then I saw global smmu errors right
>>>>>>> from when the IOMMU had bypass disabled before MMC was probed - the
>>>>>>> reason being is the SoC is not currently setup to have the MMU
>>>>>>> bypass mode disabled.)
>>>>>>
>>>>>> This looks like an ARM64 coherency issue.
>>>>>>
>>>>>> I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
>>>>>> which had no effect.  I then tried adding:
>>>>>>
>>>>>> +       __dma_flush_area(host->adma_table, desc - host->adma_table);
>>>>>> +       dma_wmb();
>>>>>>
>>>>>> and so far I haven't had any further ADMA errors.  Adding Will Deacon
>>>>>> to the thread.
>>>>>
>>>>> These are the changes to sdhci that I'm currently running.  I think
>>>>> some of the debugging related changes are probably worth adding to
>>>>> the driver, particularly printing the intmask on ADMA error (which
>>>>> is not printed by the register dump, as the value is lost) and printing
>>>>> the DMA addresses of the descriptor table entries which can be tied
>>>>> up with the DMA address error register.  Also, maybe printing the
>>>>> DMA descriptor table with the register dump, rather than having to
>>>>> resort to enabling debug would be a good idea?
>>>>>
>>>>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>>>>> index a5dc5aae973e..884dcaa9cad5 100644
>>>>> --- a/drivers/mmc/host/sdhci.c
>>>>> +++ b/drivers/mmc/host/sdhci.c
>>>>> @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
>>>>>    		/* Add a terminating entry - nop, end, valid */
>>>>>    		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
>>>>>    	}
>>>>> +	__dma_flush_area(host->adma_table, desc - host->adma_table);
>>>>> +	dma_wmb();
>>>>>    }
>>>>>    static void sdhci_adma_table_post(struct sdhci_host *host,
>>>>> @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
>>>>>    static void sdhci_adma_show_error(struct sdhci_host *host)
>>>>>    {
>>>>>    	void *desc = host->adma_table;
>>>>> +	dma_addr_t dma = host->adma_addr;
>>>>> +	bool end = false;
>>>>>    	sdhci_dumpregs(host);
>>>>> @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
>>>>>    		struct sdhci_adma2_64_desc *dma_desc = desc;
>>>>>    		if (host->flags & SDHCI_USE_64_BIT_DMA)
>>>>> -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> -			    desc, le32_to_cpu(dma_desc->addr_hi),
>>>>> +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> +			    (unsigned long long)dma,
>>>>> +			    le32_to_cpu(dma_desc->addr_hi),
>>>>>    			    le32_to_cpu(dma_desc->addr_lo),
>>>>>    			    le16_to_cpu(dma_desc->len),
>>>>>    			    le16_to_cpu(dma_desc->cmd));
>>>>>    		else
>>>>> -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> -			    desc, le32_to_cpu(dma_desc->addr_lo),
>>>>> +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
>>>>> +			    (unsigned long long)dma,
>>>>> +			    le32_to_cpu(dma_desc->addr_lo),
>>>>>    			    le16_to_cpu(dma_desc->len),
>>>>>    			    le16_to_cpu(dma_desc->cmd));
>>>>> +		if (end) break;
>>>>> +
>>>>>    		desc += host->desc_sz;
>>>>> +		dma += host->desc_sz;
>>>>>    		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
>>>>> -			break;
>>>>> +			end = true;
>>>>>    	}
>>>>>    }
>>>>> @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
>>>>>    			!= MMC_BUS_TEST_R)
>>>>>    		host->data->error = -EILSEQ;
>>>>>    	else if (intmask & SDHCI_INT_ADMA_ERROR) {
>>>>> -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
>>>>> +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
>>>>>    		sdhci_adma_show_error(host);
>>>>>    		host->data->error = -EIO;
>>>>>    		if (host->ops->adma_workaround)
>>>>
>>>> Further debug shows:
>>>>
>>>> coherent=0 - sdhci device is not cache coherent
>>>> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
>>>> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
>>>> pmd=000000237fffb003, pte=00e800236d62270f
>>>>
>>>> The mapping for the ADMA table seems to be using MAIR index 3, which is
>>>> MT_MEMORY_NC, so should be non-cacheable.
>>>>
>>>> vmallocinfo:
>>>> 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
>>>> user
>>>>
>>>> So this memory has been remapped.  Could there be an alias that has
>>>> cache lines still in the cache for the physical address, and could we
>>>> be hitting those cache lines while accessing through a non-cacheable
>>>> mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
>>>> definitely _feels_ like it has unpredictable attributes!)
>>>>
>>>> Also, given that this memory is mapped NC, then surely
>>>> __dma_flush_area() should have no effect?  However, it _does_ have the
>>>> effect of reliably solving the problem, which to me implies that there
>>>> _are_ cache lines in this NC mapping.
>>>
>>> Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
>>> alloc_pages for single pages") which has been implicated in the same
>>> problem here:
>>>
>>> https://www.spinics.net/lists/arm-kernel/msg750623.html
>>>
>>> Although reverting the commit is not clean, this also fixes the issue
>>> for me.
>>
>> Note that that one turned out to be something totally different, namely that
>> the single-page allocations, in difference to CMA, came from physical
>> addresses that the controller needed additional configuration to be able to
>> access[1] - no amount of cache maintenance would affect that.
> 
> As already replied, v4 mode is not documented as being available on
> the LX2160A - the bit in the control register is marked as "reserved".
> This is as expected as it is documented that it is using a v3.00 of
> the SDHCI standard, rather than v4.00.
> 
> So, sorry, enabling "v4 mode" isn't a workaround in this scenario.
> 
> Given that v4 mode is not mandatory, this shouldn't be a work-around.
> 
> Given that it _does_ work some of the time with the table >4GB, then
> this is not an addressing limitation.

Yes, that's what "something totally different" usually means.

>> However, the other difference between getting a single page directly from
>> the page allocator vs. the CMA area is that accesses to the linear mapping
>> of the CMA area are probably pretty rare, whereas for the single-page case
>> it's much more likely that kernel tasks using adjacent pages could lead to
>> prefetching of the descriptor page's cacheable alias. That could certainly
>> explain how reverting that commit manages to hide an apparent coherency
>> issue.
> 
> Right, so how do we fix this?

By describing the hardware correctly in the DT.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 13:55                   ` Robin Murphy
@ 2019-09-17 14:12                     ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 14:12 UTC (permalink / raw)
  To: Robin Murphy
  Cc: dann frazier, Will Deacon, Adrian Hunter, Nicolin Chen,
	linux-mmc, Will Deacon, Christoph Hellwig, Linux ARM

On Tue, Sep 17, 2019 at 02:55:51PM +0100, Robin Murphy wrote:
> On 17/09/2019 14:50, Will Deacon wrote:
> > On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote:
> > > On 17/09/2019 14:07, Russell King - ARM Linux admin wrote:
> > > > On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote:
> > > > > On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote:
> > > > > > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote:
> > > > > > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote:
> > > > > > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote:
> > > > > > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote:
> > > > > > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote:
> > > > > > > > > > 
> > > > > > > > > > > The platform has an iommu, which is in pass-through mode, via
> > > > > > > > > > > arm_smmu.disable_bypass=0.
> > > > > > > > > > 
> > > > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663
> > > > > > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default"
> > > > > > > > > > 
> > > > > > > > > > Although it had already landed in v5.2
> > > > > > > > > 
> > > > > > > > > It is not - and the two lines that you quoted above are sufficient
> > > > > > > > > to negate that as a cause.  (Please read the help for the option that
> > > > > > > > > the commit referrs to.)
> > > > > > > > > 
> > > > > > > > > In fact, with bypass disabled, the SoC fails due to other masters.
> > > > > > > > > That's already been discussed privately between myself and Will
> > > > > > > > > Deacon.
> > > > > > > > > 
> > > > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of
> > > > > > > > > the default setting in the Kconfig.
> > > > > > > > 
> > > > > > > > Adding some further debugging, and fixing the existing ADMA debugging
> > > > > > > > shows:
> > > > > > > > 
> > > > > > > > mmc0: ADMA error: 0x02000000
> > > > > > > > 
> > > > > > > > So this is an ADMA error without the transfer having completed.
> > > > > > > > 
> > > > > > > > mmc0: sdhci: Blk size:  0x00000008 | Blk cnt:  0x00000001
> > > > > > > > 
> > > > > > > > The block size is 8, with one block.
> > > > > > > > 
> > > > > > > > mmc0: sdhci: ADMA Err:  0x00000009 | ADMA Ptr: 0x000000236df1d20c
> > > > > > > > 
> > > > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c.
> > > > > > > > The descriptor table contains (including the following entry):
> > > > > > > > 
> > > > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23
> > > > > > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00
> > > > > > > > 
> > > > > > > > The descriptor table contains one descriptor of 8 bytes, is marked
> > > > > > > > as the last (END bit set) and is at DMA address 0x236df1d200.  The
> > > > > > > > following descriptor is empty, with VALID=0.
> > > > > > > > 
> > > > > > > > One may be tempted to blame it on the following descriptor, but having
> > > > > > > > had another example on eMMC while userspace was booting (rootfs on
> > > > > > > > eMMC):
> > > > > > > > 
> > > > > > > > mmc1: ADMA error: 0x02000000
> > > > > > > > mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000099
> > > > > > > > mmc1: sdhci: ADMA Err:  0x00000006 | ADMA Ptr: 0x000000236dbfa26c
> > > > > > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21
> > > > > > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23
> > > > > > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21
> > > > > > > > 
> > > > > > > > ... which is interesting for several reasons:
> > > > > > > > - The ADMA error register indicates a length mismatch error.  The
> > > > > > > >     transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes.
> > > > > > > >     Summing the ADMA lengths up to the last descriptor (length=0 is
> > > > > > > >     0x10000 bytes) gives 0x20000 bytes.  So the DMA table contains more
> > > > > > > >     bytes than the requested transfer.
> > > > > > > > 
> > > > > > > > - The ADMA error register indicates ST_CADR, which is described as
> > > > > > > >     "This state is never set because do not generate ADMA error in this
> > > > > > > >     state."
> > > > > > > > 
> > > > > > > > - The error descriptor is again after the descriptor with END=1, but
> > > > > > > >     this time has VALID=1.
> > > > > > > > 
> > > > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not
> > > > > > > > correctly seeing the descriptor table, but then I would have expected
> > > > > > > > userspace (which is basically debian stable) to fail to boot every
> > > > > > > > time given that its rootfs is on eMMC.
> > > > > > > > 
> > > > > > > > The other weird thing is if I wind the core MMC code back via:
> > > > > > > > 
> > > > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R
> > > > > > > > 
> > > > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not
> > > > > > > > completely stable, but way better than plain v5.3.  I don't see
> > > > > > > > much in that diff which would be responsible for this - although it
> > > > > > > > does seem that hch's DMA changes do make the problem more likely.
> > > > > > > > (going from 1 in 3 boots with a problem to being not able to boot.)
> > > > > > > > 
> > > > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled
> > > > > > > > bypass mode on the IOMMU (but then I saw global smmu errors right
> > > > > > > > from when the IOMMU had bypass disabled before MMC was probed - the
> > > > > > > > reason being is the SoC is not currently setup to have the MMU
> > > > > > > > bypass mode disabled.)
> > > > > > > 
> > > > > > > This looks like an ARM64 coherency issue.
> > > > > > > 
> > > > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(),
> > > > > > > which had no effect.  I then tried adding:
> > > > > > > 
> > > > > > > +       __dma_flush_area(host->adma_table, desc - host->adma_table);
> > > > > > > +       dma_wmb();
> > > > > > > 
> > > > > > > and so far I haven't had any further ADMA errors.  Adding Will Deacon
> > > > > > > to the thread.
> > > > > > 
> > > > > > These are the changes to sdhci that I'm currently running.  I think
> > > > > > some of the debugging related changes are probably worth adding to
> > > > > > the driver, particularly printing the intmask on ADMA error (which
> > > > > > is not printed by the register dump, as the value is lost) and printing
> > > > > > the DMA addresses of the descriptor table entries which can be tied
> > > > > > up with the DMA address error register.  Also, maybe printing the
> > > > > > DMA descriptor table with the register dump, rather than having to
> > > > > > resort to enabling debug would be a good idea?
> > > > > > 
> > > > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > > > > > index a5dc5aae973e..884dcaa9cad5 100644
> > > > > > --- a/drivers/mmc/host/sdhci.c
> > > > > > +++ b/drivers/mmc/host/sdhci.c
> > > > > > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host,
> > > > > >    		/* Add a terminating entry - nop, end, valid */
> > > > > >    		__sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID);
> > > > > >    	}
> > > > > > +	__dma_flush_area(host->adma_table, desc - host->adma_table);
> > > > > > +	dma_wmb();
> > > > > >    }
> > > > > >    static void sdhci_adma_table_post(struct sdhci_host *host,
> > > > > > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p)
> > > > > >    static void sdhci_adma_show_error(struct sdhci_host *host)
> > > > > >    {
> > > > > >    	void *desc = host->adma_table;
> > > > > > +	dma_addr_t dma = host->adma_addr;
> > > > > > +	bool end = false;
> > > > > >    	sdhci_dumpregs(host);
> > > > > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
> > > > > >    		struct sdhci_adma2_64_desc *dma_desc = desc;
> > > > > >    		if (host->flags & SDHCI_USE_64_BIT_DMA)
> > > > > > -			DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > > > -			    desc, le32_to_cpu(dma_desc->addr_hi),
> > > > > > +			SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > > > +			    (unsigned long long)dma,
> > > > > > +			    le32_to_cpu(dma_desc->addr_hi),
> > > > > >    			    le32_to_cpu(dma_desc->addr_lo),
> > > > > >    			    le16_to_cpu(dma_desc->len),
> > > > > >    			    le16_to_cpu(dma_desc->cmd));
> > > > > >    		else
> > > > > > -			DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > > > -			    desc, le32_to_cpu(dma_desc->addr_lo),
> > > > > > +			SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n",
> > > > > > +			    (unsigned long long)dma,
> > > > > > +			    le32_to_cpu(dma_desc->addr_lo),
> > > > > >    			    le16_to_cpu(dma_desc->len),
> > > > > >    			    le16_to_cpu(dma_desc->cmd));
> > > > > > +		if (end) break;
> > > > > > +
> > > > > >    		desc += host->desc_sz;
> > > > > > +		dma += host->desc_sz;
> > > > > >    		if (dma_desc->cmd & cpu_to_le16(ADMA2_END))
> > > > > > -			break;
> > > > > > +			end = true;
> > > > > >    	}
> > > > > >    }
> > > > > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
> > > > > >    			!= MMC_BUS_TEST_R)
> > > > > >    		host->data->error = -EILSEQ;
> > > > > >    	else if (intmask & SDHCI_INT_ADMA_ERROR) {
> > > > > > -		pr_err("%s: ADMA error\n", mmc_hostname(host->mmc));
> > > > > > +		pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask);
> > > > > >    		sdhci_adma_show_error(host);
> > > > > >    		host->data->error = -EIO;
> > > > > >    		if (host->ops->adma_workaround)
> > > > > 
> > > > > Further debug shows:
> > > > > 
> > > > > coherent=0 - sdhci device is not cache coherent
> > > > > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000
> > > > > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003,
> > > > > pmd=000000237fffb003, pte=00e800236d62270f
> > > > > 
> > > > > The mapping for the ADMA table seems to be using MAIR index 3, which is
> > > > > MT_MEMORY_NC, so should be non-cacheable.
> > > > > 
> > > > > vmallocinfo:
> > > > > 0xffffff8010fd5000-0xffffff8010fd7000    8192 dma_direct_alloc+0x4c/0x54
> > > > > user
> > > > > 
> > > > > So this memory has been remapped.  Could there be an alias that has
> > > > > cache lines still in the cache for the physical address, and could we
> > > > > be hitting those cache lines while accessing through a non-cacheable
> > > > > mapping?  (On 32-bit ARM, this is "unpredictable" and this problem
> > > > > definitely _feels_ like it has unpredictable attributes!)
> > > > > 
> > > > > Also, given that this memory is mapped NC, then surely
> > > > > __dma_flush_area() should have no effect?  However, it _does_ have the
> > > > > effect of reliably solving the problem, which to me implies that there
> > > > > _are_ cache lines in this NC mapping.
> > > > 
> > > > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback
> > > > alloc_pages for single pages") which has been implicated in the same
> > > > problem here:
> > > > 
> > > > https://www.spinics.net/lists/arm-kernel/msg750623.html
> > > > 
> > > > Although reverting the commit is not clean, this also fixes the issue
> > > > for me.
> > > 
> > > Note that that one turned out to be something totally different, namely that
> > > the single-page allocations, in difference to CMA, came from physical
> > > addresses that the controller needed additional configuration to be able to
> > > access[1] - no amount of cache maintenance would affect that.
> > 
> > To be honest, the conclusion in that other thread wasn't exactly satisfying.
> > The reporter says "Probably, my device is not 64-bit capable." and the fix
> > changes the buffer allocation enough that I wouldn't rule out the same
> > coherency issue as being the root cause. I don't think we ever tried adding
> > cache flushing to see if it helped.
> 
> Huh? The conclusion of that thread seemed pretty clear to me:
> 
> https://lore.kernel.org/linux-arm-kernel/CAK7LNASs2qkpGY_BkL--hvmKm3FJ9sEK4+v5VVYc1_CrowAB4w@mail.gmail.com/
> 
> which is where I linked that patch from :/

... and it is looking like the conclusion of that was incorrect given
that I'm seeing the same issue on hardware that isn't capable of v4
operation but is capable of addressing the full range of system memory.

I can confirm that the registers that set the ADMA table base are 64-bit
wide in the LX2160A reference manual.

I can also confirm that the eSDHC driver is using 96-bit descriptor
format, which is what is required to support 64-bit DMA addresses, and
the eSDHC hardware supports this format.

So, this does _not_ appear to be some addressing limitation of the
device.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [REGRESSION] sdhci no longer detects SD cards on LX2160A
       [not found]                         ` <CADRPPNQ-WTY0QC7_bX=N0QeueKve=k0SaMvbjOrByyvzFojz2g@mail.gmail.com>
@ 2019-09-19  4:13                           ` Y.b. Lu
  2019-09-19  7:04                             ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Y.b. Lu @ 2019-09-19  4:13 UTC (permalink / raw)
  To: Leo Li, Fabio Estevam, Russell King - ARM Linux admin
  Cc: dann frazier, Will Deacon, Adrian Hunter, Nicolin Chen,
	linux-mmc, Christoph Hellwig, Linux ARM

Sorry. My email was rejected by mailing lists. Let me re-send.

Hi Russell,

I’m not sure what board you were using for LX2160A.
We had an known issue for eSDHC controller and all NXP Layerscape RDB boards.
eSDHC couldn’t provide power-cycle to SD card, and even worse, board reset couldn’t provide power-cycle to SD card either.
But for UHS-I SD card, it’s required to have a power-cycle to reset card if it goes into UHS-I mode. Otherwise, we don’t know what will happen when kernel initializes SD card after a reboot/reset.

I could reproduce that issue with below steps on latest mainline kernel.
1. Power off board, and power on board.
2. Start up kernel, the SD card works fine in UHS-I mode.
3. Reboot/reset board. (This couldn’t provide power-cycle to SD card)
4. Start up kernel, the SD card gets that ADMA error issue.

So could you have a try to power off/power on the board, and then start up kernel. Don’t use reboot, or board reset button.
Or you can remove SD card and start up kernel, and insert SD card when kernel has been started up.
Thanks a lot.

Best regards,
Yangbo Lu


From: Li Yang <leoyang.li@nxp.com> 
Sent: Wednesday, September 18, 2019 1:48 AM
To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann frazier <dann.frazier@canonical.com>; linux-mmc <linux-mmc@vger.kernel.org>
Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A



On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam <mailto:festevam@gmail.com> wrote:
[Adding Li Yang]

On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin
<mailto:linux@armlinux.org.uk> wrote:

> The pressing question seems to be this:
>
> Are the eSDHC on the LX2160A DMA coherent or are they not?
>
> Any chances of finding out internally what the true answer to that,
> rather than me poking about trying stuff experimentally?  Having a
> definitive answer for a potentially data-corrupting change would
> be really good...

Li Yang,

Could you please help to confirm Russell's question?
Adding Yangbo who is working on SDHC.

Regards,
Leo
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-19  4:13                           ` Y.b. Lu
@ 2019-09-19  7:04                             ` Russell King - ARM Linux admin
  2019-09-19  8:15                               ` Y.b. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-19  7:04 UTC (permalink / raw)
  To: Y.b. Lu
  Cc: dann frazier, Will Deacon, Adrian Hunter, Leo Li, Nicolin Chen,
	linux-mmc, Fabio Estevam, Christoph Hellwig, Linux ARM

Hi,

This is not the issue, since the problem has been observed with eMMC
too, and is sporadic in nature.

Please could you answer the question posed: are the eSDHC controllers
DMA coherent or are they not coherent?

Thanks.

On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote:
> Sorry. My email was rejected by mailing lists. Let me re-send.
> 
> Hi Russell,
> 
> I’m not sure what board you were using for LX2160A.
> We had an known issue for eSDHC controller and all NXP Layerscape RDB boards.
> eSDHC couldn’t provide power-cycle to SD card, and even worse, board reset couldn’t provide power-cycle to SD card either.
> But for UHS-I SD card, it’s required to have a power-cycle to reset card if it goes into UHS-I mode. Otherwise, we don’t know what will happen when kernel initializes SD card after a reboot/reset.
> 
> I could reproduce that issue with below steps on latest mainline kernel.
> 1. Power off board, and power on board.
> 2. Start up kernel, the SD card works fine in UHS-I mode.
> 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card)
> 4. Start up kernel, the SD card gets that ADMA error issue.
> 
> So could you have a try to power off/power on the board, and then start up kernel. Don’t use reboot, or board reset button.
> Or you can remove SD card and start up kernel, and insert SD card when kernel has been started up.
> Thanks a lot.
> 
> Best regards,
> Yangbo Lu
> 
> 
> From: Li Yang <leoyang.li@nxp.com> 
> Sent: Wednesday, September 18, 2019 1:48 AM
> To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann frazier <dann.frazier@canonical.com>; linux-mmc <linux-mmc@vger.kernel.org>
> Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
> 
> 
> 
> On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam <mailto:festevam@gmail.com> wrote:
> [Adding Li Yang]
> 
> On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin
> <mailto:linux@armlinux.org.uk> wrote:
> 
> > The pressing question seems to be this:
> >
> > Are the eSDHC on the LX2160A DMA coherent or are they not?
> >
> > Any chances of finding out internally what the true answer to that,
> > rather than me poking about trying stuff experimentally?  Having a
> > definitive answer for a potentially data-corrupting change would
> > be really good...
> 
> Li Yang,
> 
> Could you please help to confirm Russell's question?
> Adding Yangbo who is working on SDHC.
> 
> Regards,
> Leo

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-19  7:04                             ` Russell King - ARM Linux admin
@ 2019-09-19  8:15                               ` Y.b. Lu
  2019-09-19  8:38                                 ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Y.b. Lu @ 2019-09-19  8:15 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: dann frazier, Will Deacon, Adrian Hunter, Leo Li, Nicolin Chen,
	linux-mmc, Fabio Estevam, Christoph Hellwig, Linux ARM

Hi Russell,

The ESDHC_DMA_SNOOP bit is always set in eSDHC driver for DMA.

1b - DMA transactions are snooped by the CPU data cache.
0b - DMA transactions are not snooped by the CPU data cache.

Thanks a lot.

Best regards,
Yangbo Lu

> -----Original Message-----
> From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
> Sent: Thursday, September 19, 2019 3:05 PM
> To: Y.b. Lu <yangbo.lu@nxp.com>
> Cc: Leo Li <leoyang.li@nxp.com>; Fabio Estevam <festevam@gmail.com>;
> Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>;
> Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen
> <nicoleotsuka@gmail.com>; Will Deacon <will.deacon@arm.com>; dann
> frazier <dann.frazier@canonical.com>; linux-mmc
> <linux-mmc@vger.kernel.org>
> Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
> 
> Hi,
> 
> This is not the issue, since the problem has been observed with eMMC too,
> and is sporadic in nature.
> 
> Please could you answer the question posed: are the eSDHC controllers DMA
> coherent or are they not coherent?
> 
> Thanks.
> 
> On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote:
> > Sorry. My email was rejected by mailing lists. Let me re-send.
> >
> > Hi Russell,
> >
> > I’m not sure what board you were using for LX2160A.
> > We had an known issue for eSDHC controller and all NXP Layerscape RDB
> boards.
> > eSDHC couldn’t provide power-cycle to SD card, and even worse, board
> reset couldn’t provide power-cycle to SD card either.
> > But for UHS-I SD card, it’s required to have a power-cycle to reset card if it
> goes into UHS-I mode. Otherwise, we don’t know what will happen when
> kernel initializes SD card after a reboot/reset.
> >
> > I could reproduce that issue with below steps on latest mainline kernel.
> > 1. Power off board, and power on board.
> > 2. Start up kernel, the SD card works fine in UHS-I mode.
> > 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card)
> > 4. Start up kernel, the SD card gets that ADMA error issue.
> >
> > So could you have a try to power off/power on the board, and then start up
> kernel. Don’t use reboot, or board reset button.
> > Or you can remove SD card and start up kernel, and insert SD card when
> kernel has been started up.
> > Thanks a lot.
> >
> > Best regards,
> > Yangbo Lu
> >
> >
> > From: Li Yang <leoyang.li@nxp.com>
> > Sent: Wednesday, September 18, 2019 1:48 AM
> > To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com>
> > Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig
> > <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>;
> > Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin
> > <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann
> > frazier <dann.frazier@canonical.com>; linux-mmc
> > <linux-mmc@vger.kernel.org>
> > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
> >
> >
> >
> > On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam
> <mailto:festevam@gmail.com> wrote:
> > [Adding Li Yang]
> >
> > On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin
> > <mailto:linux@armlinux.org.uk> wrote:
> >
> > > The pressing question seems to be this:
> > >
> > > Are the eSDHC on the LX2160A DMA coherent or are they not?
> > >
> > > Any chances of finding out internally what the true answer to that,
> > > rather than me poking about trying stuff experimentally?  Having a
> > > definitive answer for a potentially data-corrupting change would be
> > > really good...
> >
> > Li Yang,
> >
> > Could you please help to confirm Russell's question?
> > Adding Yangbo who is working on SDHC.
> >
> > Regards,
> > Leo
> 
> --
> RMK's Patch system:
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ar
> mlinux.org.uk%2Fdeveloper%2Fpatches%2F&amp;data=02%7C01%7Cyangbo.l
> u%40nxp.com%7C7eca2b9652104c95a52008d73ccfa99a%7C686ea1d3bc2b4
> c6fa92cd99c5c301635%7C0%7C0%7C637044734911465102&amp;sdata=QB
> SEzA9L2HC99gm65P965E3o%2FhNM18u2SouOZxTEs6s%3D&amp;reserved=0
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps
> up According to speedtest.net: 11.9Mbps down 500kbps up
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-19  8:15                               ` Y.b. Lu
@ 2019-09-19  8:38                                 ` Russell King - ARM Linux admin
  2019-09-19  9:22                                   ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-19  8:38 UTC (permalink / raw)
  To: Y.b. Lu
  Cc: dann frazier, Will Deacon, Adrian Hunter, Leo Li, Nicolin Chen,
	linux-mmc, Fabio Estevam, Christoph Hellwig, Linux ARM

Hi,

Thanks for the reply.

I see that this bit is marked "reserved" in the LX2160A reference
manual.

This brings up some further questions.

The DT property "dma-coherent" is used to tell the OS whether the
device is DMA coherent or not.  If this property is missing, but the
device is set as DMA coherent, and the OS uses "normal, non-cacheable"
memory for the ADMA table, then errors can occur if there are stale
cache lines corresponding to the memory used.  The eSDHC controller
will see the stale cache lines, but the CPU will not.

Adding "dma-coherent" to the DT declarations alone does not seem to
be the right solution - if we have an OS that does not set the
ESDHC_DMA_SNOOP bit, then we have a similar issue.

Shouldn't ESDHC_DMA_SNOOP be set depending on whether the device is
DMA coherent or not?

Note that the device is _not_ marked as "dma-coherent" in either
mainline nor in the LSDK-19.06-V4.19 branch of
https://source.codeaurora.org/external/qoriq/qoriq-components/linux
to avoid ADMA descriptor fetch errors, which leads to this error that
has now been observed with v5.3 kernels - caused precisely as I
describe above.

Thanks.

On Thu, Sep 19, 2019 at 08:15:00AM +0000, Y.b. Lu wrote:
> Hi Russell,
> 
> The ESDHC_DMA_SNOOP bit is always set in eSDHC driver for DMA.
> 
> 1b - DMA transactions are snooped by the CPU data cache.
> 0b - DMA transactions are not snooped by the CPU data cache.
> 
> Thanks a lot.
> 
> Best regards,
> Yangbo Lu
> 
> > -----Original Message-----
> > From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
> > Sent: Thursday, September 19, 2019 3:05 PM
> > To: Y.b. Lu <yangbo.lu@nxp.com>
> > Cc: Leo Li <leoyang.li@nxp.com>; Fabio Estevam <festevam@gmail.com>;
> > Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>;
> > Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen
> > <nicoleotsuka@gmail.com>; Will Deacon <will.deacon@arm.com>; dann
> > frazier <dann.frazier@canonical.com>; linux-mmc
> > <linux-mmc@vger.kernel.org>
> > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
> > 
> > Hi,
> > 
> > This is not the issue, since the problem has been observed with eMMC too,
> > and is sporadic in nature.
> > 
> > Please could you answer the question posed: are the eSDHC controllers DMA
> > coherent or are they not coherent?
> > 
> > Thanks.
> > 
> > On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote:
> > > Sorry. My email was rejected by mailing lists. Let me re-send.
> > >
> > > Hi Russell,
> > >
> > > I’m not sure what board you were using for LX2160A.
> > > We had an known issue for eSDHC controller and all NXP Layerscape RDB
> > boards.
> > > eSDHC couldn’t provide power-cycle to SD card, and even worse, board
> > reset couldn’t provide power-cycle to SD card either.
> > > But for UHS-I SD card, it’s required to have a power-cycle to reset card if it
> > goes into UHS-I mode. Otherwise, we don’t know what will happen when
> > kernel initializes SD card after a reboot/reset.
> > >
> > > I could reproduce that issue with below steps on latest mainline kernel.
> > > 1. Power off board, and power on board.
> > > 2. Start up kernel, the SD card works fine in UHS-I mode.
> > > 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card)
> > > 4. Start up kernel, the SD card gets that ADMA error issue.
> > >
> > > So could you have a try to power off/power on the board, and then start up
> > kernel. Don’t use reboot, or board reset button.
> > > Or you can remove SD card and start up kernel, and insert SD card when
> > kernel has been started up.
> > > Thanks a lot.
> > >
> > > Best regards,
> > > Yangbo Lu
> > >
> > >
> > > From: Li Yang <leoyang.li@nxp.com>
> > > Sent: Wednesday, September 18, 2019 1:48 AM
> > > To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com>
> > > Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig
> > > <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>;
> > > Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin
> > > <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann
> > > frazier <dann.frazier@canonical.com>; linux-mmc
> > > <linux-mmc@vger.kernel.org>
> > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
> > >
> > >
> > >
> > > On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam
> > <mailto:festevam@gmail.com> wrote:
> > > [Adding Li Yang]
> > >
> > > On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin
> > > <mailto:linux@armlinux.org.uk> wrote:
> > >
> > > > The pressing question seems to be this:
> > > >
> > > > Are the eSDHC on the LX2160A DMA coherent or are they not?
> > > >
> > > > Any chances of finding out internally what the true answer to that,
> > > > rather than me poking about trying stuff experimentally?  Having a
> > > > definitive answer for a potentially data-corrupting change would be
> > > > really good...
> > >
> > > Li Yang,
> > >
> > > Could you please help to confirm Russell's question?
> > > Adding Yangbo who is working on SDHC.
> > >
> > > Regards,
> > > Leo
> > 
> > --
> > RMK's Patch system:
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ar
> > mlinux.org.uk%2Fdeveloper%2Fpatches%2F&amp;data=02%7C01%7Cyangbo.l
> > u%40nxp.com%7C7eca2b9652104c95a52008d73ccfa99a%7C686ea1d3bc2b4
> > c6fa92cd99c5c301635%7C0%7C0%7C637044734911465102&amp;sdata=QB
> > SEzA9L2HC99gm65P965E3o%2FhNM18u2SouOZxTEs6s%3D&amp;reserved=0
> > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps
> > up According to speedtest.net: 11.9Mbps down 500kbps up

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-17 14:03                   ` Robin Murphy
@ 2019-09-19  9:16                     ` Russell King - ARM Linux admin
  2019-09-19 14:02                       ` Robin Murphy
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-19  9:16 UTC (permalink / raw)
  To: Robin Murphy, Y.b. Lu
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote:
> On 17/09/2019 14:49, Russell King - ARM Linux admin wrote:
> > As already replied, v4 mode is not documented as being available on
> > the LX2160A - the bit in the control register is marked as "reserved".
> > This is as expected as it is documented that it is using a v3.00 of
> > the SDHCI standard, rather than v4.00.
> > 
> > So, sorry, enabling "v4 mode" isn't a workaround in this scenario.
> > 
> > Given that v4 mode is not mandatory, this shouldn't be a work-around.
> > 
> > Given that it _does_ work some of the time with the table >4GB, then
> > this is not an addressing limitation.
> 
> Yes, that's what "something totally different" usually means.
> 
> > > However, the other difference between getting a single page directly from
> > > the page allocator vs. the CMA area is that accesses to the linear mapping
> > > of the CMA area are probably pretty rare, whereas for the single-page case
> > > it's much more likely that kernel tasks using adjacent pages could lead to
> > > prefetching of the descriptor page's cacheable alias. That could certainly
> > > explain how reverting that commit manages to hide an apparent coherency
> > > issue.
> > 
> > Right, so how do we fix this?
> 
> By describing the hardware correctly in the DT.

It would appear that it _is_ correctly described given the default
hardware configuration, but the driver sets a bit in a control
register that enables cache snooping.

Adding "dma-coherent" to the DT description does not seem to be the
correct solution, as we are reliant on the DT description and driver
implementation both agreeing, which is fragile.

From what I can see, there isn't a way for a driver to say "I've made
this device is coherent now" and I suspect making the driver set the
DMA snoop bit depending on whether "dma-coherent" is present in DT or
not will cause data-corrupting regressions for other people.

So, we're back to where we started - what is the right solution to
this problem?

The only thing I can think is that the driver needs to do something
like:

	WARN_ON(!dev_is_dma_coherent(dev));

in esdhc_of_enable_dma() as a first step, and ensuring that the snoop
bit matches the state of dev_is_dma_coherent(dev)?  Is it permitted to
use dev_is_dma_coherent() in drivers - it doesn't seem to be part of
the normal DMA API?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-19  8:38                                 ` Russell King - ARM Linux admin
@ 2019-09-19  9:22                                   ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-19  9:22 UTC (permalink / raw)
  To: Y.b. Lu
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Leo Li,
	Nicolin Chen, Fabio Estevam, Christoph Hellwig, Linux ARM

On Thu, Sep 19, 2019 at 09:38:04AM +0100, Russell King - ARM Linux admin wrote:
> Hi,
> 
> Thanks for the reply.
> 
> I see that this bit is marked "reserved" in the LX2160A reference
> manual.

Sorry, I was lookin at bit 5, it's actually bit 6.  Rest of my mail is
still relevent.

> 
> This brings up some further questions.
> 
> The DT property "dma-coherent" is used to tell the OS whether the
> device is DMA coherent or not.  If this property is missing, but the
> device is set as DMA coherent, and the OS uses "normal, non-cacheable"
> memory for the ADMA table, then errors can occur if there are stale
> cache lines corresponding to the memory used.  The eSDHC controller
> will see the stale cache lines, but the CPU will not.
> 
> Adding "dma-coherent" to the DT declarations alone does not seem to
> be the right solution - if we have an OS that does not set the
> ESDHC_DMA_SNOOP bit, then we have a similar issue.
> 
> Shouldn't ESDHC_DMA_SNOOP be set depending on whether the device is
> DMA coherent or not?
> 
> Note that the device is _not_ marked as "dma-coherent" in either
> mainline nor in the LSDK-19.06-V4.19 branch of
> https://source.codeaurora.org/external/qoriq/qoriq-components/linux
> to avoid ADMA descriptor fetch errors, which leads to this error that
> has now been observed with v5.3 kernels - caused precisely as I
> describe above.
> 
> Thanks.
> 
> On Thu, Sep 19, 2019 at 08:15:00AM +0000, Y.b. Lu wrote:
> > Hi Russell,
> > 
> > The ESDHC_DMA_SNOOP bit is always set in eSDHC driver for DMA.
> > 
> > 1b - DMA transactions are snooped by the CPU data cache.
> > 0b - DMA transactions are not snooped by the CPU data cache.
> > 
> > Thanks a lot.
> > 
> > Best regards,
> > Yangbo Lu
> > 
> > > -----Original Message-----
> > > From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
> > > Sent: Thursday, September 19, 2019 3:05 PM
> > > To: Y.b. Lu <yangbo.lu@nxp.com>
> > > Cc: Leo Li <leoyang.li@nxp.com>; Fabio Estevam <festevam@gmail.com>;
> > > Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>;
> > > Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen
> > > <nicoleotsuka@gmail.com>; Will Deacon <will.deacon@arm.com>; dann
> > > frazier <dann.frazier@canonical.com>; linux-mmc
> > > <linux-mmc@vger.kernel.org>
> > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
> > > 
> > > Hi,
> > > 
> > > This is not the issue, since the problem has been observed with eMMC too,
> > > and is sporadic in nature.
> > > 
> > > Please could you answer the question posed: are the eSDHC controllers DMA
> > > coherent or are they not coherent?
> > > 
> > > Thanks.
> > > 
> > > On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote:
> > > > Sorry. My email was rejected by mailing lists. Let me re-send.
> > > >
> > > > Hi Russell,
> > > >
> > > > I’m not sure what board you were using for LX2160A.
> > > > We had an known issue for eSDHC controller and all NXP Layerscape RDB
> > > boards.
> > > > eSDHC couldn’t provide power-cycle to SD card, and even worse, board
> > > reset couldn’t provide power-cycle to SD card either.
> > > > But for UHS-I SD card, it’s required to have a power-cycle to reset card if it
> > > goes into UHS-I mode. Otherwise, we don’t know what will happen when
> > > kernel initializes SD card after a reboot/reset.
> > > >
> > > > I could reproduce that issue with below steps on latest mainline kernel.
> > > > 1. Power off board, and power on board.
> > > > 2. Start up kernel, the SD card works fine in UHS-I mode.
> > > > 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card)
> > > > 4. Start up kernel, the SD card gets that ADMA error issue.
> > > >
> > > > So could you have a try to power off/power on the board, and then start up
> > > kernel. Don’t use reboot, or board reset button.
> > > > Or you can remove SD card and start up kernel, and insert SD card when
> > > kernel has been started up.
> > > > Thanks a lot.
> > > >
> > > > Best regards,
> > > > Yangbo Lu
> > > >
> > > >
> > > > From: Li Yang <leoyang.li@nxp.com>
> > > > Sent: Wednesday, September 18, 2019 1:48 AM
> > > > To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com>
> > > > Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig
> > > > <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>;
> > > > Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin
> > > > <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann
> > > > frazier <dann.frazier@canonical.com>; linux-mmc
> > > > <linux-mmc@vger.kernel.org>
> > > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
> > > >
> > > >
> > > >
> > > > On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam
> > > <mailto:festevam@gmail.com> wrote:
> > > > [Adding Li Yang]
> > > >
> > > > On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin
> > > > <mailto:linux@armlinux.org.uk> wrote:
> > > >
> > > > > The pressing question seems to be this:
> > > > >
> > > > > Are the eSDHC on the LX2160A DMA coherent or are they not?
> > > > >
> > > > > Any chances of finding out internally what the true answer to that,
> > > > > rather than me poking about trying stuff experimentally?  Having a
> > > > > definitive answer for a potentially data-corrupting change would be
> > > > > really good...
> > > >
> > > > Li Yang,
> > > >
> > > > Could you please help to confirm Russell's question?
> > > > Adding Yangbo who is working on SDHC.
> > > >
> > > > Regards,
> > > > Leo
> > > 
> > > --
> > > RMK's Patch system:
> > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ar
> > > mlinux.org.uk%2Fdeveloper%2Fpatches%2F&amp;data=02%7C01%7Cyangbo.l
> > > u%40nxp.com%7C7eca2b9652104c95a52008d73ccfa99a%7C686ea1d3bc2b4
> > > c6fa92cd99c5c301635%7C0%7C0%7C637044734911465102&amp;sdata=QB
> > > SEzA9L2HC99gm65P965E3o%2FhNM18u2SouOZxTEs6s%3D&amp;reserved=0
> > > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps
> > > up According to speedtest.net: 11.9Mbps down 500kbps up
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-19  9:16                     ` Russell King - ARM Linux admin
@ 2019-09-19 14:02                       ` Robin Murphy
  2019-09-19 17:23                         ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Robin Murphy @ 2019-09-19 14:02 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Y.b. Lu
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Christoph Hellwig, Linux ARM

On 19/09/2019 10:16, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote:
>> On 17/09/2019 14:49, Russell King - ARM Linux admin wrote:
>>> As already replied, v4 mode is not documented as being available on
>>> the LX2160A - the bit in the control register is marked as "reserved".
>>> This is as expected as it is documented that it is using a v3.00 of
>>> the SDHCI standard, rather than v4.00.
>>>
>>> So, sorry, enabling "v4 mode" isn't a workaround in this scenario.
>>>
>>> Given that v4 mode is not mandatory, this shouldn't be a work-around.
>>>
>>> Given that it _does_ work some of the time with the table >4GB, then
>>> this is not an addressing limitation.
>>
>> Yes, that's what "something totally different" usually means.
>>
>>>> However, the other difference between getting a single page directly from
>>>> the page allocator vs. the CMA area is that accesses to the linear mapping
>>>> of the CMA area are probably pretty rare, whereas for the single-page case
>>>> it's much more likely that kernel tasks using adjacent pages could lead to
>>>> prefetching of the descriptor page's cacheable alias. That could certainly
>>>> explain how reverting that commit manages to hide an apparent coherency
>>>> issue.
>>>
>>> Right, so how do we fix this?
>>
>> By describing the hardware correctly in the DT.
> 
> It would appear that it _is_ correctly described given the default
> hardware configuration, but the driver sets a bit in a control
> register that enables cache snooping.

Oh, fun. FWIW, the more general form of that statement would be "by 
ensuring that the device behaviour and the DT description are 
consistent", it's just rare to have both degrees of freedom.

Even in these cases, though, it tends to be ultimately necessary to 
defer to what the DT says, because there can be situations where the IP 
believes itself capable of enabling snoops, but the integration failed 
to wire things up correctly for them to actually work. I know we have to 
deal with that in arm-smmu, for one example.

> Adding "dma-coherent" to the DT description does not seem to be the
> correct solution, as we are reliant on the DT description and driver
> implementation both agreeing, which is fragile.
> 
>  From what I can see, there isn't a way for a driver to say "I've made
> this device is coherent now" and I suspect making the driver set the
> DMA snoop bit depending on whether "dma-coherent" is present in DT or
> not will cause data-corrupting regressions for other people.
> 
> So, we're back to where we started - what is the right solution to
> this problem?
> 
> The only thing I can think is that the driver needs to do something
> like:
> 
> 	WARN_ON(!dev_is_dma_coherent(dev));
> 
> in esdhc_of_enable_dma() as a first step, and ensuring that the snoop
> bit matches the state of dev_is_dma_coherent(dev)?  Is it permitted to
> use dev_is_dma_coherent() in drivers - it doesn't seem to be part of
> the normal DMA API?

The safest option would be to query the firmware property layer via 
device_get_dma_attr() - or potentially short-cut to of_dma_is_coherent() 
for a pure DT driver. Even disregarding API purity, I don't think the 
DMA API internals are really generic enough yet to reliably poke at 
(although FWIW, *certain* cases like dma_direct_ops would now actually 
work as expected if one did the unspeakable and flipped 
dev->dma_coherent from a driver, but that would definitely not win any 
friends).

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-19 14:02                       ` Robin Murphy
@ 2019-09-19 17:23                         ` Russell King - ARM Linux admin
  2019-09-20  9:55                           ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-19 17:23 UTC (permalink / raw)
  To: Robin Murphy
  Cc: dann frazier, Will Deacon, Adrian Hunter, Nicolin Chen, Y.b. Lu,
	linux-mmc, Christoph Hellwig, Linux ARM

On Thu, Sep 19, 2019 at 03:02:39PM +0100, Robin Murphy wrote:
> On 19/09/2019 10:16, Russell King - ARM Linux admin wrote:
> > On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote:
> > > On 17/09/2019 14:49, Russell King - ARM Linux admin wrote:
> > > > As already replied, v4 mode is not documented as being available on
> > > > the LX2160A - the bit in the control register is marked as "reserved".
> > > > This is as expected as it is documented that it is using a v3.00 of
> > > > the SDHCI standard, rather than v4.00.
> > > > 
> > > > So, sorry, enabling "v4 mode" isn't a workaround in this scenario.
> > > > 
> > > > Given that v4 mode is not mandatory, this shouldn't be a work-around.
> > > > 
> > > > Given that it _does_ work some of the time with the table >4GB, then
> > > > this is not an addressing limitation.
> > > 
> > > Yes, that's what "something totally different" usually means.
> > > 
> > > > > However, the other difference between getting a single page directly from
> > > > > the page allocator vs. the CMA area is that accesses to the linear mapping
> > > > > of the CMA area are probably pretty rare, whereas for the single-page case
> > > > > it's much more likely that kernel tasks using adjacent pages could lead to
> > > > > prefetching of the descriptor page's cacheable alias. That could certainly
> > > > > explain how reverting that commit manages to hide an apparent coherency
> > > > > issue.
> > > > 
> > > > Right, so how do we fix this?
> > > 
> > > By describing the hardware correctly in the DT.
> > 
> > It would appear that it _is_ correctly described given the default
> > hardware configuration, but the driver sets a bit in a control
> > register that enables cache snooping.
> 
> Oh, fun. FWIW, the more general form of that statement would be "by ensuring
> that the device behaviour and the DT description are consistent", it's just
> rare to have both degrees of freedom.
> 
> Even in these cases, though, it tends to be ultimately necessary to defer to
> what the DT says, because there can be situations where the IP believes
> itself capable of enabling snoops, but the integration failed to wire things
> up correctly for them to actually work. I know we have to deal with that in
> arm-smmu, for one example.
> 
> > Adding "dma-coherent" to the DT description does not seem to be the
> > correct solution, as we are reliant on the DT description and driver
> > implementation both agreeing, which is fragile.
> > 
> >  From what I can see, there isn't a way for a driver to say "I've made
> > this device is coherent now" and I suspect making the driver set the
> > DMA snoop bit depending on whether "dma-coherent" is present in DT or
> > not will cause data-corrupting regressions for other people.
> > 
> > So, we're back to where we started - what is the right solution to
> > this problem?
> > 
> > The only thing I can think is that the driver needs to do something
> > like:
> > 
> > 	WARN_ON(!dev_is_dma_coherent(dev));
> > 
> > in esdhc_of_enable_dma() as a first step, and ensuring that the snoop
> > bit matches the state of dev_is_dma_coherent(dev)?  Is it permitted to
> > use dev_is_dma_coherent() in drivers - it doesn't seem to be part of
> > the normal DMA API?
> 
> The safest option would be to query the firmware property layer via
> device_get_dma_attr() - or potentially short-cut to of_dma_is_coherent() for
> a pure DT driver. Even disregarding API purity, I don't think the DMA API
> internals are really generic enough yet to reliably poke at (although FWIW,
> *certain* cases like dma_direct_ops would now actually work as expected if
> one did the unspeakable and flipped dev->dma_coherent from a driver, but
> that would definitely not win any friends).

So, I prepared a few options, and option 2 was:

 drivers/mmc/host/sdhci-of-esdhc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c
index 4dd43b1adf2c..8076a1322499 100644
--- a/drivers/mmc/host/sdhci-of-esdhc.c
+++ b/drivers/mmc/host/sdhci-of-esdhc.c
@@ -19,6 +19,7 @@
 #include <linux/clk.h>
 #include <linux/ktime.h>
 #include <linux/dma-mapping.h>
+#include <linux/dma-noncoherent.h>
 #include <linux/mmc/host.h>
 #include <linux/mmc/mmc.h>
 #include "sdhci-pltfm.h"
@@ -495,7 +496,12 @@ static int esdhc_of_enable_dma(struct sdhci_host *host)
 		dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40));
 
 	value = sdhci_readl(host, ESDHC_DMA_SYSCTL);
-	value |= ESDHC_DMA_SNOOP;
+
+	if (dev_is_dma_coherent(dev))
+		value |= ESDHC_DMA_SNOOP;
+	else
+		value &= ~ESDHC_DMA_SNOOP;
+
 	sdhci_writel(host, value, ESDHC_DMA_SYSCTL);
 	return 0;
 }

The dev_is_dma_coherent() could be changed to something like
device_get_dma_attr() if that's the correct thing to base this
off of.  However, if it returns DEV_DMA_NOT_SUPPORTED, then what?
Assume non-coherent or assume coherent?  What will the DMA API
layer assume?

It seems to me that we want the DMA API layer and the driver to
both agree whether the device is to be coherent or not, and for
the sake of data integrity, we do not want any possibility for
them to deviate in that decision making process.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A
  2019-09-19 17:23                         ` Russell King - ARM Linux admin
@ 2019-09-20  9:55                           ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 31+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-20  9:55 UTC (permalink / raw)
  To: Robin Murphy
  Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon,
	Nicolin Chen, Y.b. Lu, Christoph Hellwig, Linux ARM

On Thu, Sep 19, 2019 at 06:23:49PM +0100, Russell King - ARM Linux admin wrote:
> On Thu, Sep 19, 2019 at 03:02:39PM +0100, Robin Murphy wrote:
> > On 19/09/2019 10:16, Russell King - ARM Linux admin wrote:
> > > On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote:
> > > > On 17/09/2019 14:49, Russell King - ARM Linux admin wrote:
> > > > > As already replied, v4 mode is not documented as being available on
> > > > > the LX2160A - the bit in the control register is marked as "reserved".
> > > > > This is as expected as it is documented that it is using a v3.00 of
> > > > > the SDHCI standard, rather than v4.00.
> > > > > 
> > > > > So, sorry, enabling "v4 mode" isn't a workaround in this scenario.
> > > > > 
> > > > > Given that v4 mode is not mandatory, this shouldn't be a work-around.
> > > > > 
> > > > > Given that it _does_ work some of the time with the table >4GB, then
> > > > > this is not an addressing limitation.
> > > > 
> > > > Yes, that's what "something totally different" usually means.
> > > > 
> > > > > > However, the other difference between getting a single page directly from
> > > > > > the page allocator vs. the CMA area is that accesses to the linear mapping
> > > > > > of the CMA area are probably pretty rare, whereas for the single-page case
> > > > > > it's much more likely that kernel tasks using adjacent pages could lead to
> > > > > > prefetching of the descriptor page's cacheable alias. That could certainly
> > > > > > explain how reverting that commit manages to hide an apparent coherency
> > > > > > issue.
> > > > > 
> > > > > Right, so how do we fix this?
> > > > 
> > > > By describing the hardware correctly in the DT.
> > > 
> > > It would appear that it _is_ correctly described given the default
> > > hardware configuration, but the driver sets a bit in a control
> > > register that enables cache snooping.
> > 
> > Oh, fun. FWIW, the more general form of that statement would be "by ensuring
> > that the device behaviour and the DT description are consistent", it's just
> > rare to have both degrees of freedom.
> > 
> > Even in these cases, though, it tends to be ultimately necessary to defer to
> > what the DT says, because there can be situations where the IP believes
> > itself capable of enabling snoops, but the integration failed to wire things
> > up correctly for them to actually work. I know we have to deal with that in
> > arm-smmu, for one example.
> > 
> > > Adding "dma-coherent" to the DT description does not seem to be the
> > > correct solution, as we are reliant on the DT description and driver
> > > implementation both agreeing, which is fragile.
> > > 
> > >  From what I can see, there isn't a way for a driver to say "I've made
> > > this device is coherent now" and I suspect making the driver set the
> > > DMA snoop bit depending on whether "dma-coherent" is present in DT or
> > > not will cause data-corrupting regressions for other people.
> > > 
> > > So, we're back to where we started - what is the right solution to
> > > this problem?
> > > 
> > > The only thing I can think is that the driver needs to do something
> > > like:
> > > 
> > > 	WARN_ON(!dev_is_dma_coherent(dev));
> > > 
> > > in esdhc_of_enable_dma() as a first step, and ensuring that the snoop
> > > bit matches the state of dev_is_dma_coherent(dev)?  Is it permitted to
> > > use dev_is_dma_coherent() in drivers - it doesn't seem to be part of
> > > the normal DMA API?
> > 
> > The safest option would be to query the firmware property layer via
> > device_get_dma_attr() - or potentially short-cut to of_dma_is_coherent() for
> > a pure DT driver. Even disregarding API purity, I don't think the DMA API
> > internals are really generic enough yet to reliably poke at (although FWIW,
> > *certain* cases like dma_direct_ops would now actually work as expected if
> > one did the unspeakable and flipped dev->dma_coherent from a driver, but
> > that would definitely not win any friends).
> 
> So, I prepared a few options, and option 2 was:
> 
>  drivers/mmc/host/sdhci-of-esdhc.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c
> index 4dd43b1adf2c..8076a1322499 100644
> --- a/drivers/mmc/host/sdhci-of-esdhc.c
> +++ b/drivers/mmc/host/sdhci-of-esdhc.c
> @@ -19,6 +19,7 @@
>  #include <linux/clk.h>
>  #include <linux/ktime.h>
>  #include <linux/dma-mapping.h>
> +#include <linux/dma-noncoherent.h>
>  #include <linux/mmc/host.h>
>  #include <linux/mmc/mmc.h>
>  #include "sdhci-pltfm.h"
> @@ -495,7 +496,12 @@ static int esdhc_of_enable_dma(struct sdhci_host *host)
>  		dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40));
>  
>  	value = sdhci_readl(host, ESDHC_DMA_SYSCTL);
> -	value |= ESDHC_DMA_SNOOP;
> +
> +	if (dev_is_dma_coherent(dev))
> +		value |= ESDHC_DMA_SNOOP;
> +	else
> +		value &= ~ESDHC_DMA_SNOOP;
> +
>  	sdhci_writel(host, value, ESDHC_DMA_SYSCTL);
>  	return 0;
>  }
> 
> The dev_is_dma_coherent() could be changed to something like
> device_get_dma_attr() if that's the correct thing to base this
> off of.  However, if it returns DEV_DMA_NOT_SUPPORTED, then what?
> Assume non-coherent or assume coherent?  What will the DMA API
> layer assume?
> 
> It seems to me that we want the DMA API layer and the driver to
> both agree whether the device is to be coherent or not, and for
> the sake of data integrity, we do not want any possibility for
> them to deviate in that decision making process.

I think using of_dma_is_coherent() is the safest, as if the driver
needs to be updated to ACPI, the problem will need to be readdressed.
The conditions on which dev->dma_coherent is set by the ACPI code
differs from the conditions that determine the return value of
acpi_get_dma_attr().

So, how about this:

 drivers/mmc/host/sdhci-of-esdhc.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c
index 4dd43b1adf2c..74de5e8c45c8 100644
--- a/drivers/mmc/host/sdhci-of-esdhc.c
+++ b/drivers/mmc/host/sdhci-of-esdhc.c
@@ -495,7 +495,12 @@ static int esdhc_of_enable_dma(struct sdhci_host *host)
 		dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40));
 
 	value = sdhci_readl(host, ESDHC_DMA_SYSCTL);
-	value |= ESDHC_DMA_SNOOP;
+
+	if (of_dma_is_coherent(dev->of_node))
+		value |= ESDHC_DMA_SNOOP;
+	else
+		value &= ~ESDHC_DMA_SNOOP;
+
 	sdhci_writel(host, value, ESDHC_DMA_SYSCTL);
 	return 0;
 }
-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-09-20  9:58 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-16 17:15 [REGRESSION] sdhci no longer detects SD cards on LX2160A Russell King - ARM Linux admin
2019-09-16 22:57 ` Russell King - ARM Linux admin
2019-09-17  8:06 ` Marc Gonzalez
2019-09-17  8:19   ` Russell King - ARM Linux admin
2019-09-17 10:42     ` Russell King - ARM Linux admin
2019-09-17 11:16       ` Russell King - ARM Linux admin
2019-09-17 11:42         ` Russell King - ARM Linux admin
2019-09-17 12:33           ` Russell King - ARM Linux admin
2019-09-17 13:03             ` Robin Murphy
2019-09-17 13:28               ` Russell King - ARM Linux admin
2019-09-17 13:07             ` Russell King - ARM Linux admin
2019-09-17 13:24               ` Fabio Estevam
2019-09-17 13:33                 ` Russell King - ARM Linux admin
2019-09-17 13:43                   ` Fabio Estevam
2019-09-17 13:51                     ` Russell King - ARM Linux admin
2019-09-17 13:56                       ` Fabio Estevam
     [not found]                         ` <CADRPPNQ-WTY0QC7_bX=N0QeueKve=k0SaMvbjOrByyvzFojz2g@mail.gmail.com>
2019-09-19  4:13                           ` Y.b. Lu
2019-09-19  7:04                             ` Russell King - ARM Linux admin
2019-09-19  8:15                               ` Y.b. Lu
2019-09-19  8:38                                 ` Russell King - ARM Linux admin
2019-09-19  9:22                                   ` Russell King - ARM Linux admin
2019-09-17 13:38               ` Robin Murphy
2019-09-17 13:49                 ` Russell King - ARM Linux admin
2019-09-17 14:03                   ` Robin Murphy
2019-09-19  9:16                     ` Russell King - ARM Linux admin
2019-09-19 14:02                       ` Robin Murphy
2019-09-19 17:23                         ` Russell King - ARM Linux admin
2019-09-20  9:55                           ` Russell King - ARM Linux admin
2019-09-17 13:50                 ` Will Deacon
2019-09-17 13:55                   ` Robin Murphy
2019-09-17 14:12                     ` Russell King - ARM Linux admin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).