All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
@ 2023-01-18 21:24 sebastian.kloska
  2023-01-19  9:15 ` Adrian Hunter
  2023-01-19 14:38 ` Linux kernel regression tracking (#adding)
  0 siblings, 2 replies; 12+ messages in thread
From: sebastian.kloska @ 2023-01-18 21:24 UTC (permalink / raw)
  To: adrian.hunter, riteshh, asutoshd; +Cc: linux-mmc

Hi,

Hardware: Acer Swift-1 SF114-34-P6U1
SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
eMMC: Kingston A29128

* A couple of seconds after boot access to the eMMC is completely
   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)

* Tested with stable branch linux-5.14.21 vs. linux-5.15.1
   Latest highest version including the bug is a fedora 6.1.6

* Spiked the code with a lot of pr_debug messages. As a side effect the 
driver
   seems to be much more stable, so to me it appears like a timing issue 
where
   the driver fails to wait for a specific state change which seems to 
appear
   less often when  slowed down by logging

* Would love to have a decent reference documentation to the SD 
Controller
   especially a detailed description of the behavior of the register map 
in
  mmc/host/cqhci.h All intel documents I could find seem to describe
  similar but not quite identical register maps

Regards

Sebastian

* dmesg

[  347.583082] mmc0: cqhci: timeout for tag 3, qcnt 4
[  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
[  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  0x00000510
[  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  0x00000000
[  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
[  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
[  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 0x00000001
[  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      0x00000000
[  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008
[  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00010008
[  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000800
[  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
[  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 0x00000900
[  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
[  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
[  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
[  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
[  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
[  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
[  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
[  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 0x02ff4000
[  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
[  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
[  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
[  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
[  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
0x000000011eef1218
[  347.583224] mmc0: sdhci: ============================================
[  347.583229] mmc0: running CQE recovery
[  347.690108] mmc0: Reset 0x2 never completed.
[  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
[  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
[  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
[  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
[  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
[  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
[  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
[  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
[  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
[  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
[  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
[  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
[  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
0x000000011eef1218
[  347.690157] mmc0: sdhci: ============================================
[  347.790190] mmc0: Reset 0x4 never completed.
[  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
[  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
[  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
[  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
[  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
[  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
[  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
[  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
[  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
[  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
[  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
[  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
[  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
0x000000011eef1218
[  347.790234] mmc0: sdhci: ============================================
[  347.803056] mmc0: Controller never released inhibit bit(s).
[  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
[  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
[  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
[  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
[  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
[  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
[  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
[  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
[  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
[  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
[  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
[  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
[  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
0x000000011eef1218
[  347.803119] mmc0: sdhci: ============================================
  --

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-18 21:24 Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag # sebastian.kloska
@ 2023-01-19  9:15 ` Adrian Hunter
  2023-01-19  9:46   ` sebastian.kloska
  2023-01-19 14:38 ` Linux kernel regression tracking (#adding)
  1 sibling, 1 reply; 12+ messages in thread
From: Adrian Hunter @ 2023-01-19  9:15 UTC (permalink / raw)
  To: sebastian.kloska; +Cc: linux-mmc, riteshh, asutoshd

On 18/01/23 23:24, sebastian.kloska@snafu.de wrote:
> Hi,
> 
> Hardware: Acer Swift-1 SF114-34-P6U1
> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
> eMMC: Kingston A29128
> 
> * A couple of seconds after boot access to the eMMC is completely
>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
> 
> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>   Latest highest version including the bug is a fedora 6.1.6

There is very little difference in drivers/mmc between 5.14.21
and 5.15.1. Are you sure the same issue is not present in
5.14.21?

You could try disabling runtime PM to see if that helps:

	echo on > /sys/class/mmc_host/mmc0/device/power/control

Also it should be possible to disable cqhci by adding to the kernel
command line:

	sdhci.debug_quirks=0x40a0000

> 
> * Spiked the code with a lot of pr_debug messages. As a side effect the driver
>   seems to be much more stable, so to me it appears like a timing issue where
>   the driver fails to wait for a specific state change which seems to appear
>   less often when  slowed down by logging

Ideally it would be good to see a full log with dynamic
debugging enabled for the latest kernel.

	To enable mmc debug via kernel command line:

		dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"

	Kernel must be configured:

		CONFIG_DYNAMIC_DEBUG=y

> 
> * Would love to have a decent reference documentation to the SD Controller
>   especially a detailed description of the behavior of the register map in
>  mmc/host/cqhci.h All intel documents I could find seem to describe
>  similar but not quite identical register maps

The JEDEC eMMC specification describes the CQHCI registers.
It looks like you need to register to get access to them
at www.jedec.org.  AFAICT older specs, which would be just
fine, are free to download after registering.

You can get SD Host Controller Simplified Specification
from www.sdcard.org.

> 
> Regards
> 
> Sebastian
> 
> * dmesg
> 
> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt 

The timeout is pretty long, so it seems like the controller
has gotten stuck.  It won't reset the command or data
circuits either, which is impossible if it is behaving
correctly.

> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  0x00000510
> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  0x00000000
> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 0x00000001
> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      0x00000000
> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008
> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00010008
> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000800
> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 0x00000900
> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 0x02ff4000
> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
> [  347.583224] mmc0: sdhci: ============================================
> [  347.583229] mmc0: running CQE recovery
> [  347.690108] mmc0: Reset 0x2 never completed.
> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
> [  347.690157] mmc0: sdhci: ============================================
> [  347.790190] mmc0: Reset 0x4 never completed.
> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
> [  347.790234] mmc0: sdhci: ============================================
> [  347.803056] mmc0: Controller never released inhibit bit(s).
> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
> [  347.803119] mmc0: sdhci: ============================================
>  --
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-19  9:15 ` Adrian Hunter
@ 2023-01-19  9:46   ` sebastian.kloska
  2023-01-19 10:28     ` Adrian Hunter
  0 siblings, 1 reply; 12+ messages in thread
From: sebastian.kloska @ 2023-01-19  9:46 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: linux-mmc, riteshh, asutoshd

Am 19.01.2023 10:15, schrieb Adrian Hunter:
> On 18/01/23 23:24, sebastian.kloska@snafu.de wrote:
>> Hi,
>> 
>> Hardware: Acer Swift-1 SF114-34-P6U1
>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>> eMMC: Kingston A29128
>> 
>> * A couple of seconds after boot access to the eMMC is completely
>>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
>> 
>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>   Latest highest version including the bug is a fedora 6.1.6
> 
> There is very little difference in drivers/mmc between 5.14.21
> and 5.15.1. Are you sure the same issue is not present in
> 5.14.21?

Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after 
~10min
The "slowed down by pr_debug" version sometimes lives for >2h (without 
stress
via badblocks etc)

> 
> You could try disabling runtime PM to see if that helps:
> 
> 	echo on > /sys/class/mmc_host/mmc0/device/power/control
> 
Will try
> Also it should be possible to disable cqhci by adding to the kernel
> command line:
> 
> 	sdhci.debug_quirks=0x40a0000
> 
Will try
>> 
>> * Spiked the code with a lot of pr_debug messages. As a side effect 
>> the driver
>>   seems to be much more stable, so to me it appears like a timing 
>> issue where
>>   the driver fails to wait for a specific state change which seems to 
>> appear
>>   less often when  slowed down by logging
> 
> Ideally it would be good to see a full log with dynamic
> debugging enabled for the latest kernel.

Latest torvalds/linux or stable/linux ?

> 
> 	To enable mmc debug via kernel command line:
> 
> 		dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
> 
> 	Kernel must be configured:
> 
> 		CONFIG_DYNAMIC_DEBUG=y
> 
>> 
>> * Would love to have a decent reference documentation to the SD 
>> Controller
>>   especially a detailed description of the behavior of the register 
>> map in
>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>  similar but not quite identical register maps
> 
> The JEDEC eMMC specification describes the CQHCI registers.
> It looks like you need to register to get access to them
> at www.jedec.org.  AFAICT older specs, which would be just
> fine, are free to download after registering.
> 
THX
> You can get SD Host Controller Simplified Specification
> from www.sdcard.org.
> 
THX
>> 
>> Regards
>> 
>> Sebastian
>> 
>> * dmesg
>> 
>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
> 
> The timeout is pretty long, so it seems like the controller
> has gotten stuck.  It won't reset the command or data
> circuits either, which is impossible if it is behaving
> correctly.
> 
>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP 
>> ===========
>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  
>> 0x00000510
>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  
>> 0x00000000
>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 
>> 0x00000016
>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 
>> 0x00000000
>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 
>> 0x00000001
>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      
>> 0x00000000
>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 
>> 0x00000008
>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     
>> 0x00010008
>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 
>> 0x00000800
>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    
>> 0x00000000
>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 
>> 0x00000900
>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>> ===========
>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>> 0x00001002
>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>> 0x00000070
>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>> 0x00000023
>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>> 0x0000003c
>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>> 0x00000080
>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>> 0x00000007
>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>> 0x00000000
>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 
>> 0x02ff4000
>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>> 0x00000000
>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>> 0x80000807
>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>> 0x00000000
>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>> 0x00000000
>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>> 0x00000000
>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>> 0x000000011eef1218
>> [  347.583224] mmc0: sdhci: 
>> ============================================
>> [  347.583229] mmc0: running CQE recovery
>> [  347.690108] mmc0: Reset 0x2 never completed.
>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>> ===========
>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>> 0x00001002
>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>> 0x00000070
>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>> 0x00000023
>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>> 0x0000003c
>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>> 0x00000080
>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>> 0x00000007
>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>> 0x00000000
>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>> 0x00ff0003
>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>> 0x00000000
>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>> 0x80000807
>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>> 0x00000000
>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>> 0x00000000
>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>> 0x00000000
>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>> 0x000000011eef1218
>> [  347.690157] mmc0: sdhci: 
>> ============================================
>> [  347.790190] mmc0: Reset 0x4 never completed.
>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>> ===========
>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>> 0x00001002
>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>> 0x00000070
>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>> 0x00000023
>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>> 0x0000003c
>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>> 0x00000080
>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>> 0x00000007
>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>> 0x00000000
>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>> 0x00ff0003
>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>> 0x00000000
>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>> 0x80000807
>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>> 0x00000000
>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>> 0x00000000
>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>> 0x00000000
>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>> 0x000000011eef1218
>> [  347.790234] mmc0: sdhci: 
>> ============================================
>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>> ===========
>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>> 0x00001002
>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>> 0x00000070
>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>> 0x00000023
>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>> 0x0000003c
>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>> 0x00000080
>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>> 0x00000007
>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>> 0x00000000
>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>> 0x00ff0003
>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>> 0x00000000
>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>> 0x80000807
>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>> 0x00000000
>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>> 0x00000000
>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>> 0x00000000
>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>> 0x000000011eef1218
>> [  347.803119] mmc0: sdhci: 
>> ============================================
>>  --
>> 

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-19  9:46   ` sebastian.kloska
@ 2023-01-19 10:28     ` Adrian Hunter
       [not found]       ` <08bd329f23b8fac7a2249034804ac3c7@snafu.de>
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Hunter @ 2023-01-19 10:28 UTC (permalink / raw)
  To: sebastian.kloska; +Cc: linux-mmc

On 19/01/23 11:46, sebastian.kloska@snafu.de wrote:
> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>> On 18/01/23 23:24, sebastian.kloska@snafu.de wrote:
>>> Hi,
>>>
>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>> eMMC: Kingston A29128
>>>
>>> * A couple of seconds after boot access to the eMMC is completely
>>>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
>>>
>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>   Latest highest version including the bug is a fedora 6.1.6
>>
>> There is very little difference in drivers/mmc between 5.14.21
>> and 5.15.1. Are you sure the same issue is not present in
>> 5.14.21?
> 
> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after ~10min
> The "slowed down by pr_debug" version sometimes lives for >2h (without stress
> via badblocks etc)
> 
>>
>> You could try disabling runtime PM to see if that helps:
>>
>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>
> Will try
>> Also it should be possible to disable cqhci by adding to the kernel
>> command line:
>>
>>     sdhci.debug_quirks=0x40a0000
>>
> Will try
>>>
>>> * Spiked the code with a lot of pr_debug messages. As a side effect the driver
>>>   seems to be much more stable, so to me it appears like a timing issue where
>>>   the driver fails to wait for a specific state change which seems to appear
>>>   less often when  slowed down by logging
>>
>> Ideally it would be good to see a full log with dynamic
>> debugging enabled for the latest kernel.
> 
> Latest torvalds/linux or stable/linux ?

Stable would be fine.

> 
>>
>>     To enable mmc debug via kernel command line:
>>
>>         dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
>>
>>     Kernel must be configured:
>>
>>         CONFIG_DYNAMIC_DEBUG=y
>>
>>>
>>> * Would love to have a decent reference documentation to the SD Controller
>>>   especially a detailed description of the behavior of the register map in
>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>  similar but not quite identical register maps
>>
>> The JEDEC eMMC specification describes the CQHCI registers.
>> It looks like you need to register to get access to them
>> at www.jedec.org.  AFAICT older specs, which would be just
>> fine, are free to download after registering.
>>
> THX
>> You can get SD Host Controller Simplified Specification
>> from www.sdcard.org.
>>
> THX
>>>
>>> Regards
>>>
>>> Sebastian
>>>
>>> * dmesg
>>>
>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>
>> The timeout is pretty long, so it seems like the controller
>> has gotten stuck.  It won't reset the command or data
>> circuits either, which is impossible if it is behaving
>> correctly.
>>
>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  0x00000510
>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  0x00000000
>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 0x00000001
>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      0x00000000
>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008
>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00010008
>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000800
>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 0x00000900
>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 0x02ff4000
>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>> [  347.583224] mmc0: sdhci: ============================================
>>> [  347.583229] mmc0: running CQE recovery
>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>> [  347.690157] mmc0: sdhci: ============================================
>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>> [  347.790234] mmc0: sdhci: ============================================
>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>> [  347.803119] mmc0: sdhci: ============================================
>>>  --
>>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-18 21:24 Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag # sebastian.kloska
  2023-01-19  9:15 ` Adrian Hunter
@ 2023-01-19 14:38 ` Linux kernel regression tracking (#adding)
  1 sibling, 0 replies; 12+ messages in thread
From: Linux kernel regression tracking (#adding) @ 2023-01-19 14:38 UTC (permalink / raw)
  To: sebastian.kloska, adrian.hunter, riteshh, asutoshd; +Cc: linux-mmc

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 18.01.23 22:24, sebastian.kloska@snafu.de wrote:
> 
> Hardware: Acer Swift-1 SF114-34-P6U1
> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
> eMMC: Kingston A29128
> 
> * A couple of seconds after boot access to the eMMC is completely
>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
> 
> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>   Latest highest version including the bug is a fedora 6.1.6
> 
> * Spiked the code with a lot of pr_debug messages. As a side effect the
> driver
>   seems to be much more stable, so to me it appears like a timing issue
> where
>   the driver fails to wait for a specific state change which seems to
> appear
>   less often when  slowed down by logging
> 
> * Would love to have a decent reference documentation to the SD Controller
>   especially a detailed description of the behavior of the register map in
>  mmc/host/cqhci.h All intel documents I could find seem to describe
>  similar but not quite identical register maps
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced v5.14..v5.15
#regzbot title mmc: eMMC is completely blocked a few seconds after boot
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
       [not found]         ` <f31a5b1cc0ae7182199b89730891770d@snafu.de>
@ 2023-01-20 12:08           ` Adrian Hunter
  2023-01-21 13:37             ` sebastian.kloska
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Hunter @ 2023-01-20 12:08 UTC (permalink / raw)
  To: sebastian.kloska; +Cc: linux-mmc

On 20/01/23 13:14, sebastian.kloska@snafu.de wrote:
> So no mail with attachment on linux-mmc@vger.kernel.org
> but I linked the dmesgs on https://bugzilla.kernel.org/show_bug.cgi?id=216255 <https://bugzilla.kernel.org/show_bug.cgi?id=216255>

I did receive the logs but it is good to attach them to the bug anyway.

>  
> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>> On 19/01/23 11:46, sebastian.kloska@snafu.de <mailto:sebastian.kloska@snafu.de> wrote:
>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>> On 18/01/23 23:24, sebastian.kloska@snafu.de <mailto:sebastian.kloska@snafu.de> wrote:
>>>>> Hi,
>>>>>
>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>> eMMC: Kingston A29128
>>>>>
>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
>>>>>
>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>
>>>> There is very little difference in drivers/mmc between 5.14.21
>>>> and 5.15.1. Are you sure the same issue is not present in
>>>> 5.14.21?
>>>
>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after ~10min
>>> The "slowed down by pr_debug" version sometimes lives for >2h (without stress
>>> via badblocks etc)
>>>
> So all of the following is now done on the latest stable, Which currently
> is 6.2.0-rc4+

That is a release candidate (rc).  The latest stable kernel
is 6.1.7.

>>>>
>>>> You could try disabling runtime PM to see if that helps:
>>>>
>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>
> Didn't help
> 
>>> Will try
>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>> command line:
>>>>
>>>>     sdhci.debug_quirks=0x40a0000
>>>>
> 
> Weird behavior: Seems to make it worse. System often freezes
> during early boot-up.
> 
> HOWEVER: If I slow down the driver via dynamic debugging it seems to be stable
> Failed to crash it over a couple of hours. You can say it fixes the issue if you
> are willing to accept that 80% of one CPU core is busy with logging ;-)

That is interesting.  Logging can have more side effects than you might expect.
For example the logs might be written back to storage by syslog, which causes
more logging, and more writing etc etc.  That in turn is bumping up the CPU
load.

Increasing the CPU load will decrease the chance of the CPU entering a low-power
state (C-state).

And there was an issue with C-state transitions affecting Intel SDHCI controllers,
so we need to check that possibility.  You could try kernel command line option:

	intel_idle.max_cstate=1

You can see the C-states and whether they are disabled:

	grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable

"disable" is writable by root, so you can disable them that way also.

In the meantime I will take a look at the logs.

> 
>>> Will try
>>>>>
>>>>> * Spiked the code with a lot of pr_debug messages. As a side effect the driver
>>>>>   seems to be much more stable, so to me it appears like a timing issue where
>>>>>   the driver fails to wait for a specific state change which seems to appear
>>>>>   less often when  slowed down by logging
>>>>
>>>> Ideally it would be good to see a full log with dynamic
>>>> debugging enabled for the latest kernel.
>>>
>>> Latest torvalds/linux or stable/linux ?
>>
>> Stable would be fine.
>>
>>>
>>>>
>>>>     To enable mmc debug via kernel command line:
>>>>
>>>>         dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
>>>>
> 
> This for whatever reason did not work for me on the kernel cmdline
> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
> so the attached dmesgs are missing early logs.
> 
>>>>     Kernel must be configured:
>>>>
>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>
>>>>>
>>>>> * Would love to have a decent reference documentation to the SD Controller
>>>>>   especially a detailed description of the behavior of the register map in
>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>  similar but not quite identical register maps
>>>>
>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>> It looks like you need to register to get access to them
>>>> at www.jedec.org.  <http://www.jedec.org. > AFAICT older specs, which would be just
>>>> fine, are free to download after registering.
>>>>
> 
> Uhh -- but the price for registration is hefty. THX anyway
> 
>>> THX
>>>> You can get SD Host Controller Simplified Specification
>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>
> 
> I attach to dmesg logs
> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
> * The other without dynamic debugging (as said I failed to crash the system) and sdhci.debug_quirks=0x40a0000
> 
> Hope I filtered properly for the relevant stuff
> The second one contains what seems to me a ton of kernel Oops but
> that might be due to the preliminary status of 6.2.0-rc4
> 
> THX
> 
>>> THX
>>>>>
>>>>> Regards
>>>>>
>>>>> Sebastian
>>>>>
>>>>> * dmesg
>>>>>
>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>
>>>> The timeout is pretty long, so it seems like the controller
>>>> has gotten stuck.  It won't reset the command or data
>>>> circuits either, which is impossible if it is behaving
>>>> correctly.
>>>>
>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  0x00000510
>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  0x00000000
>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 0x00000001
>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      0x00000000
>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008
>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00010008
>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000800
>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 0x00000900
>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 0x02ff4000
>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>> [  347.583224] mmc0: sdhci: ============================================
>>>>> [  347.583229] mmc0: running CQE recovery
>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>> [  347.690157] mmc0: sdhci: ============================================
>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>> [  347.790234] mmc0: sdhci: ============================================
>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>> [  347.803119] mmc0: sdhci: ============================================
>>>>>  --
>>>>>
> -- 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-20 12:08           ` Adrian Hunter
@ 2023-01-21 13:37             ` sebastian.kloska
  2023-01-23 10:46               ` Adrian Hunter
  0 siblings, 1 reply; 12+ messages in thread
From: sebastian.kloska @ 2023-01-21 13:37 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: linux-mmc

Am 20.01.2023 13:08, schrieb Adrian Hunter:
> On 20/01/23 13:14, sebastian.kloska@snafu.de wrote:
>> So no mail with attachment on linux-mmc@vger.kernel.org
>> but I linked the dmesgs on 
>> https://bugzilla.kernel.org/show_bug.cgi?id=216255 
>> <https://bugzilla.kernel.org/show_bug.cgi?id=216255>
> 
> I did receive the logs but it is good to attach them to the bug anyway.
> 
>>  
>> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>>> On 19/01/23 11:46, sebastian.kloska@snafu.de 
>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>>> On 18/01/23 23:24, sebastian.kloska@snafu.de 
>>>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>>> eMMC: Kingston A29128
>>>>>> 
>>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>>   blocked. Can be triggered by extensive I/O (e.g, with badblocks 
>>>>>> -w)
>>>>>> 
>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>> 
>>>>> There is very little difference in drivers/mmc between 5.14.21
>>>>> and 5.15.1. Are you sure the same issue is not present in
>>>>> 5.14.21?
>>>> 
>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after 
>>>> ~10min
>>>> The "slowed down by pr_debug" version sometimes lives for >2h 
>>>> (without stress
>>>> via badblocks etc)
>>>> 
>> So all of the following is now done on the latest stable, Which 
>> currently
>> is 6.2.0-rc4+
> 
> That is a release candidate (rc).  The latest stable kernel
> is 6.1.7.
> 

I know -just thought I should pull master/HEAD from the latest stable

>>>>> 
>>>>> You could try disabling runtime PM to see if that helps:
>>>>> 
>>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>> 
>> Didn't help
>> 
>>>> Will try
>>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>>> command line:
>>>>> 
>>>>>     sdhci.debug_quirks=0x40a0000
>>>>> 
>> 
>> Weird behavior: Seems to make it worse. System often freezes
>> during early boot-up.
>> 
>> HOWEVER: If I slow down the driver via dynamic debugging it seems to 
>> be stable
>> Failed to crash it over a couple of hours. You can say it fixes the 
>> issue if you
>> are willing to accept that 80% of one CPU core is busy with logging ;-)
> 
> That is interesting.  Logging can have more side effects than you might 
> expect.
> For example the logs might be written back to storage by syslog, which 
> causes
> more logging, and more writing etc etc.  That in turn is bumping up the 
> CPU
> load.
> 
> Increasing the CPU load will decrease the chance of the CPU entering a 
> low-power
> state (C-state).
> 
> And there was an issue with C-state transitions affecting Intel SDHCI
> controllers,
> so we need to check that possibility.  You could try kernel command 
> line option:
> 
> 	intel_idle.max_cstate=1
> 

This -- Sir -- works just wonderfully !!

Don't see any changes in 
/sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but
may be that's expected. Kernel runs just fine. works for stable/6.1.7 
and fedora/6.1.6
So for me personally this workaround is fine. Guess it will hurt my
power consumption/battery live though but that's acceptable for me.

Don't hesitate to contact if I should test patches for you. The source
is checked out and ccache is filled.

> You can see the C-states and whether they are disabled:
> 
> 	grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
> 
> "disable" is writable by root, so you can disable them that way also.
> 
> In the meantime I will take a look at the logs.
> 
>> 
>>>> Will try
>>>>>> 
>>>>>> * Spiked the code with a lot of pr_debug messages. As a side 
>>>>>> effect the driver
>>>>>>   seems to be much more stable, so to me it appears like a timing 
>>>>>> issue where
>>>>>>   the driver fails to wait for a specific state change which seems 
>>>>>> to appear
>>>>>>   less often when  slowed down by logging
>>>>> 
>>>>> Ideally it would be good to see a full log with dynamic
>>>>> debugging enabled for the latest kernel.
>>>> 
>>>> Latest torvalds/linux or stable/linux ?
>>> 
>>> Stable would be fine.
>>> 
>>>> 
>>>>> 
>>>>>     To enable mmc debug via kernel command line:
>>>>> 
>>>>>         dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* 
>>>>> +p"
>>>>> 
>> 
>> This for whatever reason did not work for me on the kernel cmdline
>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
>> so the attached dmesgs are missing early logs.
>> 
>>>>>     Kernel must be configured:
>>>>> 
>>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>> 
>>>>>> 
>>>>>> * Would love to have a decent reference documentation to the SD 
>>>>>> Controller
>>>>>>   especially a detailed description of the behavior of the 
>>>>>> register map in
>>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>>  similar but not quite identical register maps
>>>>> 
>>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>>> It looks like you need to register to get access to them
>>>>> at www.jedec.org.  
>>>>> <http://www.jedec.org. > AFAICT older specs, which would be just
>>>>> fine, are free to download after registering.
>>>>> 
>> 
>> Uhh -- but the price for registration is hefty. THX anyway
>> 
>>>> THX
>>>>> You can get SD Host Controller Simplified Specification
>>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>> 
>> 
>> I attach to dmesg logs
>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
>> * The other without dynamic debugging (as said I failed to crash the 
>> system) and sdhci.debug_quirks=0x40a0000
>> 
>> Hope I filtered properly for the relevant stuff
>> The second one contains what seems to me a ton of kernel Oops but
>> that might be due to the preliminary status of 6.2.0-rc4
>> 
>> THX
>> 
>>>> THX
>>>>>> 
>>>>>> Regards
>>>>>> 
>>>>>> Sebastian
>>>>>> 
>>>>>> * dmesg
>>>>>> 
>>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>> 
>>>>> The timeout is pretty long, so it seems like the controller
>>>>> has gotten stuck.  It won't reset the command or data
>>>>> circuits either, which is impossible if it is behaving
>>>>> correctly.
>>>>> 
>>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP 
>>>>>> ===========
>>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  
>>>>>> 0x00000510
>>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  
>>>>>> 0x00000000
>>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 
>>>>>> 0x00000016
>>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 
>>>>>> 0x00000000
>>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 
>>>>>> 0x00000001
>>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      
>>>>>> 0x00000000
>>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 
>>>>>> 0x00000008
>>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     
>>>>>> 0x00010008
>>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 
>>>>>> 0x00000800
>>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    
>>>>>> 0x00000000
>>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 
>>>>>> 0x00000900
>>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>> ===========
>>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>> 0x00001002
>>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>> 0x00000070
>>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>> 0x00000023
>>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>> 0x0000003c
>>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>> 0x00000080
>>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>> 0x00000007
>>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>> 0x00000000
>>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 
>>>>>> 0x02ff4000
>>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>> 0x00000000
>>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>> 0x80000807
>>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>> 0x00000000
>>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>> 0x00000000
>>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>> 0x00000000
>>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>> 0x000000011eef1218
>>>>>> [  347.583224] mmc0: sdhci: 
>>>>>> ============================================
>>>>>> [  347.583229] mmc0: running CQE recovery
>>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>> ===========
>>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>> 0x00001002
>>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>> 0x00000070
>>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>> 0x00000023
>>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>> 0x0000003c
>>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>> 0x00000080
>>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>> 0x00000007
>>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>> 0x00000000
>>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>> 0x00ff0003
>>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>> 0x00000000
>>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>> 0x80000807
>>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>> 0x00000000
>>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>> 0x00000000
>>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>> 0x00000000
>>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>> 0x000000011eef1218
>>>>>> [  347.690157] mmc0: sdhci: 
>>>>>> ============================================
>>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>> ===========
>>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>> 0x00001002
>>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>> 0x00000070
>>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>> 0x00000023
>>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>> 0x0000003c
>>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>> 0x00000080
>>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>> 0x00000007
>>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>> 0x00000000
>>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>> 0x00ff0003
>>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>> 0x00000000
>>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>> 0x80000807
>>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>> 0x00000000
>>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>> 0x00000000
>>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>> 0x00000000
>>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>> 0x000000011eef1218
>>>>>> [  347.790234] mmc0: sdhci: 
>>>>>> ============================================
>>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>> ===========
>>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>> 0x00001002
>>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>> 0x00000070
>>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>> 0x00000023
>>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>> 0x0000003c
>>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>> 0x00000080
>>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>> 0x00000007
>>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>> 0x00000000
>>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>> 0x00ff0003
>>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>> 0x00000000
>>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>> 0x80000807
>>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>> 0x00000000
>>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>> 0x00000000
>>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>> 0x00000000
>>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>> 0x000000011eef1218
>>>>>> [  347.803119] mmc0: sdhci: 
>>>>>> ============================================
>>>>>>  --
>>>>>> 
>> --
>> 

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-21 13:37             ` sebastian.kloska
@ 2023-01-23 10:46               ` Adrian Hunter
  2023-01-23 12:38                 ` sebastian.kloska
  2023-01-26 17:25                 ` sebastian.kloska
  0 siblings, 2 replies; 12+ messages in thread
From: Adrian Hunter @ 2023-01-23 10:46 UTC (permalink / raw)
  To: sebastian.kloska; +Cc: linux-mmc

On 21/01/23 15:37, sebastian.kloska@snafu.de wrote:
> Am 20.01.2023 13:08, schrieb Adrian Hunter:
>> On 20/01/23 13:14, sebastian.kloska@snafu.de wrote:
>>> So no mail with attachment on linux-mmc@vger.kernel.org
>>> but I linked the dmesgs on https://bugzilla.kernel.org/show_bug.cgi?id=216255 <https://bugzilla.kernel.org/show_bug.cgi?id=216255>
>>
>> I did receive the logs but it is good to attach them to the bug anyway.
>>
>>>  
>>> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>>>> On 19/01/23 11:46, sebastian.kloska@snafu.de <mailto:sebastian.kloska@snafu.de> wrote:
>>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>>>> On 18/01/23 23:24, sebastian.kloska@snafu.de <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>>>> eMMC: Kingston A29128
>>>>>>>
>>>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>>>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
>>>>>>>
>>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>>>
>>>>>> There is very little difference in drivers/mmc between 5.14.21
>>>>>> and 5.15.1. Are you sure the same issue is not present in
>>>>>> 5.14.21?
>>>>>
>>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after ~10min
>>>>> The "slowed down by pr_debug" version sometimes lives for >2h (without stress
>>>>> via badblocks etc)
>>>>>
>>> So all of the following is now done on the latest stable, Which currently
>>> is 6.2.0-rc4+
>>
>> That is a release candidate (rc).  The latest stable kernel
>> is 6.1.7.
>>
> 
> I know -just thought I should pull master/HEAD from the latest stable
> 
>>>>>>
>>>>>> You could try disabling runtime PM to see if that helps:
>>>>>>
>>>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>>>
>>> Didn't help
>>>
>>>>> Will try
>>>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>>>> command line:
>>>>>>
>>>>>>     sdhci.debug_quirks=0x40a0000
>>>>>>
>>>
>>> Weird behavior: Seems to make it worse. System often freezes
>>> during early boot-up.
>>>
>>> HOWEVER: If I slow down the driver via dynamic debugging it seems to be stable
>>> Failed to crash it over a couple of hours. You can say it fixes the issue if you
>>> are willing to accept that 80% of one CPU core is busy with logging ;-)
>>
>> That is interesting.  Logging can have more side effects than you might expect.
>> For example the logs might be written back to storage by syslog, which causes
>> more logging, and more writing etc etc.  That in turn is bumping up the CPU
>> load.
>>
>> Increasing the CPU load will decrease the chance of the CPU entering a low-power
>> state (C-state).
>>
>> And there was an issue with C-state transitions affecting Intel SDHCI
>> controllers,
>> so we need to check that possibility.  You could try kernel command line option:
>>
>>     intel_idle.max_cstate=1
>>
> 
> This -- Sir -- works just wonderfully !!
> 
> Don't see any changes in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but
> may be that's expected. Kernel runs just fine. works for stable/6.1.7 and fedora/6.1.6
> So for me personally this workaround is fine. Guess it will hurt my
> power consumption/battery live though but that's acceptable for me.
> 
> Don't hesitate to contact if I should test patches for you. The source
> is checked out and ccache is filled.

It is probably possible to make a kernel patch to prevent
low-power C-States only when eMMC is in use.  I will check
on that.

In the meantime, could you send an acpidump e.g.

	sudo apt-get install acpica-tools
	sudo acpidump -o /tmp/acpidump.txt

Also do you know if you have an up to date BIOS?

> 
>> You can see the C-states and whether they are disabled:
>>
>>     grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>>
>> "disable" is writable by root, so you can disable them that way also.
>>
>> In the meantime I will take a look at the logs.
>>
>>>
>>>>> Will try
>>>>>>>
>>>>>>> * Spiked the code with a lot of pr_debug messages. As a side effect the driver
>>>>>>>   seems to be much more stable, so to me it appears like a timing issue where
>>>>>>>   the driver fails to wait for a specific state change which seems to appear
>>>>>>>   less often when  slowed down by logging
>>>>>>
>>>>>> Ideally it would be good to see a full log with dynamic
>>>>>> debugging enabled for the latest kernel.
>>>>>
>>>>> Latest torvalds/linux or stable/linux ?
>>>>
>>>> Stable would be fine.
>>>>
>>>>>
>>>>>>
>>>>>>     To enable mmc debug via kernel command line:
>>>>>>
>>>>>>         dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
>>>>>>
>>>
>>> This for whatever reason did not work for me on the kernel cmdline
>>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
>>> so the attached dmesgs are missing early logs.
>>>
>>>>>>     Kernel must be configured:
>>>>>>
>>>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>>>
>>>>>>>
>>>>>>> * Would love to have a decent reference documentation to the SD Controller
>>>>>>>   especially a detailed description of the behavior of the register map in
>>>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>>>  similar but not quite identical register maps
>>>>>>
>>>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>>>> It looks like you need to register to get access to them
>>>>>> at www.jedec.org.  <http://www.jedec.org. > AFAICT older specs, which would be just
>>>>>> fine, are free to download after registering.
>>>>>>
>>>
>>> Uhh -- but the price for registration is hefty. THX anyway
>>>
>>>>> THX
>>>>>> You can get SD Host Controller Simplified Specification
>>>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>>>
>>>
>>> I attach to dmesg logs
>>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
>>> * The other without dynamic debugging (as said I failed to crash the system) and sdhci.debug_quirks=0x40a0000
>>>
>>> Hope I filtered properly for the relevant stuff
>>> The second one contains what seems to me a ton of kernel Oops but
>>> that might be due to the preliminary status of 6.2.0-rc4
>>>
>>> THX
>>>
>>>>> THX
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Sebastian
>>>>>>>
>>>>>>> * dmesg
>>>>>>>
>>>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>>>
>>>>>> The timeout is pretty long, so it seems like the controller
>>>>>> has gotten stuck.  It won't reset the command or data
>>>>>> circuits either, which is impossible if it is behaving
>>>>>> correctly.
>>>>>>
>>>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
>>>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  0x00000510
>>>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  0x00000000
>>>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
>>>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
>>>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 0x00000001
>>>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      0x00000000
>>>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008
>>>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00010008
>>>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000800
>>>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
>>>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 0x00000900
>>>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 0x02ff4000
>>>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>> [  347.583224] mmc0: sdhci: ============================================
>>>>>>> [  347.583229] mmc0: running CQE recovery
>>>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>> [  347.690157] mmc0: sdhci: ============================================
>>>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>> [  347.790234] mmc0: sdhci: ============================================
>>>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>> [  347.803119] mmc0: sdhci: ============================================
>>>>>>>  --
>>>>>>>
>>> -- 
>>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-23 10:46               ` Adrian Hunter
@ 2023-01-23 12:38                 ` sebastian.kloska
  2023-01-26 17:25                 ` sebastian.kloska
  1 sibling, 0 replies; 12+ messages in thread
From: sebastian.kloska @ 2023-01-23 12:38 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: linux-mmc

Am 23.01.2023 11:46, schrieb Adrian Hunter:
> On 21/01/23 15:37, sebastian.kloska@snafu.de wrote:
>> Am 20.01.2023 13:08, schrieb Adrian Hunter:
>>> On 20/01/23 13:14, sebastian.kloska@snafu.de wrote:
>>>> So no mail with attachment on linux-mmc@vger.kernel.org
>>>> but I linked the dmesgs on 
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=216255 
>>>> <https://bugzilla.kernel.org/show_bug.cgi?id=216255>
>>> 
>>> I did receive the logs but it is good to attach them to the bug 
>>> anyway.
>>> 
>>>>  
>>>> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>>>>> On 19/01/23 11:46, sebastian.kloska@snafu.de 
>>>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>>>>> On 18/01/23 23:24, sebastian.kloska@snafu.de 
>>>>>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>>>>> eMMC: Kingston A29128
>>>>>>>> 
>>>>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>>>>   blocked. Can be triggered by extensive I/O (e.g, with 
>>>>>>>> badblocks -w)
>>>>>>>> 
>>>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>>>> 
>>>>>>> There is very little difference in drivers/mmc between 5.14.21
>>>>>>> and 5.15.1. Are you sure the same issue is not present in
>>>>>>> 5.14.21?
>>>>>> 
>>>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after 
>>>>>> ~10min
>>>>>> The "slowed down by pr_debug" version sometimes lives for >2h 
>>>>>> (without stress
>>>>>> via badblocks etc)
>>>>>> 
>>>> So all of the following is now done on the latest stable, Which 
>>>> currently
>>>> is 6.2.0-rc4+
>>> 
>>> That is a release candidate (rc).  The latest stable kernel
>>> is 6.1.7.
>>> 
>> 
>> I know -just thought I should pull master/HEAD from the latest stable
>> 
>>>>>>> 
>>>>>>> You could try disabling runtime PM to see if that helps:
>>>>>>> 
>>>>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>>>> 
>>>> Didn't help
>>>> 
>>>>>> Will try
>>>>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>>>>> command line:
>>>>>>> 
>>>>>>>     sdhci.debug_quirks=0x40a0000
>>>>>>> 
>>>> 
>>>> Weird behavior: Seems to make it worse. System often freezes
>>>> during early boot-up.
>>>> 
>>>> HOWEVER: If I slow down the driver via dynamic debugging it seems to 
>>>> be stable
>>>> Failed to crash it over a couple of hours. You can say it fixes the 
>>>> issue if you
>>>> are willing to accept that 80% of one CPU core is busy with logging ;-)
>>> 
>>> That is interesting.  Logging can have more side effects than you 
>>> might expect.
>>> For example the logs might be written back to storage by syslog, 
>>> which causes
>>> more logging, and more writing etc etc.  That in turn is bumping up 
>>> the CPU
>>> load.
>>> 
>>> Increasing the CPU load will decrease the chance of the CPU entering 
>>> a low-power
>>> state (C-state).
>>> 
>>> And there was an issue with C-state transitions affecting Intel SDHCI
>>> controllers,
>>> so we need to check that possibility.  You could try kernel command 
>>> line option:
>>> 
>>>     intel_idle.max_cstate=1
>>> 
>> 
>> This -- Sir -- works just wonderfully !!
>> 
>> Don't see any changes in 
>> /sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but
>> may be that's expected. Kernel runs just fine. works for stable/6.1.7 
>> and fedora/6.1.6
>> So for me personally this workaround is fine. Guess it will hurt my
>> power consumption/battery live though but that's acceptable for me.
>> 
>> Don't hesitate to contact if I should test patches for you. The source
>> is checked out and ccache is filled.
> 
> It is probably possible to make a kernel patch to prevent
> low-power C-States only when eMMC is in use.  I will check
> on that.
> 
> In the meantime, could you send an acpidump e.g.
> 
> 	sudo apt-get install acpica-tools
> 	sudo acpidump -o /tmp/acpidump.txt
> 

acpidump @ https://bugzilla.kernel.org/attachment.cgi?id=303637

> Also do you know if you have an up to date BIOS?

BIOS is 3 releases behind although the laptop is only 6month old
(haven't looked that up to be honest)

v1.08 2021/06/11
up-to-date == 1.14  2022/10/21

I could give an update via FreeDOS a try

> 
>> 
>>> You can see the C-states and whether they are disabled:
>>> 
>>>     grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>>> 
>>> "disable" is writable by root, so you can disable them that way also.
>>> 
>>> In the meantime I will take a look at the logs.
>>> 
>>>> 
>>>>>> Will try
>>>>>>>> 
>>>>>>>> * Spiked the code with a lot of pr_debug messages. As a side 
>>>>>>>> effect the driver
>>>>>>>>   seems to be much more stable, so to me it appears like a 
>>>>>>>> timing issue where
>>>>>>>>   the driver fails to wait for a specific state change which 
>>>>>>>> seems to appear
>>>>>>>>   less often when  slowed down by logging
>>>>>>> 
>>>>>>> Ideally it would be good to see a full log with dynamic
>>>>>>> debugging enabled for the latest kernel.
>>>>>> 
>>>>>> Latest torvalds/linux or stable/linux ?
>>>>> 
>>>>> Stable would be fine.
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>     To enable mmc debug via kernel command line:
>>>>>>> 
>>>>>>>         dyndbg="file drivers/mmc/core/* +p;file 
>>>>>>> drivers/mmc/host/* +p"
>>>>>>> 
>>>> 
>>>> This for whatever reason did not work for me on the kernel cmdline
>>>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
>>>> so the attached dmesgs are missing early logs.
>>>> 
>>>>>>>     Kernel must be configured:
>>>>>>> 
>>>>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>>>> 
>>>>>>>> 
>>>>>>>> * Would love to have a decent reference documentation to the SD 
>>>>>>>> Controller
>>>>>>>>   especially a detailed description of the behavior of the 
>>>>>>>> register map in
>>>>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>>>>  similar but not quite identical register maps
>>>>>>> 
>>>>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>>>>> It looks like you need to register to get access to them
>>>>>>> at www.jedec.org.  
>>>>>>> <http://www.jedec.org. > AFAICT older specs, which would be just
>>>>>>> fine, are free to download after registering.
>>>>>>> 
>>>> 
>>>> Uhh -- but the price for registration is hefty. THX anyway
>>>> 
>>>>>> THX
>>>>>>> You can get SD Host Controller Simplified Specification
>>>>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>>>> 
>>>> 
>>>> I attach to dmesg logs
>>>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
>>>> * The other without dynamic debugging (as said I failed to crash the 
>>>> system) and sdhci.debug_quirks=0x40a0000
>>>> 
>>>> Hope I filtered properly for the relevant stuff
>>>> The second one contains what seems to me a ton of kernel Oops but
>>>> that might be due to the preliminary status of 6.2.0-rc4
>>>> 
>>>> THX
>>>> 
>>>>>> THX
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>>> * dmesg
>>>>>>>> 
>>>>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>>>> 
>>>>>>> The timeout is pretty long, so it seems like the controller
>>>>>>> has gotten stuck.  It won't reset the command or data
>>>>>>> circuits either, which is impossible if it is behaving
>>>>>>> correctly.
>>>>>>> 
>>>>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  
>>>>>>>> 0x00000510
>>>>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 
>>>>>>>> 0x00000016
>>>>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 
>>>>>>>> 0x00000001
>>>>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      
>>>>>>>> 0x00000000
>>>>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 
>>>>>>>> 0x00000008
>>>>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     
>>>>>>>> 0x00010008
>>>>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 
>>>>>>>> 0x00000800
>>>>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    
>>>>>>>> 0x00000000
>>>>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 
>>>>>>>> 0x00000900
>>>>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 
>>>>>>>> 0x02ff4000
>>>>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.583224] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>> [  347.583229] mmc0: running CQE recovery
>>>>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>> 0x00ff0003
>>>>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.690157] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>> 0x00ff0003
>>>>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.790234] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>> 0x00ff0003
>>>>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.803119] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>>  --
>>>>>>>> 
>>>> --
>>>> 
>> 

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-23 10:46               ` Adrian Hunter
  2023-01-23 12:38                 ` sebastian.kloska
@ 2023-01-26 17:25                 ` sebastian.kloska
  2023-01-26 19:25                   ` Adrian Hunter
  1 sibling, 1 reply; 12+ messages in thread
From: sebastian.kloska @ 2023-01-26 17:25 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: linux-mmc

Am 23.01.2023 11:46, schrieb Adrian Hunter:
> On 21/01/23 15:37, sebastian.kloska@snafu.de wrote:
>> Am 20.01.2023 13:08, schrieb Adrian Hunter:
>>> On 20/01/23 13:14, sebastian.kloska@snafu.de wrote:
>>>> So no mail with attachment on linux-mmc@vger.kernel.org
>>>> but I linked the dmesgs on 
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=216255 
>>>> <https://bugzilla.kernel.org/show_bug.cgi?id=216255>
>>> 
>>> I did receive the logs but it is good to attach them to the bug 
>>> anyway.
>>> 
>>>>  
>>>> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>>>>> On 19/01/23 11:46, sebastian.kloska@snafu.de 
>>>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>>>>> On 18/01/23 23:24, sebastian.kloska@snafu.de 
>>>>>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>>>>> eMMC: Kingston A29128
>>>>>>>> 
>>>>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>>>>   blocked. Can be triggered by extensive I/O (e.g, with 
>>>>>>>> badblocks -w)
>>>>>>>> 
>>>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>>>> 
>>>>>>> There is very little difference in drivers/mmc between 5.14.21
>>>>>>> and 5.15.1. Are you sure the same issue is not present in
>>>>>>> 5.14.21?
>>>>>> 
>>>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after 
>>>>>> ~10min
>>>>>> The "slowed down by pr_debug" version sometimes lives for >2h 
>>>>>> (without stress
>>>>>> via badblocks etc)
>>>>>> 
>>>> So all of the following is now done on the latest stable, Which 
>>>> currently
>>>> is 6.2.0-rc4+
>>> 
>>> That is a release candidate (rc).  The latest stable kernel
>>> is 6.1.7.
>>> 
>> 
>> I know -just thought I should pull master/HEAD from the latest stable
>> 
>>>>>>> 
>>>>>>> You could try disabling runtime PM to see if that helps:
>>>>>>> 
>>>>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>>>> 
>>>> Didn't help
>>>> 
>>>>>> Will try
>>>>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>>>>> command line:
>>>>>>> 
>>>>>>>     sdhci.debug_quirks=0x40a0000
>>>>>>> 
>>>> 
>>>> Weird behavior: Seems to make it worse. System often freezes
>>>> during early boot-up.
>>>> 
>>>> HOWEVER: If I slow down the driver via dynamic debugging it seems to 
>>>> be stable
>>>> Failed to crash it over a couple of hours. You can say it fixes the 
>>>> issue if you
>>>> are willing to accept that 80% of one CPU core is busy with logging ;-)
>>> 
>>> That is interesting.  Logging can have more side effects than you 
>>> might expect.
>>> For example the logs might be written back to storage by syslog, 
>>> which causes
>>> more logging, and more writing etc etc.  That in turn is bumping up 
>>> the CPU
>>> load.
>>> 
>>> Increasing the CPU load will decrease the chance of the CPU entering 
>>> a low-power
>>> state (C-state).
>>> 
>>> And there was an issue with C-state transitions affecting Intel SDHCI
>>> controllers,
>>> so we need to check that possibility.  You could try kernel command 
>>> line option:
>>> 
>>>     intel_idle.max_cstate=1
>>> 
>> 
>> This -- Sir -- works just wonderfully !!
>> 
>> Don't see any changes in 
>> /sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but
>> may be that's expected. Kernel runs just fine. works for stable/6.1.7 
>> and fedora/6.1.6
>> So for me personally this workaround is fine. Guess it will hurt my
>> power consumption/battery live though but that's acceptable for me.
>> 
>> Don't hesitate to contact if I should test patches for you. The source
>> is checked out and ccache is filled.
> 
> It is probably possible to make a kernel patch to prevent
> low-power C-States only when eMMC is in use.  I will check
> on that.
> 
> In the meantime, could you send an acpidump e.g.
> 
> 	sudo apt-get install acpica-tools
> 	sudo acpidump -o /tmp/acpidump.txt
> 
> Also do you know if you have an up to date BIOS?

So this is pretty embarrassing. Finally found the means to update my
BIOS and turned out to fix the problem.
Should have been the first on my list to solve the issue.
THX for the effort
> 
>> 
>>> You can see the C-states and whether they are disabled:
>>> 
>>>     grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>>> 
>>> "disable" is writable by root, so you can disable them that way also.
>>> 
>>> In the meantime I will take a look at the logs.
>>> 
>>>> 
>>>>>> Will try
>>>>>>>> 
>>>>>>>> * Spiked the code with a lot of pr_debug messages. As a side 
>>>>>>>> effect the driver
>>>>>>>>   seems to be much more stable, so to me it appears like a 
>>>>>>>> timing issue where
>>>>>>>>   the driver fails to wait for a specific state change which 
>>>>>>>> seems to appear
>>>>>>>>   less often when  slowed down by logging
>>>>>>> 
>>>>>>> Ideally it would be good to see a full log with dynamic
>>>>>>> debugging enabled for the latest kernel.
>>>>>> 
>>>>>> Latest torvalds/linux or stable/linux ?
>>>>> 
>>>>> Stable would be fine.
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>     To enable mmc debug via kernel command line:
>>>>>>> 
>>>>>>>         dyndbg="file drivers/mmc/core/* +p;file 
>>>>>>> drivers/mmc/host/* +p"
>>>>>>> 
>>>> 
>>>> This for whatever reason did not work for me on the kernel cmdline
>>>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
>>>> so the attached dmesgs are missing early logs.
>>>> 
>>>>>>>     Kernel must be configured:
>>>>>>> 
>>>>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>>>> 
>>>>>>>> 
>>>>>>>> * Would love to have a decent reference documentation to the SD 
>>>>>>>> Controller
>>>>>>>>   especially a detailed description of the behavior of the 
>>>>>>>> register map in
>>>>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>>>>  similar but not quite identical register maps
>>>>>>> 
>>>>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>>>>> It looks like you need to register to get access to them
>>>>>>> at www.jedec.org.  
>>>>>>> <http://www.jedec.org. > AFAICT older specs, which would be just
>>>>>>> fine, are free to download after registering.
>>>>>>> 
>>>> 
>>>> Uhh -- but the price for registration is hefty. THX anyway
>>>> 
>>>>>> THX
>>>>>>> You can get SD Host Controller Simplified Specification
>>>>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>>>> 
>>>> 
>>>> I attach to dmesg logs
>>>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
>>>> * The other without dynamic debugging (as said I failed to crash the 
>>>> system) and sdhci.debug_quirks=0x40a0000
>>>> 
>>>> Hope I filtered properly for the relevant stuff
>>>> The second one contains what seems to me a ton of kernel Oops but
>>>> that might be due to the preliminary status of 6.2.0-rc4
>>>> 
>>>> THX
>>>> 
>>>>>> THX
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>>> * dmesg
>>>>>>>> 
>>>>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>>>> 
>>>>>>> The timeout is pretty long, so it seems like the controller
>>>>>>> has gotten stuck.  It won't reset the command or data
>>>>>>> circuits either, which is impossible if it is behaving
>>>>>>> correctly.
>>>>>>> 
>>>>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  
>>>>>>>> 0x00000510
>>>>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 
>>>>>>>> 0x00000016
>>>>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 
>>>>>>>> 0x00000001
>>>>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      
>>>>>>>> 0x00000000
>>>>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 
>>>>>>>> 0x00000008
>>>>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     
>>>>>>>> 0x00010008
>>>>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 
>>>>>>>> 0x00000800
>>>>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    
>>>>>>>> 0x00000000
>>>>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 
>>>>>>>> 0x00000900
>>>>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 
>>>>>>>> 0x02ff4000
>>>>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.583224] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>> [  347.583229] mmc0: running CQE recovery
>>>>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>> 0x00ff0003
>>>>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.690157] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>> 0x00ff0003
>>>>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.790234] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>> ===========
>>>>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>> 0x00001002
>>>>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>> 0x00000070
>>>>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>> 0x00000023
>>>>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>> 0x0000003c
>>>>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>> 0x00000080
>>>>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>> 0x00000007
>>>>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>> 0x00ff0003
>>>>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>> 0x80000807
>>>>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>> 0x00000000
>>>>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>> 0x00000000
>>>>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>> 0x000000011eef1218
>>>>>>>> [  347.803119] mmc0: sdhci: 
>>>>>>>> ============================================
>>>>>>>>  --
>>>>>>>> 
>>>> --
>>>> 
>> 

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-26 17:25                 ` sebastian.kloska
@ 2023-01-26 19:25                   ` Adrian Hunter
  2023-01-26 19:44                     ` sebastian.kloska
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Hunter @ 2023-01-26 19:25 UTC (permalink / raw)
  To: sebastian.kloska; +Cc: linux-mmc

On 26/01/23 19:25, sebastian.kloska@snafu.de wrote:
> Am 23.01.2023 11:46, schrieb Adrian Hunter:
>> On 21/01/23 15:37, sebastian.kloska@snafu.de wrote:
>>> Am 20.01.2023 13:08, schrieb Adrian Hunter:
>>>> On 20/01/23 13:14, sebastian.kloska@snafu.de wrote:
>>>>> So no mail with attachment on linux-mmc@vger.kernel.org
>>>>> but I linked the dmesgs on https://bugzilla.kernel.org/show_bug.cgi?id=216255 <https://bugzilla.kernel.org/show_bug.cgi?id=216255>
>>>>
>>>> I did receive the logs but it is good to attach them to the bug anyway.
>>>>
>>>>>  
>>>>> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>>>>>> On 19/01/23 11:46, sebastian.kloska@snafu.de <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>>>>>> On 18/01/23 23:24, sebastian.kloska@snafu.de <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>>>>>> eMMC: Kingston A29128
>>>>>>>>>
>>>>>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>>>>>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
>>>>>>>>>
>>>>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>>>>>
>>>>>>>> There is very little difference in drivers/mmc between 5.14.21
>>>>>>>> and 5.15.1. Are you sure the same issue is not present in
>>>>>>>> 5.14.21?
>>>>>>>
>>>>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after ~10min
>>>>>>> The "slowed down by pr_debug" version sometimes lives for >2h (without stress
>>>>>>> via badblocks etc)
>>>>>>>
>>>>> So all of the following is now done on the latest stable, Which currently
>>>>> is 6.2.0-rc4+
>>>>
>>>> That is a release candidate (rc).  The latest stable kernel
>>>> is 6.1.7.
>>>>
>>>
>>> I know -just thought I should pull master/HEAD from the latest stable
>>>
>>>>>>>>
>>>>>>>> You could try disabling runtime PM to see if that helps:
>>>>>>>>
>>>>>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>>>>>
>>>>> Didn't help
>>>>>
>>>>>>> Will try
>>>>>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>>>>>> command line:
>>>>>>>>
>>>>>>>>     sdhci.debug_quirks=0x40a0000
>>>>>>>>
>>>>>
>>>>> Weird behavior: Seems to make it worse. System often freezes
>>>>> during early boot-up.
>>>>>
>>>>> HOWEVER: If I slow down the driver via dynamic debugging it seems to be stable
>>>>> Failed to crash it over a couple of hours. You can say it fixes the issue if you
>>>>> are willing to accept that 80% of one CPU core is busy with logging ;-)
>>>>
>>>> That is interesting.  Logging can have more side effects than you might expect.
>>>> For example the logs might be written back to storage by syslog, which causes
>>>> more logging, and more writing etc etc.  That in turn is bumping up the CPU
>>>> load.
>>>>
>>>> Increasing the CPU load will decrease the chance of the CPU entering a low-power
>>>> state (C-state).
>>>>
>>>> And there was an issue with C-state transitions affecting Intel SDHCI
>>>> controllers,
>>>> so we need to check that possibility.  You could try kernel command line option:
>>>>
>>>>     intel_idle.max_cstate=1
>>>>
>>>
>>> This -- Sir -- works just wonderfully !!
>>>
>>> Don't see any changes in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but
>>> may be that's expected. Kernel runs just fine. works for stable/6.1.7 and fedora/6.1.6
>>> So for me personally this workaround is fine. Guess it will hurt my
>>> power consumption/battery live though but that's acceptable for me.
>>>
>>> Don't hesitate to contact if I should test patches for you. The source
>>> is checked out and ccache is filled.
>>
>> It is probably possible to make a kernel patch to prevent
>> low-power C-States only when eMMC is in use.  I will check
>> on that.
>>
>> In the meantime, could you send an acpidump e.g.
>>
>>     sudo apt-get install acpica-tools
>>     sudo acpidump -o /tmp/acpidump.txt
>>
>> Also do you know if you have an up to date BIOS?
> 
> So this is pretty embarrassing. Finally found the means to update my
> BIOS and turned out to fix the problem.
> Should have been the first on my list to solve the issue.
> THX for the effort

That's cool, no worries.  I would also be interested to get
an acpidump from the updated machine.

>>
>>>
>>>> You can see the C-states and whether they are disabled:
>>>>
>>>>     grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>>>>
>>>> "disable" is writable by root, so you can disable them that way also.
>>>>
>>>> In the meantime I will take a look at the logs.
>>>>
>>>>>
>>>>>>> Will try
>>>>>>>>>
>>>>>>>>> * Spiked the code with a lot of pr_debug messages. As a side effect the driver
>>>>>>>>>   seems to be much more stable, so to me it appears like a timing issue where
>>>>>>>>>   the driver fails to wait for a specific state change which seems to appear
>>>>>>>>>   less often when  slowed down by logging
>>>>>>>>
>>>>>>>> Ideally it would be good to see a full log with dynamic
>>>>>>>> debugging enabled for the latest kernel.
>>>>>>>
>>>>>>> Latest torvalds/linux or stable/linux ?
>>>>>>
>>>>>> Stable would be fine.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>     To enable mmc debug via kernel command line:
>>>>>>>>
>>>>>>>>         dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
>>>>>>>>
>>>>>
>>>>> This for whatever reason did not work for me on the kernel cmdline
>>>>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
>>>>> so the attached dmesgs are missing early logs.
>>>>>
>>>>>>>>     Kernel must be configured:
>>>>>>>>
>>>>>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Would love to have a decent reference documentation to the SD Controller
>>>>>>>>>   especially a detailed description of the behavior of the register map in
>>>>>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>>>>>  similar but not quite identical register maps
>>>>>>>>
>>>>>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>>>>>> It looks like you need to register to get access to them
>>>>>>>> at www.jedec.org.  <http://www.jedec.org. > AFAICT older specs, which would be just
>>>>>>>> fine, are free to download after registering.
>>>>>>>>
>>>>>
>>>>> Uhh -- but the price for registration is hefty. THX anyway
>>>>>
>>>>>>> THX
>>>>>>>> You can get SD Host Controller Simplified Specification
>>>>>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>>>>>
>>>>>
>>>>> I attach to dmesg logs
>>>>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
>>>>> * The other without dynamic debugging (as said I failed to crash the system) and sdhci.debug_quirks=0x40a0000
>>>>>
>>>>> Hope I filtered properly for the relevant stuff
>>>>> The second one contains what seems to me a ton of kernel Oops but
>>>>> that might be due to the preliminary status of 6.2.0-rc4
>>>>>
>>>>> THX
>>>>>
>>>>>>> THX
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Sebastian
>>>>>>>>>
>>>>>>>>> * dmesg
>>>>>>>>>
>>>>>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>>>>>
>>>>>>>> The timeout is pretty long, so it seems like the controller
>>>>>>>> has gotten stuck.  It won't reset the command or data
>>>>>>>> circuits either, which is impossible if it is behaving
>>>>>>>> correctly.
>>>>>>>>
>>>>>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
>>>>>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  0x00000510
>>>>>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  0x00000000
>>>>>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
>>>>>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
>>>>>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 0x00000001
>>>>>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      0x00000000
>>>>>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008
>>>>>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00010008
>>>>>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000800
>>>>>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
>>>>>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 0x00000900
>>>>>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 0x02ff4000
>>>>>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.583224] mmc0: sdhci: ============================================
>>>>>>>>> [  347.583229] mmc0: running CQE recovery
>>>>>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.690157] mmc0: sdhci: ============================================
>>>>>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.790234] mmc0: sdhci: ============================================
>>>>>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.803119] mmc0: sdhci: ============================================
>>>>>>>>>  --
>>>>>>>>>
>>>>> -- 
>>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #
  2023-01-26 19:25                   ` Adrian Hunter
@ 2023-01-26 19:44                     ` sebastian.kloska
  0 siblings, 0 replies; 12+ messages in thread
From: sebastian.kloska @ 2023-01-26 19:44 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: linux-mmc

Am 26.01.2023 20:25, schrieb Adrian Hunter:
> On 26/01/23 19:25, sebastian.kloska@snafu.de wrote:
>> Am 23.01.2023 11:46, schrieb Adrian Hunter:
>>> On 21/01/23 15:37, sebastian.kloska@snafu.de wrote:
>>>> Am 20.01.2023 13:08, schrieb Adrian Hunter:
>>>>> On 20/01/23 13:14, sebastian.kloska@snafu.de wrote:
>>>>>> So no mail with attachment on linux-mmc@vger.kernel.org
>>>>>> but I linked the dmesgs on 
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=216255 
>>>>>> <https://bugzilla.kernel.org/show_bug.cgi?id=216255>
>>>>> 
>>>>> I did receive the logs but it is good to attach them to the bug 
>>>>> anyway.
>>>>> 
>>>>>>  
>>>>>> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>>>>>>> On 19/01/23 11:46, sebastian.kloska@snafu.de 
>>>>>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>>>>>>> On 18/01/23 23:24, sebastian.kloska@snafu.de 
>>>>>>>>> <mailto:sebastian.kloska@snafu.de> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>>>>>>> eMMC: Kingston A29128
>>>>>>>>>> 
>>>>>>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>>>>>>   blocked. Can be triggered by extensive I/O (e.g, with 
>>>>>>>>>> badblocks -w)
>>>>>>>>>> 
>>>>>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>>>>>> 
>>>>>>>>> There is very little difference in drivers/mmc between 5.14.21
>>>>>>>>> and 5.15.1. Are you sure the same issue is not present in
>>>>>>>>> 5.14.21?
>>>>>>>> 
>>>>>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me 
>>>>>>>> after ~10min
>>>>>>>> The "slowed down by pr_debug" version sometimes lives for >2h 
>>>>>>>> (without stress
>>>>>>>> via badblocks etc)
>>>>>>>> 
>>>>>> So all of the following is now done on the latest stable, Which 
>>>>>> currently
>>>>>> is 6.2.0-rc4+
>>>>> 
>>>>> That is a release candidate (rc).  The latest stable kernel
>>>>> is 6.1.7.
>>>>> 
>>>> 
>>>> I know -just thought I should pull master/HEAD from the latest 
>>>> stable
>>>> 
>>>>>>>>> 
>>>>>>>>> You could try disabling runtime PM to see if that helps:
>>>>>>>>> 
>>>>>>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>>>>>> 
>>>>>> Didn't help
>>>>>> 
>>>>>>>> Will try
>>>>>>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>>>>>>> command line:
>>>>>>>>> 
>>>>>>>>>     sdhci.debug_quirks=0x40a0000
>>>>>>>>> 
>>>>>> 
>>>>>> Weird behavior: Seems to make it worse. System often freezes
>>>>>> during early boot-up.
>>>>>> 
>>>>>> HOWEVER: If I slow down the driver via dynamic debugging it seems 
>>>>>> to be stable
>>>>>> Failed to crash it over a couple of hours. You can say it fixes 
>>>>>> the issue if you
>>>>>> are willing to accept that 80% of one CPU core is busy with logging ;-)
>>>>> 
>>>>> That is interesting.  Logging can have more side effects than you 
>>>>> might expect.
>>>>> For example the logs might be written back to storage by syslog, 
>>>>> which causes
>>>>> more logging, and more writing etc etc.  That in turn is bumping up 
>>>>> the CPU
>>>>> load.
>>>>> 
>>>>> Increasing the CPU load will decrease the chance of the CPU 
>>>>> entering a low-power
>>>>> state (C-state).
>>>>> 
>>>>> And there was an issue with C-state transitions affecting Intel 
>>>>> SDHCI
>>>>> controllers,
>>>>> so we need to check that possibility.  You could try kernel command 
>>>>> line option:
>>>>> 
>>>>>     intel_idle.max_cstate=1
>>>>> 
>>>> 
>>>> This -- Sir -- works just wonderfully !!
>>>> 
>>>> Don't see any changes in 
>>>> /sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but
>>>> may be that's expected. Kernel runs just fine. works for 
>>>> stable/6.1.7 and fedora/6.1.6
>>>> So for me personally this workaround is fine. Guess it will hurt my
>>>> power consumption/battery live though but that's acceptable for me.
>>>> 
>>>> Don't hesitate to contact if I should test patches for you. The 
>>>> source
>>>> is checked out and ccache is filled.
>>> 
>>> It is probably possible to make a kernel patch to prevent
>>> low-power C-States only when eMMC is in use.  I will check
>>> on that.
>>> 
>>> In the meantime, could you send an acpidump e.g.
>>> 
>>>     sudo apt-get install acpica-tools
>>>     sudo acpidump -o /tmp/acpidump.txt
>>> 
>>> Also do you know if you have an up to date BIOS?
>> 
>> So this is pretty embarrassing. Finally found the means to update my
>> BIOS and turned out to fix the problem.
>> Should have been the first on my list to solve the issue.
>> THX for the effort
> 
> That's cool, no worries.  I would also be interested to get
> an acpidump from the updated machine.
> 
Here you are https://bugzilla.kernel.org/attachment.cgi?id=303654
>>> 
>>>> 
>>>>> You can see the C-states and whether they are disabled:
>>>>> 
>>>>>     grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>>>>> 
>>>>> "disable" is writable by root, so you can disable them that way 
>>>>> also.
>>>>> 
>>>>> In the meantime I will take a look at the logs.
>>>>> 
>>>>>> 
>>>>>>>> Will try
>>>>>>>>>> 
>>>>>>>>>> * Spiked the code with a lot of pr_debug messages. As a side 
>>>>>>>>>> effect the driver
>>>>>>>>>>   seems to be much more stable, so to me it appears like a 
>>>>>>>>>> timing issue where
>>>>>>>>>>   the driver fails to wait for a specific state change which 
>>>>>>>>>> seems to appear
>>>>>>>>>>   less often when  slowed down by logging
>>>>>>>>> 
>>>>>>>>> Ideally it would be good to see a full log with dynamic
>>>>>>>>> debugging enabled for the latest kernel.
>>>>>>>> 
>>>>>>>> Latest torvalds/linux or stable/linux ?
>>>>>>> 
>>>>>>> Stable would be fine.
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>     To enable mmc debug via kernel command line:
>>>>>>>>> 
>>>>>>>>>         dyndbg="file drivers/mmc/core/* +p;file 
>>>>>>>>> drivers/mmc/host/* +p"
>>>>>>>>> 
>>>>>> 
>>>>>> This for whatever reason did not work for me on the kernel cmdline
>>>>>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
>>>>>> so the attached dmesgs are missing early logs.
>>>>>> 
>>>>>>>>>     Kernel must be configured:
>>>>>>>>> 
>>>>>>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> * Would love to have a decent reference documentation to the 
>>>>>>>>>> SD Controller
>>>>>>>>>>   especially a detailed description of the behavior of the 
>>>>>>>>>> register map in
>>>>>>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>>>>>>  similar but not quite identical register maps
>>>>>>>>> 
>>>>>>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>>>>>>> It looks like you need to register to get access to them
>>>>>>>>> at www.jedec.org.  
>>>>>>>>> <http://www.jedec.org. > AFAICT older specs, which would be just
>>>>>>>>> fine, are free to download after registering.
>>>>>>>>> 
>>>>>> 
>>>>>> Uhh -- but the price for registration is hefty. THX anyway
>>>>>> 
>>>>>>>> THX
>>>>>>>>> You can get SD Host Controller Simplified Specification
>>>>>>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>>>>>> 
>>>>>> 
>>>>>> I attach to dmesg logs
>>>>>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
>>>>>> * The other without dynamic debugging (as said I failed to crash 
>>>>>> the system) and sdhci.debug_quirks=0x40a0000
>>>>>> 
>>>>>> Hope I filtered properly for the relevant stuff
>>>>>> The second one contains what seems to me a ton of kernel Oops but
>>>>>> that might be due to the preliminary status of 6.2.0-rc4
>>>>>> 
>>>>>> THX
>>>>>> 
>>>>>>>> THX
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> Sebastian
>>>>>>>>>> 
>>>>>>>>>> * dmesg
>>>>>>>>>> 
>>>>>>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>>>>>> 
>>>>>>>>> The timeout is pretty long, so it seems like the controller
>>>>>>>>> has gotten stuck.  It won't reset the command or data
>>>>>>>>> circuits either, which is impossible if it is behaving
>>>>>>>>> correctly.
>>>>>>>>> 
>>>>>>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP 
>>>>>>>>>> ===========
>>>>>>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  
>>>>>>>>>> 0x00000510
>>>>>>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 
>>>>>>>>>> 0x00000016
>>>>>>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 
>>>>>>>>>> 0x00000001
>>>>>>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 
>>>>>>>>>> 0x00000008
>>>>>>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     
>>>>>>>>>> 0x00010008
>>>>>>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 
>>>>>>>>>> 0x00000800
>>>>>>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 
>>>>>>>>>> 0x00000900
>>>>>>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>>>> ===========
>>>>>>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>>>> 0x00001002
>>>>>>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>>>> 0x00000070
>>>>>>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>>>> 0x00000023
>>>>>>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>>>> 0x0000003c
>>>>>>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>>>> 0x00000080
>>>>>>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>>>> 0x00000007
>>>>>>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 
>>>>>>>>>> 0x02ff4000
>>>>>>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>>>> 0x80000807
>>>>>>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>>>> 0x000000011eef1218
>>>>>>>>>> [  347.583224] mmc0: sdhci: 
>>>>>>>>>> ============================================
>>>>>>>>>> [  347.583229] mmc0: running CQE recovery
>>>>>>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>>>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>>>> ===========
>>>>>>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>>>> 0x00001002
>>>>>>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>>>> 0x00000070
>>>>>>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>>>> 0x00000023
>>>>>>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>>>> 0x0000003c
>>>>>>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>>>> 0x00000080
>>>>>>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>>>> 0x00000007
>>>>>>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>>>> 0x00ff0003
>>>>>>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>>>> 0x80000807
>>>>>>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>>>> 0x000000011eef1218
>>>>>>>>>> [  347.690157] mmc0: sdhci: 
>>>>>>>>>> ============================================
>>>>>>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>>>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>>>> ===========
>>>>>>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>>>> 0x00001002
>>>>>>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>>>> 0x00000070
>>>>>>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>>>> 0x00000023
>>>>>>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>>>> 0x0000003c
>>>>>>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>>>> 0x00000080
>>>>>>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>>>> 0x00000007
>>>>>>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>>>> 0x00ff0003
>>>>>>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>>>> 0x80000807
>>>>>>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>>>> 0x000000011eef1218
>>>>>>>>>> [  347.790234] mmc0: sdhci: 
>>>>>>>>>> ============================================
>>>>>>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>>>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP 
>>>>>>>>>> ===========
>>>>>>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  
>>>>>>>>>> 0x00001002
>>>>>>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  
>>>>>>>>>> 0x00000070
>>>>>>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 
>>>>>>>>>> 0x00000023
>>>>>>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 
>>>>>>>>>> 0x0000003c
>>>>>>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  
>>>>>>>>>> 0x00000080
>>>>>>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    
>>>>>>>>>> 0x00000007
>>>>>>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 
>>>>>>>>>> 0x00ff0003
>>>>>>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   
>>>>>>>>>> 0x80000807
>>>>>>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  
>>>>>>>>>> 0x00000000
>>>>>>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 
>>>>>>>>>> 0x000000011eef1218
>>>>>>>>>> [  347.803119] mmc0: sdhci: 
>>>>>>>>>> ============================================
>>>>>>>>>>  --
>>>>>>>>>> 
>>>>>> --
>>>>>> 
>>>> 
>> 

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-01-26 19:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-18 21:24 Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag # sebastian.kloska
2023-01-19  9:15 ` Adrian Hunter
2023-01-19  9:46   ` sebastian.kloska
2023-01-19 10:28     ` Adrian Hunter
     [not found]       ` <08bd329f23b8fac7a2249034804ac3c7@snafu.de>
     [not found]         ` <f31a5b1cc0ae7182199b89730891770d@snafu.de>
2023-01-20 12:08           ` Adrian Hunter
2023-01-21 13:37             ` sebastian.kloska
2023-01-23 10:46               ` Adrian Hunter
2023-01-23 12:38                 ` sebastian.kloska
2023-01-26 17:25                 ` sebastian.kloska
2023-01-26 19:25                   ` Adrian Hunter
2023-01-26 19:44                     ` sebastian.kloska
2023-01-19 14:38 ` Linux kernel regression tracking (#adding)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.