All of lore.kernel.org
 help / color / mirror / Atom feed
* the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
@ 2021-08-18  4:06 Hui Wang
  2021-08-18  5:33 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 8+ messages in thread
From: Hui Wang @ 2021-08-18  4:06 UTC (permalink / raw)
  To: marex; +Cc: Greg Kroah-Hartman, Stable

Hi Marex,

We backported this patch to ubuntu 4.15.0-generic kernel, and found this 
patch introduced the rsi driver crashing when running system resume on 
the Dell 300x IoT platform (100% rate). Below is the log, After seeing 
this log, the rsi wifi can't work anymore, need to run 'rmmod 
rsi_sdio;modprobe rsi_sdio" to make it work again.

So do you know what is missing apart from this patch or this patch is 
not suitable for 4.15 kernel at all?

Thanks,

Hui.


[  118.494238] Freezing user space processes ... (elapsed 0.001 seconds) 
done.
[  118.495866] OOM killer disabled.
[  118.495868] Freezing remaining freezable tasks ... (elapsed 0.001 
seconds) done.
[  118.497772] Suspending console(s) (use no_console_suspend to debug)
[  118.499120] rsi_91x: ===> Interface DOWN <===
[  129.013207] mmc1: Controller never released inhibit bit(s).
[  129.013216] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[  129.013226] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
[  129.013233] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
[  129.013240] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff
[  129.013247] mmc1: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
[  129.013254] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
[  129.013261] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
[  129.013268] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff
[  129.013276] mmc1: sdhci: Int enab:  0xffffffff | Sig enab: 0xffffffff
[  129.013283] mmc1: sdhci: ACmd stat: 0x0000ffff | Slot int: 0x0000ffff
[  129.013290] mmc1: sdhci: Caps:      0xffffffff | Caps_1: 0xffffffff
[  129.013297] mmc1: sdhci: Cmd:       0x0000ffff | Max curr: 0xffffffff
[  129.013304] mmc1: sdhci: Resp[0]:   0xffffffff | Resp[1]: 0xffffffff
[  129.013311] mmc1: sdhci: Resp[2]:   0xffffffff | Resp[3]: 0xffffffff
[  129.013316] mmc1: sdhci: Host ctl2: 0x0000ffff
[  129.013323] mmc1: sdhci: ADMA Err:  0xffffffff | ADMA Ptr: 0xffffffff
[  129.013327] mmc1: sdhci: ============================================
[  129.113415] mmc1: Reset 0x2 never completed.
[  129.113417] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[  129.113421] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
[  129.113424] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
[  129.113428] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff
[  129.113431] mmc1: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
[  129.113435] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
[  129.113439] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
[  129.113442] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff
[  129.113446] mmc1: sdhci: Int enab:  0xffffffff | Sig enab: 0xffffffff
[  129.113449] mmc1: sdhci: ACmd stat: 0x0000ffff | Slot int: 0x0000ffff
[  129.113453] mmc1: sdhci: Caps:      0xffffffff | Caps_1: 0xffffffff
[  129.113457] mmc1: sdhci: Cmd:       0x0000ffff | Max curr: 0xffffffff
[  129.113460] mmc1: sdhci: Resp[0]:   0xffffffff | Resp[1]: 0xffffffff
[  129.113464] mmc1: sdhci: Resp[2]:   0xffffffff | Resp[3]: 0xffffffff
[  129.113466] mmc1: sdhci: Host ctl2: 0x0000ffff
[  129.113470] mmc1: sdhci: ADMA Err:  0xffffffff | ADMA Ptr: 0xffffffff
[  129.113472] mmc1: sdhci: ============================================
[  129.213489] mmc1: Reset 0x4 never completed.
[  129.213490] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[  129.213494] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
[  129.213498] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
[  129.213501] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff
[  129.213505] mmc1: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
[  129.213508] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
[  129.213512] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
[  129.213515] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff
[  129.213519] mmc1: sdhci: Int enab:  0xffffffff | Sig enab: 0xffffffff
[  129.213523] mmc1: sdhci: ACmd stat: 0x0000ffff | Slot int: 0x0000ffff
[  129.213526] mmc1: sdhci: Caps:      0xffffffff | Caps_1: 0xffffffff
[  129.213530] mmc1: sdhci: Cmd:       0x0000ffff | Max curr: 0xffffffff
[  129.213534] mmc1: sdhci: Resp[0]:   0xffffffff | Resp[1]: 0xffffffff
[  129.213537] mmc1: sdhci: Resp[2]:   0xffffffff | Resp[3]: 0xffffffff
[  129.213540] mmc1: sdhci: Host ctl2: 0x0000ffff
[  129.213543] mmc1: sdhci: ADMA Err:  0xffffffff | ADMA Ptr: 0xffffffff
[  129.213545] mmc1: sdhci: ============================================
[  129.213882] rsi_91x: rsi_sdio_enable_interrupts: Failed to read int 
enable register
[  129.240392] rsi_91x: ===> Interface UP <===
[  129.240443] rsi_91x: rsi_disable_ps: Cannot accept disable PS in 
PS_NONE state


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
  2021-08-18  4:06 the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel Hui Wang
@ 2021-08-18  5:33 ` Greg Kroah-Hartman
  2021-08-18  9:04   ` Marek Vasut
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Kroah-Hartman @ 2021-08-18  5:33 UTC (permalink / raw)
  To: Hui Wang; +Cc: marex, Stable

On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
> Hi Marex,
> 
> We backported this patch to ubuntu 4.15.0-generic kernel, and found this
> patch introduced the rsi driver crashing when running system resume on the
> Dell 300x IoT platform (100% rate). Below is the log, After seeing this log,
> the rsi wifi can't work anymore, need to run 'rmmod rsi_sdio;modprobe
> rsi_sdio" to make it work again.
> 
> So do you know what is missing apart from this patch or this patch is not
> suitable for 4.15 kernel at all?

Does 4.19.191 work for this system?  Why not just use that or newer
instead?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
  2021-08-18  5:33 ` Greg Kroah-Hartman
@ 2021-08-18  9:04   ` Marek Vasut
  2021-08-19  2:57     ` Hui Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Marek Vasut @ 2021-08-18  9:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Hui Wang; +Cc: Stable, Martin Fuzzey, Guido Günther

On 8/18/21 7:33 AM, Greg Kroah-Hartman wrote:
> On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
>> Hi Marex,
>>
>> We backported this patch to ubuntu 4.15.0-generic kernel, and found this
>> patch introduced the rsi driver crashing when running system resume on the
>> Dell 300x IoT platform (100% rate). Below is the log, After seeing this log,
>> the rsi wifi can't work anymore, need to run 'rmmod rsi_sdio;modprobe
>> rsi_sdio" to make it work again.
>>
>> So do you know what is missing apart from this patch or this patch is not
>> suitable for 4.15 kernel at all?
> 
> Does 4.19.191 work for this system?  Why not just use that or newer
> instead?

I haven't seen this on linux-stable 5.4.y or 5.10.y, if that information 
is of any use.

But I have to admit, I am tempted to mark the whole driver as BROKEN and 
submit that for stable backports.

Because that is what it is, it is buggy, broken, and the hardware lacks 
any documentation. I spent an insane amount of time talking to RedPine 
Signals / SiLabs trying to get help with basic things like association 
problems against various APs, no result there. I tried getting hardware 
docs from them so I can fix the driver myself, no result either. So far 
I tried to pick various fixes from their downstream driver and submit 
them, but that is massively time consuming and the changes there are not 
separated or documented, it is just one large chunk of code.

As far as I can tell, they also have no interest in fixing the driver or 
helping others with fixing it, so maybe we should just mark it as broken 
... :-(

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
  2021-08-18  9:04   ` Marek Vasut
@ 2021-08-19  2:57     ` Hui Wang
  2021-08-19  5:31       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 8+ messages in thread
From: Hui Wang @ 2021-08-19  2:57 UTC (permalink / raw)
  To: Marek Vasut, Greg Kroah-Hartman; +Cc: Stable, Martin Fuzzey, Guido Günther


On 8/18/21 5:04 PM, Marek Vasut wrote:
> On 8/18/21 7:33 AM, Greg Kroah-Hartman wrote:
>> On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
>>> Hi Marex,
>>>
>>> We backported this patch to ubuntu 4.15.0-generic kernel, and found 
>>> this
>>> patch introduced the rsi driver crashing when running system resume 
>>> on the
>>> Dell 300x IoT platform (100% rate). Below is the log, After seeing 
>>> this log,
>>> the rsi wifi can't work anymore, need to run 'rmmod rsi_sdio;modprobe
>>> rsi_sdio" to make it work again.
>>>
>>> So do you know what is missing apart from this patch or this patch 
>>> is not
>>> suitable for 4.15 kernel at all?
>>
>> Does 4.19.191 work for this system?  Why not just use that or newer
>> instead?
>
> I haven't seen this on linux-stable 5.4.y or 5.10.y, if that 
> information is of any use.
>
> But I have to admit, I am tempted to mark the whole driver as BROKEN 
> and submit that for stable backports.
>
> Because that is what it is, it is buggy, broken, and the hardware 
> lacks any documentation. I spent an insane amount of time talking to 
> RedPine Signals / SiLabs trying to get help with basic things like 
> association problems against various APs, no result there. I tried 
> getting hardware docs from them so I can fix the driver myself, no 
> result either. So far I tried to pick various fixes from their 
> downstream driver and submit them, but that is massively time 
> consuming and the changes there are not separated or documented, it is 
> just one large chunk of code.
>
> As far as I can tell, they also have no interest in fixing the driver 
> or helping others with fixing it, so maybe we should just mark it as 
> broken ... :-(

Hi Marek,

Got it, thanks for sharing it.

Hi Greg,

I just tested the 4.19.191, got the same result, the wifi will crash 
after resume under 4.19.191:

admin@HW6VB02:~$ uname -a
Linux HW6VB02 4.19.191 #1 SMP Thu Aug 19 10:19:32 CST 2021 x86_64 x86_64 
x86_64 GNU/Linux

[   59.682908] sdhci-acpi INT33BB:00: pre_suspend failed for 
non-removable host: -38
[   59.682917] Freezing user space processes ... (elapsed 0.003 seconds) 
done.
[   59.686063] OOM killer disabled.
[   59.686065] Freezing remaining freezable tasks ... (elapsed 0.001 
seconds) done.
[   59.687385] Suspending console(s) (use no_console_suspend to debug)
[   59.687931] rsi_91x: ===> Interface DOWN <===
[   70.068983] mmc1: Controller never released inhibit bit(s).
[   70.068992] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[   70.069002] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
[   70.069009] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
[   70.069016] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff
[   70.069023] mmc1: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
[   70.069030] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
[   70.069036] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
[   70.069043] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff


So let us revert this commit from 4.19.y?

Thanks,

Hui.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
  2021-08-19  2:57     ` Hui Wang
@ 2021-08-19  5:31       ` Greg Kroah-Hartman
  2021-08-19  7:49         ` Marek Vasut
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Kroah-Hartman @ 2021-08-19  5:31 UTC (permalink / raw)
  To: Hui Wang; +Cc: Marek Vasut, Stable, Martin Fuzzey, Guido Günther

On Thu, Aug 19, 2021 at 10:57:03AM +0800, Hui Wang wrote:
> 
> On 8/18/21 5:04 PM, Marek Vasut wrote:
> > On 8/18/21 7:33 AM, Greg Kroah-Hartman wrote:
> > > On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
> > > > Hi Marex,
> > > > 
> > > > We backported this patch to ubuntu 4.15.0-generic kernel, and
> > > > found this
> > > > patch introduced the rsi driver crashing when running system
> > > > resume on the
> > > > Dell 300x IoT platform (100% rate). Below is the log, After
> > > > seeing this log,
> > > > the rsi wifi can't work anymore, need to run 'rmmod rsi_sdio;modprobe
> > > > rsi_sdio" to make it work again.
> > > > 
> > > > So do you know what is missing apart from this patch or this
> > > > patch is not
> > > > suitable for 4.15 kernel at all?
> > > 
> > > Does 4.19.191 work for this system?  Why not just use that or newer
> > > instead?
> > 
> > I haven't seen this on linux-stable 5.4.y or 5.10.y, if that information
> > is of any use.
> > 
> > But I have to admit, I am tempted to mark the whole driver as BROKEN and
> > submit that for stable backports.
> > 
> > Because that is what it is, it is buggy, broken, and the hardware lacks
> > any documentation. I spent an insane amount of time talking to RedPine
> > Signals / SiLabs trying to get help with basic things like association
> > problems against various APs, no result there. I tried getting hardware
> > docs from them so I can fix the driver myself, no result either. So far
> > I tried to pick various fixes from their downstream driver and submit
> > them, but that is massively time consuming and the changes there are not
> > separated or documented, it is just one large chunk of code.
> > 
> > As far as I can tell, they also have no interest in fixing the driver or
> > helping others with fixing it, so maybe we should just mark it as broken
> > ... :-(
> 
> Hi Marek,
> 
> Got it, thanks for sharing it.
> 
> Hi Greg,
> 
> I just tested the 4.19.191, got the same result, the wifi will crash after
> resume under 4.19.191:
> 
> admin@HW6VB02:~$ uname -a
> Linux HW6VB02 4.19.191 #1 SMP Thu Aug 19 10:19:32 CST 2021 x86_64 x86_64
> x86_64 GNU/Linux
> 
> [   59.682908] sdhci-acpi INT33BB:00: pre_suspend failed for non-removable
> host: -38
> [   59.682917] Freezing user space processes ... (elapsed 0.003 seconds)
> done.
> [   59.686063] OOM killer disabled.
> [   59.686065] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> [   59.687385] Suspending console(s) (use no_console_suspend to debug)
> [   59.687931] rsi_91x: ===> Interface DOWN <===
> [   70.068983] mmc1: Controller never released inhibit bit(s).
> [   70.068992] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
> [   70.069002] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
> [   70.069009] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
> [   70.069016] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff
> [   70.069023] mmc1: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
> [   70.069030] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
> [   70.069036] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
> [   70.069043] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff
> 
> 
> So let us revert this commit from 4.19.y?

If you revert it, does it work properly?  What about in Linus's tree?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
  2021-08-19  5:31       ` Greg Kroah-Hartman
@ 2021-08-19  7:49         ` Marek Vasut
  2021-08-19  8:52           ` Hui Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Marek Vasut @ 2021-08-19  7:49 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Hui Wang; +Cc: Stable, Martin Fuzzey, Guido Günther

On 8/19/21 7:31 AM, Greg Kroah-Hartman wrote:
> On Thu, Aug 19, 2021 at 10:57:03AM +0800, Hui Wang wrote:
>>
>> On 8/18/21 5:04 PM, Marek Vasut wrote:
>>> On 8/18/21 7:33 AM, Greg Kroah-Hartman wrote:
>>>> On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
>>>>> Hi Marex,
>>>>>
>>>>> We backported this patch to ubuntu 4.15.0-generic kernel, and
>>>>> found this
>>>>> patch introduced the rsi driver crashing when running system
>>>>> resume on the
>>>>> Dell 300x IoT platform (100% rate). Below is the log, After
>>>>> seeing this log,
>>>>> the rsi wifi can't work anymore, need to run 'rmmod rsi_sdio;modprobe
>>>>> rsi_sdio" to make it work again.
>>>>>
>>>>> So do you know what is missing apart from this patch or this
>>>>> patch is not
>>>>> suitable for 4.15 kernel at all?
>>>>
>>>> Does 4.19.191 work for this system?  Why not just use that or newer
>>>> instead?
>>>
>>> I haven't seen this on linux-stable 5.4.y or 5.10.y, if that information
>>> is of any use.
>>>
>>> But I have to admit, I am tempted to mark the whole driver as BROKEN and
>>> submit that for stable backports.
>>>
>>> Because that is what it is, it is buggy, broken, and the hardware lacks
>>> any documentation. I spent an insane amount of time talking to RedPine
>>> Signals / SiLabs trying to get help with basic things like association
>>> problems against various APs, no result there. I tried getting hardware
>>> docs from them so I can fix the driver myself, no result either. So far
>>> I tried to pick various fixes from their downstream driver and submit
>>> them, but that is massively time consuming and the changes there are not
>>> separated or documented, it is just one large chunk of code.
>>>
>>> As far as I can tell, they also have no interest in fixing the driver or
>>> helping others with fixing it, so maybe we should just mark it as broken
>>> ... :-(
>>
>> Hi Marek,
>>
>> Got it, thanks for sharing it.
>>
>> Hi Greg,
>>
>> I just tested the 4.19.191, got the same result, the wifi will crash after
>> resume under 4.19.191:
>>
>> admin@HW6VB02:~$ uname -a
>> Linux HW6VB02 4.19.191 #1 SMP Thu Aug 19 10:19:32 CST 2021 x86_64 x86_64
>> x86_64 GNU/Linux
>>
>> [   59.682908] sdhci-acpi INT33BB:00: pre_suspend failed for non-removable
>> host: -38
>> [   59.682917] Freezing user space processes ... (elapsed 0.003 seconds)
>> done.
>> [   59.686063] OOM killer disabled.
>> [   59.686065] Freezing remaining freezable tasks ... (elapsed 0.001
>> seconds) done.
>> [   59.687385] Suspending console(s) (use no_console_suspend to debug)
>> [   59.687931] rsi_91x: ===> Interface DOWN <===
>> [   70.068983] mmc1: Controller never released inhibit bit(s).
>> [   70.068992] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
>> [   70.069002] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
>> [   70.069009] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
>> [   70.069016] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff
>> [   70.069023] mmc1: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
>> [   70.069030] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
>> [   70.069036] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
>> [   70.069043] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff
>>
>>
>> So let us revert this commit from 4.19.y?
> 
> If you revert it, does it work properly?  What about in Linus's tree?

I suspect in that case, sdio_claim_host() will spin indefinitely and 
never finish, see the c434e5e48dc4e ("rsi: Use resume_noirq for SDIO") 
commit message.

Note that I did my tests on ARM MMCI (stm32mp1 variant).

This "[   70.068983] mmc1: Controller never released inhibit bit(s)" 
looks suspicious in the log above.

Also, newer versions of the RSI downstream driver [1] as of 390542d 
("Updated Readme.txt file") simply comment out 
rsi_sdio_enable_interrupts() in rsi/rsi_91x_sdio.c rsi_resume(), which 
looks like RSI ran into the same problem, but "fixed" it differently. I 
think that approach RSI took is wrong and it just hid the issue.

[1] git://github.com/SiliconLabs/RS911X-nLink-OSD

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
  2021-08-19  7:49         ` Marek Vasut
@ 2021-08-19  8:52           ` Hui Wang
  2021-08-19 10:57             ` Marek Vasut
  0 siblings, 1 reply; 8+ messages in thread
From: Hui Wang @ 2021-08-19  8:52 UTC (permalink / raw)
  To: Marek Vasut, Greg Kroah-Hartman; +Cc: Stable, Martin Fuzzey, Guido Günther


On 8/19/21 3:49 PM, Marek Vasut wrote:
> On 8/19/21 7:31 AM, Greg Kroah-Hartman wrote:
>> On Thu, Aug 19, 2021 at 10:57:03AM +0800, Hui Wang wrote:
>>>
>>> On 8/18/21 5:04 PM, Marek Vasut wrote:
>>>> On 8/18/21 7:33 AM, Greg Kroah-Hartman wrote:
>>>>> On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
>>>>>> Hi Marex,
>>>>>>
>>>>>> We backported this patch to ubuntu 4.15.0-generic kernel, and
>>>>>> found this
>>>>>> patch introduced the rsi driver crashing when running system
>>>>>> resume on the
>>>>>> Dell 300x IoT platform (100% rate). Below is the log, After
>>>>>> seeing this log,
>>>>>> the rsi wifi can't work anymore, need to run 'rmmod 
>>>>>> rsi_sdio;modprobe
>>>>>> rsi_sdio" to make it work again.
>>>>>>
>>>>>> So do you know what is missing apart from this patch or this
>>>>>> patch is not
>>>>>> suitable for 4.15 kernel at all?
>>>>>
>>>>> Does 4.19.191 work for this system?  Why not just use that or newer
>>>>> instead?
>>>>
>>>> I haven't seen this on linux-stable 5.4.y or 5.10.y, if that 
>>>> information
>>>> is of any use.
>>>>
>>>> But I have to admit, I am tempted to mark the whole driver as 
>>>> BROKEN and
>>>> submit that for stable backports.
>>>>
>>>> Because that is what it is, it is buggy, broken, and the hardware 
>>>> lacks
>>>> any documentation. I spent an insane amount of time talking to RedPine
>>>> Signals / SiLabs trying to get help with basic things like association
>>>> problems against various APs, no result there. I tried getting 
>>>> hardware
>>>> docs from them so I can fix the driver myself, no result either. So 
>>>> far
>>>> I tried to pick various fixes from their downstream driver and submit
>>>> them, but that is massively time consuming and the changes there 
>>>> are not
>>>> separated or documented, it is just one large chunk of code.
>>>>
>>>> As far as I can tell, they also have no interest in fixing the 
>>>> driver or
>>>> helping others with fixing it, so maybe we should just mark it as 
>>>> broken
>>>> ... :-(
>>>
>>> Hi Marek,
>>>
>>> Got it, thanks for sharing it.
>>>
>>> Hi Greg,
>>>
>>> I just tested the 4.19.191, got the same result, the wifi will crash 
>>> after
>>> resume under 4.19.191:
>>>
>>> admin@HW6VB02:~$ uname -a
>>> Linux HW6VB02 4.19.191 #1 SMP Thu Aug 19 10:19:32 CST 2021 x86_64 
>>> x86_64
>>> x86_64 GNU/Linux
>>>
>>> [   59.682908] sdhci-acpi INT33BB:00: pre_suspend failed for 
>>> non-removable
>>> host: -38
>>> [   59.682917] Freezing user space processes ... (elapsed 0.003 
>>> seconds)
>>> done.
>>> [   59.686063] OOM killer disabled.
>>> [   59.686065] Freezing remaining freezable tasks ... (elapsed 0.001
>>> seconds) done.
>>> [   59.687385] Suspending console(s) (use no_console_suspend to debug)
>>> [   59.687931] rsi_91x: ===> Interface DOWN <===
>>> [   70.068983] mmc1: Controller never released inhibit bit(s).
>>> [   70.068992] mmc1: sdhci: ============ SDHCI REGISTER DUMP 
>>> ===========
>>> [   70.069002] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
>>> [   70.069009] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
>>> [   70.069016] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 
>>> 0x0000ffff
>>> [   70.069023] mmc1: sdhci: Present:   0xffffffff | Host ctl: 
>>> 0x000000ff
>>> [   70.069030] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
>>> [   70.069036] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
>>> [   70.069043] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 
>>> 0xffffffff
>>>
>>>
>>> So let us revert this commit from 4.19.y?
>>
>> If you revert it, does it work properly?  What about in Linus's tree?

I reverted the commit in the 4.19.191, then the wifi could work both 
before and after the system resume. I tested the mainline kernel 
linux-5.13, before suspend, the wifi could work, after suspend, the 
whole system can't wakeup, and I couldn't recover the system since I 
can't access the machine physically. I did all test via ssh remotely. So 
there is no testing result for Linus' tree.

>
> I suspect in that case, sdio_claim_host() will spin indefinitely and 
> never finish, see the c434e5e48dc4e ("rsi: Use resume_noirq for SDIO") 
> commit message.
At least, we never seen this issue in the kernel 4.15, without the 
commit of c434e5e48dc4e ("rsi: Use resume_noirq for SDIO"), the wifi and 
bluetooth works well before and after suspend.
>
> Note that I did my tests on ARM MMCI (stm32mp1 variant).
The platform I am testing is a X86 one, and the sdhci controller driver 
is sdhci_acpi.c.
>
> This "[   70.068983] mmc1: Controller never released inhibit bit(s)" 
> looks suspicious in the log above.
>
> Also, newer versions of the RSI downstream driver [1] as of 390542d 
> ("Updated Readme.txt file") simply comment out 
> rsi_sdio_enable_interrupts() in rsi/rsi_91x_sdio.c rsi_resume(), which 
> looks like RSI ran into the same problem, but "fixed" it differently. 
> I think that approach RSI took is wrong and it just hid the issue.
>
> [1] git://github.com/SiliconLabs/RS911X-nLink-OSD

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel
  2021-08-19  8:52           ` Hui Wang
@ 2021-08-19 10:57             ` Marek Vasut
  0 siblings, 0 replies; 8+ messages in thread
From: Marek Vasut @ 2021-08-19 10:57 UTC (permalink / raw)
  To: Hui Wang, Greg Kroah-Hartman; +Cc: Stable, Martin Fuzzey, Guido Günther

On 8/19/21 10:52 AM, Hui Wang wrote:
> 
> On 8/19/21 3:49 PM, Marek Vasut wrote:
>> On 8/19/21 7:31 AM, Greg Kroah-Hartman wrote:
>>> On Thu, Aug 19, 2021 at 10:57:03AM +0800, Hui Wang wrote:
>>>>
>>>> On 8/18/21 5:04 PM, Marek Vasut wrote:
>>>>> On 8/18/21 7:33 AM, Greg Kroah-Hartman wrote:
>>>>>> On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
>>>>>>> Hi Marex,
>>>>>>>
>>>>>>> We backported this patch to ubuntu 4.15.0-generic kernel, and
>>>>>>> found this
>>>>>>> patch introduced the rsi driver crashing when running system
>>>>>>> resume on the
>>>>>>> Dell 300x IoT platform (100% rate). Below is the log, After
>>>>>>> seeing this log,
>>>>>>> the rsi wifi can't work anymore, need to run 'rmmod 
>>>>>>> rsi_sdio;modprobe
>>>>>>> rsi_sdio" to make it work again.
>>>>>>>
>>>>>>> So do you know what is missing apart from this patch or this
>>>>>>> patch is not
>>>>>>> suitable for 4.15 kernel at all?
>>>>>>
>>>>>> Does 4.19.191 work for this system?  Why not just use that or newer
>>>>>> instead?
>>>>>
>>>>> I haven't seen this on linux-stable 5.4.y or 5.10.y, if that 
>>>>> information
>>>>> is of any use.
>>>>>
>>>>> But I have to admit, I am tempted to mark the whole driver as 
>>>>> BROKEN and
>>>>> submit that for stable backports.
>>>>>
>>>>> Because that is what it is, it is buggy, broken, and the hardware 
>>>>> lacks
>>>>> any documentation. I spent an insane amount of time talking to RedPine
>>>>> Signals / SiLabs trying to get help with basic things like association
>>>>> problems against various APs, no result there. I tried getting 
>>>>> hardware
>>>>> docs from them so I can fix the driver myself, no result either. So 
>>>>> far
>>>>> I tried to pick various fixes from their downstream driver and submit
>>>>> them, but that is massively time consuming and the changes there 
>>>>> are not
>>>>> separated or documented, it is just one large chunk of code.
>>>>>
>>>>> As far as I can tell, they also have no interest in fixing the 
>>>>> driver or
>>>>> helping others with fixing it, so maybe we should just mark it as 
>>>>> broken
>>>>> ... :-(
>>>>
>>>> Hi Marek,
>>>>
>>>> Got it, thanks for sharing it.
>>>>
>>>> Hi Greg,
>>>>
>>>> I just tested the 4.19.191, got the same result, the wifi will crash 
>>>> after
>>>> resume under 4.19.191:
>>>>
>>>> admin@HW6VB02:~$ uname -a
>>>> Linux HW6VB02 4.19.191 #1 SMP Thu Aug 19 10:19:32 CST 2021 x86_64 
>>>> x86_64
>>>> x86_64 GNU/Linux
>>>>
>>>> [   59.682908] sdhci-acpi INT33BB:00: pre_suspend failed for 
>>>> non-removable
>>>> host: -38
>>>> [   59.682917] Freezing user space processes ... (elapsed 0.003 
>>>> seconds)
>>>> done.
>>>> [   59.686063] OOM killer disabled.
>>>> [   59.686065] Freezing remaining freezable tasks ... (elapsed 0.001
>>>> seconds) done.
>>>> [   59.687385] Suspending console(s) (use no_console_suspend to debug)
>>>> [   59.687931] rsi_91x: ===> Interface DOWN <===
>>>> [   70.068983] mmc1: Controller never released inhibit bit(s).
>>>> [   70.068992] mmc1: sdhci: ============ SDHCI REGISTER DUMP 
>>>> ===========
>>>> [   70.069002] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
>>>> [   70.069009] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
>>>> [   70.069016] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 
>>>> 0x0000ffff
>>>> [   70.069023] mmc1: sdhci: Present:   0xffffffff | Host ctl: 
>>>> 0x000000ff
>>>> [   70.069030] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
>>>> [   70.069036] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
>>>> [   70.069043] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 
>>>> 0xffffffff
>>>>
>>>>
>>>> So let us revert this commit from 4.19.y?
>>>
>>> If you revert it, does it work properly?  What about in Linus's tree?
> 
> I reverted the commit in the 4.19.191, then the wifi could work both 
> before and after the system resume. I tested the mainline kernel 
> linux-5.13, before suspend, the wifi could work, after suspend, the 
> whole system can't wakeup, and I couldn't recover the system since I 
> can't access the machine physically. I did all test via ssh remotely. So 
> there is no testing result for Linus' tree.

I suspect you just hit the issue this patch was trying to fix then.

If you have console access, use no_console_suspend to see the backtrace 
on wake up.

>> I suspect in that case, sdio_claim_host() will spin indefinitely and 
>> never finish, see the c434e5e48dc4e ("rsi: Use resume_noirq for SDIO") 
>> commit message.
> At least, we never seen this issue in the kernel 4.15, without the 
> commit of c434e5e48dc4e ("rsi: Use resume_noirq for SDIO"), the wifi and 
> bluetooth works well before and after suspend.

I suspect you might've just been lucky with that, because it seems RSI 
did hit it too (see below). This could also be something which triggers 
only on specific controller drivers (?).

>>
>> Note that I did my tests on ARM MMCI (stm32mp1 variant).
> The platform I am testing is a X86 one, and the sdhci controller driver 
> is sdhci_acpi.c.

Do you have an RSI module which can be plugged into an SD card slot 
there , or is that RSI module soldered-on on some devkit/board ?

Mine is the later, soldered on a SoM, so I have hard time testing on 
other SDIO controllers.

>> This "[   70.068983] mmc1: Controller never released inhibit bit(s)" 
>> looks suspicious in the log above.
>>
>> Also, newer versions of the RSI downstream driver [1] as of 390542d 
>> ("Updated Readme.txt file") simply comment out 
>> rsi_sdio_enable_interrupts() in rsi/rsi_91x_sdio.c rsi_resume(), which 
>> looks like RSI ran into the same problem, but "fixed" it differently. 
>> I think that approach RSI took is wrong and it just hid the issue.
>>
>> [1] git://github.com/SiliconLabs/RS911X-nLink-OSD

The bottom line is, I would really prefer to figure out what the problem 
that you see on the Linux 5.13.y is and fix that and backport that fix, 
so the suspend/resume works correctly for everyone ; rather than revert 
a patch without really understanding the underlying problem.

Sadly, the RSI driver is buggy.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-08-19 10:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18  4:06 the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel Hui Wang
2021-08-18  5:33 ` Greg Kroah-Hartman
2021-08-18  9:04   ` Marek Vasut
2021-08-19  2:57     ` Hui Wang
2021-08-19  5:31       ` Greg Kroah-Hartman
2021-08-19  7:49         ` Marek Vasut
2021-08-19  8:52           ` Hui Wang
2021-08-19 10:57             ` Marek Vasut

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.