netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug] mt7921e driver in 5.16 causes kernel panic
@ 2022-01-11 23:17 Khalid Aziz
  2022-01-11 23:31 ` Ben Greear
  0 siblings, 1 reply; 6+ messages in thread
From: Khalid Aziz @ 2022-01-11 23:17 UTC (permalink / raw)
  To: nbd, lorenzo.bianconi83, ryder.lee, shayne.chen, sean.wang, kvalo
  Cc: davem, kuba, matthias.bgg, linux-kernel, netdev

I am seeing an intermittent bug in mt7921e driver. When the driver module is loaded
and is being initialized, almost every other time it seems to write to some wild
memory location. This results in driver failing to initialize with message
"Timeout for driver own" and at the same time I start to see "Bad page state" messages
for random processes. Here is the relevant part of dmesg:

[OK] Found device SAMSUNG MZVLB1T0HBLR-000L7 6.
[OK ]Found device SAMSUNG MZVLB1T0HBLR-000L7 SYSTEM.
[OK] Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Starting Cryptography Setup for nvme8n1p6_crypt...
[  5.687489] mt7921e 0000:03:00.0: ASIC revision: 79610010
Starting File System Check on /dev/disk/by-uuid/CCSA-8086...
Please enter passphrase for disk SAMSUNG MZVLB1T0HBLR-000L7 (nvme8n1p6_crypt) on /home
[  7.798962] mt7921e 0000:03:00.0: Timeout for driver own
[  8.874863] mt7921e 0000:03:00.0: Timeout for driver own
[  8.876266] BUG: Bad page state in process systemd-udevd  pfn:123848
[  8.877953] BUG: Bad page state in process napi/phy8-8194  pfn:10a4a8
[  9.958899] mt7921e 0000:03:00.0: Timeout for driver own
[  9.961595] BUG: Bad page state in process systemd-udevd  pfn:1037e8
[ 11.843129] mt7921e 0000:03:00.0: Timeout for driver own
[ 11.845823] BUG: Bad page state in process systemd-udevd  pfn:104380
[ 12.126922] mt7921e 0000:03:00.0: Timeout for driver own
[ 12.128788] BUG: Bad page state in process systemd-udevd  pfn:10a050
[ 13.287898] mt7921e 0000:03:00.0: Timeout for driver own
[ 14.287827] mt7921e 0000:03:00.0: Timeout for driver own
[ 14.288968] BUG: Bad page state in process systemd-udevd  pfn:109f51
[ 14.298599] BUG: Bad page state in process systemd-udevd  pfn:105f60
[ 14.292162] BUG: Bad page state in process systemd-udevd  pfn:10ac07
[ 15.372501] mt7921e 0000:03:00.0: Timeout for driver own
[ 16.454773] mt7921e 0000:03:00.0: Timeout for driver own
[ 16.456238] BUG: Bad page state in process systemd-udevd pfn:1a0c00
[ 16.515869) mt7921e 0000:03:00.0: hardware init failed

These "Bad page state" messages continue until kernel finally panics with a page
fault in a seemingly random place:

[  17.544222] BUG: Bad page state in process apparmor_parser  pfn:1116f8
[  OK  ] Finished Create Volatile Files and Directories
          Starting Network Name Resolution...
          Starting Network Time Synchronization...
          Starting Update UTMP about System Boot/Shutdown...
[  17.677144] BUG: unable to handle page fault for address: 0000396eb08090ec
[  17.680395] #PF: supervisor read access in kernel mode
[  17.681086] #PF: error code(0x0000) - not-present page
[  17.681086] PGD 0 P4D 0
[  17.681006] Opps: 0000 [#1] PREEMPT SMP NOPTI
[  17.681006] CPU: 8 PID: 63 Con: ksoftirgd/8 Tainted: G  B  W        5.16.0 #3
[  17.681606] Hardware name: LENOVO 20XF004WUS/20XF004WUS, BIOS R1NET44W (1.14) 11/08/2821

Rest of the kernel stack trace is in form of a picture which I can send if it helps. Kernel
is compiled from git tag "v5.16". Details of mediatek controller:

$ lspci -v -s 03:00.0
03:00.0 Network controller: MEDIATEK Corp. Device 7961
	Subsystem: Lenovo Device e0bc
	Physical Slot: 0
	Flags: bus master, fast devsel, latency 0, IRQ 85, IOMMU group 11
	Memory at 870200000 (64-bit, prefetchable) [size=1M]
	Memory at 870300000 (64-bit, prefetchable) [size=16K]
	Memory at 870304000 (64-bit, prefetchable) [size=4K]
	Capabilities: [80] Express Endpoint, MSI 00
	Capabilities: [e0] MSI: Enable+ Count=1/32 Maskable+ 64bit+
	Capabilities: [f8] Power Management version 3
	Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
	Capabilities: [108] Latency Tolerance Reporting
	Capabilities: [110] L1 PM Substates
	Capabilities: [200] Advanced Error Reporting
	Kernel driver in use: mt7921e
	Kernel modules: mt7921e

This is an intermittent problem and I did not see this with 5.16-rc6 kernel.
Please let me know if you need more information.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bug] mt7921e driver in 5.16 causes kernel panic
  2022-01-11 23:17 [Bug] mt7921e driver in 5.16 causes kernel panic Khalid Aziz
@ 2022-01-11 23:31 ` Ben Greear
  2022-01-11 23:58   ` Khalid Aziz
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Greear @ 2022-01-11 23:31 UTC (permalink / raw)
  To: Khalid Aziz, nbd, lorenzo.bianconi83, ryder.lee, shayne.chen,
	sean.wang, kvalo
  Cc: davem, kuba, matthias.bgg, linux-kernel, netdev

On 1/11/22 3:17 PM, Khalid Aziz wrote:
> I am seeing an intermittent bug in mt7921e driver. When the driver module is loaded
> and is being initialized, almost every other time it seems to write to some wild
> memory location. This results in driver failing to initialize with message
> "Timeout for driver own" and at the same time I start to see "Bad page state" messages
> for random processes. Here is the relevant part of dmesg:

Please see if this helps?

From: Ben Greear <greearb@candelatech.com>

If the nic fails to start, it is possible that the
reset_work has already been scheduled.  Ensure the
work item is canceled so we do not have use-after-free
crash in case cleanup is called before the work item
is executed.

This fixes crash on my x86_64 apu2 when mt7921k radio
fails to work.  Radio still fails, but OS does not
crash.

Signed-off-by: Ben Greear <greearb@candelatech.com>
---
  drivers/net/wireless/mediatek/mt76/mt7921/main.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/main.c b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
index 6073bedaa1c08..9b33002dcba4a 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7921/main.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
@@ -272,6 +272,7 @@ static void mt7921_stop(struct ieee80211_hw *hw)

  	cancel_delayed_work_sync(&dev->pm.ps_work);
  	cancel_work_sync(&dev->pm.wake_work);
+	cancel_work_sync(&dev->reset_work);
  	mt76_connac_free_pending_tx_skbs(&dev->pm, NULL);

  	mt7921_mutex_acquire(dev);
-- 
2.26.3


Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Bug] mt7921e driver in 5.16 causes kernel panic
  2022-01-11 23:31 ` Ben Greear
@ 2022-01-11 23:58   ` Khalid Aziz
  2022-01-12  0:05     ` Ben Greear
  0 siblings, 1 reply; 6+ messages in thread
From: Khalid Aziz @ 2022-01-11 23:58 UTC (permalink / raw)
  To: Ben Greear, nbd, lorenzo.bianconi83, ryder.lee, shayne.chen,
	sean.wang, kvalo
  Cc: davem, kuba, matthias.bgg, linux-kernel, netdev

On 1/11/22 16:31, Ben Greear wrote:
> On 1/11/22 3:17 PM, Khalid Aziz wrote:
>> I am seeing an intermittent bug in mt7921e driver. When the driver 
>> module is loaded
>> and is being initialized, almost every other time it seems to write to 
>> some wild
>> memory location. This results in driver failing to initialize with 
>> message
>> "Timeout for driver own" and at the same time I start to see "Bad page 
>> state" messages
>> for random processes. Here is the relevant part of dmesg:
> 
> Please see if this helps?
> 
> From: Ben Greear <greearb@candelatech.com>
> 
> If the nic fails to start, it is possible that the
> reset_work has already been scheduled.  Ensure the
> work item is canceled so we do not have use-after-free
> crash in case cleanup is called before the work item
> is executed.
> 
> This fixes crash on my x86_64 apu2 when mt7921k radio
> fails to work.  Radio still fails, but OS does not
> crash.
> 
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
>   drivers/net/wireless/mediatek/mt76/mt7921/main.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/main.c 
> b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
> index 6073bedaa1c08..9b33002dcba4a 100644
> --- a/drivers/net/wireless/mediatek/mt76/mt7921/main.c
> +++ b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
> @@ -272,6 +272,7 @@ static void mt7921_stop(struct ieee80211_hw *hw)
> 
>       cancel_delayed_work_sync(&dev->pm.ps_work);
>       cancel_work_sync(&dev->pm.wake_work);
> +    cancel_work_sync(&dev->reset_work);
>       mt76_connac_free_pending_tx_skbs(&dev->pm, NULL);
> 
>       mt7921_mutex_acquire(dev);

Hi Ben,

Unfortunately that did not help. I still saw the same messages and a 
kernel panic. I do not see this bug if I power down the laptop before 
booting it up, so mt7921_stop() would make sense as the reasonable place 
to fix it.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bug] mt7921e driver in 5.16 causes kernel panic
  2022-01-11 23:58   ` Khalid Aziz
@ 2022-01-12  0:05     ` Ben Greear
  0 siblings, 0 replies; 6+ messages in thread
From: Ben Greear @ 2022-01-12  0:05 UTC (permalink / raw)
  To: khalid, nbd, lorenzo.bianconi83, ryder.lee, shayne.chen,
	sean.wang, kvalo
  Cc: davem, kuba, matthias.bgg, linux-kernel, netdev

On 1/11/22 3:58 PM, Khalid Aziz wrote:
> On 1/11/22 16:31, Ben Greear wrote:
>> On 1/11/22 3:17 PM, Khalid Aziz wrote:
>>> I am seeing an intermittent bug in mt7921e driver. When the driver module is loaded
>>> and is being initialized, almost every other time it seems to write to some wild
>>> memory location. This results in driver failing to initialize with message
>>> "Timeout for driver own" and at the same time I start to see "Bad page state" messages
>>> for random processes. Here is the relevant part of dmesg:
>>
>> Please see if this helps?
>>
>> From: Ben Greear <greearb@candelatech.com>
>>
>> If the nic fails to start, it is possible that the
>> reset_work has already been scheduled.  Ensure the
>> work item is canceled so we do not have use-after-free
>> crash in case cleanup is called before the work item
>> is executed.
>>
>> This fixes crash on my x86_64 apu2 when mt7921k radio
>> fails to work.  Radio still fails, but OS does not
>> crash.
>>
>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>> ---
>>   drivers/net/wireless/mediatek/mt76/mt7921/main.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/main.c b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>> index 6073bedaa1c08..9b33002dcba4a 100644
>> --- a/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>> +++ b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>> @@ -272,6 +272,7 @@ static void mt7921_stop(struct ieee80211_hw *hw)
>>
>>       cancel_delayed_work_sync(&dev->pm.ps_work);
>>       cancel_work_sync(&dev->pm.wake_work);
>> +    cancel_work_sync(&dev->reset_work);
>>       mt76_connac_free_pending_tx_skbs(&dev->pm, NULL);
>>
>>       mt7921_mutex_acquire(dev);
> 
> Hi Ben,
> 
> Unfortunately that did not help. I still saw the same messages and a kernel panic. I do not see this bug if I power down the laptop before booting it up, so 
> mt7921_stop() would make sense as the reasonable place to fix it.

I think there are bugs around soft power cycle in these radios.  (And today someone reported
to me same type of problem in some 7915 radio, though my 7915 radios work in that regard for me.)  The patch above
fixes a crash I saw on a system with 7921k radio when the radio fails to boot properly for some reason
or another.  I guess there must be more bugs in the radio bringup logic and you are hitting something
different from what I hit.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bug] mt7921e driver in 5.16 causes kernel panic
  2022-01-12  0:49 ` sean.wang
@ 2022-01-12  2:16   ` Khalid Aziz
  0 siblings, 0 replies; 6+ messages in thread
From: Khalid Aziz @ 2022-01-12  2:16 UTC (permalink / raw)
  To: sean.wang
  Cc: greearb, nbd, lorenzo.bianconi83, Ryder.Lee, Shayne.Chen, kvalo,
	davem, kuba, matthias.bgg, linux-kernel, netdev, linux-wireless,
	linux-mediatek

On 1/11/22 17:49, sean.wang@mediatek.com wrote:
> From: Sean Wang <sean.wang@mediatek.com>
> 
>> On 1/11/22 16:31, Ben Greear wrote:
>>> On 1/11/22 3:17 PM, Khalid Aziz wrote:
>>>> I am seeing an intermittent bug in mt7921e driver. When the driver
>>>> module is loaded and is being initialized, almost every other time it
>>>> seems to write to some wild memory location. This results in driver
>>>> failing to initialize with message "Timeout for driver own" and at
>>>> the same time I start to see "Bad page state" messages for random
>>>> processes. Here is the relevant part of dmesg:
>>>
>>> Please see if this helps?
>>>
>>> From: Ben Greear <greearb@candelatech.com>
>>>
>>> If the nic fails to start, it is possible that the reset_work has
>>> already been scheduled.  Ensure the work item is canceled so we do not
>>> have use-after-free crash in case cleanup is called before the work
>>> item is executed.
>>>
>>> This fixes crash on my x86_64 apu2 when mt7921k radio fails to work.
>>> Radio still fails, but OS does not crash.
>>>
>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>> ---
>>>    drivers/net/wireless/mediatek/mt76/mt7921/main.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>>> b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>>> index 6073bedaa1c08..9b33002dcba4a 100644
>>> --- a/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>>> +++ b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>>> @@ -272,6 +272,7 @@ static void mt7921_stop(struct ieee80211_hw *hw)
>>>
>>>        cancel_delayed_work_sync(&dev->pm.ps_work);
>>>        cancel_work_sync(&dev->pm.wake_work);
>>> +    cancel_work_sync(&dev->reset_work);
>>>        mt76_connac_free_pending_tx_skbs(&dev->pm, NULL);
>>>
>>>        mt7921_mutex_acquire(dev);
>>
>> Hi Ben,
>>
>> Unfortunately that did not help. I still saw the same messages and a kernel panic. I do not see this bug if I power down the laptop before booting it up, so mt7921_stop() would make sense as the reasonable place to fix it.
> 
> Hi, Khalid
> 
> Could you try the patch below? It should be helpful to your issue
> 
> https://patchwork.kernel.org/project/linux-wireless/patch/70e27cbc652cbdb78277b9c691a3a5ba02653afb.1641540175.git.objelf@gmail.com/

Hi Sean,

That worked! I tried 5 reboots back-to-back after applying your patch 
without powering down my laptop. There were no error messages, kernel 
came up every time and wifi worked.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bug] mt7921e driver in 5.16 causes kernel panic
       [not found] <c3a66426-e6a0-4fd2-dc09-85181c96b755@gonehiking.org--annotate>
@ 2022-01-12  0:49 ` sean.wang
  2022-01-12  2:16   ` Khalid Aziz
  0 siblings, 1 reply; 6+ messages in thread
From: sean.wang @ 2022-01-12  0:49 UTC (permalink / raw)
  To: khalid
  Cc: greearb, nbd, lorenzo.bianconi83, Ryder.Lee, Shayne.Chen,
	Sean.Wang, kvalo, davem, kuba, matthias.bgg, linux-kernel,
	netdev, linux-wireless, linux-mediatek, Sean Wang

From: Sean Wang <sean.wang@mediatek.com>

>On 1/11/22 16:31, Ben Greear wrote:
>> On 1/11/22 3:17 PM, Khalid Aziz wrote:
>>> I am seeing an intermittent bug in mt7921e driver. When the driver
>>> module is loaded and is being initialized, almost every other time it
>>> seems to write to some wild memory location. This results in driver
>>> failing to initialize with message "Timeout for driver own" and at
>>> the same time I start to see "Bad page state" messages for random
>>> processes. Here is the relevant part of dmesg:
>>
>> Please see if this helps?
>>
>> From: Ben Greear <greearb@candelatech.com>
>>
>> If the nic fails to start, it is possible that the reset_work has
>> already been scheduled.  Ensure the work item is canceled so we do not
>> have use-after-free crash in case cleanup is called before the work
>> item is executed.
>>
>> This fixes crash on my x86_64 apu2 when mt7921k radio fails to work.
>> Radio still fails, but OS does not crash.
>>
>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>> ---
>>   drivers/net/wireless/mediatek/mt76/mt7921/main.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>> b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>> index 6073bedaa1c08..9b33002dcba4a 100644
>> --- a/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>> +++ b/drivers/net/wireless/mediatek/mt76/mt7921/main.c
>> @@ -272,6 +272,7 @@ static void mt7921_stop(struct ieee80211_hw *hw)
>>
>>       cancel_delayed_work_sync(&dev->pm.ps_work);
>>       cancel_work_sync(&dev->pm.wake_work);
>> +    cancel_work_sync(&dev->reset_work);
>>       mt76_connac_free_pending_tx_skbs(&dev->pm, NULL);
>>
>>       mt7921_mutex_acquire(dev);
>
>Hi Ben,
>
>Unfortunately that did not help. I still saw the same messages and a kernel panic. I do not see this bug if I power down the laptop before booting it up, so mt7921_stop() would make sense as the reasonable place to fix it.

Hi, Khalid

Could you try the patch below? It should be helpful to your issue

https://patchwork.kernel.org/project/linux-wireless/patch/70e27cbc652cbdb78277b9c691a3a5ba02653afb.1641540175.git.objelf@gmail.com/

>
>Thanks,
>Khalid
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-01-12  2:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-11 23:17 [Bug] mt7921e driver in 5.16 causes kernel panic Khalid Aziz
2022-01-11 23:31 ` Ben Greear
2022-01-11 23:58   ` Khalid Aziz
2022-01-12  0:05     ` Ben Greear
     [not found] <c3a66426-e6a0-4fd2-dc09-85181c96b755@gonehiking.org--annotate>
2022-01-12  0:49 ` sean.wang
2022-01-12  2:16   ` Khalid Aziz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).