All of lore.kernel.org
 help / color / mirror / Atom feed
* [ath9k-devel] kernel panic with ath9k
@ 2012-02-04  8:26 Olivier Keshavjee
  2012-02-04 15:48 ` Pavel Roskin
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Keshavjee @ 2012-02-04  8:26 UTC (permalink / raw)
  To: ath9k-devel

Hi,

Had a kernel panic yesterday, few days after installing my new computer.
Took a picture, but forgot to copy the log on next reboot (if this is
possible).
Wasn't doing anything special at the moment of the panic.

Sorry I'm not sure this is the right place to post that,
and not sure of the infos to give  you, so here are some, plus the picture
in attachement:

$uname -a
> Linux iceworld 3.2.2-1-ARCH #1 SMP PREEMPT Thu Jan 26 08:40:20 CET 2012
> x86_64 Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz GenuineIntel GNU/Linux
>
> $ lspci  | grep Network
> 09:01.0 Network controller: Atheros Communications Inc. AR9287 Wireless
> Network Adapter (rev 01)
>
> $ lsmod | grep ath9k
> ath9k                  87343  0
> ath9k_common            2096  1 ath9k
> ath9k_hw              347282  2 ath9k_common,ath9k
> ath                    14802  3 ath9k_hw,ath9k_common,ath9k
> mac80211              228367  1 ath9k
> cfg80211              172260  3 mac80211,ath,ath9k
>


Thanks,

Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ath9k.org/pipermail/ath9k-devel/attachments/20120204/06d3cd89/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2012-02-03 Kernel panic.jpg
Type: image/jpeg
Size: 426983 bytes
Desc: not available
Url : http://lists.ath9k.org/pipermail/ath9k-devel/attachments/20120204/06d3cd89/attachment-0001.jpg 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-04  8:26 [ath9k-devel] kernel panic with ath9k Olivier Keshavjee
@ 2012-02-04 15:48 ` Pavel Roskin
  2012-02-07 22:00   ` Pavel Roskin
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Roskin @ 2012-02-04 15:48 UTC (permalink / raw)
  To: ath9k-devel

Hello, Olivier!

Quoting Olivier Keshavjee <olivier.keshavjee@gmail.com>:

> Had a kernel panic yesterday, few days after installing my new computer.
> Took a picture, but forgot to copy the log on next reboot (if this is
> possible).
> Wasn't doing anything special at the moment of the panic.

As I expected, the picture shows a panic in ath9k rate control  
algorithm.  There is some serious problem there.  Many people reported  
a kernel warning in Fedora:

https://bugzilla.redhat.com/show_bug.cgi?id=768639

The problem is hard to reproduce on purpose.  The debug information  
posted there indicates some memory corruption.  One of the users  
observed a rate value of 179 returned by ath_rc_get_nextvalid_txrate().

I tried debugging of ath_rc_get_highest_rix() an observed such values  
of rate_table->info[rate].user_ratekbps (for reasonable values of rate):

0xb000072
0x8000043
0x2000013
0x8000043

I see a pattern here.

I also had a panic in ath_rc_rate_set_rtscts() on the line that says
cix = rate_table->info[rix].ctrl_rate;

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-04 15:48 ` Pavel Roskin
@ 2012-02-07 22:00   ` Pavel Roskin
  2012-02-08 19:10     ` Adrian Chadd
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Roskin @ 2012-02-07 22:00 UTC (permalink / raw)
  To: ath9k-devel

On Sat, 04 Feb 2012 10:48:48 -0500
Pavel Roskin <proski@gnu.org> wrote:

> Hello, Olivier!
> 
> Quoting Olivier Keshavjee <olivier.keshavjee@gmail.com>:
> 
> > Had a kernel panic yesterday, few days after installing my new
> > computer. Took a picture, but forgot to copy the log on next reboot
> > (if this is possible).
> > Wasn't doing anything special at the moment of the panic.
> 
> As I expected, the picture shows a panic in ath9k rate control  
> algorithm.  There is some serious problem there.  Many people
> reported a kernel warning in Fedora:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=768639

I've posted a fix there, please test.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-07 22:00   ` Pavel Roskin
@ 2012-02-08 19:10     ` Adrian Chadd
  2012-02-08 19:27       ` Felix Fietkau
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Chadd @ 2012-02-08 19:10 UTC (permalink / raw)
  To: ath9k-devel

Has anyone figured out why -1 is appearing?


Adrian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-08 19:10     ` Adrian Chadd
@ 2012-02-08 19:27       ` Felix Fietkau
  2012-02-08 22:54         ` Pavel Roskin
  0 siblings, 1 reply; 12+ messages in thread
From: Felix Fietkau @ 2012-02-08 19:27 UTC (permalink / raw)
  To: ath9k-devel

On 2012-02-08 8:10 PM, Adrian Chadd wrote:
> Has anyone figured out why -1 is appearing?
Maybe it's fixed by the mac80211 patch I submitted today. mac80211 was
calling .tx_status on the rate control module before .rate_init.
Just a wild guess...

- Felix

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-08 19:27       ` Felix Fietkau
@ 2012-02-08 22:54         ` Pavel Roskin
  2012-02-10  6:05           ` Mohammed Shafi
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Roskin @ 2012-02-08 22:54 UTC (permalink / raw)
  To: ath9k-devel

On Wed, 08 Feb 2012 20:27:52 +0100
Felix Fietkau <nbd@openwrt.org> wrote:

> On 2012-02-08 8:10 PM, Adrian Chadd wrote:
> > Has anyone figured out why -1 is appearing?
> Maybe it's fixed by the mac80211 patch I submitted today. mac80211 was
> calling .tx_status on the rate control module before .rate_init.
> Just a wild guess...

I can reproduce the problem reliably with just one router (some stock
Linksys router with standard firmware) and one laptop (Asus Eee PC
1005PE with AR9285).  802.11n is not used.  I just leave the laptop
turned on overnight, and it catches the warning at some point.

The bug might be "stimulated" by closing the lid and opening it again
while transmitting data (e.g. watching a video).  But it doesn't happen
most of the time.  And if it does happen, it takes about a couple of
minutes after the connection is reestablished.

This night, I caught two events less than a millisecond apart:

[18311.688446] wlan0: moving STA 00:21:29:a0:c7:b2 to state 1
[18311.688453] wlan0: moving STA 00:21:29:a0:c7:b2 to state 2
[18311.711759] wlan0: moving STA 00:21:29:a0:c7:b2 to state 3
[19634.009730] ath_rc_get_rateindex: rate->idx = -1
[19634.009762] ath_rc_get_rateindex: rate->idx = -1
[23124.212050] wlan0: moving STA 00:21:29:a0:c7:b2 to state 2
[23124.212061] wlan0: moving STA 00:21:29:a0:c7:b2 to state 1
[23124.212068] wlan0: moving STA 00:21:29:a0:c7:b2 to state 0

That's all with compat-wireless-3.3-rc1-2, which is used by Fedora.

Indeed, the memory corruption happens in ath_tx_status(),
the .tx_status function for ath9k rate control, so it's plausible.

I tried applying your patch to compat-wireless-3.3-rc1-2, but it
depends on the patch that introduced WLAN_STA_RATE_CONTROL.  And that
patch doesn't apply cleanly because WLAN_STA_INSERTED is not in
compat-wireless-3.3-rc1-2.

The current bleeding edge compat-wireless is for 2012/02/06 and it
needs both the patch that introduces WLAN_STA_RATE_CONTROL and the one
that uses it.  At least they can be applied and everything compiles.

I'm going to leave the patched bleeding edge compat-wireless overnight
(with the sanity check I added to the code).  I'll report the results.

I believe the patch needs to be backported to Linux 3.3 and older
versions.  The Fedora bug 768639 has 74 comments, mostly messages about
duplicate bugs.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-08 22:54         ` Pavel Roskin
@ 2012-02-10  6:05           ` Mohammed Shafi
  2012-02-11  4:06             ` Pavel Roskin
  0 siblings, 1 reply; 12+ messages in thread
From: Mohammed Shafi @ 2012-02-10  6:05 UTC (permalink / raw)
  To: ath9k-devel

Hi Pavel,

> I can reproduce the problem reliably with just one router (some stock
> Linksys router with standard firmware) and one laptop (Asus Eee PC
> 1005PE with AR9285). ?802.11n is not used. ?I just leave the laptop
> turned on overnight, and it catches the warning at some point.

i could try this,  please let me know if the station is only
associated or is there some traffic going on.

>
> The bug might be "stimulated" by closing the lid and opening it again
> while transmitting data (e.g. watching a video). ?But it doesn't happen
> most of the time. ?And if it does happen, it takes about a couple of
> minutes after the connection is reestablished.

i could try this one, please let me know if you had found a way to
recreate it easily, then we can add more printks to figure out whats
happening. i also tried the legacy , reduced tx power, noisy
environement

>
> This night, I caught two events less than a millisecond apart:
>
> [18311.688446] wlan0: moving STA 00:21:29:a0:c7:b2 to state 1
> [18311.688453] wlan0: moving STA 00:21:29:a0:c7:b2 to state 2
> [18311.711759] wlan0: moving STA 00:21:29:a0:c7:b2 to state 3
> [19634.009730] ath_rc_get_rateindex: rate->idx = -1
> [19634.009762] ath_rc_get_rateindex: rate->idx = -1
> [23124.212050] wlan0: moving STA 00:21:29:a0:c7:b2 to state 2
> [23124.212061] wlan0: moving STA 00:21:29:a0:c7:b2 to state 1
> [23124.212068] wlan0: moving STA 00:21:29:a0:c7:b2 to state 0
>
> That's all with compat-wireless-3.3-rc1-2, which is used by Fedora.

need to figure it out -1 is happening, i don't know yet. i had put few printks
but no -1.  also i purposefully put -1 in final_rate (many times it
seemed to survive and once the suspend/resume failed, i think i have
not tested sufficiently)

>
> Indeed, the memory corruption happens in ath_tx_status(),
> the .tx_status function for ath9k rate control, so it's plausible.
>

> I believe the patch needs to be backported to Linux 3.3 and older
> versions. ?The Fedora bug 768639 has 74 comments, mostly messages about
> duplicate bugs.

thank you so much for your efforts!  initially suspected the
rate_table pointer corruption but could not figure out the rix is
causing it.

>
> --
> Regards,
> Pavel Roskin
> _______________________________________________
> ath9k-devel mailing list
> ath9k-devel at lists.ath9k.org
> https://lists.ath9k.org/mailman/listinfo/ath9k-devel



-- 
shafi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-10  6:05           ` Mohammed Shafi
@ 2012-02-11  4:06             ` Pavel Roskin
  2012-02-13  7:36               ` Mohammed Shafi
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Roskin @ 2012-02-11  4:06 UTC (permalink / raw)
  To: ath9k-devel

Quoting Mohammed Shafi <shafi.wireless@gmail.com>:

> Hi Pavel,
>
>> I can reproduce the problem reliably with just one router (some stock
>> Linksys router with standard firmware)

WRT54G2 V1

>> and one laptop (Asus Eee PC
>> 1005PE with AR9285). ?802.11n is not used. ?I just leave the laptop
>> turned on overnight, and it catches the warning at some point.
>
> i could try this,  please let me know if the station is only
> associated or is there some traffic going on.

The station is associated.  No traffic is needed.  I just leave the  
laptop connected overnight.  It's configured to use a screensaver but  
not to suspend.  Last time, it took about 4 hours before the first "-1  
event" followed by another one less than a millisecond later.

The AP uses WPA1-PSK with TKIP and CCMP.

>> The bug might be "stimulated" by closing the lid and opening it again
>> while transmitting data (e.g. watching a video). ?But it doesn't happen
>> most of the time. ?And if it does happen, it takes about a couple of
>> minutes after the connection is reestablished.
>
> i could try this one, please let me know if you had found a way to
> recreate it easily, then we can add more printks to figure out whats
> happening. i also tried the legacy , reduced tx power, noisy
> environement

Last time, I was particularly unlucky with the "quick tricks", so I'm  
leaving it overnight again.  This time, the goal is to find out what's  
happening in ath_tx_status(), especially the value of final_ts_idx.

If you want me to run any specific code, please let me know.

> need to figure it out -1 is happening, i don't know yet. i had put   
> few printks
> but no -1.  also i purposefully put -1 in final_rate (many times it
> seemed to survive and once the suspend/resume failed, i think i have
> not tested sufficiently)

That's not very interesting, ath_debug_stat_rc just collects data for  
debugfs.  If given a bad index, it would corrupt memory.

> thank you so much for your efforts!  initially suspected the
> rate_table pointer corruption but could not figure out the rix is
> causing it.

It just occurred to me that ath_tx_status() doesn't follow the  
precautions used by ath_rc_tx_status().  The later checks idx and  
skips rates with negative idx.  The former doesn't check it in the  
loop, so final_ts_idx could point to a rate with negative idx.

Maybe it's not such a big deal that there are rates with negative idx  
as long as the rate control code checks for it.

But it would be great to know when it happens, so that the code could  
be tested for that condition by developers rather than by unlucky users.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-11  4:06             ` Pavel Roskin
@ 2012-02-13  7:36               ` Mohammed Shafi
  2012-02-15 14:23                 ` Pavel Roskin
  0 siblings, 1 reply; 12+ messages in thread
From: Mohammed Shafi @ 2012-02-13  7:36 UTC (permalink / raw)
  To: ath9k-devel

>
> It just occurred to me that ath_tx_status() doesn't follow the precautions
> used by ath_rc_tx_status(). ?The later checks idx and skips rates with
> negative idx. ?The former doesn't check it in the loop, so final_ts_idx
> could point to a rate with negative idx.

i had always assumed that rate idx is invalid -1 then rate count is 0
and the other way around. yes if rate idx -1 and rate count is still
not 0 final_idx can be pointing to  -1 rate idx. thanks for fixing it.

>
> Maybe it's not such a big deal that there are rates with negative idx as
> long as the rate control code checks for it.
>
> But it would be great to know when it happens, so that the code could be
> tested for that condition by developers rather than by unlucky users.
>
> --
> Regards,
> Pavel Roskin



-- 
shafi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-13  7:36               ` Mohammed Shafi
@ 2012-02-15 14:23                 ` Pavel Roskin
  2012-02-15 15:48                   ` Mohammed Shafi
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Roskin @ 2012-02-15 14:23 UTC (permalink / raw)
  To: ath9k-devel

Quoting Mohammed Shafi <shafi.wireless@gmail.com>:

>>
>> It just occurred to me that ath_tx_status() doesn't follow the precautions
>> used by ath_rc_tx_status(). ?The later checks idx and skips rates with
>> negative idx. ?The former doesn't check it in the loop, so final_ts_idx
>> could point to a rate with negative idx.
>
> i had always assumed that rate idx is invalid -1 then rate count is 0
> and the other way around. yes if rate idx -1 and rate count is still
> not 0 final_idx can be pointing to  -1 rate idx. thanks for fixing it.

I was able to catch the rates when it happens:

rate 0, idx 0, count 10, flags 0x0
rate 1, idx -1, count 1, flags 0x0
rate 2, idx -1, count 1, flags 0x0
rate 3, idx -1, count 1, flags 0x0

I also captured fc and tx_info->flags:

fc = 0x1148
IEEE80211_FTYPE_DATA
IEEE80211_STYPE_NULLFUNC
IEEE80211_FCTL_TODS
IEEE80211_FCTL_PM

tx_info->flags = 0x6001001a
IEEE80211_TX_CTL_ASSIGN_SEQ
IEEE80211_TX_CTL_CLEAR_PS_FILT
IEEE80211_TX_CTL_FIRST_FRAGMENT
IEEE80211_TX_INTFL_DONT_ENCRYPT
IEEE80211_TX_CTL_USE_MINRATE
IEEE80211_TX_CTL_DONTFRAG

10 is likely max_rate_tries, set in ath9k/init.c and used in  
rate_control_send_low().

The most important thing is that we are not processing random numbers.  
  I don't know how we end up with idx -1 and count 1, but see e.g.  
ieee80211_tx_h_rate_ctrl():

                 /*
                  * make sure there's no valid rate following
                  * an invalid one, just in case drivers don't
                  * take the API seriously to stop at -1.
                  */
                 if (inval) {
                         info->control.rates[i].idx = -1;
                         continue;
                 }

The code doesn't attempt to set count to 0.

At this point, I'm confident that my fix addresses the problem and  
there is no underlying problem that needs to be fixed.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-15 14:23                 ` Pavel Roskin
@ 2012-02-15 15:48                   ` Mohammed Shafi
  2012-02-15 18:06                     ` Pavel Roskin
  0 siblings, 1 reply; 12+ messages in thread
From: Mohammed Shafi @ 2012-02-15 15:48 UTC (permalink / raw)
  To: ath9k-devel

Hi Pavel,

once thanks a lot for analyzing it more and the amazing information
you had provided.

>
> I was able to catch the rates when it happens:
>
> rate 0, idx 0, count 10, flags 0x0
> rate 1, idx -1, count 1, flags 0x0
> rate 2, idx -1, count 1, flags 0x0
> rate 3, idx -1, count 1, flags 0x0
>
> I also captured fc and tx_info->flags:
>
> fc = 0x1148
> IEEE80211_FTYPE_DATA
> IEEE80211_STYPE_NULLFUNC
> IEEE80211_FCTL_TODS
> IEEE80211_FCTL_PM
>
> tx_info->flags = 0x6001001a
> IEEE80211_TX_CTL_ASSIGN_SEQ
> IEEE80211_TX_CTL_CLEAR_PS_FILT
> IEEE80211_TX_CTL_FIRST_FRAGMENT
> IEEE80211_TX_INTFL_DONT_ENCRYPT
> IEEE80211_TX_CTL_USE_MINRATE
> IEEE80211_TX_CTL_DONTFRAG
>
> 10 is likely max_rate_tries, set in ath9k/init.c and used in
> rate_control_send_low().
>
> The most important thing is that we are not processing random numbers. ?I
> don't know how we end up with idx -1 and count 1, but see e.g.
> ieee80211_tx_h_rate_ctrl():
>
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * make sure there's no valid rate following
> ? ? ? ? ? ? ? ? * an invalid one, just in case drivers don't
> ? ? ? ? ? ? ? ? * take the API seriously to stop at -1.
> ? ? ? ? ? ? ? ? */
> ? ? ? ? ? ? ? ?if (inval) {
> ? ? ? ? ? ? ? ? ? ? ? ?info->control.rates[i].idx = -1;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
>
> The code doesn't attempt to set count to 0.
>
> At this point, I'm confident that my fix addresses the problem and there is
> no underlying problem that needs to be fixed.


in mac80211 for connection monitoring we are send NULLFUNC frame in lower rates
(please check for IEEE80211_TX_CTL_USE_MINRATE) and
IEEE80211_STA_CONNECTION_POLL , IEEE80211_STA_BEACON_POLL


if we refer rate_control_get_rate in rate.c of mac80211

for (i = 0; i < IEEE80211_TX_MAX_RATES; i++) {
		info->control.rates[i].idx = -1;
		info->control.rates[i].flags = 0;
		info->control.rates[i].count = 1;
	}

why count = 1 not 0 ? if count has to be 0 then we should fix in mac80211 also.
though the doc in mac80211 says idx -1 means invalid rate, did not say
if count =0 invalid
but anyway the rate is tried once ?

*now  the drivers assumption of count=0 for invalid is not correct i think
ministrel checks for both index and count in 'minstrel_ht_txstat_valid'
while ath9k does not before your patch

*then we call ref->ops->get_rate

*ath_get_rate is called which will just calls rate_control_send_low

* ath_tx_status will be called as we expect ACk status for NULL func.

then panic may occur. this is my theory :)

any thoughts about changing it in mac80211

-- 
shafi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ath9k-devel] kernel panic with ath9k
  2012-02-15 15:48                   ` Mohammed Shafi
@ 2012-02-15 18:06                     ` Pavel Roskin
  0 siblings, 0 replies; 12+ messages in thread
From: Pavel Roskin @ 2012-02-15 18:06 UTC (permalink / raw)
  To: ath9k-devel

On Wed, 15 Feb 2012 21:18:36 +0530
Mohammed Shafi <shafi.wireless@gmail.com> wrote:

> if we refer rate_control_get_rate in rate.c of mac80211
> 
> for (i = 0; i < IEEE80211_TX_MAX_RATES; i++) {
> 		info->control.rates[i].idx = -1;
> 		info->control.rates[i].flags = 0;
> 		info->control.rates[i].count = 1;
> 	}

Thanks for finding this code!  That's another sign that my patch is not
hiding an underlying problem, such as memory corruption.

> why count = 1 not 0 ? if count has to be 0 then we should fix in
> mac80211 also. though the doc in mac80211 says idx -1 means invalid
> rate, did not say if count =0 invalid
> but anyway the rate is tried once ?
> 
> *now  the drivers assumption of count=0 for invalid is not correct i
> think ministrel checks for both index and count in
> 'minstrel_ht_txstat_valid' while ath9k does not before your patch

We should remove the checks for count=0 eventually.  But it's not a
change for stable kernels.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-02-15 18:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-04  8:26 [ath9k-devel] kernel panic with ath9k Olivier Keshavjee
2012-02-04 15:48 ` Pavel Roskin
2012-02-07 22:00   ` Pavel Roskin
2012-02-08 19:10     ` Adrian Chadd
2012-02-08 19:27       ` Felix Fietkau
2012-02-08 22:54         ` Pavel Roskin
2012-02-10  6:05           ` Mohammed Shafi
2012-02-11  4:06             ` Pavel Roskin
2012-02-13  7:36               ` Mohammed Shafi
2012-02-15 14:23                 ` Pavel Roskin
2012-02-15 15:48                   ` Mohammed Shafi
2012-02-15 18:06                     ` Pavel Roskin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.