All of lore.kernel.org
 help / color / mirror / Atom feed
* [ath9k-devel] Possible over driving AR9106, how to detect?
@ 2011-09-07 14:59 Daniel Smith
  2011-09-07 15:47 ` Adrian Chadd
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Smith @ 2011-09-07 14:59 UTC (permalink / raw)
  To: ath9k-devel

Greetings,

I am running into a situation where I believe the RF front-end (AR9106) 
is being over driven. The configuration being used is a high-gain 
antenna with an inline LNA attached to a SparkLAN WMIA-199NI. The 
interface is put into monitor mode and set the fcsfail flag. The 
interface is brought up and after random periods of time we stop 
receiving frames. In the past this typically occurred at the same time 
as the DMA rx stop storm would occur. After all the fixes were put in 
place we still continued to see the issue. When the lock-up occurs there 
is no sign of activity on the card (e.g. frame counts, interrupt counts, 
etc in debugfs do not increase). This lock-up can be cleared by simple 
changing the channel or cycling the interface down and back up. One 
factor that seems to trigger this is when we run in the lab where the AP 
and station are within close proximity resulting in a signal strength of 
70dB.

I was wondering if there might be an interrupt that I could mask in or 
some other means to detect if the radio is in fact being over driven.

Thanks in advance!
V/r,
Daniel Smith

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ath9k-devel] Possible over driving AR9106, how to detect?
  2011-09-07 14:59 [ath9k-devel] Possible over driving AR9106, how to detect? Daniel Smith
@ 2011-09-07 15:47 ` Adrian Chadd
  2011-09-07 18:04   ` Daniel Smith
  0 siblings, 1 reply; 7+ messages in thread
From: Adrian Chadd @ 2011-09-07 15:47 UTC (permalink / raw)
  To: ath9k-devel

On 7 September 2011 22:59, Daniel Smith <viscous.liquid@gmail.com> wrote:
> Greetings,
>
> I am running into a situation where I believe the RF front-end (AR9106)
> is being over driven. The configuration being used is a high-gain
> antenna with an inline LNA attached to a SparkLAN WMIA-199NI. The
> interface is put into monitor mode and set the fcsfail flag. The
> interface is brought up and after random periods of time we stop
> receiving frames. In the past this typically occurred at the same time
> as the DMA rx stop storm would occur. After all the fixes were put in
> place we still continued to see the issue. When the lock-up occurs there
> is no sign of activity on the card (e.g. frame counts, interrupt counts,
> etc in debugfs do not increase). This lock-up can be cleared by simple
> changing the channel or cycling the interface down and back up. One
> factor that seems to trigger this is when we run in the lab where the AP
> and station are within close proximity resulting in a signal strength of
> 70dB.
>
> I was wondering if there might be an interrupt that I could mask in or
> some other means to detect if the radio is in fact being over driven.

If the DMA RX stop storm occured then it meant the NIC thought it hit
the end of the RX descriptor list (whether you did or not) and it just
kept signalling it couldn't write packets anywhere.

I remember seeing this in FreeBSD, so I added some code to the RX
tasklet to forcibly reset the PCU receive and re-link all the RX
descriptors. It causes packet loss when it occurs (and it only occurs
when I'm thrashing the NIC with too much UDP traffic) but I bet it
could also occur if I enabled PHY errors (eg when doing radar
detection) in a very busy+noisy environment.

ath9k handles the RX descriptors a bit differently but when I tried
the same method in FreeBSD, it still ended up occasionally hitting
RXEOL, firing off RXORN interrupts and then getting very pissed off at
me. I'll do some further digging soon and I'll post an update to the
list when I figure it out.

If you're up for a bit of coding, here's what I did:

* when RXEOL interrupt is received, set sc->sc_kickpcu=1; disable RXEOL/RXORN;
* in ath_rx_tasklet() (in ath9k, it's not called that in FreeBSD) run
the normal descriptor list check, then once that's done, if
sc->sc_kickpcu == 1:
    * set it to 0
    * call pcu stop;
    * re-initialise all of the descriptors
    * call pcu start;
    * re-enable interrupts, with RXEOL|RXORN re-enabled.

This reliably fixes all the crazy stuff I saw when I didn't do the
above but it does give (to me, unacceptable) packet loss under very
high UDP RX load.


Adrian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ath9k-devel] Possible over driving AR9106, how to detect?
  2011-09-07 15:47 ` Adrian Chadd
@ 2011-09-07 18:04   ` Daniel Smith
  2011-09-08  0:37     ` Adrian Chadd
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Smith @ 2011-09-07 18:04 UTC (permalink / raw)
  To: ath9k-devel

Thanks for the quick response Adrian.

On Wed, Sep 7, 2011 at 11:47 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> If the DMA RX stop storm occured then it meant the NIC thought it hit
> the end of the RX descriptor list (whether you did or not) and it just
> kept signalling it couldn't write packets anywhere.

I am not certain it is in fact the DMA RX stop storm. The occurrence
often coincided when the storm use to be much more pervasive. Now we
still see it even when there is no storm occurring, e.g. the interrupt
debug counters are not increasing and there are none of the "unable to
stop RX DMA" kernel log errors.

> I remember seeing this in FreeBSD, so I added some code to the RX
> tasklet to forcibly reset the PCU receive and re-link all the RX
> descriptors. It causes packet loss when it occurs (and it only occurs
> when I'm thrashing the NIC with too much UDP traffic) but I bet it
> could also occur if I enabled PHY errors (eg when doing radar
> detection) in a very busy+noisy environment.

I have seen you mention this in other postings. Like stated above I am
pretty certain it is occurring even when there is no DMA storm, but
what is intriguing is that you seem to be seeing the same trigger.
That being when high volumes of traffic coming into the interface.

> ath9k handles the RX descriptors a bit differently but when I tried
> the same method in FreeBSD, it still ended up occasionally hitting
> RXEOL, firing off RXORN interrupts and then getting very pissed off at
> me. I'll do some further digging soon and I'll post an update to the
> list when I figure it out.

I will go back and see if in fact I can see an RXEOL being fired when
the lock-up occurs for us.

> If you're up for a bit of coding, here's what I did:

Always up for a challenge (^_^)

> * when RXEOL interrupt is received, set sc->sc_kickpcu=1; disable RXEOL/RXORN;
> * in ath_rx_tasklet() (in ath9k, it's not called that in FreeBSD) run
> the normal descriptor list check, then once that's done, if
> sc->sc_kickpcu == 1:
> ? ?* set it to 0
> ? ?* call pcu stop;
> ? ?* re-initialise all of the descriptors
> ? ?* call pcu start;
> ? ?* re-enable interrupts, with RXEOL|RXORN re-enabled.

This may be as simple (with some additional success checks) as a:

if (sc->sc_kickpcu == 1) {
    ath_stoprecv(sc);
    ath_rx_cleanup(sc);
    ath_rx_init(sc, ATH_RXBUF);
    ath_startrecv(sc);

    sc->sc_kickpcu = 0
}

right after unlocking the spinlock, afterwards because three of those
calls all try to lock the rxbuflock.


> This reliably fixes all the crazy stuff I saw when I didn't do the
> above but it does give (to me, unacceptable) packet loss under very
> high UDP RX load.

Do you have this fix in the FreeBSD mainline? If so, would it be
beneficial to the mainline ath9k for a similar fix?

v/r,
Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ath9k-devel] Possible over driving AR9106, how to detect?
  2011-09-07 18:04   ` Daniel Smith
@ 2011-09-08  0:37     ` Adrian Chadd
  2011-09-08 11:42       ` Daniel Smith
  0 siblings, 1 reply; 7+ messages in thread
From: Adrian Chadd @ 2011-09-08  0:37 UTC (permalink / raw)
  To: ath9k-devel

On 8 September 2011 02:04, Daniel Smith <viscous.liquid@gmail.com> wrote:
> Thanks for the quick response Adrian.

No worries.

> I am not certain it is in fact the DMA RX stop storm. The occurrence
> often coincided when the storm use to be much more pervasive. Now we
> still see it even when there is no storm occurring, e.g. the interrupt
> debug counters are not increasing and there are none of the "unable to
> stop RX DMA" kernel log errors.

Right. The reason you won't see "unable to stop RX dma" is because it
hasn't locked up anything like that..

> This may be as simple (with some additional success checks) as a:
>
> if (sc->sc_kickpcu == 1) {
> ? ?ath_stoprecv(sc);
> ? ?ath_rx_cleanup(sc);
> ? ?ath_rx_init(sc, ATH_RXBUF);
> ? ?ath_startrecv(sc);
>
> ? ?sc->sc_kickpcu = 0
> }

That looks like my solution. But I have extra checks to see whether
startrecv() properly worked.
Otherwise I do an ath_reset() call.

Also, you should make sure you reset the interrupt mask.

> right after unlocking the spinlock, afterwards because three of those
> calls all try to lock the rxbuflock.
>
>
>> This reliably fixes all the crazy stuff I saw when I didn't do the
>> above but it does give (to me, unacceptable) packet loss under very
>> high UDP RX load.
>
> Do you have this fix in the FreeBSD mainline? If so, would it be
> beneficial to the mainline ath9k for a similar fix?

I'm still trying to figure out what the "real" problem is (and then fix it.)


Adrian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ath9k-devel] Possible over driving AR9106, how to detect?
  2011-09-08  0:37     ` Adrian Chadd
@ 2011-09-08 11:42       ` Daniel Smith
  2011-09-08 12:01         ` Adrian Chadd
  2011-09-08 12:09         ` Alex Hacker
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Smith @ 2011-09-08 11:42 UTC (permalink / raw)
  To: ath9k-devel

On Wed, Sep 7, 2011 at 8:37 PM, Adrian Chadd <adrian@freebsd.org> wrote:
> Right. The reason you won't see "unable to stop RX dma" is because it
> hasn't locked up anything like that..

Ahh, so you think this possible could be a precursor to the DMA storm?

> That looks like my solution. But I have extra checks to see whether
> startrecv() properly worked.
> Otherwise I do an ath_reset() call.

Don't doubt it, it flowed naturally from your description. I will
probably put checks on the stoprecv and the rx_init.

> I'm still trying to figure out what the "real" problem is (and then fix it.)

For us we can reliably recreate it when we have high gain reception
(70+ dB) combined with a high incoming frame rate. I am wondering if
(and the reason for the post) the RF front-end is being over driven.
Therefore not a bug that can be fixed in software but just a
limitation of the hardware that needs to be acknowledged and dealt
with appropriately.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ath9k-devel] Possible over driving AR9106, how to detect?
  2011-09-08 11:42       ` Daniel Smith
@ 2011-09-08 12:01         ` Adrian Chadd
  2011-09-08 12:09         ` Alex Hacker
  1 sibling, 0 replies; 7+ messages in thread
From: Adrian Chadd @ 2011-09-08 12:01 UTC (permalink / raw)
  To: ath9k-devel

On 8 September 2011 19:42, Daniel Smith <viscous.liquid@gmail.com> wrote:
> On Wed, Sep 7, 2011 at 8:37 PM, Adrian Chadd <adrian@freebsd.org> wrote:
>> Right. The reason you won't see "unable to stop RX dma" is because it
>> hasn't locked up anything like that..
>
> Ahh, so you think this possible could be a precursor to the DMA storm?

Well, that's basically what is happening:

* The hardware hits or thinks its hit the end of the RX descriptor list;
* It registers an RXEOL once it hits it;
* The PCU then stops receiving RX frames, as there's nowhere now to put them;
* and you then get an RXORN interrupt for each RX FIFO overrun
(because there's nowhere for the PCU to DMA them into.)

The patch recently pushed into ath9k just quietens the interrupts when
they occur, hoping that when the RX process is next kicked, the RX
descriptors would be re-setup and rechained correctly and the PCU will
continue along its merry way.

The trouble is - I found (in FreeBSD), this isn't necessarily the
case. If I revert my fix and change it to just stop/start the PCU
after walking the RX descriptor list and re-chaining things, the list
would eventually get itself twisted in a way where only a handful of
buffers on the RX list would be RX'ed into before the hardware thought
it hit EOL. So it was hitting EOL after 3, 4, .. 10 descriptors;
rather than 512.

That's why I think there's something else we've missed. I don't want
to have to shut the PCU down and then re-init the list every time this
happens. I'd like to see what is left in the RX descriptor that's
causing issues. It may be something as simple as a missing write
memory barrier.



Adrian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ath9k-devel] Possible over driving AR9106, how to detect?
  2011-09-08 11:42       ` Daniel Smith
  2011-09-08 12:01         ` Adrian Chadd
@ 2011-09-08 12:09         ` Alex Hacker
  1 sibling, 0 replies; 7+ messages in thread
From: Alex Hacker @ 2011-09-08 12:09 UTC (permalink / raw)
  To: ath9k-devel

On Thu, Sep 08, 2011 at 07:42:19AM -0400, Daniel Smith wrote:
> For us we can reliably recreate it when we have high gain reception
> (70+ dB) combined with a high incoming frame rate. I am wondering if
> (and the reason for the post) the RF front-end is being over driven.
> Therefore not a bug that can be fixed in software but just a
> limitation of the hardware that needs to be acknowledged and dealt
> with appropriately.

Not sure about AR9106 but for AR9160 or AR92xx chips the RSSI of 70dB (it
should be around -25dBm at antenna input) is not suffcient to overdrive radio.
Such signal levels or more is very common in our lab tests without any issues.

Regards,
Alex.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-09-08 12:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-07 14:59 [ath9k-devel] Possible over driving AR9106, how to detect? Daniel Smith
2011-09-07 15:47 ` Adrian Chadd
2011-09-07 18:04   ` Daniel Smith
2011-09-08  0:37     ` Adrian Chadd
2011-09-08 11:42       ` Daniel Smith
2011-09-08 12:01         ` Adrian Chadd
2011-09-08 12:09         ` Alex Hacker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.