All of lore.kernel.org
 help / color / mirror / Atom feed
* Permanent disconnect when AP changes frequency
@ 2022-03-04 15:44 Michael Johnson
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Johnson @ 2022-03-04 15:44 UTC (permalink / raw)
  To: iwd

[-- Attachment #1: Type: text/plain, Size: 3221 bytes --]

Hi all,

I just hit a problem today where my iwd clients were getting
permanently disconnected from a network after the AP changed it's
frequency.
The first time the client tries to FT roam to the changed AP it gets a
result: 1 (authentication failed?). After that when it tries to
reconnect to another BSS in the same network but this time gets a
result: 2 (association failure?) and critically blacklists the BSS. It
tries to connect to all available BSS and gets the same error until
there are none left. This leaves the client permanently disconnected.
I'm guessing that changing the frequency wiped the auth from the AP
which explains why the initial FT roam fails but I'm not sure what
would be causing the subsequent failures.

Looking at `src/station.c:station_retry_with_status` would it make
sense to clean whatever auth data we have for FT if the status is an
AUTH failure?

```
Mar 04 13:35:14 p3-1147 iwd[443]:
src/station.c:station_transition_start() 5, target *:*:*:*:07:b1
Mar 04 13:35:14 p3-1147 iwd[443]: src/netdev.c:netdev_cqm_rssi_update()
Mar 04 13:35:14 p3-1147 iwd[443]:
src/wiphy.c:wiphy_radio_work_insert() Inserting work item 6
Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_enter_state()
Old State: connected, new state: roaming
Mar 04 13:35:14 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
Work item 5 done
Mar 04 13:35:14 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_next()
Starting work item 6
Mar 04 13:35:14 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
Work item 6 done
Mar 04 13:35:14 p3-1147 iwd[443]:
src/station.c:station_fast_transition_cb() 5, result: 1
Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_roam_failed() 5
Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_disassociated() 5
Mar 04 13:35:14 p3-1147 iwd[443]:
src/resolve.c:resolve_systemd_revert() ifindex: 5
Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_enter_state()
Old State: roaming, new state: disconnected
Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_enter_state()
Old State: disconnected, new state: autoconnect_quick
...
Mar 04 13:36:16 p3-1147 iwd[443]:
src/station.c:__station_connect_network() connecting to BSS
*:*:*:*:08:71
Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_enter_state()
Old State: autoconnect_full, new state: connecting (auto)
Mar 04 13:36:16 p3-1147 iwd[443]: src/scan.c:scan_periodic_stop()
Stopping periodic scan for wdev 2
Mar 04 13:36:16 p3-1147 iwd[443]: src/scan.c:scan_cancel() Trying to
cancel scan id 13 for wdev 2
Mar 04 13:36:16 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
Work item 13 done
Mar 04 13:36:16 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_next()
Starting work item 14
Mar 04 13:36:16 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
Work item 14 done
Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
5, result: 2
Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_enter_state()
Old State: connecting (auto), new state: disconnected
...
Mar 04 13:36:31 p3-1147 iwd[443]:
src/station.c:station_autoconnect_next() autoconnect: No suitable
BSSes found
```

Kind Regards,
Michael Johnson

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Permanent disconnect when AP changes frequency
@ 2022-03-07 17:43 Michael Johnson
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Johnson @ 2022-03-07 17:43 UTC (permalink / raw)
  To: iwd

[-- Attachment #1: Type: text/plain, Size: 6156 bytes --]

Hi James,

I've tested the patch and it does seem to fix the initial disconnect, thank you!
I'm still hoping to get the extra kernel message to debug the
subsequent failures but that might take some time.

Regards,
Michael

On Mon, 7 Mar 2022 at 11:14, Michael Johnson <mjohnson459(a)gmail.com> wrote:
>
> Hi James,
>
> Thanks for the quick response. I'll test out the patch today and I'll
> also see if I can rebuild the kernel to get more information on what
> is happening.
>
> Regards,
> Michael
>
> On Fri, 4 Mar 2022 at 21:28, James Prestwood <prestwoj(a)gmail.com> wrote:
> >
> > Hi Michael,
> >
> > On Fri, 2022-03-04 at 17:18 +0000, Michael Johnson wrote:
> > > Hi James,
> > >
> > > On Fri, 4 Mar 2022 at 16:59, James Prestwood <prestwoj(a)gmail.com>
> > > wrote:
> > >
> > > > Was this the AP doing this on its own? Or did you manually change the
> > > > frequency. In theory a channel switch should come as an event and you
> > > > would see this in the logs:
> > > >
> > > > netdev_channeL_switch_event() Channel switch event, frequenct: <freq>
> > >
> > > I manually changed it but I don't see any "Channel switch event" in
> > > the logs. The client wasn't connected to the AP at the time of the
> > > change and it only failed when it tried to roam normally.
> > >
> > > > This shouldn't be permanent, but the more times the connection fails
> > > > the longer the timer. In your case though the FT failure will not
> > > > cause
> > > > a blacklist, only the association failure. The default is 60 seconds
> > > > though, so that is quite a while.
> > >
> > > It looks like it came off the blacklist once but still failed and was
> > > put back on it.
> > >
> > > Mar 04 13:35:15 p3-1147 iwd[443]:
> > > src/station.c:__station_connect_network() connecting to BSS *:08:71
> > > Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > 5, result: 2
> > > Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > Nothing left on autoconnect list
> > > Mar 04 13:35:23 p3-1147 iwd[443]:
> > > src/station.c:__station_connect_network() connecting to BSS *:07:b1
> > > Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > 5, result: 2
> > > Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > Nothing left on autoconnect list
> > > Mar 04 13:36:16 p3-1147 iwd[443]:
> > > src/station.c:__station_connect_network() connecting to BSS *:08:71
> > > Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > 5, result: 2
> > > Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > Nothing left on autoconnect list
> > > Mar 04 13:36:24 p3-1147 iwd[443]:
> > > src/station.c:__station_connect_network() connecting to BSS *:07:b1
> > > Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > 5, result: 2
> > > Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > > Nothing left on autoconnect list
> > >
> > > > > I'm guessing that changing the frequency wiped the auth from the AP
> > > > > which explains why the initial FT roam fails but I'm not sure what
> > > > > would be causing the subsequent failures.
> > > >
> > > > Could be. Looks like the failure after the failed roam never even got
> > > > to the AP. So the "association failure" is really just the kernel
> > > > failing (unless you're on brcmfmac?). Maybe something can be seen in
> > > > the kernel logs. I'm thinking the kernel got confused maybe with the
> > > > channel switch and probably needs to re-scan. Does a manual attempt
> > > > to
> > > > connect afterwards work?
> > >
> > > The kernel didn't print anything to the logs but I can try and capture
> > > extra details with iwmon. I couldn't test a manual connect as it's a
> > > headless machine but it was fine after a reboot.
> > > I can try to trigger it again with a hardwire so I can check the
> > > manual connection.
> > >
> > > > > Looking at `src/station.c:station_retry_with_status` would it make
> > > > > sense to clean whatever auth data we have for FT if the status is
> > > > > an
> > > > > AUTH failure?
> > > >
> > > > The retry logic really only matters for new connections. Once IWD
> > > > goes
> > > > into a disconnected state (like after you're roam failure) any
> > > > previous
> > > > connection data is cleared out.
> > >
> > > Does that mean whatever is going wrong is at the kernel level? It
> > > seems like some state is cached somewhere if all subsequent
> > > connections fail.
> >
> > We've identified one issue [1] which should help here and actually fix
> > the initial roam, but even with this you should have been able to
> > connect again after IWD re-scans so we think something else may be
> > going on.
> >
> > Is it possible for you to rebuild the kernel and disable a NL80211
> > flag? We cant see some messages because the kernel is clearing out a
> > buffer which it does on certain message types. We can use a big hammer
> > here and just disable this on all messages:
> >
> > diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
> > index 578bff9c378b..17df9be1d5a3 100644
> > --- a/net/wireless/nl80211.c
> > +++ b/net/wireless/nl80211.c
> > @@ -15177,7 +15177,7 @@ static int nl80211_set_fils_aad(struct sk_buff
> > *skb,
> >  /* If a netdev is associated, it must be UP, P2P must be started */
> >  #define NL80211_FLAG_NEED_WDEV_UP      (NL80211_FLAG_NEED_WDEV |\
> >                                          NL80211_FLAG_CHECK_NETDEV_UP)
> > -#define NL80211_FLAG_CLEAR_SKB         0x20
> > +#define NL80211_FLAG_CLEAR_SKB         0x00
> >  #define NL80211_FLAG_NO_WIPHY_MTX      0x40
> >
> >  static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff
> > *skb,
> >
> > And then run iwmon with this disabled. Hopefully we can get a bit more
> > from this.
> >
> > Thanks,
> > James
> >
> > [1]
> > https://lists.01.org/hyperkitty/list/iwd(a)lists.01.org/thread/ED5XTWW2HITGBDJJBDFOXMNZDIPHSSGE/
> >
> > >
> > > Regards,
> > > Michael
> >
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Permanent disconnect when AP changes frequency
@ 2022-03-07 11:14 Michael Johnson
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Johnson @ 2022-03-07 11:14 UTC (permalink / raw)
  To: iwd

[-- Attachment #1: Type: text/plain, Size: 5572 bytes --]

Hi James,

Thanks for the quick response. I'll test out the patch today and I'll
also see if I can rebuild the kernel to get more information on what
is happening.

Regards,
Michael

On Fri, 4 Mar 2022 at 21:28, James Prestwood <prestwoj(a)gmail.com> wrote:
>
> Hi Michael,
>
> On Fri, 2022-03-04 at 17:18 +0000, Michael Johnson wrote:
> > Hi James,
> >
> > On Fri, 4 Mar 2022 at 16:59, James Prestwood <prestwoj(a)gmail.com>
> > wrote:
> >
> > > Was this the AP doing this on its own? Or did you manually change the
> > > frequency. In theory a channel switch should come as an event and you
> > > would see this in the logs:
> > >
> > > netdev_channeL_switch_event() Channel switch event, frequenct: <freq>
> >
> > I manually changed it but I don't see any "Channel switch event" in
> > the logs. The client wasn't connected to the AP at the time of the
> > change and it only failed when it tried to roam normally.
> >
> > > This shouldn't be permanent, but the more times the connection fails
> > > the longer the timer. In your case though the FT failure will not
> > > cause
> > > a blacklist, only the association failure. The default is 60 seconds
> > > though, so that is quite a while.
> >
> > It looks like it came off the blacklist once but still failed and was
> > put back on it.
> >
> > Mar 04 13:35:15 p3-1147 iwd[443]:
> > src/station.c:__station_connect_network() connecting to BSS *:08:71
> > Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > 5, result: 2
> > Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > Nothing left on autoconnect list
> > Mar 04 13:35:23 p3-1147 iwd[443]:
> > src/station.c:__station_connect_network() connecting to BSS *:07:b1
> > Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > 5, result: 2
> > Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > Nothing left on autoconnect list
> > Mar 04 13:36:16 p3-1147 iwd[443]:
> > src/station.c:__station_connect_network() connecting to BSS *:08:71
> > Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > 5, result: 2
> > Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > Nothing left on autoconnect list
> > Mar 04 13:36:24 p3-1147 iwd[443]:
> > src/station.c:__station_connect_network() connecting to BSS *:07:b1
> > Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > 5, result: 2
> > Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> > Nothing left on autoconnect list
> >
> > > > I'm guessing that changing the frequency wiped the auth from the AP
> > > > which explains why the initial FT roam fails but I'm not sure what
> > > > would be causing the subsequent failures.
> > >
> > > Could be. Looks like the failure after the failed roam never even got
> > > to the AP. So the "association failure" is really just the kernel
> > > failing (unless you're on brcmfmac?). Maybe something can be seen in
> > > the kernel logs. I'm thinking the kernel got confused maybe with the
> > > channel switch and probably needs to re-scan. Does a manual attempt
> > > to
> > > connect afterwards work?
> >
> > The kernel didn't print anything to the logs but I can try and capture
> > extra details with iwmon. I couldn't test a manual connect as it's a
> > headless machine but it was fine after a reboot.
> > I can try to trigger it again with a hardwire so I can check the
> > manual connection.
> >
> > > > Looking at `src/station.c:station_retry_with_status` would it make
> > > > sense to clean whatever auth data we have for FT if the status is
> > > > an
> > > > AUTH failure?
> > >
> > > The retry logic really only matters for new connections. Once IWD
> > > goes
> > > into a disconnected state (like after you're roam failure) any
> > > previous
> > > connection data is cleared out.
> >
> > Does that mean whatever is going wrong is at the kernel level? It
> > seems like some state is cached somewhere if all subsequent
> > connections fail.
>
> We've identified one issue [1] which should help here and actually fix
> the initial roam, but even with this you should have been able to
> connect again after IWD re-scans so we think something else may be
> going on.
>
> Is it possible for you to rebuild the kernel and disable a NL80211
> flag? We cant see some messages because the kernel is clearing out a
> buffer which it does on certain message types. We can use a big hammer
> here and just disable this on all messages:
>
> diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
> index 578bff9c378b..17df9be1d5a3 100644
> --- a/net/wireless/nl80211.c
> +++ b/net/wireless/nl80211.c
> @@ -15177,7 +15177,7 @@ static int nl80211_set_fils_aad(struct sk_buff
> *skb,
>  /* If a netdev is associated, it must be UP, P2P must be started */
>  #define NL80211_FLAG_NEED_WDEV_UP      (NL80211_FLAG_NEED_WDEV |\
>                                          NL80211_FLAG_CHECK_NETDEV_UP)
> -#define NL80211_FLAG_CLEAR_SKB         0x20
> +#define NL80211_FLAG_CLEAR_SKB         0x00
>  #define NL80211_FLAG_NO_WIPHY_MTX      0x40
>
>  static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff
> *skb,
>
> And then run iwmon with this disabled. Hopefully we can get a bit more
> from this.
>
> Thanks,
> James
>
> [1]
> https://lists.01.org/hyperkitty/list/iwd(a)lists.01.org/thread/ED5XTWW2HITGBDJJBDFOXMNZDIPHSSGE/
>
> >
> > Regards,
> > Michael
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Permanent disconnect when AP changes frequency
@ 2022-03-04 21:28 James Prestwood
  0 siblings, 0 replies; 7+ messages in thread
From: James Prestwood @ 2022-03-04 21:28 UTC (permalink / raw)
  To: iwd

[-- Attachment #1: Type: text/plain, Size: 5077 bytes --]

Hi Michael,

On Fri, 2022-03-04 at 17:18 +0000, Michael Johnson wrote:
> Hi James,
> 
> On Fri, 4 Mar 2022 at 16:59, James Prestwood <prestwoj(a)gmail.com>
> wrote:
> 
> > Was this the AP doing this on its own? Or did you manually change the
> > frequency. In theory a channel switch should come as an event and you
> > would see this in the logs:
> > 
> > netdev_channeL_switch_event() Channel switch event, frequenct: <freq>
> 
> I manually changed it but I don't see any "Channel switch event" in
> the logs. The client wasn't connected to the AP at the time of the
> change and it only failed when it tried to roam normally.
> 
> > This shouldn't be permanent, but the more times the connection fails
> > the longer the timer. In your case though the FT failure will not
> > cause
> > a blacklist, only the association failure. The default is 60 seconds
> > though, so that is quite a while.
> 
> It looks like it came off the blacklist once but still failed and was
> put back on it.
> 
> Mar 04 13:35:15 p3-1147 iwd[443]:
> src/station.c:__station_connect_network() connecting to BSS *:08:71
> Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> 5, result: 2
> Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> Nothing left on autoconnect list
> Mar 04 13:35:23 p3-1147 iwd[443]:
> src/station.c:__station_connect_network() connecting to BSS *:07:b1
> Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> 5, result: 2
> Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> Nothing left on autoconnect list
> Mar 04 13:36:16 p3-1147 iwd[443]:
> src/station.c:__station_connect_network() connecting to BSS *:08:71
> Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> 5, result: 2
> Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> Nothing left on autoconnect list
> Mar 04 13:36:24 p3-1147 iwd[443]:
> src/station.c:__station_connect_network() connecting to BSS *:07:b1
> Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> 5, result: 2
> Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> Nothing left on autoconnect list
> 
> > > I'm guessing that changing the frequency wiped the auth from the AP
> > > which explains why the initial FT roam fails but I'm not sure what
> > > would be causing the subsequent failures.
> > 
> > Could be. Looks like the failure after the failed roam never even got
> > to the AP. So the "association failure" is really just the kernel
> > failing (unless you're on brcmfmac?). Maybe something can be seen in
> > the kernel logs. I'm thinking the kernel got confused maybe with the
> > channel switch and probably needs to re-scan. Does a manual attempt
> > to
> > connect afterwards work?
> 
> The kernel didn't print anything to the logs but I can try and capture
> extra details with iwmon. I couldn't test a manual connect as it's a
> headless machine but it was fine after a reboot.
> I can try to trigger it again with a hardwire so I can check the
> manual connection.
> 
> > > Looking at `src/station.c:station_retry_with_status` would it make
> > > sense to clean whatever auth data we have for FT if the status is
> > > an
> > > AUTH failure?
> > 
> > The retry logic really only matters for new connections. Once IWD
> > goes
> > into a disconnected state (like after you're roam failure) any
> > previous
> > connection data is cleared out.
> 
> Does that mean whatever is going wrong is at the kernel level? It
> seems like some state is cached somewhere if all subsequent
> connections fail.

We've identified one issue [1] which should help here and actually fix
the initial roam, but even with this you should have been able to
connect again after IWD re-scans so we think something else may be
going on.

Is it possible for you to rebuild the kernel and disable a NL80211
flag? We cant see some messages because the kernel is clearing out a
buffer which it does on certain message types. We can use a big hammer
here and just disable this on all messages:

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 578bff9c378b..17df9be1d5a3 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -15177,7 +15177,7 @@ static int nl80211_set_fils_aad(struct sk_buff
*skb,
 /* If a netdev is associated, it must be UP, P2P must be started */
 #define NL80211_FLAG_NEED_WDEV_UP      (NL80211_FLAG_NEED_WDEV |\
                                         NL80211_FLAG_CHECK_NETDEV_UP)
-#define NL80211_FLAG_CLEAR_SKB         0x20
+#define NL80211_FLAG_CLEAR_SKB         0x00
 #define NL80211_FLAG_NO_WIPHY_MTX      0x40
 
 static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff
*skb,

And then run iwmon with this disabled. Hopefully we can get a bit more
from this.

Thanks,
James

[1]
https://lists.01.org/hyperkitty/list/iwd(a)lists.01.org/thread/ED5XTWW2HITGBDJJBDFOXMNZDIPHSSGE/

> 
> Regards,
> Michael


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Permanent disconnect when AP changes frequency
@ 2022-03-04 17:18 Michael Johnson
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Johnson @ 2022-03-04 17:18 UTC (permalink / raw)
  To: iwd

[-- Attachment #1: Type: text/plain, Size: 3447 bytes --]

Hi James,

On Fri, 4 Mar 2022 at 16:59, James Prestwood <prestwoj(a)gmail.com> wrote:

> Was this the AP doing this on its own? Or did you manually change the
> frequency. In theory a channel switch should come as an event and you
> would see this in the logs:
>
> netdev_channeL_switch_event() Channel switch event, frequenct: <freq>

I manually changed it but I don't see any "Channel switch event" in
the logs. The client wasn't connected to the AP at the time of the
change and it only failed when it tried to roam normally.

> This shouldn't be permanent, but the more times the connection fails
> the longer the timer. In your case though the FT failure will not cause
> a blacklist, only the association failure. The default is 60 seconds
> though, so that is quite a while.

It looks like it came off the blacklist once but still failed and was
put back on it.

Mar 04 13:35:15 p3-1147 iwd[443]:
src/station.c:__station_connect_network() connecting to BSS *:08:71
Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
5, result: 2
Mar 04 13:35:15 p3-1147 iwd[443]: src/station.c:station_connect_cb()
Nothing left on autoconnect list
Mar 04 13:35:23 p3-1147 iwd[443]:
src/station.c:__station_connect_network() connecting to BSS *:07:b1
Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
5, result: 2
Mar 04 13:35:23 p3-1147 iwd[443]: src/station.c:station_connect_cb()
Nothing left on autoconnect list
Mar 04 13:36:16 p3-1147 iwd[443]:
src/station.c:__station_connect_network() connecting to BSS *:08:71
Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
5, result: 2
Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
Nothing left on autoconnect list
Mar 04 13:36:24 p3-1147 iwd[443]:
src/station.c:__station_connect_network() connecting to BSS *:07:b1
Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
5, result: 2
Mar 04 13:36:24 p3-1147 iwd[443]: src/station.c:station_connect_cb()
Nothing left on autoconnect list

> > I'm guessing that changing the frequency wiped the auth from the AP
> > which explains why the initial FT roam fails but I'm not sure what
> > would be causing the subsequent failures.
>
> Could be. Looks like the failure after the failed roam never even got
> to the AP. So the "association failure" is really just the kernel
> failing (unless you're on brcmfmac?). Maybe something can be seen in
> the kernel logs. I'm thinking the kernel got confused maybe with the
> channel switch and probably needs to re-scan. Does a manual attempt to
> connect afterwards work?

The kernel didn't print anything to the logs but I can try and capture
extra details with iwmon. I couldn't test a manual connect as it's a
headless machine but it was fine after a reboot.
I can try to trigger it again with a hardwire so I can check the
manual connection.

> > Looking at `src/station.c:station_retry_with_status` would it make
> > sense to clean whatever auth data we have for FT if the status is an
> > AUTH failure?
>
> The retry logic really only matters for new connections. Once IWD goes
> into a disconnected state (like after you're roam failure) any previous
> connection data is cleared out.

Does that mean whatever is going wrong is at the kernel level? It
seems like some state is cached somewhere if all subsequent
connections fail.

Regards,
Michael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Permanent disconnect when AP changes frequency
@ 2022-03-04 17:00 Denis Kenzior
  0 siblings, 0 replies; 7+ messages in thread
From: Denis Kenzior @ 2022-03-04 17:00 UTC (permalink / raw)
  To: iwd

[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]

Hi Michael,

On 3/4/22 09:44, Michael Johnson wrote:
> Hi all,
> 
> I just hit a problem today where my iwd clients were getting
> permanently disconnected from a network after the AP changed it's
> frequency.

Can you elaborate?  Normally change in frequency is mostly transparent to user 
space.  We should get an event from the driver/kernel telling us that this 
happened, but overall it should not cause us to roam?

> The first time the client tries to FT roam to the changed AP it gets a

Why is the roam being triggered?  This seems to be missing from your log.

> result: 1 (authentication failed?). After that when it tries to
> reconnect to another BSS in the same network but this time gets a
> result: 2 (association failure?) and critically blacklists the BSS. It
> tries to connect to all available BSS and gets the same error until

Do you have an iwmon trace?

> there are none left. This leaves the client permanently disconnected.
> I'm guessing that changing the frequency wiped the auth from the AP
> which explains why the initial FT roam fails but I'm not sure what
> would be causing the subsequent failures.
> 
> Looking at `src/station.c:station_retry_with_status` would it make
> sense to clean whatever auth data we have for FT if the status is an
> AUTH failure?

It is being wiped out anyway since we're no longer associated when the FT fails. 
  Subsequent connections are being done the 'slow' way.  We'd need to see iwmon 
traces to understand why these associations fail.

Regards,
-Denis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Permanent disconnect when AP changes frequency
@ 2022-03-04 16:59 James Prestwood
  0 siblings, 0 replies; 7+ messages in thread
From: James Prestwood @ 2022-03-04 16:59 UTC (permalink / raw)
  To: iwd

[-- Attachment #1: Type: text/plain, Size: 4671 bytes --]

Hi Michael,

On Fri, 2022-03-04 at 15:44 +0000, Michael Johnson wrote:
> Hi all,
> 
> I just hit a problem today where my iwd clients were getting
> permanently disconnected from a network after the AP changed it's
> frequency.

Was this the AP doing this on its own? Or did you manually change the
frequency. In theory a channel switch should come as an event and you
would see this in the logs:

netdev_channeL_switch_event() Channel switch event, frequenct: <freq>

> The first time the client tries to FT roam to the changed AP it gets
> a
> result: 1 (authentication failed?). After that when it tries to
> reconnect to another BSS in the same network but this time gets a
> result: 2 (association failure?) and critically blacklists the BSS.
> It
> tries to connect to all available BSS and gets the same error until
> there are none left. This leaves the client permanently disconnected.

This shouldn't be permanent, but the more times the connection fails
the longer the timer. In your case though the FT failure will not cause
a blacklist, only the association failure. The default is 60 seconds
though, so that is quite a while.

> I'm guessing that changing the frequency wiped the auth from the AP
> which explains why the initial FT roam fails but I'm not sure what
> would be causing the subsequent failures.

Could be. Looks like the failure after the failed roam never even got
to the AP. So the "association failure" is really just the kernel
failing (unless you're on brcmfmac?). Maybe something can be seen in
the kernel logs. I'm thinking the kernel got confused maybe with the
channel switch and probably needs to re-scan. Does a manual attempt to
connect afterwards work?

> 
> Looking at `src/station.c:station_retry_with_status` would it make
> sense to clean whatever auth data we have for FT if the status is an
> AUTH failure?

The retry logic really only matters for new connections. Once IWD goes
into a disconnected state (like after you're roam failure) any previous
connection data is cleared out.

> 
> ```
> Mar 04 13:35:14 p3-1147 iwd[443]:
> src/station.c:station_transition_start() 5, target *:*:*:*:07:b1
> Mar 04 13:35:14 p3-1147 iwd[443]:
> src/netdev.c:netdev_cqm_rssi_update()
> Mar 04 13:35:14 p3-1147 iwd[443]:
> src/wiphy.c:wiphy_radio_work_insert() Inserting work item 6
> Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_enter_state()
> Old State: connected, new state: roaming
> Mar 04 13:35:14 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
> Work item 5 done
> Mar 04 13:35:14 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_next()
> Starting work item 6
> Mar 04 13:35:14 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
> Work item 6 done
> Mar 04 13:35:14 p3-1147 iwd[443]:
> src/station.c:station_fast_transition_cb() 5, result: 1
> Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_roam_failed()
> 5
> Mar 04 13:35:14 p3-1147 iwd[443]:
> src/station.c:station_disassociated() 5
> Mar 04 13:35:14 p3-1147 iwd[443]:
> src/resolve.c:resolve_systemd_revert() ifindex: 5
> Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_enter_state()
> Old State: roaming, new state: disconnected
> Mar 04 13:35:14 p3-1147 iwd[443]: src/station.c:station_enter_state()
> Old State: disconnected, new state: autoconnect_quick
> ...
> Mar 04 13:36:16 p3-1147 iwd[443]:
> src/station.c:__station_connect_network() connecting to BSS
> *:*:*:*:08:71
> Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_enter_state()
> Old State: autoconnect_full, new state: connecting (auto)
> Mar 04 13:36:16 p3-1147 iwd[443]: src/scan.c:scan_periodic_stop()
> Stopping periodic scan for wdev 2
> Mar 04 13:36:16 p3-1147 iwd[443]: src/scan.c:scan_cancel() Trying to
> cancel scan id 13 for wdev 2
> Mar 04 13:36:16 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
> Work item 13 done
> Mar 04 13:36:16 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_next()
> Starting work item 14
> Mar 04 13:36:16 p3-1147 iwd[443]: src/wiphy.c:wiphy_radio_work_done()
> Work item 14 done
> Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_connect_cb()
> 5, result: 2
> Mar 04 13:36:16 p3-1147 iwd[443]: src/station.c:station_enter_state()
> Old State: connecting (auto), new state: disconnected
> ...
> Mar 04 13:36:31 p3-1147 iwd[443]:
> src/station.c:station_autoconnect_next() autoconnect: No suitable
> BSSes found
> ```
> 
> Kind Regards,
> Michael Johnson
> _______________________________________________
> iwd mailing list -- iwd(a)lists.01.org
> To unsubscribe send an email to iwd-leave(a)lists.01.org


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-07 17:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-04 15:44 Permanent disconnect when AP changes frequency Michael Johnson
2022-03-04 16:59 James Prestwood
2022-03-04 17:00 Denis Kenzior
2022-03-04 17:18 Michael Johnson
2022-03-04 21:28 James Prestwood
2022-03-07 11:14 Michael Johnson
2022-03-07 17:43 Michael Johnson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.