netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* DSA breaks clients' roaming between switch port and host interfaces
@ 2020-04-05 12:23 DENG Qingfang
  2020-04-05 15:09 ` Andrew Lunn
  0 siblings, 1 reply; 3+ messages in thread
From: DENG Qingfang @ 2020-04-05 12:23 UTC (permalink / raw)
  To: netdev
  Cc: Vivien Didelot, Andrew Lunn, Florian Fainelli, Russell King,
	Chuanhong Guo, René van Dorst, John Crispin, Hauke Mehrtens,
	Stijn Segers, riddlariddla

Hello,
I found a bug of DSA that breaks WiFi clients roaming.

I set up 2 WiFi routers as AP, both of them run kernel 5.4.30 and use DSA.

        +-------------------------+
+-----------------------------+
        |                         |                            |
                      |
        |                         |                            |
                      |
        |       AP1               |                            |
AP2                   |
        |                     LAN2+--------------------------->|LAN1
                      |
        |       10.0.0.1/24       |                            |
10.0.0.2/24           |
        |                         |                            |
                      |
        |       MV88E6XXX DSA     |                            |
MT7530 DSA            |
        |                         |                            |
                      |
        |                         |                            |
                      |
        |                         |                            |
                      |
        +-------------------------+
+-----------------------------+
                     ^                                              ^
                     |                                              |
                     |                      Roams                   |
                     |                     -------------------------+
                     |
                     +------------    +-------------------+
                                      |     Wi-Fi         |
                                      |     Client        |
                                      |                   |
                                      |     10.0.0.3/24   |
                                      |                   |
                                      |                   |
                                      +-------------------+

When the client roams from AP1 to AP2, it cannot ping AP1 anymore for
a few minutes, and vice versa.

With bridge fdb I found out the part that caused the problem.
When the client is connected to AP1, bridge fdb on AP2 shows:

<client's mac> dev lan1 master br-lan
<client's mac> dev lan1 vlan 1 self

It means AP2 should talk to the client via lan1, which is correct.

After the client roams to AP2, the problem comes:

<client's mac>  dev wlan0 master br-lan
<client's mac>  dev lan1 vlan 1 self

From iproute2 man page: "self" means the address is associated with
the port drivers fdb. Usually hardware.

The lan1 is still there, which means the kernel has updated the
forwarding table in br-lan, but forgot to delete the one in the switch
hardware.

What happens when the client now tries to talk to AP1, such as ping
10.0.0.1? I debugged with tcpdump:

1. The client sends ARP request: who-has 10.0.0.1?
2. The software part of the bridge of AP2 receives the ARP request,
updates fdb, and sends it to the CPU port
3. The switch receives the client's ARP request from the CPU port, and
floods it out of the LAN1 port. Although the source MAC address of the
request is the client's, _auto learning of the CPU port is disabled in
DSA_, so the switch does not update the MAC table.
4. AP1 receives the ARP request, then responds: 10.0.0.1 is-at <AP1's MAC>.
5. AP2's switch receives the response from LAN1, then looks it up in
the MAC table, the egress port is the same as the ingress port (LAN1).
To avoid loop, the ARP response is discarded.

If I manually delete the leftover fdb entry in the hardware via
"bridge fdb del <client's MAC> dev lan1 vlan 1", the client can talk
to AP1 immediately.
And vice versa, the mv88e6xxx has the same bug, so I think it's with
the general DSA part.

Does anyone know how to fix it?

Thanks.
Qingfang

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: DSA breaks clients' roaming between switch port and host interfaces
  2020-04-05 12:23 DSA breaks clients' roaming between switch port and host interfaces DENG Qingfang
@ 2020-04-05 15:09 ` Andrew Lunn
  2020-04-05 18:17   ` DENG Qingfang
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Lunn @ 2020-04-05 15:09 UTC (permalink / raw)
  To: DENG Qingfang
  Cc: netdev, Vivien Didelot, Florian Fainelli, Russell King,
	Chuanhong Guo, René van Dorst, John Crispin, Hauke Mehrtens,
	Stijn Segers, riddlariddla

On Sun, Apr 05, 2020 at 08:23:36PM +0800, DENG Qingfang wrote:
> Hello,
> I found a bug of DSA that breaks WiFi clients roaming.
> 
> I set up 2 WiFi routers as AP, both of them run kernel 5.4.30 and use DSA.
> 
>         +-------------------------+
> +-----------------------------+
>         |                         |                            |
>                       |
>         |                         |                            |
>                       |
>         |       AP1               |                            |
> AP2                   |
>         |                     LAN2+--------------------------->|LAN1
>                       |
>         |       10.0.0.1/24       |                            |
> 10.0.0.2/24           |
>         |                         |                            |
>                       |
>         |       MV88E6XXX DSA     |                            |
> MT7530 DSA            |
>         |                         |                            |
>                       |
>         |                         |                            |
>                       |
>         |                         |                            |
>                       |
>         +-------------------------+
> +-----------------------------+
>                      ^                                              ^
>                      |                                              |
>                      |                      Roams                   |
>                      |                     -------------------------+
>                      |
>                      +------------    +-------------------+
>                                       |     Wi-Fi         |
>                                       |     Client        |
>                                       |                   |
>                                       |     10.0.0.3/24   |
>                                       |                   |
>                                       |                   |
>                                       +-------------------+
> 
> When the client roams from AP1 to AP2, it cannot ping AP1 anymore for
> a few minutes, and vice versa.
> 
> With bridge fdb I found out the part that caused the problem.
> When the client is connected to AP1, bridge fdb on AP2 shows:
> 
> <client's mac> dev lan1 master br-lan
> <client's mac> dev lan1 vlan 1 self
> 
> It means AP2 should talk to the client via lan1, which is correct.
> 
> After the client roams to AP2, the problem comes:
> 
> <client's mac>  dev wlan0 master br-lan
> <client's mac>  dev lan1 vlan 1 self
> 
> >From iproute2 man page: "self" means the address is associated with
> the port drivers fdb. Usually hardware.
> 
> The lan1 is still there, which means the kernel has updated the
> forwarding table in br-lan, but forgot to delete the one in the switch
> hardware.
> 
> What happens when the client now tries to talk to AP1, such as ping
> 10.0.0.1? I debugged with tcpdump:
> 
> 1. The client sends ARP request: who-has 10.0.0.1?
> 2. The software part of the bridge of AP2 receives the ARP request,
> updates fdb, and sends it to the CPU port
> 3. The switch receives the client's ARP request from the CPU port, and
> floods it out of the LAN1 port. Although the source MAC address of the
> request is the client's, _auto learning of the CPU port is disabled in
> DSA_, so the switch does not update the MAC table.
> 4. AP1 receives the ARP request, then responds: 10.0.0.1 is-at <AP1's MAC>.
> 5. AP2's switch receives the response from LAN1, then looks it up in
> the MAC table, the egress port is the same as the ingress port (LAN1).
> To avoid loop, the ARP response is discarded.
> 
> If I manually delete the leftover fdb entry in the hardware via
> "bridge fdb del <client's MAC> dev lan1 vlan 1", the client can talk
> to AP1 immediately.
> And vice versa, the mv88e6xxx has the same bug, so I think it's with
> the general DSA part.
> 
> Does anyone know how to fix it?
> 
> Thanks.
> Qingfang

Hi Qingfang

I've had similar reports from somebody else.

Did you try playing with auto learning for the CPU port?

    Andrew

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: DSA breaks clients' roaming between switch port and host interfaces
  2020-04-05 15:09 ` Andrew Lunn
@ 2020-04-05 18:17   ` DENG Qingfang
  0 siblings, 0 replies; 3+ messages in thread
From: DENG Qingfang @ 2020-04-05 18:17 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Vivien Didelot, Florian Fainelli, Russell King,
	Chuanhong Guo, René van Dorst, John Crispin, Hauke Mehrtens,
	Stijn Segers, riddlariddla

I just tried. It did work but I think there are side effects, right?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-04-05 18:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-05 12:23 DSA breaks clients' roaming between switch port and host interfaces DENG Qingfang
2020-04-05 15:09 ` Andrew Lunn
2020-04-05 18:17   ` DENG Qingfang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).