All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@idosch.org>
To: Vladimir Oltean <olteanv@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>,
	Florian Fainelli <f.fainelli@gmail.com>,
	Vivien Didelot <vivien.didelot@gmail.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jiri Pirko <jiri@resnulli.us>, Jakub Kicinski <kuba@kernel.org>,
	Ivan Vecera <ivecera@redhat.com>, netdev <netdev@vger.kernel.org>,
	Horatiu Vultur <horatiu.vultur@microchip.com>,
	"Allan W. Nielsen" <allan.nielsen@microchip.com>,
	Nikolay Aleksandrov <nikolay@cumulusnetworks.com>,
	Roopa Prabhu <roopa@cumulusnetworks.com>
Subject: Re: [PATCH RFC net-next 10/13] net: bridge: add port flags for host flooding
Date: Mon, 27 Jul 2020 20:15:01 +0300	[thread overview]
Message-ID: <20200727171501.GB1910935@shredder> (raw)
In-Reply-To: <20200723223551.d23ol2oriv474bn4@skbuf>

On Fri, Jul 24, 2020 at 01:35:51AM +0300, Vladimir Oltean wrote:
> On Mon, May 25, 2020 at 11:11:11PM +0300, Ido Schimmel wrote:
> > On Sun, May 24, 2020 at 07:13:46PM +0300, Vladimir Oltean wrote:
> > > Hi Ido,
> > > 
> > > On Sun, 24 May 2020 at 17:26, Ido Schimmel <idosch@idosch.org> wrote:
> > > >
> > > > On Fri, May 22, 2020 at 12:10:33AM +0300, Vladimir Oltean wrote:
> > > > > From: Vladimir Oltean <vladimir.oltean@nxp.com>
> > > > >
> > > > > In cases where the bridge is offloaded by a switchdev, there are
> > > > > situations where we can optimize RX filtering towards the host. To be
> > > > > precise, the host only needs to do termination, which it can do by
> > > > > responding at the MAC addresses of the slave ports and of the bridge
> > > > > interface itself. But most notably, it doesn't need to do forwarding,
> > > > > so there is no need to see packets with unknown destination address.
> > > > >
> > > > > But there are, however, cases when a switchdev does need to flood to the
> > > > > CPU. Such an example is when the switchdev is bridged with a foreign
> > > > > interface, and since there is no offloaded datapath, packets need to
> > > > > pass through the CPU. Currently this is the only identified case, but it
> > > > > can be extended at any time.
> > > > >
> > > > > So far, switchdev implementers made driver-level assumptions, such as:
> > > > > this chip is never integrated in SoCs where it can be bridged with a
> > > > > foreign interface, so I'll just disable host flooding and save some CPU
> > > > > cycles. Or: I can never know what else can be bridged with this
> > > > > switchdev port, so I must leave host flooding enabled in any case.
> > > > >
> > > > > Let the bridge drive the host flooding decision, and pass it to
> > > > > switchdev via the same mechanism as the external flooding flags.
> > > > >
> > > > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > > > > ---
> > > > >  include/linux/if_bridge.h |  3 +++
> > > > >  net/bridge/br_if.c        | 40 +++++++++++++++++++++++++++++++++++++++
> > > > >  net/bridge/br_switchdev.c |  4 +++-
> > > > >  3 files changed, 46 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
> > > > > index b3a8d3054af0..6891a432862d 100644
> > > > > --- a/include/linux/if_bridge.h
> > > > > +++ b/include/linux/if_bridge.h
> > > > > @@ -49,6 +49,9 @@ struct br_ip_list {
> > > > >  #define BR_ISOLATED          BIT(16)
> > > > >  #define BR_MRP_AWARE         BIT(17)
> > > > >  #define BR_MRP_LOST_CONT     BIT(18)
> > > > > +#define BR_HOST_FLOOD                BIT(19)
> > > > > +#define BR_HOST_MCAST_FLOOD  BIT(20)
> > > > > +#define BR_HOST_BCAST_FLOOD  BIT(21)
> > > > >
> > > > >  #define BR_DEFAULT_AGEING_TIME       (300 * HZ)
> > > > >
> > > > > diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> > > > > index a0e9a7937412..aae59d1e619b 100644
> > > > > --- a/net/bridge/br_if.c
> > > > > +++ b/net/bridge/br_if.c
> > > > > @@ -166,6 +166,45 @@ void br_manage_promisc(struct net_bridge *br)
> > > > >       }
> > > > >  }
> > > > >
> > > > > +static int br_manage_host_flood(struct net_bridge *br)
> > > > > +{
> > > > > +     const unsigned long mask = BR_HOST_FLOOD | BR_HOST_MCAST_FLOOD |
> > > > > +                                BR_HOST_BCAST_FLOOD;
> > > > > +     struct net_bridge_port *p, *q;
> > > > > +
> > > > > +     list_for_each_entry(p, &br->port_list, list) {
> > > > > +             unsigned long flags = p->flags;
> > > > > +             bool sw_bridging = false;
> > > > > +             int err;
> > > > > +
> > > > > +             list_for_each_entry(q, &br->port_list, list) {
> > > > > +                     if (p == q)
> > > > > +                             continue;
> > > > > +
> > > > > +                     if (!netdev_port_same_parent_id(p->dev, q->dev)) {
> > > > > +                             sw_bridging = true;
> > > >
> > > > It's not that simple. There are cases where not all bridge slaves have
> > > > the same parent ID and still there is no reason to flood traffic to the
> > > > CPU. VXLAN, for example.
> > > >
> > > > You could argue that the VXLAN device needs to have the same parent ID
> > > > as the physical netdevs member in the bridge, but it will break your
> > > > data path. For example, lets assume your hardware decided to flood a
> > > > packet in L2. The packet will egress all the local ports, but will also
> > > > perform VXLAN encapsulation. The packet continues with the IP of the
> > > > remote VTEP(s) to the underlay router and then encounters a neighbour
> > > > miss exception, which sends it to the CPU for resolution.
> > > >
> > > > Since this exception was encountered in the router the driver would mark
> > > > the packet with 'offload_fwd_mark', as it already performed L2
> > > > forwarding. If the VXLAN device has the same parent ID as the physical
> > > > netdevs, then the Linux bridge will never let it egress, nothing will
> > > > trigger neighbour resolution and the packet will be discarded.
> > > >
> > > 
> > > I wasn't going to argue that.
> > > Ok, so with a bridged VXLAN only certain multicast DMACs corresponding
> > > to multicast IPs should be flooded to the CPU.
> > > Actually Allan's example was a bit simpler, he said that host flooding
> > > can be made a per-VLAN flag. I'm glad that you raised this. So maybe
> > > we should try to define some mechanism by which virtual interfaces can
> > > specify to the bridge that they don't need to see all traffic? Do you
> > > have any ideas?
> > 
> > Maybe, when a port joins a bridge, query member ports if they can
> > forward traffic to it in hardware and based on the answer determine the
> > flooding towards the CPU?
> > 
> 
> Hi Ido, Allan,
> 
> I understand less and less of this. What I don't really understand is,
> if you have a switchdev bridged with a vtep like this:
> 
>  +-------------------------+
>  |           br0           |
>  +-------------------------+
>      |                |
>      |           +--------+
>      |           | vxlan0 |
>      |           +--------+
>      |                |
>  +--------+      +--------+
>  |  swp0  |      |  eth0  |
>  +--------+      +--------+
> 
> why would the swp0 interface care about the remote_ip at all. To the
> traffic seen by swp0, the VXLAN segment doesn't exist. Encapsulation and
> decapsulation all happen outside of the switchdev interface. All that
> switchdev sees is that, from the CPU side, it's talking to a bunch of
> MAC addresses.

I don't understand "Encapsulation and decapsulation all happen outside
of the switchdev interface". What does it mean? Encapsulation and
decapsulation happen in hardware... Frame is received by swp0, forwarded
to hardware VTEP, encapsulated and routed towards its destination. The
CPU does not see the packet. Same with decapsulation.

You patch instructs drivers to flood traffic to the CPU if netdevs with
different parent ID are member in it. I explained that this breaks with
VXLAN.

swp0 and vxlan0 do not have the same parent ID yet this does not mean
packets should be flooded to the CPU. I also explained why they should
not have the same parent ID:

1. Packet is received by swp0
2. Forwarded in hardware to hardware VTEP
3. VTEP performs encapsulation
4. VXLAN packet is routed in hardware
5. VXLAN packet encounters a neighbour miss in router
6. Original packet is trapped to CPU from swp0 because of an unresolved
neighbour
7. Driver marks packet with 'offload_fwd_mark' because it was already L2
forwarded in hardware
8. Packet reaches software bridge
9. bridge does not forward it to swpX netdevs because they share the
same parent ID as swp0 and packet is marked
10. Packet is forwarded to vxlan0
11. Packet is routed in and neighbour is resolved

Since the neighbour is now resolved the next packet will be completely
forwarded in hardware.

If vxlan0 and swp0 had the same parent ID, then step 10 would never
happen and the neighbour would never be resolved.

> 
> The same comment also applies for 8021q, in fact. I did try this
> experiment, to bridge a switchdev with a VLAN sub-interface of another
> port. I don't know why, I used to have the misconception that the desire
> in doing that would be to somehow only extract one VLAN ID from the
> switchdev, and the rest could be kept outside of the CPU's flooding
> domain. But that isn't the case at all. When bridging, I'm bridging the
> _entire_ traffic of swp0 with, say, eth0.100. And, as in the case of
> vxlan, encap/decap all happens outside of switchdev. So, contrary to my
> initial expectation, if I'm receiving on swp0 a packet tagged with VLAN
> 100, it would end up exiting the bridge, on eth0, with 2 VLAN tags with
> ID 100.
> 
> Simply put, I think my change is fine the way it is. Either that, or I
> just don't understand your comment about querying bridge members whether
> they can forward in hardware. How are you dealing with this today?
> 
> Thanks,
> -Vladimir

  reply	other threads:[~2020-07-27 17:15 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-21 21:10 [PATCH RFC net-next 00/13] RX filtering for DSA switches Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 01/13] net: core: dev_addr_lists: add VID to device address Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 02/13] net: 8021q: vlan_dev: add vid tag to addresses of uc and mc lists Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 03/13] net: 8021q: vlan_dev: add vid tag for vlan device own address Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 04/13] ethernet: eth: add default vid len for all ethernet kind devices Vladimir Oltean
2020-05-22  0:34   ` kbuild test robot
2020-05-22  9:59   ` kbuild test robot
2020-05-21 21:10 ` [PATCH RFC net-next 05/13] net: bridge: multicast: propagate br_mc_disabled_update() return Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 06/13] net: core: dev_addr_lists: export some raw __hw_addr helpers Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 07/13] net: dsa: don't use switchdev_notifier_fdb_info in dsa_switchdev_event_work Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 08/13] net: dsa: add ability to program unicast and multicast filters for CPU port Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 09/13] net: dsa: mroute: don't panic the kernel if called without the prepare phase Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 10/13] net: bridge: add port flags for host flooding Vladimir Oltean
2020-05-22 12:38   ` Nikolay Aleksandrov
2020-05-22 13:13     ` Vladimir Oltean
2020-05-22 18:45       ` Allan W. Nielsen
2020-07-20 11:08         ` Vladimir Oltean
2020-05-24 14:26   ` Ido Schimmel
2020-05-24 16:13     ` Vladimir Oltean
2020-05-25 20:11       ` Ido Schimmel
2020-05-25 20:32         ` Vladimir Oltean
2020-07-23 22:35         ` Vladimir Oltean
2020-07-27 17:15           ` Ido Schimmel [this message]
2020-05-21 21:10 ` [PATCH RFC net-next 11/13] net: dsa: deal with new flooding port attributes from bridge Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 12/13] net: dsa: treat switchdev notifications for multicast router connected to port Vladimir Oltean
2020-05-21 21:10 ` [PATCH RFC net-next 13/13] net: dsa: wire up multicast IGMP snooping attribute notification Vladimir Oltean
2020-05-22 18:42 ` [PATCH RFC net-next 00/13] RX filtering for DSA switches Allan W. Nielsen
2020-05-24 14:06 ` Ido Schimmel
2020-05-24 16:24   ` Vladimir Oltean
2020-05-25 19:48     ` Ido Schimmel
2020-05-25 20:23       ` Vladimir Oltean
2020-05-26 14:01         ` Ido Schimmel
2020-05-27 11:36           ` Vladimir Oltean
2020-05-28 14:37             ` Ido Schimmel
2020-07-20 10:00               ` Vladimir Oltean
2020-07-27 16:56                 ` Ido Schimmel
2020-10-27 11:52                   ` Vladimir Oltean
2020-10-28 14:43                     ` Ido Schimmel
2020-10-28 18:46                       ` Vladimir Oltean
2020-11-01 11:27                         ` Ido Schimmel
2020-11-01 12:06                           ` Vladimir Oltean
2020-11-01 14:42                             ` Ido Schimmel
2020-11-01 15:04                               ` Vladimir Oltean
2020-11-01 15:39                                 ` Ido Schimmel
2020-11-01 16:13                                   ` Vladimir Oltean
2020-11-11  4:12                                     ` Florian Fainelli
2020-05-24 16:13 ` Florian Fainelli
2020-05-24 16:34   ` Vladimir Oltean

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200727171501.GB1910935@shredder \
    --to=idosch@idosch.org \
    --cc=allan.nielsen@microchip.com \
    --cc=andrew@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=f.fainelli@gmail.com \
    --cc=horatiu.vultur@microchip.com \
    --cc=ivecera@redhat.com \
    --cc=jiri@resnulli.us \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=nikolay@cumulusnetworks.com \
    --cc=olteanv@gmail.com \
    --cc=roopa@cumulusnetworks.com \
    --cc=vivien.didelot@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.