Re: [PATCH v3 net-next 2/4] net: dsa: Link aggregation support

From: Vladimir Oltean <olteanv@gmail.com>
To: Tobias Waldekranz <tobias@waldekranz.com>
Cc: davem@davemloft.net, kuba@kernel.org, andrew@lunn.ch,
	vivien.didelot@gmail.com, f.fainelli@gmail.com,
	j.vosburgh@gmail.com, vfalico@gmail.com, andy@greyhouse.net,
	netdev@vger.kernel.org
Subject: Re: [PATCH v3 net-next 2/4] net: dsa: Link aggregation support
Date: Tue, 8 Dec 2020 13:23:50 +0200	[thread overview]
Message-ID: <20201208112350.kuvlaxqto37igczk@skbuf> (raw)
In-Reply-To: <20201202091356.24075-3-tobias@waldekranz.com>

Hi Tobias,

On Wed, Dec 02, 2020 at 10:13:54AM +0100, Tobias Waldekranz wrote:
> Monitor the following events and notify the driver when:
>
> - A DSA port joins/leaves a LAG.
> - A LAG, made up of DSA ports, joins/leaves a bridge.
> - A DSA port in a LAG is enabled/disabled (enabled meaning
>   "distributing" in 802.3ad LACP terms).
>
> Each LAG interface to which a DSA port is attached is represented by a
> `struct dsa_lag` which is globally reachable from the switch tree and
> from each associated port.
>
> When a LAG joins a bridge, the DSA subsystem will treat that as each
> individual port joining the bridge. The driver may look at the port's
> LAG pointer to see if it is associated with any LAG, if that is
> required. This is analogue to how switchdev events are replicated out
> to all lower devices when reaching e.g. a LAG.
>
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---
>
> +struct dsa_lag {
> +	struct net_device *dev;
> +	int id;
> +
> +	struct list_head ports;
> +
> +	/* For multichip systems, we must ensure that each hash bucket
> +	 * is only enabled on a single egress port throughout the
> +	 * whole tree, lest we send duplicates. Therefore we must
> +	 * maintain a global list of active tx ports, so that each
> +	 * switch can figure out which buckets to enable on which
> +	 * ports.
> +	 */
> +	struct list_head tx_ports;
> +	int num_tx;
> +
> +	refcount_t refcount;
> +};

Sorry it took so long. I wanted to understand:
(a) where are the challenged for drivers to uniformly support software
    bridging when they already have code for bridge offloading. I found
    the following issues:
    - We have taggers that unconditionally set skb->offload_fwd_mark = 1,
      which kind of prevents software bridging. I'm not sure what the
      fix for these should be.
    - Source address is a big problem, but this time not in the sense
      that it traditionally has been. Specifically, due to address
      learning being enabled, the hardware FDB will set destinations to
      take the autonomous fast path. But surprise, the autonomous fast
      path is blocked, because as far as the switch is concerned, the
      ports are standalone and not offloading the bridge. We have drivers
      that don't disable address learning when they operate in standalone
      mode, which is something they definitely should do.
    There is nothing actionable for you in this patch set to resolve this.
    I just wanted to get an idea.
(b) Whether struct dsa_lag really brings us any significant benefit. I
    found that it doesn't. It's a lot of code added to the DSA core, that
    should not really belong in the middle layer. I need to go back and
    quote your motivation in the RFC:

| All LAG configuration is cached in `struct dsa_lag`s. I realize that
| the standard M.O. of DSA is to read back information from hardware
| when required. With LAGs this becomes very tricky though. For example,
| the change of a link state on one switch will require re-balancing of
| LAG hash buckets on another one, which in turn depends on the total
| number of active links in the LAG. Do you agree that this is
| motivated?

    After reimplementing bonding offload in ocelot, I have found
    struct dsa_lag to not provide any benefit. All the information a
    driver needs is already provided through the
    struct net_device *lag_dev argument given to lag_join and lag_leave,
    and through the struct netdev_lag_lower_state_info *info given to
    lag_change. I will send an RFC to you and the list shortly to prove
    that this information is absolutely sufficient for the driver to do
    decent internal bookkeeping, and that DSA should not really care
    beyond that.

    There are two points to be made:
    - Recently we have seen people with non-DSA (pure switchdev) hardware
      being compelled to write DSA drivers, because they noticed that a
      large part of the middle layer had already been written, and it
      presents an API with a lot of syntactic sugar. Maybe there is a
      larger issue here in that the switchdev offloading APIs are fairly
      bulky and repetitive, but that does not mean that we should be
      encouraging the attitude "come to DSA, we have cookies".
      https://lwn.net/ml/linux-kernel/20201125232459.378-1-lukma@denx.de/
    - Remember that the only reason why the DSA framework and the
      syntactic sugar exists is that we are presenting the hardware a
      unified view for the ports which have a struct net_device registered,
      and the ports which don't (DSA links and CPU ports). The argument
      really needs to be broken down into two:
      - For cross-chip DSA links, I can see why it was convenient for
        you to have the dsa_lag_by_dev(ds->dst, lag_dev) helper. But
        just as we currently have a struct net_device *bridge_dev in
        struct dsa_port, so we could have a struct net_device *bond,
        without the extra fat of struct dsa_lag, and reference counting,
        active ports, etc etc, would become simpler (actually inexistent
        as far as the DSA layer is concerned). Two ports are in the same
        bond if they have the same struct net_device *bond, just as they
        are bridged if they have the same struct net_device *bridge_dev.
      - For CPU ports, this raises an important question, which is
        whether LAG on switches with multiple CPU ports is ever going to
        be a thing. And if it is, how it is even going to be configured
        from the user's perspective. Because on a multi-CPU port system,
        you should actually see it as two bonding interfaces back to back.
        First, there's the bonding interface that spans the DSA masters.
        That needs no hardware offloading. Then there's the bonding
        interface that is the mirror image of that, and spans the CPU
        ports. I think this is a bit up in the air now. Because with
        your struct dsa_lag or without, we still have no bonding device
        associated with it, so things like the balancing policy are not
        really defined.

I would like you to reiterate some of the reasons why you prefer having
struct dsa_lag.