From: Vladimir Oltean <olteanv@gmail.com>
To: Tobias Waldekranz <tobias@waldekranz.com>
Cc: davem@davemloft.net, kuba@kernel.org, andrew@lunn.ch,
vivien.didelot@gmail.com, f.fainelli@gmail.com,
j.vosburgh@gmail.com, vfalico@gmail.com, andy@greyhouse.net,
netdev@vger.kernel.org
Subject: Re: [PATCH v3 net-next 2/4] net: dsa: Link aggregation support
Date: Tue, 8 Dec 2020 18:37:51 +0200 [thread overview]
Message-ID: <20201208163751.4c73gkdmy4byv3rp@skbuf> (raw)
In-Reply-To: <87mtyo5n40.fsf@waldekranz.com>
On Tue, Dec 08, 2020 at 04:33:19PM +0100, Tobias Waldekranz wrote:
> On Tue, Dec 08, 2020 at 13:23, Vladimir Oltean <olteanv@gmail.com> wrote:
> > Sorry it took so long. I wanted to understand:
> > (a) where are the challenged for drivers to uniformly support software
> > bridging when they already have code for bridge offloading. I found
> > the following issues:
> > - We have taggers that unconditionally set skb->offload_fwd_mark = 1,
> > which kind of prevents software bridging. I'm not sure what the
> > fix for these should be.
>
> At least on mv88e6xxx you would not be able to determine this simply
> from looking at the tag. Both in standalone mode and bridged mode, you
> would receive FORWARDs with the same source. You could look at
> dp->bridge_dev to figure it out though.
Yes, but that raises the question whether it should be DSA that fixes it
up globally for everyone, like in dsa_switch_rcv:
if (!dp->bridge_dev)
skb->offload_fwd_mark = 0;
with a nice comment above that everyone can refer to,
or make each and every tagger do this. I'm leaning towards the latter.
> > - Source address is a big problem, but this time not in the sense
> > that it traditionally has been. Specifically, due to address
> > learning being enabled, the hardware FDB will set destinations to
> > take the autonomous fast path. But surprise, the autonomous fast
> > path is blocked, because as far as the switch is concerned, the
> > ports are standalone and not offloading the bridge. We have drivers
> > that don't disable address learning when they operate in standalone
> > mode, which is something they definitely should do.
>
> Some hardware can function with it on (e.g. mv88e6xxx can associate an
> FDB per port), but there is no reason to do it, so yes it should be
> disabled.
So how does mv88e6xxx handle the address learning issue currently?
> > There is nothing actionable for you in this patch set to resolve this.
> > I just wanted to get an idea.
> > (b) Whether struct dsa_lag really brings us any significant benefit. I
> > found that it doesn't. It's a lot of code added to the DSA core, that
> > should not really belong in the middle layer. I need to go back and
> > quote your motivation in the RFC:
> >
> > | All LAG configuration is cached in `struct dsa_lag`s. I realize that
> > | the standard M.O. of DSA is to read back information from hardware
> > | when required. With LAGs this becomes very tricky though. For example,
> > | the change of a link state on one switch will require re-balancing of
> > | LAG hash buckets on another one, which in turn depends on the total
> > | number of active links in the LAG. Do you agree that this is
> > | motivated?
> >
> > After reimplementing bonding offload in ocelot, I have found
> > struct dsa_lag to not provide any benefit. All the information a
> > driver needs is already provided through the
> > struct net_device *lag_dev argument given to lag_join and lag_leave,
> > and through the struct netdev_lag_lower_state_info *info given to
> > lag_change. I will send an RFC to you and the list shortly to prove
> > that this information is absolutely sufficient for the driver to do
> > decent internal bookkeeping, and that DSA should not really care
> > beyond that.
>
> Do you have a multi-chip setup? If not then I understand that `struct
> dsa_lag` does not give you anything extra. In a multi-chip scenario
> things become harder. Example:
>
> .-----. .-----.
> | sw0 +---+ sw1 |
> '-+-+-'3 3'--+--'
> 1 2 1
>
> Let's say that sw0p1, sw0p2 and sw1p1 are in a LAG. This system can hash
> flows into 8 buckets. So with all ports active you would need an
> allocation like this:
>
> sw0p1: 0,1,2
> sw0p2: 3,4,5
> sw1p1: 6,7
>
> For some reason, the system determines that sw0p2 is now inactive and
> the LAG should be rebalanced over the two remaining active links:
>
> sw0p1: 0,1,2,3
> sw0p2: -
> sw1p1: 4,5,6,7
>
> In order for sw0 and sw1 to agree on the assignment they need access to
> a shared view of the LAG at the tree level, both about the set of active
> ports and their ordering. This is `struct dsa_lag`s main raison d'être.
Yup, you could do that just like I did with ocelot, aka keep in
struct dsa_port:
struct net_device *bond;
bool lag_tx_active;
This is enough to replace your usage of:
list_for_each_entry(dp, &lag->ports, lag_list) {
...
}
with:
list_for_each_entry(dp, &dst->ports, list) {
if (dp->bond != lag_dev)
continue;
...
}
and:
list_for_each_entry(dp, &lag->tx_ports, lag_tx_list) {
...
}
with:
list_for_each_entry(dp, &dst->ports, list) {
if (dp->bond != lag_dev || !dp->lag_tx_active)
continue;
...
}
The amount of iteration that you would do would be about the same. Just
this:
struct dsa_lag *lag = dsa_lag_by_dev(ds->dst, lag_dev);
would need to be replaced with something more precise, depending on what
you need the struct dsa_lag pointer for. You use it in crosschip_lag_join
and in crosschip_lag_leave to call mv88e6xxx_lag_sync_map, where you
again iterate over the ports in the DSA switch tree. But if you passed
just the struct net_device *lag_dev directly, you could keep the same
iteration and do away with the reference-counted struct dsa_lag.
> The same goes for when a port joins/leaves a LAG. For example, if sw1p1
> was to leave the LAG, we want to make sure that we do not needlessly
> flood LAG traffic over the backplane (sw0p3<->sw1p3). If you want to
> solve this at the ds level without `struct dsa_lag`, you need a refcount
> per backplane port in order to figure out if the leaving port was the
> last one behind that backplane port.
Humm, why?
Nothing would change. Just as you start with a map of 0 in
mv88e6xxx_lag_sync_map, and use dsa_towards_port for every dp that you
find in the switch tree, I am saying keep that iteration, but don't keep
those extra lists for the bonded ports and the active bonded ports. Just
use as a search key the LAG net device itself, and keep an extra bool
per dp. I think it's really simpler this way, with a lot less overhead
in terms of data structures, lists and whatnot.
> > - Remember that the only reason why the DSA framework and the
> > syntactic sugar exists is that we are presenting the hardware a
> > unified view for the ports which have a struct net_device registered,
> > and the ports which don't (DSA links and CPU ports). The argument
> > really needs to be broken down into two:
> > - For cross-chip DSA links, I can see why it was convenient for
> > you to have the dsa_lag_by_dev(ds->dst, lag_dev) helper. But
> > just as we currently have a struct net_device *bridge_dev in
> > struct dsa_port, so we could have a struct net_device *bond,
> > without the extra fat of struct dsa_lag, and reference counting,
> > active ports, etc etc, would become simpler (actually inexistent
> > as far as the DSA layer is concerned). Two ports are in the same
> > bond if they have the same struct net_device *bond, just as they
> > are bridged if they have the same struct net_device *bridge_dev.
> > - For CPU ports, this raises an important question, which is
> > whether LAG on switches with multiple CPU ports is ever going to
> > be a thing. And if it is, how it is even going to be configured
> > from the user's perspective. Because on a multi-CPU port system,
> > you should actually see it as two bonding interfaces back to back.
> > First, there's the bonding interface that spans the DSA masters.
> > That needs no hardware offloading. Then there's the bonding
> > interface that is the mirror image of that, and spans the CPU
> > ports. I think this is a bit up in the air now. Because with
>
> Aside. On our devices we use the term cpu0, cpu1 etc. to refer to a
> switch port that is connected to a CPU. The CPU side of those
> connections are chan0, chan1 ("channel"). I am not saying we have to
> adopt those, but some unambiguous terms would be great in these
> conversations.
You have me confused. chan0, chan1 are DSA master interfaces? Can we
keep calling them DSA master interfaces, or do you find that confusing?
> > your struct dsa_lag or without, we still have no bonding device
> > associated with it, so things like the balancing policy are not
> > really defined.
> >
>
> I have a different take on that. I do not think you need to create a
> user-visible LAG at all in that case. You just setup the hardware to
> treat the two CPU ports as a static LAG based on the information from
> the DT. Then you attach the same rx handler to both. On tx you hash and
> loadbalance on flows that allow it (FORWARDs on mv88e6xxx) and use the
> primary CPU port for control traffic (FROM_CPU).
>
> The CPU port is completely defined by the DT today, so I do not see why
> we could not add balancing policy to that if it is ever required.
Hashing policy for a bonding interface, defined in DT? Yummm.
> > I would like you to reiterate some of the reasons why you prefer having
> > struct dsa_lag.
>
> I hope I did that already. But I will add that if there was a dst->priv
> for the drivers to use as they see fit, I guess the bookkeeping could be
> moved into the mv88e6xxx driver instead if you feel it pollutes the DSA
> layer. Maybe you can not assume that all chips in a tree use a
> compatible driver though?
I don't think the DSA switch tree is private to anyone.
> Are there any other divers that support multi-chip that might want to
> use the same thing?
Nope.
next prev parent reply other threads:[~2020-12-08 16:39 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-02 9:13 [PATCH v3 net-next 0/4] net: dsa: Link aggregation support Tobias Waldekranz
2020-12-02 9:13 ` [PATCH v3 net-next 1/4] net: bonding: Notify ports about their initial state Tobias Waldekranz
2020-12-02 19:09 ` Jay Vosburgh
2020-12-02 21:52 ` Tobias Waldekranz
2020-12-03 0:39 ` Jay Vosburgh
2020-12-03 8:16 ` Tobias Waldekranz
2020-12-02 9:13 ` [PATCH v3 net-next 2/4] net: dsa: Link aggregation support Tobias Waldekranz
2020-12-02 10:07 ` Vladimir Oltean
2020-12-02 10:51 ` Tobias Waldekranz
2020-12-02 18:58 ` Jakub Kicinski
2020-12-02 21:29 ` Tobias Waldekranz
2020-12-02 21:32 ` Vladimir Oltean
2020-12-03 16:24 ` Vladimir Oltean
2020-12-03 20:53 ` Tobias Waldekranz
2020-12-03 21:09 ` Andrew Lunn
2020-12-03 21:35 ` Tobias Waldekranz
2020-12-04 0:35 ` Vladimir Oltean
2020-12-03 21:57 ` Vladimir Oltean
2020-12-03 23:12 ` Tobias Waldekranz
2020-12-04 0:56 ` Vladimir Oltean
2020-12-07 21:49 ` Tobias Waldekranz
2020-12-04 1:33 ` Andrew Lunn
2020-12-04 4:18 ` Florian Fainelli
2020-12-07 21:56 ` Tobias Waldekranz
2020-12-03 20:48 ` Vladimir Oltean
2020-12-04 2:20 ` Andrew Lunn
2020-12-07 21:19 ` Tobias Waldekranz
2020-12-07 23:26 ` Andrew Lunn
2020-12-09 8:57 ` Tobias Waldekranz
2020-12-09 14:27 ` Andrew Lunn
2020-12-09 15:21 ` Tobias Waldekranz
2020-12-09 23:03 ` Andrew Lunn
2020-12-04 4:04 ` Florian Fainelli
2020-12-08 11:23 ` Vladimir Oltean
2020-12-08 15:33 ` Tobias Waldekranz
2020-12-08 16:37 ` Vladimir Oltean [this message]
2020-12-09 8:37 ` Tobias Waldekranz
2020-12-09 10:53 ` Vladimir Oltean
2020-12-09 14:11 ` Tobias Waldekranz
2020-12-09 16:04 ` Vladimir Oltean
2020-12-09 22:01 ` Tobias Waldekranz
2020-12-09 22:21 ` Vladimir Oltean
2020-12-10 10:18 ` Tobias Waldekranz
2020-12-09 22:59 ` Andrew Lunn
2020-12-10 1:05 ` Vladimir Oltean
2020-12-09 14:23 ` Andrew Lunn
2020-12-09 23:17 ` Vladimir Oltean
2020-12-08 17:26 ` Andrew Lunn
2020-12-11 20:50 ` Tobias Waldekranz
2020-12-12 14:26 ` Vladimir Oltean
2020-12-13 21:18 ` Tobias Waldekranz
2020-12-14 0:12 ` Vladimir Oltean
2020-12-14 11:42 ` Ido Schimmel
2020-12-16 15:15 ` Tobias Waldekranz
2020-12-16 18:48 ` Ido Schimmel
2020-12-14 9:41 ` Tobias Waldekranz
2020-12-02 9:13 ` [PATCH v3 net-next 3/4] net: dsa: mv88e6xxx: " Tobias Waldekranz
2020-12-02 9:13 ` [PATCH v3 net-next 4/4] net: dsa: tag_dsa: Support reception of packets from LAG devices Tobias Waldekranz
2020-12-04 3:58 ` Florian Fainelli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201208163751.4c73gkdmy4byv3rp@skbuf \
--to=olteanv@gmail.com \
--cc=andrew@lunn.ch \
--cc=andy@greyhouse.net \
--cc=davem@davemloft.net \
--cc=f.fainelli@gmail.com \
--cc=j.vosburgh@gmail.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=tobias@waldekranz.com \
--cc=vfalico@gmail.com \
--cc=vivien.didelot@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).