netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ansuel Smith <ansuelsmth@gmail.com>
To: Vladimir Oltean <olteanv@gmail.com>
Cc: "Marek Behun" <marek.behun@nic.cz>,
	netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Andrew Lunn" <andrew@lunn.ch>,
	"Vivien Didelot" <vivien.didelot@gmail.com>,
	"Florian Fainelli" <f.fainelli@gmail.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andriin@fb.com>,
	"Eric Dumazet" <edumazet@google.com>,
	"Wei Wang" <weiwan@google.com>,
	"Cong Wang" <cong.wang@bytedance.com>,
	"Taehee Yoo" <ap420073@gmail.com>,
	"Björn Töpel" <bjorn@kernel.org>,
	"zhang kai" <zhangkaiheb@126.com>,
	"Weilong Chen" <chenweilong@huawei.com>,
	"Roopa Prabhu" <roopa@cumulusnetworks.com>,
	"Di Zhu" <zhudi21@huawei.com>,
	"Francis Laniel" <laniel_francis@privacyrequired.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
Date: Mon, 12 Apr 2021 07:04:35 +0200	[thread overview]
Message-ID: <YHPU452sWbJ5Ciss@Ansuel-xps.localdomain> (raw)
In-Reply-To: <20210411185017.3xf7kxzzq2vefpwu@skbuf>

On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> > On Sat, 10 Apr 2021 15:34:46 +0200
> > Ansuel Smith <ansuelsmth@gmail.com> wrote:
> > 
> > > Hi,
> > > this is a respin of the Marek series in hope that this time we can
> > > finally make some progress with dsa supporting multi-cpu port.
> > > 
> > > This implementation is similar to the Marek series but with some tweaks.
> > > This adds support for multiple-cpu port but leave the driver the
> > > decision of the type of logic to use about assigning a CPU port to the
> > > various port. The driver can also provide no preference and the CPU port
> > > is decided using a round-robin way.
> > 
> > In the last couple of months I have been giving some thought to this
> > problem, and came up with one important thing: if there are multiple
> > upstream ports, it would make a lot of sense to dynamically reallocate
> > them to each user port, based on which user port is actually used, and
> > at what speed.
> > 
> > For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> > ports support at most 1 Gbps. Round-robin would assign:
> >   CPU port 0 - Port 0
> >   CPU port 1 - Port 1
> >   CPU port 0 - Port 2
> >   CPU port 1 - Port 3
> >   CPU port 0 - Port 4
> > 
> > Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> > with 1, 3 and 4 free:
> >   CPU port 0 - Port 0 (plugged)
> >   CPU port 1 - Port 1 (free)
> >   CPU port 0 - Port 2 (plugged)
> >   CPU port 1 - Port 3 (free)
> >   CPU port 0 - Port 4 (free)
> > 
> > We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> > CPU, and the second CPU port is not used at all.
> > 
> > A mechanism for automatic reassignment of CPU ports would be ideal here.
> > 
> > What do you guys think?
> 
> The reason why I don't think this is such a great idea is because the
> CPU port assignment is a major reconfiguration step which should at the
> very least be done while the network is down, to avoid races with the
> data path (something which this series does not appear to handle).
> And if you allow the static user-port-to-CPU-port assignment to change
> every time a link goes up/down, I don't think you really want to force
> the network down through the entire switch basically.
> 
> So I'd be tempted to say 'tough luck' if all your ports are not up, and
> the ones that are are assigned statically to the same CPU port. It's a
> compromise between flexibility and simplicity, and I would go for
> simplicity here. That's the most you can achieve with static assignment,
> just put the CPU ports in a LAG if you want better dynamic load balancing
> (for details read on below).
> 
> But this brings us to another topic, which I've been discussing with
> Florian. I am also interested in the multi CPU ports topic for the
> NXP LS1028A SoC, which uses the felix driver for its embedded switch.
> I need to explain some of the complexities there, in order to lay out
> what are the aspects which should ideally be supported.
> 
> The Ocelot switch family (which Felix is a part of) doesn't actually
> support more than one "NPI" port as it's called (when the CPU port
> module's queues are linked to an Ethernet port, which is what DSA calls
> the "CPU port"). So you'd be tempted to say that a DSA setup with
> multiple CPU ports is not realizable for this SoC.
> 
> But in fact, there are 2 Ethernet ports connecting the embedded switch
> and the CPU, one port is at 2.5Gbps and the other is at 1Gbps. We can
> dynamically choose which one is the NPI port through device tree
> (arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi), and at the moment, we
> choose the 2.5Gbps port as DSA CPU port, and we disable the 1Gbps
> internal port. If we wanted to, we could enable the second internal port
> as an internally-facing user port, but that's a bit awkward due to
> multi-homing. Nonetheless, this is all that's achievable using the NPI
> port functionality.
> 
> However, due to some unrelated issues, the Felix switch has ended up
> supporting two tagging protocols in fact. So there is now an option
> through which the user can run this command:
> 
>   echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
> 
> (where eno2 is the DSA master)
> and the switch will disable the NPI port and set up some VLAN
> pushing/popping rules through which DSA gets everything it needs to
> offer absolutely the same services towards the upper network stack
> layer, but without enabling the hardware functionality for a CPU port
> (as far as the switch hardware is aware, it is unmanaged).
> 
> This opens up some possibilities because we no longer have the
> limitation that there can be only 1 NPI port in the system. As you'd
> have it, I believe that any DSA switch with a fully programmable "port
> forwarding matrix" (aka a bitmap which answers the question "can port i
> send packets to port j?") is capable to some degree of supporting
> multiple DSA CPU ports, in the statically-assigned fashion that this
> patch series attempts to achieve. Namely, you just configure the port
> forwarding matrix to allow user port i to flood traffic to one CPU port
> but not to the other, and you disable communication between the CPU
> ports.
> 
> But there is something which is even more interesting about Felix with
> the ocelot-8021q tagger. Since Marek posted his RFC and until Ansuel
> posted the follow-up, things have happened, and now both Felix and the
> Marvell driver support LAG offload via the bonding and/or team drivers.
> At least for Felix, when using the ocelot-8021q tagged, it should be
> possible to put the two CPU ports in a hardware LAG, and the two DSA
> masters in a software LAG, and let the bond/team upper of the DSA
> masters be the CPU port.
> 
> I would like us to keep the door open for both alternatives, and to have
> a way to switch between static user-to-CPU port assignment, and LAG.
> I think that if there are multiple 'ethernet = ' phandles present in the
> device tree, DSA should populate a list of valid DSA masters, and then
> call into the driver to allow it to select which master it prefers for
> each user port. This is similar to what Ansuel added with 'port_get_preferred_cpu',
> except that I chose "DSA master" and not "CPU port" for a specific reason.
> For LAG, the DSA master would be bond0.
> 
> In fact, in my case, this CPU port election procedure should also be
> repeated when the tagging protocol changes, because Felix will be forced
> to choose the same DSA master for all user ports at probe time, because
> it boots up with the standard NPI-based "ocelot" tagger. So it can only
> make use of the second CPU port when the tagging protocol changes.
> 
> Changing the DSA tagging protocol has to be done with the network down
> (ip link set eno2 down && ip link set eno3 down). If we were to bring
> eno2 and eno3 back up now, DSA or the driver would choose one of the DSA
> masters for every port, round robin or what not. But we don't bring
> the masters up yet, we create a bonding interface and we enslave eno2
> and eno3 to it. DSA should detect this and add bond0 to its list of
> candidates for a DSA master. Only now we bring up the masters, and the
> port_get_preferred_cpu() function (or however it might end up being
> named) should give the driver the option to select bond0 as a valid DSA
> master.
> 
> Using bond0 as a DSA master would need some changes to DSA, because
> currently it won't work. Namely, the RX data path needs the netdev->dsa_ptr
> populated on the DSA master, whose type is a struct dsa_port *cpu_dp.
> So, logically, we need a *cpu_dp corresponding to bond0.
> 
> One idea to solve this is to reuse something we already have: the
> current struct dsa_switch_tree :: lags array of net devices. These hold
> pointers to bonding interfaces now, but we could turn them into an array
> of struct dsa_port "logical ports". The idea is that through this
> change, every LAG offloaded by DSA will have a corresponding "logical
> dp" which isn't present in the dst->ports list. Since every struct
> dsa_port already has a lag_dev pointer, transforming the "dst->lags"
> array from an array of LAG net devices into an array of logical DSA
> ports will cover the existing use case as well: a logical port will
> always have the dp->lag_dev pointer populated with the bonding/team
> interface it offloads.
> 
> So if we make this change, we can populate bond0->dsa_ptr with this
> "logical dp". This way we need to make no other changes to the RX data
> path, and from the PoV of the data path itself, it isn't even a 'multi
> CPU port' setup.
> 
> As for TX, that should already be fine: we call dev_queue_xmit() towards
> bond0 because that's our master, and bond0 deals with xmit hashing
> towards eno2 or eno3 on a per-packet basis, and that's what we want.
> 
> To destroy this setup, the user just needs to run 'ip link del bond0'
> (with the link down I think) and DSA should call 'port_get_preferred_cpu'
> again, but without bond0 in the list this time. The setup should revert
> to its state with static assignment of user to CPU ports.
> 
> [ of course, I haven't tried any of this, I am just imagining things ]
> 
> I deliberately gave this very convoluted example because I would like
> the progress that we make to be in this general direction. If I
> understand your use cases, I believe they should be covered.

To sum up all these comments, we really wants all the same thing.
- Switch driver that can comunicate some type of preference with the CPU
  assignement logic.
- Userspace that can change the CPU.
I really think that to make some progress with this, we should really try
to add support for some very basic implementation of this. I think the series
I posted achieve this, the switch driver to actually enable multi-cpu
support requires the get_preferred_port function to be declared so every
driver must be changed to support this or the old implementation is used
(and this is not a bad thing since some tweaks are always needed or 99%
of the time the second cpu will cause some problem)


  parent reply	other threads:[~2021-04-12 10:42 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-10 13:34 [PATCH RFC net-next 0/3] Multi-CPU DSA support Ansuel Smith
2021-04-10 13:34 ` [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports Ansuel Smith
2021-04-12  3:35   ` DENG Qingfang
2021-04-12  4:41     ` Ansuel Smith
2021-04-12 15:30       ` DENG Qingfang
2021-04-12 16:17         ` Frank Wunderlich
2021-04-10 13:34 ` [PATCH RFC net-next 2/3] net: add ndo for setting the iflink property Ansuel Smith
2021-04-10 13:34 ` [PATCH RFC net-next 3/3] net: dsa: implement ndo_set_netlink for chaning port's CPU port Ansuel Smith
2021-04-10 13:34 ` [PATCH RFC iproute2-next] iplink: allow to change iplink value Ansuel Smith
2021-04-11 17:04   ` Stephen Hemminger
2021-04-11 17:09     ` Vladimir Oltean
2021-04-11 18:01 ` [PATCH RFC net-next 0/3] Multi-CPU DSA support Marek Behun
2021-04-11 18:08   ` Ansuel Smith
2021-04-11 18:39   ` Andrew Lunn
2021-04-12  2:07     ` Florian Fainelli
2021-04-12  4:53     ` Ansuel Smith
2021-04-11 18:50   ` Vladimir Oltean
2021-04-11 23:53     ` Vladimir Oltean
2021-04-12  2:10       ` Florian Fainelli
2021-04-12  5:04     ` Ansuel Smith [this message]
2021-04-12 12:46     ` Tobias Waldekranz
2021-04-12 14:35       ` Vladimir Oltean
2021-04-12 21:06         ` Tobias Waldekranz
2021-04-12 19:30       ` Marek Behun
2021-04-12 21:22         ` Tobias Waldekranz
2021-04-12 21:34           ` Vladimir Oltean
2021-04-12 21:49             ` Tobias Waldekranz
2021-04-12 21:56               ` Marek Behun
2021-04-12 22:06               ` Vladimir Oltean
2021-04-12 22:26                 ` Tobias Waldekranz
2021-04-12 22:48                   ` Vladimir Oltean
2021-04-12 23:04                     ` Marek Behun
2021-04-12 21:50           ` Marek Behun
2021-04-12 22:05             ` Tobias Waldekranz
2021-04-12 22:55               ` Marek Behun
2021-04-12 23:09                 ` Tobias Waldekranz
2021-04-12 23:13                   ` Tobias Waldekranz
2021-04-12 23:54                     ` Marek Behun
2021-04-13  0:27                       ` Marek Behun
2021-04-13  0:31                         ` Marek Behun
2021-04-13 14:46                         ` Tobias Waldekranz
2021-04-13 15:14                           ` Marek Behun
2021-04-13 18:16                             ` Tobias Waldekranz
2021-04-14 15:14                               ` Marek Behun
2021-04-14 18:39                                 ` Tobias Waldekranz
2021-04-14 23:39                                   ` Vladimir Oltean
2021-04-15  9:20                                     ` Tobias Waldekranz
2021-04-13 14:40                       ` Tobias Waldekranz
2021-04-12 15:00     ` DENG Qingfang
2021-04-12 16:32       ` Vladimir Oltean
2021-04-12 22:04         ` Marek Behun
2021-04-12 22:17           ` Vladimir Oltean
2021-04-12 22:47             ` Marek Behun
  -- strict thread matches above, loose matches on Subject: below --
2019-08-24  2:42 Marek Behún
2019-08-24 15:24 ` Andrew Lunn
2019-08-24 17:45   ` Marek Behun
2019-08-24 17:54     ` Andrew Lunn
2019-08-25  4:19   ` Marek Behun
2019-08-24 15:40 ` Vladimir Oltean
2019-08-24 15:44   ` Vladimir Oltean
2019-08-24 17:55     ` Marek Behun
2019-08-24 15:56   ` Andrew Lunn
2019-08-24 17:58     ` Marek Behun
2019-08-24 20:04 ` Florian Fainelli
2019-08-24 21:01   ` Marek Behun
2019-08-25  4:08     ` Marek Behun
2019-08-25  7:13   ` Marek Behun
2019-08-25 15:00     ` Florian Fainelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YHPU452sWbJ5Ciss@Ansuel-xps.localdomain \
    --to=ansuelsmth@gmail.com \
    --cc=andrew@lunn.ch \
    --cc=andriin@fb.com \
    --cc=ap420073@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=chenweilong@huawei.com \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=f.fainelli@gmail.com \
    --cc=kuba@kernel.org \
    --cc=laniel_francis@privacyrequired.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marek.behun@nic.cz \
    --cc=netdev@vger.kernel.org \
    --cc=olteanv@gmail.com \
    --cc=roopa@cumulusnetworks.com \
    --cc=vivien.didelot@gmail.com \
    --cc=weiwan@google.com \
    --cc=zhangkaiheb@126.com \
    --cc=zhudi21@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).