All of lore.kernel.org
 help / color / mirror / Atom feed
* Offloading DSA taggers to hardware
@ 2019-11-13 12:40 Vladimir Oltean
  2019-11-13 16:53 ` Andrew Lunn
  2019-11-13 19:40 ` Florian Fainelli
  0 siblings, 2 replies; 7+ messages in thread
From: Vladimir Oltean @ 2019-11-13 12:40 UTC (permalink / raw)
  To: netdev

DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch
to any NIC, and the software stack creates N "virtual" net devices, each
representing a switch port, with I/O capabilities based on the metadata present
in the frame. It all looks like an hourglass:

  switch           switch           switch           switch           switch
net_device       net_device       net_device       net_device       net_device
     |                |                |                |                |
     |                |                |                |                |
     |                |                |                |                |
     +----------------+----------------+----------------+----------------+
                                       |
                                       |
                                  DSA master
                                  net_device
                                       |
                                       |
                                  DSA master
                                      NIC
                                       |
                                    switch
                                   CPU port
                                       |
                                       |
     +----------------+----------------+----------------+----------------+
     |                |                |                |                |
     |                |                |                |                |
     |                |                |                |                |
  switch           switch           switch           switch           switch
   port             port             port             port             port


But the process by which the stack:
- Parses the frame on receive, decodes the DSA tag and redirects the frame from
  the DSA master net_device to a switch net_device based on the source port,
  then removes the DSA tag from the frame and recalculates checksums as
  appropriate
- Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch
  net_device to the real DSA master net_device

can be optimized, if the DSA master NIC supports this. Let's say there is a
fictional NIC that has a programmable hardware parser and the ability to
perform frame manipulation (insert, extract a tag). Such a NIC could be
programmed to do a better job adding/removing the DSA tag, as well as
masquerading skb->dev based on the parser meta-data. In addition, there would
be a net benefit for QoS, which as a consequence of the DSA model, cannot be
really end-to-end: a frame classified to a high-priority traffic class by the
switch may be treated as best-effort by the DSA master, due to the fact that it
doesn't really parse the DSA tag (the traffic class, in this case).

I think the DSA hotpath would still need to be involved, but instead of calling
the tagger's xmit/rcv it would need to call a newly introduced ndo that
offloads this operation.

Is there any hardware out there that can do this? Is it desirable to see
something like this in DSA?

Regards,
-Vladimir

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Offloading DSA taggers to hardware
  2019-11-13 12:40 Offloading DSA taggers to hardware Vladimir Oltean
@ 2019-11-13 16:53 ` Andrew Lunn
  2019-11-13 17:47   ` Vladimir Oltean
  2019-11-13 19:29   ` Florian Fainelli
  2019-11-13 19:40 ` Florian Fainelli
  1 sibling, 2 replies; 7+ messages in thread
From: Andrew Lunn @ 2019-11-13 16:53 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev

Hi Vladimir

I've not seen any hardware that can do this. There is an
Atheros/Qualcom integrated SoC/Switch where the 'header' is actually
just a field in the transmit/receive descriptor. There is an out of
tree driver for it, and the tag driver is very minimal. But clearly
this only works for integrated systems.

The other 'smart' features i've seen in NICs with respect to DSA is
being able to do hardware checksums. Freescale FEC for example cannot
figure out where the IP header is, because of the DSA header, and so
cannot calculate IP/TCP/UDP checksums. Marvell, and i expect some
other vendors of both MAC and switch devices, know about these
headers, and can do checksumming.

I'm not even sure there are any NICs which can do GSO or LRO when
there is a DSA header involved.

In the direction CPU to switch, i think many of the QoS issues are
higher up the stack. By the time the tagger is involved, all the queue
discipline stuff has been done, and it really is time to send the
frame. In the 'post buffer bloat world', the NICs hardware queue
should be small, so QoS is not so relevant once you reach the TX
queue. The real QoS issue i guess is that the slave interfaces have no
idea they are sharing resources at the lowest level. So a high
priority frames from slave 1 are not differentiated from best effort
frames from slave 2. If we were serious about improving QoS, we need a
meta scheduler across the slaves feeding the master interface in a QoS
aware way.

In the other direction, how much is the NIC really looking at QoS
information on the receive path? Are you thinking RPS? I'm not sure
any of the NICs commonly used today with DSA are actually multi-queue
and do RPS.

Another aspect here might be, what Linux is doing with DSA is probably
well past the silicon vendors expected use cases. None of the 'vendor
crap' drivers i've seen for these SOHO class switches have the level
of integration we have in Linux. We are pushing the limits of the
host/switch interfaces much more then vendors do, and so silicon
vendors are not so aware of the limits in these areas? But DSA is
being successful, vendors are taking more notice of it, and maybe with
time, the host/switch interface will improve. NICs might start
supporting GSO/LRO when there is a DSA header involved? Multi-queue
NICs become more popular in this class of hardware and RPS knows how
to handle DSA headers. But my guess would be, it will be for a Marvell
NIC paired with a Marvell Switch, Broadcom NIC paired with a Broadcom
switch, etc. I doubt there will be cross vendor support.

	Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Offloading DSA taggers to hardware
  2019-11-13 16:53 ` Andrew Lunn
@ 2019-11-13 17:47   ` Vladimir Oltean
  2019-11-13 19:29   ` Florian Fainelli
  1 sibling, 0 replies; 7+ messages in thread
From: Vladimir Oltean @ 2019-11-13 17:47 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev

Hi Andrew,

On Wed, 13 Nov 2019 at 18:53, Andrew Lunn <andrew@lunn.ch> wrote:
>
> Hi Vladimir
>
> I've not seen any hardware that can do this. There is an
> Atheros/Qualcom integrated SoC/Switch where the 'header' is actually
> just a field in the transmit/receive descriptor. There is an out of
> tree driver for it, and the tag driver is very minimal. But clearly
> this only works for integrated systems.
>

What is this Atheros SoC?
It is funny that the topic reminded you of it. Your line of reasoning
probably was: "Atheros pushed this idea so far that they omitted the
DSA frame tag altogether for their own CPU port/DSA master". Which
means that even if they try to use this "offloaded DSA tagger"
abstraction, it would slightly violate the main idea of an offload,
which is the fact that it's optional. What do you think?

> The other 'smart' features i've seen in NICs with respect to DSA is
> being able to do hardware checksums. Freescale FEC for example cannot
> figure out where the IP header is, because of the DSA header, and so
> cannot calculate IP/TCP/UDP checksums. Marvell, and i expect some
> other vendors of both MAC and switch devices, know about these
> headers, and can do checksumming.
>

Of course there are many more benefits that derive from more complete
frame parsing as well, for some reason my mind just stopped at QoS
when I wrote this email.

> I'm not even sure there are any NICs which can do GSO or LRO when
> there is a DSA header involved.
>
> In the direction CPU to switch, i think many of the QoS issues are
> higher up the stack. By the time the tagger is involved, all the queue
> discipline stuff has been done, and it really is time to send the
> frame. In the 'post buffer bloat world', the NICs hardware queue
> should be small, so QoS is not so relevant once you reach the TX
> queue. The real QoS issue i guess is that the slave interfaces have no
> idea they are sharing resources at the lowest level. So a high
> priority frames from slave 1 are not differentiated from best effort
> frames from slave 2. If we were serious about improving QoS, we need a
> meta scheduler across the slaves feeding the master interface in a QoS
> aware way.
>

Qdiscs on the DSA master are a good discussion to be had, but this
wasn't the main thing I wanted to bring up here.

> In the other direction, how much is the NIC really looking at QoS
> information on the receive path? Are you thinking RPS? I'm not sure
> any of the NICs commonly used today with DSA are actually multi-queue
> and do RPS.
>

Actually both DSA master drivers I've been using so far (gianfar,
enetc) register a number of RX queues equal to the number of cores. It
is possible to add ethtool --config-nfc rules to steer certain
priority traffic to its own CPU, but the keys need to be masked
according to where the QoS field in the DSA frame tag overlaps with
what the DSA master thinks it's looking at, aka DMAC, SMAC, EtherType,
etc. It's not pretty.

> Another aspect here might be, what Linux is doing with DSA is probably
> well past the silicon vendors expected use cases. None of the 'vendor
> crap' drivers i've seen for these SOHO class switches have the level
> of integration we have in Linux. We are pushing the limits of the
> host/switch interfaces much more then vendors do, and so silicon
> vendors are not so aware of the limits in these areas? But DSA is
> being successful, vendors are taking more notice of it, and maybe with
> time, the host/switch interface will improve. NICs might start
> supporting GSO/LRO when there is a DSA header involved? Multi-queue
> NICs become more popular in this class of hardware and RPS knows how
> to handle DSA headers. But my guess would be, it will be for a Marvell
> NIC paired with a Marvell Switch, Broadcom NIC paired with a Broadcom
> switch, etc. I doubt there will be cross vendor support.

...Atheros with Atheros... :)

Yes, that's kind of the angle I'm coming from, basically trying to
understand what a correct abstraction from Linux's perspective would
look like, and what is considered too much "tribalism". The DSA model
is attractive even for an integrated system because there is more
modularity in the design, but there are some clear optimizations that
can be made when the master+switch recipe is tightly controlled.

>
>         Andrew

Thanks,
-Vladimir

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Offloading DSA taggers to hardware
  2019-11-13 16:53 ` Andrew Lunn
  2019-11-13 17:47   ` Vladimir Oltean
@ 2019-11-13 19:29   ` Florian Fainelli
  1 sibling, 0 replies; 7+ messages in thread
From: Florian Fainelli @ 2019-11-13 19:29 UTC (permalink / raw)
  To: Andrew Lunn, Vladimir Oltean; +Cc: netdev

On 11/13/19 8:53 AM, Andrew Lunn wrote:
> Hi Vladimir
> 
> I've not seen any hardware that can do this.

Such hardware exists ant there was a prior attempt at supporting that:

http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html

> There is an Atheros/Qualcom integrated SoC/Switch where the 'header' is actually
> just a field in the transmit/receive descriptor. There is an out of
> tree driver for it, and the tag driver is very minimal. But clearly
> this only works for integrated systems.

It can work between discrete components in premise, it just is unlikely
because of the flexibility of DSA to mix and match MAC and switches and
having different vendors on either end. Of course, even between the same
vendor, the right hand rarely talks to the left hand, so it only has to
be the work of someone who knows both ends.

> 
> The other 'smart' features i've seen in NICs with respect to DSA is
> being able to do hardware checksums. Freescale FEC for example cannot
> figure out where the IP header is, because of the DSA header, and so
> cannot calculate IP/TCP/UDP checksums. Marvell, and i expect some
> other vendors of both MAC and switch devices, know about these
> headers, and can do checksumming.

This is probably to be blamed to the fact that most Ethernet switch
tagging protocols did not assign themselves an EtherType, otherwise they
just could do that checksumming. In fact, this even trip up controllers
that are from the same vendor:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a40061ea2e39494104602b3048751341bda374a1

> 
> I'm not even sure there are any NICs which can do GSO or LRO when
> there is a DSA header involved.

Similarly to VLAN devices, would not this be done at the DSA virtual
device level instead?

> 
> In the direction CPU to switch, i think many of the QoS issues are
> higher up the stack. By the time the tagger is involved, all the queue
> discipline stuff has been done, and it really is time to send the
> frame. In the 'post buffer bloat world', the NICs hardware queue
> should be small, so QoS is not so relevant once you reach the TX
> queue. The real QoS issue i guess is that the slave interfaces have no
> idea they are sharing resources at the lowest level. So a high
> priority frames from slave 1 are not differentiated from best effort
> frames from slave 2. If we were serious about improving QoS, we need a
> meta scheduler across the slaves feeding the master interface in a QoS
> aware way.
> 
> In the other direction, how much is the NIC really looking at QoS
> information on the receive path? Are you thinking RPS? I'm not sure
> any of the NICs commonly used today with DSA are actually multi-queue
> and do RPS.

Same hardware as presented above can deliver frames to the desired
switch output queue:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d156576362c07e954dc36e07b0d7b0733a010f7d

> 
> Another aspect here might be, what Linux is doing with DSA is probably
> well past the silicon vendors expected use cases. None of the 'vendor
> crap' drivers i've seen for these SOHO class switches have the level
> of integration we have in Linux. We are pushing the limits of the
> host/switch interfaces much more then vendors do, and so silicon
> vendors are not so aware of the limits in these areas? 

Maybe, but vendors support many basic things we still don't like
controlling broadcast storm suppression (commonly requested feature),
offloading QoS properly etc. etc. What we have achieved so far is IMHO a
solid framework, but there are still many, many features unsupported.

> But DSA is being successful, vendors are taking more notice of it, and maybe with
> time, the host/switch interface will improve. NICs might start
> supporting GSO/LRO when there is a DSA header involved? Multi-queue
> NICs become more popular in this class of hardware and RPS knows how
> to handle DSA headers. But my guess would be, it will be for a Marvell
> NIC paired with a Marvell Switch, Broadcom NIC paired with a Broadcom
> switch, etc. I doubt there will be cross vendor support.
> 
> 	Andrew
> 


-- 
Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Offloading DSA taggers to hardware
  2019-11-13 12:40 Offloading DSA taggers to hardware Vladimir Oltean
  2019-11-13 16:53 ` Andrew Lunn
@ 2019-11-13 19:40 ` Florian Fainelli
  2019-11-14 16:40   ` Vladimir Oltean
  1 sibling, 1 reply; 7+ messages in thread
From: Florian Fainelli @ 2019-11-13 19:40 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, Andrew Lunn, Vivien Didelot

On 11/13/19 4:40 AM, Vladimir Oltean wrote:
> DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch
> to any NIC, and the software stack creates N "virtual" net devices, each
> representing a switch port, with I/O capabilities based on the metadata present
> in the frame. It all looks like an hourglass:
> 
>   switch           switch           switch           switch           switch
> net_device       net_device       net_device       net_device       net_device
>      |                |                |                |                |
>      |                |                |                |                |
>      |                |                |                |                |
>      +----------------+----------------+----------------+----------------+
>                                        |
>                                        |
>                                   DSA master
>                                   net_device
>                                        |
>                                        |
>                                   DSA master
>                                       NIC
>                                        |
>                                     switch
>                                    CPU port
>                                        |
>                                        |
>      +----------------+----------------+----------------+----------------+
>      |                |                |                |                |
>      |                |                |                |                |
>      |                |                |                |                |
>   switch           switch           switch           switch           switch
>    port             port             port             port             port
> 
> 
> But the process by which the stack:
> - Parses the frame on receive, decodes the DSA tag and redirects the frame from
>   the DSA master net_device to a switch net_device based on the source port,
>   then removes the DSA tag from the frame and recalculates checksums as
>   appropriate
> - Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch
>   net_device to the real DSA master net_device
> 
> can be optimized, if the DSA master NIC supports this. Let's say there is a
> fictional NIC that has a programmable hardware parser and the ability to
> perform frame manipulation (insert, extract a tag). Such a NIC could be
> programmed to do a better job adding/removing the DSA tag, as well as
> masquerading skb->dev based on the parser meta-data. In addition, there would
> be a net benefit for QoS, which as a consequence of the DSA model, cannot be
> really end-to-end: a frame classified to a high-priority traffic class by the
> switch may be treated as best-effort by the DSA master, due to the fact that it
> doesn't really parse the DSA tag (the traffic class, in this case).

The QoS part can be guaranteed for an integrated design, not so much if
you have discrete/separate NIC and switch vendors and there is no agreed
upon mechanism to "not lose information" between the two.

> 
> I think the DSA hotpath would still need to be involved, but instead of calling
> the tagger's xmit/rcv it would need to call a newly introduced ndo that
> offloads this operation.
> 
> Is there any hardware out there that can do this? Is it desirable to see
> something like this in DSA?

BCM7445 and BCM7278 (and other DSL and Cable Modem chips, just not
supported upstream) use drivers/net/dsa/bcm_sf2.c along with
drivers/net/ethernet/broadcom/bcmsysport.c. It is possible to offload
the creation and extraction of the Broadcom tag:

http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html

(this was reverted shortly after because napi_gro_receive() occupies the
full 48 bytes skb->cb[] space on 64-bit hosts, I have now a better view
of solving this though, see below).

In my experience though, since the data is already hot in the cache in
either direction, so a memmove() is not that costly, it was not possible
to see sizable throughput improvements at 1Gbps or 2Gbps speeds because
the CPU is more than capable of managing the tag extraction in software,
and that is the most compatible way of doing it.

To give you some more details, the SYSTEMPORT MAC will pre-pend an 8
byte Receive Status Block, word 0 contains status/length/error and word
1 can contain the full 4byte Broadcom tag as extracted. Then there is a
(configurable) 2byte gap to align the IP header and then the Ethernet
header can be found. This is quite similar to the
NET_DSA_TAG_BRCM_PREPEND case, except for this 2b gap, which is why I am
wondering if I am not going to introduce an additional tagging protocol
NET_DSA_TAG_BRCM_PREPEND_WITH_2B or whatever side band information I can
provide in the skb to permit the removal of these extraneous 2bytes.

On transmit, we also have an 8byte transmit status block which can be
constructed to contain information for the HW to insert a 4byte Broadcom
tag, along with a VLAN tag, and with the same length/checksum insertion
information. TX path would be equivalent to not doing any tagging, so
similarly, it may be desirable to have a separate
NET_DSA_TAG_BRCM_PREPEN value that indicates that nothing needs to be
done except queue the frame for transmission on the master netdev.

Now from a practical angle, offloading DSA tagging only makes sense if
you happen to have a lot of host initiated/received traffic, which would
be the case for either a streaming device (BCM7445/BCM7278) with their
ports either completely separate (DSA default), or bridged. Does that
apply in your case?
-- 
Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Offloading DSA taggers to hardware
  2019-11-13 19:40 ` Florian Fainelli
@ 2019-11-14 16:40   ` Vladimir Oltean
  2019-11-22 17:47     ` Florian Fainelli
  0 siblings, 1 reply; 7+ messages in thread
From: Vladimir Oltean @ 2019-11-14 16:40 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, Andrew Lunn, Vivien Didelot

Hi Florian,

On Wed, 13 Nov 2019 at 21:40, Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On 11/13/19 4:40 AM, Vladimir Oltean wrote:
> > DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch
> > to any NIC, and the software stack creates N "virtual" net devices, each
> > representing a switch port, with I/O capabilities based on the metadata present
> > in the frame. It all looks like an hourglass:
> >
> >   switch           switch           switch           switch           switch
> > net_device       net_device       net_device       net_device       net_device
> >      |                |                |                |                |
> >      |                |                |                |                |
> >      |                |                |                |                |
> >      +----------------+----------------+----------------+----------------+
> >                                        |
> >                                        |
> >                                   DSA master
> >                                   net_device
> >                                        |
> >                                        |
> >                                   DSA master
> >                                       NIC
> >                                        |
> >                                     switch
> >                                    CPU port
> >                                        |
> >                                        |
> >      +----------------+----------------+----------------+----------------+
> >      |                |                |                |                |
> >      |                |                |                |                |
> >      |                |                |                |                |
> >   switch           switch           switch           switch           switch
> >    port             port             port             port             port
> >
> >
> > But the process by which the stack:
> > - Parses the frame on receive, decodes the DSA tag and redirects the frame from
> >   the DSA master net_device to a switch net_device based on the source port,
> >   then removes the DSA tag from the frame and recalculates checksums as
> >   appropriate
> > - Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch
> >   net_device to the real DSA master net_device
> >
> > can be optimized, if the DSA master NIC supports this. Let's say there is a
> > fictional NIC that has a programmable hardware parser and the ability to
> > perform frame manipulation (insert, extract a tag). Such a NIC could be
> > programmed to do a better job adding/removing the DSA tag, as well as
> > masquerading skb->dev based on the parser meta-data. In addition, there would
> > be a net benefit for QoS, which as a consequence of the DSA model, cannot be
> > really end-to-end: a frame classified to a high-priority traffic class by the
> > switch may be treated as best-effort by the DSA master, due to the fact that it
> > doesn't really parse the DSA tag (the traffic class, in this case).
>
> The QoS part can be guaranteed for an integrated design, not so much if
> you have discrete/separate NIC and switch vendors and there is no agreed
> upon mechanism to "not lose information" between the two.
>
> >
> > I think the DSA hotpath would still need to be involved, but instead of calling
> > the tagger's xmit/rcv it would need to call a newly introduced ndo that
> > offloads this operation.
> >
> > Is there any hardware out there that can do this? Is it desirable to see
> > something like this in DSA?
>
> BCM7445 and BCM7278 (and other DSL and Cable Modem chips, just not
> supported upstream) use drivers/net/dsa/bcm_sf2.c along with
> drivers/net/ethernet/broadcom/bcmsysport.c. It is possible to offload
> the creation and extraction of the Broadcom tag:
>
> http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html
>
> (this was reverted shortly after because napi_gro_receive() occupies the
> full 48 bytes skb->cb[] space on 64-bit hosts, I have now a better view
> of solving this though, see below).
>
> In my experience though, since the data is already hot in the cache in
> either direction, so a memmove() is not that costly, it was not possible
> to see sizable throughput improvements at 1Gbps or 2Gbps speeds because
> the CPU is more than capable of managing the tag extraction in software,
> and that is the most compatible way of doing it.
>
> To give you some more details, the SYSTEMPORT MAC will pre-pend an 8
> byte Receive Status Block, word 0 contains status/length/error and word
> 1 can contain the full 4byte Broadcom tag as extracted. Then there is a
> (configurable) 2byte gap to align the IP header and then the Ethernet
> header can be found. This is quite similar to the
> NET_DSA_TAG_BRCM_PREPEND case, except for this 2b gap, which is why I am
> wondering if I am not going to introduce an additional tagging protocol
> NET_DSA_TAG_BRCM_PREPEND_WITH_2B or whatever side band information I can
> provide in the skb to permit the removal of these extraneous 2bytes.
>
> On transmit, we also have an 8byte transmit status block which can be
> constructed to contain information for the HW to insert a 4byte Broadcom
> tag, along with a VLAN tag, and with the same length/checksum insertion
> information. TX path would be equivalent to not doing any tagging, so
> similarly, it may be desirable to have a separate
> NET_DSA_TAG_BRCM_PREPEN value that indicates that nothing needs to be
> done except queue the frame for transmission on the master netdev.
>
> Now from a practical angle, offloading DSA tagging only makes sense if
> you happen to have a lot of host initiated/received traffic, which would
> be the case for either a streaming device (BCM7445/BCM7278) with their
> ports either completely separate (DSA default), or bridged. Does that
> apply in your case?

Not at all, I would say. In fact, I was trying to understand what are
the chances of interpreting information from the master's frame
descriptor as the de-facto DSA tag in mainline Linux. Your story with
Starfighter 2 chips seems to indicate that it isn't such a good idea.

> --
> Florian

Thanks,
-Vladimir

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Offloading DSA taggers to hardware
  2019-11-14 16:40   ` Vladimir Oltean
@ 2019-11-22 17:47     ` Florian Fainelli
  0 siblings, 0 replies; 7+ messages in thread
From: Florian Fainelli @ 2019-11-22 17:47 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Andrew Lunn, Vivien Didelot

On 11/14/19 8:40 AM, Vladimir Oltean wrote:
> Hi Florian,
> 
> On Wed, 13 Nov 2019 at 21:40, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>> On 11/13/19 4:40 AM, Vladimir Oltean wrote:
>>> DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch
>>> to any NIC, and the software stack creates N "virtual" net devices, each
>>> representing a switch port, with I/O capabilities based on the metadata present
>>> in the frame. It all looks like an hourglass:
>>>
>>>   switch           switch           switch           switch           switch
>>> net_device       net_device       net_device       net_device       net_device
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>      +----------------+----------------+----------------+----------------+
>>>                                        |
>>>                                        |
>>>                                   DSA master
>>>                                   net_device
>>>                                        |
>>>                                        |
>>>                                   DSA master
>>>                                       NIC
>>>                                        |
>>>                                     switch
>>>                                    CPU port
>>>                                        |
>>>                                        |
>>>      +----------------+----------------+----------------+----------------+
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>   switch           switch           switch           switch           switch
>>>    port             port             port             port             port
>>>
>>>
>>> But the process by which the stack:
>>> - Parses the frame on receive, decodes the DSA tag and redirects the frame from
>>>   the DSA master net_device to a switch net_device based on the source port,
>>>   then removes the DSA tag from the frame and recalculates checksums as
>>>   appropriate
>>> - Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch
>>>   net_device to the real DSA master net_device
>>>
>>> can be optimized, if the DSA master NIC supports this. Let's say there is a
>>> fictional NIC that has a programmable hardware parser and the ability to
>>> perform frame manipulation (insert, extract a tag). Such a NIC could be
>>> programmed to do a better job adding/removing the DSA tag, as well as
>>> masquerading skb->dev based on the parser meta-data. In addition, there would
>>> be a net benefit for QoS, which as a consequence of the DSA model, cannot be
>>> really end-to-end: a frame classified to a high-priority traffic class by the
>>> switch may be treated as best-effort by the DSA master, due to the fact that it
>>> doesn't really parse the DSA tag (the traffic class, in this case).
>>
>> The QoS part can be guaranteed for an integrated design, not so much if
>> you have discrete/separate NIC and switch vendors and there is no agreed
>> upon mechanism to "not lose information" between the two.
>>
>>>
>>> I think the DSA hotpath would still need to be involved, but instead of calling
>>> the tagger's xmit/rcv it would need to call a newly introduced ndo that
>>> offloads this operation.
>>>
>>> Is there any hardware out there that can do this? Is it desirable to see
>>> something like this in DSA?
>>
>> BCM7445 and BCM7278 (and other DSL and Cable Modem chips, just not
>> supported upstream) use drivers/net/dsa/bcm_sf2.c along with
>> drivers/net/ethernet/broadcom/bcmsysport.c. It is possible to offload
>> the creation and extraction of the Broadcom tag:
>>
>> http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html
>>
>> (this was reverted shortly after because napi_gro_receive() occupies the
>> full 48 bytes skb->cb[] space on 64-bit hosts, I have now a better view
>> of solving this though, see below).
>>
>> In my experience though, since the data is already hot in the cache in
>> either direction, so a memmove() is not that costly, it was not possible
>> to see sizable throughput improvements at 1Gbps or 2Gbps speeds because
>> the CPU is more than capable of managing the tag extraction in software,
>> and that is the most compatible way of doing it.
>>
>> To give you some more details, the SYSTEMPORT MAC will pre-pend an 8
>> byte Receive Status Block, word 0 contains status/length/error and word
>> 1 can contain the full 4byte Broadcom tag as extracted. Then there is a
>> (configurable) 2byte gap to align the IP header and then the Ethernet
>> header can be found. This is quite similar to the
>> NET_DSA_TAG_BRCM_PREPEND case, except for this 2b gap, which is why I am
>> wondering if I am not going to introduce an additional tagging protocol
>> NET_DSA_TAG_BRCM_PREPEND_WITH_2B or whatever side band information I can
>> provide in the skb to permit the removal of these extraneous 2bytes.
>>
>> On transmit, we also have an 8byte transmit status block which can be
>> constructed to contain information for the HW to insert a 4byte Broadcom
>> tag, along with a VLAN tag, and with the same length/checksum insertion
>> information. TX path would be equivalent to not doing any tagging, so
>> similarly, it may be desirable to have a separate
>> NET_DSA_TAG_BRCM_PREPEN value that indicates that nothing needs to be
>> done except queue the frame for transmission on the master netdev.
>>
>> Now from a practical angle, offloading DSA tagging only makes sense if
>> you happen to have a lot of host initiated/received traffic, which would
>> be the case for either a streaming device (BCM7445/BCM7278) with their
>> ports either completely separate (DSA default), or bridged. Does that
>> apply in your case?
> 
> Not at all, I would say. In fact, I was trying to understand what are
> the chances of interpreting information from the master's frame
> descriptor as the de-facto DSA tag in mainline Linux. Your story with
> Starfighter 2 chips seems to indicate that it isn't such a good idea.

I would not say that this is a bad idea, but that it may be challenging
to find a driver agnostic way, on both the DSA master and tagger side to
provide the switch tag in a way that minimizes the amount of data
manipulation within the packet, while preserving possible stack
optimizations such as GRO. Technically, we should probably be doing the
GRO at the DSA slave layer though, I am fuzzy on the details here TBH.

AFAIR, there may have been some efforts to allow nesting of skb->cb[]
work by Florian Westphal, maybe we could use that.
-- 
Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-11-22 17:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 12:40 Offloading DSA taggers to hardware Vladimir Oltean
2019-11-13 16:53 ` Andrew Lunn
2019-11-13 17:47   ` Vladimir Oltean
2019-11-13 19:29   ` Florian Fainelli
2019-11-13 19:40 ` Florian Fainelli
2019-11-14 16:40   ` Vladimir Oltean
2019-11-22 17:47     ` Florian Fainelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.