From mboxrd@z Thu Jan  1 00:00:00 1970
From: Florian Fainelli <f.fainelli@gmail.com>
Subject: Re: [PATCH 4/4] cpsw: add switchdev support
Date: Tue, 5 Jun 2018 14:37:41 -0700
Message-ID: <aa02ecd9-98f5-a3b5-2d02-27286eaf9d3b@gmail.com>
References: <1527144984-31236-1-git-send-email-ilias.apalodimas@linaro.org>
 <1527144984-31236-5-git-send-email-ilias.apalodimas@linaro.org>
 <20180524131229.GC24557@lunn.ch> <20180524133234.GA15703@apalos>
 <20180524163904.GH5128@lunn.ch> <20180525045647.GA1758@apalos>
 <aa95cdbd-1e89-0b8e-352e-4de80bf0f714@gmail.com>
 <20180602103432.GA948@apalos> <e5832a2c-da63-d91d-e93a-51b7e84959a0@ti.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Cc: Andrew Lunn <andrew@lunn.ch>, netdev@vger.kernel.org,
        ivan.khoronzhuk@linaro.org, nsekhar@ti.com, jiri@resnulli.us,
        ivecera@redhat.com, francois.ozog@linaro.org, yogeshs@ti.com,
        spatton@ti.com
To: Grygorii Strashko <grygorii.strashko@ti.com>,
        Ilias Apalodimas <ilias.apalodimas@linaro.org>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-qt0-f179.google.com ([209.85.216.179]:34010 "EHLO
        mail-qt0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751600AbeFEVhu (ORCPT
        <rfc822;netdev@vger.kernel.org>); Tue, 5 Jun 2018 17:37:50 -0400
Received: by mail-qt0-f179.google.com with SMTP id d3-v6so4207575qto.1
        for <netdev@vger.kernel.org>; Tue, 05 Jun 2018 14:37:50 -0700 (PDT)
In-Reply-To: <e5832a2c-da63-d91d-e93a-51b7e84959a0@ti.com>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 06/05/2018 02:03 PM, Grygorii Strashko wrote:
> 
> 
> On 06/02/2018 05:34 AM, Ilias Apalodimas wrote:
>> Hi Florian,
>>
>> Thanks for taking time to look into this
>>
>> On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:
>>>
>>>
>>> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:
>>>> On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:
>>>>> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:
>>>>>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:
>>>>>>> Device tree is supposed to describe the hardware. Using that hardware
>>>>>>> in different ways is not something you should describe in DT.
>>>>>>>
>>>>>> The new switchdev mode is applied with a .config option in the kernel. What you
>>>>>> see is pre-existing code, so i am not sure if i should change it in this
>>>>>> patchset.
>>>>>
>>>>> If you break the code up into a library and two drivers, it becomes a
>>>>> moot point.
>>>> Agree
>>>>
>>>>>
>>>>> But what i don't like here is that the device tree says to do dual
>>>>> mac. But you ignore that and do sometime else. I would prefer that if
>>>>> DT says dual mac, and switchdev is compiled in, the probe fails with
>>>>> EINVAL. Rather than ignore something, make it clear it is invalid.
>>>> The switch has 3 modes of operation as is.
>>>> 1. switch mode, to enable that you don't need to add anything on
>>>> the DTS and linux registers a single netdev interface.
>>>> 2. dual mac mode, this is when you need to add dual_emac; on the DTS.
>>>> 3. switchdev mode which is controlled by a .config option, since as you
>>>> pointed out DTS was not made for controlling config options.
>>>>
>>>> I agree that this is far from beautiful. If the driver remains as in though,
>>>> i'd prefer either keeping what's there or making "switchdev" a DTS option,
>>>> following the pre-existing erroneous usage rather than making the device
>>>> unusable.  If we end up returning some error and refuse to initialize, users
>>>> that remote upgrade their equipment, without taking a good look at changelog,
>>>> will loose access to their devices with no means of remotely fixing that.
>>>
>>> It seems to me that the mistake here is seeing multiple modes of
>>> operations for the cpsw. There are not actually many, there is one
>>> usage, and then there is what you can and cannot offload.
> 
>> CPSW has in fact 2 modes of operation, different FIFO usage/lookup entry(it's
>> called ALE in the current driver) by-pass(which is used in dual emac for
>> example) and other features. Again Grygorii is better suited to answer the
>> exact differences.
> 
> dual_mac is HW enabled mode of operation, so having DT option is pretty
> reasonable as for me. 
> 1) when enabled it configures internal FIFOs in special way so both
> external Ports became equal in the direction toward to Host port 0.
> 
> TRM "The intention of this mode is to allow packets from both ethernet ports
> to be written into the FIFO without one port starving the other port."
> 
> 2) ALE, out of the box, does not support this mode and, as result, two
> default vlan have to be created to direct traffic P1->P0 (vlan1) and
> P2-P0 (vlan2), but any packets P0->P1 and P0->P2 have to be directed
> (and will bypass ALE). This way traffic separated on cpsw egress towards to P0, 
> 
>>> The basic
>>> premise with switchdev and DSA (which uses switchdev) is that each
>>> user-facing port of your switch needs to work as if it were a normal
>>> Ethernet NIC, that is what you call dual-MAC I believe. Then, when you
>>> create a bridge and you enslave those ports into the bridge, you need to
>>> have forwarding done in hardware between these two ports when the
>>> SMAC/DMAC are not for the host/CPU/management interface and you must
>>> simultaneously still have the host have the ability to send/receive
>>> traffic through the bridge device.
> 
> TRM
> "When operating in dual mac mode the intention is to transfer packets between ports 0 and 1 and ports 0
> and 2, but not between ports 1 and 2. Each CPGMAC_SL appears as a single MAC with no bridging
> between MAC’s. Each CPGMAC_SL has at least one unique (not the same) mac address."
> 
> So, we cant't talk about forwarding in dual_mac mode. Interfaces expected to be 
> completely independent without any packet leaking between interfaces.
> 
> !! BUT !! CPSW ALE (switch core) still expected to be used, but for packet filtering
> - only registered vlans
> - only registered mcast/bcast
> - ingress mcast/bcast rate limiting (it's actually more like coalescing -
>  limits number of mcast/bcast packets per sec. 
> 
> And all offloading ALE (val/mdb) entries should always contain two ports in masks:
> p1&p0 or p2&p0. Never ever all three ports. 
> 
>> Yes dual emac does that. But dual emac configures the port facing VLAN to the
>> CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is
>> configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU port
>> That's exactly what the current RFC does as well, with the addition of
>> registering a sw0p0 (i'll explain why later on this mail)
>> A little more detail on the issue we are having. On my description
>> sw0p0 -> CPU port, sw0p1 -> port 1 sw0p2 -> port 2. sw0p1/sw0p2 are the ports
>> that have PHYs attached.
>>
>> When we start in the new switchdev mode all interfaces are added to VLAN 0
>> so CPU port + port1 + port2 are all in the same VLAN group. In that case sw0p1
>> and sw0p2 are working as you describe. So those 2 interfaces can send/receive
>> traffic normally which matches the switchdev case.
>>
>> When we add them on a bridge(let's say br0), VLAN1(or any default bridge VLAN)
>> is now configured on sw0p1 and sw0p2 but *not* on the CPU port.
>>  From this point on the whole fuunctionality just collapses. The switch will
>> work and offload traffic between sw0p1/sw0p2 but the bridge won't be able to
>> get an ip address (since VLAN1 is not a member of the CPU port and the packet
>> gets dropped).
>> IGMPv2/V3 messages will never reach the br_multicast.c code to trigger
>> switchdev and configure the MDBs on the ports.  i am prety sure there are other
>> fail scenarios which i haven't discovered already, but those 2 are the most
>> basic ones.  If we add VLAN1 on the CPU port, everything works as intended of
>> course.
>>
>> That's the reason we registered sw0p0 as the CPU port. It can't do any "real"
>> traffic, but you can configure the CPU port independantly and not be forced to
>> do an OR on every VLAN add/delete grouping the CPU port with your port command.
>> The TL;DR version of this is that the switch is working exactly as switchdev is
>> expecting offloading traffic to the hardware when possible as long as the CPU
>> port is member of the proper VLANs
>>
>> Petr's patch solves this for us (9c86ce2c1ae337fc10568a12aea812ed03de8319).
>> We can now do "bridge vlan add dev br0 vid 100 pvid untagged self" and decide
>> when to add the CPU port or not.
>>
>> There are still a couple of cases that are not covered though, if we don't
>> register the CPU port. We cant decide when to forward multicast
>> traffic on the CPU port if a join hasn't been sent from br0.
>> So let's say you got 2 hosts doing multicast and for whatever reason the host
>> wants to see that traffic.
>> With the CPU port present you can do a
>> "bridge mdb add dev br0 port sw0p0 grp 239.1.1.1 permanent" which will offload
>> the traffic to the CPU port and thus the host. If this goes away we are forced
>> to send a join.
>>   
>>> It seems to me like this is entirely doable given that the dual MAC use
>>> case is supported already.
>>>
>>> switchdev is just a stateless framework to get notified from the
>>> networking stack about what you can possibly offload in hardware, so
>>> having a DTS option gate that is unfortunately wrong because it is
>>> really implementing a SW policy in DTS which is not what it is meant for.
>> The DTS option for configuration pre-existed, i don't like that either switchdev
>> mode is activated by a .config option not DTS(it just overrides whatever config
>> you have on the DTS). Far from pretty though fair point, i am open to
>> suggestions.
> 
> Again this is option describing HW mode which not expected to be changed on the fly.
> I'm honestly, propose descope dual_mac and concentrate on switch mode (switchdev) - 
> right now I don't see how dual_mac can be supported with switchdev as per above.
> The same way as I do not see how we can re-use switchdev with 50% of devices which 
> have "only one" user-facing external port (P1 or P2).

This is still not appropriate for Device Tree, because this is
completely orthogonal from one another. Also, I think you tend to
conflate what the switch can do, with in which mode does it start by
default, which are, again, two different things.
-- 
Florian