Re: [PATCH net] net: dsa: fix flow dissection on Tx path

From: Florian Fainelli <f.fainelli@gmail.com>
To: Alexander Lobakin <alobakin@dlink.ru>
Cc: Andrew Lunn <andrew@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Muciri Gatimu <muciri@openmesh.com>,
	Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>,
	John Crispin <john@phrozen.org>,
	Vivien Didelot <vivien.didelot@gmail.com>,
	Stanislav Fomichev <sdf@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Song Liu <songliubraving@fb.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Matteo Croce <mcroce@redhat.com>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Eric Dumazet <edumazet@google.com>,
	Paul Blakey <paulb@mellanox.com>,
	Yoshiki Komachi <komachi.yoshiki@gmail.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH net] net: dsa: fix flow dissection on Tx path
Date: Fri, 6 Dec 2019 10:05:20 -0800	[thread overview]
Message-ID: <30eb10a3-ad94-5b9b-5cea-f4b50c2f0e42@gmail.com> (raw)
In-Reply-To: <5e108291220110b10fdf0c88f8894fdb@dlink.ru>

On 12/5/19 11:37 PM, Alexander Lobakin wrote:
> Florian Fainelli wrote 06.12.2019 06:32:
>> On 12/5/2019 6:58 AM, Alexander Lobakin wrote:
>>> Andrew Lunn wrote 05.12.2019 17:01:
>>>>> Hi,
>>>>>
>>>>> > What i'm missing here is an explanation why the flow dissector is
>>>>> > called here if the protocol is already set? It suggests there is a
>>>>> > case when the protocol is not correctly set, and we do need to look
>>>>> > into the frame?
>>>>>
>>>>> If we have a device with multiple Tx queues, but XPS is not configured
>>>>> or system is running on uniprocessor system, then networking core code
>>>>> selects Tx queue depending on the flow to utilize as much Tx queues as
>>>>> possible but without breaking frames order.
>>>>> This selection happens in net/core/dev.c:skb_tx_hash() as:
>>>>>
>>>>> reciprocal_scale(skb_get_hash(skb), qcount)
>>>>>
>>>>> where 'qcount' is the total number of Tx queues on the network device.
>>>>>
>>>>> If skb has not been hashed prior to this line, then skb_get_hash()
>>>>> will
>>>>> call flow dissector to generate a new hash. That's why flow dissection
>>>>> can occur on Tx path.
>>>>
>>>>
>>>> Hi Alexander
>>>>
>>>> So it looks like you are now skipping this hash. Which in your
>>>> testing, give better results, because the protocol is already set
>>>> correctly. But are there cases when the protocol is not set correctly?
>>>> We really do need to look into the frame?
>>>
>>> Actually no, I'm not skipping the entire hashing, I'm only skipping
>>> tag_ops->flow_dissect() (helper that only alters network offset and
>>> replaces fake ETH_P_XDSA with the actual protocol) call on Tx path,
>>> because this only breaks flow dissection logics. All skbs are still
>>> processed and hashed by the generic code that goes after that call.
>>>
>>>> How about when an outer header has just been removed? The frame was
>>>> received on a GRE tunnel, the GRE header has just been removed, and
>>>> now the frame is on its way out? Is the protocol still GRE, and we
>>>> should look into the frame to determine if it is IPv4, ARP etc?
>>>>
>>>> Your patch looks to improve things for the cases you have tested, but
>>>> i'm wondering if there are other use cases where we really do need to
>>>> look into the frame? In which case, your fix is doing the wrong thing.
>>>> Should we be extending the tagger to handle the TX case as well as the
>>>> RX case?
>>>
>>> We really have two options: don't call tag_ops->flow_dissect() on Tx
>>> (this patch), or extend tagger callbacks to handle Tx path too. I was
>>> using both of this for several months each and couldn't detect cases
>>> where the first one was worse than the second.
>>> I mean, there _might_ be such cases in theory, and if they will appear
>>> we should extend our taggers. But for now I don't see the necessity to
>>> do this as generic flow dissection logics works as expected after this
>>> patch and is completely broken without it.
>>> And remember that we have the reverse logic on Tx and all skbs are
>>> firstly queued on slave netdevice and only then on master/CPU port.
>>>
>>> It would be nice to see what other people think about it anyways.
>>
>> Your patch seems appropriate to me and quite frankly I am not sure why
>> flow dissection on RX is done at the DSA master device level, where we
>> have not parsed the DSA tag yet, instead of being done at the DSA slave
>> network device level. It seems to me that if the DSA master has N RX
>> queues, we should be creating the DSA slave devices with the same amount
>> of RX queues and perform RPS there against a standard Ethernet frame
>> (sans DSA tag).
>>
>> For TX the story is a little different because we can have multiqueue
>> DSA slave network devices in order to steer traffic towards particular
>> switch queues and we could do XPS there that way.
>>
>> What do you think?
> 
> Hi Florian,
> 
> First of all, thank you for the "Reviewed-by"!
> 
> I agree with you that all the network stack processing should be
> performed on standard frames without CPU tags and on corresponding
> slave netdevices. So I think we really should think about extending
> DSA core code to create slaves with at least as many Rx queues as
> master device have. With this done we could remove .flow_dissect()
> callback from DSA taggers entirely and simplify traffic flow.

Indeed.

> 
> Also, if we get back to Tx processing, number of Tx queues on slaves
> should be equal to number of queues on switch inself in ideal case.
> Maybe we should then apply this rule to Rx queues too, i.e. create
> slaves with the number of Rx queues that switch has?

Yes, I would offer the same configuration knob we have today with TX
queues for RX queues.

> 
> (for example, I'm currently working with the switches that have 8 Rxqs
> and 8 Txqs, but their Ethernet controlers / CPU ports have only 4/4)

Yes, that is not uncommon unfortunately, we have a similar set-up with
BCM7278 which has 16 TX queues for its DSA master and we have 4 switch
ports with 8 TX queues per port, what I did is basically clamp the
number of DSA slave device TX queues to have a 1:1 mapping and that
seems acceptable to the users :)
-- 
Florian