From: Florian Fainelli <f.fainelli@gmail.com>
To: Alexander Lobakin <alobakin@dlink.ru>
Cc: Andrew Lunn <andrew@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Muciri Gatimu <muciri@openmesh.com>,
Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>,
John Crispin <john@phrozen.org>,
Vivien Didelot <vivien.didelot@gmail.com>,
Stanislav Fomichev <sdf@google.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Song Liu <songliubraving@fb.com>,
Alexei Starovoitov <ast@kernel.org>,
Matteo Croce <mcroce@redhat.com>,
Jakub Sitnicki <jakub@cloudflare.com>,
Eric Dumazet <edumazet@google.com>,
Paul Blakey <paulb@mellanox.com>,
Yoshiki Komachi <komachi.yoshiki@gmail.com>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH net] net: dsa: fix flow dissection on Tx path
Date: Fri, 6 Dec 2019 10:05:20 -0800 [thread overview]
Message-ID: <30eb10a3-ad94-5b9b-5cea-f4b50c2f0e42@gmail.com> (raw)
In-Reply-To: <5e108291220110b10fdf0c88f8894fdb@dlink.ru>
On 12/5/19 11:37 PM, Alexander Lobakin wrote:
> Florian Fainelli wrote 06.12.2019 06:32:
>> On 12/5/2019 6:58 AM, Alexander Lobakin wrote:
>>> Andrew Lunn wrote 05.12.2019 17:01:
>>>>> Hi,
>>>>>
>>>>> > What i'm missing here is an explanation why the flow dissector is
>>>>> > called here if the protocol is already set? It suggests there is a
>>>>> > case when the protocol is not correctly set, and we do need to look
>>>>> > into the frame?
>>>>>
>>>>> If we have a device with multiple Tx queues, but XPS is not configured
>>>>> or system is running on uniprocessor system, then networking core code
>>>>> selects Tx queue depending on the flow to utilize as much Tx queues as
>>>>> possible but without breaking frames order.
>>>>> This selection happens in net/core/dev.c:skb_tx_hash() as:
>>>>>
>>>>> reciprocal_scale(skb_get_hash(skb), qcount)
>>>>>
>>>>> where 'qcount' is the total number of Tx queues on the network device.
>>>>>
>>>>> If skb has not been hashed prior to this line, then skb_get_hash()
>>>>> will
>>>>> call flow dissector to generate a new hash. That's why flow dissection
>>>>> can occur on Tx path.
>>>>
>>>>
>>>> Hi Alexander
>>>>
>>>> So it looks like you are now skipping this hash. Which in your
>>>> testing, give better results, because the protocol is already set
>>>> correctly. But are there cases when the protocol is not set correctly?
>>>> We really do need to look into the frame?
>>>
>>> Actually no, I'm not skipping the entire hashing, I'm only skipping
>>> tag_ops->flow_dissect() (helper that only alters network offset and
>>> replaces fake ETH_P_XDSA with the actual protocol) call on Tx path,
>>> because this only breaks flow dissection logics. All skbs are still
>>> processed and hashed by the generic code that goes after that call.
>>>
>>>> How about when an outer header has just been removed? The frame was
>>>> received on a GRE tunnel, the GRE header has just been removed, and
>>>> now the frame is on its way out? Is the protocol still GRE, and we
>>>> should look into the frame to determine if it is IPv4, ARP etc?
>>>>
>>>> Your patch looks to improve things for the cases you have tested, but
>>>> i'm wondering if there are other use cases where we really do need to
>>>> look into the frame? In which case, your fix is doing the wrong thing.
>>>> Should we be extending the tagger to handle the TX case as well as the
>>>> RX case?
>>>
>>> We really have two options: don't call tag_ops->flow_dissect() on Tx
>>> (this patch), or extend tagger callbacks to handle Tx path too. I was
>>> using both of this for several months each and couldn't detect cases
>>> where the first one was worse than the second.
>>> I mean, there _might_ be such cases in theory, and if they will appear
>>> we should extend our taggers. But for now I don't see the necessity to
>>> do this as generic flow dissection logics works as expected after this
>>> patch and is completely broken without it.
>>> And remember that we have the reverse logic on Tx and all skbs are
>>> firstly queued on slave netdevice and only then on master/CPU port.
>>>
>>> It would be nice to see what other people think about it anyways.
>>
>> Your patch seems appropriate to me and quite frankly I am not sure why
>> flow dissection on RX is done at the DSA master device level, where we
>> have not parsed the DSA tag yet, instead of being done at the DSA slave
>> network device level. It seems to me that if the DSA master has N RX
>> queues, we should be creating the DSA slave devices with the same amount
>> of RX queues and perform RPS there against a standard Ethernet frame
>> (sans DSA tag).
>>
>> For TX the story is a little different because we can have multiqueue
>> DSA slave network devices in order to steer traffic towards particular
>> switch queues and we could do XPS there that way.
>>
>> What do you think?
>
> Hi Florian,
>
> First of all, thank you for the "Reviewed-by"!
>
> I agree with you that all the network stack processing should be
> performed on standard frames without CPU tags and on corresponding
> slave netdevices. So I think we really should think about extending
> DSA core code to create slaves with at least as many Rx queues as
> master device have. With this done we could remove .flow_dissect()
> callback from DSA taggers entirely and simplify traffic flow.
Indeed.
>
> Also, if we get back to Tx processing, number of Tx queues on slaves
> should be equal to number of queues on switch inself in ideal case.
> Maybe we should then apply this rule to Rx queues too, i.e. create
> slaves with the number of Rx queues that switch has?
Yes, I would offer the same configuration knob we have today with TX
queues for RX queues.
>
> (for example, I'm currently working with the switches that have 8 Rxqs
> and 8 Txqs, but their Ethernet controlers / CPU ports have only 4/4)
Yes, that is not uncommon unfortunately, we have a similar set-up with
BCM7278 which has 16 TX queues for its DSA master and we have 4 switch
ports with 8 TX queues per port, what I did is basically clamp the
number of DSA slave device TX queues to have a 1:1 mapping and that
seems acceptable to the users :)
--
Florian
next prev parent reply other threads:[~2019-12-06 18:05 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-05 10:02 [PATCH net] net: dsa: fix flow dissection on Tx path Alexander Lobakin
2019-12-05 12:58 ` Andrew Lunn
2019-12-05 13:34 ` Alexander Lobakin
2019-12-05 14:01 ` Andrew Lunn
2019-12-05 14:58 ` Alexander Lobakin
2019-12-06 3:32 ` Florian Fainelli
2019-12-06 7:37 ` Alexander Lobakin
2019-12-06 18:05 ` Florian Fainelli [this message]
2019-12-06 3:28 ` Florian Fainelli
2019-12-06 15:06 ` Alexander Lobakin
2019-12-06 19:32 ` Rainer Sickinger
2019-12-07 4:19 ` David Miller
2019-12-07 8:10 ` Alexander Lobakin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=30eb10a3-ad94-5b9b-5cea-f4b50c2f0e42@gmail.com \
--to=f.fainelli@gmail.com \
--cc=alobakin@dlink.ru \
--cc=andrew@lunn.ch \
--cc=ast@kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jakub@cloudflare.com \
--cc=john@phrozen.org \
--cc=komachi.yoshiki@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mcroce@redhat.com \
--cc=muciri@openmesh.com \
--cc=netdev@vger.kernel.org \
--cc=paulb@mellanox.com \
--cc=sdf@google.com \
--cc=shashidhar.lakkavalli@openmesh.com \
--cc=songliubraving@fb.com \
--cc=vivien.didelot@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).