netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* multi-queue over IFF_NO_QUEUE "virtual" devices
@ 2017-08-07 22:26 Florian Fainelli
  2017-08-30  3:49 ` Florian Fainelli
  0 siblings, 1 reply; 4+ messages in thread
From: Florian Fainelli @ 2017-08-07 22:26 UTC (permalink / raw)
  To: netdev, jiri, jhs, xiyou.wangcong; +Cc: davem, andrew, vivien.didelot

Hi,

Most DSA supported Broadcom switches have multiple queues per ports
(usually 8) and each of these queues can be configured with different
pause, drop, hysteresis thresholds and so on in order to make use of the
switch's internal buffering scheme and have some queues achieve some
kind of lossless behavior (e.g: LAN to LAN traffic for Q7 has a higher
priority than LAN to WAN for Q0).

This is obviously very workload specific, so I'd want maximum
programmability as much as possible.

This brings me to a few questions:

1) If we have the DSA slave network devices currently flagged with
IFF_NO_QUEUE becoming multi-queue (on TX) aware such that an application
can control exactly which switch egress queue is used on a per-flow
basis, would that be a problem (this is the dynamic selection of the TX
queue)?

2) The conduit interface (CPU) port network interface has a congestion
control scheme which requires each of its TX queues (32 or 16) to be
statically mapped to each of the underlying switch port queues because
the congestion/ HW needs to inspect the queue depths of the switch to
accept/reject a packet at the CPU's TX ring level. Do we have a good way
with tc to map a virtual/stacked device's queue(s) on-top of its
physical/underlying device's queues (this is the static queue mapping
necessary for congestion to work)?

Let me know if you think this is the right approach or not.

Thanks!
-- 
Florian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: multi-queue over IFF_NO_QUEUE "virtual" devices
  2017-08-07 22:26 multi-queue over IFF_NO_QUEUE "virtual" devices Florian Fainelli
@ 2017-08-30  3:49 ` Florian Fainelli
  2017-08-30 23:37   ` Cong Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Florian Fainelli @ 2017-08-30  3:49 UTC (permalink / raw)
  To: netdev, jiri, jhs, xiyou.wangcong, andrew; +Cc: davem, vivien.didelot

Le 08/07/17 à 15:26, Florian Fainelli a écrit :
> Hi,
> 
> Most DSA supported Broadcom switches have multiple queues per ports
> (usually 8) and each of these queues can be configured with different
> pause, drop, hysteresis thresholds and so on in order to make use of the
> switch's internal buffering scheme and have some queues achieve some
> kind of lossless behavior (e.g: LAN to LAN traffic for Q7 has a higher
> priority than LAN to WAN for Q0).
> 
> This is obviously very workload specific, so I'd want maximum
> programmability as much as possible.
> 
> This brings me to a few questions:
> 
> 1) If we have the DSA slave network devices currently flagged with
> IFF_NO_QUEUE becoming multi-queue (on TX) aware such that an application
> can control exactly which switch egress queue is used on a per-flow
> basis, would that be a problem (this is the dynamic selection of the TX
> queue)?

So I have this part figured out, with a bunch of changes network devices
created by DSA are now multiqueue aware and the Broadcom tag layer is
capable of extracting the queue index, passing it in the tag where
expected and having the switch forward to the appropriate switch port
and queue within that port. It also sets the queue mapping in the SKB
for later consumption by the master network device driver: bcmsysport.c
because of 2).

> 
> 2) The conduit interface (CPU) port network interface has a congestion
> control scheme which requires each of its TX queues (32 or 16) to be
> statically mapped to each of the underlying switch port queues because
> the congestion/ HW needs to inspect the queue depths of the switch to
> accept/reject a packet at the CPU's TX ring level. Do we have a good way
> with tc to map a virtual/stacked device's queue(s) on-top of its
> physical/underlying device's queues (this is the static queue mapping
> necessary for congestion to work)?

That part I have not figured out yet, with some static mapping I can
obtain the results that I want and was even considering the possibility
of doing something like this:

- register a network device notifier with bcmsysport.c (master network
device) for this setup
- expose a helper function allowing me to obtain a given DSA network
device port index
- whenever DSA creates network devices reconfigure the ring and queue
mapping of the TX queues managed by bcmsysport.c with the DSA network
device port index that has just been registered and just do a 1-1
mapping of the 8 queues

You would end-up with something like:

gphy (port 0) queues 0-7 mapped to systemport queues 0-7
rgmii_1 (port 1) queues 0-7 mapped to systemport queues 8-15
rgmii_2 (port 2) queues 0-7 mapped to systemport queues 16 through 23
moca (port 7) queues 0-7 mapped to systemport queues 24-31

This should be working because bcmsysport's TX queues are not under
direct control by the user, they are used via DSA created network
devices which indicate the queue they want to use. When the DSA
interfaces are brought down, their respective systemport queues now
become unused. This also works because the number of physical ports of
the switch times the number of queues is matching the number of TX
queues from systemport (like if someone designed it with that exact
purpose in mind ;)).

The only problem with that approach of course is that it embeds a policy
within the systemport driver.

Ideally I would really like to configure this via tc by setting up a
mapping between queues of one network devices to queues of another
network device, is that a possible thing, Jamal, Cong, Jiri, do you know?
-- 
Florian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: multi-queue over IFF_NO_QUEUE "virtual" devices
  2017-08-30  3:49 ` Florian Fainelli
@ 2017-08-30 23:37   ` Cong Wang
  2017-08-31  0:30     ` Florian Fainelli
  0 siblings, 1 reply; 4+ messages in thread
From: Cong Wang @ 2017-08-30 23:37 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Linux Kernel Network Developers, Jiri Pirko, Jamal Hadi Salim,
	andrew, David Miller, Vivien Didelot

On Tue, Aug 29, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> Le 08/07/17 à 15:26, Florian Fainelli a écrit :
>> Hi,
>>
>> Most DSA supported Broadcom switches have multiple queues per ports
>> (usually 8) and each of these queues can be configured with different
>> pause, drop, hysteresis thresholds and so on in order to make use of the
>> switch's internal buffering scheme and have some queues achieve some
>> kind of lossless behavior (e.g: LAN to LAN traffic for Q7 has a higher
>> priority than LAN to WAN for Q0).
>>
>> This is obviously very workload specific, so I'd want maximum
>> programmability as much as possible.
>>
>> This brings me to a few questions:
>>
>> 1) If we have the DSA slave network devices currently flagged with
>> IFF_NO_QUEUE becoming multi-queue (on TX) aware such that an application
>> can control exactly which switch egress queue is used on a per-flow
>> basis, would that be a problem (this is the dynamic selection of the TX
>> queue)?
>
> So I have this part figured out, with a bunch of changes network devices
> created by DSA are now multiqueue aware and the Broadcom tag layer is
> capable of extracting the queue index, passing it in the tag where
> expected and having the switch forward to the appropriate switch port
> and queue within that port. It also sets the queue mapping in the SKB
> for later consumption by the master network device driver: bcmsysport.c
> because of 2).
>
>>
>> 2) The conduit interface (CPU) port network interface has a congestion
>> control scheme which requires each of its TX queues (32 or 16) to be
>> statically mapped to each of the underlying switch port queues because
>> the congestion/ HW needs to inspect the queue depths of the switch to
>> accept/reject a packet at the CPU's TX ring level. Do we have a good way
>> with tc to map a virtual/stacked device's queue(s) on-top of its
>> physical/underlying device's queues (this is the static queue mapping
>> necessary for congestion to work)?
>
> That part I have not figured out yet, with some static mapping I can
> obtain the results that I want and was even considering the possibility
> of doing something like this:
>
> - register a network device notifier with bcmsysport.c (master network
> device) for this setup
> - expose a helper function allowing me to obtain a given DSA network
> device port index
> - whenever DSA creates network devices reconfigure the ring and queue
> mapping of the TX queues managed by bcmsysport.c with the DSA network
> device port index that has just been registered and just do a 1-1
> mapping of the 8 queues
>
> You would end-up with something like:
>
> gphy (port 0) queues 0-7 mapped to systemport queues 0-7
> rgmii_1 (port 1) queues 0-7 mapped to systemport queues 8-15
> rgmii_2 (port 2) queues 0-7 mapped to systemport queues 16 through 23
> moca (port 7) queues 0-7 mapped to systemport queues 24-31
>
> This should be working because bcmsysport's TX queues are not under
> direct control by the user, they are used via DSA created network
> devices which indicate the queue they want to use. When the DSA
> interfaces are brought down, their respective systemport queues now
> become unused. This also works because the number of physical ports of
> the switch times the number of queues is matching the number of TX
> queues from systemport (like if someone designed it with that exact
> purpose in mind ;)).
>
> The only problem with that approach of course is that it embeds a policy
> within the systemport driver.
>
> Ideally I would really like to configure this via tc by setting up a
> mapping between queues of one network devices to queues of another
> network device, is that a possible thing, Jamal, Cong, Jiri, do you know?

I am not sure if I understand the mapping you are talking about here.

TC layer rarely deals with hardware queues directly (except probably mq),
so this question probably don't belong to TC.

OTOH, TC can modify skb->hash, so you can redirect packets to a specific
queue, but this doesn't sound like what you are you looking for.

Maybe Jiri has more thoughts here since he works on TC offloading things.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: multi-queue over IFF_NO_QUEUE "virtual" devices
  2017-08-30 23:37   ` Cong Wang
@ 2017-08-31  0:30     ` Florian Fainelli
  0 siblings, 0 replies; 4+ messages in thread
From: Florian Fainelli @ 2017-08-31  0:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jiri Pirko, Jamal Hadi Salim,
	andrew, David Miller, Vivien Didelot

On 08/30/2017 04:37 PM, Cong Wang wrote:
> On Tue, Aug 29, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> Le 08/07/17 à 15:26, Florian Fainelli a écrit :
>>> Hi,
>>>
>>> Most DSA supported Broadcom switches have multiple queues per ports
>>> (usually 8) and each of these queues can be configured with different
>>> pause, drop, hysteresis thresholds and so on in order to make use of the
>>> switch's internal buffering scheme and have some queues achieve some
>>> kind of lossless behavior (e.g: LAN to LAN traffic for Q7 has a higher
>>> priority than LAN to WAN for Q0).
>>>
>>> This is obviously very workload specific, so I'd want maximum
>>> programmability as much as possible.
>>>
>>> This brings me to a few questions:
>>>
>>> 1) If we have the DSA slave network devices currently flagged with
>>> IFF_NO_QUEUE becoming multi-queue (on TX) aware such that an application
>>> can control exactly which switch egress queue is used on a per-flow
>>> basis, would that be a problem (this is the dynamic selection of the TX
>>> queue)?
>>
>> So I have this part figured out, with a bunch of changes network devices
>> created by DSA are now multiqueue aware and the Broadcom tag layer is
>> capable of extracting the queue index, passing it in the tag where
>> expected and having the switch forward to the appropriate switch port
>> and queue within that port. It also sets the queue mapping in the SKB
>> for later consumption by the master network device driver: bcmsysport.c
>> because of 2).
>>
>>>
>>> 2) The conduit interface (CPU) port network interface has a congestion
>>> control scheme which requires each of its TX queues (32 or 16) to be
>>> statically mapped to each of the underlying switch port queues because
>>> the congestion/ HW needs to inspect the queue depths of the switch to
>>> accept/reject a packet at the CPU's TX ring level. Do we have a good way
>>> with tc to map a virtual/stacked device's queue(s) on-top of its
>>> physical/underlying device's queues (this is the static queue mapping
>>> necessary for congestion to work)?
>>
>> That part I have not figured out yet, with some static mapping I can
>> obtain the results that I want and was even considering the possibility
>> of doing something like this:
>>
>> - register a network device notifier with bcmsysport.c (master network
>> device) for this setup
>> - expose a helper function allowing me to obtain a given DSA network
>> device port index
>> - whenever DSA creates network devices reconfigure the ring and queue
>> mapping of the TX queues managed by bcmsysport.c with the DSA network
>> device port index that has just been registered and just do a 1-1
>> mapping of the 8 queues
>>
>> You would end-up with something like:
>>
>> gphy (port 0) queues 0-7 mapped to systemport queues 0-7
>> rgmii_1 (port 1) queues 0-7 mapped to systemport queues 8-15
>> rgmii_2 (port 2) queues 0-7 mapped to systemport queues 16 through 23
>> moca (port 7) queues 0-7 mapped to systemport queues 24-31
>>
>> This should be working because bcmsysport's TX queues are not under
>> direct control by the user, they are used via DSA created network
>> devices which indicate the queue they want to use. When the DSA
>> interfaces are brought down, their respective systemport queues now
>> become unused. This also works because the number of physical ports of
>> the switch times the number of queues is matching the number of TX
>> queues from systemport (like if someone designed it with that exact
>> purpose in mind ;)).
>>
>> The only problem with that approach of course is that it embeds a policy
>> within the systemport driver.
>>
>> Ideally I would really like to configure this via tc by setting up a
>> mapping between queues of one network devices to queues of another
>> network device, is that a possible thing, Jamal, Cong, Jiri, do you know?
> 
> I am not sure if I understand the mapping you are talking about here.
> 
> TC layer rarely deals with hardware queues directly (except probably mq),
> so this question probably don't belong to TC.
> 
> OTOH, TC can modify skb->hash, so you can redirect packets to a specific
> queue, but this doesn't sound like what you are you looking for.

I am actually building on TC being able to influence the value of
skb->queue_mapping, but that is just for the stacked devices, not the
underlying conduit device that does the actual transmission.

> 
> Maybe Jiri has more thoughts here since he works on TC offloading things.
> 

Patches with explanations and context (hopefully clearer) here:

http://patchwork.ozlabs.org/project/netdev/list/?series=728

Thanks!
-- 
Florian

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-08-31  0:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-07 22:26 multi-queue over IFF_NO_QUEUE "virtual" devices Florian Fainelli
2017-08-30  3:49 ` Florian Fainelli
2017-08-30 23:37   ` Cong Wang
2017-08-31  0:30     ` Florian Fainelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).