* Kernel interface to configure queue-group parameters
@ 2023-02-07 0:15 Nambiar, Amritha
2023-02-07 16:28 ` Alexander H Duyck
0 siblings, 1 reply; 10+ messages in thread
From: Nambiar, Amritha @ 2023-02-07 0:15 UTC (permalink / raw)
To: netdev
Cc: davem, kuba, edumazet, pabeni, Saeed Mahameed, alexander.duyck,
Samudrala, Sridhar
Hello,
We are looking for feedback on the kernel interface to configure
queue-group level parameters.
Queues are primary residents in the kernel and there are multiple
interfaces to configure queue-level parameters. For example, tx_maxrate
for a transmit queue can be controlled via the sysfs interface. Ethtool
is another option to change the RX/TX ring parameters of the specified
network device (example, rx-buf-len, tx-push etc.).
Queue_groups are a set of queues grouped together into a single object.
For example, tx_queue_group-0 is a transmit queue_group with index 0 and
can have transmit queues say 0-31, similarly rx_queue_group-0 is a
receive queue_group with index 0 and can have receive queues 0-31,
tx/rx_queue_group_1 may consist of TX and RX queues say 32-127
respectively. Currently, upstream drivers for both ice and mlx5 support
creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
At this point, the kernel does not have an abstraction for queue_group.
A close equivalent in the kernel is a 'traffic class' which consists of
a set of transmit queues. Today, traffic classes are created using TC's
mqprio scheduler. Only a limited set of parameters can be configured on
each traffic class via mqprio, example priority per traffic class, min
and max bandwidth rates per traffic class etc. Mqprio also supports
offload of these parameters to the hardware. The parameters set for the
traffic class (tx queue_group) is applicable to all transmit queues
belonging to the queue_group. However, introducing additional parameters
for queue_groups and configuring them via mqprio makes the interface
less user-friendly (as the command line gets cumbersome due to the
number of qdisc parameters). Although, mqprio is the interface to create
transmit queue_groups, and is also the interface to configure and
offload certain transmit queue_group parameters, due to these
limitations we are wondering if it is worth considering other interface
options for configuring queue_group parameters.
Likewise, receive queue_groups can be created using the ethtool
interface as RSS contexts. Next step would be to configure
per-rx_queue_group parameters. Based on the discussion in
https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
it looks like ethtool may not be the right interface to configure
rx_queue_group parameters (that are unrelated to flow<->queue
assignment), example NAPI configurations on the queue_group.
The key gaps in the kernel to support queue-group parameters are:
1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
2. Offload hooks for TX/RX queue_group parameters depending on the
chosen interface.
Following are the options we have investigated:
1. tc-mqprio:
Pros:
- Already supports creating queue_groups, offload of certain parameters
Cons:
- Introducing new parameters makes the interface less user-friendly.
TC qdisc parameters are specified at the qdisc creation, larger the
number of traffic classes and their respective parameters, lesser the
usability.
2. Ethtool:
Pros:
- Already creates RX queue_groups as RSS contexts
Cons:
- May not be the right interface for non-RSS related parameters
Example for configuring number of napi pollers for a queue group:
ethtool -X <iface> context <context_num> num_pollers <n>
3. sysfs:
Pros:
- Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
- Makes it possible to support some existing per-netdev napi
parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be
per-queue-group parameters.
Cons:
- Requires introducing new queue_group structures for TX and RX
queue groups and references for it, kset references for queue_group in
struct net_device
- Additional ndo ops in net_device_ops for each parameter for
hardware offload.
Examples :
/sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
/sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate
4. Devlink:
Pros:
- New parameters can be added without any changes to the kernel or
userspace.
Cons:
- Queue/Queue_group is a function-wide entity, Devlink is for
device-wide stuff. Devlink being device centric is not suitable for
queue parameters such as rates, NAPI etc.
Thanks,
Amritha
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-07 0:15 Kernel interface to configure queue-group parameters Nambiar, Amritha
@ 2023-02-07 16:28 ` Alexander H Duyck
2023-02-09 0:36 ` Jakub Kicinski
2023-02-16 10:35 ` Nambiar, Amritha
0 siblings, 2 replies; 10+ messages in thread
From: Alexander H Duyck @ 2023-02-07 16:28 UTC (permalink / raw)
To: Nambiar, Amritha, netdev
Cc: davem, kuba, edumazet, pabeni, Saeed Mahameed, Samudrala, Sridhar
On Mon, 2023-02-06 at 16:15 -0800, Nambiar, Amritha wrote:
> Hello,
>
> We are looking for feedback on the kernel interface to configure
> queue-group level parameters.
>
> Queues are primary residents in the kernel and there are multiple
> interfaces to configure queue-level parameters. For example, tx_maxrate
> for a transmit queue can be controlled via the sysfs interface. Ethtool
> is another option to change the RX/TX ring parameters of the specified
> network device (example, rx-buf-len, tx-push etc.).
>
> Queue_groups are a set of queues grouped together into a single object.
> For example, tx_queue_group-0 is a transmit queue_group with index 0 and
> can have transmit queues say 0-31, similarly rx_queue_group-0 is a
> receive queue_group with index 0 and can have receive queues 0-31,
> tx/rx_queue_group_1 may consist of TX and RX queues say 32-127
> respectively. Currently, upstream drivers for both ice and mlx5 support
> creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
>
> At this point, the kernel does not have an abstraction for queue_group.
> A close equivalent in the kernel is a 'traffic class' which consists of
> a set of transmit queues. Today, traffic classes are created using TC's
> mqprio scheduler. Only a limited set of parameters can be configured on
> each traffic class via mqprio, example priority per traffic class, min
> and max bandwidth rates per traffic class etc. Mqprio also supports
> offload of these parameters to the hardware. The parameters set for the
> traffic class (tx queue_group) is applicable to all transmit queues
> belonging to the queue_group. However, introducing additional parameters
> for queue_groups and configuring them via mqprio makes the interface
> less user-friendly (as the command line gets cumbersome due to the
> number of qdisc parameters). Although, mqprio is the interface to create
> transmit queue_groups, and is also the interface to configure and
> offload certain transmit queue_group parameters, due to these
> limitations we are wondering if it is worth considering other interface
> options for configuring queue_group parameters.
>
I think much of this depends on exactly what functionality we are
talking about. The problem is the Intel use case conflates interrupts
w/ queues w/ the applications themselves since what it is trying to do
is a poor imitation of RDMA being implemented using something akin to
VMDq last I knew.
So for example one of the things you are asking about below is
establishing a minimum rate for outgoing Tx packets. In my mind we
would probably want to use something like mqprio to set that up since
it is Tx rate limiting and if we were to configure it to happen in
software it would need to be handled in the Qdisc layer.
As far as the NAPI pollers attribute that seems like something that
needs further clarification. Are you limiting the number of busy poll
instances that can run on a single queue group? Is there a reason for
doing it per queue group instead of this being something that could be
enabled on a specific set of queues within the group?
> Likewise, receive queue_groups can be created using the ethtool
> interface as RSS contexts. Next step would be to configure
> per-rx_queue_group parameters. Based on the discussion in
> https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
> it looks like ethtool may not be the right interface to configure
> rx_queue_group parameters (that are unrelated to flow<->queue
> assignment), example NAPI configurations on the queue_group.
>
> The key gaps in the kernel to support queue-group parameters are:
> 1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
> 2. Offload hooks for TX/RX queue_group parameters depending on the
> chosen interface.
>
> Following are the options we have investigated:
>
> 1. tc-mqprio:
> Pros:
> - Already supports creating queue_groups, offload of certain parameters
>
> Cons:
> - Introducing new parameters makes the interface less user-friendly.
> TC qdisc parameters are specified at the qdisc creation, larger the
> number of traffic classes and their respective parameters, lesser the
> usability.
Yes and no. The TC layer is mostly meant for handling the Tx side of
things. For something like the rate limiting it might make sense since
there is already logic there to do it in mqprio. But if you are trying
to pull in NAPI or RSS attributes then I agree it would hurt usability.
> 2. Ethtool:
> Pros:
> - Already creates RX queue_groups as RSS contexts
>
> Cons:
> - May not be the right interface for non-RSS related parameters
>
> Example for configuring number of napi pollers for a queue group:
> ethtool -X <iface> context <context_num> num_pollers <n>
One thing that might make sense would be to look at adding a possible
alias for context that could work with something like DCB or the queue
groups use case. I believe that for DCB there is a similar issue where
the various priorities could have seperate RSS contexts so it might
make sense to look at applying a similar logic. Also there has been
talk about trying to do the the round robin on SYN type logic. That
might make sense to expose as a hfunc type since it would be overriding
RSS for TCP flows.
The num_pollers can be problematic though as we don't really have
anything like that in ethtool currently. Probably the closest thing I
can think of is interrupt moderation. It depends on if it has to be a
per queue group attribute or if it could be a per-queue attrtibute.
Specifically I am referring to the -Q option that is currently applied
to the coalescing functions in ethtool.
> 3. sysfs:
> Pros:
> - Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
> - Makes it possible to support some existing per-netdev napi
> parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be
> per-queue-group parameters.
>
> Cons:
> - Requires introducing new queue_group structures for TX and RX
> queue groups and references for it, kset references for queue_group in
> struct net_device
> - Additional ndo ops in net_device_ops for each parameter for
> hardware offload.
>
> Examples :
> /sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
> /sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate
So min_rate is something already handled in mqprio since it is so DCB
like. You are essentially guaranteeing bandwidth aren't you? Couldn't
you just define a bw_rlimit shaper for mqprio and then use the existing
bw_rlimit values to define the min_rate?
As far as adding the queue_groups interface one ugly bit would be that
we would probably need to have links between the queues and these
groups which would start to turn the sysfs into a tangled mess.
The biggest issue I see is that there isn't any sort of sysfs interface
exposed for NAPI which is what you would essentially need to justify
something like this since that is what you are modifying.
> 4. Devlink:
> Pros:
> - New parameters can be added without any changes to the kernel or
> userspace.
>
> Cons:
> - Queue/Queue_group is a function-wide entity, Devlink is for
> device-wide stuff. Devlink being device centric is not suitable for
> queue parameters such as rates, NAPI etc.
Yeah, I wouldn't expect something like this to be a good fit.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-07 16:28 ` Alexander H Duyck
@ 2023-02-09 0:36 ` Jakub Kicinski
2023-02-16 10:34 ` Nambiar, Amritha
2023-02-16 10:35 ` Nambiar, Amritha
1 sibling, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2023-02-09 0:36 UTC (permalink / raw)
To: Alexander H Duyck
Cc: Nambiar, Amritha, netdev, davem, edumazet, pabeni,
Saeed Mahameed, Samudrala, Sridhar
On Tue, 07 Feb 2023 08:28:56 -0800 Alexander H Duyck wrote:
> I think much of this depends on exactly what functionality we are
> talking about.
Right, maybe we need to take a page out of the container's book and
concede that best we can do is provide targeted APIs for slices of
the problem. Which someone in user space would have to combine.
> > 4. Devlink:
> > Pros:
> > - New parameters can be added without any changes to the kernel or
> > userspace.
> >
> > Cons:
> > - Queue/Queue_group is a function-wide entity, Devlink is for
> > device-wide stuff. Devlink being device centric is not suitable for
> > queue parameters such as rates, NAPI etc.
>
> Yeah, I wouldn't expect something like this to be a good fit.
Devlink has the hierarchical rate API for example.
Maybe we should (re)consider adding top level nodes for RSS contexts
there?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-09 0:36 ` Jakub Kicinski
@ 2023-02-16 10:34 ` Nambiar, Amritha
0 siblings, 0 replies; 10+ messages in thread
From: Nambiar, Amritha @ 2023-02-16 10:34 UTC (permalink / raw)
To: Jakub Kicinski, Alexander H Duyck
Cc: netdev, davem, edumazet, pabeni, Saeed Mahameed, Samudrala, Sridhar
On 2/8/2023 4:36 PM, Jakub Kicinski wrote:
> On Tue, 07 Feb 2023 08:28:56 -0800 Alexander H Duyck wrote:
>> I think much of this depends on exactly what functionality we are
>> talking about.
>
> Right, maybe we need to take a page out of the container's book and
> concede that best we can do is provide targeted APIs for slices of
> the problem. Which someone in user space would have to combine.
>
Agree, a common interface for various parameters for the queue-group
does not seem like a practical approach and the interface to use is
largely driven by the functionality itself.
>>> 4. Devlink:
>>> Pros:
>>> - New parameters can be added without any changes to the kernel or
>>> userspace.
>>>
>>> Cons:
>>> - Queue/Queue_group is a function-wide entity, Devlink is for
>>> device-wide stuff. Devlink being device centric is not suitable for
>>> queue parameters such as rates, NAPI etc.
>>
>> Yeah, I wouldn't expect something like this to be a good fit.
>
> Devlink has the hierarchical rate API for example.
> Maybe we should (re)consider adding top level nodes for RSS contexts
> there?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-07 16:28 ` Alexander H Duyck
2023-02-09 0:36 ` Jakub Kicinski
@ 2023-02-16 10:35 ` Nambiar, Amritha
2023-02-16 17:32 ` Jakub Kicinski
2023-02-19 17:39 ` Alexander H Duyck
1 sibling, 2 replies; 10+ messages in thread
From: Nambiar, Amritha @ 2023-02-16 10:35 UTC (permalink / raw)
To: Alexander H Duyck, netdev
Cc: davem, kuba, edumazet, pabeni, Saeed Mahameed, Samudrala, Sridhar
On 2/7/2023 8:28 AM, Alexander H Duyck wrote:
> On Mon, 2023-02-06 at 16:15 -0800, Nambiar, Amritha wrote:
>> Hello,
>>
>> We are looking for feedback on the kernel interface to configure
>> queue-group level parameters.
>>
>> Queues are primary residents in the kernel and there are multiple
>> interfaces to configure queue-level parameters. For example, tx_maxrate
>> for a transmit queue can be controlled via the sysfs interface. Ethtool
>> is another option to change the RX/TX ring parameters of the specified
>> network device (example, rx-buf-len, tx-push etc.).
>>
>> Queue_groups are a set of queues grouped together into a single object.
>> For example, tx_queue_group-0 is a transmit queue_group with index 0 and
>> can have transmit queues say 0-31, similarly rx_queue_group-0 is a
>> receive queue_group with index 0 and can have receive queues 0-31,
>> tx/rx_queue_group_1 may consist of TX and RX queues say 32-127
>> respectively. Currently, upstream drivers for both ice and mlx5 support
>> creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
>>
>> At this point, the kernel does not have an abstraction for queue_group.
>> A close equivalent in the kernel is a 'traffic class' which consists of
>> a set of transmit queues. Today, traffic classes are created using TC's
>> mqprio scheduler. Only a limited set of parameters can be configured on
>> each traffic class via mqprio, example priority per traffic class, min
>> and max bandwidth rates per traffic class etc. Mqprio also supports
>> offload of these parameters to the hardware. The parameters set for the
>> traffic class (tx queue_group) is applicable to all transmit queues
>> belonging to the queue_group. However, introducing additional parameters
>> for queue_groups and configuring them via mqprio makes the interface
>> less user-friendly (as the command line gets cumbersome due to the
>> number of qdisc parameters). Although, mqprio is the interface to create
>> transmit queue_groups, and is also the interface to configure and
>> offload certain transmit queue_group parameters, due to these
>> limitations we are wondering if it is worth considering other interface
>> options for configuring queue_group parameters.
>>
>
> I think much of this depends on exactly what functionality we are
> talking about. The problem is the Intel use case conflates interrupts
> w/ queues w/ the applications themselves since what it is trying to do
> is a poor imitation of RDMA being implemented using something akin to
> VMDq last I knew.
>
> So for example one of the things you are asking about below is
> establishing a minimum rate for outgoing Tx packets. In my mind we
> would probably want to use something like mqprio to set that up since
> it is Tx rate limiting and if we were to configure it to happen in
> software it would need to be handled in the Qdisc layer.
>
Configuring min and max rates for outgoing TX packets are already
supported in ice driver using mqprio. The issue is that dynamically
changing the rates per traffic class/queue_group via mqprio is not
straightforward, the "tc qdisc change" command will need all the rates
for traffic classes again, even for the tcs where rates are not being
changed.
For example, here's the sample command to configure min and max rates on
4 TX queue groups:
# tc qdisc add dev $iface root mqprio \
num_tc 4 \
map 0 1 2 3 \
queues 2@0 4@2 8@6 16@14 \
hw 1 mode channel \
shaper bw_rlimit \
min_rate 1Gbit 2Gbit 2Gbit 1Gbit\
max_rate 4Gbit 5Gbit 5Gbit 10Gbit
Now, changing TX min_rate for TC1 to 20 Gbit:
# tc qdisc change dev $iface root mqprio \
shaper bw_rlimit min_rate 1Gbit 20Gbit 2Gbit 1Gbit
Although this is not a major concern, I was looking for the simplicity
that something like sysfs provides with tx_maxrate for a queue, so that
when there are large number of tcs, just the ones that are being changed
needs to be dealt with (if we were to have sysfs rates per queue_group).
> As far as the NAPI pollers attribute that seems like something that
> needs further clarification. Are you limiting the number of busy poll
> instances that can run on a single queue group? Is there a reason for
> doing it per queue group instead of this being something that could be
> enabled on a specific set of queues within the group?
>
Yes, we are trying to limit the number of napi instances for the queues
within a queue-group. Some options we could use:
1. A 'num_poller' attribute on a queue_group level - The initial idea
was to configure the number of poller threads that would be handling the
queues within the queue_group, as an example, a num_poller value of 4 on
a queue_group consisting of 4 queues would imply that there is a poller
per queue. This could also be changed to something like a single poller
for all 4 queues within the group.
2. A poller bitmap for each queue (both TX and RX) - The main concern
with the queue-level maps is that it would still be nice to have a
higher level queue-group isolation, so that a poller is not shared among
queues belonging to distinct queue-groups. Also, a queue-group level
config would consolidate the mapping of queues and vectors in the driver
in batches, instead of the driver having to update the queue<->vector
mapping in response to each queue's poller configuration.
But we could do away with having these at queue-group level, and instead
use a different method as the third option below:
3. A queues bitmap per napi instance - So the default arrangement today
is 1:1 mapping between queues and interrupt vectors and hence 1:1
queue<->poller association. If the user could configure one interrupt
vector to serve different queues, these queues can be serviced by the
poller/napi instance for the vector.
One way to do this is to have a bitmap of queues for each IRQ allocated
for the device (similar to smp_affinity CPUs bitmap for the given IRQ).
So, /sys/class/net/<iface>/device/msi_irqs/ lists all the IRQs
associated with the network interface. If the IRQ can take additional
attribute like queues_affinity for the IRQs on the network device (use
/sys/class/net/<iface>/device/msi_irqs/N/ since queues_affinity would be
specific to the network subsystem), this would enable multiple queues
<-> single vector association configurable by the user. The driver would
validate that a queue is not mapped multiple interrupts. This way an
interrupt can be shared among different queues as configured by the user.
Another approach is to expose the napi-ids via sysfs and support
per-napi attributes.
/sys/class/net/<iface>/napis/napi<0-N>
Each numbered sub-directory N contains attributes of that napi. A
'napi_queues' attribute would be a bitmap of queues associated with the
napi-N enabling many queues <-> single napi association. Example,
/sys/class/net/<iface>/napis/napi-N/napi_queues
We also plan to introduce an additional napi attribute for each napi
instance called 'poller_timeout' indicating the timeout in jiffies.
Exposing the napi-ids would also enable moving some existing napi
attributes such as 'threaded' and 'napi_defer_hard_irqs' etc. (which are
currently per netdev) to be per napi instance.
>> Likewise, receive queue_groups can be created using the ethtool
>> interface as RSS contexts. Next step would be to configure
>> per-rx_queue_group parameters. Based on the discussion in
>> https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
>> it looks like ethtool may not be the right interface to configure
>> rx_queue_group parameters (that are unrelated to flow<->queue
>> assignment), example NAPI configurations on the queue_group.
>>
>> The key gaps in the kernel to support queue-group parameters are:
>> 1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
>> 2. Offload hooks for TX/RX queue_group parameters depending on the
>> chosen interface.
>>
>> Following are the options we have investigated:
>>
>> 1. tc-mqprio:
>> Pros:
>> - Already supports creating queue_groups, offload of certain parameters
>>
>> Cons:
>> - Introducing new parameters makes the interface less user-friendly.
>> TC qdisc parameters are specified at the qdisc creation, larger the
>> number of traffic classes and their respective parameters, lesser the
>> usability.
>
> Yes and no. The TC layer is mostly meant for handling the Tx side of
> things. For something like the rate limiting it might make sense since
> there is already logic there to do it in mqprio. But if you are trying
> to pull in NAPI or RSS attributes then I agree it would hurt usability.
>
The TX queue-group parameters supported via mqprio are limited to
priority, min and max rates. I think extending mqprio for a larger set
of TX parameters beyond just rates (say max burst) would bloat up the
command line. But yes, I agree, the TC layer is not the place for NAPI
attributes on TX queues.
>> 2. Ethtool:
>> Pros:
>> - Already creates RX queue_groups as RSS contexts
>>
>> Cons:
>> - May not be the right interface for non-RSS related parameters
>>
>> Example for configuring number of napi pollers for a queue group:
>> ethtool -X <iface> context <context_num> num_pollers <n>
>
> One thing that might make sense would be to look at adding a possible
> alias for context that could work with something like DCB or the queue
> groups use case. I believe that for DCB there is a similar issue where
> the various priorities could have seperate RSS contexts so it might
> make sense to look at applying a similar logic. Also there has been
> talk about trying to do the the round robin on SYN type logic. That
> might make sense to expose as a hfunc type since it would be overriding
> RSS for TCP flows.
>
For the round robin flow steering of TCP flows (on SYN by overriding RSS
hash), the plan was to add a new 'inline_fd' parameter to ethtool rss
context. Will look into your suggestion for using hfunc type.
> The num_pollers can be problematic though as we don't really have
> anything like that in ethtool currently. Probably the closest thing I
> can think of is interrupt moderation. It depends on if it has to be a
> per queue group attribute or if it could be a per-queue attrtibute.
> Specifically I am referring to the -Q option that is currently applied
> to the coalescing functions in ethtool.
>
>> 3. sysfs:
>> Pros:
>> - Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
>> - Makes it possible to support some existing per-netdev napi
>> parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be
>> per-queue-group parameters.
>>
>> Cons:
>> - Requires introducing new queue_group structures for TX and RX
>> queue groups and references for it, kset references for queue_group in
>> struct net_device
>> - Additional ndo ops in net_device_ops for each parameter for
>> hardware offload.
>>
>> Examples :
>> /sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
>> /sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate
>
> So min_rate is something already handled in mqprio since it is so DCB
> like. You are essentially guaranteeing bandwidth aren't you? Couldn't
> you just define a bw_rlimit shaper for mqprio and then use the existing
> bw_rlimit values to define the min_rate?
>
The ice driver already supports min_rate per queue_group using mqprio. I
was suggesting this in case we happen to have a TX queue_group object,
since dynamically changing rates via mqprio was not handy enough as I
mentioned above.
> As far as adding the queue_groups interface one ugly bit would be that
> we would probably need to have links between the queues and these
> groups which would start to turn the sysfs into a tangled mess.
>
Agree, maintaining the links between queues and groups is not trivial.
> The biggest issue I see is that there isn't any sort of sysfs interface
> exposed for NAPI which is what you would essentially need to justify
> something like this since that is what you are modifying.
>
Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
Maybe, initially there would be as many napis as queues due to 1:1
association, but as the queues bitmap is tuned for the napi, only those
napis that have queue[s] associated with it would be exposed.
>> 4. Devlink:
>> Pros:
>> - New parameters can be added without any changes to the kernel or
>> userspace.
>>
>> Cons:
>> - Queue/Queue_group is a function-wide entity, Devlink is for
>> device-wide stuff. Devlink being device centric is not suitable for
>> queue parameters such as rates, NAPI etc.
>
> Yeah, I wouldn't expect something like this to be a good fit.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-16 10:35 ` Nambiar, Amritha
@ 2023-02-16 17:32 ` Jakub Kicinski
2023-02-24 9:14 ` Nambiar, Amritha
2023-02-19 17:39 ` Alexander H Duyck
1 sibling, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2023-02-16 17:32 UTC (permalink / raw)
To: Nambiar, Amritha
Cc: Alexander H Duyck, netdev, davem, edumazet, pabeni,
Saeed Mahameed, Samudrala, Sridhar
On Thu, 16 Feb 2023 02:35:35 -0800 Nambiar, Amritha wrote:
> > The biggest issue I see is that there isn't any sort of sysfs interface
> > exposed for NAPI which is what you would essentially need to justify
> > something like this since that is what you are modifying.
>
> Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
> Maybe, initially there would be as many napis as queues due to 1:1
> association, but as the queues bitmap is tuned for the napi, only those
> napis that have queue[s] associated with it would be exposed.
Forget about using sysfs, please. We've been talking about making
"queues first class citizen", mapping to pollers is part of that
problem space. And it's complex enough to be better suited for netlink.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-16 10:35 ` Nambiar, Amritha
2023-02-16 17:32 ` Jakub Kicinski
@ 2023-02-19 17:39 ` Alexander H Duyck
2023-02-24 9:17 ` Nambiar, Amritha
1 sibling, 1 reply; 10+ messages in thread
From: Alexander H Duyck @ 2023-02-19 17:39 UTC (permalink / raw)
To: Nambiar, Amritha, netdev
Cc: davem, kuba, edumazet, pabeni, Saeed Mahameed, Samudrala, Sridhar
On Thu, 2023-02-16 at 02:35 -0800, Nambiar, Amritha wrote:
> On 2/7/2023 8:28 AM, Alexander H Duyck wrote:
> > On Mon, 2023-02-06 at 16:15 -0800, Nambiar, Amritha wrote:
> > > Hello,
> > >
> > > We are looking for feedback on the kernel interface to configure
> > > queue-group level parameters.
> > >
> > > Queues are primary residents in the kernel and there are multiple
> > > interfaces to configure queue-level parameters. For example, tx_maxrate
> > > for a transmit queue can be controlled via the sysfs interface. Ethtool
> > > is another option to change the RX/TX ring parameters of the specified
> > > network device (example, rx-buf-len, tx-push etc.).
> > >
> > > Queue_groups are a set of queues grouped together into a single object.
> > > For example, tx_queue_group-0 is a transmit queue_group with index 0 and
> > > can have transmit queues say 0-31, similarly rx_queue_group-0 is a
> > > receive queue_group with index 0 and can have receive queues 0-31,
> > > tx/rx_queue_group_1 may consist of TX and RX queues say 32-127
> > > respectively. Currently, upstream drivers for both ice and mlx5 support
> > > creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
> > >
> > > At this point, the kernel does not have an abstraction for queue_group.
> > > A close equivalent in the kernel is a 'traffic class' which consists of
> > > a set of transmit queues. Today, traffic classes are created using TC's
> > > mqprio scheduler. Only a limited set of parameters can be configured on
> > > each traffic class via mqprio, example priority per traffic class, min
> > > and max bandwidth rates per traffic class etc. Mqprio also supports
> > > offload of these parameters to the hardware. The parameters set for the
> > > traffic class (tx queue_group) is applicable to all transmit queues
> > > belonging to the queue_group. However, introducing additional parameters
> > > for queue_groups and configuring them via mqprio makes the interface
> > > less user-friendly (as the command line gets cumbersome due to the
> > > number of qdisc parameters). Although, mqprio is the interface to create
> > > transmit queue_groups, and is also the interface to configure and
> > > offload certain transmit queue_group parameters, due to these
> > > limitations we are wondering if it is worth considering other interface
> > > options for configuring queue_group parameters.
> > >
> >
> > I think much of this depends on exactly what functionality we are
> > talking about. The problem is the Intel use case conflates interrupts
> > w/ queues w/ the applications themselves since what it is trying to do
> > is a poor imitation of RDMA being implemented using something akin to
> > VMDq last I knew.
> >
> > So for example one of the things you are asking about below is
> > establishing a minimum rate for outgoing Tx packets. In my mind we
> > would probably want to use something like mqprio to set that up since
> > it is Tx rate limiting and if we were to configure it to happen in
> > software it would need to be handled in the Qdisc layer.
> >
>
> Configuring min and max rates for outgoing TX packets are already
> supported in ice driver using mqprio. The issue is that dynamically
> changing the rates per traffic class/queue_group via mqprio is not
> straightforward, the "tc qdisc change" command will need all the rates
> for traffic classes again, even for the tcs where rates are not being
> changed.
> For example, here's the sample command to configure min and max rates on
> 4 TX queue groups:
>
> # tc qdisc add dev $iface root mqprio \
> num_tc 4 \
> map 0 1 2 3 \
> queues 2@0 4@2 8@6 16@14 \
> hw 1 mode channel \
> shaper bw_rlimit \
> min_rate 1Gbit 2Gbit 2Gbit 1Gbit\
> max_rate 4Gbit 5Gbit 5Gbit 10Gbit
>
> Now, changing TX min_rate for TC1 to 20 Gbit:
>
> # tc qdisc change dev $iface root mqprio \
> shaper bw_rlimit min_rate 1Gbit 20Gbit 2Gbit 1Gbit
>
> Although this is not a major concern, I was looking for the simplicity
> that something like sysfs provides with tx_maxrate for a queue, so that
> when there are large number of tcs, just the ones that are being changed
> needs to be dealt with (if we were to have sysfs rates per queue_group).
So it sounds like there is an interface already, you may just not like
having to work with it due to the userspace tooling. Perhaps the
solution would be to look at fixing things up so that the tooling would
allow you to make changes to individual values. I haven't looked into
the interface much but is there any way to retrieve the current
settings from the Qdisc? If so you might be able to just update tc so
that it would allow incremental updates and fill in the gaps with the
config it already has.
> > As far as the NAPI pollers attribute that seems like something that
> > needs further clarification. Are you limiting the number of busy poll
> > instances that can run on a single queue group? Is there a reason for
> > doing it per queue group instead of this being something that could be
> > enabled on a specific set of queues within the group?
> >
>
> Yes, we are trying to limit the number of napi instances for the queues
> within a queue-group. Some options we could use:
>
> 1. A 'num_poller' attribute on a queue_group level - The initial idea
> was to configure the number of poller threads that would be handling the
> queues within the queue_group, as an example, a num_poller value of 4 on
> a queue_group consisting of 4 queues would imply that there is a poller
> per queue. This could also be changed to something like a single poller
> for all 4 queues within the group.
>
> 2. A poller bitmap for each queue (both TX and RX) - The main concern
> with the queue-level maps is that it would still be nice to have a
> higher level queue-group isolation, so that a poller is not shared among
> queues belonging to distinct queue-groups. Also, a queue-group level
> config would consolidate the mapping of queues and vectors in the driver
> in batches, instead of the driver having to update the queue<->vector
> mapping in response to each queue's poller configuration.
>
> But we could do away with having these at queue-group level, and instead
> use a different method as the third option below:
> 3. A queues bitmap per napi instance - So the default arrangement today
> is 1:1 mapping between queues and interrupt vectors and hence 1:1
> queue<->poller association. If the user could configure one interrupt
> vector to serve different queues, these queues can be serviced by the
> poller/napi instance for the vector.
> One way to do this is to have a bitmap of queues for each IRQ allocated
> for the device (similar to smp_affinity CPUs bitmap for the given IRQ).
> So, /sys/class/net/<iface>/device/msi_irqs/ lists all the IRQs
> associated with the network interface. If the IRQ can take additional
> attribute like queues_affinity for the IRQs on the network device (use
> /sys/class/net/<iface>/device/msi_irqs/N/ since queues_affinity would be
> specific to the network subsystem), this would enable multiple queues
> <-> single vector association configurable by the user. The driver would
> validate that a queue is not mapped multiple interrupts. This way an
> interrupt can be shared among different queues as configured by the user.
> Another approach is to expose the napi-ids via sysfs and support
> per-napi attributes.
> /sys/class/net/<iface>/napis/napi<0-N>
> Each numbered sub-directory N contains attributes of that napi. A
> 'napi_queues' attribute would be a bitmap of queues associated with the
> napi-N enabling many queues <-> single napi association. Example,
> /sys/class/net/<iface>/napis/napi-N/napi_queues
>
> We also plan to introduce an additional napi attribute for each napi
> instance called 'poller_timeout' indicating the timeout in jiffies.
> Exposing the napi-ids would also enable moving some existing napi
> attributes such as 'threaded' and 'napi_defer_hard_irqs' etc. (which are
> currently per netdev) to be per napi instance.
This one will require more thought and discussion as the NAPI instances
themselves have been something that was largely hidden and not exposed
to userspace up until now.
However with that said I am pretty certain sysfs isn't the way to go.
> > > Likewise, receive queue_groups can be created using the ethtool
> > > interface as RSS contexts. Next step would be to configure
> > > per-rx_queue_group parameters. Based on the discussion in
> > > https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
> > > it looks like ethtool may not be the right interface to configure
> > > rx_queue_group parameters (that are unrelated to flow<->queue
> > > assignment), example NAPI configurations on the queue_group.
> > >
> > > The key gaps in the kernel to support queue-group parameters are:
> > > 1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
> > > 2. Offload hooks for TX/RX queue_group parameters depending on the
> > > chosen interface.
> > >
> > > Following are the options we have investigated:
> > >
> > > 1. tc-mqprio:
> > > Pros:
> > > - Already supports creating queue_groups, offload of certain parameters
> > >
> > > Cons:
> > > - Introducing new parameters makes the interface less user-friendly.
> > > TC qdisc parameters are specified at the qdisc creation, larger the
> > > number of traffic classes and their respective parameters, lesser the
> > > usability.
> >
> > Yes and no. The TC layer is mostly meant for handling the Tx side of
> > things. For something like the rate limiting it might make sense since
> > there is already logic there to do it in mqprio. But if you are trying
> > to pull in NAPI or RSS attributes then I agree it would hurt usability.
> >
>
> The TX queue-group parameters supported via mqprio are limited to
> priority, min and max rates. I think extending mqprio for a larger set
> of TX parameters beyond just rates (say max burst) would bloat up the
> command line. But yes, I agree, the TC layer is not the place for NAPI
> attributes on TX queues.
The problem is you are reinventing the wheel. It sounds like this
mostly does what you are looking for. If you are going to look at
extending it then you should do so. Otherwise maybe you need to look at
putting together a new Qdisc instead of creating an entirely new
infrastructure since Qdisc is how we would deal with implementing
something like this in software. We shouldn't be bypassing that as we
should be implementing an equivilent for what we are wanting to do in
hardware in the software.
> > > 2. Ethtool:
> > > Pros:
> > > - Already creates RX queue_groups as RSS contexts
> > >
> > > Cons:
> > > - May not be the right interface for non-RSS related parameters
> > >
> > > Example for configuring number of napi pollers for a queue group:
> > > ethtool -X <iface> context <context_num> num_pollers <n>
> >
> > One thing that might make sense would be to look at adding a possible
> > alias for context that could work with something like DCB or the queue
> > groups use case. I believe that for DCB there is a similar issue where
> > the various priorities could have seperate RSS contexts so it might
> > make sense to look at applying a similar logic. Also there has been
> > talk about trying to do the the round robin on SYN type logic. That
> > might make sense to expose as a hfunc type since it would be overriding
> > RSS for TCP flows.
> >
>
> For the round robin flow steering of TCP flows (on SYN by overriding RSS
> hash), the plan was to add a new 'inline_fd' parameter to ethtool rss
> context. Will look into your suggestion for using hfunc type.
It sounds like we are generally thinking in the same area so that is a
good start there.
> >
> > > 3. sysfs:
> > > Pros:
> > > - Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
> > > - Makes it possible to support some existing per-netdev napi
> > > parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be
> > > per-queue-group parameters.
> > >
> > > Cons:
> > > - Requires introducing new queue_group structures for TX and RX
> > > queue groups and references for it, kset references for queue_group in
> > > struct net_device
> > > - Additional ndo ops in net_device_ops for each parameter for
> > > hardware offload.
> > >
> > > Examples :
> > > /sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
> > > /sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate
> >
> > So min_rate is something already handled in mqprio since it is so DCB
> > like. You are essentially guaranteeing bandwidth aren't you? Couldn't
> > you just define a bw_rlimit shaper for mqprio and then use the existing
> > bw_rlimit values to define the min_rate?
> >
>
> The ice driver already supports min_rate per queue_group using mqprio. I
> was suggesting this in case we happen to have a TX queue_group object,
> since dynamically changing rates via mqprio was not handy enough as I
> mentioned above.
Yeah, but based on the description you are rewriting the kernel side
because you don't like dealing with the userspace tools. Again maybe
the solution here would be to look at cleaning up the userspace
interface to add support for reading/retrieving the existing values and
then updating instead of requiring a complete update every time.
What we want to avoid is creating new overhead in the kernel where we
now have yet another way to control Tx rates as each redundant
interface added is that much more overhead that has to be dealt with
throughout the Tx path. If we already have a way to do this with mqprio
lets support just offloading that into hardware rather than adding yet
another Tx rate control.
> > As far as adding the queue_groups interface one ugly bit would be that
> > we would probably need to have links between the queues and these
> > groups which would start to turn the sysfs into a tangled mess.
> >
> Agree, maintaining the links between queues and groups is not trivial.
>
> > The biggest issue I see is that there isn't any sort of sysfs interface
> > exposed for NAPI which is what you would essentially need to justify
> > something like this since that is what you are modifying.
> >
>
> Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
> Maybe, initially there would be as many napis as queues due to 1:1
> association, but as the queues bitmap is tuned for the napi, only those
> napis that have queue[s] associated with it would be exposed.
As Jakub already pointed out adding more sysfs is generally frowned
upon.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-16 17:32 ` Jakub Kicinski
@ 2023-02-24 9:14 ` Nambiar, Amritha
2023-02-24 19:22 ` Jakub Kicinski
0 siblings, 1 reply; 10+ messages in thread
From: Nambiar, Amritha @ 2023-02-24 9:14 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexander H Duyck, netdev, davem, edumazet, pabeni,
Saeed Mahameed, Samudrala, Sridhar
On 2/16/2023 9:32 AM, Jakub Kicinski wrote:
> On Thu, 16 Feb 2023 02:35:35 -0800 Nambiar, Amritha wrote:
>>> The biggest issue I see is that there isn't any sort of sysfs interface
>>> exposed for NAPI which is what you would essentially need to justify
>>> something like this since that is what you are modifying.
>>
>> Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
>> Maybe, initially there would be as many napis as queues due to 1:1
>> association, but as the queues bitmap is tuned for the napi, only those
>> napis that have queue[s] associated with it would be exposed.
>
> Forget about using sysfs, please. We've been talking about making
> "queues first class citizen", mapping to pollers is part of that
> problem space. And it's complex enough to be better suited for netlink.
Okay. Can ethtool netlink be an option for this? For example,
ethtool --show-napis
Lists all the napi instances and associated queue[s] list for each napi
for the specified network device.
ethtool --set-napi
Configure the attributes (say, queue[s] list) for each napi
napi <napi_id>
The napi instance to configure
queues <q_id1, q_id2, ...>
The queue[s] that are to be serviced by the napi instance.
Example:
ethtool --set-napi eth0 napi 1477 queues 1,2,5
The 'set-napi' command for the napi<->queue[s] association would have
the following affect :
1. If multiple napis are impacted due to an update, remove the queue[s]
from the existing napi instance it is associated with.
2. Driver updates queue[s]<->vector mapping and associates with new napi
instance.
3. Report the impacted napis and its new queue[s] list back to the stack.
The 'show-napi' command should now list all the napis and the updated
queue[s] list.
This could also be extended for other napi attributes beyond queue[s] list.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-19 17:39 ` Alexander H Duyck
@ 2023-02-24 9:17 ` Nambiar, Amritha
0 siblings, 0 replies; 10+ messages in thread
From: Nambiar, Amritha @ 2023-02-24 9:17 UTC (permalink / raw)
To: Alexander H Duyck, netdev
Cc: davem, kuba, edumazet, pabeni, Saeed Mahameed, Samudrala, Sridhar
On 2/19/2023 9:39 AM, Alexander H Duyck wrote:
> On Thu, 2023-02-16 at 02:35 -0800, Nambiar, Amritha wrote:
>> On 2/7/2023 8:28 AM, Alexander H Duyck wrote:
>>> On Mon, 2023-02-06 at 16:15 -0800, Nambiar, Amritha wrote:
>>>> Hello,
>>>>
>>>> We are looking for feedback on the kernel interface to configure
>>>> queue-group level parameters.
>>>>
>>>> Queues are primary residents in the kernel and there are multiple
>>>> interfaces to configure queue-level parameters. For example, tx_maxrate
>>>> for a transmit queue can be controlled via the sysfs interface. Ethtool
>>>> is another option to change the RX/TX ring parameters of the specified
>>>> network device (example, rx-buf-len, tx-push etc.).
>>>>
>>>> Queue_groups are a set of queues grouped together into a single object.
>>>> For example, tx_queue_group-0 is a transmit queue_group with index 0 and
>>>> can have transmit queues say 0-31, similarly rx_queue_group-0 is a
>>>> receive queue_group with index 0 and can have receive queues 0-31,
>>>> tx/rx_queue_group_1 may consist of TX and RX queues say 32-127
>>>> respectively. Currently, upstream drivers for both ice and mlx5 support
>>>> creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
>>>>
>>>> At this point, the kernel does not have an abstraction for queue_group.
>>>> A close equivalent in the kernel is a 'traffic class' which consists of
>>>> a set of transmit queues. Today, traffic classes are created using TC's
>>>> mqprio scheduler. Only a limited set of parameters can be configured on
>>>> each traffic class via mqprio, example priority per traffic class, min
>>>> and max bandwidth rates per traffic class etc. Mqprio also supports
>>>> offload of these parameters to the hardware. The parameters set for the
>>>> traffic class (tx queue_group) is applicable to all transmit queues
>>>> belonging to the queue_group. However, introducing additional parameters
>>>> for queue_groups and configuring them via mqprio makes the interface
>>>> less user-friendly (as the command line gets cumbersome due to the
>>>> number of qdisc parameters). Although, mqprio is the interface to create
>>>> transmit queue_groups, and is also the interface to configure and
>>>> offload certain transmit queue_group parameters, due to these
>>>> limitations we are wondering if it is worth considering other interface
>>>> options for configuring queue_group parameters.
>>>>
>>>
>>> I think much of this depends on exactly what functionality we are
>>> talking about. The problem is the Intel use case conflates interrupts
>>> w/ queues w/ the applications themselves since what it is trying to do
>>> is a poor imitation of RDMA being implemented using something akin to
>>> VMDq last I knew.
>>>
>>> So for example one of the things you are asking about below is
>>> establishing a minimum rate for outgoing Tx packets. In my mind we
>>> would probably want to use something like mqprio to set that up since
>>> it is Tx rate limiting and if we were to configure it to happen in
>>> software it would need to be handled in the Qdisc layer.
>>>
>>
>> Configuring min and max rates for outgoing TX packets are already
>> supported in ice driver using mqprio. The issue is that dynamically
>> changing the rates per traffic class/queue_group via mqprio is not
>> straightforward, the "tc qdisc change" command will need all the rates
>> for traffic classes again, even for the tcs where rates are not being
>> changed.
>> For example, here's the sample command to configure min and max rates on
>> 4 TX queue groups:
>>
>> # tc qdisc add dev $iface root mqprio \
>> num_tc 4 \
>> map 0 1 2 3 \
>> queues 2@0 4@2 8@6 16@14 \
>> hw 1 mode channel \
>> shaper bw_rlimit \
>> min_rate 1Gbit 2Gbit 2Gbit 1Gbit\
>> max_rate 4Gbit 5Gbit 5Gbit 10Gbit
>>
>> Now, changing TX min_rate for TC1 to 20 Gbit:
>>
>> # tc qdisc change dev $iface root mqprio \
>> shaper bw_rlimit min_rate 1Gbit 20Gbit 2Gbit 1Gbit
>>
>> Although this is not a major concern, I was looking for the simplicity
>> that something like sysfs provides with tx_maxrate for a queue, so that
>> when there are large number of tcs, just the ones that are being changed
>> needs to be dealt with (if we were to have sysfs rates per queue_group).
>
> So it sounds like there is an interface already, you may just not like
> having to work with it due to the userspace tooling. Perhaps the
> solution would be to look at fixing things up so that the tooling would
> allow you to make changes to individual values. I haven't looked into
> the interface much but is there any way to retrieve the current
> settings from the Qdisc? If so you might be able to just update tc so
> that it would allow incremental updates and fill in the gaps with the
> config it already has.
>
>>> As far as the NAPI pollers attribute that seems like something that
>>> needs further clarification. Are you limiting the number of busy poll
>>> instances that can run on a single queue group? Is there a reason for
>>> doing it per queue group instead of this being something that could be
>>> enabled on a specific set of queues within the group?
>>>
>>
>> Yes, we are trying to limit the number of napi instances for the queues
>> within a queue-group. Some options we could use:
>>
>> 1. A 'num_poller' attribute on a queue_group level - The initial idea
>> was to configure the number of poller threads that would be handling the
>> queues within the queue_group, as an example, a num_poller value of 4 on
>> a queue_group consisting of 4 queues would imply that there is a poller
>> per queue. This could also be changed to something like a single poller
>> for all 4 queues within the group.
>>
>> 2. A poller bitmap for each queue (both TX and RX) - The main concern
>> with the queue-level maps is that it would still be nice to have a
>> higher level queue-group isolation, so that a poller is not shared among
>> queues belonging to distinct queue-groups. Also, a queue-group level
>> config would consolidate the mapping of queues and vectors in the driver
>> in batches, instead of the driver having to update the queue<->vector
>> mapping in response to each queue's poller configuration.
>>
>> But we could do away with having these at queue-group level, and instead
>> use a different method as the third option below:
>> 3. A queues bitmap per napi instance - So the default arrangement today
>> is 1:1 mapping between queues and interrupt vectors and hence 1:1
>> queue<->poller association. If the user could configure one interrupt
>> vector to serve different queues, these queues can be serviced by the
>> poller/napi instance for the vector.
>> One way to do this is to have a bitmap of queues for each IRQ allocated
>> for the device (similar to smp_affinity CPUs bitmap for the given IRQ).
>> So, /sys/class/net/<iface>/device/msi_irqs/ lists all the IRQs
>> associated with the network interface. If the IRQ can take additional
>> attribute like queues_affinity for the IRQs on the network device (use
>> /sys/class/net/<iface>/device/msi_irqs/N/ since queues_affinity would be
>> specific to the network subsystem), this would enable multiple queues
>> <-> single vector association configurable by the user. The driver would
>> validate that a queue is not mapped multiple interrupts. This way an
>> interrupt can be shared among different queues as configured by the user.
>> Another approach is to expose the napi-ids via sysfs and support
>> per-napi attributes.
>> /sys/class/net/<iface>/napis/napi<0-N>
>> Each numbered sub-directory N contains attributes of that napi. A
>> 'napi_queues' attribute would be a bitmap of queues associated with the
>> napi-N enabling many queues <-> single napi association. Example,
>> /sys/class/net/<iface>/napis/napi-N/napi_queues
>>
>> We also plan to introduce an additional napi attribute for each napi
>> instance called 'poller_timeout' indicating the timeout in jiffies.
>> Exposing the napi-ids would also enable moving some existing napi
>> attributes such as 'threaded' and 'napi_defer_hard_irqs' etc. (which are
>> currently per netdev) to be per napi instance.
>
> This one will require more thought and discussion as the NAPI instances
> themselves have been something that was largely hidden and not exposed
> to userspace up until now.
>
> However with that said I am pretty certain sysfs isn't the way to go.
>
>>>> Likewise, receive queue_groups can be created using the ethtool
>>>> interface as RSS contexts. Next step would be to configure
>>>> per-rx_queue_group parameters. Based on the discussion in
>>>> https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
>>>> it looks like ethtool may not be the right interface to configure
>>>> rx_queue_group parameters (that are unrelated to flow<->queue
>>>> assignment), example NAPI configurations on the queue_group.
>>>>
>>>> The key gaps in the kernel to support queue-group parameters are:
>>>> 1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
>>>> 2. Offload hooks for TX/RX queue_group parameters depending on the
>>>> chosen interface.
>>>>
>>>> Following are the options we have investigated:
>>>>
>>>> 1. tc-mqprio:
>>>> Pros:
>>>> - Already supports creating queue_groups, offload of certain parameters
>>>>
>>>> Cons:
>>>> - Introducing new parameters makes the interface less user-friendly.
>>>> TC qdisc parameters are specified at the qdisc creation, larger the
>>>> number of traffic classes and their respective parameters, lesser the
>>>> usability.
>>>
>>> Yes and no. The TC layer is mostly meant for handling the Tx side of
>>> things. For something like the rate limiting it might make sense since
>>> there is already logic there to do it in mqprio. But if you are trying
>>> to pull in NAPI or RSS attributes then I agree it would hurt usability.
>>>
>>
>> The TX queue-group parameters supported via mqprio are limited to
>> priority, min and max rates. I think extending mqprio for a larger set
>> of TX parameters beyond just rates (say max burst) would bloat up the
>> command line. But yes, I agree, the TC layer is not the place for NAPI
>> attributes on TX queues.
>
> The problem is you are reinventing the wheel. It sounds like this
> mostly does what you are looking for. If you are going to look at
> extending it then you should do so. Otherwise maybe you need to look at
> putting together a new Qdisc instead of creating an entirely new
> infrastructure since Qdisc is how we would deal with implementing
> something like this in software. We shouldn't be bypassing that as we
> should be implementing an equivilent for what we are wanting to do in
> hardware in the software.
>
>>>> 2. Ethtool:
>>>> Pros:
>>>> - Already creates RX queue_groups as RSS contexts
>>>>
>>>> Cons:
>>>> - May not be the right interface for non-RSS related parameters
>>>>
>>>> Example for configuring number of napi pollers for a queue group:
>>>> ethtool -X <iface> context <context_num> num_pollers <n>
>>>
>>> One thing that might make sense would be to look at adding a possible
>>> alias for context that could work with something like DCB or the queue
>>> groups use case. I believe that for DCB there is a similar issue where
>>> the various priorities could have seperate RSS contexts so it might
>>> make sense to look at applying a similar logic. Also there has been
>>> talk about trying to do the the round robin on SYN type logic. That
>>> might make sense to expose as a hfunc type since it would be overriding
>>> RSS for TCP flows.
>>>
>>
>> For the round robin flow steering of TCP flows (on SYN by overriding RSS
>> hash), the plan was to add a new 'inline_fd' parameter to ethtool rss
>> context. Will look into your suggestion for using hfunc type.
>
> It sounds like we are generally thinking in the same area so that is a
> good start there.
>
>>>
>>>> 3. sysfs:
>>>> Pros:
>>>> - Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
>>>> - Makes it possible to support some existing per-netdev napi
>>>> parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be
>>>> per-queue-group parameters.
>>>>
>>>> Cons:
>>>> - Requires introducing new queue_group structures for TX and RX
>>>> queue groups and references for it, kset references for queue_group in
>>>> struct net_device
>>>> - Additional ndo ops in net_device_ops for each parameter for
>>>> hardware offload.
>>>>
>>>> Examples :
>>>> /sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
>>>> /sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate
>>>
>>> So min_rate is something already handled in mqprio since it is so DCB
>>> like. You are essentially guaranteeing bandwidth aren't you? Couldn't
>>> you just define a bw_rlimit shaper for mqprio and then use the existing
>>> bw_rlimit values to define the min_rate?
>>>
>>
>> The ice driver already supports min_rate per queue_group using mqprio. I
>> was suggesting this in case we happen to have a TX queue_group object,
>> since dynamically changing rates via mqprio was not handy enough as I
>> mentioned above.
>
> Yeah, but based on the description you are rewriting the kernel side
> because you don't like dealing with the userspace tools. Again maybe
> the solution here would be to look at cleaning up the userspace
> interface to add support for reading/retrieving the existing values and
> then updating instead of requiring a complete update every time.
>
> What we want to avoid is creating new overhead in the kernel where we
> now have yet another way to control Tx rates as each redundant
> interface added is that much more overhead that has to be dealt with
> throughout the Tx path. If we already have a way to do this with mqprio
> lets support just offloading that into hardware rather than adding yet
> another Tx rate control.
>
>>> As far as adding the queue_groups interface one ugly bit would be that
>>> we would probably need to have links between the queues and these
>>> groups which would start to turn the sysfs into a tangled mess.
>>>
>> Agree, maintaining the links between queues and groups is not trivial.
>>
>>> The biggest issue I see is that there isn't any sort of sysfs interface
>>> exposed for NAPI which is what you would essentially need to justify
>>> something like this since that is what you are modifying.
>>>
>>
>> Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
>> Maybe, initially there would be as many napis as queues due to 1:1
>> association, but as the queues bitmap is tuned for the napi, only those
>> napis that have queue[s] associated with it would be exposed.
>
> As Jakub already pointed out adding more sysfs is generally frowned
> upon.
Agree. We do not wish to have another interface to control rates and
that overhead can be avoided. We can continue to use mqprio and
dynamically change the rates using "tc qdisc change" command.
For exposing napis, can ethtool netlink be an option as I detailed in
the reply to Jakub?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Kernel interface to configure queue-group parameters
2023-02-24 9:14 ` Nambiar, Amritha
@ 2023-02-24 19:22 ` Jakub Kicinski
0 siblings, 0 replies; 10+ messages in thread
From: Jakub Kicinski @ 2023-02-24 19:22 UTC (permalink / raw)
To: Nambiar, Amritha
Cc: Alexander H Duyck, netdev, davem, edumazet, pabeni,
Saeed Mahameed, Samudrala, Sridhar
On Fri, 24 Feb 2023 01:14:15 -0800 Nambiar, Amritha wrote:
> On 2/16/2023 9:32 AM, Jakub Kicinski wrote:
> > On Thu, 16 Feb 2023 02:35:35 -0800 Nambiar, Amritha wrote:
> >> Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
> >> Maybe, initially there would be as many napis as queues due to 1:1
> >> association, but as the queues bitmap is tuned for the napi, only those
> >> napis that have queue[s] associated with it would be exposed.
> >
> > Forget about using sysfs, please. We've been talking about making
> > "queues first class citizen", mapping to pollers is part of that
> > problem space. And it's complex enough to be better suited for netlink.
>
> Okay. Can ethtool netlink be an option for this? For example,
>
> ethtool --show-napis
> Lists all the napi instances and associated queue[s] list for each napi
> for the specified network device.
>
> ethtool --set-napi
> Configure the attributes (say, queue[s] list) for each napi
>
> napi <napi_id>
> The napi instance to configure
>
> queues <q_id1, q_id2, ...>
> The queue[s] that are to be serviced by the napi instance.
The netdev-genl family is a better target.
But the work is doing the refactoring within the kernel to abstract
all this stuff away from the drivers, so that the kernel has
a stronger model of queues. If we just expose the calls to the drivers
directly we'll end up with a lot of code duplication and not-so-subtle
differences between vendors :(
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-02-24 19:22 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-07 0:15 Kernel interface to configure queue-group parameters Nambiar, Amritha
2023-02-07 16:28 ` Alexander H Duyck
2023-02-09 0:36 ` Jakub Kicinski
2023-02-16 10:34 ` Nambiar, Amritha
2023-02-16 10:35 ` Nambiar, Amritha
2023-02-16 17:32 ` Jakub Kicinski
2023-02-24 9:14 ` Nambiar, Amritha
2023-02-24 19:22 ` Jakub Kicinski
2023-02-19 17:39 ` Alexander H Duyck
2023-02-24 9:17 ` Nambiar, Amritha
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).