netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Hierarchical QoS Hardware Offload (HTB)
@ 2020-01-30 16:20 Yossi Kuperman
  2020-01-31  1:47 ` Dave Taht
  2020-02-01 16:48 ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 6+ messages in thread
From: Yossi Kuperman @ 2020-01-30 16:20 UTC (permalink / raw)
  To: netdev
  Cc: Jamal Hadi Salim, Jiri Pirko, Rony Efraim, Maxim Mikityanskiy,
	John Fastabend, Eran Ben Elisha

Following is an outline briefly describing our plans towards offloading HTB functionality.

HTB qdisc allows you to use one physical link to simulate several slower links. This is done by configuring a hierarchical QoS tree; each tree node corresponds to a class. Filters are used to classify flows to different classes. HTB is quite flexible and versatile, but it comes with a cost. HTB does not scale and consumes considerable CPU and memory. Our aim is to offload HTB functionality to hardware and provide the user with the flexibility and the conventional tools offered by TC subsystem, while scaling to thousands of traffic classes and maintaining wire-speed performance. 

Mellanox hardware can support hierarchical rate-limiting; rate-limiting is done per hardware queue. In our proposed solution, flow classification takes place in software. By moving the classification to clsact egress hook, which is thread-safe and does not require locking, we avoid the contention induced by the single qdisc lock. Furthermore, clsact filters are perform before the net-device’s TX queue is selected, allowing the driver a chance to translate the class to the appropriate hardware queue. Please note that the user will need to configure the filters slightly different; apply them to the clsact rather than to the HTB itself, and set the priority to the desired class-id.

For example, the following two filters are equivalent:
	1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
	2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10

Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.

Furthermore, the most concerning aspect of the current HTB implementation is its lack of support for multi-queue. All net-device’s TX queues points to the same HTB instance, resulting in high spin-lock contention. This contention (might) negates the overall performance gains expected by introducing the offload in the first place. We should modify HTB to present itself as mq qdisc does. By default, mq qdisc allocates a simple fifo qdisc per TX queue exposed by the lower layer device. This is only when hardware offload is configured, otherwise, HTB behaves as usual. There is no HTB code along the data-path; the only overhead compared to regular traffic is the classification taking place at clsact. Please note that this design induces full offload---no fallback to software; it is not trivial to partial offload the hierarchical tree considering borrowing between siblings anyway.


To summaries: for each HTB leaf-class the driver will allocate a special queue and match it with a corresponding net-device TX queue (increase real_num_tx_queues). A unique fifo qdisc will be attached to any such TX queue. Classification will still take place in software, but rather at the clsact egress hook. This way we can scale to thousands of classes while maintaining wire-speed performance and reducing CPU overhead.

Any feedback will be much appreciated.

Cheers,
Kuperman



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
  2020-01-30 16:20 [RFC] Hierarchical QoS Hardware Offload (HTB) Yossi Kuperman
@ 2020-01-31  1:47 ` Dave Taht
  2020-01-31 21:42   ` Dave Taht
  2020-02-14 11:33   ` Yossi Kuperman
  2020-02-01 16:48 ` Toke Høiland-Jørgensen
  1 sibling, 2 replies; 6+ messages in thread
From: Dave Taht @ 2020-01-31  1:47 UTC (permalink / raw)
  To: Yossi Kuperman
  Cc: netdev, Jamal Hadi Salim, Jiri Pirko, Rony Efraim,
	Maxim Mikityanskiy, John Fastabend, Eran Ben Elisha

On Thu, Jan 30, 2020 at 8:21 AM Yossi Kuperman <yossiku@mellanox.com> wrote:
>
> Following is an outline briefly describing our plans towards offloading HTB functionality.
>
> HTB qdisc allows you to use one physical link to simulate several slower links. This is done by configuring a hierarchical QoS tree; each tree node corresponds to a class. Filters are used to classify flows to different classes. HTB is quite flexible and versatile, but it comes with a cost. HTB does not scale and consumes considerable CPU and memory. Our aim is to offload HTB functionality to hardware and provide the user with the flexibility and the conventional tools offered by TC subsystem, while scaling to thousands of traffic classes and maintaining wire-speed performance.
>
> Mellanox hardware can support hierarchical rate-limiting; rate-limiting is done per hardware queue. In our proposed solution, flow classification takes place in software. By moving the classification to clsact egress hook, which is thread-safe and does not require locking, we avoid the contention induced by the single qdisc lock. Furthermore, clsact filters are perform before the net-device’s TX queue is selected, allowing the driver a chance to translate the class to the appropriate hardware queue. Please note that the user will need to configure the filters slightly different; apply them to the clsact rather than to the HTB itself, and set the priority to the desired class-id.
>
> For example, the following two filters are equivalent:
>         1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
>         2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>
> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>
> Furthermore, the most concerning aspect of the current HTB implementation is its lack of support for multi-queue. All net-device’s TX queues points to the same HTB instance, resulting in high spin-lock contention. This contention (might) negates the overall performance gains expected by introducing the offload in the first place. We should modify HTB to present itself as mq qdisc does. By default, mq qdisc allocates a simple fifo qdisc per TX queue exposed by the lower layer device. This is only when hardware offload is configured, otherwise, HTB behaves as usual. There is no HTB code along the data-path; the only overhead compared to regular traffic is the classification taking place at clsact. Please note that this design induces full offload---no fallback to software; it is not trivial to partial offload the hierarchical tree considering borrowing between siblings anyway.
>
>
> To summaries: for each HTB leaf-class the driver will allocate a special queue and match it with a corresponding net-device TX queue (increase real_num_tx_queues). A unique fifo qdisc will be attached to any such TX queue. Classification will still take place in software, but rather at the clsact egress hook. This way we can scale to thousands of classes while maintaining wire-speed performance and reducing CPU overhead.
>
> Any feedback will be much appreciated.

It was of course my hope that fifos would be universally replaced with
rfc8290 or rfc8033 by now. So moving a software htb +
net.core.default_qdisc = "of anything other than pfifo_fast" to a
hardware offload with fifos... will be "interesting". Will there be
features to at least limit the size of the offloaded fifo by packets
(or preferably, bytes?).




>
> Cheers,
> Kuperman
>
>


-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
  2020-01-31  1:47 ` Dave Taht
@ 2020-01-31 21:42   ` Dave Taht
  2020-02-14 11:33   ` Yossi Kuperman
  1 sibling, 0 replies; 6+ messages in thread
From: Dave Taht @ 2020-01-31 21:42 UTC (permalink / raw)
  To: Yossi Kuperman
  Cc: netdev, Jamal Hadi Salim, Jiri Pirko, Rony Efraim,
	Maxim Mikityanskiy, John Fastabend, Eran Ben Elisha

On Thu, Jan 30, 2020 at 5:47 PM Dave Taht <dave.taht@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 8:21 AM Yossi Kuperman <yossiku@mellanox.com> wrote:
> >
> > Following is an outline briefly describing our plans towards offloading HTB functionality.
> >
> > HTB qdisc allows you to use one physical link to simulate several slower links. This is done by configuring a hierarchical QoS tree; each tree node corresponds to a class. Filters are used to classify flows to different classes. HTB is quite flexible and versatile, but it comes with a cost. HTB does not scale and consumes considerable CPU and memory. Our aim is to offload HTB functionality to hardware and provide the user with the flexibility and the conventional tools offered by TC subsystem, while scaling to thousands of traffic classes and maintaining wire-speed performance.
> >
> > Mellanox hardware can support hierarchical rate-limiting; rate-limiting is done per hardware queue. In our proposed solution, flow classification takes place in software. By moving the classification to clsact egress hook, which is thread-safe and does not require locking, we avoid the contention induced by the single qdisc lock. Furthermore, clsact filters are perform before the net-device’s TX queue is selected, allowing the driver a chance to translate the class to the appropriate hardware queue. Please note that the user will need to configure the filters slightly different; apply them to the clsact rather than to the HTB itself, and set the priority to the desired class-id.
> >
> > For example, the following two filters are equivalent:
> >         1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
> >         2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
> >
> > Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
> >
> > Furthermore, the most concerning aspect of the current HTB implementation is its lack of support for multi-queue. All net-device’s TX queues points to the same HTB instance, resulting in high spin-lock contention. This contention (might) negates the overall performance gains expected by introducing the offload in the first place. We should modify HTB to present itself as mq qdisc does. By default, mq qdisc allocates a simple fifo qdisc per TX queue exposed by the lower layer device. This is only when hardware offload is configured, otherwise, HTB behaves as usual. There is no HTB code along the data-path; the only overhead compared to regular traffic is the classification taking place at clsact. Please note that this design induces full offload---no fallback to software; it is not trivial to partial offload the hierarchical tree considering borrowing between siblings anyway.
> >
> >
> > To summaries: for each HTB leaf-class the driver will allocate a special queue and match it with a corresponding net-device TX queue (increase real_num_tx_queues). A unique fifo qdisc will be attached to any such TX queue. Classification will still take place in software, but rather at the clsact egress hook. This way we can scale to thousands of classes while maintaining wire-speed performance and reducing CPU overhead.
> >
> > Any feedback will be much appreciated.
>
> It was of course my hope that fifos would be universally replaced with
> rfc8290 or rfc8033 by now. So moving a software htb +
> net.core.default_qdisc = "of anything other than pfifo_fast" to a
> hardware offload with fifos... will be "interesting". Will there be
> features to at least limit the size of the offloaded fifo by packets
> (or preferably, bytes?).

Another hope, in the long run, was that something like this might
prove feasible for more hw offloads.

https://tools.ietf.org/html/draft-morton-tsvwg-cheap-nasty-queueing-01
>
>
>
>
> >
> > Cheers,
> > Kuperman
> >
> >
>
>
> --
> Make Music, Not War
>
> Dave Täht
> CTO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-831-435-0729



-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
  2020-01-30 16:20 [RFC] Hierarchical QoS Hardware Offload (HTB) Yossi Kuperman
  2020-01-31  1:47 ` Dave Taht
@ 2020-02-01 16:48 ` Toke Høiland-Jørgensen
       [not found]   ` <bbafbd41-2a3b-3abd-e57c-18175a7c9e3f@mellanox.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-01 16:48 UTC (permalink / raw)
  To: Yossi Kuperman, netdev
  Cc: Jamal Hadi Salim, Jiri Pirko, Rony Efraim, Maxim Mikityanskiy,
	John Fastabend, Eran Ben Elisha

Yossi Kuperman <yossiku@mellanox.com> writes:

> Following is an outline briefly describing our plans towards offloading HTB functionality.
>
> HTB qdisc allows you to use one physical link to simulate several
> slower links. This is done by configuring a hierarchical QoS tree;
> each tree node corresponds to a class. Filters are used to classify
> flows to different classes. HTB is quite flexible and versatile, but
> it comes with a cost. HTB does not scale and consumes considerable CPU
> and memory. Our aim is to offload HTB functionality to hardware and
> provide the user with the flexibility and the conventional tools
> offered by TC subsystem, while scaling to thousands of traffic classes
> and maintaining wire-speed performance. 
>
> Mellanox hardware can support hierarchical rate-limiting;
> rate-limiting is done per hardware queue. In our proposed solution,
> flow classification takes place in software. By moving the
> classification to clsact egress hook, which is thread-safe and does
> not require locking, we avoid the contention induced by the single
> qdisc lock. Furthermore, clsact filters are perform before the
> net-device’s TX queue is selected, allowing the driver a chance to
> translate the class to the appropriate hardware queue. Please note
> that the user will need to configure the filters slightly different;
> apply them to the clsact rather than to the HTB itself, and set the
> priority to the desired class-id.
>
> For example, the following two filters are equivalent:
> 	1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
> 	2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>
> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>
> Furthermore, the most concerning aspect of the current HTB
> implementation is its lack of support for multi-queue. All
> net-device’s TX queues points to the same HTB instance, resulting in
> high spin-lock contention. This contention (might) negates the overall
> performance gains expected by introducing the offload in the first
> place. We should modify HTB to present itself as mq qdisc does. By
> default, mq qdisc allocates a simple fifo qdisc per TX queue exposed
> by the lower layer device. This is only when hardware offload is
> configured, otherwise, HTB behaves as usual. There is no HTB code
> along the data-path; the only overhead compared to regular traffic is
> the classification taking place at clsact. Please note that this
> design induces full offload---no fallback to software; it is not
> trivial to partial offload the hierarchical tree considering borrowing
> between siblings anyway.
>
>
> To summaries: for each HTB leaf-class the driver will allocate a
> special queue and match it with a corresponding net-device TX queue
> (increase real_num_tx_queues). A unique fifo qdisc will be attached to
> any such TX queue. Classification will still take place in software,
> but rather at the clsact egress hook. This way we can scale to
> thousands of classes while maintaining wire-speed performance and
> reducing CPU overhead.
>
> Any feedback will be much appreciated.

Other than echoing Dave's concern around baking FIFO semantics into
hardware, maybe also consider whether implementing the required
functionality using EDT-based semantics instead might be better? I.e.,
something like this:
https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF

-Toke


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
       [not found]   ` <bbafbd41-2a3b-3abd-e57c-18175a7c9e3f@mellanox.com>
@ 2020-02-14 11:14     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-14 11:14 UTC (permalink / raw)
  To: Yossi Kuperman, netdev
  Cc: Jamal Hadi Salim, Jiri Pirko, Rony Efraim, Maxim Mikityanskiy,
	John Fastabend, Eran Ben Elisha, Eric Dumazet,
	Stanislav Fomichev, Willem de Bruijn

Yossi Kuperman <yossiku@mellanox.com> writes:

> On 01/02/2020 18:48, Toke Høiland-Jørgensen wrote:
>> Yossi Kuperman <yossiku@mellanox.com> writes:
>>
>>> Following is an outline briefly describing our plans towards offloading HTB functionality.
>>>
>>> HTB qdisc allows you to use one physical link to simulate several
>>> slower links. This is done by configuring a hierarchical QoS tree;
>>> each tree node corresponds to a class. Filters are used to classify
>>> flows to different classes. HTB is quite flexible and versatile, but
>>> it comes with a cost. HTB does not scale and consumes considerable CPU
>>> and memory. Our aim is to offload HTB functionality to hardware and
>>> provide the user with the flexibility and the conventional tools
>>> offered by TC subsystem, while scaling to thousands of traffic classes
>>> and maintaining wire-speed performance. 
>>>
>>> Mellanox hardware can support hierarchical rate-limiting;
>>> rate-limiting is done per hardware queue. In our proposed solution,
>>> flow classification takes place in software. By moving the
>>> classification to clsact egress hook, which is thread-safe and does
>>> not require locking, we avoid the contention induced by the single
>>> qdisc lock. Furthermore, clsact filters are perform before the
>>> net-device’s TX queue is selected, allowing the driver a chance to
>>> translate the class to the appropriate hardware queue. Please note
>>> that the user will need to configure the filters slightly different;
>>> apply them to the clsact rather than to the HTB itself, and set the
>>> priority to the desired class-id.
>>>
>>> For example, the following two filters are equivalent:
>>> 	1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
>>> 	2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>>>
>>> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>>>
>>> Furthermore, the most concerning aspect of the current HTB
>>> implementation is its lack of support for multi-queue. All
>>> net-device’s TX queues points to the same HTB instance, resulting in
>>> high spin-lock contention. This contention (might) negates the overall
>>> performance gains expected by introducing the offload in the first
>>> place. We should modify HTB to present itself as mq qdisc does. By
>>> default, mq qdisc allocates a simple fifo qdisc per TX queue exposed
>>> by the lower layer device. This is only when hardware offload is
>>> configured, otherwise, HTB behaves as usual. There is no HTB code
>>> along the data-path; the only overhead compared to regular traffic is
>>> the classification taking place at clsact. Please note that this
>>> design induces full offload---no fallback to software; it is not
>>> trivial to partial offload the hierarchical tree considering borrowing
>>> between siblings anyway.
>>>
>>>
>>> To summaries: for each HTB leaf-class the driver will allocate a
>>> special queue and match it with a corresponding net-device TX queue
>>> (increase real_num_tx_queues). A unique fifo qdisc will be attached to
>>> any such TX queue. Classification will still take place in software,
>>> but rather at the clsact egress hook. This way we can scale to
>>> thousands of classes while maintaining wire-speed performance and
>>> reducing CPU overhead.
>>>
>>> Any feedback will be much appreciated.
>> Other than echoing Dave's concern around baking FIFO semantics into
>> hardware, maybe also consider whether implementing the required
>> functionality using EDT-based semantics instead might be better? I.e.,
>> something like this:
>> https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF
>
> Sorry for the long delay.
>
> The above talk description is quite concise, I can only speculate how
> this EDT+BPF+FQ might work. Anyway, we have a requirement from a
> customer to provide HTB-like semantics including borrowing and
> rate-limiting flow-aggregates. We have a working PoC, which closely
> resembles the aforementioned description, including hardware
> offload.  
>
> Is it possible to construct hierarchical QoS (for borrowing purposes)
> using EDT+BPF+FQ?

I believe so. Haven't worked out the details of how to make it
equivalent, though, but I'm hoping that is what that talk will do.
Adding some of the speakers to Cc, maybe they can comment? :)

> Is it applicable only for TCP?

No, EDT is applicable to all packets.

-Toke


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
  2020-01-31  1:47 ` Dave Taht
  2020-01-31 21:42   ` Dave Taht
@ 2020-02-14 11:33   ` Yossi Kuperman
  1 sibling, 0 replies; 6+ messages in thread
From: Yossi Kuperman @ 2020-02-14 11:33 UTC (permalink / raw)
  To: Dave Taht
  Cc: netdev, Jamal Hadi Salim, Jiri Pirko, Rony Efraim,
	Maxim Mikityanskiy, John Fastabend, Eran Ben Elisha


On 31/01/2020 3:47, Dave Taht wrote:
> On Thu, Jan 30, 2020 at 8:21 AM Yossi Kuperman <yossiku@mellanox.com> wrote:
>> Following is an outline briefly describing our plans towards offloading HTB functionality.
>>
>> HTB qdisc allows you to use one physical link to simulate several slower links. This is done by configuring a hierarchical QoS tree; each tree node corresponds to a class. Filters are used to classify flows to different classes. HTB is quite flexible and versatile, but it comes with a cost. HTB does not scale and consumes considerable CPU and memory. Our aim is to offload HTB functionality to hardware and provide the user with the flexibility and the conventional tools offered by TC subsystem, while scaling to thousands of traffic classes and maintaining wire-speed performance.
>>
>> Mellanox hardware can support hierarchical rate-limiting; rate-limiting is done per hardware queue. In our proposed solution, flow classification takes place in software. By moving the classification to clsact egress hook, which is thread-safe and does not require locking, we avoid the contention induced by the single qdisc lock. Furthermore, clsact filters are perform before the net-device’s TX queue is selected, allowing the driver a chance to translate the class to the appropriate hardware queue. Please note that the user will need to configure the filters slightly different; apply them to the clsact rather than to the HTB itself, and set the priority to the desired class-id.
>>
>> For example, the following two filters are equivalent:
>>         1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
>>         2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>>
>> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>>
>> Furthermore, the most concerning aspect of the current HTB implementation is its lack of support for multi-queue. All net-device’s TX queues points to the same HTB instance, resulting in high spin-lock contention. This contention (might) negates the overall performance gains expected by introducing the offload in the first place. We should modify HTB to present itself as mq qdisc does. By default, mq qdisc allocates a simple fifo qdisc per TX queue exposed by the lower layer device. This is only when hardware offload is configured, otherwise, HTB behaves as usual. There is no HTB code along the data-path; the only overhead compared to regular traffic is the classification taking place at clsact. Please note that this design induces full offload---no fallback to software; it is not trivial to partial offload the hierarchical tree considering borrowing between siblings anyway.
>>
>>
>> To summaries: for each HTB leaf-class the driver will allocate a special queue and match it with a corresponding net-device TX queue (increase real_num_tx_queues). A unique fifo qdisc will be attached to any such TX queue. Classification will still take place in software, but rather at the clsact egress hook. This way we can scale to thousands of classes while maintaining wire-speed performance and reducing CPU overhead.
>>
>> Any feedback will be much appreciated.
> It was of course my hope that fifos would be universally replaced with
> rfc8290 or rfc8033 by now. So moving a software htb +
> net.core.default_qdisc = "of anything other than pfifo_fast" to a
> hardware offload with fifos... will be "interesting". Will there be
> features to at least limit the size of the offloaded fifo by packets
> (or preferably, bytes?).
>
Yes, as far as I know we can limit the size of the hardware queue by number of packets.

>
>
>> Cheers,
>> Kuperman
>>
>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-02-14 11:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-30 16:20 [RFC] Hierarchical QoS Hardware Offload (HTB) Yossi Kuperman
2020-01-31  1:47 ` Dave Taht
2020-01-31 21:42   ` Dave Taht
2020-02-14 11:33   ` Yossi Kuperman
2020-02-01 16:48 ` Toke Høiland-Jørgensen
     [not found]   ` <bbafbd41-2a3b-3abd-e57c-18175a7c9e3f@mellanox.com>
2020-02-14 11:14     ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).