Re: [RFC] Hierarchical QoS Hardware Offload (HTB) - Toke Høiland-Jørgensen

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Yossi Kuperman <yossiku@mellanox.com>,
	"netdev\@vger.kernel.org" <netdev@vger.kernel.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>,
	Jiri Pirko <jiri@mellanox.com>, Rony Efraim <ronye@mellanox.com>,
	Maxim Mikityanskiy <maximmi@mellanox.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Subject: Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
Date: Fri, 14 Feb 2020 12:14:42 +0100	[thread overview]
Message-ID: <877e0pe7z1.fsf@toke.dk> (raw)
In-Reply-To: <bbafbd41-2a3b-3abd-e57c-18175a7c9e3f@mellanox.com>

Yossi Kuperman <yossiku@mellanox.com> writes:

> On 01/02/2020 18:48, Toke Høiland-Jørgensen wrote:
>> Yossi Kuperman <yossiku@mellanox.com> writes:
>>
>>> Following is an outline briefly describing our plans towards offloading HTB functionality.
>>>
>>> HTB qdisc allows you to use one physical link to simulate several
>>> slower links. This is done by configuring a hierarchical QoS tree;
>>> each tree node corresponds to a class. Filters are used to classify
>>> flows to different classes. HTB is quite flexible and versatile, but
>>> it comes with a cost. HTB does not scale and consumes considerable CPU
>>> and memory. Our aim is to offload HTB functionality to hardware and
>>> provide the user with the flexibility and the conventional tools
>>> offered by TC subsystem, while scaling to thousands of traffic classes
>>> and maintaining wire-speed performance. 
>>>
>>> Mellanox hardware can support hierarchical rate-limiting;
>>> rate-limiting is done per hardware queue. In our proposed solution,
>>> flow classification takes place in software. By moving the
>>> classification to clsact egress hook, which is thread-safe and does
>>> not require locking, we avoid the contention induced by the single
>>> qdisc lock. Furthermore, clsact filters are perform before the
>>> net-device’s TX queue is selected, allowing the driver a chance to
>>> translate the class to the appropriate hardware queue. Please note
>>> that the user will need to configure the filters slightly different;
>>> apply them to the clsact rather than to the HTB itself, and set the
>>> priority to the desired class-id.
>>>
>>> For example, the following two filters are equivalent:
>>> 	1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
>>> 	2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>>>
>>> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>>>
>>> Furthermore, the most concerning aspect of the current HTB
>>> implementation is its lack of support for multi-queue. All
>>> net-device’s TX queues points to the same HTB instance, resulting in
>>> high spin-lock contention. This contention (might) negates the overall
>>> performance gains expected by introducing the offload in the first
>>> place. We should modify HTB to present itself as mq qdisc does. By
>>> default, mq qdisc allocates a simple fifo qdisc per TX queue exposed
>>> by the lower layer device. This is only when hardware offload is
>>> configured, otherwise, HTB behaves as usual. There is no HTB code
>>> along the data-path; the only overhead compared to regular traffic is
>>> the classification taking place at clsact. Please note that this
>>> design induces full offload---no fallback to software; it is not
>>> trivial to partial offload the hierarchical tree considering borrowing
>>> between siblings anyway.
>>>
>>>
>>> To summaries: for each HTB leaf-class the driver will allocate a
>>> special queue and match it with a corresponding net-device TX queue
>>> (increase real_num_tx_queues). A unique fifo qdisc will be attached to
>>> any such TX queue. Classification will still take place in software,
>>> but rather at the clsact egress hook. This way we can scale to
>>> thousands of classes while maintaining wire-speed performance and
>>> reducing CPU overhead.
>>>
>>> Any feedback will be much appreciated.
>> Other than echoing Dave's concern around baking FIFO semantics into
>> hardware, maybe also consider whether implementing the required
>> functionality using EDT-based semantics instead might be better? I.e.,
>> something like this:
>> https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF
>
> Sorry for the long delay.
>
> The above talk description is quite concise, I can only speculate how
> this EDT+BPF+FQ might work. Anyway, we have a requirement from a
> customer to provide HTB-like semantics including borrowing and
> rate-limiting flow-aggregates. We have a working PoC, which closely
> resembles the aforementioned description, including hardware
> offload.  
>
> Is it possible to construct hierarchical QoS (for borrowing purposes)
> using EDT+BPF+FQ?

I believe so. Haven't worked out the details of how to make it
equivalent, though, but I'm hoping that is what that talk will do.
Adding some of the speakers to Cc, maybe they can comment? :)

> Is it applicable only for TCP?

No, EDT is applicable to all packets.

-Toke