Re: [RFC] Hierarchical QoS Hardware Offload (HTB)

From: Yossi Kuperman <yossiku@mellanox.com>
To: Dave Taht <dave.taht@gmail.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Jiri Pirko <jiri@mellanox.com>, Rony Efraim <ronye@mellanox.com>,
	Maxim Mikityanskiy <maximmi@mellanox.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Eran Ben Elisha <eranbe@mellanox.com>
Subject: Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
Date: Fri, 14 Feb 2020 13:33:56 +0200	[thread overview]
Message-ID: <ad3c29b8-2b45-1b1a-7cee-336c0699e9ac@mellanox.com> (raw)
In-Reply-To: <CAA93jw6tgQF4XMKN5etJqkO4xvxSFDCn41en7LSJ55gVJeGybQ@mail.gmail.com>


On 31/01/2020 3:47, Dave Taht wrote:
> On Thu, Jan 30, 2020 at 8:21 AM Yossi Kuperman <yossiku@mellanox.com> wrote:
>> Following is an outline briefly describing our plans towards offloading HTB functionality.
>>
>> HTB qdisc allows you to use one physical link to simulate several slower links. This is done by configuring a hierarchical QoS tree; each tree node corresponds to a class. Filters are used to classify flows to different classes. HTB is quite flexible and versatile, but it comes with a cost. HTB does not scale and consumes considerable CPU and memory. Our aim is to offload HTB functionality to hardware and provide the user with the flexibility and the conventional tools offered by TC subsystem, while scaling to thousands of traffic classes and maintaining wire-speed performance.
>>
>> Mellanox hardware can support hierarchical rate-limiting; rate-limiting is done per hardware queue. In our proposed solution, flow classification takes place in software. By moving the classification to clsact egress hook, which is thread-safe and does not require locking, we avoid the contention induced by the single qdisc lock. Furthermore, clsact filters are perform before the net-device’s TX queue is selected, allowing the driver a chance to translate the class to the appropriate hardware queue. Please note that the user will need to configure the filters slightly different; apply them to the clsact rather than to the HTB itself, and set the priority to the desired class-id.
>>
>> For example, the following two filters are equivalent:
>>         1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
>>         2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>>
>> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>>
>> Furthermore, the most concerning aspect of the current HTB implementation is its lack of support for multi-queue. All net-device’s TX queues points to the same HTB instance, resulting in high spin-lock contention. This contention (might) negates the overall performance gains expected by introducing the offload in the first place. We should modify HTB to present itself as mq qdisc does. By default, mq qdisc allocates a simple fifo qdisc per TX queue exposed by the lower layer device. This is only when hardware offload is configured, otherwise, HTB behaves as usual. There is no HTB code along the data-path; the only overhead compared to regular traffic is the classification taking place at clsact. Please note that this design induces full offload---no fallback to software; it is not trivial to partial offload the hierarchical tree considering borrowing between siblings anyway.
>>
>>
>> To summaries: for each HTB leaf-class the driver will allocate a special queue and match it with a corresponding net-device TX queue (increase real_num_tx_queues). A unique fifo qdisc will be attached to any such TX queue. Classification will still take place in software, but rather at the clsact egress hook. This way we can scale to thousands of classes while maintaining wire-speed performance and reducing CPU overhead.
>>
>> Any feedback will be much appreciated.
> It was of course my hope that fifos would be universally replaced with
> rfc8290 or rfc8033 by now. So moving a software htb +
> net.core.default_qdisc = "of anything other than pfifo_fast" to a
> hardware offload with fifos... will be "interesting". Will there be
> features to at least limit the size of the offloaded fifo by packets
> (or preferably, bytes?).
>
Yes, as far as I know we can limit the size of the hardware queue by number of packets.

>
>
>> Cheers,
>> Kuperman
>>
>>
>