From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Yossi Kuperman <yossiku@mellanox.com>,
"netdev\@vger.kernel.org" <netdev@vger.kernel.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>,
Jiri Pirko <jiri@mellanox.com>, Rony Efraim <ronye@mellanox.com>,
Maxim Mikityanskiy <maximmi@mellanox.com>,
John Fastabend <john.fastabend@gmail.com>,
Eran Ben Elisha <eranbe@mellanox.com>
Subject: Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
Date: Sat, 01 Feb 2020 17:48:26 +0100 [thread overview]
Message-ID: <87y2tmckyt.fsf@toke.dk> (raw)
In-Reply-To: <FC053E80-74C9-4884-92F1-4DBEB5F0C81A@mellanox.com>
Yossi Kuperman <yossiku@mellanox.com> writes:
> Following is an outline briefly describing our plans towards offloading HTB functionality.
>
> HTB qdisc allows you to use one physical link to simulate several
> slower links. This is done by configuring a hierarchical QoS tree;
> each tree node corresponds to a class. Filters are used to classify
> flows to different classes. HTB is quite flexible and versatile, but
> it comes with a cost. HTB does not scale and consumes considerable CPU
> and memory. Our aim is to offload HTB functionality to hardware and
> provide the user with the flexibility and the conventional tools
> offered by TC subsystem, while scaling to thousands of traffic classes
> and maintaining wire-speed performance.
>
> Mellanox hardware can support hierarchical rate-limiting;
> rate-limiting is done per hardware queue. In our proposed solution,
> flow classification takes place in software. By moving the
> classification to clsact egress hook, which is thread-safe and does
> not require locking, we avoid the contention induced by the single
> qdisc lock. Furthermore, clsact filters are perform before the
> net-device’s TX queue is selected, allowing the driver a chance to
> translate the class to the appropriate hardware queue. Please note
> that the user will need to configure the filters slightly different;
> apply them to the clsact rather than to the HTB itself, and set the
> priority to the desired class-id.
>
> For example, the following two filters are equivalent:
> 1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
> 2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>
> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>
> Furthermore, the most concerning aspect of the current HTB
> implementation is its lack of support for multi-queue. All
> net-device’s TX queues points to the same HTB instance, resulting in
> high spin-lock contention. This contention (might) negates the overall
> performance gains expected by introducing the offload in the first
> place. We should modify HTB to present itself as mq qdisc does. By
> default, mq qdisc allocates a simple fifo qdisc per TX queue exposed
> by the lower layer device. This is only when hardware offload is
> configured, otherwise, HTB behaves as usual. There is no HTB code
> along the data-path; the only overhead compared to regular traffic is
> the classification taking place at clsact. Please note that this
> design induces full offload---no fallback to software; it is not
> trivial to partial offload the hierarchical tree considering borrowing
> between siblings anyway.
>
>
> To summaries: for each HTB leaf-class the driver will allocate a
> special queue and match it with a corresponding net-device TX queue
> (increase real_num_tx_queues). A unique fifo qdisc will be attached to
> any such TX queue. Classification will still take place in software,
> but rather at the clsact egress hook. This way we can scale to
> thousands of classes while maintaining wire-speed performance and
> reducing CPU overhead.
>
> Any feedback will be much appreciated.
Other than echoing Dave's concern around baking FIFO semantics into
hardware, maybe also consider whether implementing the required
functionality using EDT-based semantics instead might be better? I.e.,
something like this:
https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF
-Toke
next prev parent reply other threads:[~2020-02-01 16:48 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-30 16:20 [RFC] Hierarchical QoS Hardware Offload (HTB) Yossi Kuperman
2020-01-31 1:47 ` Dave Taht
2020-01-31 21:42 ` Dave Taht
2020-02-14 11:33 ` Yossi Kuperman
2020-02-01 16:48 ` Toke Høiland-Jørgensen [this message]
[not found] ` <bbafbd41-2a3b-3abd-e57c-18175a7c9e3f@mellanox.com>
2020-02-14 11:14 ` Toke Høiland-Jørgensen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y2tmckyt.fsf@toke.dk \
--to=toke@redhat.com \
--cc=eranbe@mellanox.com \
--cc=jhs@mojatatu.com \
--cc=jiri@mellanox.com \
--cc=john.fastabend@gmail.com \
--cc=maximmi@mellanox.com \
--cc=netdev@vger.kernel.org \
--cc=ronye@mellanox.com \
--cc=yossiku@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).