All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] Add Common Applications Kept Enhanced (cake) qdisc
@ 2017-12-03 22:06 Dave Taht
  2017-12-03 22:06 ` [PATCH net-next 1/3] pkt_sched.h: add support for sch_cake API Dave Taht
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Dave Taht @ 2017-12-03 22:06 UTC (permalink / raw)
  To: netdev
  Cc: Dave Taht, Toke Høiland-Jørgensen, Sebastian Moeller,
	Ryan Mounce, Jonathan Morton, Kevin Darbyshire-Bryant,
	Nils Andreas Svee, Dean Scarff, Loganaden Velvindron

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* (New) Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

There are some features still considered experimental, notably the
ingress_autorate bandwidth estimator and cobalt itself.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

Dave Taht (3):
  pkt_sched.h: add support for sch_cake API
  Add Common Applications Kept Enhanced (cake) qdisc
  Add support for building the new cake qdisc

 include/net/cobalt.h           |  152 +++
 include/uapi/linux/pkt_sched.h |   58 +
 net/sched/Kconfig              |   11 +
 net/sched/Makefile             |    1 +
 net/sched/sch_cake.c           | 2561 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 2783 insertions(+)
 create mode 100644 include/net/cobalt.h
 create mode 100644 net/sched/sch_cake.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next 1/3] pkt_sched.h: add support for sch_cake API
  2017-12-03 22:06 [PATCH net-next 0/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
@ 2017-12-03 22:06 ` Dave Taht
  2017-12-03 22:06 ` [PATCH net-next 2/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Dave Taht @ 2017-12-03 22:06 UTC (permalink / raw)
  To: netdev
  Cc: Dave Taht, Toke Høiland-Jørgensen, Sebastian Moeller,
	Ryan Mounce, Jonathan Morton, Kevin Darbyshire-Bryant,
	Nils Andreas Svee, Dean Scarff, Loganaden Velvindron

Signed-off-by: Dave Taht <dave.taht@gmail.com>
---
 include/uapi/linux/pkt_sched.h | 58 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index af3cc2f..ed7c111 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -935,4 +935,62 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* CAKE */
+enum {
+	TCA_CAKE_UNSPEC,
+	TCA_CAKE_BASE_RATE,
+	TCA_CAKE_DIFFSERV_MODE,
+	TCA_CAKE_ATM,
+	TCA_CAKE_FLOW_MODE,
+	TCA_CAKE_OVERHEAD,
+	TCA_CAKE_RTT,
+	TCA_CAKE_TARGET,
+	TCA_CAKE_AUTORATE,
+	TCA_CAKE_MEMORY,
+	TCA_CAKE_NAT,
+	TCA_CAKE_ETHERNET,
+	TCA_CAKE_WASH,
+	TCA_CAKE_MPU,
+	TCA_CAKE_INGRESS,
+	TCA_CAKE_ACK_FILTER,
+	__TCA_CAKE_MAX
+};
+#define TCA_CAKE_MAX	(__TCA_CAKE_MAX - 1)
+
+struct tc_cake_traffic_stats {
+	__u32 packets;
+	__u32 link_ms;
+	__u64 bytes;
+};
+
+#define TC_CAKE_MAX_TINS (8)
+struct tc_cake_xstats {
+	__u16 version;  /* == 5, increments when struct extended */
+	__u8  max_tins; /* == TC_CAKE_MAX_TINS */
+	__u8  tin_cnt;  /* <= TC_CAKE_MAX_TINS */
+
+	__u32 threshold_rate[TC_CAKE_MAX_TINS];
+	__u32 target_us[TC_CAKE_MAX_TINS];
+	struct tc_cake_traffic_stats sent[TC_CAKE_MAX_TINS];
+	struct tc_cake_traffic_stats dropped[TC_CAKE_MAX_TINS];
+	struct tc_cake_traffic_stats ecn_marked[TC_CAKE_MAX_TINS];
+	struct tc_cake_traffic_stats backlog[TC_CAKE_MAX_TINS];
+	__u32 interval_us[TC_CAKE_MAX_TINS];
+	__u32 way_indirect_hits[TC_CAKE_MAX_TINS];
+	__u32 way_misses[TC_CAKE_MAX_TINS];
+	__u32 way_collisions[TC_CAKE_MAX_TINS];
+	__u32 peak_delay_us[TC_CAKE_MAX_TINS]; /* ~= bulk flow delay */
+	__u32 avge_delay_us[TC_CAKE_MAX_TINS];
+	__u32 base_delay_us[TC_CAKE_MAX_TINS]; /* ~= sparse flows delay */
+	__u16 sparse_flows[TC_CAKE_MAX_TINS];
+	__u16 bulk_flows[TC_CAKE_MAX_TINS];
+	__u16 unresponse_flows[TC_CAKE_MAX_TINS]; /* v4 - was u32 last_len */
+	__u16 spare[TC_CAKE_MAX_TINS]; /* v4 - split last_len */
+	__u32 max_skblen[TC_CAKE_MAX_TINS];
+	__u32 capacity_estimate;  /* version 2 */
+	__u32 memory_limit;       /* version 3 */
+	__u32 memory_used;	  /* version 3 */
+	struct tc_cake_traffic_stats ack_drops[TC_CAKE_MAX_TINS]; /* v5 */
+};
+
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 2/3] Add Common Applications Kept Enhanced (cake) qdisc
  2017-12-03 22:06 [PATCH net-next 0/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
  2017-12-03 22:06 ` [PATCH net-next 1/3] pkt_sched.h: add support for sch_cake API Dave Taht
@ 2017-12-03 22:06 ` Dave Taht
  2017-12-03 22:06 ` [PATCH net-next 3/3] Add support for building the new cake qdisc Dave Taht
  2018-04-24 11:44 ` [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Toke Høiland-Jørgensen
  3 siblings, 0 replies; 20+ messages in thread
From: Dave Taht @ 2017-12-03 22:06 UTC (permalink / raw)
  To: netdev
  Cc: Dave Taht, Toke Høiland-Jørgensen, Sebastian Moeller,
	Ryan Mounce, Jonathan Morton, Kevin Darbyshire-Bryant,
	Nils Andreas Svee, Dean Scarff, Loganaden Velvindron

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* (New) Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

There are some features still considered experimental, notably the
ingress_autorate bandwidth estimator and cobalt itself.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth0
qdisc cake 1: dev eth0 root refcnt 2 bandwidth 20Mbit diffserv3
 triple-isolate nat ack-filter rtt 100.0ms noatm overhead 18 via-ethernet
 total_overhead 18 hard_header_len 14 mpu 64 
 Sent 119336682 bytes 390608 pkt (dropped 126463, overlimits 882857 requeues 0) 
 backlog 23620b 29p requeues 0 
 memory used: 158598b of 4Mb
 capacity estimate: 20Mbit
                 Bulk   Best Effort      Voice
  thresh      1250Kbit      20Mbit       5Mbit
  target        14.5ms       5.0ms       5.0ms
  interval     109.5ms     100.0ms     100.0ms
  pk_delay       6.8ms      12.8ms       9.8ms
  av_delay       2.5ms       2.6ms       2.6ms
  sp_delay       538us       256us       350us
  pkts           80534      339655       96911
  bytes       12165915    90314487    31167998
  way_inds           0           0           0
  way_miss           6          79           7
  way_cols           0           0           0
  drops           1161        2303         480
  marks              0           0           0
  ack_drop       45420       69090        8009
  sp_flows           2           5           2
  bk_flows           4          16           3
  un_flows           0           0           0
  max_len         3028        3028        3028

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Tested-by: Pete Heist <peteheist@gmail.com>
Tested-by: Georgios Amanakis <gamanakis@gmail.com> 

---
 include/net/cobalt.h |  152 +++
 net/sched/sch_cake.c | 2561 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 2713 insertions(+)
 create mode 100644 include/net/cobalt.h
 create mode 100644 net/sched/sch_cake.c

diff --git a/include/net/cobalt.h b/include/net/cobalt.h
new file mode 100644
index 0000000..b2559df
--- /dev/null
+++ b/include/net/cobalt.h
@@ -0,0 +1,152 @@
+#ifndef __NET_SCHED_COBALT_H
+#define __NET_SCHED_COBALT_H
+
+/* COBALT - Codel-BLUE Alternate AQM algorithm.
+ *
+ *  Copyright (C) 2011-2012 Kathleen Nichols <nichols@pollere.com>
+ *  Copyright (C) 2011-2012 Van Jacobson <van@pollere.net>
+ *  Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
+ *  Copyright (C) 2016-2017 Michael D. Täht <dave.taht@gmail.com>
+ *  Copyright (c) 2015-2017 Jonathan Morton <chromatix99@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions, and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ *    derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ */
+
+/* COBALT operates the Codel and BLUE algorithms in parallel, in order to
+ * obtain the best features of each.  Codel is excellent on flows which
+ * respond to congestion signals in a TCP-like way.  BLUE is more effective on
+ * unresponsive flows.
+ */
+
+#include <linux/version.h>
+#include <linux/types.h>
+#include <linux/ktime.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <linux/reciprocal_div.h>
+
+typedef u64 cobalt_time_t;
+typedef s64 cobalt_tdiff_t;
+
+#define MS2TIME(a) (a * (u64) NSEC_PER_MSEC)
+#define US2TIME(a) (a * (u64) NSEC_PER_USEC)
+
+struct cobalt_skb_cb {
+	cobalt_time_t enqueue_time;
+};
+
+static inline cobalt_time_t cobalt_get_time(void)
+{
+	return ktime_get_ns();
+}
+
+static inline u32 cobalt_time_to_us(cobalt_time_t val)
+{
+	do_div(val, NSEC_PER_USEC);
+	return (u32)val;
+}
+
+static inline struct cobalt_skb_cb *get_cobalt_cb(const struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct cobalt_skb_cb));
+	return (struct cobalt_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline cobalt_time_t cobalt_get_enqueue_time(const struct sk_buff *skb)
+{
+	return get_cobalt_cb(skb)->enqueue_time;
+}
+
+static inline void cobalt_set_enqueue_time(struct sk_buff *skb,
+					   cobalt_time_t now)
+{
+	get_cobalt_cb(skb)->enqueue_time = now;
+}
+
+/**
+ * struct cobalt_params - contains codel and blue parameters
+ * @interval:	codel initial drop rate
+ * @target:     maximum persistent sojourn time & blue update rate
+ * @p_inc:      increment of blue drop probability (0.32 fxp)
+ * @p_dec:      decrement of blue drop probability (0.32 fxp)
+ */
+struct cobalt_params {
+	cobalt_time_t	interval;
+	cobalt_time_t	target;
+	u32		p_inc;
+	u32		p_dec;
+};
+
+/* struct cobalt_vars - contains codel and blue variables
+ * @count:	  codel dropping frequency
+ * @rec_inv_sqrt: reciprocal value of sqrt(count) >> 1
+ * @drop_next:    time to drop next packet, or when we dropped last
+ * @blue_timer:	  Blue time to next drop
+ * @p_drop:       BLUE drop probability (0.32 fxp)
+ * @dropping:     set if in dropping state
+ * @ecn_marked:   set if marked
+ */
+struct cobalt_vars {
+	u32		count;
+	u32		rec_inv_sqrt;
+	cobalt_time_t	drop_next;
+	cobalt_time_t	blue_timer;
+	u32     p_drop;
+	bool	dropping;
+	bool    ecn_marked;
+};
+
+/* Initialise visible and internal data. */
+static void cobalt_vars_init(struct cobalt_vars *vars);
+
+static struct cobalt_skb_cb *get_cobalt_cb(const struct sk_buff *skb);
+static cobalt_time_t cobalt_get_enqueue_time(const struct sk_buff *skb);
+
+/* Call this when a packet had to be dropped due to queue overflow. */
+static bool cobalt_queue_full(struct cobalt_vars *vars,
+			      struct cobalt_params *p,
+			      cobalt_time_t now);
+
+/* Call this when the queue was serviced but turned out to be empty. */
+static bool cobalt_queue_empty(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now);
+
+/* Call this with a freshly dequeued packet for possible congestion marking.
+ * Returns true as an instruction to drop the packet, false for delivery.
+ */
+static bool cobalt_should_drop(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now,
+			       struct sk_buff *skb);
+
+#endif
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
new file mode 100644
index 0000000..79964d7
--- /dev/null
+++ b/net/sched/sch_cake.c
@@ -0,0 +1,2561 @@
+/* COMMON Applications Kept Enhanced (CAKE) discipline - version 5
+ *
+ * Copyright (C) 2014-2017 Jonathan Morton <chromatix99@gmail.com>
+ * Copyright (C) 2015-2017 Toke Høiland-Jørgensen <toke@toke.dk>
+ * Copyright (C) 2014-2017 Dave Täht <dave.taht@gmail.com>
+ * Copyright (C) 2015-2017 Sebastian Moeller <moeller0@gmx.de>
+ * (C) 2015-2017 Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
+ * Copyright (C) 2017 Ryan Mounce <ryan@mounce.com.au>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	notice, this list of conditions, and the following disclaimer,
+ *	without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	notice, this list of conditions and the following disclaimer in the
+ *	documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ *	derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/jhash.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/reciprocal_div.h>
+#include <linux/version.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <linux/if_vlan.h>
+#include <net/tcp.h>
+#include <net/flow_dissector.h>
+#include <net/cobalt.h>
+
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/nf_conntrack.h>
+#endif
+
+/* The CAKE Principles:
+ *		   (or, how to have your cake and eat it too)
+ *
+ * This is a combination of several shaping, AQM and FQ techniques into one
+ * easy-to-use package:
+ *
+ * - An overall bandwidth shaper, to move the bottleneck away from dumb CPE
+ *   equipment and bloated MACs.  This operates in deficit mode (as in sch_fq),
+ *   eliminating the need for any sort of burst parameter (eg. token bucket
+ *   depth).  Burst support is limited to that necessary to overcome scheduling
+ *   latency.
+ *
+ * - A Diffserv-aware priority queue, giving more priority to certain classes,
+ *   up to a specified fraction of bandwidth.  Above that bandwidth threshold,
+ *   the priority is reduced to avoid starving other tins.
+ *
+ * - Each priority tin has a separate Flow Queue system, to isolate traffic
+ *   flows from each other.  This prevents a burst on one flow from increasing
+ *   the delay to another.  Flows are distributed to queues using a
+ *   set-associative hash function.
+ *
+ * - Each queue is actively managed by Cobalt, which is a combination of the
+ *   Codel and Blue AQM algorithms.  This serves flows fairly, and signals
+ *   congestion early via ECN (if available) and/or packet drops, to keep
+ *   latency low.  The codel parameters are auto-tuned based on the bandwidth
+ *   setting, as is necessary at low bandwidths.
+ *
+ * The configuration parameters are kept deliberately simple for ease of use.
+ * Everything has sane defaults.  Complete generality of configuration is *not*
+ * a goal.
+ *
+ * The priority queue operates according to a weighted DRR scheme, combined with
+ * a bandwidth tracker which reuses the shaper logic to detect which side of the
+ * bandwidth sharing threshold the tin is operating.  This determines whether a
+ * priority-based weight (high) or a bandwidth-based weight (low) is used for
+ * that tin in the current pass.
+ *
+ * This qdisc was inspired by Eric Dumazet's fq_codel code, which he kindly
+ * granted us permission to leverage.
+ */
+
+#define CAKE_SET_WAYS (8)
+#define CAKE_MAX_TINS (8)
+#define CAKE_QUEUES (1024)
+
+#ifndef CAKE_VERSION
+#define CAKE_VERSION "unknown"
+#endif
+static char *cake_version __attribute__((used)) = "Cake version: "
+		CAKE_VERSION;
+
+enum {
+	CAKE_SET_NONE = 0,
+	CAKE_SET_SPARSE,
+	CAKE_SET_SPARSE_WAIT, /* counted in SPARSE, actually in BULK */
+	CAKE_SET_BULK,
+	CAKE_SET_DECAYING
+};
+
+struct cake_flow {
+	/* this stuff is all needed per-flow at dequeue time */
+	struct sk_buff	  *head;
+	struct sk_buff	  *tail;
+	struct sk_buff	  *ackcheck;
+	struct list_head  flowchain;
+	s32		  deficit;
+	struct cobalt_vars cvars;
+	u16		  srchost; /* index into cake_host table */
+	u16		  dsthost;
+	u8		  set;
+}; /* please try to keep this structure <= 64 bytes */
+
+struct cake_host {
+	u32 srchost_tag;
+	u32 dsthost_tag;
+	u16 srchost_refcnt;
+	u16 dsthost_refcnt;
+};
+
+struct cake_heap_entry {
+	u16 t:3, b:10;
+};
+
+struct cake_tin_data {
+	struct cake_flow flows[CAKE_QUEUES];
+	u32	backlogs[CAKE_QUEUES];
+	u32	tags[CAKE_QUEUES]; /* for set association */
+	u16	overflow_idx[CAKE_QUEUES];
+	struct cake_host hosts[CAKE_QUEUES]; /* for triple isolation */
+	u16	flow_quantum;
+
+	struct cobalt_params cparams;
+	u32	drop_overlimit;
+	u16	bulk_flow_count;
+	u16	sparse_flow_count;
+	u16	decaying_flow_count;
+	u16	unresponsive_flow_count;
+
+	u32	max_skblen;
+
+	struct list_head new_flows;
+	struct list_head old_flows;
+	struct list_head decaying_flows;
+
+	/* time_next = time_this + ((len * rate_ns) >> rate_shft) */
+	u64	tin_time_next_packet;
+	u32	tin_rate_ns;
+	u32	tin_rate_bps;
+	u16	tin_rate_shft;
+
+	u16	tin_quantum_prio;
+	u16	tin_quantum_band;
+	s32	tin_deficit;
+	u32	tin_backlog;
+	u32	tin_dropped;
+	u32	tin_ecn_mark;
+
+	u32	packets;
+	u64	bytes;
+
+	u32	ack_drops;
+
+	/* moving averages */
+	cobalt_time_t avge_delay;
+	cobalt_time_t peak_delay;
+	cobalt_time_t base_delay;
+
+	/* hash function stats */
+	u32	way_directs;
+	u32	way_hits;
+	u32	way_misses;
+	u32	way_collisions;
+}; /* number of tins is small, so size of this struct doesn't matter much */
+
+struct cake_sched_data {
+	struct cake_tin_data *tins;
+
+	struct cake_heap_entry overflow_heap[CAKE_QUEUES * CAKE_MAX_TINS];
+	u16		overflow_timeout;
+
+	u16		tin_cnt;
+	u8		tin_mode;
+	u8		flow_mode;
+
+	/* time_next = time_this + ((len * rate_ns) >> rate_shft) */
+	u16		rate_shft;
+	u64		time_next_packet;
+	u64		failsafe_next_packet;
+	u32		rate_ns;
+	u32		rate_bps;
+	u16		rate_flags;
+	s16		rate_overhead;
+	u16		rate_mpu;
+	u32		interval;
+	u32		target;
+
+	/* resource tracking */
+	u32		buffer_used;
+	u32		buffer_max_used;
+	u32		buffer_limit;
+	u32		buffer_config_limit;
+
+	/* indices for dequeue */
+	u16		cur_tin;
+	u16		cur_flow;
+
+	struct qdisc_watchdog watchdog;
+	const u8	*tin_index;
+	const u8	*tin_order;
+
+	/* bandwidth capacity estimate */
+	u64		last_packet_time;
+	u64		avg_packet_interval;
+	u64		avg_window_begin;
+	u32		avg_window_bytes;
+	u32		avg_peak_bandwidth;
+	u64		last_reconfig_time;
+};
+
+enum {
+	CAKE_MODE_BESTEFFORT = 1,
+	CAKE_MODE_PRECEDENCE,
+	CAKE_MODE_DIFFSERV8,
+	CAKE_MODE_DIFFSERV4,
+	CAKE_MODE_LLT,
+	CAKE_MODE_DIFFSERV3,
+	CAKE_MODE_MAX
+};
+
+enum {
+	CAKE_FLAG_ATM = 0x0001,
+	CAKE_FLAG_PTM = 0x0002,
+	CAKE_FLAG_AUTORATE_INGRESS = 0x0010,
+	CAKE_FLAG_INGRESS = 0x0040,
+	CAKE_FLAG_WASH = 0x0100,
+	CAKE_FLAG_ACK_FILTER = 0x0200,
+	CAKE_FLAG_ACK_AGGRESSIVE = 0x0400
+};
+
+enum {
+	CAKE_FLOW_NONE = 0,
+	CAKE_FLOW_SRC_IP,
+	CAKE_FLOW_DST_IP,
+	CAKE_FLOW_HOSTS,    /* = CAKE_FLOW_SRC_IP | CAKE_FLOW_DST_IP */
+	CAKE_FLOW_FLOWS,
+	CAKE_FLOW_DUAL_SRC, /* = CAKE_FLOW_SRC_IP | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_DUAL_DST, /* = CAKE_FLOW_DST_IP | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_TRIPLE,   /* = CAKE_FLOW_HOSTS  | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_MAX,
+	CAKE_FLOW_NAT_FLAG = 64
+};
+
+static u16 quantum_div[CAKE_QUEUES + 1] = {0};
+
+/* Diffserv lookup tables */
+
+static const u8 precedence[] = {0, 0, 0, 0, 0, 0, 0, 0,
+				1, 1, 1, 1, 1, 1, 1, 1,
+				2, 2, 2, 2, 2, 2, 2, 2,
+				3, 3, 3, 3, 3, 3, 3, 3,
+				4, 4, 4, 4, 4, 4, 4, 4,
+				5, 5, 5, 5, 5, 5, 5, 5,
+				6, 6, 6, 6, 6, 6, 6, 6,
+				7, 7, 7, 7, 7, 7, 7, 7,
+				};
+
+static const u8 diffserv_llt[] = {1, 0, 0, 1, 2, 2, 1, 1,
+				3, 1, 1, 1, 1, 1, 1, 1,
+				1, 1, 1, 1, 1, 1, 1, 1,
+				1, 1, 1, 1, 1, 1, 1, 1,
+				1, 1, 1, 1, 1, 1, 1, 1,
+				1, 1, 1, 1, 2, 1, 2, 1,
+				4, 1, 1, 1, 1, 1, 1, 1,
+				4, 1, 1, 1, 1, 1, 1, 1,
+				};
+
+static const u8 diffserv8[] = {2, 5, 1, 2, 4, 2, 2, 2,
+			       0, 2, 1, 2, 1, 2, 1, 2,
+			       5, 2, 4, 2, 4, 2, 4, 2,
+				3, 2, 3, 2, 3, 2, 3, 2,
+				6, 2, 3, 2, 3, 2, 3, 2,
+				6, 2, 2, 2, 6, 2, 6, 2,
+				7, 2, 2, 2, 2, 2, 2, 2,
+				7, 2, 2, 2, 2, 2, 2, 2,
+				};
+
+static const u8 diffserv4[] = {0, 2, 0, 0, 2, 0, 0, 0,
+			       1, 0, 0, 0, 0, 0, 0, 0,
+				2, 0, 2, 0, 2, 0, 2, 0,
+				2, 0, 2, 0, 2, 0, 2, 0,
+				3, 0, 2, 0, 2, 0, 2, 0,
+				3, 0, 0, 0, 3, 0, 3, 0,
+				3, 0, 0, 0, 0, 0, 0, 0,
+				3, 0, 0, 0, 0, 0, 0, 0,
+				};
+
+static const u8 diffserv3[] = {0, 0, 0, 0, 2, 0, 0, 0,
+			       1, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 2, 0, 2, 0,
+				2, 0, 0, 0, 0, 0, 0, 0,
+				2, 0, 0, 0, 0, 0, 0, 0,
+				};
+
+static const u8 besteffort[] = {0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				};
+
+/* tin priority order for stats dumping */
+
+static const u8 normal_order[] = {0, 1, 2, 3, 4, 5, 6, 7};
+static const u8 bulk_order[] = {1, 0, 2, 3};
+
+#define REC_INV_SQRT_CACHE (16)
+static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
+
+/* http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
+ * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
+ *
+ * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
+ */
+
+static void cobalt_newton_step(struct cobalt_vars *vars)
+{
+	u32 invsqrt = vars->rec_inv_sqrt;
+	u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
+	u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
+
+	val >>= 2; /* avoid overflow in following multiply */
+	val = (val * invsqrt) >> (32 - 2 + 1);
+
+	vars->rec_inv_sqrt = val;
+}
+
+static void cobalt_invsqrt(struct cobalt_vars *vars)
+{
+	if (vars->count < REC_INV_SQRT_CACHE)
+		vars->rec_inv_sqrt = cobalt_rec_inv_sqrt_cache[vars->count];
+	else
+		cobalt_newton_step(vars);
+}
+
+/* There is a big difference in timing between the accurate values placed in
+ * the cache and the approximations given by a single Newton step for small
+ * count values, particularly when stepping from count 1 to 2 or vice versa.
+ * Above 16, a single Newton step gives sufficient accuracy in either
+ * direction, given the precision stored.
+ *
+ * The magnitude of the error when stepping up to count 2 is such as to give
+ * the value that *should* have been produced at count 4.
+ */
+
+static void cobalt_cache_init(void)
+{
+	struct cobalt_vars v;
+
+	memset(&v, 0, sizeof(v));
+	v.rec_inv_sqrt = ~0U;
+	cobalt_rec_inv_sqrt_cache[0] = v.rec_inv_sqrt;
+
+	for (v.count = 1; v.count < REC_INV_SQRT_CACHE; v.count++) {
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+
+		cobalt_rec_inv_sqrt_cache[v.count] = v.rec_inv_sqrt;
+	}
+}
+
+static void cobalt_vars_init(struct cobalt_vars *vars)
+{
+	memset(vars, 0, sizeof(*vars));
+
+	if (!cobalt_rec_inv_sqrt_cache[0]) {
+		cobalt_cache_init();
+		cobalt_rec_inv_sqrt_cache[0] = ~0;
+	}
+}
+
+/* CoDel control_law is t + interval/sqrt(count)
+ * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
+ * both sqrt() and divide operation.
+ */
+static cobalt_time_t cobalt_control(cobalt_time_t t,
+				    cobalt_time_t interval,
+				    u32 rec_inv_sqrt)
+{
+	return t + reciprocal_scale(interval, rec_inv_sqrt);
+}
+
+/* Call this when a packet had to be dropped due to queue overflow.  Returns
+ * true if the BLUE state was quiescent before but active after this call.
+ */
+static bool cobalt_queue_full(struct cobalt_vars *vars,
+			      struct cobalt_params *p,
+			      cobalt_time_t now)
+{
+	bool up = false;
+
+	if ((now - vars->blue_timer) > p->target) {
+		up = !vars->p_drop;
+		vars->p_drop += p->p_inc;
+		if (vars->p_drop < p->p_inc)
+			vars->p_drop = ~0;
+		vars->blue_timer = now;
+	}
+	vars->dropping = true;
+	vars->drop_next = now;
+	if (!vars->count)
+		vars->count = 1;
+
+	return up;
+}
+
+/* Call this when the queue was serviced but turned out to be empty.  Returns
+ * true if the BLUE state was active before but quiescent after this call.
+ */
+static bool cobalt_queue_empty(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now)
+{
+	bool down = false;
+
+	if (vars->p_drop && (now - vars->blue_timer) > p->target) {
+		if (vars->p_drop < p->p_dec)
+			vars->p_drop = 0;
+		else
+			vars->p_drop -= p->p_dec;
+		vars->blue_timer = now;
+		down = !vars->p_drop;
+	}
+	vars->dropping = false;
+
+	if (vars->count && (now - vars->drop_next) >= 0) {
+		vars->count--;
+		cobalt_invsqrt(vars);
+		vars->drop_next = cobalt_control(vars->drop_next,
+						 p->interval,
+						 vars->rec_inv_sqrt);
+	}
+
+	return down;
+}
+
+/* Call this with a freshly dequeued packet for possible congestion marking.
+ * Returns true as an instruction to drop the packet, false for delivery.
+ */
+static bool cobalt_should_drop(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now,
+			       struct sk_buff *skb)
+{
+	bool drop = false;
+
+	/* Simplified Codel implementation */
+	cobalt_tdiff_t sojourn  = now - cobalt_get_enqueue_time(skb);
+
+/* The 'schedule' variable records, in its sign, whether 'now' is before or
+ * after 'drop_next'.  This allows 'drop_next' to be updated before the next
+ * scheduling decision is actually branched, without destroying that
+ * information.  Similarly, the first 'schedule' value calculated is preserved
+ * in the boolean 'next_due'.
+ *
+ * As for 'drop_next', we take advantage of the fact that 'interval' is both
+ * the delay between first exceeding 'target' and the first signalling event,
+ * *and* the scaling factor for the signalling frequency.  It's therefore very
+ * natural to use a single mechanism for both purposes, and eliminates a
+ * significant amount of reference Codel's spaghetti code.  To help with this,
+ * both the '0' and '1' entries in the invsqrt cache are 0xFFFFFFFF, as close
+ * as possible to 1.0 in fixed-point.
+ */
+
+	cobalt_tdiff_t schedule = now - vars->drop_next;
+	bool over_target = sojourn > p->target;
+	bool next_due    = vars->count && schedule >= 0;
+
+	vars->ecn_marked = false;
+
+	if (over_target) {
+		if (!vars->dropping) {
+			vars->dropping = true;
+			vars->drop_next = cobalt_control(now,
+							 p->interval,
+							 vars->rec_inv_sqrt);
+		}
+		if (!vars->count)
+			vars->count = 1;
+	} else if (vars->dropping) {
+		vars->dropping = false;
+	}
+
+	if (next_due && vars->dropping) {
+		/* Use ECN mark if possible, otherwise drop */
+		drop = !(vars->ecn_marked = INET_ECN_set_ce(skb));
+
+		vars->count++;
+		if (!vars->count)
+			vars->count--;
+		cobalt_invsqrt(vars);
+		vars->drop_next = cobalt_control(vars->drop_next,
+						 p->interval,
+						 vars->rec_inv_sqrt);
+		schedule = now - vars->drop_next;
+	} else {
+		while (next_due) {
+			vars->count--;
+			cobalt_invsqrt(vars);
+			vars->drop_next = cobalt_control(vars->drop_next,
+							 p->interval,
+							 vars->rec_inv_sqrt);
+			schedule = now - vars->drop_next;
+			next_due = vars->count && schedule >= 0;
+		}
+	}
+
+	/* Simple BLUE implementation.  Lack of ECN is deliberate. */
+	if (vars->p_drop)
+		drop |= (prandom_u32() < vars->p_drop);
+
+	/* Overload the drop_next field as an activity timeout */
+	if (!vars->count)
+		vars->drop_next = now + p->interval;
+	else if (schedule > 0 && !drop)
+		vars->drop_next = now;
+
+	return drop;
+}
+
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+
+static inline void cake_update_flowkeys(struct flow_keys *keys,
+					const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	bool rev = false;
+
+	struct nf_conn *ct;
+	const struct nf_conntrack_tuple *tuple;
+
+	if (tc_skb_protocol(skb) != htons(ETH_P_IP))
+		return;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (ct) {
+		tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
+	} else {
+		const struct nf_conntrack_tuple_hash *hash;
+		struct nf_conntrack_tuple srctuple;
+
+		if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
+				       NFPROTO_IPV4, dev_net(skb->dev),
+				       &srctuple))
+			return;
+
+		hash = nf_conntrack_find_get(dev_net(skb->dev),
+					     &nf_ct_zone_dflt,
+					     &srctuple);
+		if (!hash)
+			return;
+
+		rev = true;
+		ct = nf_ct_tuplehash_to_ctrack(hash);
+		tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
+	}
+
+	keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
+	keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
+
+	if (keys->ports.ports) {
+		keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
+		keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
+	}
+
+	if (rev)
+		nf_ct_put(ct);
+}
+#else
+static inline void cake_update_flowkeys(struct flow_keys *keys,
+					const struct sk_buff *skb)
+{
+	/* There is nothing we can do here without CONNTRACK */
+}
+#endif
+
+/* Cake has several subtle multiple bit settings. In these cases you
+ *  would be matching triple isolate mode as well.
+ */
+
+static inline bool cake_dsrc(int flow_mode)
+{
+	return (flow_mode & CAKE_FLOW_DUAL_SRC) == CAKE_FLOW_DUAL_SRC;
+}
+
+static inline bool cake_ddst(int flow_mode)
+{
+	return (flow_mode & CAKE_FLOW_DUAL_DST) == CAKE_FLOW_DUAL_DST;
+}
+
+static inline u32
+cake_hash(struct cake_tin_data *q, const struct sk_buff *skb, int flow_mode)
+{
+	struct flow_keys keys, host_keys;
+	u32 flow_hash = 0, srchost_hash, dsthost_hash;
+	u16 reduced_hash, srchost_idx, dsthost_idx;
+
+	if (unlikely(flow_mode == CAKE_FLOW_NONE))
+		return 0;
+
+	skb_flow_dissect_flow_keys(skb, &keys,
+				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+
+	if (flow_mode & CAKE_FLOW_NAT_FLAG)
+		cake_update_flowkeys(&keys, skb);
+
+	/* flow_hash_from_keys() sorts the addresses by value, so we have
+	 * to preserve their order in a separate data structure to treat
+	 * src and dst host addresses as independently selectable.
+	 */
+	host_keys = keys;
+	host_keys.ports.ports     = 0;
+	host_keys.basic.ip_proto  = 0;
+	host_keys.keyid.keyid     = 0;
+	host_keys.tags.flow_label = 0;
+
+	switch (host_keys.control.addr_type) {
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		host_keys.addrs.v4addrs.src = 0;
+		dsthost_hash = flow_hash_from_keys(&host_keys);
+		host_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
+		host_keys.addrs.v4addrs.dst = 0;
+		srchost_hash = flow_hash_from_keys(&host_keys);
+		break;
+
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		memset(&host_keys.addrs.v6addrs.src, 0,
+		       sizeof(host_keys.addrs.v6addrs.src));
+		dsthost_hash = flow_hash_from_keys(&host_keys);
+		host_keys.addrs.v6addrs.src = keys.addrs.v6addrs.src;
+		memset(&host_keys.addrs.v6addrs.dst, 0,
+		       sizeof(host_keys.addrs.v6addrs.dst));
+		srchost_hash = flow_hash_from_keys(&host_keys);
+		break;
+
+	default:
+		dsthost_hash = 0;
+		srchost_hash = 0;
+	};
+
+	/* This *must* be after the above switch, since as a
+	 * side-effect it sorts the src and dst addresses.
+	 */
+	if (flow_mode & CAKE_FLOW_FLOWS)
+		flow_hash = flow_hash_from_keys(&keys);
+
+	if (!(flow_mode & CAKE_FLOW_FLOWS)) {
+		if (flow_mode & CAKE_FLOW_SRC_IP)
+			flow_hash ^= srchost_hash;
+
+		if (flow_mode & CAKE_FLOW_DST_IP)
+			flow_hash ^= dsthost_hash;
+	}
+
+	reduced_hash = flow_hash % CAKE_QUEUES;
+
+	/* set-associative hashing */
+	/* fast path if no hash collision (direct lookup succeeds) */
+	if (likely(q->tags[reduced_hash] == flow_hash &&
+		   q->flows[reduced_hash].set)) {
+		q->way_directs++;
+	} else {
+		u32 inner_hash = reduced_hash % CAKE_SET_WAYS;
+		u32 outer_hash = reduced_hash - inner_hash;
+		u32 i, k;
+		bool allocate_src = false;
+		bool allocate_dst = false;
+
+		/* check if any active queue in the set is reserved for
+		 * this flow.
+		 */
+		for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+		     i++, k = (k + 1) % CAKE_SET_WAYS) {
+			if (q->tags[outer_hash + k] == flow_hash) {
+				if (i)
+					q->way_hits++;
+
+				if (!q->flows[outer_hash + k].set) {
+					/* need to increment host refcnts */
+					allocate_src = cake_dsrc(flow_mode);
+					allocate_dst = cake_ddst(flow_mode);
+				}
+
+				goto found;
+			}
+		}
+
+		/* no queue is reserved for this flow, look for an
+		 * empty one.
+		 */
+		for (i = 0; i < CAKE_SET_WAYS;
+			 i++, k = (k + 1) % CAKE_SET_WAYS) {
+			if (!q->flows[outer_hash + k].set) {
+				q->way_misses++;
+				allocate_src = cake_dsrc(flow_mode);
+				allocate_dst = cake_ddst(flow_mode);
+				goto found;
+			}
+		}
+
+		/* With no empty queues, default to the original
+		 * queue, accept the collision, update the host tags.
+		 */
+		q->way_collisions++;
+		q->hosts[q->flows[reduced_hash].srchost].srchost_refcnt--;
+		q->hosts[q->flows[reduced_hash].dsthost].dsthost_refcnt--;
+		allocate_src = cake_dsrc(flow_mode);
+		allocate_dst = cake_ddst(flow_mode);
+found:
+		/* reserve queue for future packets in same flow */
+		reduced_hash = outer_hash + k;
+		q->tags[reduced_hash] = flow_hash;
+
+		if (allocate_src) {
+			srchost_idx = srchost_hash % CAKE_QUEUES;
+			inner_hash = srchost_idx % CAKE_SET_WAYS;
+			outer_hash = srchost_idx - inner_hash;
+			for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+				i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (q->hosts[outer_hash + k].srchost_tag ==
+				    srchost_hash)
+					goto found_src;
+			}
+			for (i = 0; i < CAKE_SET_WAYS;
+				i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (!q->hosts[outer_hash + k].srchost_refcnt)
+					break;
+			}
+			q->hosts[outer_hash + k].srchost_tag = srchost_hash;
+found_src:
+			srchost_idx = outer_hash + k;
+			q->hosts[srchost_idx].srchost_refcnt++;
+			q->flows[reduced_hash].srchost = srchost_idx;
+		}
+
+		if (allocate_dst) {
+			dsthost_idx = dsthost_hash % CAKE_QUEUES;
+			inner_hash = dsthost_idx % CAKE_SET_WAYS;
+			outer_hash = dsthost_idx - inner_hash;
+			for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+			     i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (q->hosts[outer_hash + k].dsthost_tag ==
+				    dsthost_hash)
+					goto found_dst;
+			}
+			for (i = 0; i < CAKE_SET_WAYS;
+			     i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (!q->hosts[outer_hash + k].dsthost_refcnt)
+					break;
+			}
+			q->hosts[outer_hash + k].dsthost_tag = dsthost_hash;
+found_dst:
+			dsthost_idx = outer_hash + k;
+			q->hosts[dsthost_idx].dsthost_refcnt++;
+			q->flows[reduced_hash].dsthost = dsthost_idx;
+		}
+	}
+
+	return reduced_hash;
+}
+
+/* helper functions : might be changed when/if skb use a standard list_head */
+/* remove one skb from head of slot queue */
+
+static inline struct sk_buff *dequeue_head(struct cake_flow *flow)
+{
+	struct sk_buff *skb = flow->head;
+
+	if (skb) {
+		flow->head = skb->next;
+		skb->next = NULL;
+
+		if (skb == flow->ackcheck)
+			flow->ackcheck = NULL;
+	}
+
+	return skb;
+}
+
+/* add skb to flow queue (tail add) */
+
+static inline void
+flow_queue_add(struct cake_flow *flow, struct sk_buff *skb)
+{
+	if (!flow->head)
+		flow->head = skb;
+	else
+		flow->tail->next = skb;
+	flow->tail = skb;
+	skb->next = NULL;
+}
+
+static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
+				       struct cake_flow *flow)
+{
+	int seglen;
+	struct sk_buff *skb = flow->tail, *skb_check, *skb_check_prev;
+	struct iphdr *iph, *iph_check;
+	struct ipv6hdr *ipv6h, *ipv6h_check;
+	struct tcphdr *tcph, *tcph_check;
+	bool otherconn_ack_seen = false;
+	struct sk_buff *otherconn_checked_to = NULL;
+	bool thisconn_redundant_seen = false, thisconn_seen_last = false;
+	struct sk_buff *thisconn_checked_to = NULL, *thisconn_ack = NULL;
+	bool aggressive = q->rate_flags & CAKE_FLAG_ACK_AGGRESSIVE;
+
+	/* no other possible ACKs to filter */
+	if (flow->head == skb)
+		return NULL;
+
+	iph = skb->encapsulation ? inner_ip_hdr(skb) : ip_hdr(skb);
+	ipv6h = skb->encapsulation ? inner_ipv6_hdr(skb) : ipv6_hdr(skb);
+
+	/* check that the innermost network header is v4/v6, and contains TCP */
+	if (iph->version == 4) {
+		if (iph->protocol != IPPROTO_TCP)
+			return NULL;
+		seglen = ntohs(iph->tot_len) - (4 * iph->ihl);
+		tcph = (struct tcphdr *)((void *)iph + (4 * iph->ihl));
+	} else if (ipv6h->version == 6) {
+		if (ipv6h->nexthdr != IPPROTO_TCP)
+			return NULL;
+		seglen = ntohs(ipv6h->payload_len);
+		tcph = (struct tcphdr *)((void *)ipv6h + sizeof(struct ipv6hdr));
+	} else {
+		return NULL;
+	}
+
+	/* the 'triggering' packet need only have the ACK flag set.
+	 * also check that SYN is not set, as there won't be any previous ACKs.
+	 */
+	if ((tcp_flag_word(tcph) &
+		(TCP_FLAG_ACK | TCP_FLAG_SYN)) != TCP_FLAG_ACK)
+		return NULL;
+
+	/* the 'triggering' ACK is at the end of the queue,
+	 * we have already returned if it is the only packet in the flow.
+	 * stop before last packet in queue, don't compare trigger ACK to itself
+	 * start where we finished last time if recorded in ->ackcheck
+	 * otherwise start from the the head of the flow queue.
+	 */
+	skb_check_prev = flow->ackcheck;
+	skb_check = flow->ackcheck ?: flow->head;
+
+	while (skb_check->next) {
+		bool pure_ack, thisconn;
+
+		/* don't increment if at head of flow queue (_prev == NULL) */
+		if (skb_check_prev) {
+			skb_check_prev = skb_check;
+			skb_check = skb_check->next;
+			if (!skb_check->next)
+				break;
+		} else {
+			skb_check_prev = ERR_PTR(-1);
+		}
+
+		iph_check = skb_check->encapsulation ?
+			inner_ip_hdr(skb_check) : ip_hdr(skb_check);
+		ipv6h_check = skb_check->encapsulation ?
+			inner_ipv6_hdr(skb_check) : ipv6_hdr(skb_check);
+
+		if (iph_check->version == 4) {
+			if (iph_check->protocol != IPPROTO_TCP)
+				continue;
+			seglen = ntohs(iph_check->tot_len) - (4 * iph_check->ihl);
+			tcph_check = (struct tcphdr *)((void *)iph_check
+				+ (4 * iph_check->ihl));
+			if (iph->version == 4 &&
+			    iph_check->saddr == iph->saddr &&
+			    iph_check->daddr == iph->daddr) {
+				thisconn = true;
+			} else {
+				thisconn = false;
+			}
+		} else if (ipv6h_check->version == 6) {
+			if (ipv6h_check->nexthdr != IPPROTO_TCP)
+				continue;
+			seglen = ntohs(ipv6h_check->payload_len);
+			tcph_check = (struct tcphdr *)((void *)ipv6h_check +
+				     sizeof(struct ipv6hdr));
+			if (ipv6h->version == 6 &&
+			    ipv6_addr_cmp(&ipv6h_check->saddr, &ipv6h->saddr) &&
+				ipv6_addr_cmp(&ipv6h_check->daddr, &ipv6h->daddr)) {
+				thisconn = true;
+			} else {
+				thisconn = false;
+			}
+		} else {
+			continue;
+		}
+
+		/* stricter criteria apply to ACKs that we may filter
+		 * 3 reserved flags must be unset to avoid future breakage
+		 * ECE/CWR/NS can be safely ignored
+		 * ACK must be set
+		 * All other flags URG/PSH/RST/SYN/FIN must be unset
+		 * 0x0FFF0000 = all TCP flags (confirm ACK=1, others zero)
+		 * 0x01C00000 = NS/CWR/ECE (safe to ignore)
+		 * 0x0E3F0000 = 0x0FFF0000 & ~0x01C00000
+		 * must be 'pure' ACK, contain zero bytes of segment data
+		 * options are ignored
+		 */
+		if ((tcp_flag_word(tcph) &
+			(TCP_FLAG_ACK | TCP_FLAG_SYN)) != TCP_FLAG_ACK) {
+			continue;
+		} else if (((tcp_flag_word(tcph_check) &
+				cpu_to_be32(0x0E3F0000)) != TCP_FLAG_ACK) ||
+			   ((seglen - 4 * tcph_check->doff) != 0)) {
+			pure_ack = false;
+		} else {
+			pure_ack = true;
+		}
+
+		/* if we find an ACK belonging to a different connection
+		 * continue checking for other ACKs this round however
+		 * restart checking from the other connection next time.
+		 */
+		if (thisconn &&	(tcph_check->source != tcph->source ||
+				 tcph_check->dest != tcph->dest)) {
+			thisconn = false;
+		}
+
+		/* new ack sequence must be greater
+		 */
+		if (thisconn &&
+		    (ntohl(tcph_check->ack_seq) > ntohl(tcph->ack_seq)))
+			continue;
+
+		/* DupACKs with an equal sequence number shouldn't be filtered,
+		 * but we can filter if the triggering packet is a SACK
+		 */
+		if (thisconn &&
+		    (ntohl(tcph_check->ack_seq) == ntohl(tcph->ack_seq))) {
+			/* inspired by tcp_parse_options in tcp_input.c */
+			bool sack = false;
+			int length = (tcph->doff * 4) - sizeof(struct tcphdr);
+			const u8 *ptr = (const u8 *)(tcph + 1);
+
+			while (length > 0) {
+				int opcode = *ptr++;
+				int opsize;
+
+				if (opcode == TCPOPT_EOL)
+					break;
+				if (opcode == TCPOPT_NOP) {
+					length--;
+					continue;
+				}
+				opsize = *ptr++;
+				if (opsize < 2 || opsize > length)
+					break;
+				if (opcode == TCPOPT_SACK) {
+					sack = true;
+					break;
+				}
+				ptr += opsize - 2;
+				length -= opsize;
+			}
+			if (!sack)
+				continue;
+		}
+
+		/* somewhat complicated control flow for 'conservative'
+		 * ACK filtering that aims to be more polite to slow-start and
+		 * in the presence of packet loss.
+		 * does not filter if there is one 'redundant' ACK in the queue.
+		 * 'data' ACKs won't be filtered but do count as redundant ACKs.
+		 */
+		if (thisconn) {
+			thisconn_seen_last = true;
+			/* if aggressive and this is a data ack we can skip
+			 * checking it next time.
+			 */
+			thisconn_checked_to = (aggressive && !pure_ack) ?
+				skb_check : skb_check_prev;
+			/* the first pure ack for this connection.
+			 * record where it is, but only break if aggressive
+			 * or already seen data ack from the same connection
+			 */
+			if (pure_ack && !thisconn_ack) {
+				thisconn_ack = skb_check_prev;
+				if (aggressive || thisconn_redundant_seen)
+					break;
+			/* data ack or subsequent pure ack */
+			} else {
+				thisconn_redundant_seen = true;
+				/* this is the second ack for this connection
+				 * break to filter the first pure ack
+				 */
+				if (thisconn_ack)
+					break;
+			}
+		/* track packets from non-matching tcp connections that will
+		 * need evaluation on the next run.
+		 * if there are packets from both the matching connection and
+		 * others that requre checking next run, track which was updated
+		 * last and return the older of the two to ensure full coverage.
+		 * if a non-matching pure ack has been seen, cannot skip any
+		 * further on the next run so don't update.
+		 */
+		} else if (!otherconn_ack_seen) {
+			thisconn_seen_last = false;
+			if (pure_ack) {
+				otherconn_ack_seen = true;
+				/* if aggressive we don't care about old data,
+				 * start from the pure ack.
+				 * otherwise if there is a previous data ack,
+				 * start checking from it next time.
+				 */
+				if (aggressive || !otherconn_checked_to)
+					otherconn_checked_to = skb_check_prev;
+			} else {
+				otherconn_checked_to = aggressive ?
+					skb_check : skb_check_prev;
+			}
+		}
+	}
+
+	/* skb_check is reused at this point
+	 * it is the pure ACK to be filtered (if any)
+	 */
+	skb_check = NULL;
+
+	/* next time start checking from the older/nearest to head of unfiltered
+	 * but important tcp packets from this connection and other connections.
+	 * if none seen, start after the last packet evaluated in the loop.
+	 */
+	if (thisconn_checked_to && otherconn_checked_to)
+		flow->ackcheck = thisconn_seen_last ?
+			otherconn_checked_to : thisconn_checked_to;
+	else if (thisconn_checked_to)
+		flow->ackcheck = thisconn_checked_to;
+	else if (otherconn_checked_to)
+		flow->ackcheck = otherconn_checked_to;
+	else
+		flow->ackcheck = skb_check_prev;
+
+	/* if filtering, the pure ACK from the flow queue */
+	if (thisconn_ack && (aggressive || thisconn_redundant_seen)) {
+		if (PTR_ERR(thisconn_ack) == -1) {
+			skb_check = flow->head;
+			flow->head = flow->head->next;
+		} else {
+			skb_check = thisconn_ack->next;
+			thisconn_ack->next = thisconn_ack->next->next;
+		}
+	}
+
+	/* we just filtered that ack, fix up the list */
+	if (flow->ackcheck == skb_check)
+		flow->ackcheck = thisconn_ack;
+	/* check the entire flow queue next time */
+	if (PTR_ERR(flow->ackcheck) == -1)
+		flow->ackcheck = NULL;
+
+	return skb_check;
+}
+
+static inline u32 cake_overhead(struct cake_sched_data *q, u32 in)
+{
+	u32 out = in + q->rate_overhead;
+
+	if (q->rate_mpu && out < q->rate_mpu)
+		out = q->rate_mpu;
+
+	if (q->rate_flags & CAKE_FLAG_ATM) {
+		out += 47;
+		out /= 48;
+		out *= 53;
+	} else if (q->rate_flags & CAKE_FLAG_PTM) {
+		/* Add one byte per 64 bytes or part thereof.
+		 * This is conservative and easier to calculate than the
+		 * precise value.
+		 */
+		out += (out / 64) + !!(out % 64);
+	}
+
+	return out;
+}
+
+static inline cobalt_time_t cake_ewma(cobalt_time_t avg, cobalt_time_t sample,
+				      u32 shift)
+{
+	avg -= avg >> shift;
+	avg += sample >> shift;
+	return avg;
+}
+
+static inline void cake_heap_swap(struct cake_sched_data *q, u16 i, u16 j)
+{
+	struct cake_heap_entry ii = q->overflow_heap[i];
+	struct cake_heap_entry jj = q->overflow_heap[j];
+
+	q->overflow_heap[i] = jj;
+	q->overflow_heap[j] = ii;
+
+	q->tins[ii.t].overflow_idx[ii.b] = j;
+	q->tins[jj.t].overflow_idx[jj.b] = i;
+}
+
+static inline u32 cake_heap_get_backlog(const struct cake_sched_data *q, u16 i)
+{
+	struct cake_heap_entry ii = q->overflow_heap[i];
+
+	return q->tins[ii.t].backlogs[ii.b];
+}
+
+static void cake_heapify(struct cake_sched_data *q, u16 i)
+{
+	static const u32 a = CAKE_MAX_TINS * CAKE_QUEUES;
+	u32 m = i;
+	u32 mb = cake_heap_get_backlog(q, m);
+
+	while (m < a) {
+		u32 l = m + m + 1;
+		u32 r = l + 1;
+
+		if (l < a) {
+			u32 lb = cake_heap_get_backlog(q, l);
+
+			if (lb > mb) {
+				m  = l;
+				mb = lb;
+			}
+		}
+
+		if (r < a) {
+			u32 rb = cake_heap_get_backlog(q, r);
+
+			if (rb > mb) {
+				m  = r;
+				mb = rb;
+			}
+		}
+
+		if (m != i) {
+			cake_heap_swap(q, i, m);
+			i = m;
+		} else {
+			break;
+		}
+	}
+}
+
+static void cake_heapify_up(struct cake_sched_data *q, u16 i)
+{
+	while (i > 0 && i < CAKE_MAX_TINS * CAKE_QUEUES) {
+		u16 p = (i - 1) >> 1;
+		u32 ib = cake_heap_get_backlog(q, i);
+		u32 pb = cake_heap_get_backlog(q, p);
+
+		if (ib > pb) {
+			cake_heap_swap(q, i, p);
+			i = p;
+		} else {
+			break;
+		}
+	}
+}
+
+static int cake_advance_shaper(struct cake_sched_data *q,
+			       struct cake_tin_data *b,
+			       u32 len, u64 now, bool drop)
+{
+	s64 tdiff1, tdiff2, tdiff3, tdiff4;
+	/* charge packet bandwidth to this tin
+	 * and to the global shaper.
+	 */
+	if (q->rate_ns) {
+		len = cake_overhead(q, len);
+		tdiff1 = b->tin_time_next_packet - now;
+		tdiff2 = (len * (u64)b->tin_rate_ns) >> b->tin_rate_shft;
+		tdiff3 = (len * (u64)q->rate_ns) >> q->rate_shft;
+		tdiff4 = tdiff3 + (tdiff3 >> 1);
+
+		if (tdiff1 < 0)
+			b->tin_time_next_packet += tdiff2;
+		else if (tdiff1 < tdiff2)
+			b->tin_time_next_packet = now + tdiff2;
+
+		q->time_next_packet += tdiff3;
+		if (!drop)
+			q->failsafe_next_packet += tdiff4;
+	}
+	return len;
+}
+
+static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	u32 idx = 0, tin = 0, len;
+	struct cake_tin_data *b;
+	struct cake_flow *flow;
+	struct cake_heap_entry qq;
+	u64 now = cobalt_get_time();
+
+	if (!q->overflow_timeout) {
+		int i;
+		/* Build fresh max-heap */
+		for (i = CAKE_MAX_TINS * CAKE_QUEUES / 2; i >= 0; i--)
+			cake_heapify(q, i);
+	}
+	q->overflow_timeout = 65535;
+
+	/* select longest queue for pruning */
+	qq  = q->overflow_heap[0];
+	tin = qq.t;
+	idx = qq.b;
+
+	b = &q->tins[tin];
+	flow = &b->flows[idx];
+	skb = dequeue_head(flow);
+	if (unlikely(!skb)) {
+		/* heap has gone wrong, rebuild it next time */
+		q->overflow_timeout = 0;
+		return idx + (tin << 16);
+	}
+
+	if (cobalt_queue_full(&flow->cvars, &b->cparams, now))
+		b->unresponsive_flow_count++;
+
+	len = qdisc_pkt_len(skb);
+	q->buffer_used      -= skb->truesize;
+	b->backlogs[idx]    -= len;
+	b->tin_backlog      -= len;
+	sch->qstats.backlog -= len;
+	qdisc_tree_reduce_backlog(sch, 1, len);
+
+	b->tin_dropped++;
+	sch->qstats.drops++;
+
+	if (q->rate_flags & CAKE_FLAG_INGRESS)
+		cake_advance_shaper(q, b, len, now, true);
+
+	__qdisc_drop(skb, to_free);
+	sch->q.qlen--;
+	cake_heapify(q, 0);
+
+	return idx + (tin << 16);
+}
+
+static inline void cake_wash_diffserv(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	case htons(ETH_P_IPV6):
+		ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	default:
+		break;
+	};
+}
+
+static inline u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
+{
+	u8 dscp;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_IPV6):
+		dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_ARP):
+		return 0x38;  /* CS7 - Net Control */
+
+	default:
+		/* If there is no Diffserv field, treat as best-effort */
+		return 0;
+	};
+}
+
+static void cake_reconfigure(struct Qdisc *sch);
+
+static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			struct sk_buff **to_free)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 idx, tin;
+	struct cake_tin_data *b;
+	struct cake_flow *flow;
+	/* signed len to handle corner case filtered ACK larger than trigger */
+	int len = qdisc_pkt_len(skb);
+	u64 now = cobalt_get_time();
+	struct sk_buff *ack = NULL;
+
+	/* extract the Diffserv Precedence field, if it exists */
+	/* and clear DSCP bits if washing */
+	if (q->tin_mode != CAKE_MODE_BESTEFFORT) {
+		tin = q->tin_index[cake_handle_diffserv(skb,
+				q->rate_flags & CAKE_FLAG_WASH)];
+		if (unlikely(tin >= q->tin_cnt))
+			tin = 0;
+	} else {
+		tin = 0;
+		if (q->rate_flags & CAKE_FLAG_WASH)
+			cake_wash_diffserv(skb);
+	}
+
+	b = &q->tins[tin];
+
+	/* choose flow to insert into */
+	idx = cake_hash(b, skb, q->flow_mode);
+	flow = &b->flows[idx];
+
+	/* ensure shaper state isn't stale */
+	if (!b->tin_backlog) {
+		if (b->tin_time_next_packet < now)
+			b->tin_time_next_packet = now;
+
+		if (!sch->q.qlen) {
+			if (q->time_next_packet < now) {
+				q->failsafe_next_packet = now;
+				q->time_next_packet = now;
+			} else if (q->time_next_packet > now && q->failsafe_next_packet > now) {
+				u64 next = min(q->time_next_packet,
+					       q->failsafe_next_packet);
+				sch->qstats.overlimits++;
+				qdisc_watchdog_schedule_ns(&q->watchdog, next);
+			}
+		}
+	}
+
+	if (unlikely(len > b->max_skblen))
+		b->max_skblen = len;
+
+	/* Split GSO aggregates if they're likely to impair flow isolation
+	 * or if we need to know individual packet sizes for framing overhead.
+	 */
+
+	if (skb_is_gso(skb)) {
+		struct sk_buff *segs, *nskb;
+		netdev_features_t features = netif_skb_features(skb);
+		/* signed slen to handle corner case
+		 * suppressed ACK larger than trigger
+		 */
+		int slen = 0;
+
+		segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+		if (IS_ERR_OR_NULL(segs))
+			return qdisc_drop(skb, sch, to_free);
+
+		while (segs) {
+			nskb = segs->next;
+			segs->next = NULL;
+			qdisc_skb_cb(segs)->pkt_len = segs->len;
+			cobalt_set_enqueue_time(segs, now);
+			flow_queue_add(flow, segs);
+
+			if (q->rate_flags & CAKE_FLAG_ACK_FILTER)
+				ack = cake_ack_filter(q, flow);
+
+			if (ack) {
+				b->ack_drops++;
+				sch->qstats.drops++;
+				b->bytes += ack->len;
+				slen += segs->len - ack->len;
+				q->buffer_used += segs->truesize -
+					ack->truesize;
+				if (q->rate_flags & CAKE_FLAG_INGRESS)
+					cake_advance_shaper(q, b,
+							    qdisc_pkt_len(ack),
+							    now, true);
+
+				qdisc_tree_reduce_backlog(sch, 1,
+							  qdisc_pkt_len(ack));
+				consume_skb(ack);
+			} else {
+				sch->q.qlen++;
+				slen += segs->len;
+				q->buffer_used += segs->truesize;
+			}
+			b->packets++;
+			segs = nskb;
+		}
+		/* stats */
+		b->bytes	    += slen;
+		b->backlogs[idx]    += slen;
+		b->tin_backlog      += slen;
+		sch->qstats.backlog += slen;
+		q->avg_window_bytes += slen;
+
+		qdisc_tree_reduce_backlog(sch, 1, len);
+		consume_skb(skb);
+	} else {
+		/* not splitting */
+		cobalt_set_enqueue_time(skb, now);
+		flow_queue_add(flow, skb);
+
+		if (q->rate_flags & CAKE_FLAG_ACK_FILTER)
+			ack = cake_ack_filter(q, flow);
+
+		if (ack) {
+			b->ack_drops++;
+			sch->qstats.drops++;
+			b->bytes += qdisc_pkt_len(ack);
+			len -= qdisc_pkt_len(ack);
+			q->buffer_used += skb->truesize - ack->truesize;
+			if (q->rate_flags & CAKE_FLAG_INGRESS)
+				cake_advance_shaper(q, b, ack->len, now, true);
+
+			qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
+			consume_skb(ack);
+		} else {
+			sch->q.qlen++;
+			q->buffer_used      += skb->truesize;
+		}
+		/* stats */
+		b->packets++;
+		b->bytes	    += len;
+		b->backlogs[idx]    += len;
+		b->tin_backlog      += len;
+		sch->qstats.backlog += len;
+		q->avg_window_bytes += len;
+	}
+
+	if (q->overflow_timeout)
+		cake_heapify_up(q, b->overflow_idx[idx]);
+
+	/* incoming bandwidth capacity estimate */
+	if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) {
+		u64 packet_interval = now - q->last_packet_time;
+
+		if (packet_interval > NSEC_PER_SEC)
+			packet_interval = NSEC_PER_SEC;
+
+		/* filter out short-term bursts, eg. wifi aggregation */
+		q->avg_packet_interval = cake_ewma(q->avg_packet_interval,
+						   packet_interval,
+			packet_interval > q->avg_packet_interval ? 2 : 8);
+
+		q->last_packet_time = now;
+
+		if (packet_interval > q->avg_packet_interval) {
+			u64 window_interval = now - q->avg_window_begin;
+			u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC;
+
+			do_div(b, window_interval);
+			q->avg_peak_bandwidth =
+				cake_ewma(q->avg_peak_bandwidth, b,
+					  b > q->avg_peak_bandwidth ? 2 : 8);
+			q->avg_window_bytes = 0;
+			q->avg_window_begin = now;
+
+			if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS &&
+			    now - q->last_reconfig_time >
+				(NSEC_PER_SEC / 4)) {
+				q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4;
+				cake_reconfigure(sch);
+			}
+		}
+	} else {
+		q->avg_window_bytes = 0;
+		q->last_packet_time = now;
+	}
+
+	/* flowchain */
+	if (!flow->set || flow->set == CAKE_SET_DECAYING) {
+		struct cake_host *srchost = &b->hosts[flow->srchost];
+		struct cake_host *dsthost = &b->hosts[flow->dsthost];
+		u16 host_load = 1;
+
+		if (!flow->set) {
+			list_add_tail(&flow->flowchain, &b->new_flows);
+		} else {
+			b->decaying_flow_count--;
+			list_move_tail(&flow->flowchain, &b->new_flows);
+		}
+		flow->set = CAKE_SET_SPARSE;
+		b->sparse_flow_count++;
+
+		if (cake_dsrc(q->flow_mode))
+			host_load = max(host_load, srchost->srchost_refcnt);
+
+		if (cake_ddst(q->flow_mode))
+			host_load = max(host_load, dsthost->dsthost_refcnt);
+
+		flow->deficit = (b->flow_quantum * quantum_div[host_load]) >> 16;
+	} else if (flow->set == CAKE_SET_SPARSE_WAIT) {
+		/* this flow was empty, accounted as a sparse flow, but actually
+		 * in the bulk rotation.
+		 */
+		flow->set = CAKE_SET_BULK;
+		b->sparse_flow_count--;
+		b->bulk_flow_count++;
+	}
+
+	if (q->buffer_used > q->buffer_max_used)
+		q->buffer_max_used = q->buffer_used;
+
+	if (q->buffer_used > q->buffer_limit) {
+		u32 dropped = 0;
+
+		while (q->buffer_used > q->buffer_limit) {
+			dropped++;
+			cake_drop(sch, to_free);
+		}
+		b->drop_overlimit += dropped;
+	}
+	return NET_XMIT_SUCCESS;
+}
+
+static struct sk_buff *cake_dequeue_one(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct cake_tin_data *b = &q->tins[q->cur_tin];
+	struct cake_flow *flow = &b->flows[q->cur_flow];
+	struct sk_buff *skb = NULL;
+	u32 len;
+
+	/* WARN_ON(flow != container_of(vars, struct cake_flow, cvars)); */
+
+	if (flow->head) {
+		skb = dequeue_head(flow);
+		len = qdisc_pkt_len(skb);
+		b->backlogs[q->cur_flow] -= len;
+		b->tin_backlog		 -= len;
+		sch->qstats.backlog      -= len;
+		q->buffer_used		 -= skb->truesize;
+		sch->q.qlen--;
+
+		if (q->overflow_timeout)
+			cake_heapify(q, b->overflow_idx[q->cur_flow]);
+	}
+	return skb;
+}
+
+/* Discard leftover packets from a tin no longer in use. */
+static void cake_clear_tin(struct Qdisc *sch, u16 tin)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	q->cur_tin = tin;
+	for (q->cur_flow = 0; q->cur_flow < CAKE_QUEUES; q->cur_flow++)
+		while (!!(skb = cake_dequeue_one(sch)))
+			kfree_skb(skb);
+}
+
+static struct sk_buff *cake_dequeue(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	struct cake_tin_data *b = &q->tins[q->cur_tin];
+	struct cake_flow *flow;
+	struct cake_host *srchost, *dsthost;
+	struct list_head *head;
+	u32 len;
+	u16 host_load;
+	cobalt_time_t now = ktime_get_ns();
+	cobalt_time_t delay;
+	bool first_flow = true;
+
+begin:
+	if (!sch->q.qlen)
+		return NULL;
+
+	/* global hard shaper */
+	if (q->time_next_packet > now && q->failsafe_next_packet > now) {
+		u64 next = min(q->time_next_packet, q->failsafe_next_packet);
+		sch->qstats.overlimits++;
+		qdisc_watchdog_schedule_ns(&q->watchdog, next);
+		return NULL;
+	}
+
+	/* Choose a class to work on. */
+	if (!q->rate_ns) {
+		/* In unlimited mode, can't rely on shaper timings, just balance
+		 * with DRR
+		 */
+		while (b->tin_deficit < 0 ||
+		       !(b->sparse_flow_count + b->bulk_flow_count)) {
+			if (b->tin_deficit <= 0)
+				b->tin_deficit += b->tin_quantum_band;
+
+			q->cur_tin++;
+			b++;
+			if (q->cur_tin >= q->tin_cnt) {
+				q->cur_tin = 0;
+				b = q->tins;
+			}
+		}
+	} else {
+		/* In shaped mode, choose:
+		 * - Highest-priority tin with queue and meeting schedule, or
+		 * - The earliest-scheduled tin with queue.
+		 */
+		int tin, best_tin = 0;
+		s64 best_time = 0xFFFFFFFFFFFFUL;
+
+		for (tin = 0; tin < q->tin_cnt; tin++) {
+			b = q->tins + tin;
+			if ((b->sparse_flow_count + b->bulk_flow_count) > 0) {
+				s64 tdiff = b->tin_time_next_packet - now;
+
+				if (tdiff <= 0 || tdiff <= best_time) {
+					best_time = tdiff;
+					best_tin = tin;
+				}
+			}
+		}
+
+		q->cur_tin = best_tin;
+		b = q->tins + best_tin;
+	}
+
+retry:
+	/* service this class */
+	head = &b->decaying_flows;
+	if (!first_flow || list_empty(head)) {
+		head = &b->new_flows;
+		if (list_empty(head)) {
+			head = &b->old_flows;
+			if (unlikely(list_empty(head))) {
+				head = &b->decaying_flows;
+				if (unlikely(list_empty(head)))
+					goto begin;
+			}
+		}
+	}
+	flow = list_first_entry(head, struct cake_flow, flowchain);
+	q->cur_flow = flow - b->flows;
+	first_flow = false;
+
+	/* triple isolation (modified DRR++) */
+	srchost = &b->hosts[flow->srchost];
+	dsthost = &b->hosts[flow->dsthost];
+	host_load = 1;
+
+	if (cake_dsrc(q->flow_mode))
+		host_load = max(host_load, srchost->srchost_refcnt);
+
+	if (cake_ddst(q->flow_mode))
+		host_load = max(host_load, dsthost->dsthost_refcnt);
+
+	WARN_ON(host_load > CAKE_QUEUES);
+
+	/* flow isolation (DRR++) */
+	if (flow->deficit <= 0) {
+		flow->deficit += (b->flow_quantum * quantum_div[host_load] +
+				  (prandom_u32() >> 16)) >> 16;
+		list_move_tail(&flow->flowchain, &b->old_flows);
+
+		/* Keep all flows with deficits out of the sparse and decaying
+		 * rotations.  No non-empty flow can go into the decaying
+		 * rotation, so they can't get deficits
+		 */
+		if (flow->set == CAKE_SET_SPARSE) {
+			if (flow->head) {
+				b->sparse_flow_count--;
+				b->bulk_flow_count++;
+				flow->set = CAKE_SET_BULK;
+			} else {
+				/* we've moved it to the bulk rotation for
+				 * correct deficit accounting but we still want
+				 * to count it as a sparse flow, not a bulk one.
+				 */
+				flow->set = CAKE_SET_SPARSE_WAIT;
+			}
+		}
+		goto retry;
+	}
+
+	/* Retrieve a packet via the AQM */
+	while (1) {
+		skb = cake_dequeue_one(sch);
+		if (!skb) {
+			/* this queue was actually empty */
+			if (cobalt_queue_empty(&flow->cvars, &b->cparams, now))
+				b->unresponsive_flow_count--;
+
+			if (flow->cvars.p_drop || flow->cvars.count ||
+			    (now - flow->cvars.drop_next) < 0) {
+				/* keep in the flowchain until the state has
+				 * decayed to rest
+				 */
+				list_move_tail(&flow->flowchain,
+					       &b->decaying_flows);
+				if (flow->set == CAKE_SET_BULK) {
+					b->bulk_flow_count--;
+					b->decaying_flow_count++;
+				} else if (flow->set == CAKE_SET_SPARSE ||
+					   flow->set == CAKE_SET_SPARSE_WAIT) {
+					b->sparse_flow_count--;
+					b->decaying_flow_count++;
+				}
+				flow->set = CAKE_SET_DECAYING;
+			} else {
+				/* remove empty queue from the flowchain */
+				list_del_init(&flow->flowchain);
+				if (flow->set == CAKE_SET_SPARSE ||
+				    flow->set == CAKE_SET_SPARSE_WAIT)
+					b->sparse_flow_count--;
+				else if (flow->set == CAKE_SET_BULK)
+					b->bulk_flow_count--;
+				else
+					b->decaying_flow_count--;
+
+				flow->set = CAKE_SET_NONE;
+				srchost->srchost_refcnt--;
+				dsthost->dsthost_refcnt--;
+			}
+			goto begin;
+		}
+
+		/* Last packet in queue may be marked, shouldn't be dropped */
+		if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb) ||
+		    !flow->head)
+			break;
+
+		/* drop this packet, get another one */
+		if (q->rate_flags & CAKE_FLAG_INGRESS) {
+			len = cake_advance_shaper(q, b,
+						  qdisc_pkt_len(skb),
+						  now, true);
+			flow->deficit -= len;
+			b->tin_deficit -= len;
+		}
+		b->tin_dropped++;
+		qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
+		qdisc_qstats_drop(sch);
+		kfree_skb(skb);
+		if (q->rate_flags & CAKE_FLAG_INGRESS)
+			goto retry;
+	}
+
+	b->tin_ecn_mark += !!flow->cvars.ecn_marked;
+	qdisc_bstats_update(sch, skb);
+
+	len = cake_overhead(q, qdisc_pkt_len(skb));
+	flow->deficit -= len;
+	b->tin_deficit -= len;
+
+	/* collect delay stats */
+	delay = now - cobalt_get_enqueue_time(skb);
+	b->avge_delay = cake_ewma(b->avge_delay, delay, 8);
+	b->peak_delay = cake_ewma(b->peak_delay, delay,
+				  delay > b->peak_delay ? 2 : 8);
+	b->base_delay = cake_ewma(b->base_delay, delay,
+				  delay < b->base_delay ? 2 : 8);
+
+	cake_advance_shaper(q, b, len, now, false);
+	if (q->time_next_packet > now && sch->q.qlen) {
+		u64 next = min(q->time_next_packet, q->failsafe_next_packet);
+		qdisc_watchdog_schedule_ns(&q->watchdog, next);
+	} else if (!sch->q.qlen) {
+		int i;
+
+		for (i = 0; i < q->tin_cnt; i++) {
+			if (q->tins[i].decaying_flow_count) {
+				u64 next = now + q->tins[i].cparams.target;
+				qdisc_watchdog_schedule_ns(&q->watchdog, next);
+				break;
+			}
+		}
+	}
+
+	if (q->overflow_timeout)
+		q->overflow_timeout--;
+
+	return skb;
+}
+
+static void cake_reset(struct Qdisc *sch)
+{
+	u32 c;
+
+	for (c = 0; c < CAKE_MAX_TINS; c++)
+		cake_clear_tin(sch, c);
+}
+
+static const struct nla_policy cake_policy[TCA_CAKE_MAX + 1] = {
+	[TCA_CAKE_BASE_RATE]     = { .type = NLA_U32 },
+	[TCA_CAKE_DIFFSERV_MODE] = { .type = NLA_U32 },
+	[TCA_CAKE_ATM]		 = { .type = NLA_U32 },
+	[TCA_CAKE_FLOW_MODE]     = { .type = NLA_U32 },
+	[TCA_CAKE_OVERHEAD]      = { .type = NLA_S32 },
+	[TCA_CAKE_RTT]		 = { .type = NLA_U32 },
+	[TCA_CAKE_TARGET]	 = { .type = NLA_U32 },
+	[TCA_CAKE_AUTORATE]      = { .type = NLA_U32 },
+	[TCA_CAKE_MEMORY]	 = { .type = NLA_U32 },
+	[TCA_CAKE_NAT]		 = { .type = NLA_U32 },
+	[TCA_CAKE_ETHERNET]      = { .type = NLA_U32 },
+	[TCA_CAKE_WASH]		 = { .type = NLA_U32 },
+	[TCA_CAKE_MPU]		 = { .type = NLA_U32 },
+	[TCA_CAKE_INGRESS]	 = { .type = NLA_U32 },
+	[TCA_CAKE_ACK_FILTER]	 = { .type = NLA_U32 },
+};
+
+static void cake_set_rate(struct cake_tin_data *b, u64 rate, u32 mtu,
+			  cobalt_time_t ns_target, cobalt_time_t rtt_est_ns)
+{
+	/* convert byte-rate into time-per-byte
+	 * so it will always unwedge in reasonable time.
+	 */
+	static const u64 MIN_RATE = 64;
+	u64 rate_ns = 0;
+	u8  rate_shft = 0;
+	cobalt_time_t byte_target_ns;
+	u32 byte_target = mtu + (mtu >> 1);
+
+	b->flow_quantum = 1514;
+	if (rate) {
+		b->flow_quantum = max(min(rate >> 12, 1514ULL), 300ULL);
+		rate_shft = 32;
+		rate_ns = ((u64)NSEC_PER_SEC) << rate_shft;
+		do_div(rate_ns, max(MIN_RATE, rate));
+		while (!!(rate_ns >> 32)) {
+			rate_ns >>= 1;
+			rate_shft--;
+		}
+	} /* else unlimited, ie. zero delay */
+
+	b->tin_rate_bps  = rate;
+	b->tin_rate_ns   = rate_ns;
+	b->tin_rate_shft = rate_shft;
+
+	byte_target_ns = (byte_target * rate_ns) >> rate_shft;
+
+	b->cparams.target = max(byte_target_ns, ns_target);
+	b->cparams.interval = max(rtt_est_ns +
+				     b->cparams.target - ns_target,
+				     b->cparams.target * 2);
+	b->cparams.p_inc = 1 << 24; /* 1/256 */
+	b->cparams.p_dec = 1 << 20; /* 1/4096 */
+}
+
+static int cake_config_besteffort(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct cake_tin_data *b = &q->tins[0];
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+
+	q->tin_cnt = 1;
+
+	q->tin_index = besteffort;
+	q->tin_order = normal_order;
+
+	cake_set_rate(b, rate, mtu, US2TIME(q->target), US2TIME(q->interval));
+	b->tin_quantum_band = 65535;
+	b->tin_quantum_prio = 65535;
+
+	return 0;
+}
+
+static int cake_config_precedence(struct Qdisc *sch)
+{
+	/* convert high-level (user visible) parameters into internal format */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+	q->tin_index = precedence;
+	q->tin_order = normal_order;
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+/*	List of known Diffserv codepoints:
+ *
+ *	Least Effort (CS1)
+ *	Best Effort (CS0)
+ *	Max Reliability & LLT "Lo" (TOS1)
+ *	Max Throughput (TOS2)
+ *	Min Delay (TOS4)
+ *  LLT "La" (TOS5)
+ *	Assured Forwarding 1 (AF1x) - x3
+ *	Assured Forwarding 2 (AF2x) - x3
+ *	Assured Forwarding 3 (AF3x) - x3
+ *	Assured Forwarding 4 (AF4x) - x3
+ *	Precedence Class 2 (CS2)
+ *	Precedence Class 3 (CS3)
+ *	Precedence Class 4 (CS4)
+ *	Precedence Class 5 (CS5)
+ *	Precedence Class 6 (CS6)
+ *	Precedence Class 7 (CS7)
+ *	Voice Admit (VA)
+ *	Expedited Forwarding (EF)
+
+ *	Total 25 codepoints.
+ */
+
+/*	List of traffic classes in RFC 4594:
+ *		(roughly descending order of contended priority)
+ *		(roughly ascending order of uncontended throughput)
+ *
+ *	Network Control (CS6,CS7)      - routing traffic
+ *	Telephony (EF,VA)         - aka. VoIP streams
+ *	Signalling (CS5)               - VoIP setup
+ *	Multimedia Conferencing (AF4x) - aka. video calls
+ *	Realtime Interactive (CS4)     - eg. games
+ *	Multimedia Streaming (AF3x)    - eg. YouTube, NetFlix, Twitch
+ *	Broadcast Video (CS3)
+ *	Low Latency Data (AF2x,TOS4)      - eg. database
+ *	Ops, Admin, Management (CS2,TOS1) - eg. ssh
+ *	Standard Service (CS0 & unrecognised codepoints)
+ *	High Throughput Data (AF1x,TOS2)  - eg. web traffic
+ *	Low Priority Data (CS1)           - eg. BitTorrent
+
+ *	Total 12 traffic classes.
+ */
+
+static int cake_config_diffserv8(struct Qdisc *sch)
+{
+/*	Pruned list of traffic classes for typical applications:
+ *
+ *		Network Control          (CS6, CS7)
+ *		Minimum Latency          (EF, VA, CS5, CS4)
+ *		Interactive Shell        (CS2, TOS1)
+ *		Low Latency Transactions (AF2x, TOS4)
+ *		Video Streaming          (AF4x, AF3x, CS3)
+ *		Bog Standard             (CS0 etc.)
+ *		High Throughput          (AF1x, TOS2)
+ *		Background Traffic       (CS1)
+ *
+ *		Total 8 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv8;
+	q->tin_order = normal_order;
+
+	/* class characteristics */
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+static int cake_config_diffserv4(struct Qdisc *sch)
+{
+/*  Further pruned list of traffic classes for four-class system:
+ *
+ *	    Latency Sensitive  (CS7, CS6, EF, VA, CS5, CS4)
+ *	    Streaming Media    (AF4x, AF3x, CS3, AF2x, TOS4, CS2, TOS1)
+ *	    Best Effort        (CS0, AF1x, TOS2, and those not specified)
+ *	    Background Traffic (CS1)
+ *
+ *		Total 4 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum = 1024;
+
+	q->tin_cnt = 4;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv4;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 1, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[3], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 2;
+	q->tins[3].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 1;
+	q->tins[3].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static int cake_config_diffserv3(struct Qdisc *sch)
+{
+/*  Simplified Diffserv structure with 3 tins.
+ *		Low Priority		(CS1)
+ *		Best Effort
+ *		Latency Sensitive	(TOS4, VA, EF, CS6, CS7)
+ */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum = 1024;
+
+	q->tin_cnt = 3;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv3;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static int cake_config_diffserv_llt(struct Qdisc *sch)
+{
+/*  Diffserv structure specialised for Latency-Loss-Tradeoff spec.
+ *		Loss Sensitive		(TOS1, TOS2)
+ *		Best Effort
+ *		Latency Sensitive	(TOS4, TOS5, VA, EF)
+ *		Low Priority		(CS1)
+ *		Network Control		(CS6, CS7)
+ */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+
+	q->tin_cnt = 5;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv_llt;
+	q->tin_order = normal_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[5], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	cake_set_rate(&q->tins[0], rate / 3, mtu,
+		      US2TIME(q->target * 4), US2TIME(q->interval * 4));
+	cake_set_rate(&q->tins[1], rate / 3, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate / 3, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[3], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[4], rate >> 4, mtu,
+		      US2TIME(q->target * 4), US2TIME(q->interval * 4));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = 2048;
+	q->tins[1].tin_quantum_prio = 2048;
+	q->tins[2].tin_quantum_prio = 2048;
+	q->tins[3].tin_quantum_prio = 16384;
+	q->tins[4].tin_quantum_prio = 32768;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = 2048;
+	q->tins[1].tin_quantum_band = 2048;
+	q->tins[2].tin_quantum_band = 2048;
+	q->tins[3].tin_quantum_band = 256;
+	q->tins[4].tin_quantum_band = 16;
+
+	return 5;
+}
+
+static void cake_reconfigure(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int c, ft;
+
+	switch (q->tin_mode) {
+	case CAKE_MODE_BESTEFFORT:
+		ft = cake_config_besteffort(sch);
+		break;
+
+	case CAKE_MODE_PRECEDENCE:
+		ft = cake_config_precedence(sch);
+		break;
+
+	case CAKE_MODE_DIFFSERV8:
+		ft = cake_config_diffserv8(sch);
+		break;
+
+	case CAKE_MODE_DIFFSERV4:
+		ft = cake_config_diffserv4(sch);
+		break;
+
+	case CAKE_MODE_LLT:
+		ft = cake_config_diffserv_llt(sch);
+		break;
+
+	case CAKE_MODE_DIFFSERV3:
+	default:
+		ft = cake_config_diffserv3(sch);
+		break;
+	};
+
+	for (c = q->tin_cnt; c < CAKE_MAX_TINS; c++)
+		cake_clear_tin(sch, c);
+
+	q->rate_ns   = q->tins[ft].tin_rate_ns;
+	q->rate_shft = q->tins[ft].tin_rate_shft;
+
+	if (q->buffer_config_limit) {
+		q->buffer_limit = q->buffer_config_limit;
+	} else if (q->rate_bps) {
+		u64 t = (u64)q->rate_bps * q->interval;
+
+		do_div(t, USEC_PER_SEC / 4);
+		q->buffer_limit = max_t(u32, t, 4U << 20);
+	} else {
+		q->buffer_limit = ~0;
+	}
+
+	sch->flags &= ~TCQ_F_CAN_BYPASS;
+
+	q->buffer_limit = min(q->buffer_limit, max(sch->limit * psched_mtu(qdisc_dev(sch)), q->buffer_config_limit));
+}
+
+static int cake_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_CAKE_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_CAKE_MAX, opt, cake_policy, NULL);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_CAKE_BASE_RATE])
+		q->rate_bps = nla_get_u32(tb[TCA_CAKE_BASE_RATE]);
+
+	if (tb[TCA_CAKE_DIFFSERV_MODE])
+		q->tin_mode = nla_get_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+
+	if (tb[TCA_CAKE_ATM]) {
+		q->rate_flags &= ~(CAKE_FLAG_ATM | CAKE_FLAG_PTM);
+		q->rate_flags |= nla_get_u32(tb[TCA_CAKE_ATM]) &
+			(CAKE_FLAG_ATM | CAKE_FLAG_PTM);
+	}
+
+	if (tb[TCA_CAKE_WASH]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_WASH]))
+			q->rate_flags |= CAKE_FLAG_WASH;
+		else
+			q->rate_flags &= ~CAKE_FLAG_WASH;
+	}
+
+	if (tb[TCA_CAKE_FLOW_MODE])
+		q->flow_mode = nla_get_u32(tb[TCA_CAKE_FLOW_MODE]);
+
+	if (tb[TCA_CAKE_NAT]) {
+		q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
+		q->flow_mode |= CAKE_FLOW_NAT_FLAG *
+			!!nla_get_u32(tb[TCA_CAKE_NAT]);
+	}
+
+	if (tb[TCA_CAKE_OVERHEAD]) {
+		if (tb[TCA_CAKE_ETHERNET])
+			q->rate_overhead = -(nla_get_s32(tb[TCA_CAKE_ETHERNET]));
+		else
+			q->rate_overhead = -(qdisc_dev(sch)->hard_header_len);
+
+		q->rate_overhead += nla_get_s32(tb[TCA_CAKE_OVERHEAD]);
+	}
+
+	if (tb[TCA_CAKE_MPU])
+		q->rate_mpu = nla_get_u32(tb[TCA_CAKE_MPU]);
+
+	if (tb[TCA_CAKE_RTT]) {
+		q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
+
+		if (!q->interval)
+			q->interval = 1;
+	}
+
+	if (tb[TCA_CAKE_TARGET]) {
+		q->target = nla_get_u32(tb[TCA_CAKE_TARGET]);
+
+		if (!q->target)
+			q->target = 1;
+	}
+
+	if (tb[TCA_CAKE_AUTORATE]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
+			q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_INGRESS]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
+			q->rate_flags |= CAKE_FLAG_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_ACK_FILTER]) {
+		q->rate_flags &= ~(CAKE_FLAG_ACK_FILTER |
+				   CAKE_FLAG_ACK_AGGRESSIVE);
+		q->rate_flags |= nla_get_u32(tb[TCA_CAKE_ACK_FILTER]) &
+					     (CAKE_FLAG_ACK_FILTER |
+					      CAKE_FLAG_ACK_AGGRESSIVE);
+	}
+
+	if (tb[TCA_CAKE_MEMORY])
+		q->buffer_config_limit = nla_get_s32(tb[TCA_CAKE_MEMORY]);
+
+	if (q->tins) {
+		sch_tree_lock(sch);
+		cake_reconfigure(sch);
+		sch_tree_unlock(sch);
+	}
+
+	return 0;
+}
+
+static void cake_free(void *addr)
+{
+	if (addr)
+		kvfree(addr);
+}
+
+static void cake_destroy(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+
+	if (q->tins)
+		cake_free(q->tins);
+}
+
+static int cake_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int i, j;
+
+	sch->limit = 10240;
+	q->tin_mode = CAKE_MODE_DIFFSERV3;
+	q->flow_mode  = CAKE_FLOW_TRIPLE;
+
+	q->rate_bps = 0; /* unlimited by default */
+
+	q->interval = 100000; /* 100ms default */
+	q->target   =   5000; /* 5ms: codel RFC argues
+			       * for 5 to 10% of interval
+			       */
+
+	q->cur_tin = 0;
+	q->cur_flow  = 0;
+
+	if (opt) {
+		int err = cake_change(sch, opt);
+
+		if (err)
+			return err;
+	}
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+
+	quantum_div[0] = ~0;
+	for (i = 1; i <= CAKE_QUEUES; i++)
+		quantum_div[i] = 65535 / i;
+
+	q->tins = kvzalloc(CAKE_MAX_TINS * sizeof(struct cake_tin_data),
+			   GFP_KERNEL);
+	if (!q->tins)
+		goto nomem;
+
+	for (i = 0; i < CAKE_MAX_TINS; i++) {
+		struct cake_tin_data *b = q->tins + i;
+
+		INIT_LIST_HEAD(&b->new_flows);
+		INIT_LIST_HEAD(&b->old_flows);
+		INIT_LIST_HEAD(&b->decaying_flows);
+		b->sparse_flow_count = 0;
+		b->bulk_flow_count = 0;
+		b->decaying_flow_count = 0;
+
+		for (j = 0; j < CAKE_QUEUES; j++) {
+			struct cake_flow *flow = b->flows + j;
+			u32 k = j * CAKE_MAX_TINS + i;
+
+			INIT_LIST_HEAD(&flow->flowchain);
+			cobalt_vars_init(&flow->cvars);
+
+			q->overflow_heap[k].t = i;
+			q->overflow_heap[k].b = j;
+			b->overflow_idx[j] = k;
+		}
+	}
+
+	cake_reconfigure(sch);
+	q->avg_peak_bandwidth = q->rate_bps;
+	return 0;
+
+nomem:
+	cake_destroy(sch);
+	return -ENOMEM;
+}
+
+static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_BASE_RATE, q->rate_bps))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_DIFFSERV_MODE, q->tin_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_ATM, (q->rate_flags &
+			(CAKE_FLAG_ATM | CAKE_FLAG_PTM))))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_FLOW_MODE, q->flow_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_WASH,
+			!!(q->rate_flags & CAKE_FLAG_WASH)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_OVERHEAD, q->rate_overhead +
+			qdisc_dev(sch)->hard_header_len))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_MPU, q->rate_mpu))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_ETHERNET,
+			qdisc_dev(sch)->hard_header_len))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_RTT, q->interval))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_TARGET, q->target))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
+			!!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_INGRESS,
+			!!(q->rate_flags & CAKE_FLAG_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER,
+			(q->rate_flags &
+			(CAKE_FLAG_ACK_FILTER | CAKE_FLAG_ACK_AGGRESSIVE))))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct tc_cake_xstats *st = kvzalloc(sizeof(*st), GFP_KERNEL);
+	int i;
+
+	if (!st)
+		return -ENOMEM;
+
+	st->version = 5;
+	st->max_tins = TC_CAKE_MAX_TINS;
+	st->tin_cnt = q->tin_cnt;
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[q->tin_order[i]];
+
+		st->threshold_rate[i] = b->tin_rate_bps;
+		st->target_us[i]      = cobalt_time_to_us(b->cparams.target);
+		st->interval_us[i]    = cobalt_time_to_us(b->cparams.interval);
+
+		/* TODO FIXME: add missing aspects of these composite stats */
+		st->sent[i].packets       = b->packets;
+		st->sent[i].bytes	  = b->bytes;
+		st->dropped[i].packets    = b->tin_dropped;
+		st->ecn_marked[i].packets = b->tin_ecn_mark;
+		st->backlog[i].bytes      = b->tin_backlog;
+		st->ack_drops[i].packets  = b->ack_drops;
+
+		st->peak_delay_us[i] = cobalt_time_to_us(b->peak_delay);
+		st->avge_delay_us[i] = cobalt_time_to_us(b->avge_delay);
+		st->base_delay_us[i] = cobalt_time_to_us(b->base_delay);
+
+		st->way_indirect_hits[i] = b->way_hits;
+		st->way_misses[i]	 = b->way_misses;
+		st->way_collisions[i]    = b->way_collisions;
+
+		st->sparse_flows[i]      = b->sparse_flow_count +
+					   b->decaying_flow_count;
+		st->bulk_flows[i]	 = b->bulk_flow_count;
+		st->unresponse_flows[i]  = b->unresponsive_flow_count;
+		st->spare[i]		 = 0;
+		st->max_skblen[i]	 = b->max_skblen;
+	}
+	st->capacity_estimate = q->avg_peak_bandwidth;
+	st->memory_limit      = q->buffer_limit;
+	st->memory_used       = q->buffer_max_used;
+
+	i = gnet_stats_copy_app(d, st, sizeof(*st));
+	cake_free(st);
+	return i;
+}
+
+static struct Qdisc_ops cake_qdisc_ops __read_mostly = {
+	.id		=	"cake",
+	.priv_size	=	sizeof(struct cake_sched_data),
+	.enqueue	=	cake_enqueue,
+	.dequeue	=	cake_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	cake_init,
+	.reset		=	cake_reset,
+	.destroy	=	cake_destroy,
+	.change		=	cake_change,
+	.dump		=	cake_dump,
+	.dump_stats	=	cake_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init cake_module_init(void)
+{
+	return register_qdisc(&cake_qdisc_ops);
+}
+
+static void __exit cake_module_exit(void)
+{
+	unregister_qdisc(&cake_qdisc_ops);
+}
+
+module_init(cake_module_init)
+module_exit(cake_module_exit)
+MODULE_AUTHOR("Jonathan Morton");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("The Cake shaper. Version: " CAKE_VERSION);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 3/3] Add support for building the new cake qdisc
  2017-12-03 22:06 [PATCH net-next 0/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
  2017-12-03 22:06 ` [PATCH net-next 1/3] pkt_sched.h: add support for sch_cake API Dave Taht
  2017-12-03 22:06 ` [PATCH net-next 2/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
@ 2017-12-03 22:06 ` Dave Taht
  2017-12-05  4:41   ` kbuild test robot
  2018-04-24 11:44 ` [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Toke Høiland-Jørgensen
  3 siblings, 1 reply; 20+ messages in thread
From: Dave Taht @ 2017-12-03 22:06 UTC (permalink / raw)
  To: netdev
  Cc: Dave Taht, Toke Høiland-Jørgensen, Sebastian Moeller,
	Ryan Mounce, Jonathan Morton, Kevin Darbyshire-Bryant,
	Nils Andreas Svee, Dean Scarff, Loganaden Velvindron

Hook up sch_cake to the build system.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
---
 net/sched/Kconfig  | 11 +++++++++++
 net/sched/Makefile |  1 +
 2 files changed, 12 insertions(+)

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03d86a..3ea22e5 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -284,6 +284,17 @@ config NET_SCH_FQ_CODEL
 
 	  If unsure, say N.
 
+config NET_SCH_CAKE
+	tristate "Common Applicatons Kept Enhanced (CAKE)"
+	help
+	  Say Y here if you want to use the CAKE
+	  packet scheduling algorithm.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_cake.
+
+	  If unsure, say N.
+
 config NET_SCH_FQ
 	tristate "Fair Queue"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5b63544..b8dd962 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
+obj-$(CONFIG_NET_SCH_CAKE)	+= sch_cake.o
 obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)	+= sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] Add support for building the new cake qdisc
  2017-12-03 22:06 ` [PATCH net-next 3/3] Add support for building the new cake qdisc Dave Taht
@ 2017-12-05  4:41   ` kbuild test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2017-12-05  4:41 UTC (permalink / raw)
  To: Dave Taht
  Cc: kbuild-all, netdev, Dave Taht, Toke Høiland-Jørgensen,
	Sebastian Moeller, Ryan Mounce, Jonathan Morton,
	Kevin Darbyshire-Bryant, Nils Andreas Svee, Dean Scarff,
	Loganaden Velvindron

[-- Attachment #1: Type: text/plain, Size: 4868 bytes --]

Hi Dave,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Dave-Taht/Add-Common-Applications-Kept-Enhanced-cake-qdisc/20171205-053924
config: parisc-allmodconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=parisc 

All warnings (new ones prefixed by >>):

   net/sched/sch_cake.c: In function 'cake_dump_stats':
>> net/sched/sch_cake.c:2530:1: warning: the frame size of 1424 bytes is larger than 1280 bytes [-Wframe-larger-than=]
    }
    ^

vim +2530 net/sched/sch_cake.c

8c28e37b Dave Taht 2017-12-03  2479  
8c28e37b Dave Taht 2017-12-03  2480  static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
8c28e37b Dave Taht 2017-12-03  2481  {
8c28e37b Dave Taht 2017-12-03  2482  	struct cake_sched_data *q = qdisc_priv(sch);
8c28e37b Dave Taht 2017-12-03  2483  	struct tc_cake_xstats *st = kvzalloc(sizeof(*st), GFP_KERNEL);
8c28e37b Dave Taht 2017-12-03  2484  	int i;
8c28e37b Dave Taht 2017-12-03  2485  
8c28e37b Dave Taht 2017-12-03  2486  	if (!st)
8c28e37b Dave Taht 2017-12-03  2487  		return -ENOMEM;
8c28e37b Dave Taht 2017-12-03  2488  
8c28e37b Dave Taht 2017-12-03  2489  	st->version = 5;
8c28e37b Dave Taht 2017-12-03  2490  	st->max_tins = TC_CAKE_MAX_TINS;
8c28e37b Dave Taht 2017-12-03  2491  	st->tin_cnt = q->tin_cnt;
8c28e37b Dave Taht 2017-12-03  2492  
8c28e37b Dave Taht 2017-12-03  2493  	for (i = 0; i < q->tin_cnt; i++) {
8c28e37b Dave Taht 2017-12-03  2494  		struct cake_tin_data *b = &q->tins[q->tin_order[i]];
8c28e37b Dave Taht 2017-12-03  2495  
8c28e37b Dave Taht 2017-12-03  2496  		st->threshold_rate[i] = b->tin_rate_bps;
8c28e37b Dave Taht 2017-12-03  2497  		st->target_us[i]      = cobalt_time_to_us(b->cparams.target);
8c28e37b Dave Taht 2017-12-03  2498  		st->interval_us[i]    = cobalt_time_to_us(b->cparams.interval);
8c28e37b Dave Taht 2017-12-03  2499  
8c28e37b Dave Taht 2017-12-03  2500  		/* TODO FIXME: add missing aspects of these composite stats */
8c28e37b Dave Taht 2017-12-03  2501  		st->sent[i].packets       = b->packets;
8c28e37b Dave Taht 2017-12-03  2502  		st->sent[i].bytes	  = b->bytes;
8c28e37b Dave Taht 2017-12-03  2503  		st->dropped[i].packets    = b->tin_dropped;
8c28e37b Dave Taht 2017-12-03  2504  		st->ecn_marked[i].packets = b->tin_ecn_mark;
8c28e37b Dave Taht 2017-12-03  2505  		st->backlog[i].bytes      = b->tin_backlog;
8c28e37b Dave Taht 2017-12-03  2506  		st->ack_drops[i].packets  = b->ack_drops;
8c28e37b Dave Taht 2017-12-03  2507  
8c28e37b Dave Taht 2017-12-03  2508  		st->peak_delay_us[i] = cobalt_time_to_us(b->peak_delay);
8c28e37b Dave Taht 2017-12-03  2509  		st->avge_delay_us[i] = cobalt_time_to_us(b->avge_delay);
8c28e37b Dave Taht 2017-12-03  2510  		st->base_delay_us[i] = cobalt_time_to_us(b->base_delay);
8c28e37b Dave Taht 2017-12-03  2511  
8c28e37b Dave Taht 2017-12-03  2512  		st->way_indirect_hits[i] = b->way_hits;
8c28e37b Dave Taht 2017-12-03  2513  		st->way_misses[i]	 = b->way_misses;
8c28e37b Dave Taht 2017-12-03  2514  		st->way_collisions[i]    = b->way_collisions;
8c28e37b Dave Taht 2017-12-03  2515  
8c28e37b Dave Taht 2017-12-03  2516  		st->sparse_flows[i]      = b->sparse_flow_count +
8c28e37b Dave Taht 2017-12-03  2517  					   b->decaying_flow_count;
8c28e37b Dave Taht 2017-12-03  2518  		st->bulk_flows[i]	 = b->bulk_flow_count;
8c28e37b Dave Taht 2017-12-03  2519  		st->unresponse_flows[i]  = b->unresponsive_flow_count;
8c28e37b Dave Taht 2017-12-03  2520  		st->spare[i]		 = 0;
8c28e37b Dave Taht 2017-12-03  2521  		st->max_skblen[i]	 = b->max_skblen;
8c28e37b Dave Taht 2017-12-03  2522  	}
8c28e37b Dave Taht 2017-12-03  2523  	st->capacity_estimate = q->avg_peak_bandwidth;
8c28e37b Dave Taht 2017-12-03  2524  	st->memory_limit      = q->buffer_limit;
8c28e37b Dave Taht 2017-12-03  2525  	st->memory_used       = q->buffer_max_used;
8c28e37b Dave Taht 2017-12-03  2526  
8c28e37b Dave Taht 2017-12-03  2527  	i = gnet_stats_copy_app(d, st, sizeof(*st));
8c28e37b Dave Taht 2017-12-03  2528  	cake_free(st);
8c28e37b Dave Taht 2017-12-03  2529  	return i;
8c28e37b Dave Taht 2017-12-03 @2530  }
8c28e37b Dave Taht 2017-12-03  2531  

:::::: The code at line 2530 was first introduced by commit
:::::: 8c28e37b67a2c9259e413f7dc39d6c84c08e8c75 Add Common Applications Kept Enhanced (cake) qdisc

:::::: TO: Dave Taht <dave.taht@gmail.com>
:::::: CC: 0day robot <fengguang.wu@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 52277 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
  2017-12-03 22:06 [PATCH net-next 0/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
                   ` (2 preceding siblings ...)
  2017-12-03 22:06 ` [PATCH net-next 3/3] Add support for building the new cake qdisc Dave Taht
@ 2018-04-24 11:44 ` Toke Høiland-Jørgensen
  2018-04-24 11:44   ` [PATCH iproute2-next v2] Add support for cake qdisc Toke Høiland-Jørgensen
                     ` (3 more replies)
  3 siblings, 4 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2018-04-24 11:44 UTC (permalink / raw)
  To: netdev; +Cc: cake, Toke Høiland-Jørgensen, Dave Taht

sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
  variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth2
qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms raw overhead 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 0b of 5000000b
 capacity estimate: 100Mbit
 min/max network layer size:        65535 /       0
 min/max overhead-adjusted size:    65535 /       0
 average network hdr offset:            0

                   Bulk  Best Effort        Voice
  thresh       6250Kbit      100Mbit       25Mbit
  target          5.0ms        5.0ms        5.0ms
  interval      100.0ms      100.0ms      100.0ms
  pk_delay          0us          0us          0us
  av_delay          0us          0us          0us
  sp_delay          0us          0us          0us
  pkts                0            0            0
  bytes               0            0            0
  way_inds            0            0            0
  way_miss            0            0            0
  way_cols            0            0            0
  drops               0            0            0
  marks               0            0            0
  ack_drop            0            0            0
  sp_flows            0            0            0
  bk_flows            0            0            0
  un_flows            0            0            0
  max_len             0            0            0
  quantum           300         1514          762

Tested-by: Pete Heist <peteheist@gmail.com>
Tested-by: Georgios Amanakis <gamanakis@gmail.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
Changes since v1:
  - Fix kbuild test bot complaint
  - Clean up the netlink ABI
  - Fix checkpatch complaints
  - A few tweaks to the behaviour of cake based on testing carried out
    while writing the paper.

include/uapi/linux/pkt_sched.h |  107 ++
 net/sched/Kconfig              |   11 +
 net/sched/Makefile             |    1 +
 net/sched/sch_cake.c           | 2620 ++++++++++++++++++++++++++++++++
 4 files changed, 2739 insertions(+)
 create mode 100644 net/sched/sch_cake.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..919f7c698fed 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,111 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* CAKE */
+enum {
+	TCA_CAKE_UNSPEC,
+	TCA_CAKE_BASE_RATE,
+	TCA_CAKE_DIFFSERV_MODE,
+	TCA_CAKE_ATM,
+	TCA_CAKE_FLOW_MODE,
+	TCA_CAKE_OVERHEAD,
+	TCA_CAKE_RTT,
+	TCA_CAKE_TARGET,
+	TCA_CAKE_AUTORATE,
+	TCA_CAKE_MEMORY,
+	TCA_CAKE_NAT,
+	TCA_CAKE_RAW, // was _ETHERNET
+	TCA_CAKE_WASH,
+	TCA_CAKE_MPU,
+	TCA_CAKE_INGRESS,
+	TCA_CAKE_ACK_FILTER,
+	__TCA_CAKE_MAX
+};
+#define TCA_CAKE_MAX	(__TCA_CAKE_MAX - 1)
+
+enum {
+	CAKE_FLOW_NONE = 0,
+	CAKE_FLOW_SRC_IP,
+	CAKE_FLOW_DST_IP,
+	CAKE_FLOW_HOSTS,    /* = CAKE_FLOW_SRC_IP | CAKE_FLOW_DST_IP */
+	CAKE_FLOW_FLOWS,
+	CAKE_FLOW_DUAL_SRC, /* = CAKE_FLOW_SRC_IP | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_DUAL_DST, /* = CAKE_FLOW_DST_IP | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_TRIPLE,   /* = CAKE_FLOW_HOSTS  | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_MAX,
+};
+
+enum {
+	CAKE_DIFFSERV_DIFFSERV3 = 0,
+	CAKE_DIFFSERV_DIFFSERV4,
+	CAKE_DIFFSERV_DIFFSERV8,
+	CAKE_DIFFSERV_BESTEFFORT,
+	CAKE_DIFFSERV_PRECEDENCE,
+	CAKE_DIFFSERV_MAX
+};
+
+enum {
+	CAKE_ACK_NONE = 0,
+	CAKE_ACK_FILTER,
+	CAKE_ACK_AGGRESSIVE,
+	CAKE_ACK_MAX
+};
+
+enum {
+	CAKE_ATM_NONE = 0,
+	CAKE_ATM_ATM,
+	CAKE_ATM_PTM,
+	CAKE_ATM_MAX
+};
+
+struct tc_cake_traffic_stats {
+	__u32 packets;
+	__u32 link_ms;
+	__u64 bytes;
+};
+
+#define TC_CAKE_MAX_TINS (8)
+struct tc_cake_tin_stats {
+
+	__u32 threshold_rate;
+	__u32 target_us;
+	struct tc_cake_traffic_stats sent;
+	struct tc_cake_traffic_stats dropped;
+	struct tc_cake_traffic_stats ecn_marked;
+	struct tc_cake_traffic_stats backlog;
+	__u32 interval_us;
+	__u32 way_indirect_hits;
+	__u32 way_misses;
+	__u32 way_collisions;
+	__u32 peak_delay_us; /* ~= bulk flow delay */
+	__u32 avge_delay_us;
+	__u32 base_delay_us; /* ~= sparse flows delay */
+	__u16 sparse_flows;
+	__u16 bulk_flows;
+	__u16 unresponse_flows;
+	__u16 spare;
+	__u32 max_skblen;
+	struct tc_cake_traffic_stats ack_drops;
+	__u16 flow_quantum;
+};
+
+struct tc_cake_xstats {
+	__u16 version;
+	__u16 tin_stats_size; /* == sizeof(struct tc_cake_tin_stats) */
+	__u32 capacity_estimate;
+	__u32 memory_limit;
+	__u32 memory_used;
+	__u8  tin_cnt;
+	__u8  avg_trnoff;
+	__u16 max_netlen;
+	__u16 max_adjlen;
+	__u16 min_netlen;
+	__u16 min_adjlen;
+
+	__u16 spare1;
+	__u32 spare2;
+
+	struct tc_cake_tin_stats tin_stats[0]; /* keep last */
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..6e7d614b5757 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -284,6 +284,17 @@ config NET_SCH_FQ_CODEL
 
 	  If unsure, say N.
 
+config NET_SCH_CAKE
+	tristate "Common Applications Kept Enhanced (CAKE)"
+	help
+	  Say Y here if you want to use the Common Applications Kept Enhanced
+          (CAKE) queue management algorithm.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_cake.
+
+	  If unsure, say N.
+
 config NET_SCH_FQ
 	tristate "Fair Queue"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..435054cee32c 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
+obj-$(CONFIG_NET_SCH_CAKE)	+= sch_cake.o
 obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)	+= sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
new file mode 100644
index 000000000000..de3a435d5590
--- /dev/null
+++ b/net/sched/sch_cake.c
@@ -0,0 +1,2620 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+
+/* COMMON Applications Kept Enhanced (CAKE) discipline
+ *
+ * Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
+ * Copyright (C) 2015-2018 Toke Høiland-Jørgensen <toke@toke.dk>
+ * Copyright (C) 2014-2018 Dave Täht <dave.taht@gmail.com>
+ * Copyright (C) 2015-2018 Sebastian Moeller <moeller0@gmx.de>
+ * (C) 2015-2018 Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
+ * Copyright (C) 2017 Ryan Mounce <ryan@mounce.com.au>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	notice, this list of conditions, and the following disclaimer,
+ *	without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	notice, this list of conditions and the following disclaimer in the
+ *	documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ *	derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/jhash.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/reciprocal_div.h>
+#include <net/netlink.h>
+#include <linux/version.h>
+#include <linux/if_vlan.h>
+#include <net/pkt_sched.h>
+#include <net/tcp.h>
+#include <net/flow_dissector.h>
+
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/nf_conntrack.h>
+#endif
+
+/* The CAKE Principles:
+ *		   (or, how to have your cake and eat it too)
+ *
+ * This is a combination of several shaping, AQM and FQ techniques into one
+ * easy-to-use package:
+ *
+ * - An overall bandwidth shaper, to move the bottleneck away from dumb CPE
+ *   equipment and bloated MACs.  This operates in deficit mode (as in sch_fq),
+ *   eliminating the need for any sort of burst parameter (eg. token bucket
+ *   depth).  Burst support is limited to that necessary to overcome scheduling
+ *   latency.
+ *
+ * - A Diffserv-aware priority queue, giving more priority to certain classes,
+ *   up to a specified fraction of bandwidth.  Above that bandwidth threshold,
+ *   the priority is reduced to avoid starving other tins.
+ *
+ * - Each priority tin has a separate Flow Queue system, to isolate traffic
+ *   flows from each other.  This prevents a burst on one flow from increasing
+ *   the delay to another.  Flows are distributed to queues using a
+ *   set-associative hash function.
+ *
+ * - Each queue is actively managed by Cobalt, which is a combination of the
+ *   Codel and Blue AQM algorithms.  This serves flows fairly, and signals
+ *   congestion early via ECN (if available) and/or packet drops, to keep
+ *   latency low.  The codel parameters are auto-tuned based on the bandwidth
+ *   setting, as is necessary at low bandwidths.
+ *
+ * The configuration parameters are kept deliberately simple for ease of use.
+ * Everything has sane defaults.  Complete generality of configuration is *not*
+ * a goal.
+ *
+ * The priority queue operates according to a weighted DRR scheme, combined with
+ * a bandwidth tracker which reuses the shaper logic to detect which side of the
+ * bandwidth sharing threshold the tin is operating.  This determines whether a
+ * priority-based weight (high) or a bandwidth-based weight (low) is used for
+ * that tin in the current pass.
+ *
+ * This qdisc was inspired by Eric Dumazet's fq_codel code, which he kindly
+ * granted us permission to leverage.
+ */
+
+#define CAKE_SET_WAYS (8)
+#define CAKE_MAX_TINS (8)
+#define CAKE_QUEUES (1024)
+#define US2TIME(a) (a * (u64)NSEC_PER_USEC)
+
+typedef u64 cobalt_time_t;
+typedef s64 cobalt_tdiff_t;
+
+/**
+ * struct cobalt_params - contains codel and blue parameters
+ * @interval:	codel initial drop rate
+ * @target:     maximum persistent sojourn time & blue update rate
+ * @mtu_time:   serialisation delay of maximum-size packet
+ * @p_inc:      increment of blue drop probability (0.32 fxp)
+ * @p_dec:      decrement of blue drop probability (0.32 fxp)
+ */
+struct cobalt_params {
+	cobalt_time_t	interval;
+	cobalt_time_t	target;
+	cobalt_time_t	mtu_time;
+	u32		p_inc;
+	u32		p_dec;
+};
+
+/* struct cobalt_vars - contains codel and blue variables
+ * @count:	  codel dropping frequency
+ * @rec_inv_sqrt: reciprocal value of sqrt(count) >> 1
+ * @drop_next:    time to drop next packet, or when we dropped last
+ * @blue_timer:	  Blue time to next drop
+ * @p_drop:       BLUE drop probability (0.32 fxp)
+ * @dropping:     set if in dropping state
+ * @ecn_marked:   set if marked
+ */
+struct cobalt_vars {
+	u32		count;
+	u32		rec_inv_sqrt;
+	cobalt_time_t	drop_next;
+	cobalt_time_t	blue_timer;
+	u32     p_drop;
+	bool	dropping;
+	bool    ecn_marked;
+};
+
+enum {
+	CAKE_SET_NONE = 0,
+	CAKE_SET_SPARSE,
+	CAKE_SET_SPARSE_WAIT, /* counted in SPARSE, actually in BULK */
+	CAKE_SET_BULK,
+	CAKE_SET_DECAYING
+};
+
+struct cake_flow {
+	/* this stuff is all needed per-flow at dequeue time */
+	struct sk_buff	  *head;
+	struct sk_buff	  *tail;
+	struct sk_buff	  *ackcheck;
+	struct list_head  flowchain;
+	s32		  deficit;
+	struct cobalt_vars cvars;
+	u16		  srchost; /* index into cake_host table */
+	u16		  dsthost;
+	u8		  set;
+}; /* please try to keep this structure <= 64 bytes */
+
+struct cake_host {
+	u32 srchost_tag;
+	u32 dsthost_tag;
+	u16 srchost_refcnt;
+	u16 dsthost_refcnt;
+};
+
+struct cake_heap_entry {
+	u16 t:3, b:10;
+};
+
+struct cake_tin_data {
+	struct cake_flow flows[CAKE_QUEUES];
+	u32	backlogs[CAKE_QUEUES];
+	u32	tags[CAKE_QUEUES]; /* for set association */
+	u16	overflow_idx[CAKE_QUEUES];
+	struct cake_host hosts[CAKE_QUEUES]; /* for triple isolation */
+	u32	perturb;
+	u16	flow_quantum;
+
+	struct cobalt_params cparams;
+	u32	drop_overlimit;
+	u16	bulk_flow_count;
+	u16	sparse_flow_count;
+	u16	decaying_flow_count;
+	u16	unresponsive_flow_count;
+
+	u32	max_skblen;
+
+	struct list_head new_flows;
+	struct list_head old_flows;
+	struct list_head decaying_flows;
+
+	/* time_next = time_this + ((len * rate_ns) >> rate_shft) */
+	u64	tin_time_next_packet;
+	u32	tin_rate_ns;
+	u32	tin_rate_bps;
+	u16	tin_rate_shft;
+
+	u16	tin_quantum_prio;
+	u16	tin_quantum_band;
+	s32	tin_deficit;
+	u32	tin_backlog;
+	u32	tin_dropped;
+	u32	tin_ecn_mark;
+
+	u32	packets;
+	u64	bytes;
+
+	u32	ack_drops;
+
+	/* moving averages */
+	cobalt_time_t avge_delay;
+	cobalt_time_t peak_delay;
+	cobalt_time_t base_delay;
+
+	/* hash function stats */
+	u32	way_directs;
+	u32	way_hits;
+	u32	way_misses;
+	u32	way_collisions;
+}; /* number of tins is small, so size of this struct doesn't matter much */
+
+struct cake_sched_data {
+	struct cake_tin_data *tins;
+
+	struct cake_heap_entry overflow_heap[CAKE_QUEUES * CAKE_MAX_TINS];
+	u16		overflow_timeout;
+
+	u16		tin_cnt;
+	u8		tin_mode;
+#define	CAKE_FLOW_NAT_FLAG 64
+	u8		flow_mode;
+	u8		ack_filter;
+	u8		atm_mode;
+
+	/* time_next = time_this + ((len * rate_ns) >> rate_shft) */
+	u16		rate_shft;
+	u64		time_next_packet;
+	u64		failsafe_next_packet;
+	u32		rate_ns;
+	u32		rate_bps;
+	u16		rate_flags;
+	s16		rate_overhead;
+	u16		rate_mpu;
+	u32		interval;
+	u32		target;
+
+	/* resource tracking */
+	u32		buffer_used;
+	u32		buffer_max_used;
+	u32		buffer_limit;
+	u32		buffer_config_limit;
+
+	/* indices for dequeue */
+	u16		cur_tin;
+	u16		cur_flow;
+
+	struct qdisc_watchdog watchdog;
+	const u8	*tin_index;
+	const u8	*tin_order;
+
+	/* bandwidth capacity estimate */
+	u64		last_packet_time;
+	u64		avg_packet_interval;
+	u64		avg_window_begin;
+	u32		avg_window_bytes;
+	u32		avg_peak_bandwidth;
+	u64		last_reconfig_time;
+
+	/* packet length stats */
+	u32 avg_trnoff;
+	u16 max_netlen;
+	u16 max_adjlen;
+	u16 min_netlen;
+	u16 min_adjlen;
+};
+
+enum {
+	CAKE_FLAG_OVERHEAD	   = BIT(0),
+	CAKE_FLAG_AUTORATE_INGRESS = BIT(1),
+	CAKE_FLAG_INGRESS	   = BIT(2),
+	CAKE_FLAG_WASH		   = BIT(3)
+};
+
+/* COBALT operates the Codel and BLUE algorithms in parallel, in order to
+ * obtain the best features of each.  Codel is excellent on flows which
+ * respond to congestion signals in a TCP-like way.  BLUE is more effective on
+ * unresponsive flows.
+ */
+
+struct cobalt_skb_cb {
+	cobalt_time_t enqueue_time;
+	u32           adjusted_len;
+};
+
+static inline cobalt_time_t cobalt_get_time(void)
+{
+	return ktime_get_ns();
+}
+
+static inline u32 cobalt_time_to_us(cobalt_time_t val)
+{
+	do_div(val, NSEC_PER_USEC);
+	return (u32)val;
+}
+
+static inline struct cobalt_skb_cb *get_cobalt_cb(const struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct cobalt_skb_cb));
+	return (struct cobalt_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline cobalt_time_t cobalt_get_enqueue_time(const struct sk_buff *skb)
+{
+	return get_cobalt_cb(skb)->enqueue_time;
+}
+
+static inline void cobalt_set_enqueue_time(struct sk_buff *skb,
+					   cobalt_time_t now)
+{
+	get_cobalt_cb(skb)->enqueue_time = now;
+}
+
+static u16 quantum_div[CAKE_QUEUES + 1] = {0};
+
+/* Diffserv lookup tables */
+
+static const u8 precedence[] = {0, 0, 0, 0, 0, 0, 0, 0,
+				1, 1, 1, 1, 1, 1, 1, 1,
+				2, 2, 2, 2, 2, 2, 2, 2,
+				3, 3, 3, 3, 3, 3, 3, 3,
+				4, 4, 4, 4, 4, 4, 4, 4,
+				5, 5, 5, 5, 5, 5, 5, 5,
+				6, 6, 6, 6, 6, 6, 6, 6,
+				7, 7, 7, 7, 7, 7, 7, 7,
+				};
+
+static const u8 diffserv8[] = {2, 5, 1, 2, 4, 2, 2, 2,
+			       0, 2, 1, 2, 1, 2, 1, 2,
+			       5, 2, 4, 2, 4, 2, 4, 2,
+				3, 2, 3, 2, 3, 2, 3, 2,
+				6, 2, 3, 2, 3, 2, 3, 2,
+				6, 2, 2, 2, 6, 2, 6, 2,
+				7, 2, 2, 2, 2, 2, 2, 2,
+				7, 2, 2, 2, 2, 2, 2, 2,
+				};
+
+static const u8 diffserv4[] = {0, 2, 0, 0, 2, 0, 0, 0,
+			       1, 0, 0, 0, 0, 0, 0, 0,
+				2, 0, 2, 0, 2, 0, 2, 0,
+				2, 0, 2, 0, 2, 0, 2, 0,
+				3, 0, 2, 0, 2, 0, 2, 0,
+				3, 0, 0, 0, 3, 0, 3, 0,
+				3, 0, 0, 0, 0, 0, 0, 0,
+				3, 0, 0, 0, 0, 0, 0, 0,
+				};
+
+static const u8 diffserv3[] = {0, 0, 0, 0, 2, 0, 0, 0,
+			       1, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 2, 0, 2, 0,
+				2, 0, 0, 0, 0, 0, 0, 0,
+				2, 0, 0, 0, 0, 0, 0, 0,
+				};
+
+static const u8 besteffort[] = {0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				0, 0, 0, 0, 0, 0, 0, 0,
+				};
+
+/* tin priority order for stats dumping */
+
+static const u8 normal_order[] = {0, 1, 2, 3, 4, 5, 6, 7};
+static const u8 bulk_order[] = {1, 0, 2, 3};
+
+#define REC_INV_SQRT_CACHE (16)
+static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
+
+/* http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
+ * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
+ *
+ * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
+ */
+
+static void cobalt_newton_step(struct cobalt_vars *vars)
+{
+	u32 invsqrt = vars->rec_inv_sqrt;
+	u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
+	u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
+
+	val >>= 2; /* avoid overflow in following multiply */
+	val = (val * invsqrt) >> (32 - 2 + 1);
+
+	vars->rec_inv_sqrt = val;
+}
+
+static void cobalt_invsqrt(struct cobalt_vars *vars)
+{
+	if (vars->count < REC_INV_SQRT_CACHE)
+		vars->rec_inv_sqrt = cobalt_rec_inv_sqrt_cache[vars->count];
+	else
+		cobalt_newton_step(vars);
+}
+
+/* There is a big difference in timing between the accurate values placed in
+ * the cache and the approximations given by a single Newton step for small
+ * count values, particularly when stepping from count 1 to 2 or vice versa.
+ * Above 16, a single Newton step gives sufficient accuracy in either
+ * direction, given the precision stored.
+ *
+ * The magnitude of the error when stepping up to count 2 is such as to give
+ * the value that *should* have been produced at count 4.
+ */
+
+static void cobalt_cache_init(void)
+{
+	struct cobalt_vars v;
+
+	memset(&v, 0, sizeof(v));
+	v.rec_inv_sqrt = ~0U;
+	cobalt_rec_inv_sqrt_cache[0] = v.rec_inv_sqrt;
+
+	for (v.count = 1; v.count < REC_INV_SQRT_CACHE; v.count++) {
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+
+		cobalt_rec_inv_sqrt_cache[v.count] = v.rec_inv_sqrt;
+	}
+}
+
+static void cobalt_vars_init(struct cobalt_vars *vars)
+{
+	memset(vars, 0, sizeof(*vars));
+
+	if (!cobalt_rec_inv_sqrt_cache[0]) {
+		cobalt_cache_init();
+		cobalt_rec_inv_sqrt_cache[0] = ~0;
+	}
+}
+
+/* CoDel control_law is t + interval/sqrt(count)
+ * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
+ * both sqrt() and divide operation.
+ */
+static cobalt_time_t cobalt_control(cobalt_time_t t,
+				    cobalt_time_t interval,
+				    u32 rec_inv_sqrt)
+{
+	return t + reciprocal_scale(interval, rec_inv_sqrt);
+}
+
+/* Call this when a packet had to be dropped due to queue overflow.  Returns
+ * true if the BLUE state was quiescent before but active after this call.
+ */
+static bool cobalt_queue_full(struct cobalt_vars *vars,
+			      struct cobalt_params *p,
+			      cobalt_time_t now)
+{
+	bool up = false;
+
+	if ((now - vars->blue_timer) > p->target) {
+		up = !vars->p_drop;
+		vars->p_drop += p->p_inc;
+		if (vars->p_drop < p->p_inc)
+			vars->p_drop = ~0;
+		vars->blue_timer = now;
+	}
+	vars->dropping = true;
+	vars->drop_next = now;
+	if (!vars->count)
+		vars->count = 1;
+
+	return up;
+}
+
+/* Call this when the queue was serviced but turned out to be empty.  Returns
+ * true if the BLUE state was active before but quiescent after this call.
+ */
+static bool cobalt_queue_empty(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now)
+{
+	bool down = false;
+
+	if (vars->p_drop && (now - vars->blue_timer) > p->target) {
+		if (vars->p_drop < p->p_dec)
+			vars->p_drop = 0;
+		else
+			vars->p_drop -= p->p_dec;
+		vars->blue_timer = now;
+		down = !vars->p_drop;
+	}
+	vars->dropping = false;
+
+	if (vars->count && (now - vars->drop_next) >= 0) {
+		vars->count--;
+		cobalt_invsqrt(vars);
+		vars->drop_next = cobalt_control(vars->drop_next,
+						 p->interval,
+						 vars->rec_inv_sqrt);
+	}
+
+	return down;
+}
+
+/* Call this with a freshly dequeued packet for possible congestion marking.
+ * Returns true as an instruction to drop the packet, false for delivery.
+ */
+static bool cobalt_should_drop(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now,
+			       struct sk_buff *skb,
+			       u32 bulk_flows)
+{
+	bool drop = false;
+
+	/* Simplified Codel implementation */
+	cobalt_tdiff_t sojourn  = now - cobalt_get_enqueue_time(skb);
+
+/* The 'schedule' variable records, in its sign, whether 'now' is before or
+ * after 'drop_next'.  This allows 'drop_next' to be updated before the next
+ * scheduling decision is actually branched, without destroying that
+ * information.  Similarly, the first 'schedule' value calculated is preserved
+ * in the boolean 'next_due'.
+ *
+ * As for 'drop_next', we take advantage of the fact that 'interval' is both
+ * the delay between first exceeding 'target' and the first signalling event,
+ * *and* the scaling factor for the signalling frequency.  It's therefore very
+ * natural to use a single mechanism for both purposes, and eliminates a
+ * significant amount of reference Codel's spaghetti code.  To help with this,
+ * both the '0' and '1' entries in the invsqrt cache are 0xFFFFFFFF, as close
+ * as possible to 1.0 in fixed-point.
+ */
+
+	cobalt_tdiff_t schedule = now - vars->drop_next;
+
+	bool over_target = sojourn > p->target &&
+			   sojourn > p->mtu_time * bulk_flows * 2 &&
+			   sojourn > p->mtu_time * 4;
+	bool next_due    = vars->count && schedule >= 0;
+
+	vars->ecn_marked = false;
+
+	if (over_target) {
+		if (!vars->dropping) {
+			vars->dropping = true;
+			vars->drop_next = cobalt_control(now,
+							 p->interval,
+							 vars->rec_inv_sqrt);
+		}
+		if (!vars->count)
+			vars->count = 1;
+	} else if (vars->dropping) {
+		vars->dropping = false;
+	}
+
+	if (next_due && vars->dropping) {
+		/* Use ECN mark if possible, otherwise drop */
+		drop = !(vars->ecn_marked = INET_ECN_set_ce(skb));
+
+		vars->count++;
+		if (!vars->count)
+			vars->count--;
+		cobalt_invsqrt(vars);
+		vars->drop_next = cobalt_control(vars->drop_next,
+						 p->interval,
+						 vars->rec_inv_sqrt);
+		schedule = now - vars->drop_next;
+	} else {
+		while (next_due) {
+			vars->count--;
+			cobalt_invsqrt(vars);
+			vars->drop_next = cobalt_control(vars->drop_next,
+							 p->interval,
+							 vars->rec_inv_sqrt);
+			schedule = now - vars->drop_next;
+			next_due = vars->count && schedule >= 0;
+		}
+	}
+
+	/* Simple BLUE implementation.  Lack of ECN is deliberate. */
+	if (vars->p_drop)
+		drop |= (prandom_u32() < vars->p_drop);
+
+	/* Overload the drop_next field as an activity timeout */
+	if (!vars->count)
+		vars->drop_next = now + p->interval;
+	else if (schedule > 0 && !drop)
+		vars->drop_next = now;
+
+	return drop;
+}
+
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+
+static inline void cake_update_flowkeys(struct flow_keys *keys,
+					const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	bool rev = false;
+
+	struct nf_conn *ct;
+	const struct nf_conntrack_tuple *tuple;
+
+	if (tc_skb_protocol(skb) != htons(ETH_P_IP))
+		return;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (ct) {
+		tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
+	} else {
+		const struct nf_conntrack_tuple_hash *hash;
+		struct nf_conntrack_tuple srctuple;
+
+		if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
+				       NFPROTO_IPV4, dev_net(skb->dev),
+				       &srctuple))
+			return;
+
+		hash = nf_conntrack_find_get(dev_net(skb->dev),
+					     &nf_ct_zone_dflt,
+					     &srctuple);
+		if (!hash)
+			return;
+
+		rev = true;
+		ct = nf_ct_tuplehash_to_ctrack(hash);
+		tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
+	}
+
+	keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
+	keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
+
+	if (keys->ports.ports) {
+		keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
+		keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
+	}
+	if (rev)
+		nf_ct_put(ct);
+}
+#else
+static inline void cake_update_flowkeys(struct flow_keys *keys,
+					const struct sk_buff *skb)
+{
+	/* There is nothing we can do here without CONNTRACK */
+}
+#endif
+
+/* Cake has several subtle multiple bit settings. In these cases you
+ *  would be matching triple isolate mode as well.
+ */
+
+static inline bool cake_dsrc(int flow_mode)
+{
+	return (flow_mode & CAKE_FLOW_DUAL_SRC) == CAKE_FLOW_DUAL_SRC;
+}
+
+static inline bool cake_ddst(int flow_mode)
+{
+	return (flow_mode & CAKE_FLOW_DUAL_DST) == CAKE_FLOW_DUAL_DST;
+}
+
+static inline u32
+cake_hash(struct cake_tin_data *q, const struct sk_buff *skb, int flow_mode)
+{
+	struct flow_keys keys, host_keys;
+	u32 flow_hash = 0, srchost_hash, dsthost_hash;
+	u16 reduced_hash, srchost_idx, dsthost_idx;
+
+	if (unlikely(flow_mode == CAKE_FLOW_NONE))
+		return 0;
+
+	skb_flow_dissect_flow_keys(skb, &keys,
+				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+
+	if (flow_mode & CAKE_FLOW_NAT_FLAG)
+		cake_update_flowkeys(&keys, skb);
+
+	/* flow_hash_from_keys() sorts the addresses by value, so we have
+	 * to preserve their order in a separate data structure to treat
+	 * src and dst host addresses as independently selectable.
+	 */
+	host_keys = keys;
+	host_keys.ports.ports     = 0;
+	host_keys.basic.ip_proto  = 0;
+	host_keys.keyid.keyid     = 0;
+	host_keys.tags.flow_label = 0;
+
+	switch (host_keys.control.addr_type) {
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		host_keys.addrs.v4addrs.src = 0;
+		dsthost_hash = flow_hash_from_keys(&host_keys);
+		host_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
+		host_keys.addrs.v4addrs.dst = 0;
+		srchost_hash = flow_hash_from_keys(&host_keys);
+		break;
+
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		memset(&host_keys.addrs.v6addrs.src, 0,
+		       sizeof(host_keys.addrs.v6addrs.src));
+		dsthost_hash = flow_hash_from_keys(&host_keys);
+		host_keys.addrs.v6addrs.src = keys.addrs.v6addrs.src;
+		memset(&host_keys.addrs.v6addrs.dst, 0,
+		       sizeof(host_keys.addrs.v6addrs.dst));
+		srchost_hash = flow_hash_from_keys(&host_keys);
+		break;
+
+	default:
+		dsthost_hash = 0;
+		srchost_hash = 0;
+	};
+
+	/* This *must* be after the above switch, since as a
+	 * side-effect it sorts the src and dst addresses.
+	 */
+	if (flow_mode & CAKE_FLOW_FLOWS)
+		flow_hash = flow_hash_from_keys(&keys);
+
+	if (!(flow_mode & CAKE_FLOW_FLOWS)) {
+		if (flow_mode & CAKE_FLOW_SRC_IP)
+			flow_hash ^= srchost_hash;
+
+		if (flow_mode & CAKE_FLOW_DST_IP)
+			flow_hash ^= dsthost_hash;
+	}
+
+	reduced_hash = flow_hash % CAKE_QUEUES;
+
+	/* set-associative hashing */
+	/* fast path if no hash collision (direct lookup succeeds) */
+	if (likely(q->tags[reduced_hash] == flow_hash &&
+		   q->flows[reduced_hash].set)) {
+		q->way_directs++;
+	} else {
+		u32 inner_hash = reduced_hash % CAKE_SET_WAYS;
+		u32 outer_hash = reduced_hash - inner_hash;
+		u32 i, k;
+		bool allocate_src = false;
+		bool allocate_dst = false;
+
+		/* check if any active queue in the set is reserved for
+		 * this flow.
+		 */
+		for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+		     i++, k = (k + 1) % CAKE_SET_WAYS) {
+			if (q->tags[outer_hash + k] == flow_hash) {
+				if (i)
+					q->way_hits++;
+
+				if (!q->flows[outer_hash + k].set) {
+					/* need to increment host refcnts */
+					allocate_src = cake_dsrc(flow_mode);
+					allocate_dst = cake_ddst(flow_mode);
+				}
+
+				goto found;
+			}
+		}
+
+		/* no queue is reserved for this flow, look for an
+		 * empty one.
+		 */
+		for (i = 0; i < CAKE_SET_WAYS;
+			 i++, k = (k + 1) % CAKE_SET_WAYS) {
+			if (!q->flows[outer_hash + k].set) {
+				q->way_misses++;
+				allocate_src = cake_dsrc(flow_mode);
+				allocate_dst = cake_ddst(flow_mode);
+				goto found;
+			}
+		}
+
+		/* With no empty queues, default to the original
+		 * queue, accept the collision, update the host tags.
+		 */
+		q->way_collisions++;
+		q->hosts[q->flows[reduced_hash].srchost].srchost_refcnt--;
+		q->hosts[q->flows[reduced_hash].dsthost].dsthost_refcnt--;
+		allocate_src = cake_dsrc(flow_mode);
+		allocate_dst = cake_ddst(flow_mode);
+found:
+		/* reserve queue for future packets in same flow */
+		reduced_hash = outer_hash + k;
+		q->tags[reduced_hash] = flow_hash;
+
+		if (allocate_src) {
+			srchost_idx = srchost_hash % CAKE_QUEUES;
+			inner_hash = srchost_idx % CAKE_SET_WAYS;
+			outer_hash = srchost_idx - inner_hash;
+			for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+				i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (q->hosts[outer_hash + k].srchost_tag ==
+				    srchost_hash)
+					goto found_src;
+			}
+			for (i = 0; i < CAKE_SET_WAYS;
+				i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (!q->hosts[outer_hash + k].srchost_refcnt)
+					break;
+			}
+			q->hosts[outer_hash + k].srchost_tag = srchost_hash;
+found_src:
+			srchost_idx = outer_hash + k;
+			q->hosts[srchost_idx].srchost_refcnt++;
+			q->flows[reduced_hash].srchost = srchost_idx;
+		}
+
+		if (allocate_dst) {
+			dsthost_idx = dsthost_hash % CAKE_QUEUES;
+			inner_hash = dsthost_idx % CAKE_SET_WAYS;
+			outer_hash = dsthost_idx - inner_hash;
+			for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+			     i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (q->hosts[outer_hash + k].dsthost_tag ==
+				    dsthost_hash)
+					goto found_dst;
+			}
+			for (i = 0; i < CAKE_SET_WAYS;
+			     i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (!q->hosts[outer_hash + k].dsthost_refcnt)
+					break;
+			}
+			q->hosts[outer_hash + k].dsthost_tag = dsthost_hash;
+found_dst:
+			dsthost_idx = outer_hash + k;
+			q->hosts[dsthost_idx].dsthost_refcnt++;
+			q->flows[reduced_hash].dsthost = dsthost_idx;
+		}
+	}
+
+	return reduced_hash;
+}
+
+/* helper functions : might be changed when/if skb use a standard list_head */
+/* remove one skb from head of slot queue */
+
+static inline struct sk_buff *dequeue_head(struct cake_flow *flow)
+{
+	struct sk_buff *skb = flow->head;
+
+	if (skb) {
+		flow->head = skb->next;
+		skb->next = NULL;
+
+		if (skb == flow->ackcheck)
+			flow->ackcheck = NULL;
+	}
+
+	return skb;
+}
+
+/* add skb to flow queue (tail add) */
+
+static inline void
+flow_queue_add(struct cake_flow *flow, struct sk_buff *skb)
+{
+	if (!flow->head)
+		flow->head = skb;
+	else
+		flow->tail->next = skb;
+	flow->tail = skb;
+	skb->next = NULL;
+}
+
+static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
+				       struct cake_flow *flow)
+{
+	int seglen;
+	struct sk_buff *skb = flow->tail, *skb_check, *skb_check_prev;
+	struct iphdr *iph, *iph_check;
+	struct ipv6hdr *ipv6h, *ipv6h_check;
+	struct tcphdr *tcph, *tcph_check;
+	bool otherconn_ack_seen = false;
+	struct sk_buff *otherconn_checked_to = NULL;
+	bool thisconn_redundant_seen = false, thisconn_seen_last = false;
+	struct sk_buff *thisconn_checked_to = NULL, *thisconn_ack = NULL;
+	bool aggressive = q->ack_filter == CAKE_ACK_AGGRESSIVE;
+
+	/* no other possible ACKs to filter */
+	if (flow->head == skb)
+		return NULL;
+
+	iph = skb->encapsulation ? inner_ip_hdr(skb) : ip_hdr(skb);
+	ipv6h = skb->encapsulation ? inner_ipv6_hdr(skb) : ipv6_hdr(skb);
+
+	/* check that the innermost network header is v4/v6, and contains TCP */
+	if (iph->version == 4) {
+		if (iph->protocol != IPPROTO_TCP)
+			return NULL;
+		seglen = ntohs(iph->tot_len) - (4 * iph->ihl);
+		tcph = (struct tcphdr *)((void *)iph + (4 * iph->ihl));
+	} else if (ipv6h->version == 6) {
+		if (ipv6h->nexthdr != IPPROTO_TCP)
+			return NULL;
+		seglen = ntohs(ipv6h->payload_len);
+		tcph = (struct tcphdr *)((void *)ipv6h +
+					 sizeof(struct ipv6hdr));
+	} else {
+		return NULL;
+	}
+
+	/* the 'triggering' packet need only have the ACK flag set.
+	 * also check that SYN is not set, as there won't be any previous ACKs.
+	 */
+	if ((tcp_flag_word(tcph) &
+		(TCP_FLAG_ACK | TCP_FLAG_SYN)) != TCP_FLAG_ACK)
+		return NULL;
+
+	/* the 'triggering' ACK is at the end of the queue,
+	 * we have already returned if it is the only packet in the flow.
+	 * stop before last packet in queue, don't compare trigger ACK to itself
+	 * start where we finished last time if recorded in ->ackcheck
+	 * otherwise start from the the head of the flow queue.
+	 */
+	skb_check_prev = flow->ackcheck;
+	skb_check = flow->ackcheck ?: flow->head;
+
+	while (skb_check->next) {
+		bool pure_ack, thisconn;
+
+		/* don't increment if at head of flow queue (_prev == NULL) */
+		if (skb_check_prev) {
+			skb_check_prev = skb_check;
+			skb_check = skb_check->next;
+			if (!skb_check->next)
+				break;
+		} else {
+			skb_check_prev = ERR_PTR(-1);
+		}
+
+		iph_check = skb_check->encapsulation ?
+			inner_ip_hdr(skb_check) : ip_hdr(skb_check);
+		ipv6h_check = skb_check->encapsulation ?
+			inner_ipv6_hdr(skb_check) : ipv6_hdr(skb_check);
+
+		if (iph_check->version == 4) {
+			if (iph_check->protocol != IPPROTO_TCP)
+				continue;
+			seglen = (ntohs(iph_check->tot_len) -
+				  (4 * iph_check->ihl));
+			tcph_check = (struct tcphdr *)((void *)iph_check
+				+ (4 * iph_check->ihl));
+			if (iph->version == 4 &&
+			    iph_check->saddr == iph->saddr &&
+			    iph_check->daddr == iph->daddr) {
+				thisconn = true;
+			} else {
+				thisconn = false;
+			}
+		} else if (ipv6h_check->version == 6) {
+			if (ipv6h_check->nexthdr != IPPROTO_TCP)
+				continue;
+			seglen = ntohs(ipv6h_check->payload_len);
+			tcph_check = (struct tcphdr *)((void *)ipv6h_check +
+				     sizeof(struct ipv6hdr));
+			if (ipv6h->version == 6 &&
+			    ipv6_addr_cmp(&ipv6h_check->saddr, &ipv6h->saddr) &&
+				ipv6_addr_cmp(&ipv6h_check->daddr,
+					      &ipv6h->daddr)) {
+				thisconn = true;
+			} else {
+				thisconn = false;
+			}
+		} else {
+			continue;
+		}
+
+		/* stricter criteria apply to ACKs that we may filter
+		 * 3 reserved flags must be unset to avoid future breakage
+		 * ECE/CWR/NS can be safely ignored
+		 * ACK must be set
+		 * All other flags URG/PSH/RST/SYN/FIN must be unset
+		 * 0x0FFF0000 = all TCP flags (confirm ACK=1, others zero)
+		 * 0x01C00000 = NS/CWR/ECE (safe to ignore)
+		 * 0x0E3F0000 = 0x0FFF0000 & ~0x01C00000
+		 * must be 'pure' ACK, contain zero bytes of segment data
+		 * options are ignored
+		 */
+		if ((tcp_flag_word(tcph_check) &
+			(TCP_FLAG_ACK | TCP_FLAG_SYN)) != TCP_FLAG_ACK) {
+			continue;
+		} else if (((tcp_flag_word(tcph_check) &
+				cpu_to_be32(0x0E3F0000)) != TCP_FLAG_ACK) ||
+			   ((seglen - 4 * tcph_check->doff) != 0)) {
+			pure_ack = false;
+		} else {
+			pure_ack = true;
+		}
+
+		/* if we find an ACK belonging to a different connection
+		 * continue checking for other ACKs this round however
+		 * restart checking from the other connection next time.
+		 */
+		if (thisconn &&	(tcph_check->source != tcph->source ||
+				 tcph_check->dest != tcph->dest)) {
+			thisconn = false;
+		}
+
+		/* new ack sequence must be greater
+		 */
+		if (thisconn &&
+		    (ntohl(tcph_check->ack_seq) > ntohl(tcph->ack_seq)))
+			continue;
+
+		/* DupACKs with an equal sequence number shouldn't be filtered,
+		 * but we can filter if the triggering packet is a SACK
+		 */
+		if (thisconn &&
+		    (ntohl(tcph_check->ack_seq) == ntohl(tcph->ack_seq))) {
+			/* inspired by tcp_parse_options in tcp_input.c */
+			bool sack = false;
+			int length = (tcph->doff * 4) - sizeof(struct tcphdr);
+			const u8 *ptr = (const u8 *)(tcph + 1);
+
+			while (length > 0) {
+				int opcode = *ptr++;
+				int opsize;
+
+				if (opcode == TCPOPT_EOL)
+					break;
+				if (opcode == TCPOPT_NOP) {
+					length--;
+					continue;
+				}
+				opsize = *ptr++;
+				if (opsize < 2 || opsize > length)
+					break;
+				if (opcode == TCPOPT_SACK) {
+					sack = true;
+					break;
+				}
+				ptr += opsize - 2;
+				length -= opsize;
+			}
+			if (!sack)
+				continue;
+		}
+
+		/* somewhat complicated control flow for 'conservative'
+		 * ACK filtering that aims to be more polite to slow-start and
+		 * in the presence of packet loss.
+		 * does not filter if there is one 'redundant' ACK in the queue.
+		 * 'data' ACKs won't be filtered but do count as redundant ACKs.
+		 */
+		if (thisconn) {
+			thisconn_seen_last = true;
+			/* if aggressive and this is a data ack we can skip
+			 * checking it next time.
+			 */
+			thisconn_checked_to = (aggressive && !pure_ack) ?
+				skb_check : skb_check_prev;
+			/* the first pure ack for this connection.
+			 * record where it is, but only break if aggressive
+			 * or already seen data ack from the same connection
+			 */
+			if (pure_ack && !thisconn_ack) {
+				thisconn_ack = skb_check_prev;
+				if (aggressive || thisconn_redundant_seen)
+					break;
+			/* data ack or subsequent pure ack */
+			} else {
+				thisconn_redundant_seen = true;
+				/* this is the second ack for this connection
+				 * break to filter the first pure ack
+				 */
+				if (thisconn_ack)
+					break;
+			}
+		/* track packets from non-matching tcp connections that will
+		 * need evaluation on the next run.
+		 * if there are packets from both the matching connection and
+		 * others that requre checking next run, track which was updated
+		 * last and return the older of the two to ensure full coverage.
+		 * if a non-matching pure ack has been seen, cannot skip any
+		 * further on the next run so don't update.
+		 */
+		} else if (!otherconn_ack_seen) {
+			thisconn_seen_last = false;
+			if (pure_ack) {
+				otherconn_ack_seen = true;
+				/* if aggressive we don't care about old data,
+				 * start from the pure ack.
+				 * otherwise if there is a previous data ack,
+				 * start checking from it next time.
+				 */
+				if (aggressive || !otherconn_checked_to)
+					otherconn_checked_to = skb_check_prev;
+			} else {
+				otherconn_checked_to = aggressive ?
+					skb_check : skb_check_prev;
+			}
+		}
+	}
+
+	/* skb_check is reused at this point
+	 * it is the pure ACK to be filtered (if any)
+	 */
+	skb_check = NULL;
+
+	/* next time start checking from the older/nearest to head of unfiltered
+	 * but important tcp packets from this connection and other connections.
+	 * if none seen, start after the last packet evaluated in the loop.
+	 */
+	if (thisconn_checked_to && otherconn_checked_to)
+		flow->ackcheck = thisconn_seen_last ?
+			otherconn_checked_to : thisconn_checked_to;
+	else if (thisconn_checked_to)
+		flow->ackcheck = thisconn_checked_to;
+	else if (otherconn_checked_to)
+		flow->ackcheck = otherconn_checked_to;
+	else
+		flow->ackcheck = skb_check_prev;
+
+	/* if filtering, the pure ACK from the flow queue */
+	if (thisconn_ack && (aggressive || thisconn_redundant_seen)) {
+		if (PTR_ERR(thisconn_ack) == -1) {
+			skb_check = flow->head;
+			flow->head = flow->head->next;
+		} else {
+			skb_check = thisconn_ack->next;
+			thisconn_ack->next = thisconn_ack->next->next;
+		}
+	}
+
+	/* we just filtered that ack, fix up the list */
+	if (flow->ackcheck == skb_check)
+		flow->ackcheck = thisconn_ack;
+	/* check the entire flow queue next time */
+	if (PTR_ERR(flow->ackcheck) == -1)
+		flow->ackcheck = NULL;
+
+	return skb_check;
+}
+
+static inline cobalt_time_t cake_ewma(cobalt_time_t avg, cobalt_time_t sample,
+				      u32 shift)
+{
+	avg -= avg >> shift;
+	avg += sample >> shift;
+	return avg;
+}
+
+static inline u32 cake_overhead(struct cake_sched_data *q, struct sk_buff *skb)
+{
+	u32 len = qdisc_pkt_len(skb);
+	u32 off = skb_network_offset(skb);
+
+	q->avg_trnoff = cake_ewma(q->avg_trnoff, off << 16, 8);
+
+	if (q->rate_flags & CAKE_FLAG_OVERHEAD)
+		len -= off;
+
+	if (q->max_netlen < len)
+		q->max_netlen = len;
+	if (q->min_netlen > len)
+		q->min_netlen = len;
+
+	len += q->rate_overhead;
+
+	if (len < q->rate_mpu)
+		len = q->rate_mpu;
+
+	if (q->atm_mode == CAKE_ATM_ATM) {
+		len += 47;
+		len /= 48;
+		len *= 53;
+	} else if (q->atm_mode == CAKE_ATM_PTM) {
+		/* Add one byte per 64 bytes or part thereof.
+		 * This is conservative and easier to calculate than the
+		 * precise value.
+		 */
+		len += (len + 63) / 64;
+	}
+
+	if (q->max_adjlen < len)
+		q->max_adjlen = len;
+	if (q->min_adjlen > len)
+		q->min_adjlen = len;
+
+	get_cobalt_cb(skb)->adjusted_len = len;
+	return len;
+}
+
+static inline void cake_heap_swap(struct cake_sched_data *q, u16 i, u16 j)
+{
+	struct cake_heap_entry ii = q->overflow_heap[i];
+	struct cake_heap_entry jj = q->overflow_heap[j];
+
+	q->overflow_heap[i] = jj;
+	q->overflow_heap[j] = ii;
+
+	q->tins[ii.t].overflow_idx[ii.b] = j;
+	q->tins[jj.t].overflow_idx[jj.b] = i;
+}
+
+static inline u32 cake_heap_get_backlog(const struct cake_sched_data *q, u16 i)
+{
+	struct cake_heap_entry ii = q->overflow_heap[i];
+
+	return q->tins[ii.t].backlogs[ii.b];
+}
+
+static void cake_heapify(struct cake_sched_data *q, u16 i)
+{
+	static const u32 a = CAKE_MAX_TINS * CAKE_QUEUES;
+	u32 m = i;
+	u32 mb = cake_heap_get_backlog(q, m);
+
+	while (m < a) {
+		u32 l = m + m + 1;
+		u32 r = l + 1;
+
+		if (l < a) {
+			u32 lb = cake_heap_get_backlog(q, l);
+
+			if (lb > mb) {
+				m  = l;
+				mb = lb;
+			}
+		}
+
+		if (r < a) {
+			u32 rb = cake_heap_get_backlog(q, r);
+
+			if (rb > mb) {
+				m  = r;
+				mb = rb;
+			}
+		}
+
+		if (m != i) {
+			cake_heap_swap(q, i, m);
+			i = m;
+		} else {
+			break;
+		}
+	}
+}
+
+static void cake_heapify_up(struct cake_sched_data *q, u16 i)
+{
+	while (i > 0 && i < CAKE_MAX_TINS * CAKE_QUEUES) {
+		u16 p = (i - 1) >> 1;
+		u32 ib = cake_heap_get_backlog(q, i);
+		u32 pb = cake_heap_get_backlog(q, p);
+
+		if (ib > pb) {
+			cake_heap_swap(q, i, p);
+			i = p;
+		} else {
+			break;
+		}
+	}
+}
+
+static int cake_advance_shaper(struct cake_sched_data *q,
+			       struct cake_tin_data *b,
+			       struct sk_buff *skb,
+			       u64 now, bool drop)
+{
+	u32 len = get_cobalt_cb(skb)->adjusted_len;
+
+	/* charge packet bandwidth to this tin
+	 * and to the global shaper.
+	 */
+	if (q->rate_ns) {
+		s64 tdiff1 = b->tin_time_next_packet - now;
+		s64 tdiff2 = (len * (u64)b->tin_rate_ns) >> b->tin_rate_shft;
+		s64 tdiff3 = (len * (u64)q->rate_ns) >> q->rate_shft;
+		s64 tdiff4 = tdiff3 + (tdiff3 >> 1);
+
+		if (tdiff1 < 0)
+			b->tin_time_next_packet += tdiff2;
+		else if (tdiff1 < tdiff2)
+			b->tin_time_next_packet = now + tdiff2;
+
+		q->time_next_packet += tdiff3;
+		if (!drop)
+			q->failsafe_next_packet += tdiff4;
+	}
+	return len;
+}
+
+static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	u32 idx = 0, tin = 0, len;
+	struct cake_tin_data *b;
+	struct cake_flow *flow;
+	struct cake_heap_entry qq;
+	u64 now = cobalt_get_time();
+
+	if (!q->overflow_timeout) {
+		int i;
+		/* Build fresh max-heap */
+		for (i = CAKE_MAX_TINS * CAKE_QUEUES / 2; i >= 0; i--)
+			cake_heapify(q, i);
+	}
+	q->overflow_timeout = 65535;
+
+	/* select longest queue for pruning */
+	qq  = q->overflow_heap[0];
+	tin = qq.t;
+	idx = qq.b;
+
+	b = &q->tins[tin];
+	flow = &b->flows[idx];
+	skb = dequeue_head(flow);
+	if (unlikely(!skb)) {
+		/* heap has gone wrong, rebuild it next time */
+		q->overflow_timeout = 0;
+		return idx + (tin << 16);
+	}
+
+	if (cobalt_queue_full(&flow->cvars, &b->cparams, now))
+		b->unresponsive_flow_count++;
+
+	len = qdisc_pkt_len(skb);
+	q->buffer_used      -= skb->truesize;
+	b->backlogs[idx]    -= len;
+	b->tin_backlog      -= len;
+	sch->qstats.backlog -= len;
+	qdisc_tree_reduce_backlog(sch, 1, len);
+
+	b->tin_dropped++;
+	sch->qstats.drops++;
+
+	if (q->rate_flags & CAKE_FLAG_INGRESS)
+		cake_advance_shaper(q, b, skb, now, true);
+
+	__qdisc_drop(skb, to_free);
+	sch->q.qlen--;
+
+	cake_heapify(q, 0);
+
+	return idx + (tin << 16);
+}
+
+static inline void cake_wash_diffserv(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	case htons(ETH_P_IPV6):
+		ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	default:
+		break;
+	};
+}
+
+static inline u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
+{
+	u8 dscp;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_IPV6):
+		dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_ARP):
+		return 0x38;  /* CS7 - Net Control */
+
+	default:
+		/* If there is no Diffserv field, treat as best-effort */
+		return 0;
+	};
+}
+
+static void cake_reconfigure(struct Qdisc *sch);
+
+static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			struct sk_buff **to_free)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 idx, tin;
+	struct cake_tin_data *b;
+	struct cake_flow *flow;
+	/* signed len to handle corner case filtered ACK larger than trigger */
+	int len = qdisc_pkt_len(skb);
+	u64 now = cobalt_get_time();
+	struct sk_buff *ack = NULL;
+
+	/* extract the Diffserv Precedence field, if it exists */
+	/* and clear DSCP bits if washing */
+	if (q->tin_mode != CAKE_DIFFSERV_BESTEFFORT) {
+		tin = q->tin_index[cake_handle_diffserv(skb,
+				q->rate_flags & CAKE_FLAG_WASH)];
+		if (unlikely(tin >= q->tin_cnt))
+			tin = 0;
+	} else {
+		tin = 0;
+		if (q->rate_flags & CAKE_FLAG_WASH)
+			cake_wash_diffserv(skb);
+	}
+
+	b = &q->tins[tin];
+
+	/* choose flow to insert into */
+	idx = cake_hash(b, skb, q->flow_mode);
+	flow = &b->flows[idx];
+
+	/* ensure shaper state isn't stale */
+	if (!b->tin_backlog) {
+		if (b->tin_time_next_packet < now)
+			b->tin_time_next_packet = now;
+
+		if (!sch->q.qlen) {
+			if (q->time_next_packet < now) {
+				q->failsafe_next_packet = now;
+				q->time_next_packet = now;
+			} else if (q->time_next_packet > now &&
+				   q->failsafe_next_packet > now) {
+				u64 next = min(q->time_next_packet,
+					       q->failsafe_next_packet);
+				sch->qstats.overlimits++;
+				qdisc_watchdog_schedule_ns(&q->watchdog, next);
+			}
+		}
+	}
+
+	if (unlikely(len > b->max_skblen))
+		b->max_skblen = len;
+
+	/* Split GSO aggregates if they're likely to impair flow isolation
+	 * or if we need to know individual packet sizes for framing overhead.
+	 */
+
+	if (skb_is_gso(skb)) {
+		struct sk_buff *segs, *nskb;
+		netdev_features_t features = netif_skb_features(skb);
+		/* signed slen to handle corner case
+		 * suppressed ACK larger than trigger
+		 */
+		int slen = 0;
+
+		segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+		if (IS_ERR_OR_NULL(segs))
+			return qdisc_drop(skb, sch, to_free);
+
+		while (segs) {
+			nskb = segs->next;
+			segs->next = NULL;
+			qdisc_skb_cb(segs)->pkt_len = segs->len;
+			cobalt_set_enqueue_time(segs, now);
+			get_cobalt_cb(segs)->adjusted_len = cake_overhead(q,
+									  segs);
+			flow_queue_add(flow, segs);
+
+			if (q->ack_filter)
+				ack = cake_ack_filter(q, flow);
+
+			if (ack) {
+				b->ack_drops++;
+				sch->qstats.drops++;
+				b->bytes += ack->len;
+				slen += segs->len - ack->len;
+				q->buffer_used += segs->truesize -
+					ack->truesize;
+				if (q->rate_flags & CAKE_FLAG_INGRESS)
+					cake_advance_shaper(q, b, ack,
+							    now, true);
+
+				qdisc_tree_reduce_backlog(sch, 1,
+							  qdisc_pkt_len(ack));
+				consume_skb(ack);
+			} else {
+				sch->q.qlen++;
+				slen += segs->len;
+				q->buffer_used += segs->truesize;
+			}
+			b->packets++;
+			segs = nskb;
+		}
+		/* stats */
+		b->bytes	    += slen;
+		b->backlogs[idx]    += slen;
+		b->tin_backlog      += slen;
+		sch->qstats.backlog += slen;
+		q->avg_window_bytes += slen;
+
+		qdisc_tree_reduce_backlog(sch, 1, len);
+		consume_skb(skb);
+	} else {
+		/* not splitting */
+		cobalt_set_enqueue_time(skb, now);
+		get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb);
+		flow_queue_add(flow, skb);
+
+		if (q->ack_filter)
+			ack = cake_ack_filter(q, flow);
+
+		if (ack) {
+			b->ack_drops++;
+			sch->qstats.drops++;
+			b->bytes += qdisc_pkt_len(ack);
+			len -= qdisc_pkt_len(ack);
+			q->buffer_used += skb->truesize - ack->truesize;
+			if (q->rate_flags & CAKE_FLAG_INGRESS)
+				cake_advance_shaper(q, b, ack, now, true);
+
+			qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
+			consume_skb(ack);
+		} else {
+			sch->q.qlen++;
+			q->buffer_used      += skb->truesize;
+		}
+		/* stats */
+		b->packets++;
+		b->bytes	    += len;
+		b->backlogs[idx]    += len;
+		b->tin_backlog      += len;
+		sch->qstats.backlog += len;
+		q->avg_window_bytes += len;
+	}
+
+	if (q->overflow_timeout)
+		cake_heapify_up(q, b->overflow_idx[idx]);
+
+	/* incoming bandwidth capacity estimate */
+	if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) {
+		u64 packet_interval = now - q->last_packet_time;
+
+		if (packet_interval > NSEC_PER_SEC)
+			packet_interval = NSEC_PER_SEC;
+
+		/* filter out short-term bursts, eg. wifi aggregation */
+		q->avg_packet_interval = cake_ewma(q->avg_packet_interval,
+						   packet_interval,
+			packet_interval > q->avg_packet_interval ? 2 : 8);
+
+		q->last_packet_time = now;
+
+		if (packet_interval > q->avg_packet_interval) {
+			u64 window_interval = now - q->avg_window_begin;
+			u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC;
+
+			do_div(b, window_interval);
+			q->avg_peak_bandwidth =
+				cake_ewma(q->avg_peak_bandwidth, b,
+					  b > q->avg_peak_bandwidth ? 2 : 8);
+			q->avg_window_bytes = 0;
+			q->avg_window_begin = now;
+
+			if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS &&
+			    now - q->last_reconfig_time >
+				(NSEC_PER_SEC / 4)) {
+				q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4;
+				cake_reconfigure(sch);
+			}
+		}
+	} else {
+		q->avg_window_bytes = 0;
+		q->last_packet_time = now;
+	}
+
+	/* flowchain */
+	if (!flow->set || flow->set == CAKE_SET_DECAYING) {
+		struct cake_host *srchost = &b->hosts[flow->srchost];
+		struct cake_host *dsthost = &b->hosts[flow->dsthost];
+		u16 host_load = 1;
+
+		if (!flow->set) {
+			list_add_tail(&flow->flowchain, &b->new_flows);
+		} else {
+			b->decaying_flow_count--;
+			list_move_tail(&flow->flowchain, &b->new_flows);
+		}
+		flow->set = CAKE_SET_SPARSE;
+		b->sparse_flow_count++;
+
+		if (cake_dsrc(q->flow_mode))
+			host_load = max(host_load, srchost->srchost_refcnt);
+
+		if (cake_ddst(q->flow_mode))
+			host_load = max(host_load, dsthost->dsthost_refcnt);
+
+		flow->deficit = (b->flow_quantum *
+				 quantum_div[host_load]) >> 16;
+	} else if (flow->set == CAKE_SET_SPARSE_WAIT) {
+		/* this flow was empty, accounted as a sparse flow, but actually
+		 * in the bulk rotation.
+		 */
+		flow->set = CAKE_SET_BULK;
+		b->sparse_flow_count--;
+		b->bulk_flow_count++;
+	}
+
+	if (q->buffer_used > q->buffer_max_used)
+		q->buffer_max_used = q->buffer_used;
+
+	if (q->buffer_used > q->buffer_limit) {
+		u32 dropped = 0;
+
+		while (q->buffer_used > q->buffer_limit) {
+			dropped++;
+			cake_drop(sch, to_free);
+		}
+		b->drop_overlimit += dropped;
+	}
+	return NET_XMIT_SUCCESS;
+}
+
+static struct sk_buff *cake_dequeue_one(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct cake_tin_data *b = &q->tins[q->cur_tin];
+	struct cake_flow *flow = &b->flows[q->cur_flow];
+	struct sk_buff *skb = NULL;
+	u32 len;
+
+	/* WARN_ON(flow != container_of(vars, struct cake_flow, cvars)); */
+
+	if (flow->head) {
+		skb = dequeue_head(flow);
+		len = qdisc_pkt_len(skb);
+		b->backlogs[q->cur_flow] -= len;
+		b->tin_backlog		 -= len;
+		sch->qstats.backlog      -= len;
+		q->buffer_used		 -= skb->truesize;
+		sch->q.qlen--;
+
+		if (q->overflow_timeout)
+			cake_heapify(q, b->overflow_idx[q->cur_flow]);
+	}
+	return skb;
+}
+
+/* Discard leftover packets from a tin no longer in use. */
+static void cake_clear_tin(struct Qdisc *sch, u16 tin)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	q->cur_tin = tin;
+	for (q->cur_flow = 0; q->cur_flow < CAKE_QUEUES; q->cur_flow++)
+		while (!!(skb = cake_dequeue_one(sch)))
+			kfree_skb(skb);
+}
+
+static struct sk_buff *cake_dequeue(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	struct cake_tin_data *b = &q->tins[q->cur_tin];
+	struct cake_flow *flow;
+	struct cake_host *srchost, *dsthost;
+	struct list_head *head;
+	u32 len;
+	u16 host_load;
+	cobalt_time_t now = ktime_get_ns();
+	cobalt_time_t delay;
+	bool first_flow = true;
+
+begin:
+	if (!sch->q.qlen)
+		return NULL;
+
+	/* global hard shaper */
+	if (q->time_next_packet > now && q->failsafe_next_packet > now) {
+		u64 next = min(q->time_next_packet, q->failsafe_next_packet);
+
+		sch->qstats.overlimits++;
+		qdisc_watchdog_schedule_ns(&q->watchdog, next);
+		return NULL;
+	}
+
+	/* Choose a class to work on. */
+	if (!q->rate_ns) {
+		/* In unlimited mode, can't rely on shaper timings, just balance
+		 * with DRR
+		 */
+		while (b->tin_deficit < 0 ||
+		       !(b->sparse_flow_count + b->bulk_flow_count)) {
+			if (b->tin_deficit <= 0)
+				b->tin_deficit += b->tin_quantum_band;
+
+			q->cur_tin++;
+			b++;
+			if (q->cur_tin >= q->tin_cnt) {
+				q->cur_tin = 0;
+				b = q->tins;
+			}
+		}
+	} else {
+		/* In shaped mode, choose:
+		 * - Highest-priority tin with queue and meeting schedule, or
+		 * - The earliest-scheduled tin with queue.
+		 */
+		int tin, best_tin = 0;
+		s64 best_time = 0xFFFFFFFFFFFFUL;
+
+		for (tin = 0; tin < q->tin_cnt; tin++) {
+			b = q->tins + tin;
+			if ((b->sparse_flow_count + b->bulk_flow_count) > 0) {
+				s64 tdiff = b->tin_time_next_packet - now;
+
+				if (tdiff <= 0 || tdiff <= best_time) {
+					best_time = tdiff;
+					best_tin = tin;
+				}
+			}
+		}
+
+		q->cur_tin = best_tin;
+		b = q->tins + best_tin;
+	}
+
+retry:
+	/* service this class */
+	head = &b->decaying_flows;
+	if (!first_flow || list_empty(head)) {
+		head = &b->new_flows;
+		if (list_empty(head)) {
+			head = &b->old_flows;
+			if (unlikely(list_empty(head))) {
+				head = &b->decaying_flows;
+				if (unlikely(list_empty(head)))
+					goto begin;
+			}
+		}
+	}
+	flow = list_first_entry(head, struct cake_flow, flowchain);
+	q->cur_flow = flow - b->flows;
+	first_flow = false;
+
+	/* triple isolation (modified DRR++) */
+	srchost = &b->hosts[flow->srchost];
+	dsthost = &b->hosts[flow->dsthost];
+	host_load = 1;
+
+	if (cake_dsrc(q->flow_mode))
+		host_load = max(host_load, srchost->srchost_refcnt);
+
+	if (cake_ddst(q->flow_mode))
+		host_load = max(host_load, dsthost->dsthost_refcnt);
+
+	WARN_ON(host_load > CAKE_QUEUES);
+
+	/* flow isolation (DRR++) */
+	if (flow->deficit <= 0) {
+		/* The shifted prandom_u32() is a way to apply dithering to
+		 * avoid accumulating roundoff errors
+		 */
+		flow->deficit += (b->flow_quantum * quantum_div[host_load] +
+				  (prandom_u32() >> 16)) >> 16;
+		list_move_tail(&flow->flowchain, &b->old_flows);
+
+		/* Keep all flows with deficits out of the sparse and decaying
+		 * rotations.  No non-empty flow can go into the decaying
+		 * rotation, so they can't get deficits
+		 */
+		if (flow->set == CAKE_SET_SPARSE) {
+			if (flow->head) {
+				b->sparse_flow_count--;
+				b->bulk_flow_count++;
+				flow->set = CAKE_SET_BULK;
+			} else {
+				/* we've moved it to the bulk rotation for
+				 * correct deficit accounting but we still want
+				 * to count it as a sparse flow, not a bulk one.
+				 */
+				flow->set = CAKE_SET_SPARSE_WAIT;
+			}
+		}
+		goto retry;
+	}
+
+	/* Retrieve a packet via the AQM */
+	while (1) {
+		skb = cake_dequeue_one(sch);
+		if (!skb) {
+			/* this queue was actually empty */
+			if (cobalt_queue_empty(&flow->cvars, &b->cparams, now))
+				b->unresponsive_flow_count--;
+
+			if (flow->cvars.p_drop || flow->cvars.count ||
+			    now < flow->cvars.drop_next) {
+				/* keep in the flowchain until the state has
+				 * decayed to rest
+				 */
+				list_move_tail(&flow->flowchain,
+					       &b->decaying_flows);
+				if (flow->set == CAKE_SET_BULK) {
+					b->bulk_flow_count--;
+					b->decaying_flow_count++;
+				} else if (flow->set == CAKE_SET_SPARSE ||
+					   flow->set == CAKE_SET_SPARSE_WAIT) {
+					b->sparse_flow_count--;
+					b->decaying_flow_count++;
+				}
+				flow->set = CAKE_SET_DECAYING;
+			} else {
+				/* remove empty queue from the flowchain */
+				list_del_init(&flow->flowchain);
+				if (flow->set == CAKE_SET_SPARSE ||
+				    flow->set == CAKE_SET_SPARSE_WAIT)
+					b->sparse_flow_count--;
+				else if (flow->set == CAKE_SET_BULK)
+					b->bulk_flow_count--;
+				else
+					b->decaying_flow_count--;
+
+				flow->set = CAKE_SET_NONE;
+				srchost->srchost_refcnt--;
+				dsthost->dsthost_refcnt--;
+			}
+			goto begin;
+		}
+
+		/* Last packet in queue may be marked, shouldn't be dropped */
+		if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb,
+					(b->bulk_flow_count *
+					 !!(q->rate_flags & CAKE_FLAG_INGRESS))) ||
+		    !flow->head)
+			break;
+
+		/* drop this packet, get another one */
+		if (q->rate_flags & CAKE_FLAG_INGRESS) {
+			len = cake_advance_shaper(q, b, skb,
+						  now, true);
+			flow->deficit -= len;
+			b->tin_deficit -= len;
+		}
+		b->tin_dropped++;
+		qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
+		qdisc_qstats_drop(sch);
+		kfree_skb(skb);
+		if (q->rate_flags & CAKE_FLAG_INGRESS)
+			goto retry;
+	}
+
+	b->tin_ecn_mark += !!flow->cvars.ecn_marked;
+	qdisc_bstats_update(sch, skb);
+
+	/* collect delay stats */
+	delay = now - cobalt_get_enqueue_time(skb);
+	b->avge_delay = cake_ewma(b->avge_delay, delay, 8);
+	b->peak_delay = cake_ewma(b->peak_delay, delay,
+				  delay > b->peak_delay ? 2 : 8);
+	b->base_delay = cake_ewma(b->base_delay, delay,
+				  delay < b->base_delay ? 2 : 8);
+
+	len = cake_advance_shaper(q, b, skb, now, false);
+	flow->deficit -= len;
+	b->tin_deficit -= len;
+
+	if (q->time_next_packet > now && sch->q.qlen) {
+		u64 next = min(q->time_next_packet, q->failsafe_next_packet);
+
+		qdisc_watchdog_schedule_ns(&q->watchdog, next);
+	} else if (!sch->q.qlen) {
+		int i;
+
+		for (i = 0; i < q->tin_cnt; i++) {
+			if (q->tins[i].decaying_flow_count) {
+				u64 next = now + q->tins[i].cparams.target;
+
+				qdisc_watchdog_schedule_ns(&q->watchdog, next);
+				break;
+			}
+		}
+	}
+
+	if (q->overflow_timeout)
+		q->overflow_timeout--;
+
+	return skb;
+}
+
+static void cake_reset(struct Qdisc *sch)
+{
+	u32 c;
+
+	for (c = 0; c < CAKE_MAX_TINS; c++)
+		cake_clear_tin(sch, c);
+}
+
+static const struct nla_policy cake_policy[TCA_CAKE_MAX + 1] = {
+	[TCA_CAKE_BASE_RATE]     = { .type = NLA_U32 },
+	[TCA_CAKE_DIFFSERV_MODE] = { .type = NLA_U32 },
+	[TCA_CAKE_ATM]		 = { .type = NLA_U32 },
+	[TCA_CAKE_FLOW_MODE]     = { .type = NLA_U32 },
+	[TCA_CAKE_OVERHEAD]      = { .type = NLA_S32 },
+	[TCA_CAKE_RTT]		 = { .type = NLA_U32 },
+	[TCA_CAKE_TARGET]	 = { .type = NLA_U32 },
+	[TCA_CAKE_AUTORATE]      = { .type = NLA_U32 },
+	[TCA_CAKE_MEMORY]	 = { .type = NLA_U32 },
+	[TCA_CAKE_NAT]		 = { .type = NLA_U32 },
+	[TCA_CAKE_RAW]       = { .type = NLA_U32 },
+	[TCA_CAKE_WASH]		 = { .type = NLA_U32 },
+	[TCA_CAKE_MPU]		 = { .type = NLA_U32 },
+	[TCA_CAKE_INGRESS]	 = { .type = NLA_U32 },
+	[TCA_CAKE_ACK_FILTER]	 = { .type = NLA_U32 },
+};
+
+static void cake_set_rate(struct cake_tin_data *b, u64 rate, u32 mtu,
+			  cobalt_time_t ns_target, cobalt_time_t rtt_est_ns)
+{
+	/* convert byte-rate into time-per-byte
+	 * so it will always unwedge in reasonable time.
+	 */
+	static const u64 MIN_RATE = 64;
+	u64 rate_ns = 0;
+	u8  rate_shft = 0;
+	cobalt_time_t byte_target_ns;
+	u32 byte_target = mtu;
+
+	b->flow_quantum = 1514;
+	if (rate) {
+		b->flow_quantum = max(min(rate >> 12, 1514ULL), 300ULL);
+		rate_shft = 32;
+		rate_ns = ((u64)NSEC_PER_SEC) << rate_shft;
+		do_div(rate_ns, max(MIN_RATE, rate));
+		while (!!(rate_ns >> 32)) {
+			rate_ns >>= 1;
+			rate_shft--;
+		}
+	} /* else unlimited, ie. zero delay */
+
+	b->tin_rate_bps  = rate;
+	b->tin_rate_ns   = rate_ns;
+	b->tin_rate_shft = rate_shft;
+
+	byte_target_ns = (byte_target * rate_ns) >> rate_shft;
+
+	b->cparams.target = max((byte_target_ns * 3) / 2, ns_target);
+	b->cparams.interval = max(rtt_est_ns +
+				     b->cparams.target - ns_target,
+				     b->cparams.target * 2);
+	b->cparams.mtu_time = byte_target_ns;
+	b->cparams.p_inc = 1 << 24; /* 1/256 */
+	b->cparams.p_dec = 1 << 20; /* 1/4096 */
+}
+
+static int cake_config_besteffort(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct cake_tin_data *b = &q->tins[0];
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+
+	q->tin_cnt = 1;
+
+	q->tin_index = besteffort;
+	q->tin_order = normal_order;
+
+	cake_set_rate(b, rate, mtu, US2TIME(q->target), US2TIME(q->interval));
+	b->tin_quantum_band = 65535;
+	b->tin_quantum_prio = 65535;
+
+	return 0;
+}
+
+static int cake_config_precedence(struct Qdisc *sch)
+{
+	/* convert high-level (user visible) parameters into internal format */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+	q->tin_index = precedence;
+	q->tin_order = normal_order;
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+/*	List of known Diffserv codepoints:
+ *
+ *	Least Effort (CS1)
+ *	Best Effort (CS0)
+ *	Max Reliability & LLT "Lo" (TOS1)
+ *	Max Throughput (TOS2)
+ *	Min Delay (TOS4)
+ *	LLT "La" (TOS5)
+ *	Assured Forwarding 1 (AF1x) - x3
+ *	Assured Forwarding 2 (AF2x) - x3
+ *	Assured Forwarding 3 (AF3x) - x3
+ *	Assured Forwarding 4 (AF4x) - x3
+ *	Precedence Class 2 (CS2)
+ *	Precedence Class 3 (CS3)
+ *	Precedence Class 4 (CS4)
+ *	Precedence Class 5 (CS5)
+ *	Precedence Class 6 (CS6)
+ *	Precedence Class 7 (CS7)
+ *	Voice Admit (VA)
+ *	Expedited Forwarding (EF)
+
+ *	Total 25 codepoints.
+ */
+
+/*	List of traffic classes in RFC 4594:
+ *		(roughly descending order of contended priority)
+ *		(roughly ascending order of uncontended throughput)
+ *
+ *	Network Control (CS6,CS7)      - routing traffic
+ *	Telephony (EF,VA)         - aka. VoIP streams
+ *	Signalling (CS5)               - VoIP setup
+ *	Multimedia Conferencing (AF4x) - aka. video calls
+ *	Realtime Interactive (CS4)     - eg. games
+ *	Multimedia Streaming (AF3x)    - eg. YouTube, NetFlix, Twitch
+ *	Broadcast Video (CS3)
+ *	Low Latency Data (AF2x,TOS4)      - eg. database
+ *	Ops, Admin, Management (CS2,TOS1) - eg. ssh
+ *	Standard Service (CS0 & unrecognised codepoints)
+ *	High Throughput Data (AF1x,TOS2)  - eg. web traffic
+ *	Low Priority Data (CS1)           - eg. BitTorrent
+
+ *	Total 12 traffic classes.
+ */
+
+static int cake_config_diffserv8(struct Qdisc *sch)
+{
+/*	Pruned list of traffic classes for typical applications:
+ *
+ *		Network Control          (CS6, CS7)
+ *		Minimum Latency          (EF, VA, CS5, CS4)
+ *		Interactive Shell        (CS2, TOS1)
+ *		Low Latency Transactions (AF2x, TOS4)
+ *		Video Streaming          (AF4x, AF3x, CS3)
+ *		Bog Standard             (CS0 etc.)
+ *		High Throughput          (AF1x, TOS2)
+ *		Background Traffic       (CS1)
+ *
+ *		Total 8 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv8;
+	q->tin_order = normal_order;
+
+	/* class characteristics */
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+static int cake_config_diffserv4(struct Qdisc *sch)
+{
+/*  Further pruned list of traffic classes for four-class system:
+ *
+ *	    Latency Sensitive  (CS7, CS6, EF, VA, CS5, CS4)
+ *	    Streaming Media    (AF4x, AF3x, CS3, AF2x, TOS4, CS2, TOS1)
+ *	    Best Effort        (CS0, AF1x, TOS2, and those not specified)
+ *	    Background Traffic (CS1)
+ *
+ *		Total 4 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum = 1024;
+
+	q->tin_cnt = 4;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv4;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 1, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[3], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 2;
+	q->tins[3].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 1;
+	q->tins[3].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static int cake_config_diffserv3(struct Qdisc *sch)
+{
+/*  Simplified Diffserv structure with 3 tins.
+ *		Low Priority		(CS1)
+ *		Best Effort
+ *		Latency Sensitive	(TOS4, VA, EF, CS6, CS7)
+ */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum = 1024;
+
+	q->tin_cnt = 3;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv3;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static void cake_reconfigure(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int c, ft;
+
+	switch (q->tin_mode) {
+	case CAKE_DIFFSERV_BESTEFFORT:
+		ft = cake_config_besteffort(sch);
+		break;
+
+	case CAKE_DIFFSERV_PRECEDENCE:
+		ft = cake_config_precedence(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV8:
+		ft = cake_config_diffserv8(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV4:
+		ft = cake_config_diffserv4(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV3:
+	default:
+		ft = cake_config_diffserv3(sch);
+		break;
+	};
+
+	for (c = q->tin_cnt; c < CAKE_MAX_TINS; c++) {
+		cake_clear_tin(sch, c);
+		q->tins[c].cparams.mtu_time = q->tins[ft].cparams.mtu_time;
+	}
+
+	q->rate_ns   = q->tins[ft].tin_rate_ns;
+	q->rate_shft = q->tins[ft].tin_rate_shft;
+
+	if (q->buffer_config_limit) {
+		q->buffer_limit = q->buffer_config_limit;
+	} else if (q->rate_bps) {
+		u64 t = (u64)q->rate_bps * q->interval;
+
+		do_div(t, USEC_PER_SEC / 4);
+		q->buffer_limit = max_t(u32, t, 4U << 20);
+	} else {
+		q->buffer_limit = ~0;
+	}
+
+	sch->flags &= ~TCQ_F_CAN_BYPASS;
+
+	q->buffer_limit = min(q->buffer_limit,
+			      max(sch->limit * psched_mtu(qdisc_dev(sch)),
+				  q->buffer_config_limit));
+}
+
+static int cake_change(struct Qdisc *sch, struct nlattr *opt,
+		       struct netlink_ext_ack *extack)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_CAKE_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_CAKE_MAX, opt, cake_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_CAKE_BASE_RATE])
+		q->rate_bps = nla_get_u32(tb[TCA_CAKE_BASE_RATE]);
+
+	if (tb[TCA_CAKE_DIFFSERV_MODE])
+		q->tin_mode = nla_get_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+
+	if (tb[TCA_CAKE_ATM])
+		q->atm_mode = nla_get_u32(tb[TCA_CAKE_ATM]);
+
+	if (tb[TCA_CAKE_WASH]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_WASH]))
+			q->rate_flags |= CAKE_FLAG_WASH;
+		else
+			q->rate_flags &= ~CAKE_FLAG_WASH;
+	}
+
+	if (tb[TCA_CAKE_FLOW_MODE])
+		q->flow_mode = nla_get_u32(tb[TCA_CAKE_FLOW_MODE]);
+
+	if (tb[TCA_CAKE_NAT]) {
+		q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
+		q->flow_mode |= CAKE_FLOW_NAT_FLAG *
+			!!nla_get_u32(tb[TCA_CAKE_NAT]);
+	}
+
+	if (tb[TCA_CAKE_OVERHEAD]) {
+		q->rate_overhead = nla_get_s32(tb[TCA_CAKE_OVERHEAD]);
+		q->rate_flags |= CAKE_FLAG_OVERHEAD;
+
+		q->max_netlen = q->max_adjlen = 0;
+		q->min_netlen = q->min_adjlen = ~0;
+	}
+
+	if (tb[TCA_CAKE_RAW]) {
+		q->rate_flags &= ~CAKE_FLAG_OVERHEAD;
+
+		q->max_netlen = q->max_adjlen = 0;
+		q->min_netlen = q->min_adjlen = ~0;
+	}
+
+	if (tb[TCA_CAKE_MPU])
+		q->rate_mpu = nla_get_u32(tb[TCA_CAKE_MPU]);
+
+	if (tb[TCA_CAKE_RTT]) {
+		q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
+
+		if (!q->interval)
+			q->interval = 1;
+	}
+
+	if (tb[TCA_CAKE_TARGET]) {
+		q->target = nla_get_u32(tb[TCA_CAKE_TARGET]);
+
+		if (!q->target)
+			q->target = 1;
+	}
+
+	if (tb[TCA_CAKE_AUTORATE]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
+			q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_INGRESS]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
+			q->rate_flags |= CAKE_FLAG_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_ACK_FILTER])
+		q->ack_filter = nla_get_u32(tb[TCA_CAKE_ACK_FILTER]);
+
+	if (tb[TCA_CAKE_MEMORY])
+		q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
+
+	if (q->tins) {
+		sch_tree_lock(sch);
+		cake_reconfigure(sch);
+		sch_tree_unlock(sch);
+	}
+
+	return 0;
+}
+
+static void *cake_zalloc(size_t sz)
+{
+	void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
+
+	if (!ptr)
+		ptr = vzalloc(sz);
+	return ptr;
+}
+
+static void cake_free(void *addr)
+{
+	if (addr)
+		kvfree(addr);
+}
+
+static void cake_destroy(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+
+	if (q->tins)
+		cake_free(q->tins);
+}
+
+static int cake_init(struct Qdisc *sch, struct nlattr *opt,
+		     struct netlink_ext_ack *extack)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int i, j;
+
+	sch->limit = 10240;
+	q->tin_mode = CAKE_DIFFSERV_DIFFSERV3;
+	q->flow_mode  = CAKE_FLOW_TRIPLE;
+
+	q->rate_bps = 0; /* unlimited by default */
+
+	q->interval = 100000; /* 100ms default */
+	q->target   =   5000; /* 5ms: codel RFC argues
+			       * for 5 to 10% of interval
+			       */
+
+	q->cur_tin = 0;
+	q->cur_flow  = 0;
+
+	if (opt) {
+		int err = cake_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+
+	quantum_div[0] = ~0;
+	for (i = 1; i <= CAKE_QUEUES; i++)
+		quantum_div[i] = 65535 / i;
+
+	q->tins = cake_zalloc(CAKE_MAX_TINS * sizeof(struct cake_tin_data));
+	if (!q->tins)
+		goto nomem;
+
+	for (i = 0; i < CAKE_MAX_TINS; i++) {
+		struct cake_tin_data *b = q->tins + i;
+
+		b->perturb = prandom_u32();
+		INIT_LIST_HEAD(&b->new_flows);
+		INIT_LIST_HEAD(&b->old_flows);
+		INIT_LIST_HEAD(&b->decaying_flows);
+		b->sparse_flow_count = 0;
+		b->bulk_flow_count = 0;
+		b->decaying_flow_count = 0;
+
+		for (j = 0; j < CAKE_QUEUES; j++) {
+			struct cake_flow *flow = b->flows + j;
+			u32 k = j * CAKE_MAX_TINS + i;
+
+			INIT_LIST_HEAD(&flow->flowchain);
+			cobalt_vars_init(&flow->cvars);
+
+			q->overflow_heap[k].t = i;
+			q->overflow_heap[k].b = j;
+			b->overflow_idx[j] = k;
+		}
+	}
+
+	cake_reconfigure(sch);
+	q->avg_peak_bandwidth = q->rate_bps;
+	q->min_netlen = q->min_adjlen = ~0;
+	return 0;
+
+nomem:
+	cake_destroy(sch);
+	return -ENOMEM;
+}
+
+static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_BASE_RATE, q->rate_bps))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_DIFFSERV_MODE, q->tin_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_ATM, q->atm_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_FLOW_MODE,
+			q->flow_mode & ~CAKE_FLOW_NAT_FLAG))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_NAT,
+			!!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_WASH,
+			!!(q->rate_flags & CAKE_FLAG_WASH)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_OVERHEAD, q->rate_overhead))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_MPU, q->rate_mpu))
+		goto nla_put_failure;
+
+	if (!(q->rate_flags & CAKE_FLAG_OVERHEAD))
+		if (nla_put_u32(skb, TCA_CAKE_RAW, 0))
+			goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_RTT, q->interval))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_TARGET, q->target))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
+			!!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_INGRESS,
+			!!(q->rate_flags & CAKE_FLAG_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct tc_cake_xstats *st;
+	size_t size = (sizeof(*st) +
+		       sizeof(struct tc_cake_tin_stats) * q->tin_cnt);
+	int i;
+
+	st = cake_zalloc(size);
+
+	if (!st)
+		return -1;
+
+	st->version = 0x102; /* old userspace code discards versions > 0xFF */
+	st->tin_stats_size = sizeof(struct tc_cake_tin_stats);
+	st->tin_cnt = q->tin_cnt;
+
+	st->avg_trnoff = (q->avg_trnoff + 0x8000) >> 16;
+	st->max_netlen = q->max_netlen;
+	st->max_adjlen = q->max_adjlen;
+	st->min_netlen = q->min_netlen;
+	st->min_adjlen = q->min_adjlen;
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[q->tin_order[i]];
+		struct tc_cake_tin_stats *tstat = &st->tin_stats[i];
+
+		tstat->threshold_rate = b->tin_rate_bps;
+		tstat->target_us      = cobalt_time_to_us(b->cparams.target);
+		tstat->interval_us    = cobalt_time_to_us(b->cparams.interval);
+
+		/* TODO FIXME: add missing aspects of these composite stats */
+		tstat->sent.packets       = b->packets;
+		tstat->sent.bytes	  = b->bytes;
+		tstat->dropped.packets    = b->tin_dropped;
+		tstat->ecn_marked.packets = b->tin_ecn_mark;
+		tstat->backlog.bytes      = b->tin_backlog;
+		tstat->ack_drops.packets  = b->ack_drops;
+
+		tstat->peak_delay_us = cobalt_time_to_us(b->peak_delay);
+		tstat->avge_delay_us = cobalt_time_to_us(b->avge_delay);
+		tstat->base_delay_us = cobalt_time_to_us(b->base_delay);
+
+		tstat->way_indirect_hits = b->way_hits;
+		tstat->way_misses	 = b->way_misses;
+		tstat->way_collisions    = b->way_collisions;
+
+		tstat->sparse_flows      = b->sparse_flow_count +
+					   b->decaying_flow_count;
+		tstat->bulk_flows	 = b->bulk_flow_count;
+		tstat->unresponse_flows  = b->unresponsive_flow_count;
+		tstat->spare		 = 0;
+		tstat->max_skblen	 = b->max_skblen;
+
+		tstat->flow_quantum	 = b->flow_quantum;
+	}
+	st->capacity_estimate = q->avg_peak_bandwidth;
+	st->memory_limit      = q->buffer_limit;
+	st->memory_used       = q->buffer_max_used;
+
+	i = gnet_stats_copy_app(d, st, size);
+	cake_free(st);
+	return i;
+}
+
+static struct Qdisc_ops cake_qdisc_ops __read_mostly = {
+	.id		=	"cake",
+	.priv_size	=	sizeof(struct cake_sched_data),
+	.enqueue	=	cake_enqueue,
+	.dequeue	=	cake_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	cake_init,
+	.reset		=	cake_reset,
+	.destroy	=	cake_destroy,
+	.change		=	cake_change,
+	.dump		=	cake_dump,
+	.dump_stats	=	cake_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init cake_module_init(void)
+{
+	return register_qdisc(&cake_qdisc_ops);
+}
+
+static void __exit cake_module_exit(void)
+{
+	unregister_qdisc(&cake_qdisc_ops);
+}
+
+module_init(cake_module_init)
+module_exit(cake_module_exit)
+MODULE_AUTHOR("Jonathan Morton");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("The CAKE shaper.");
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH iproute2-next v2] Add support for cake qdisc
  2018-04-24 11:44 ` [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Toke Høiland-Jørgensen
@ 2018-04-24 11:44   ` Toke Høiland-Jørgensen
  2018-04-24 12:30     ` [PATCH iproute2-next v3] " Toke Høiland-Jørgensen
  2018-04-24 15:11   ` [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Stephen Hemminger
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2018-04-24 11:44 UTC (permalink / raw)
  To: netdev; +Cc: cake, Toke Høiland-Jørgensen, Dave Taht

sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
  variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
Changes since v1:
  - Updated netlink ABI
  - Remove diffserv-llt mode
  - Various tweaks and clean-ups of stats output
  
 man/man8/tc-cake.8 | 632 +++++++++++++++++++++++++++++++++++
 man/man8/tc.8      |   1 +
 tc/Makefile        |   1 +
 tc/q_cake.c        | 797 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1431 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
new file mode 100644
index 00000000..30e41bc9
--- /dev/null
+++ b/man/man8/tc-cake.8
@@ -0,0 +1,632 @@
+.TH CAKE 8 "24 April 2018" "iproute2" "Linux"
+.SH NAME
+CAKE \- Common Applications Kept Enhanced (CAKE)
+.SH SYNOPSIS
+.B tc qdisc ... cake
+.br
+[
+.BR bandwidth
+RATE |
+.BR unlimited*
+|
+.BR autorate_ingress
+]
+.br
+[
+.BR rtt
+TIME |
+.BR datacentre
+|
+.BR lan
+|
+.BR metro
+|
+.BR regional
+|
+.BR internet*
+|
+.BR oceanic
+|
+.BR satellite
+|
+.BR interplanetary
+]
+.br
+[
+.BR besteffort
+|
+.BR diffserv8
+|
+.BR diffserv4
+|
+.BR diffserv3*
+]
+.br
+[
+.BR flowblind
+|
+.BR srchost
+|
+.BR dsthost
+|
+.BR hosts
+|
+.BR flows
+|
+.BR dual-srchost
+|
+.BR dual-dsthost
+|
+.BR triple-isolate*
+]
+.br
+[
+.BR nat
+|
+.BR nonat*
+]
+.br
+[
+.BR wash
+|
+.BR nowash*
+]
+.br
+[
+.BR ack-filter
+|
+.BR ack-filter-aggressive
+|
+.BR no-ack-filter*
+]
+.br
+[
+.BR memlimit
+LIMIT ]
+.br
+[
+.BR ptm
+|
+.BR atm
+|
+.BR noatm*
+]
+.br
+[
+.BR overhead
+N |
+.BR conservative
+|
+.BR raw*
+]
+.br
+[
+.BR mpu
+N ]
+.br
+[
+.BR ingress
+|
+.BR egress*
+]
+.br
+(* marks defaults)
+
+
+.SH DESCRIPTION
+CAKE (Common Applications Kept Enhanced) is a shaping-capable queue discipline
+which uses both AQM and FQ.  It combines COBALT, which is an AQM algorithm
+combining Codel and BLUE, a shaper which operates in deficit mode, and a variant
+of DRR++ for flow isolation.  8-way set-associative hashing is used to virtually
+eliminate hash collisions.  Priority queuing is available through a simplified
+diffserv implementation.  Overhead compensation for various encapsulation
+schemes is tightly integrated.
+
+All settings are optional; the default settings are chosen to be sensible in
+most common deployments.  Most people will only need to set the
+.B bandwidth
+parameter to get useful results, but reading the
+.B Overhead Compensation
+and
+.B Round Trip Time
+sections is strongly encouraged.
+
+.SH SHAPER PARAMETERS
+CAKE uses a deficit-mode shaper, which does not exhibit the initial burst
+typical of token-bucket shapers.  It will automatically burst precisely as much
+as required to maintain the configured throughput.  As such, it is very
+straightforward to configure.
+.PP
+.B unlimited
+(default)
+.br
+	No limit on the bandwidth.
+.PP
+.B bandwidth
+RATE
+.br
+	Set the shaper bandwidth.  See
+.BR tc(8)
+or examples below for details of the RATE value.
+.PP
+.B autorate_ingress
+.br
+	Automatic capacity estimation based on traffic arriving at this qdisc.
+This is most likely to be useful with cellular links, which tend to change
+quality randomly.  A
+.B bandwidth
+parameter can be used in conjunction to specify an initial estimate.  The shaper
+will periodically be set to a bandwidth slightly below the estimated rate.  This
+estimator cannot estimate the bandwidth of links downstream of itself.
+
+.SH OVERHEAD COMPENSATION PARAMETERS
+The size of each packet on the wire may differ from that seen by Linux.  The
+following parameters allow CAKE to compensate for this difference by internally
+considering each packet to be bigger than Linux informs it.  To assist users who
+are not expert network engineers, keywords have been provided to represent a
+number of common link technologies.
+
+.SS	Manual Overhead Specification
+.B overhead
+BYTES
+.br
+	Adds BYTES to the size of each packet.  BYTES may be negative; values
+between -64 and 256 (inclusive) are accepted.
+.PP
+.B mpu
+BYTES
+.br
+	Rounds each packet (including overhead) up to a minimum length
+BYTES. BYTES may not be negative; values between 0 and 256 (inclusive)
+are accepted.
+.PP
+.B atm
+.br
+	Compensates for ATM cell framing, which is normally found on ADSL links.
+This is performed after the
+.B overhead
+parameter above.  ATM uses fixed 53-byte cells, each of which can carry 48 bytes
+payload.
+.PP
+.B ptm
+.br
+	Compensates for PTM encoding, which is normally found on VDSL2 links and
+uses a 64b/65b encoding scheme. It is even more efficient to simply
+derate the specified shaper bandwidth by a factor of 64/65 or 0.984. See
+ITU G.992.3 Annex N and IEEE 802.3 Section 61.3 for details.
+.PP
+.B noatm
+.br
+	Disables ATM and PTM compensation.
+
+.SS	Failsafe Overhead Keywords
+These two keywords are provided for quick-and-dirty setup.  Use them if you
+can't be bothered to read the rest of this section.
+.PP
+.B raw
+(default)
+.br
+	Turns off all overhead compensation in CAKE.  The packet size reported
+by Linux will be used directly.
+.PP
+	Other overhead keywords may be added after "raw".  The effect of this is
+to make the overhead compensation operate relative to the reported packet size,
+not the underlying IP packet size.
+.PP
+.B conservative
+.br
+	Compensates for more overhead than is likely to occur on any
+widely-deployed link technology.
+.br
+	Equivalent to
+.B overhead 48 atm.
+
+.SS ADSL Overhead Keywords
+Most ADSL modems have a way to check which framing scheme is in use.  Often this
+is also specified in the settings document provided by the ISP.  The keywords in
+this section are intended to correspond with these sources of information.  All
+of them implicitly set the
+.B atm
+flag.
+.PP
+.B pppoa-vcmux
+.br
+	Equivalent to
+.B overhead 10 atm
+.PP
+.B pppoa-llc
+.br
+	Equivalent to
+.B overhead 14 atm
+.PP
+.B pppoe-vcmux
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B pppoe-llcsnap
+.br
+	Equivalent to
+.B overhead 40 atm
+.PP
+.B bridged-vcmux
+.br
+	Equivalent to
+.B overhead 24 atm
+.PP
+.B bridged-llcsnap
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B ipoa-vcmux
+.br
+	Equivalent to
+.B overhead 8 atm
+.PP
+.B ipoa-llcsnap
+.br
+	Equivalent to
+.B overhead 16 atm
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS VDSL2 Overhead Keywords
+ATM was dropped from VDSL2 in favour of PTM, which is a much more
+straightforward framing scheme.  Some ISPs retained PPPoE for compatibility with
+their existing back-end systems.
+.PP
+.B pppoe-ptm
+.br
+	Equivalent to
+.B overhead 30 ptm
+
+.br
+	PPPoE: 2B PPP + 6B PPPoE +
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+.B bridged-ptm
+.br
+	Equivalent to
+.B overhead 22 ptm
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS DOCSIS Cable Overhead Keyword
+DOCSIS is the universal standard for providing Internet service over cable-TV
+infrastructure.
+
+In this case, the actual on-wire overhead is less important than the packet size
+the head-end equipment uses for shaping and metering.  This is specified to be
+an Ethernet frame including the CRC (aka FCS).
+.PP
+.B docsis
+.br
+	Equivalent to
+.B overhead 18 mpu 64 noatm
+
+.SS Ethernet Overhead Keywords
+.PP
+.B ethernet
+.br
+	Accounts for Ethernet's preamble, inter-frame gap, and Frame Check
+Sequence.  Use this keyword when the bottleneck being shaped for is an
+actual Ethernet cable.
+.br
+	Equivalent to
+.B overhead 38 mpu 84 noatm
+.PP
+.B ether-vlan
+.br
+	Adds 4 bytes to the overhead compensation, accounting for an IEEE 802.1Q
+VLAN header appended to the Ethernet frame header.  NB: Some ISPs use one or
+even two of these within PPPoE; this keyword may be repeated as necessary to
+express this.
+
+.SH ROUND TRIP TIME PARAMETERS
+Active Queue Management (AQM) consists of embedding congestion signals in the
+packet flow, which receivers use to instruct senders to slow down when the queue
+is persistently occupied.  CAKE uses ECN signalling when available, and packet
+drops otherwise, according to a combination of the Codel and BLUE AQM algorithms
+called COBALT.
+
+Very short latencies require a very rapid AQM response to adequately control
+latency.  However, such a rapid response tends to impair throughput when the
+actual RTT is relatively long.  CAKE allows specifying the RTT it assumes for
+tuning various parameters.  Actual RTTs within an order of magnitude of this
+will generally work well for both throughput and latency management.
+
+At the 'lan' setting and below, the time constants are similar in magnitude to
+the jitter in the Linux kernel itself, so congestion might be signalled
+prematurely. The flows will then become sparse and total throughput reduced,
+leaving little or no back-pressure for the fairness logic to work against. Use
+the "metro" setting for local lans unless you have a custom kernel.
+.PP
+.B rtt
+TIME
+.br
+	Manually specify an RTT.
+.PP
+.B datacentre
+.br
+	For extremely high-performance 10GigE+ networks only.  Equivalent to
+.B rtt 100us.
+.PP
+.B lan
+.br
+	For pure Ethernet (not Wi-Fi) networks, at home or in the office.  Don't
+use this when shaping for an Internet access link.  Equivalent to
+.B rtt 1ms.
+.PP
+.B metro
+.br
+	For traffic mostly within a single city.  Equivalent to
+.B rtt 10ms.
+.PP
+.B regional
+.br
+	For traffic mostly within a European-sized country.  Equivalent to
+.B rtt 30ms.
+.PP
+.B internet
+(default)
+.br
+	This is suitable for most Internet traffic.  Equivalent to
+.B rtt 100ms.
+.PP
+.B oceanic
+.br
+	For Internet traffic with generally above-average latency, such as that
+suffered by Australasian residents.  Equivalent to
+.B rtt 300ms.
+.PP
+.B satellite
+.br
+	For traffic via geostationary satellites.  Equivalent to
+.B rtt 1000ms.
+.PP
+.B interplanetary
+.br
+	So named because Jupiter is about 1 light-hour from Earth.  Use this to
+(almost) completely disable AQM actions.  Equivalent to
+.B rtt 3600s.
+
+.SH FLOW ISOLATION PARAMETERS
+With flow isolation enabled, CAKE places packets from different flows into
+different queues, each of which carries its own AQM state.  Packets from each
+queue are then delivered fairly, according to a DRR++ algorithm which minimises
+latency for "sparse" flows.  CAKE uses a set-associative hashing algorithm to
+minimise flow collisions.
+
+These keywords specify whether fairness based on source address, destination
+address, individual flows, or any combination of those is desired.
+.PP
+.B flowblind
+.br
+	Disables flow isolation; all traffic passes through a single queue for
+each tin.
+.PP
+.B srchost
+.br
+	Flows are defined only by source address.  Could be useful on the egress
+path of an ISP backhaul.
+.PP
+.B dsthost
+.br
+	Flows are defined only by destination address.  Could be useful on the
+ingress path of an ISP backhaul.
+.PP
+.B hosts
+.br
+	Flows are defined by source-destination host pairs.  This is host
+isolation, rather than flow isolation.
+.PP
+.B flows
+.br
+	Flows are defined by the entire 5-tuple of source address, destination
+address, transport protocol, source port and destination port.  This is the type
+of flow isolation performed by SFQ and fq_codel.
+.PP
+.B dual-srchost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+source addresses, then over individual flows.  Good for use on egress traffic
+from a LAN to the internet, where it'll prevent any one LAN host from
+monopolising the uplink, regardless of the number of flows they use.
+.PP
+.B dual-dsthost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+destination addresses, then over individual flows.  Good for use on ingress
+traffic to a LAN from the internet, where it'll prevent any one LAN host from
+monopolising the downlink, regardless of the number of flows they use.
+.PP
+.B triple-isolate
+(default)
+.br
+	Flows are defined by the 5-tuple, and fairness is applied over source
+*and* destination addresses intelligently (ie. not merely by host-pairs), and
+also over individual flows.  Use this if you're not certain whether to use
+dual-srchost or dual-dsthost; it'll do both jobs at once, preventing any one
+host on *either* side of the link from monopolising it with a large number of
+flows.
+.PP
+.B nat
+.br
+	Instructs Cake to perform a NAT lookup before applying flow-isolation
+rules, to determine the true addresses and port numbers of the packet, to
+improve fairness between hosts "inside" the NAT.  This has no practical effect
+in "flowblind" or "flows" modes, or if NAT is performed on a different host.
+.PP
+.B nonat
+(default)
+.br
+	Cake will not perform a NAT lookup.  Flow isolation will be performed
+using the addresses and port numbers directly visible to the interface Cake is
+attached to.
+
+.SH PRIORITY QUEUE PARAMETERS
+CAKE can divide traffic into "tins" based on the Diffserv field.  Each tin has
+its own independent set of flow-isolation queues, and is serviced based on a WRR
+algorithm.  To avoid perverse Diffserv marking incentives, tin weights have a
+"priority sharing" value when bandwidth used by that tin is below a threshold,
+and a lower "bandwidth sharing" value when above.  Bandwidth is compared against
+the threshold using the same algorithm as the deficit-mode shaper.
+
+Detailed customisation of tin parameters is not provided.  The following presets
+perform all necessary tuning, relative to the current shaper bandwidth and RTT
+settings.
+.PP
+.B besteffort
+.br
+	Disables priority queuing by placing all traffic in one tin.
+.PP
+.B precedence
+.br
+	Enables legacy interpretation of TOS "Precedence" field.  Use of this
+preset on the modern Internet is firmly discouraged.
+.PP
+.B diffserv4
+.br
+	Provides a general-purpose Diffserv implementation with four tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Video (AF4x, AF3x, CS3, AF2x, CS2, TOS4, TOS1), 50% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, CS5, CS4), 25% threshold.
+.PP
+.B diffserv3
+(default)
+.br
+	Provides a simple, general-purpose Diffserv implementation with three tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, TOS4), 25% threshold, reduced Codel interval.
+
+.SH OTHER PARAMETERS
+.B memlimit
+LIMIT
+.br
+	Limit the memory consumed by Cake to LIMIT bytes. Note that this does
+not translate directly to queue size (so do not size this based on bandwidth
+delay product considerations, but rather on worst case acceptable memory
+consumption), as there is some overhead in the data structures containing the
+packets, especially for small packets.
+
+	By default, the limit is calculated based on the bandwidth and RTT
+settings.
+
+.PP
+.B wash
+
+.br
+	Traffic entering your diffserv domain is frequently mis-marked in
+transit from the perspective of your network, and traffic exiting yours may be
+mis-marked from the perspective of the transiting provider.
+
+Apply the wash option to clear all extra diffserv (but not ECN bits), after
+priority queuing has taken place.
+
+If you are shaping inbound, and cannot trust the diffserv markings (as is the
+case for Comcast Cable, among others), it is best to use a single queue
+"besteffort" mode with wash.
+
+.SH EXAMPLES
+# tc qdisc delete root dev eth0
+.br
+# tc qdisc add root dev eth0 cake bandwidth 100Mbit ethernet
+.br
+# tc -s qdisc show dev eth0
+.br
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
+ backlog 0b 0p requeues 0
+ memory used: 0b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:        65535 /       0
+ min/max overhead-adjusted size:    65535 /       0
+ average network hdr offset:            0
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay          0us          0us          0us
+  av_delay          0us          0us          0us
+  sp_delay          0us          0us          0us
+  pkts                0            0            0
+  bytes               0            0            0
+  way_inds            0            0            0
+  way_miss            0            0            0
+  way_cols            0            0            0
+  drops               0            0            0
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            0            0            0
+  bk_flows            0            0            0
+  un_flows            0            0            0
+  max_len             0            0            0
+  quantum           300         1514          762
+
+After some use:
+.br
+# tc -s qdisc show dev eth0
+
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 44709231 bytes 31931 pkt (dropped 45, overlimits 93782 requeues 0) 
+ backlog 33308b 22p requeues 0
+ memory used: 292352b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:           28 /    1500
+ min/max overhead-adjusted size:       84 /    1538
+ average network hdr offset:           14
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay        8.7ms        6.9ms        5.0ms
+  av_delay        4.9ms        5.3ms        3.8ms
+  sp_delay        727us        1.4ms        511us
+  pkts             2590        21271         8137
+  bytes         3081804     30302659     11426206
+  way_inds            0           46            0
+  way_miss            3           17            4
+  way_cols            0            0            0
+  drops              20           15           10
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            2            4            1
+  bk_flows            1            2            1
+  un_flows            0            0            0
+  max_len          1514         1514         1514
+  quantum           300         1514          762
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-codel (8),
+.BR tc-fq_codel (8),
+.BR tc-red (8)
+
+.SH AUTHORS
+Cake's principal author is Jonathan Morton, with contributions from
+Tony Ambardar, Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen,
+Sebastian Moeller, Ryan Mounce, Dean Scarff, Nils Andreas Svee, and Dave Täht.
+
+This manual page was written by Loganaden Velvindron. Please report corrections
+to the Linux Networking mailing list <netdev@vger.kernel.org>.
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 840880fb..716dfec5 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -795,6 +795,7 @@ was written by Alexey N. Kuznetsov and added in Linux 2.2.
 .BR tc-basic (8),
 .BR tc-bfifo (8),
 .BR tc-bpf (8),
+.BR tc-cake (8),
 .BR tc-cbq (8),
 .BR tc-cgroup (8),
 .BR tc-choke (8),
diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..d9a43568 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -66,6 +66,7 @@ TCMODULES += q_codel.o
 TCMODULES += q_fq_codel.o
 TCMODULES += q_fq.o
 TCMODULES += q_pie.o
+TCMODULES += q_cake.o
 TCMODULES += q_hhf.o
 TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
diff --git a/tc/q_cake.c b/tc/q_cake.c
new file mode 100644
index 00000000..9f02b35c
--- /dev/null
+++ b/tc/q_cake.c
@@ -0,0 +1,797 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * Common Applications Kept Enhanced  --  CAKE
+ *
+ *  Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
+ *  Copyright (C) 2017-2018 Toke Høiland-Jørgensen <toke@toke.dk>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions, and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ *    derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <inttypes.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+	fprintf(stderr,
+"Usage: ... cake [ bandwidth RATE | unlimited* | autorate_ingress ]\n"
+"                [ rtt TIME | datacentre | lan | metro | regional |\n"
+"                  internet* | oceanic | satellite | interplanetary ]\n"
+"                [ besteffort | diffserv8 | diffserv4 | diffserv3* ]\n"
+"                [ flowblind | srchost | dsthost | hosts | flows |\n"
+"                  dual-srchost | dual-dsthost | triple-isolate* ]\n"
+"                [ nat | nonat* ]\n"
+"                [ wash | nowash* ]\n"
+"                [ ack-filter | ack-filter-aggressive | no-ack-filter* ]\n"
+"                [ memlimit LIMIT ]\n"
+"                [ ptm | atm | noatm* ] [ overhead N | conservative | raw* ]\n"
+"                [ mpu N ] [ ingress | egress* ]\n"
+"                (* marks defaults)\n");
+}
+
+static int cake_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			  struct nlmsghdr *n, const char *dev)
+{
+	int unlimited = 0;
+	unsigned bandwidth = 0;
+	unsigned interval = 0;
+	unsigned target = 0;
+	unsigned diffserv = 0;
+	unsigned memlimit = 0;
+	unsigned testflags = 0;
+	int  overhead = 0;
+	bool overhead_set = false;
+	bool overhead_override = false;
+	int mpu = 0;
+	int flowmode = -1;
+	int nat = -1;
+	int atm = -1;
+	int autorate = -1;
+	int wash = -1;
+	int ingress = -1;
+	int ack_filter = -1;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "bandwidth") == 0) {
+			NEXT_ARG();
+			if (get_rate(&bandwidth, *argv)) {
+				fprintf(stderr, "Illegal \"bandwidth\"\n");
+				return -1;
+			}
+			unlimited = 0;
+			autorate = 0;
+		} else if (strcmp(*argv, "unlimited") == 0) {
+			bandwidth = 0;
+			unlimited = 1;
+			autorate = 0;
+		} else if (strcmp(*argv, "autorate_ingress") == 0) {
+			autorate = 1;
+
+		} else if (strcmp(*argv, "rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&interval, *argv)) {
+				fprintf(stderr, "Illegal \"rtt\"\n");
+				return -1;
+			}
+			target = interval / 20;
+			if(!target)
+				target = 1;
+		} else if (strcmp(*argv, "datacentre") == 0) {
+			interval = 100;
+			target   =   5;
+		} else if (strcmp(*argv, "lan") == 0) {
+			interval = 1000;
+			target   =   50;
+		} else if (strcmp(*argv, "metro") == 0) {
+			interval = 10000;
+			target   =   500;
+		} else if (strcmp(*argv, "regional") == 0) {
+			interval = 30000;
+			target    = 1500;
+		} else if (strcmp(*argv, "internet") == 0) {
+			interval = 100000;
+			target   =   5000;
+		} else if (strcmp(*argv, "oceanic") == 0) {
+			interval = 300000;
+			target   =  15000;
+		} else if (strcmp(*argv, "satellite") == 0) {
+			interval = 1000000;
+			target   =   50000;
+		} else if (strcmp(*argv, "interplanetary") == 0) {
+			interval = 1000000000;
+			target   =   50000000;
+
+		} else if (strcmp(*argv, "besteffort") == 0) {
+			diffserv = CAKE_DIFFSERV_BESTEFFORT;
+		} else if (strcmp(*argv, "precedence") == 0) {
+			diffserv = CAKE_DIFFSERV_PRECEDENCE;
+		} else if (strcmp(*argv, "diffserv8") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV8;
+		} else if (strcmp(*argv, "diffserv4") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv3") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV3;
+
+		} else if (strcmp(*argv, "nowash") == 0) {
+			wash = 0;
+		} else if (strcmp(*argv, "wash") == 0) {
+			wash = 1;
+
+		} else if (strcmp(*argv, "flowblind") == 0) {
+			flowmode = CAKE_FLOW_NONE;
+		} else if (strcmp(*argv, "srchost") == 0) {
+			flowmode = CAKE_FLOW_SRC_IP;
+		} else if (strcmp(*argv, "dsthost") == 0) {
+			flowmode = CAKE_FLOW_DST_IP;
+		} else if (strcmp(*argv, "hosts") == 0) {
+			flowmode = CAKE_FLOW_HOSTS;
+		} else if (strcmp(*argv, "flows") == 0) {
+			flowmode = CAKE_FLOW_FLOWS;
+		} else if (strcmp(*argv, "dual-srchost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_SRC;
+		} else if (strcmp(*argv, "dual-dsthost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_DST;
+		} else if (strcmp(*argv, "triple-isolate") == 0) {
+			flowmode = CAKE_FLOW_TRIPLE;
+
+		} else if (strcmp(*argv, "nat") == 0) {
+			nat = 1;
+		} else if (strcmp(*argv, "nonat") == 0) {
+			nat = 0;
+
+		} else if (strcmp(*argv, "ptm") == 0) {
+			atm = CAKE_ATM_PTM;
+		} else if (strcmp(*argv, "atm") == 0) {
+			atm = CAKE_ATM_ATM;
+		} else if (strcmp(*argv, "noatm") == 0) {
+			atm = CAKE_ATM_NONE;
+
+		} else if (strcmp(*argv, "raw") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead = 0;
+			overhead_set = true;
+			overhead_override = true;
+		} else if (strcmp(*argv, "conservative") == 0) {
+			/*
+			 * Deliberately over-estimate overhead:
+			 * one whole ATM cell plus ATM framing.
+			 * A safe choice if the actual overhead is unknown.
+			 */
+			atm = CAKE_ATM_ATM;
+			overhead = 48;
+			overhead_set = true;
+
+		/* Various ADSL framing schemes, all over ATM cells */
+		} else if (strcmp(*argv, "ipoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 8;
+			overhead_set = true;
+		} else if (strcmp(*argv, "ipoa-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 16;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 24;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 10;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-llc") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 14;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 40;
+			overhead_set = true;
+
+		/* Typical VDSL2 framing schemes, both over PTM */
+		/* PTM has 64b/65b coding which absorbs some bandwidth */
+		} else if (strcmp(*argv, "pppoe-ptm") == 0) {
+			/* 2B PPP + 6B PPPoE + 6B dest MAC + 6B src MAC
+			 * + 2B ethertype + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 30B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 30;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-ptm") == 0) {
+			/* 6B dest MAC + 6B src MAC + 2B ethertype
+			 * + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 22B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 22;
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "via-ethernet") == 0) {
+			/*
+			 * We used to use this flag to manually compensate for
+			 * Linux including the Ethernet header on Ethernet-type
+			 * interfaces, but not on IP-type interfaces.
+			 *
+			 * It is no longer needed, because Cake now adjusts for
+			 * that automatically, and is thus ignored.
+			 *
+			 * It would be deleted entirely, but it appears in the
+			 * stats output when the automatic compensation is
+			 * active.
+			 */
+
+		} else if (strcmp(*argv, "ethernet") == 0) {
+			/* ethernet pre-amble & interframe gap & FCS
+			 * you may need to add vlan tag */
+			overhead += 38;
+			overhead_set = true;
+			mpu = 84;
+
+		/* Additional Ethernet-related overhead used by some ISPs */
+		} else if (strcmp(*argv, "ether-vlan") == 0) {
+			/* 802.1q VLAN tag - may be repeated */
+			overhead += 4;
+			overhead_set = true;
+
+		/*
+		 * DOCSIS cable shapers account for Ethernet frame with FCS,
+		 * but not interframe gap or preamble.
+		 */
+		} else if (strcmp(*argv, "docsis") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead += 18;
+			overhead_set = true;
+			mpu = 64;
+
+		} else if (strcmp(*argv, "overhead") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			overhead = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || overhead < -64 || overhead > 256) {
+				fprintf(stderr, "Illegal \"overhead\", valid range is -64 to 256\\n");
+				return -1;
+			}
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "mpu") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			mpu = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || mpu < 0 || mpu > 256) {
+				fprintf(stderr, "Illegal \"mpu\", valid range is 0 to 256\\n");
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "testflags") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			testflags = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv) {
+				fprintf(stderr, "Illegal \"testflags\"\\n");
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "ingress") == 0) {
+			ingress = 1;
+		} else if (strcmp(*argv, "egress") == 0) {
+			ingress = 0;
+
+		} else if (strcmp(*argv, "no-ack-filter") == 0) {
+			ack_filter = CAKE_ACK_NONE;
+		} else if (strcmp(*argv, "ack-filter") == 0) {
+			ack_filter = CAKE_ACK_FILTER;
+		} else if (strcmp(*argv, "ack-filter-aggressive") == 0) {
+			ack_filter = CAKE_ACK_AGGRESSIVE;
+
+		} else if (strcmp(*argv, "memlimit") == 0) {
+			NEXT_ARG();
+			if(get_size(&memlimit, *argv)) {
+				fprintf(stderr, "Illegal value for \"memlimit\": \"%s\"\n", *argv);
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	tail = NLMSG_TAIL(n);
+	addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+	if (bandwidth || unlimited)
+		addattr_l(n, 1024, TCA_CAKE_BASE_RATE, &bandwidth, sizeof(bandwidth));
+	if (diffserv)
+		addattr_l(n, 1024, TCA_CAKE_DIFFSERV_MODE, &diffserv, sizeof(diffserv));
+	if (atm != -1)
+		addattr_l(n, 1024, TCA_CAKE_ATM, &atm, sizeof(atm));
+	if (flowmode != -1)
+		addattr_l(n, 1024, TCA_CAKE_FLOW_MODE, &flowmode, sizeof(flowmode));
+	if (overhead_set)
+		addattr_l(n, 1024, TCA_CAKE_OVERHEAD, &overhead, sizeof(overhead));
+	if (overhead_override) {
+		unsigned zero = 0;
+		addattr_l(n, 1024, TCA_CAKE_RAW, &zero, sizeof(zero));
+	}
+	if (mpu > 0)
+		addattr_l(n, 1024, TCA_CAKE_MPU, &mpu, sizeof(mpu));
+	if (interval)
+		addattr_l(n, 1024, TCA_CAKE_RTT, &interval, sizeof(interval));
+	if (target)
+		addattr_l(n, 1024, TCA_CAKE_TARGET, &target, sizeof(target));
+	if (autorate != -1)
+		addattr_l(n, 1024, TCA_CAKE_AUTORATE, &autorate, sizeof(autorate));
+	if (memlimit)
+		addattr_l(n, 1024, TCA_CAKE_MEMORY, &memlimit, sizeof(memlimit));
+	if (nat != -1)
+		addattr_l(n, 1024, TCA_CAKE_NAT, &nat, sizeof(nat));
+	if (wash != -1)
+		addattr_l(n, 1024, TCA_CAKE_WASH, &wash, sizeof(wash));
+	if (ingress != -1)
+		addattr_l(n, 1024, TCA_CAKE_INGRESS, &ingress, sizeof(ingress));
+	if (ack_filter != -1)
+		addattr_l(n, 1024, TCA_CAKE_ACK_FILTER, &ack_filter, sizeof(ack_filter));
+	if (testflags)
+		addattr_l(n, 1024, TCA_CAKE_TEST_FLAGS, &testflags, sizeof(testflags));
+
+	tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
+	return 0;
+}
+
+
+static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_CAKE_MAX + 1];
+	unsigned bandwidth = 0;
+	unsigned diffserv = 0;
+	unsigned flowmode = 0;
+	unsigned interval = 0;
+	unsigned memlimit = 0;
+	unsigned testflags = 0;
+	int overhead = 0;
+	int raw = 0;
+	int mpu = 0;
+	int atm = 0;
+	int nat = 0;
+	int autorate = 0;
+	int wash = 0;
+	int ingress = 0;
+	int ack_filter = 0;
+	SPRINT_BUF(b1);
+	SPRINT_BUF(b2);
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_CAKE_MAX, opt);
+
+	if (tb[TCA_CAKE_BASE_RATE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_BASE_RATE]) >= sizeof(__u32)) {
+		bandwidth = rta_getattr_u32(tb[TCA_CAKE_BASE_RATE]);
+		if(bandwidth) {
+			print_uint(PRINT_JSON, "bandwidth", NULL, bandwidth);
+			print_string(PRINT_FP, NULL, "bandwidth %s ", sprint_rate(bandwidth, b1));
+		} else
+			print_string(PRINT_ANY, "bandwidth", "bandwidth %s ", "unlimited");
+	}
+	if (tb[TCA_CAKE_AUTORATE] &&
+		RTA_PAYLOAD(tb[TCA_CAKE_AUTORATE]) >= sizeof(__u32)) {
+		autorate = rta_getattr_u32(tb[TCA_CAKE_AUTORATE]);
+		if(autorate == 1)
+			print_string(PRINT_ANY, "autorate", "autorate_%s ", "ingress");
+		else if(autorate)
+			print_string(PRINT_ANY, "autorate", "(?autorate?) ", "unknown");
+	}
+	if (tb[TCA_CAKE_DIFFSERV_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_DIFFSERV_MODE]) >= sizeof(__u32)) {
+		diffserv = rta_getattr_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+		switch(diffserv) {
+		case CAKE_DIFFSERV_DIFFSERV3:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv3");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV4:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv4");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV8:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv8");
+			break;
+		case CAKE_DIFFSERV_BESTEFFORT:
+			print_string(PRINT_ANY, "diffserv", "%s ", "besteffort");
+			break;
+		case CAKE_DIFFSERV_PRECEDENCE:
+			print_string(PRINT_ANY, "diffserv", "%s ", "precedence");
+			break;
+		default:
+			print_string(PRINT_ANY, "diffserv", "(?diffserv?) ", "unknown");
+			break;
+		};
+	}
+	if (tb[TCA_CAKE_FLOW_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_FLOW_MODE]) >= sizeof(__u32)) {
+		flowmode = rta_getattr_u32(tb[TCA_CAKE_FLOW_MODE]);
+		switch(flowmode) {
+		case CAKE_FLOW_NONE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flowblind");
+			break;
+		case CAKE_FLOW_SRC_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "srchost");
+			break;
+		case CAKE_FLOW_DST_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dsthost");
+			break;
+		case CAKE_FLOW_HOSTS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "hosts");
+			break;
+		case CAKE_FLOW_FLOWS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flows");
+			break;
+		case CAKE_FLOW_DUAL_SRC:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-srchost");
+			break;
+		case CAKE_FLOW_DUAL_DST:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-dsthost");
+			break;
+		case CAKE_FLOW_TRIPLE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "triple-isolate");
+			break;
+		default:
+			print_string(PRINT_ANY, "flowmode", "(?flowmode?) ", "unknown");
+			break;
+		};
+
+	}
+
+	if (tb[TCA_CAKE_NAT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_NAT]) >= sizeof(__u32)) {
+	    nat = rta_getattr_u32(tb[TCA_CAKE_NAT]);
+	}
+
+	if(nat)
+		print_string(PRINT_FP, NULL, "nat ", NULL);
+	print_bool(PRINT_JSON, "nat", NULL, nat);
+
+	if (tb[TCA_CAKE_WASH] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_WASH]) >= sizeof(__u32)) {
+		wash = rta_getattr_u32(tb[TCA_CAKE_WASH]);
+	}
+	if (tb[TCA_CAKE_ATM] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ATM]) >= sizeof(__u32)) {
+		atm = rta_getattr_u32(tb[TCA_CAKE_ATM]);
+	}
+	if (tb[TCA_CAKE_OVERHEAD] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_OVERHEAD]) >= sizeof(__s32)) {
+		overhead = *(__s32 *) RTA_DATA(tb[TCA_CAKE_OVERHEAD]);
+	}
+	if (tb[TCA_CAKE_MPU] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_MPU]) >= sizeof(__u32)) {
+		mpu = rta_getattr_u32(tb[TCA_CAKE_MPU]);
+	}
+	if (tb[TCA_CAKE_INGRESS] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_INGRESS]) >= sizeof(__u32)) {
+		ingress = rta_getattr_u32(tb[TCA_CAKE_INGRESS]);
+	}
+	if (tb[TCA_CAKE_ACK_FILTER] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ACK_FILTER]) >= sizeof(__u32)) {
+		ack_filter = rta_getattr_u32(tb[TCA_CAKE_ACK_FILTER]);
+	}
+	if (tb[TCA_CAKE_RAW]) {
+		raw = 1;
+	}
+	if (tb[TCA_CAKE_RTT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_RTT]) >= sizeof(__u32)) {
+		interval = rta_getattr_u32(tb[TCA_CAKE_RTT]);
+	}
+	if (tb[TCA_CAKE_TEST_FLAGS] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_TEST_FLAGS]) >= sizeof(__u32)) {
+		testflags = rta_getattr_u32(tb[TCA_CAKE_TEST_FLAGS]);
+	}
+	print_uint(PRINT_ANY, "testflags", "testflags %d ", testflags);
+
+	if (wash)
+		print_string(PRINT_FP, NULL, "wash ", NULL);
+	print_bool(PRINT_JSON, "wash", NULL, wash);
+
+	if (ingress)
+		print_string(PRINT_FP, NULL, "ingress ", NULL);
+	print_bool(PRINT_JSON, "ingress", NULL, ingress);
+
+	if (ack_filter == CAKE_ACK_AGGRESSIVE)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter-%s ", "aggressive");
+	else if (ack_filter == CAKE_ACK_FILTER)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter ", "enabled");
+	else
+		print_string(PRINT_JSON, "ack-filter", NULL, "disabled");
+
+	if (interval)
+		print_string(PRINT_FP, NULL, "rtt %s ", sprint_time(interval, b2));
+	print_uint(PRINT_JSON, "rtt", NULL, interval);
+
+	if (raw)
+		print_string(PRINT_FP, NULL, "raw ", NULL);
+	print_bool(PRINT_JSON, "raw", NULL, raw);
+
+	if (atm == CAKE_ATM_ATM)
+		print_string(PRINT_ANY, "atm", "%s ", "atm");
+	else if (atm == CAKE_ATM_PTM)
+		print_string(PRINT_ANY, "atm", "%s ", "ptm");
+	else if (!raw)
+		print_string(PRINT_ANY, "atm", "%s ", "noatm");
+
+	print_int(PRINT_ANY, "overhead", "overhead %d ", overhead);
+
+	if (mpu)
+		print_uint(PRINT_ANY, "mpu", "mpu %" PRIu64 " ", mpu);
+
+	if (memlimit) {
+		print_uint(PRINT_JSON, "memlimit", NULL, memlimit);
+		print_string(PRINT_FP, NULL, "memlimit %s", sprint_size(memlimit, b1));
+	}
+
+	return 0;
+}
+
+#define FOR_EACH_TIN(xstats, tst, i)				\
+	for(tst = xstats->tin_stats, i = 0;			\
+	i < xstats->tin_cnt;						\
+	    i++, tst = ((void *) xstats->tin_stats) + xstats->tin_stats_size * i)
+
+static void cake_print_json_tin(struct tc_cake_tin_stats *tst, uint version)
+{
+	open_json_object(NULL);
+	print_uint(PRINT_JSON, "threshold_rate", NULL, tst->threshold_rate);
+	print_uint(PRINT_JSON, "target", NULL, tst->target_us);
+	print_uint(PRINT_JSON, "interval", NULL, tst->interval_us);
+	print_uint(PRINT_JSON, "peak_delay", NULL, tst->peak_delay_us);
+	print_uint(PRINT_JSON, "average_delay", NULL, tst->avge_delay_us);
+	print_uint(PRINT_JSON, "base_delay", NULL, tst->base_delay_us);
+	print_uint(PRINT_JSON, "sent_packets", NULL, tst->sent.packets);
+	print_uint(PRINT_JSON, "sent_bytes", NULL, tst->sent.bytes);
+	print_uint(PRINT_JSON, "way_indirect_hits", NULL, tst->way_indirect_hits);
+	print_uint(PRINT_JSON, "way_misses", NULL, tst->way_misses);
+	print_uint(PRINT_JSON, "way_collisions", NULL, tst->way_collisions);
+	print_uint(PRINT_JSON, "drops", NULL, tst->dropped.packets);
+	print_uint(PRINT_JSON, "ecn_mark", NULL, tst->ecn_marked.packets);
+	print_uint(PRINT_JSON, "ack_drops", NULL, tst->ack_drops.packets);
+	print_uint(PRINT_JSON, "sparse_flows", NULL, tst->sparse_flows);
+	print_uint(PRINT_JSON, "bulk_flows", NULL, tst->bulk_flows);
+	print_uint(PRINT_JSON, "unresponsive_flows", NULL, tst->unresponse_flows);
+	print_uint(PRINT_JSON, "max_pkt_len", NULL, tst->max_skblen);
+	if (version >= 0x102)
+		print_uint(PRINT_JSON, "flow_quantum", NULL, tst->flow_quantum);
+	close_json_object();
+}
+
+static int cake_print_xstats(struct qdisc_util *qu, FILE *f,
+			     struct rtattr *xstats)
+{
+	struct tc_cake_xstats     *stnc;
+	struct tc_cake_tin_stats  *tst;
+	SPRINT_BUF(b1);
+	int i;
+
+	if (xstats == NULL)
+		return 0;
+
+	if (RTA_PAYLOAD(xstats) < sizeof(*stnc))
+		return -1;
+
+	stnc = RTA_DATA(xstats);
+
+	if (stnc->version < 0x101 ||
+	    RTA_PAYLOAD(xstats) < (sizeof(struct tc_cake_xstats) +
+				    stnc->tin_stats_size * stnc->tin_cnt))
+		return -1;
+
+	print_uint(PRINT_JSON, "memory_used", NULL, stnc->memory_used);
+	print_uint(PRINT_JSON, "memory_limit", NULL, stnc->memory_limit);
+	print_uint(PRINT_JSON, "capacity_estimate", NULL, stnc->capacity_estimate);
+
+	print_string(PRINT_FP, NULL, " memory used: %s",
+		sprint_size(stnc->memory_used, b1));
+	print_string(PRINT_FP, NULL, " of %s\n",
+		sprint_size(stnc->memory_limit, b1));
+	print_string(PRINT_FP, NULL, " capacity estimate: %s\n",
+		sprint_rate(stnc->capacity_estimate, b1));
+
+	print_uint(PRINT_ANY, "min_network_size", " min/max network layer size: %12" PRIu64,
+		stnc->min_netlen);
+	print_uint(PRINT_ANY, "max_network_size", " /%8" PRIu64 "\n", stnc->max_netlen);
+	print_uint(PRINT_ANY, "min_adj_size", " min/max overhead-adjusted size: %8" PRIu64,
+		stnc->min_adjlen);
+	print_uint(PRINT_ANY, "max_adj_size", " /%8" PRIu64 "\n", stnc->max_adjlen);
+	print_uint(PRINT_ANY, "avg_hdr_offset", " average network hdr offset: %12" PRIu64 "\n\n",
+		stnc->avg_trnoff);
+
+	if (is_json_context()) {
+		open_json_array(PRINT_JSON, "tins");
+		FOR_EACH_TIN(stnc, tst, i)
+			cake_print_json_tin(tst, stnc->version);
+		close_json_array(PRINT_JSON, NULL);
+		return 0;
+	}
+
+
+	switch(stnc->tin_cnt) {
+	case 3:
+		fprintf(f, "                   Bulk  Best Effort        Voice\n");
+		break;
+
+	case 4:
+		fprintf(f, "                   Bulk  Best Effort        Video        Voice\n");
+		break;
+
+	case 5:
+		fprintf(f, "               Low Loss  Best Effort    Low Delay         Bulk  Net Control\n");
+		break;
+
+	default:
+		fprintf(f, "          ");
+		for(i=0; i < stnc->tin_cnt; i++)
+			fprintf(f, "       Tin %u", i);
+		fprintf(f, "\n");
+	};
+
+	fprintf(f, "  thresh  ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_rate(tst->threshold_rate, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  target  ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->target_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  interval");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->interval_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  pk_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->peak_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  av_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->avge_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  sp_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->base_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  pkts    ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->sent.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  bytes   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12llu", tst->sent.bytes);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_inds");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_indirect_hits);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_miss");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_misses);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_cols");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_collisions);
+	fprintf(f, "\n");
+
+	fprintf(f, "  drops   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->dropped.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  marks   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->ecn_marked.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  ack_drop");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->ack_drops.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  sp_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->sparse_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  bk_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->bulk_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  un_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->unresponse_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  max_len ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->max_skblen);
+	fprintf(f, "\n");
+
+	if (stnc->version >= 0x102) {
+		fprintf(f, "  quantum ");
+		FOR_EACH_TIN(stnc, tst, i)
+			fprintf(f, " %12u", tst->flow_quantum);
+		fprintf(f, "\n");
+	}
+
+	return 0;
+}
+
+struct qdisc_util cake_qdisc_util = {
+	.id		= "cake",
+	.parse_qopt	= cake_parse_opt,
+	.print_qopt	= cake_print_opt,
+	.print_xstats	= cake_print_xstats,
+};
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH iproute2-next v3] Add support for cake qdisc
  2018-04-24 11:44   ` [PATCH iproute2-next v2] Add support for cake qdisc Toke Høiland-Jørgensen
@ 2018-04-24 12:30     ` Toke Høiland-Jørgensen
  2018-04-24 14:44       ` [Cake] " Stephen Hemminger
  2018-04-24 14:45       ` Stephen Hemminger
  0 siblings, 2 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2018-04-24 12:30 UTC (permalink / raw)
  To: netdev; +Cc: cake, Toke Høiland-Jørgensen, Dave Taht

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
Changes since v2:
  - Accidentally sent a version of the tc patch from my testing branch
    which includes a test flag that shouldn't be in there. Sorry for the
    mistake.
  
 man/man8/tc-cake.8 | 632 +++++++++++++++++++++++++++++++++++++++++++
 man/man8/tc.8      |   1 +
 tc/Makefile        |   1 +
 tc/q_cake.c        | 778 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1412 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
new file mode 100644
index 00000000..30e41bc9
--- /dev/null
+++ b/man/man8/tc-cake.8
@@ -0,0 +1,632 @@
+.TH CAKE 8 "23 November 2017" "iproute2" "Linux"
+.SH NAME
+CAKE \- Common Applications Kept Enhanced (CAKE)
+.SH SYNOPSIS
+.B tc qdisc ... cake
+.br
+[
+.BR bandwidth
+RATE |
+.BR unlimited*
+|
+.BR autorate_ingress
+]
+.br
+[
+.BR rtt
+TIME |
+.BR datacentre
+|
+.BR lan
+|
+.BR metro
+|
+.BR regional
+|
+.BR internet*
+|
+.BR oceanic
+|
+.BR satellite
+|
+.BR interplanetary
+]
+.br
+[
+.BR besteffort
+|
+.BR diffserv8
+|
+.BR diffserv4
+|
+.BR diffserv3*
+]
+.br
+[
+.BR flowblind
+|
+.BR srchost
+|
+.BR dsthost
+|
+.BR hosts
+|
+.BR flows
+|
+.BR dual-srchost
+|
+.BR dual-dsthost
+|
+.BR triple-isolate*
+]
+.br
+[
+.BR nat
+|
+.BR nonat*
+]
+.br
+[
+.BR wash
+|
+.BR nowash*
+]
+.br
+[
+.BR ack-filter
+|
+.BR ack-filter-aggressive
+|
+.BR no-ack-filter*
+]
+.br
+[
+.BR memlimit
+LIMIT ]
+.br
+[
+.BR ptm
+|
+.BR atm
+|
+.BR noatm*
+]
+.br
+[
+.BR overhead
+N |
+.BR conservative
+|
+.BR raw*
+]
+.br
+[
+.BR mpu
+N ]
+.br
+[
+.BR ingress
+|
+.BR egress*
+]
+.br
+(* marks defaults)
+
+
+.SH DESCRIPTION
+CAKE (Common Applications Kept Enhanced) is a shaping-capable queue discipline
+which uses both AQM and FQ.  It combines COBALT, which is an AQM algorithm
+combining Codel and BLUE, a shaper which operates in deficit mode, and a variant
+of DRR++ for flow isolation.  8-way set-associative hashing is used to virtually
+eliminate hash collisions.  Priority queuing is available through a simplified
+diffserv implementation.  Overhead compensation for various encapsulation
+schemes is tightly integrated.
+
+All settings are optional; the default settings are chosen to be sensible in
+most common deployments.  Most people will only need to set the
+.B bandwidth
+parameter to get useful results, but reading the
+.B Overhead Compensation
+and
+.B Round Trip Time
+sections is strongly encouraged.
+
+.SH SHAPER PARAMETERS
+CAKE uses a deficit-mode shaper, which does not exhibit the initial burst
+typical of token-bucket shapers.  It will automatically burst precisely as much
+as required to maintain the configured throughput.  As such, it is very
+straightforward to configure.
+.PP
+.B unlimited
+(default)
+.br
+	No limit on the bandwidth.
+.PP
+.B bandwidth
+RATE
+.br
+	Set the shaper bandwidth.  See
+.BR tc(8)
+or examples below for details of the RATE value.
+.PP
+.B autorate_ingress
+.br
+	Automatic capacity estimation based on traffic arriving at this qdisc.
+This is most likely to be useful with cellular links, which tend to change
+quality randomly.  A
+.B bandwidth
+parameter can be used in conjunction to specify an initial estimate.  The shaper
+will periodically be set to a bandwidth slightly below the estimated rate.  This
+estimator cannot estimate the bandwidth of links downstream of itself.
+
+.SH OVERHEAD COMPENSATION PARAMETERS
+The size of each packet on the wire may differ from that seen by Linux.  The
+following parameters allow CAKE to compensate for this difference by internally
+considering each packet to be bigger than Linux informs it.  To assist users who
+are not expert network engineers, keywords have been provided to represent a
+number of common link technologies.
+
+.SS	Manual Overhead Specification
+.B overhead
+BYTES
+.br
+	Adds BYTES to the size of each packet.  BYTES may be negative; values
+between -64 and 256 (inclusive) are accepted.
+.PP
+.B mpu
+BYTES
+.br
+	Rounds each packet (including overhead) up to a minimum length
+BYTES. BYTES may not be negative; values between 0 and 256 (inclusive)
+are accepted.
+.PP
+.B atm
+.br
+	Compensates for ATM cell framing, which is normally found on ADSL links.
+This is performed after the
+.B overhead
+parameter above.  ATM uses fixed 53-byte cells, each of which can carry 48 bytes
+payload.
+.PP
+.B ptm
+.br
+	Compensates for PTM encoding, which is normally found on VDSL2 links and
+uses a 64b/65b encoding scheme. It is even more efficient to simply
+derate the specified shaper bandwidth by a factor of 64/65 or 0.984. See
+ITU G.992.3 Annex N and IEEE 802.3 Section 61.3 for details.
+.PP
+.B noatm
+.br
+	Disables ATM and PTM compensation.
+
+.SS	Failsafe Overhead Keywords
+These two keywords are provided for quick-and-dirty setup.  Use them if you
+can't be bothered to read the rest of this section.
+.PP
+.B raw
+(default)
+.br
+	Turns off all overhead compensation in CAKE.  The packet size reported
+by Linux will be used directly.
+.PP
+	Other overhead keywords may be added after "raw".  The effect of this is
+to make the overhead compensation operate relative to the reported packet size,
+not the underlying IP packet size.
+.PP
+.B conservative
+.br
+	Compensates for more overhead than is likely to occur on any
+widely-deployed link technology.
+.br
+	Equivalent to
+.B overhead 48 atm.
+
+.SS ADSL Overhead Keywords
+Most ADSL modems have a way to check which framing scheme is in use.  Often this
+is also specified in the settings document provided by the ISP.  The keywords in
+this section are intended to correspond with these sources of information.  All
+of them implicitly set the
+.B atm
+flag.
+.PP
+.B pppoa-vcmux
+.br
+	Equivalent to
+.B overhead 10 atm
+.PP
+.B pppoa-llc
+.br
+	Equivalent to
+.B overhead 14 atm
+.PP
+.B pppoe-vcmux
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B pppoe-llcsnap
+.br
+	Equivalent to
+.B overhead 40 atm
+.PP
+.B bridged-vcmux
+.br
+	Equivalent to
+.B overhead 24 atm
+.PP
+.B bridged-llcsnap
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B ipoa-vcmux
+.br
+	Equivalent to
+.B overhead 8 atm
+.PP
+.B ipoa-llcsnap
+.br
+	Equivalent to
+.B overhead 16 atm
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS VDSL2 Overhead Keywords
+ATM was dropped from VDSL2 in favour of PTM, which is a much more
+straightforward framing scheme.  Some ISPs retained PPPoE for compatibility with
+their existing back-end systems.
+.PP
+.B pppoe-ptm
+.br
+	Equivalent to
+.B overhead 30 ptm
+
+.br
+	PPPoE: 2B PPP + 6B PPPoE +
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+.B bridged-ptm
+.br
+	Equivalent to
+.B overhead 22 ptm
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS DOCSIS Cable Overhead Keyword
+DOCSIS is the universal standard for providing Internet service over cable-TV
+infrastructure.
+
+In this case, the actual on-wire overhead is less important than the packet size
+the head-end equipment uses for shaping and metering.  This is specified to be
+an Ethernet frame including the CRC (aka FCS).
+.PP
+.B docsis
+.br
+	Equivalent to
+.B overhead 18 mpu 64 noatm
+
+.SS Ethernet Overhead Keywords
+.PP
+.B ethernet
+.br
+	Accounts for Ethernet's preamble, inter-frame gap, and Frame Check
+Sequence.  Use this keyword when the bottleneck being shaped for is an
+actual Ethernet cable.
+.br
+	Equivalent to
+.B overhead 38 mpu 84 noatm
+.PP
+.B ether-vlan
+.br
+	Adds 4 bytes to the overhead compensation, accounting for an IEEE 802.1Q
+VLAN header appended to the Ethernet frame header.  NB: Some ISPs use one or
+even two of these within PPPoE; this keyword may be repeated as necessary to
+express this.
+
+.SH ROUND TRIP TIME PARAMETERS
+Active Queue Management (AQM) consists of embedding congestion signals in the
+packet flow, which receivers use to instruct senders to slow down when the queue
+is persistently occupied.  CAKE uses ECN signalling when available, and packet
+drops otherwise, according to a combination of the Codel and BLUE AQM algorithms
+called COBALT.
+
+Very short latencies require a very rapid AQM response to adequately control
+latency.  However, such a rapid response tends to impair throughput when the
+actual RTT is relatively long.  CAKE allows specifying the RTT it assumes for
+tuning various parameters.  Actual RTTs within an order of magnitude of this
+will generally work well for both throughput and latency management.
+
+At the 'lan' setting and below, the time constants are similar in magnitude to
+the jitter in the Linux kernel itself, so congestion might be signalled
+prematurely. The flows will then become sparse and total throughput reduced,
+leaving little or no back-pressure for the fairness logic to work against. Use
+the "metro" setting for local lans unless you have a custom kernel.
+.PP
+.B rtt
+TIME
+.br
+	Manually specify an RTT.
+.PP
+.B datacentre
+.br
+	For extremely high-performance 10GigE+ networks only.  Equivalent to
+.B rtt 100us.
+.PP
+.B lan
+.br
+	For pure Ethernet (not Wi-Fi) networks, at home or in the office.  Don't
+use this when shaping for an Internet access link.  Equivalent to
+.B rtt 1ms.
+.PP
+.B metro
+.br
+	For traffic mostly within a single city.  Equivalent to
+.B rtt 10ms.
+.PP
+.B regional
+.br
+	For traffic mostly within a European-sized country.  Equivalent to
+.B rtt 30ms.
+.PP
+.B internet
+(default)
+.br
+	This is suitable for most Internet traffic.  Equivalent to
+.B rtt 100ms.
+.PP
+.B oceanic
+.br
+	For Internet traffic with generally above-average latency, such as that
+suffered by Australasian residents.  Equivalent to
+.B rtt 300ms.
+.PP
+.B satellite
+.br
+	For traffic via geostationary satellites.  Equivalent to
+.B rtt 1000ms.
+.PP
+.B interplanetary
+.br
+	So named because Jupiter is about 1 light-hour from Earth.  Use this to
+(almost) completely disable AQM actions.  Equivalent to
+.B rtt 3600s.
+
+.SH FLOW ISOLATION PARAMETERS
+With flow isolation enabled, CAKE places packets from different flows into
+different queues, each of which carries its own AQM state.  Packets from each
+queue are then delivered fairly, according to a DRR++ algorithm which minimises
+latency for "sparse" flows.  CAKE uses a set-associative hashing algorithm to
+minimise flow collisions.
+
+These keywords specify whether fairness based on source address, destination
+address, individual flows, or any combination of those is desired.
+.PP
+.B flowblind
+.br
+	Disables flow isolation; all traffic passes through a single queue for
+each tin.
+.PP
+.B srchost
+.br
+	Flows are defined only by source address.  Could be useful on the egress
+path of an ISP backhaul.
+.PP
+.B dsthost
+.br
+	Flows are defined only by destination address.  Could be useful on the
+ingress path of an ISP backhaul.
+.PP
+.B hosts
+.br
+	Flows are defined by source-destination host pairs.  This is host
+isolation, rather than flow isolation.
+.PP
+.B flows
+.br
+	Flows are defined by the entire 5-tuple of source address, destination
+address, transport protocol, source port and destination port.  This is the type
+of flow isolation performed by SFQ and fq_codel.
+.PP
+.B dual-srchost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+source addresses, then over individual flows.  Good for use on egress traffic
+from a LAN to the internet, where it'll prevent any one LAN host from
+monopolising the uplink, regardless of the number of flows they use.
+.PP
+.B dual-dsthost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+destination addresses, then over individual flows.  Good for use on ingress
+traffic to a LAN from the internet, where it'll prevent any one LAN host from
+monopolising the downlink, regardless of the number of flows they use.
+.PP
+.B triple-isolate
+(default)
+.br
+	Flows are defined by the 5-tuple, and fairness is applied over source
+*and* destination addresses intelligently (ie. not merely by host-pairs), and
+also over individual flows.  Use this if you're not certain whether to use
+dual-srchost or dual-dsthost; it'll do both jobs at once, preventing any one
+host on *either* side of the link from monopolising it with a large number of
+flows.
+.PP
+.B nat
+.br
+	Instructs Cake to perform a NAT lookup before applying flow-isolation
+rules, to determine the true addresses and port numbers of the packet, to
+improve fairness between hosts "inside" the NAT.  This has no practical effect
+in "flowblind" or "flows" modes, or if NAT is performed on a different host.
+.PP
+.B nonat
+(default)
+.br
+	Cake will not perform a NAT lookup.  Flow isolation will be performed
+using the addresses and port numbers directly visible to the interface Cake is
+attached to.
+
+.SH PRIORITY QUEUE PARAMETERS
+CAKE can divide traffic into "tins" based on the Diffserv field.  Each tin has
+its own independent set of flow-isolation queues, and is serviced based on a WRR
+algorithm.  To avoid perverse Diffserv marking incentives, tin weights have a
+"priority sharing" value when bandwidth used by that tin is below a threshold,
+and a lower "bandwidth sharing" value when above.  Bandwidth is compared against
+the threshold using the same algorithm as the deficit-mode shaper.
+
+Detailed customisation of tin parameters is not provided.  The following presets
+perform all necessary tuning, relative to the current shaper bandwidth and RTT
+settings.
+.PP
+.B besteffort
+.br
+	Disables priority queuing by placing all traffic in one tin.
+.PP
+.B precedence
+.br
+	Enables legacy interpretation of TOS "Precedence" field.  Use of this
+preset on the modern Internet is firmly discouraged.
+.PP
+.B diffserv4
+.br
+	Provides a general-purpose Diffserv implementation with four tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Video (AF4x, AF3x, CS3, AF2x, CS2, TOS4, TOS1), 50% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, CS5, CS4), 25% threshold.
+.PP
+.B diffserv3
+(default)
+.br
+	Provides a simple, general-purpose Diffserv implementation with three tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, TOS4), 25% threshold, reduced Codel interval.
+
+.SH OTHER PARAMETERS
+.B memlimit
+LIMIT
+.br
+	Limit the memory consumed by Cake to LIMIT bytes. Note that this does
+not translate directly to queue size (so do not size this based on bandwidth
+delay product considerations, but rather on worst case acceptable memory
+consumption), as there is some overhead in the data structures containing the
+packets, especially for small packets.
+
+	By default, the limit is calculated based on the bandwidth and RTT
+settings.
+
+.PP
+.B wash
+
+.br
+	Traffic entering your diffserv domain is frequently mis-marked in
+transit from the perspective of your network, and traffic exiting yours may be
+mis-marked from the perspective of the transiting provider.
+
+Apply the wash option to clear all extra diffserv (but not ECN bits), after
+priority queuing has taken place.
+
+If you are shaping inbound, and cannot trust the diffserv markings (as is the
+case for Comcast Cable, among others), it is best to use a single queue
+"besteffort" mode with wash.
+
+.SH EXAMPLES
+# tc qdisc delete root dev eth0
+.br
+# tc qdisc add root dev eth0 cake bandwidth 100Mbit ethernet
+.br
+# tc -s qdisc show dev eth0
+.br
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
+ backlog 0b 0p requeues 0
+ memory used: 0b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:        65535 /       0
+ min/max overhead-adjusted size:    65535 /       0
+ average network hdr offset:            0
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay          0us          0us          0us
+  av_delay          0us          0us          0us
+  sp_delay          0us          0us          0us
+  pkts                0            0            0
+  bytes               0            0            0
+  way_inds            0            0            0
+  way_miss            0            0            0
+  way_cols            0            0            0
+  drops               0            0            0
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            0            0            0
+  bk_flows            0            0            0
+  un_flows            0            0            0
+  max_len             0            0            0
+  quantum           300         1514          762
+
+After some use:
+.br
+# tc -s qdisc show dev eth0
+
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 44709231 bytes 31931 pkt (dropped 45, overlimits 93782 requeues 0) 
+ backlog 33308b 22p requeues 0
+ memory used: 292352b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:           28 /    1500
+ min/max overhead-adjusted size:       84 /    1538
+ average network hdr offset:           14
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay        8.7ms        6.9ms        5.0ms
+  av_delay        4.9ms        5.3ms        3.8ms
+  sp_delay        727us        1.4ms        511us
+  pkts             2590        21271         8137
+  bytes         3081804     30302659     11426206
+  way_inds            0           46            0
+  way_miss            3           17            4
+  way_cols            0            0            0
+  drops              20           15           10
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            2            4            1
+  bk_flows            1            2            1
+  un_flows            0            0            0
+  max_len          1514         1514         1514
+  quantum           300         1514          762
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-codel (8),
+.BR tc-fq_codel (8),
+.BR tc-red (8)
+
+.SH AUTHORS
+Cake's principal author is Jonathan Morton, with contributions from
+Tony Ambardar, Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen,
+Sebastian Moeller, Ryan Mounce, Dean Scarff, Nils Andreas Svee, and Dave Täht.
+
+This manual page was written by Loganaden Velvindron. Please report corrections
+to the Linux Networking mailing list <netdev@vger.kernel.org>.
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 840880fb..716dfec5 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -795,6 +795,7 @@ was written by Alexey N. Kuznetsov and added in Linux 2.2.
 .BR tc-basic (8),
 .BR tc-bfifo (8),
 .BR tc-bpf (8),
+.BR tc-cake (8),
 .BR tc-cbq (8),
 .BR tc-cgroup (8),
 .BR tc-choke (8),
diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..d9a43568 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -66,6 +66,7 @@ TCMODULES += q_codel.o
 TCMODULES += q_fq_codel.o
 TCMODULES += q_fq.o
 TCMODULES += q_pie.o
+TCMODULES += q_cake.o
 TCMODULES += q_hhf.o
 TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
diff --git a/tc/q_cake.c b/tc/q_cake.c
new file mode 100644
index 00000000..12263361
--- /dev/null
+++ b/tc/q_cake.c
@@ -0,0 +1,778 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * Common Applications Kept Enhanced  --  CAKE
+ *
+ *  Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
+ *  Copyright (C) 2017-2018 Toke Høiland-Jørgensen <toke@toke.dk>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions, and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ *    derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <inttypes.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+	fprintf(stderr,
+"Usage: ... cake [ bandwidth RATE | unlimited* | autorate_ingress ]\n"
+"                [ rtt TIME | datacentre | lan | metro | regional |\n"
+"                  internet* | oceanic | satellite | interplanetary ]\n"
+"                [ besteffort | diffserv8 | diffserv4 | diffserv3* ]\n"
+"                [ flowblind | srchost | dsthost | hosts | flows |\n"
+"                  dual-srchost | dual-dsthost | triple-isolate* ]\n"
+"                [ nat | nonat* ]\n"
+"                [ wash | nowash* ]\n"
+"                [ ack-filter | ack-filter-aggressive | no-ack-filter* ]\n"
+"                [ memlimit LIMIT ]\n"
+"                [ ptm | atm | noatm* ] [ overhead N | conservative | raw* ]\n"
+"                [ mpu N ] [ ingress | egress* ]\n"
+"                (* marks defaults)\n");
+}
+
+static int cake_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			  struct nlmsghdr *n, const char *dev)
+{
+	int unlimited = 0;
+	unsigned bandwidth = 0;
+	unsigned interval = 0;
+	unsigned target = 0;
+	unsigned diffserv = 0;
+	unsigned memlimit = 0;
+	int  overhead = 0;
+	bool overhead_set = false;
+	bool overhead_override = false;
+	int mpu = 0;
+	int flowmode = -1;
+	int nat = -1;
+	int atm = -1;
+	int autorate = -1;
+	int wash = -1;
+	int ingress = -1;
+	int ack_filter = -1;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "bandwidth") == 0) {
+			NEXT_ARG();
+			if (get_rate(&bandwidth, *argv)) {
+				fprintf(stderr, "Illegal \"bandwidth\"\n");
+				return -1;
+			}
+			unlimited = 0;
+			autorate = 0;
+		} else if (strcmp(*argv, "unlimited") == 0) {
+			bandwidth = 0;
+			unlimited = 1;
+			autorate = 0;
+		} else if (strcmp(*argv, "autorate_ingress") == 0) {
+			autorate = 1;
+
+		} else if (strcmp(*argv, "rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&interval, *argv)) {
+				fprintf(stderr, "Illegal \"rtt\"\n");
+				return -1;
+			}
+			target = interval / 20;
+			if(!target)
+				target = 1;
+		} else if (strcmp(*argv, "datacentre") == 0) {
+			interval = 100;
+			target   =   5;
+		} else if (strcmp(*argv, "lan") == 0) {
+			interval = 1000;
+			target   =   50;
+		} else if (strcmp(*argv, "metro") == 0) {
+			interval = 10000;
+			target   =   500;
+		} else if (strcmp(*argv, "regional") == 0) {
+			interval = 30000;
+			target    = 1500;
+		} else if (strcmp(*argv, "internet") == 0) {
+			interval = 100000;
+			target   =   5000;
+		} else if (strcmp(*argv, "oceanic") == 0) {
+			interval = 300000;
+			target   =  15000;
+		} else if (strcmp(*argv, "satellite") == 0) {
+			interval = 1000000;
+			target   =   50000;
+		} else if (strcmp(*argv, "interplanetary") == 0) {
+			interval = 1000000000;
+			target   =   50000000;
+
+		} else if (strcmp(*argv, "besteffort") == 0) {
+			diffserv = CAKE_DIFFSERV_BESTEFFORT;
+		} else if (strcmp(*argv, "precedence") == 0) {
+			diffserv = CAKE_DIFFSERV_PRECEDENCE;
+		} else if (strcmp(*argv, "diffserv8") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV8;
+		} else if (strcmp(*argv, "diffserv4") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv3") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV3;
+
+		} else if (strcmp(*argv, "nowash") == 0) {
+			wash = 0;
+		} else if (strcmp(*argv, "wash") == 0) {
+			wash = 1;
+
+		} else if (strcmp(*argv, "flowblind") == 0) {
+			flowmode = CAKE_FLOW_NONE;
+		} else if (strcmp(*argv, "srchost") == 0) {
+			flowmode = CAKE_FLOW_SRC_IP;
+		} else if (strcmp(*argv, "dsthost") == 0) {
+			flowmode = CAKE_FLOW_DST_IP;
+		} else if (strcmp(*argv, "hosts") == 0) {
+			flowmode = CAKE_FLOW_HOSTS;
+		} else if (strcmp(*argv, "flows") == 0) {
+			flowmode = CAKE_FLOW_FLOWS;
+		} else if (strcmp(*argv, "dual-srchost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_SRC;
+		} else if (strcmp(*argv, "dual-dsthost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_DST;
+		} else if (strcmp(*argv, "triple-isolate") == 0) {
+			flowmode = CAKE_FLOW_TRIPLE;
+
+		} else if (strcmp(*argv, "nat") == 0) {
+			nat = 1;
+		} else if (strcmp(*argv, "nonat") == 0) {
+			nat = 0;
+
+		} else if (strcmp(*argv, "ptm") == 0) {
+			atm = CAKE_ATM_PTM;
+		} else if (strcmp(*argv, "atm") == 0) {
+			atm = CAKE_ATM_ATM;
+		} else if (strcmp(*argv, "noatm") == 0) {
+			atm = CAKE_ATM_NONE;
+
+		} else if (strcmp(*argv, "raw") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead = 0;
+			overhead_set = true;
+			overhead_override = true;
+		} else if (strcmp(*argv, "conservative") == 0) {
+			/*
+			 * Deliberately over-estimate overhead:
+			 * one whole ATM cell plus ATM framing.
+			 * A safe choice if the actual overhead is unknown.
+			 */
+			atm = CAKE_ATM_ATM;
+			overhead = 48;
+			overhead_set = true;
+
+		/* Various ADSL framing schemes, all over ATM cells */
+		} else if (strcmp(*argv, "ipoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 8;
+			overhead_set = true;
+		} else if (strcmp(*argv, "ipoa-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 16;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 24;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 10;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-llc") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 14;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 40;
+			overhead_set = true;
+
+		/* Typical VDSL2 framing schemes, both over PTM */
+		/* PTM has 64b/65b coding which absorbs some bandwidth */
+		} else if (strcmp(*argv, "pppoe-ptm") == 0) {
+			/* 2B PPP + 6B PPPoE + 6B dest MAC + 6B src MAC
+			 * + 2B ethertype + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 30B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 30;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-ptm") == 0) {
+			/* 6B dest MAC + 6B src MAC + 2B ethertype
+			 * + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 22B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 22;
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "via-ethernet") == 0) {
+			/*
+			 * We used to use this flag to manually compensate for
+			 * Linux including the Ethernet header on Ethernet-type
+			 * interfaces, but not on IP-type interfaces.
+			 *
+			 * It is no longer needed, because Cake now adjusts for
+			 * that automatically, and is thus ignored.
+			 *
+			 * It would be deleted entirely, but it appears in the
+			 * stats output when the automatic compensation is
+			 * active.
+			 */
+
+		} else if (strcmp(*argv, "ethernet") == 0) {
+			/* ethernet pre-amble & interframe gap & FCS
+			 * you may need to add vlan tag */
+			overhead += 38;
+			overhead_set = true;
+			mpu = 84;
+
+		/* Additional Ethernet-related overhead used by some ISPs */
+		} else if (strcmp(*argv, "ether-vlan") == 0) {
+			/* 802.1q VLAN tag - may be repeated */
+			overhead += 4;
+			overhead_set = true;
+
+		/*
+		 * DOCSIS cable shapers account for Ethernet frame with FCS,
+		 * but not interframe gap or preamble.
+		 */
+		} else if (strcmp(*argv, "docsis") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead += 18;
+			overhead_set = true;
+			mpu = 64;
+
+		} else if (strcmp(*argv, "overhead") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			overhead = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || overhead < -64 || overhead > 256) {
+				fprintf(stderr, "Illegal \"overhead\", valid range is -64 to 256\\n");
+				return -1;
+			}
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "mpu") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			mpu = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || mpu < 0 || mpu > 256) {
+				fprintf(stderr, "Illegal \"mpu\", valid range is 0 to 256\\n");
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "ingress") == 0) {
+			ingress = 1;
+		} else if (strcmp(*argv, "egress") == 0) {
+			ingress = 0;
+
+		} else if (strcmp(*argv, "no-ack-filter") == 0) {
+			ack_filter = CAKE_ACK_NONE;
+		} else if (strcmp(*argv, "ack-filter") == 0) {
+			ack_filter = CAKE_ACK_FILTER;
+		} else if (strcmp(*argv, "ack-filter-aggressive") == 0) {
+			ack_filter = CAKE_ACK_AGGRESSIVE;
+
+		} else if (strcmp(*argv, "memlimit") == 0) {
+			NEXT_ARG();
+			if(get_size(&memlimit, *argv)) {
+				fprintf(stderr, "Illegal value for \"memlimit\": \"%s\"\n", *argv);
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	tail = NLMSG_TAIL(n);
+	addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+	if (bandwidth || unlimited)
+		addattr_l(n, 1024, TCA_CAKE_BASE_RATE, &bandwidth, sizeof(bandwidth));
+	if (diffserv)
+		addattr_l(n, 1024, TCA_CAKE_DIFFSERV_MODE, &diffserv, sizeof(diffserv));
+	if (atm != -1)
+		addattr_l(n, 1024, TCA_CAKE_ATM, &atm, sizeof(atm));
+	if (flowmode != -1)
+		addattr_l(n, 1024, TCA_CAKE_FLOW_MODE, &flowmode, sizeof(flowmode));
+	if (overhead_set)
+		addattr_l(n, 1024, TCA_CAKE_OVERHEAD, &overhead, sizeof(overhead));
+	if (overhead_override) {
+		unsigned zero = 0;
+		addattr_l(n, 1024, TCA_CAKE_RAW, &zero, sizeof(zero));
+	}
+	if (mpu > 0)
+		addattr_l(n, 1024, TCA_CAKE_MPU, &mpu, sizeof(mpu));
+	if (interval)
+		addattr_l(n, 1024, TCA_CAKE_RTT, &interval, sizeof(interval));
+	if (target)
+		addattr_l(n, 1024, TCA_CAKE_TARGET, &target, sizeof(target));
+	if (autorate != -1)
+		addattr_l(n, 1024, TCA_CAKE_AUTORATE, &autorate, sizeof(autorate));
+	if (memlimit)
+		addattr_l(n, 1024, TCA_CAKE_MEMORY, &memlimit, sizeof(memlimit));
+	if (nat != -1)
+		addattr_l(n, 1024, TCA_CAKE_NAT, &nat, sizeof(nat));
+	if (wash != -1)
+		addattr_l(n, 1024, TCA_CAKE_WASH, &wash, sizeof(wash));
+	if (ingress != -1)
+		addattr_l(n, 1024, TCA_CAKE_INGRESS, &ingress, sizeof(ingress));
+	if (ack_filter != -1)
+		addattr_l(n, 1024, TCA_CAKE_ACK_FILTER, &ack_filter, sizeof(ack_filter));
+
+	tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
+	return 0;
+}
+
+
+static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_CAKE_MAX + 1];
+	unsigned bandwidth = 0;
+	unsigned diffserv = 0;
+	unsigned flowmode = 0;
+	unsigned interval = 0;
+	unsigned memlimit = 0;
+	int overhead = 0;
+	int raw = 0;
+	int mpu = 0;
+	int atm = 0;
+	int nat = 0;
+	int autorate = 0;
+	int wash = 0;
+	int ingress = 0;
+	int ack_filter = 0;
+	SPRINT_BUF(b1);
+	SPRINT_BUF(b2);
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_CAKE_MAX, opt);
+
+	if (tb[TCA_CAKE_BASE_RATE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_BASE_RATE]) >= sizeof(__u32)) {
+		bandwidth = rta_getattr_u32(tb[TCA_CAKE_BASE_RATE]);
+		if(bandwidth) {
+			print_uint(PRINT_JSON, "bandwidth", NULL, bandwidth);
+			print_string(PRINT_FP, NULL, "bandwidth %s ", sprint_rate(bandwidth, b1));
+		} else
+			print_string(PRINT_ANY, "bandwidth", "bandwidth %s ", "unlimited");
+	}
+	if (tb[TCA_CAKE_AUTORATE] &&
+		RTA_PAYLOAD(tb[TCA_CAKE_AUTORATE]) >= sizeof(__u32)) {
+		autorate = rta_getattr_u32(tb[TCA_CAKE_AUTORATE]);
+		if(autorate == 1)
+			print_string(PRINT_ANY, "autorate", "autorate_%s ", "ingress");
+		else if(autorate)
+			print_string(PRINT_ANY, "autorate", "(?autorate?) ", "unknown");
+	}
+	if (tb[TCA_CAKE_DIFFSERV_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_DIFFSERV_MODE]) >= sizeof(__u32)) {
+		diffserv = rta_getattr_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+		switch(diffserv) {
+		case CAKE_DIFFSERV_DIFFSERV3:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv3");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV4:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv4");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV8:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv8");
+			break;
+		case CAKE_DIFFSERV_BESTEFFORT:
+			print_string(PRINT_ANY, "diffserv", "%s ", "besteffort");
+			break;
+		case CAKE_DIFFSERV_PRECEDENCE:
+			print_string(PRINT_ANY, "diffserv", "%s ", "precedence");
+			break;
+		default:
+			print_string(PRINT_ANY, "diffserv", "(?diffserv?) ", "unknown");
+			break;
+		};
+	}
+	if (tb[TCA_CAKE_FLOW_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_FLOW_MODE]) >= sizeof(__u32)) {
+		flowmode = rta_getattr_u32(tb[TCA_CAKE_FLOW_MODE]);
+		switch(flowmode) {
+		case CAKE_FLOW_NONE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flowblind");
+			break;
+		case CAKE_FLOW_SRC_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "srchost");
+			break;
+		case CAKE_FLOW_DST_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dsthost");
+			break;
+		case CAKE_FLOW_HOSTS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "hosts");
+			break;
+		case CAKE_FLOW_FLOWS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flows");
+			break;
+		case CAKE_FLOW_DUAL_SRC:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-srchost");
+			break;
+		case CAKE_FLOW_DUAL_DST:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-dsthost");
+			break;
+		case CAKE_FLOW_TRIPLE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "triple-isolate");
+			break;
+		default:
+			print_string(PRINT_ANY, "flowmode", "(?flowmode?) ", "unknown");
+			break;
+		};
+
+	}
+
+	if (tb[TCA_CAKE_NAT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_NAT]) >= sizeof(__u32)) {
+	    nat = rta_getattr_u32(tb[TCA_CAKE_NAT]);
+	}
+
+	if(nat)
+		print_string(PRINT_FP, NULL, "nat ", NULL);
+	print_bool(PRINT_JSON, "nat", NULL, nat);
+
+	if (tb[TCA_CAKE_WASH] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_WASH]) >= sizeof(__u32)) {
+		wash = rta_getattr_u32(tb[TCA_CAKE_WASH]);
+	}
+	if (tb[TCA_CAKE_ATM] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ATM]) >= sizeof(__u32)) {
+		atm = rta_getattr_u32(tb[TCA_CAKE_ATM]);
+	}
+	if (tb[TCA_CAKE_OVERHEAD] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_OVERHEAD]) >= sizeof(__s32)) {
+		overhead = *(__s32 *) RTA_DATA(tb[TCA_CAKE_OVERHEAD]);
+	}
+	if (tb[TCA_CAKE_MPU] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_MPU]) >= sizeof(__u32)) {
+		mpu = rta_getattr_u32(tb[TCA_CAKE_MPU]);
+	}
+	if (tb[TCA_CAKE_INGRESS] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_INGRESS]) >= sizeof(__u32)) {
+		ingress = rta_getattr_u32(tb[TCA_CAKE_INGRESS]);
+	}
+	if (tb[TCA_CAKE_ACK_FILTER] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ACK_FILTER]) >= sizeof(__u32)) {
+		ack_filter = rta_getattr_u32(tb[TCA_CAKE_ACK_FILTER]);
+	}
+	if (tb[TCA_CAKE_RAW]) {
+		raw = 1;
+	}
+	if (tb[TCA_CAKE_RTT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_RTT]) >= sizeof(__u32)) {
+		interval = rta_getattr_u32(tb[TCA_CAKE_RTT]);
+	}
+	if (wash)
+		print_string(PRINT_FP, NULL, "wash ", NULL);
+	print_bool(PRINT_JSON, "wash", NULL, wash);
+
+	if (ingress)
+		print_string(PRINT_FP, NULL, "ingress ", NULL);
+	print_bool(PRINT_JSON, "ingress", NULL, ingress);
+
+	if (ack_filter == CAKE_ACK_AGGRESSIVE)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter-%s ", "aggressive");
+	else if (ack_filter == CAKE_ACK_FILTER)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter ", "enabled");
+	else
+		print_string(PRINT_JSON, "ack-filter", NULL, "disabled");
+
+	if (interval)
+		print_string(PRINT_FP, NULL, "rtt %s ", sprint_time(interval, b2));
+	print_uint(PRINT_JSON, "rtt", NULL, interval);
+
+	if (raw)
+		print_string(PRINT_FP, NULL, "raw ", NULL);
+	print_bool(PRINT_JSON, "raw", NULL, raw);
+
+	if (atm == CAKE_ATM_ATM)
+		print_string(PRINT_ANY, "atm", "%s ", "atm");
+	else if (atm == CAKE_ATM_PTM)
+		print_string(PRINT_ANY, "atm", "%s ", "ptm");
+	else if (!raw)
+		print_string(PRINT_ANY, "atm", "%s ", "noatm");
+
+	print_int(PRINT_ANY, "overhead", "overhead %d ", overhead);
+
+	if (mpu)
+		print_uint(PRINT_ANY, "mpu", "mpu %" PRIu64 " ", mpu);
+
+	if (memlimit) {
+		print_uint(PRINT_JSON, "memlimit", NULL, memlimit);
+		print_string(PRINT_FP, NULL, "memlimit %s", sprint_size(memlimit, b1));
+	}
+
+	return 0;
+}
+
+#define FOR_EACH_TIN(xstats, tst, i)				\
+	for(tst = xstats->tin_stats, i = 0;			\
+	i < xstats->tin_cnt;						\
+	    i++, tst = ((void *) xstats->tin_stats) + xstats->tin_stats_size * i)
+
+static void cake_print_json_tin(struct tc_cake_tin_stats *tst, uint version)
+{
+	open_json_object(NULL);
+	print_uint(PRINT_JSON, "threshold_rate", NULL, tst->threshold_rate);
+	print_uint(PRINT_JSON, "target", NULL, tst->target_us);
+	print_uint(PRINT_JSON, "interval", NULL, tst->interval_us);
+	print_uint(PRINT_JSON, "peak_delay", NULL, tst->peak_delay_us);
+	print_uint(PRINT_JSON, "average_delay", NULL, tst->avge_delay_us);
+	print_uint(PRINT_JSON, "base_delay", NULL, tst->base_delay_us);
+	print_uint(PRINT_JSON, "sent_packets", NULL, tst->sent.packets);
+	print_uint(PRINT_JSON, "sent_bytes", NULL, tst->sent.bytes);
+	print_uint(PRINT_JSON, "way_indirect_hits", NULL, tst->way_indirect_hits);
+	print_uint(PRINT_JSON, "way_misses", NULL, tst->way_misses);
+	print_uint(PRINT_JSON, "way_collisions", NULL, tst->way_collisions);
+	print_uint(PRINT_JSON, "drops", NULL, tst->dropped.packets);
+	print_uint(PRINT_JSON, "ecn_mark", NULL, tst->ecn_marked.packets);
+	print_uint(PRINT_JSON, "ack_drops", NULL, tst->ack_drops.packets);
+	print_uint(PRINT_JSON, "sparse_flows", NULL, tst->sparse_flows);
+	print_uint(PRINT_JSON, "bulk_flows", NULL, tst->bulk_flows);
+	print_uint(PRINT_JSON, "unresponsive_flows", NULL, tst->unresponse_flows);
+	print_uint(PRINT_JSON, "max_pkt_len", NULL, tst->max_skblen);
+	if (version >= 0x102)
+		print_uint(PRINT_JSON, "flow_quantum", NULL, tst->flow_quantum);
+	close_json_object();
+}
+
+static int cake_print_xstats(struct qdisc_util *qu, FILE *f,
+			     struct rtattr *xstats)
+{
+	struct tc_cake_xstats     *stnc;
+	struct tc_cake_tin_stats  *tst;
+	SPRINT_BUF(b1);
+	int i;
+
+	if (xstats == NULL)
+		return 0;
+
+	if (RTA_PAYLOAD(xstats) < sizeof(*stnc))
+		return -1;
+
+	stnc = RTA_DATA(xstats);
+
+	if (stnc->version < 0x101 ||
+	    RTA_PAYLOAD(xstats) < (sizeof(struct tc_cake_xstats) +
+				    stnc->tin_stats_size * stnc->tin_cnt))
+		return -1;
+
+	print_uint(PRINT_JSON, "memory_used", NULL, stnc->memory_used);
+	print_uint(PRINT_JSON, "memory_limit", NULL, stnc->memory_limit);
+	print_uint(PRINT_JSON, "capacity_estimate", NULL, stnc->capacity_estimate);
+
+	print_string(PRINT_FP, NULL, " memory used: %s",
+		sprint_size(stnc->memory_used, b1));
+	print_string(PRINT_FP, NULL, " of %s\n",
+		sprint_size(stnc->memory_limit, b1));
+	print_string(PRINT_FP, NULL, " capacity estimate: %s\n",
+		sprint_rate(stnc->capacity_estimate, b1));
+
+	print_uint(PRINT_ANY, "min_network_size", " min/max network layer size: %12" PRIu64,
+		stnc->min_netlen);
+	print_uint(PRINT_ANY, "max_network_size", " /%8" PRIu64 "\n", stnc->max_netlen);
+	print_uint(PRINT_ANY, "min_adj_size", " min/max overhead-adjusted size: %8" PRIu64,
+		stnc->min_adjlen);
+	print_uint(PRINT_ANY, "max_adj_size", " /%8" PRIu64 "\n", stnc->max_adjlen);
+	print_uint(PRINT_ANY, "avg_hdr_offset", " average network hdr offset: %12" PRIu64 "\n\n",
+		stnc->avg_trnoff);
+
+	if (is_json_context()) {
+		open_json_array(PRINT_JSON, "tins");
+		FOR_EACH_TIN(stnc, tst, i)
+			cake_print_json_tin(tst, stnc->version);
+		close_json_array(PRINT_JSON, NULL);
+		return 0;
+	}
+
+
+	switch(stnc->tin_cnt) {
+	case 3:
+		fprintf(f, "                   Bulk  Best Effort        Voice\n");
+		break;
+
+	case 4:
+		fprintf(f, "                   Bulk  Best Effort        Video        Voice\n");
+		break;
+
+	case 5:
+		fprintf(f, "               Low Loss  Best Effort    Low Delay         Bulk  Net Control\n");
+		break;
+
+	default:
+		fprintf(f, "          ");
+		for(i=0; i < stnc->tin_cnt; i++)
+			fprintf(f, "       Tin %u", i);
+		fprintf(f, "\n");
+	};
+
+	fprintf(f, "  thresh  ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_rate(tst->threshold_rate, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  target  ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->target_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  interval");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->interval_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  pk_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->peak_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  av_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->avge_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  sp_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->base_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  pkts    ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->sent.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  bytes   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12llu", tst->sent.bytes);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_inds");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_indirect_hits);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_miss");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_misses);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_cols");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_collisions);
+	fprintf(f, "\n");
+
+	fprintf(f, "  drops   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->dropped.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  marks   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->ecn_marked.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  ack_drop");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->ack_drops.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  sp_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->sparse_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  bk_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->bulk_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  un_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->unresponse_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  max_len ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->max_skblen);
+	fprintf(f, "\n");
+
+	if (stnc->version >= 0x102) {
+		fprintf(f, "  quantum ");
+		FOR_EACH_TIN(stnc, tst, i)
+			fprintf(f, " %12u", tst->flow_quantum);
+		fprintf(f, "\n");
+	}
+
+	return 0;
+}
+
+struct qdisc_util cake_qdisc_util = {
+	.id		= "cake",
+	.parse_qopt	= cake_parse_opt,
+	.print_qopt	= cake_print_opt,
+	.print_xstats	= cake_print_xstats,
+};
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH iproute2-next v3] Add support for cake qdisc
  2018-04-24 12:30     ` [PATCH iproute2-next v3] " Toke Høiland-Jørgensen
@ 2018-04-24 14:44       ` Stephen Hemminger
  2018-04-24 14:45       ` Stephen Hemminger
  1 sibling, 0 replies; 20+ messages in thread
From: Stephen Hemminger @ 2018-04-24 14:44 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake

On Tue, 24 Apr 2018 14:30:46 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> diff --git a/tc/q_cake.c b/tc/q_cake.c
> new file mode 100644
> index 00000000..12263361
> --- /dev/null
> +++ b/tc/q_cake.c
> @@ -0,0 +1,778 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +/*
> + * Common Applications Kept Enhanced  --  CAKE
> + *
> + *  Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
> + *  Copyright (C) 2017-2018 Toke Høiland-Jørgensen <toke@toke.dk>
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions, and the following disclaimer,
> + *    without modification.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. The names of the authors may not be used to endorse or promote products
> + *    derived from this software without specific prior written permission.
> + *
> + * Alternatively, provided that this notice is retained in full, this
> + * software may be distributed under the terms of the GNU General
> + * Public License ("GPL") version 2, in which case the provisions of the
> + * GPL apply INSTEAD OF those given above.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> + * DAMAGE.
> + *
> + */

If you  have SPDX tag you don't have to have all the GPL boilerplate anymore.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH iproute2-next v3] Add support for cake qdisc
  2018-04-24 12:30     ` [PATCH iproute2-next v3] " Toke Høiland-Jørgensen
  2018-04-24 14:44       ` [Cake] " Stephen Hemminger
@ 2018-04-24 14:45       ` Stephen Hemminger
  2018-04-24 14:52         ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2018-04-24 14:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake

On Tue, 24 Apr 2018 14:30:46 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> +static void cake_print_json_tin(struct tc_cake_tin_stats *tst, uint version)
> +{
> +	open_json_object(NULL);
> +	print_uint(PRINT_JSON, "threshold_rate", NULL, tst->threshold_rate);
> +	print_uint(PRINT_JSON, "target", NULL, tst->target_us);
> +	print_uint(PRINT_JSON, "interval", NULL, tst->interval_us);
> +	print_uint(PRINT_JSON, "peak_delay", NULL, tst->peak_delay_us);
> +	print_uint(PRINT_JSON, "average_delay", NULL, tst->avge_delay_us);
> +	print_uint(PRINT_JSON, "base_delay", NULL, tst->base_delay_us);
> +	print_uint(PRINT_JSON, "sent_packets", NULL, tst->sent.packets);
> +	print_uint(PRINT_JSON, "sent_bytes", NULL, tst->sent.bytes);
> +	print_uint(PRINT_JSON, "way_indirect_hits", NULL, tst->way_indirect_hits);
> +	print_uint(PRINT_JSON, "way_misses", NULL, tst->way_misses);
> +	print_uint(PRINT_JSON, "way_collisions", NULL, tst->way_collisions);
> +	print_uint(PRINT_JSON, "drops", NULL, tst->dropped.packets);
> +	print_uint(PRINT_JSON, "ecn_mark", NULL, tst->ecn_marked.packets);
> +	print_uint(PRINT_JSON, "ack_drops", NULL, tst->ack_drops.packets);
> +	print_uint(PRINT_JSON, "sparse_flows", NULL, tst->sparse_flows);
> +	print_uint(PRINT_JSON, "bulk_flows", NULL, tst->bulk_flows);
> +	print_uint(PRINT_JSON, "unresponsive_flows", NULL, tst->unresponse_flows);
> +	print_uint(PRINT_JSON, "max_pkt_len", NULL, tst->max_skblen);
> +	if (version >= 0x102)
> +		print_uint(PRINT_JSON, "flow_quantum", NULL, tst->flow_quantum);

Please don't version objects in netlink. That is not how netlink is
supposed to be used.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH iproute2-next v3] Add support for cake qdisc
  2018-04-24 14:45       ` Stephen Hemminger
@ 2018-04-24 14:52         ` Toke Høiland-Jørgensen
  2018-04-24 15:10           ` Stephen Hemminger
  0 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2018-04-24 14:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, cake

Stephen Hemminger <stephen@networkplumber.org> writes:

> On Tue, 24 Apr 2018 14:30:46 +0200
> Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
>> +static void cake_print_json_tin(struct tc_cake_tin_stats *tst, uint version)
>> +{
>> +	open_json_object(NULL);
>> +	print_uint(PRINT_JSON, "threshold_rate", NULL, tst->threshold_rate);
>> +	print_uint(PRINT_JSON, "target", NULL, tst->target_us);
>> +	print_uint(PRINT_JSON, "interval", NULL, tst->interval_us);
>> +	print_uint(PRINT_JSON, "peak_delay", NULL, tst->peak_delay_us);
>> +	print_uint(PRINT_JSON, "average_delay", NULL, tst->avge_delay_us);
>> +	print_uint(PRINT_JSON, "base_delay", NULL, tst->base_delay_us);
>> +	print_uint(PRINT_JSON, "sent_packets", NULL, tst->sent.packets);
>> +	print_uint(PRINT_JSON, "sent_bytes", NULL, tst->sent.bytes);
>> +	print_uint(PRINT_JSON, "way_indirect_hits", NULL, tst->way_indirect_hits);
>> +	print_uint(PRINT_JSON, "way_misses", NULL, tst->way_misses);
>> +	print_uint(PRINT_JSON, "way_collisions", NULL, tst->way_collisions);
>> +	print_uint(PRINT_JSON, "drops", NULL, tst->dropped.packets);
>> +	print_uint(PRINT_JSON, "ecn_mark", NULL, tst->ecn_marked.packets);
>> +	print_uint(PRINT_JSON, "ack_drops", NULL, tst->ack_drops.packets);
>> +	print_uint(PRINT_JSON, "sparse_flows", NULL, tst->sparse_flows);
>> +	print_uint(PRINT_JSON, "bulk_flows", NULL, tst->bulk_flows);
>> +	print_uint(PRINT_JSON, "unresponsive_flows", NULL, tst->unresponse_flows);
>> +	print_uint(PRINT_JSON, "max_pkt_len", NULL, tst->max_skblen);
>> +	if (version >= 0x102)
>> +		print_uint(PRINT_JSON, "flow_quantum", NULL, tst->flow_quantum);
>
> Please don't version objects in netlink. That is not how netlink is
> supposed to be used.

Well, this is leftover from keeping track of different versions of the
out-of-tree patch, and we already broke compatibility pretty thoroughly
as a preparation for upstreaming. So I'm fine with dropping the version
check; will resend.

That being said, the versioning comes from the XSTATS API, which does
not use netlink attributes for its members, but rather passes through as
struct. So what is one supposed to do in this case?

-Toke

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH iproute2-next v3] Add support for cake qdisc
  2018-04-24 14:52         ` Toke Høiland-Jørgensen
@ 2018-04-24 15:10           ` Stephen Hemminger
  2018-04-24 15:38             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2018-04-24 15:10 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake

On Tue, 24 Apr 2018 16:52:57 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> Well, this is leftover from keeping track of different versions of the
> out-of-tree patch, and we already broke compatibility pretty thoroughly
> as a preparation for upstreaming. So I'm fine with dropping the version
> check; will resend.
> 
> That being said, the versioning comes from the XSTATS API, which does
> not use netlink attributes for its members, but rather passes through as
> struct. So what is one supposed to do in this case?

If a structure is likely to change, then it should be decomposed into nested
netlink attributes. Once you send a structure over userspace API in netlink
it is fixed forever.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
  2018-04-24 11:44 ` [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Toke Høiland-Jørgensen
  2018-04-24 11:44   ` [PATCH iproute2-next v2] Add support for cake qdisc Toke Høiland-Jørgensen
@ 2018-04-24 15:11   ` Stephen Hemminger
  2018-04-24 15:14   ` Stephen Hemminger
  2018-04-26 19:16   ` kbuild test robot
  3 siblings, 0 replies; 20+ messages in thread
From: Stephen Hemminger @ 2018-04-24 15:11 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake

On Tue, 24 Apr 2018 13:44:06 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> +struct tc_cake_xstats {
> +	__u16 version;
> +	__u16 tin_stats_size; /* == sizeof(struct tc_cake_tin_stats) */
> +	__u32 capacity_estimate;
> +	__u32 memory_limit;
> +	__u32 memory_used;
> +	__u8  tin_cnt;
> +	__u8  avg_trnoff;
> +	__u16 max_netlen;
> +	__u16 max_adjlen;
> +	__u16 min_netlen;
> +	__u16 min_adjlen;
> +
> +	__u16 spare1;
> +	__u32 spare2;
> +
> +	struct tc_cake_tin_stats tin_stats[0]; /* keep last */
> +};

No versioning allowed in userspace API. You need to drop version and make
it permanent.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
  2018-04-24 11:44 ` [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Toke Høiland-Jørgensen
  2018-04-24 11:44   ` [PATCH iproute2-next v2] Add support for cake qdisc Toke Høiland-Jørgensen
  2018-04-24 15:11   ` [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Stephen Hemminger
@ 2018-04-24 15:14   ` Stephen Hemminger
  2018-04-24 15:32     ` Georgios Amanakis
  2018-04-26 19:16   ` kbuild test robot
  3 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2018-04-24 15:14 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake

On Tue, 24 Apr 2018 13:44:06 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> +static const u8 precedence[] = {0, 0, 0, 0, 0, 0, 0, 0,
> +				1, 1, 1, 1, 1, 1, 1, 1,
> +				2, 2, 2, 2, 2, 2, 2, 2,
> +				3, 3, 3, 3, 3, 3, 3, 3,
> +				4, 4, 4, 4, 4, 4, 4, 4,
> +				5, 5, 5, 5, 5, 5, 5, 5,
> +				6, 6, 6, 6, 6, 6, 6, 6,
> +				7, 7, 7, 7, 7, 7, 7, 7,
> +				};

Minor nit.  The kernel style for initializing array should be used
here.

static const u8 precedence[] = {
	0, 0, 0, 0, 0, 0, 0, 0,
	1, 1, 1, 1, 1, 1, 1, 1,
...
};

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
  2018-04-24 15:14   ` Stephen Hemminger
@ 2018-04-24 15:32     ` Georgios Amanakis
  2018-04-24 15:41       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Georgios Amanakis @ 2018-04-24 15:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Cake List, netdev

On Tue, 24 Apr 2018 13:44:06 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> +config NET_SCH_CAKE
> +       tristate "Common Applications Kept Enhanced (CAKE)"
> +       help
> +         Say Y here if you want to use the Common Applications Kept Enhanced
> +          (CAKE) queue management algorithm.
> +
> +         To compile this driver as a module, choose M here: the module
> +         will be called sch_cake.

In net/sched/Kconfig we should probably add NF_CONNTRACK as a dependency:
"depends on NF_CONNTRACK"

Otherwise if NET_SCH_CAKE=y and NF_CONNTRACK=m compilation fails with:

net/sched/sch_cake.o: In function `cake_enqueue':
sch_cake.c:(.text+0x3e10): undefined reference to `nf_ct_get_tuplepr'
sch_cake.c:(.text+0x3e3a): undefined reference to `nf_conntrack_find_get'
make: *** [Makefile:1041: vmlinux] Error 1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH iproute2-next v3] Add support for cake qdisc
  2018-04-24 15:10           ` Stephen Hemminger
@ 2018-04-24 15:38             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2018-04-24 15:38 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, cake

Stephen Hemminger <stephen@networkplumber.org> writes:

> On Tue, 24 Apr 2018 16:52:57 +0200
> Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
>> Well, this is leftover from keeping track of different versions of the
>> out-of-tree patch, and we already broke compatibility pretty thoroughly
>> as a preparation for upstreaming. So I'm fine with dropping the version
>> check; will resend.
>> 
>> That being said, the versioning comes from the XSTATS API, which does
>> not use netlink attributes for its members, but rather passes through as
>> struct. So what is one supposed to do in this case?
>
> If a structure is likely to change, then it should be decomposed into nested
> netlink attributes. Once you send a structure over userspace API in netlink
> it is fixed forever.

Right. Is decomposing stats into netlink attributes actually possible
within the qdisc dump_stats callback? Could we do something like:

nla_start_nest(d->skb, TCA_CAKE_STATS);

from within cake_dump_stats() and read it in cake_print_xstats in tc?

-Toke

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
  2018-04-24 15:32     ` Georgios Amanakis
@ 2018-04-24 15:41       ` Toke Høiland-Jørgensen
       [not found]         ` <CACvFP_hzHY+=qPh6_=_++UhbnAU1xjPguY7fNzG0TSVhYm0V3Q@mail.gmail.com>
  0 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2018-04-24 15:41 UTC (permalink / raw)
  To: Georgios Amanakis; +Cc: Cake List, netdev

Georgios Amanakis <gamanakis@gmail.com> writes:

> On Tue, 24 Apr 2018 13:44:06 +0200
> Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
>> +config NET_SCH_CAKE
>> +       tristate "Common Applications Kept Enhanced (CAKE)"
>> +       help
>> +         Say Y here if you want to use the Common Applications Kept Enhanced
>> +          (CAKE) queue management algorithm.
>> +
>> +         To compile this driver as a module, choose M here: the module
>> +         will be called sch_cake.
>
> In net/sched/Kconfig we should probably add NF_CONNTRACK as a dependency:
> "depends on NF_CONNTRACK"
>
> Otherwise if NET_SCH_CAKE=y and NF_CONNTRACK=m compilation fails with:
>
> net/sched/sch_cake.o: In function `cake_enqueue':
> sch_cake.c:(.text+0x3e10): undefined reference to `nf_ct_get_tuplepr'
> sch_cake.c:(.text+0x3e3a): undefined reference to `nf_conntrack_find_get'
> make: *** [Makefile:1041: vmlinux] Error 1

Hmm we really don't want to have a hard depend on conntrack. We
currently ifdef the conntrack-specific bits thus:
#if IS_ENABLED(CONFIG_NF_CONNTRACK).

Does anyone know if there is a way to do this so the module/builtin
split doesn't bite us?

-Toke

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
       [not found]         ` <CACvFP_hzHY+=qPh6_=_++UhbnAU1xjPguY7fNzG0TSVhYm0V3Q@mail.gmail.com>
@ 2018-04-24 15:51           ` Georgios Amanakis
  2018-04-24 16:03             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Georgios Amanakis @ 2018-04-24 15:51 UTC (permalink / raw)
  To: Cake List, netdev

On Tue, Apr 24, 2018 at 11:47 AM, Georgios Amanakis <gamanakis@gmail.com> wrote:
>>
>> Does anyone know if there is a way to do this so the module/builtin
>> split doesn't bite us?
>>
> #ifdef CONFIG_NF_CONNTRACK ??

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
  2018-04-24 15:51           ` Georgios Amanakis
@ 2018-04-24 16:03             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2018-04-24 16:03 UTC (permalink / raw)
  To: Georgios Amanakis, Cake List, netdev

Georgios Amanakis <gamanakis@gmail.com> writes:

> On Tue, Apr 24, 2018 at 11:47 AM, Georgios Amanakis <gamanakis@gmail.com> wrote:
>>>
>>> Does anyone know if there is a way to do this so the module/builtin
>>> split doesn't bite us?
>>>
>> #ifdef CONFIG_NF_CONNTRACK ??

That is basically what we're doing. But it looks like there's an
IS_REACHABLE macro which does what we need. Will fix, thanks for
pointing it out :)

-Toke

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc
  2018-04-24 11:44 ` [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Toke Høiland-Jørgensen
                     ` (2 preceding siblings ...)
  2018-04-24 15:14   ` Stephen Hemminger
@ 2018-04-26 19:16   ` kbuild test robot
  3 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2018-04-26 19:16 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: kbuild-all, netdev, cake, Toke Høiland-Jørgensen, Dave Taht

[-- Attachment #1: Type: text/plain, Size: 3367 bytes --]

Hi Toke,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/Add-Common-Applications-Kept-Enhanced-cake-qdisc/20180426-064653
config: parisc-allmodconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=parisc 

All warnings (new ones prefixed by >>):


vim +2589 net//sched/sch_cake.c

  2525	
  2526	static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
  2527	{
  2528		struct cake_sched_data *q = qdisc_priv(sch);
  2529		struct tc_cake_xstats *st;
  2530		size_t size = (sizeof(*st) +
  2531			       sizeof(struct tc_cake_tin_stats) * q->tin_cnt);
  2532		int i;
  2533	
  2534		st = cake_zalloc(size);
  2535	
  2536		if (!st)
  2537			return -1;
  2538	
  2539		st->version = 0x102; /* old userspace code discards versions > 0xFF */
  2540		st->tin_stats_size = sizeof(struct tc_cake_tin_stats);
  2541		st->tin_cnt = q->tin_cnt;
  2542	
  2543		st->avg_trnoff = (q->avg_trnoff + 0x8000) >> 16;
  2544		st->max_netlen = q->max_netlen;
  2545		st->max_adjlen = q->max_adjlen;
  2546		st->min_netlen = q->min_netlen;
  2547		st->min_adjlen = q->min_adjlen;
  2548	
  2549		for (i = 0; i < q->tin_cnt; i++) {
  2550			struct cake_tin_data *b = &q->tins[q->tin_order[i]];
  2551			struct tc_cake_tin_stats *tstat = &st->tin_stats[i];
  2552	
  2553			tstat->threshold_rate = b->tin_rate_bps;
  2554			tstat->target_us      = cobalt_time_to_us(b->cparams.target);
  2555			tstat->interval_us    = cobalt_time_to_us(b->cparams.interval);
  2556	
  2557			/* TODO FIXME: add missing aspects of these composite stats */
  2558			tstat->sent.packets       = b->packets;
  2559			tstat->sent.bytes	  = b->bytes;
  2560			tstat->dropped.packets    = b->tin_dropped;
  2561			tstat->ecn_marked.packets = b->tin_ecn_mark;
  2562			tstat->backlog.bytes      = b->tin_backlog;
  2563			tstat->ack_drops.packets  = b->ack_drops;
  2564	
  2565			tstat->peak_delay_us = cobalt_time_to_us(b->peak_delay);
  2566			tstat->avge_delay_us = cobalt_time_to_us(b->avge_delay);
  2567			tstat->base_delay_us = cobalt_time_to_us(b->base_delay);
  2568	
  2569			tstat->way_indirect_hits = b->way_hits;
  2570			tstat->way_misses	 = b->way_misses;
  2571			tstat->way_collisions    = b->way_collisions;
  2572	
  2573			tstat->sparse_flows      = b->sparse_flow_count +
  2574						   b->decaying_flow_count;
  2575			tstat->bulk_flows	 = b->bulk_flow_count;
  2576			tstat->unresponse_flows  = b->unresponsive_flow_count;
  2577			tstat->spare		 = 0;
  2578			tstat->max_skblen	 = b->max_skblen;
  2579	
  2580			tstat->flow_quantum	 = b->flow_quantum;
  2581		}
  2582		st->capacity_estimate = q->avg_peak_bandwidth;
  2583		st->memory_limit      = q->buffer_limit;
  2584		st->memory_used       = q->buffer_max_used;
  2585	
  2586		i = gnet_stats_copy_app(d, st, size);
  2587		cake_free(st);
  2588		return i;
> 2589	}
  2590	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 52468 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-04-26 19:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-03 22:06 [PATCH net-next 0/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
2017-12-03 22:06 ` [PATCH net-next 1/3] pkt_sched.h: add support for sch_cake API Dave Taht
2017-12-03 22:06 ` [PATCH net-next 2/3] Add Common Applications Kept Enhanced (cake) qdisc Dave Taht
2017-12-03 22:06 ` [PATCH net-next 3/3] Add support for building the new cake qdisc Dave Taht
2017-12-05  4:41   ` kbuild test robot
2018-04-24 11:44 ` [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Toke Høiland-Jørgensen
2018-04-24 11:44   ` [PATCH iproute2-next v2] Add support for cake qdisc Toke Høiland-Jørgensen
2018-04-24 12:30     ` [PATCH iproute2-next v3] " Toke Høiland-Jørgensen
2018-04-24 14:44       ` [Cake] " Stephen Hemminger
2018-04-24 14:45       ` Stephen Hemminger
2018-04-24 14:52         ` Toke Høiland-Jørgensen
2018-04-24 15:10           ` Stephen Hemminger
2018-04-24 15:38             ` Toke Høiland-Jørgensen
2018-04-24 15:11   ` [Cake] [PATCH net-next v2] Add Common Applications Kept Enhanced (cake) qdisc Stephen Hemminger
2018-04-24 15:14   ` Stephen Hemminger
2018-04-24 15:32     ` Georgios Amanakis
2018-04-24 15:41       ` Toke Høiland-Jørgensen
     [not found]         ` <CACvFP_hzHY+=qPh6_=_++UhbnAU1xjPguY7fNzG0TSVhYm0V3Q@mail.gmail.com>
2018-04-24 15:51           ` Georgios Amanakis
2018-04-24 16:03             ` Toke Høiland-Jørgensen
2018-04-26 19:16   ` kbuild test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.