netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next] net/sched: add skbprio scheduler
@ 2018-07-07 10:13 Nishanth Devarajan
  2018-07-09 15:44 ` Marcelo Ricardo Leitner
  2018-07-11  2:57 ` Cong Wang
  0 siblings, 2 replies; 26+ messages in thread
From: Nishanth Devarajan @ 2018-07-07 10:13 UTC (permalink / raw)
  To: xiyou.wangcong, jhs, jiri, davem; +Cc: netdev, doucette, michel

net/sched: add skbprio scheduer

Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
according to their skb->priority field. Under congestion, already-enqueued lower
priority packets will be dropped to make space available for higher priority
packets. Skbprio was conceived as a solution for denial-of-service defenses that
need to route packets with different priorities as a means to overcome DoS
attacks.

v3
*Drop max_limit parameter in struct skbprio_sched_data and instead use
sch->limit.

*Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
qdisc (previously being referenced every time qdisc changes).

*Move qdisc's detailed description from in-code to Documentation/networking.

*When qdisc is saturated, enqueue incoming packet first before dequeueing
lowest priority packet in queue - improves usage of call stack registers.

*Introduce and use overlimit stat to keep track of number of dropped packets.

v2
*Use skb->priority field rather than DS field. Rename queueing discipline as
SKB Priority Queue (previously Gatekeeper Priority Queue).

*Queueing discipline is made classful to expose Skbprio's internal priority
queues.

Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com>
Reviewed-by: Sachin Paryani <sachin.paryani@gmail.com>
Reviewed-by: Cody Doucette <doucette@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
---
 Documentation/networking/sch_skbprio.txt |  24 +++
 include/uapi/linux/pkt_sched.h           |  15 ++
 net/sched/Kconfig                        |  13 ++
 net/sched/Makefile                       |   1 +
 net/sched/sch_skbprio.c                  | 330 +++++++++++++++++++++++++++++++
 5 files changed, 383 insertions(+)
 create mode 100644 Documentation/networking/sch_skbprio.txt
 create mode 100644 net/sched/sch_skbprio.c

diff --git a/Documentation/networking/sch_skbprio.txt b/Documentation/networking/sch_skbprio.txt
new file mode 100644
index 0000000..3aa4d3e
--- /dev/null
+++ b/Documentation/networking/sch_skbprio.txt
@@ -0,0 +1,24 @@
+SKB Priority Queue
+==================
+
+This qdisc schedules a packet according to skb->priority, where a higher
+value places the packet closer to the exit of the queue. When the queue is
+full, the lowest priority packet in the queue is dropped to make room for
+the packet to be added if it has higher priority. If the packet to be added
+has lower priority than all packets in the queue, it is dropped.
+
+Without the SKB priority queue, queue length limits must be imposed
+for individual queues, and there is no easy way to enforce a global queue
+length limit across all priorities. With the SKBprio queue, a global
+queue length limit can be enforced while not restricting the queue lengths
+of individual priorities.
+
+This is especially useful for a denial-of-service defense system like
+Gatekeeper, which prioritizes packets in flows that demonstrate expected
+behavior of legitimate users. The queue is flexible to allow any number
+of packets of any priority up to the global limit of the scheduler
+without risking resource overconsumption by a flood of low priority packets.
+
+The Gatekeeper codebase is found here:
+
+		https://github.com/AltraMayor/gatekeeper
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 9491184..5c6429d 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -124,6 +124,21 @@ struct tc_fifo_qopt {
 	__u32	limit;	/* Queue length: bytes for bfifo, packets for pfifo */
 };
 
+/* SKBPRIO section */
+
+/*
+ * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1).
+ * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be able
+ * to map one to one the DS field of IPV4 and IPV6 headers.
+ * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY.
+ */
+
+#define SKBPRIO_MAX_PRIORITY 64
+
+struct tc_skbprio_qopt {
+	__u32	limit;		/* Queue length in packets. */
+};
+
 /* PRIO section */
 
 #define TCQ_PRIO_BANDS	16
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index fcc8970..4aa6eb0 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -251,6 +251,19 @@ config NET_SCH_MQPRIO
 
 	  If unsure, say N.
 
+config NET_SCH_SKBPRIO
+	tristate "SKB priority queue scheduler (SKBPRIO)"
+	help
+	  Say Y here if you want to use the SKB priority queue
+	  scheduler. This schedules packets according to skb->priority,
+	  which is useful for request packets in DoS mitigation systems such
+	  as Gatekeeper.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called sch_skbprio.
+
+	  If unsure, say N.
+
 config NET_SCH_CHOKE
 	tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 9a5a707..ad5cd1e 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
 obj-$(CONFIG_NET_SCH_PLUG)	+= sch_plug.o
 obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
+obj-$(CONFIG_NET_SCH_SKBPRIO)	+= sch_skbprio.o
 obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
new file mode 100644
index 0000000..6b94f54
--- /dev/null
+++ b/net/sched/sch_skbprio.c
@@ -0,0 +1,330 @@
+/*
+ * net/sched/sch_skbprio.c  SKB Priority Queue.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Nishanth Devarajan, <ndev2021@gmail.com>
+ *		Cody Doucette, <doucette@bu.edu>
+ *	        original idea by Michel Machado, Cody Doucette, and Qiaobin Fu
+ */
+
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+#include <net/inet_ecn.h>
+
+/*		SKB Priority Queue
+ *	=================================
+ *
+ * Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes
+ * packets according to their skb->priority field. Under congestion,
+ * Skbprio drops already-enqueued lower priority packets to make space
+ * available for higher priority packets; it was conceived as a solution
+ * for denial-of-service defenses that need to route packets with different
+ * priorities as a mean to overcome DoS attacks.
+ */
+
+struct skbprio_sched_data {
+	/* Queue state. */
+	struct sk_buff_head qdiscs[SKBPRIO_MAX_PRIORITY];
+	struct gnet_stats_queue qstats[SKBPRIO_MAX_PRIORITY];
+	u16 highest_prio;
+	u16 lowest_prio;
+};
+
+static u16 calc_new_high_prio(const struct skbprio_sched_data *q)
+{
+	int prio;
+
+	for (prio = q->highest_prio - 1; prio >= q->lowest_prio; prio--) {
+		if (!skb_queue_empty(&q->qdiscs[prio]))
+			return prio;
+	}
+
+	/* SKB queue is empty, return 0 (default highest priority setting). */
+	return 0;
+}
+
+static u16 calc_new_low_prio(const struct skbprio_sched_data *q)
+{
+	int prio;
+
+	for (prio = q->lowest_prio + 1; prio <= q->highest_prio; prio++) {
+		if (!skb_queue_empty(&q->qdiscs[prio]))
+			return prio;
+	}
+
+	/* SKB queue is empty, return SKBPRIO_MAX_PRIORITY - 1
+	 * (default lowest priority setting).
+	 */
+	return SKBPRIO_MAX_PRIORITY - 1;
+}
+
+static int skbprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			  struct sk_buff **to_free)
+{
+	const unsigned int max_priority = SKBPRIO_MAX_PRIORITY - 1;
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct sk_buff_head *qdisc;
+	struct sk_buff_head *lp_qdisc;
+	struct sk_buff *to_drop;
+	u16 prio, lp;
+
+	/* Obtain the priority of @skb. */
+	prio = min(skb->priority, max_priority);
+
+	qdisc = &q->qdiscs[prio];
+	if (sch->q.qlen < sch->limit) {
+		__skb_queue_tail(qdisc, skb);
+		qdisc_qstats_backlog_inc(sch, skb);
+		q->qstats[prio].backlog += qdisc_pkt_len(skb);
+
+		/* Check to update highest and lowest priorities. */
+		if (prio > q->highest_prio)
+			q->highest_prio = prio;
+
+		if (prio < q->lowest_prio)
+			q->lowest_prio = prio;
+
+		sch->q.qlen++;
+		return NET_XMIT_SUCCESS;
+	}
+
+	/* If this packet has the lowest priority, drop it. */
+	lp = q->lowest_prio;
+	if (prio <= lp) {
+		q->qstats[prio].drops++;
+		q->qstats[prio].overlimits++;
+		return qdisc_drop(skb, sch, to_free);
+	}
+
+	__skb_queue_tail(qdisc, skb);
+	qdisc_qstats_backlog_inc(sch, skb);
+	q->qstats[prio].backlog += qdisc_pkt_len(skb);
+
+	/* Drop the packet at the tail of the lowest priority qdisc. */
+	lp_qdisc = &q->qdiscs[lp];
+	to_drop = __skb_dequeue_tail(lp_qdisc);
+	BUG_ON(!to_drop);
+	qdisc_qstats_backlog_dec(sch, to_drop);
+	qdisc_drop(to_drop, sch, to_free);
+
+	q->qstats[lp].backlog -= qdisc_pkt_len(to_drop);
+	q->qstats[lp].drops++;
+	q->qstats[lp].overlimits++;
+
+	/* Check to update highest and lowest priorities. */
+	if (skb_queue_empty(lp_qdisc)) {
+		if (q->lowest_prio == q->highest_prio) {
+			/* The incoming packet is the only packet in queue. */
+			BUG_ON(sch->q.qlen != 1);
+			q->lowest_prio = prio;
+			q->highest_prio = prio;
+		} else {
+			q->lowest_prio = calc_new_low_prio(q);
+		}
+	}
+
+	if (prio > q->highest_prio)
+		q->highest_prio = prio;
+
+	return NET_XMIT_CN;
+}
+
+static struct sk_buff *skbprio_dequeue(struct Qdisc *sch)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct sk_buff_head *hpq = &q->qdiscs[q->highest_prio];
+	struct sk_buff *skb = __skb_dequeue(hpq);
+
+	if (unlikely(!skb))
+		return NULL;
+
+	sch->q.qlen--;
+	qdisc_qstats_backlog_dec(sch, skb);
+	qdisc_bstats_update(sch, skb);
+
+	q->qstats[q->highest_prio].backlog -= qdisc_pkt_len(skb);
+
+	/* Update highest priority field. */
+	if (skb_queue_empty(hpq)) {
+		if (q->lowest_prio == q->highest_prio) {
+			BUG_ON(sch->q.qlen);
+			q->highest_prio = 0;
+			q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+		} else {
+			q->highest_prio = calc_new_high_prio(q);
+		}
+	}
+	return skb;
+}
+
+static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct tc_skbprio_qopt *ctl = nla_data(opt);
+	const unsigned int min_limit = 1;
+
+	if (ctl->limit == (typeof(ctl->limit))-1)
+		sch->limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
+	else if (ctl->limit < min_limit)
+		return -EINVAL;
+	else
+		sch->limit = ctl->limit;
+
+	return 0;
+}
+
+static int skbprio_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	const unsigned int min_limit = 1;
+	int prio;
+
+	/* Initialise all queues, one for each possible priority. */
+	for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+		__skb_queue_head_init(&q->qdiscs[prio]);
+
+	memset(&q->qstats, 0, sizeof(q->qstats));
+	q->highest_prio = 0;
+	q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+	if (!opt) {
+		sch->limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
+		return 0;
+	}
+	return skbprio_change(sch, opt, extack);
+}
+
+static int skbprio_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct tc_skbprio_qopt opt;
+
+	opt.limit = sch->limit;
+
+	if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
+		return -1;
+
+	return skb->len;
+}
+
+static void skbprio_reset(struct Qdisc *sch)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	int prio;
+
+	sch->qstats.backlog = 0;
+	sch->q.qlen = 0;
+
+	for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+		__skb_queue_purge(&q->qdiscs[prio]);
+
+	memset(&q->qstats, 0, sizeof(q->qstats));
+	q->highest_prio = 0;
+	q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+}
+
+static void skbprio_destroy(struct Qdisc *sch)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	int prio;
+
+	for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+		__skb_queue_purge(&q->qdiscs[prio]);
+}
+
+static struct Qdisc *skbprio_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	return NULL;
+}
+
+static unsigned long skbprio_find(struct Qdisc *sch, u32 classid)
+{
+	return 0;
+}
+
+static int skbprio_dump_class(struct Qdisc *sch, unsigned long cl,
+			     struct sk_buff *skb, struct tcmsg *tcm)
+{
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	return 0;
+}
+
+static int skbprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+				   struct gnet_dump *d)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	if (gnet_stats_copy_queue(d, NULL, &q->qstats[cl - 1],
+		q->qstats[cl - 1].qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void skbprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < SKBPRIO_MAX_PRIORITY; i++) {
+		if (arg->count < arg->skip) {
+			arg->count++;
+			continue;
+		}
+		if (arg->fn(sch, i + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops skbprio_class_ops = {
+	.leaf		=	skbprio_leaf,
+	.find		=	skbprio_find,
+	.dump		=	skbprio_dump_class,
+	.dump_stats	=	skbprio_dump_class_stats,
+	.walk		=	skbprio_walk,
+};
+
+static struct Qdisc_ops skbprio_qdisc_ops __read_mostly = {
+	.cl_ops		=	&skbprio_class_ops,
+	.id		=	"skbprio",
+	.priv_size	=	sizeof(struct skbprio_sched_data),
+	.enqueue	=	skbprio_enqueue,
+	.dequeue	=	skbprio_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	skbprio_init,
+	.reset		=	skbprio_reset,
+	.change		=	skbprio_change,
+	.dump		=	skbprio_dump,
+	.destroy	=	skbprio_destroy,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init skbprio_module_init(void)
+{
+	return register_qdisc(&skbprio_qdisc_ops);
+}
+
+static void __exit skbprio_module_exit(void)
+{
+	unregister_qdisc(&skbprio_qdisc_ops);
+}
+
+module_init(skbprio_module_init)
+module_exit(skbprio_module_exit)
+
+MODULE_LICENSE("GPL");
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-07 10:13 [PATCH v3 net-next] net/sched: add skbprio scheduler Nishanth Devarajan
@ 2018-07-09 15:44 ` Marcelo Ricardo Leitner
  2018-07-09 18:18   ` Michel Machado
  2018-07-11  2:57 ` Cong Wang
  1 sibling, 1 reply; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-09 15:44 UTC (permalink / raw)
  To: Nishanth Devarajan
  Cc: xiyou.wangcong, jhs, jiri, davem, netdev, doucette, michel

On Sat, Jul 07, 2018 at 03:43:55PM +0530, Nishanth Devarajan wrote:
> net/sched: add skbprio scheduer
> 
> Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
> according to their skb->priority field. Under congestion, already-enqueued lower
> priority packets will be dropped to make space available for higher priority
> packets. Skbprio was conceived as a solution for denial-of-service defenses that
> need to route packets with different priorities as a means to overcome DoS
> attacks.

Why can't we implement this as a new flag for sch_prio.c?

I don't see why this duplication is needed, especially because it will
only be "slower" (as in, it will do more work) when qdisc is already
full and dropping packets anyway.

  Marcelo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 15:44 ` Marcelo Ricardo Leitner
@ 2018-07-09 18:18   ` Michel Machado
  2018-07-09 19:53     ` Marcelo Ricardo Leitner
  2018-07-11  2:38     ` Cong Wang
  0 siblings, 2 replies; 26+ messages in thread
From: Michel Machado @ 2018-07-09 18:18 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Nishanth Devarajan
  Cc: xiyou.wangcong, jhs, jiri, davem, netdev, doucette

On 07/09/2018 11:44 AM, Marcelo Ricardo Leitner wrote:
> On Sat, Jul 07, 2018 at 03:43:55PM +0530, Nishanth Devarajan wrote:
>> net/sched: add skbprio scheduer
>>
>> Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
>> according to their skb->priority field. Under congestion, already-enqueued lower
>> priority packets will be dropped to make space available for higher priority
>> packets. Skbprio was conceived as a solution for denial-of-service defenses that
>> need to route packets with different priorities as a means to overcome DoS
>> attacks.
> 
> Why can't we implement this as a new flag for sch_prio.c?
> 
> I don't see why this duplication is needed, especially because it will
> only be "slower" (as in, it will do more work) when qdisc is already
> full and dropping packets anyway.

    sch_prio.c and skbprio diverge on a number of aspects:

    1. sch_prio.c supports up to 16 priorities whereas skbprio 64. This 
is not just a matter of changing a constant since sch_prio.c doesn't use 
skb->priority.

    2. sch_prio.c does not have a global limit on the number of packets 
on all its queues, only a limit per queue.

    3. The queues of sch_prio.c are struct Qdisc, which don't have a 
method to drop at its tail.

    Given the divergences, adding flags to sch_prio.c will essentially 
keep both implementations together instead of being isolated as being 
proposed.

    On the speed point, there may not be noticeable difference between 
both qdiscs because the enqueueing and dequeueing costs of both qdics 
are O(1). Notice that the "extra work" (i.e. dropping lower priority 
packets) is a key aspect of skbprio since it gives routers a cheap way 
to choose which packets to drop during a DoS.

[ ]'s
Michel Machado

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 18:18   ` Michel Machado
@ 2018-07-09 19:53     ` Marcelo Ricardo Leitner
  2018-07-09 21:03       ` Michel Machado
  2018-07-11  2:32       ` Cong Wang
  2018-07-11  2:38     ` Cong Wang
  1 sibling, 2 replies; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-09 19:53 UTC (permalink / raw)
  To: Michel Machado
  Cc: Nishanth Devarajan, xiyou.wangcong, jhs, jiri, davem, netdev, doucette

On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> On 07/09/2018 11:44 AM, Marcelo Ricardo Leitner wrote:
> > On Sat, Jul 07, 2018 at 03:43:55PM +0530, Nishanth Devarajan wrote:
> > > net/sched: add skbprio scheduer
> > > 
> > > Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
> > > according to their skb->priority field. Under congestion, already-enqueued lower
> > > priority packets will be dropped to make space available for higher priority
> > > packets. Skbprio was conceived as a solution for denial-of-service defenses that
> > > need to route packets with different priorities as a means to overcome DoS
> > > attacks.
> > 
> > Why can't we implement this as a new flag for sch_prio.c?
> > 
> > I don't see why this duplication is needed, especially because it will
> > only be "slower" (as in, it will do more work) when qdisc is already
> > full and dropping packets anyway.
> 
>    sch_prio.c and skbprio diverge on a number of aspects:
> 
>    1. sch_prio.c supports up to 16 priorities whereas skbprio 64. This is
> not just a matter of changing a constant since sch_prio.c doesn't use
> skb->priority.

Yes it does use skb->priority for classifying into a band:

prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
{
        struct prio_sched_data *q = qdisc_priv(sch);
        u32 band = skb->priority;
...

> 
>    2. sch_prio.c does not have a global limit on the number of packets on
> all its queues, only a limit per queue.

It can be useful to sch_prio.c as well, why not?
prio_enqueue()
{
...
+	if (count > sch->global_limit)
+		prio_tail_drop(sch);   /* to be implemented */
        ret = qdisc_enqueue(skb, qdisc, to_free);

> 
>    3. The queues of sch_prio.c are struct Qdisc, which don't have a method
> to drop at its tail.

That can be implemented, most likely as prio_tail_drop() as above.

> 
>    Given the divergences, adding flags to sch_prio.c will essentially keep
> both implementations together instead of being isolated as being proposed.

I don't agree. There aren't that many flags. I see only 2, one which
makes sense to sch_prio as it is already (the global limit) and from
where it should drop, the overflown packet or from tail.

All other code will be reused: stats handling, netlink handling,
enqueue and dequeue at least.

If we add this other qdisc, named as it is, it will be very confusing
to sysadmins: both are named very closely and work essentially in the
same way, but one drops from tail and another drops the incoming
packet.

> 
>    On the speed point, there may not be noticeable difference between both
> qdiscs because the enqueueing and dequeueing costs of both qdics are O(1).
> Notice that the "extra work" (i.e. dropping lower priority packets) is a key
> aspect of skbprio since it gives routers a cheap way to choose which packets
> to drop during a DoS.

On that I agree. I was more referring to something like: "lets not make
sch_prio slow and implement a new one instead.", which I don't it's
valid because the extra "cost" is only visible when it's already
dropping packets. Hopefully it's clearer now :)

[]s
Marcelo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 19:53     ` Marcelo Ricardo Leitner
@ 2018-07-09 21:03       ` Michel Machado
  2018-07-09 21:40         ` Marcelo Ricardo Leitner
  2018-07-11  2:32       ` Cong Wang
  1 sibling, 1 reply; 26+ messages in thread
From: Michel Machado @ 2018-07-09 21:03 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Nishanth Devarajan, xiyou.wangcong, jhs, jiri, davem, netdev, doucette

On 07/09/2018 03:53 PM, Marcelo Ricardo Leitner wrote:
> On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
>> On 07/09/2018 11:44 AM, Marcelo Ricardo Leitner wrote:
>>> On Sat, Jul 07, 2018 at 03:43:55PM +0530, Nishanth Devarajan wrote:
>>>> net/sched: add skbprio scheduer
>>>>
>>>> Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
>>>> according to their skb->priority field. Under congestion, already-enqueued lower
>>>> priority packets will be dropped to make space available for higher priority
>>>> packets. Skbprio was conceived as a solution for denial-of-service defenses that
>>>> need to route packets with different priorities as a means to overcome DoS
>>>> attacks.
>>>
>>> Why can't we implement this as a new flag for sch_prio.c?
>>>
>>> I don't see why this duplication is needed, especially because it will
>>> only be "slower" (as in, it will do more work) when qdisc is already
>>> full and dropping packets anyway.
>>
>>     sch_prio.c and skbprio diverge on a number of aspects:
>>
>>     1. sch_prio.c supports up to 16 priorities whereas skbprio 64. This is
>> not just a matter of changing a constant since sch_prio.c doesn't use
>> skb->priority.
> 
> Yes it does use skb->priority for classifying into a band:
> 
> prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
> {
>          struct prio_sched_data *q = qdisc_priv(sch);
>          u32 band = skb->priority;
> ...

    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward 
compatibility with applications.

>>     3. The queues of sch_prio.c are struct Qdisc, which don't have a method
>> to drop at its tail.
> 
> That can be implemented, most likely as prio_tail_drop() as above.

    struct Qdisc represents *all* qdiscs. My knowledge of the other 
qdiscs is limited, but not all qdiscs may have a meaningful method to 
drop at the tail. For example: a qdisc that works over flows may not 
know with flow is the tail. Not to mention that this would be a 
widespread patch to only support this new prio qdisc. It would be 
prudent to wait for the production success of the proposed, 
self-contained qdisc before making this commitment.

[ ]'s
Michel Machado

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 21:03       ` Michel Machado
@ 2018-07-09 21:40         ` Marcelo Ricardo Leitner
  2018-07-10 14:03           ` Michel Machado
  2018-07-11  2:25           ` Cong Wang
  0 siblings, 2 replies; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-09 21:40 UTC (permalink / raw)
  To: Michel Machado
  Cc: Nishanth Devarajan, xiyou.wangcong, jhs, jiri, davem, netdev, doucette

On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> On 07/09/2018 03:53 PM, Marcelo Ricardo Leitner wrote:
> > On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> > > On 07/09/2018 11:44 AM, Marcelo Ricardo Leitner wrote:
> > > > On Sat, Jul 07, 2018 at 03:43:55PM +0530, Nishanth Devarajan wrote:
> > > > > net/sched: add skbprio scheduer
> > > > > 
> > > > > Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
> > > > > according to their skb->priority field. Under congestion, already-enqueued lower
> > > > > priority packets will be dropped to make space available for higher priority
> > > > > packets. Skbprio was conceived as a solution for denial-of-service defenses that
> > > > > need to route packets with different priorities as a means to overcome DoS
> > > > > attacks.
> > > > 
> > > > Why can't we implement this as a new flag for sch_prio.c?
> > > > 
> > > > I don't see why this duplication is needed, especially because it will
> > > > only be "slower" (as in, it will do more work) when qdisc is already
> > > > full and dropping packets anyway.
> > > 
> > >     sch_prio.c and skbprio diverge on a number of aspects:
> > > 
> > >     1. sch_prio.c supports up to 16 priorities whereas skbprio 64. This is
> > > not just a matter of changing a constant since sch_prio.c doesn't use
> > > skb->priority.
> > 
> > Yes it does use skb->priority for classifying into a band:
> > 
> > prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
> > {
> >          struct prio_sched_data *q = qdisc_priv(sch);
> >          u32 band = skb->priority;
> > ...
> 
>    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> with applications.

If done, it needs to be done carefully, indeed. I don't know if it's
doable, neither I know how hard is your requirement for 64 different
priorities.

You can get 64 different priorities by stacking sch_prio, btw. And if
you implement drop_from_tail() as part of Qdisc, you can even get it
working for this cascading case too.

> 
> > >     3. The queues of sch_prio.c are struct Qdisc, which don't have a method
> > > to drop at its tail.
> > 
> > That can be implemented, most likely as prio_tail_drop() as above.
> 
>    struct Qdisc represents *all* qdiscs. My knowledge of the other qdiscs is
> limited, but not all qdiscs may have a meaningful method to drop at the
> tail. For example: a qdisc that works over flows may not know with flow is

True, but it doesn't mean you have to implement it for all available qdiscs.

> the tail. Not to mention that this would be a widespread patch to only
> support this new prio qdisc. It would be prudent to wait for the production
> success of the proposed, self-contained qdisc before making this commitment.

On the other hand, by adding another qdisc you're adding more work
that one needs to do when dealing with qdisc infrastructure, such as
updating enqueue() prototype, for example.

Once this new qdisc is in, it won't be easy to deprecate it.

  Marcelo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 21:40         ` Marcelo Ricardo Leitner
@ 2018-07-10 14:03           ` Michel Machado
  2018-07-10 14:33             ` Marcelo Ricardo Leitner
  2018-07-11  2:25           ` Cong Wang
  1 sibling, 1 reply; 26+ messages in thread
From: Michel Machado @ 2018-07-10 14:03 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Nishanth Devarajan, xiyou.wangcong, jhs, jiri, davem, netdev, doucette

On 07/09/2018 05:40 PM, Marcelo Ricardo Leitner wrote:
> On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
>> On 07/09/2018 03:53 PM, Marcelo Ricardo Leitner wrote:
>>> On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
>>>> On 07/09/2018 11:44 AM, Marcelo Ricardo Leitner wrote:
>>>>> On Sat, Jul 07, 2018 at 03:43:55PM +0530, Nishanth Devarajan wrote:
>>>>>> net/sched: add skbprio scheduer
>>>>>>
>>>>>> Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
>>>>>> according to their skb->priority field. Under congestion, already-enqueued lower
>>>>>> priority packets will be dropped to make space available for higher priority
>>>>>> packets. Skbprio was conceived as a solution for denial-of-service defenses that
>>>>>> need to route packets with different priorities as a means to overcome DoS
>>>>>> attacks.
>>>>>
>>>>> Why can't we implement this as a new flag for sch_prio.c?
>>>>>
>>>>> I don't see why this duplication is needed, especially because it will
>>>>> only be "slower" (as in, it will do more work) when qdisc is already
>>>>> full and dropping packets anyway.
>>>>
>>>>      sch_prio.c and skbprio diverge on a number of aspects:
>>>>
>>>>      1. sch_prio.c supports up to 16 priorities whereas skbprio 64. This is
>>>> not just a matter of changing a constant since sch_prio.c doesn't use
>>>> skb->priority.
>>>
>>> Yes it does use skb->priority for classifying into a band:
>>>
>>> prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
>>> {
>>>           struct prio_sched_data *q = qdisc_priv(sch);
>>>           u32 band = skb->priority;
>>> ...
>>
>>     Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
>> with applications.
> 
> If done, it needs to be done carefully, indeed. I don't know if it's
> doable, neither I know how hard is your requirement for 64 different
> priorities.
> 
> You can get 64 different priorities by stacking sch_prio, btw. And if
> you implement drop_from_tail() as part of Qdisc, you can even get it
> working for this cascading case too.

    A solution would be to add another flag to switch between the 
current prio_classify() and a new one to just use skb->priority as in 
skbprio. This way we don't risk breaking applications that rely on 
tcf_classify() and this odd behavior that I found in prio_classify():

         band = TC_H_MIN(band) - 1;
         if (band >= q->bands)
                 return q->queues[q->prio2band[0]];
         return q->queues[band];

    When band is zero, it returns q->queues[q->prio2band[0]] instead of 
q->queues[band] as it would for other bands less than q->bands.

>>>>      3. The queues of sch_prio.c are struct Qdisc, which don't have a method
>>>> to drop at its tail.
>>>
>>> That can be implemented, most likely as prio_tail_drop() as above.
>>
>>     struct Qdisc represents *all* qdiscs. My knowledge of the other qdiscs is
>> limited, but not all qdiscs may have a meaningful method to drop at the
>> tail. For example: a qdisc that works over flows may not know with flow is
> 
> True, but it doesn't mean you have to implement it for all available qdiscs.

    If it is not implemented for all available qdiscs and the flag to 
drop at the tail is on, sch_prio.c would need to issue a log message 
whenever a packet goes into one of the subqueues that don't drop at the 
tail and have a failsafe behavior.

>> the tail. Not to mention that this would be a widespread patch to only
>> support this new prio qdisc. It would be prudent to wait for the production
>> success of the proposed, self-contained qdisc before making this commitment.
> 
> On the other hand, by adding another qdisc you're adding more work
> that one needs to do when dealing with qdisc infrastructure, such as
> updating enqueue() prototype, for example.
> 
> Once this new qdisc is in, it won't be easy to deprecate it.

    We need to choose between (1) having skbprio that has some duplicate 
code with sch_prio.c and (2) adding flags to sch_prio.c and make a major 
refactoring of the schedule subsystem to add drop an the tail to qdiscs.

    I mean major because we are not just talking about adding the method 
dequeue_tail() to struct Qdisc and adding dequeue_tail() to all qdiscs. 
One will need to come up with definitions of dequeue_tail() for qdiscs 
that don't naturally have it and even rewrite the data structures of 
qdiscs. To substantiate this last point, consider sch_fifo.c, one of the 
simplest qdiscs available. sch_fifo.c keeps its packets in sch->q, which 
is of type struct qdisc_skb_head. struct qdisc_skb_head doesn't set 
skb->prev, so it cannot drop at the tail without walking through its list.

    I do understand the motivation for minimizing duplicate code. But 
the small amount of duplicate code that skbprio adds is cheaper than 
refactoring the scheduler system to only support this new sch_prio.c.

[ ]'s
Michel Machado

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-10 14:03           ` Michel Machado
@ 2018-07-10 14:33             ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-10 14:33 UTC (permalink / raw)
  To: Michel Machado
  Cc: Nishanth Devarajan, xiyou.wangcong, jhs, jiri, davem, netdev, doucette

On Tue, Jul 10, 2018 at 10:03:22AM -0400, Michel Machado wrote:
...
> > You can get 64 different priorities by stacking sch_prio, btw. And if
> > you implement drop_from_tail() as part of Qdisc, you can even get it
> > working for this cascading case too.
> 
>    A solution would be to add another flag to switch between the current
> prio_classify() and a new one to just use skb->priority as in skbprio. This

Sounds promising.

> way we don't risk breaking applications that rely on tcf_classify() and this
> odd behavior that I found in prio_classify():
> 
>         band = TC_H_MIN(band) - 1;
>         if (band >= q->bands)
>                 return q->queues[q->prio2band[0]];
>         return q->queues[band];
> 
>    When band is zero, it returns q->queues[q->prio2band[0]] instead of
> q->queues[band] as it would for other bands less than q->bands.

Agreed, this looks odd. It came from 1d8ae3fdeb00 ("pkt_sched: Remove
RR scheduler."):

        band = TC_H_MIN(band) - 1;
        if (band >= q->bands)
-               band = q->prio2band[0];
-out:
-       if (q->mq)
-               skb_set_queue_mapping(skb, band);
+               return q->queues[q->prio2band[0]];
+
        return q->queues[band];
 }

I can see how it made sense before the change, but not after.

> 
> > > > >      3. The queues of sch_prio.c are struct Qdisc, which don't have a method
> > > > > to drop at its tail.
> > > > 
> > > > That can be implemented, most likely as prio_tail_drop() as above.
> > > 
> > >     struct Qdisc represents *all* qdiscs. My knowledge of the other qdiscs is
> > > limited, but not all qdiscs may have a meaningful method to drop at the
> > > tail. For example: a qdisc that works over flows may not know with flow is
> > 
> > True, but it doesn't mean you have to implement it for all available qdiscs.
> 
>    If it is not implemented for all available qdiscs and the flag to drop at
> the tail is on, sch_prio.c would need to issue a log message whenever a
> packet goes into one of the subqueues that don't drop at the tail and have a
> failsafe behavior.

That's fine. pr_warn_ratelimit() is probably what we need for logging
the error, so it a) doesn't flood kernel log and b) gets activated
even if the sysadmin later try again with another qdisc (as opposed to
pr_warn_once).

For the failsafe behavior, it probably can then just drop the incoming
packet. It is not what you want, yes, but it's an easy way out out of
a non-expected situation and that works well enough.

> 
> > > the tail. Not to mention that this would be a widespread patch to only
> > > support this new prio qdisc. It would be prudent to wait for the production
> > > success of the proposed, self-contained qdisc before making this commitment.
> > 
> > On the other hand, by adding another qdisc you're adding more work
> > that one needs to do when dealing with qdisc infrastructure, such as
> > updating enqueue() prototype, for example.
> > 
> > Once this new qdisc is in, it won't be easy to deprecate it.
> 
>    We need to choose between (1) having skbprio that has some duplicate code
> with sch_prio.c and (2) adding flags to sch_prio.c and make a major
> refactoring of the schedule subsystem to add drop an the tail to qdiscs.

Yes,

> 
>    I mean major because we are not just talking about adding the method
> dequeue_tail() to struct Qdisc and adding dequeue_tail() to all qdiscs. One

I think it is. :-)

> will need to come up with definitions of dequeue_tail() for qdiscs that
> don't naturally have it and even rewrite the data structures of qdiscs. To
> substantiate this last point, consider sch_fifo.c, one of the simplest
> qdiscs available. sch_fifo.c keeps its packets in sch->q, which is of type
> struct qdisc_skb_head. struct qdisc_skb_head doesn't set skb->prev, so it
> cannot drop at the tail without walking through its list.

Yes but this would only be needed for the qdiscs that you want to
support with this flag. Nobody said you need to implement it on all
qdiscs that we have...

> 
>    I do understand the motivation for minimizing duplicate code. But the
> small amount of duplicate code that skbprio adds is cheaper than refactoring
> the scheduler system to only support this new sch_prio.c.

I'm afraid that without the code for option (2) above, this discussion
will become subjective. I'll wait for other opinions here.

Cheers,
  Marcelo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 21:40         ` Marcelo Ricardo Leitner
  2018-07-10 14:03           ` Michel Machado
@ 2018-07-11  2:25           ` Cong Wang
  2018-07-11 19:33             ` Marcelo Ricardo Leitner
  1 sibling, 1 reply; 26+ messages in thread
From: Cong Wang @ 2018-07-11  2:25 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> >    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> > with applications.
>
> If done, it needs to be done carefully, indeed. I don't know if it's
> doable, neither I know how hard is your requirement for 64 different
> priorities.

struct tc_prio_qopt {
        int     bands;                  /* Number of bands */
        __u8    priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
};

How would you do it carefully?

Also, it is not only used by prio but also pfifo_fast.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 19:53     ` Marcelo Ricardo Leitner
  2018-07-09 21:03       ` Michel Machado
@ 2018-07-11  2:32       ` Cong Wang
  2018-07-11 18:37         ` Marcelo Ricardo Leitner
  1 sibling, 1 reply; 26+ messages in thread
From: Cong Wang @ 2018-07-11  2:32 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Mon, Jul 9, 2018 at 12:53 PM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> >
> >    2. sch_prio.c does not have a global limit on the number of packets on
> > all its queues, only a limit per queue.
>
> It can be useful to sch_prio.c as well, why not?
> prio_enqueue()
> {
> ...
> +       if (count > sch->global_limit)
> +               prio_tail_drop(sch);   /* to be implemented */
>         ret = qdisc_enqueue(skb, qdisc, to_free);
>

Isn't the whole point of sch_prio offloading the queueing to
each class? If you need a limit, there is one for each child
qdisc if you use for example pfifo or bfifo (depending on you
want to limit bytes or packets).

Also, what's your plan for backward compatibility here?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-09 18:18   ` Michel Machado
  2018-07-09 19:53     ` Marcelo Ricardo Leitner
@ 2018-07-11  2:38     ` Cong Wang
  1 sibling, 0 replies; 26+ messages in thread
From: Cong Wang @ 2018-07-11  2:38 UTC (permalink / raw)
  To: Michel Machado
  Cc: Marcelo Ricardo Leitner, Nishanth Devarajan, Jamal Hadi Salim,
	Jiri Pirko, David Miller, Linux Kernel Network Developers,
	Cody Doucette

On Mon, Jul 9, 2018 at 11:18 AM Michel Machado <michel@digirati.com.br> wrote:
>     3. The queues of sch_prio.c are struct Qdisc, which don't have a
> method to drop at its tail.

This isn't true, you can install a qdisc which drops at tail as its
child qdisc, you can install different qdiscs for different children
too if you really want. The control is on you, sch_prio is simply a
wrapper.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-07 10:13 [PATCH v3 net-next] net/sched: add skbprio scheduler Nishanth Devarajan
  2018-07-09 15:44 ` Marcelo Ricardo Leitner
@ 2018-07-11  2:57 ` Cong Wang
  2018-07-11 15:24   ` Michel Machado
  1 sibling, 1 reply; 26+ messages in thread
From: Cong Wang @ 2018-07-11  2:57 UTC (permalink / raw)
  To: Nishanth Devarajan
  Cc: Jamal Hadi Salim, Jiri Pirko, David Miller,
	Linux Kernel Network Developers, Cody Doucette, Michel Machado

On Sat, Jul 7, 2018 at 3:14 AM Nishanth Devarajan <ndev2021@gmail.com> wrote:
> diff --git a/Documentation/networking/sch_skbprio.txt b/Documentation/networking/sch_skbprio.txt
> new file mode 100644
> index 0000000..3aa4d3e
> --- /dev/null
> +++ b/Documentation/networking/sch_skbprio.txt

We usually document each qdisc behavior in tc man pages in iproute2.

I don't mind you document it in kernel, but it kinda breaks the tradition.


> +static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
> +                       struct netlink_ext_ack *extack)
> +{
> +       struct skbprio_sched_data *q = qdisc_priv(sch);
> +       struct tc_skbprio_qopt *ctl = nla_data(opt);
> +       const unsigned int min_limit = 1;
> +
> +       if (ctl->limit == (typeof(ctl->limit))-1)
> +               sch->limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
> +       else if (ctl->limit < min_limit)
> +               return -EINVAL;
> +       else
> +               sch->limit = ctl->limit;

The dev->tx_queue_len is fundamentally non-sense since now
almost every real NIC is multi-queue and qdisc has a completely
different sch->limit. This is why I suggested you to simply
avoid it in your code.

There is no standard way to use dev->tx_queue_len in kernel,
so I can't claim your use is correct or not, but it still looks odd,
other qdisc seems just uses as a default, rather than picking
the smaller or bigger value as a cap.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-11  2:57 ` Cong Wang
@ 2018-07-11 15:24   ` Michel Machado
  2018-07-19 18:39     ` Cong Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Michel Machado @ 2018-07-11 15:24 UTC (permalink / raw)
  To: Cong Wang, Nishanth Devarajan
  Cc: Jamal Hadi Salim, Jiri Pirko, David Miller,
	Linux Kernel Network Developers, Cody Doucette

On 07/10/2018 10:57 PM, Cong Wang wrote:
> On Sat, Jul 7, 2018 at 3:14 AM Nishanth Devarajan <ndev2021@gmail.com> wrote:
>> diff --git a/Documentation/networking/sch_skbprio.txt b/Documentation/networking/sch_skbprio.txt
>> new file mode 100644
>> index 0000000..3aa4d3e
>> --- /dev/null
>> +++ b/Documentation/networking/sch_skbprio.txt
> 
> We usually document each qdisc behavior in tc man pages in iproute2.
> 
> I don't mind you document it in kernel, but it kinda breaks the tradition.

    We're going to move this text to tc man pages for v4.

>> +static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
>> +                       struct netlink_ext_ack *extack)
>> +{
>> +       struct skbprio_sched_data *q = qdisc_priv(sch);
>> +       struct tc_skbprio_qopt *ctl = nla_data(opt);
>> +       const unsigned int min_limit = 1;
>> +
>> +       if (ctl->limit == (typeof(ctl->limit))-1)
>> +               sch->limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
>> +       else if (ctl->limit < min_limit)
>> +               return -EINVAL;
>> +       else
>> +               sch->limit = ctl->limit;
> 
> The dev->tx_queue_len is fundamentally non-sense since now
> almost every real NIC is multi-queue and qdisc has a completely
> different sch->limit. This is why I suggested you to simply
> avoid it in your code.

    Would you be okay with a constant there? If so, we could just put 64 
there. The optimal number is hardware dependent, but we don't know how 
to calculate it.

> There is no standard way to use dev->tx_queue_len in kernel,
> so I can't claim your use is correct or not, but it still looks odd,
> other qdisc seems just uses as a default, rather than picking
> the smaller or bigger value as a cap.

    The reason for the `max(qdisc_dev(sch)->tx_queue_len, min_limit)` is 
to make sure that sch->limit is at least 1. We couldn't come up with a 
meaningful behavior for sch->limit being zero, so we defined the basis 
case of skbprio_enqueue() as sch->limit one. If there's a guarantee that 
qdisc_dev(sch)->tx_queue_len is always greater than zero, we don't need 
the max().

[ ]'s
Michel Machado

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-11  2:32       ` Cong Wang
@ 2018-07-11 18:37         ` Marcelo Ricardo Leitner
  2018-07-13  5:07           ` Cong Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-11 18:37 UTC (permalink / raw)
  To: Cong Wang
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Tue, Jul 10, 2018 at 07:32:43PM -0700, Cong Wang wrote:
> On Mon, Jul 9, 2018 at 12:53 PM Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> >
> > On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> > >
> > >    2. sch_prio.c does not have a global limit on the number of packets on
> > > all its queues, only a limit per queue.
> >
> > It can be useful to sch_prio.c as well, why not?
> > prio_enqueue()
> > {
> > ...
> > +       if (count > sch->global_limit)
> > +               prio_tail_drop(sch);   /* to be implemented */
> >         ret = qdisc_enqueue(skb, qdisc, to_free);
> >
> 
> Isn't the whole point of sch_prio offloading the queueing to
> each class? If you need a limit, there is one for each child
> qdisc if you use for example pfifo or bfifo (depending on you
> want to limit bytes or packets).

Yes, but Michel wants to drop from other lower priorities if needed,
and that's not possible if you handle the limit already in a child
qdisc as they don't know about their siblings. The idea in the example
above is to discard it from whatever lower priority is needed, then
queue it. (ok, the example missed to check the priority level)

As for the different units, sch_prio holds a count of how many packets
are queued on its children, and that's what would be used for the limit.

> 
> Also, what's your plan for backward compatibility here?

say:
  if (sch->global_limit && count > sch->global_limit)
as in, only do the limit check/enforcing if needed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-11  2:25           ` Cong Wang
@ 2018-07-11 19:33             ` Marcelo Ricardo Leitner
  2018-07-13  6:05               ` Cong Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-11 19:33 UTC (permalink / raw)
  To: Cong Wang
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote:
> On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> >
> > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> > >    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> > > with applications.
> >
> > If done, it needs to be done carefully, indeed. I don't know if it's
> > doable, neither I know how hard is your requirement for 64 different
> > priorities.
> 
> struct tc_prio_qopt {
>         int     bands;                  /* Number of bands */
>         __u8    priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
> };
> 
> How would you do it carefully?

quick shot, multiplex v1 and v2 formats based on bands and sizeof():

#define TCQ_PRIO_BANDS_V1	16
#define TCQ_PRIO_BANDS_V2	64
#define TC_PRIO_MAX_V2		64

struct tc_prio_qopt_v2 {
        int     bands;                  /* Number of bands */
        __u8    priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO band */
};

static int prio_tune(struct Qdisc *sch, struct nlattr *opt,
                     struct netlink_ext_ack *extack)
{
        struct prio_sched_data *q = qdisc_priv(sch);
        struct Qdisc *queues[TCQ_PRIO_BANDS_V2];
        int oldbands = q->bands, i;
        struct tc_prio_qopt_v2 *qopt;

        if (nla_len(opt) < sizeof(int))
                return -EINVAL;
        qopt = nla_data(opt);

	if (qopt->bands <= TCQ_PRIO_BANDS_V1 &&
            nla_len(opt) < sizeof(struct tc_prio_qopt))
                return -EINVAL;

	if (qopt->bands <= TCQ_PRIO_BANDS_V2 &&
            nla_len(opt) < sizeof(*qopt))
                return -EINVAL;

	/* By here, if it has up to 3 bands, we can assume it is using the _v1
	 * layout, while if it has up to TCQ_PRIO_BANDS_V2 it is using the _v2
	 * format.
	 */

        if (qopt->bands > TCQ_PRIO_BANDS_V2 || qopt->bands < 2)
                return -EINVAL;
...

With something like this I think it can keep compatibility with old
software while also allowing the new usage.

> Also, it is not only used by prio but also pfifo_fast.

Yes. More is needed, indeed. prio2band would also need to be expanded,
etc. Yet, I still don't see any blocker.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-11 18:37         ` Marcelo Ricardo Leitner
@ 2018-07-13  5:07           ` Cong Wang
  2018-07-13 13:00             ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 26+ messages in thread
From: Cong Wang @ 2018-07-13  5:07 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Wed, Jul 11, 2018 at 11:37 AM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Tue, Jul 10, 2018 at 07:32:43PM -0700, Cong Wang wrote:
> > On Mon, Jul 9, 2018 at 12:53 PM Marcelo Ricardo Leitner
> > <marcelo.leitner@gmail.com> wrote:
> > >
> > > On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> > > >
> > > >    2. sch_prio.c does not have a global limit on the number of packets on
> > > > all its queues, only a limit per queue.
> > >
> > > It can be useful to sch_prio.c as well, why not?
> > > prio_enqueue()
> > > {
> > > ...
> > > +       if (count > sch->global_limit)
> > > +               prio_tail_drop(sch);   /* to be implemented */
> > >         ret = qdisc_enqueue(skb, qdisc, to_free);
> > >
> >
> > Isn't the whole point of sch_prio offloading the queueing to
> > each class? If you need a limit, there is one for each child
> > qdisc if you use for example pfifo or bfifo (depending on you
> > want to limit bytes or packets).
>
> Yes, but Michel wants to drop from other lower priorities if needed,
> and that's not possible if you handle the limit already in a child
> qdisc as they don't know about their siblings. The idea in the example
> above is to discard it from whatever lower priority is needed, then
> queue it. (ok, the example missed to check the priority level)

So it disproves your point of adding a flag to sch_prio, right?

Also, you have to re-introduce qdisc->ops->drop() if you really want
to go this direction.

>
> As for the different units, sch_prio holds a count of how many packets
> are queued on its children, and that's what would be used for the limit.
>
> >
> > Also, what's your plan for backward compatibility here?
>
> say:
>   if (sch->global_limit && count > sch->global_limit)
> as in, only do the limit check/enforcing if needed.

Obviously doesn't work, users could pass 0 to effectively
disable the qdisc from enqueue'ing any packet.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-11 19:33             ` Marcelo Ricardo Leitner
@ 2018-07-13  6:05               ` Cong Wang
  2018-07-13 13:04                 ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 26+ messages in thread
From: Cong Wang @ 2018-07-13  6:05 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Wed, Jul 11, 2018 at 12:33 PM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote:
> > On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
> > <marcelo.leitner@gmail.com> wrote:
> > >
> > > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> > > >    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> > > > with applications.
> > >
> > > If done, it needs to be done carefully, indeed. I don't know if it's
> > > doable, neither I know how hard is your requirement for 64 different
> > > priorities.
> >
> > struct tc_prio_qopt {
> >         int     bands;                  /* Number of bands */
> >         __u8    priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
> > };
> >
> > How would you do it carefully?
>
> quick shot, multiplex v1 and v2 formats based on bands and sizeof():
>
> #define TCQ_PRIO_BANDS_V1       16
> #define TCQ_PRIO_BANDS_V2       64
> #define TC_PRIO_MAX_V2          64
>
> struct tc_prio_qopt_v2 {
>         int     bands;                  /* Number of bands */
>         __u8    priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO band */
> };
>

Good try, but:

1. You don't take padding into account, although the difference
between 16 and 64 is big here. If it were 16 and 20, almost certainly
wouldn't work.

2. What if I compile a new iproute2 on an old kernel? The iproute2
will use V2, while old kernel has no knowledge of V2, so it only
copies a part of V2 in the end....

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-13  5:07           ` Cong Wang
@ 2018-07-13 13:00             ` Marcelo Ricardo Leitner
  2018-07-13 18:17               ` Cong Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-13 13:00 UTC (permalink / raw)
  To: Cong Wang
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Thu, Jul 12, 2018 at 10:07:30PM -0700, Cong Wang wrote:
> On Wed, Jul 11, 2018 at 11:37 AM Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> >
> > On Tue, Jul 10, 2018 at 07:32:43PM -0700, Cong Wang wrote:
> > > On Mon, Jul 9, 2018 at 12:53 PM Marcelo Ricardo Leitner
> > > <marcelo.leitner@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> > > > >
> > > > >    2. sch_prio.c does not have a global limit on the number of packets on
> > > > > all its queues, only a limit per queue.
> > > >
> > > > It can be useful to sch_prio.c as well, why not?
> > > > prio_enqueue()
> > > > {
> > > > ...
> > > > +       if (count > sch->global_limit)
> > > > +               prio_tail_drop(sch);   /* to be implemented */
> > > >         ret = qdisc_enqueue(skb, qdisc, to_free);
> > > >
> > >
> > > Isn't the whole point of sch_prio offloading the queueing to
> > > each class? If you need a limit, there is one for each child
> > > qdisc if you use for example pfifo or bfifo (depending on you
> > > want to limit bytes or packets).
> >
> > Yes, but Michel wants to drop from other lower priorities if needed,
> > and that's not possible if you handle the limit already in a child
> > qdisc as they don't know about their siblings. The idea in the example
> > above is to discard it from whatever lower priority is needed, then
> > queue it. (ok, the example missed to check the priority level)
> 
> So it disproves your point of adding a flag to sch_prio, right?

I don't see how?

> 
> Also, you have to re-introduce qdisc->ops->drop() if you really want
> to go this direction.

Again, yes. What's the deal with it?

> 
> >
> > As for the different units, sch_prio holds a count of how many packets
> > are queued on its children, and that's what would be used for the limit.
> >
> > >
> > > Also, what's your plan for backward compatibility here?
> >
> > say:
> >   if (sch->global_limit && count > sch->global_limit)
> > as in, only do the limit check/enforcing if needed.
> 
> Obviously doesn't work, users could pass 0 to effectively
> disable the qdisc from enqueue'ing any packet.

If you only had considered the right 'limit' variable, you would be
right here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-13  6:05               ` Cong Wang
@ 2018-07-13 13:04                 ` Marcelo Ricardo Leitner
  2018-07-13 18:26                   ` Cong Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-13 13:04 UTC (permalink / raw)
  To: Cong Wang
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Thu, Jul 12, 2018 at 11:05:45PM -0700, Cong Wang wrote:
> On Wed, Jul 11, 2018 at 12:33 PM Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> >
> > On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote:
> > > On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
> > > <marcelo.leitner@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> > > > >    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> > > > > with applications.
> > > >
> > > > If done, it needs to be done carefully, indeed. I don't know if it's
> > > > doable, neither I know how hard is your requirement for 64 different
> > > > priorities.
> > >
> > > struct tc_prio_qopt {
> > >         int     bands;                  /* Number of bands */
> > >         __u8    priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
> > > };
> > >
> > > How would you do it carefully?
> >
> > quick shot, multiplex v1 and v2 formats based on bands and sizeof():
> >
> > #define TCQ_PRIO_BANDS_V1       16
> > #define TCQ_PRIO_BANDS_V2       64
> > #define TC_PRIO_MAX_V2          64
> >
> > struct tc_prio_qopt_v2 {
> >         int     bands;                  /* Number of bands */
> >         __u8    priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO band */
> > };
> >
> 
> Good try, but:
> 
> 1. You don't take padding into account, although the difference
> between 16 and 64 is big here. If it were 16 and 20, almost certainly
> wouldn't work.

It still would work, no matter how much padding you have, as currently
you can't use more than 3 bands.

> 
> 2. What if I compile a new iproute2 on an old kernel? The iproute2
> will use V2, while old kernel has no knowledge of V2, so it only
> copies a part of V2 in the end....

Yes, and that's not a problem:
- Either bands is > 3 and it will return EINVAL, protecting from
  reading beyond the buffer.
- Or 2 <= bands <= 3 and it will handle it as a _v1 struct, and use
  only the original size.

iproute2 (or other app) may still use _v1 if it wants, btw.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-13 13:00             ` Marcelo Ricardo Leitner
@ 2018-07-13 18:17               ` Cong Wang
  2018-07-14  4:51                 ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 26+ messages in thread
From: Cong Wang @ 2018-07-13 18:17 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Fri, Jul 13, 2018 at 6:00 AM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Thu, Jul 12, 2018 at 10:07:30PM -0700, Cong Wang wrote:
> > On Wed, Jul 11, 2018 at 11:37 AM Marcelo Ricardo Leitner
> > <marcelo.leitner@gmail.com> wrote:
> > >
> > > On Tue, Jul 10, 2018 at 07:32:43PM -0700, Cong Wang wrote:
> > > > On Mon, Jul 9, 2018 at 12:53 PM Marcelo Ricardo Leitner
> > > > <marcelo.leitner@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> > > > > >
> > > > > >    2. sch_prio.c does not have a global limit on the number of packets on
> > > > > > all its queues, only a limit per queue.
> > > > >
> > > > > It can be useful to sch_prio.c as well, why not?
> > > > > prio_enqueue()
> > > > > {
> > > > > ...
> > > > > +       if (count > sch->global_limit)
> > > > > +               prio_tail_drop(sch);   /* to be implemented */
> > > > >         ret = qdisc_enqueue(skb, qdisc, to_free);
> > > > >
> > > >
> > > > Isn't the whole point of sch_prio offloading the queueing to
> > > > each class? If you need a limit, there is one for each child
> > > > qdisc if you use for example pfifo or bfifo (depending on you
> > > > want to limit bytes or packets).
> > >
> > > Yes, but Michel wants to drop from other lower priorities if needed,
> > > and that's not possible if you handle the limit already in a child
> > > qdisc as they don't know about their siblings. The idea in the example
> > > above is to discard it from whatever lower priority is needed, then
> > > queue it. (ok, the example missed to check the priority level)
> >
> > So it disproves your point of adding a flag to sch_prio, right?
>
> I don't see how?

Interesting, you said "Michel wants to drop from other lower
priorities if needed", but sch_prio has no knowledge of this,
you confirmed with "...if you handle the limit already in a child
qdisc as they don't know about their siblings."

The if clause is true as the limit is indeed handled by its child
qdiscs as designed.

Therefore, a simple of adding a flag to sch_prio, as you
suggested and demonstrated above, doesn't work, as
confirmed by your own words.

What am I missing here?

Are you go further by suggesting moving the limit out of prio?
Or are you going to expand your definition of "adding a flag"?
Perhaps two flags? :)

I am very open for discussion to see how far we can go.

>
> >
> > Also, you have to re-introduce qdisc->ops->drop() if you really want
> > to go this direction.
>
> Again, yes. What's the deal with it?
>

Nothing, just want to tell you ops->drop() is nothing new, to help
your discussion.


> >
> > >
> > > As for the different units, sch_prio holds a count of how many packets
> > > are queued on its children, and that's what would be used for the limit.
> > >
> > > >
> > > > Also, what's your plan for backward compatibility here?
> > >
> > > say:
> > >   if (sch->global_limit && count > sch->global_limit)
> > > as in, only do the limit check/enforcing if needed.
> >
> > Obviously doesn't work, users could pass 0 to effectively
> > disable the qdisc from enqueue'ing any packet.
>
> If you only had considered the right 'limit' variable, you would be
> right here.

Yeah, that is exactly what you propose, isn't it? :)

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-13 13:04                 ` Marcelo Ricardo Leitner
@ 2018-07-13 18:26                   ` Cong Wang
  2018-07-14  4:39                     ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 26+ messages in thread
From: Cong Wang @ 2018-07-13 18:26 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Fri, Jul 13, 2018 at 6:04 AM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Thu, Jul 12, 2018 at 11:05:45PM -0700, Cong Wang wrote:
> > On Wed, Jul 11, 2018 at 12:33 PM Marcelo Ricardo Leitner
> > <marcelo.leitner@gmail.com> wrote:
> > >
> > > On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote:
> > > > On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
> > > > <marcelo.leitner@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> > > > > >    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> > > > > > with applications.
> > > > >
> > > > > If done, it needs to be done carefully, indeed. I don't know if it's
> > > > > doable, neither I know how hard is your requirement for 64 different
> > > > > priorities.
> > > >
> > > > struct tc_prio_qopt {
> > > >         int     bands;                  /* Number of bands */
> > > >         __u8    priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
> > > > };
> > > >
> > > > How would you do it carefully?
> > >
> > > quick shot, multiplex v1 and v2 formats based on bands and sizeof():
> > >
> > > #define TCQ_PRIO_BANDS_V1       16
> > > #define TCQ_PRIO_BANDS_V2       64
> > > #define TC_PRIO_MAX_V2          64
> > >
> > > struct tc_prio_qopt_v2 {
> > >         int     bands;                  /* Number of bands */
> > >         __u8    priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO band */
> > > };
> > >
> >
> > Good try, but:
> >
> > 1. You don't take padding into account, although the difference
> > between 16 and 64 is big here. If it were 16 and 20, almost certainly
> > wouldn't work.
>
> It still would work, no matter how much padding you have, as currently
> you can't use more than 3 bands.

I am lost.

With your proposal above, you have 16 bands for V1 and 64 bands
for V2, where does 3 come from???


>
> >
> > 2. What if I compile a new iproute2 on an old kernel? The iproute2
> > will use V2, while old kernel has no knowledge of V2, so it only
> > copies a part of V2 in the end....
>
> Yes, and that's not a problem:
> - Either bands is > 3 and it will return EINVAL, protecting from
>   reading beyond the buffer.
> - Or 2 <= bands <= 3 and it will handle it as a _v1 struct, and use
>   only the original size.

Again why 3 not 16 or 64 ??

Also, why does an old kernel has the logic in its binary to determine
this?

>
> iproute2 (or other app) may still use _v1 if it wants, btw.

Yes, old iproute2 must still have v1, what's point? Are you
suggesting new iproute2 should still have v1 after you propose
v1 and v2 for kernel?

I must seriously miss something. Please help.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-13 18:26                   ` Cong Wang
@ 2018-07-14  4:39                     ` Marcelo Ricardo Leitner
  2018-07-17  6:41                       ` Cong Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-14  4:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Fri, Jul 13, 2018 at 11:26:28AM -0700, Cong Wang wrote:
> On Fri, Jul 13, 2018 at 6:04 AM Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> >
> > On Thu, Jul 12, 2018 at 11:05:45PM -0700, Cong Wang wrote:
> > > On Wed, Jul 11, 2018 at 12:33 PM Marcelo Ricardo Leitner
> > > <marcelo.leitner@gmail.com> wrote:
> > > >
> > > > On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote:
> > > > > On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
> > > > > <marcelo.leitner@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> > > > > > >    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> > > > > > > with applications.
> > > > > >
> > > > > > If done, it needs to be done carefully, indeed. I don't know if it's
> > > > > > doable, neither I know how hard is your requirement for 64 different
> > > > > > priorities.
> > > > >
> > > > > struct tc_prio_qopt {
> > > > >         int     bands;                  /* Number of bands */
> > > > >         __u8    priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
> > > > > };
> > > > >
> > > > > How would you do it carefully?
> > > >
> > > > quick shot, multiplex v1 and v2 formats based on bands and sizeof():
> > > >
> > > > #define TCQ_PRIO_BANDS_V1       16
> > > > #define TCQ_PRIO_BANDS_V2       64
> > > > #define TC_PRIO_MAX_V2          64
> > > >
> > > > struct tc_prio_qopt_v2 {
> > > >         int     bands;                  /* Number of bands */
> > > >         __u8    priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO band */
> > > > };
> > > >
> > >
> > > Good try, but:
> > >
> > > 1. You don't take padding into account, although the difference
> > > between 16 and 64 is big here. If it were 16 and 20, almost certainly
> > > wouldn't work.
> >
> > It still would work, no matter how much padding you have, as currently
> > you can't use more than 3 bands.
> 
> I am lost.
> 
> With your proposal above, you have 16 bands for V1 and 64 bands
> for V2, where does 3 come from???

My bad. s/3/16/

> 
> 
> >
> > >
> > > 2. What if I compile a new iproute2 on an old kernel? The iproute2
> > > will use V2, while old kernel has no knowledge of V2, so it only
> > > copies a part of V2 in the end....
> >
> > Yes, and that's not a problem:
> > - Either bands is > 3 and it will return EINVAL, protecting from
> >   reading beyond the buffer.
> > - Or 2 <= bands <= 3 and it will handle it as a _v1 struct, and use
> >   only the original size.
> 
> Again why 3 not 16 or 64 ??

Again, s/3/16/

> 
> Also, why does an old kernel has the logic in its binary to determine
> this?

It won't, and it doesn't need to. If you use bands > 16 with an old
kernel, it will reject per current code (that I already pasted):

        if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
                return -EINVAL;

Simple as that. If you try to use more bands than it supports, it will
reject it.

> 
> >
> > iproute2 (or other app) may still use _v1 if it wants, btw.
> 
> Yes, old iproute2 must still have v1, what's point? Are you

??

> suggesting new iproute2 should still have v1 after you propose
> v1 and v2 for kernel?

I'm only saying that both versions will be accepted by a new kernel.

> 
> I must seriously miss something. Please help.
> 
> Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-13 18:17               ` Cong Wang
@ 2018-07-14  4:51                 ` Marcelo Ricardo Leitner
  2018-07-17  5:36                   ` Cong Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-07-14  4:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Fri, Jul 13, 2018 at 11:17:18AM -0700, Cong Wang wrote:
...
> > > > > Isn't the whole point of sch_prio offloading the queueing to
> > > > > each class? If you need a limit, there is one for each child
> > > > > qdisc if you use for example pfifo or bfifo (depending on you
> > > > > want to limit bytes or packets).
> > > >
> > > > Yes, but Michel wants to drop from other lower priorities if needed,
> > > > and that's not possible if you handle the limit already in a child
> > > > qdisc as they don't know about their siblings. The idea in the example
> > > > above is to discard it from whatever lower priority is needed, then
> > > > queue it. (ok, the example missed to check the priority level)
> > >
> > > So it disproves your point of adding a flag to sch_prio, right?
> >
> > I don't see how?
> 
> Interesting, you said "Michel wants to drop from other lower
> priorities if needed", but sch_prio has no knowledge of this,
> you confirmed with "...if you handle the limit already in a child
> qdisc as they don't know about their siblings."
> 
> The if clause is true as the limit is indeed handled by its child
> qdiscs as designed.
> 
> Therefore, a simple of adding a flag to sch_prio, as you
> suggested and demonstrated above, doesn't work, as
> confirmed by your own words.

Well, it would help if you didn't cut out key parts of my words.

> 
> What am I missing here?
> 
> Are you go further by suggesting moving the limit out of prio?
> Or are you going to expand your definition of "adding a flag"?
> Perhaps two flags? :)
> 
> I am very open for discussion to see how far we can go.

I am not keen on continuing this discussion if you keep twisting my
words just for fun.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-14  4:51                 ` Marcelo Ricardo Leitner
@ 2018-07-17  5:36                   ` Cong Wang
  0 siblings, 0 replies; 26+ messages in thread
From: Cong Wang @ 2018-07-17  5:36 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Fri, Jul 13, 2018 at 9:51 PM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> Well, it would help if you didn't cut out key parts of my words.

Sorry about it, please allow me to copy and paste all of your words
here:

"Yes, but Michel wants to drop from other lower priorities if needed,
and that's not possible if you handle the limit already in a child
qdisc as they don't know about their siblings. The idea in the example
above is to discard it from whatever lower priority is needed, then
queue it. (ok, the example missed to check the priority level)"

So from your own words, you agreed "the idea in the example"
is not what Michel wants, because "is to discard it from whatever
lower priority is needed", as "Michel wants to drop from other lower
priorities if needed".

You also agreed Michel's requirement is not possible (to implement
in sch_prio) because "you handle the limit already in a child qdisc
as they don't know about their siblings" is also true.

Based on the above, I said it "disproves your point of adding a flag
to sch_prio".

What am I missing?



>
> >
> > What am I missing here?
> >
> > Are you go further by suggesting moving the limit out of prio?
> > Or are you going to expand your definition of "adding a flag"?
> > Perhaps two flags? :)
> >
> > I am very open for discussion to see how far we can go.
>
> I am not keen on continuing this discussion if you keep twisting my
> words just for fun.

No, I am trying to understand seriously about what you suggest here.

Please be patient! I know I am stupid!!!! :)

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-14  4:39                     ` Marcelo Ricardo Leitner
@ 2018-07-17  6:41                       ` Cong Wang
  0 siblings, 0 replies; 26+ messages in thread
From: Cong Wang @ 2018-07-17  6:41 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Michel Machado, Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Linux Kernel Network Developers, Cody Doucette

On Fri, Jul 13, 2018 at 9:39 PM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Fri, Jul 13, 2018 at 11:26:28AM -0700, Cong Wang wrote:
> > On Fri, Jul 13, 2018 at 6:04 AM Marcelo Ricardo Leitner
> > <marcelo.leitner@gmail.com> wrote:
> > >
> > > On Thu, Jul 12, 2018 at 11:05:45PM -0700, Cong Wang wrote:
> > > > On Wed, Jul 11, 2018 at 12:33 PM Marcelo Ricardo Leitner
> > > > <marcelo.leitner@gmail.com> wrote:
> > > > >
> > > > > On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote:
> > > > > > On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
> > > > > > <marcelo.leitner@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> > > > > > > >    Changing TC_PRIO_MAX from 15 to 63 risks breaking backward compatibility
> > > > > > > > with applications.
> > > > > > >
> > > > > > > If done, it needs to be done carefully, indeed. I don't know if it's
> > > > > > > doable, neither I know how hard is your requirement for 64 different
> > > > > > > priorities.
> > > > > >
> > > > > > struct tc_prio_qopt {
> > > > > >         int     bands;                  /* Number of bands */
> > > > > >         __u8    priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
> > > > > > };
> > > > > >
> > > > > > How would you do it carefully?
> > > > >
> > > > > quick shot, multiplex v1 and v2 formats based on bands and sizeof():
> > > > >
> > > > > #define TCQ_PRIO_BANDS_V1       16
> > > > > #define TCQ_PRIO_BANDS_V2       64
> > > > > #define TC_PRIO_MAX_V2          64
> > > > >
> > > > > struct tc_prio_qopt_v2 {
> > > > >         int     bands;                  /* Number of bands */
> > > > >         __u8    priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO band */
> > > > > };
> > > > >
> > > >
> > > > Good try, but:
> > > >
> > > > 1. You don't take padding into account, although the difference
> > > > between 16 and 64 is big here. If it were 16 and 20, almost certainly
> > > > wouldn't work.
> > >
> > > It still would work, no matter how much padding you have, as currently
> > > you can't use more than 3 bands.
> >
> > I am lost.
> >
> > With your proposal above, you have 16 bands for V1 and 64 bands
> > for V2, where does 3 come from???
>
> My bad. s/3/16/


Ah, thanks for clarifying it! Please see below.


>
> >
> >
> > >
> > > >
> > > > 2. What if I compile a new iproute2 on an old kernel? The iproute2
> > > > will use V2, while old kernel has no knowledge of V2, so it only
> > > > copies a part of V2 in the end....
> > >
> > > Yes, and that's not a problem:
> > > - Either bands is > 3 and it will return EINVAL, protecting from
> > >   reading beyond the buffer.
> > > - Or 2 <= bands <= 3 and it will handle it as a _v1 struct, and use
> > >   only the original size.
> >
> > Again why 3 not 16 or 64 ??
>
> Again, s/3/16/
>
> >
> > Also, why does an old kernel has the logic in its binary to determine
> > this?
>
> It won't, and it doesn't need to. If you use bands > 16 with an old
> kernel, it will reject per current code (that I already pasted):
>
>         if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
>                 return -EINVAL;
>
> Simple as that. If you try to use more bands than it supports, it will
> reject it.

Hmm, I see, But in your demo code, you miss the following pieces:

        for (i = 0; i <= TC_PRIO_MAX; i++) {
                if (qopt->priomap[i] >= qopt->bands)
                        return -EINVAL;
        }

        memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);


I guess you want to change TC_PRIO_MAX to qopt->bands
too.

With this together, your suggestion actually looks reasonable.

Do I understand it correctly?


>
> >
> > >
> > > iproute2 (or other app) may still use _v1 if it wants, btw.
> >
> > Yes, old iproute2 must still have v1, what's point? Are you
>
> ??
>
> > suggesting new iproute2 should still have v1 after you propose
> > v1 and v2 for kernel?
>
> I'm only saying that both versions will be accepted by a new kernel.

I see, I thought you suggest to completely move to V2 for new
kernel.

Nice compatibility trick!

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
  2018-07-11 15:24   ` Michel Machado
@ 2018-07-19 18:39     ` Cong Wang
  0 siblings, 0 replies; 26+ messages in thread
From: Cong Wang @ 2018-07-19 18:39 UTC (permalink / raw)
  To: Michel Machado
  Cc: Nishanth Devarajan, Jamal Hadi Salim, Jiri Pirko, David Miller,
	Linux Kernel Network Developers, Cody Doucette

(Sorry for missing this email, it is lost in other discussions.)

On Wed, Jul 11, 2018 at 8:25 AM Michel Machado <michel@digirati.com.br> wrote:
>
> On 07/10/2018 10:57 PM, Cong Wang wrote:
> > The dev->tx_queue_len is fundamentally non-sense since now
> > almost every real NIC is multi-queue and qdisc has a completely
> > different sch->limit. This is why I suggested you to simply
> > avoid it in your code.
>
>     Would you be okay with a constant there? If so, we could just put 64
> there. The optimal number is hardware dependent, but we don't know how
> to calculate it.

Yes, sure, fq_codel uses 10240 already. :)


>
> > There is no standard way to use dev->tx_queue_len in kernel,
> > so I can't claim your use is correct or not, but it still looks odd,
> > other qdisc seems just uses as a default, rather than picking
> > the smaller or bigger value as a cap.
>
>     The reason for the `max(qdisc_dev(sch)->tx_queue_len, min_limit)` is
> to make sure that sch->limit is at least 1. We couldn't come up with a
> meaningful behavior for sch->limit being zero, so we defined the basis
> case of skbprio_enqueue() as sch->limit one. If there's a guarantee that
> qdisc_dev(sch)->tx_queue_len is always greater than zero, we don't need
> the max().

I think tx_queue_len could be 0. But again, why do you need to care
about tx_queue_len being 0 or not here?

sch->limit could be 0 too, it means this qdisc should not queue any
packets.

Thanks

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-07-19 19:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-07 10:13 [PATCH v3 net-next] net/sched: add skbprio scheduler Nishanth Devarajan
2018-07-09 15:44 ` Marcelo Ricardo Leitner
2018-07-09 18:18   ` Michel Machado
2018-07-09 19:53     ` Marcelo Ricardo Leitner
2018-07-09 21:03       ` Michel Machado
2018-07-09 21:40         ` Marcelo Ricardo Leitner
2018-07-10 14:03           ` Michel Machado
2018-07-10 14:33             ` Marcelo Ricardo Leitner
2018-07-11  2:25           ` Cong Wang
2018-07-11 19:33             ` Marcelo Ricardo Leitner
2018-07-13  6:05               ` Cong Wang
2018-07-13 13:04                 ` Marcelo Ricardo Leitner
2018-07-13 18:26                   ` Cong Wang
2018-07-14  4:39                     ` Marcelo Ricardo Leitner
2018-07-17  6:41                       ` Cong Wang
2018-07-11  2:32       ` Cong Wang
2018-07-11 18:37         ` Marcelo Ricardo Leitner
2018-07-13  5:07           ` Cong Wang
2018-07-13 13:00             ` Marcelo Ricardo Leitner
2018-07-13 18:17               ` Cong Wang
2018-07-14  4:51                 ` Marcelo Ricardo Leitner
2018-07-17  5:36                   ` Cong Wang
2018-07-11  2:38     ` Cong Wang
2018-07-11  2:57 ` Cong Wang
2018-07-11 15:24   ` Michel Machado
2018-07-19 18:39     ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).