All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next v8 0/2] net: sched: allow user to select txqueue
@ 2022-01-26 14:32 xiangxia.m.yue
  2022-01-26 14:32 ` [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue xiangxia.m.yue
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: xiangxia.m.yue @ 2022-01-26 14:32 UTC (permalink / raw)
  To: netdev
  Cc: Tonghao Zhang, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

Patch 1 allow user to select txqueue in clsact hook.
Patch 2 support skbhash, classid, cpuid to select txqueue.

Tonghao Zhang (2):
  net: sched: use queue_mapping to pick tx queue
  net: sched: support hash/classid/cpuid selecting tx queue

 include/linux/netdevice.h              |  3 +
 include/linux/rtnetlink.h              |  1 +
 include/net/tc_act/tc_skbedit.h        |  1 +
 include/uapi/linux/tc_act/tc_skbedit.h |  8 +++
 net/core/dev.c                         | 31 +++++++++-
 net/sched/act_skbedit.c                | 84 ++++++++++++++++++++++++--
 6 files changed, 122 insertions(+), 6 deletions(-)

-- 
v8:
* 1/2 remove the inline keyword.
v7:
* 1/2 fix build warn, move pick tx queue into egress_needed_key for
  simplifing codes.
v6:
* 1/2 use static key and compiled when CONFIG_NET_EGRESS configured.
v5:
* 1/2 merge netdev_xmit_reset_txqueue(void),
  netdev_xmit_skip_txqueue(void), to netdev_xmit_skip_txqueue(bool skip). 
v4:
* 1/2 introduce netdev_xmit_reset_txqueue() and invoked in
  __dev_queue_xmit(), so ximt.skip_txqueue will not affect
  selecting tx queue in next netdev, or next packets.
  more details, see commit log.
* 2/2 fix the coding style, rename:
  SKBEDIT_F_QUEUE_MAPPING_HASH -> SKBEDIT_F_TXQ_SKBHASH
  SKBEDIT_F_QUEUE_MAPPING_CLASSID -> SKBEDIT_F_TXQ_CLASSID
  SKBEDIT_F_QUEUE_MAPPING_CPUID -> SKBEDIT_F_TXQ_CPUID
* 2/2 refactor tcf_skbedit_hash, if type of hash is not specified, use
  the queue_mapping, because hash % mapping_mod == 0 in "case 0:"
* 2/2 merge the check and add extack
v3:
* 2/2 fix the warning, add cpuid hash type.
v2:
* 1/2 change skb->tc_skip_txqueue to per-cpu var, add more commit message.
* 2/2 optmize the codes.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
--
2.27.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue
  2022-01-26 14:32 [net-next v8 0/2] net: sched: allow user to select txqueue xiangxia.m.yue
@ 2022-01-26 14:32 ` xiangxia.m.yue
  2022-02-15  0:14   ` Jamal Hadi Salim
  2022-01-26 14:32 ` [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting " xiangxia.m.yue
  2022-01-27  2:24 ` [net-next v8 0/2] net: sched: allow user to select txqueue Jakub Kicinski
  2 siblings, 1 reply; 24+ messages in thread
From: xiangxia.m.yue @ 2022-01-26 14:32 UTC (permalink / raw)
  To: netdev
  Cc: Tonghao Zhang, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch fixes issue:
* If we install tc filters with act_skbedit in clsact hook.
  It doesn't work, because netdev_core_pick_tx() overwrites
  queue_mapping.

  $ tc filter ... action skbedit queue_mapping 1

And this patch is useful:
* We can use FQ + EDT to implement efficient policies. Tx queues
  are picked by xps, ndo_select_queue of netdev driver, or skb hash
  in netdev_core_pick_tx(). In fact, the netdev driver, and skb
  hash are _not_ under control. xps uses the CPUs map to select Tx
  queues, but we can't figure out which task_struct of pod/containter
  running on this cpu in most case. We can use clsact filters to classify
  one pod/container traffic to one Tx queue. Why ?

  In containter networking environment, there are two kinds of pod/
  containter/net-namespace. One kind (e.g. P1, P2), the high throughput
  is key in these applications. But avoid running out of network resource,
  the outbound traffic of these pods is limited, using or sharing one
  dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
  (e.g. Pn), the low latency of data access is key. And the traffic is not
  limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
  This choice provides two benefits. First, contention on the HTB/FQ Qdisc
  lock is significantly reduced since fewer CPUs contend for the same queue.
  More importantly, Qdisc contention can be eliminated completely if each
  CPU has its own FIFO Qdisc for the second kind of pods.

  There must be a mechanism in place to support classifying traffic based on
  pods/container to different Tx queues. Note that clsact is outside of Qdisc
  while Qdisc can run a classifier to select a sub-queue under the lock.

  In general recording the decision in the skb seems a little heavy handed.
  This patch introduces a per-CPU variable, suggested by Eric.

  The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
  - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
    is set in qdisc->enqueue() though tx queue has been selected in
    netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
    firstly in __dev_queue_xmit(), is useful:
  - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
    in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
    For example, eth0, macvlan in pod, which root Qdisc install skbedit
    queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
    eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
    because there is no filters in clsact or tx Qdisc of this netdev.
    Same action taked in eth0, ixgbe in Host.
  - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
    in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
    in __dev_queue_xmit when processing next packets.

  For performance reasons, use the static key. If user does not config the NET_EGRESS,
  the patch will not be compiled.

  +----+      +----+      +----+
  | P1 |      | P2 |      | Pn |
  +----+      +----+      +----+
    |           |           |
    +-----------+-----------+
                |
                | clsact/skbedit
                |      MQ
                v
    +-----------+-----------+
    | q0        | q1        | qn
    v           v           v
  HTB/FQ      HTB/FQ  ...  FIFO

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 include/linux/netdevice.h |  3 +++
 include/linux/rtnetlink.h |  1 +
 net/core/dev.c            | 31 +++++++++++++++++++++++++++++--
 net/sched/act_skbedit.c   |  6 +++++-
 4 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e490b84732d1..60e14b2b091d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3015,6 +3015,9 @@ struct softnet_data {
 	struct {
 		u16 recursion;
 		u8  more;
+#ifdef CONFIG_NET_EGRESS
+		u8  skip_txqueue;
+#endif
 	} xmit;
 #ifdef CONFIG_RPS
 	/* input_queue_head should be written by cpu owning this struct,
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index bb9cb84114c1..e87c2dccc4d5 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -100,6 +100,7 @@ void net_dec_ingress_queue(void);
 #ifdef CONFIG_NET_EGRESS
 void net_inc_egress_queue(void);
 void net_dec_egress_queue(void);
+void netdev_xmit_skip_txqueue(bool skip);
 #endif
 
 void rtnetlink_init(void);
diff --git a/net/core/dev.c b/net/core/dev.c
index 1baab07820f6..842473fa8e9f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3860,6 +3860,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 
 	return skb;
 }
+
+static struct netdev_queue *
+netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
+{
+	int qm = skb_get_queue_mapping(skb);
+
+	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
+}
+
+static bool netdev_xmit_txqueue_skipped(void)
+{
+	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+}
+
+void netdev_xmit_skip_txqueue(bool skip)
+{
+	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
+}
+EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
 #endif /* CONFIG_NET_EGRESS */
 
 #ifdef CONFIG_XPS
@@ -4030,7 +4049,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
 static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
-	struct netdev_queue *txq;
+	struct netdev_queue *txq = NULL;
 	struct Qdisc *q;
 	int rc = -ENOMEM;
 	bool again = false;
@@ -4058,11 +4077,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 			if (!skb)
 				goto out;
 		}
+
+		netdev_xmit_skip_txqueue(false);
+
 		nf_skip_egress(skb, true);
 		skb = sch_handle_egress(skb, &rc, dev);
 		if (!skb)
 			goto out;
 		nf_skip_egress(skb, false);
+
+		if (netdev_xmit_txqueue_skipped())
+			txq = netdev_tx_queue_mapping(dev, skb);
 	}
 #endif
 	/* If device/qdisc don't need skb->dst, release it right now while
@@ -4073,7 +4098,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	else
 		skb_dst_force(skb);
 
-	txq = netdev_core_pick_tx(dev, skb, sb_dev);
+	if (likely(!txq))
+		txq = netdev_core_pick_tx(dev, skb, sb_dev);
+
 	q = rcu_dereference_bh(txq->qdisc);
 
 	trace_net_dev_queue(skb);
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index ceba11b198bb..d5799b4fc499 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -58,8 +58,12 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
 		}
 	}
 	if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
-	    skb->dev->real_num_tx_queues > params->queue_mapping)
+	    skb->dev->real_num_tx_queues > params->queue_mapping) {
+#ifdef CONFIG_NET_EGRESS
+		netdev_xmit_skip_txqueue(true);
+#endif
 		skb_set_queue_mapping(skb, params->queue_mapping);
+	}
 	if (params->flags & SKBEDIT_F_MARK) {
 		skb->mark &= ~params->mask;
 		skb->mark |= params->mark & params->mask;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-01-26 14:32 [net-next v8 0/2] net: sched: allow user to select txqueue xiangxia.m.yue
  2022-01-26 14:32 ` [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue xiangxia.m.yue
@ 2022-01-26 14:32 ` xiangxia.m.yue
  2022-01-26 19:52   ` Cong Wang
  2022-02-15  0:22   ` Jamal Hadi Salim
  2022-01-27  2:24 ` [net-next v8 0/2] net: sched: allow user to select txqueue Jakub Kicinski
  2 siblings, 2 replies; 24+ messages in thread
From: xiangxia.m.yue @ 2022-01-26 14:32 UTC (permalink / raw)
  To: netdev
  Cc: Tonghao Zhang, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch allows user to select queue_mapping, range
from A to B. And user can use skbhash, cgroup classid
and cpuid to select Tx queues. Then we can load balance
packets from A to B queue. The range is an unsigned 16bit
value in decimal format.

$ tc filter ... action skbedit queue_mapping skbhash A B

"skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
is enhanced with flags:
* SKBEDIT_F_TXQ_SKBHASH
* SKBEDIT_F_TXQ_CLASSID
* SKBEDIT_F_TXQ_CPUID

Use skb->hash, cgroup classid, or cpuid to distribute packets.
Then same range of tx queues can be shared for different flows,
cgroups, or CPUs in a variety of scenarios.

For example, F1 may share range R1 with F2. The best way to do
that is to set flag to SKBEDIT_F_TXQ_HASH, using skb->hash to
share the queues. If cgroup C1 want to share the R1 with cgroup
C2 .. Cn, use the SKBEDIT_F_TXQ_CLASSID. Of course, in some other
scenario, C1 use R1, while Cn can use the Rn.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 include/net/tc_act/tc_skbedit.h        |  1 +
 include/uapi/linux/tc_act/tc_skbedit.h |  8 +++
 net/sched/act_skbedit.c                | 78 +++++++++++++++++++++++++-
 3 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h
index 00bfee70609e..ee96e0fa6566 100644
--- a/include/net/tc_act/tc_skbedit.h
+++ b/include/net/tc_act/tc_skbedit.h
@@ -17,6 +17,7 @@ struct tcf_skbedit_params {
 	u32 mark;
 	u32 mask;
 	u16 queue_mapping;
+	u16 mapping_mod;
 	u16 ptype;
 	struct rcu_head rcu;
 };
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h
index 800e93377218..5ea1438a4d88 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -29,6 +29,13 @@
 #define SKBEDIT_F_PTYPE			0x8
 #define SKBEDIT_F_MASK			0x10
 #define SKBEDIT_F_INHERITDSFIELD	0x20
+#define SKBEDIT_F_TXQ_SKBHASH		0x40
+#define SKBEDIT_F_TXQ_CLASSID		0x80
+#define SKBEDIT_F_TXQ_CPUID		0x100
+
+#define SKBEDIT_F_TXQ_HASH_MASK (SKBEDIT_F_TXQ_SKBHASH | \
+				 SKBEDIT_F_TXQ_CLASSID | \
+				 SKBEDIT_F_TXQ_CPUID)
 
 struct tc_skbedit {
 	tc_gen;
@@ -45,6 +52,7 @@ enum {
 	TCA_SKBEDIT_PTYPE,
 	TCA_SKBEDIT_MASK,
 	TCA_SKBEDIT_FLAGS,
+	TCA_SKBEDIT_QUEUE_MAPPING_MAX,
 	__TCA_SKBEDIT_MAX
 };
 #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index d5799b4fc499..4c209689f8de 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -10,6 +10,7 @@
 #include <linux/kernel.h>
 #include <linux/skbuff.h>
 #include <linux/rtnetlink.h>
+#include <net/cls_cgroup.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/ip.h>
@@ -23,6 +24,38 @@
 static unsigned int skbedit_net_id;
 static struct tc_action_ops act_skbedit_ops;
 
+static u16 tcf_skbedit_hash(struct tcf_skbedit_params *params,
+			    struct sk_buff *skb)
+{
+	u32 mapping_hash_type = params->flags & SKBEDIT_F_TXQ_HASH_MASK;
+	u16 queue_mapping = params->queue_mapping;
+	u16 mapping_mod = params->mapping_mod;
+	u32 hash = 0;
+
+	switch (mapping_hash_type) {
+	case SKBEDIT_F_TXQ_CLASSID:
+		hash = task_get_classid(skb);
+		break;
+	case SKBEDIT_F_TXQ_SKBHASH:
+		hash = skb_get_hash(skb);
+		break;
+	case SKBEDIT_F_TXQ_CPUID:
+		hash = raw_smp_processor_id();
+		break;
+	case 0:
+		/* Hash type isn't specified. In this case:
+		 * hash % mapping_mod == 0
+		 */
+		break;
+	default:
+		net_warn_ratelimited("The type of queue_mapping hash is not supported. 0x%x\n",
+				     mapping_hash_type);
+	}
+
+	queue_mapping = queue_mapping + hash % mapping_mod;
+	return netdev_cap_txqueue(skb->dev, queue_mapping);
+}
+
 static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
 			   struct tcf_result *res)
 {
@@ -62,7 +95,7 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
 #ifdef CONFIG_NET_EGRESS
 		netdev_xmit_skip_txqueue(true);
 #endif
-		skb_set_queue_mapping(skb, params->queue_mapping);
+		skb_set_queue_mapping(skb, tcf_skbedit_hash(params, skb));
 	}
 	if (params->flags & SKBEDIT_F_MARK) {
 		skb->mark &= ~params->mask;
@@ -96,6 +129,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
 	[TCA_SKBEDIT_PTYPE]		= { .len = sizeof(u16) },
 	[TCA_SKBEDIT_MASK]		= { .len = sizeof(u32) },
 	[TCA_SKBEDIT_FLAGS]		= { .len = sizeof(u64) },
+	[TCA_SKBEDIT_QUEUE_MAPPING_MAX]	= { .len = sizeof(u16) },
 };
 
 static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
@@ -112,6 +146,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
 	struct tcf_skbedit *d;
 	u32 flags = 0, *priority = NULL, *mark = NULL, *mask = NULL;
 	u16 *queue_mapping = NULL, *ptype = NULL;
+	u16 mapping_mod = 1;
 	bool exists = false;
 	int ret = 0, err;
 	u32 index;
@@ -156,7 +191,34 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
 
 	if (tb[TCA_SKBEDIT_FLAGS] != NULL) {
 		u64 *pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]);
-
+		u64 mapping_hash_type;
+
+		mapping_hash_type = *pure_flags & SKBEDIT_F_TXQ_HASH_MASK;
+		if (mapping_hash_type) {
+			u16 *queue_mapping_max;
+
+			/* Hash types are mutually exclusive. */
+			if (mapping_hash_type & (mapping_hash_type - 1)) {
+				NL_SET_ERR_MSG_MOD(extack, "Multi types of hash are specified.");
+				return -EINVAL;
+			}
+
+			if (!tb[TCA_SKBEDIT_QUEUE_MAPPING] ||
+			    !tb[TCA_SKBEDIT_QUEUE_MAPPING_MAX]) {
+				NL_SET_ERR_MSG_MOD(extack, "Missing required range of queue_mapping.");
+				return -EINVAL;
+			}
+
+			queue_mapping_max =
+				nla_data(tb[TCA_SKBEDIT_QUEUE_MAPPING_MAX]);
+			if (*queue_mapping_max < *queue_mapping) {
+				NL_SET_ERR_MSG_MOD(extack, "The range of queue_mapping is invalid, max < min.");
+				return -EINVAL;
+			}
+
+			mapping_mod = *queue_mapping_max - *queue_mapping + 1;
+			flags |= mapping_hash_type;
+		}
 		if (*pure_flags & SKBEDIT_F_INHERITDSFIELD)
 			flags |= SKBEDIT_F_INHERITDSFIELD;
 	}
@@ -208,8 +270,10 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
 	params_new->flags = flags;
 	if (flags & SKBEDIT_F_PRIORITY)
 		params_new->priority = *priority;
-	if (flags & SKBEDIT_F_QUEUE_MAPPING)
+	if (flags & SKBEDIT_F_QUEUE_MAPPING) {
 		params_new->queue_mapping = *queue_mapping;
+		params_new->mapping_mod = mapping_mod;
+	}
 	if (flags & SKBEDIT_F_MARK)
 		params_new->mark = *mark;
 	if (flags & SKBEDIT_F_PTYPE)
@@ -276,6 +340,13 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct tc_action *a,
 		goto nla_put_failure;
 	if (params->flags & SKBEDIT_F_INHERITDSFIELD)
 		pure_flags |= SKBEDIT_F_INHERITDSFIELD;
+	if (params->flags & SKBEDIT_F_TXQ_HASH_MASK) {
+		if (nla_put_u16(skb, TCA_SKBEDIT_QUEUE_MAPPING_MAX,
+				params->queue_mapping + params->mapping_mod - 1))
+			goto nla_put_failure;
+
+		pure_flags |= params->flags & SKBEDIT_F_TXQ_HASH_MASK;
+	}
 	if (pure_flags != 0 &&
 	    nla_put(skb, TCA_SKBEDIT_FLAGS, sizeof(pure_flags), &pure_flags))
 		goto nla_put_failure;
@@ -325,6 +396,7 @@ static size_t tcf_skbedit_get_fill_size(const struct tc_action *act)
 	return nla_total_size(sizeof(struct tc_skbedit))
 		+ nla_total_size(sizeof(u32)) /* TCA_SKBEDIT_PRIORITY */
 		+ nla_total_size(sizeof(u16)) /* TCA_SKBEDIT_QUEUE_MAPPING */
+		+ nla_total_size(sizeof(u16)) /* TCA_SKBEDIT_QUEUE_MAPPING_MAX */
 		+ nla_total_size(sizeof(u32)) /* TCA_SKBEDIT_MARK */
 		+ nla_total_size(sizeof(u16)) /* TCA_SKBEDIT_PTYPE */
 		+ nla_total_size(sizeof(u32)) /* TCA_SKBEDIT_MASK */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-01-26 14:32 ` [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting " xiangxia.m.yue
@ 2022-01-26 19:52   ` Cong Wang
  2022-01-27  1:23     ` Tonghao Zhang
  2022-01-31 13:12     ` Jamal Hadi Salim
  2022-02-15  0:22   ` Jamal Hadi Salim
  1 sibling, 2 replies; 24+ messages in thread
From: Cong Wang @ 2022-01-26 19:52 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Wed, Jan 26, 2022 at 6:32 AM <xiangxia.m.yue@gmail.com> wrote:
>
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
>
> This patch allows user to select queue_mapping, range
> from A to B. And user can use skbhash, cgroup classid
> and cpuid to select Tx queues. Then we can load balance
> packets from A to B queue. The range is an unsigned 16bit
> value in decimal format.
>
> $ tc filter ... action skbedit queue_mapping skbhash A B
>
> "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
> is enhanced with flags:
> * SKBEDIT_F_TXQ_SKBHASH
> * SKBEDIT_F_TXQ_CLASSID
> * SKBEDIT_F_TXQ_CPUID

NAK.

Keeping resending the same non-sense can't help anything at all.

You really should just use eBPF, with eBPF code you don't even need
to send anything to upstream, you can do whatever you want without
arguing with anyone. It is a win-win. I have no idea why you don't even
get this after wasting so much time.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-01-26 19:52   ` Cong Wang
@ 2022-01-27  1:23     ` Tonghao Zhang
  2022-01-27  1:40       ` Tonghao Zhang
  2022-01-31 13:12     ` Jamal Hadi Salim
  1 sibling, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-01-27  1:23 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Thu, Jan 27, 2022 at 3:52 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Jan 26, 2022 at 6:32 AM <xiangxia.m.yue@gmail.com> wrote:
> >
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> >
> > This patch allows user to select queue_mapping, range
> > from A to B. And user can use skbhash, cgroup classid
> > and cpuid to select Tx queues. Then we can load balance
> > packets from A to B queue. The range is an unsigned 16bit
> > value in decimal format.
> >
> > $ tc filter ... action skbedit queue_mapping skbhash A B
> >
> > "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
> > is enhanced with flags:
> > * SKBEDIT_F_TXQ_SKBHASH
> > * SKBEDIT_F_TXQ_CLASSID
> > * SKBEDIT_F_TXQ_CPUID
>
> NAK.
>
> Keeping resending the same non-sense can't help anything at all.
>
> You really should just use eBPF, with eBPF code you don't even need
> to send anything to upstream, you can do whatever you want without
I know ebpf can do more things. but we can let everyone to use ebpf in
tc. For them, the
tc command is more easy to use in data center. we have talked this in
another thread.
> arguing with anyone. It is a win-win. I have no idea why you don't even
> get this after wasting so much time.
>
> Thanks.



-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-01-27  1:23     ` Tonghao Zhang
@ 2022-01-27  1:40       ` Tonghao Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Tonghao Zhang @ 2022-01-27  1:40 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Thu, Jan 27, 2022 at 9:23 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote:
>
> On Thu, Jan 27, 2022 at 3:52 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Wed, Jan 26, 2022 at 6:32 AM <xiangxia.m.yue@gmail.com> wrote:
> > >
> > > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > >
> > > This patch allows user to select queue_mapping, range
> > > from A to B. And user can use skbhash, cgroup classid
> > > and cpuid to select Tx queues. Then we can load balance
> > > packets from A to B queue. The range is an unsigned 16bit
> > > value in decimal format.
> > >
> > > $ tc filter ... action skbedit queue_mapping skbhash A B
> > >
> > > "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
> > > is enhanced with flags:
> > > * SKBEDIT_F_TXQ_SKBHASH
> > > * SKBEDIT_F_TXQ_CLASSID
> > > * SKBEDIT_F_TXQ_CPUID
> >
> > NAK.
> >
> > Keeping resending the same non-sense can't help anything at all.
> >
> > You really should just use eBPF, with eBPF code you don't even need
> > to send anything to upstream, you can do whatever you want without
> I know ebpf can do more things. but we can let everyone to use ebpf in
can ->can't
> tc. For them, the
> tc command is more easy to use in data center. we have talked this in
> another thread.
> > arguing with anyone. It is a win-win. I have no idea why you don't even
> > get this after wasting so much time.
> >
> > Thanks.
>
>
>
> --
> Best regards, Tonghao



-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 0/2] net: sched: allow user to select txqueue
  2022-01-26 14:32 [net-next v8 0/2] net: sched: allow user to select txqueue xiangxia.m.yue
  2022-01-26 14:32 ` [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue xiangxia.m.yue
  2022-01-26 14:32 ` [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting " xiangxia.m.yue
@ 2022-01-27  2:24 ` Jakub Kicinski
  2 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2022-01-27  2:24 UTC (permalink / raw)
  To: netdev
  Cc: xiangxia.m.yue, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S. Miller, Jonathan Lemon, Eric Dumazet, Alexander Lobakin,
	Paolo Abeni, Talal Ahmad, Kevin Hao, Ilias Apalodimas,
	Antoine Tenart, Wei Wang

On Wed, 26 Jan 2022 22:32:04 +0800 xiangxia.m.yue@gmail.com wrote:
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> Patch 1 allow user to select txqueue in clsact hook.
> Patch 2 support skbhash, classid, cpuid to select txqueue.

Does anyone else have an opinion on this one?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-01-26 19:52   ` Cong Wang
  2022-01-27  1:23     ` Tonghao Zhang
@ 2022-01-31 13:12     ` Jamal Hadi Salim
  2022-02-05  7:25       ` Tonghao Zhang
  1 sibling, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2022-01-31 13:12 UTC (permalink / raw)
  To: Cong Wang, Tonghao Zhang
  Cc: Linux Kernel Network Developers, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Jonathan Lemon, Eric Dumazet, Alexander Lobakin,
	Paolo Abeni, Talal Ahmad, Kevin Hao, Ilias Apalodimas, Kees Cook,
	Kumar Kartikeya Dwivedi, Antoine Tenart, Wei Wang, Arnd Bergmann

On 2022-01-26 14:52, Cong Wang wrote:
> You really should just use eBPF, with eBPF code you don't even need
> to send anything to upstream, you can do whatever you want without
> arguing with anyone. It is a win-win.

Cong,

This doesnt work in some environments. Example:

1) Some data centres (telco large and medium sized enteprises that
i have personally encountered) dont allow for anything that requires
compilation to be introduced (including ebpf).
They depend on upstream - if something is already in the kernel and
requires a script it becomes an operational issue which is a simpler
process.
This is unlike large organizations who have staff of developers
dedicated to coding stuff. Most of the folks i am talking about
have zero developers in house. But even if they did have a few,
introducing code into the kernel that has to be vetted by a
multitude of internal organizations tends to be a very
long process.

2) In some cases adding new code voids the distro vendor's
support warranty and you have to pay the distro vendor to
vet and put your changes via their regression testing.
Most of these organizations are tied to one or other distro
vendor and they dont want to mess with the warranty or pay
extra fees which causes more work for them (a lot of them
have their own vetting process after the distro vendors vetting).

I am not sure what the OP's situation is - but what i described
above is _real_. If there is some extension to existing features like
skbedit and there is a good use case IMO we should allow for it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-01-31 13:12     ` Jamal Hadi Salim
@ 2022-02-05  7:25       ` Tonghao Zhang
  2022-02-09 14:17         ` Tonghao Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-05  7:25 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jakub Kicinski, Cong Wang
  Cc: Linux Kernel Network Developers, Jiri Pirko, David S. Miller,
	Jonathan Lemon, Eric Dumazet, Alexander Lobakin, Paolo Abeni,
	Talal Ahmad, Kevin Hao, Ilias Apalodimas, Kees Cook,
	Kumar Kartikeya Dwivedi, Antoine Tenart, Wei Wang, Arnd Bergmann

On Mon, Jan 31, 2022 at 9:12 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2022-01-26 14:52, Cong Wang wrote:
> > You really should just use eBPF, with eBPF code you don't even need
> > to send anything to upstream, you can do whatever you want without
> > arguing with anyone. It is a win-win.
>
> Cong,
>
> This doesnt work in some environments. Example:
>
> 1) Some data centres (telco large and medium sized enteprises that
> i have personally encountered) dont allow for anything that requires
> compilation to be introduced (including ebpf).
> They depend on upstream - if something is already in the kernel and
> requires a script it becomes an operational issue which is a simpler
> process.
> This is unlike large organizations who have staff of developers
> dedicated to coding stuff. Most of the folks i am talking about
> have zero developers in house. But even if they did have a few,
> introducing code into the kernel that has to be vetted by a
> multitude of internal organizations tends to be a very
> long process.
Yes, really agree with that.
> 2) In some cases adding new code voids the distro vendor's
> support warranty and you have to pay the distro vendor to
> vet and put your changes via their regression testing.
> Most of these organizations are tied to one or other distro
> vendor and they dont want to mess with the warranty or pay
> extra fees which causes more work for them (a lot of them
> have their own vetting process after the distro vendors vetting).
>
> I am not sure what the OP's situation is - but what i described
> above is _real_. If there is some extension to existing features like
> skbedit and there is a good use case IMO we should allow for it.
>
> cheers,
> jamal



-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-05  7:25       ` Tonghao Zhang
@ 2022-02-09 14:17         ` Tonghao Zhang
  2022-02-09 15:26           ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-09 14:17 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jakub Kicinski, Cong Wang
  Cc: Linux Kernel Network Developers, Jiri Pirko, David S. Miller,
	Jonathan Lemon, Eric Dumazet, Alexander Lobakin, Paolo Abeni,
	Talal Ahmad, Kevin Hao, Ilias Apalodimas, Kees Cook,
	Kumar Kartikeya Dwivedi, Antoine Tenart, Wei Wang, Arnd Bergmann

On Sat, Feb 5, 2022 at 3:25 PM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote:
>
> On Mon, Jan 31, 2022 at 9:12 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On 2022-01-26 14:52, Cong Wang wrote:
> > > You really should just use eBPF, with eBPF code you don't even need
> > > to send anything to upstream, you can do whatever you want without
> > > arguing with anyone. It is a win-win.
> >
> > Cong,
> >
> > This doesnt work in some environments. Example:
> >
> > 1) Some data centres (telco large and medium sized enteprises that
> > i have personally encountered) dont allow for anything that requires
> > compilation to be introduced (including ebpf).
> > They depend on upstream - if something is already in the kernel and
> > requires a script it becomes an operational issue which is a simpler
> > process.
> > This is unlike large organizations who have staff of developers
> > dedicated to coding stuff. Most of the folks i am talking about
> > have zero developers in house. But even if they did have a few,
> > introducing code into the kernel that has to be vetted by a
> > multitude of internal organizations tends to be a very
> > long process.
> Yes, really agree with that.
Hi Jakub, do you have an opinion?

> > 2) In some cases adding new code voids the distro vendor's
> > support warranty and you have to pay the distro vendor to
> > vet and put your changes via their regression testing.
> > Most of these organizations are tied to one or other distro
> > vendor and they dont want to mess with the warranty or pay
> > extra fees which causes more work for them (a lot of them
> > have their own vetting process after the distro vendors vetting).
> >
> > I am not sure what the OP's situation is - but what i described
> > above is _real_. If there is some extension to existing features like
> > skbedit and there is a good use case IMO we should allow for it.
> >
> > cheers,
> > jamal
>
>
>
> --
> Best regards, Tonghao



--
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-09 14:17         ` Tonghao Zhang
@ 2022-02-09 15:26           ` Jakub Kicinski
  2022-02-14 11:16             ` Tonghao Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2022-02-09 15:26 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Jamal Hadi Salim, Cong Wang, Linux Kernel Network Developers,
	Jiri Pirko, David S. Miller, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Wed, 9 Feb 2022 22:17:24 +0800 Tonghao Zhang wrote:
> > > This doesnt work in some environments. Example:
> > >
> > > 1) Some data centres (telco large and medium sized enteprises that
> > > i have personally encountered) dont allow for anything that requires
> > > compilation to be introduced (including ebpf).
> > > They depend on upstream - if something is already in the kernel and
> > > requires a script it becomes an operational issue which is a simpler
> > > process.
> > > This is unlike large organizations who have staff of developers
> > > dedicated to coding stuff. Most of the folks i am talking about
> > > have zero developers in house. But even if they did have a few,
> > > introducing code into the kernel that has to be vetted by a
> > > multitude of internal organizations tends to be a very
> > > long process.  
> > Yes, really agree with that.  
> Hi Jakub, do you have an opinion?

I think the patches are perfectly acceptable, nothing changed.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-09 15:26           ` Jakub Kicinski
@ 2022-02-14 11:16             ` Tonghao Zhang
  2022-02-14 15:35               ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-14 11:16 UTC (permalink / raw)
  To: Jakub Kicinski, Jamal Hadi Salim, Cong Wang
  Cc: Linux Kernel Network Developers, Jiri Pirko, David S. Miller,
	Jonathan Lemon, Eric Dumazet, Alexander Lobakin, Paolo Abeni,
	Talal Ahmad, Kevin Hao, Ilias Apalodimas, Kees Cook,
	Kumar Kartikeya Dwivedi, Antoine Tenart, Wei Wang, Arnd Bergmann

On Wed, Feb 9, 2022 at 11:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 9 Feb 2022 22:17:24 +0800 Tonghao Zhang wrote:
> > > > This doesnt work in some environments. Example:
> > > >
> > > > 1) Some data centres (telco large and medium sized enteprises that
> > > > i have personally encountered) dont allow for anything that requires
> > > > compilation to be introduced (including ebpf).
> > > > They depend on upstream - if something is already in the kernel and
> > > > requires a script it becomes an operational issue which is a simpler
> > > > process.
> > > > This is unlike large organizations who have staff of developers
> > > > dedicated to coding stuff. Most of the folks i am talking about
> > > > have zero developers in house. But even if they did have a few,
> > > > introducing code into the kernel that has to be vetted by a
> > > > multitude of internal organizations tends to be a very
> > > > long process.
> > > Yes, really agree with that.
> > Hi Jakub, do you have an opinion?
>
> I think the patches are perfectly acceptable, nothing changed.
Hi Jakub
Will we apply this patchset ? We waited for  Cong to answer Jamal's
comment for a long time.


-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-14 11:16             ` Tonghao Zhang
@ 2022-02-14 15:35               ` Jakub Kicinski
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2022-02-14 15:35 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Jamal Hadi Salim, Cong Wang, Linux Kernel Network Developers,
	Jiri Pirko, David S. Miller, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Mon, 14 Feb 2022 19:16:11 +0800 Tonghao Zhang wrote:
> > > Hi Jakub, do you have an opinion?  
> >
> > I think the patches are perfectly acceptable, nothing changed.  
> Hi Jakub
> Will we apply this patchset ? We waited for  Cong to answer Jamal's
> comment for a long time.

You'd need to repost, the code is 3 weeks old by now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue
  2022-01-26 14:32 ` [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue xiangxia.m.yue
@ 2022-02-15  0:14   ` Jamal Hadi Salim
  2022-02-15  2:00     ` Tonghao Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2022-02-15  0:14 UTC (permalink / raw)
  To: xiangxia.m.yue, netdev
  Cc: Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Eric Dumazet, Alexander Lobakin, Paolo Abeni,
	Talal Ahmad, Kevin Hao, Ilias Apalodimas, Kees Cook,
	Kumar Kartikeya Dwivedi, Antoine Tenart, Wei Wang, Arnd Bergmann


On 2022-01-26 09:32, xiangxia.m.yue@gmail.com wrote:
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> This patch fixes issue:
> * If we install tc filters with act_skbedit in clsact hook.
>    It doesn't work, because netdev_core_pick_tx() overwrites
>    queue_mapping.
> 
>    $ tc filter ... action skbedit queue_mapping 1

This seems reasonable but your explanation below is confusing. You
mention things like xps which apply in the opposite direction.

Can you please help clarify?
If i understand correctly you are interested in separating the
bulk vs latency sensitive traffic into different queues on
outgoing packets, the traditional way of doing this is setting
skb->priority; a lot of traditional hardware mechanisms and even
qdiscs in s/w are already mapped to handle this (and cgroup
priority is also nicely mapped to handle that).


The diagram shows traffic coming out of the pods towards
the wire. Out of curiosity:
What is the underlying driver that is expecting queue map to
be maintained for queue selection?


cheers,
jamal

> 
> And this patch is useful:
> * We can use FQ + EDT to implement efficient policies. Tx queues
>    are picked by xps, ndo_select_queue of netdev driver, or skb hash
>    in netdev_core_pick_tx(). In fact, the netdev driver, and skb
>    hash are _not_ under control. xps uses the CPUs map to select Tx
>    queues, but we can't figure out which task_struct of pod/containter
>    running on this cpu in most case. We can use clsact filters to classify
>    one pod/container traffic to one Tx queue. Why ?
> 
>    In containter networking environment, there are two kinds of pod/
>    containter/net-namespace. One kind (e.g. P1, P2), the high throughput
>    is key in these applications. But avoid running out of network resource,
>    the outbound traffic of these pods is limited, using or sharing one
>    dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
>    (e.g. Pn), the low latency of data access is key. And the traffic is not
>    limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
>    This choice provides two benefits. First, contention on the HTB/FQ Qdisc
>    lock is significantly reduced since fewer CPUs contend for the same queue.
>    More importantly, Qdisc contention can be eliminated completely if each
>    CPU has its own FIFO Qdisc for the second kind of pods.
> 
>    There must be a mechanism in place to support classifying traffic based on
>    pods/container to different Tx queues. Note that clsact is outside of Qdisc
>    while Qdisc can run a classifier to select a sub-queue under the lock.
> 
>    In general recording the decision in the skb seems a little heavy handed.
>    This patch introduces a per-CPU variable, suggested by Eric.
> 
>    The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
>    - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
>      is set in qdisc->enqueue() though tx queue has been selected in
>      netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
>      firstly in __dev_queue_xmit(), is useful:
>    - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
>      in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
>      For example, eth0, macvlan in pod, which root Qdisc install skbedit
>      queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
>      eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
>      because there is no filters in clsact or tx Qdisc of this netdev.
>      Same action taked in eth0, ixgbe in Host.
>    - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
>      in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
>      in __dev_queue_xmit when processing next packets.
> 
>    For performance reasons, use the static key. If user does not config the NET_EGRESS,
>    the patch will not be compiled.
> 
>    +----+      +----+      +----+
>    | P1 |      | P2 |      | Pn |
>    +----+      +----+      +----+
>      |           |           |
>      +-----------+-----------+
>                  |
>                  | clsact/skbedit
>                  |      MQ
>                  v
>      +-----------+-----------+
>      | q0        | q1        | qn
>      v           v           v
>    HTB/FQ      HTB/FQ  ...  FIFO
> 
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Jiri Pirko <jiri@resnulli.us>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Alexander Lobakin <alobakin@pm.me>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Talal Ahmad <talalahmad@google.com>
> Cc: Kevin Hao <haokexin@gmail.com>
> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Antoine Tenart <atenart@kernel.org>
> Cc: Wei Wang <weiwan@google.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> ---
>   include/linux/netdevice.h |  3 +++
>   include/linux/rtnetlink.h |  1 +
>   net/core/dev.c            | 31 +++++++++++++++++++++++++++++--
>   net/sched/act_skbedit.c   |  6 +++++-
>   4 files changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index e490b84732d1..60e14b2b091d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3015,6 +3015,9 @@ struct softnet_data {
>   	struct {
>   		u16 recursion;
>   		u8  more;
> +#ifdef CONFIG_NET_EGRESS
> +		u8  skip_txqueue;
> +#endif
>   	} xmit;
>   #ifdef CONFIG_RPS
>   	/* input_queue_head should be written by cpu owning this struct,
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index bb9cb84114c1..e87c2dccc4d5 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void);
>   #ifdef CONFIG_NET_EGRESS
>   void net_inc_egress_queue(void);
>   void net_dec_egress_queue(void);
> +void netdev_xmit_skip_txqueue(bool skip);
>   #endif
>   
>   void rtnetlink_init(void);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1baab07820f6..842473fa8e9f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3860,6 +3860,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>   
>   	return skb;
>   }
> +
> +static struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> +	int qm = skb_get_queue_mapping(skb);
> +
> +	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
> +static bool netdev_xmit_txqueue_skipped(void)
> +{
> +	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +void netdev_xmit_skip_txqueue(bool skip)
> +{
> +	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
>   #endif /* CONFIG_NET_EGRESS */
>   
>   #ifdef CONFIG_XPS
> @@ -4030,7 +4049,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
>   static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   {
>   	struct net_device *dev = skb->dev;
> -	struct netdev_queue *txq;
> +	struct netdev_queue *txq = NULL;
>   	struct Qdisc *q;
>   	int rc = -ENOMEM;
>   	bool again = false;
> @@ -4058,11 +4077,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   			if (!skb)
>   				goto out;
>   		}
> +
> +		netdev_xmit_skip_txqueue(false);
> +
>   		nf_skip_egress(skb, true);
>   		skb = sch_handle_egress(skb, &rc, dev);
>   		if (!skb)
>   			goto out;
>   		nf_skip_egress(skb, false);
> +
> +		if (netdev_xmit_txqueue_skipped())
> +			txq = netdev_tx_queue_mapping(dev, skb);
>   	}
>   #endif
>   	/* If device/qdisc don't need skb->dst, release it right now while
> @@ -4073,7 +4098,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   	else
>   		skb_dst_force(skb);
>   
> -	txq = netdev_core_pick_tx(dev, skb, sb_dev);
> +	if (likely(!txq))
> +		txq = netdev_core_pick_tx(dev, skb, sb_dev);
> +
>   	q = rcu_dereference_bh(txq->qdisc);
>   
>   	trace_net_dev_queue(skb);
> diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
> index ceba11b198bb..d5799b4fc499 100644
> --- a/net/sched/act_skbedit.c
> +++ b/net/sched/act_skbedit.c
> @@ -58,8 +58,12 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
>   		}
>   	}
>   	if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
> -	    skb->dev->real_num_tx_queues > params->queue_mapping)
> +	    skb->dev->real_num_tx_queues > params->queue_mapping) {
> +#ifdef CONFIG_NET_EGRESS
> +		netdev_xmit_skip_txqueue(true);
> +#endif
>   		skb_set_queue_mapping(skb, params->queue_mapping);
> +	}
>   	if (params->flags & SKBEDIT_F_MARK) {
>   		skb->mark &= ~params->mask;
>   		skb->mark |= params->mark & params->mask;


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-01-26 14:32 ` [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting " xiangxia.m.yue
  2022-01-26 19:52   ` Cong Wang
@ 2022-02-15  0:22   ` Jamal Hadi Salim
  2022-02-15  1:40     ` Tonghao Zhang
  1 sibling, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2022-02-15  0:22 UTC (permalink / raw)
  To: xiangxia.m.yue, netdev
  Cc: Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Eric Dumazet, Alexander Lobakin, Paolo Abeni,
	Talal Ahmad, Kevin Hao, Ilias Apalodimas, Kees Cook,
	Kumar Kartikeya Dwivedi, Antoine Tenart, Wei Wang, Arnd Bergmann

On 2022-01-26 09:32, xiangxia.m.yue@gmail.com wrote:
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> This patch allows user to select queue_mapping, range
> from A to B. And user can use skbhash, cgroup classid
> and cpuid to select Tx queues. Then we can load balance
> packets from A to B queue. The range is an unsigned 16bit
> value in decimal format.
> 
> $ tc filter ... action skbedit queue_mapping skbhash A B
> 
> "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
> is enhanced with flags:
> * SKBEDIT_F_TXQ_SKBHASH
> * SKBEDIT_F_TXQ_CLASSID
> * SKBEDIT_F_TXQ_CPUID
> 
> Use skb->hash, cgroup classid, or cpuid to distribute packets.
> Then same range of tx queues can be shared for different flows,
> cgroups, or CPUs in a variety of scenarios.
> 
> For example, F1 may share range R1 with F2. The best way to do
> that is to set flag to SKBEDIT_F_TXQ_HASH, using skb->hash to
> share the queues. If cgroup C1 want to share the R1 with cgroup
> C2 .. Cn, use the SKBEDIT_F_TXQ_CLASSID. Of course, in some other
> scenario, C1 use R1, while Cn can use the Rn.
> 

So while i dont agree that ebpf is the solution for reasons i mentioned
earlier - after looking at the details think iam confused by this change
and maybe i didnt fully understand the use case.

What is the driver that would work  with this?
You said earlier packets are coming out of some pods and then heading to
the wire and you are looking to balance and isolate between bulk and
latency  sensitive traffic - how are any of these metadatum useful for
that? skb->priority seems more natural for that.


cheers,
jamal



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-15  0:22   ` Jamal Hadi Salim
@ 2022-02-15  1:40     ` Tonghao Zhang
  2022-02-16  0:17       ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-15  1:40 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Tue, Feb 15, 2022 at 8:22 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2022-01-26 09:32, xiangxia.m.yue@gmail.com wrote:
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> >
> > This patch allows user to select queue_mapping, range
> > from A to B. And user can use skbhash, cgroup classid
> > and cpuid to select Tx queues. Then we can load balance
> > packets from A to B queue. The range is an unsigned 16bit
> > value in decimal format.
> >
> > $ tc filter ... action skbedit queue_mapping skbhash A B
> >
> > "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
> > is enhanced with flags:
> > * SKBEDIT_F_TXQ_SKBHASH
> > * SKBEDIT_F_TXQ_CLASSID
> > * SKBEDIT_F_TXQ_CPUID
> >
> > Use skb->hash, cgroup classid, or cpuid to distribute packets.
> > Then same range of tx queues can be shared for different flows,
> > cgroups, or CPUs in a variety of scenarios.
> >
> > For example, F1 may share range R1 with F2. The best way to do
> > that is to set flag to SKBEDIT_F_TXQ_HASH, using skb->hash to
> > share the queues. If cgroup C1 want to share the R1 with cgroup
> > C2 .. Cn, use the SKBEDIT_F_TXQ_CLASSID. Of course, in some other
> > scenario, C1 use R1, while Cn can use the Rn.
> >
>
> So while i dont agree that ebpf is the solution for reasons i mentioned
> earlier - after looking at the details think iam confused by this change
> and maybe i didnt fully understand the use case.
>
> What is the driver that would work  with this?
> You said earlier packets are coming out of some pods and then heading to
> the wire and you are looking to balance and isolate between bulk and
> latency  sensitive traffic - how are any of these metadatum useful for
> that? skb->priority seems more natural for that.
Hi
I try to explain. there are two tx-queue range, e.g. A(Q0-Qn), B(Qn+1-Qm).
A is used for latency sensitive traffic. B is used for bulk sensitive
traffic. A may be shared by Pods/Containers which key is
high throughput. B may be shared by Pods/Containers which key is low
latency. So we can do the balance in range A for latency sensitive
traffic.
So we can use the skb->hash or CPUID or classid to classify the
packets in range A or B. The balance policies are used for different
use case.
For skb->hash, the packets from Pods/Containers will share the range.
Should to know that one Pod/Container may use the multi TCP/UDP flows.
That flows share the tx queue range.
For CPUID, while Pod/Container use the multi flows, pod pinned on one
CPU will use one tx-queue in range A or B.
For CLASSID, the Pod may contain the multi containters.

skb->priority may be used by applications. we can't require
application developer to change them.


>
> cheers,
> jamal
>
>


-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue
  2022-02-15  0:14   ` Jamal Hadi Salim
@ 2022-02-15  2:00     ` Tonghao Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-15  2:00 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Tue, Feb 15, 2022 at 8:14 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>
> On 2022-01-26 09:32, xiangxia.m.yue@gmail.com wrote:
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> >
> > This patch fixes issue:
> > * If we install tc filters with act_skbedit in clsact hook.
> >    It doesn't work, because netdev_core_pick_tx() overwrites
> >    queue_mapping.
> >
> >    $ tc filter ... action skbedit queue_mapping 1
>
> This seems reasonable but your explanation below is confusing. You
> mention things like xps which apply in the opposite direction.
As I understand, xps is used for CPU-to-queue. The skbedit
queue_mapping is used for tx patch too.

> Can you please help clarify?
> If i understand correctly you are interested in separating the
> bulk vs latency sensitive traffic into different queues on
> outgoing packets, the traditional way of doing this is setting
> skb->priority; a lot of traditional hardware mechanisms and even
> qdiscs in s/w are already mapped to handle this (and cgroup
> priority is also nicely mapped to handle that).
why we don't use the skb->priority?
1. the application developer may use skb->priority. We can't require
them to change that. if the application sets priority, the cgroup
priority doesn't work.
2. it's easy to use the tc powerful filter to classify packets and
then do the action.
3. patch 2/2 will use skbedit action to do balance in tx queue range.
>
> The diagram shows traffic coming out of the pods towards
> the wire. Out of curiosity:
> What is the underlying driver that is expecting queue map to
> be maintained for queue selection?
In our production env, we use the ixgbe, i40e and mlx nic which
support multi tx queue.
>
> cheers,
> jamal
>
> >
> > And this patch is useful:
> > * We can use FQ + EDT to implement efficient policies. Tx queues
> >    are picked by xps, ndo_select_queue of netdev driver, or skb hash
> >    in netdev_core_pick_tx(). In fact, the netdev driver, and skb
> >    hash are _not_ under control. xps uses the CPUs map to select Tx
> >    queues, but we can't figure out which task_struct of pod/containter
> >    running on this cpu in most case. We can use clsact filters to classify
> >    one pod/container traffic to one Tx queue. Why ?
> >
> >    In containter networking environment, there are two kinds of pod/
> >    containter/net-namespace. One kind (e.g. P1, P2), the high throughput
> >    is key in these applications. But avoid running out of network resource,
> >    the outbound traffic of these pods is limited, using or sharing one
> >    dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
> >    (e.g. Pn), the low latency of data access is key. And the traffic is not
> >    limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
> >    This choice provides two benefits. First, contention on the HTB/FQ Qdisc
> >    lock is significantly reduced since fewer CPUs contend for the same queue.
> >    More importantly, Qdisc contention can be eliminated completely if each
> >    CPU has its own FIFO Qdisc for the second kind of pods.
> >
> >    There must be a mechanism in place to support classifying traffic based on
> >    pods/container to different Tx queues. Note that clsact is outside of Qdisc
> >    while Qdisc can run a classifier to select a sub-queue under the lock.
> >
> >    In general recording the decision in the skb seems a little heavy handed.
> >    This patch introduces a per-CPU variable, suggested by Eric.
> >
> >    The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
> >    - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
> >      is set in qdisc->enqueue() though tx queue has been selected in
> >      netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
> >      firstly in __dev_queue_xmit(), is useful:
> >    - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
> >      in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
> >      For example, eth0, macvlan in pod, which root Qdisc install skbedit
> >      queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
> >      eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
> >      because there is no filters in clsact or tx Qdisc of this netdev.
> >      Same action taked in eth0, ixgbe in Host.
> >    - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
> >      in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
> >      in __dev_queue_xmit when processing next packets.
> >
> >    For performance reasons, use the static key. If user does not config the NET_EGRESS,
> >    the patch will not be compiled.
> >
> >    +----+      +----+      +----+
> >    | P1 |      | P2 |      | Pn |
> >    +----+      +----+      +----+
> >      |           |           |
> >      +-----------+-----------+
> >                  |
> >                  | clsact/skbedit
> >                  |      MQ
> >                  v
> >      +-----------+-----------+
> >      | q0        | q1        | qn
> >      v           v           v
> >    HTB/FQ      HTB/FQ  ...  FIFO
> >
> > Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> > Cc: Cong Wang <xiyou.wangcong@gmail.com>
> > Cc: Jiri Pirko <jiri@resnulli.us>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Alexander Lobakin <alobakin@pm.me>
> > Cc: Paolo Abeni <pabeni@redhat.com>
> > Cc: Talal Ahmad <talalahmad@google.com>
> > Cc: Kevin Hao <haokexin@gmail.com>
> > Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > Cc: Antoine Tenart <atenart@kernel.org>
> > Cc: Wei Wang <weiwan@google.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Suggested-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > ---
> >   include/linux/netdevice.h |  3 +++
> >   include/linux/rtnetlink.h |  1 +
> >   net/core/dev.c            | 31 +++++++++++++++++++++++++++++--
> >   net/sched/act_skbedit.c   |  6 +++++-
> >   4 files changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index e490b84732d1..60e14b2b091d 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3015,6 +3015,9 @@ struct softnet_data {
> >       struct {
> >               u16 recursion;
> >               u8  more;
> > +#ifdef CONFIG_NET_EGRESS
> > +             u8  skip_txqueue;
> > +#endif
> >       } xmit;
> >   #ifdef CONFIG_RPS
> >       /* input_queue_head should be written by cpu owning this struct,
> > diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> > index bb9cb84114c1..e87c2dccc4d5 100644
> > --- a/include/linux/rtnetlink.h
> > +++ b/include/linux/rtnetlink.h
> > @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void);
> >   #ifdef CONFIG_NET_EGRESS
> >   void net_inc_egress_queue(void);
> >   void net_dec_egress_queue(void);
> > +void netdev_xmit_skip_txqueue(bool skip);
> >   #endif
> >
> >   void rtnetlink_init(void);
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 1baab07820f6..842473fa8e9f 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3860,6 +3860,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> >
> >       return skb;
> >   }
> > +
> > +static struct netdev_queue *
> > +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> > +{
> > +     int qm = skb_get_queue_mapping(skb);
> > +
> > +     return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> > +}
> > +
> > +static bool netdev_xmit_txqueue_skipped(void)
> > +{
> > +     return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> > +}
> > +
> > +void netdev_xmit_skip_txqueue(bool skip)
> > +{
> > +     __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> > +}
> > +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> >   #endif /* CONFIG_NET_EGRESS */
> >
> >   #ifdef CONFIG_XPS
> > @@ -4030,7 +4049,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
> >   static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   {
> >       struct net_device *dev = skb->dev;
> > -     struct netdev_queue *txq;
> > +     struct netdev_queue *txq = NULL;
> >       struct Qdisc *q;
> >       int rc = -ENOMEM;
> >       bool again = false;
> > @@ -4058,11 +4077,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >                       if (!skb)
> >                               goto out;
> >               }
> > +
> > +             netdev_xmit_skip_txqueue(false);
> > +
> >               nf_skip_egress(skb, true);
> >               skb = sch_handle_egress(skb, &rc, dev);
> >               if (!skb)
> >                       goto out;
> >               nf_skip_egress(skb, false);
> > +
> > +             if (netdev_xmit_txqueue_skipped())
> > +                     txq = netdev_tx_queue_mapping(dev, skb);
> >       }
> >   #endif
> >       /* If device/qdisc don't need skb->dst, release it right now while
> > @@ -4073,7 +4098,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >       else
> >               skb_dst_force(skb);
> >
> > -     txq = netdev_core_pick_tx(dev, skb, sb_dev);
> > +     if (likely(!txq))
> > +             txq = netdev_core_pick_tx(dev, skb, sb_dev);
> > +
> >       q = rcu_dereference_bh(txq->qdisc);
> >
> >       trace_net_dev_queue(skb);
> > diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
> > index ceba11b198bb..d5799b4fc499 100644
> > --- a/net/sched/act_skbedit.c
> > +++ b/net/sched/act_skbedit.c
> > @@ -58,8 +58,12 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
> >               }
> >       }
> >       if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
> > -         skb->dev->real_num_tx_queues > params->queue_mapping)
> > +         skb->dev->real_num_tx_queues > params->queue_mapping) {
> > +#ifdef CONFIG_NET_EGRESS
> > +             netdev_xmit_skip_txqueue(true);
> > +#endif
> >               skb_set_queue_mapping(skb, params->queue_mapping);
> > +     }
> >       if (params->flags & SKBEDIT_F_MARK) {
> >               skb->mark &= ~params->mask;
> >               skb->mark |= params->mark & params->mask;
>


-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-15  1:40     ` Tonghao Zhang
@ 2022-02-16  0:17       ` Jamal Hadi Salim
  2022-02-16 13:36         ` Tonghao Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2022-02-16  0:17 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On 2022-02-14 20:40, Tonghao Zhang wrote:
> On Tue, Feb 15, 2022 at 8:22 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> On 2022-01-26 09:32, xiangxia.m.yue@gmail.com wrote:
>>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
>>>

>
>> So while i dont agree that ebpf is the solution for reasons i mentioned
>> earlier - after looking at the details think iam confused by this change
>> and maybe i didnt fully understand the use case.
>>
>> What is the driver that would work  with this?
>> You said earlier packets are coming out of some pods and then heading to
>> the wire and you are looking to balance and isolate between bulk and
>> latency  sensitive traffic - how are any of these metadatum useful for
>> that? skb->priority seems more natural for that.

Quote from your other email:

 > In our production env, we use the ixgbe, i40e and mlx nic which
 > support multi tx queue.

Please bear with me.
The part i was wondering about is how these drivers would use queue
mapping to select their hardware queues.
Maybe you meant the software queue (in the qdiscs?) - But even then
how does queue mapping map select which queue is to be used.

> Hi
> I try to explain. there are two tx-queue range, e.g. A(Q0-Qn), B(Qn+1-Qm).
> A is used for latency sensitive traffic. B is used for bulk sensitive
> traffic. A may be shared by Pods/Containers which key is
> high throughput. B may be shared by Pods/Containers which key is low
> latency. So we can do the balance in range A for latency sensitive
> traffic.

So far makes sense. I am not sure if you get better performance but
thats unrelated to this discussion. Just trying to understand your
setup  first in order to understand the use case. IIUC:
You have packets coming out of the pods and hitting the host stack
where you are applying these rules on egress qdisc of one of these
ixgbe, i40e and mlx nics, correct?
And that egress qdisc then ends up selecting a queue based on queue
mapping?

Can you paste a more complete example of a sample setup on some egress
port including what the classifier would be looking at?
Your diagram was unclear how the load balancing was going to be
achieved using the qdiscs (or was it the hardware?).

> So we can use the skb->hash or CPUID or classid to classify the
> packets in range A or B. The balance policies are used for different
> use case.
> For skb->hash, the packets from Pods/Containers will share the range.
> Should to know that one Pod/Container may use the multi TCP/UDP flows.
> That flows share the tx queue range.
> For CPUID, while Pod/Container use the multi flows, pod pinned on one
> CPU will use one tx-queue in range A or B.
> For CLASSID, the Pod may contain the multi containters.
> 
> skb->priority may be used by applications. we can't require
> application developer to change them.

It can also be set by skbedit.
Note also: Other than user specifying via setsockopt and skbedit,
DSCP/TOS/COS are all translated into skb->priority. Most of those
L3/L2 fields are intended to map to either bulk or latency sensitive
traffic.
More importantly:
 From s/w level - most if not _all_ classful qdiscs look at skb->priority
to decide where to enqueue.
 From h/w level - skb->priority is typically mapped to qos hardware level
(example 802.1q).
Infact skb->priority could be translated by qdisc layer into
classid if you set the 32 bit value to be the major:minor number for
a specific configured classid.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-16  0:17       ` Jamal Hadi Salim
@ 2022-02-16 13:36         ` Tonghao Zhang
  2022-02-16 23:39           ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-16 13:36 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2022-02-14 20:40, Tonghao Zhang wrote:
> > On Tue, Feb 15, 2022 at 8:22 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>
> >> On 2022-01-26 09:32, xiangxia.m.yue@gmail.com wrote:
> >>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> >>>
>
> >
> >> So while i dont agree that ebpf is the solution for reasons i mentioned
> >> earlier - after looking at the details think iam confused by this change
> >> and maybe i didnt fully understand the use case.
> >>
> >> What is the driver that would work  with this?
> >> You said earlier packets are coming out of some pods and then heading to
> >> the wire and you are looking to balance and isolate between bulk and
> >> latency  sensitive traffic - how are any of these metadatum useful for
> >> that? skb->priority seems more natural for that.
>
> Quote from your other email:
>
>  > In our production env, we use the ixgbe, i40e and mlx nic which
>  > support multi tx queue.
>
> Please bear with me.
> The part i was wondering about is how these drivers would use queue
> mapping to select their hardware queues.
Hi
For mlx5e, mlx5e_xmit() use the skb_get_queue_mapping() to pick tx queue.
For ixgbe, __ixgbe_xmit_frame() use the skb_get_queue_mapping() to
pick tx queue.
For i40e, i40e_lan_xmit_frame() use the skb->queue_mapping

we can set the skb->queue_mapping in skbedit.
> Maybe you meant the software queue (in the qdiscs?) - But even then
Yes, more importantly, we take care of software tx queue which may use
the fifo or htb/fq qdisc.
> how does queue mapping map select which queue is to be used.
we select the tx queue in clsact and we will not invoke the
netdev_core_pick_tx() to pick the tx queue and then
we can use qdisc of this tx queue to do tc policy(fifo/fq/htb qdisc
enqueue/dequeue ...)

> > Hi
> > I try to explain. there are two tx-queue range, e.g. A(Q0-Qn), B(Qn+1-Qm).
> > A is used for latency sensitive traffic. B is used for bulk sensitive
> > traffic. A may be shared by Pods/Containers which key is
> > high throughput. B may be shared by Pods/Containers which key is low
> > latency. So we can do the balance in range A for latency sensitive
> > traffic.
>
> So far makes sense. I am not sure if you get better performance but
> thats unrelated to this discussion. Just trying to understand your
> setup  first in order to understand the use case. IIUC:
> You have packets coming out of the pods and hitting the host stack
> where you are applying these rules on egress qdisc of one of these
> ixgbe, i40e and mlx nics, correct?
> And that egress qdisc then ends up selecting a queue based on queue
> mapping?
>
> Can you paste a more complete example of a sample setup on some egress
> port including what the classifier would be looking at?
Hi

  +----+      +----+      +----+     +----+
  | P1 |      | P2 |      | PN |     | PM |
  +----+      +----+      +----+     +----+
    |           |           |           |
    +-----------+-----------+-----------+
                       |
                       | clsact/skbedit
                       |      MQ
                       v
    +-----------+-----------+-----------+
    | q0        | q1        | qn        | qm
    v           v           v           v
  HTB/FQ      HTB/FQ  ...  FIFO        FIFO

NETDEV=eth0
tc qdisc add dev $NETDEV clsact
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash
n m

The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m.
P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc.

tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash
0 1

The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1.
P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to
limit its network rate, because we don't hope P2 use all bandwidth to
affect P1.

> Your diagram was unclear how the load balancing was going to be
> achieved using the qdiscs (or was it the hardware?).
Firstly, in clsact hook, we select one tx queue from qn to qm for P1,
and use the qdisc of this tx queue, for example FIFO.
in underlay driver, because the we set the skb->queue_mapping in
skbedit, so the hw tx queue from qn to qm will be select too.
any way, in clsact hook, we can use the skbedit queue_mapping to
select software tx queue and hw tx queue.

For doing balance, we can use the skbhash/cpuid/cgroup classid to
select tx queue from qn to qm for P1.
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n
m
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid
n m

Why we want to do the balance, because we don't want pin the packets
from Pod to one tx queue. (in k8s the pods are created or destroy
frequently, and the number of Pods > tx queue number).
sharing the tx queue equally is more important.

> > So we can use the skb->hash or CPUID or classid to classify the
> > packets in range A or B. The balance policies are used for different
> > use case.
> > For skb->hash, the packets from Pods/Containers will share the range.
> > Should to know that one Pod/Container may use the multi TCP/UDP flows.
> > That flows share the tx queue range.
> > For CPUID, while Pod/Container use the multi flows, pod pinned on one
> > CPU will use one tx-queue in range A or B.
> > For CLASSID, the Pod may contain the multi containters.
> >
> > skb->priority may be used by applications. we can't require
> > application developer to change them.
>
> It can also be set by skbedit.
> Note also: Other than user specifying via setsockopt and skbedit,
> DSCP/TOS/COS are all translated into skb->priority. Most of those
> L3/L2 fields are intended to map to either bulk or latency sensitive
> traffic.
> More importantly:
>  From s/w level - most if not _all_ classful qdiscs look at skb->priority
> to decide where to enqueue.
>  From h/w level - skb->priority is typically mapped to qos hardware level
> (example 802.1q).
> Infact skb->priority could be translated by qdisc layer into
> classid if you set the 32 bit value to be the major:minor number for
> a specific configured classid.
>
> cheers,
> jamal



-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-16 13:36         ` Tonghao Zhang
@ 2022-02-16 23:39           ` Jamal Hadi Salim
  2022-02-18 12:43             ` Tonghao Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2022-02-16 23:39 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On 2022-02-16 08:36, Tonghao Zhang wrote:
> On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:


[...]
The mapping to hardware made sense. Sorry I missed it earlier.

>> Can you paste a more complete example of a sample setup on some egress
>> port including what the classifier would be looking at?
> Hi
> 
>    +----+      +----+      +----+     +----+
>    | P1 |      | P2 |      | PN |     | PM |
>    +----+      +----+      +----+     +----+
>      |           |           |           |
>      +-----------+-----------+-----------+
>                         |
>                         | clsact/skbedit
>                         |      MQ
>                         v
>      +-----------+-----------+-----------+
>      | q0        | q1        | qn        | qm
>      v           v           v           v
>    HTB/FQ      HTB/FQ  ...  FIFO        FIFO
> 

Below is still missing your MQ setup (If i understood your diagram
correctly). Can you please post that?
Are you classids essentially mapping to q0..m?
tc -s class show after you run some traffic should help

> NETDEV=eth0
> tc qdisc add dev $NETDEV clsact
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash
> n m
> 

Have you observed a nice distribution here?
for s/w side tc -s class show after you run some traffic should help
for h/w side ethtool -s

IIUC, the hash of the ip header with src_ip 192.168.122.100
(and dst ip,
is being distributed across queues n..m
[because either 192.168.122.100 is talking to many destination
IPs and/or ports?]
Is this correct if packets are being forwarded as opposed to
being sourced from the host?
ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)

> The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m.
> P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc.
> 
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash
> 0 1
> 
> The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1.
> P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to
> limit its network rate, because we don't hope P2 use all bandwidth to
> affect P1.
> 

Understood.

>> Your diagram was unclear how the load balancing was going to be
>> achieved using the qdiscs (or was it the hardware?).
> Firstly, in clsact hook, we select one tx queue from qn to qm for P1,
> and use the qdisc of this tx queue, for example FIFO.
> in underlay driver, because the we set the skb->queue_mapping in
> skbedit, so the hw tx queue from qn to qm will be select too.
> any way, in clsact hook, we can use the skbedit queue_mapping to
> select software tx queue and hw tx queue.
> 

ethtool -s and tc -s class if you have this running somewhere..

> For doing balance, we can use the skbhash/cpuid/cgroup classid to
> select tx queue from qn to qm for P1.
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n
> m
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid
> n m
> 

The skbhash should work fine if you have good entropy (varying dst ip
and dst port mostly, the srcip/srcport/protocol dont offer much  entropy
unless you have a lot of pods on your system).
i.e if it works correctly (forwarding vs host - see my question above)
then you should be able to pin a 5tuple flow to a tx queue.
If you have a large number of flows/pods then you could potentially
get a nice distribution.

I may be missing something on the cpuid one - seems high likelihood
of having the same flow on multiple queues (based on what
raw_smp_processor_id() returns, which i believe is not guaranteed to be
consistent). IOW, you could be sending packets out of order for the
same 5 tuple flow (because they end up in different queues).

As for classid variant - if these packets are already outside the
pod and into the host stack, is that field even valid?

> Why we want to do the balance, because we don't want pin the packets
> from Pod to one tx queue. (in k8s the pods are created or destroy
> frequently, and the number of Pods > tx queue number).
> sharing the tx queue equally is more important.
> 

As long as the same flow is pinned to the same queue (see my comment
on cpuid).
Over a very long period what you describe maybe true but it also
seems depends on many other variables.
I think it would help to actually show some data on how true above
statement is (example the creation/destruction rate of the pods).
Or collect data over a very long period.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-16 23:39           ` Jamal Hadi Salim
@ 2022-02-18 12:43             ` Tonghao Zhang
  2022-02-20 18:30               ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-18 12:43 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2022-02-16 08:36, Tonghao Zhang wrote:
> > On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>
> [...]
> The mapping to hardware made sense. Sorry I missed it earlier.
>
> >> Can you paste a more complete example of a sample setup on some egress
> >> port including what the classifier would be looking at?
> > Hi
> >
> >    +----+      +----+      +----+     +----+
> >    | P1 |      | P2 |      | PN |     | PM |
> >    +----+      +----+      +----+     +----+
> >      |           |           |           |
> >      +-----------+-----------+-----------+
> >                         |
> >                         | clsact/skbedit
> >                         |      MQ
> >                         v
> >      +-----------+-----------+-----------+
> >      | q0        | q1        | qn        | qm
> >      v           v           v           v
> >    HTB/FQ      HTB/FQ  ...  FIFO        FIFO
> >
>
> Below is still missing your MQ setup (If i understood your diagram
> correctly). Can you please post that?
> Are you classids essentially mapping to q0..m?
> tc -s class show after you run some traffic should help
Hi Jamal

The setup commands is shown as below:
NETDEV=eth0
ip li set dev $NETDEV up
tc qdisc del dev $NETDEV clsact 2>/dev/null
tc qdisc add dev $NETDEV clsact

ip link add ipv1 link $NETDEV type ipvlan mode l2
ip netns add n1
ip link set ipv1 netns n1

ip netns exec n1 ip link set ipv1 up
ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up

tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 2.2.2.100 action skbedit queue_mapping hash-type skbhash 2 6

tc qdisc add dev $NETDEV handle 1: root mq

tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit

tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
tc qdisc add dev $NETDEV parent 1:3 pfifo
tc qdisc add dev $NETDEV parent 1:4 pfifo
tc qdisc add dev $NETDEV parent 1:5 pfifo
tc qdisc add dev $NETDEV parent 1:6 pfifo
tc qdisc add dev $NETDEV parent 1:7 pfifo


use the perf to generate packets:
ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10

we use the skbedit to select tx queue from 2 - 6
# ethtool -S eth0 | grep -i [tr]x_queue_[0-9]_bytes
     rx_queue_0_bytes: 442
     rx_queue_1_bytes: 60966
     rx_queue_2_bytes: 10440203
     rx_queue_3_bytes: 6083863
     rx_queue_4_bytes: 3809726
     rx_queue_5_bytes: 3581460
     rx_queue_6_bytes: 5772099
     rx_queue_7_bytes: 148
     rx_queue_8_bytes: 368
     rx_queue_9_bytes: 383
     tx_queue_0_bytes: 42
     tx_queue_1_bytes: 0
     tx_queue_2_bytes: 11442586444
     tx_queue_3_bytes: 7383615334
     tx_queue_4_bytes: 3981365579
     tx_queue_5_bytes: 3983235051
     tx_queue_6_bytes: 6706236461
     tx_queue_7_bytes: 42
     tx_queue_8_bytes: 0
     tx_queue_9_bytes: 0

tx queues 2-6 are mapping to classid 1:3 - 1:7
# tc -s class show dev eth0
class mq 1:1 root leaf 2:
 Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:2 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:3 root leaf 8002:
 Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:4 root leaf 8003:
 Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:5 root leaf 8004:
 Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:6 root leaf 8005:
 Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:7 root leaf 8006:
 Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:8 root
 Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:9 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:a root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class tbf 8001:1 parent 8001:

class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 2000000 ctokens: 2000000

class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 1000000 ctokens: 1000000

> > NETDEV=eth0
> > tc qdisc add dev $NETDEV clsact
> > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> > src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash
> > n m
> >
>
> Have you observed a nice distribution here?
Yes, as shown above
> for s/w side tc -s class show after you run some traffic should help
> for h/w side ethtool -s
>
> IIUC, the hash of the ip header with src_ip 192.168.122.100
> (and dst ip,
> is being distributed across queues n..m
> [because either 192.168.122.100 is talking to many destination
> IPs and/or ports?]
yes, we use the iperf3 -P options to send out multi flows.
> Is this correct if packets are being forwarded as opposed to
> being sourced from the host?
Good question, for TCP, we set the ixgbe ntuple off.
ethtool -K ixgbe-dev ntuple off
so in the underlying driver, hw will record this flow, and its tx
queue, when it comes back to pod.
hw will send to rx queue corresponding to tx queue.

the codes:
ixgbe_xmit_frame/ixgbe_xmit_frame_ring -->ixgbe_atr() ->
ixgbe_fdir_add_signature_filter_82599
ixgbe_fdir_add_signature_filter_82599 will install the rule for
incoming packets.

> ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)
for tcp:
__tcp_transmit_skb -> skb_set_hash_from_sk

for udp
udp_sendmsg -> ip_make_skb -> __ip_append_data -> sock_alloc_send_pskb
-> skb_set_owner_w

> > The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m.
> > P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc.
> >
> > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> > src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash
> > 0 1
> >
> > The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1.
> > P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to
> > limit its network rate, because we don't hope P2 use all bandwidth to
> > affect P1.
> >
>
> Understood.
>
> >> Your diagram was unclear how the load balancing was going to be
> >> achieved using the qdiscs (or was it the hardware?).
> > Firstly, in clsact hook, we select one tx queue from qn to qm for P1,
> > and use the qdisc of this tx queue, for example FIFO.
> > in underlay driver, because the we set the skb->queue_mapping in
> > skbedit, so the hw tx queue from qn to qm will be select too.
> > any way, in clsact hook, we can use the skbedit queue_mapping to
> > select software tx queue and hw tx queue.
> >
>
> ethtool -s and tc -s class if you have this running somewhere..
>
> > For doing balance, we can use the skbhash/cpuid/cgroup classid to
> > select tx queue from qn to qm for P1.
> > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> > src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n
> > m
> > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> > src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid
> > n m
> >
>
> The skbhash should work fine if you have good entropy (varying dst ip
> and dst port mostly, the srcip/srcport/protocol dont offer much  entropy
> unless you have a lot of pods on your system).
> i.e if it works correctly (forwarding vs host - see my question above)
> then you should be able to pin a 5tuple flow to a tx queue.
> If you have a large number of flows/pods then you could potentially
> get a nice distribution.
>
> I may be missing something on the cpuid one - seems high likelihood
> of having the same flow on multiple queues (based on what
> raw_smp_processor_id() returns, which i believe is not guaranteed to be
> consistent). IOW, you could be sending packets out of order for the
> same 5 tuple flow (because they end up in different queues).
Yes, but think about one case, we pin one pod to one cpu, so all the
processes of the pod will
use the same cpu. then all packets from this pod will use the same tx queue.
> As for classid variant - if these packets are already outside the
> pod and into the host stack, is that field even valid?
Yes, ipvlan, macvlan and other virt netdev don't clean this field.
> > Why we want to do the balance, because we don't want pin the packets
> > from Pod to one tx queue. (in k8s the pods are created or destroy
> > frequently, and the number of Pods > tx queue number).
> > sharing the tx queue equally is more important.
> >
>
> As long as the same flow is pinned to the same queue (see my comment
> on cpuid).
> Over a very long period what you describe maybe true but it also
> seems depends on many other variables.
NETDEV=eth0

ip li set dev $NETDEV up

tc qdisc del dev $NETDEV clsact 2>/dev/null
tc qdisc add dev $NETDEV clsact

ip link add ipv1 link $NETDEV type ipvlan mode l2
ip netns add n1
ip link set ipv1 netns n1

ip netns exec n1 ip link set ipv1 up
ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up

tc filter add dev $NETDEV egress protocol ip prio 1 \
flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6

tc qdisc add dev $NETDEV handle 1: root mq

tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit

tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
tc qdisc add dev $NETDEV parent 1:3 pfifo
tc qdisc add dev $NETDEV parent 1:4 pfifo
tc qdisc add dev $NETDEV parent 1:5 pfifo
tc qdisc add dev $NETDEV parent 1:6 pfifo
tc qdisc add dev $NETDEV parent 1:7 pfifo

set the iperf3 to one cpu
# mkdir -p /sys/fs/cgroup/cpuset/n0
# echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus
# echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems
# ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G
# echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks

# ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
     tx_queue_0_bytes: 7180
     tx_queue_1_bytes: 418
     tx_queue_2_bytes: 3015
     tx_queue_3_bytes: 4824
     tx_queue_4_bytes: 3738
     tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4
     tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4,
skbedit use this tx queue, and don't use tx queue 5
     tx_queue_7_bytes: 4364
     tx_queue_8_bytes: 42
     tx_queue_9_bytes: 3030


# tc -s class show dev eth0
class mq 1:1 root leaf 2:
 Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:2 root leaf 8001:
 Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:3 root leaf 8002:
 Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:4 root leaf 8003:
 Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:5 root leaf 8004:
 Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:6 root leaf 8005:
 Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:7 root leaf 8006:
 Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18)
 backlog 0b 0p requeues 18
class mq 1:8 root
 Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:9 root
 Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:a root
 Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class tbf 8001:1 parent 8001:

class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 2000000 ctokens: 2000000

class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 1000000 ctokens: 1000000

> I think it would help to actually show some data on how true above
> statement is (example the creation/destruction rate of the pods).
> Or collect data over a very long period.
>
> cheers,
> jamal



-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-18 12:43             ` Tonghao Zhang
@ 2022-02-20 18:30               ` Jamal Hadi Salim
  2022-02-21  1:43                 ` Tonghao Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2022-02-20 18:30 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On 2022-02-18 07:43, Tonghao Zhang wrote:
> On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>

> Hi Jamal
> 
> The setup commands is shown as below:
> NETDEV=eth0
> ip li set dev $NETDEV up
> tc qdisc del dev $NETDEV clsact 2>/dev/null
> tc qdisc add dev $NETDEV clsact
> 
> ip link add ipv1 link $NETDEV type ipvlan mode l2
> ip netns add n1
> ip link set ipv1 netns n1
> 
> ip netns exec n1 ip link set ipv1 up
> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
> 
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 2.2.2.100 action skbedit queue_mapping hash-type skbhash 2 6
> 
> tc qdisc add dev $NETDEV handle 1: root mq
> 
> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
> 
> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> tc qdisc add dev $NETDEV parent 1:3 pfifo
> tc qdisc add dev $NETDEV parent 1:4 pfifo
> tc qdisc add dev $NETDEV parent 1:5 pfifo
> tc qdisc add dev $NETDEV parent 1:6 pfifo
> tc qdisc add dev $NETDEV parent 1:7 pfifo
> 
> 
> use the perf to generate packets:
> ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10
> 
> we use the skbedit to select tx queue from 2 - 6
> # ethtool -S eth0 | grep -i [tr]x_queue_[0-9]_bytes
>       rx_queue_0_bytes: 442
>       rx_queue_1_bytes: 60966
>       rx_queue_2_bytes: 10440203
>       rx_queue_3_bytes: 6083863
>       rx_queue_4_bytes: 3809726
>       rx_queue_5_bytes: 3581460
>       rx_queue_6_bytes: 5772099
>       rx_queue_7_bytes: 148
>       rx_queue_8_bytes: 368
>       rx_queue_9_bytes: 383
>       tx_queue_0_bytes: 42
>       tx_queue_1_bytes: 0
>       tx_queue_2_bytes: 11442586444
>       tx_queue_3_bytes: 7383615334
>       tx_queue_4_bytes: 3981365579
>       tx_queue_5_bytes: 3983235051
>       tx_queue_6_bytes: 6706236461
>       tx_queue_7_bytes: 42
>       tx_queue_8_bytes: 0
>       tx_queue_9_bytes: 0
> 
> tx queues 2-6 are mapping to classid 1:3 - 1:7
> # tc -s class show dev eth0
> class mq 1:1 root leaf 2:
>   Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:2 root leaf 8001:
>   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:3 root leaf 8002:
>   Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:4 root leaf 8003:
>   Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:5 root leaf 8004:
>   Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:6 root leaf 8005:
>   Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:7 root leaf 8006:
>   Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:8 root
>   Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:9 root
>   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:a root
>   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class tbf 8001:1 parent 8001:
> 
> class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
>   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
>   lended: 0 borrowed: 0 giants: 0
>   tokens: 2000000 ctokens: 2000000
> 
> class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
>   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
>   lended: 0 borrowed: 0 giants: 0
>   tokens: 1000000 ctokens: 1000000
> 

Yes, this is a good example (which should have been in the commit
message of 0/2 or 2/2 - would have avoided long discussion).
The byte count doesnt map correctly between the DMA side and the
qdisc side; you probably had some additional experiments running
prior to installing the mq qdisc.
So not a big deal - it is close enough.

To Cong's comments earlier - I dont think you can correctly have
picked the queue in user space for this specific policy (hash-type
skbhash). Reason is you are dependent on the skb hash computation
which is based on things like ephemeral src port for the netperf
client - which cannot be determined in user space.

> Good question, for TCP, we set the ixgbe ntuple off.
> ethtool -K ixgbe-dev ntuple off
> so in the underlying driver, hw will record this flow, and its tx
> queue, when it comes back to pod.
> hw will send to rx queue corresponding to tx queue.
> 
> the codes:
> ixgbe_xmit_frame/ixgbe_xmit_frame_ring -->ixgbe_atr() ->
> ixgbe_fdir_add_signature_filter_82599
> ixgbe_fdir_add_signature_filter_82599 will install the rule for
> incoming packets.
> 
>> ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)
> for tcp:
> __tcp_transmit_skb -> skb_set_hash_from_sk
> 
> for udp
> udp_sendmsg -> ip_make_skb -> __ip_append_data -> sock_alloc_send_pskb
> -> skb_set_owner_w

Thats a different use case than what you are presenting here.
i.e the k8s pod scenario is purely a forwarding use case.
But it doesnt matter tbh since your data shows reasonable results.

[i didnt dig into the code but it is likely (based on your experimental
data) that both skb->l4_hash and skb->sw_hash  will _not be set_
and so skb_get_hash() will compute the skb->hash from scratch.]

>> I may be missing something on the cpuid one - seems high likelihood
>> of having the same flow on multiple queues (based on what
>> raw_smp_processor_id() returns, which i believe is not guaranteed to be
>> consistent). IOW, you could be sending packets out of order for the
>> same 5 tuple flow (because they end up in different queues).
> Yes, but think about one case, we pin one pod to one cpu, so all the
> processes of the pod will
> use the same cpu. then all packets from this pod will use the same tx queue.

To Cong's point - if you already knew the pinned-to cpuid then you could
just as easily set that queue map from user space?

>> As for classid variant - if these packets are already outside th
>> pod and into the host stack, is that field even valid?
> Yes, ipvlan, macvlan and other virt netdev don't clean this field.
>>> Why we want to do the balance, because we don't want pin the packets
>>> from Pod to one tx queue. (in k8s the pods are created or destroy
>>> frequently, and the number of Pods > tx queue number).
>>> sharing the tx queue equally is more important.
>>>
>>
>> As long as the same flow is pinned to the same queue (see my comment
>> on cpuid).
>> Over a very long period what you describe maybe true but it also
>> seems depends on many other variables.
> NETDEV=eth0
> 
> ip li set dev $NETDEV up
> 
> tc qdisc del dev $NETDEV clsact 2>/dev/null
> tc qdisc add dev $NETDEV clsact
> 
> ip link add ipv1 link $NETDEV type ipvlan mode l2
> ip netns add n1
> ip link set ipv1 netns n1
> 
> ip netns exec n1 ip link set ipv1 up
> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
> 
> tc filter add dev $NETDEV egress protocol ip prio 1 \
> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6
> 
> tc qdisc add dev $NETDEV handle 1: root mq
> 
> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
> 
> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> tc qdisc add dev $NETDEV parent 1:3 pfifo
> tc qdisc add dev $NETDEV parent 1:4 pfifo
> tc qdisc add dev $NETDEV parent 1:5 pfifo
> tc qdisc add dev $NETDEV parent 1:6 pfifo
> tc qdisc add dev $NETDEV parent 1:7 pfifo
> 
> set the iperf3 to one cpu
> # mkdir -p /sys/fs/cgroup/cpuset/n0
> # echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus
> # echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems
> # ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G
> # echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks
> 
> # ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
>       tx_queue_0_bytes: 7180
>       tx_queue_1_bytes: 418
>       tx_queue_2_bytes: 3015
>       tx_queue_3_bytes: 4824
>       tx_queue_4_bytes: 3738
>       tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4
>       tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4,
> skbedit use this tx queue, and don't use tx queue 5
>       tx_queue_7_bytes: 4364
>       tx_queue_8_bytes: 42
>       tx_queue_9_bytes: 3030
> 
> 
> # tc -s class show dev eth0
> class mq 1:1 root leaf 2:
>   Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:2 root leaf 8001:
>   Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:3 root leaf 8002:
>   Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:4 root leaf 8003:
>   Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:5 root leaf 8004:
>   Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:6 root leaf 8005:
>   Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:7 root leaf 8006:
>   Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18)
>   backlog 0b 0p requeues 18
> class mq 1:8 root
>   Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:9 root
>   Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class mq 1:a root
>   Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
> class tbf 8001:1 parent 8001:
> 
> class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
>   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
>   lended: 0 borrowed: 0 giants: 0
>   tokens: 2000000 ctokens: 2000000
> 
> class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
>   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
>   lended: 0 borrowed: 0 giants: 0
>   tokens: 1000000 ctokens: 1000000
> 

Yes, if you pin a flow/process to a cpu - this is expected. See my
earlier comment. You could argue that you are automating things but
it is not as a strong as the hash setup (and will have to be documented
that it works only if you pin processes doing network i/o to cpus).

Could you also post an example on the cgroups classid?

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-20 18:30               ` Jamal Hadi Salim
@ 2022-02-21  1:43                 ` Tonghao Zhang
  2022-02-22 11:44                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Tonghao Zhang @ 2022-02-21  1:43 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On Mon, Feb 21, 2022 at 2:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2022-02-18 07:43, Tonghao Zhang wrote:
> > On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>
>
> > Hi Jamal
> >
> > The setup commands is shown as below:
> > NETDEV=eth0
> > ip li set dev $NETDEV up
> > tc qdisc del dev $NETDEV clsact 2>/dev/null
> > tc qdisc add dev $NETDEV clsact
> >
> > ip link add ipv1 link $NETDEV type ipvlan mode l2
> > ip netns add n1
> > ip link set ipv1 netns n1
> >
> > ip netns exec n1 ip link set ipv1 up
> > ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
> >
> > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> > src_ip 2.2.2.100 action skbedit queue_mapping hash-type skbhash 2 6
> >
> > tc qdisc add dev $NETDEV handle 1: root mq
> >
> > tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> > tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> > tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
> >
> > tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> > tc qdisc add dev $NETDEV parent 1:3 pfifo
> > tc qdisc add dev $NETDEV parent 1:4 pfifo
> > tc qdisc add dev $NETDEV parent 1:5 pfifo
> > tc qdisc add dev $NETDEV parent 1:6 pfifo
> > tc qdisc add dev $NETDEV parent 1:7 pfifo
> >
> >
> > use the perf to generate packets:
> > ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10
> >
> > we use the skbedit to select tx queue from 2 - 6
> > # ethtool -S eth0 | grep -i [tr]x_queue_[0-9]_bytes
> >       rx_queue_0_bytes: 442
> >       rx_queue_1_bytes: 60966
> >       rx_queue_2_bytes: 10440203
> >       rx_queue_3_bytes: 6083863
> >       rx_queue_4_bytes: 3809726
> >       rx_queue_5_bytes: 3581460
> >       rx_queue_6_bytes: 5772099
> >       rx_queue_7_bytes: 148
> >       rx_queue_8_bytes: 368
> >       rx_queue_9_bytes: 383
> >       tx_queue_0_bytes: 42
> >       tx_queue_1_bytes: 0
> >       tx_queue_2_bytes: 11442586444
> >       tx_queue_3_bytes: 7383615334
> >       tx_queue_4_bytes: 3981365579
> >       tx_queue_5_bytes: 3983235051
> >       tx_queue_6_bytes: 6706236461
> >       tx_queue_7_bytes: 42
> >       tx_queue_8_bytes: 0
> >       tx_queue_9_bytes: 0
> >
> > tx queues 2-6 are mapping to classid 1:3 - 1:7
> > # tc -s class show dev eth0
> > class mq 1:1 root leaf 2:
> >   Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:2 root leaf 8001:
> >   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:3 root leaf 8002:
> >   Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:4 root leaf 8003:
> >   Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:5 root leaf 8004:
> >   Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:6 root leaf 8005:
> >   Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:7 root leaf 8006:
> >   Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:8 root
> >   Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:9 root
> >   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:a root
> >   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class tbf 8001:1 parent 8001:
> >
> > class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
> >   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> >   lended: 0 borrowed: 0 giants: 0
> >   tokens: 2000000 ctokens: 2000000
> >
> > class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
> >   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> >   lended: 0 borrowed: 0 giants: 0
> >   tokens: 1000000 ctokens: 1000000
> >
>
> Yes, this is a good example (which should have been in the commit
> message of 0/2 or 2/2 - would have avoided long discussion).
I will add this example to commit 2/2 in next version.
> The byte count doesnt map correctly between the DMA side and the
> qdisc side; you probably had some additional experiments running
> prior to installing the mq qdisc.
Yes, for tx queue index, it start from 0, for mq qdisc class, the
index start from 1
> So not a big deal - it is close enough.
>
> To Cong's comments earlier - I dont think you can correctly have
> picked the queue in user space for this specific policy (hash-type
> skbhash). Reason is you are dependent on the skb hash computation
> which is based on things like ephemeral src port for the netperf
> client - which cannot be determined in user space.
>
> > Good question, for TCP, we set the ixgbe ntuple off.
> > ethtool -K ixgbe-dev ntuple off
> > so in the underlying driver, hw will record this flow, and its tx
> > queue, when it comes back to pod.
> > hw will send to rx queue corresponding to tx queue.
> >
> > the codes:
> > ixgbe_xmit_frame/ixgbe_xmit_frame_ring -->ixgbe_atr() ->
> > ixgbe_fdir_add_signature_filter_82599
> > ixgbe_fdir_add_signature_filter_82599 will install the rule for
> > incoming packets.
> >
> >> ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)
> > for tcp:
> > __tcp_transmit_skb -> skb_set_hash_from_sk
> >
> > for udp
> > udp_sendmsg -> ip_make_skb -> __ip_append_data -> sock_alloc_send_pskb
> > -> skb_set_owner_w
>
> Thats a different use case than what you are presenting here.
> i.e the k8s pod scenario is purely a forwarding use case.
> But it doesnt matter tbh since your data shows reasonable results.
>
> [i didnt dig into the code but it is likely (based on your experimental
> data) that both skb->l4_hash and skb->sw_hash  will _not be set_
> and so skb_get_hash() will compute the skb->hash from scratch.]
No, for example, for tcp, we have set hash in __tcp_transmit_skb which
invokes the skb_set_hash_from_sk
so in skbedit, skb_get_hash only gets skb->hash.
> >> I may be missing something on the cpuid one - seems high likelihood
> >> of having the same flow on multiple queues (based on what
> >> raw_smp_processor_id() returns, which i believe is not guaranteed to be
> >> consistent). IOW, you could be sending packets out of order for the
> >> same 5 tuple flow (because they end up in different queues).
> > Yes, but think about one case, we pin one pod to one cpu, so all the
> > processes of the pod will
> > use the same cpu. then all packets from this pod will use the same tx queue.
>
> To Cong's point - if you already knew the pinned-to cpuid then you could
> just as easily set that queue map from user space?
Yes, we can set it from user space. If we can know the cpu which the
pod uses, and select the one tx queue
automatically in skbedit, that can make the things easy?
> >> As for classid variant - if these packets are already outside th
> >> pod and into the host stack, is that field even valid?
> > Yes, ipvlan, macvlan and other virt netdev don't clean this field.
> >>> Why we want to do the balance, because we don't want pin the packets
> >>> from Pod to one tx queue. (in k8s the pods are created or destroy
> >>> frequently, and the number of Pods > tx queue number).
> >>> sharing the tx queue equally is more important.
> >>>
> >>
> >> As long as the same flow is pinned to the same queue (see my comment
> >> on cpuid).
> >> Over a very long period what you describe maybe true but it also
> >> seems depends on many other variables.
> > NETDEV=eth0
> >
> > ip li set dev $NETDEV up
> >
> > tc qdisc del dev $NETDEV clsact 2>/dev/null
> > tc qdisc add dev $NETDEV clsact
> >
> > ip link add ipv1 link $NETDEV type ipvlan mode l2
> > ip netns add n1
> > ip link set ipv1 netns n1
> >
> > ip netns exec n1 ip link set ipv1 up
> > ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
> >
> > tc filter add dev $NETDEV egress protocol ip prio 1 \
> > flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6
> >
> > tc qdisc add dev $NETDEV handle 1: root mq
> >
> > tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> > tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> > tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
> >
> > tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> > tc qdisc add dev $NETDEV parent 1:3 pfifo
> > tc qdisc add dev $NETDEV parent 1:4 pfifo
> > tc qdisc add dev $NETDEV parent 1:5 pfifo
> > tc qdisc add dev $NETDEV parent 1:6 pfifo
> > tc qdisc add dev $NETDEV parent 1:7 pfifo
> >
> > set the iperf3 to one cpu
> > # mkdir -p /sys/fs/cgroup/cpuset/n0
> > # echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus
> > # echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems
> > # ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G
> > # echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks
> >
> > # ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
> >       tx_queue_0_bytes: 7180
> >       tx_queue_1_bytes: 418
> >       tx_queue_2_bytes: 3015
> >       tx_queue_3_bytes: 4824
> >       tx_queue_4_bytes: 3738
> >       tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4
> >       tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4,
> > skbedit use this tx queue, and don't use tx queue 5
> >       tx_queue_7_bytes: 4364
> >       tx_queue_8_bytes: 42
> >       tx_queue_9_bytes: 3030
> >
> >
> > # tc -s class show dev eth0
> > class mq 1:1 root leaf 2:
> >   Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:2 root leaf 8001:
> >   Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:3 root leaf 8002:
> >   Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:4 root leaf 8003:
> >   Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:5 root leaf 8004:
> >   Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:6 root leaf 8005:
> >   Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:7 root leaf 8006:
> >   Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18)
> >   backlog 0b 0p requeues 18
> > class mq 1:8 root
> >   Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:9 root
> >   Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class mq 1:a root
> >   Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> > class tbf 8001:1 parent 8001:
> >
> > class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
> >   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> >   lended: 0 borrowed: 0 giants: 0
> >   tokens: 2000000 ctokens: 2000000
> >
> > class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
> >   Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> >   lended: 0 borrowed: 0 giants: 0
> >   tokens: 1000000 ctokens: 1000000
> >
>
> Yes, if you pin a flow/process to a cpu - this is expected. See my
> earlier comment. You could argue that you are automating things but
> it is not as a strong as the hash setup (and will have to be documented
> that it works only if you pin processes doing network i/o to cpus).
Ok, it should be documented in iproute2. and we will doc this in
commit message too.
> Could you also post an example on the cgroups classid?

The setup commands:
NETDEV=eth0
ip li set dev $NETDEV up

tc qdisc del dev $NETDEV clsact 2>/dev/null
tc qdisc add dev $NETDEV clsact

ip link add ipv1 link $NETDEV type ipvlan mode l2
ip netns add n1
ip link set ipv1 netns n1

ip netns exec n1 ip link set ipv1 up
ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up

tc filter add dev $NETDEV egress protocol ip prio 1 \
flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type
classid 2 6

tc qdisc add dev $NETDEV handle 1: root mq

tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit

tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
tc qdisc add dev $NETDEV parent 1:3 pfifo
tc qdisc add dev $NETDEV parent 1:4 pfifo
tc qdisc add dev $NETDEV parent 1:5 pfifo
tc qdisc add dev $NETDEV parent 1:6 pfifo
tc qdisc add dev $NETDEV parent 1:7 pfifo

setup classid
# mkdir -p /sys/fs/cgroup/net_cls/n0
# echo 0x100001 > /sys/fs/cgroup/net_cls/n0/net_cls.classid
# echo $(pidof iperf3) > /sys/fs/cgroup/net_cls/n0/tasks

# ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
     tx_queue_0_bytes: 9660
     tx_queue_1_bytes: 0
     tx_queue_2_bytes: 102434986698 #  don't set the iperf to cgroup n0
     tx_queue_3_bytes: 2964
     tx_queue_4_bytes: 75041373396 # after we set the iperf to cgroup n0
     tx_queue_5_bytes: 13458
     tx_queue_6_bytes: 1252
     tx_queue_7_bytes: 522
     tx_queue_8_bytes: 48000
     tx_queue_9_bytes: 0

# tc -s class show dev eth0
class mq 1:1 root leaf 2:
 Sent 11106 bytes 65 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:2 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:3 root leaf 8002:
 Sent 106986143484 bytes 70783214 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:4 root leaf 8003:
 Sent 2964 bytes 12 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:5 root leaf 8004:
 Sent 78364514238 bytes 51985575 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:6 root leaf 8005:
 Sent 13458 bytes 101 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:7 root leaf 8006:
 Sent 1252 bytes 6 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:8 root
 Sent 522 bytes 5 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:9 root
 Sent 48000 bytes 222 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:a root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class tbf 8001:1 parent 8001:

class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 2000000 ctokens: 2000000

class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 1000000 ctokens: 1000000

> cheers,
> jamal



-- 
Best regards, Tonghao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
  2022-02-21  1:43                 ` Tonghao Zhang
@ 2022-02-22 11:44                   ` Jamal Hadi Salim
  0 siblings, 0 replies; 24+ messages in thread
From: Jamal Hadi Salim @ 2022-02-22 11:44 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Linux Kernel Network Developers, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
	Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
	Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
	Antoine Tenart, Wei Wang, Arnd Bergmann

On 2022-02-20 20:43, Tonghao Zhang wrote:
> On Mon, Feb 21, 2022 at 2:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> On 2022-02-18 07:43, Tonghao Zhang wrote:
>>> On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>>>
>>





>>
>> Thats a different use case than what you are presenting here.
>> i.e the k8s pod scenario is purely a forwarding use case.
>> But it doesnt matter tbh since your data shows reasonable results.
>>
>> [i didnt dig into the code but it is likely (based on your experimental
>> data) that both skb->l4_hash and skb->sw_hash  will _not be set_
>> and so skb_get_hash() will compute the skb->hash from scratch.]
> No, for example, for tcp, we have set hash in __tcp_transmit_skb which
> invokes the skb_set_hash_from_sk
> so in skbedit, skb_get_hash only gets skb->hash.

There is no tcp anything in the forwarding case. Your use case was for
forwarding. I understand the local host tcp/udp variant.

>>>> I may be missing something on the cpuid one - seems high likelihood
>>>> of having the same flow on multiple queues (based on what
>>>> raw_smp_processor_id() returns, which i believe is not guaranteed to be
>>>> consistent). IOW, you could be sending packets out of order for the
>>>> same 5 tuple flow (because they end up in different queues).
>>> Yes, but think about one case, we pin one pod to one cpu, so all the
>>> processes of the pod will
>>> use the same cpu. then all packets from this pod will use the same tx queue.
>>
>> To Cong's point - if you already knew the pinned-to cpuid then you could
>> just as easily set that queue map from user space?
> Yes, we can set it from user space. If we can know the cpu which the
> pod uses, and select the one tx queue
> automatically in skbedit, that can make the things easy?

Yes, but you know the CPU - so Cong's point is valid. You knew the
CPU when you setup the cgroup for iperf by hand, you can use the
same hand to set the queue map skbedit.

>>> ip li set dev $NETDEV up
>>>
>>> tc qdisc del dev $NETDEV clsact 2>/dev/null
>>> tc qdisc add dev $NETDEV clsact
>>>
>>> ip link add ipv1 link $NETDEV type ipvlan mode l2
>>> ip netns add n1
>>> ip link set ipv1 netns n1
>>>
>>> ip netns exec n1 ip link set ipv1 up
>>> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
>>>
>>> tc filter add dev $NETDEV egress protocol ip prio 1 \
>>> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6
>>>
>>> tc qdisc add dev $NETDEV handle 1: root mq
>>>
>>> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
>>> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
>>> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
>>>
>>> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
>>> tc qdisc add dev $NETDEV parent 1:3 pfifo
>>> tc qdisc add dev $NETDEV parent 1:4 pfifo
>>> tc qdisc add dev $NETDEV parent 1:5 pfifo
>>> tc qdisc add dev $NETDEV parent 1:6 pfifo
>>> tc qdisc add dev $NETDEV parent 1:7 pfifo
>>>
>>> set the iperf3 to one cpu
>>> # mkdir -p /sys/fs/cgroup/cpuset/n0
>>> # echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus
>>> # echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems
>>> # ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G
>>> # echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks
>>>
>>> # ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
>>>        tx_queue_0_bytes: 7180
>>>        tx_queue_1_bytes: 418
>>>        tx_queue_2_bytes: 3015
>>>        tx_queue_3_bytes: 4824
>>>        tx_queue_4_bytes: 3738
>>>        tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4
>>>        tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4,
>>> skbedit use this tx queue, and don't use tx queue 5
>>>        tx_queue_7_bytes: 4364
>>>        tx_queue_8_bytes: 42
>>>        tx_queue_9_bytes: 3030
>>>
>>>
>>> # tc -s class show dev eth0
>>> class mq 1:1 root leaf 2:
>>>    Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:2 root leaf 8001:
>>>    Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:3 root leaf 8002:
>>>    Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:4 root leaf 8003:
>>>    Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:5 root leaf 8004:
>>>    Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:6 root leaf 8005:
>>>    Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:7 root leaf 8006:
>>>    Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18)
>>>    backlog 0b 0p requeues 18
>>> class mq 1:8 root
>>>    Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:9 root
>>>    Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:a root
>>>    Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class tbf 8001:1 parent 8001:
>>>
>>> class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
>>>    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>>    lended: 0 borrowed: 0 giants: 0
>>>    tokens: 2000000 ctokens: 2000000
>>>
>>> class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
>>>    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>>    lended: 0 borrowed: 0 giants: 0
>>>    tokens: 1000000 ctokens: 1000000
>>>
>>
>> Yes, if you pin a flow/process to a cpu - this is expected. See my
>> earlier comment. You could argue that you are automating things but
>> it is not as a strong as the hash setup (and will have to be documented
>> that it works only if you pin processes doing network i/o to cpus).
> Ok, it should be documented in iproute2. and we will doc this in
> commit message too.

I think this part is iffy. You could argue automation pov
but i dont see much else.

>> Could you also post an example on the cgroups classid?
> 
> The setup commands:
> NETDEV=eth0
> ip li set dev $NETDEV up
> 
> tc qdisc del dev $NETDEV clsact 2>/dev/null
> tc qdisc add dev $NETDEV clsact
> 
> ip link add ipv1 link $NETDEV type ipvlan mode l2
> ip netns add n1
> ip link set ipv1 netns n1
> 
> ip netns exec n1 ip link set ipv1 up
> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
> 
> tc filter add dev $NETDEV egress protocol ip prio 1 \
> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type
> classid 2 6
> 
> tc qdisc add dev $NETDEV handle 1: root mq
> 
> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
> 
> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> tc qdisc add dev $NETDEV parent 1:3 pfifo
> tc qdisc add dev $NETDEV parent 1:4 pfifo
> tc qdisc add dev $NETDEV parent 1:5 pfifo
> tc qdisc add dev $NETDEV parent 1:6 pfifo
> tc qdisc add dev $NETDEV parent 1:7 pfifo
> 
> setup classid
> # mkdir -p /sys/fs/cgroup/net_cls/n0
> # echo 0x100001 > /sys/fs/cgroup/net_cls/n0/net_cls.classid
> # echo $(pidof iperf3) > /sys/fs/cgroup/net_cls/n0/tasks
> 


I would say some thing here as well. You know the classid, you manually
set it above, you could have said:

src_ip 2.2.2.100 action skbedit queue_mapping 1

cheers,
jamal


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-02-22 11:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26 14:32 [net-next v8 0/2] net: sched: allow user to select txqueue xiangxia.m.yue
2022-01-26 14:32 ` [net-next v8 1/2] net: sched: use queue_mapping to pick tx queue xiangxia.m.yue
2022-02-15  0:14   ` Jamal Hadi Salim
2022-02-15  2:00     ` Tonghao Zhang
2022-01-26 14:32 ` [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting " xiangxia.m.yue
2022-01-26 19:52   ` Cong Wang
2022-01-27  1:23     ` Tonghao Zhang
2022-01-27  1:40       ` Tonghao Zhang
2022-01-31 13:12     ` Jamal Hadi Salim
2022-02-05  7:25       ` Tonghao Zhang
2022-02-09 14:17         ` Tonghao Zhang
2022-02-09 15:26           ` Jakub Kicinski
2022-02-14 11:16             ` Tonghao Zhang
2022-02-14 15:35               ` Jakub Kicinski
2022-02-15  0:22   ` Jamal Hadi Salim
2022-02-15  1:40     ` Tonghao Zhang
2022-02-16  0:17       ` Jamal Hadi Salim
2022-02-16 13:36         ` Tonghao Zhang
2022-02-16 23:39           ` Jamal Hadi Salim
2022-02-18 12:43             ` Tonghao Zhang
2022-02-20 18:30               ` Jamal Hadi Salim
2022-02-21  1:43                 ` Tonghao Zhang
2022-02-22 11:44                   ` Jamal Hadi Salim
2022-01-27  2:24 ` [net-next v8 0/2] net: sched: allow user to select txqueue Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.