All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper
@ 2015-04-10 23:33 Alexei Starovoitov
  2015-04-10 23:33 ` [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Alexei Starovoitov
  2015-04-11  6:40 ` [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper Daniel Borkmann
  0 siblings, 2 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2015-04-10 23:33 UTC (permalink / raw)
  To: David S. Miller
  Cc: Eric Dumazet, Daniel Borkmann, Thomas Graf, Jiri Pirko,
	Jamal Hadi Salim, netdev

similar to skb_postpull_rcsum() introduce skb_postpush_rcsum() helper

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---

next patch is also using skb_postpush_rcsum()

 include/linux/skbuff.h |   13 +++++++++++++
 net/core/filter.c      |    4 ++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0991259643d6..049c2db872e0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2592,6 +2592,19 @@ static inline void skb_postpull_rcsum(struct sk_buff *skb,
 		skb->csum = csum_sub(skb->csum, csum_partial(start, len, 0));
 }
 
+/**
+ *	skb_postpush_rcsum - update checksum for skb after push
+ *	@skb: buffer to update
+ *	@start: start of data after push
+ *	@len: length of data pushed
+ */
+static inline void skb_postpush_rcsum(struct sk_buff *skb,
+				      const void *start, unsigned int len)
+{
+	if (skb->ip_summed == CHECKSUM_COMPLETE)
+		skb->csum = csum_add(skb->csum, csum_partial(start, len, 0));
+}
+
 unsigned char *skb_pull_rcsum(struct sk_buff *skb, unsigned int len);
 
 /**
diff --git a/net/core/filter.c b/net/core/filter.c
index b669e75d2b36..e5872704c28b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1213,8 +1213,8 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags)
 		/* skb_store_bits cannot return -EFAULT here */
 		skb_store_bits(skb, offset, ptr, len);
 
-	if (BPF_RECOMPUTE_CSUM(flags) && skb->ip_summed == CHECKSUM_COMPLETE)
-		skb->csum = csum_add(skb->csum, csum_partial(ptr, len, 0));
+	if (BPF_RECOMPUTE_CSUM(flags))
+		skb_postpush_rcsum(skb, ptr, len);
 	return 0;
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-10 23:33 [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper Alexei Starovoitov
@ 2015-04-10 23:33 ` Alexei Starovoitov
  2015-04-11  6:46   ` Daniel Borkmann
  2015-04-13 14:16   ` Jamal Hadi Salim
  2015-04-11  6:40 ` [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper Daniel Borkmann
  1 sibling, 2 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2015-04-10 23:33 UTC (permalink / raw)
  To: David S. Miller
  Cc: Eric Dumazet, Daniel Borkmann, Thomas Graf, Jiri Pirko,
	Jamal Hadi Salim, netdev

TC classifers and actions attached to ingress and egress qdiscs see
inconsistent skb->data. For ingress L2 header is already pulled, whereas
for egress it's present. Introduce an optional flag for ingress qdisc
which if set will cause ingress to push L2 header before calling
into classifiers/actions and pull L2 back afterwards.

The cls_bpf/act_bpf are now marked as 'needs_l2'. The users can use them
on ingress qdisc created with 'needs_l2' flag and on any egress qdisc.
The use of them with vanilla ingress is disallowed.

The ingress_l2 qdisc can only be attached to devices that provide headers_ops.

When ingress is not enabled static_key avoids *(skb->dev->ingress_queue)

When ingress is enabled the difference old vs new to reach qdisc spinlock:
old:
*(skb->dev->ingress_queue), if, *(rxq->qdisc), if, *(rxq->qdisc), if
new:
*(skb->dev->ingress_queue), if, *(rxq->qdisc), if, if

This patch provides a foundation to use ingress_l2+cls_bpf to filter
interesting traffic and mirror small part of it to a different netdev for
capturing. This approach is significantly faster than traditional af_packet,
since skb_clone is called after filtering. dhclient and other tap-based tools
may consider switching to this style.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---

This patch is on top of Daniel's static_key patch:
http://patchwork.ozlabs.org/patch/460228/

V3->V4:
. drop skb_share_check() for ingress_l2, since it's called before taps
. drop 'needs_l2' marks in other cls/acts (only cls_bpf/act_bpf are marked)
. use static_key
. add dev->header_ops check to avoid attaching ingress_l2 to devices that
  don't support pushing of L2

 include/net/act_api.h          |    9 ++++---
 include/net/sch_generic.h      |    3 +++
 include/uapi/linux/pkt_sched.h |   10 +++++++
 net/core/dev.c                 |   57 +++++++++++++++++++++++++++++-----------
 net/sched/act_api.c            |   21 ++++++++++-----
 net/sched/act_bpf.c            |    1 +
 net/sched/cls_api.c            |   20 ++++++++++++--
 net/sched/cls_bpf.c            |    1 +
 net/sched/sch_api.c            |    5 ++++
 net/sched/sch_ingress.c        |   21 +++++++++++++++
 samples/bpf/tcbpf1_kern.c      |    6 -----
 11 files changed, 121 insertions(+), 33 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 3ee4c92afd1b..c52bd98b7dd9 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -95,6 +95,8 @@ struct tc_action_ops {
 			struct nlattr *est, struct tc_action *act, int ovr,
 			int bind);
 	int     (*walk)(struct sk_buff *, struct netlink_callback *, int, struct tc_action *);
+	u32	flags;
+#define TCA_NEEDS_L2	1
 };
 
 int tcf_hash_search(struct tc_action *a, u32 index);
@@ -112,12 +114,11 @@ int tcf_unregister_action(struct tc_action_ops *a);
 int tcf_action_destroy(struct list_head *actions, int bind);
 int tcf_action_exec(struct sk_buff *skb, const struct list_head *actions,
 		    struct tcf_result *res);
-int tcf_action_init(struct net *net, struct nlattr *nla,
-				  struct nlattr *est, char *n, int ovr,
-				  int bind, struct list_head *);
+int tcf_action_init(struct net *net, struct nlattr *nla, struct nlattr *est,
+		    char *n, int ovr, int bind, struct list_head *, bool);
 struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
 				    struct nlattr *est, char *n, int ovr,
-				    int bind);
+				    int bind, bool qdisc_has_l2);
 int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int);
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6d778efcfdfd..e755cc50a037 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -61,6 +61,7 @@ struct Qdisc {
 				      */
 #define TCQ_F_WARN_NONWC	(1 << 16)
 #define TCQ_F_CPUSTATS		0x20 /* run using percpu statistics */
+#define TCQ_F_EARLY_INGRESS_L2	0x40 /* ingress qdisc with L2 header */
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
@@ -228,6 +229,8 @@ struct tcf_proto_ops {
 					struct sk_buff *skb, struct tcmsg*);
 
 	struct module		*owner;
+	u32			flags;
+#define TCF_NEEDS_L2	1
 };
 
 struct tcf_proto {
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 534b84710745..1076c56e6458 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -845,4 +845,14 @@ struct tc_pie_xstats {
 	__u32 maxq;             /* maximum queue size */
 	__u32 ecn_mark;         /* packets marked with ecn*/
 };
+
+/* INGRESS section */
+
+enum {
+	TCA_INGRESS_UNSPEC,
+	TCA_INGRESS_NEEDS_L2,
+	__TCA_INGRESS_MAX,
+};
+
+#define TCA_INGRESS_MAX (__TCA_INGRESS_MAX - 1)
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index af4a1b0adc10..91f62e5b6105 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3529,12 +3529,11 @@ EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
  * the ingress scheduler, you just can't add policies on ingress.
  *
  */
-static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
+static int ing_filter(struct sk_buff *skb, struct Qdisc *q)
 {
 	struct net_device *dev = skb->dev;
 	u32 ttl = G_TC_RTTL(skb->tc_verd);
 	int result = TC_ACT_OK;
-	struct Qdisc *q;
 
 	if (unlikely(MAX_RED_LOOP < ttl++)) {
 		net_warn_ratelimited("Redir loop detected Dropping packet (%d->%d)\n",
@@ -3545,24 +3544,20 @@ static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
 	skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl);
 	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
 
-	q = rcu_dereference(rxq->qdisc);
-	if (q != &noop_qdisc) {
-		spin_lock(qdisc_lock(q));
-		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
-			result = qdisc_enqueue_root(skb, q);
-		spin_unlock(qdisc_lock(q));
-	}
+	spin_lock(qdisc_lock(q));
+	if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
+		result = qdisc_enqueue_root(skb, q);
+	spin_unlock(qdisc_lock(q));
 
 	return result;
 }
 
 static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 					 struct packet_type **pt_prev,
-					 int *ret, struct net_device *orig_dev)
+					 int *ret, struct net_device *orig_dev,
+					 struct Qdisc *q)
 {
-	struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue);
-
-	if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
+	if (q->flags != TCQ_F_INGRESS) /* this check includes q == &noop_qdisc */
 		return skb;
 
 	if (*pt_prev) {
@@ -3570,7 +3565,7 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 		*pt_prev = NULL;
 	}
 
-	switch (ing_filter(skb, rxq)) {
+	switch (ing_filter(skb, q)) {
 	case TC_ACT_SHOT:
 	case TC_ACT_STOLEN:
 		kfree_skb(skb);
@@ -3579,6 +3574,27 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 
 	return skb;
 }
+
+static inline struct sk_buff *handle_early_ing(struct sk_buff *skb, struct Qdisc *q)
+{
+	u32 hard_header_len;
+
+	if (q->flags != (TCQ_F_INGRESS | TCQ_F_EARLY_INGRESS_L2))
+		return skb;
+
+	hard_header_len = skb->dev->hard_header_len;
+	skb_push(skb, hard_header_len);
+	skb_postpush_rcsum(skb, skb->data, hard_header_len);
+
+	switch (ing_filter(skb, q)) {
+	case TC_ACT_SHOT:
+	case TC_ACT_STOLEN:
+		kfree_skb(skb);
+		return NULL;
+	}
+	skb_pull_rcsum(skb, hard_header_len);
+	return skb;
+}
 #endif
 
 /**
@@ -3658,6 +3674,8 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	rx_handler_func_t *rx_handler;
 	struct net_device *orig_dev;
 	bool deliver_exact = false;
+	struct Qdisc *qdisc = &noop_qdisc;
+	struct netdev_queue *rxq;
 	int ret = NET_RX_DROP;
 	__be16 type;
 
@@ -3693,6 +3711,15 @@ another_round:
 		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
 		goto ncls;
 	}
+	if (static_key_false(&ingress_needed)) {
+		rxq = rcu_dereference(skb->dev->ingress_queue);
+		if (rxq)
+			qdisc = rcu_dereference(rxq->qdisc);
+
+		skb = handle_early_ing(skb, qdisc);
+		if (!skb)
+			goto unlock;
+	}
 #endif
 
 	if (pfmemalloc)
@@ -3713,7 +3740,7 @@ another_round:
 skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	if (static_key_false(&ingress_needed)) {
-		skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
+		skb = handle_ing(skb, &pt_prev, &ret, orig_dev, qdisc);
 		if (!skb)
 			goto unlock;
 	}
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3d43e4979f27..0fe93a60c59b 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -484,7 +484,7 @@ errout:
 
 struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
 				    struct nlattr *est, char *name, int ovr,
-				    int bind)
+				    int bind, bool qdisc_has_l2)
 {
 	struct tc_action *a;
 	struct tc_action_ops *a_o;
@@ -533,6 +533,14 @@ struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
 		goto err_out;
 	}
 
+	if ((a_o->flags & TCA_NEEDS_L2) && !qdisc_has_l2) {
+		/* actions that require L2 cannot be attached
+		 * to vanilla ingress qdisc
+		 */
+		err = -EINVAL;
+		goto err_mod;
+	}
+
 	err = -ENOMEM;
 	a = kzalloc(sizeof(*a), GFP_KERNEL);
 	if (a == NULL)
@@ -565,9 +573,9 @@ err_out:
 	return ERR_PTR(err);
 }
 
-int tcf_action_init(struct net *net, struct nlattr *nla,
-				  struct nlattr *est, char *name, int ovr,
-				  int bind, struct list_head *actions)
+int tcf_action_init(struct net *net, struct nlattr *nla, struct nlattr *est,
+		    char *name, int ovr, int bind, struct list_head *actions,
+		    bool qdisc_has_l2)
 {
 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
 	struct tc_action *act;
@@ -579,7 +587,8 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
 		return err;
 
 	for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
-		act = tcf_action_init_1(net, tb[i], est, name, ovr, bind);
+		act = tcf_action_init_1(net, tb[i], est, name, ovr, bind,
+					qdisc_has_l2);
 		if (IS_ERR(act)) {
 			err = PTR_ERR(act);
 			goto err;
@@ -931,7 +940,7 @@ tcf_action_add(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 	int ret = 0;
 	LIST_HEAD(actions);
 
-	ret = tcf_action_init(net, nla, NULL, NULL, ovr, 0, &actions);
+	ret = tcf_action_init(net, nla, NULL, NULL, ovr, 0, &actions, true);
 	if (ret)
 		goto done;
 
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 4d2cede17468..8f89d97dfa4a 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -339,6 +339,7 @@ static struct tc_action_ops act_bpf_ops __read_mostly = {
 	.dump		=	tcf_bpf_dump,
 	.cleanup	=	tcf_bpf_cleanup,
 	.init		=	tcf_bpf_init,
+	.flags		=	TCA_NEEDS_L2,
 };
 
 static int __init bpf_init_module(void)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8b0470e418dc..3bbb033af864 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -111,6 +111,11 @@ static inline u32 tcf_auto_prio(struct tcf_proto *tp)
 	return first;
 }
 
+static bool qdisc_has_l2(const struct Qdisc *q)
+{
+	return !(q->flags & TCQ_F_INGRESS) || (q->flags & TCQ_F_EARLY_INGRESS_L2);
+}
+
 /* Add/change/delete/get a filter node */
 
 static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
@@ -257,6 +262,15 @@ replay:
 			kfree(tp);
 			goto errout;
 		}
+		if ((tp_ops->flags & TCF_NEEDS_L2) && !qdisc_has_l2(q)) {
+			/* classifiers that need L2 header cannot be
+			 * attached to vanilla ingress qdisc
+			 */
+			err = -EINVAL;
+			module_put(tp_ops->owner);
+			kfree(tp);
+			goto errout;
+		}
 		tp->ops = tp_ops;
 		tp->protocol = protocol;
 		tp->prio = nprio ? :
@@ -522,7 +536,8 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 		if (exts->police && tb[exts->police]) {
 			act = tcf_action_init_1(net, tb[exts->police], rate_tlv,
 						"police", ovr,
-						TCA_ACT_BIND);
+						TCA_ACT_BIND,
+						qdisc_has_l2(tp->q));
 			if (IS_ERR(act))
 				return PTR_ERR(act);
 
@@ -532,7 +547,8 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 			int err;
 			err = tcf_action_init(net, tb[exts->action], rate_tlv,
 					      NULL, ovr,
-					      TCA_ACT_BIND, &exts->actions);
+					      TCA_ACT_BIND, &exts->actions,
+					      qdisc_has_l2(tp->q));
 			if (err)
 				return err;
 		}
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 5c4171c5d2bd..b99e677e2800 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -476,6 +476,7 @@ static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
 	.delete		=	cls_bpf_delete,
 	.walk		=	cls_bpf_walk,
 	.dump		=	cls_bpf_dump,
+	.flags		=	TCF_NEEDS_L2,
 };
 
 static int __init cls_bpf_init_mod(void)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ad9eed70bc8f..521499bf610a 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -987,6 +987,11 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 				goto err_out4;
 		}
 
+		if ((sch->flags & TCQ_F_EARLY_INGRESS_L2) && !dev->header_ops) {
+			err = -EINVAL;
+			goto err_out4;
+		}
+
 		qdisc_list_add(sch);
 
 		return sch;
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 4cdbfb85686a..6bdbb6cc3f49 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -88,8 +88,29 @@ static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 
 /* ------------------------------------------------------------- */
 
+static const struct nla_policy ingress_policy[TCA_INGRESS_MAX + 1] = {
+	[TCA_INGRESS_NEEDS_L2]	= { .type = NLA_U32 },
+};
+
 static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
 {
+	struct nlattr *tb[TCA_INGRESS_MAX + 1];
+	int err;
+
+	if (!opt)
+		goto out;
+
+	err = nla_parse_nested(tb, TCA_INGRESS_MAX, opt, ingress_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_INGRESS_NEEDS_L2]) {
+		if (nla_get_u32(tb[TCA_INGRESS_NEEDS_L2]))
+			sch->flags |= TCQ_F_EARLY_INGRESS_L2;
+		else
+			sch->flags &= ~TCQ_F_EARLY_INGRESS_L2;
+	}
+out:
 	net_inc_ingress_queue();
 
 	return 0;
diff --git a/samples/bpf/tcbpf1_kern.c b/samples/bpf/tcbpf1_kern.c
index 7cf3f42a6e39..76cdaab63058 100644
--- a/samples/bpf/tcbpf1_kern.c
+++ b/samples/bpf/tcbpf1_kern.c
@@ -14,12 +14,6 @@ static inline void set_dst_mac(struct __sk_buff *skb, char *mac)
 	bpf_skb_store_bytes(skb, 0, mac, ETH_ALEN, 1);
 }
 
-/* use 1 below for ingress qdisc and 0 for egress */
-#if 0
-#undef ETH_HLEN
-#define ETH_HLEN 0
-#endif
-
 #define IP_CSUM_OFF (ETH_HLEN + offsetof(struct iphdr, check))
 #define TOS_OFF (ETH_HLEN + offsetof(struct iphdr, tos))
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper
  2015-04-10 23:33 [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper Alexei Starovoitov
  2015-04-10 23:33 ` [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Alexei Starovoitov
@ 2015-04-11  6:40 ` Daniel Borkmann
  1 sibling, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2015-04-11  6:40 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller
  Cc: Eric Dumazet, Thomas Graf, Jiri Pirko, Jamal Hadi Salim, netdev

On 04/11/2015 01:33 AM, Alexei Starovoitov wrote:
> similar to skb_postpull_rcsum() introduce skb_postpush_rcsum() helper
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-10 23:33 ` [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Alexei Starovoitov
@ 2015-04-11  6:46   ` Daniel Borkmann
  2015-04-13 14:16   ` Jamal Hadi Salim
  1 sibling, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2015-04-11  6:46 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller
  Cc: Eric Dumazet, Thomas Graf, Jiri Pirko, Jamal Hadi Salim, netdev

On 04/11/2015 01:33 AM, Alexei Starovoitov wrote:
> TC classifers and actions attached to ingress and egress qdiscs see
> inconsistent skb->data. For ingress L2 header is already pulled, whereas
> for egress it's present. Introduce an optional flag for ingress qdisc
> which if set will cause ingress to push L2 header before calling
> into classifiers/actions and pull L2 back afterwards.
>
> The cls_bpf/act_bpf are now marked as 'needs_l2'. The users can use them
> on ingress qdisc created with 'needs_l2' flag and on any egress qdisc.
> The use of them with vanilla ingress is disallowed.
>
> The ingress_l2 qdisc can only be attached to devices that provide headers_ops.
>
> When ingress is not enabled static_key avoids *(skb->dev->ingress_queue)
>
> When ingress is enabled the difference old vs new to reach qdisc spinlock:
> old:
> *(skb->dev->ingress_queue), if, *(rxq->qdisc), if, *(rxq->qdisc), if
> new:
> *(skb->dev->ingress_queue), if, *(rxq->qdisc), if, if
>
> This patch provides a foundation to use ingress_l2+cls_bpf to filter
> interesting traffic and mirror small part of it to a different netdev for
> capturing. This approach is significantly faster than traditional af_packet,
> since skb_clone is called after filtering. dhclient and other tap-based tools
> may consider switching to this style.
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

Yes, that's the suggested alternative for the constraints we're having.
Looks good to me, thanks!

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-10 23:33 ` [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Alexei Starovoitov
  2015-04-11  6:46   ` Daniel Borkmann
@ 2015-04-13 14:16   ` Jamal Hadi Salim
  2015-04-13 17:37     ` Alexei Starovoitov
  1 sibling, 1 reply; 12+ messages in thread
From: Jamal Hadi Salim @ 2015-04-13 14:16 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller
  Cc: Eric Dumazet, Daniel Borkmann, Thomas Graf, Jiri Pirko, netdev

On 04/10/15 19:33, Alexei Starovoitov wrote:
> TC classifers and actions attached to ingress and egress qdiscs see
> inconsistent skb->data. For ingress L2 header is already pulled, whereas
> for egress it's present. Introduce an optional flag for ingress qdisc
> which if set will cause ingress to push L2 header before calling
> into classifiers/actions and pull L2 back afterwards.
>
> The cls_bpf/act_bpf are now marked as 'needs_l2'. The users can use them
> on ingress qdisc created with 'needs_l2' flag and on any egress qdisc.
> The use of them with vanilla ingress is disallowed.
>
> The ingress_l2 qdisc can only be attached to devices that provide headers_ops.
>
> When ingress is not enabled static_key avoids *(skb->dev->ingress_queue)
>
> When ingress is enabled the difference old vs new to reach qdisc spinlock:
> old:
> *(skb->dev->ingress_queue), if, *(rxq->qdisc), if, *(rxq->qdisc), if
> new:
> *(skb->dev->ingress_queue), if, *(rxq->qdisc), if, if
>
> This patch provides a foundation to use ingress_l2+cls_bpf to filter
> interesting traffic and mirror small part of it to a different netdev for
> capturing. This approach is significantly faster than traditional af_packet,
> since skb_clone is called after filtering. dhclient and other tap-based tools
> may consider switching to this style.
>

Alexei,
I want to support this work but i am having difficulties. I see your
point as i hope you see mine. In my opinion, it is a stalemate.
We need Dave to make the call.

To repeat what i said earlier:
The only known user at this point is bpf. cls_bpf and cls_act could both
look at the AT field, find where they are being invoked from and react
accordingly. This is not very hard for a coder to do and the user
injecting the policy doesnt need to know about it.
If you do that then i think you need to also inform users downstream
from bpf that they should expect to see the packet at the Link header
and not the network header.


cheers,
jamal

PS:- note that __netif_receive_skb_core() at the beginning is what sets 
all these headers.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-13 14:16   ` Jamal Hadi Salim
@ 2015-04-13 17:37     ` Alexei Starovoitov
  0 siblings, 0 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2015-04-13 17:37 UTC (permalink / raw)
  To: Jamal Hadi Salim, David S. Miller
  Cc: Eric Dumazet, Daniel Borkmann, Thomas Graf, Jiri Pirko, netdev

On 4/13/15 7:16 AM, Jamal Hadi Salim wrote:
>
> To repeat what i said earlier:
> The only known user at this point is bpf. cls_bpf and cls_act could both
> look at the AT field, find where they are being invoked from and react
> accordingly. This is not very hard for a coder to do and the user
> injecting the policy doesnt need to know about it.

It is hard. That's my experience coding in bpf clearly tells me.
Burdening users will have long term consequences.
If Dave doesn't like ingress_l2 approach, my only choice left is to
abandon ld_abs/ld_ind instructions for tc+bpf programs (since ld_abs
must use skb->data) and introduce new helpers that always use
skb->mac_header. That is the ugly choice and the deeper you look the
uglier it becomes. Like I cannot reject programs with ld_abs in
verifier, since ld_abs with skf_ll_off negative offset is still usable,
but slower than new helpers, but regular ld_abs shouldn't be used and
user will be finding bugs the hard way. I cannot recode ld_abs
offsets on the fly either, since the same program can be attached to
both ingress and egress. I cannot pass l3-l2 delta into the program,
since negative offsets in ld_abs have different meaning, so ld_abs -14
cannot work. I cannot push/pull/skb_share_check when program is already
running. Etc, etc.
ingress_l2 cleanly solves it and it also makes ingress and egress
_consistent_ for all existing and future classifiers and actions.
I don't see it as being bpf specific. imo it should have been this way
from the beginning. That's why I tried to do it unconditionally in v2
patch, but ingress was this way for too long, so we have to add a flag.

> If you do that then i think you need to also inform users downstream
> from bpf that they should expect to see the packet at the Link header
> and not the network header.

I've analyzed all classifiers and actions. Few assume starting at l2
and I marked them as such in v3. In this v4 patch I dropped the marking,
since my reading of Dave's recommendation was to fix them to work
with/without l2. That's my plan. So nothing will be affected by this
ingress_l2. imo it's a good generic feature. I can see future non-bpf
actions that would also want to use TCA_NEEDS_L2 to avoid dealing with
ingress/egress differences.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-14  0:57     ` Alexei Starovoitov
@ 2015-04-14 18:05       ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2015-04-14 18:05 UTC (permalink / raw)
  To: ast; +Cc: cwang, daniel, edumazet, tgraf, jiri, jhs, netdev

From: Alexei Starovoitov <ast@plumgrid.com>
Date: Mon, 13 Apr 2015 17:57:21 -0700

> On 4/13/15 3:44 PM, Cong Wang wrote:
>>
>> My suggestion is fairly simple: move skb_pull() out of
>> eth_type_trans()
>> and move it to netif_receive_skb(), it apparently needs much more work
>> but it worth the effort. This would eliminate the skb_push() in mirred
>> and in ifb too, dev_forward_skb() would benefit too.
> 
> wow. That would be massive. We're talking months worth of work here.

I think it's worth the effort.

Alexei I cannot apply your needs_l2 patch, it's fundamentally broken
because it assumes the whole chain of classifiers and actions starting
from a given qdisc root want the skb->data in one place or another.
It can be mixed, and quite easily so, and for that scenerio it's
broken.

You've been told multiple ways to keep the existing stuff working
yet give cls_bpf what it wants.  I'll add another one so you'll have
three options:

1) Firstly, to propagate the packet offset into the code generated
   for the BPF classifier when being attached to the ingress qdisc.
   Not only is this doable, you'll know you aren't being shared with
   other kinds of qdiscs so it'll just work.

2) Adjust where skb_pull() happens, Cong's idea.

3) Do the SKB share_check/pull/push in cls_bpf.  No impact on anyone
   else in the qdisc paths, mixed configurations handled in a guaranteed
   way.

I know none of these fit into your unreasonable idealistic model of
how you want all of this to work, but surprise this is the real world
and you'll need to compromise some of your absolute requirements in
order to meet your goals of proper functionality.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-13 22:44   ` Cong Wang
@ 2015-04-14  0:57     ` Alexei Starovoitov
  2015-04-14 18:05       ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2015-04-14  0:57 UTC (permalink / raw)
  To: Cong Wang, Daniel Borkmann
  Cc: David S. Miller, Eric Dumazet, Thomas Graf, Jiri Pirko,
	Jamal Hadi Salim, netdev

On 4/13/15 3:44 PM, Cong Wang wrote:
>
> My suggestion is fairly simple: move skb_pull() out of eth_type_trans()
> and move it to netif_receive_skb(), it apparently needs much more work
> but it worth the effort. This would eliminate the skb_push() in mirred
> and in ifb too, dev_forward_skb() would benefit too.

wow. That would be massive. We're talking months worth of work here.
Though theoretically it's possible, we'd still need a flag to preserve
compatibility with user space and to be able to roll such monster step
by step. Also non-eth devices are calling into netif_receive_skb(). Just
thinking about all corner cases of such move makes me very skeptical
that we can pull it off. In general we don't care about out of tree
drivers, but changing eth_type_trans() this way is really scary.
I'd rather keep things simple which is this patch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-11  6:53 ` Daniel Borkmann
@ 2015-04-13 22:44   ` Cong Wang
  2015-04-14  0:57     ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2015-04-13 22:44 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, David S. Miller, Eric Dumazet, Thomas Graf,
	Jiri Pirko, Jamal Hadi Salim, netdev

On Fri, Apr 10, 2015 at 11:53 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 04/11/2015 02:45 AM, Cong Wang wrote:
> ...
>>
>> The problem you try to solve is kernel's implementation problem,
>> users should not care, so we should make some effort to make ingress
>> align with egress, rather than making more differences.
>
>
> I guess you've followed the previous discussion up to this point,
> and are aware that we cannot break some users f.e. such as u32.

This is not hard to fix, e.g. pull it back in u32 for egress.

> There has been thought put into this and we went back and forth,
> if you have a better suggestion on how to get it resolved properly
> and generically, I'm all for it.

My suggestion is fairly simple: move skb_pull() out of eth_type_trans()
and move it to netif_receive_skb(), it apparently needs much more work
but it worth the effort. This would eliminate the skb_push() in mirred
and in ifb too, dev_forward_skb() would benefit too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-11  0:45 [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Cong Wang
  2015-04-11  1:39 ` Alexei Starovoitov
@ 2015-04-11  6:53 ` Daniel Borkmann
  2015-04-13 22:44   ` Cong Wang
  1 sibling, 1 reply; 12+ messages in thread
From: Daniel Borkmann @ 2015-04-11  6:53 UTC (permalink / raw)
  To: Cong Wang, Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Thomas Graf, Jiri Pirko,
	Jamal Hadi Salim, netdev

On 04/11/2015 02:45 AM, Cong Wang wrote:
...
> The problem you try to solve is kernel's implementation problem,
> users should not care, so we should make some effort to make ingress
> align with egress, rather than making more differences.

I guess you've followed the previous discussion up to this point,
and are aware that we cannot break some users f.e. such as u32.
There has been thought put into this and we went back and forth,
if you have a better suggestion on how to get it resolved properly
and generically, I'm all for it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
  2015-04-11  0:45 [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Cong Wang
@ 2015-04-11  1:39 ` Alexei Starovoitov
  2015-04-11  6:53 ` Daniel Borkmann
  1 sibling, 0 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2015-04-11  1:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jiri Pirko, Jamal Hadi Salim, netdev

On 4/10/15 5:45 PM, Cong Wang wrote:
>
> The problem you try to solve is kernel's implementation problem,
> users should not care, so we should make some effort to make ingress
> align with egress, rather than making more differences.

Agree. I already tried to do this unconditionally in v2.
Folks have raised concerns that it may break some tc modules
and in general doesn't work for all types of netdevs that don't
set headers_ops. So this flag is the safe compromise.
Personally I will be always turning it on.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc
@ 2015-04-11  0:45 Cong Wang
  2015-04-11  1:39 ` Alexei Starovoitov
  2015-04-11  6:53 ` Daniel Borkmann
  0 siblings, 2 replies; 12+ messages in thread
From: Cong Wang @ 2015-04-11  0:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jiri Pirko, Jamal Hadi Salim, netdev

On Fri, Apr 10, 2015 at 4:33 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> TC classifers and actions attached to ingress and egress qdiscs see
> inconsistent skb->data. For ingress L2 header is already pulled, whereas
> for egress it's present. Introduce an optional flag for ingress qdisc
> which if set will cause ingress to push L2 header before calling
> into classifiers/actions and pull L2 back afterwards.
>
> The cls_bpf/act_bpf are now marked as 'needs_l2'. The users can use them
> on ingress qdisc created with 'needs_l2' flag and on any egress qdisc.
> The use of them with vanilla ingress is disallowed.
>
> The ingress_l2 qdisc can only be attached to devices that provide headers_ops.
>
> When ingress is not enabled static_key avoids *(skb->dev->ingress_queue)
>
> When ingress is enabled the difference old vs new to reach qdisc spinlock:
> old:
> *(skb->dev->ingress_queue), if, *(rxq->qdisc), if, *(rxq->qdisc), if
> new:
> *(skb->dev->ingress_queue), if, *(rxq->qdisc), if, if
>
> This patch provides a foundation to use ingress_l2+cls_bpf to filter
> interesting traffic and mirror small part of it to a different netdev for
> capturing. This approach is significantly faster than traditional af_packet,
> since skb_clone is called after filtering. dhclient and other tap-based tools
> may consider switching to this style.
>

Why the flag needs to be visible to user-space? IOW, why users
should learn the knowledge of the need of this flag?

The problem you try to solve is kernel's implementation problem,
users should not care, so we should make some effort to make ingress
align with egress, rather than making more differences.

The more I look at you patch, no matter where you put that flag into,
the more I think it is a workaround. _Maybe_ it fits for short-term,
but you introduce a new API which can't be removed once merged.

This issue should have been there for a rather long time, so it is not
urgent, take some effect to make ingress align with egress, we would
get more benefits.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-04-14 18:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-10 23:33 [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper Alexei Starovoitov
2015-04-10 23:33 ` [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Alexei Starovoitov
2015-04-11  6:46   ` Daniel Borkmann
2015-04-13 14:16   ` Jamal Hadi Salim
2015-04-13 17:37     ` Alexei Starovoitov
2015-04-11  6:40 ` [PATCH v4 net-next 1/2] net: introduce skb_postpush_rcsum() helper Daniel Borkmann
2015-04-11  0:45 [PATCH v4 net-next 2/2] tc: add 'needs_l2' flag to ingress qdisc Cong Wang
2015-04-11  1:39 ` Alexei Starovoitov
2015-04-11  6:53 ` Daniel Borkmann
2015-04-13 22:44   ` Cong Wang
2015-04-14  0:57     ` Alexei Starovoitov
2015-04-14 18:05       ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.