All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch net-next RFC 0/6] Add support for offloading packet-sampling
@ 2016-10-12 12:41 Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 1/6] Introduce ife encapsulation module Jiri Pirko
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-10-12 12:41 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

From: Jiri Pirko <jiri@mellanox.com>

Add the sample tc action, which allows to sample packet matching
a classifier. The sample action peeks randomly packets, duplicates them,
truncates them and adds informative metadata on the packet, for example,
the input interface and the original packet length. The sampled packets
are marked to allow matching them and redirecting them to a specific
collector device.

The sampled packets metadata is packed using ife encapsulation. To do
that, this patch-set extracts ife logics from the tc_ife action into an
independent ife module, and uses that functionality to pack the metadata.
To include all the needed metadata, this patch-set introduces some new
IFE_META tlv types.

In addition, Add the support for offloading the matchall-sample tc command
in the Mellanox mlxsw driver, for ingress qdiscs.

Yotam Gigi (6):
  Introduce ife encapsulation module
  act_ife: Change to use ife module
  ife: Introduce new metadata tlv types
  Introduce sample tc action
  mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
  mlxsw: packet sample: Add packet sample offloading support

 MAINTAINERS                                    |   7 +
 drivers/net/ethernet/mellanox/mlxsw/reg.h      |  43 ++++
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 116 +++++++++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  11 +
 drivers/net/ethernet/mellanox/mlxsw/trap.h     |   1 +
 include/net/ife.h                              |  19 ++
 include/net/tc_act/tc_ife.h                    |   3 -
 include/net/tc_act/tc_sample.h                 |  88 ++++++++
 include/uapi/linux/Kbuild                      |   1 +
 include/uapi/linux/ife.h                       |  19 ++
 include/uapi/linux/tc_act/Kbuild               |   1 +
 include/uapi/linux/tc_act/tc_ife.h             |  10 +-
 include/uapi/linux/tc_act/tc_sample.h          |  31 +++
 net/Kconfig                                    |   1 +
 net/Makefile                                   |   1 +
 net/ife/Kconfig                                |  16 ++
 net/ife/Makefile                               |   5 +
 net/ife/ife.c                                  | 147 ++++++++++++
 net/sched/Kconfig                              |  14 ++
 net/sched/Makefile                             |   1 +
 net/sched/act_ife.c                            | 109 +++------
 net/sched/act_sample.c                         | 300 +++++++++++++++++++++++++
 22 files changed, 849 insertions(+), 95 deletions(-)
 create mode 100644 include/net/ife.h
 create mode 100644 include/net/tc_act/tc_sample.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 include/uapi/linux/tc_act/tc_sample.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c
 create mode 100644 net/sched/act_sample.c

-- 
2.5.5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch net-next RFC 1/6] Introduce ife encapsulation module
  2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
@ 2016-10-12 12:41 ` Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 2/6] act_ife: Change to use ife module Jiri Pirko
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-10-12 12:41 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

From: Yotam Gigi <yotam.gi@gmail.com>

This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
 - ife_encode: encode skb and reserve space for the ife meta header
 - ife_decode: decode skb and extract the meta header size
 - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
   header space.
 - ife_tlv_meta_decode - decodes one tlv entry from the packet
 - ife_tlv_meta_next - advance to the next tlv

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 MAINTAINERS               |   7 +++
 include/net/ife.h         |  19 ++++++
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/ife.h  |  16 +++++
 net/Kconfig               |   1 +
 net/Makefile              |   1 +
 net/ife/Kconfig           |  16 +++++
 net/ife/Makefile          |   5 ++
 net/ife/ife.c             | 147 ++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 213 insertions(+)
 create mode 100644 include/net/ife.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 464437d..8f6741f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6042,6 +6042,13 @@ F:	include/net/cfg802154.h
 F:	include/net/ieee802154_netdev.h
 F:	Documentation/networking/ieee802154.txt
 
+IFE PROTOCOL
+M:	Yotam Gigi <yotamg@mellanox.com>
+M:	Jamal Hadi Salim <jhs@mojatatu.com>
+F:	net/ife
+F:	include/net/ife.h
+F:	include/uapi/linux/ife.h
+
 IGORPLUG-USB IR RECEIVER
 M:	Sean Young <sean@mess.org>
 L:	linux-media@vger.kernel.org
diff --git a/include/net/ife.h b/include/net/ife.h
new file mode 100644
index 0000000..05d7bbb
--- /dev/null
+++ b/include/net/ife.h
@@ -0,0 +1,19 @@
+#ifndef __NET_IFE_H
+#define __NET_IFE_H
+
+#include <uapi/linux/ife.h>
+#include <linux/etherdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <uapi/linux/ife.h>
+
+void *ife_encode(struct sk_buff *skb, u16 metalen);
+void *ife_decode(struct sk_buff *skb, u16 *metalen);
+
+void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+			const void *dval);
+
+void *ife_tlv_meta_next(void *skbdata);
+
+#endif // __NET_IFE_H
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index d0352a9..27f39bc 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -190,6 +190,7 @@ header-y += if_tun.h
 header-y += if_tunnel.h
 header-y += if_vlan.h
 header-y += if_x25.h
+header-y += ife.h
 header-y += igmp.h
 header-y += ila.h
 header-y += in6.h
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
new file mode 100644
index 0000000..abec4e0
--- /dev/null
+++ b/include/uapi/linux/ife.h
@@ -0,0 +1,16 @@
+#ifndef __UAPI_IFE_H
+#define __UAPI_IFE_H
+
+#define IFE_METAHDRLEN 2
+
+#define IFE_META_SKBMARK 1
+#define IFE_META_HASHID 2
+#define IFE_META_PRIO 3
+#define IFE_META_QMAP 4
+#define IFE_META_TCINDEX 5
+
+/*Can be overridden at runtime by module option*/
+#define __IFE_META_MAX 6
+#define IFE_META_MAX (__IFE_META_MAX - 1)
+
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..3cf29b1 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -393,6 +393,7 @@ source "net/9p/Kconfig"
 source "net/caif/Kconfig"
 source "net/ceph/Kconfig"
 source "net/nfc/Kconfig"
+source "net/ife/Kconfig"
 
 config LWTUNNEL
 	bool "Network light weight tunnels"
diff --git a/net/Makefile b/net/Makefile
index 4cafaa2..4ddc67e 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -69,6 +69,7 @@ obj-$(CONFIG_DNS_RESOLVER)	+= dns_resolver/
 obj-$(CONFIG_CEPH_LIB)		+= ceph/
 obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
 obj-$(CONFIG_NFC)		+= nfc/
+obj-$(CONFIG_NET_IFE)		+= ife/
 obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
 obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
 obj-$(CONFIG_MPLS)		+= mpls/
diff --git a/net/ife/Kconfig b/net/ife/Kconfig
new file mode 100644
index 0000000..31e48b6
--- /dev/null
+++ b/net/ife/Kconfig
@@ -0,0 +1,16 @@
+#
+# IFE subsystem configuration
+#
+
+menuconfig NET_IFE
+	depends on NET
+        tristate "Inter-FE based on IETF ForCES InterFE LFB"
+	default n
+	help
+	  Say Y here to add support of IFE encapsulation protocol
+	  For details refer to netdev01 paper:
+	  "Distributing Linux Traffic Control Classifier-Action Subsystem"
+	   Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
+
+	  To compile this support as a module, choose M here: the module will
+	  be called ife.
diff --git a/net/ife/Makefile b/net/ife/Makefile
new file mode 100644
index 0000000..2a90d97
--- /dev/null
+++ b/net/ife/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the IFE encapsulation protocol
+#
+
+obj-$(CONFIG_NET_IFE) += ife.o
diff --git a/net/ife/ife.c b/net/ife/ife.c
new file mode 100644
index 0000000..6feaa9d
--- /dev/null
+++ b/net/ife/ife.c
@@ -0,0 +1,147 @@
+/*
+ * net/ife/ife.c	Inter-FE protocol based on ForCES WG InterFE LFB
+ *
+ *		Refer to:
+ *		draft-ietf-forces-interfelfb-03
+ *		and
+ *		netdev01 paper:
+ *		"Distributing Linux Traffic Control Classifier-Action
+ *		Subsystem"
+ *		Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <linux/etherdevice.h>
+#include <net/ife.h>
+
+void *ife_encode(struct sk_buff *skb, u16 metalen)
+{
+	/* OUTERHDR:TOTMETALEN:{TLVHDR:Metadatum:TLVHDR..}:ORIGDATA
+	 * where ORIGDATA = original ethernet header ...
+	 */
+	int hdrm = metalen + IFE_METAHDRLEN;
+	int total_push = hdrm + skb->dev->hard_header_len;
+	struct ethhdr *iethh;	/* inner ether header */
+	int skboff = 0;
+	int err;
+
+	err = skb_cow_head(skb, total_push);
+	if (unlikely(err))
+		return NULL;
+
+	iethh = (struct ethhdr *) skb->data;
+
+	__skb_push(skb, total_push);
+	memcpy(skb->data, iethh, skb->dev->hard_header_len);
+	skb_reset_mac_header(skb);
+	skboff += skb->dev->hard_header_len;
+
+	/* total metadata length */
+	metalen += IFE_METAHDRLEN;
+	metalen = htons(metalen);
+	memcpy((skb->data + skboff), &metalen, IFE_METAHDRLEN);
+	skboff += IFE_METAHDRLEN;
+
+	return skb->data + skboff;
+}
+EXPORT_SYMBOL_GPL(ife_encode);
+
+struct ifeheadr {
+	__be16 metalen;
+	u8 tlv_data[];
+};
+
+void *ife_decode(struct sk_buff *skb, u16 *metalen)
+{
+	struct ifeheadr *ifehdr;
+	int total_pull;
+	u16 ifehdrln;
+
+	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
+	ifehdrln = ifehdr->metalen;
+	ifehdrln = ntohs(ifehdrln);
+	total_pull = skb->dev->hard_header_len + ifehdrln;
+
+	if (unlikely(ifehdrln < 2))
+		return NULL;
+
+	if (unlikely(!pskb_may_pull(skb, total_pull)))
+		return NULL;
+
+	skb_set_mac_header(skb, total_pull);
+	__skb_pull(skb, total_pull);
+	*metalen = ifehdrln - IFE_METAHDRLEN;
+
+	return &ifehdr->tlv_data;
+}
+EXPORT_SYMBOL_GPL(ife_decode);
+
+struct meta_tlvhdr {
+	__be16 type;
+	__be16 len;
+};
+
+/* Caller takes care of presenting data in network order
+*/
+void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+{
+	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+
+	*dlen = ntohs(tlv->len) - NLA_HDRLEN;
+	*attrtype = ntohs(tlv->type);
+
+	if (totlen)
+		*totlen = nla_total_size(*dlen);
+
+	return skbdata + sizeof(struct meta_tlvhdr);
+}
+EXPORT_SYMBOL_GPL(ife_tlv_meta_decode);
+
+void *ife_tlv_meta_next(void *skbdata)
+{
+	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+	u16 tlvlen = tlv->len;
+
+	tlvlen = ntohs(tlvlen);
+	tlvlen = NLA_ALIGN(tlvlen);
+
+	return skbdata + tlvlen;
+}
+EXPORT_SYMBOL_GPL(ife_tlv_meta_next);
+
+/* Caller takes care of presenting data in network order
+*/
+int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
+{
+	u32 *tlv = (u32 *) (skbdata);
+	u16 totlen = nla_total_size(dlen);	/*alignment + hdr */
+	char *dptr = (char *) tlv + NLA_HDRLEN;
+	u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
+
+	*tlv = htonl(htlv);
+	memset(dptr, 0, totlen - NLA_HDRLEN);
+	memcpy(dptr, dval, dlen);
+
+	return totlen;
+}
+EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
+
+MODULE_AUTHOR("Jamal Hadi Salim (2015)");
+MODULE_DESCRIPTION("Inter-FE LFB action");
+MODULE_LICENSE("GPL");
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next RFC 2/6] act_ife: Change to use ife module
  2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 1/6] Introduce ife encapsulation module Jiri Pirko
@ 2016-10-12 12:41 ` Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 3/6] ife: Introduce new metadata tlv types Jiri Pirko
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-10-12 12:41 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

From: Yotam Gigi <yotam.gi@gmail.com>

Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/tc_act/tc_ife.h        |   3 -
 include/uapi/linux/tc_act/tc_ife.h |  10 +---
 net/sched/Kconfig                  |   1 +
 net/sched/act_ife.c                | 109 +++++++++++--------------------------
 4 files changed, 34 insertions(+), 89 deletions(-)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 9fd2bea0..30ba459 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -6,7 +6,6 @@
 #include <linux/rtnetlink.h>
 #include <linux/module.h>
 
-#define IFE_METAHDRLEN 2
 struct tcf_ife_info {
 	struct tc_action common;
 	u8 eth_dst[ETH_ALEN];
@@ -45,8 +44,6 @@ struct tcf_meta_ops {
 
 int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi);
 int ife_get_meta_u16(struct sk_buff *skb, struct tcf_meta_info *mi);
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
-			const void *dval);
 int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
diff --git a/include/uapi/linux/tc_act/tc_ife.h b/include/uapi/linux/tc_act/tc_ife.h
index cd18360..7c28178 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -3,6 +3,7 @@
 
 #include <linux/types.h>
 #include <linux/pkt_cls.h>
+#include <linux/ife.h>
 
 #define TCA_ACT_IFE 25
 /* Flag bits for now just encoding/decoding; mutually exclusive */
@@ -28,13 +29,4 @@ enum {
 };
 #define TCA_IFE_MAX (__TCA_IFE_MAX - 1)
 
-#define IFE_META_SKBMARK 1
-#define IFE_META_HASHID 2
-#define	IFE_META_PRIO 3
-#define	IFE_META_QMAP 4
-#define	IFE_META_TCINDEX 5
-/*Can be overridden at runtime by module option*/
-#define	__IFE_META_MAX 6
-#define IFE_META_MAX (__IFE_META_MAX - 1)
-
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 87956a7..24f7cac 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -763,6 +763,7 @@ config NET_ACT_SKBMOD
 config NET_ACT_IFE
         tristate "Inter-FE action based on IETF ForCES InterFE LFB"
         depends on NET_CLS_ACT
+        select NET_IFE
         ---help---
 	  Say Y here to allow for sourcing and terminating metadata
 	  For details refer to netdev01 paper:
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 95c463c..5c2478a 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -32,6 +32,7 @@
 #include <uapi/linux/tc_act/tc_ife.h>
 #include <net/tc_act/tc_ife.h>
 #include <linux/etherdevice.h>
+#include <net/ife.h>
 
 #define IFE_TAB_MASK 15
 
@@ -46,23 +47,6 @@ static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = {
 	[TCA_IFE_TYPE] = { .type = NLA_U16},
 };
 
-/* Caller takes care of presenting data in network order
-*/
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
-{
-	u32 *tlv = (u32 *)(skbdata);
-	u16 totlen = nla_total_size(dlen);	/*alignment + hdr */
-	char *dptr = (char *)tlv + NLA_HDRLEN;
-	u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
-
-	*tlv = htonl(htlv);
-	memset(dptr, 0, totlen - NLA_HDRLEN);
-	memcpy(dptr, dval, dlen);
-
-	return totlen;
-}
-EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
-
 int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
 	u16 edata = 0;
@@ -637,69 +621,60 @@ int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
 	return 0;
 }
 
-struct ifeheadr {
-	__be16 metalen;
-	u8 tlv_data[];
-};
-
-struct meta_tlvhdr {
-	__be16 type;
-	__be16 len;
-};
-
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 			  struct tcf_result *res)
 {
 	struct tcf_ife_info *ife = to_ife(a);
+	u32 at = G_TC_AT(skb->tc_verd);
 	int action = ife->tcf_action;
-	struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-	int ifehdrln = (int)ifehdr->metalen;
-	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
+	u8 *ifehdr_end;
+	u8 *tlv_data;
+	u16 metalen;
 
 	spin_lock(&ife->tcf_lock);
 	bstats_update(&ife->tcf_bstats, skb);
 	tcf_lastuse_update(&ife->tcf_tm);
 	spin_unlock(&ife->tcf_lock);
 
-	ifehdrln = ntohs(ifehdrln);
-	if (unlikely(!pskb_may_pull(skb, ifehdrln))) {
+	if (!(at & AT_EGRESS))
+		skb_push(skb, skb->dev->hard_header_len);
+
+	tlv_data = ife_decode(skb, &metalen);
+	if (unlikely(!tlv_data)) {
 		spin_lock(&ife->tcf_lock);
 		ife->tcf_qstats.drops++;
 		spin_unlock(&ife->tcf_lock);
 		return TC_ACT_SHOT;
 	}
 
-	skb_set_mac_header(skb, ifehdrln);
-	__skb_pull(skb, ifehdrln);
-	skb->protocol = eth_type_trans(skb, skb->dev);
-	ifehdrln -= IFE_METAHDRLEN;
-
-	while (ifehdrln > 0) {
-		u8 *tlvdata = (u8 *)tlv;
-		u16 mtype = tlv->type;
-		u16 mlen = tlv->len;
-		u16 alen;
+	ifehdr_end = tlv_data + metalen;
+	for (; tlv_data < ifehdr_end; tlv_data = ife_tlv_meta_next(tlv_data)) {
+		u8 *curr_data;
+		u16 mtype;
+		u16 dlen;
 
-		mtype = ntohs(mtype);
-		mlen = ntohs(mlen);
-		alen = NLA_ALIGN(mlen);
+		curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
 
-		if (find_decode_metaid(skb, ife, mtype, (mlen - NLA_HDRLEN),
-				       (void *)(tlvdata + NLA_HDRLEN))) {
+		if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
 			/* abuse overlimits to count when we receive metadata
 			 * but dont have an ops for it
 			 */
-			pr_info_ratelimited("Unknown metaid %d alnlen %d\n",
-					    mtype, mlen);
+			pr_info_ratelimited("Unknown metaid %d dlen %d\n",
+					    mtype, dlen);
 			ife->tcf_qstats.overlimits++;
 		}
+	}
 
-		tlvdata += alen;
-		ifehdrln -= alen;
-		tlv = (struct meta_tlvhdr *)tlvdata;
+	if (WARN_ON(tlv_data != ifehdr_end)) {
+		spin_lock(&ife->tcf_lock);
+		ife->tcf_qstats.drops++;
+		spin_unlock(&ife->tcf_lock);
+		return TC_ACT_SHOT;
 	}
 
+	skb->protocol = eth_type_trans(skb, skb->dev);
 	skb_reset_network_header(skb);
+
 	return action;
 }
 
@@ -727,7 +702,6 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	struct tcf_ife_info *ife = to_ife(a);
 	int action = ife->tcf_action;
 	struct ethhdr *oethh;	/* outer ether header */
-	struct ethhdr *iethh;	/* inner eth header */
 	struct tcf_meta_info *e;
 	/*
 	   OUTERHDR:TOTMETALEN:{TLVHDR:Metadatum:TLVHDR..}:ORIGDATA
@@ -735,10 +709,11 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	 */
 	u16 metalen = ife_get_sz(skb, ife);
 	int hdrm = metalen + skb->dev->hard_header_len + IFE_METAHDRLEN;
-	unsigned int skboff = skb->dev->hard_header_len;
+	unsigned int skboff = 0;
 	u32 at = G_TC_AT(skb->tc_verd);
 	int new_len = skb->len + hdrm;
 	bool exceed_mtu = false;
+	void *ife_meta;
 	int err;
 
 	if (at & AT_EGRESS) {
@@ -766,27 +741,10 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 		return TC_ACT_SHOT;
 	}
 
-	err = skb_cow_head(skb, hdrm);
-	if (unlikely(err)) {
-		ife->tcf_qstats.drops++;
-		spin_unlock(&ife->tcf_lock);
-		return TC_ACT_SHOT;
-	}
-
 	if (!(at & AT_EGRESS))
 		skb_push(skb, skb->dev->hard_header_len);
 
-	iethh = (struct ethhdr *)skb->data;
-	__skb_push(skb, hdrm);
-	memcpy(skb->data, iethh, skb->mac_len);
-	skb_reset_mac_header(skb);
-	oethh = eth_hdr(skb);
-
-	/*total metadata length */
-	metalen += IFE_METAHDRLEN;
-	metalen = htons(metalen);
-	memcpy((skb->data + skboff), &metalen, IFE_METAHDRLEN);
-	skboff += IFE_METAHDRLEN;
+	ife_meta = ife_encode(skb, metalen);
 
 	/* XXX: we dont have a clever way of telling encode to
 	 * not repeat some of the computations that are done by
@@ -794,7 +752,7 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	 */
 	list_for_each_entry(e, &ife->metalist, metalist) {
 		if (e->ops->encode) {
-			err = e->ops->encode(skb, (void *)(skb->data + skboff),
+			err = e->ops->encode(skb, (void *)(ife_meta + skboff),
 					     e);
 		}
 		if (err < 0) {
@@ -805,15 +763,12 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 		}
 		skboff += err;
 	}
+	oethh = (struct ethhdr *)skb->data;
 
 	if (!is_zero_ether_addr(ife->eth_src))
 		ether_addr_copy(oethh->h_source, ife->eth_src);
-	else
-		ether_addr_copy(oethh->h_source, iethh->h_source);
 	if (!is_zero_ether_addr(ife->eth_dst))
 		ether_addr_copy(oethh->h_dest, ife->eth_dst);
-	else
-		ether_addr_copy(oethh->h_dest, iethh->h_dest);
 	oethh->h_proto = htons(ife->eth_type);
 
 	if (!(at & AT_EGRESS))
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next RFC 3/6] ife: Introduce new metadata tlv types
  2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 1/6] Introduce ife encapsulation module Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 2/6] act_ife: Change to use ife module Jiri Pirko
@ 2016-10-12 12:41 ` Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 4/6] Introduce sample tc action Jiri Pirko
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-10-12 12:41 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

From: Yotam Gigi <yotam.gi@gmail.com>

 - IFE_META_IFINDEX: Allow to pass ifindex value as part of the ife
   metadata
 - IFE_META_ORIG_SIZE: Allow to pass the original packet size as part of
   the ife metadata. Can be used in case that the packet is truncated
 - IFE_META_SIZE: Allow to pass the size of the encapsulated packet as
   part of the ife metadata

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/uapi/linux/ife.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
index abec4e0..990f29e 100644
--- a/include/uapi/linux/ife.h
+++ b/include/uapi/linux/ife.h
@@ -8,6 +8,9 @@
 #define IFE_META_PRIO 3
 #define IFE_META_QMAP 4
 #define IFE_META_TCINDEX 5
+#define IFE_META_IFINDEX 6
+#define IFE_META_ORIGSIZE 7
+#define IFE_META_SIZE 8
 
 /*Can be overridden at runtime by module option*/
 #define __IFE_META_MAX 6
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
                   ` (2 preceding siblings ...)
  2016-10-12 12:41 ` [patch net-next RFC 3/6] ife: Introduce new metadata tlv types Jiri Pirko
@ 2016-10-12 12:41 ` Jiri Pirko
  2016-10-15 16:34   ` Roopa Prabhu
  2016-10-16 10:27   ` Or Gerlitz
  2016-10-12 12:41 ` [patch net-next RFC 5/6] mlxsw: reg: add the Monitoring Packet Sampling Configuration Register Jiri Pirko
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-10-12 12:41 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

From: Yotam Gigi <yotam.gi@gmail.com>

This action allow the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly, truncating them,
adding some informative metadata regarding the interface and the original
packet size and mark them with specific mark, to allow further tc rules to
match and process. The marked sample packets are then injected into the
device ingress qdisc using netif_receive_skb.

The packets metadata is packed using the ife encapsulation protocol, and
the outer packet's ethernet dest, source and eth_type, along with the
rate, mark and the optional truncation size can be configured from
userspace.

Example:
To sample ingress traffic from interface eth1, and redirect the sampled
the sampled packets to interface dummy0, one may use the commands:

tc qdisc add dev eth1 handle ffff: ingress

tc filter add dev eth1 parent ffff: \
	   matchall action sample rate 12 mark 17

tc filter add parent ffff: dev eth1 protocol all \
	   u32 match mark 172 0xff
	   action mirred egress redirect dev dummy0

Where the first command adds an ingress qdisc and the second starts
sampling every 12'th packet on dev eth0 and marks the sampled packets with
17. The command third catches the sampled packets, which are marked with
17, and redirects them to dev dummy0.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/tc_act/tc_sample.h        |  88 ++++++++++
 include/uapi/linux/tc_act/Kbuild      |   1 +
 include/uapi/linux/tc_act/tc_sample.h |  31 ++++
 net/sched/Kconfig                     |  13 ++
 net/sched/Makefile                    |   1 +
 net/sched/act_sample.c                | 300 ++++++++++++++++++++++++++++++++++
 6 files changed, 434 insertions(+)
 create mode 100644 include/net/tc_act/tc_sample.h
 create mode 100644 include/uapi/linux/tc_act/tc_sample.h
 create mode 100644 net/sched/act_sample.c

diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
new file mode 100644
index 0000000..a2b445a
--- /dev/null
+++ b/include/net/tc_act/tc_sample.h
@@ -0,0 +1,88 @@
+#ifndef __NET_TC_SAMPLE_H
+#define __NET_TC_SAMPLE_H
+
+#include <net/act_api.h>
+#include <linux/tc_act/tc_sample.h>
+
+struct tcf_sample {
+	struct tc_action	common;
+	u32			rate;
+	u32			mark;
+	bool			truncate;
+	u32			trunc_size;
+	u32			packet_counter;
+	u8			eth_dst[ETH_ALEN];
+	u8			eth_src[ETH_ALEN];
+	u16			eth_type;
+	bool			eth_type_set;
+	struct list_head	tcfm_list;
+};
+#define to_sample(a) ((struct tcf_sample *)a)
+
+struct sample_packet_metadata {
+	int sample_size;
+	int orig_size;
+	int ifindex;
+};
+
+#if IS_ENABLED(NET_ACT_SAMPLE)
+struct ethhdr *sample_packet_pack(struct sk_buff *skb,
+				  struct sample_packet_metadata *metadata);
+#else
+struct ethhdr *sample_packet_pack(struct sk_buff *skb,
+				  struct sample_packet_metadata *metadata)
+{
+	return NULL;
+}
+#endif
+
+static inline bool is_tcf_sample(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+	return a->ops && a->ops->type == TCA_ACT_SAMPLE;
+#else
+	return false;
+#endif
+}
+
+static inline __u32 tcf_sample_mark(const struct tc_action *a)
+{
+	return to_sample(a)->mark;
+}
+
+static inline __u32 tcf_sample_rate(const struct tc_action *a)
+{
+	return to_sample(a)->rate;
+}
+
+static inline bool tcf_sample_truncate(const struct tc_action *a)
+{
+	return to_sample(a)->truncate;
+}
+
+static inline int tcf_sample_trunc_size(const struct tc_action *a)
+{
+	return to_sample(a)->trunc_size;
+}
+
+static inline u16 tcf_sample_eth_type(const struct tc_action *a)
+{
+	return to_sample(a)->eth_type;
+}
+
+static inline bool tcf_sample_eth_type_set(const struct tc_action *a)
+{
+	return to_sample(a)->eth_type_set;
+}
+
+static inline void tcf_sample_eth_dst_addr(const struct tc_action *a, u8 *dst)
+{
+	ether_addr_copy(dst, to_sample(a)->eth_dst);
+}
+
+static inline void tcf_sample_eth_src_addr(const struct tc_action *a, u8 *src)
+{
+	ether_addr_copy(src, to_sample(a)->eth_src);
+}
+
+#endif /* __NET_TC_SAMPLE_H */
diff --git a/include/uapi/linux/tc_act/Kbuild b/include/uapi/linux/tc_act/Kbuild
index e3969bd..6c6b8d6 100644
--- a/include/uapi/linux/tc_act/Kbuild
+++ b/include/uapi/linux/tc_act/Kbuild
@@ -4,6 +4,7 @@ header-y += tc_defact.h
 header-y += tc_gact.h
 header-y += tc_ipt.h
 header-y += tc_mirred.h
+header-y += tc_sample.h
 header-y += tc_nat.h
 header-y += tc_pedit.h
 header-y += tc_skbedit.h
diff --git a/include/uapi/linux/tc_act/tc_sample.h b/include/uapi/linux/tc_act/tc_sample.h
new file mode 100644
index 0000000..654945b
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_sample.h
@@ -0,0 +1,31 @@
+#ifndef __LINUX_TC_SAMPLE_H
+#define __LINUX_TC_SAMPLE_H
+
+#include <linux/types.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+
+#define TCA_ACT_SAMPLE 26
+
+struct tc_sample {
+	tc_gen;
+	__u32		rate;		/* sample rate */
+	__u32		mark;		/* mark to put on the sampled packets */
+	bool		truncate;	/* whether to truncate the packets */
+	__u32		trunc_size;	/* truncation size */
+	__u8		eth_dst[ETH_ALEN]; /* encapsulated mac destination */
+	__u8		eth_src[ETH_ALEN]; /* encapsulated mac source */
+	bool		eth_type_set;	   /* whether to overrid ethtype */
+	__u16		eth_type;	   /* encapsulated mac ethtype */
+};
+
+enum {
+	TCA_SAMPLE_UNSPEC,
+	TCA_SAMPLE_TM,
+	TCA_SAMPLE_PARMS,
+	TCA_SAMPLE_PAD,
+	__TCA_SAMPLE_MAX
+};
+#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 24f7cac..c54ea6b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -650,6 +650,19 @@ config NET_ACT_MIRRED
 	  To compile this code as a module, choose M here: the
 	  module will be called act_mirred.
 
+config NET_ACT_SAMPLE
+        tristate "Traffic Sampling"
+        depends on NET_CLS_ACT
+        select NET_IFE
+        ---help---
+	  Say Y here to allow packet sampling tc action. The packet sample
+	  action consists of statistically duplicating packets, truncating them
+	  and adding descriptive metadata to them. The duplicated packets are
+	  then marked to allow further processing using tc.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called act_sample.
+
 config NET_ACT_IPT
         tristate "IPtables targets"
         depends on NET_CLS_ACT && NETFILTER && IP_NF_IPTABLES
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 4bdda36..7b915d2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_NET_CLS_ACT)	+= act_api.o
 obj-$(CONFIG_NET_ACT_POLICE)	+= act_police.o
 obj-$(CONFIG_NET_ACT_GACT)	+= act_gact.o
 obj-$(CONFIG_NET_ACT_MIRRED)	+= act_mirred.o
+obj-$(CONFIG_NET_ACT_SAMPLE)	+= act_sample.o
 obj-$(CONFIG_NET_ACT_IPT)	+= act_ipt.o
 obj-$(CONFIG_NET_ACT_NAT)	+= act_nat.o
 obj-$(CONFIG_NET_ACT_PEDIT)	+= act_pedit.o
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
new file mode 100644
index 0000000..1477825
--- /dev/null
+++ b/net/sched/act_sample.c
@@ -0,0 +1,300 @@
+/*
+ * net/sched/act_sample.c	packet sampling tc action
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Yotam Gigi <yotamg@mellanox.com> (2016)
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/gfp.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <linux/tc_act/tc_sample.h>
+#include <net/tc_act/tc_sample.h>
+#include <net/ife.h>
+
+#include <linux/if_arp.h>
+
+#define SAMPLE_TAB_MASK     7
+static int sample_net_id;
+static struct tc_action_ops act_sample_ops;
+
+static const struct nla_policy sample_policy[TCA_SAMPLE_MAX + 1] = {
+	[TCA_SAMPLE_PARMS]	= { .len = sizeof(struct tc_sample) },
+};
+
+static int tcf_sample_init(struct net *net, struct nlattr *nla,
+			   struct nlattr *est, struct tc_action **a, int ovr,
+			   int bind)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+	struct nlattr *tb[TCA_SAMPLE_MAX + 1];
+	struct tc_sample *parm;
+	struct tcf_sample *s;
+	int ret;
+	bool exists = false;
+
+	if (!nla)
+		return -EINVAL;
+	ret = nla_parse_nested(tb, TCA_SAMPLE_MAX, nla, sample_policy);
+	if (ret < 0)
+		return ret;
+	if (!tb[TCA_SAMPLE_PARMS])
+		return -EINVAL;
+	parm = nla_data(tb[TCA_SAMPLE_PARMS]);
+
+	exists = tcf_hash_check(tn, parm->index, a, bind);
+	if (exists && bind)
+		return 0;
+
+	if (!exists) {
+		ret = tcf_hash_create(tn, parm->index, est, a,
+				      &act_sample_ops, bind, false);
+		if (ret)
+			return ret;
+		ret = ACT_P_CREATED;
+	} else {
+		tcf_hash_release(*a, bind);
+		if (!ovr)
+			return -EEXIST;
+	}
+	s = to_sample(*a);
+
+	ASSERT_RTNL();
+	s->tcf_action = parm->action;
+	s->rate = parm->rate;
+	s->mark = parm->mark;
+	s->truncate = parm->truncate;
+	s->trunc_size = parm->trunc_size;
+	s->eth_type = parm->eth_type;
+	s->eth_type_set = parm->eth_type_set;
+	s->packet_counter = 0;
+
+	if (parm->eth_dst)
+		ether_addr_copy(s->eth_dst, parm->eth_dst);
+	else
+		eth_zero_addr(s->eth_dst);
+	if (parm->eth_src)
+		ether_addr_copy(s->eth_src, parm->eth_src);
+	else
+		eth_zero_addr(s->eth_src);
+
+	if (ret == ACT_P_CREATED)
+		tcf_hash_insert(tn, *a);
+	return ret;
+}
+
+static bool dev_ok_push(struct net_device *dev)
+{
+	switch (dev->type) {
+	case ARPHRD_TUNNEL:
+	case ARPHRD_TUNNEL6:
+	case ARPHRD_SIT:
+	case ARPHRD_IPGRE:
+	case ARPHRD_VOID:
+	case ARPHRD_NONE:
+		return false;
+	default:
+		return true;
+	}
+}
+
+struct ethhdr *sample_packet_pack(struct sk_buff *skb,
+				  struct sample_packet_metadata *metadata)
+{
+	int sample_size;
+	int orig_size;
+	void *ifetlv;
+	int ifindex;
+	u16 metalen;
+
+	metalen = nla_total_size(sizeof(metadata->ifindex)) +
+		  nla_total_size(sizeof(metadata->orig_size)) +
+		  nla_total_size(sizeof(metadata->sample_size));
+
+	ifindex = htonl(metadata->ifindex);
+	orig_size = htonl(metadata->orig_size);
+	sample_size = htonl(metadata->sample_size);
+
+	ifetlv = ife_encode(skb, metalen);
+	if (!ifetlv)
+		return NULL;
+
+	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_IFINDEX,
+				      sizeof(ifindex), &ifindex);
+
+	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_ORIGSIZE,
+				      sizeof(orig_size), &orig_size);
+
+	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SIZE,
+				      sizeof(sample_size), &sample_size);
+
+	return (struct ethhdr *) skb->data;
+}
+EXPORT_SYMBOL(sample_packet_pack);
+
+static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
+		      struct tcf_result *res)
+{
+	struct tcf_sample *s = to_sample(a);
+	struct sample_packet_metadata metadata;
+	static struct ethhdr *ethhdr;
+	struct sk_buff *skb2;
+	int retval;
+	u32 at;
+
+	tcf_lastuse_update(&s->tcf_tm);
+	bstats_cpu_update(this_cpu_ptr(s->common.cpu_bstats), skb);
+
+	rcu_read_lock();
+	retval = READ_ONCE(s->tcf_action);
+
+	if (++s->packet_counter % s->rate == 0) {
+		skb2 = skb_copy(skb, GFP_ATOMIC);
+		if (!skb2)
+			goto out;
+
+		if (s->truncate)
+			skb_trim(skb2, s->trunc_size);
+
+		at = G_TC_AT(skb->tc_verd);
+		skb2->mac_len = skb->mac_len;
+
+		/* on ingress, the mac header gets poped, so push it back */
+		if (!(at & AT_EGRESS) && dev_ok_push(skb->dev))
+			skb_push(skb2, skb2->mac_len);
+
+		metadata.ifindex = skb->dev->ifindex;
+		metadata.orig_size = skb->len + skb->dev->hard_header_len;
+		metadata.sample_size = skb2->len;
+		ethhdr = sample_packet_pack(skb2, (void *)&metadata);
+		if (!ethhdr)
+			goto out;
+
+		if (!is_zero_ether_addr(s->eth_src))
+			ether_addr_copy(ethhdr->h_source, s->eth_src);
+		if (!is_zero_ether_addr(s->eth_dst))
+			ether_addr_copy(ethhdr->h_dest, s->eth_dst);
+		if (s->eth_type_set)
+			ethhdr->h_proto = htons(s->eth_type);
+
+		skb2->mark = s->mark;
+		netif_receive_skb(skb2);
+
+		/* mirror is always swallowed */
+		skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
+	}
+out:
+	rcu_read_unlock();
+
+	return retval;
+}
+
+static int tcf_sample_dump(struct sk_buff *skb, struct tc_action *a,
+			   int bind, int ref)
+{
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tcf_sample *s = to_sample(a);
+	struct tc_sample opt = {
+		.index      = s->tcf_index,
+		.action     = s->tcf_action,
+		.refcnt     = s->tcf_refcnt - ref,
+		.bindcnt    = s->tcf_bindcnt - bind,
+		.rate       = s->rate,
+		.mark       = s->mark,
+		.trunc_size = s->trunc_size,
+		.truncate   = s->truncate,
+	};
+	struct tcf_t t;
+
+	if (nla_put(skb, TCA_SAMPLE_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &s->tcf_tm);
+	if (nla_put_64bit(skb, TCA_SAMPLE_TM, sizeof(t), &t, TCA_SAMPLE_PAD))
+		goto nla_put_failure;
+	return skb->len;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_sample_walker(struct net *net, struct sk_buff *skb,
+			     struct netlink_callback *cb, int type,
+			     const struct tc_action_ops *ops)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	return tcf_generic_walker(tn, skb, cb, type, ops);
+}
+
+static int tcf_sample_search(struct net *net, struct tc_action **a, u32 index)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	return tcf_hash_search(tn, a, index);
+}
+
+static struct tc_action_ops act_sample_ops = {
+	.kind		=	"sample",
+	.type		=	TCA_ACT_SAMPLE,
+	.owner		=	THIS_MODULE,
+	.act		=	tcf_sample,
+	.dump		=	tcf_sample_dump,
+	.init		=	tcf_sample_init,
+	.walk		=	tcf_sample_walker,
+	.lookup		=	tcf_sample_search,
+	.size		=	sizeof(struct tcf_sample),
+};
+
+static __net_init int sample_init_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	return tc_action_net_init(tn, &act_sample_ops, SAMPLE_TAB_MASK);
+}
+
+static void __net_exit sample_exit_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	tc_action_net_exit(tn);
+}
+
+static struct pernet_operations sample_net_ops = {
+	.init = sample_init_net,
+	.exit = sample_exit_net,
+	.id   = &sample_net_id,
+	.size = sizeof(struct tc_action_net),
+};
+
+MODULE_AUTHOR("Yotam Gigi (2016)");
+MODULE_DESCRIPTION("Packet sampling action");
+MODULE_LICENSE("GPL");
+
+static int __init sample_init_module(void)
+{
+	return tcf_register_action(&act_sample_ops, &sample_net_ops);
+}
+
+static void __exit sample_cleanup_module(void)
+{
+	tcf_unregister_action(&act_sample_ops, &sample_net_ops);
+}
+
+module_init(sample_init_module);
+module_exit(sample_cleanup_module);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next RFC 5/6] mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
  2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
                   ` (3 preceding siblings ...)
  2016-10-12 12:41 ` [patch net-next RFC 4/6] Introduce sample tc action Jiri Pirko
@ 2016-10-12 12:41 ` Jiri Pirko
  2016-10-12 12:41 ` [patch net-next RFC 6/6] mlxsw: packet sample: Add packet sample offloading support Jiri Pirko
  2016-10-13  7:29 ` [patch net-next RFC 0/6] Add support for offloading packet-sampling Roopa Prabhu
  6 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-10-12 12:41 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

From: Yotam Gigi <yotam.gi@gmail.com>

The MPSC register allows to configure ingress packet sampling on specific
port of the mlxsw device. The sampled packets are then trapped via
PKT_SAMPLE trap.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 43 +++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 6460c72..e657865 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -4832,6 +4832,47 @@ static inline void mlxsw_reg_mlcr_pack(char *payload, u8 local_port,
 					   MLXSW_REG_MLCR_DURATION_MAX : 0);
 }
 
+/* MPSC - Monitoring Packet Sampling Configuration Register
+ * --------------------------------------------------------
+ * MPSC Register is used to configure the Packet Sampling mechanism.
+ */
+#define MLXSW_REG_MPSC_ID 0x9080
+#define MLXSW_REG_MPSC_LEN 0x14
+
+static const struct mlxsw_reg_info mlxsw_reg_mpsc = {
+	.id = MLXSW_REG_MPSC_ID,
+	.len = MLXSW_REG_MPSC_LEN,
+};
+
+/* reg_mpsc_local_port
+ * Local port number
+ * Not supported for CPU port
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, mpsc, local_port, 0x00, 16, 8);
+
+/* reg_mpsc_e
+ * Enable sampling on port local_port
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, e, 0x04, 30, 1);
+
+/* reg_mpsc_rate
+ * Sampling rate = 1 out of rate packets (with randomization around
+ * the point). Valid values are: 1 to 3.5*10^9
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, rate, 0x08, 0, 32);
+
+static inline void mlxsw_reg_mpsc_pack(char *payload, u8 local_port, bool e,
+				       u32 rate)
+{
+	MLXSW_REG_ZERO(mpsc, payload);
+	mlxsw_reg_mpsc_local_port_set(payload, local_port);
+	mlxsw_reg_mpsc_e_set(payload, e);
+	mlxsw_reg_mpsc_rate_set(payload, rate);
+}
+
 /* SBPR - Shared Buffer Pools Register
  * -----------------------------------
  * The SBPR configures and retrieves the shared buffer pools and configuration.
@@ -5367,6 +5408,8 @@ static inline const char *mlxsw_reg_id_str(u16 reg_id)
 		return "MTMP";
 	case MLXSW_REG_MLCR_ID:
 		return "MLCR";
+	case MLXSW_REG_MPSC_ID:
+		return "MPSC";
 	case MLXSW_REG_SBPR_ID:
 		return "SBPR";
 	case MLXSW_REG_SBCM_ID:
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next RFC 6/6] mlxsw: packet sample: Add packet sample offloading support
  2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
                   ` (4 preceding siblings ...)
  2016-10-12 12:41 ` [patch net-next RFC 5/6] mlxsw: reg: add the Monitoring Packet Sampling Configuration Register Jiri Pirko
@ 2016-10-12 12:41 ` Jiri Pirko
  2016-10-13  7:29 ` [patch net-next RFC 0/6] Add support for offloading packet-sampling Roopa Prabhu
  6 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-10-12 12:41 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

From: Yotam Gigi <yotam.gi@gmail.com>

Using the MPSC regiter, add the functions that configure port packets
sampling in hardware and the necessary datatypes in the mlxsw_sp_port
struct. In addition, add the necessary trap for sampled packets and
integrate with matchall offloading to allow offloading of the sample tc
action.

The current offload support is for the tc command:

tc filter add dev <DEV> parent ffff:   \
	  matchall   \
	  action sample rate <RATE> mark <MARK> [trunc <SIZE>]   \
	  		[src <SADDR>] [dst <DADDR>] [type <TYPE>]

Where only ingress qdiscs are supported, and only a combination of
matchall classifier and sample action will lead to activating hardware
packet sampling.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 116 +++++++++++++++++++++++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  11 +++
 drivers/net/ethernet/mellanox/mlxsw/trap.h     |   1 +
 3 files changed, 122 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 1ec0a4c..f9504ae 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -57,6 +57,7 @@
 #include <net/pkt_cls.h>
 #include <net/tc_act/tc_mirred.h>
 #include <net/netevent.h>
+#include <net/tc_act/tc_sample.h>
 
 #include "spectrum.h"
 #include "core.h"
@@ -467,6 +468,16 @@ static void mlxsw_sp_span_mirror_remove(struct mlxsw_sp_port *from,
 	mlxsw_sp_span_inspected_port_unbind(from, span_entry, type);
 }
 
+static int mlxsw_sp_port_sample_set(struct mlxsw_sp_port *mlxsw_sp_port,
+				    bool enable, u32 rate)
+{
+	struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+	char mpsc_pl[MLXSW_REG_MPSC_LEN];
+
+	mlxsw_reg_mpsc_pack(mpsc_pl, mlxsw_sp_port->local_port, enable, rate);
+	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(mpsc), mpsc_pl);
+}
+
 static int mlxsw_sp_port_admin_status_set(struct mlxsw_sp_port *mlxsw_sp_port,
 					  bool is_up)
 {
@@ -1221,6 +1232,46 @@ err_mirror_add:
 	return err;
 }
 
+static int
+mlxsw_sp_port_add_cls_matchall_sample(struct mlxsw_sp_port *mlxsw_sp_port,
+				      struct tc_cls_matchall_offload *cls,
+				      const struct tc_action *a,
+				      bool ingress)
+{
+	struct mlxsw_sp_port_mall_tc_entry *mall_tc_entry;
+	int err;
+
+	if (mlxsw_sp_port->sample.enable) {
+		netdev_err(mlxsw_sp_port->dev, "Sample already active\n");
+		return -EEXIST;
+	}
+
+	err = mlxsw_sp_port_sample_set(mlxsw_sp_port, true, tcf_sample_rate(a));
+	if (err)
+		return err;
+
+	mlxsw_sp_port->sample.enable = true;
+	mlxsw_sp_port->sample.mark = tcf_sample_mark(a);
+	mlxsw_sp_port->sample.truncate = tcf_sample_truncate(a);
+	mlxsw_sp_port->sample.trunc_size = tcf_sample_trunc_size(a);
+	mlxsw_sp_port->sample.eth_type = tcf_sample_eth_type(a);
+	mlxsw_sp_port->sample.eth_type_set = tcf_sample_eth_type_set(a);
+	tcf_sample_eth_dst_addr(a, mlxsw_sp_port->sample.eth_dst);
+	tcf_sample_eth_src_addr(a, mlxsw_sp_port->sample.eth_src);
+
+	netdev_dbg(mlxsw_sp_port->dev, "Activate hardware sample\n");
+
+	mall_tc_entry = kzalloc(sizeof(*mall_tc_entry), GFP_KERNEL);
+	if (!mall_tc_entry)
+		return -ENOMEM;
+
+	mall_tc_entry->cookie = cls->cookie;
+	mall_tc_entry->type = MLXSW_SP_PORT_MALL_SAMPLE;
+	list_add_tail(&mall_tc_entry->list, &mlxsw_sp_port->mall_tc_list);
+
+	return 0;
+}
+
 static int mlxsw_sp_port_add_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
 					  __be16 protocol,
 					  struct tc_cls_matchall_offload *cls,
@@ -1236,15 +1287,19 @@ static int mlxsw_sp_port_add_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
 	}
 
 	tcf_exts_to_list(cls->exts, &actions);
-	list_for_each_entry(a, &actions, list) {
-		if (!is_tcf_mirred_mirror(a) || protocol != htons(ETH_P_ALL))
-			return -ENOTSUPP;
+	a = list_first_entry(&actions, struct tc_action, list);
 
+	if (is_tcf_mirred_mirror(a) && protocol == htons(ETH_P_ALL))
 		err = mlxsw_sp_port_add_cls_matchall_mirror(mlxsw_sp_port, cls,
 							    a, ingress);
-		if (err)
-			return err;
-	}
+	else if (is_tcf_sample(a) && protocol == htons(ETH_P_ALL))
+		err = mlxsw_sp_port_add_cls_matchall_sample(mlxsw_sp_port, cls,
+							    a, ingress);
+	else
+		return -ENOTSUPP;
+
+	if (err)
+		return err;
 
 	return 0;
 }
@@ -1272,6 +1327,10 @@ static void mlxsw_sp_port_del_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
 
 		mlxsw_sp_span_mirror_remove(mlxsw_sp_port, to_port, span_type);
 		break;
+	case MLXSW_SP_PORT_MALL_SAMPLE:
+		mlxsw_sp_port->sample.enable = false;
+		mlxsw_sp_port_sample_set(mlxsw_sp_port, false, 1);
+		break;
 	default:
 		WARN_ON(1);
 	}
@@ -2739,6 +2798,48 @@ static void mlxsw_sp_rx_listener_mark_func(struct sk_buff *skb, u8 local_port,
 	return mlxsw_sp_rx_listener_func(skb, local_port, priv);
 }
 
+static void mlxsw_sp_rx_listener_sample_func(struct sk_buff *skb, u8 local_port,
+					     void *priv)
+{
+	struct mlxsw_sp *mlxsw_sp = priv;
+	struct mlxsw_sp_port *mlxsw_sp_port = mlxsw_sp->ports[local_port];
+	struct sample_packet_metadata packet_metadata;
+	static struct ethhdr *ethhdr;
+
+	if (unlikely(!mlxsw_sp_port)) {
+		dev_warn_ratelimited(mlxsw_sp->bus_info->dev, "Port %d: sample skb received for non-existent port\n",
+				     local_port);
+		return;
+	}
+
+	skb->dev = mlxsw_sp_port->dev;
+	skb->mac_len = skb->dev->hard_header_len;
+	skb->mark = mlxsw_sp_port->sample.mark;
+
+	packet_metadata.ifindex = skb->dev->ifindex;
+	packet_metadata.orig_size = skb->len;
+
+	if (mlxsw_sp_port->sample.truncate)
+		skb_trim(skb, mlxsw_sp_port->sample.trunc_size);
+
+	packet_metadata.sample_size = skb->len;
+
+	ethhdr = sample_packet_pack(skb, &packet_metadata);
+	if (!ethhdr)
+		return;
+
+	if (!is_zero_ether_addr(mlxsw_sp_port->sample.eth_src))
+		ether_addr_copy(ethhdr->h_source,
+				mlxsw_sp_port->sample.eth_src);
+	if (!is_zero_ether_addr(mlxsw_sp_port->sample.eth_dst))
+		ether_addr_copy(ethhdr->h_dest, mlxsw_sp_port->sample.eth_dst);
+	if (mlxsw_sp_port->sample.eth_type_set)
+		ethhdr->h_proto = htons(mlxsw_sp_port->sample.eth_type);
+
+	skb->protocol = eth_type_trans(skb, skb->dev);
+	netif_receive_skb(skb);
+}
+
 #define MLXSW_SP_RXL(_func, _trap_id, _action)			\
 	{							\
 		.func = _func,					\
@@ -2749,6 +2850,9 @@ static void mlxsw_sp_rx_listener_mark_func(struct sk_buff *skb, u8 local_port,
 
 static const struct mlxsw_rx_listener mlxsw_sp_rx_listener[] = {
 	MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, FDB_MC, TRAP_TO_CPU),
+	MLXSW_SP_RXL(mlxsw_sp_rx_listener_sample_func, PKT_SAMPLE,
+		     MIRROR_TO_CPU),
+
 	/* Traps for specific L2 packet types, not trapped as FDB MC */
 	MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, STP, TRAP_TO_CPU),
 	MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, LACP, TRAP_TO_CPU),
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 9b22863..673a727 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -229,6 +229,7 @@ struct mlxsw_sp_span_entry {
 
 enum mlxsw_sp_port_mall_action_type {
 	MLXSW_SP_PORT_MALL_MIRROR,
+	MLXSW_SP_PORT_MALL_SAMPLE,
 };
 
 struct mlxsw_sp_port_mall_mirror_tc_entry {
@@ -362,6 +363,16 @@ struct mlxsw_sp_port {
 		struct rtnl_link_stats64 *cache;
 		struct delayed_work update_dw;
 	} hw_stats;
+	struct {
+		bool enable;
+		u32 mark;
+		bool truncate;
+		u32 trunc_size;
+		u8 eth_src[ETH_ALEN];
+		u8 eth_dst[ETH_ALEN];
+		u16 eth_type;
+		bool eth_type_set;
+	} sample;
 };
 
 struct mlxsw_sp_port *mlxsw_sp_port_lower_dev_hold(struct net_device *dev);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/trap.h b/drivers/net/ethernet/mellanox/mlxsw/trap.h
index ed8e301..eebd135 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/trap.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/trap.h
@@ -54,6 +54,7 @@ enum {
 	MLXSW_TRAP_ID_IGMP_V2_REPORT = 0x32,
 	MLXSW_TRAP_ID_IGMP_V2_LEAVE = 0x33,
 	MLXSW_TRAP_ID_IGMP_V3_REPORT = 0x34,
+	MLXSW_TRAP_ID_PKT_SAMPLE = 0x38,
 	MLXSW_TRAP_ID_ARPBC = 0x50,
 	MLXSW_TRAP_ID_ARPUC = 0x51,
 	MLXSW_TRAP_ID_MTUERROR = 0x52,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 0/6] Add support for offloading packet-sampling
  2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
                   ` (5 preceding siblings ...)
  2016-10-12 12:41 ` [patch net-next RFC 6/6] mlxsw: packet sample: Add packet sample offloading support Jiri Pirko
@ 2016-10-13  7:29 ` Roopa Prabhu
  2016-10-13  8:48   ` Jiri Pirko
  6 siblings, 1 reply; 24+ messages in thread
From: Roopa Prabhu @ 2016-10-13  7:29 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

On 10/12/16, 5:41 AM, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Add the sample tc action, which allows to sample packet matching
> a classifier. The sample action peeks randomly packets, duplicates them,
> truncates them and adds informative metadata on the packet, for example,
> the input interface and the original packet length. The sampled packets
> are marked to allow matching them and redirecting them to a specific
> collector device.
>
> The sampled packets metadata is packed using ife encapsulation. To do
> that, this patch-set extracts ife logics from the tc_ife action into an
> independent ife module, and uses that functionality to pack the metadata.
> To include all the needed metadata, this patch-set introduces some new
> IFE_META tlv types.
>
> In addition, Add the support for offloading the matchall-sample tc command
> in the Mellanox mlxsw driver, for ingress qdiscs.
>
> Yotam Gigi (6):
>   Introduce ife encapsulation module
>   act_ife: Change to use ife module
>   ife: Introduce new metadata tlv types
>   Introduce sample tc action
>   mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
>   mlxsw: packet sample: Add packet sample offloading support
>

we spoke with yotam about this at netdev1.2. and also remember speaking about this on our switchdev calls:
Today our driver uses NFLOG to log packets to a netlink socket and hsflowd supported by the sflow
people (at http://www.sflow.net/) is capable of reading from a nflog socket. NFLOG has the required netlink
attribute markers for packet header/data (which we can possibly extend). We could also add nflog like action
in tc if needed.

sflow agents like hsflowd are capable of sending packets to an external collector with the required sflow header.
Instead of re-inventing a new API for sflow, would be better to standardize/unify on existing mechanisms.

Also, this patch series requires a new device to be created which can be avoided if we used
existing mechanisms like NFLOG.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 0/6] Add support for offloading packet-sampling
  2016-10-13  7:29 ` [patch net-next RFC 0/6] Add support for offloading packet-sampling Roopa Prabhu
@ 2016-10-13  8:48   ` Jiri Pirko
  2016-10-13 11:49     ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2016-10-13  8:48 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: netdev, davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

Thu, Oct 13, 2016 at 09:29:57AM CEST, roopa@cumulusnetworks.com wrote:
>On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Add the sample tc action, which allows to sample packet matching
>> a classifier. The sample action peeks randomly packets, duplicates them,
>> truncates them and adds informative metadata on the packet, for example,
>> the input interface and the original packet length. The sampled packets
>> are marked to allow matching them and redirecting them to a specific
>> collector device.
>>
>> The sampled packets metadata is packed using ife encapsulation. To do
>> that, this patch-set extracts ife logics from the tc_ife action into an
>> independent ife module, and uses that functionality to pack the metadata.
>> To include all the needed metadata, this patch-set introduces some new
>> IFE_META tlv types.
>>
>> In addition, Add the support for offloading the matchall-sample tc command
>> in the Mellanox mlxsw driver, for ingress qdiscs.
>>
>> Yotam Gigi (6):
>>   Introduce ife encapsulation module
>>   act_ife: Change to use ife module
>>   ife: Introduce new metadata tlv types
>>   Introduce sample tc action
>>   mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
>>   mlxsw: packet sample: Add packet sample offloading support
>>
>
>we spoke with yotam about this at netdev1.2. and also remember speaking about this on our switchdev calls:
>Today our driver uses NFLOG to log packets to a netlink socket and hsflowd supported by the sflow
>people (at http://www.sflow.net/) is capable of reading from a nflog socket. NFLOG has the required netlink
>attribute markers for packet header/data (which we can possibly extend). We could also add nflog like action
>in tc if needed.
>
>sflow agents like hsflowd are capable of sending packets to an external collector with the required sflow header.
>Instead of re-inventing a new API for sflow, would be better to standardize/unify on existing mechanisms.
>
>Also, this patch series requires a new device to be created which can be avoided if we used
>existing mechanisms like NFLOG.

When I was first thinking about re-using NFLOG, it seemed like an
abusal. We need to call it from driver directly, which sounds odd.
However, since we use sample_packet_pack function to wrap it up, the
NFLOG is called from the tc action code, it does not look bad.
Yet still, this has nothing in common with netfilter, only using it's
log facilities. That is odd.

I think that the IFE ways is way more clear and generic and not-abusing.
However you are right the NFLOG way has advantage of existing user
component. I'm not sure how to do this :(

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 0/6] Add support for offloading packet-sampling
  2016-10-13  8:48   ` Jiri Pirko
@ 2016-10-13 11:49     ` Jamal Hadi Salim
  2016-10-13 12:10       ` Jiri Pirko
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2016-10-13 11:49 UTC (permalink / raw)
  To: Jiri Pirko, Roopa Prabhu
  Cc: netdev, davem, yotamg, idosch, eladr, nogahf, ogerlitz,
	geert+renesas, stephen, xiyou.wangcong, linux

On 16-10-13 04:48 AM, Jiri Pirko wrote:
> Thu, Oct 13, 2016 at 09:29:57AM CEST, roopa@cumulusnetworks.com wrote:
>> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
[..]
>>
>> we spoke with yotam about this at netdev1.2. and also remember speaking about this on our switchdev calls:
>> Today our driver uses NFLOG to log packets to a netlink socket and hsflowd supported by the sflow
>> people (at http://www.sflow.net/) is capable of reading from a nflog socket. NFLOG has the required netlink
>> attribute markers for packet header/data (which we can possibly extend). We could also add nflog like action
>> in tc if needed.
>>
>> sflow agents like hsflowd are capable of sending packets to an external collector with the required sflow header.
>> Instead of re-inventing a new API for sflow, would be better to standardize/unify on existing mechanisms.
>>
>> Also, this patch series requires a new device to be created which can be avoided if we used
>> existing mechanisms like NFLOG.
>
> When I was first thinking about re-using NFLOG, it seemed like an
> abusal. We need to call it from driver directly, which sounds odd.
> However, since we use sample_packet_pack function to wrap it up, the
> NFLOG is called from the tc action code, it does not look bad.
> Yet still, this has nothing in common with netfilter, only using it's
> log facilities. That is odd.
>

Sorry, had not seen the code until now; helps me get perspective.
If you are going to require  netfilter just so you can do this - it
sounds so wrong (since you already provides a hook for tc offloading
into the switch for other functions).
Roopa, did you mean eth1 as the new device or did you mean just in
general config requiring a device to be specified or did you mean a new
cpu netdev being needed? I couldnt tell from the patch.

> I think that the IFE ways is way more clear and generic and not-abusing.
> However you are right the NFLOG way has advantage of existing user
> component. I'm not sure how to do this :(

Can you do NFLOG to user space without requiring netfilter compiled in?
One advantage with IFE is it is a wire protocol - so you can have the
sflow collector/aggregator sit on a different machine (for small cpu
switches makes sense). So modifying the sflow daemon to accept IFE
formatted data is an interesting!

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 0/6] Add support for offloading packet-sampling
  2016-10-13 11:49     ` Jamal Hadi Salim
@ 2016-10-13 12:10       ` Jiri Pirko
  2016-10-13 12:30         ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2016-10-13 12:10 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Roopa Prabhu, netdev, davem, yotamg, idosch, eladr, nogahf,
	ogerlitz, geert+renesas, stephen, xiyou.wangcong, linux

Thu, Oct 13, 2016 at 01:49:07PM CEST, jhs@mojatatu.com wrote:
>On 16-10-13 04:48 AM, Jiri Pirko wrote:
>> Thu, Oct 13, 2016 at 09:29:57AM CEST, roopa@cumulusnetworks.com wrote:
>> > On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>> > > From: Jiri Pirko <jiri@mellanox.com>
>> > > 
>[..]
>> > 
>> > we spoke with yotam about this at netdev1.2. and also remember speaking about this on our switchdev calls:
>> > Today our driver uses NFLOG to log packets to a netlink socket and hsflowd supported by the sflow
>> > people (at http://www.sflow.net/) is capable of reading from a nflog socket. NFLOG has the required netlink
>> > attribute markers for packet header/data (which we can possibly extend). We could also add nflog like action
>> > in tc if needed.
>> > 
>> > sflow agents like hsflowd are capable of sending packets to an external collector with the required sflow header.
>> > Instead of re-inventing a new API for sflow, would be better to standardize/unify on existing mechanisms.
>> > 
>> > Also, this patch series requires a new device to be created which can be avoided if we used
>> > existing mechanisms like NFLOG.
>> 
>> When I was first thinking about re-using NFLOG, it seemed like an
>> abusal. We need to call it from driver directly, which sounds odd.
>> However, since we use sample_packet_pack function to wrap it up, the
>> NFLOG is called from the tc action code, it does not look bad.
>> Yet still, this has nothing in common with netfilter, only using it's
>> log facilities. That is odd.
>> 
>
>Sorry, had not seen the code until now; helps me get perspective.
>If you are going to require  netfilter just so you can do this - it
>sounds so wrong (since you already provides a hook for tc offloading
>into the switch for other functions).

+1


>Roopa, did you mean eth1 as the new device or did you mean just in
>general config requiring a device to be specified or did you mean a new
>cpu netdev being needed? I couldnt tell from the patch.

You just have to have some netdev to use to funnel the IFE headered
sample skbs to userspace. A dummy or a tap.


>
>> I think that the IFE ways is way more clear and generic and not-abusing.
>> However you are right the NFLOG way has advantage of existing user
>> component. I'm not sure how to do this :(
>
>Can you do NFLOG to user space without requiring netfilter compiled in?
>One advantage with IFE is it is a wire protocol - so you can have the
>sflow collector/aggregator sit on a different machine (for small cpu
>switches makes sense). So modifying the sflow daemon to accept IFE
>formatted data is an interesting!

Agreed. For me, that looks like the correct way to do this.


>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 0/6] Add support for offloading packet-sampling
  2016-10-13 12:10       ` Jiri Pirko
@ 2016-10-13 12:30         ` Jamal Hadi Salim
  2016-10-13 12:45           ` Jiri Pirko
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2016-10-13 12:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Roopa Prabhu, netdev, davem, yotamg, idosch, eladr, nogahf,
	ogerlitz, geert+renesas, stephen, xiyou.wangcong, linux

On 16-10-13 08:10 AM, Jiri Pirko wrote:
> Thu, Oct 13, 2016 at 01:49:07PM CEST, jhs@mojatatu.com wrote:
>> On 16-10-13 04:48 AM, Jiri Pirko wrote:

[..]
>> Roopa, did you mean eth1 as the new device or did you mean just in
>> general config requiring a device to be specified or did you mean a new
>> cpu netdev being needed? I couldnt tell from the patch.
>
> You just have to have some netdev to use to funnel the IFE headered
> sample skbs to userspace. A dummy or a tap.
>

I see.
So with nflog you get basically a backend using a netlink socket
but in your case you will redirect to tuntap for the case of local
sflow but some other device for remote? I am assuming using dummy
would require a packet socket as means of retrieving the data.
If you take the structuring of the metadata that nflog uses it should
be easy to transpose.
To Roopa's point, however: Would it not make sense to support nflog
(in addition?).

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 0/6] Add support for offloading packet-sampling
  2016-10-13 12:30         ` Jamal Hadi Salim
@ 2016-10-13 12:45           ` Jiri Pirko
  2016-10-14  5:02             ` Roopa Prabhu
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2016-10-13 12:45 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Roopa Prabhu, netdev, davem, yotamg, idosch, eladr, nogahf,
	ogerlitz, geert+renesas, stephen, xiyou.wangcong, linux

Thu, Oct 13, 2016 at 02:30:19PM CEST, jhs@mojatatu.com wrote:
>On 16-10-13 08:10 AM, Jiri Pirko wrote:
>> Thu, Oct 13, 2016 at 01:49:07PM CEST, jhs@mojatatu.com wrote:
>> > On 16-10-13 04:48 AM, Jiri Pirko wrote:
>
>[..]
>> > Roopa, did you mean eth1 as the new device or did you mean just in
>> > general config requiring a device to be specified or did you mean a new
>> > cpu netdev being needed? I couldnt tell from the patch.
>> 
>> You just have to have some netdev to use to funnel the IFE headered
>> sample skbs to userspace. A dummy or a tap.
>> 
>
>I see.
>So with nflog you get basically a backend using a netlink socket
>but in your case you will redirect to tuntap for the case of local
>sflow but some other device for remote? I am assuming using dummy
>would require a packet socket as means of retrieving the data.

Correct. The idea is that the userspace app would create a tap device,
setup the sampling packets to be sent there and recieve them
over chardev. Or the remote delivery could be use to push the sampling
packet to a remote host.


>If you take the structuring of the metadata that nflog uses it should
>be easy to transpose.

Yes, we do it with IFE, this patchset implements that.


>To Roopa's point, however: Would it not make sense to support nflog
>(in addition?).
>
>cheers,
>jamal
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 0/6] Add support for offloading packet-sampling
  2016-10-13 12:45           ` Jiri Pirko
@ 2016-10-14  5:02             ` Roopa Prabhu
  0 siblings, 0 replies; 24+ messages in thread
From: Roopa Prabhu @ 2016-10-14  5:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, netdev, davem, Yotam Gigi, Ido Schimmel,
	Elad Raz, Nogah Frankel, Or Gerlitz, geert+renesas, stephen,
	Cong Wang, Guenter Roeck, Shrijeet Mukherjee

On Thu, Oct 13, 2016 at 5:45 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Oct 13, 2016 at 02:30:19PM CEST, jhs@mojatatu.com wrote:
>>On 16-10-13 08:10 AM, Jiri Pirko wrote:
>>> Thu, Oct 13, 2016 at 01:49:07PM CEST, jhs@mojatatu.com wrote:
>>> > On 16-10-13 04:48 AM, Jiri Pirko wrote:
>>
>>[..]
>>> > Roopa, did you mean eth1 as the new device or did you mean just in
>>> > general config requiring a device to be specified or did you mean a new
>>> > cpu netdev being needed? I couldnt tell from the patch.
>>>
>>> You just have to have some netdev to use to funnel the IFE headered
>>> sample skbs to userspace. A dummy or a tap.
>>>
>>
>>I see.
>>So with nflog you get basically a backend using a netlink socket
>>but in your case you will redirect to tuntap for the case of local
>>sflow but some other device for remote? I am assuming using dummy
>>would require a packet socket as means of retrieving the data.
>
> Correct. The idea is that the userspace app would create a tap device,
> setup the sampling packets to be sent there and recieve them
> over chardev. Or the remote delivery could be use to push the sampling
> packet to a remote host.
>
>
>>If you take the structuring of the metadata that nflog uses it should
>>be easy to transpose.
>
> Yes, we do it with IFE, this patchset implements that.
>
>
>>To Roopa's point, however: Would it not make sense to support nflog
>>(in addition?).
>>

[sorry responding to all conversations so far here]

using ife for delivery of sampled packets to remote is a good option
to have if you have users.
so far I have seen agents collecting samples locally and have their
own protocol to ship them
to a collector (example sflow). Just bringing that up so that we don't
optimize for the less common case
and make the common case difficult to use :).

In my conversations with the sflow people (founders) and others,
netlink as a mechanism for sampled packet
delivery (similar to ulog/nflog) has proven useful and they see it as
a great API to standardize on going forward (Given they are
already using netlink for collecting other samples like stats etc).
something to thing about.
The people I know collecting samples are happy with having netfilter.
agreed that tc already has an existing hw
offload mechanism. and I was not suggesting giving up on tc either.

and also to jiri, agree, I don't think logging from the driver is a
good option. I was merely suggesting
having a similar option without the need for a new collector device.

The three steps in the patch series to collect samples + a device
seems a bit heavy weight.
but, if you think you have users for it, sure. having multiple api's
is also an option.
But api's come with a cost of maintaining them for ever.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-12 12:41 ` [patch net-next RFC 4/6] Introduce sample tc action Jiri Pirko
@ 2016-10-15 16:34   ` Roopa Prabhu
  2016-10-15 17:31     ` Roopa Prabhu
  2016-10-17 10:10     ` Jamal Hadi Salim
  2016-10-16 10:27   ` Or Gerlitz
  1 sibling, 2 replies; 24+ messages in thread
From: Roopa Prabhu @ 2016-10-15 16:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux

On 10/12/16, 5:41 AM, Jiri Pirko wrote:
> From: Yotam Gigi <yotam.gi@gmail.com>
>
> This action allow the user to sample traffic matched by tc classifier.
> The sampling consists of choosing packets randomly, truncating them,
> adding some informative metadata regarding the interface and the original
> packet size and mark them with specific mark, to allow further tc rules to
> match and process. The marked sample packets are then injected into the
> device ingress qdisc using netif_receive_skb.
>
> The packets metadata is packed using the ife encapsulation protocol, and
> the outer packet's ethernet dest, source and eth_type, along with the
> rate, mark and the optional truncation size can be configured from
> userspace.
>
> Example:
> To sample ingress traffic from interface eth1, and redirect the sampled
> the sampled packets to interface dummy0, one may use the commands:
>
> tc qdisc add dev eth1 handle ffff: ingress
>
> tc filter add dev eth1 parent ffff: \
> 	   matchall action sample rate 12 mark 17
>
> tc filter add parent ffff: dev eth1 protocol all \
> 	   u32 match mark 172 0xff
> 	   action mirred egress redirect dev dummy0
>
> Where the first command adds an ingress qdisc and the second starts
> sampling every 12'th packet on dev eth0 and marks the sampled packets with
> 17. The command third catches the sampled packets, which are marked with
> 17, and redirects them to dev dummy0.
>
> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
channeling some feedback from Peter Phaal @sflow inline below:

> ---
>  
> diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
> new file mode 100644
> index 0000000..a2b445a
> --- /dev/null
> +++ b/include/net/tc_act/tc_sample.h
> @@ -0,0 +1,88 @@
> +#ifndef __NET_TC_SAMPLE_H
> +#define __NET_TC_SAMPLE_H
> +
> +#include <net/act_api.h>
> +#include <linux/tc_act/tc_sample.h>
> +
> +struct tcf_sample {
> +	struct tc_action	common;
> +	u32			rate;
> +	u32			mark;
> +	bool			truncate;
> +	u32			trunc_size;
> +	u32			packet_counter;
> +	u8			eth_dst[ETH_ALEN];
> +	u8			eth_src[ETH_ALEN];
> +	u16			eth_type;
> +	bool			eth_type_set;
> +	struct list_head	tcfm_list;
> +};
> +#define to_sample(a) ((struct tcf_sample *)a)
> +
> +struct sample_packet_metadata {
> +	int sample_size;
> +	int orig_size;
> +	int ifindex;
> +};
> +
This metadata does not look extensible.. can it be made to ?

With sflow in context, you need a pair of ifindex numbers to encode ingress and egress ports. Ideally you would also include a sequence number and a count of the total number of packets that were candidates for sampling. The OVS implementation is a good example, the metadata includes all the actions applied to the packet in the kernel data path.



[snip]

> diff --git a/include/uapi/linux/tc_act/tc_sample.h b/include/uapi/linux/tc_act/tc_sample.h
> new file mode 100644
> index 0000000..654945b
> --- /dev/null
> +++ b/include/uapi/linux/tc_act/tc_sample.h
> @@ -0,0 +1,31 @@
> +#ifndef __LINUX_TC_SAMPLE_H
> +#define __LINUX_TC_SAMPLE_H
> +
> +#include <linux/types.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/if_ether.h>
> +
> +#define TCA_ACT_SAMPLE 26
> +
> +struct tc_sample {
> +	tc_gen;
> +	__u32		rate;		/* sample rate */
> +	__u32		mark;		/* mark to put on the sampled packets */
> +	bool		truncate;	/* whether to truncate the packets */
> +	__u32		trunc_size;	/* truncation size */
> +	__u8		eth_dst[ETH_ALEN]; /* encapsulated mac destination */
> +	__u8		eth_src[ETH_ALEN]; /* encapsulated mac source */
> +	bool		eth_type_set;	   /* whether to overrid ethtype */
> +	__u16		eth_type;	   /* encapsulated mac ethtype */
> +};
> +
this does not look extensible and is part of UAPI ..

Doing the minimum in the kernel and leaving the rest to the user space agent is much more flexible. The user space agent can attach additional metadata and offer more flexibility in forwarding (sFlow uses XDR encoding over UDP and is routable over IPv4/IPv6).



> +enum {
> +	TCA_SAMPLE_UNSPEC,
> +	TCA_SAMPLE_TM,
> +	TCA_SAMPLE_PARMS,
> +	TCA_SAMPLE_PAD,
> +	__TCA_SAMPLE_MAX
> +};
> +#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
> +
> +#endif
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 24f7cac..c54ea6b 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -650,6 +650,19 @@ config NET_ACT_MIRRED
>  	  To compile this code as a module, choose M here: the
>  	  module will be called act_mirred.
>  
> +config NET_ACT_SAMPLE
> +        tristate "Traffic Sampling"
> +        depends on NET_CLS_ACT
> +        select NET_IFE
> +        ---help---
> +	  Say Y here to allow packet sampling tc action. The packet sample
> +	  action consists of statistically duplicating packets, truncating them
> +	  and adding descriptive metadata to them. The duplicated packets are
> +	  then marked to allow further processing using tc.
> +
> +	  To compile this code as a module, choose M here: the
> +	  module will be called act_sample.
> +
>  config NET_ACT_IPT
>          tristate "IPtables targets"
>          depends on NET_CLS_ACT && NETFILTER && IP_NF_IPTABLES
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 4bdda36..7b915d2 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
[snip]
> +static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
> +		      struct tcf_result *res)
> +{
> +	struct tcf_sample *s = to_sample(a);
> +	struct sample_packet_metadata metadata;
> +	static struct ethhdr *ethhdr;
> +	struct sk_buff *skb2;
> +	int retval;
> +	u32 at;
> +
> +	tcf_lastuse_update(&s->tcf_tm);
> +	bstats_cpu_update(this_cpu_ptr(s->common.cpu_bstats), skb);
> +
> +	rcu_read_lock();
> +	retval = READ_ONCE(s->tcf_action);
> +
> +	if (++s->packet_counter % s->rate == 0) {

The sampling function isn’t random

if (++s->packet_counter % s->rate == 0) {

This is unsuitable for sFlow, which is specific about the random sampling function required. BPF, OVS, and the 
ULOG statistics module include efficient kernel based random sampling functions that could be used instead.


Thanks,
Roopa

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-15 16:34   ` Roopa Prabhu
@ 2016-10-15 17:31     ` Roopa Prabhu
  2016-10-17 10:10     ` Jamal Hadi Salim
  1 sibling, 0 replies; 24+ messages in thread
From: Roopa Prabhu @ 2016-10-15 17:31 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux,
	Shrijeet Mukherjee

On 10/15/16, 9:34 AM, Roopa Prabhu wrote:
> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>> From: Yotam Gigi <yotam.gi@gmail.com>
>>
>> This action allow the user to sample traffic matched by tc classifier.
>> The sampling consists of choosing packets randomly, truncating them,
>> adding some informative metadata regarding the interface and the original
>> packet size and mark them with specific mark, to allow further tc rules to
>> match and process. The marked sample packets are then injected into the
>> device ingress qdisc using netif_receive_skb.
>>
>> The packets metadata is packed using the ife encapsulation protocol, and
>> the outer packet's ethernet dest, source and eth_type, along with the
>> rate, mark and the optional truncation size can be configured from
>> userspace.
>>
>> Example:
>> To sample ingress traffic from interface eth1, and redirect the sampled
>> the sampled packets to interface dummy0, one may use the commands:
>>
>> tc qdisc add dev eth1 handle ffff: ingress
>>
>> tc filter add dev eth1 parent ffff: \
>> 	   matchall action sample rate 12 mark 17
>>
>> tc filter add parent ffff: dev eth1 protocol all \
>> 	   u32 match mark 172 0xff
>> 	   action mirred egress redirect dev dummy0
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling every 12'th packet on dev eth0 and marks the sampled packets with
>> 17. The command third catches the sampled packets, which are marked with
>> 17, and redirects them to dev dummy0.
>>
>> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> channeling some feedback from Peter Phaal @sflow inline below:
>
>
If it helps, one more thing that came up was using bpf.
They also use bpf filters for pkt sampling in the non-offloaded case:
http://blog.sflow.com/2016/05/berkeley-packet-filter-bpf.html

so, existing apps (like sflow) that care about packet sampling do prefer to use
a socket api for sample delivery: netlink nflog or bpf like socket filters

also, to keep the software and hardware models the same, wondering if ebpf attach
can be a viable option (have not thought about the offloaded case completely yet).
This would give apps more control on attaching sample headers (like sflow) if needed.

thanks,
Roopa

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-12 12:41 ` [patch net-next RFC 4/6] Introduce sample tc action Jiri Pirko
  2016-10-15 16:34   ` Roopa Prabhu
@ 2016-10-16 10:27   ` Or Gerlitz
  2016-10-18  8:33     ` Yotam Gigi
  1 sibling, 1 reply; 24+ messages in thread
From: Or Gerlitz @ 2016-10-16 10:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Linux Netdev List, David Miller, Yotam Gigi, Ido Schimmel,
	Elad Raz, Nogah Frankel, Or Gerlitz, Jamal Hadi Salim,
	geert+renesas, Stephen Hemminger, Cong Wang, Guenter Roeck

On Wed, Oct 12, 2016 at 3:41 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> From: Yotam Gigi <yotam.gi@gmail.com>
>
> This action allow the user to sample traffic matched by tc classifier.
> The sampling consists of choosing packets randomly, truncating them,
> adding some informative metadata regarding the interface and the original
> packet size and mark them with specific mark, to allow further tc rules to
> match and process. The marked sample packets are then injected into the
> device ingress qdisc using netif_receive_skb.
>
> The packets metadata is packed using the ife encapsulation protocol, and
> the outer packet's ethernet dest, source and eth_type, along with the
> rate, mark and the optional truncation size can be configured from
> userspace.
>
> Example:
> To sample ingress traffic from interface eth1, and redirect the sampled
> the sampled packets to interface dummy0, one may use the commands:
>
> tc qdisc add dev eth1 handle ffff: ingress
>
> tc filter add dev eth1 parent ffff: \
>            matchall action sample rate 12 mark 17
>
> tc filter add parent ffff: dev eth1 protocol all \
>            u32 match mark 172 0xff
>            action mirred egress redirect dev dummy0
>
> Where the first command adds an ingress qdisc and the second starts
> sampling every 12'th packet on dev eth0 and marks the sampled packets with
> 17. The command third catches the sampled packets, which are marked with
> 17, and redirects them to dev dummy0.

eth0 --> eth1

command third --> third command

don't we need a re-classify directive for the u32 filter to apply
after the marking done by the matchall rule + sample action
or is that implicit?


> diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
> new file mode 100644
> index 0000000..a2b445a
> --- /dev/null
> +++ b/include/net/tc_act/tc_sample.h
> @@ -0,0 +1,88 @@
> +#ifndef __NET_TC_SAMPLE_H
> +#define __NET_TC_SAMPLE_H
> +
> +#include <net/act_api.h>
> +#include <linux/tc_act/tc_sample.h>
> +
> +struct tcf_sample {
> +       struct tc_action        common;
> +       u32                     rate;
> +       u32                     mark;
> +       bool                    truncate;
> +       u32                     trunc_size;
> +       u32                     packet_counter;
> +       u8                      eth_dst[ETH_ALEN];
> +       u8                      eth_src[ETH_ALEN];
> +       u16                     eth_type;
> +       bool                    eth_type_set;
> +       struct list_head        tcfm_list;
> +};

> +++ b/include/uapi/linux/tc_act/tc_sample.h
> @@ -0,0 +1,31 @@
> +#ifndef __LINUX_TC_SAMPLE_H
> +#define __LINUX_TC_SAMPLE_H
> +
> +#include <linux/types.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/if_ether.h>
> +
> +#define TCA_ACT_SAMPLE 26
> +
> +struct tc_sample {
> +       tc_gen;
> +       __u32           rate;           /* sample rate */
> +       __u32           mark;           /* mark to put on the sampled packets */
> +       bool            truncate;       /* whether to truncate the packets */
> +       __u32           trunc_size;     /* truncation size */
> +       __u8            eth_dst[ETH_ALEN]; /* encapsulated mac destination */
> +       __u8            eth_src[ETH_ALEN]; /* encapsulated mac source */
> +       bool            eth_type_set;      /* whether to overrid ethtype */
> +       __u16           eth_type;          /* encapsulated mac ethtype */
> +};

overrid --> override

what do you mean by override here, to encapsulate?

consider using 0 as special value, e.g no truncation and no encapsulation

best if you just define the netlink attributes (document on the RHS
the type, see the uapi
for the new tunnel key action) and let the tc action in-kernel code to
decode them directly
into the non UAPI structure. This way you are extendable and also
avoid having two
structs which is sort of confusing.

> +
> +enum {
> +       TCA_SAMPLE_UNSPEC,
> +       TCA_SAMPLE_TM,
> +       TCA_SAMPLE_PARMS,
> +       TCA_SAMPLE_PAD,
> +       __TCA_SAMPLE_MAX
> +};
> +#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
> +
> +#endif


> +static bool dev_ok_push(struct net_device *dev)
> +{
> +       switch (dev->type) {
> +       case ARPHRD_TUNNEL:
> +       case ARPHRD_TUNNEL6:
> +       case ARPHRD_SIT:
> +       case ARPHRD_IPGRE:
> +       case ARPHRD_VOID:
> +       case ARPHRD_NONE:
> +               return false;
> +       default:
> +               return true;
> +       }
> +}
> +

> +static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
> +                     struct tcf_result *res)
> +{
> +       struct tcf_sample *s = to_sample(a);
> +       struct sample_packet_metadata metadata;
> +       static struct ethhdr *ethhdr;
> +       struct sk_buff *skb2;
> +       int retval;
> +       u32 at;
> +
> +       tcf_lastuse_update(&s->tcf_tm);
> +       bstats_cpu_update(this_cpu_ptr(s->common.cpu_bstats), skb);
> +
> +       rcu_read_lock();
> +       retval = READ_ONCE(s->tcf_action);
> +
> +       if (++s->packet_counter % s->rate == 0) {
> +               skb2 = skb_copy(skb, GFP_ATOMIC);
> +               if (!skb2)
> +                       goto out;
> +
> +               if (s->truncate)
> +                       skb_trim(skb2, s->trunc_size);
> +
> +               at = G_TC_AT(skb->tc_verd);
> +               skb2->mac_len = skb->mac_len;
> +
> +               /* on ingress, the mac header gets poped, so push it back */
> +               if (!(at & AT_EGRESS) && dev_ok_push(skb->dev))
> +                       skb_push(skb2, skb2->mac_len);
> +

what's the exact role of the !(at & AT_EGRESS) check?

and if !dev_ok_push(.) - are we just fine to continue here without
that push? maybe
worth documenting that corner a bit



> +               metadata.ifindex = skb->dev->ifindex;
> +               metadata.orig_size = skb->len + skb->dev->hard_header_len;
> +               metadata.sample_size = skb2->len;
> +               ethhdr = sample_packet_pack(skb2, (void *)&metadata);
> +               if (!ethhdr)
> +                       goto out;
> +
> +               if (!is_zero_ether_addr(s->eth_src))
> +                       ether_addr_copy(ethhdr->h_source, s->eth_src);
> +               if (!is_zero_ether_addr(s->eth_dst))
> +                       ether_addr_copy(ethhdr->h_dest, s->eth_dst);
> +               if (s->eth_type_set)
> +                       ethhdr->h_proto = htons(s->eth_type);
> +
> +               skb2->mark = s->mark;
> +               netif_receive_skb(skb2);
> +
> +               /* mirror is always swallowed */
> +               skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
> +       }
> +out:
> +       rcu_read_unlock();
> +
> +       return retval;
> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-15 16:34   ` Roopa Prabhu
  2016-10-15 17:31     ` Roopa Prabhu
@ 2016-10-17 10:10     ` Jamal Hadi Salim
  2016-10-18  0:17       ` Roopa Prabhu
  1 sibling, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2016-10-17 10:10 UTC (permalink / raw)
  To: Roopa Prabhu, Jiri Pirko
  Cc: netdev, davem, yotamg, idosch, eladr, nogahf, ogerlitz,
	geert+renesas, stephen, xiyou.wangcong, linux


Some comments:
IIUC, the main struggle seems to be whether the redirect to dummy0
is useful or not? i.e instead of just letting the packets go up the
stack on eth1?
It seems like sflowd needs to read off eth1 via packet socket?
To be backward compatible - supporting that approach seems sensible.

Note:
There is a clear efficiency benefit of both using IFE encoding and
redirecting to dummy0.
1) Redirecting to dummy0 implies you dont need to exercise a bpf
filter around every packet that comes off eth1.
I understand there are probably not millions of pps for this case;
but in a non-offloaded cases it could be millions pps.
And in case of sampling over many ethx devices, you can redirect
samples from many other ethx devices.
So making dummy0 the sflow device is a win.
2) Encaping an IFE header implies a much more efficient bpf filter
(IFE ethertype is an excellent discriminator for bpf).

Additional benefit is as mentioned before - redirecting to a device
means you can send it remotely over ethernet to a more powerful
machine without having to cross kernel-userspace. Redirecting instead
of mirroring to tuntap is also an interesting option.

More comments below (on the sflow person's comment - dont seem him
on the Cc):

On 16-10-15 12:34 PM, Roopa Prabhu wrote:
> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>> From: Yotam Gigi <yotam.gi@gmail.com>


>> +
>> +struct sample_packet_metadata {
>> +	int sample_size;
>> +	int orig_size;
>> +	int ifindex;
>> +};
>> +
> This metadata does not look extensible.. can it be made to ?
>

Sure it can...

> With sflow in context, you need a pair of ifindex numbers to encode ingress and egress ports.

What is the use case for both?
> Ideally you would also include a sequence number and a count of the total number of packets
 > that were candidates for sampling.

Sequence number may make sense (they will help show a gap if something
gets dropped). But i am not sure about the stats consuming such space.
Stats are something that can be queried (tc stats should have a record
of how many bytes/packets )

>The OVS implementation is a good example, the metadata includes all the actions applied
>to the packet in the kernel data path.
>

Again not sure what the use case would be (and why waste such space
especially when you are sending over the wire with such details).

>> +	rcu_read_lock();
>> +	retval = READ_ONCE(s->tcf_action);
>> +
>> +	if (++s->packet_counter % s->rate == 0) {
>
> The sampling function isn’t random
>
> if (++s->packet_counter % s->rate == 0) {
>
> This is unsuitable for sFlow, which is specific about the random sampling function required.
>BPF, OVS, and the
> ULOG statistics module include efficient kernel based random sampling functions that could be used instead.
>

If i understood correctly, the above is a fallback sampling algorithm.
In the case of the spectrum it already does the sampling in the ASIC
so there is no need to repeat it in software.
Agreed that in that case the sampling approach is not sufficiently
random.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-17 10:10     ` Jamal Hadi Salim
@ 2016-10-18  0:17       ` Roopa Prabhu
  2016-10-18  5:07         ` Roopa Prabhu
  2016-10-18 10:58         ` Yotam Gigi
  0 siblings, 2 replies; 24+ messages in thread
From: Roopa Prabhu @ 2016-10-18  0:17 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, netdev, davem, yotamg, idosch, eladr, nogahf,
	ogerlitz, geert+renesas, stephen, xiyou.wangcong, linux,
	Shrijeet Mukherjee

On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
>
> Some comments:
> IIUC, the main struggle seems to be whether the redirect to dummy0
> is useful or not? i.e instead of just letting the packets go up the
> stack on eth1?

yep, correct...given existing workflow for the non-offloaded case is
to receive sample packets via bpf filter on socket or
use netlink as a sample delivery mechanism (NFLOG eg)


> It seems like sflowd needs to read off eth1 via packet socket?
> To be backward compatible - supporting that approach seems sensible.
>
> Note:
> There is a clear efficiency benefit of both using IFE encoding and
> redirecting to dummy0.
> 1) Redirecting to dummy0 implies you dont need to exercise a bpf
> filter around every packet that comes off eth1.
> I understand there are probably not millions of pps for this case;
> but in a non-offloaded cases it could be millions pps.
> And in case of sampling over many ethx devices, you can redirect
> samples from many other ethx devices.
> So making dummy0 the sflow device is a win.
> 2) Encaping an IFE header implies a much more efficient bpf filter
> (IFE ethertype is an excellent discriminator for bpf).
>
> Additional benefit is as mentioned before - redirecting to a device
> means you can send it remotely over ethernet to a more powerful
> machine without having to cross kernel-userspace. Redirecting instead
> of mirroring to tuntap is also an interesting option.

sure, this seems like a good option to have.
generally you have one instance of the sampling agent on a hyper visor or switch.
But, if you have use-cases where monitoring agents run external, sure.
would have preferred if it was optional or an addon and not the default.

Regarding the device, yeah, agree there are pros and cons.
An additional device just to sample packets seems like an overkill.
But, if there is no other other option, and there are benefits to it, no objections.
Hopefully we can add another option on the existing api to skip the device in the future.


>
>
> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>>> From: Yotam Gigi <yotam.gi@gmail.com>
>
>
>>> +
>>> +struct sample_packet_metadata {
>>> +    int sample_size;
>>> +    int orig_size;
>>> +    int ifindex;
>>> +};
>>> +
>> This metadata does not look extensible.. can it be made to ?
>>
>
> Sure it can...
>
>> With sflow in context, you need a pair of ifindex numbers to encode ingress and egress ports.
>
> What is the use case for both?

I have heard that most monitoring tools have moved to ingress only sampling because of operational
complexity (use case is sflow). I think hardware also supports ingress and egress only sampling.
better to have an option to reflect that in the api.

>> Ideally you would also include a sequence number and a count of the total number of packets
> > that were candidates for sampling.
>
> Sequence number may make sense (they will help show a gap if something
> gets dropped). But i am not sure about the stats consuming such space.
> Stats are something that can be queried (tc stats should have a record
> of how many bytes/packets )

sure, thats fine.
>
>> The OVS implementation is a good example, the metadata includes all the actions applied
>> to the packet in the kernel data path.
>>
>
> Again not sure what the use case would be (and why waste such space
> especially when you are sending over the wire with such details).

All this is being used currently.., But, this can be other api's sflow uses
for monitoring.
http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf

Does not have to be part of the main/basic sampling api...
it was just an example.

>
>>> +    rcu_read_lock();
>>> +    retval = READ_ONCE(s->tcf_action);
>>> +
>>> +    if (++s->packet_counter % s->rate == 0) {
>>
>> The sampling function isn’t random
>>
>> if (++s->packet_counter % s->rate == 0) {
>>
>> This is unsuitable for sFlow, which is specific about the random sampling function required.
>> BPF, OVS, and the
>> ULOG statistics module include efficient kernel based random sampling functions that could be used instead.
>>
>
> If i understood correctly, the above is a fallback sampling algorithm.
> In the case of the spectrum it already does the sampling in the ASIC
> so there is no need to repeat it in software.
> Agreed that in that case the sampling approach is not sufficiently
> random.

yes. and since the same sampling api will be used for offloaded and non-offloaded case,
the sampling algo here for the non-offloaded case...can do better .. atleast match the existing
api efficiency. We would want people to use the same api for the offload and non-offloaded case.

thanks,
Roopa

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-18  0:17       ` Roopa Prabhu
@ 2016-10-18  5:07         ` Roopa Prabhu
  2016-10-18 10:58         ` Yotam Gigi
  1 sibling, 0 replies; 24+ messages in thread
From: Roopa Prabhu @ 2016-10-18  5:07 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, netdev, davem, yotamg, idosch, eladr, nogahf,
	ogerlitz, geert+renesas, stephen, xiyou.wangcong, linux,
	Shrijeet Mukherjee, Peter Phaal

On 10/17/16, 5:17 PM, Roopa Prabhu wrote:
> On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
[snip]

inline below more data/context..

>>>> +
>>>> +struct sample_packet_metadata {
>>>> +    int sample_size;
>>>> +    int orig_size;
>>>> +    int ifindex;
>>>> +};
>>>> +
>>> This metadata does not look extensible.. can it be made to ?
>>>
>> Sure it can...
more sflow context here... [1]

An extensible metadata scheme is  highly desirable when passing data from the dataplane to 
the sampling agent in userspace. Looking forward, advanced instrumentation is being 
added to data planes and keeping the api future proof will help.



>>
>>> With sflow in context, you need a pair of ifindex numbers to encode ingress and egress ports.
>> What is the use case for both?
> I have heard that most monitoring tools have moved to ingress only sampling because of operational
> complexity (use case is sflow). I think hardware also supports ingress and egress only sampling.
> better to have an option to reflect that in the api.

The reason for having two ifindex numbers is to record the ingress and egress ports (i.e. the path that the packet takes through the datapath/ASIC). You may actually have three ifindex numbers associated with a sample:
1. The data source that made the measurement (on a linux system each bridge has its own ifindex)
2. The ifindex associated with the ingress switch port
3. The ifindex associated with the egress switch port.

All three apply irrespective of sampling direction.


thanks,
Roopa

[1] Additional extended flow attributes have been defined to further extend sFlow packet samples:
http://sflow.org/sflow_tunnels.txt <http://sflow.org/sflow_tunnels.txt>
http://sflow.org/sflow_openflow.txt <http://sflow.org/sflow_openflow.txt>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-16 10:27   ` Or Gerlitz
@ 2016-10-18  8:33     ` Yotam Gigi
  0 siblings, 0 replies; 24+ messages in thread
From: Yotam Gigi @ 2016-10-18  8:33 UTC (permalink / raw)
  To: Or Gerlitz, Jiri Pirko
  Cc: Linux Netdev List, David Miller, Ido Schimmel, Elad Raz,
	Nogah Frankel, Or Gerlitz, Jamal Hadi Salim, geert+renesas,
	Stephen Hemminger, Cong Wang, Guenter Roeck



>-----Original Message-----
>From: Or Gerlitz [mailto:gerlitz.or@gmail.com]
>Sent: Sunday, October 16, 2016 1:27 PM
>To: Jiri Pirko <jiri@resnulli.us>
>Cc: Linux Netdev List <netdev@vger.kernel.org>; David Miller
><davem@davemloft.net>; Yotam Gigi <yotamg@mellanox.com>; Ido Schimmel
><idosch@mellanox.com>; Elad Raz <eladr@mellanox.com>; Nogah Frankel
><nogahf@mellanox.com>; Or Gerlitz <ogerlitz@mellanox.com>; Jamal Hadi Salim
><jhs@mojatatu.com>; geert+renesas@glider.be; Stephen Hemminger
><stephen@networkplumber.org>; Cong Wang <xiyou.wangcong@gmail.com>;
>Guenter Roeck <linux@roeck-us.net>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On Wed, Oct 12, 2016 at 3:41 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> From: Yotam Gigi <yotam.gi@gmail.com>
>>
>> This action allow the user to sample traffic matched by tc classifier.
>> The sampling consists of choosing packets randomly, truncating them,
>> adding some informative metadata regarding the interface and the original
>> packet size and mark them with specific mark, to allow further tc rules to
>> match and process. The marked sample packets are then injected into the
>> device ingress qdisc using netif_receive_skb.
>>
>> The packets metadata is packed using the ife encapsulation protocol, and
>> the outer packet's ethernet dest, source and eth_type, along with the
>> rate, mark and the optional truncation size can be configured from
>> userspace.
>>
>> Example:
>> To sample ingress traffic from interface eth1, and redirect the sampled
>> the sampled packets to interface dummy0, one may use the commands:
>>
>> tc qdisc add dev eth1 handle ffff: ingress
>>
>> tc filter add dev eth1 parent ffff: \
>>            matchall action sample rate 12 mark 17
>>
>> tc filter add parent ffff: dev eth1 protocol all \
>>            u32 match mark 172 0xff
>>            action mirred egress redirect dev dummy0
>>
>> Where the first command adds an ingress qdisc and the second starts
>> sampling every 12'th packet on dev eth0 and marks the sampled packets with
>> 17. The command third catches the sampled packets, which are marked with
>> 17, and redirects them to dev dummy0.
>
>eth0 --> eth1
>
>command third --> third command
>

Missed that. Thanks :)

>don't we need a re-classify directive for the u32 filter to apply
>after the marking done by the matchall rule + sample action
>or is that implicit?

No, as the packets are re-injected to the ingress qdisc (as described in the 
commit message). Reclassify won't work as the sampled packets, which are a copy
of the chosen packets are generated inside the sample action and are not part of
the device packet stream.

>
>
>> diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
>> new file mode 100644
>> index 0000000..a2b445a
>> --- /dev/null
>> +++ b/include/net/tc_act/tc_sample.h
>> @@ -0,0 +1,88 @@
>> +#ifndef __NET_TC_SAMPLE_H
>> +#define __NET_TC_SAMPLE_H
>> +
>> +#include <net/act_api.h>
>> +#include <linux/tc_act/tc_sample.h>
>> +
>> +struct tcf_sample {
>> +       struct tc_action        common;
>> +       u32                     rate;
>> +       u32                     mark;
>> +       bool                    truncate;
>> +       u32                     trunc_size;
>> +       u32                     packet_counter;
>> +       u8                      eth_dst[ETH_ALEN];
>> +       u8                      eth_src[ETH_ALEN];
>> +       u16                     eth_type;
>> +       bool                    eth_type_set;
>> +       struct list_head        tcfm_list;
>> +};
>
>> +++ b/include/uapi/linux/tc_act/tc_sample.h
>> @@ -0,0 +1,31 @@
>> +#ifndef __LINUX_TC_SAMPLE_H
>> +#define __LINUX_TC_SAMPLE_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/pkt_cls.h>
>> +#include <linux/if_ether.h>
>> +
>> +#define TCA_ACT_SAMPLE 26
>> +
>> +struct tc_sample {
>> +       tc_gen;
>> +       __u32           rate;           /* sample rate */
>> +       __u32           mark;           /* mark to put on the sampled packets */
>> +       bool            truncate;       /* whether to truncate the packets */
>> +       __u32           trunc_size;     /* truncation size */
>> +       __u8            eth_dst[ETH_ALEN]; /* encapsulated mac destination */
>> +       __u8            eth_src[ETH_ALEN]; /* encapsulated mac source */
>> +       bool            eth_type_set;      /* whether to overrid ethtype */
>> +       __u16           eth_type;          /* encapsulated mac ethtype */
>> +};
>
>overrid --> override

Fixed. Thanks :)

>
>what do you mean by override here, to encapsulate?

No. It’s the IFE header eth_type.

>
>consider using 0 as special value, e.g no truncation and no encapsulation
>
>best if you just define the netlink attributes (document on the RHS
>the type, see the uapi
>for the new tunnel key action) and let the tc action in-kernel code to
>decode them directly
>into the non UAPI structure. This way you are extendable and also
>avoid having two
>structs which is sort of confusing.

You are right. Will do that.

>
>> +
>> +enum {
>> +       TCA_SAMPLE_UNSPEC,
>> +       TCA_SAMPLE_TM,
>> +       TCA_SAMPLE_PARMS,
>> +       TCA_SAMPLE_PAD,
>> +       __TCA_SAMPLE_MAX
>> +};
>> +#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
>> +
>> +#endif
>
>
>> +static bool dev_ok_push(struct net_device *dev)
>> +{
>> +       switch (dev->type) {
>> +       case ARPHRD_TUNNEL:
>> +       case ARPHRD_TUNNEL6:
>> +       case ARPHRD_SIT:
>> +       case ARPHRD_IPGRE:
>> +       case ARPHRD_VOID:
>> +       case ARPHRD_NONE:
>> +               return false;
>> +       default:
>> +               return true;
>> +       }
>> +}
>> +
>
>> +static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
>> +                     struct tcf_result *res)
>> +{
>> +       struct tcf_sample *s = to_sample(a);
>> +       struct sample_packet_metadata metadata;
>> +       static struct ethhdr *ethhdr;
>> +       struct sk_buff *skb2;
>> +       int retval;
>> +       u32 at;
>> +
>> +       tcf_lastuse_update(&s->tcf_tm);
>> +       bstats_cpu_update(this_cpu_ptr(s->common.cpu_bstats), skb);
>> +
>> +       rcu_read_lock();
>> +       retval = READ_ONCE(s->tcf_action);
>> +
>> +       if (++s->packet_counter % s->rate == 0) {
>> +               skb2 = skb_copy(skb, GFP_ATOMIC);
>> +               if (!skb2)
>> +                       goto out;
>> +
>> +               if (s->truncate)
>> +                       skb_trim(skb2, s->trunc_size);
>> +
>> +               at = G_TC_AT(skb->tc_verd);
>> +               skb2->mac_len = skb->mac_len;
>> +
>> +               /* on ingress, the mac header gets poped, so push it back */
>> +               if (!(at & AT_EGRESS) && dev_ok_push(skb->dev))
>> +                       skb_push(skb2, skb2->mac_len);
>> +
>
>what's the exact role of the !(at & AT_EGRESS) check?

Check whether we are in ingress (not in egress). As the doc sais:
/* on ingress, the mac header gets poped, so push it back */

>
>and if !dev_ok_push(.) - are we just fine to continue here without
>that push? maybe
>worth documenting that corner a bit
>

It might be. The ok_push was taken almost as-is from act_mirred.

>
>
>> +               metadata.ifindex = skb->dev->ifindex;
>> +               metadata.orig_size = skb->len + skb->dev->hard_header_len;
>> +               metadata.sample_size = skb2->len;
>> +               ethhdr = sample_packet_pack(skb2, (void *)&metadata);
>> +               if (!ethhdr)
>> +                       goto out;
>> +
>> +               if (!is_zero_ether_addr(s->eth_src))
>> +                       ether_addr_copy(ethhdr->h_source, s->eth_src);
>> +               if (!is_zero_ether_addr(s->eth_dst))
>> +                       ether_addr_copy(ethhdr->h_dest, s->eth_dst);
>> +               if (s->eth_type_set)
>> +                       ethhdr->h_proto = htons(s->eth_type);
>> +
>> +               skb2->mark = s->mark;
>> +               netif_receive_skb(skb2);
>> +
>> +               /* mirror is always swallowed */
>> +               skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
>> +       }
>> +out:
>> +       rcu_read_unlock();
>> +
>> +       return retval;
>> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-18  0:17       ` Roopa Prabhu
  2016-10-18  5:07         ` Roopa Prabhu
@ 2016-10-18 10:58         ` Yotam Gigi
  2016-10-19  7:33           ` Roopa Prabhu
  1 sibling, 1 reply; 24+ messages in thread
From: Yotam Gigi @ 2016-10-18 10:58 UTC (permalink / raw)
  To: Roopa Prabhu, Jamal Hadi Salim
  Cc: Jiri Pirko, netdev, davem, Ido Schimmel, Elad Raz, Nogah Frankel,
	Or Gerlitz, geert+renesas, stephen, xiyou.wangcong, linux,
	Shrijeet Mukherjee, Yotam Gigi

>-----Original Message-----
>From: Roopa Prabhu [mailto:roopa@cumulusnetworks.com]
>Sent: Tuesday, October 18, 2016 3:17 AM
>To: Jamal Hadi Salim <jhs@mojatatu.com>
>Cc: Jiri Pirko <jiri@resnulli.us>; netdev@vger.kernel.org; davem@davemloft.net;
>Yotam Gigi <yotamg@mellanox.com>; Ido Schimmel <idosch@mellanox.com>; Elad
>Raz <eladr@mellanox.com>; Nogah Frankel <nogahf@mellanox.com>; Or Gerlitz
><ogerlitz@mellanox.com>; geert+renesas@glider.be;
>stephen@networkplumber.org; xiyou.wangcong@gmail.com; linux@roeck-us.net;
>Shrijeet Mukherjee <shm@cumulusnetworks.com>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
>>
>> Some comments:
>> IIUC, the main struggle seems to be whether the redirect to dummy0
>> is useful or not? i.e instead of just letting the packets go up the
>> stack on eth1?
>
>yep, correct...given existing workflow for the non-offloaded case is
>to receive sample packets via bpf filter on socket or
>use netlink as a sample delivery mechanism (NFLOG eg)
>
>
>> It seems like sflowd needs to read off eth1 via packet socket?
>> To be backward compatible - supporting that approach seems sensible.

I am not sure whether using socket is backward compatible. hsflowd 
expects nflog packets, and ife packets will not be understood.

We planned on adding support on hsflowd in IFE encapsulated packets from
tuntap device. 

Did I understand you correctly?

>>
>> Note:
>> There is a clear efficiency benefit of both using IFE encoding and
>> redirecting to dummy0.
>> 1) Redirecting to dummy0 implies you dont need to exercise a bpf
>> filter around every packet that comes off eth1.
>> I understand there are probably not millions of pps for this case;
>> but in a non-offloaded cases it could be millions pps.
>> And in case of sampling over many ethx devices, you can redirect
>> samples from many other ethx devices.
>> So making dummy0 the sflow device is a win.
>> 2) Encaping an IFE header implies a much more efficient bpf filter
>> (IFE ethertype is an excellent discriminator for bpf).
>>
>> Additional benefit is as mentioned before - redirecting to a device
>> means you can send it remotely over ethernet to a more powerful
>> machine without having to cross kernel-userspace. Redirecting instead
>> of mirroring to tuntap is also an interesting option.
>
>sure, this seems like a good option to have.
>generally you have one instance of the sampling agent on a hyper visor or switch.
>But, if you have use-cases where monitoring agents run external, sure.
>would have preferred if it was optional or an addon and not the default.
>
>Regarding the device, yeah, agree there are pros and cons.
>An additional device just to sample packets seems like an overkill.
>But, if there is no other other option, and there are benefits to it, no objections.
>Hopefully we can add another option on the existing api to skip the device in the
>future.
>
>
>>
>>
>> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>>> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>>>> From: Yotam Gigi <yotam.gi@gmail.com>
>>
>>
>>>> +
>>>> +struct sample_packet_metadata {
>>>> +    int sample_size;
>>>> +    int orig_size;
>>>> +    int ifindex;
>>>> +};
>>>> +
>>> This metadata does not look extensible.. can it be made to ?
>>>
>>
>> Sure it can...

I will update the userspace API to be more generice: I will drop this struct and
let the user (iproute2 currently)  build the netlink packet himself (as Or 
Gerlitz suggested).

>>
>>> With sflow in context, you need a pair of ifindex numbers to encode ingress and
>egress ports.
>>
>> What is the use case for both?
>
>I have heard that most monitoring tools have moved to ingress only sampling
>because of operational
>complexity (use case is sflow). I think hardware also supports ingress and egress
>only sampling.
>better to have an option to reflect that in the api.

Agree. I will add both ingress and egress ports in the IFE. Both the hardware 
Implementation and kernel implementation don't support setting both, but 
It is good to have that option.

>
>>> Ideally you would also include a sequence number and a count of the total
>number of packets
>> > that were candidates for sampling.
>>
>> Sequence number may make sense (they will help show a gap if something
>> gets dropped). But i am not sure about the stats consuming such space.
>> Stats are something that can be queried (tc stats should have a record
>> of how many bytes/packets )
>
>sure, thats fine.

Will add sequence number.

>>
>>> The OVS implementation is a good example, the metadata includes all the
>actions applied
>>> to the packet in the kernel data path.
>>>
>>
>> Again not sure what the use case would be (and why waste such space
>> especially when you are sending over the wire with such details).
>
>All this is being used currently.., But, this can be other api's sflow uses
>for monitoring.
>http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf
>	
>Does not have to be part of the main/basic sampling api...
>it was just an example.
>

I guess that making the API extensible solves this, isn't it?

>>
>>>> +    rcu_read_lock();
>>>> +    retval = READ_ONCE(s->tcf_action);
>>>> +
>>>> +    if (++s->packet_counter % s->rate == 0) {
>>>
>>> The sampling function isn't random
>>>
>>> if (++s->packet_counter % s->rate == 0) {
>>>
>>> This is unsuitable for sFlow, which is specific about the random sampling
>function required.
>>> BPF, OVS, and the
>>> ULOG statistics module include efficient kernel based random sampling
>functions that could be used instead.
>>>
>>
>> If i understood correctly, the above is a fallback sampling algorithm.
>> In the case of the spectrum it already does the sampling in the ASIC
>> so there is no need to repeat it in software.
>> Agreed that in that case the sampling approach is not sufficiently
>> random.
>
>yes. and since the same sampling api will be used for offloaded and non-offloaded
>case,
>the sampling algo here for the non-offloaded case...can do better .. atleast match
>the existing
>api efficiency. We would want people to use the same api for the offload and non-
>offloaded case.

Yep, spectrum does not have this functionality, and this is why I did not implement in
sample sw implementation too. I agree that it should be there.

In my opinion, it should be optional param (like random_type or something), that the
spectrum offload support will not implement (until hardware will support it).

>
>thanks,
>Roopa

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-18 10:58         ` Yotam Gigi
@ 2016-10-19  7:33           ` Roopa Prabhu
  2016-10-19  8:28             ` Yotam Gigi
  0 siblings, 1 reply; 24+ messages in thread
From: Roopa Prabhu @ 2016-10-19  7:33 UTC (permalink / raw)
  To: Yotam Gigi
  Cc: Jamal Hadi Salim, Jiri Pirko, netdev, davem, Ido Schimmel,
	Elad Raz, Nogah Frankel, Or Gerlitz, geert+renesas, stephen,
	xiyou.wangcong, linux, Shrijeet Mukherjee, Yotam Gigi

On 10/18/16, 3:58 AM, Yotam Gigi wrote:

> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
[snip]

>> The OVS implementation is a good example, the metadata includes all the
>> actions applied
>>>> to the packet in the kernel data path.
>>>>
>>> Again not sure what the use case would be (and why waste such space
>>> especially when you are sending over the wire with such details).
>> All this is being used currently.., But, this can be other api's sflow uses
>> for monitoring.
>> http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf
>> 	
>> Does not have to be part of the main/basic sampling api...
>> it was just an example.
>>
> I guess that making the API extensible solves this, isn't it?

yes, that might help...

Just wanted to bring up the question/clarification on using mark again

tc qdisc add dev eth1 handle ffff: ingress

tc filter add dev eth1 parent ffff: \
           matchall action sample rate 12 mark 17

tc filter add parent ffff: dev eth1 protocol all \
           u32 match mark 172 0xff
           action mirred egress redirect dev dummy0

Like we discussed @ netdev, mark can be used by other things in the system.
A request to sample on an interface cannot be disruptive.
Does this require mark to be not used elsewhere in the system when sampling is enabled on an interface ?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [patch net-next RFC 4/6] Introduce sample tc action
  2016-10-19  7:33           ` Roopa Prabhu
@ 2016-10-19  8:28             ` Yotam Gigi
  0 siblings, 0 replies; 24+ messages in thread
From: Yotam Gigi @ 2016-10-19  8:28 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, Jiri Pirko, netdev, davem, Ido Schimmel,
	Elad Raz, Nogah Frankel, Or Gerlitz, geert+renesas, stephen,
	xiyou.wangcong, linux, Shrijeet Mukherjee, Yotam Gigi



>-----Original Message-----
>From: Roopa Prabhu [mailto:roopa@cumulusnetworks.com]
>Sent: Wednesday, October 19, 2016 10:33 AM
>To: Yotam Gigi <yotamg@mellanox.com>
>Cc: Jamal Hadi Salim <jhs@mojatatu.com>; Jiri Pirko <jiri@resnulli.us>;
>netdev@vger.kernel.org; davem@davemloft.net; Ido Schimmel
><idosch@mellanox.com>; Elad Raz <eladr@mellanox.com>; Nogah Frankel
><nogahf@mellanox.com>; Or Gerlitz <ogerlitz@mellanox.com>;
>geert+renesas@glider.be; stephen@networkplumber.org;
>xiyou.wangcong@gmail.com; linux@roeck-us.net; Shrijeet Mukherjee
><shm@cumulusnetworks.com>; Yotam Gigi <yotam.gi@gmail.com>
>Subject: Re: [patch net-next RFC 4/6] Introduce sample tc action
>
>On 10/18/16, 3:58 AM, Yotam Gigi wrote:
>
>> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>[snip]
>
>>> The OVS implementation is a good example, the metadata includes all the
>>> actions applied
>>>>> to the packet in the kernel data path.
>>>>>
>>>> Again not sure what the use case would be (and why waste such space
>>>> especially when you are sending over the wire with such details).
>>> All this is being used currently.., But, this can be other api's sflow uses
>>> for monitoring.
>>> http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf
>>>
>>> Does not have to be part of the main/basic sampling api...
>>> it was just an example.
>>>
>> I guess that making the API extensible solves this, isn't it?
>
>yes, that might help...
>
>Just wanted to bring up the question/clarification on using mark again
>
>tc qdisc add dev eth1 handle ffff: ingress
>
>tc filter add dev eth1 parent ffff: \
>           matchall action sample rate 12 mark 17
>
>tc filter add parent ffff: dev eth1 protocol all \
>           u32 match mark 172 0xff
>           action mirred egress redirect dev dummy0
>
>Like we discussed @ netdev, mark can be used by other things in the system.
>A request to sample on an interface cannot be disruptive.
>Does this require mark to be not used elsewhere in the system when sampling is
>enabled on an interface ?

I think the we can spare the usage of mark, or at least make it optional, as the user
can match on the packets according to the eth_type (as part of the IFE, the user 
can set the sampled packet eth_type).

I will do that, and update the documentation as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-10-19 20:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-12 12:41 [patch net-next RFC 0/6] Add support for offloading packet-sampling Jiri Pirko
2016-10-12 12:41 ` [patch net-next RFC 1/6] Introduce ife encapsulation module Jiri Pirko
2016-10-12 12:41 ` [patch net-next RFC 2/6] act_ife: Change to use ife module Jiri Pirko
2016-10-12 12:41 ` [patch net-next RFC 3/6] ife: Introduce new metadata tlv types Jiri Pirko
2016-10-12 12:41 ` [patch net-next RFC 4/6] Introduce sample tc action Jiri Pirko
2016-10-15 16:34   ` Roopa Prabhu
2016-10-15 17:31     ` Roopa Prabhu
2016-10-17 10:10     ` Jamal Hadi Salim
2016-10-18  0:17       ` Roopa Prabhu
2016-10-18  5:07         ` Roopa Prabhu
2016-10-18 10:58         ` Yotam Gigi
2016-10-19  7:33           ` Roopa Prabhu
2016-10-19  8:28             ` Yotam Gigi
2016-10-16 10:27   ` Or Gerlitz
2016-10-18  8:33     ` Yotam Gigi
2016-10-12 12:41 ` [patch net-next RFC 5/6] mlxsw: reg: add the Monitoring Packet Sampling Configuration Register Jiri Pirko
2016-10-12 12:41 ` [patch net-next RFC 6/6] mlxsw: packet sample: Add packet sample offloading support Jiri Pirko
2016-10-13  7:29 ` [patch net-next RFC 0/6] Add support for offloading packet-sampling Roopa Prabhu
2016-10-13  8:48   ` Jiri Pirko
2016-10-13 11:49     ` Jamal Hadi Salim
2016-10-13 12:10       ` Jiri Pirko
2016-10-13 12:30         ` Jamal Hadi Salim
2016-10-13 12:45           ` Jiri Pirko
2016-10-14  5:02             ` Roopa Prabhu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.