BPF Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support
@ 2020-04-15  8:54 Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-04-15  8:54 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Hangbin Liu

Hi all,

This is a prototype for xdp multicast support, which has been discussed
before[0]. The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple
ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To be able to reuse the existing bpf_redirect_map() helper, we use a
containing map-in-map type to store the forwarding and exclude groups.
When a map-in-map type is passed to the redirect helper, it will
interpret the index as encoding the forwarding group in the upper 16
bits and the exclude group in the lower 16 bits. The enqueue logic will
unpack the two halves of the index and perform separate lookups in the
containing map. E.g., an index of 0x00010001 will look for the
forwarding group at map index 0x10000 and the exclude group at map index
0x1; the application is expected to populate the map accordingly.

For this RFC series we are primarily looking for feedback on the concept
and API: the example in patch 2 is functional, but not a lot of effort
has been made on performance optimisation.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

Hangbin Liu (2):
  xdp: add dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test

 include/linux/bpf.h                           |  29 ++
 include/net/xdp.h                             |   1 +
 kernel/bpf/arraymap.c                         |   2 +-
 kernel/bpf/devmap.c                           | 118 +++++++
 kernel/bpf/hashtab.c                          |   2 +-
 kernel/bpf/verifier.c                         |  15 +-
 net/core/filter.c                             |  69 +++-
 net/core/xdp.c                                |  26 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multicast.sh     | 142 ++++++++
 samples/bpf/xdp_redirect_map_multicast_kern.c | 147 +++++++++
 samples/bpf/xdp_redirect_map_multicast_user.c | 306 ++++++++++++++++++
 12 files changed, 854 insertions(+), 6 deletions(-)
 create mode 100755 samples/bpf/xdp_redirect_map_multicast.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_user.c

-- 
2.19.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC PATCH bpf-next 1/2] xdp: add dev map multicast support
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-04-15  8:54 ` Hangbin Liu
  2020-04-20  9:52   ` Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-04-15  8:54 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Hangbin Liu

This is a prototype for xdp multicast support. In this implemention we
use map-in-map to store the multicast groups, because we may have both
include and exclude groups on one interface.

The include and exclude groups are seperated by a 32 bits map key.
the high 16 bits keys are used for include groups and low 16 bits
keys are for exclude groups.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h   |  29 +++++++++++
 include/net/xdp.h     |   1 +
 kernel/bpf/arraymap.c |   2 +-
 kernel/bpf/devmap.c   | 118 ++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/hashtab.c  |   2 +-
 kernel/bpf/verifier.c |  15 +++++-
 net/core/filter.c     |  69 +++++++++++++++++++++++-
 net/core/xdp.c        |  26 ++++++++++
 8 files changed, 256 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fd2b2322412d..72797667bca8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1156,11 +1156,17 @@ struct sk_buff;
 
 struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
 struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key);
+void *array_of_map_lookup_elem(struct bpf_map *map, void *key);
+void *htab_of_map_lookup_elem(struct bpf_map *map, void *key);
 void __dev_flush(void);
 int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, u32 index);
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 
@@ -1276,6 +1282,16 @@ static inline struct net_device  *__dev_map_hash_lookup_elem(struct bpf_map *map
 	return NULL;
 }
 
+static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
+{
+
+}
+
+static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
+{
+
+}
+
 static inline void __dev_flush(void)
 {
 }
@@ -1297,6 +1313,19 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
+{
+	return true;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, u32 index)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 40c6d3398458..a214dce8579c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
 }
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 /* Convert xdp_buff to xdp_frame */
 static inline
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 95d77770353c..26ac66a05015 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1031,7 +1031,7 @@ static void array_of_map_free(struct bpf_map *map)
 	fd_array_map_free(map);
 }
 
-static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
+void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_map **inner_map = array_map_lookup_elem(map, key);
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 58bdca5d978a..3a60cb209ae1 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -85,6 +85,9 @@ static DEFINE_PER_CPU(struct list_head, dev_flush_list);
 static DEFINE_SPINLOCK(dev_map_lock);
 static LIST_HEAD(dev_map_list);
 
+static void *dev_map_lookup_elem(struct bpf_map *map, void *key);
+static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key);
+
 static struct hlist_head *dev_map_create_hash(unsigned int entries)
 {
 	int i;
@@ -456,6 +459,121 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
+{
+	struct bpf_dtab_netdev *in_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			in_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			in_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, u32 index)
+{
+	struct bpf_dtab_netdev *obj = NULL;
+	struct bpf_map *in_map, *ex_map;
+	struct xdp_frame *xdpf, *nxdpf;
+	struct net_device *dev;
+	u32 in_index, ex_index;
+	u32 key, next_key;
+	int err;
+
+	in_index = index >> 16;
+	in_index = in_index << 16;
+	ex_index = in_index ^ index;
+
+	in_map = map->ops->map_lookup_elem(map, &in_index);
+	/* ex_map could be NULL */
+	ex_map = map->ops->map_lookup_elem(map, &ex_index);
+
+	devmap_get_next_key(in_map, NULL, &key);
+
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		switch (in_map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(in_map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(in_map, key);
+			break;
+		default:
+			break;
+		}
+		if (!obj)
+			goto find_next;
+
+		if (ex_map && !dev_in_exclude_map(obj, ex_map)) {
+			dev = obj->dev;
+
+			if (!dev->netdev_ops->ndo_xdp_xmit)
+				return -EOPNOTSUPP;
+
+			err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+			if (unlikely(err))
+				return err;
+
+			nxdpf = xdpf_clone(xdpf);
+			if (unlikely(!nxdpf))
+				return -ENOMEM;
+
+			bq_enqueue(dev, nxdpf, dev_rx);
+		}
+find_next:
+		err = devmap_get_next_key(in_map, &key, &next_key);
+		if (err)
+			break;
+		key = next_key;
+	}
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d541c8486c95..4e0a2eebd38d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1853,7 +1853,7 @@ static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr)
 	return map;
 }
 
-static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
+void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_map **inner_map  = htab_map_lookup_elem(map, key);
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 04c6630cc18f..84d23418823a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3898,7 +3898,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
 	case BPF_MAP_TYPE_HASH_OF_MAPS:
-		if (func_id != BPF_FUNC_map_lookup_elem)
+		/* Used by multicast redirect */
+		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_SOCKMAP:
@@ -3968,8 +3970,17 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
 		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH &&
 		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
-		    map->map_type != BPF_MAP_TYPE_XSKMAP)
+		    map->map_type != BPF_MAP_TYPE_XSKMAP &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY_OF_MAPS &&
+		    map->map_type != BPF_MAP_TYPE_HASH_OF_MAPS)
 			goto error;
+		if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+		    map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+			/* FIXME: Maybe we should also strict the key size here ?? */
+			if (map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP &&
+			    map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+				goto error;
+		}
 		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
diff --git a/net/core/filter.c b/net/core/filter.c
index 7628b947dbc3..7d2076f5b0a4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    u32 index)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
+		/* fall through */
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		return dev_map_enqueue(fwd, xdp, dev_rx);
+	case BPF_MAP_TYPE_HASH_OF_MAPS:
+	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
+		return dev_map_enqueue_multi(xdp, dev_rx, map, index);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3508,6 +3513,10 @@ static inline void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 		return __cpu_map_lookup_elem(map, index);
 	case BPF_MAP_TYPE_XSKMAP:
 		return __xsk_map_lookup_elem(map, index);
+	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
+		return array_of_map_lookup_elem(map, (index >> 16) << 16);
+	case BPF_MAP_TYPE_HASH_OF_MAPS:
+		return htab_of_map_lookup_elem(map, (index >> 16) << 16);
 	default:
 		return NULL;
 	}
@@ -3552,7 +3561,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, index);
 	}
 
 	if (unlikely(err))
@@ -3566,6 +3575,55 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct sk_buff *skb, struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, u32 index)
+
+{
+	struct bpf_map *in_map, *ex_map;
+	struct bpf_dtab_netdev *dst;
+	u32 in_index, ex_index;
+	struct sk_buff *nskb;
+	u32 key, next_key;
+	int err;
+	void *fwd;
+
+	in_index = index >> 16;
+	in_index = in_index << 16;
+	ex_index = in_index ^ index;
+
+	in_map = map->ops->map_lookup_elem(map, &in_index);
+	/* ex_map could be NULL */
+	ex_map = map->ops->map_lookup_elem(map, &ex_index);
+
+	in_map->ops->map_get_next_key(in_map, NULL, &key);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(in_map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (ex_map && dev_in_exclude_map(dst, ex_map))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -EOVERFLOW;
+
+			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+			if (unlikely(err))
+				return err;
+		}
+
+find_next:
+		err = in_map->ops->map_get_next_key(in_map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3588,6 +3646,13 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
+	} else if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+		   map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+		/* Do multicast redirecting */
+		err = dev_map_redirect_multi(skb, xdp_prog, map, index);
+		if (unlikely(err))
+			goto err;
+		consume_skb(skb);
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
 		struct xdp_sock *xs = fwd;
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4c7ea85486af..70dfb4910f84 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	return xdpf;
 }
 EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
-- 
2.19.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
@ 2020-04-15  8:54 ` Hangbin Liu
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-04-15  8:54 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Hangbin Liu

This test is used for testing xdp multicast. It defined 3 groups
for different usage. Each interface in init net has different
exclude interfaces. In the test it tests both generic/native mode
and 3 different map-in-map types.

For more testing details, please see the test description in
xdp_redirect_map_multicast.sh.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multicast.sh     | 142 ++++++++
 samples/bpf/xdp_redirect_map_multicast_kern.c | 147 +++++++++
 samples/bpf/xdp_redirect_map_multicast_user.c | 306 ++++++++++++++++++
 4 files changed, 598 insertions(+)
 create mode 100755 samples/bpf/xdp_redirect_map_multicast.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 424f6fe7ce38..55555b0267cf 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multicast
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multicast-objs := bpf_load.o xdp_redirect_map_multicast_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multicast_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multicast.sh b/samples/bpf/xdp_redirect_map_multicast.sh
new file mode 100755
index 000000000000..01f825e33060
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multicast.sh
@@ -0,0 +1,142 @@
+#!/bin/bash
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Include Groups:
+#     Group 1 has interfaces: veth1, veth2, veth3, veth4 (All traffic except IPv4, IPv6)
+#     Group 2 has interfaces: veth1, veth3 (For IPv4 traffic only)
+#     Group 3 has interfaces: veth2, veth4 (For IPv6 traffic only)
+# Exclude Groups:
+#     veth1: exclude veth1
+#     veth2: exclude veth2
+#     veth3: exclude veth3, veth4
+#     veth4: exclude veth3, veth4
+#
+# Testing:
+# XDP modes: generic, native
+# map types: array of array, hash of array, hash of hash
+# Include:
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (pass)
+#     IPv6
+#        ns4 -> ns1 (fail), ns4 -> ns2 (pass)
+# Exclude:
+#     arp ns1 -> ns2: ns2, ns3, ns4 should receive the arp request
+#     arp ns4 -> ns1: ns1, ns2 should receive the arp request, ns3 should not
+#
+
+
+# netns numbers
+NUM=4
+IFACES=""
+DRV_MODE="generic native"
+MAP_TYPE="aa ha hh"
+
+test_pass()
+{
+	echo "Pass: $@"
+}
+
+test_fail()
+{
+	echo "fail: $@"
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip netns del ns$i
+	done
+}
+
+setup_ns()
+{
+	local mode=$1
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth0 type veth peer name veth$i
+	        ip link set veth0 netns ns$i
+		ip netns exec ns$i ip link set veth0 up
+		ip link set veth$i up
+
+		ip netns exec ns$i ip addr add 192.0.2.$i/24 dev veth0
+		ip netns exec ns$i ip addr add 2001:db8::$i/24 dev veth0
+		# Use xdp_redirect_map_kern.o because the dummy section in
+		# xdp_redirect_map_multicast_kern.o does not support iproute2 loading
+		ip netns exec ns$i ip link set veth0 xdp$mode obj xdp_redirect_map_kern.o sec xdp_redirect_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_tests()
+{
+	local drv_mode=$1
+	local map_type=$2
+	local drv_p
+
+	[ ${drv_mode} == "drv" ] && drv_p="-N" || drv_p="-S"
+
+	./xdp_redirect_map_multicast $drv_p -M $map_type $IFACES &> xdp_${drv_mode}_${map_type}.log &
+	xdp_pid=$!
+	sleep 10
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${drv_mode}_${map_type}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${drv_mode}_${map_type}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${drv_mode}_${map_type}.log &
+	ip netns exec ns1 ping 192.0.2.100 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.1" arp_ns1-2_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns1-2" || test_fail "$drv_mode $map_type arp ns1-2"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.1" arp_ns1-3_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns1-3" || test_fail "$drv_mode $map_type arp ns1-3"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.1" arp_ns1-4_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns1-4" || test_fail "$drv_mode $map_type arp ns1-4"
+
+	ip netns exec ns1 tcpdump -i veth0 -nn -l -e &> arp_ns4-1_${drv_mode}_${map_type}.log &
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns4-2_${drv_mode}_${map_type}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns4-3_${drv_mode}_${map_type}.log &
+	ip netns exec ns4 ping 192.0.2.100 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.4" arp_ns4-1_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns4-1" || test_fail "$drv_mode $map_type arp ns4-1"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.4" arp_ns4-2_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns4-2" || test_fail "$drv_mode $map_type arp ns4-2"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.4" arp_ns4-3_${drv_mode}_${map_type}.log && \
+		test_fail "$drv_mode $map_type arp ns4-3" || test_pass "$drv_mode $map_type arp ns4-3"
+
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$drv_mode $map_type ping ns1-2" || test_pass "$drv_mode $map_type ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_pass "$drv_mode $map_type ping ns1-3" || test_fail "$drv_mode $map_type ping ns1-3"
+
+	# ping6 test
+	ip netns exec ns4 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$drv_mode $map_type ping6 ns4-1" || test_pass "$drv_mode $map_type ping6 ns4-1"
+	ip netns exec ns4 ping6 2001:db8::2 -c 4 &> /dev/null && \
+		test_pass "$drv_mode $map_type ping6 ns4-2" || test_fail "$drv_mode $map_type ping6 ns4-2"
+
+	kill $xdp_pid
+}
+
+for mode in ${DRV_MODE}; do
+	sleep 2
+	setup_ns $mode
+	for type in ${MAP_TYPE}; do
+		do_tests $mode $type
+	done
+	sleep 20
+	clean_up
+done
diff --git a/samples/bpf/xdp_redirect_map_multicast_kern.c b/samples/bpf/xdp_redirect_map_multicast_kern.c
new file mode 100644
index 000000000000..f2ac36eed0e9
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multicast_kern.c
@@ -0,0 +1,147 @@
+/* This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+#define MAX_NR_PORTS 65536
+
+/* In the sample we will store all ports to group 1.
+ * And add two multicast groups:
+ * group 2 for even number interfaces, group 3 for odd number interfaces
+ */
+
+/* This is an array map template(NOT used) for multicast group storage
+ * The format could be lined by index:ifindex, like
+ * [0, 0], [1, 1], [2, 0], [3, 3], [4,4], [5, 0], [6, 0] ...
+ * which would be easier to modify and update.
+ *
+ * This map also could be used as multicast exclude array map.
+ * */
+struct bpf_map_def SEC("maps") group_a = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = MAX_NR_PORTS,
+};
+
+/* This is a hash map template(NOT used) for multicast group storage
+ * The format could be none-lined index:ifindex, like
+ * [1, 1], [3, 3], [4, 4]...
+ * which would save more spaces for storage
+ *
+ * This map also could be used as multicast exclude hash map.
+ * */
+struct bpf_map_def SEC("maps") group_h = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = MAX_NR_PORTS,
+};
+
+/* This is an array map-in-map, the inner maps will store all the
+ * include array maps and exclude array maps
+ *
+ * The max_entries is MAX_NR_PORTS * 32 as I only use 3 groups.
+ * */
+struct bpf_map_def SEC("maps") a_of_group_a = {
+	.type = BPF_MAP_TYPE_ARRAY_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_NR_PORTS * 32,
+};
+
+/* This is a hash map-in-map, the inner maps will store all the
+ * include array maps and exclude array maps
+ * */
+struct bpf_map_def SEC("maps") h_of_group_a = {
+	.type = BPF_MAP_TYPE_HASH_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_NR_PORTS * 32,
+};
+
+/* This is a hash map-in-map, the inner maps will store all the
+ * include hash maps and exclude hash maps
+ * */
+struct bpf_map_def SEC("maps") h_of_group_h = {
+	.type = BPF_MAP_TYPE_HASH_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_NR_PORTS * 32,
+};
+
+/* Note: This map is not used yet, we get the gourp id based on IP version at
+ * present.
+ *
+ * This map is used to store all the include groups fds based on ip/mac dest.
+ */
+struct bpf_map_def SEC("maps") mcast_route_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u16),
+	.max_entries = MAX_NR_PORTS,
+};
+
+/* TODO: This is used for broadcast redirecting/forwarding,
+ * how to do the redirecting/forwarding one on one based on neigh tables?? */
+SEC("xdp_redirect_map")
+int xdp_redirect_map_prog(struct xdp_md *ctx)
+{
+	u32 key, mcast_group_id, exclude_group_id, redirect_key;
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int *inmap_id;
+	u16 h_proto;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_IPV6))
+		mcast_group_id = 3;
+	else if (h_proto == htons(ETH_P_IP))
+		mcast_group_id = 2;
+	else
+		mcast_group_id = 1;
+
+	exclude_group_id = ctx->ingress_ifindex;
+	redirect_key = (mcast_group_id << 16) | exclude_group_id;
+
+	key = 1 << 16;
+	if ((inmap_id = bpf_map_lookup_elem(&a_of_group_a, &key)) && inmap_id)
+		return bpf_redirect_map(&a_of_group_a, redirect_key, 0);
+	else if ((inmap_id = bpf_map_lookup_elem(&h_of_group_a, &key)) && inmap_id)
+		return bpf_redirect_map(&h_of_group_a, redirect_key, 0);
+	else if ((inmap_id = bpf_map_lookup_elem(&h_of_group_h, &key)) && inmap_id)
+		return bpf_redirect_map(&h_of_group_h, redirect_key, 0);
+
+	return XDP_PASS;
+}
+
+/* FIXME: This prog could not be load by iproute2 as the map-in-map need
+ * set inner map fd first.
+ */
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multicast_user.c b/samples/bpf/xdp_redirect_map_multicast_user.c
new file mode 100644
index 000000000000..a451c90a05b6
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multicast_user.c
@@ -0,0 +1,306 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ */
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+#include <sys/resource.h>
+
+#include "bpf_load.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+#define MAX_NR_PORTS 65536
+
+static int ifaces[MAX_IFACE_NUM] = {};
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static int init_map_in_map(struct bpf_object *obj, char *outmap_name, bool inmap_hash)
+{
+	struct bpf_map *outmap;
+	int inmap_fd, ret;
+
+	inmap_fd = bpf_create_map(inmap_hash ? BPF_MAP_TYPE_DEVMAP_HASH : BPF_MAP_TYPE_DEVMAP, sizeof(__u32), sizeof(int), MAX_NR_PORTS, 0);
+	if (inmap_fd < 0) {
+		printf("Failed to create inner map '%s'!\n", strerror(errno));
+		return 1;
+	}
+	outmap = bpf_object__find_map_by_name(obj, outmap_name);
+	if (!outmap) {
+		printf("Failed to load map %s from test prog\n", outmap_name);
+		return 1;
+	}
+        ret = bpf_map__set_inner_map_fd(outmap, inmap_fd);
+        if (ret) {
+                printf("Failed to set inner_map_fd for map %s\n", outmap_name);
+		return 1;
+        }
+
+	return 0;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n"
+		"    -M    map-in-map mode, could be aa, ha, hh(default)\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	struct bpf_object_open_attr obj_open_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	struct bpf_map *outmap;
+	char *outmap_name;
+	char ifname[IF_NAMESIZE];
+	int pro_fd, inmap_fd, outmap_fd;
+	int i, j, ret, opt, ifindex;
+	__u32 inmap_id, key;
+	char filename[256];
+	bool inmap_hash;
+	char *mode = NULL;
+
+	while ((opt = getopt(argc, argv, "SNFM:")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		case 'M':
+			mode = optarg;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	/* array of array */
+	if (strncmp(mode, "aa", 2) == 0) {
+		outmap_name = "a_of_group_a";
+		inmap_hash = false;
+	/* hash of array */
+	} else if (strncmp(mode, "ha", 2) == 0) {
+		outmap_name = "h_of_group_a";
+		inmap_hash = false;
+	/* hash of hash */
+	} else {
+		outmap_name = "h_of_group_h";
+		inmap_hash = true;
+	}
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK)");
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i ++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	/* open bpf obj, set inner fd for out map-in-map and load bpf obj */
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	obj_open_attr.file = filename;
+	obj = bpf_object__open_xattr(&obj_open_attr);
+
+	/* We need set inner map fd for all outmaps. */
+	if(init_map_in_map(obj, "a_of_group_a", false)) {
+		printf("Failed to init out map a_of_group_a\n");
+		goto err_out;
+	}
+	if(init_map_in_map(obj, "h_of_group_a", false)) {
+		printf("Failed to init out map h_of_group_a\n");
+		goto err_out;
+	}
+	if(init_map_in_map(obj, "h_of_group_h", true)) {
+		printf("Failed to init out map h_of_group_h\n");
+		goto err_out;
+	}
+
+	bpf_object__load(obj);
+
+	prog = bpf_program__next(NULL, obj);
+	pro_fd = bpf_program__fd(prog);
+
+	outmap = bpf_object__find_map_by_name(obj, outmap_name);
+	if (!outmap) {
+		printf("Failed to load map %s from test prog\n", outmap_name);
+		goto err_out;
+	}
+	outmap_fd = bpf_map__fd(outmap);
+	if (outmap_fd < 0) {
+		printf("Failed to get fd from map %s\n", outmap_name);
+		goto err_out;
+	}
+	/* Init 3 multicast groups first.
+	 * group 1: this is used for all ports group
+	 * group 2: this is used for even number interfaces
+	 * group 3: this is used for odd number interfaces
+	 * You can store the group number in mcast_route_map for furture
+	 * IP/MAC -> Multicast Group lookup.
+	 */
+	for (i = 1; i <= 3; i++) {
+		/* Split the include/exclude groups by 16 bit
+		 * FIXME: is there a flexible way? how to let
+		 * kernel side know this?
+		 */
+		key = i << 16;
+		inmap_fd = bpf_create_map(inmap_hash ? BPF_MAP_TYPE_DEVMAP_HASH : BPF_MAP_TYPE_DEVMAP, sizeof(__u32), sizeof(int), MAX_NR_PORTS, 0);
+		if (inmap_fd < 0) {
+			printf("Failed to create inner map '%s'!\n", strerror(errno));
+			goto err_out;
+		}
+		ret = bpf_map_update_elem(outmap_fd, &key, &inmap_fd, 0);
+		if (ret) {
+			printf("Failed to update map %s\n", outmap_name);
+			goto err_out;
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Set values for each map
+	 * 1. set all ports to group 1
+	 * 2. set ports to group 2 or 3 based on interface index
+	 * 3. set exclude group for each interface
+	 */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* bind pro_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, pro_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+		/* Add the interface to group 1 */
+		key = 1 << 16;
+		ret = bpf_map_lookup_elem(outmap_fd, &key, &inmap_id);
+		if (ret) {
+			printf("Failed to lookup inmap by key %u from map %s\n", key, outmap_name);
+			goto err_out;
+		}
+		inmap_fd = bpf_map_get_fd_by_id(inmap_id);
+		ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+		if (ret) {
+			printf("Failed to update key %d, value %d for inmap id %u\n", ifindex, ifindex, inmap_id);
+			goto err_out;
+		}
+
+		/* Add the even number ifaces to group 2 and odd ifaces to group 3 */
+		if (i % 2 == 0)
+			key = 2 << 16;
+		else
+			key = 3 << 16;
+		ret = bpf_map_lookup_elem(outmap_fd, &key, &inmap_id);
+		if (ret) {
+			printf("Failed to lookup inmap by key %u from map %s\n", key, outmap_name);
+			goto err_out;
+		}
+		inmap_fd = bpf_map_get_fd_by_id(inmap_id);
+		ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+		if (ret) {
+			printf("Failed to update key %d, value %d for inmap id %u\n", ifindex, ifindex, inmap_id);
+			goto err_out;
+		}
+
+		/* Set the exclude map for the interfaces */
+		key = ifindex;
+		inmap_fd = bpf_create_map(inmap_hash ? BPF_MAP_TYPE_DEVMAP_HASH : BPF_MAP_TYPE_DEVMAP, sizeof(__u32), sizeof(int), MAX_NR_PORTS, 0);
+		if (inmap_fd < 0) {
+			printf("Failed to create inner map '%s'!\n", strerror(errno));
+			goto err_out;
+		}
+		ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+		if (ret) {
+			printf("Failed to update key %d, value %d for exclude inmap\n", ifindex, ifindex);
+			goto err_out;
+		}
+
+		/* let test exclude map by excluding all the interfaces except
+		 * the first one. The first two interfaces are not affect. e.g.
+		 * iface_1 = [1], iface_2 = [2], iface_3 = [3, 4,..],
+		 * iface_4 = [3, 4,...]
+		 */
+		for (j = 2; ifaces[j] > 0; j++) {
+			if (i <= 1 || i == j)
+				continue;
+			ifindex = ifaces[j];
+			ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+			if (ret) {
+				printf("Failed to update key %d, value %d for exclude inmap\n", ifindex, ifindex);
+				goto err_out;
+		}
+
+		}
+		ret = bpf_map_update_elem(outmap_fd, &key, &inmap_fd, 0);
+		if (ret) {
+			printf("Failed to update map %s\n", outmap_name);
+			goto err_out;
+		}
+	}
+
+	sleep(600);
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.19.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH bpf-next 1/2] xdp: add dev map multicast support
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
@ 2020-04-20  9:52   ` Hangbin Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-04-20  9:52 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov, bpf

Hi Daniel,

Would you please help review the RFC and give some comments?
Especially for the ifindex parameter of bpf_redirect_map() which
contains both include and exclude map id. Should we keep the current
designing, or find a way to make it flexible, or even add a new syscall
to accept two index parameters?

Thanks
Hangbin

On Wed, Apr 15, 2020 at 04:54:36PM +0800, Hangbin Liu wrote:
> This is a prototype for xdp multicast support. In this implemention we
> use map-in-map to store the multicast groups, because we may have both
> include and exclude groups on one interface.
> 
> The include and exclude groups are seperated by a 32 bits map key.
> the high 16 bits keys are used for include groups and low 16 bits
> keys are for exclude groups.
> 
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h   |  29 +++++++++++
>  include/net/xdp.h     |   1 +
>  kernel/bpf/arraymap.c |   2 +-
>  kernel/bpf/devmap.c   | 118 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/hashtab.c  |   2 +-
>  kernel/bpf/verifier.c |  15 +++++-
>  net/core/filter.c     |  69 +++++++++++++++++++++++-
>  net/core/xdp.c        |  26 ++++++++++
>  8 files changed, 256 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fd2b2322412d..72797667bca8 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1156,11 +1156,17 @@ struct sk_buff;
>  
>  struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
>  struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key);
> +void *array_of_map_lookup_elem(struct bpf_map *map, void *key);
> +void *htab_of_map_lookup_elem(struct bpf_map *map, void *key);
>  void __dev_flush(void);
>  int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map);
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, u32 index);
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog);
>  
> @@ -1276,6 +1282,16 @@ static inline struct net_device  *__dev_map_hash_lookup_elem(struct bpf_map *map
>  	return NULL;
>  }
>  
> +static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +
> +}
> +
> +static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +
> +}
> +
>  static inline void __dev_flush(void)
>  {
>  }
> @@ -1297,6 +1313,19 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return 0;
>  }
>  
> +static inline
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
> +{
> +	return true;
> +}
> +
> +static inline
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, u32 index)
> +{
> +	return 0;
> +}
> +
>  struct sk_buff;
>  
>  static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 40c6d3398458..a214dce8579c 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
>  }
>  
>  struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>  
>  /* Convert xdp_buff to xdp_frame */
>  static inline
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 95d77770353c..26ac66a05015 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -1031,7 +1031,7 @@ static void array_of_map_free(struct bpf_map *map)
>  	fd_array_map_free(map);
>  }
>  
> -static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
> +void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
>  {
>  	struct bpf_map **inner_map = array_map_lookup_elem(map, key);
>  
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 58bdca5d978a..3a60cb209ae1 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -85,6 +85,9 @@ static DEFINE_PER_CPU(struct list_head, dev_flush_list);
>  static DEFINE_SPINLOCK(dev_map_lock);
>  static LIST_HEAD(dev_map_list);
>  
> +static void *dev_map_lookup_elem(struct bpf_map *map, void *key);
> +static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key);
> +
>  static struct hlist_head *dev_map_create_hash(unsigned int entries)
>  {
>  	int i;
> @@ -456,6 +459,121 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return __xdp_enqueue(dev, xdp, dev_rx);
>  }
>  
> +/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
> +static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
> +{
> +	struct bpf_dtab_netdev *in_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			in_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			in_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, u32 index)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct bpf_map *in_map, *ex_map;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	u32 in_index, ex_index;
> +	u32 key, next_key;
> +	int err;
> +
> +	in_index = index >> 16;
> +	in_index = in_index << 16;
> +	ex_index = in_index ^ index;
> +
> +	in_map = map->ops->map_lookup_elem(map, &in_index);
> +	/* ex_map could be NULL */
> +	ex_map = map->ops->map_lookup_elem(map, &ex_index);
> +
> +	devmap_get_next_key(in_map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	for (;;) {
> +		switch (in_map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(in_map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(in_map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +		if (!obj)
> +			goto find_next;
> +
> +		if (ex_map && !dev_in_exclude_map(obj, ex_map)) {
> +			dev = obj->dev;
> +
> +			if (!dev->netdev_ops->ndo_xdp_xmit)
> +				return -EOPNOTSUPP;
> +
> +			err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +			if (unlikely(err))
> +				return err;
> +
> +			nxdpf = xdpf_clone(xdpf);
> +			if (unlikely(!nxdpf))
> +				return -ENOMEM;
> +
> +			bq_enqueue(dev, nxdpf, dev_rx);
> +		}
> +find_next:
> +		err = devmap_get_next_key(in_map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index d541c8486c95..4e0a2eebd38d 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1853,7 +1853,7 @@ static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr)
>  	return map;
>  }
>  
> -static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
> +void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
>  {
>  	struct bpf_map **inner_map  = htab_map_lookup_elem(map, key);
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 04c6630cc18f..84d23418823a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3898,7 +3898,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		break;
>  	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
>  	case BPF_MAP_TYPE_HASH_OF_MAPS:
> -		if (func_id != BPF_FUNC_map_lookup_elem)
> +		/* Used by multicast redirect */
> +		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
>  	case BPF_MAP_TYPE_SOCKMAP:
> @@ -3968,8 +3970,17 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
>  		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH &&
>  		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
> -		    map->map_type != BPF_MAP_TYPE_XSKMAP)
> +		    map->map_type != BPF_MAP_TYPE_XSKMAP &&
> +		    map->map_type != BPF_MAP_TYPE_ARRAY_OF_MAPS &&
> +		    map->map_type != BPF_MAP_TYPE_HASH_OF_MAPS)
>  			goto error;
> +		if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
> +		    map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
> +			/* FIXME: Maybe we should also strict the key size here ?? */
> +			if (map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP &&
> +			    map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +				goto error;
> +		}
>  		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7628b947dbc3..7d2076f5b0a4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    u32 index)
>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
> +		/* fall through */
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		return dev_map_enqueue(fwd, xdp, dev_rx);
> +	case BPF_MAP_TYPE_HASH_OF_MAPS:
> +	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
> +		return dev_map_enqueue_multi(xdp, dev_rx, map, index);
>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3508,6 +3513,10 @@ static inline void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
>  		return __cpu_map_lookup_elem(map, index);
>  	case BPF_MAP_TYPE_XSKMAP:
>  		return __xsk_map_lookup_elem(map, index);
> +	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
> +		return array_of_map_lookup_elem(map, (index >> 16) << 16);
> +	case BPF_MAP_TYPE_HASH_OF_MAPS:
> +		return htab_of_map_lookup_elem(map, (index >> 16) << 16);
>  	default:
>  		return NULL;
>  	}
> @@ -3552,7 +3561,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, index);
>  	}
>  
>  	if (unlikely(err))
> @@ -3566,6 +3575,55 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
> +static int dev_map_redirect_multi(struct sk_buff *skb, struct bpf_prog *xdp_prog,
> +				  struct bpf_map *map, u32 index)
> +
> +{
> +	struct bpf_map *in_map, *ex_map;
> +	struct bpf_dtab_netdev *dst;
> +	u32 in_index, ex_index;
> +	struct sk_buff *nskb;
> +	u32 key, next_key;
> +	int err;
> +	void *fwd;
> +
> +	in_index = index >> 16;
> +	in_index = in_index << 16;
> +	ex_index = in_index ^ index;
> +
> +	in_map = map->ops->map_lookup_elem(map, &in_index);
> +	/* ex_map could be NULL */
> +	ex_map = map->ops->map_lookup_elem(map, &ex_index);
> +
> +	in_map->ops->map_get_next_key(in_map, NULL, &key);
> +
> +	for (;;) {
> +		fwd = __xdp_map_lookup_elem(in_map, key);
> +		if (fwd) {
> +			dst = (struct bpf_dtab_netdev *)fwd;
> +			if (ex_map && dev_in_exclude_map(dst, ex_map))
> +				goto find_next;
> +
> +			nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (!nskb)
> +				return -EOVERFLOW;
> +
> +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> +			if (unlikely(err))
> +				return err;
> +		}
> +
> +find_next:
> +		err = in_map->ops->map_get_next_key(in_map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct sk_buff *skb,
>  				       struct xdp_buff *xdp,
> @@ -3588,6 +3646,13 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  		err = dev_map_generic_redirect(dst, skb, xdp_prog);
>  		if (unlikely(err))
>  			goto err;
> +	} else if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
> +		   map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
> +		/* Do multicast redirecting */
> +		err = dev_map_redirect_multi(skb, xdp_prog, map, index);
> +		if (unlikely(err))
> +			goto err;
> +		consume_skb(skb);
>  	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
>  		struct xdp_sock *xs = fwd;
>  
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 4c7ea85486af..70dfb4910f84 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>  	return xdpf;
>  }
>  EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);
> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);
> -- 
> 2.19.2
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-04-24  8:56 ` Hangbin Liu
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-05-26 14:05 ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  4 siblings, 2 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-04-24  8:56 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

Hi all,

This is a prototype for xdp multicast support, which has been discussed
before[0]. The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple
ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

For this RFC series we are primarily looking for feedback on the concept
and API: the example in patch 2 is functional, but not a lot of effort
has been made on performance optimisation.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v2: Discussed with Jiri, Toke, Jesper, Eelco, we think the v1 is doing
a trick and may make user confused. So let's just add a new helper
to make the implemention more clear.

Hangbin Liu (2):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test

 include/linux/bpf.h                       |  20 +++
 include/linux/filter.h                    |   1 +
 include/net/xdp.h                         |   1 +
 include/uapi/linux/bpf.h                  |  23 ++-
 kernel/bpf/devmap.c                       | 114 +++++++++++++++
 kernel/bpf/verifier.c                     |   6 +
 net/core/filter.c                         |  98 ++++++++++++-
 net/core/xdp.c                            |  26 ++++
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 124 ++++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 100 +++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 170 ++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            |  23 ++-
 13 files changed, 702 insertions(+), 7 deletions(-)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

-- 
2.19.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-04-24  8:56   ` Hangbin Liu
  2020-04-24 14:19     ` Lorenzo Bianconi
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  1 sibling, 2 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-04-24  8:56 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a prototype for xdp multicast support. In this implemention we
add a new helper to accept two maps, forward map and exclude map.
We will redirect the packet to all the interfaces in *forward map*, but
exclude the interfaces that in *exclude map*.

To achive this I add a new ex_map for struct bpf_redirect_info.
in the helper I set tgt_value to NULL to make a difference with
bpf_xdp_redirect_map()

We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
create a exclude map for each interface and just want to exclude the
ingress interface.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

v2: add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h            |  20 ++++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  23 ++++++-
 kernel/bpf/devmap.c            | 114 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              |  98 ++++++++++++++++++++++++++--
 net/core/xdp.c                 |  26 ++++++++
 tools/include/uapi/linux/bpf.h |  23 ++++++-
 9 files changed, 305 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fd2b2322412d..3fd2903def3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1161,6 +1161,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 
@@ -1297,6 +1302,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9b5aa5c483cc..5b4e1ccd2d37 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -614,6 +614,7 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 40c6d3398458..a214dce8579c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
 }
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 /* Convert xdp_buff to xdp_frame */
 static inline
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2e29a671d67e..1dbe42290223 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3025,6 +3025,21 @@ union bpf_attr {
  *		* **-EOPNOTSUPP**	Unsupported operation, for example a
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to all the interfaces in *map*, and
+ * 		exclude the interfaces that in *ex_map*. The *ex_map* could
+ * 		be NULL.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which could exlcude redirect to the ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3151,7 +3166,8 @@ union bpf_attr {
 	FN(xdp_output),			\
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
-	FN(sk_assign),
+	FN(sk_assign),			\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 58bdca5d978a..34b171f7826c 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -456,6 +456,120 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	struct bpf_dtab_netdev *in_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	if (!map)
+		return false;
+
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			in_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			in_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	struct bpf_dtab_netdev *obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	struct net_device *dev;
+	u32 key, next_key;
+	int err;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map,
+					       exclude_ingress ? dev_rx->ifindex : 0))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			return -EOPNOTSUPP;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			return err;
+
+		nxdpf = xdpf_clone(xdpf);
+		if (unlikely(!nxdpf))
+			return -ENOMEM;
+
+		bq_enqueue(dev, nxdpf, dev_rx);
+
+find_next:
+		err = devmap_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+		key = next_key;
+	}
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 38cfcf701eeb..f77213a0e354 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3880,6 +3880,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -3970,6 +3971,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index 7d6ceaa54d21..94d1530e5ac6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, bool exclude_ingress)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
+						     exclude_ingress);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
 	struct bpf_map *map = READ_ONCE(ri->map);
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
@@ -3552,7 +3559,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
 	}
 
 	if (unlikely(err))
@@ -3566,6 +3573,49 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  bool exclude_ingress)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	u32 key, next_key;
+	int err;
+	void *fwd;
+
+	/* Get first key from forward map */
+	map->ops->map_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -EOVERFLOW;
+
+			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+			if (unlikely(err))
+				return err;
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3573,6 +3623,8 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
@@ -3583,9 +3635,16 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			/* Deal with multicast maps */
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, exclude_ingress);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3699,6 +3758,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+
+	WRITE_ONCE(ri->map, map);
+	WRITE_ONCE(ri->ex_map, ex_map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6304,6 +6390,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4c7ea85486af..70dfb4910f84 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	return xdpf;
 }
 EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2e29a671d67e..1dbe42290223 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3025,6 +3025,21 @@ union bpf_attr {
  *		* **-EOPNOTSUPP**	Unsupported operation, for example a
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to all the interfaces in *map*, and
+ * 		exclude the interfaces that in *ex_map*. The *ex_map* could
+ * 		be NULL.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which could exlcude redirect to the ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3151,7 +3166,8 @@ union bpf_attr {
 	FN(xdp_output),			\
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
-	FN(sk_assign),
+	FN(sk_assign),			\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.19.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-04-24  8:56   ` Hangbin Liu
  2020-04-24 14:21     ` Lorenzo Bianconi
  1 sibling, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-04-24  8:56 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a sample for xdp multicast. In the sample we have 3 forward
groups and 1 exclude group. It will redirect each interface's
packets to all the interfaces in the forward group, and exclude
the interface in exclude map.

For more testing details, please see the test description in
xdp_redirect_map_multi.sh.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 124 ++++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 100 +++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 170 ++++++++++++++++++++++
 4 files changed, 397 insertions(+)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 424f6fe7ce38..eb7306efe85e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi.sh b/samples/bpf/xdp_redirect_map_multi.sh
new file mode 100755
index 000000000000..1999f261a1e8
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward multicast groups:
+#     Forward group all has interfaces: veth1, veth2, veth3, veth4 (All traffic except IPv4, IPv6)
+#     Forward group v4 has interfaces: veth1, veth3, veth4 (For IPv4 traffic only)
+#     Forward group v6 has interfaces: veth2, veth3, veth4 (For IPv6 traffic only)
+# Exclude Groups:
+#     Exclude group: veth3 (assume ns3 is in black list)
+#
+# Test modules:
+# XDP modes: generic, native
+# map types: group v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test cases:
+#     ARP(we didn't exclude ns3 in kern.c for ARP):
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (fail), ns1 -> ns4 (pass)
+#     IPv6
+#        ns2 -> ns1 (fail), ns2 -> ns3 (fail), ns2 -> ns4 (pass)
+#
+
+
+# netns numbers
+NUM=4
+IFACES=""
+DRV_MODE="drv generic"
+
+test_pass()
+{
+	echo "Pass: $@"
+}
+
+test_fail()
+{
+	echo "fail: $@"
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip netns del ns$i
+	done
+}
+
+setup_ns()
+{
+	local mode=$1
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth0 type veth peer name veth$i
+	        ip link set veth0 netns ns$i
+		ip netns exec ns$i ip link set veth0 up
+		ip link set veth$i up
+
+		ip netns exec ns$i ip addr add 192.0.2.$i/24 dev veth0
+		ip netns exec ns$i ip addr add 2001:db8::$i/24 dev veth0
+		ip netns exec ns$i ip link set veth0 xdp$mode obj \
+			xdp_redirect_map_multi_kern.o sec xdp_redirect_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_tests()
+{
+	local drv_mode=$1
+	local drv_p
+
+	[ ${drv_mode} == "drv" ] && drv_p="-N" || drv_p="-S"
+
+	./xdp_redirect_map_multi $drv_p $IFACES &> xdp_${drv_mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${drv_mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${drv_mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${drv_mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-2" || test_fail "$drv_mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-3" || test_fail "$drv_mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-4" || test_fail "$drv_mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-2" || test_pass "$drv_mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-3" || test_pass "$drv_mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping ns1-4" || test_fail "$drv_mode ping ns1-4"
+
+	# ping6 test
+	ip netns exec ns2 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-1" || test_pass "$drv_mode ping6 ns2-1"
+	ip netns exec ns2 ping6 2001:db8::3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-3" || test_pass "$drv_mode ping6 ns2-3"
+	ip netns exec ns2 ping6 2001:db8::4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping6 ns2-4" || test_fail "$drv_mode ping6 ns2-4"
+
+	kill $xdp_pid
+}
+
+for mode in ${DRV_MODE}; do
+	sleep 2
+	setup_ns $mode
+	do_tests $mode
+	sleep 20
+	clean_up
+done
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..c98985683ba2
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,100 @@
+/*
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+/* In this sample we will use 3 forward maps and 1 exclude map to
+ * show how to use the helper bpf_redirect_map_multi().
+ *
+ * In real world, there may have multi forward maps and exclude map. You can
+ * use map-in-map type to store the forward and exlude maps. e.g.
+ * forward_map_in_map[group_a_index] = forward_group_a_map
+ * forward_map_in_map[group_b_index] = forward_group_b_map
+ * exclude_map_in_map[iface_1_index] = iface_1_exclude_map
+ * exclude_map_in_map[iface_2_index] = iface_2_exclude_map
+ * Then store the forward group indexes based on IP/MAC policy in another
+ * hash map, e.g.:
+ * mcast_route_map[hash(subnet_a)] = group_a_index
+ * mcast_route_map[hash(subnet_b)] = group_b_index
+ *
+ * You can init the maps in user.c, and find the forward group index from
+ * mcast_route_map bye key hash(subnet) in kern.c, Then you could find
+ * the forward group by the group index. You can also get the exclude map
+ * simply by iface index in exclude_map_in_map.
+ */
+struct bpf_map_def SEC("maps") forward_map_v4 = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_v6 = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_all = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") exclude_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	u32 key, mcast_group_id, exclude_group_id;
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int *inmap_id;
+	u16 h_proto;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_IP))
+		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else if (h_proto == htons(ETH_P_IPV6))
+		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else
+		return bpf_redirect_map_multi(&forward_map_all, NULL,
+					      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..2fcd15322201
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,170 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ */
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static int ifaces[MAX_IFACE_NUM] = {};
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ... \n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, group_v6, exclude;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i ++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
+	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
+	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
+
+	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: remove the 2nd interfaces from group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: remove the 1st interfaces from group v6 */
+		if (i != 0) {
+			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: add the 3rd interfaces to exclude map */
+		if (i == 2) {
+			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	sleep(600);
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.19.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-04-24 14:19     ` Lorenzo Bianconi
  2020-04-28 11:09       ` Eelco Chaudron
  2020-05-06  9:35       ` Hangbin Liu
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
  1 sibling, 2 replies; 72+ messages in thread
From: Lorenzo Bianconi @ 2020-04-24 14:19 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann


[-- Attachment #1: Type: text/plain, Size: 13985 bytes --]

> This is a prototype for xdp multicast support. In this implemention we
> add a new helper to accept two maps, forward map and exclude map.
> We will redirect the packet to all the interfaces in *forward map*, but
> exclude the interfaces that in *exclude map*.
> 
> To achive this I add a new ex_map for struct bpf_redirect_info.
> in the helper I set tgt_value to NULL to make a difference with
> bpf_xdp_redirect_map()
> 
> We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> create a exclude map for each interface and just want to exclude the
> ingress interface.
> 
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.
> 
> v2: add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h            |  20 ++++++
>  include/linux/filter.h         |   1 +
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  23 ++++++-
>  kernel/bpf/devmap.c            | 114 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |   6 ++
>  net/core/filter.c              |  98 ++++++++++++++++++++++++++--
>  net/core/xdp.c                 |  26 ++++++++
>  tools/include/uapi/linux/bpf.h |  23 ++++++-
>  9 files changed, 305 insertions(+), 7 deletions(-)
> 

[...]

> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	struct bpf_dtab_netdev *in_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	if (!map)
> +		return false;

doing so it seems mandatory to define an exclude_map even if we want just to do
not forward the packet to the "ingress" interface.
Moreover I was thinking that we can assume to never forward to in the incoming
interface. Doing so the code would be simpler I guess. Is there a use case for
it? (forward even to the ingress interface)

> +
> +	if (obj->dev->ifindex == exclude_ifindex)
> +		return true;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			in_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			in_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (!obj || dev_in_exclude_map(obj, ex_map,
> +					       exclude_ingress ? dev_rx->ifindex : 0))
> +			goto find_next;
> +
> +		dev = obj->dev;
> +
> +		if (!dev->netdev_ops->ndo_xdp_xmit)
> +			return -EOPNOTSUPP;
> +
> +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +		if (unlikely(err))
> +			return err;
> +
> +		nxdpf = xdpf_clone(xdpf);
> +		if (unlikely(!nxdpf))
> +			return -ENOMEM;
> +
> +		bq_enqueue(dev, nxdpf, dev_rx);
> +
> +find_next:
> +		err = devmap_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}

Do we need to free 'incoming' xdp buffer here? I think most of the drivers assume
the packet is owned by the stack if xdp_do_redirect returns 0

> +
> +	return 0;
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 38cfcf701eeb..f77213a0e354 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3880,6 +3880,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_redirect_map_multi &&
>  		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
> @@ -3970,6 +3971,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		    map->map_type != BPF_MAP_TYPE_XSKMAP)
>  			goto error;
>  		break;
> +	case BPF_FUNC_redirect_map_multi:
> +		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
> +		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +			goto error;
> +		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
>  	case BPF_FUNC_sock_map_update:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7d6ceaa54d21..94d1530e5ac6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    struct bpf_map *ex_map, bool exclude_ingress)
>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		return dev_map_enqueue(fwd, xdp, dev_rx);
> +		if (fwd)
> +			return dev_map_enqueue(fwd, xdp, dev_rx);
> +		else
> +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
> +						     exclude_ingress);

I guess it would be better to do not make it the default case. Maybe you can
add a bit in flags to mark it for "multicast"

>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>  	struct bpf_map *map = READ_ONCE(ri->map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
> @@ -3552,7 +3559,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
>  	}
>  
>  	if (unlikely(err))
> @@ -3566,6 +3573,49 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
> +static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
> +				  struct bpf_prog *xdp_prog,
> +				  struct bpf_map *map, struct bpf_map *ex_map,
> +				  bool exclude_ingress)
> +
> +{
> +	struct bpf_dtab_netdev *dst;
> +	struct sk_buff *nskb;
> +	u32 key, next_key;
> +	int err;
> +	void *fwd;
> +
> +	/* Get first key from forward map */
> +	map->ops->map_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		fwd = __xdp_map_lookup_elem(map, key);
> +		if (fwd) {
> +			dst = (struct bpf_dtab_netdev *)fwd;
> +			if (dev_in_exclude_map(dst, ex_map,
> +					       exclude_ingress ? dev->ifindex : 0))
> +				goto find_next;
> +
> +			nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (!nskb)
> +				return -EOVERFLOW;
> +
> +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> +			if (unlikely(err))
> +				return err;
> +		}
> +
> +find_next:
> +		err = map->ops->map_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct sk_buff *skb,
>  				       struct xdp_buff *xdp,
> @@ -3573,6 +3623,8 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct bpf_map *map)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
>  	int err = 0;
> @@ -3583,9 +3635,16 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  
>  	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
>  	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
> -		struct bpf_dtab_netdev *dst = fwd;
> +		if (fwd) {
> +			struct bpf_dtab_netdev *dst = fwd;
> +
> +			err = dev_map_generic_redirect(dst, skb, xdp_prog);
> +		} else {
> +			/* Deal with multicast maps */
> +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> +						     ex_map, exclude_ingress);
> +		}
>  
> -		err = dev_map_generic_redirect(dst, skb, xdp_prog);
>  		if (unlikely(err))
>  			goto err;
>  	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
> @@ -3699,6 +3758,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
>  	.arg3_type      = ARG_ANYTHING,
>  };
>  
> +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> +	   struct bpf_map *, ex_map, u64, flags)
> +{
> +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
> +		return XDP_ABORTED;
> +
> +	ri->tgt_index = 0;
> +	ri->tgt_value = NULL;
> +	ri->flags = flags;
> +
> +	WRITE_ONCE(ri->map, map);
> +	WRITE_ONCE(ri->ex_map, ex_map);
> +
> +	return XDP_REDIRECT;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
> +	.func           = bpf_xdp_redirect_map_multi,
> +	.gpl_only       = false,
> +	.ret_type       = RET_INTEGER,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg3_type      = ARG_ANYTHING,
> +};
> +
>  static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
>  				  unsigned long off, unsigned long len)
>  {
> @@ -6304,6 +6390,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_xdp_redirect_proto;
>  	case BPF_FUNC_redirect_map:
>  		return &bpf_xdp_redirect_map_proto;
> +	case BPF_FUNC_redirect_map_multi:
> +		return &bpf_xdp_redirect_map_multi_proto;
>  	case BPF_FUNC_xdp_adjust_tail:
>  		return &bpf_xdp_adjust_tail_proto;
>  	case BPF_FUNC_fib_lookup:
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 4c7ea85486af..70dfb4910f84 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>  	return xdpf;
>  }
>  EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);
> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 2e29a671d67e..1dbe42290223 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -3025,6 +3025,21 @@ union bpf_attr {
>   *		* **-EOPNOTSUPP**	Unsupported operation, for example a
>   *					call from outside of TC ingress.
>   *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to all the interfaces in *map*, and
> + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> + * 		be NULL.
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which could exlcude redirect to the ingress device.
> + *
> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3151,7 +3166,8 @@ union bpf_attr {
>  	FN(xdp_output),			\
>  	FN(get_netns_cookie),		\
>  	FN(get_current_ancestor_cgroup_id),	\
> -	FN(sk_assign),
> +	FN(sk_assign),			\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> -- 
> 2.19.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-04-24 14:21     ` Lorenzo Bianconi
  0 siblings, 0 replies; 72+ messages in thread
From: Lorenzo Bianconi @ 2020-04-24 14:21 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann


[-- Attachment #1: Type: text/plain, Size: 6979 bytes --]

On Apr 24, Hangbin Liu wrote:
> This is a sample for xdp multicast. In the sample we have 3 forward
> groups and 1 exclude group. It will redirect each interface's
> packets to all the interfaces in the forward group, and exclude
> the interface in exclude map.
> 
> For more testing details, please see the test description in
> xdp_redirect_map_multi.sh.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  samples/bpf/Makefile                      |   3 +
>  samples/bpf/xdp_redirect_map_multi.sh     | 124 ++++++++++++++++
>  samples/bpf/xdp_redirect_map_multi_kern.c | 100 +++++++++++++
>  samples/bpf/xdp_redirect_map_multi_user.c | 170 ++++++++++++++++++++++
>  4 files changed, 397 insertions(+)
>  create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
>  create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
>  create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 424f6fe7ce38..eb7306efe85e 100644

[...]

> +
> +SEC("xdp_redirect_map_multi")
> +int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
> +{
> +	u32 key, mcast_group_id, exclude_group_id;
> +	void *data_end = (void *)(long)ctx->data_end;
> +	void *data = (void *)(long)ctx->data;
> +	struct ethhdr *eth = data;
> +	int *inmap_id;
> +	u16 h_proto;
> +	u64 nh_off;
> +
> +	nh_off = sizeof(*eth);
> +	if (data + nh_off > data_end)
> +		return XDP_DROP;
> +
> +	h_proto = eth->h_proto;
> +
> +	if (h_proto == htons(ETH_P_IP))
> +		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
> +					      BPF_F_EXCLUDE_INGRESS);

Do we need the 'BPF_F_EXCLUDE_INGRESS' here?

> +	else if (h_proto == htons(ETH_P_IPV6))
> +		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
> +					      BPF_F_EXCLUDE_INGRESS);

ditto

> +	else
> +		return bpf_redirect_map_multi(&forward_map_all, NULL,
> +					      BPF_F_EXCLUDE_INGRESS);
> +}
> +
> +SEC("xdp_redirect_dummy")
> +int xdp_redirect_dummy_prog(struct xdp_md *ctx)
> +{
> +	return XDP_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
> new file mode 100644
> index 000000000000..2fcd15322201
> --- /dev/null
> +++ b/samples/bpf/xdp_redirect_map_multi_user.c
> @@ -0,0 +1,170 @@
> +/* SPDX-License-Identifier: GPL-2.0-only
> + */
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <net/if.h>
> +#include <unistd.h>
> +#include <libgen.h>
> +
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#define MAX_IFACE_NUM 32
> +
> +static int ifaces[MAX_IFACE_NUM] = {};
> +static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
> +
> +static void int_exit(int sig)
> +{
> +	__u32 prog_id = 0;
> +	int i;
> +
> +	for (i = 0; ifaces[i] > 0; i++) {
> +		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
> +			printf("bpf_get_link_xdp_id failed\n");
> +			exit(1);
> +		}
> +		if (prog_id)
> +			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
> +	}
> +
> +	exit(0);
> +}
> +
> +static void usage(const char *prog)
> +{
> +	fprintf(stderr,
> +		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ... \n"
> +		"OPTS:\n"
> +		"    -S    use skb-mode\n"
> +		"    -N    enforce native mode\n"
> +		"    -F    force loading prog\n",
> +		prog);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int prog_fd, group_all, group_v4, group_v6, exclude;
> +	struct bpf_prog_load_attr prog_load_attr = {
> +		.prog_type      = BPF_PROG_TYPE_XDP,
> +	};
> +	int i, ret, opt, ifindex;
> +	char ifname[IF_NAMESIZE];
> +	struct bpf_object *obj;
> +	char filename[256];
> +
> +	while ((opt = getopt(argc, argv, "SNF")) != -1) {
> +		switch (opt) {
> +		case 'S':
> +			xdp_flags |= XDP_FLAGS_SKB_MODE;
> +			break;
> +		case 'N':
> +			/* default, set below */
> +			break;
> +		case 'F':
> +			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
> +			break;
> +		default:
> +			usage(basename(argv[0]));
> +			return 1;
> +		}
> +	}
> +
> +	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
> +		xdp_flags |= XDP_FLAGS_DRV_MODE;
> +
> +	if (optind == argc) {
> +		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
> +		return 1;
> +	}
> +
> +	printf("Get interfaces");
> +	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i ++) {
> +		ifaces[i] = if_nametoindex(argv[optind + i]);
> +		if (!ifaces[i])
> +			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
> +		if (!if_indextoname(ifaces[i], ifname)) {
> +			perror("Invalid interface name or i");
> +			return 1;
> +		}
> +		printf(" %d", ifaces[i]);
> +	}
> +	printf("\n");
> +
> +	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
> +	prog_load_attr.file = filename;
> +
> +	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> +		return 1;
> +
> +	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
> +	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
> +	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
> +	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
> +
> +	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0) {
> +		printf("bpf_object__find_map_fd_by_name failed\n");
> +		return 1;
> +	}
> +
> +	signal(SIGINT, int_exit);
> +	signal(SIGTERM, int_exit);
> +
> +	/* Init forward multicast groups and exclude group */
> +	for (i = 0; ifaces[i] > 0; i++) {
> +		ifindex = ifaces[i];
> +
> +		/* Add all the interfaces to group all */
> +		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
> +		if (ret) {
> +			perror("bpf_map_update_elem");
> +			goto err_out;
> +		}
> +
> +		/* For testing: remove the 2nd interfaces from group v4 */
> +		if (i != 1) {
> +			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
> +			if (ret) {
> +				perror("bpf_map_update_elem");
> +				goto err_out;
> +			}
> +		}
> +
> +		/* For testing: remove the 1st interfaces from group v6 */
> +		if (i != 0) {
> +			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
> +			if (ret) {
> +				perror("bpf_map_update_elem");
> +				goto err_out;
> +			}
> +		}
> +
> +		/* For testing: add the 3rd interfaces to exclude map */
> +		if (i == 2) {
> +			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
> +			if (ret) {
> +				perror("bpf_map_update_elem");
> +				goto err_out;
> +			}
> +		}
> +
> +		/* bind prog_fd to each interface */
> +		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
> +		if (ret) {
> +			printf("Set xdp fd failed on %d\n", ifindex);
> +			goto err_out;
> +		}
> +
> +	}
> +
> +	sleep(600);
> +	return 0;
> +
> +err_out:
> +	return 1;
> +}
> -- 
> 2.19.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-04-24 14:19     ` Lorenzo Bianconi
@ 2020-04-24 14:34     ` Toke Høiland-Jørgensen
  2020-05-06  9:14       ` Hangbin Liu
  2020-05-18  8:45       ` Hangbin Liu
  1 sibling, 2 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-24 14:34 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, Hangbin Liu

Hangbin Liu <liuhangbin@gmail.com> writes:

> This is a prototype for xdp multicast support. In this implemention we
> add a new helper to accept two maps, forward map and exclude map.
> We will redirect the packet to all the interfaces in *forward map*, but
> exclude the interfaces that in *exclude map*.

Yeah, the new helper is much cleaner!

> To achive this I add a new ex_map for struct bpf_redirect_info.
> in the helper I set tgt_value to NULL to make a difference with
> bpf_xdp_redirect_map()
>
> We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> create a exclude map for each interface and just want to exclude the
> ingress interface.
>
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.

Got any performance numbers? :)

> v2: add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h            |  20 ++++++
>  include/linux/filter.h         |   1 +
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  23 ++++++-
>  kernel/bpf/devmap.c            | 114 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |   6 ++
>  net/core/filter.c              |  98 ++++++++++++++++++++++++++--
>  net/core/xdp.c                 |  26 ++++++++
>  tools/include/uapi/linux/bpf.h |  23 ++++++-
>  9 files changed, 305 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fd2b2322412d..3fd2903def3f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1161,6 +1161,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex);
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress);
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog);
>  
> @@ -1297,6 +1302,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return 0;
>  }
>  
> +static inline
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	return false;
> +}
> +
> +static inline
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	return 0;
> +}
> +
>  struct sk_buff;
>  
>  static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 9b5aa5c483cc..5b4e1ccd2d37 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -614,6 +614,7 @@ struct bpf_redirect_info {
>  	u32 tgt_index;
>  	void *tgt_value;
>  	struct bpf_map *map;
> +	struct bpf_map *ex_map;
>  	u32 kern_flags;
>  };
>  
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 40c6d3398458..a214dce8579c 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
>  }
>  
>  struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>  
>  /* Convert xdp_buff to xdp_frame */
>  static inline
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2e29a671d67e..1dbe42290223 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3025,6 +3025,21 @@ union bpf_attr {
>   *		* **-EOPNOTSUPP**	Unsupported operation, for example a
>   *					call from outside of TC ingress.
>   *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to all the interfaces in *map*, and
> + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> + * 		be NULL.
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which could exlcude redirect to the ingress device.

I'd suggest rewording this to:

* 		Redirect the packet to ALL the interfaces in *map*, but
* 		exclude the interfaces in *ex_map* (which may be NULL).
*
* 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
* 		which additionally excludes the current ingress device.


> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3151,7 +3166,8 @@ union bpf_attr {
>  	FN(xdp_output),			\
>  	FN(get_netns_cookie),		\
>  	FN(get_current_ancestor_cgroup_id),	\
> -	FN(sk_assign),
> +	FN(sk_assign),			\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 58bdca5d978a..34b171f7826c 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -456,6 +456,120 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return __xdp_enqueue(dev, xdp, dev_rx);
>  }
>  
> +/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
> +static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	struct bpf_dtab_netdev *in_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	if (!map)
> +		return false;
> +
> +	if (obj->dev->ifindex == exclude_ifindex)
> +		return true;

We probably want the EXCLUDE_INGRESS flag to work even if ex_map is
NULL, right? In that case you want to switch the order of the two checks
above.

> +	devmap_get_next_key(map, NULL, &key);
> +
> +	for (;;) {

I wonder if we should require DEVMAP_HASH maps to be indexed by ifindex
to avoid the loop?

> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			in_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			in_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;

You do a clone for each map entry below, so I think you end up leaking
this initial xdpf? Also, you'll end up with one clone more than
necessary - redirecting to two interfaces should only require 1 clone,
you're doing 2.

> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (!obj || dev_in_exclude_map(obj, ex_map,
> +					       exclude_ingress ? dev_rx->ifindex : 0))
> +			goto find_next;
> +
> +		dev = obj->dev;
> +
> +		if (!dev->netdev_ops->ndo_xdp_xmit)
> +			return -EOPNOTSUPP;
> +
> +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +		if (unlikely(err))
> +			return err;

These abort the whole operation midway through the loop if any error
occurs. That is probably not what we want? I think the right thing to do
is just continue the loop and only return an error if *all* of the
forwarding attempts failed. Maybe we need a tracepoint to catch
individual errors?

> +		nxdpf = xdpf_clone(xdpf);
> +		if (unlikely(!nxdpf))
> +			return -ENOMEM;

As this is a memory error it's likely fatal on the nest loop iteration
as well, so probably OK to abort everything here.

> +		bq_enqueue(dev, nxdpf, dev_rx);
> +
> +find_next:
> +		err = devmap_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 38cfcf701eeb..f77213a0e354 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3880,6 +3880,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_redirect_map_multi &&
>  		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
> @@ -3970,6 +3971,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		    map->map_type != BPF_MAP_TYPE_XSKMAP)
>  			goto error;
>  		break;
> +	case BPF_FUNC_redirect_map_multi:
> +		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
> +		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +			goto error;
> +		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
>  	case BPF_FUNC_sock_map_update:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7d6ceaa54d21..94d1530e5ac6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    struct bpf_map *ex_map, bool exclude_ingress)
>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		return dev_map_enqueue(fwd, xdp, dev_rx);
> +		if (fwd)
> +			return dev_map_enqueue(fwd, xdp, dev_rx);
> +		else
> +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
> +						     exclude_ingress);
>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);

I don't think you need the READ_ONCE here since there's already one
below?

>  	struct bpf_map *map = READ_ONCE(ri->map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
> @@ -3552,7 +3559,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
>  	}
>  
>  	if (unlikely(err))
> @@ -3566,6 +3573,49 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
> +static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
> +				  struct bpf_prog *xdp_prog,
> +				  struct bpf_map *map, struct bpf_map *ex_map,
> +				  bool exclude_ingress)
> +
> +{
> +	struct bpf_dtab_netdev *dst;
> +	struct sk_buff *nskb;
> +	u32 key, next_key;
> +	int err;
> +	void *fwd;
> +
> +	/* Get first key from forward map */
> +	map->ops->map_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		fwd = __xdp_map_lookup_elem(map, key);
> +		if (fwd) {
> +			dst = (struct bpf_dtab_netdev *)fwd;
> +			if (dev_in_exclude_map(dst, ex_map,
> +					       exclude_ingress ? dev->ifindex : 0))
> +				goto find_next;
> +
> +			nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (!nskb)
> +				return -EOVERFLOW;
> +
> +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> +			if (unlikely(err))
> +				return err;
> +		}
> +
> +find_next:
> +		err = map->ops->map_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}

This duplication bugs me; maybe we should try to consolidate the generic
and native XDP code paths?

>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct sk_buff *skb,
>  				       struct xdp_buff *xdp,
> @@ -3573,6 +3623,8 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct bpf_map *map)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
>  	int err = 0;
> @@ -3583,9 +3635,16 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  
>  	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
>  	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
> -		struct bpf_dtab_netdev *dst = fwd;
> +		if (fwd) {
> +			struct bpf_dtab_netdev *dst = fwd;
> +
> +			err = dev_map_generic_redirect(dst, skb, xdp_prog);
> +		} else {
> +			/* Deal with multicast maps */
> +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> +						     ex_map, exclude_ingress);
> +		}
>  
> -		err = dev_map_generic_redirect(dst, skb, xdp_prog);
>  		if (unlikely(err))
>  			goto err;
>  	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
> @@ -3699,6 +3758,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
>  	.arg3_type      = ARG_ANYTHING,
>  };
>  
> +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> +	   struct bpf_map *, ex_map, u64, flags)
> +{
> +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
> +		return XDP_ABORTED;
> +
> +	ri->tgt_index = 0;
> +	ri->tgt_value = NULL;
> +	ri->flags = flags;
> +
> +	WRITE_ONCE(ri->map, map);
> +	WRITE_ONCE(ri->ex_map, ex_map);
> +
> +	return XDP_REDIRECT;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
> +	.func           = bpf_xdp_redirect_map_multi,
> +	.gpl_only       = false,
> +	.ret_type       = RET_INTEGER,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg3_type      = ARG_ANYTHING,
> +};
> +
>  static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
>  				  unsigned long off, unsigned long len)
>  {
> @@ -6304,6 +6390,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_xdp_redirect_proto;
>  	case BPF_FUNC_redirect_map:
>  		return &bpf_xdp_redirect_map_proto;
> +	case BPF_FUNC_redirect_map_multi:
> +		return &bpf_xdp_redirect_map_multi_proto;
>  	case BPF_FUNC_xdp_adjust_tail:
>  		return &bpf_xdp_adjust_tail_proto;
>  	case BPF_FUNC_fib_lookup:
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 4c7ea85486af..70dfb4910f84 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>  	return xdpf;
>  }
>  EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);
> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 2e29a671d67e..1dbe42290223 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h

Updates to tools/include should generally go into a separate patch.

> @@ -3025,6 +3025,21 @@ union bpf_attr {
>   *		* **-EOPNOTSUPP**	Unsupported operation, for example a
>   *					call from outside of TC ingress.
>   *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to all the interfaces in *map*, and
> + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> + * 		be NULL.
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which could exlcude redirect to the ingress device.
> + *
> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3151,7 +3166,8 @@ union bpf_attr {
>  	FN(xdp_output),			\
>  	FN(get_netns_cookie),		\
>  	FN(get_current_ancestor_cgroup_id),	\
> -	FN(sk_assign),
> +	FN(sk_assign),			\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> -- 
> 2.19.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:19     ` Lorenzo Bianconi
@ 2020-04-28 11:09       ` Eelco Chaudron
  2020-05-06  9:35       ` Hangbin Liu
  1 sibling, 0 replies; 72+ messages in thread
From: Eelco Chaudron @ 2020-04-28 11:09 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Hangbin Liu, bpf, netdev, Toke Høiland-Jørgensen,
	Jiri Benc, Jesper Dangaard Brouer, ast, Daniel Borkmann



On 24 Apr 2020, at 16:19, Lorenzo Bianconi wrote:

[...]

>> +{
>> +
>> +	switch (map->map_type) {
>> +	case BPF_MAP_TYPE_DEVMAP:
>> +		return dev_map_get_next_key(map, key, next_key);
>> +	case BPF_MAP_TYPE_DEVMAP_HASH:
>> +		return dev_map_hash_get_next_key(map, key, next_key);
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return -ENOENT;
>> +}
>> +
>> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map 
>> *map,
>> +			int exclude_ifindex)
>> +{
>> +	struct bpf_dtab_netdev *in_obj = NULL;
>> +	u32 key, next_key;
>> +	int err;
>> +
>> +	if (!map)
>> +		return false;
>
> doing so it seems mandatory to define an exclude_map even if we want 
> just to do
> not forward the packet to the "ingress" interface.
> Moreover I was thinking that we can assume to never forward to in the 
> incoming
> interface. Doing so the code would be simpler I guess. Is there a use 
> case for
> it? (forward even to the ingress interface)
>

This part I can answer, it’s called VEPA, I think it’s part of IEEE 
802.1Qbg.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
@ 2020-05-06  9:14       ` Hangbin Liu
  2020-05-06 10:00         ` Toke Høiland-Jørgensen
  2020-05-18  8:45       ` Hangbin Liu
  1 sibling, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-05-06  9:14 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hi Toke,

Thanks for your review, please see replies below.

On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
> >
> > The general data path is kept in net/core/filter.c. The native data
> > path is in kernel/bpf/devmap.c so we can use direct calls to
> > get better performace.
> 
> Got any performance numbers? :)

No, I haven't test the performance. Do you have any suggestions about how
to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
how to test the throughput. I don't think netperf or iperf supports this.
> 
> > + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> > + * 	Description
> > + * 		Redirect the packet to all the interfaces in *map*, and
> > + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> > + * 		be NULL.
> > + *
> > + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> > + * 		which could exlcude redirect to the ingress device.
> 
> I'd suggest rewording this to:
> 
> * 		Redirect the packet to ALL the interfaces in *map*, but
> * 		exclude the interfaces in *ex_map* (which may be NULL).
> *
> * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> * 		which additionally excludes the current ingress device.

Thanks, I will update it
> > +
> > +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> > +			int exclude_ifindex)
> > +{
> > +	struct bpf_dtab_netdev *in_obj = NULL;
> > +	u32 key, next_key;
> > +	int err;
> > +
> > +	if (!map)
> > +		return false;
> > +
> > +	if (obj->dev->ifindex == exclude_ifindex)
> > +		return true;
> 
> We probably want the EXCLUDE_INGRESS flag to work even if ex_map is
> NULL, right? In that case you want to switch the order of the two checks
> above.

Yes, will fix it.

> 
> > +	devmap_get_next_key(map, NULL, &key);
> > +
> > +	for (;;) {
> 
> I wonder if we should require DEVMAP_HASH maps to be indexed by ifindex
> to avoid the loop?

I guess it's not easy to force user to index the map by ifindex.

> > +	xdpf = convert_to_xdp_frame(xdp);
> > +	if (unlikely(!xdpf))
> > +		return -EOVERFLOW;
> 
> You do a clone for each map entry below, so I think you end up leaking
> this initial xdpf? Also, you'll end up with one clone more than
> necessary - redirecting to two interfaces should only require 1 clone,
> you're doing 2.

We don't know which is the latest one. So we need to keep the initial
for clone. Is it enough to call xdp_release_frame() after the for loop?
> 
> > +	for (;;) {
> > +		switch (map->map_type) {
> > +		case BPF_MAP_TYPE_DEVMAP:
> > +			obj = __dev_map_lookup_elem(map, key);
> > +			break;
> > +		case BPF_MAP_TYPE_DEVMAP_HASH:
> > +			obj = __dev_map_hash_lookup_elem(map, key);
> > +			break;
> > +		default:
> > +			break;
> > +		}
> > +
> > +		if (!obj || dev_in_exclude_map(obj, ex_map,
> > +					       exclude_ingress ? dev_rx->ifindex : 0))
> > +			goto find_next;
> > +
> > +		dev = obj->dev;
> > +
> > +		if (!dev->netdev_ops->ndo_xdp_xmit)
> > +			return -EOPNOTSUPP;
> > +
> > +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> > +		if (unlikely(err))
> > +			return err;
> 
> These abort the whole operation midway through the loop if any error
> occurs. That is probably not what we want? I think the right thing to do
> is just continue the loop and only return an error if *all* of the
> forwarding attempts failed. Maybe we need a tracepoint to catch
> individual errors?

Makes sense. I will see if we can add a tracepoint here.
> >  
> > +static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
> > +				  struct bpf_prog *xdp_prog,
> > +				  struct bpf_map *map, struct bpf_map *ex_map,
> > +				  bool exclude_ingress)
> > +
> > +{
> > +	struct bpf_dtab_netdev *dst;
> > +	struct sk_buff *nskb;
> > +	u32 key, next_key;
> > +	int err;
> > +	void *fwd;
> > +
> > +	/* Get first key from forward map */
> > +	map->ops->map_get_next_key(map, NULL, &key);
> > +
> > +	for (;;) {
> > +		fwd = __xdp_map_lookup_elem(map, key);
> > +		if (fwd) {
> > +			dst = (struct bpf_dtab_netdev *)fwd;
> > +			if (dev_in_exclude_map(dst, ex_map,
> > +					       exclude_ingress ? dev->ifindex : 0))
> > +				goto find_next;
> > +
> > +			nskb = skb_clone(skb, GFP_ATOMIC);
> > +			if (!nskb)
> > +				return -EOVERFLOW;
> > +
> > +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> > +			if (unlikely(err))
> > +				return err;
> > +		}
> > +
> > +find_next:
> > +		err = map->ops->map_get_next_key(map, &key, &next_key);
> > +		if (err)
> > +			break;
> > +
> > +		key = next_key;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> This duplication bugs me; maybe we should try to consolidate the generic
> and native XDP code paths?

Yes, I have tried to combine these two functions together. But one is generic
code path and another is XDP code patch. One use skb_clone and another
use xdpf_clone(). There are also some extra checks for XDP code. So maybe
we'd better just keep it as it is.

> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 2e29a671d67e..1dbe42290223 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> 
> Updates to tools/include should generally go into a separate patch.

Will fix it, thanks.

Best Regards
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:19     ` Lorenzo Bianconi
  2020-04-28 11:09       ` Eelco Chaudron
@ 2020-05-06  9:35       ` Hangbin Liu
  1 sibling, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-05-06  9:35 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann


Hi Lorenzo,

Thanks for the comments, please see replies below.

On Fri, Apr 24, 2020 at 04:19:08PM +0200, Lorenzo Bianconi wrote:
> > +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> > +			int exclude_ifindex)
> > +{
> > +	struct bpf_dtab_netdev *in_obj = NULL;
> > +	u32 key, next_key;
> > +	int err;
> > +
> > +	if (!map)
> > +		return false;
> 
> doing so it seems mandatory to define an exclude_map even if we want just to do
> not forward the packet to the "ingress" interface.
> Moreover I was thinking that we can assume to never forward to in the incoming
> interface. Doing so the code would be simpler I guess. Is there a use case for
> it? (forward even to the ingress interface)

Eelco has help answered one use case: VEPA. Another reason I added this flag
is that the other syscalls like bpf_redirect() or bpf_redirect_map() are
also able to forward to ingress interface. So we need to behave the same
by default.
> 
> > +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> > +			  struct bpf_map *map, struct bpf_map *ex_map,
> > +			  bool exclude_ingress)
> > +{

[...]
> > +	}
> 
> Do we need to free 'incoming' xdp buffer here? I think most of the drivers assume
> the packet is owned by the stack if xdp_do_redirect returns 0

Yes, we need. I will fix it.
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 7d6ceaa54d21..94d1530e5ac6 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
> >  };
> >  
> >  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> > -			    struct bpf_map *map, struct xdp_buff *xdp)
> > +			    struct bpf_map *map, struct xdp_buff *xdp,
> > +			    struct bpf_map *ex_map, bool exclude_ingress)
> >  {
> >  	switch (map->map_type) {
> >  	case BPF_MAP_TYPE_DEVMAP:
> >  	case BPF_MAP_TYPE_DEVMAP_HASH:
> > -		return dev_map_enqueue(fwd, xdp, dev_rx);
> > +		if (fwd)
> > +			return dev_map_enqueue(fwd, xdp, dev_rx);
> > +		else
> > +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
> > +						     exclude_ingress);
> 
> I guess it would be better to do not make it the default case. Maybe you can
> add a bit in flags to mark it for "multicast"

But how do we distinguish the flag bit with other syscalls? e.g. If we define
0x02 as the "do_multicast" flag. What if other syscalls also used this flag.

Currently __bpf_tx_xdp_map() is only called by xdp_do_redirect(). If there
is a map and no fwd, it must be multicast forward. So we are still safe now.
Maybe we need an update in future.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-06  9:14       ` Hangbin Liu
@ 2020-05-06 10:00         ` Toke Høiland-Jørgensen
  2020-05-08  8:53           ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-05-06 10:00 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> Hi Toke,
>
> Thanks for your review, please see replies below.
>
> On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
>> >
>> > The general data path is kept in net/core/filter.c. The native data
>> > path is in kernel/bpf/devmap.c so we can use direct calls to
>> > get better performace.
>> 
>> Got any performance numbers? :)
>
> No, I haven't test the performance. Do you have any suggestions about how
> to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
> how to test the throughput. I don't think netperf or iperf supports
> this.

What I usually do when benchmarking XDP_REDIRECT is to just use pktgen
(samples/pktgen in the kernel source tree) on another machine,
specifically, like this:

./pktgen_sample03_burst_single_flow.sh  -i enp1s0f1 -d 10.70.2.2 -m ec:0d:9a:db:11:35 -t 4  -s 64

(adjust iface, IP and MAC address to your system, of course). That'll
flood the target machine with small UDP packets. On that machine, I then
run the 'xdp_redirect_map' program from samples/bpf. The bpf program
used by that sample will update an internal counter for every packet,
and the userspace prints it out, which gives you the performance (in
PPS). So just modifying that sample to using your new multicast helper
(and comparing it to regular REDIRECT to a single device) would be a
first approximation of a performance test.

[...]

>> > +	devmap_get_next_key(map, NULL, &key);
>> > +
>> > +	for (;;) {
>> 
>> I wonder if we should require DEVMAP_HASH maps to be indexed by ifindex
>> to avoid the loop?
>
> I guess it's not easy to force user to index the map by ifindex.

Well, the way to 'force the user' is just to assume that this is the
case, and if the map is filled in wrong, things just won't work ;)

>> > +	xdpf = convert_to_xdp_frame(xdp);
>> > +	if (unlikely(!xdpf))
>> > +		return -EOVERFLOW;
>> 
>> You do a clone for each map entry below, so I think you end up leaking
>> this initial xdpf? Also, you'll end up with one clone more than
>> necessary - redirecting to two interfaces should only require 1 clone,
>> you're doing 2.
>
> We don't know which is the latest one. So we need to keep the initial
> for clone. Is it enough to call xdp_release_frame() after the for
> loop?

You could do something like:

bool first = true;
for (;;) {

[...]

           if (!first) {
   		nxdpf = xdpf_clone(xdpf);
   		if (unlikely(!nxdpf))
   			return -ENOMEM;
   		bq_enqueue(dev, nxdpf, dev_rx);
           } else {
   		bq_enqueue(dev, xdpf, dev_rx);
   		first = false;
           }
}

/* didn't find anywhere to forward to, free buf */
if (first)
   xdp_return_frame_rx_napi(xdpf);



[...]

>> This duplication bugs me; maybe we should try to consolidate the generic
>> and native XDP code paths?
>
> Yes, I have tried to combine these two functions together. But one is generic
> code path and another is XDP code patch. One use skb_clone and another
> use xdpf_clone(). There are also some extra checks for XDP code. So maybe
> we'd better just keep it as it is.

Yeah, guess it may not be as simple as I'd like it to be ;)
Let's keep it this way for now at least; we can always consolidate in a
separate patch series.

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-06 10:00         ` Toke Høiland-Jørgensen
@ 2020-05-08  8:53           ` Hangbin Liu
  2020-05-08 14:58             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-05-08  8:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Wed, May 06, 2020 at 12:00:08PM +0200, Toke Høiland-Jørgensen wrote:
> > No, I haven't test the performance. Do you have any suggestions about how
> > to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
> > how to test the throughput. I don't think netperf or iperf supports
> > this.
> 
> What I usually do when benchmarking XDP_REDIRECT is to just use pktgen
> (samples/pktgen in the kernel source tree) on another machine,
> specifically, like this:
> 
> ./pktgen_sample03_burst_single_flow.sh  -i enp1s0f1 -d 10.70.2.2 -m ec:0d:9a:db:11:35 -t 4  -s 64
> 
> (adjust iface, IP and MAC address to your system, of course). That'll
> flood the target machine with small UDP packets. On that machine, I then
> run the 'xdp_redirect_map' program from samples/bpf. The bpf program
> used by that sample will update an internal counter for every packet,
> and the userspace prints it out, which gives you the performance (in
> PPS). So just modifying that sample to using your new multicast helper
> (and comparing it to regular REDIRECT to a single device) would be a
> first approximation of a performance test.

Thanks for this method. I will update the sample and do some more tests.
> 
> You could do something like:
> 
> bool first = true;
> for (;;) {
> 
> [...]
> 
>            if (!first) {
>    		nxdpf = xdpf_clone(xdpf);
>    		if (unlikely(!nxdpf))
>    			return -ENOMEM;
>    		bq_enqueue(dev, nxdpf, dev_rx);
>            } else {
>    		bq_enqueue(dev, xdpf, dev_rx);
>    		first = false;
>            }
> }
> 
> /* didn't find anywhere to forward to, free buf */
> if (first)
>    xdp_return_frame_rx_napi(xdpf);

I think the first xdpf will be consumed by the driver and the later
xdpf_clone() will failed, won't it?

How about just do a xdp_return_frame_rx_napi(xdpf) after all nxdpf enqueue?

> > @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct
> > xdp_buff *xdp,
> >                   struct bpf_prog *xdp_prog)
> >  {
> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> > +     bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> > +     struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>
> I don't think you need the READ_ONCE here since there's already one
> below?

BTW, I forgot to ask, why we don't need the READ_ONCE for ex_map?
I though the map and ex_map are two different pointers.

> >       struct bpf_map *map = READ_ONCE(ri->map);

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-08  8:53           ` Hangbin Liu
@ 2020-05-08 14:58             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-05-08 14:58 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, May 06, 2020 at 12:00:08PM +0200, Toke Høiland-Jørgensen wrote:
>> > No, I haven't test the performance. Do you have any suggestions about how
>> > to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
>> > how to test the throughput. I don't think netperf or iperf supports
>> > this.
>> 
>> What I usually do when benchmarking XDP_REDIRECT is to just use pktgen
>> (samples/pktgen in the kernel source tree) on another machine,
>> specifically, like this:
>> 
>> ./pktgen_sample03_burst_single_flow.sh  -i enp1s0f1 -d 10.70.2.2 -m ec:0d:9a:db:11:35 -t 4  -s 64
>> 
>> (adjust iface, IP and MAC address to your system, of course). That'll
>> flood the target machine with small UDP packets. On that machine, I then
>> run the 'xdp_redirect_map' program from samples/bpf. The bpf program
>> used by that sample will update an internal counter for every packet,
>> and the userspace prints it out, which gives you the performance (in
>> PPS). So just modifying that sample to using your new multicast helper
>> (and comparing it to regular REDIRECT to a single device) would be a
>> first approximation of a performance test.
>
> Thanks for this method. I will update the sample and do some more tests.

Great!

>> You could do something like:
>> 
>> bool first = true;
>> for (;;) {
>> 
>> [...]
>> 
>>            if (!first) {
>>    		nxdpf = xdpf_clone(xdpf);
>>    		if (unlikely(!nxdpf))
>>    			return -ENOMEM;
>>    		bq_enqueue(dev, nxdpf, dev_rx);
>>            } else {
>>    		bq_enqueue(dev, xdpf, dev_rx);
>>    		first = false;
>>            }
>> }
>> 
>> /* didn't find anywhere to forward to, free buf */
>> if (first)
>>    xdp_return_frame_rx_napi(xdpf);
>
> I think the first xdpf will be consumed by the driver and the later
> xdpf_clone() will failed, won't it?

No, bq_enqueue just sticks the frame on a list, it's not consumed until
after the NAPI cycle ends (and the driver calls xdp_do_flush()).

> How about just do a xdp_return_frame_rx_napi(xdpf) after all nxdpf enqueue?

Yeah, that would be the semantically obvious thing to do, but it is
wasteful in that you end up performing one more clone than you strictly
have to :)

>> > @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct
>> > xdp_buff *xdp,
>> >                   struct bpf_prog *xdp_prog)
>> >  {
>> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> > +     bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
>> > +     struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>>
>> I don't think you need the READ_ONCE here since there's already one
>> below?
>
> BTW, I forgot to ask, why we don't need the READ_ONCE for ex_map?
> I though the map and ex_map are two different pointers.

It isn't, but not for the reason I thought, so I can understand why my
comment might have been somewhat confusing (I have been confused by this
myself until just now...).

The READ_ONCE() is not needed because the ex_map field is only ever read
from or written to by the CPU owning the per-cpu pointer. Whereas the
'map' field is manipulated by remote CPUs in bpf_clear_redirect_map().
So you need neither READ_ONCE() nor WRITE_ONCE() on ex_map, just like
there are none on tgt_index and tgt_value.

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
  2020-05-06  9:14       ` Hangbin Liu
@ 2020-05-18  8:45       ` Hangbin Liu
  2020-05-19 10:15         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-05-18  8:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hi Toke,

On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
> 
> Yeah, the new helper is much cleaner!
> 
> > To achive this I add a new ex_map for struct bpf_redirect_info.
> > in the helper I set tgt_value to NULL to make a difference with
> > bpf_xdp_redirect_map()
> >
> > We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> > create a exclude map for each interface and just want to exclude the
> > ingress interface.
> >
> > The general data path is kept in net/core/filter.c. The native data
> > path is in kernel/bpf/devmap.c so we can use direct calls to
> > get better performace.
> 
> Got any performance numbers? :)

Recently I tried with pktgen to get the performance number. It works
with native mode, although the number looks not high.

I tested it on VM with 1 cpu core. By forwarding to 7 ports, With pktgen
config like:
echo "count 10000000" > /proc/net/pktgen/veth0
echo "clone_skb 0" > /proc/net/pktgen/veth0
echo "pkt_size 64" > /proc/net/pktgen/veth0
echo "dst 224.1.1.10" > /proc/net/pktgen/veth0

I got forwarding number like:
Forwarding     159958 pkt/s
Forwarding     160213 pkt/s
Forwarding     160448 pkt/s

But when testing generic mode, I got system crashed directly. The code
path is:
do_xdp_generic()
  - netif_receive_generic_xdp()
    - pskb_expand_head()    <- skb_is_nonlinear(skb)
      - BUG_ON(skb_shared(skb))

So I want to ask do you have the same issue with pktgen? Any workaround?

> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 2e29a671d67e..1dbe42290223 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> 
> Updates to tools/include should generally go into a separate patch.

Is this a must to? It looks strange to separate the same implementation
into two patches.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-18  8:45       ` Hangbin Liu
@ 2020-05-19 10:15         ` Jesper Dangaard Brouer
  2020-05-20  1:24           ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-05-19 10:15 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: Toke Høiland-Jørgensen, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi, brouer

On Mon, 18 May 2020 16:45:27 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> Hi Toke,
> 
> On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
> > 
> > Yeah, the new helper is much cleaner!
> >   
> > > To achive this I add a new ex_map for struct bpf_redirect_info.
> > > in the helper I set tgt_value to NULL to make a difference with
> > > bpf_xdp_redirect_map()
> > >
> > > We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> > > create a exclude map for each interface and just want to exclude the
> > > ingress interface.
> > >
> > > The general data path is kept in net/core/filter.c. The native data
> > > path is in kernel/bpf/devmap.c so we can use direct calls to
> > > get better performace.  
> > 
> > Got any performance numbers? :)  
> 
> Recently I tried with pktgen to get the performance number. It works
> with native mode, although the number looks not high.
> 
> I tested it on VM with 1 cpu core. 

Performance testing on a VM doesn't really make much sense.

> By forwarding to 7 ports, With pktgen
> config like:
> echo "count 10000000" > /proc/net/pktgen/veth0
> echo "clone_skb 0" > /proc/net/pktgen/veth0
> echo "pkt_size 64" > /proc/net/pktgen/veth0
> echo "dst 224.1.1.10" > /proc/net/pktgen/veth0
> 
> I got forwarding number like:
> Forwarding     159958 pkt/s
> Forwarding     160213 pkt/s
> Forwarding     160448 pkt/s
> 
> But when testing generic mode, I got system crashed directly. The code
> path is:
> do_xdp_generic()
>   - netif_receive_generic_xdp()
>     - pskb_expand_head()    <- skb_is_nonlinear(skb)
>       - BUG_ON(skb_shared(skb))
> 
> So I want to ask do you have the same issue with pktgen? Any workaround?

Pktgen is not meant to be used on virtual devices.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-19 10:15         ` Jesper Dangaard Brouer
@ 2020-05-20  1:24           ` Hangbin Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-05-20  1:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi

On Tue, May 19, 2020 at 12:15:12PM +0200, Jesper Dangaard Brouer wrote:
> Performance testing on a VM doesn't really make much sense.
> 
> Pktgen is not meant to be used on virtual devices.

Thanks, I will try on a physical machine.

Cheers
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
                   ` (2 preceding siblings ...)
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-05-23  6:05 ` Hangbin Liu
  2020-05-23  6:05   ` [PATCHv3 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-05-23  6:05   ` [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  2020-05-26 14:05 ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  4 siblings, 2 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-05-23  6:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

Hi all,

This patchset is for xdp multicast support, which has been discussed
before[0]. The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple
ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The example in patch 2 is functional, but not a lot of effort
has been made on performance optimisation. I did a simple test(pkt size 64)
with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
arrays:

bpf_redirect_map() with 1 ingress, 1 egress:
generic path: ~1600k pps
native path: ~980k pps

bpf_redirect_map_multi() with 1 ingress, 3 egress:
generic path: ~600k pps
native path: ~480k pps

bpf_redirect_map_multi() with 1 ingress, 9 egress:
generic path: ~125k pps
native path: ~100k pps

The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
the arrays and do clone skb/xdpf. The native path is slower than generic
path as we send skbs by pktgen. So the result looks reasonable.

We need also note that the performace number will get slower if we use large
BPF_MAP_TYPE_DEVMAP arrays.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.
f) Add rxcnt map to show the packet transmit speed in sample test.
g) Add performace test number.

I didn't split the tools/include to a separate patch because I think
they are all the same change, and I saw some others also do like this.
But I can re-post the patch and split it if you insist.

v2:
Discussed with Jiri, Toke, Jesper, Eelco, we think the v1 is doing
a trick and may make user confused. So let's just add a new helper
to make the implementation more clear.

Hangbin Liu (2):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test

 include/linux/bpf.h                       |  20 +++
 include/linux/filter.h                    |   1 +
 include/net/xdp.h                         |   1 +
 include/uapi/linux/bpf.h                  |  22 ++-
 kernel/bpf/devmap.c                       | 124 ++++++++++++++
 kernel/bpf/verifier.c                     |   6 +
 net/core/filter.c                         | 101 ++++++++++-
 net/core/xdp.c                            |  26 +++
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 133 +++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 112 ++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 198 ++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            |  22 ++-
 13 files changed, 762 insertions(+), 7 deletions(-)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv3 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-05-23  6:05   ` Hangbin Liu
  2020-05-26  7:34     ` kbuild test robot
  2020-05-23  6:05   ` [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  1 sibling, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-05-23  6:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This patch is for xdp multicast support. In this implementation we
add a new helper to accept two maps: forward map and exclude map.
We will redirect the packet to all the interfaces in *forward map*, but
exclude the interfaces that in *exclude map*.

To achive this I add a new ex_map for struct bpf_redirect_info.
in the helper I set tgt_value to NULL to make a difference with
bpf_xdp_redirect_map()

We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
create a exclude map for each interface and just want to exclude the
ingress interface.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h            |  20 ++++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  22 +++++-
 kernel/bpf/devmap.c            | 124 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              | 101 +++++++++++++++++++++++++--
 net/core/xdp.c                 |  26 +++++++
 tools/include/uapi/linux/bpf.h |  22 +++++-
 9 files changed, 316 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index efe8836b5c48..d1c169bec6b5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1240,6 +1240,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 
@@ -1377,6 +1382,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 73d06a39e2d6..5d9c6ac6ade3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -612,6 +612,7 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 90f11760bd12..967684aa096a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -105,6 +105,7 @@ void xdp_warn(const char *msg, const char *func, const int line);
 #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 /* Convert xdp_buff to xdp_frame */
 static inline
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 97e1fd19ff58..000b0cf961ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3157,6 +3157,20 @@ union bpf_attr {
  *		**bpf_sk_cgroup_id**\ ().
  *	Return
  *		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map* (which may be NULL).
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3288,7 +3302,8 @@ union bpf_attr {
 	FN(seq_printf),			\
 	FN(seq_write),			\
 	FN(sk_cgroup_id),		\
-	FN(sk_ancestor_cgroup_id),
+	FN(sk_ancestor_cgroup_id),	\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a51d9fb7a359..ecc5c44a5bab 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -455,6 +455,130 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	struct bpf_dtab_netdev *in_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	if (!map)
+		return false;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			in_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			in_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	struct bpf_dtab_netdev *obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	struct net_device *dev;
+	bool first = true;
+	u32 key, next_key;
+	int err;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map,
+					       exclude_ingress ? dev_rx->ifindex : 0))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			goto find_next;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			goto find_next;
+
+		if (!first) {
+			nxdpf = xdpf_clone(xdpf);
+			if (unlikely(!nxdpf))
+				return -ENOMEM;
+
+			bq_enqueue(dev, nxdpf, dev_rx);
+		} else {
+			bq_enqueue(dev, xdpf, dev_rx);
+			first = false;
+		}
+
+find_next:
+		err = devmap_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+		key = next_key;
+	}
+
+	/* didn't find anywhere to forward to, free buf */
+	if (first)
+		xdp_return_frame_rx_napi(xdpf);
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d2e27dba4ac6..a5857953248d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3946,6 +3946,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -4038,6 +4039,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index bd2853d23b50..f07eb1408f70 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, bool exclude_ingress)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
+						     exclude_ingress);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = ri->ex_map;
 	struct bpf_map *map = READ_ONCE(ri->map);
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
@@ -3541,6 +3548,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (unlikely(!map)) {
@@ -3552,7 +3560,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
 	}
 
 	if (unlikely(err))
@@ -3566,6 +3574,50 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  bool exclude_ingress)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	u32 key, next_key;
+	int err;
+	void *fwd;
+
+	/* Get first key from forward map */
+	map->ops->map_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -ENOMEM;
+
+			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+			if (unlikely(err))
+				return err;
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	consume_skb(skb);
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3573,19 +3625,29 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			/* Deal with multicast maps */
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, exclude_ingress);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3699,6 +3761,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+	ri->ex_map = ex_map;
+
+	WRITE_ONCE(ri->map, map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6363,6 +6452,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 90f44f382115..acdc63833b1f 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -475,3 +475,29 @@ void xdp_warn(const char *msg, const char *func, const int line)
 	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 97e1fd19ff58..000b0cf961ea 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3157,6 +3157,20 @@ union bpf_attr {
  *		**bpf_sk_cgroup_id**\ ().
  *	Return
  *		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map* (which may be NULL).
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3288,7 +3302,8 @@ union bpf_attr {
 	FN(seq_printf),			\
 	FN(seq_write),			\
 	FN(sk_cgroup_id),		\
-	FN(sk_ancestor_cgroup_id),
+	FN(sk_ancestor_cgroup_id),	\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-05-23  6:05   ` [PATCHv3 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-05-23  6:05   ` Hangbin Liu
  1 sibling, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-05-23  6:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a sample for xdp multicast. In the sample we have 3 forward
groups and 1 exclude group. It will redirect each interface's
packets to all the interfaces in the forward group, and exclude
the interface in exclude map.

For more testing details, please see the test description in
xdp_redirect_map_multi.sh.

v3: add rxcnt map to show the packet transmit speed.
v2: no update.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 135 +++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 113 +++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 197 ++++++++++++++++++++++
 4 files changed, 448 insertions(+)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 8403e4762306..000709bb89c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi.sh b/samples/bpf/xdp_redirect_map_multi.sh
new file mode 100755
index 000000000000..bbf10ca06720
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi.sh
@@ -0,0 +1,135 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  ... init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |  ...
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward multicast groups:
+#     Forward group all has interfaces: veth1, veth2, veth3, veth4, ... (All traffic except IPv4, IPv6)
+#     Forward group v4 has interfaces: veth1, veth3, veth4, ... (For IPv4 traffic only)
+#     Forward group v6 has interfaces: veth2, veth3, veth4, ... (For IPv6 traffic only)
+# Exclude Groups:
+#     Exclude group: veth3 (assume ns3 is in black list)
+#
+# Test modules:
+# XDP modes: generic, native
+# map types: group v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test cases:
+#     ARP(we didn't exclude ns3 in kern.c for ARP):
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (fail), ns1 -> ns4 (pass)
+#     IPv6
+#        ns2 -> ns1 (fail), ns2 -> ns3 (fail), ns2 -> ns4 (pass)
+#
+
+
+# netns numbers
+NUM=10
+IFACES=""
+DRV_MODE="generic drv"
+
+test_pass()
+{
+	echo "Pass: $@"
+}
+
+test_fail()
+{
+	echo "fail: $@"
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip netns del ns$i
+	done
+}
+
+setup_ns()
+{
+	local mode=$1
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth0 type veth peer name veth$i
+	        ip link set veth0 netns ns$i
+		ip -n ns$i link set veth0 up
+		ip link set veth$i up
+
+		ip -n ns$i addr add 192.0.2.$i/24 dev veth0
+		ip -n ns$i addr add 2001:db8::$i/24 dev veth0
+		ip -n ns$i link set veth0 xdp$mode obj \
+			xdp_redirect_map_multi_kern.o sec xdp_redirect_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_ping_tests()
+{
+	local drv_mode=$1
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${drv_mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${drv_mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${drv_mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-2" || test_fail "$drv_mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-3" || test_fail "$drv_mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-4" || test_fail "$drv_mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-2" || test_pass "$drv_mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-3" || test_pass "$drv_mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping ns1-4" || test_fail "$drv_mode ping ns1-4"
+
+	# ping6 test
+	ip netns exec ns2 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-1" || test_pass "$drv_mode ping6 ns2-1"
+	ip netns exec ns2 ping6 2001:db8::3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-3" || test_pass "$drv_mode ping6 ns2-3"
+	ip netns exec ns2 ping6 2001:db8::4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping6 ns2-4" || test_fail "$drv_mode ping6 ns2-4"
+}
+
+do_tests()
+{
+	local drv_mode=$1
+	local drv_p
+
+	[ ${drv_mode} == "drv" ] && drv_p="-N" || drv_p="-S"
+
+	# run `ulimit -l unlimited` if you got errors like
+	# libbpf: Error in bpf_object__probe_global_data():Operation not permitted(1).
+	./xdp_redirect_map_multi $drv_p $IFACES &> xdp_${drv_mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	do_ping_tests $drv_mode
+
+	kill $xdp_pid
+}
+
+for mode in ${DRV_MODE}; do
+	setup_ns $mode
+	do_tests $mode
+	sleep 10
+	clean_up
+	sleep 5
+done
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..81f71461a252
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+/* In this sample we will use 3 forward maps and 1 exclude map to
+ * show how to use the helper bpf_redirect_map_multi().
+ *
+ * In real world, there may have multi forward maps and exclude map. You can
+ * use map-in-map type to store the forward and exlude maps. e.g.
+ * forward_map_in_map[group_a_index] = forward_group_a_map
+ * forward_map_in_map[group_b_index] = forward_group_b_map
+ * exclude_map_in_map[iface_1_index] = iface_1_exclude_map
+ * exclude_map_in_map[iface_2_index] = iface_2_exclude_map
+ * Then store the forward group indexes based on IP/MAC policy in another
+ * hash map, e.g.:
+ * mcast_route_map[hash(subnet_a)] = group_a_index
+ * mcast_route_map[hash(subnet_b)] = group_b_index
+ *
+ * You can init the maps in user.c, and find the forward group index from
+ * mcast_route_map bye key hash(subnet) in kern.c, Then you could find
+ * the forward group by the group index. You can also get the exclude map
+ * simply by iface index in exclude_map_in_map.
+ */
+struct bpf_map_def SEC("maps") forward_map_v4 = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 4096,
+};
+
+struct bpf_map_def SEC("maps") forward_map_v6 = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_all = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") exclude_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 1,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	long *value;
+	u16 h_proto;
+	u32 key = 0;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	/* count packet in global counter */
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	if (h_proto == htons(ETH_P_IP))
+		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else if (h_proto == htons(ETH_P_IPV6))
+		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else
+		return bpf_redirect_map_multi(&forward_map_all, NULL,
+					      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..7339ce4c7f9c
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static int ifaces[MAX_IFACE_NUM] = {};
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int rxcnt;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(rxcnt, &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("Forwarding %10llu pkt/s\n", sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, group_v6, exclude;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
+	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
+	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
+	rxcnt = bpf_object__find_map_fd_by_name(obj, "rxcnt");
+
+	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0 ||
+	    rxcnt < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: remove the 2nd interfaces from group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: remove the 1st interfaces from group v6 */
+		if (i != 0) {
+			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: add the 3rd interfaces to exclude map */
+		if (i == 2) {
+			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	poll_stats(2);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv3 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-23  6:05   ` [PATCHv3 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-05-26  7:34     ` kbuild test robot
  0 siblings, 0 replies; 72+ messages in thread
From: kbuild test robot @ 2020-05-26  7:34 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: kbuild-all, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu


[-- Attachment #1: Type: text/plain, Size: 5721 bytes --]

Hi Hangbin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[also build test WARNING on net-next/master next-20200525]
[cannot apply to bpf/master net/master linus/master v5.7-rc7]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Hangbin-Liu/xdp-add-dev-map-multicast-support/20200523-141019
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: c6x-randconfig-r003-20200526 (attached as .config)
compiler: c6x-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=c6x 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

In file included from include/asm-generic/atomic.h:12,
from ./arch/c6x/include/generated/asm/atomic.h:1,
from include/linux/atomic.h:7,
from include/asm-generic/bitops/lock.h:5,
from arch/c6x/include/asm/bitops.h:87,
from include/linux/bitops.h:29,
from include/linux/kernel.h:12,
from include/linux/list.h:9,
from include/linux/module.h:12,
from net/core/filter.c:20:
net/core/filter.c: In function 'bpf_clear_redirect_map':
arch/c6x/include/asm/cmpxchg.h:55:3: warning: value computed is not used [-Wunused-value]
55 |  ((__typeof__(*(ptr)))__cmpxchg_local_generic((ptr),           |  ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56 |            (unsigned long)(o),          |            ~~~~~~~~~~~~~~~~~~~~~
57 |            (unsigned long)(n),          |            ~~~~~~~~~~~~~~~~~~~~~
58 |            sizeof(*(ptr))))
|            ~~~~~~~~~~~~~~~~
include/asm-generic/cmpxchg.h:106:28: note: in expansion of macro 'cmpxchg_local'
106 | #define cmpxchg(ptr, o, n) cmpxchg_local((ptr), (o), (n))
|                            ^~~~~~~~~~~~~
net/core/filter.c:3534:4: note: in expansion of macro 'cmpxchg'
3534 |    cmpxchg(&ri->map, map, NULL);
|    ^~~~~~~
net/core/filter.c: At top level:
>> net/core/filter.c:3787:20: warning: initialized field overwritten [-Woverride-init]
3787 |  .arg1_type      = ARG_CONST_MAP_PTR,
|                    ^~~~~~~~~~~~~~~~~
net/core/filter.c:3787:20: note: (near initialization for 'bpf_xdp_redirect_map_multi_proto.<anonymous>.<anonymous>.arg1_type')
/tmp/cc2n7hPR.s: Assembler messages:
/tmp/cc2n7hPR.s:69347: Warning: ignoring changed section type for .far
/tmp/cc2n7hPR.s:69347: Warning: ignoring changed section attributes for .far
/tmp/cc2n7hPR.s:69454: Warning: ignoring changed section type for .far
/tmp/cc2n7hPR.s:69454: Warning: ignoring changed section attributes for .far
/tmp/cc2n7hPR.s:69503: Warning: ignoring changed section type for .far
/tmp/cc2n7hPR.s:69503: Warning: ignoring changed section attributes for .far
--
In file included from include/asm-generic/atomic.h:12,
from ./arch/c6x/include/generated/asm/atomic.h:1,
from include/linux/atomic.h:7,
from include/asm-generic/bitops/lock.h:5,
from arch/c6x/include/asm/bitops.h:87,
from include/linux/bitops.h:29,
from include/linux/kernel.h:12,
from include/linux/list.h:9,
from include/linux/module.h:12,
from net/core/filter.c:20:
net/core/filter.c: In function 'bpf_clear_redirect_map':
arch/c6x/include/asm/cmpxchg.h:55:3: warning: value computed is not used [-Wunused-value]
55 |  ((__typeof__(*(ptr)))__cmpxchg_local_generic((ptr),           |  ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56 |            (unsigned long)(o),          |            ~~~~~~~~~~~~~~~~~~~~~
57 |            (unsigned long)(n),          |            ~~~~~~~~~~~~~~~~~~~~~
58 |            sizeof(*(ptr))))
|            ~~~~~~~~~~~~~~~~
include/asm-generic/cmpxchg.h:106:28: note: in expansion of macro 'cmpxchg_local'
106 | #define cmpxchg(ptr, o, n) cmpxchg_local((ptr), (o), (n))
|                            ^~~~~~~~~~~~~
net/core/filter.c:3534:4: note: in expansion of macro 'cmpxchg'
3534 |    cmpxchg(&ri->map, map, NULL);
|    ^~~~~~~
net/core/filter.c: At top level:
>> net/core/filter.c:3787:20: warning: initialized field overwritten [-Woverride-init]
3787 |  .arg1_type      = ARG_CONST_MAP_PTR,
|                    ^~~~~~~~~~~~~~~~~
net/core/filter.c:3787:20: note: (near initialization for 'bpf_xdp_redirect_map_multi_proto.<anonymous>.<anonymous>.arg1_type')
/tmp/ccHtg48M.s: Assembler messages:
/tmp/ccHtg48M.s:69347: Warning: ignoring changed section type for .far
/tmp/ccHtg48M.s:69347: Warning: ignoring changed section attributes for .far
/tmp/ccHtg48M.s:69454: Warning: ignoring changed section type for .far
/tmp/ccHtg48M.s:69454: Warning: ignoring changed section attributes for .far
/tmp/ccHtg48M.s:69503: Warning: ignoring changed section type for .far
/tmp/ccHtg48M.s:69503: Warning: ignoring changed section attributes for .far

vim +3787 net/core/filter.c

  3781	
  3782	static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
  3783		.func           = bpf_xdp_redirect_map_multi,
  3784		.gpl_only       = false,
  3785		.ret_type       = RET_INTEGER,
  3786		.arg1_type      = ARG_CONST_MAP_PTR,
> 3787		.arg1_type      = ARG_CONST_MAP_PTR,
  3788		.arg3_type      = ARG_ANYTHING,
  3789	};
  3790	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23268 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
                   ` (3 preceding siblings ...)
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-05-26 14:05 ` Hangbin Liu
  2020-05-26 14:05   ` [PATCHv4 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
                     ` (3 more replies)
  4 siblings, 4 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-05-26 14:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

Hi all,

This patchset is for xdp multicast support, which has been discussed
before[0]. The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple
ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The example in patch 2 is functional, but not a lot of effort
has been made on performance optimisation. I did a simple test(pkt size 64)
with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
arrays:

bpf_redirect_map() with 1 ingress, 1 egress:
generic path: ~1600k pps
native path: ~980k pps

bpf_redirect_map_multi() with 1 ingress, 3 egress:
generic path: ~600k pps
native path: ~480k pps

bpf_redirect_map_multi() with 1 ingress, 9 egress:
generic path: ~125k pps
native path: ~100k pps

The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
the arrays and do clone skb/xdpf. The native path is slower than generic
path as we send skbs by pktgen. So the result looks reasonable.

We need also note that the performace number will get slower if we use large
BPF_MAP_TYPE_DEVMAP arrays.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.
f) Add rxcnt map to show the packet transmit speed in sample test.
g) Add performace test number.

I didn't split the tools/include to a separate patch because I think
they are all the same change, and I saw some others also do like this.
But I can re-post the patch and split it if you insist.

v2:
Discussed with Jiri, Toke, Jesper, Eelco, we think the v1 is doing
a trick and may make user confused. So let's just add a new helper
to make the implementation more clear.

Hangbin Liu (2):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test

 include/linux/bpf.h                       |  20 +++
 include/linux/filter.h                    |   1 +
 include/net/xdp.h                         |   1 +
 include/uapi/linux/bpf.h                  |  22 ++-
 kernel/bpf/devmap.c                       | 124 ++++++++++++++
 kernel/bpf/verifier.c                     |   6 +
 net/core/filter.c                         | 101 ++++++++++-
 net/core/xdp.c                            |  26 +++
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 133 +++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 112 ++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 198 ++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            |  22 ++-
 13 files changed, 762 insertions(+), 7 deletions(-)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-26 14:05 ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-05-26 14:05   ` Hangbin Liu
  2020-05-27 10:29     ` Toke Høiland-Jørgensen
                       ` (2 more replies)
  2020-05-26 14:05   ` [PATCHv4 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-05-26 14:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This patch is for xdp multicast support. In this implementation we
add a new helper to accept two maps: forward map and exclude map.
We will redirect the packet to all the interfaces in *forward map*, but
exclude the interfaces that in *exclude map*.

To achive this I add a new ex_map for struct bpf_redirect_info.
in the helper I set tgt_value to NULL to make a difference with
bpf_xdp_redirect_map()

We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
create a exclude map for each interface and just want to exclude the
ingress interface.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h            |  20 ++++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  22 +++++-
 kernel/bpf/devmap.c            | 124 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              | 101 +++++++++++++++++++++++++--
 net/core/xdp.c                 |  26 +++++++
 tools/include/uapi/linux/bpf.h |  22 +++++-
 9 files changed, 316 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index efe8836b5c48..d1c169bec6b5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1240,6 +1240,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 
@@ -1377,6 +1382,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 73d06a39e2d6..5d9c6ac6ade3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -612,6 +612,7 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 90f11760bd12..967684aa096a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -105,6 +105,7 @@ void xdp_warn(const char *msg, const char *func, const int line);
 #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 /* Convert xdp_buff to xdp_frame */
 static inline
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 97e1fd19ff58..000b0cf961ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3157,6 +3157,20 @@ union bpf_attr {
  *		**bpf_sk_cgroup_id**\ ().
  *	Return
  *		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map* (which may be NULL).
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3288,7 +3302,8 @@ union bpf_attr {
 	FN(seq_printf),			\
 	FN(seq_write),			\
 	FN(sk_cgroup_id),		\
-	FN(sk_ancestor_cgroup_id),
+	FN(sk_ancestor_cgroup_id),	\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a51d9fb7a359..ecc5c44a5bab 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -455,6 +455,130 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	struct bpf_dtab_netdev *in_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	if (!map)
+		return false;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			in_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			in_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	struct bpf_dtab_netdev *obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	struct net_device *dev;
+	bool first = true;
+	u32 key, next_key;
+	int err;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map,
+					       exclude_ingress ? dev_rx->ifindex : 0))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			goto find_next;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			goto find_next;
+
+		if (!first) {
+			nxdpf = xdpf_clone(xdpf);
+			if (unlikely(!nxdpf))
+				return -ENOMEM;
+
+			bq_enqueue(dev, nxdpf, dev_rx);
+		} else {
+			bq_enqueue(dev, xdpf, dev_rx);
+			first = false;
+		}
+
+find_next:
+		err = devmap_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+		key = next_key;
+	}
+
+	/* didn't find anywhere to forward to, free buf */
+	if (first)
+		xdp_return_frame_rx_napi(xdpf);
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d2e27dba4ac6..a5857953248d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3946,6 +3946,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -4038,6 +4039,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index bd2853d23b50..f07eb1408f70 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, bool exclude_ingress)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
+						     exclude_ingress);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = ri->ex_map;
 	struct bpf_map *map = READ_ONCE(ri->map);
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
@@ -3541,6 +3548,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (unlikely(!map)) {
@@ -3552,7 +3560,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
 	}
 
 	if (unlikely(err))
@@ -3566,6 +3574,50 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  bool exclude_ingress)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	u32 key, next_key;
+	int err;
+	void *fwd;
+
+	/* Get first key from forward map */
+	map->ops->map_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -ENOMEM;
+
+			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+			if (unlikely(err))
+				return err;
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	consume_skb(skb);
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3573,19 +3625,29 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			/* Deal with multicast maps */
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, exclude_ingress);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3699,6 +3761,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+	ri->ex_map = ex_map;
+
+	WRITE_ONCE(ri->map, map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6363,6 +6452,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 90f44f382115..acdc63833b1f 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -475,3 +475,29 @@ void xdp_warn(const char *msg, const char *func, const int line)
 	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 97e1fd19ff58..000b0cf961ea 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3157,6 +3157,20 @@ union bpf_attr {
  *		**bpf_sk_cgroup_id**\ ().
  *	Return
  *		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map* (which may be NULL).
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3288,7 +3302,8 @@ union bpf_attr {
 	FN(seq_printf),			\
 	FN(seq_write),			\
 	FN(sk_cgroup_id),		\
-	FN(sk_ancestor_cgroup_id),
+	FN(sk_ancestor_cgroup_id),	\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv4 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-05-26 14:05 ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-05-26 14:05   ` [PATCHv4 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-05-26 14:05   ` Hangbin Liu
  2020-05-27 10:21   ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Toke Høiland-Jørgensen
  2020-07-01  4:19   ` [PATCHv5 bpf-next 0/3] xdp: add a new helper for " Hangbin Liu
  3 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-05-26 14:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a sample for xdp multicast. In the sample we have 3 forward
groups and 1 exclude group. It will redirect each interface's
packets to all the interfaces in the forward group, and exclude
the interface in exclude map.

For more testing details, please see the test description in
xdp_redirect_map_multi.sh.

v4: no update.
v3: add rxcnt map to show the packet transmit speed.
v2: no update.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 135 +++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 113 +++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 197 ++++++++++++++++++++++
 4 files changed, 448 insertions(+)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 8403e4762306..000709bb89c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi.sh b/samples/bpf/xdp_redirect_map_multi.sh
new file mode 100755
index 000000000000..bbf10ca06720
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi.sh
@@ -0,0 +1,135 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  ... init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |  ...
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward multicast groups:
+#     Forward group all has interfaces: veth1, veth2, veth3, veth4, ... (All traffic except IPv4, IPv6)
+#     Forward group v4 has interfaces: veth1, veth3, veth4, ... (For IPv4 traffic only)
+#     Forward group v6 has interfaces: veth2, veth3, veth4, ... (For IPv6 traffic only)
+# Exclude Groups:
+#     Exclude group: veth3 (assume ns3 is in black list)
+#
+# Test modules:
+# XDP modes: generic, native
+# map types: group v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test cases:
+#     ARP(we didn't exclude ns3 in kern.c for ARP):
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (fail), ns1 -> ns4 (pass)
+#     IPv6
+#        ns2 -> ns1 (fail), ns2 -> ns3 (fail), ns2 -> ns4 (pass)
+#
+
+
+# netns numbers
+NUM=10
+IFACES=""
+DRV_MODE="generic drv"
+
+test_pass()
+{
+	echo "Pass: $@"
+}
+
+test_fail()
+{
+	echo "fail: $@"
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip netns del ns$i
+	done
+}
+
+setup_ns()
+{
+	local mode=$1
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth0 type veth peer name veth$i
+	        ip link set veth0 netns ns$i
+		ip -n ns$i link set veth0 up
+		ip link set veth$i up
+
+		ip -n ns$i addr add 192.0.2.$i/24 dev veth0
+		ip -n ns$i addr add 2001:db8::$i/24 dev veth0
+		ip -n ns$i link set veth0 xdp$mode obj \
+			xdp_redirect_map_multi_kern.o sec xdp_redirect_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_ping_tests()
+{
+	local drv_mode=$1
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${drv_mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${drv_mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${drv_mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-2" || test_fail "$drv_mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-3" || test_fail "$drv_mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-4" || test_fail "$drv_mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-2" || test_pass "$drv_mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-3" || test_pass "$drv_mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping ns1-4" || test_fail "$drv_mode ping ns1-4"
+
+	# ping6 test
+	ip netns exec ns2 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-1" || test_pass "$drv_mode ping6 ns2-1"
+	ip netns exec ns2 ping6 2001:db8::3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-3" || test_pass "$drv_mode ping6 ns2-3"
+	ip netns exec ns2 ping6 2001:db8::4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping6 ns2-4" || test_fail "$drv_mode ping6 ns2-4"
+}
+
+do_tests()
+{
+	local drv_mode=$1
+	local drv_p
+
+	[ ${drv_mode} == "drv" ] && drv_p="-N" || drv_p="-S"
+
+	# run `ulimit -l unlimited` if you got errors like
+	# libbpf: Error in bpf_object__probe_global_data():Operation not permitted(1).
+	./xdp_redirect_map_multi $drv_p $IFACES &> xdp_${drv_mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	do_ping_tests $drv_mode
+
+	kill $xdp_pid
+}
+
+for mode in ${DRV_MODE}; do
+	setup_ns $mode
+	do_tests $mode
+	sleep 10
+	clean_up
+	sleep 5
+done
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..81f71461a252
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+/* In this sample we will use 3 forward maps and 1 exclude map to
+ * show how to use the helper bpf_redirect_map_multi().
+ *
+ * In real world, there may have multi forward maps and exclude map. You can
+ * use map-in-map type to store the forward and exlude maps. e.g.
+ * forward_map_in_map[group_a_index] = forward_group_a_map
+ * forward_map_in_map[group_b_index] = forward_group_b_map
+ * exclude_map_in_map[iface_1_index] = iface_1_exclude_map
+ * exclude_map_in_map[iface_2_index] = iface_2_exclude_map
+ * Then store the forward group indexes based on IP/MAC policy in another
+ * hash map, e.g.:
+ * mcast_route_map[hash(subnet_a)] = group_a_index
+ * mcast_route_map[hash(subnet_b)] = group_b_index
+ *
+ * You can init the maps in user.c, and find the forward group index from
+ * mcast_route_map bye key hash(subnet) in kern.c, Then you could find
+ * the forward group by the group index. You can also get the exclude map
+ * simply by iface index in exclude_map_in_map.
+ */
+struct bpf_map_def SEC("maps") forward_map_v4 = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 4096,
+};
+
+struct bpf_map_def SEC("maps") forward_map_v6 = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_all = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") exclude_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 1,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	long *value;
+	u16 h_proto;
+	u32 key = 0;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	/* count packet in global counter */
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	if (h_proto == htons(ETH_P_IP))
+		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else if (h_proto == htons(ETH_P_IPV6))
+		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else
+		return bpf_redirect_map_multi(&forward_map_all, NULL,
+					      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..7339ce4c7f9c
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static int ifaces[MAX_IFACE_NUM] = {};
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int rxcnt;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(rxcnt, &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("Forwarding %10llu pkt/s\n", sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, group_v6, exclude;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
+	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
+	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
+	rxcnt = bpf_object__find_map_fd_by_name(obj, "rxcnt");
+
+	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0 ||
+	    rxcnt < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: remove the 2nd interfaces from group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: remove the 1st interfaces from group v6 */
+		if (i != 0) {
+			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: add the 3rd interfaces to exclude map */
+		if (i == 2) {
+			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	poll_stats(2);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-05-26 14:05 ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-05-26 14:05   ` [PATCHv4 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-05-26 14:05   ` [PATCHv4 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-05-27 10:21   ` Toke Høiland-Jørgensen
  2020-05-27 10:32     ` Eelco Chaudron
                       ` (2 more replies)
  2020-07-01  4:19   ` [PATCHv5 bpf-next 0/3] xdp: add a new helper for " Hangbin Liu
  3 siblings, 3 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-05-27 10:21 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, Hangbin Liu

Hangbin Liu <liuhangbin@gmail.com> writes:

> Hi all,
>
> This patchset is for xdp multicast support, which has been discussed
> before[0]. The goal is to be able to implement an OVS-like data plane in
> XDP, i.e., a software switch that can forward XDP frames to multiple
> ports.
>
> To achieve this, an application needs to specify a group of interfaces
> to forward a packet to. It is also common to want to exclude one or more
> physical interfaces from the forwarding operation - e.g., to forward a
> packet to all interfaces in the multicast group except the interface it
> arrived on. While this could be done simply by adding more groups, this
> quickly leads to a combinatorial explosion in the number of groups an
> application has to maintain.
>
> To avoid the combinatorial explosion, we propose to include the ability
> to specify an "exclude group" as part of the forwarding operation. This
> needs to be a group (instead of just a single port index), because a
> physical interface can be part of a logical grouping, such as a bond
> device.
>
> Thus, the logical forwarding operation becomes a "set difference"
> operation, i.e. "forward to all ports in group A that are not also in
> group B". This series implements such an operation using device maps to
> represent the groups. This means that the XDP program specifies two
> device maps, one containing the list of netdevs to redirect to, and the
> other containing the exclude list.
>
> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
> to accept two maps, the forwarding map and exclude map. If user
> don't want to use exclude map and just want simply stop redirecting back
> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.
>
> The example in patch 2 is functional, but not a lot of effort
> has been made on performance optimisation. I did a simple test(pkt size 64)
> with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
> arrays:
>
> bpf_redirect_map() with 1 ingress, 1 egress:
> generic path: ~1600k pps
> native path: ~980k pps
>
> bpf_redirect_map_multi() with 1 ingress, 3 egress:
> generic path: ~600k pps
> native path: ~480k pps
>
> bpf_redirect_map_multi() with 1 ingress, 9 egress:
> generic path: ~125k pps
> native path: ~100k pps
>
> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> the arrays and do clone skb/xdpf. The native path is slower than generic
> path as we send skbs by pktgen. So the result looks reasonable.

How are you running these tests? Still on virtual devices? We really
need results from a physical setup in native mode to assess the impact
on the native-XDP fast path. The numbers above don't tell much in this
regard. I'd also like to see a before/after patch for straight
bpf_redirect_map(), since you're messing with the fast path, and we want
to make sure it's not causing a performance regression for regular
redirect.

Finally, since the overhead seems to be quite substantial: A comparison
with a regular network stack bridge might make sense? After all we also
want to make sure it's a performance win over that :)

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-26 14:05   ` [PATCHv4 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-05-27 10:29     ` Toke Høiland-Jørgensen
  2020-06-10 10:18     ` Jesper Dangaard Brouer
  2020-06-10 10:21     ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-05-27 10:29 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, Hangbin Liu

Hangbin Liu <liuhangbin@gmail.com> writes:

> This patch is for xdp multicast support. In this implementation we
> add a new helper to accept two maps: forward map and exclude map.
> We will redirect the packet to all the interfaces in *forward map*, but
> exclude the interfaces that in *exclude map*.
>
> To achive this I add a new ex_map for struct bpf_redirect_info.
> in the helper I set tgt_value to NULL to make a difference with
> bpf_xdp_redirect_map()
>
> We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> create a exclude map for each interface and just want to exclude the
> ingress interface.
>
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.
>
> v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo
>
> v3: Based on Toke's suggestion, do the following update
> a) Update bpf_redirect_map_multi() description in bpf.h.
> b) Fix exclude_ifindex checking order in dev_in_exclude_map().
> c) Fix one more xdpf clone in dev_map_enqueue_multi().
> d) Go find next one in dev_map_enqueue_multi() if the interface is not
>    able to forward instead of abort the whole loop.
> e) Remove READ_ONCE/WRITE_ONCE for ex_map.
>
> v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h            |  20 ++++++
>  include/linux/filter.h         |   1 +
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  22 +++++-
>  kernel/bpf/devmap.c            | 124 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |   6 ++
>  net/core/filter.c              | 101 +++++++++++++++++++++++++--
>  net/core/xdp.c                 |  26 +++++++
>  tools/include/uapi/linux/bpf.h |  22 +++++-
>  9 files changed, 316 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index efe8836b5c48..d1c169bec6b5 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1240,6 +1240,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex);
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress);
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog);
>  
> @@ -1377,6 +1382,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return 0;
>  }
>  
> +static inline
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	return false;
> +}
> +
> +static inline
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	return 0;
> +}
> +
>  struct sk_buff;
>  
>  static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 73d06a39e2d6..5d9c6ac6ade3 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -612,6 +612,7 @@ struct bpf_redirect_info {
>  	u32 tgt_index;
>  	void *tgt_value;
>  	struct bpf_map *map;
> +	struct bpf_map *ex_map;
>  	u32 kern_flags;
>  };
>  
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 90f11760bd12..967684aa096a 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -105,6 +105,7 @@ void xdp_warn(const char *msg, const char *func, const int line);
>  #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
>  
>  struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>  
>  /* Convert xdp_buff to xdp_frame */
>  static inline
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 97e1fd19ff58..000b0cf961ea 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3157,6 +3157,20 @@ union bpf_attr {
>   *		**bpf_sk_cgroup_id**\ ().
>   *	Return
>   *		The id is returned or 0 in case the id could not be retrieved.
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to ALL the interfaces in *map*, but
> + * 		exclude the interfaces in *ex_map* (which may be NULL).
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which additionally excludes the current ingress device.
> + *
> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3288,7 +3302,8 @@ union bpf_attr {
>  	FN(seq_printf),			\
>  	FN(seq_write),			\
>  	FN(sk_cgroup_id),		\
> -	FN(sk_ancestor_cgroup_id),
> +	FN(sk_ancestor_cgroup_id),	\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index a51d9fb7a359..ecc5c44a5bab 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -455,6 +455,130 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return __xdp_enqueue(dev, xdp, dev_rx);
>  }
>  
> +/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
> +static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	struct bpf_dtab_netdev *in_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	if (obj->dev->ifindex == exclude_ifindex)
> +		return true;
> +
> +	if (!map)
> +		return false;
> +
> +	devmap_get_next_key(map, NULL, &key);

You also need to check if this fails; the map could be empty... This
goes for all the places you loop through maps below, but not going to
repeat the comment :)

> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			in_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			in_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	bool first = true;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (!obj || dev_in_exclude_map(obj, ex_map,
> +					       exclude_ingress ? dev_rx->ifindex : 0))
> +			goto find_next;
> +
> +		dev = obj->dev;
> +
> +		if (!dev->netdev_ops->ndo_xdp_xmit)
> +			goto find_next;
> +
> +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +		if (unlikely(err))
> +			goto find_next;
> +
> +		if (!first) {
> +			nxdpf = xdpf_clone(xdpf);
> +			if (unlikely(!nxdpf))
> +				return -ENOMEM;
> +
> +			bq_enqueue(dev, nxdpf, dev_rx);
> +		} else {
> +			bq_enqueue(dev, xdpf, dev_rx);
> +			first = false;
> +		}
> +
> +find_next:
> +		err = devmap_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}
> +
> +	/* didn't find anywhere to forward to, free buf */
> +	if (first)
> +		xdp_return_frame_rx_napi(xdpf);
> +
> +	return 0;
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d2e27dba4ac6..a5857953248d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3946,6 +3946,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_redirect_map_multi &&
>  		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
> @@ -4038,6 +4039,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		    map->map_type != BPF_MAP_TYPE_XSKMAP)
>  			goto error;
>  		break;
> +	case BPF_FUNC_redirect_map_multi:
> +		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
> +		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +			goto error;
> +		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
>  	case BPF_FUNC_sock_map_update:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bd2853d23b50..f07eb1408f70 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    struct bpf_map *ex_map, bool exclude_ingress)

Maybe just pass through the flags argument here?

>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		return dev_map_enqueue(fwd, xdp, dev_rx);

Using a NULL target_value to distinguish between multicast and unicast
forwarding is clever, but bordering on 'too clever' :) - took me a
little while to figure out this was what you were doing, at least. So
please add a comment explaining this, here and in the helper.

> +		if (fwd)
> +			return dev_map_enqueue(fwd, xdp, dev_rx);
> +		else
> +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
> +						     exclude_ingress);
>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = ri->ex_map;
>  	struct bpf_map *map = READ_ONCE(ri->map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
> @@ -3541,6 +3548,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  	ri->tgt_index = 0;
>  	ri->tgt_value = NULL;
> +	ri->ex_map = NULL;
>  	WRITE_ONCE(ri->map, NULL);
>  
>  	if (unlikely(!map)) {
> @@ -3552,7 +3560,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
>  	}
>  
>  	if (unlikely(err))
> @@ -3566,6 +3574,50 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
> +static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
> +				  struct bpf_prog *xdp_prog,
> +				  struct bpf_map *map, struct bpf_map *ex_map,
> +				  bool exclude_ingress)
> +
> +{
> +	struct bpf_dtab_netdev *dst;
> +	struct sk_buff *nskb;
> +	u32 key, next_key;
> +	int err;
> +	void *fwd;
> +
> +	/* Get first key from forward map */
> +	map->ops->map_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		fwd = __xdp_map_lookup_elem(map, key);
> +		if (fwd) {
> +			dst = (struct bpf_dtab_netdev *)fwd;
> +			if (dev_in_exclude_map(dst, ex_map,
> +					       exclude_ingress ? dev->ifindex : 0))
> +				goto find_next;
> +
> +			nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (!nskb)
> +				return -ENOMEM;
> +
> +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> +			if (unlikely(err))
> +				return err;
> +		}
> +
> +find_next:
> +		err = map->ops->map_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	consume_skb(skb);
> +	return 0;
> +}
> +
>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct sk_buff *skb,
>  				       struct xdp_buff *xdp,
> @@ -3573,19 +3625,29 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct bpf_map *map)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = ri->ex_map;
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
>  	int err = 0;
>  
>  	ri->tgt_index = 0;
>  	ri->tgt_value = NULL;
> +	ri->ex_map = NULL;
>  	WRITE_ONCE(ri->map, NULL);
>  
>  	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
>  	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
> -		struct bpf_dtab_netdev *dst = fwd;

Same as above - please add a comment explaining this test...

> +		if (fwd) {
> +			struct bpf_dtab_netdev *dst = fwd;
> +
> +			err = dev_map_generic_redirect(dst, skb, xdp_prog);
> +		} else {
> +			/* Deal with multicast maps */
> +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> +						     ex_map, exclude_ingress);
> +		}
>  
> -		err = dev_map_generic_redirect(dst, skb, xdp_prog);
>  		if (unlikely(err))
>  			goto err;
>  	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
> @@ -3699,6 +3761,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
>  	.arg3_type      = ARG_ANYTHING,
>  };
>  
> +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> +	   struct bpf_map *, ex_map, u64, flags)
> +{
> +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
> +		return XDP_ABORTED;
> +
> +	ri->tgt_index = 0;
> +	ri->tgt_value = NULL;
> +	ri->flags = flags;
> +	ri->ex_map = ex_map;
> +
> +	WRITE_ONCE(ri->map, map);
> +
> +	return XDP_REDIRECT;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
> +	.func           = bpf_xdp_redirect_map_multi,
> +	.gpl_only       = false,
> +	.ret_type       = RET_INTEGER,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg2_type      = ARG_CONST_MAP_PTR,
> +	.arg3_type      = ARG_ANYTHING,
> +};
> +
>  static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
>  				  unsigned long off, unsigned long len)
>  {
> @@ -6363,6 +6452,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_xdp_redirect_proto;
>  	case BPF_FUNC_redirect_map:
>  		return &bpf_xdp_redirect_map_proto;
> +	case BPF_FUNC_redirect_map_multi:
> +		return &bpf_xdp_redirect_map_multi_proto;
>  	case BPF_FUNC_xdp_adjust_tail:
>  		return &bpf_xdp_adjust_tail_proto;
>  	case BPF_FUNC_fib_lookup:
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 90f44f382115..acdc63833b1f 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -475,3 +475,29 @@ void xdp_warn(const char *msg, const char *func, const int line)
>  	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
>  };
>  EXPORT_SYMBOL_GPL(xdp_warn);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);
> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 97e1fd19ff58..000b0cf961ea 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -3157,6 +3157,20 @@ union bpf_attr {
>   *		**bpf_sk_cgroup_id**\ ().
>   *	Return
>   *		The id is returned or 0 in case the id could not be retrieved.
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to ALL the interfaces in *map*, but
> + * 		exclude the interfaces in *ex_map* (which may be NULL).
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which additionally excludes the current ingress device.
> + *
> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3288,7 +3302,8 @@ union bpf_attr {
>  	FN(seq_printf),			\
>  	FN(seq_write),			\
>  	FN(sk_cgroup_id),		\
> -	FN(sk_ancestor_cgroup_id),
> +	FN(sk_ancestor_cgroup_id),	\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> -- 
> 2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-05-27 10:21   ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Toke Høiland-Jørgensen
@ 2020-05-27 10:32     ` Eelco Chaudron
  2020-05-27 12:38     ` Hangbin Liu
  2020-06-03  2:40     ` Hangbin Liu
  2 siblings, 0 replies; 72+ messages in thread
From: Eelco Chaudron @ 2020-05-27 10:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Hangbin Liu, bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, ast,
	Daniel Borkmann, Lorenzo Bianconi



On 27 May 2020, at 12:21, Toke Høiland-Jørgensen wrote:

> Hangbin Liu <liuhangbin@gmail.com> writes:
>
>> Hi all,
>>
>> This patchset is for xdp multicast support, which has been discussed
>> before[0]. The goal is to be able to implement an OVS-like data plane 
>> in
>> XDP, i.e., a software switch that can forward XDP frames to multiple
>> ports.
>>
>> To achieve this, an application needs to specify a group of 
>> interfaces
>> to forward a packet to. It is also common to want to exclude one or 
>> more
>> physical interfaces from the forwarding operation - e.g., to forward 
>> a
>> packet to all interfaces in the multicast group except the interface 
>> it
>> arrived on. While this could be done simply by adding more groups, 
>> this
>> quickly leads to a combinatorial explosion in the number of groups an
>> application has to maintain.
>>
>> To avoid the combinatorial explosion, we propose to include the 
>> ability
>> to specify an "exclude group" as part of the forwarding operation. 
>> This
>> needs to be a group (instead of just a single port index), because a
>> physical interface can be part of a logical grouping, such as a bond
>> device.
>>
>> Thus, the logical forwarding operation becomes a "set difference"
>> operation, i.e. "forward to all ports in group A that are not also in
>> group B". This series implements such an operation using device maps 
>> to
>> represent the groups. This means that the XDP program specifies two
>> device maps, one containing the list of netdevs to redirect to, and 
>> the
>> other containing the exclude list.
>>
>> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
>> to accept two maps, the forwarding map and exclude map. If user
>> don't want to use exclude map and just want simply stop redirecting 
>> back
>> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.
>>
>> The example in patch 2 is functional, but not a lot of effort
>> has been made on performance optimisation. I did a simple test(pkt 
>> size 64)
>> with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
>> arrays:
>>
>> bpf_redirect_map() with 1 ingress, 1 egress:
>> generic path: ~1600k pps
>> native path: ~980k pps
>>
>> bpf_redirect_map_multi() with 1 ingress, 3 egress:
>> generic path: ~600k pps
>> native path: ~480k pps
>>
>> bpf_redirect_map_multi() with 1 ingress, 9 egress:
>> generic path: ~125k pps
>> native path: ~100k pps
>>
>> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we 
>> loop
>> the arrays and do clone skb/xdpf. The native path is slower than 
>> generic
>> path as we send skbs by pktgen. So the result looks reasonable.
>
> How are you running these tests? Still on virtual devices? We really
> need results from a physical setup in native mode to assess the impact
> on the native-XDP fast path. The numbers above don't tell much in this
> regard. I'd also like to see a before/after patch for straight
> bpf_redirect_map(), since you're messing with the fast path, and we 
> want
> to make sure it's not causing a performance regression for regular
> redirect.
>
> Finally, since the overhead seems to be quite substantial: A 
> comparison
> with a regular network stack bridge might make sense? After all we 
> also
> want to make sure it's a performance win over that :)

What about adding a test with only one egress port? So it compares 
better to bpf_redirect_map(), i.e. “bpf_redirect_map_multi() with 1 
ingress, 1 egress”.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-05-27 10:21   ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Toke Høiland-Jørgensen
  2020-05-27 10:32     ` Eelco Chaudron
@ 2020-05-27 12:38     ` Hangbin Liu
  2020-05-27 15:04       ` Toke Høiland-Jørgensen
  2020-06-03  2:40     ` Hangbin Liu
  2 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-05-27 12:38 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Wed, May 27, 2020 at 12:21:54PM +0200, Toke Høiland-Jørgensen wrote:
> > The example in patch 2 is functional, but not a lot of effort
> > has been made on performance optimisation. I did a simple test(pkt size 64)
> > with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
> > arrays:
> >
> > bpf_redirect_map() with 1 ingress, 1 egress:
> > generic path: ~1600k pps
> > native path: ~980k pps
> >
> > bpf_redirect_map_multi() with 1 ingress, 3 egress:
> > generic path: ~600k pps
> > native path: ~480k pps
> >
> > bpf_redirect_map_multi() with 1 ingress, 9 egress:
> > generic path: ~125k pps
> > native path: ~100k pps
> >
> > The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> > the arrays and do clone skb/xdpf. The native path is slower than generic
> > path as we send skbs by pktgen. So the result looks reasonable.
> 
> How are you running these tests? Still on virtual devices? We really

I run it with the test topology in patch 2/2. The test is run on physical
machines, but I use veth interface. Do you mean use a physical NIC driver
for testing?


BTW, when using pktgen, I got an panic because the skb don't have enough
header room. The code path looks like

do_xdp_generic()
  - netif_receive_generic_xdp()
    - skb_headroom(skb) < XDP_PACKET_HEADROOM
      - pskb_expand_head()
        - BUG_ON(skb_shared(skb))

So I added a draft patch for pktgen, not sure if it has any influence.

index 08e2811b5274..fee17310c178 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -170,6 +170,7 @@
 #include <linux/uaccess.h>
 #include <asm/dma.h>
 #include <asm/div64.h>         /* do_div */
+#include <linux/bpf.h>

 #define VERSION        "2.75"
 #define IP_NAME_SZ 32
@@ -2692,7 +2693,7 @@ static void pktgen_finalize_skb(struct pktgen_dev *pkt_dev, struct sk_buff *skb,
 static struct sk_buff *pktgen_alloc_skb(struct net_device *dev,
                                        struct pktgen_dev *pkt_dev)
 {
-       unsigned int extralen = LL_RESERVED_SPACE(dev);
+       unsigned int extralen = LL_RESERVED_SPACE(dev) + XDP_PACKET_HEADROOM;
        struct sk_buff *skb = NULL;
        unsigned int size;

> need results from a physical setup in native mode to assess the impact
> on the native-XDP fast path. The numbers above don't tell much in this
> regard. I'd also like to see a before/after patch for straight
> bpf_redirect_map(), since you're messing with the fast path, and we want
> to make sure it's not causing a performance regression for regular
> redirect.

OK, I will write a test with 1 ingress + 1 egress for bpf_redirect_map_multi.
Just as Eelco said.
> 
> Finally, since the overhead seems to be quite substantial: A comparison
> with a regular network stack bridge might make sense? After all we also
> want to make sure it's a performance win over that :)

OK, Will do it.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-05-27 12:38     ` Hangbin Liu
@ 2020-05-27 15:04       ` Toke Høiland-Jørgensen
  2020-06-16  9:09         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-05-27 15:04 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, May 27, 2020 at 12:21:54PM +0200, Toke Høiland-Jørgensen wrote:
>> > The example in patch 2 is functional, but not a lot of effort
>> > has been made on performance optimisation. I did a simple test(pkt size 64)
>> > with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
>> > arrays:
>> >
>> > bpf_redirect_map() with 1 ingress, 1 egress:
>> > generic path: ~1600k pps
>> > native path: ~980k pps
>> >
>> > bpf_redirect_map_multi() with 1 ingress, 3 egress:
>> > generic path: ~600k pps
>> > native path: ~480k pps
>> >
>> > bpf_redirect_map_multi() with 1 ingress, 9 egress:
>> > generic path: ~125k pps
>> > native path: ~100k pps
>> >
>> > The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
>> > the arrays and do clone skb/xdpf. The native path is slower than generic
>> > path as we send skbs by pktgen. So the result looks reasonable.
>> 
>> How are you running these tests? Still on virtual devices? We really
>
> I run it with the test topology in patch 2/2. The test is run on physical
> machines, but I use veth interface. Do you mean use a physical NIC driver
> for testing?

Yes, sorry, when I said 'physical machine' I should have also 'physical
NIC'. We really need to know how the performance of this is on the XDP
fast path, i.e., when there are no skbs involved at all.

> BTW, when using pktgen, I got an panic because the skb don't have enough
> header room. The code path looks like
>
> do_xdp_generic()
>   - netif_receive_generic_xdp()
>     - skb_headroom(skb) < XDP_PACKET_HEADROOM
>       - pskb_expand_head()
>         - BUG_ON(skb_shared(skb))
>
> So I added a draft patch for pktgen, not sure if it has any influence.

Hmm, as Jesper said pktgen was really not intended to be used this way,
so I guess that's why. I guess I'll let him comment on whether he thinks
it's worth fixing; or you could send this as a proper patch and see if
anyone complains about it ;)

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-05-27 10:21   ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Toke Høiland-Jørgensen
  2020-05-27 10:32     ` Eelco Chaudron
  2020-05-27 12:38     ` Hangbin Liu
@ 2020-06-03  2:40     ` Hangbin Liu
  2020-06-03 11:05       ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-03  2:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Wed, May 27, 2020 at 12:21:54PM +0200, Toke Høiland-Jørgensen wrote:
> > The example in patch 2 is functional, but not a lot of effort
> > has been made on performance optimisation. I did a simple test(pkt size 64)
> > with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
> > arrays:
> >
> > bpf_redirect_map() with 1 ingress, 1 egress:
> > generic path: ~1600k pps
> > native path: ~980k pps
> >
> > bpf_redirect_map_multi() with 1 ingress, 3 egress:
> > generic path: ~600k pps
> > native path: ~480k pps
> >
> > bpf_redirect_map_multi() with 1 ingress, 9 egress:
> > generic path: ~125k pps
> > native path: ~100k pps
> >
> > The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> > the arrays and do clone skb/xdpf. The native path is slower than generic
> > path as we send skbs by pktgen. So the result looks reasonable.
> 
> How are you running these tests? Still on virtual devices? We really
> need results from a physical setup in native mode to assess the impact
> on the native-XDP fast path. The numbers above don't tell much in this
> regard. I'd also like to see a before/after patch for straight
> bpf_redirect_map(), since you're messing with the fast path, and we want
> to make sure it's not causing a performance regression for regular
> redirect.
> 
> Finally, since the overhead seems to be quite substantial: A comparison
> with a regular network stack bridge might make sense? After all we also
> want to make sure it's a performance win over that :)

Hi Toke,

Here is the result I tested with 2 i40e 10G ports on physical machine.
The pktgen pkt_size is 64.

Bridge forwarding(I use sample/bpf/xdp1 to count the PPS, so there are two modes data):
generic mode: 1.32M PPS
driver mode: 1.66M PPS

xdp_redirect_map:
generic mode: 1.88M PPS
driver mode: 2.74M PPS

xdp_redirect_map_multi:
generic mode: 1.38M PPS
driver mode: 2.73M PPS

So what do you think about the data. If you are OK, I will update
my patch and re-post it.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-03  2:40     ` Hangbin Liu
@ 2020-06-03 11:05       ` Toke Høiland-Jørgensen
  2020-06-04  4:09         ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-06-03 11:05 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, May 27, 2020 at 12:21:54PM +0200, Toke Høiland-Jørgensen wrote:
>> > The example in patch 2 is functional, but not a lot of effort
>> > has been made on performance optimisation. I did a simple test(pkt size 64)
>> > with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
>> > arrays:
>> >
>> > bpf_redirect_map() with 1 ingress, 1 egress:
>> > generic path: ~1600k pps
>> > native path: ~980k pps
>> >
>> > bpf_redirect_map_multi() with 1 ingress, 3 egress:
>> > generic path: ~600k pps
>> > native path: ~480k pps
>> >
>> > bpf_redirect_map_multi() with 1 ingress, 9 egress:
>> > generic path: ~125k pps
>> > native path: ~100k pps
>> >
>> > The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
>> > the arrays and do clone skb/xdpf. The native path is slower than generic
>> > path as we send skbs by pktgen. So the result looks reasonable.
>> 
>> How are you running these tests? Still on virtual devices? We really
>> need results from a physical setup in native mode to assess the impact
>> on the native-XDP fast path. The numbers above don't tell much in this
>> regard. I'd also like to see a before/after patch for straight
>> bpf_redirect_map(), since you're messing with the fast path, and we want
>> to make sure it's not causing a performance regression for regular
>> redirect.
>> 
>> Finally, since the overhead seems to be quite substantial: A comparison
>> with a regular network stack bridge might make sense? After all we also
>> want to make sure it's a performance win over that :)
>
> Hi Toke,
>
> Here is the result I tested with 2 i40e 10G ports on physical machine.
> The pktgen pkt_size is 64.

These numbers seem a bit low (I'm getting ~8.5MPPS on my test machine
for a simple redirect). Some of that may just be performance of the
machine, I guess (what are you running this on?), but please check that
you are not limited by pktgen itself - i.e., that pktgen is generating
traffic at a higher rate than what XDP is processing.

> Bridge forwarding(I use sample/bpf/xdp1 to count the PPS, so there are two modes data):
> generic mode: 1.32M PPS
> driver mode: 1.66M PPS

I'm not sure I understand this - what are you measuring here exactly?

> xdp_redirect_map:
> generic mode: 1.88M PPS
> driver mode: 2.74M PPS

Please add numbers without your patch applied as well, for comparison.

> xdp_redirect_map_multi:
> generic mode: 1.38M PPS
> driver mode: 2.73M PPS

I assume this is with a single interface only, right? Could you please
add a test with a second interface (so the packet is cloned) as well?
You can just use a veth as the second target device.

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-03 11:05       ` Toke Høiland-Jørgensen
@ 2020-06-04  4:09         ` Hangbin Liu
  2020-06-04  9:44           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-04  4:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Wed, Jun 03, 2020 at 01:05:28PM +0200, Toke Høiland-Jørgensen wrote:
> > Hi Toke,
> >
> > Here is the result I tested with 2 i40e 10G ports on physical machine.
> > The pktgen pkt_size is 64.
> 
> These numbers seem a bit low (I'm getting ~8.5MPPS on my test machine
> for a simple redirect). Some of that may just be performance of the
> machine, I guess (what are you running this on?), but please check that
> you are not limited by pktgen itself - i.e., that pktgen is generating
> traffic at a higher rate than what XDP is processing.

Here is the test topology, which looks like

 Host A    |     Host B        |        Host C
 eth0      +    eth0 - eth1    +        eth0

I did pktgen sending on Host A, forwarding on Host B.
Host B is a Dell PowerEdge R730 (128G memory, Intel(R) Xeon(R) CPU E5-2690 v3)
eth0, eth1 is an onboard i40e 10G driver

Test 1: add eth0, eth1 to br0 and test bridge forwarding
Test 2: Test xdp_redirect_map(), eth0 is ingress, eth1 is egress
Test 3: Test xdp_redirect_map_multi(), eth0 is ingress, eth1 is egress

> 
> > Bridge forwarding(I use sample/bpf/xdp1 to count the PPS, so there are two modes data):
> > generic mode: 1.32M PPS
> > driver mode: 1.66M PPS
> 
> I'm not sure I understand this - what are you measuring here exactly?

> Finally, since the overhead seems to be quite substantial: A comparison
> with a regular network stack bridge might make sense? After all we also
> want to make sure it's a performance win over that :)

I though you want me also test with bridge forwarding. Am I missing something?

> 
> > xdp_redirect_map:
> > generic mode: 1.88M PPS
> > driver mode: 2.74M PPS
> 
> Please add numbers without your patch applied as well, for comparison.

OK, I will.
> 
> > xdp_redirect_map_multi:
> > generic mode: 1.38M PPS
> > driver mode: 2.73M PPS
> 
> I assume this is with a single interface only, right? Could you please
> add a test with a second interface (so the packet is cloned) as well?
> You can just use a veth as the second target device.

OK, so the topology on Host B should be like

eth0 + eth1 + veth0, eth0 as ingress, eth1 and veth0 as egress, right?

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-04  4:09         ` Hangbin Liu
@ 2020-06-04  9:44           ` Toke Høiland-Jørgensen
  2020-06-04 12:12             ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-06-04  9:44 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, Jun 03, 2020 at 01:05:28PM +0200, Toke Høiland-Jørgensen wrote:
>> > Hi Toke,
>> >
>> > Here is the result I tested with 2 i40e 10G ports on physical machine.
>> > The pktgen pkt_size is 64.
>> 
>> These numbers seem a bit low (I'm getting ~8.5MPPS on my test machine
>> for a simple redirect). Some of that may just be performance of the
>> machine, I guess (what are you running this on?), but please check that
>> you are not limited by pktgen itself - i.e., that pktgen is generating
>> traffic at a higher rate than what XDP is processing.
>
> Here is the test topology, which looks like
>
>  Host A    |     Host B        |        Host C
>  eth0      +    eth0 - eth1    +        eth0
>
> I did pktgen sending on Host A, forwarding on Host B.
> Host B is a Dell PowerEdge R730 (128G memory, Intel(R) Xeon(R) CPU E5-2690 v3)
> eth0, eth1 is an onboard i40e 10G driver
>
> Test 1: add eth0, eth1 to br0 and test bridge forwarding
> Test 2: Test xdp_redirect_map(), eth0 is ingress, eth1 is egress
> Test 3: Test xdp_redirect_map_multi(), eth0 is ingress, eth1 is egress

Right, that all seems reasonable, but that machine is comparable to
my test machine, so you should be getting way more than 2.75 MPPS on a
regular redirect test. Are you bottlenecked on pktgen or something?

Could you please try running Jesper's ethtool stats poller:
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

on eth0 on Host B, and see what PPS values you get on the different counters?

>> > Bridge forwarding(I use sample/bpf/xdp1 to count the PPS, so there are two modes data):
>> > generic mode: 1.32M PPS
>> > driver mode: 1.66M PPS
>> 
>> I'm not sure I understand this - what are you measuring here exactly?
>
>> Finally, since the overhead seems to be quite substantial: A comparison
>> with a regular network stack bridge might make sense? After all we also
>> want to make sure it's a performance win over that :)
>
> I though you want me also test with bridge forwarding. Am I missing something?

Yes, but what does this mean:
> (I use sample/bpf/xdp1 to count the PPS, so there are two modes data):

or rather, why are there two numbers? :)

>> > xdp_redirect_map:
>> > generic mode: 1.88M PPS
>> > driver mode: 2.74M PPS
>> 
>> Please add numbers without your patch applied as well, for comparison.
>
> OK, I will.
>> 
>> > xdp_redirect_map_multi:
>> > generic mode: 1.38M PPS
>> > driver mode: 2.73M PPS
>> 
>> I assume this is with a single interface only, right? Could you please
>> add a test with a second interface (so the packet is cloned) as well?
>> You can just use a veth as the second target device.
>
> OK, so the topology on Host B should be like
>
> eth0 + eth1 + veth0, eth0 as ingress, eth1 and veth0 as egress, right?

Yup, exactly!

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-04  9:44           ` Toke Høiland-Jørgensen
@ 2020-06-04 12:12             ` Hangbin Liu
  2020-06-04 12:37               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-04 12:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Thu, Jun 04, 2020 at 11:44:24AM +0200, Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:
> > Here is the test topology, which looks like
> >
> >  Host A    |     Host B        |        Host C
> >  eth0      +    eth0 - eth1    +        eth0
> >
> > I did pktgen sending on Host A, forwarding on Host B.
> > Host B is a Dell PowerEdge R730 (128G memory, Intel(R) Xeon(R) CPU E5-2690 v3)
> > eth0, eth1 is an onboard i40e 10G driver
> >
> > Test 1: add eth0, eth1 to br0 and test bridge forwarding
> > Test 2: Test xdp_redirect_map(), eth0 is ingress, eth1 is egress
> > Test 3: Test xdp_redirect_map_multi(), eth0 is ingress, eth1 is egress
> 
> Right, that all seems reasonable, but that machine is comparable to
> my test machine, so you should be getting way more than 2.75 MPPS on a
> regular redirect test. Are you bottlenecked on pktgen or something?

Yes, I found the pktgen is bottleneck. I only use 1 thread.
By using the cmd you gave to me
./pktgen_sample03_burst_single_flow.sh  -i eno1 -d 192.168.200.1 -m f8:bc:12:14:11:20 -t 4  -s 64

Now I could get higher speed.

> 
> Could you please try running Jesper's ethtool stats poller:
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

Nice tool.

> > I though you want me also test with bridge forwarding. Am I missing something?
> 
> Yes, but what does this mean:
> > (I use sample/bpf/xdp1 to count the PPS, so there are two modes data):
> 
> or rather, why are there two numbers? :)

Just as it said, to test bridge forwarding speed. I use the xdp tool
sample/bpf/xdp1 to count the PPS. But there are two modes when attach xdp
to eth0, general and driver mode. So there are 2 number..

Now I use the ethtool_stats.pl to count forwarding speed and here is the result:

With kernel 5.7(ingress i40e, egress i40e)
XDP:
bridge: 1.8M PPS
xdp_redirect_map:
  generic mode: 1.9M PPS
  driver mode: 10.4M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e)
bridge: 1.8M
xdp_redirect_map:
  generic mode: 1.86M PPS
  driver mode: 10.17M PPS
xdp_redirect_map_multi:
  generic mode: 1.53M PPS
  driver mode: 7.22M PPS

Kernel 5.7 + my patch(ingress i40e, egress veth)
xdp_redirect_map:
  generic mode: 1.38M PPS
  driver mode: 4.15M PPS
xdp_redirect_map_multi:
  generic mode: 1.13M PPS
  driver mode: 3.55M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e + veth)
xdp_redirect_map_multi:
  generic mode: 1.13M PPS
  driver mode: 3.47M PPS

I added a group that with i40e ingress and veth egress, which shows
a significant drop on the speed. It looks like veth driver is a bottleneck,
but I don't have more i40e NICs on the test bed...

Thanks
Hangbin


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-04 12:12             ` Hangbin Liu
@ 2020-06-04 12:37               ` Toke Høiland-Jørgensen
  2020-06-04 14:41                 ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-06-04 12:37 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Thu, Jun 04, 2020 at 11:44:24AM +0200, Toke Høiland-Jørgensen wrote:
>> Hangbin Liu <liuhangbin@gmail.com> writes:
>> > Here is the test topology, which looks like
>> >
>> >  Host A    |     Host B        |        Host C
>> >  eth0      +    eth0 - eth1    +        eth0
>> >
>> > I did pktgen sending on Host A, forwarding on Host B.
>> > Host B is a Dell PowerEdge R730 (128G memory, Intel(R) Xeon(R) CPU E5-2690 v3)
>> > eth0, eth1 is an onboard i40e 10G driver
>> >
>> > Test 1: add eth0, eth1 to br0 and test bridge forwarding
>> > Test 2: Test xdp_redirect_map(), eth0 is ingress, eth1 is egress
>> > Test 3: Test xdp_redirect_map_multi(), eth0 is ingress, eth1 is egress
>> 
>> Right, that all seems reasonable, but that machine is comparable to
>> my test machine, so you should be getting way more than 2.75 MPPS on a
>> regular redirect test. Are you bottlenecked on pktgen or something?
>
> Yes, I found the pktgen is bottleneck. I only use 1 thread.
> By using the cmd you gave to me
> ./pktgen_sample03_burst_single_flow.sh  -i eno1 -d 192.168.200.1 -m f8:bc:12:14:11:20 -t 4  -s 64
>
> Now I could get higher speed.
>
>> 
>> Could you please try running Jesper's ethtool stats poller:
>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>
> Nice tool.
>
>> > I though you want me also test with bridge forwarding. Am I missing something?
>> 
>> Yes, but what does this mean:
>> > (I use sample/bpf/xdp1 to count the PPS, so there are two modes data):
>> 
>> or rather, why are there two numbers? :)
>
> Just as it said, to test bridge forwarding speed. I use the xdp tool
> sample/bpf/xdp1 to count the PPS. But there are two modes when attach xdp
> to eth0, general and driver mode. So there are 2 number..
>
> Now I use the ethtool_stats.pl to count forwarding speed and here is the result:
>
> With kernel 5.7(ingress i40e, egress i40e)
> XDP:
> bridge: 1.8M PPS
> xdp_redirect_map:
>   generic mode: 1.9M PPS
>   driver mode: 10.4M PPS

Ah, now we're getting somewhere! :)

> Kernel 5.7 + my patch(ingress i40e, egress i40e)
> bridge: 1.8M
> xdp_redirect_map:
>   generic mode: 1.86M PPS
>   driver mode: 10.17M PPS

Right, so this corresponds to a ~2ns overhead (10**9/10400000 -
10**9/10170000). This is not too far from being in the noise, I suppose;
is the difference consistent?

> xdp_redirect_map_multi:
>   generic mode: 1.53M PPS
>   driver mode: 7.22M PPS
>
> Kernel 5.7 + my patch(ingress i40e, egress veth)
> xdp_redirect_map:
>   generic mode: 1.38M PPS
>   driver mode: 4.15M PPS
> xdp_redirect_map_multi:
>   generic mode: 1.13M PPS
>   driver mode: 3.55M PPS
>
> Kernel 5.7 + my patch(ingress i40e, egress i40e + veth)
> xdp_redirect_map_multi:
>   generic mode: 1.13M PPS
>   driver mode: 3.47M PPS
>
> I added a group that with i40e ingress and veth egress, which shows
> a significant drop on the speed. It looks like veth driver is a bottleneck,
> but I don't have more i40e NICs on the test bed...

I suspect this may be because veth ends up creating an SKB for each
packet after receiving the frame on the peer device (even though it's
immediately dropped). Could you please try adding an XDP program that
drops the packets on the veth peer of your target, and see if that
helps?

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-04 12:37               ` Toke Høiland-Jørgensen
@ 2020-06-04 14:41                 ` Hangbin Liu
  2020-06-04 16:02                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-04 14:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Thu, Jun 04, 2020 at 02:37:23PM +0200, Toke Høiland-Jørgensen wrote:
> > Now I use the ethtool_stats.pl to count forwarding speed and here is the result:
> >
> > With kernel 5.7(ingress i40e, egress i40e)
> > XDP:
> > bridge: 1.8M PPS
> > xdp_redirect_map:
> >   generic mode: 1.9M PPS
> >   driver mode: 10.4M PPS
> 
> Ah, now we're getting somewhere! :)
> 
> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
> > bridge: 1.8M
> > xdp_redirect_map:
> >   generic mode: 1.86M PPS
> >   driver mode: 10.17M PPS
> 
> Right, so this corresponds to a ~2ns overhead (10**9/10400000 -
> 10**9/10170000). This is not too far from being in the noise, I suppose;
> is the difference consistent?

Sorry, I didn't get, what different consistent do you mean?

> 
> > xdp_redirect_map_multi:
> >   generic mode: 1.53M PPS
> >   driver mode: 7.22M PPS
> >
> > Kernel 5.7 + my patch(ingress i40e, egress veth)
> > xdp_redirect_map:
> >   generic mode: 1.38M PPS
> >   driver mode: 4.15M PPS
> > xdp_redirect_map_multi:
> >   generic mode: 1.13M PPS
> >   driver mode: 3.55M PPS

With XDP_DROP in veth perr, the number looks much better

xdp_redirect_map:
  generic mode: 1.64M PPS
  driver mode: 13.3M PPS
xdp_redirect_map_multi:
  generic mode: 1.29M PPS
  driver mode: 8.5M PPS

> >
> > Kernel 5.7 + my patch(ingress i40e, egress i40e + veth)
> > xdp_redirect_map_multi:
> >   generic mode: 1.13M PPS
> >   driver mode: 3.47M PPS

But I don't know why this one get even a little slower..

xdp_redirect_map_multi:
  generic mode: 0.96M PPS
  driver mode: 3.14M PPS

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-04 14:41                 ` Hangbin Liu
@ 2020-06-04 16:02                   ` Toke Høiland-Jørgensen
  2020-06-05  6:26                     ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-06-04 16:02 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Thu, Jun 04, 2020 at 02:37:23PM +0200, Toke Høiland-Jørgensen wrote:
>> > Now I use the ethtool_stats.pl to count forwarding speed and here is the result:
>> >
>> > With kernel 5.7(ingress i40e, egress i40e)
>> > XDP:
>> > bridge: 1.8M PPS
>> > xdp_redirect_map:
>> >   generic mode: 1.9M PPS
>> >   driver mode: 10.4M PPS
>> 
>> Ah, now we're getting somewhere! :)
>> 
>> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
>> > bridge: 1.8M
>> > xdp_redirect_map:
>> >   generic mode: 1.86M PPS
>> >   driver mode: 10.17M PPS
>> 
>> Right, so this corresponds to a ~2ns overhead (10**9/10400000 -
>> 10**9/10170000). This is not too far from being in the noise, I suppose;
>> is the difference consistent?
>
> Sorry, I didn't get, what different consistent do you mean?

I meant, how much do the numbers vary between each test run?

>> > xdp_redirect_map_multi:
>> >   generic mode: 1.53M PPS
>> >   driver mode: 7.22M PPS
>> >
>> > Kernel 5.7 + my patch(ingress i40e, egress veth)
>> > xdp_redirect_map:
>> >   generic mode: 1.38M PPS
>> >   driver mode: 4.15M PPS
>> > xdp_redirect_map_multi:
>> >   generic mode: 1.13M PPS
>> >   driver mode: 3.55M PPS
>
> With XDP_DROP in veth perr, the number looks much better
>
> xdp_redirect_map:
>   generic mode: 1.64M PPS
>   driver mode: 13.3M PPS
> xdp_redirect_map_multi:
>   generic mode: 1.29M PPS
>   driver mode: 8.5M PPS

Is this for a single interface in both cases? Look a bit odd that you
get such a big difference all of a sudden; is the redirect failing in
one of those cases (should be a hint in the ethtool stats, I think,
otherwise check xdp_monitor)?

>> > Kernel 5.7 + my patch(ingress i40e, egress i40e + veth)
>> > xdp_redirect_map_multi:
>> >   generic mode: 1.13M PPS
>> >   driver mode: 3.47M PPS
>
> But I don't know why this one get even a little slower..
>
> xdp_redirect_map_multi:
>   generic mode: 0.96M PPS
>   driver mode: 3.14M PPS

Yeah, this does seem a bit odd. Don't have any good ideas off the top of
my head, but maybe worth double-checking where the time is spent. You
can use 'perf' for this, but you need to make sure it's recording the
CPU that is processing packets...

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-04 16:02                   ` Toke Høiland-Jørgensen
@ 2020-06-05  6:26                     ` Hangbin Liu
  2020-06-08 15:32                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-05  6:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Thu, Jun 04, 2020 at 06:02:54PM +0200, Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:
> 
> > On Thu, Jun 04, 2020 at 02:37:23PM +0200, Toke Høiland-Jørgensen wrote:
> >> > Now I use the ethtool_stats.pl to count forwarding speed and here is the result:
> >> >
> >> > With kernel 5.7(ingress i40e, egress i40e)
> >> > XDP:
> >> > bridge: 1.8M PPS
> >> > xdp_redirect_map:
> >> >   generic mode: 1.9M PPS
> >> >   driver mode: 10.4M PPS
> >> 
> >> Ah, now we're getting somewhere! :)
> >> 
> >> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
> >> > bridge: 1.8M
> >> > xdp_redirect_map:
> >> >   generic mode: 1.86M PPS
> >> >   driver mode: 10.17M PPS
> >> 
> >> Right, so this corresponds to a ~2ns overhead (10**9/10400000 -
> >> 10**9/10170000). This is not too far from being in the noise, I suppose;
> >> is the difference consistent?
> >
> > Sorry, I didn't get, what different consistent do you mean?
> 
> I meant, how much do the numbers vary between each test run?

Oh, when run it at the same period, the number is stable, the range is about
~0.05M PPS. But after a long time or reboot, the speed may changed a little.
Here is the new test result after I reboot the system:

Kernel 5.7 + my patch(ingress i40e, egress i40e)
xdp_redirect_map:
  generic mode: 1.9M PPS
  driver mode: 10.2M PPS

xdp_redirect_map_multi:
  generic mode: 1.58M PPS
  driver mode: 7.16M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(No XDP on peer))
xdp_redirect_map:
  generic mode: 2.2M PPS
  driver mode: 14.2M PPS

xdp_redirect_map_multi:
  generic mode: 1.6M PPS
  driver mode: 9.9M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(with XDP_DROP on peer))
xdp_redirect_map:
  generic mode: 1.6M PPS
  driver mode: 13.6M PPS

xdp_redirect_map_multi:
  generic mode: 1.3M PPS
  driver mode: 8.7M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(No XDP on peer))
xdp_redirect_map_multi:
  generic mode: 1.15M PPS
  driver mode: 3.48M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(with XDP_DROP on peer))
xdp_redirect_map_multi:
  generic mode: 0.98M PPS
  driver mode: 3.15M PPS

This time the number looks more reasonable.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-05  6:26                     ` Hangbin Liu
@ 2020-06-08 15:32                       ` Toke Høiland-Jørgensen
  2020-06-09  3:03                         ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-06-08 15:32 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Thu, Jun 04, 2020 at 06:02:54PM +0200, Toke Høiland-Jørgensen wrote:
>> Hangbin Liu <liuhangbin@gmail.com> writes:
>> 
>> > On Thu, Jun 04, 2020 at 02:37:23PM +0200, Toke Høiland-Jørgensen wrote:
>> >> > Now I use the ethtool_stats.pl to count forwarding speed and here is the result:
>> >> >
>> >> > With kernel 5.7(ingress i40e, egress i40e)
>> >> > XDP:
>> >> > bridge: 1.8M PPS
>> >> > xdp_redirect_map:
>> >> >   generic mode: 1.9M PPS
>> >> >   driver mode: 10.4M PPS
>> >> 
>> >> Ah, now we're getting somewhere! :)
>> >> 
>> >> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
>> >> > bridge: 1.8M
>> >> > xdp_redirect_map:
>> >> >   generic mode: 1.86M PPS
>> >> >   driver mode: 10.17M PPS
>> >> 
>> >> Right, so this corresponds to a ~2ns overhead (10**9/10400000 -
>> >> 10**9/10170000). This is not too far from being in the noise, I suppose;
>> >> is the difference consistent?
>> >
>> > Sorry, I didn't get, what different consistent do you mean?
>> 
>> I meant, how much do the numbers vary between each test run?
>
> Oh, when run it at the same period, the number is stable, the range is about
> ~0.05M PPS. But after a long time or reboot, the speed may changed a little.
> Here is the new test result after I reboot the system:
>
> Kernel 5.7 + my patch(ingress i40e, egress i40e)
> xdp_redirect_map:
>   generic mode: 1.9M PPS
>   driver mode: 10.2M PPS
>
> xdp_redirect_map_multi:
>   generic mode: 1.58M PPS
>   driver mode: 7.16M PPS
>
> Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(No XDP on peer))
> xdp_redirect_map:
>   generic mode: 2.2M PPS
>   driver mode: 14.2M PPS

This looks wrong - why is performance increasing when adding another
target? How are you even adding another target to regular
xdp_redirect_map?

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-08 15:32                       ` Toke Høiland-Jørgensen
@ 2020-06-09  3:03                         ` Hangbin Liu
  2020-06-09 20:31                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-09  3:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Mon, Jun 08, 2020 at 05:32:54PM +0200, Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:
> 
> > On Thu, Jun 04, 2020 at 06:02:54PM +0200, Toke Høiland-Jørgensen wrote:
> >> Hangbin Liu <liuhangbin@gmail.com> writes:
> >> 
> >> > On Thu, Jun 04, 2020 at 02:37:23PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> > Now I use the ethtool_stats.pl to count forwarding speed and here is the result:
> >> >> >
> >> >> > With kernel 5.7(ingress i40e, egress i40e)
> >> >> > XDP:
> >> >> > bridge: 1.8M PPS
> >> >> > xdp_redirect_map:
> >> >> >   generic mode: 1.9M PPS
> >> >> >   driver mode: 10.4M PPS
> >> >> 
> >> >> Ah, now we're getting somewhere! :)
> >> >> 
> >> >> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
> >> >> > bridge: 1.8M
> >> >> > xdp_redirect_map:
> >> >> >   generic mode: 1.86M PPS
> >> >> >   driver mode: 10.17M PPS
> >> >> 
> >> >> Right, so this corresponds to a ~2ns overhead (10**9/10400000 -
> >> >> 10**9/10170000). This is not too far from being in the noise, I suppose;
> >> >> is the difference consistent?
> >> >
> >> > Sorry, I didn't get, what different consistent do you mean?
> >> 
> >> I meant, how much do the numbers vary between each test run?
> >
> > Oh, when run it at the same period, the number is stable, the range is about
> > ~0.05M PPS. But after a long time or reboot, the speed may changed a little.
> > Here is the new test result after I reboot the system:
> >
> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
> > xdp_redirect_map:
> >   generic mode: 1.9M PPS
> >   driver mode: 10.2M PPS
> >
> > xdp_redirect_map_multi:
> >   generic mode: 1.58M PPS
> >   driver mode: 7.16M PPS
> >
> > Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(No XDP on peer))
> > xdp_redirect_map:
> >   generic mode: 2.2M PPS
> >   driver mode: 14.2M PPS
> 
> This looks wrong - why is performance increasing when adding another
> target? How are you even adding another target to regular
> xdp_redirect_map?
> 
Oh, sorry for the typo, the numbers make me crazy, it should be only
ingress i40e, egress veth. Here is the right description:

Kernel 5.7 + my patch(ingress i40e, egress i40e)
xdp_redirect_map:
  generic mode: 1.9M PPS
  driver mode: 10.2M PPS

xdp_redirect_map_multi:
  generic mode: 1.58M PPS
  driver mode: 7.16M PPS

Kernel 5.7 + my patch(ingress i40e, egress veth(No XDP on peer))
xdp_redirect_map:
  generic mode: 2.2M PPS
  driver mode: 14.2M PPS

xdp_redirect_map_multi:
  generic mode: 1.6M PPS
  driver mode: 9.9M PPS

Kernel 5.7 + my patch(ingress i40e, egress veth(with XDP_DROP on peer))
xdp_redirect_map:
  generic mode: 1.6M PPS
  driver mode: 13.6M PPS

xdp_redirect_map_multi:
  generic mode: 1.3M PPS
  driver mode: 8.7M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(No XDP on peer))
xdp_redirect_map_multi:
  generic mode: 1.15M PPS
  driver mode: 3.48M PPS

Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(with XDP_DROP on peer))
xdp_redirect_map_multi:
  generic mode: 0.98M PPS
  driver mode: 3.15M PPS

The performance number for xdp_redirect_map_multi is not very well.
But I think we can optimize after the implementation.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-09  3:03                         ` Hangbin Liu
@ 2020-06-09 20:31                           ` Toke Høiland-Jørgensen
  2020-06-10  2:35                             ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-06-09 20:31 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Mon, Jun 08, 2020 at 05:32:54PM +0200, Toke Høiland-Jørgensen wrote:
>> Hangbin Liu <liuhangbin@gmail.com> writes:
>> 
>> > On Thu, Jun 04, 2020 at 06:02:54PM +0200, Toke Høiland-Jørgensen wrote:
>> >> Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> 
>> >> > On Thu, Jun 04, 2020 at 02:37:23PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> > Now I use the ethtool_stats.pl to count forwarding speed and here is the result:
>> >> >> >
>> >> >> > With kernel 5.7(ingress i40e, egress i40e)
>> >> >> > XDP:
>> >> >> > bridge: 1.8M PPS
>> >> >> > xdp_redirect_map:
>> >> >> >   generic mode: 1.9M PPS
>> >> >> >   driver mode: 10.4M PPS
>> >> >> 
>> >> >> Ah, now we're getting somewhere! :)
>> >> >> 
>> >> >> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
>> >> >> > bridge: 1.8M
>> >> >> > xdp_redirect_map:
>> >> >> >   generic mode: 1.86M PPS
>> >> >> >   driver mode: 10.17M PPS
>> >> >> 
>> >> >> Right, so this corresponds to a ~2ns overhead (10**9/10400000 -
>> >> >> 10**9/10170000). This is not too far from being in the noise, I suppose;
>> >> >> is the difference consistent?
>> >> >
>> >> > Sorry, I didn't get, what different consistent do you mean?
>> >> 
>> >> I meant, how much do the numbers vary between each test run?
>> >
>> > Oh, when run it at the same period, the number is stable, the range is about
>> > ~0.05M PPS. But after a long time or reboot, the speed may changed a little.
>> > Here is the new test result after I reboot the system:
>> >
>> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
>> > xdp_redirect_map:
>> >   generic mode: 1.9M PPS
>> >   driver mode: 10.2M PPS
>> >
>> > xdp_redirect_map_multi:
>> >   generic mode: 1.58M PPS
>> >   driver mode: 7.16M PPS
>> >
>> > Kernel 5.7 + my patch(ingress i40e, egress i40e + veth(No XDP on peer))
>> > xdp_redirect_map:
>> >   generic mode: 2.2M PPS
>> >   driver mode: 14.2M PPS
>> 
>> This looks wrong - why is performance increasing when adding another
>> target? How are you even adding another target to regular
>> xdp_redirect_map?
>> 
> Oh, sorry for the typo, the numbers make me crazy, it should be only
> ingress i40e, egress veth. Here is the right description:
>
> Kernel 5.7 + my patch(ingress i40e, egress i40e)
> xdp_redirect_map:
>   generic mode: 1.9M PPS
>   driver mode: 10.2M PPS
>
> xdp_redirect_map_multi:
>   generic mode: 1.58M PPS
>   driver mode: 7.16M PPS
>
> Kernel 5.7 + my patch(ingress i40e, egress veth(No XDP on peer))
> xdp_redirect_map:
>   generic mode: 2.2M PPS
>   driver mode: 14.2M PPS

A few messages up-thread you were getting 4.15M PPS in this case - what
changed? It's inconsistencies like these that make me suspicious of the
whole set of results :/

Are you getting these numbers from ethtool_stats.pl or from the XDP
program? What counter are you looking at, exactly?

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-09 20:31                           ` Toke Høiland-Jørgensen
@ 2020-06-10  2:35                             ` Hangbin Liu
  2020-06-10 10:03                               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-10  2:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Tue, Jun 09, 2020 at 10:31:19PM +0200, Toke Høiland-Jørgensen wrote:
> > Oh, sorry for the typo, the numbers make me crazy, it should be only
> > ingress i40e, egress veth. Here is the right description:
> >
> > Kernel 5.7 + my patch(ingress i40e, egress i40e)
> > xdp_redirect_map:
> >   generic mode: 1.9M PPS
> >   driver mode: 10.2M PPS
> >
> > xdp_redirect_map_multi:
> >   generic mode: 1.58M PPS
> >   driver mode: 7.16M PPS
> >
> > Kernel 5.7 + my patch(ingress i40e, egress veth(No XDP on peer))
> > xdp_redirect_map:
> >   generic mode: 2.2M PPS
> >   driver mode: 14.2M PPS
> 
> A few messages up-thread you were getting 4.15M PPS in this case - what
> changed? It's inconsistencies like these that make me suspicious of the
> whole set of results :/

I got the number after a reboot, not sure what happened.
And I also feel surprised... But the result shows the number, so I have
to put it here.

> 
> Are you getting these numbers from ethtool_stats.pl or from the XDP
> program? What counter are you looking at, exactly?

For bridge testing I use ethtool_stats.pl. For later xdp_redirect_map
and xdp_redirect_map_multi testing, I checked that ethtool_stats.pl and
XDP program shows the same number. When run ethtool_stats.pl the number
will go a little bit slower. So at the end I use the xdp program's number.

I'm going to re-setup the test environment and share it with you. Hope
we could get a final number that we all accept.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-10  2:35                             ` Hangbin Liu
@ 2020-06-10 10:03                               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-10 10:03 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: Toke Høiland-Jørgensen, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi, brouer

On Wed, 10 Jun 2020 10:35:08 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> On Tue, Jun 09, 2020 at 10:31:19PM +0200, Toke Høiland-Jørgensen wrote:
> > > Oh, sorry for the typo, the numbers make me crazy, it should be only
> > > ingress i40e, egress veth. Here is the right description:
> > >
> > > Kernel 5.7 + my patch(ingress i40e, egress i40e)
> > > xdp_redirect_map:
> > >   generic mode: 1.9M PPS
> > >   driver mode: 10.2M PPS
> > >
> > > xdp_redirect_map_multi:
> > >   generic mode: 1.58M PPS
> > >   driver mode: 7.16M PPS
> > >
> > > Kernel 5.7 + my patch(ingress i40e, egress veth(No XDP on peer))
> > > xdp_redirect_map:
> > >   generic mode: 2.2M PPS
> > >   driver mode: 14.2M PPS  
> > 
> > A few messages up-thread you were getting 4.15M PPS in this case - what
> > changed? It's inconsistencies like these that make me suspicious of the
> > whole set of results :/  
> 
> I got the number after a reboot, not sure what happened.
> And I also feel surprised... But the result shows the number, so I have
> to put it here.
> 
> > 
> > Are you getting these numbers from ethtool_stats.pl or from the XDP
> > program? What counter are you looking at, exactly?  
> 
> For bridge testing I use ethtool_stats.pl. For later xdp_redirect_map
> and xdp_redirect_map_multi testing, I checked that ethtool_stats.pl and
> XDP program shows the same number. When run ethtool_stats.pl the number
> will go a little bit slower. So at the end I use the xdp program's number.

You cannot trust the xdp program's number, because it just counts all
RX-packets, and don't take into account if the packets are getting
dropped.  We really want to verify (e.g. with ethtool_stats.pl) that
the packets were successfully transmitted.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-26 14:05   ` [PATCHv4 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-05-27 10:29     ` Toke Høiland-Jørgensen
@ 2020-06-10 10:18     ` Jesper Dangaard Brouer
  2020-06-12  8:54       ` Hangbin Liu
  2020-06-10 10:21     ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-10 10:18 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi, brouer

On Tue, 26 May 2020 22:05:38 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 90f44f382115..acdc63833b1f 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -475,3 +475,29 @@ void xdp_warn(const char *msg, const char *func, const int line)
>  	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
>  };
>  EXPORT_SYMBOL_GPL(xdp_warn);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);

I don't think this will work.  You are assuming that the memory model
(xdp_mem_info) is the same.

You happened to use i40, that have MEM_TYPE_PAGE_SHARED, and you should
have changed this to MEM_TYPE_PAGE_ORDER0, but it doesn't crash as they
are compatible.  If you were using mlx5, I suspect that this would
result in memory leaking.

You also need to update xdpf->frame_sz, as you also cannot assume it is
the same.

> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


struct xdp_frame {
	void *data;
	u16 len;
	u16 headroom;
	u32 metasize:8;
	u32 frame_sz:24;
	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
	 * while mem info is valid on remote CPU.
	 */
	struct xdp_mem_info mem;
	struct net_device *dev_rx; /* used by cpumap */
};


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-26 14:05   ` [PATCHv4 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-05-27 10:29     ` Toke Høiland-Jørgensen
  2020-06-10 10:18     ` Jesper Dangaard Brouer
@ 2020-06-10 10:21     ` Jesper Dangaard Brouer
  2020-06-10 10:29       ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-10 10:21 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi, brouer

On Tue, 26 May 2020 22:05:38 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index a51d9fb7a359..ecc5c44a5bab 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
[...]

> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	bool first = true;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (!obj || dev_in_exclude_map(obj, ex_map,
> +					       exclude_ingress ? dev_rx->ifindex : 0))
> +			goto find_next;
> +
> +		dev = obj->dev;
> +
> +		if (!dev->netdev_ops->ndo_xdp_xmit)
> +			goto find_next;
> +
> +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +		if (unlikely(err))
> +			goto find_next;
> +
> +		if (!first) {
> +			nxdpf = xdpf_clone(xdpf);
> +			if (unlikely(!nxdpf))
> +				return -ENOMEM;
> +
> +			bq_enqueue(dev, nxdpf, dev_rx);
> +		} else {
> +			bq_enqueue(dev, xdpf, dev_rx);

This looks racy.  You enqueue the original frame, and then later
xdpf_clone it.  The original frame might have been freed at that point.

> +			first = false;
> +		}
> +
> +find_next:
> +		err = devmap_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}
> +
> +	/* didn't find anywhere to forward to, free buf */
> +	if (first)
> +		xdp_return_frame_rx_napi(xdpf);
> +
> +	return 0;
> +}
> +


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-06-10 10:21     ` Jesper Dangaard Brouer
@ 2020-06-10 10:29       ` Toke Høiland-Jørgensen
  2020-06-16  9:04         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-06-10 10:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, brouer

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Tue, 26 May 2020 22:05:38 +0800
> Hangbin Liu <liuhangbin@gmail.com> wrote:
>
>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>> index a51d9fb7a359..ecc5c44a5bab 100644
>> --- a/kernel/bpf/devmap.c
>> +++ b/kernel/bpf/devmap.c
> [...]
>
>> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
>> +			  struct bpf_map *map, struct bpf_map *ex_map,
>> +			  bool exclude_ingress)
>> +{
>> +	struct bpf_dtab_netdev *obj = NULL;
>> +	struct xdp_frame *xdpf, *nxdpf;
>> +	struct net_device *dev;
>> +	bool first = true;
>> +	u32 key, next_key;
>> +	int err;
>> +
>> +	devmap_get_next_key(map, NULL, &key);
>> +
>> +	xdpf = convert_to_xdp_frame(xdp);
>> +	if (unlikely(!xdpf))
>> +		return -EOVERFLOW;
>> +
>> +	for (;;) {
>> +		switch (map->map_type) {
>> +		case BPF_MAP_TYPE_DEVMAP:
>> +			obj = __dev_map_lookup_elem(map, key);
>> +			break;
>> +		case BPF_MAP_TYPE_DEVMAP_HASH:
>> +			obj = __dev_map_hash_lookup_elem(map, key);
>> +			break;
>> +		default:
>> +			break;
>> +		}
>> +
>> +		if (!obj || dev_in_exclude_map(obj, ex_map,
>> +					       exclude_ingress ? dev_rx->ifindex : 0))
>> +			goto find_next;
>> +
>> +		dev = obj->dev;
>> +
>> +		if (!dev->netdev_ops->ndo_xdp_xmit)
>> +			goto find_next;
>> +
>> +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
>> +		if (unlikely(err))
>> +			goto find_next;
>> +
>> +		if (!first) {
>> +			nxdpf = xdpf_clone(xdpf);
>> +			if (unlikely(!nxdpf))
>> +				return -ENOMEM;
>> +
>> +			bq_enqueue(dev, nxdpf, dev_rx);
>> +		} else {
>> +			bq_enqueue(dev, xdpf, dev_rx);
>
> This looks racy.  You enqueue the original frame, and then later
> xdpf_clone it.  The original frame might have been freed at that
> point.

This was actually my suggestion; on the assumption that bq_enqueue()
just puts the frame on a list that won't be flushed until we exit the
NAPI loop.

But I guess now that you mention it that bq_enqueue() may flush the
queue, so you're right that this won't work. Sorry about that, Hangbin :/

Jesper, the reason I suggested this was to avoid an "extra" copy (i.e.,
if we have two destinations, ideally we should only clone once instead
of twice). Got any clever ideas for a safe way to achieve this? :)

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-06-10 10:18     ` Jesper Dangaard Brouer
@ 2020-06-12  8:54       ` Hangbin Liu
  2020-06-16  8:55         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-12  8:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi

On Wed, Jun 10, 2020 at 12:18:59PM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 26 May 2020 22:05:38 +0800
> Hangbin Liu <liuhangbin@gmail.com> wrote:
> 
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 90f44f382115..acdc63833b1f 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -475,3 +475,29 @@ void xdp_warn(const char *msg, const char *func, const int line)
> >  	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
> >  };
> >  EXPORT_SYMBOL_GPL(xdp_warn);
> > +
> > +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> > +{
> > +	unsigned int headroom, totalsize;
> > +	struct xdp_frame *nxdpf;
> > +	struct page *page;
> > +	void *addr;
> > +
> > +	headroom = xdpf->headroom + sizeof(*xdpf);
> > +	totalsize = headroom + xdpf->len;
> > +
> > +	if (unlikely(totalsize > PAGE_SIZE))
> > +		return NULL;
> > +	page = dev_alloc_page();
> > +	if (!page)
> > +		return NULL;
> > +	addr = page_to_virt(page);
> > +
> > +	memcpy(addr, xdpf, totalsize);
> 
> I don't think this will work.  You are assuming that the memory model
> (xdp_mem_info) is the same.
> 
> You happened to use i40, that have MEM_TYPE_PAGE_SHARED, and you should
> have changed this to MEM_TYPE_PAGE_ORDER0, but it doesn't crash as they
> are compatible.  If you were using mlx5, I suspect that this would
> result in memory leaking.

Is there anything else I should do except add the following line?
	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> 
> You also need to update xdpf->frame_sz, as you also cannot assume it is
> the same.

Won't the memcpy() copy xdpf->frame_sz to nxdpf? 

And I didn't see xdpf->frame_sz is set in xdp_convert_zc_to_xdp_frame(),
do we need a fix?

Thanks
Hangbin
> 
> > +
> > +	nxdpf = addr;
> > +	nxdpf->data = addr + headroom;
> > +
> > +	return nxdpf;
> > +}
> > +EXPORT_SYMBOL_GPL(xdpf_clone);
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> 
> struct xdp_frame {
> 	void *data;
> 	u16 len;
> 	u16 headroom;
> 	u32 metasize:8;
> 	u32 frame_sz:24;
> 	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> 	 * while mem info is valid on remote CPU.
> 	 */
> 	struct xdp_mem_info mem;
> 	struct net_device *dev_rx; /* used by cpumap */
> };
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-06-12  8:54       ` Hangbin Liu
@ 2020-06-16  8:55         ` Jesper Dangaard Brouer
  2020-06-16 10:11           ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-16  8:55 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi, brouer

On Fri, 12 Jun 2020 16:54:08 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> On Wed, Jun 10, 2020 at 12:18:59PM +0200, Jesper Dangaard Brouer wrote:
> > On Tue, 26 May 2020 22:05:38 +0800
> > Hangbin Liu <liuhangbin@gmail.com> wrote:
> >   
> > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > index 90f44f382115..acdc63833b1f 100644
> > > --- a/net/core/xdp.c
> > > +++ b/net/core/xdp.c
> > > @@ -475,3 +475,29 @@ void xdp_warn(const char *msg, const char *func, const int line)
> > >  	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
> > >  };
> > >  EXPORT_SYMBOL_GPL(xdp_warn);
> > > +
> > > +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> > > +{
> > > +	unsigned int headroom, totalsize;
> > > +	struct xdp_frame *nxdpf;
> > > +	struct page *page;
> > > +	void *addr;
> > > +
> > > +	headroom = xdpf->headroom + sizeof(*xdpf);
> > > +	totalsize = headroom + xdpf->len;
> > > +
> > > +	if (unlikely(totalsize > PAGE_SIZE))
> > > +		return NULL;
> > > +	page = dev_alloc_page();
> > > +	if (!page)
> > > +		return NULL;
> > > +	addr = page_to_virt(page);
> > > +
> > > +	memcpy(addr, xdpf, totalsize);  
> > 
> > I don't think this will work.  You are assuming that the memory model
> > (xdp_mem_info) is the same.
> > 
> > You happened to use i40, that have MEM_TYPE_PAGE_SHARED, and you should
> > have changed this to MEM_TYPE_PAGE_ORDER0, but it doesn't crash as they
> > are compatible.  If you were using mlx5, I suspect that this would
> > result in memory leaking.  
> 
> Is there anything else I should do except add the following line?
> 	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;

You do realize that you also have copied over the mem.id, right?

And as I wrote below you also need to update frame_sz.

> > 
> > You also need to update xdpf->frame_sz, as you also cannot assume it is
> > the same.  
> 
> Won't the memcpy() copy xdpf->frame_sz to nxdpf? 

You obviously cannot use the frame_sz from the existing frame, as you
just allocated a new page for the new xdp_frame, that have another size
(here PAGE_SIZE).


> And I didn't see xdpf->frame_sz is set in xdp_convert_zc_to_xdp_frame(),
> do we need a fix?

Good catch, that sounds like a bug, that should be fixed.
Will you send a fix?


> > > +
> > > +	nxdpf = addr;
> > > +	nxdpf->data = addr + headroom;
> > > +
> > > +	return nxdpf;
> > > +}
> > > +EXPORT_SYMBOL_GPL(xdpf_clone);  
> > 
> > 
> > struct xdp_frame {
> > 	void *data;
> > 	u16 len;
> > 	u16 headroom;
> > 	u32 metasize:8;
> > 	u32 frame_sz:24;
> > 	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> > 	 * while mem info is valid on remote CPU.
> > 	 */
> > 	struct xdp_mem_info mem;
> > 	struct net_device *dev_rx; /* used by cpumap */
> > };
> >   
> 

struct xdp_mem_info {
	u32                        type;                 /*     0     4 */
	u32                        id;                   /*     4     4 */

	/* size: 8, cachelines: 1, members: 2 */
	/* last cacheline: 8 bytes */
};

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-06-10 10:29       ` Toke Høiland-Jørgensen
@ 2020-06-16  9:04         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-16  9:04 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Hangbin Liu, bpf, netdev, Jiri Benc, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, brouer


On Wed, 10 Jun 2020 12:29:35 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> 
> > On Tue, 26 May 2020 22:05:38 +0800
> > Hangbin Liu <liuhangbin@gmail.com> wrote:
> >  
> >> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> >> index a51d9fb7a359..ecc5c44a5bab 100644
> >> --- a/kernel/bpf/devmap.c
> >> +++ b/kernel/bpf/devmap.c  
> > [...]
> >  
> >> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> >> +			  struct bpf_map *map, struct bpf_map *ex_map,
> >> +			  bool exclude_ingress)
> >> +{
[...]
> >> +		if (!first) {
> >> +			nxdpf = xdpf_clone(xdpf);
> >> +			if (unlikely(!nxdpf))
> >> +				return -ENOMEM;
> >> +
> >> +			bq_enqueue(dev, nxdpf, dev_rx);
> >> +		} else {
> >> +			bq_enqueue(dev, xdpf, dev_rx);  
> >
> > This looks racy.  You enqueue the original frame, and then later
> > xdpf_clone it.  The original frame might have been freed at that
> > point.  
> 
> This was actually my suggestion; on the assumption that bq_enqueue()
> just puts the frame on a list that won't be flushed until we exit the
> NAPI loop.
> 
> But I guess now that you mention it that bq_enqueue() may flush the
> queue, so you're right that this won't work. Sorry about that, Hangbin :/
> 
> Jesper, the reason I suggested this was to avoid an "extra" copy (i.e.,
> if we have two destinations, ideally we should only clone once instead
> of twice). Got any clever ideas for a safe way to achieve this? :)

Maybe you/we could avoid the clone on the last destination?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-05-27 15:04       ` Toke Høiland-Jørgensen
@ 2020-06-16  9:09         ` Jesper Dangaard Brouer
  2020-06-16  9:47           ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-16  9:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Hangbin Liu, bpf, netdev, Jiri Benc, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, brouer

On Wed, 27 May 2020 17:04:50 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Hangbin Liu <liuhangbin@gmail.com> writes:
> 
> > On Wed, May 27, 2020 at 12:21:54PM +0200, Toke Høiland-Jørgensen wrote:  
> >> > The example in patch 2 is functional, but not a lot of effort
> >> > has been made on performance optimisation. I did a simple test(pkt size 64)
> >> > with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
> >> > arrays:
> >> >
> >> > bpf_redirect_map() with 1 ingress, 1 egress:
> >> > generic path: ~1600k pps
> >> > native path: ~980k pps
> >> >
> >> > bpf_redirect_map_multi() with 1 ingress, 3 egress:
> >> > generic path: ~600k pps
> >> > native path: ~480k pps
> >> >
> >> > bpf_redirect_map_multi() with 1 ingress, 9 egress:
> >> > generic path: ~125k pps
> >> > native path: ~100k pps
> >> >
> >> > The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> >> > the arrays and do clone skb/xdpf. The native path is slower than generic
> >> > path as we send skbs by pktgen. So the result looks reasonable.  
> >> 
> >> How are you running these tests? Still on virtual devices? We really  
> >
> > I run it with the test topology in patch 2/2. The test is run on physical
> > machines, but I use veth interface. Do you mean use a physical NIC driver
> > for testing?  
> 
> Yes, sorry, when I said 'physical machine' I should have also 'physical
> NIC'. We really need to know how the performance of this is on the XDP
> fast path, i.e., when there are no skbs involved at all.
> 
> > BTW, when using pktgen, I got an panic because the skb don't have enough
> > header room. The code path looks like
> >
> > do_xdp_generic()
> >   - netif_receive_generic_xdp()
> >     - skb_headroom(skb) < XDP_PACKET_HEADROOM
> >       - pskb_expand_head()
> >         - BUG_ON(skb_shared(skb))
> >
> > So I added a draft patch for pktgen, not sure if it has any influence.  
> 
> Hmm, as Jesper said pktgen was really not intended to be used this way,
> so I guess that's why. I guess I'll let him comment on whether he thinks
> it's worth fixing; or you could send this as a proper patch and see if
> anyone complains about it ;)

Don't use pktgen in this way with veth.  If anything pktgen should
detect that you use pktgen in virtual interfaces and reject/disallow
that you do this.

 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support
  2020-06-16  9:09         ` Jesper Dangaard Brouer
@ 2020-06-16  9:47           ` Hangbin Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-06-16  9:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi

On Tue, Jun 16, 2020 at 11:09:22AM +0200, Jesper Dangaard Brouer wrote:
> > > BTW, when using pktgen, I got an panic because the skb don't have enough
> > > header room. The code path looks like
> > >
> > > do_xdp_generic()
> > >   - netif_receive_generic_xdp()
> > >     - skb_headroom(skb) < XDP_PACKET_HEADROOM
> > >       - pskb_expand_head()
> > >         - BUG_ON(skb_shared(skb))
> > >
> > > So I added a draft patch for pktgen, not sure if it has any influence.  
> > 
> > Hmm, as Jesper said pktgen was really not intended to be used this way,
> > so I guess that's why. I guess I'll let him comment on whether he thinks
> > it's worth fixing; or you could send this as a proper patch and see if
> > anyone complains about it ;)
> 
> Don't use pktgen in this way with veth.  If anything pktgen should
> detect that you use pktgen in virtual interfaces and reject/disallow
> that you do this.

OK, got it.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-06-16  8:55         ` Jesper Dangaard Brouer
@ 2020-06-16 10:11           ` Hangbin Liu
  2020-06-16 14:38             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-06-16 10:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi

HI Jesper,

On Tue, Jun 16, 2020 at 10:55:06AM +0200, Jesper Dangaard Brouer wrote:
> > Is there anything else I should do except add the following line?
> > 	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> 
> You do realize that you also have copied over the mem.id, right?

Thanks for the reminding. To confirm, set mem.id to 0 is enough, right?
> 
> And as I wrote below you also need to update frame_sz.
> 
> > > 
> > > You also need to update xdpf->frame_sz, as you also cannot assume it is
> > > the same.  
> > 
> > Won't the memcpy() copy xdpf->frame_sz to nxdpf? 
> 
> You obviously cannot use the frame_sz from the existing frame, as you
> just allocated a new page for the new xdp_frame, that have another size
> (here PAGE_SIZE).

Thanks, I didn't understand the frame_sz correctly before.
> 
> 
> > And I didn't see xdpf->frame_sz is set in xdp_convert_zc_to_xdp_frame(),
> > do we need a fix?
> 
> Good catch, that sounds like a bug, that should be fixed.
> Will you send a fix?

OK, I will.

> 
> 
> > > > +
> > > > +	nxdpf = addr;
> > > > +	nxdpf->data = addr + headroom;
> > > > +
> > > > +	return nxdpf;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(xdpf_clone);  
> > > 
> > > 
> > > struct xdp_frame {
> > > 	void *data;
> > > 	u16 len;
> > > 	u16 headroom;
> > > 	u32 metasize:8;
> > > 	u32 frame_sz:24;
> > > 	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> > > 	 * while mem info is valid on remote CPU.
> > > 	 */
> > > 	struct xdp_mem_info mem;
> > > 	struct net_device *dev_rx; /* used by cpumap */
> > > };
> > >   
> > 
> 
> struct xdp_mem_info {
> 	u32                        type;                 /*     0     4 */
> 	u32                        id;                   /*     4     4 */
> 
> 	/* size: 8, cachelines: 1, members: 2 */
> 	/* last cacheline: 8 bytes */
> };
> 

Is this a struct reference or you want to remind me something else?

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv4 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-06-16 10:11           ` Hangbin Liu
@ 2020-06-16 14:38             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-16 14:38 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi, brouer

On Tue, 16 Jun 2020 18:11:33 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> HI Jesper,
> 
> On Tue, Jun 16, 2020 at 10:55:06AM +0200, Jesper Dangaard Brouer wrote:
> > > Is there anything else I should do except add the following line?
> > > 	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;  
> > 
> > You do realize that you also have copied over the mem.id, right?  
> 
> Thanks for the reminding. To confirm, set mem.id to 0 is enough, right?

Yes.

> > And as I wrote below you also need to update frame_sz.
> >   
> > > > 
> > > > You also need to update xdpf->frame_sz, as you also cannot assume it is
> > > > the same.    
> > > 
> > > Won't the memcpy() copy xdpf->frame_sz to nxdpf?   
> > 
> > You obviously cannot use the frame_sz from the existing frame, as you
> > just allocated a new page for the new xdp_frame, that have another size
> > (here PAGE_SIZE).  
> 
> Thanks, I didn't understand the frame_sz correctly before.
> > 
> >   
> > > And I didn't see xdpf->frame_sz is set in xdp_convert_zc_to_xdp_frame(),
> > > do we need a fix?  
> > 
> > Good catch, that sounds like a bug, that should be fixed.
> > Will you send a fix?  
> 
> OK, I will.

Thanks.
 
> >   
> > > > > +
> > > > > +	nxdpf = addr;
> > > > > +	nxdpf->data = addr + headroom;
> > > > > +
> > > > > +	return nxdpf;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(xdpf_clone);    
> > > > 
> > > > 
> > > > struct xdp_frame {
> > > > 	void *data;
> > > > 	u16 len;
> > > > 	u16 headroom;
> > > > 	u32 metasize:8;
> > > > 	u32 frame_sz:24;
> > > > 	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> > > > 	 * while mem info is valid on remote CPU.
> > > > 	 */
> > > > 	struct xdp_mem_info mem;
> > > > 	struct net_device *dev_rx; /* used by cpumap */
> > > > };
> > > >     
> > >   
> > 
> > struct xdp_mem_info {
> > 	u32                        type;                 /*     0     4 */
> > 	u32                        id;                   /*     4     4 */
> > 
> > 	/* size: 8, cachelines: 1, members: 2 */
> > 	/* last cacheline: 8 bytes */
> > };
> >   
> 
> Is this a struct reference or you want to remind me something else?

This is just a struct reference to help the readers of this email.
I had to lookup the struct to review this code, so I included it to
save time for other reviewers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv5 bpf-next 0/3] xdp: add a new helper for dev map multicast support
  2020-05-26 14:05 ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
                     ` (2 preceding siblings ...)
  2020-05-27 10:21   ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Toke Høiland-Jørgensen
@ 2020-07-01  4:19   ` Hangbin Liu
  2020-07-01  4:19     ` [PATCHv5 bpf-next 1/3] " Hangbin Liu
                       ` (3 more replies)
  3 siblings, 4 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-01  4:19 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 2nd and 3rd patches are for usage sample and testing purpose, so there
is no effort has been made on performance optimisation. I did same tests
with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
test result(the veth peer has a dummy xdp program with XDP_DROP directly):

Version         | Test                                   | Native | Generic
5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M

The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
the arrays and do clone skb/xdpf. The native path is slower than generic
path as we send skbs by pktgen. So the result looks reasonable.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (3):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: add xdp_redirect_multi test

 include/linux/bpf.h                           |  20 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  25 ++-
 kernel/bpf/devmap.c                           | 154 ++++++++++++++++
 kernel/bpf/verifier.c                         |   6 +
 net/core/filter.c                             | 109 ++++++++++-
 net/core/xdp.c                                |  29 +++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  57 ++++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 166 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  25 ++-
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       |  90 +++++++++
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 164 +++++++++++++++++
 .../selftests/bpf/xdp_redirect_multi.c        | 173 ++++++++++++++++++
 16 files changed, 1019 insertions(+), 8 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv5 bpf-next 1/3] xdp: add a new helper for dev map multicast support
  2020-07-01  4:19   ` [PATCHv5 bpf-next 0/3] xdp: add a new helper for " Hangbin Liu
@ 2020-07-01  4:19     ` Hangbin Liu
  2020-07-01  5:09       ` Andrii Nakryiko
  2020-07-01 18:33       ` kernel test robot
  2020-07-01  4:19     ` [PATCHv5 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-01  4:19 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This patch is for xdp multicast support. In this implementation we
add a new helper to accept two maps: forward map and exclude map.
We will redirect the packet to all the interfaces in *forward map*, but
exclude the interfaces that in *exclude map*.

To achive this I add a new ex_map for struct bpf_redirect_info.
in the helper I set tgt_value to NULL to make a difference with
bpf_xdp_redirect_map()

We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
create a exclude map for each interface and just want to exclude the
ingress interface.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h            |  20 +++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  25 +++++-
 kernel/bpf/devmap.c            | 154 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              | 109 +++++++++++++++++++++--
 net/core/xdp.c                 |  29 +++++++
 tools/include/uapi/linux/bpf.h |  25 +++++-
 9 files changed, 363 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3d2ade703a35..c77bc70dba87 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1264,6 +1264,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 bool dev_map_can_have_prog(struct bpf_map *map);
@@ -1406,6 +1411,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 259377723603..cf5b5b1d9ae5 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -612,6 +612,7 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 609f819ed08b..deb6c104e698 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -110,6 +110,7 @@ void xdp_warn(const char *msg, const char *func, const int line);
 #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 static inline
 void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0cb8ec948816..d7de6c0b32e4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3285,6 +3285,23 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		This is a multicast implementation for XDP redirect. It will
+ * 		redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map*.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map() as a unicast implementation,
+ * 		which supports redirecting packet to a specific ifindex
+ * 		in the map. As both helpers use struct bpf_redirect_info
+ * 		to store the redirect info, we will use a a NULL tgt_value
+ * 		to distinguish multicast and unicast redirecting.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3427,7 +3444,8 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3588,6 +3606,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 58acc46861ef..8a45fc9e2ccb 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -510,6 +510,160 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	struct bpf_dtab_netdev *ex_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	if (!map)
+		return false;
+
+	err = devmap_get_next_key(map, NULL, &key);
+	if (err)
+		return false;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			ex_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			ex_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (ex_obj && ex_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+struct bpf_dtab_netdev *devmap_get_next_obj(struct xdp_buff *xdp, struct bpf_map *map,
+					    struct bpf_map *ex_map, u32 *key,
+					    u32 *next_key, int ex_ifindex)
+{
+	struct bpf_dtab_netdev *obj;
+	struct net_device *dev;
+	u32 *tmp_key = key;
+	int err;
+
+	err = devmap_get_next_key(map, tmp_key, next_key);
+	if (err)
+		return NULL;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, *next_key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, *next_key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map, ex_ifindex))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			goto find_next;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			goto find_next;
+
+		return obj;
+
+find_next:
+		tmp_key = next_key;
+		err = devmap_get_next_key(map, tmp_key, next_key);
+		if (err)
+			break;
+	}
+
+	return NULL;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags)
+{
+	struct bpf_dtab_netdev *obj = NULL, *next_obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	bool last_one = false;
+	int ex_ifindex;
+	u32 key, next_key;
+
+	ex_ifindex = flags & BPF_F_EXCLUDE_INGRESS ? dev_rx->ifindex : 0;
+
+	/* Find first available obj */
+	obj = devmap_get_next_obj(xdp, map, ex_map, NULL, &key, ex_ifindex);
+	if (!obj)
+		return 0;
+
+	xdpf = xdp_convert_buff_to_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		/* Check if we still have one more available obj */
+		next_obj = devmap_get_next_obj(xdp, map, ex_map, &key,
+					       &next_key, ex_ifindex);
+		if (!next_obj)
+			last_one = true;
+
+		if (last_one) {
+			bq_enqueue(obj->dev, xdpf, dev_rx);
+			return 0;
+		}
+
+		nxdpf = xdpf_clone(xdpf);
+		if (unlikely(!nxdpf)) {
+			xdp_return_frame_rx_napi(xdpf);
+			return -ENOMEM;
+		}
+
+		bq_enqueue(obj->dev, nxdpf, dev_rx);
+
+		/* Deal with next obj */
+		obj = next_obj;
+		key = next_key;
+	}
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7de98906ddf4..8302b68ef953 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4110,6 +4110,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -4202,6 +4203,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index c796e141ea8e..37df1ea747ae 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3515,12 +3515,19 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, u32 flags)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		/* We use a NULL fwd value to distinguish multicast
+		 * and unicast forwarding
+		 */
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map, flags);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3577,12 +3584,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = READ_ONCE(ri->map);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (unlikely(!map)) {
@@ -3594,7 +3603,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, ri->flags);
 	}
 
 	if (unlikely(err))
@@ -3608,6 +3617,55 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  u32 flags)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	bool exclude_ingress;
+	u32 key, next_key;
+	void *fwd;
+	int err;
+
+	/* Get first key from forward map */
+	err = map->ops->map_get_next_key(map, NULL, &key);
+	if (err)
+		return err;
+
+	exclude_ingress = !!(flags & BPF_F_EXCLUDE_INGRESS);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -ENOMEM;
+
+			/* Try forword next one no mater the current forward
+			 * succeed or not */
+			dev_map_generic_redirect(dst, nskb, xdp_prog);
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	consume_skb(skb);
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3615,19 +3673,30 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		/* We use a NULL fwd value to distinguish multicast
+		 * and unicast forwarding
+		 */
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, ri->flags);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3741,6 +3810,34 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	/* Set the tgt_value to NULL to distinguish with bpf_xdp_redirect_map */
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+	ri->ex_map = ex_map;
+
+	WRITE_ONCE(ri->map, map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6464,6 +6561,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 90f44f382115..7e291f1015d7 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -475,3 +475,32 @@ void xdp_warn(const char *msg, const char *func, const int line)
 	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+	nxdpf->frame_sz = PAGE_SIZE;
+	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+	nxdpf->mem.id = 0;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0cb8ec948816..d7de6c0b32e4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3285,6 +3285,23 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		This is a multicast implementation for XDP redirect. It will
+ * 		redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map*.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map() as a unicast implementation,
+ * 		which supports redirecting packet to a specific ifindex
+ * 		in the map. As both helpers use struct bpf_redirect_info
+ * 		to store the redirect info, we will use a a NULL tgt_value
+ * 		to distinguish multicast and unicast redirecting.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3427,7 +3444,8 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3588,6 +3606,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv5 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test
  2020-07-01  4:19   ` [PATCHv5 bpf-next 0/3] xdp: add a new helper for " Hangbin Liu
  2020-07-01  4:19     ` [PATCHv5 bpf-next 1/3] " Hangbin Liu
@ 2020-07-01  4:19     ` Hangbin Liu
  2020-07-01  4:19     ` [PATCHv5 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
  2020-07-09  1:30     ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Hangbin Liu
  3 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-01  4:19 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a sample for xdp multicast. In the sample we could forward all
packets between given interfaces.

v5: add a null_map as we have strict the arg2 to ARG_CONST_MAP_PTR.
    Move the testing part to bpf selftest in next patch.
v4: no update.
v3: add rxcnt map to show the packet transmit speed.
v2: no update.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c |  57 ++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 166 ++++++++++++++++++++++
 3 files changed, 226 insertions(+)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 8403e4762306..000709bb89c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..cc7ebaedf55a
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct bpf_map_def SEC("maps") forward_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 256,
+};
+
+struct bpf_map_def SEC("maps") null_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 1,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	long *value;
+	u32 key = 0;
+
+	/* count packet in global counter */
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	return bpf_redirect_map_multi(&forward_map, &null_map,
+				      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_dummy")
+int xdp_pass(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..49f44c91b672
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int ifaces[MAX_IFACE_NUM] = {};
+static int rxcnt;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(rxcnt, &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("Forwarding %10llu pkt/s\n", sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int prog_fd, forward_map;
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	forward_map = bpf_object__find_map_fd_by_name(obj, "forward_map");
+	rxcnt = bpf_object__find_map_fd_by_name(obj, "rxcnt");
+
+	if (forward_map < 0 || rxcnt < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(forward_map, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	poll_stats(2);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv5 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test
  2020-07-01  4:19   ` [PATCHv5 bpf-next 0/3] xdp: add a new helper for " Hangbin Liu
  2020-07-01  4:19     ` [PATCHv5 bpf-next 1/3] " Hangbin Liu
  2020-07-01  4:19     ` [PATCHv5 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-07-01  4:19     ` Hangbin Liu
  2020-07-09  1:30     ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Hangbin Liu
  3 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-01  4:19 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

Add a bpf selftest for new helper xdp_redirect_map_multi(). In this
test we have 3 forward groups groups and 1 exclude group. The test will
redirect each interface's packets to all the interfaces in the forward
group, and exclude the interface in exclude map. We will also test both
DEVMAP and DEVMAP_HASH with xdp generic and drv.

For more test details, you can find it in the test script. Here is
the test result.
]# ./test_xdp_redirect_multi.sh
Pass: xdpgeneric arp ns1-2
Pass: xdpgeneric arp ns1-3
Pass: xdpgeneric arp ns1-4
Pass: xdpgeneric ping ns1-2
Pass: xdpgeneric ping ns1-3
Pass: xdpgeneric ping ns1-4
Pass: xdpgeneric ping6 ns2-1
Pass: xdpgeneric ping6 ns2-3
Pass: xdpgeneric ping6 ns2-4
Pass: xdpdrv arp ns1-2
Pass: xdpdrv arp ns1-3
Pass: xdpdrv arp ns1-4
Pass: xdpdrv ping ns1-2
Pass: xdpdrv ping ns1-3
Pass: xdpdrv ping ns1-4
Pass: xdpdrv ping6 ns2-1
Pass: xdpdrv ping6 ns2-3
Pass: xdpdrv ping6 ns2-4
Summary: PASS 18, FAIL 0

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       |  90 +++++++++
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 164 +++++++++++++++++
 .../selftests/bpf/xdp_redirect_multi.c        | 173 ++++++++++++++++++
 4 files changed, 430 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 1f9c696b3edf..66b857210814 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -51,6 +51,7 @@ TEST_FILES = test_lwt_ip_encap.o \
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
 	test_xdp_redirect.sh \
+	test_xdp_redirect_multi.sh \
 	test_xdp_meta.sh \
 	test_xdp_veth.sh \
 	test_offload.py \
@@ -79,7 +80,8 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
-	test_lirc_mode2_user xdping test_cpp runqslower bench
+	test_lirc_mode2_user xdping test_cpp runqslower bench \
+	xdp_redirect_multi
 
 TEST_CUSTOM_PROGS = urandom_read
 
diff --git a/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c b/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
new file mode 100644
index 000000000000..70b8476b9df3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <string.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct bpf_map_def SEC("maps") forward_map_v4 = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 4096,
+};
+
+struct bpf_map_def SEC("maps") forward_map_v6 = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_all = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") exclude_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") null_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	__u16 h_proto;
+	__u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == bpf_htons(ETH_P_IP))
+		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else if (h_proto == bpf_htons(ETH_P_IPV6))
+		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else
+		return bpf_redirect_map_multi(&forward_map_all, &null_map,
+					      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_dummy")
+int xdp_pass(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh b/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
new file mode 100755
index 000000000000..f4f8f751854e
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
@@ -0,0 +1,164 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  ... init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |  ...
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward multicast groups:
+#     Forward group all has interfaces: veth1, veth2, veth3, veth4, ... (All traffic except IPv4, IPv6)
+#     Forward group v4 has interfaces: veth1, veth3, veth4, ... (For IPv4 traffic only)
+#     Forward group v6 has interfaces: veth2, veth3, veth4, ... (For IPv6 traffic only)
+# Exclude Groups:
+#     Exclude group: veth3 (assume ns3 is in black list)
+#
+# Test modules:
+# XDP modes: generic, native
+# map types: group v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test cases:
+#     ARP(we didn't block ARP for ns3):
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (fail), ns1 -> ns4 (pass)
+#     IPv6
+#        ns2 -> ns1 (fail), ns2 -> ns3 (fail), ns2 -> ns4 (pass)
+#
+
+
+# netns numbers
+NUM=4
+IFACES=""
+DRV_MODE="xdpgeneric xdpdrv"
+PASS=0
+FAIL=0
+
+test_pass()
+{
+	echo "Pass: $@"
+	PASS=$((PASS + 1))
+}
+
+test_fail()
+{
+	echo "fail: $@"
+	FAIL=$((FAIL + 1))
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip link del veth$i 2> /dev/null
+		ip netns del ns$i 2> /dev/null
+	done
+	rm -f xdp_redirect_*.log arp_ns*.log
+}
+
+# Kselftest framework requirement - SKIP code is 4.
+check_env()
+{
+	ip link set dev lo xdpgeneric off &>/dev/null
+	if [ $? -ne 0 ];then
+		echo "selftests: [SKIP] Could not run test without the ip xdpgeneric support"
+		exit 4
+	fi
+
+	which tcpdump &>/dev/null
+	if [ $? -ne 0 ];then
+		echo "selftests: [SKIP] Could not run test without tcpdump"
+		exit 4
+	fi
+}
+
+setup_ns()
+{
+	local mode=$1
+	IFACES=""
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth$i type veth peer name veth0 netns ns$i
+		ip link set veth$i up
+		ip -n ns$i link set veth0 up
+
+		ip -n ns$i addr add 192.0.2.$i/24 dev veth0
+		ip -n ns$i addr add 2001:db8::$i/64 dev veth0
+		ip -n ns$i link set veth0 $mode obj \
+			xdp_redirect_multi_kern.o sec xdp_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_ping_tests()
+{
+	local mode=$1
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${mode}.log && \
+		test_pass "$mode arp ns1-2" || test_fail "$mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${mode}.log && \
+		test_pass "$mode arp ns1-3" || test_fail "$mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${mode}.log && \
+		test_pass "$mode arp ns1-4" || test_fail "$mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$mode ping ns1-2" || test_pass "$mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_fail "$mode ping ns1-3" || test_pass "$mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$mode ping ns1-4" || test_fail "$mode ping ns1-4"
+
+	# ping6 test
+	ip netns exec ns2 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$mode ping6 ns2-1" || test_pass "$mode ping6 ns2-1"
+	ip netns exec ns2 ping6 2001:db8::3 -c 4 &> /dev/null && \
+		test_fail "$mode ping6 ns2-3" || test_pass "$mode ping6 ns2-3"
+	ip netns exec ns2 ping6 2001:db8::4 -c 4 &> /dev/null && \
+		test_pass "$mode ping6 ns2-4" || test_fail "$mode ping6 ns2-4"
+}
+
+do_tests()
+{
+	local mode=$1
+	local drv_p
+
+	[ ${mode} == "xdpdrv" ] && drv_p="-N" || drv_p="-S"
+
+	# run `ulimit -l unlimited` if you got errors like
+	# libbpf: Error in bpf_object__probe_global_data():Operation not permitted(1).
+	./xdp_redirect_multi $drv_p $IFACES &> xdp_redirect_${mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	do_ping_tests $mode
+
+	kill $xdp_pid
+}
+
+trap clean_up 0 2 3 6 9
+
+check_env
+
+for mode in ${DRV_MODE}; do
+	setup_ns $mode
+	do_tests $mode
+	sleep 10
+	clean_up
+	sleep 5
+done
+
+echo "Summary: PASS $PASS, FAIL $FAIL"
+[ $FAIL -eq 0 ] && exit 0 || exit 1
diff --git a/tools/testing/selftests/bpf/xdp_redirect_multi.c b/tools/testing/selftests/bpf/xdp_redirect_multi.c
new file mode 100644
index 000000000000..5626005cb679
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_redirect_multi.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int ifaces[MAX_IFACE_NUM] = {};
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, group_v6, exclude;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
+	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
+	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
+
+	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: remove the 1st interfaces from group v6 */
+		if (i != 0) {
+			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: remove the 2nd interfaces from group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: add the 3rd interfaces to exclude map */
+		if (i == 2) {
+			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	/* sleep some time for testing */
+	sleep(999);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv5 bpf-next 1/3] xdp: add a new helper for dev map multicast support
  2020-07-01  4:19     ` [PATCHv5 bpf-next 1/3] " Hangbin Liu
@ 2020-07-01  5:09       ` Andrii Nakryiko
  2020-07-01  6:51         ` Hangbin Liu
  2020-07-01 18:33       ` kernel test robot
  1 sibling, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2020-07-01  5:09 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, Networking, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Lorenzo Bianconi

On Tue, Jun 30, 2020 at 9:21 PM Hangbin Liu <liuhangbin@gmail.com> wrote:
>
> This patch is for xdp multicast support. In this implementation we
> add a new helper to accept two maps: forward map and exclude map.
> We will redirect the packet to all the interfaces in *forward map*, but
> exclude the interfaces that in *exclude map*.
>
> To achive this I add a new ex_map for struct bpf_redirect_info.
> in the helper I set tgt_value to NULL to make a difference with
> bpf_xdp_redirect_map()
>
> We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> create a exclude map for each interface and just want to exclude the
> ingress interface.
>
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.
>
> v5:
> a) Check devmap_get_next_key() return value.
> b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
> c) In function dev_map_enqueue_multi(), consume xdpf for the last
>    obj instead of the first on.
> d) Update helper description and code comments to explain that we
>    use NULL target value to distinguish multicast and unicast
>    forwarding.
> e) Update memory model, memory id and frame_sz in xdpf_clone().
>
> v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo
>
> v3: Based on Toke's suggestion, do the following update
> a) Update bpf_redirect_map_multi() description in bpf.h.
> b) Fix exclude_ifindex checking order in dev_in_exclude_map().
> c) Fix one more xdpf clone in dev_map_enqueue_multi().
> d) Go find next one in dev_map_enqueue_multi() if the interface is not
>    able to forward instead of abort the whole loop.
> e) Remove READ_ONCE/WRITE_ONCE for ex_map.
>
> v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h            |  20 +++++
>  include/linux/filter.h         |   1 +
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  25 +++++-
>  kernel/bpf/devmap.c            | 154 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |   6 ++
>  net/core/filter.c              | 109 +++++++++++++++++++++--
>  net/core/xdp.c                 |  29 +++++++
>  tools/include/uapi/linux/bpf.h |  25 +++++-
>  9 files changed, 363 insertions(+), 7 deletions(-)
>

[...]

> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 0cb8ec948816..d7de6c0b32e4 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -3285,6 +3285,23 @@ union bpf_attr {
>   *             Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
>   *     Return
>   *             *sk* if casting is valid, or NULL otherwise.
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)

We've recently converted all return types for helpers from int to
long, please update accordingly. Thanks.

[...]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv5 bpf-next 1/3] xdp: add a new helper for dev map multicast support
  2020-07-01  5:09       ` Andrii Nakryiko
@ 2020-07-01  6:51         ` Hangbin Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-01  6:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Lorenzo Bianconi

On Tue, Jun 30, 2020 at 10:09:39PM -0700, Andrii Nakryiko wrote:
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 0cb8ec948816..d7de6c0b32e4 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -3285,6 +3285,23 @@ union bpf_attr {
> >   *             Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
> >   *     Return
> >   *             *sk* if casting is valid, or NULL otherwise.
> > + *
> > + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> 
> We've recently converted all return types for helpers from int to
> long, please update accordingly. Thanks.
> 

Thanks, I will fix it.

- Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv5 bpf-next 1/3] xdp: add a new helper for dev map multicast support
  2020-07-01  4:19     ` [PATCHv5 bpf-next 1/3] " Hangbin Liu
  2020-07-01  5:09       ` Andrii Nakryiko
@ 2020-07-01 18:33       ` kernel test robot
  1 sibling, 0 replies; 72+ messages in thread
From: kernel test robot @ 2020-07-01 18:33 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: kbuild-all, clang-built-linux, netdev,
	Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu


[-- Attachment #1: Type: text/plain, Size: 3059 bytes --]

Hi Hangbin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Hangbin-Liu/xdp-add-a-new-helper-for-dev-map-multicast-support/20200701-122334
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-randconfig-r013-20200701 (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project c8f1d442d0858f66fd4128fde6f67eb5202fa2b1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/bpf/devmap.c:571:25: warning: no previous prototype for function 'devmap_get_next_obj' [-Wmissing-prototypes]
   struct bpf_dtab_netdev *devmap_get_next_obj(struct xdp_buff *xdp, struct bpf_map *map,
                           ^
   kernel/bpf/devmap.c:571:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   struct bpf_dtab_netdev *devmap_get_next_obj(struct xdp_buff *xdp, struct bpf_map *map,
   ^
   static 
   1 warning generated.

vim +/devmap_get_next_obj +571 kernel/bpf/devmap.c

   570	
 > 571	struct bpf_dtab_netdev *devmap_get_next_obj(struct xdp_buff *xdp, struct bpf_map *map,
   572						    struct bpf_map *ex_map, u32 *key,
   573						    u32 *next_key, int ex_ifindex)
   574	{
   575		struct bpf_dtab_netdev *obj;
   576		struct net_device *dev;
   577		u32 *tmp_key = key;
   578		int err;
   579	
   580		err = devmap_get_next_key(map, tmp_key, next_key);
   581		if (err)
   582			return NULL;
   583	
   584		for (;;) {
   585			switch (map->map_type) {
   586			case BPF_MAP_TYPE_DEVMAP:
   587				obj = __dev_map_lookup_elem(map, *next_key);
   588				break;
   589			case BPF_MAP_TYPE_DEVMAP_HASH:
   590				obj = __dev_map_hash_lookup_elem(map, *next_key);
   591				break;
   592			default:
   593				break;
   594			}
   595	
   596			if (!obj || dev_in_exclude_map(obj, ex_map, ex_ifindex))
   597				goto find_next;
   598	
   599			dev = obj->dev;
   600	
   601			if (!dev->netdev_ops->ndo_xdp_xmit)
   602				goto find_next;
   603	
   604			err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
   605			if (unlikely(err))
   606				goto find_next;
   607	
   608			return obj;
   609	
   610	find_next:
   611			tmp_key = next_key;
   612			err = devmap_get_next_key(map, tmp_key, next_key);
   613			if (err)
   614				break;
   615		}
   616	
   617		return NULL;
   618	}
   619	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28561 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support
  2020-07-01  4:19   ` [PATCHv5 bpf-next 0/3] xdp: add a new helper for " Hangbin Liu
                       ` (2 preceding siblings ...)
  2020-07-01  4:19     ` [PATCHv5 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
@ 2020-07-09  1:30     ` Hangbin Liu
  2020-07-09  1:30       ` [PATCHv6 bpf-next 1/3] " Hangbin Liu
                         ` (3 more replies)
  3 siblings, 4 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-09  1:30 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 2nd and 3rd patches are for usage sample and testing purpose, so there
is no effort has been made on performance optimisation. I did same tests
with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
test result(the veth peer has a dummy xdp program with XDP_DROP directly):

Version         | Test                                   | Native | Generic
5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M

The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
the arrays and do clone skb/xdpf. The native path is slower than generic
path as we send skbs by pktgen. So the result looks reasonable.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (3):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: add xdp_redirect_multi test

 include/linux/bpf.h                           |  20 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  22 +++
 kernel/bpf/devmap.c                           | 154 ++++++++++++++++
 kernel/bpf/verifier.c                         |   6 +
 net/core/filter.c                             | 109 ++++++++++-
 net/core/xdp.c                                |  29 +++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  57 ++++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 166 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  22 +++
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       |  90 +++++++++
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 164 +++++++++++++++++
 .../selftests/bpf/xdp_redirect_multi.c        | 173 ++++++++++++++++++
 16 files changed, 1015 insertions(+), 6 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv6 bpf-next 1/3] xdp: add a new helper for dev map multicast support
  2020-07-09  1:30     ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Hangbin Liu
@ 2020-07-09  1:30       ` Hangbin Liu
  2020-07-09 16:33         ` David Ahern
  2020-07-09  1:30       ` [PATCHv6 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-07-09  1:30 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This patch is for xdp multicast support. In this implementation we
add a new helper to accept two maps: forward map and exclude map.
We will redirect the packet to all the interfaces in *forward map*, but
exclude the interfaces that in *exclude map*.

To achive this I add a new ex_map for struct bpf_redirect_info.
in the helper I set tgt_value to NULL to make a difference with
bpf_xdp_redirect_map()

We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
create a exclude map for each interface and just want to exclude the
ingress interface.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h            |  20 +++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  22 +++++
 kernel/bpf/devmap.c            | 154 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              | 109 +++++++++++++++++++++--
 net/core/xdp.c                 |  29 +++++++
 tools/include/uapi/linux/bpf.h |  22 +++++
 9 files changed, 359 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0cd7f6884c5c..b48d587b8b3b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1264,6 +1264,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 bool dev_map_can_have_prog(struct bpf_map *map);
@@ -1406,6 +1411,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 259377723603..cf5b5b1d9ae5 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -612,6 +612,7 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 609f819ed08b..deb6c104e698 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -110,6 +110,7 @@ void xdp_warn(const char *msg, const char *func, const int line);
 #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 static inline
 void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 548a749aebb3..a14e41309e73 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3319,6 +3319,22 @@ union bpf_attr {
  *		A non-negative value equal to or less than *size* on success,
  *		or a negative error in case of failure.
  *
+ * long bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		This is a multicast implementation for XDP redirect. It will
+ * 		redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map*.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map() as a unicast implementation,
+ * 		which supports redirecting packet to a specific ifindex
+ * 		in the map. As both helpers use struct bpf_redirect_info
+ * 		to store the redirect info, we will use a a NULL tgt_value
+ * 		to distinguish multicast and unicast redirecting.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3463,6 +3479,7 @@ union bpf_attr {
 	FN(skc_to_tcp_request_sock),	\
 	FN(skc_to_udp6_sock),		\
 	FN(get_task_stack),		\
+	FN(redirect_map_multi),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -3624,6 +3641,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 10abb06065bb..617a51391971 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -512,6 +512,160 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	struct bpf_dtab_netdev *ex_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	if (!map)
+		return false;
+
+	err = devmap_get_next_key(map, NULL, &key);
+	if (err)
+		return false;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			ex_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			ex_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (ex_obj && ex_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+static struct bpf_dtab_netdev *devmap_get_next_obj(struct xdp_buff *xdp, struct bpf_map *map,
+						   struct bpf_map *ex_map, u32 *key,
+						   u32 *next_key, int ex_ifindex)
+{
+	struct bpf_dtab_netdev *obj;
+	struct net_device *dev;
+	u32 *tmp_key = key;
+	int err;
+
+	err = devmap_get_next_key(map, tmp_key, next_key);
+	if (err)
+		return NULL;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, *next_key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, *next_key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map, ex_ifindex))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			goto find_next;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			goto find_next;
+
+		return obj;
+
+find_next:
+		tmp_key = next_key;
+		err = devmap_get_next_key(map, tmp_key, next_key);
+		if (err)
+			break;
+	}
+
+	return NULL;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags)
+{
+	struct bpf_dtab_netdev *obj = NULL, *next_obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	bool last_one = false;
+	int ex_ifindex;
+	u32 key, next_key;
+
+	ex_ifindex = flags & BPF_F_EXCLUDE_INGRESS ? dev_rx->ifindex : 0;
+
+	/* Find first available obj */
+	obj = devmap_get_next_obj(xdp, map, ex_map, NULL, &key, ex_ifindex);
+	if (!obj)
+		return 0;
+
+	xdpf = xdp_convert_buff_to_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		/* Check if we still have one more available obj */
+		next_obj = devmap_get_next_obj(xdp, map, ex_map, &key,
+					       &next_key, ex_ifindex);
+		if (!next_obj)
+			last_one = true;
+
+		if (last_one) {
+			bq_enqueue(obj->dev, xdpf, dev_rx);
+			return 0;
+		}
+
+		nxdpf = xdpf_clone(xdpf);
+		if (unlikely(!nxdpf)) {
+			xdp_return_frame_rx_napi(xdpf);
+			return -ENOMEM;
+		}
+
+		bq_enqueue(obj->dev, nxdpf, dev_rx);
+
+		/* Deal with next obj */
+		obj = next_obj;
+		key = next_key;
+	}
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b608185e1ffd..ceaf28ec111a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4110,6 +4110,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -4202,6 +4203,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index ddcc0d6209e1..673d12a051ef 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3515,12 +3515,19 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, u32 flags)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		/* We use a NULL fwd value to distinguish multicast
+		 * and unicast forwarding
+		 */
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map, flags);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3577,12 +3584,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = READ_ONCE(ri->map);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (unlikely(!map)) {
@@ -3594,7 +3603,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, ri->flags);
 	}
 
 	if (unlikely(err))
@@ -3608,6 +3617,55 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  u32 flags)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	bool exclude_ingress;
+	u32 key, next_key;
+	void *fwd;
+	int err;
+
+	/* Get first key from forward map */
+	err = map->ops->map_get_next_key(map, NULL, &key);
+	if (err)
+		return err;
+
+	exclude_ingress = !!(flags & BPF_F_EXCLUDE_INGRESS);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -ENOMEM;
+
+			/* Try forword next one no mater the current forward
+			 * succeed or not */
+			dev_map_generic_redirect(dst, nskb, xdp_prog);
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	consume_skb(skb);
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3615,19 +3673,30 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		/* We use a NULL fwd value to distinguish multicast
+		 * and unicast forwarding
+		 */
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, ri->flags);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3741,6 +3810,34 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	/* Set the tgt_value to NULL to distinguish with bpf_xdp_redirect_map */
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+	ri->ex_map = ex_map;
+
+	WRITE_ONCE(ri->map, map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6464,6 +6561,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 3c45f99e26d5..9b43d0a208a7 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -476,3 +476,32 @@ void xdp_warn(const char *msg, const char *func, const int line)
 	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+	nxdpf->frame_sz = PAGE_SIZE;
+	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+	nxdpf->mem.id = 0;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 548a749aebb3..a14e41309e73 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3319,6 +3319,22 @@ union bpf_attr {
  *		A non-negative value equal to or less than *size* on success,
  *		or a negative error in case of failure.
  *
+ * long bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		This is a multicast implementation for XDP redirect. It will
+ * 		redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map*.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map() as a unicast implementation,
+ * 		which supports redirecting packet to a specific ifindex
+ * 		in the map. As both helpers use struct bpf_redirect_info
+ * 		to store the redirect info, we will use a a NULL tgt_value
+ * 		to distinguish multicast and unicast redirecting.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3463,6 +3479,7 @@ union bpf_attr {
 	FN(skc_to_tcp_request_sock),	\
 	FN(skc_to_udp6_sock),		\
 	FN(get_task_stack),		\
+	FN(redirect_map_multi),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -3624,6 +3641,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv6 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test
  2020-07-09  1:30     ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Hangbin Liu
  2020-07-09  1:30       ` [PATCHv6 bpf-next 1/3] " Hangbin Liu
@ 2020-07-09  1:30       ` Hangbin Liu
  2020-07-09 22:40         ` Daniel Borkmann
  2020-07-09  1:30       ` [PATCHv6 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
  2020-07-09 22:37       ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Daniel Borkmann
  3 siblings, 1 reply; 72+ messages in thread
From: Hangbin Liu @ 2020-07-09  1:30 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a sample for xdp multicast. In the sample we could forward all
packets between given interfaces.

v5: add a null_map as we have strict the arg2 to ARG_CONST_MAP_PTR.
    Move the testing part to bpf selftest in next patch.
v4: no update.
v3: add rxcnt map to show the packet transmit speed.
v2: no update.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c |  57 ++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 166 ++++++++++++++++++++++
 3 files changed, 226 insertions(+)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index f87ee02073ba..fddca6cb76b8 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..cc7ebaedf55a
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct bpf_map_def SEC("maps") forward_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 256,
+};
+
+struct bpf_map_def SEC("maps") null_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 1,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	long *value;
+	u32 key = 0;
+
+	/* count packet in global counter */
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	return bpf_redirect_map_multi(&forward_map, &null_map,
+				      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_dummy")
+int xdp_pass(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..49f44c91b672
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int ifaces[MAX_IFACE_NUM] = {};
+static int rxcnt;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(rxcnt, &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("Forwarding %10llu pkt/s\n", sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int prog_fd, forward_map;
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	forward_map = bpf_object__find_map_fd_by_name(obj, "forward_map");
+	rxcnt = bpf_object__find_map_fd_by_name(obj, "rxcnt");
+
+	if (forward_map < 0 || rxcnt < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(forward_map, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	poll_stats(2);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv6 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test
  2020-07-09  1:30     ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Hangbin Liu
  2020-07-09  1:30       ` [PATCHv6 bpf-next 1/3] " Hangbin Liu
  2020-07-09  1:30       ` [PATCHv6 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-07-09  1:30       ` Hangbin Liu
  2020-07-09 22:37       ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Daniel Borkmann
  3 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-09  1:30 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

Add a bpf selftest for new helper xdp_redirect_map_multi(). In this
test we have 3 forward groups groups and 1 exclude group. The test will
redirect each interface's packets to all the interfaces in the forward
group, and exclude the interface in exclude map. We will also test both
DEVMAP and DEVMAP_HASH with xdp generic and drv.

For more test details, you can find it in the test script. Here is
the test result.
]# ./test_xdp_redirect_multi.sh
Pass: xdpgeneric arp ns1-2
Pass: xdpgeneric arp ns1-3
Pass: xdpgeneric arp ns1-4
Pass: xdpgeneric ping ns1-2
Pass: xdpgeneric ping ns1-3
Pass: xdpgeneric ping ns1-4
Pass: xdpgeneric ping6 ns2-1
Pass: xdpgeneric ping6 ns2-3
Pass: xdpgeneric ping6 ns2-4
Pass: xdpdrv arp ns1-2
Pass: xdpdrv arp ns1-3
Pass: xdpdrv arp ns1-4
Pass: xdpdrv ping ns1-2
Pass: xdpdrv ping ns1-3
Pass: xdpdrv ping ns1-4
Pass: xdpdrv ping6 ns2-1
Pass: xdpdrv ping6 ns2-3
Pass: xdpdrv ping6 ns2-4
Summary: PASS 18, FAIL 0

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       |  90 +++++++++
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 164 +++++++++++++++++
 .../selftests/bpf/xdp_redirect_multi.c        | 173 ++++++++++++++++++
 4 files changed, 430 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 1f9c696b3edf..66b857210814 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -51,6 +51,7 @@ TEST_FILES = test_lwt_ip_encap.o \
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
 	test_xdp_redirect.sh \
+	test_xdp_redirect_multi.sh \
 	test_xdp_meta.sh \
 	test_xdp_veth.sh \
 	test_offload.py \
@@ -79,7 +80,8 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
-	test_lirc_mode2_user xdping test_cpp runqslower bench
+	test_lirc_mode2_user xdping test_cpp runqslower bench \
+	xdp_redirect_multi
 
 TEST_CUSTOM_PROGS = urandom_read
 
diff --git a/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c b/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
new file mode 100644
index 000000000000..70b8476b9df3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <string.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct bpf_map_def SEC("maps") forward_map_v4 = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 4096,
+};
+
+struct bpf_map_def SEC("maps") forward_map_v6 = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_all = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") exclude_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") null_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	__u16 h_proto;
+	__u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == bpf_htons(ETH_P_IP))
+		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else if (h_proto == bpf_htons(ETH_P_IPV6))
+		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else
+		return bpf_redirect_map_multi(&forward_map_all, &null_map,
+					      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_dummy")
+int xdp_pass(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh b/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
new file mode 100755
index 000000000000..f4f8f751854e
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
@@ -0,0 +1,164 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  ... init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |  ...
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward multicast groups:
+#     Forward group all has interfaces: veth1, veth2, veth3, veth4, ... (All traffic except IPv4, IPv6)
+#     Forward group v4 has interfaces: veth1, veth3, veth4, ... (For IPv4 traffic only)
+#     Forward group v6 has interfaces: veth2, veth3, veth4, ... (For IPv6 traffic only)
+# Exclude Groups:
+#     Exclude group: veth3 (assume ns3 is in black list)
+#
+# Test modules:
+# XDP modes: generic, native
+# map types: group v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test cases:
+#     ARP(we didn't block ARP for ns3):
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (fail), ns1 -> ns4 (pass)
+#     IPv6
+#        ns2 -> ns1 (fail), ns2 -> ns3 (fail), ns2 -> ns4 (pass)
+#
+
+
+# netns numbers
+NUM=4
+IFACES=""
+DRV_MODE="xdpgeneric xdpdrv"
+PASS=0
+FAIL=0
+
+test_pass()
+{
+	echo "Pass: $@"
+	PASS=$((PASS + 1))
+}
+
+test_fail()
+{
+	echo "fail: $@"
+	FAIL=$((FAIL + 1))
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip link del veth$i 2> /dev/null
+		ip netns del ns$i 2> /dev/null
+	done
+	rm -f xdp_redirect_*.log arp_ns*.log
+}
+
+# Kselftest framework requirement - SKIP code is 4.
+check_env()
+{
+	ip link set dev lo xdpgeneric off &>/dev/null
+	if [ $? -ne 0 ];then
+		echo "selftests: [SKIP] Could not run test without the ip xdpgeneric support"
+		exit 4
+	fi
+
+	which tcpdump &>/dev/null
+	if [ $? -ne 0 ];then
+		echo "selftests: [SKIP] Could not run test without tcpdump"
+		exit 4
+	fi
+}
+
+setup_ns()
+{
+	local mode=$1
+	IFACES=""
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth$i type veth peer name veth0 netns ns$i
+		ip link set veth$i up
+		ip -n ns$i link set veth0 up
+
+		ip -n ns$i addr add 192.0.2.$i/24 dev veth0
+		ip -n ns$i addr add 2001:db8::$i/64 dev veth0
+		ip -n ns$i link set veth0 $mode obj \
+			xdp_redirect_multi_kern.o sec xdp_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_ping_tests()
+{
+	local mode=$1
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${mode}.log && \
+		test_pass "$mode arp ns1-2" || test_fail "$mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${mode}.log && \
+		test_pass "$mode arp ns1-3" || test_fail "$mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${mode}.log && \
+		test_pass "$mode arp ns1-4" || test_fail "$mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$mode ping ns1-2" || test_pass "$mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_fail "$mode ping ns1-3" || test_pass "$mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$mode ping ns1-4" || test_fail "$mode ping ns1-4"
+
+	# ping6 test
+	ip netns exec ns2 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$mode ping6 ns2-1" || test_pass "$mode ping6 ns2-1"
+	ip netns exec ns2 ping6 2001:db8::3 -c 4 &> /dev/null && \
+		test_fail "$mode ping6 ns2-3" || test_pass "$mode ping6 ns2-3"
+	ip netns exec ns2 ping6 2001:db8::4 -c 4 &> /dev/null && \
+		test_pass "$mode ping6 ns2-4" || test_fail "$mode ping6 ns2-4"
+}
+
+do_tests()
+{
+	local mode=$1
+	local drv_p
+
+	[ ${mode} == "xdpdrv" ] && drv_p="-N" || drv_p="-S"
+
+	# run `ulimit -l unlimited` if you got errors like
+	# libbpf: Error in bpf_object__probe_global_data():Operation not permitted(1).
+	./xdp_redirect_multi $drv_p $IFACES &> xdp_redirect_${mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	do_ping_tests $mode
+
+	kill $xdp_pid
+}
+
+trap clean_up 0 2 3 6 9
+
+check_env
+
+for mode in ${DRV_MODE}; do
+	setup_ns $mode
+	do_tests $mode
+	sleep 10
+	clean_up
+	sleep 5
+done
+
+echo "Summary: PASS $PASS, FAIL $FAIL"
+[ $FAIL -eq 0 ] && exit 0 || exit 1
diff --git a/tools/testing/selftests/bpf/xdp_redirect_multi.c b/tools/testing/selftests/bpf/xdp_redirect_multi.c
new file mode 100644
index 000000000000..5626005cb679
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_redirect_multi.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int ifaces[MAX_IFACE_NUM] = {};
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, group_v6, exclude;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
+	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
+	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
+
+	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: remove the 1st interfaces from group v6 */
+		if (i != 0) {
+			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: remove the 2nd interfaces from group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: add the 3rd interfaces to exclude map */
+		if (i == 2) {
+			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	/* sleep some time for testing */
+	sleep(999);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv6 bpf-next 1/3] xdp: add a new helper for dev map multicast support
  2020-07-09  1:30       ` [PATCHv6 bpf-next 1/3] " Hangbin Liu
@ 2020-07-09 16:33         ` David Ahern
  2020-07-10  6:55           ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: David Ahern @ 2020-07-09 16:33 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi

On 7/8/20 7:30 PM, Hangbin Liu wrote:
> This patch is for xdp multicast support. In this implementation we
> add a new helper to accept two maps: forward map and exclude map.
> We will redirect the packet to all the interfaces in *forward map*, but
> exclude the interfaces that in *exclude map*.
> 

good feature. I bet we could use this to create a simpler xdp dumper -
redirect to an xdpmon device which converts to an skb and passes to any
attached sockets.


> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 10abb06065bb..617a51391971 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -512,6 +512,160 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return __xdp_enqueue(dev, xdp, dev_rx);
>  }
>  
> +/* Use direct call in fast path instead of map->ops->map_get_next_key() */
> +static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	struct bpf_dtab_netdev *ex_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	if (obj->dev->ifindex == exclude_ifindex)
> +		return true;
> +
> +	if (!map)
> +		return false;
> +
> +	err = devmap_get_next_key(map, NULL, &key);
> +	if (err)
> +		return false;
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			ex_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			ex_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (ex_obj && ex_obj->dev->ifindex == obj->dev->ifindex)

I'm probably missing something fundamental, but why do you need to walk
the keys? Why not just do a lookup on the device index?

> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +




> @@ -3741,6 +3810,34 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
>  	.arg3_type      = ARG_ANYTHING,
>  };
>  
> +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> +	   struct bpf_map *, ex_map, u64, flags)
> +{
> +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))

If flags is a bitfield, the check should be:
    flags & ~BPF_F_EXCLUDE_INGRESS

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support
  2020-07-09  1:30     ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Hangbin Liu
                         ` (2 preceding siblings ...)
  2020-07-09  1:30       ` [PATCHv6 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
@ 2020-07-09 22:37       ` Daniel Borkmann
  2020-07-10  7:36         ` Hangbin Liu
  3 siblings, 1 reply; 72+ messages in thread
From: Daniel Borkmann @ 2020-07-09 22:37 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Lorenzo Bianconi

On 7/9/20 3:30 AM, Hangbin Liu wrote:
> This patch is for xdp multicast support. which has been discussed before[0],
> The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
> a software switch that can forward XDP frames to multiple ports.
> 
> To achieve this, an application needs to specify a group of interfaces
> to forward a packet to. It is also common to want to exclude one or more
> physical interfaces from the forwarding operation - e.g., to forward a
> packet to all interfaces in the multicast group except the interface it
> arrived on. While this could be done simply by adding more groups, this
> quickly leads to a combinatorial explosion in the number of groups an
> application has to maintain.
> 
> To avoid the combinatorial explosion, we propose to include the ability
> to specify an "exclude group" as part of the forwarding operation. This
> needs to be a group (instead of just a single port index), because a
> physical interface can be part of a logical grouping, such as a bond
> device.
> 
> Thus, the logical forwarding operation becomes a "set difference"
> operation, i.e. "forward to all ports in group A that are not also in
> group B". This series implements such an operation using device maps to
> represent the groups. This means that the XDP program specifies two
> device maps, one containing the list of netdevs to redirect to, and the
> other containing the exclude list.

Could you move this description as part of patch 1/3 instead of cover
letter? Mostly given this helps understanding the rationale wrt exclusion
map which is otherwise lacking from just looking at the patch itself.

Assuming you have a bond, how does this look in practice for your mentioned
ovs-like data plane in XDP? The map for 'group A' is shared among all XDP
progs and the map for 'group B' is managed per prog? The BPF_F_EXCLUDE_INGRESS
is clear, but how would this look wrt forwarding from a phys dev /to/ the
bond iface w/ XDP?

Also, what about tc BPF helper support for the case where not every device
might have native XDP (but they could still share the maps)?

> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
> to accept two maps, the forwarding map and exclude map. If user
> don't want to use exclude map and just want simply stop redirecting back
> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.
> 
> The 2nd and 3rd patches are for usage sample and testing purpose, so there
> is no effort has been made on performance optimisation. I did same tests
> with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
> test result(the veth peer has a dummy xdp program with XDP_DROP directly):
> 
> Version         | Test                                   | Native | Generic
> 5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> 5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
> 5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> 5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M
> 
> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> the arrays and do clone skb/xdpf. The native path is slower than generic
> path as we send skbs by pktgen. So the result looks reasonable.
> 
> Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
> suggestions and help on implementation.
> 
> [0] https://xdp-project.net/#Handling-multicast
> 
> v6: converted helper return types from int to long
> 
> v5:
> a) Check devmap_get_next_key() return value.
> b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
> c) In function dev_map_enqueue_multi(), consume xdpf for the last
>     obj instead of the first on.
> d) Update helper description and code comments to explain that we
>     use NULL target value to distinguish multicast and unicast
>     forwarding.
> e) Update memory model, memory id and frame_sz in xdpf_clone().
> f) Split the tests from sample and add a bpf kernel selftest patch.
> 
> v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo
> 
> v3: Based on Toke's suggestion, do the following update
> a) Update bpf_redirect_map_multi() description in bpf.h.
> b) Fix exclude_ifindex checking order in dev_in_exclude_map().
> c) Fix one more xdpf clone in dev_map_enqueue_multi().
> d) Go find next one in dev_map_enqueue_multi() if the interface is not
>     able to forward instead of abort the whole loop.
> e) Remove READ_ONCE/WRITE_ONCE for ex_map.
> 
> v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
> 
> Hangbin Liu (3):
>    xdp: add a new helper for dev map multicast support
>    sample/bpf: add xdp_redirect_map_multicast test
>    selftests/bpf: add xdp_redirect_multi test
> 
>   include/linux/bpf.h                           |  20 ++
>   include/linux/filter.h                        |   1 +
>   include/net/xdp.h                             |   1 +
>   include/uapi/linux/bpf.h                      |  22 +++
>   kernel/bpf/devmap.c                           | 154 ++++++++++++++++
>   kernel/bpf/verifier.c                         |   6 +
>   net/core/filter.c                             | 109 ++++++++++-
>   net/core/xdp.c                                |  29 +++
>   samples/bpf/Makefile                          |   3 +
>   samples/bpf/xdp_redirect_map_multi_kern.c     |  57 ++++++
>   samples/bpf/xdp_redirect_map_multi_user.c     | 166 +++++++++++++++++
>   tools/include/uapi/linux/bpf.h                |  22 +++
>   tools/testing/selftests/bpf/Makefile          |   4 +-
>   .../bpf/progs/xdp_redirect_multi_kern.c       |  90 +++++++++
>   .../selftests/bpf/test_xdp_redirect_multi.sh  | 164 +++++++++++++++++
>   .../selftests/bpf/xdp_redirect_multi.c        | 173 ++++++++++++++++++
>   16 files changed, 1015 insertions(+), 6 deletions(-)
>   create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
>   create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
>   create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
>   create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
>   create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv6 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test
  2020-07-09  1:30       ` [PATCHv6 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-07-09 22:40         ` Daniel Borkmann
  2020-07-10  6:41           ` Hangbin Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Daniel Borkmann @ 2020-07-09 22:40 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Lorenzo Bianconi

On 7/9/20 3:30 AM, Hangbin Liu wrote:
> This is a sample for xdp multicast. In the sample we could forward all
> packets between given interfaces.
> 
> v5: add a null_map as we have strict the arg2 to ARG_CONST_MAP_PTR.
>      Move the testing part to bpf selftest in next patch.
> v4: no update.
> v3: add rxcnt map to show the packet transmit speed.
> v2: no update.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>   samples/bpf/Makefile                      |   3 +
>   samples/bpf/xdp_redirect_map_multi_kern.c |  57 ++++++++
>   samples/bpf/xdp_redirect_map_multi_user.c | 166 ++++++++++++++++++++++
>   3 files changed, 226 insertions(+)
>   create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
>   create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index f87ee02073ba..fddca6cb76b8 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
>   tprogs-y += per_socket_stats_example
>   tprogs-y += xdp_redirect
>   tprogs-y += xdp_redirect_map
> +tprogs-y += xdp_redirect_map_multi
>   tprogs-y += xdp_redirect_cpu
>   tprogs-y += xdp_monitor
>   tprogs-y += xdp_rxq_info
> @@ -97,6 +98,7 @@ test_map_in_map-objs := test_map_in_map_user.o
>   per_socket_stats_example-objs := cookie_uid_helper_example.o
>   xdp_redirect-objs := xdp_redirect_user.o
>   xdp_redirect_map-objs := xdp_redirect_map_user.o
> +xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
>   xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
>   xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
>   xdp_rxq_info-objs := xdp_rxq_info_user.o
> @@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
>   always-y += tcp_dumpstats_kern.o
>   always-y += xdp_redirect_kern.o
>   always-y += xdp_redirect_map_kern.o
> +always-y += xdp_redirect_map_multi_kern.o
>   always-y += xdp_redirect_cpu_kern.o
>   always-y += xdp_monitor_kern.o
>   always-y += xdp_rxq_info_kern.o
> diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
> new file mode 100644
> index 000000000000..cc7ebaedf55a
> --- /dev/null
> +++ b/samples/bpf/xdp_redirect_map_multi_kern.c
> @@ -0,0 +1,57 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#define KBUILD_MODNAME "foo"
> +#include <uapi/linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +
> +struct bpf_map_def SEC("maps") forward_map = {
> +	.type = BPF_MAP_TYPE_DEVMAP_HASH,
> +	.key_size = sizeof(u32),
> +	.value_size = sizeof(int),
> +	.max_entries = 256,
> +};
> +
> +struct bpf_map_def SEC("maps") null_map = {
> +	.type = BPF_MAP_TYPE_DEVMAP_HASH,
> +	.key_size = sizeof(u32),
> +	.value_size = sizeof(int),
> +	.max_entries = 1,
> +};
> +
> +struct bpf_map_def SEC("maps") rxcnt = {
> +	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
> +	.key_size = sizeof(u32),
> +	.value_size = sizeof(long),
> +	.max_entries = 1,
> +};
> +
> +SEC("xdp_redirect_map_multi")
> +int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
> +{
> +	long *value;
> +	u32 key = 0;
> +
> +	/* count packet in global counter */
> +	value = bpf_map_lookup_elem(&rxcnt, &key);
> +	if (value)
> +		*value += 1;
> +
> +	return bpf_redirect_map_multi(&forward_map, &null_map,
> +				      BPF_F_EXCLUDE_INGRESS);

Why not extending to allow use-case like ...

   return bpf_redirect_map_multi(&fwd_map, NULL, BPF_F_EXCLUDE_INGRESS);

... instead of requiring a dummy/'null' map?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv6 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test
  2020-07-09 22:40         ` Daniel Borkmann
@ 2020-07-10  6:41           ` Hangbin Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-10  6:41 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Lorenzo Bianconi

On Fri, Jul 10, 2020 at 12:40:11AM +0200, Daniel Borkmann wrote:
> > +SEC("xdp_redirect_map_multi")
> > +int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
> > +{
> > +	long *value;
> > +	u32 key = 0;
> > +
> > +	/* count packet in global counter */
> > +	value = bpf_map_lookup_elem(&rxcnt, &key);
> > +	if (value)
> > +		*value += 1;
> > +
> > +	return bpf_redirect_map_multi(&forward_map, &null_map,
> > +				      BPF_F_EXCLUDE_INGRESS);
> 
> Why not extending to allow use-case like ...
> 
>   return bpf_redirect_map_multi(&fwd_map, NULL, BPF_F_EXCLUDE_INGRESS);
> 
> ... instead of requiring a dummy/'null' map?
> 

I planed to let user set NULL, but the arg2_type is ARG_CONST_MAP_PTR, which
not allow NULL pointer.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv6 bpf-next 1/3] xdp: add a new helper for dev map multicast support
  2020-07-09 16:33         ` David Ahern
@ 2020-07-10  6:55           ` Hangbin Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-10  6:55 UTC (permalink / raw)
  To: David Ahern
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi

Hi David,
On Thu, Jul 09, 2020 at 10:33:38AM -0600, David Ahern wrote:
> > +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> > +			int exclude_ifindex)
> > +{
> > +	struct bpf_dtab_netdev *ex_obj = NULL;
> > +	u32 key, next_key;
> > +	int err;
> > +
> > +	if (obj->dev->ifindex == exclude_ifindex)
> > +		return true;
> > +
> > +	if (!map)
> > +		return false;
> > +
> > +	err = devmap_get_next_key(map, NULL, &key);
> > +	if (err)
> > +		return false;
> > +
> > +	for (;;) {
> > +		switch (map->map_type) {
> > +		case BPF_MAP_TYPE_DEVMAP:
> > +			ex_obj = __dev_map_lookup_elem(map, key);
> > +			break;
> > +		case BPF_MAP_TYPE_DEVMAP_HASH:
> > +			ex_obj = __dev_map_hash_lookup_elem(map, key);
> > +			break;
> > +		default:
> > +			break;
> > +		}
> > +
> > +		if (ex_obj && ex_obj->dev->ifindex == obj->dev->ifindex)
> 
> I'm probably missing something fundamental, but why do you need to walk
> the keys? Why not just do a lookup on the device index?

This functions is to check if the device index is in exclude map.

The device indexes are stored as values in the map. The user could store
the values by any key number. There is no way to lookup the device index
directly unless loop the map and check each values we stored.

Is there a map feature which could get an exact value directly?

> > +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> > +	   struct bpf_map *, ex_map, u64, flags)
> > +{
> > +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> > +
> > +	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
> 
> If flags is a bitfield, the check should be:
>     flags & ~BPF_F_EXCLUDE_INGRESS

Thanks for the tips, I will fix it.

Cheers
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support
  2020-07-09 22:37       ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Daniel Borkmann
@ 2020-07-10  7:36         ` Hangbin Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Hangbin Liu @ 2020-07-10  7:36 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Lorenzo Bianconi

On Fri, Jul 10, 2020 at 12:37:59AM +0200, Daniel Borkmann wrote:
> On 7/9/20 3:30 AM, Hangbin Liu wrote:
> > This patch is for xdp multicast support. which has been discussed before[0],
> > The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
> > a software switch that can forward XDP frames to multiple ports.
> > 
> > To achieve this, an application needs to specify a group of interfaces
> > to forward a packet to. It is also common to want to exclude one or more
> > physical interfaces from the forwarding operation - e.g., to forward a
> > packet to all interfaces in the multicast group except the interface it
> > arrived on. While this could be done simply by adding more groups, this
> > quickly leads to a combinatorial explosion in the number of groups an
> > application has to maintain.
> > 
> > To avoid the combinatorial explosion, we propose to include the ability
> > to specify an "exclude group" as part of the forwarding operation. This
> > needs to be a group (instead of just a single port index), because a
> > physical interface can be part of a logical grouping, such as a bond
> > device.
> > 
> > Thus, the logical forwarding operation becomes a "set difference"
> > operation, i.e. "forward to all ports in group A that are not also in
> > group B". This series implements such an operation using device maps to
> > represent the groups. This means that the XDP program specifies two
> > device maps, one containing the list of netdevs to redirect to, and the
> > other containing the exclude list.
> 
> Could you move this description as part of patch 1/3 instead of cover
> letter? Mostly given this helps understanding the rationale wrt exclusion
> map which is otherwise lacking from just looking at the patch itself.

OK, I will

> 
> Assuming you have a bond, how does this look in practice for your mentioned
> ovs-like data plane in XDP? The map for 'group A' is shared among all XDP
> progs and the map for 'group B' is managed per prog? The BPF_F_EXCLUDE_INGRESS

Yes, kind of. Since we have two maps as parameter. The 'group A map'(include map)
will be shared between the interfaces in same group/vlan. The 'group B map'
(exclude map) is interface specific. Each interface will hold it's own exclude map.

As most time each interface only exclude itself, a null map + BPF_F_EXCLUDE_INGRESS
should be enough.

For bond situation. e.g. A active-backup bond0 with eth1 + eth2 as slaves.
If eth1 is active interface, we can add eth2 to the exclude map.

> is clear, but how would this look wrt forwarding from a phys dev /to/ the
> bond iface w/ XDP?

As bond interface doesn't support native XDP, This forwarding only works for
physical slave interfaces.

For generic xdp, maybe we can forward to bond interface directly, but I
haven't tried.

> 
> Also, what about tc BPF helper support for the case where not every device
> might have native XDP (but they could still share the maps)?

I haven't tried tc BPF. This helper works for both generic and native xdp
forwarding. I think it should also works if we load the prog with native
xdp mode in one interface and generic xdp mode in another interface, couldn't
we?

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, back to index

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
2020-04-20  9:52   ` Hangbin Liu
2020-04-15  8:54 ` [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
2020-04-24 14:19     ` Lorenzo Bianconi
2020-04-28 11:09       ` Eelco Chaudron
2020-05-06  9:35       ` Hangbin Liu
2020-04-24 14:34     ` Toke Høiland-Jørgensen
2020-05-06  9:14       ` Hangbin Liu
2020-05-06 10:00         ` Toke Høiland-Jørgensen
2020-05-08  8:53           ` Hangbin Liu
2020-05-08 14:58             ` Toke Høiland-Jørgensen
2020-05-18  8:45       ` Hangbin Liu
2020-05-19 10:15         ` Jesper Dangaard Brouer
2020-05-20  1:24           ` Hangbin Liu
2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-04-24 14:21     ` Lorenzo Bianconi
2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
2020-05-23  6:05   ` [PATCHv3 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
2020-05-26  7:34     ` kbuild test robot
2020-05-23  6:05   ` [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-05-26 14:05 ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
2020-05-26 14:05   ` [PATCHv4 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
2020-05-27 10:29     ` Toke Høiland-Jørgensen
2020-06-10 10:18     ` Jesper Dangaard Brouer
2020-06-12  8:54       ` Hangbin Liu
2020-06-16  8:55         ` Jesper Dangaard Brouer
2020-06-16 10:11           ` Hangbin Liu
2020-06-16 14:38             ` Jesper Dangaard Brouer
2020-06-10 10:21     ` Jesper Dangaard Brouer
2020-06-10 10:29       ` Toke Høiland-Jørgensen
2020-06-16  9:04         ` Jesper Dangaard Brouer
2020-05-26 14:05   ` [PATCHv4 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-05-27 10:21   ` [PATCHv4 bpf-next 0/2] xdp: add dev map multicast support Toke Høiland-Jørgensen
2020-05-27 10:32     ` Eelco Chaudron
2020-05-27 12:38     ` Hangbin Liu
2020-05-27 15:04       ` Toke Høiland-Jørgensen
2020-06-16  9:09         ` Jesper Dangaard Brouer
2020-06-16  9:47           ` Hangbin Liu
2020-06-03  2:40     ` Hangbin Liu
2020-06-03 11:05       ` Toke Høiland-Jørgensen
2020-06-04  4:09         ` Hangbin Liu
2020-06-04  9:44           ` Toke Høiland-Jørgensen
2020-06-04 12:12             ` Hangbin Liu
2020-06-04 12:37               ` Toke Høiland-Jørgensen
2020-06-04 14:41                 ` Hangbin Liu
2020-06-04 16:02                   ` Toke Høiland-Jørgensen
2020-06-05  6:26                     ` Hangbin Liu
2020-06-08 15:32                       ` Toke Høiland-Jørgensen
2020-06-09  3:03                         ` Hangbin Liu
2020-06-09 20:31                           ` Toke Høiland-Jørgensen
2020-06-10  2:35                             ` Hangbin Liu
2020-06-10 10:03                               ` Jesper Dangaard Brouer
2020-07-01  4:19   ` [PATCHv5 bpf-next 0/3] xdp: add a new helper for " Hangbin Liu
2020-07-01  4:19     ` [PATCHv5 bpf-next 1/3] " Hangbin Liu
2020-07-01  5:09       ` Andrii Nakryiko
2020-07-01  6:51         ` Hangbin Liu
2020-07-01 18:33       ` kernel test robot
2020-07-01  4:19     ` [PATCHv5 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-07-01  4:19     ` [PATCHv5 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
2020-07-09  1:30     ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Hangbin Liu
2020-07-09  1:30       ` [PATCHv6 bpf-next 1/3] " Hangbin Liu
2020-07-09 16:33         ` David Ahern
2020-07-10  6:55           ` Hangbin Liu
2020-07-09  1:30       ` [PATCHv6 bpf-next 2/3] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-07-09 22:40         ` Daniel Borkmann
2020-07-10  6:41           ` Hangbin Liu
2020-07-09  1:30       ` [PATCHv6 bpf-next 3/3] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
2020-07-09 22:37       ` [PATCHv6 bpf-next 0/3] xdp: add a new helper for dev map multicast support Daniel Borkmann
2020-07-10  7:36         ` Hangbin Liu

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git