BPF Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support
@ 2020-04-15  8:54 Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-04-15  8:54 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Hangbin Liu

Hi all,

This is a prototype for xdp multicast support, which has been discussed
before[0]. The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple
ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To be able to reuse the existing bpf_redirect_map() helper, we use a
containing map-in-map type to store the forwarding and exclude groups.
When a map-in-map type is passed to the redirect helper, it will
interpret the index as encoding the forwarding group in the upper 16
bits and the exclude group in the lower 16 bits. The enqueue logic will
unpack the two halves of the index and perform separate lookups in the
containing map. E.g., an index of 0x00010001 will look for the
forwarding group at map index 0x10000 and the exclude group at map index
0x1; the application is expected to populate the map accordingly.

For this RFC series we are primarily looking for feedback on the concept
and API: the example in patch 2 is functional, but not a lot of effort
has been made on performance optimisation.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

Hangbin Liu (2):
  xdp: add dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test

 include/linux/bpf.h                           |  29 ++
 include/net/xdp.h                             |   1 +
 kernel/bpf/arraymap.c                         |   2 +-
 kernel/bpf/devmap.c                           | 118 +++++++
 kernel/bpf/hashtab.c                          |   2 +-
 kernel/bpf/verifier.c                         |  15 +-
 net/core/filter.c                             |  69 +++-
 net/core/xdp.c                                |  26 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multicast.sh     | 142 ++++++++
 samples/bpf/xdp_redirect_map_multicast_kern.c | 147 +++++++++
 samples/bpf/xdp_redirect_map_multicast_user.c | 306 ++++++++++++++++++
 12 files changed, 854 insertions(+), 6 deletions(-)
 create mode 100755 samples/bpf/xdp_redirect_map_multicast.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_user.c

-- 
2.19.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next 1/2] xdp: add dev map multicast support
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-04-15  8:54 ` Hangbin Liu
  2020-04-20  9:52   ` Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Hangbin Liu @ 2020-04-15  8:54 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Hangbin Liu

This is a prototype for xdp multicast support. In this implemention we
use map-in-map to store the multicast groups, because we may have both
include and exclude groups on one interface.

The include and exclude groups are seperated by a 32 bits map key.
the high 16 bits keys are used for include groups and low 16 bits
keys are for exclude groups.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h   |  29 +++++++++++
 include/net/xdp.h     |   1 +
 kernel/bpf/arraymap.c |   2 +-
 kernel/bpf/devmap.c   | 118 ++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/hashtab.c  |   2 +-
 kernel/bpf/verifier.c |  15 +++++-
 net/core/filter.c     |  69 +++++++++++++++++++++++-
 net/core/xdp.c        |  26 ++++++++++
 8 files changed, 256 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fd2b2322412d..72797667bca8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1156,11 +1156,17 @@ struct sk_buff;
 
 struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
 struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key);
+void *array_of_map_lookup_elem(struct bpf_map *map, void *key);
+void *htab_of_map_lookup_elem(struct bpf_map *map, void *key);
 void __dev_flush(void);
 int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, u32 index);
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 
@@ -1276,6 +1282,16 @@ static inline struct net_device  *__dev_map_hash_lookup_elem(struct bpf_map *map
 	return NULL;
 }
 
+static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
+{
+
+}
+
+static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
+{
+
+}
+
 static inline void __dev_flush(void)
 {
 }
@@ -1297,6 +1313,19 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
+{
+	return true;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, u32 index)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 40c6d3398458..a214dce8579c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
 }
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 /* Convert xdp_buff to xdp_frame */
 static inline
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 95d77770353c..26ac66a05015 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1031,7 +1031,7 @@ static void array_of_map_free(struct bpf_map *map)
 	fd_array_map_free(map);
 }
 
-static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
+void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_map **inner_map = array_map_lookup_elem(map, key);
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 58bdca5d978a..3a60cb209ae1 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -85,6 +85,9 @@ static DEFINE_PER_CPU(struct list_head, dev_flush_list);
 static DEFINE_SPINLOCK(dev_map_lock);
 static LIST_HEAD(dev_map_list);
 
+static void *dev_map_lookup_elem(struct bpf_map *map, void *key);
+static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key);
+
 static struct hlist_head *dev_map_create_hash(unsigned int entries)
 {
 	int i;
@@ -456,6 +459,121 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
+{
+	struct bpf_dtab_netdev *in_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			in_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			in_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, u32 index)
+{
+	struct bpf_dtab_netdev *obj = NULL;
+	struct bpf_map *in_map, *ex_map;
+	struct xdp_frame *xdpf, *nxdpf;
+	struct net_device *dev;
+	u32 in_index, ex_index;
+	u32 key, next_key;
+	int err;
+
+	in_index = index >> 16;
+	in_index = in_index << 16;
+	ex_index = in_index ^ index;
+
+	in_map = map->ops->map_lookup_elem(map, &in_index);
+	/* ex_map could be NULL */
+	ex_map = map->ops->map_lookup_elem(map, &ex_index);
+
+	devmap_get_next_key(in_map, NULL, &key);
+
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		switch (in_map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(in_map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(in_map, key);
+			break;
+		default:
+			break;
+		}
+		if (!obj)
+			goto find_next;
+
+		if (ex_map && !dev_in_exclude_map(obj, ex_map)) {
+			dev = obj->dev;
+
+			if (!dev->netdev_ops->ndo_xdp_xmit)
+				return -EOPNOTSUPP;
+
+			err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+			if (unlikely(err))
+				return err;
+
+			nxdpf = xdpf_clone(xdpf);
+			if (unlikely(!nxdpf))
+				return -ENOMEM;
+
+			bq_enqueue(dev, nxdpf, dev_rx);
+		}
+find_next:
+		err = devmap_get_next_key(in_map, &key, &next_key);
+		if (err)
+			break;
+		key = next_key;
+	}
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d541c8486c95..4e0a2eebd38d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1853,7 +1853,7 @@ static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr)
 	return map;
 }
 
-static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
+void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_map **inner_map  = htab_map_lookup_elem(map, key);
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 04c6630cc18f..84d23418823a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3898,7 +3898,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
 	case BPF_MAP_TYPE_HASH_OF_MAPS:
-		if (func_id != BPF_FUNC_map_lookup_elem)
+		/* Used by multicast redirect */
+		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_SOCKMAP:
@@ -3968,8 +3970,17 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
 		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH &&
 		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
-		    map->map_type != BPF_MAP_TYPE_XSKMAP)
+		    map->map_type != BPF_MAP_TYPE_XSKMAP &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY_OF_MAPS &&
+		    map->map_type != BPF_MAP_TYPE_HASH_OF_MAPS)
 			goto error;
+		if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+		    map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+			/* FIXME: Maybe we should also strict the key size here ?? */
+			if (map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP &&
+			    map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+				goto error;
+		}
 		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
diff --git a/net/core/filter.c b/net/core/filter.c
index 7628b947dbc3..7d2076f5b0a4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    u32 index)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
+		/* fall through */
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		return dev_map_enqueue(fwd, xdp, dev_rx);
+	case BPF_MAP_TYPE_HASH_OF_MAPS:
+	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
+		return dev_map_enqueue_multi(xdp, dev_rx, map, index);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3508,6 +3513,10 @@ static inline void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 		return __cpu_map_lookup_elem(map, index);
 	case BPF_MAP_TYPE_XSKMAP:
 		return __xsk_map_lookup_elem(map, index);
+	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
+		return array_of_map_lookup_elem(map, (index >> 16) << 16);
+	case BPF_MAP_TYPE_HASH_OF_MAPS:
+		return htab_of_map_lookup_elem(map, (index >> 16) << 16);
 	default:
 		return NULL;
 	}
@@ -3552,7 +3561,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, index);
 	}
 
 	if (unlikely(err))
@@ -3566,6 +3575,55 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct sk_buff *skb, struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, u32 index)
+
+{
+	struct bpf_map *in_map, *ex_map;
+	struct bpf_dtab_netdev *dst;
+	u32 in_index, ex_index;
+	struct sk_buff *nskb;
+	u32 key, next_key;
+	int err;
+	void *fwd;
+
+	in_index = index >> 16;
+	in_index = in_index << 16;
+	ex_index = in_index ^ index;
+
+	in_map = map->ops->map_lookup_elem(map, &in_index);
+	/* ex_map could be NULL */
+	ex_map = map->ops->map_lookup_elem(map, &ex_index);
+
+	in_map->ops->map_get_next_key(in_map, NULL, &key);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(in_map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (ex_map && dev_in_exclude_map(dst, ex_map))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -EOVERFLOW;
+
+			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+			if (unlikely(err))
+				return err;
+		}
+
+find_next:
+		err = in_map->ops->map_get_next_key(in_map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3588,6 +3646,13 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
+	} else if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+		   map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+		/* Do multicast redirecting */
+		err = dev_map_redirect_multi(skb, xdp_prog, map, index);
+		if (unlikely(err))
+			goto err;
+		consume_skb(skb);
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
 		struct xdp_sock *xs = fwd;
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4c7ea85486af..70dfb4910f84 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	return xdpf;
 }
 EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
-- 
2.19.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
@ 2020-04-15  8:54 ` Hangbin Liu
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  3 siblings, 0 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-04-15  8:54 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov,
	Daniel Borkmann, Hangbin Liu

This test is used for testing xdp multicast. It defined 3 groups
for different usage. Each interface in init net has different
exclude interfaces. In the test it tests both generic/native mode
and 3 different map-in-map types.

For more testing details, please see the test description in
xdp_redirect_map_multicast.sh.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multicast.sh     | 142 ++++++++
 samples/bpf/xdp_redirect_map_multicast_kern.c | 147 +++++++++
 samples/bpf/xdp_redirect_map_multicast_user.c | 306 ++++++++++++++++++
 4 files changed, 598 insertions(+)
 create mode 100755 samples/bpf/xdp_redirect_map_multicast.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multicast_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 424f6fe7ce38..55555b0267cf 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multicast
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multicast-objs := bpf_load.o xdp_redirect_map_multicast_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multicast_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multicast.sh b/samples/bpf/xdp_redirect_map_multicast.sh
new file mode 100755
index 000000000000..01f825e33060
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multicast.sh
@@ -0,0 +1,142 @@
+#!/bin/bash
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Include Groups:
+#     Group 1 has interfaces: veth1, veth2, veth3, veth4 (All traffic except IPv4, IPv6)
+#     Group 2 has interfaces: veth1, veth3 (For IPv4 traffic only)
+#     Group 3 has interfaces: veth2, veth4 (For IPv6 traffic only)
+# Exclude Groups:
+#     veth1: exclude veth1
+#     veth2: exclude veth2
+#     veth3: exclude veth3, veth4
+#     veth4: exclude veth3, veth4
+#
+# Testing:
+# XDP modes: generic, native
+# map types: array of array, hash of array, hash of hash
+# Include:
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (pass)
+#     IPv6
+#        ns4 -> ns1 (fail), ns4 -> ns2 (pass)
+# Exclude:
+#     arp ns1 -> ns2: ns2, ns3, ns4 should receive the arp request
+#     arp ns4 -> ns1: ns1, ns2 should receive the arp request, ns3 should not
+#
+
+
+# netns numbers
+NUM=4
+IFACES=""
+DRV_MODE="generic native"
+MAP_TYPE="aa ha hh"
+
+test_pass()
+{
+	echo "Pass: $@"
+}
+
+test_fail()
+{
+	echo "fail: $@"
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip netns del ns$i
+	done
+}
+
+setup_ns()
+{
+	local mode=$1
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth0 type veth peer name veth$i
+	        ip link set veth0 netns ns$i
+		ip netns exec ns$i ip link set veth0 up
+		ip link set veth$i up
+
+		ip netns exec ns$i ip addr add 192.0.2.$i/24 dev veth0
+		ip netns exec ns$i ip addr add 2001:db8::$i/24 dev veth0
+		# Use xdp_redirect_map_kern.o because the dummy section in
+		# xdp_redirect_map_multicast_kern.o does not support iproute2 loading
+		ip netns exec ns$i ip link set veth0 xdp$mode obj xdp_redirect_map_kern.o sec xdp_redirect_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_tests()
+{
+	local drv_mode=$1
+	local map_type=$2
+	local drv_p
+
+	[ ${drv_mode} == "drv" ] && drv_p="-N" || drv_p="-S"
+
+	./xdp_redirect_map_multicast $drv_p -M $map_type $IFACES &> xdp_${drv_mode}_${map_type}.log &
+	xdp_pid=$!
+	sleep 10
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${drv_mode}_${map_type}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${drv_mode}_${map_type}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${drv_mode}_${map_type}.log &
+	ip netns exec ns1 ping 192.0.2.100 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.1" arp_ns1-2_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns1-2" || test_fail "$drv_mode $map_type arp ns1-2"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.1" arp_ns1-3_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns1-3" || test_fail "$drv_mode $map_type arp ns1-3"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.1" arp_ns1-4_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns1-4" || test_fail "$drv_mode $map_type arp ns1-4"
+
+	ip netns exec ns1 tcpdump -i veth0 -nn -l -e &> arp_ns4-1_${drv_mode}_${map_type}.log &
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns4-2_${drv_mode}_${map_type}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns4-3_${drv_mode}_${map_type}.log &
+	ip netns exec ns4 ping 192.0.2.100 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.4" arp_ns4-1_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns4-1" || test_fail "$drv_mode $map_type arp ns4-1"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.4" arp_ns4-2_${drv_mode}_${map_type}.log && \
+		test_pass "$drv_mode $map_type arp ns4-2" || test_fail "$drv_mode $map_type arp ns4-2"
+	grep -q "Request who-has 192.0.2.100 tell 192.0.2.4" arp_ns4-3_${drv_mode}_${map_type}.log && \
+		test_fail "$drv_mode $map_type arp ns4-3" || test_pass "$drv_mode $map_type arp ns4-3"
+
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$drv_mode $map_type ping ns1-2" || test_pass "$drv_mode $map_type ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_pass "$drv_mode $map_type ping ns1-3" || test_fail "$drv_mode $map_type ping ns1-3"
+
+	# ping6 test
+	ip netns exec ns4 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$drv_mode $map_type ping6 ns4-1" || test_pass "$drv_mode $map_type ping6 ns4-1"
+	ip netns exec ns4 ping6 2001:db8::2 -c 4 &> /dev/null && \
+		test_pass "$drv_mode $map_type ping6 ns4-2" || test_fail "$drv_mode $map_type ping6 ns4-2"
+
+	kill $xdp_pid
+}
+
+for mode in ${DRV_MODE}; do
+	sleep 2
+	setup_ns $mode
+	for type in ${MAP_TYPE}; do
+		do_tests $mode $type
+	done
+	sleep 20
+	clean_up
+done
diff --git a/samples/bpf/xdp_redirect_map_multicast_kern.c b/samples/bpf/xdp_redirect_map_multicast_kern.c
new file mode 100644
index 000000000000..f2ac36eed0e9
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multicast_kern.c
@@ -0,0 +1,147 @@
+/* This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+#define MAX_NR_PORTS 65536
+
+/* In the sample we will store all ports to group 1.
+ * And add two multicast groups:
+ * group 2 for even number interfaces, group 3 for odd number interfaces
+ */
+
+/* This is an array map template(NOT used) for multicast group storage
+ * The format could be lined by index:ifindex, like
+ * [0, 0], [1, 1], [2, 0], [3, 3], [4,4], [5, 0], [6, 0] ...
+ * which would be easier to modify and update.
+ *
+ * This map also could be used as multicast exclude array map.
+ * */
+struct bpf_map_def SEC("maps") group_a = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = MAX_NR_PORTS,
+};
+
+/* This is a hash map template(NOT used) for multicast group storage
+ * The format could be none-lined index:ifindex, like
+ * [1, 1], [3, 3], [4, 4]...
+ * which would save more spaces for storage
+ *
+ * This map also could be used as multicast exclude hash map.
+ * */
+struct bpf_map_def SEC("maps") group_h = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = MAX_NR_PORTS,
+};
+
+/* This is an array map-in-map, the inner maps will store all the
+ * include array maps and exclude array maps
+ *
+ * The max_entries is MAX_NR_PORTS * 32 as I only use 3 groups.
+ * */
+struct bpf_map_def SEC("maps") a_of_group_a = {
+	.type = BPF_MAP_TYPE_ARRAY_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_NR_PORTS * 32,
+};
+
+/* This is a hash map-in-map, the inner maps will store all the
+ * include array maps and exclude array maps
+ * */
+struct bpf_map_def SEC("maps") h_of_group_a = {
+	.type = BPF_MAP_TYPE_HASH_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_NR_PORTS * 32,
+};
+
+/* This is a hash map-in-map, the inner maps will store all the
+ * include hash maps and exclude hash maps
+ * */
+struct bpf_map_def SEC("maps") h_of_group_h = {
+	.type = BPF_MAP_TYPE_HASH_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_NR_PORTS * 32,
+};
+
+/* Note: This map is not used yet, we get the gourp id based on IP version at
+ * present.
+ *
+ * This map is used to store all the include groups fds based on ip/mac dest.
+ */
+struct bpf_map_def SEC("maps") mcast_route_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u16),
+	.max_entries = MAX_NR_PORTS,
+};
+
+/* TODO: This is used for broadcast redirecting/forwarding,
+ * how to do the redirecting/forwarding one on one based on neigh tables?? */
+SEC("xdp_redirect_map")
+int xdp_redirect_map_prog(struct xdp_md *ctx)
+{
+	u32 key, mcast_group_id, exclude_group_id, redirect_key;
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int *inmap_id;
+	u16 h_proto;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_IPV6))
+		mcast_group_id = 3;
+	else if (h_proto == htons(ETH_P_IP))
+		mcast_group_id = 2;
+	else
+		mcast_group_id = 1;
+
+	exclude_group_id = ctx->ingress_ifindex;
+	redirect_key = (mcast_group_id << 16) | exclude_group_id;
+
+	key = 1 << 16;
+	if ((inmap_id = bpf_map_lookup_elem(&a_of_group_a, &key)) && inmap_id)
+		return bpf_redirect_map(&a_of_group_a, redirect_key, 0);
+	else if ((inmap_id = bpf_map_lookup_elem(&h_of_group_a, &key)) && inmap_id)
+		return bpf_redirect_map(&h_of_group_a, redirect_key, 0);
+	else if ((inmap_id = bpf_map_lookup_elem(&h_of_group_h, &key)) && inmap_id)
+		return bpf_redirect_map(&h_of_group_h, redirect_key, 0);
+
+	return XDP_PASS;
+}
+
+/* FIXME: This prog could not be load by iproute2 as the map-in-map need
+ * set inner map fd first.
+ */
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multicast_user.c b/samples/bpf/xdp_redirect_map_multicast_user.c
new file mode 100644
index 000000000000..a451c90a05b6
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multicast_user.c
@@ -0,0 +1,306 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ */
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+#include <sys/resource.h>
+
+#include "bpf_load.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+#define MAX_NR_PORTS 65536
+
+static int ifaces[MAX_IFACE_NUM] = {};
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static int init_map_in_map(struct bpf_object *obj, char *outmap_name, bool inmap_hash)
+{
+	struct bpf_map *outmap;
+	int inmap_fd, ret;
+
+	inmap_fd = bpf_create_map(inmap_hash ? BPF_MAP_TYPE_DEVMAP_HASH : BPF_MAP_TYPE_DEVMAP, sizeof(__u32), sizeof(int), MAX_NR_PORTS, 0);
+	if (inmap_fd < 0) {
+		printf("Failed to create inner map '%s'!\n", strerror(errno));
+		return 1;
+	}
+	outmap = bpf_object__find_map_by_name(obj, outmap_name);
+	if (!outmap) {
+		printf("Failed to load map %s from test prog\n", outmap_name);
+		return 1;
+	}
+        ret = bpf_map__set_inner_map_fd(outmap, inmap_fd);
+        if (ret) {
+                printf("Failed to set inner_map_fd for map %s\n", outmap_name);
+		return 1;
+        }
+
+	return 0;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n"
+		"    -M    map-in-map mode, could be aa, ha, hh(default)\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	struct bpf_object_open_attr obj_open_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	struct bpf_map *outmap;
+	char *outmap_name;
+	char ifname[IF_NAMESIZE];
+	int pro_fd, inmap_fd, outmap_fd;
+	int i, j, ret, opt, ifindex;
+	__u32 inmap_id, key;
+	char filename[256];
+	bool inmap_hash;
+	char *mode = NULL;
+
+	while ((opt = getopt(argc, argv, "SNFM:")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		case 'M':
+			mode = optarg;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	/* array of array */
+	if (strncmp(mode, "aa", 2) == 0) {
+		outmap_name = "a_of_group_a";
+		inmap_hash = false;
+	/* hash of array */
+	} else if (strncmp(mode, "ha", 2) == 0) {
+		outmap_name = "h_of_group_a";
+		inmap_hash = false;
+	/* hash of hash */
+	} else {
+		outmap_name = "h_of_group_h";
+		inmap_hash = true;
+	}
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK)");
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i ++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	/* open bpf obj, set inner fd for out map-in-map and load bpf obj */
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	obj_open_attr.file = filename;
+	obj = bpf_object__open_xattr(&obj_open_attr);
+
+	/* We need set inner map fd for all outmaps. */
+	if(init_map_in_map(obj, "a_of_group_a", false)) {
+		printf("Failed to init out map a_of_group_a\n");
+		goto err_out;
+	}
+	if(init_map_in_map(obj, "h_of_group_a", false)) {
+		printf("Failed to init out map h_of_group_a\n");
+		goto err_out;
+	}
+	if(init_map_in_map(obj, "h_of_group_h", true)) {
+		printf("Failed to init out map h_of_group_h\n");
+		goto err_out;
+	}
+
+	bpf_object__load(obj);
+
+	prog = bpf_program__next(NULL, obj);
+	pro_fd = bpf_program__fd(prog);
+
+	outmap = bpf_object__find_map_by_name(obj, outmap_name);
+	if (!outmap) {
+		printf("Failed to load map %s from test prog\n", outmap_name);
+		goto err_out;
+	}
+	outmap_fd = bpf_map__fd(outmap);
+	if (outmap_fd < 0) {
+		printf("Failed to get fd from map %s\n", outmap_name);
+		goto err_out;
+	}
+	/* Init 3 multicast groups first.
+	 * group 1: this is used for all ports group
+	 * group 2: this is used for even number interfaces
+	 * group 3: this is used for odd number interfaces
+	 * You can store the group number in mcast_route_map for furture
+	 * IP/MAC -> Multicast Group lookup.
+	 */
+	for (i = 1; i <= 3; i++) {
+		/* Split the include/exclude groups by 16 bit
+		 * FIXME: is there a flexible way? how to let
+		 * kernel side know this?
+		 */
+		key = i << 16;
+		inmap_fd = bpf_create_map(inmap_hash ? BPF_MAP_TYPE_DEVMAP_HASH : BPF_MAP_TYPE_DEVMAP, sizeof(__u32), sizeof(int), MAX_NR_PORTS, 0);
+		if (inmap_fd < 0) {
+			printf("Failed to create inner map '%s'!\n", strerror(errno));
+			goto err_out;
+		}
+		ret = bpf_map_update_elem(outmap_fd, &key, &inmap_fd, 0);
+		if (ret) {
+			printf("Failed to update map %s\n", outmap_name);
+			goto err_out;
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Set values for each map
+	 * 1. set all ports to group 1
+	 * 2. set ports to group 2 or 3 based on interface index
+	 * 3. set exclude group for each interface
+	 */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* bind pro_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, pro_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+		/* Add the interface to group 1 */
+		key = 1 << 16;
+		ret = bpf_map_lookup_elem(outmap_fd, &key, &inmap_id);
+		if (ret) {
+			printf("Failed to lookup inmap by key %u from map %s\n", key, outmap_name);
+			goto err_out;
+		}
+		inmap_fd = bpf_map_get_fd_by_id(inmap_id);
+		ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+		if (ret) {
+			printf("Failed to update key %d, value %d for inmap id %u\n", ifindex, ifindex, inmap_id);
+			goto err_out;
+		}
+
+		/* Add the even number ifaces to group 2 and odd ifaces to group 3 */
+		if (i % 2 == 0)
+			key = 2 << 16;
+		else
+			key = 3 << 16;
+		ret = bpf_map_lookup_elem(outmap_fd, &key, &inmap_id);
+		if (ret) {
+			printf("Failed to lookup inmap by key %u from map %s\n", key, outmap_name);
+			goto err_out;
+		}
+		inmap_fd = bpf_map_get_fd_by_id(inmap_id);
+		ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+		if (ret) {
+			printf("Failed to update key %d, value %d for inmap id %u\n", ifindex, ifindex, inmap_id);
+			goto err_out;
+		}
+
+		/* Set the exclude map for the interfaces */
+		key = ifindex;
+		inmap_fd = bpf_create_map(inmap_hash ? BPF_MAP_TYPE_DEVMAP_HASH : BPF_MAP_TYPE_DEVMAP, sizeof(__u32), sizeof(int), MAX_NR_PORTS, 0);
+		if (inmap_fd < 0) {
+			printf("Failed to create inner map '%s'!\n", strerror(errno));
+			goto err_out;
+		}
+		ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+		if (ret) {
+			printf("Failed to update key %d, value %d for exclude inmap\n", ifindex, ifindex);
+			goto err_out;
+		}
+
+		/* let test exclude map by excluding all the interfaces except
+		 * the first one. The first two interfaces are not affect. e.g.
+		 * iface_1 = [1], iface_2 = [2], iface_3 = [3, 4,..],
+		 * iface_4 = [3, 4,...]
+		 */
+		for (j = 2; ifaces[j] > 0; j++) {
+			if (i <= 1 || i == j)
+				continue;
+			ifindex = ifaces[j];
+			ret = bpf_map_update_elem(inmap_fd, &ifindex, &ifindex, 0);
+			if (ret) {
+				printf("Failed to update key %d, value %d for exclude inmap\n", ifindex, ifindex);
+				goto err_out;
+		}
+
+		}
+		ret = bpf_map_update_elem(outmap_fd, &key, &inmap_fd, 0);
+		if (ret) {
+			printf("Failed to update map %s\n", outmap_name);
+			goto err_out;
+		}
+	}
+
+	sleep(600);
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.19.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next 1/2] xdp: add dev map multicast support
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
@ 2020-04-20  9:52   ` Hangbin Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-04-20  9:52 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, Alexei Starovoitov, bpf

Hi Daniel,

Would you please help review the RFC and give some comments?
Especially for the ifindex parameter of bpf_redirect_map() which
contains both include and exclude map id. Should we keep the current
designing, or find a way to make it flexible, or even add a new syscall
to accept two index parameters?

Thanks
Hangbin

On Wed, Apr 15, 2020 at 04:54:36PM +0800, Hangbin Liu wrote:
> This is a prototype for xdp multicast support. In this implemention we
> use map-in-map to store the multicast groups, because we may have both
> include and exclude groups on one interface.
> 
> The include and exclude groups are seperated by a 32 bits map key.
> the high 16 bits keys are used for include groups and low 16 bits
> keys are for exclude groups.
> 
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h   |  29 +++++++++++
>  include/net/xdp.h     |   1 +
>  kernel/bpf/arraymap.c |   2 +-
>  kernel/bpf/devmap.c   | 118 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/hashtab.c  |   2 +-
>  kernel/bpf/verifier.c |  15 +++++-
>  net/core/filter.c     |  69 +++++++++++++++++++++++-
>  net/core/xdp.c        |  26 ++++++++++
>  8 files changed, 256 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fd2b2322412d..72797667bca8 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1156,11 +1156,17 @@ struct sk_buff;
>  
>  struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
>  struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key);
> +void *array_of_map_lookup_elem(struct bpf_map *map, void *key);
> +void *htab_of_map_lookup_elem(struct bpf_map *map, void *key);
>  void __dev_flush(void);
>  int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map);
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, u32 index);
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog);
>  
> @@ -1276,6 +1282,16 @@ static inline struct net_device  *__dev_map_hash_lookup_elem(struct bpf_map *map
>  	return NULL;
>  }
>  
> +static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +
> +}
> +
> +static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +
> +}
> +
>  static inline void __dev_flush(void)
>  {
>  }
> @@ -1297,6 +1313,19 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return 0;
>  }
>  
> +static inline
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
> +{
> +	return true;
> +}
> +
> +static inline
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, u32 index)
> +{
> +	return 0;
> +}
> +
>  struct sk_buff;
>  
>  static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 40c6d3398458..a214dce8579c 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
>  }
>  
>  struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>  
>  /* Convert xdp_buff to xdp_frame */
>  static inline
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 95d77770353c..26ac66a05015 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -1031,7 +1031,7 @@ static void array_of_map_free(struct bpf_map *map)
>  	fd_array_map_free(map);
>  }
>  
> -static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
> +void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
>  {
>  	struct bpf_map **inner_map = array_map_lookup_elem(map, key);
>  
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 58bdca5d978a..3a60cb209ae1 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -85,6 +85,9 @@ static DEFINE_PER_CPU(struct list_head, dev_flush_list);
>  static DEFINE_SPINLOCK(dev_map_lock);
>  static LIST_HEAD(dev_map_list);
>  
> +static void *dev_map_lookup_elem(struct bpf_map *map, void *key);
> +static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key);
> +
>  static struct hlist_head *dev_map_create_hash(unsigned int entries)
>  {
>  	int i;
> @@ -456,6 +459,121 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return __xdp_enqueue(dev, xdp, dev_rx);
>  }
>  
> +/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
> +static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map)
> +{
> +	struct bpf_dtab_netdev *in_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			in_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			in_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, u32 index)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct bpf_map *in_map, *ex_map;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	u32 in_index, ex_index;
> +	u32 key, next_key;
> +	int err;
> +
> +	in_index = index >> 16;
> +	in_index = in_index << 16;
> +	ex_index = in_index ^ index;
> +
> +	in_map = map->ops->map_lookup_elem(map, &in_index);
> +	/* ex_map could be NULL */
> +	ex_map = map->ops->map_lookup_elem(map, &ex_index);
> +
> +	devmap_get_next_key(in_map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	for (;;) {
> +		switch (in_map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(in_map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(in_map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +		if (!obj)
> +			goto find_next;
> +
> +		if (ex_map && !dev_in_exclude_map(obj, ex_map)) {
> +			dev = obj->dev;
> +
> +			if (!dev->netdev_ops->ndo_xdp_xmit)
> +				return -EOPNOTSUPP;
> +
> +			err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +			if (unlikely(err))
> +				return err;
> +
> +			nxdpf = xdpf_clone(xdpf);
> +			if (unlikely(!nxdpf))
> +				return -ENOMEM;
> +
> +			bq_enqueue(dev, nxdpf, dev_rx);
> +		}
> +find_next:
> +		err = devmap_get_next_key(in_map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index d541c8486c95..4e0a2eebd38d 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1853,7 +1853,7 @@ static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr)
>  	return map;
>  }
>  
> -static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
> +void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
>  {
>  	struct bpf_map **inner_map  = htab_map_lookup_elem(map, key);
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 04c6630cc18f..84d23418823a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3898,7 +3898,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		break;
>  	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
>  	case BPF_MAP_TYPE_HASH_OF_MAPS:
> -		if (func_id != BPF_FUNC_map_lookup_elem)
> +		/* Used by multicast redirect */
> +		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
>  	case BPF_MAP_TYPE_SOCKMAP:
> @@ -3968,8 +3970,17 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
>  		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH &&
>  		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
> -		    map->map_type != BPF_MAP_TYPE_XSKMAP)
> +		    map->map_type != BPF_MAP_TYPE_XSKMAP &&
> +		    map->map_type != BPF_MAP_TYPE_ARRAY_OF_MAPS &&
> +		    map->map_type != BPF_MAP_TYPE_HASH_OF_MAPS)
>  			goto error;
> +		if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
> +		    map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
> +			/* FIXME: Maybe we should also strict the key size here ?? */
> +			if (map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP &&
> +			    map->inner_map_meta->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +				goto error;
> +		}
>  		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7628b947dbc3..7d2076f5b0a4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    u32 index)
>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
> +		/* fall through */
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		return dev_map_enqueue(fwd, xdp, dev_rx);
> +	case BPF_MAP_TYPE_HASH_OF_MAPS:
> +	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
> +		return dev_map_enqueue_multi(xdp, dev_rx, map, index);
>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3508,6 +3513,10 @@ static inline void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
>  		return __cpu_map_lookup_elem(map, index);
>  	case BPF_MAP_TYPE_XSKMAP:
>  		return __xsk_map_lookup_elem(map, index);
> +	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
> +		return array_of_map_lookup_elem(map, (index >> 16) << 16);
> +	case BPF_MAP_TYPE_HASH_OF_MAPS:
> +		return htab_of_map_lookup_elem(map, (index >> 16) << 16);
>  	default:
>  		return NULL;
>  	}
> @@ -3552,7 +3561,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, index);
>  	}
>  
>  	if (unlikely(err))
> @@ -3566,6 +3575,55 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
> +static int dev_map_redirect_multi(struct sk_buff *skb, struct bpf_prog *xdp_prog,
> +				  struct bpf_map *map, u32 index)
> +
> +{
> +	struct bpf_map *in_map, *ex_map;
> +	struct bpf_dtab_netdev *dst;
> +	u32 in_index, ex_index;
> +	struct sk_buff *nskb;
> +	u32 key, next_key;
> +	int err;
> +	void *fwd;
> +
> +	in_index = index >> 16;
> +	in_index = in_index << 16;
> +	ex_index = in_index ^ index;
> +
> +	in_map = map->ops->map_lookup_elem(map, &in_index);
> +	/* ex_map could be NULL */
> +	ex_map = map->ops->map_lookup_elem(map, &ex_index);
> +
> +	in_map->ops->map_get_next_key(in_map, NULL, &key);
> +
> +	for (;;) {
> +		fwd = __xdp_map_lookup_elem(in_map, key);
> +		if (fwd) {
> +			dst = (struct bpf_dtab_netdev *)fwd;
> +			if (ex_map && dev_in_exclude_map(dst, ex_map))
> +				goto find_next;
> +
> +			nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (!nskb)
> +				return -EOVERFLOW;
> +
> +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> +			if (unlikely(err))
> +				return err;
> +		}
> +
> +find_next:
> +		err = in_map->ops->map_get_next_key(in_map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct sk_buff *skb,
>  				       struct xdp_buff *xdp,
> @@ -3588,6 +3646,13 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  		err = dev_map_generic_redirect(dst, skb, xdp_prog);
>  		if (unlikely(err))
>  			goto err;
> +	} else if (map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
> +		   map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
> +		/* Do multicast redirecting */
> +		err = dev_map_redirect_multi(skb, xdp_prog, map, index);
> +		if (unlikely(err))
> +			goto err;
> +		consume_skb(skb);
>  	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
>  		struct xdp_sock *xs = fwd;
>  
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 4c7ea85486af..70dfb4910f84 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>  	return xdpf;
>  }
>  EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);
> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);
> -- 
> 2.19.2
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
  2020-04-15  8:54 ` [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-04-24  8:56 ` Hangbin Liu
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  3 siblings, 2 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-04-24  8:56 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

Hi all,

This is a prototype for xdp multicast support, which has been discussed
before[0]. The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple
ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

For this RFC series we are primarily looking for feedback on the concept
and API: the example in patch 2 is functional, but not a lot of effort
has been made on performance optimisation.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v2: Discussed with Jiri, Toke, Jesper, Eelco, we think the v1 is doing
a trick and may make user confused. So let's just add a new helper
to make the implemention more clear.

Hangbin Liu (2):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test

 include/linux/bpf.h                       |  20 +++
 include/linux/filter.h                    |   1 +
 include/net/xdp.h                         |   1 +
 include/uapi/linux/bpf.h                  |  23 ++-
 kernel/bpf/devmap.c                       | 114 +++++++++++++++
 kernel/bpf/verifier.c                     |   6 +
 net/core/filter.c                         |  98 ++++++++++++-
 net/core/xdp.c                            |  26 ++++
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 124 ++++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 100 +++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 170 ++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            |  23 ++-
 13 files changed, 702 insertions(+), 7 deletions(-)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

-- 
2.19.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-04-24  8:56   ` Hangbin Liu
  2020-04-24 14:19     ` Lorenzo Bianconi
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  1 sibling, 2 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-04-24  8:56 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a prototype for xdp multicast support. In this implemention we
add a new helper to accept two maps, forward map and exclude map.
We will redirect the packet to all the interfaces in *forward map*, but
exclude the interfaces that in *exclude map*.

To achive this I add a new ex_map for struct bpf_redirect_info.
in the helper I set tgt_value to NULL to make a difference with
bpf_xdp_redirect_map()

We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
create a exclude map for each interface and just want to exclude the
ingress interface.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

v2: add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h            |  20 ++++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  23 ++++++-
 kernel/bpf/devmap.c            | 114 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              |  98 ++++++++++++++++++++++++++--
 net/core/xdp.c                 |  26 ++++++++
 tools/include/uapi/linux/bpf.h |  23 ++++++-
 9 files changed, 305 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fd2b2322412d..3fd2903def3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1161,6 +1161,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 
@@ -1297,6 +1302,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9b5aa5c483cc..5b4e1ccd2d37 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -614,6 +614,7 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 40c6d3398458..a214dce8579c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
 }
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 /* Convert xdp_buff to xdp_frame */
 static inline
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2e29a671d67e..1dbe42290223 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3025,6 +3025,21 @@ union bpf_attr {
  *		* **-EOPNOTSUPP**	Unsupported operation, for example a
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to all the interfaces in *map*, and
+ * 		exclude the interfaces that in *ex_map*. The *ex_map* could
+ * 		be NULL.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which could exlcude redirect to the ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3151,7 +3166,8 @@ union bpf_attr {
 	FN(xdp_output),			\
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
-	FN(sk_assign),
+	FN(sk_assign),			\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 58bdca5d978a..34b171f7826c 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -456,6 +456,120 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	struct bpf_dtab_netdev *in_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	if (!map)
+		return false;
+
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			in_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			in_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	struct bpf_dtab_netdev *obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	struct net_device *dev;
+	u32 key, next_key;
+	int err;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map,
+					       exclude_ingress ? dev_rx->ifindex : 0))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			return -EOPNOTSUPP;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			return err;
+
+		nxdpf = xdpf_clone(xdpf);
+		if (unlikely(!nxdpf))
+			return -ENOMEM;
+
+		bq_enqueue(dev, nxdpf, dev_rx);
+
+find_next:
+		err = devmap_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+		key = next_key;
+	}
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 38cfcf701eeb..f77213a0e354 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3880,6 +3880,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -3970,6 +3971,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index 7d6ceaa54d21..94d1530e5ac6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, bool exclude_ingress)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
+						     exclude_ingress);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
 	struct bpf_map *map = READ_ONCE(ri->map);
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
@@ -3552,7 +3559,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
 	}
 
 	if (unlikely(err))
@@ -3566,6 +3573,49 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  bool exclude_ingress)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	u32 key, next_key;
+	int err;
+	void *fwd;
+
+	/* Get first key from forward map */
+	map->ops->map_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -EOVERFLOW;
+
+			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+			if (unlikely(err))
+				return err;
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3573,6 +3623,8 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
@@ -3583,9 +3635,16 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			/* Deal with multicast maps */
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, exclude_ingress);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3699,6 +3758,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+
+	WRITE_ONCE(ri->map, map);
+	WRITE_ONCE(ri->ex_map, ex_map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6304,6 +6390,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4c7ea85486af..70dfb4910f84 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	return xdpf;
 }
 EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2e29a671d67e..1dbe42290223 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3025,6 +3025,21 @@ union bpf_attr {
  *		* **-EOPNOTSUPP**	Unsupported operation, for example a
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to all the interfaces in *map*, and
+ * 		exclude the interfaces that in *ex_map*. The *ex_map* could
+ * 		be NULL.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which could exlcude redirect to the ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3151,7 +3166,8 @@ union bpf_attr {
 	FN(xdp_output),			\
 	FN(get_netns_cookie),		\
 	FN(get_current_ancestor_cgroup_id),	\
-	FN(sk_assign),
+	FN(sk_assign),			\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.19.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-04-24  8:56   ` Hangbin Liu
  2020-04-24 14:21     ` Lorenzo Bianconi
  1 sibling, 1 reply; 22+ messages in thread
From: Hangbin Liu @ 2020-04-24  8:56 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a sample for xdp multicast. In the sample we have 3 forward
groups and 1 exclude group. It will redirect each interface's
packets to all the interfaces in the forward group, and exclude
the interface in exclude map.

For more testing details, please see the test description in
xdp_redirect_map_multi.sh.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 124 ++++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 100 +++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 170 ++++++++++++++++++++++
 4 files changed, 397 insertions(+)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 424f6fe7ce38..eb7306efe85e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi.sh b/samples/bpf/xdp_redirect_map_multi.sh
new file mode 100755
index 000000000000..1999f261a1e8
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward multicast groups:
+#     Forward group all has interfaces: veth1, veth2, veth3, veth4 (All traffic except IPv4, IPv6)
+#     Forward group v4 has interfaces: veth1, veth3, veth4 (For IPv4 traffic only)
+#     Forward group v6 has interfaces: veth2, veth3, veth4 (For IPv6 traffic only)
+# Exclude Groups:
+#     Exclude group: veth3 (assume ns3 is in black list)
+#
+# Test modules:
+# XDP modes: generic, native
+# map types: group v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test cases:
+#     ARP(we didn't exclude ns3 in kern.c for ARP):
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (fail), ns1 -> ns4 (pass)
+#     IPv6
+#        ns2 -> ns1 (fail), ns2 -> ns3 (fail), ns2 -> ns4 (pass)
+#
+
+
+# netns numbers
+NUM=4
+IFACES=""
+DRV_MODE="drv generic"
+
+test_pass()
+{
+	echo "Pass: $@"
+}
+
+test_fail()
+{
+	echo "fail: $@"
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip netns del ns$i
+	done
+}
+
+setup_ns()
+{
+	local mode=$1
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth0 type veth peer name veth$i
+	        ip link set veth0 netns ns$i
+		ip netns exec ns$i ip link set veth0 up
+		ip link set veth$i up
+
+		ip netns exec ns$i ip addr add 192.0.2.$i/24 dev veth0
+		ip netns exec ns$i ip addr add 2001:db8::$i/24 dev veth0
+		ip netns exec ns$i ip link set veth0 xdp$mode obj \
+			xdp_redirect_map_multi_kern.o sec xdp_redirect_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_tests()
+{
+	local drv_mode=$1
+	local drv_p
+
+	[ ${drv_mode} == "drv" ] && drv_p="-N" || drv_p="-S"
+
+	./xdp_redirect_map_multi $drv_p $IFACES &> xdp_${drv_mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${drv_mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${drv_mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${drv_mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-2" || test_fail "$drv_mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-3" || test_fail "$drv_mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-4" || test_fail "$drv_mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-2" || test_pass "$drv_mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-3" || test_pass "$drv_mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping ns1-4" || test_fail "$drv_mode ping ns1-4"
+
+	# ping6 test
+	ip netns exec ns2 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-1" || test_pass "$drv_mode ping6 ns2-1"
+	ip netns exec ns2 ping6 2001:db8::3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-3" || test_pass "$drv_mode ping6 ns2-3"
+	ip netns exec ns2 ping6 2001:db8::4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping6 ns2-4" || test_fail "$drv_mode ping6 ns2-4"
+
+	kill $xdp_pid
+}
+
+for mode in ${DRV_MODE}; do
+	sleep 2
+	setup_ns $mode
+	do_tests $mode
+	sleep 20
+	clean_up
+done
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..c98985683ba2
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,100 @@
+/*
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+/* In this sample we will use 3 forward maps and 1 exclude map to
+ * show how to use the helper bpf_redirect_map_multi().
+ *
+ * In real world, there may have multi forward maps and exclude map. You can
+ * use map-in-map type to store the forward and exlude maps. e.g.
+ * forward_map_in_map[group_a_index] = forward_group_a_map
+ * forward_map_in_map[group_b_index] = forward_group_b_map
+ * exclude_map_in_map[iface_1_index] = iface_1_exclude_map
+ * exclude_map_in_map[iface_2_index] = iface_2_exclude_map
+ * Then store the forward group indexes based on IP/MAC policy in another
+ * hash map, e.g.:
+ * mcast_route_map[hash(subnet_a)] = group_a_index
+ * mcast_route_map[hash(subnet_b)] = group_b_index
+ *
+ * You can init the maps in user.c, and find the forward group index from
+ * mcast_route_map bye key hash(subnet) in kern.c, Then you could find
+ * the forward group by the group index. You can also get the exclude map
+ * simply by iface index in exclude_map_in_map.
+ */
+struct bpf_map_def SEC("maps") forward_map_v4 = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_v6 = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_all = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") exclude_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	u32 key, mcast_group_id, exclude_group_id;
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int *inmap_id;
+	u16 h_proto;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_IP))
+		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else if (h_proto == htons(ETH_P_IPV6))
+		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else
+		return bpf_redirect_map_multi(&forward_map_all, NULL,
+					      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..2fcd15322201
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,170 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ */
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static int ifaces[MAX_IFACE_NUM] = {};
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ... \n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, group_v6, exclude;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i ++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
+	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
+	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
+
+	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: remove the 2nd interfaces from group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: remove the 1st interfaces from group v6 */
+		if (i != 0) {
+			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: add the 3rd interfaces to exclude map */
+		if (i == 2) {
+			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	sleep(600);
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.19.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-04-24 14:19     ` Lorenzo Bianconi
  2020-04-28 11:09       ` Eelco Chaudron
  2020-05-06  9:35       ` Hangbin Liu
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
  1 sibling, 2 replies; 22+ messages in thread
From: Lorenzo Bianconi @ 2020-04-24 14:19 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann


[-- Attachment #1: Type: text/plain, Size: 13985 bytes --]

> This is a prototype for xdp multicast support. In this implemention we
> add a new helper to accept two maps, forward map and exclude map.
> We will redirect the packet to all the interfaces in *forward map*, but
> exclude the interfaces that in *exclude map*.
> 
> To achive this I add a new ex_map for struct bpf_redirect_info.
> in the helper I set tgt_value to NULL to make a difference with
> bpf_xdp_redirect_map()
> 
> We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> create a exclude map for each interface and just want to exclude the
> ingress interface.
> 
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.
> 
> v2: add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h            |  20 ++++++
>  include/linux/filter.h         |   1 +
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  23 ++++++-
>  kernel/bpf/devmap.c            | 114 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |   6 ++
>  net/core/filter.c              |  98 ++++++++++++++++++++++++++--
>  net/core/xdp.c                 |  26 ++++++++
>  tools/include/uapi/linux/bpf.h |  23 ++++++-
>  9 files changed, 305 insertions(+), 7 deletions(-)
> 

[...]

> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	struct bpf_dtab_netdev *in_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	if (!map)
> +		return false;

doing so it seems mandatory to define an exclude_map even if we want just to do
not forward the packet to the "ingress" interface.
Moreover I was thinking that we can assume to never forward to in the incoming
interface. Doing so the code would be simpler I guess. Is there a use case for
it? (forward even to the ingress interface)

> +
> +	if (obj->dev->ifindex == exclude_ifindex)
> +		return true;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			in_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			in_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (!obj || dev_in_exclude_map(obj, ex_map,
> +					       exclude_ingress ? dev_rx->ifindex : 0))
> +			goto find_next;
> +
> +		dev = obj->dev;
> +
> +		if (!dev->netdev_ops->ndo_xdp_xmit)
> +			return -EOPNOTSUPP;
> +
> +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +		if (unlikely(err))
> +			return err;
> +
> +		nxdpf = xdpf_clone(xdpf);
> +		if (unlikely(!nxdpf))
> +			return -ENOMEM;
> +
> +		bq_enqueue(dev, nxdpf, dev_rx);
> +
> +find_next:
> +		err = devmap_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}

Do we need to free 'incoming' xdp buffer here? I think most of the drivers assume
the packet is owned by the stack if xdp_do_redirect returns 0

> +
> +	return 0;
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 38cfcf701eeb..f77213a0e354 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3880,6 +3880,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_redirect_map_multi &&
>  		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
> @@ -3970,6 +3971,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		    map->map_type != BPF_MAP_TYPE_XSKMAP)
>  			goto error;
>  		break;
> +	case BPF_FUNC_redirect_map_multi:
> +		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
> +		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +			goto error;
> +		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
>  	case BPF_FUNC_sock_map_update:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7d6ceaa54d21..94d1530e5ac6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    struct bpf_map *ex_map, bool exclude_ingress)
>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		return dev_map_enqueue(fwd, xdp, dev_rx);
> +		if (fwd)
> +			return dev_map_enqueue(fwd, xdp, dev_rx);
> +		else
> +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
> +						     exclude_ingress);

I guess it would be better to do not make it the default case. Maybe you can
add a bit in flags to mark it for "multicast"

>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>  	struct bpf_map *map = READ_ONCE(ri->map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
> @@ -3552,7 +3559,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
>  	}
>  
>  	if (unlikely(err))
> @@ -3566,6 +3573,49 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
> +static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
> +				  struct bpf_prog *xdp_prog,
> +				  struct bpf_map *map, struct bpf_map *ex_map,
> +				  bool exclude_ingress)
> +
> +{
> +	struct bpf_dtab_netdev *dst;
> +	struct sk_buff *nskb;
> +	u32 key, next_key;
> +	int err;
> +	void *fwd;
> +
> +	/* Get first key from forward map */
> +	map->ops->map_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		fwd = __xdp_map_lookup_elem(map, key);
> +		if (fwd) {
> +			dst = (struct bpf_dtab_netdev *)fwd;
> +			if (dev_in_exclude_map(dst, ex_map,
> +					       exclude_ingress ? dev->ifindex : 0))
> +				goto find_next;
> +
> +			nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (!nskb)
> +				return -EOVERFLOW;
> +
> +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> +			if (unlikely(err))
> +				return err;
> +		}
> +
> +find_next:
> +		err = map->ops->map_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct sk_buff *skb,
>  				       struct xdp_buff *xdp,
> @@ -3573,6 +3623,8 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct bpf_map *map)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
>  	int err = 0;
> @@ -3583,9 +3635,16 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  
>  	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
>  	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
> -		struct bpf_dtab_netdev *dst = fwd;
> +		if (fwd) {
> +			struct bpf_dtab_netdev *dst = fwd;
> +
> +			err = dev_map_generic_redirect(dst, skb, xdp_prog);
> +		} else {
> +			/* Deal with multicast maps */
> +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> +						     ex_map, exclude_ingress);
> +		}
>  
> -		err = dev_map_generic_redirect(dst, skb, xdp_prog);
>  		if (unlikely(err))
>  			goto err;
>  	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
> @@ -3699,6 +3758,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
>  	.arg3_type      = ARG_ANYTHING,
>  };
>  
> +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> +	   struct bpf_map *, ex_map, u64, flags)
> +{
> +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
> +		return XDP_ABORTED;
> +
> +	ri->tgt_index = 0;
> +	ri->tgt_value = NULL;
> +	ri->flags = flags;
> +
> +	WRITE_ONCE(ri->map, map);
> +	WRITE_ONCE(ri->ex_map, ex_map);
> +
> +	return XDP_REDIRECT;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
> +	.func           = bpf_xdp_redirect_map_multi,
> +	.gpl_only       = false,
> +	.ret_type       = RET_INTEGER,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg3_type      = ARG_ANYTHING,
> +};
> +
>  static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
>  				  unsigned long off, unsigned long len)
>  {
> @@ -6304,6 +6390,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_xdp_redirect_proto;
>  	case BPF_FUNC_redirect_map:
>  		return &bpf_xdp_redirect_map_proto;
> +	case BPF_FUNC_redirect_map_multi:
> +		return &bpf_xdp_redirect_map_multi_proto;
>  	case BPF_FUNC_xdp_adjust_tail:
>  		return &bpf_xdp_adjust_tail_proto;
>  	case BPF_FUNC_fib_lookup:
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 4c7ea85486af..70dfb4910f84 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>  	return xdpf;
>  }
>  EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);
> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 2e29a671d67e..1dbe42290223 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -3025,6 +3025,21 @@ union bpf_attr {
>   *		* **-EOPNOTSUPP**	Unsupported operation, for example a
>   *					call from outside of TC ingress.
>   *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to all the interfaces in *map*, and
> + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> + * 		be NULL.
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which could exlcude redirect to the ingress device.
> + *
> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3151,7 +3166,8 @@ union bpf_attr {
>  	FN(xdp_output),			\
>  	FN(get_netns_cookie),		\
>  	FN(get_current_ancestor_cgroup_id),	\
> -	FN(sk_assign),
> +	FN(sk_assign),			\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> -- 
> 2.19.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
@ 2020-04-24 14:21     ` Lorenzo Bianconi
  0 siblings, 0 replies; 22+ messages in thread
From: Lorenzo Bianconi @ 2020-04-24 14:21 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann


[-- Attachment #1: Type: text/plain, Size: 6979 bytes --]

On Apr 24, Hangbin Liu wrote:
> This is a sample for xdp multicast. In the sample we have 3 forward
> groups and 1 exclude group. It will redirect each interface's
> packets to all the interfaces in the forward group, and exclude
> the interface in exclude map.
> 
> For more testing details, please see the test description in
> xdp_redirect_map_multi.sh.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  samples/bpf/Makefile                      |   3 +
>  samples/bpf/xdp_redirect_map_multi.sh     | 124 ++++++++++++++++
>  samples/bpf/xdp_redirect_map_multi_kern.c | 100 +++++++++++++
>  samples/bpf/xdp_redirect_map_multi_user.c | 170 ++++++++++++++++++++++
>  4 files changed, 397 insertions(+)
>  create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
>  create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
>  create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 424f6fe7ce38..eb7306efe85e 100644

[...]

> +
> +SEC("xdp_redirect_map_multi")
> +int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
> +{
> +	u32 key, mcast_group_id, exclude_group_id;
> +	void *data_end = (void *)(long)ctx->data_end;
> +	void *data = (void *)(long)ctx->data;
> +	struct ethhdr *eth = data;
> +	int *inmap_id;
> +	u16 h_proto;
> +	u64 nh_off;
> +
> +	nh_off = sizeof(*eth);
> +	if (data + nh_off > data_end)
> +		return XDP_DROP;
> +
> +	h_proto = eth->h_proto;
> +
> +	if (h_proto == htons(ETH_P_IP))
> +		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
> +					      BPF_F_EXCLUDE_INGRESS);

Do we need the 'BPF_F_EXCLUDE_INGRESS' here?

> +	else if (h_proto == htons(ETH_P_IPV6))
> +		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
> +					      BPF_F_EXCLUDE_INGRESS);

ditto

> +	else
> +		return bpf_redirect_map_multi(&forward_map_all, NULL,
> +					      BPF_F_EXCLUDE_INGRESS);
> +}
> +
> +SEC("xdp_redirect_dummy")
> +int xdp_redirect_dummy_prog(struct xdp_md *ctx)
> +{
> +	return XDP_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
> new file mode 100644
> index 000000000000..2fcd15322201
> --- /dev/null
> +++ b/samples/bpf/xdp_redirect_map_multi_user.c
> @@ -0,0 +1,170 @@
> +/* SPDX-License-Identifier: GPL-2.0-only
> + */
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <net/if.h>
> +#include <unistd.h>
> +#include <libgen.h>
> +
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#define MAX_IFACE_NUM 32
> +
> +static int ifaces[MAX_IFACE_NUM] = {};
> +static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
> +
> +static void int_exit(int sig)
> +{
> +	__u32 prog_id = 0;
> +	int i;
> +
> +	for (i = 0; ifaces[i] > 0; i++) {
> +		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
> +			printf("bpf_get_link_xdp_id failed\n");
> +			exit(1);
> +		}
> +		if (prog_id)
> +			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
> +	}
> +
> +	exit(0);
> +}
> +
> +static void usage(const char *prog)
> +{
> +	fprintf(stderr,
> +		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ... \n"
> +		"OPTS:\n"
> +		"    -S    use skb-mode\n"
> +		"    -N    enforce native mode\n"
> +		"    -F    force loading prog\n",
> +		prog);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int prog_fd, group_all, group_v4, group_v6, exclude;
> +	struct bpf_prog_load_attr prog_load_attr = {
> +		.prog_type      = BPF_PROG_TYPE_XDP,
> +	};
> +	int i, ret, opt, ifindex;
> +	char ifname[IF_NAMESIZE];
> +	struct bpf_object *obj;
> +	char filename[256];
> +
> +	while ((opt = getopt(argc, argv, "SNF")) != -1) {
> +		switch (opt) {
> +		case 'S':
> +			xdp_flags |= XDP_FLAGS_SKB_MODE;
> +			break;
> +		case 'N':
> +			/* default, set below */
> +			break;
> +		case 'F':
> +			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
> +			break;
> +		default:
> +			usage(basename(argv[0]));
> +			return 1;
> +		}
> +	}
> +
> +	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
> +		xdp_flags |= XDP_FLAGS_DRV_MODE;
> +
> +	if (optind == argc) {
> +		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
> +		return 1;
> +	}
> +
> +	printf("Get interfaces");
> +	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i ++) {
> +		ifaces[i] = if_nametoindex(argv[optind + i]);
> +		if (!ifaces[i])
> +			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
> +		if (!if_indextoname(ifaces[i], ifname)) {
> +			perror("Invalid interface name or i");
> +			return 1;
> +		}
> +		printf(" %d", ifaces[i]);
> +	}
> +	printf("\n");
> +
> +	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
> +	prog_load_attr.file = filename;
> +
> +	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> +		return 1;
> +
> +	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
> +	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
> +	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
> +	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
> +
> +	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0) {
> +		printf("bpf_object__find_map_fd_by_name failed\n");
> +		return 1;
> +	}
> +
> +	signal(SIGINT, int_exit);
> +	signal(SIGTERM, int_exit);
> +
> +	/* Init forward multicast groups and exclude group */
> +	for (i = 0; ifaces[i] > 0; i++) {
> +		ifindex = ifaces[i];
> +
> +		/* Add all the interfaces to group all */
> +		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
> +		if (ret) {
> +			perror("bpf_map_update_elem");
> +			goto err_out;
> +		}
> +
> +		/* For testing: remove the 2nd interfaces from group v4 */
> +		if (i != 1) {
> +			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
> +			if (ret) {
> +				perror("bpf_map_update_elem");
> +				goto err_out;
> +			}
> +		}
> +
> +		/* For testing: remove the 1st interfaces from group v6 */
> +		if (i != 0) {
> +			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
> +			if (ret) {
> +				perror("bpf_map_update_elem");
> +				goto err_out;
> +			}
> +		}
> +
> +		/* For testing: add the 3rd interfaces to exclude map */
> +		if (i == 2) {
> +			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
> +			if (ret) {
> +				perror("bpf_map_update_elem");
> +				goto err_out;
> +			}
> +		}
> +
> +		/* bind prog_fd to each interface */
> +		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
> +		if (ret) {
> +			printf("Set xdp fd failed on %d\n", ifindex);
> +			goto err_out;
> +		}
> +
> +	}
> +
> +	sleep(600);
> +	return 0;
> +
> +err_out:
> +	return 1;
> +}
> -- 
> 2.19.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-04-24 14:19     ` Lorenzo Bianconi
@ 2020-04-24 14:34     ` Toke Høiland-Jørgensen
  2020-05-06  9:14       ` Hangbin Liu
  2020-05-18  8:45       ` Hangbin Liu
  1 sibling, 2 replies; 22+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-24 14:34 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, Hangbin Liu

Hangbin Liu <liuhangbin@gmail.com> writes:

> This is a prototype for xdp multicast support. In this implemention we
> add a new helper to accept two maps, forward map and exclude map.
> We will redirect the packet to all the interfaces in *forward map*, but
> exclude the interfaces that in *exclude map*.

Yeah, the new helper is much cleaner!

> To achive this I add a new ex_map for struct bpf_redirect_info.
> in the helper I set tgt_value to NULL to make a difference with
> bpf_xdp_redirect_map()
>
> We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> create a exclude map for each interface and just want to exclude the
> ingress interface.
>
> The general data path is kept in net/core/filter.c. The native data
> path is in kernel/bpf/devmap.c so we can use direct calls to
> get better performace.

Got any performance numbers? :)

> v2: add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/bpf.h            |  20 ++++++
>  include/linux/filter.h         |   1 +
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  23 ++++++-
>  kernel/bpf/devmap.c            | 114 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |   6 ++
>  net/core/filter.c              |  98 ++++++++++++++++++++++++++--
>  net/core/xdp.c                 |  26 ++++++++
>  tools/include/uapi/linux/bpf.h |  23 ++++++-
>  9 files changed, 305 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fd2b2322412d..3fd2903def3f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1161,6 +1161,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex);
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress);
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog);
>  
> @@ -1297,6 +1302,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return 0;
>  }
>  
> +static inline
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	return false;
> +}
> +
> +static inline
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	return 0;
> +}
> +
>  struct sk_buff;
>  
>  static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 9b5aa5c483cc..5b4e1ccd2d37 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -614,6 +614,7 @@ struct bpf_redirect_info {
>  	u32 tgt_index;
>  	void *tgt_value;
>  	struct bpf_map *map;
> +	struct bpf_map *ex_map;
>  	u32 kern_flags;
>  };
>  
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 40c6d3398458..a214dce8579c 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -92,6 +92,7 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
>  }
>  
>  struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>  
>  /* Convert xdp_buff to xdp_frame */
>  static inline
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2e29a671d67e..1dbe42290223 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3025,6 +3025,21 @@ union bpf_attr {
>   *		* **-EOPNOTSUPP**	Unsupported operation, for example a
>   *					call from outside of TC ingress.
>   *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to all the interfaces in *map*, and
> + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> + * 		be NULL.
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which could exlcude redirect to the ingress device.

I'd suggest rewording this to:

* 		Redirect the packet to ALL the interfaces in *map*, but
* 		exclude the interfaces in *ex_map* (which may be NULL).
*
* 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
* 		which additionally excludes the current ingress device.


> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3151,7 +3166,8 @@ union bpf_attr {
>  	FN(xdp_output),			\
>  	FN(get_netns_cookie),		\
>  	FN(get_current_ancestor_cgroup_id),	\
> -	FN(sk_assign),
> +	FN(sk_assign),			\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 58bdca5d978a..34b171f7826c 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -456,6 +456,120 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return __xdp_enqueue(dev, xdp, dev_rx);
>  }
>  
> +/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
> +static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +
> +	switch (map->map_type) {
> +	case BPF_MAP_TYPE_DEVMAP:
> +		return dev_map_get_next_key(map, key, next_key);
> +	case BPF_MAP_TYPE_DEVMAP_HASH:
> +		return dev_map_hash_get_next_key(map, key, next_key);
> +	default:
> +		break;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> +			int exclude_ifindex)
> +{
> +	struct bpf_dtab_netdev *in_obj = NULL;
> +	u32 key, next_key;
> +	int err;
> +
> +	if (!map)
> +		return false;
> +
> +	if (obj->dev->ifindex == exclude_ifindex)
> +		return true;

We probably want the EXCLUDE_INGRESS flag to work even if ex_map is
NULL, right? In that case you want to switch the order of the two checks
above.

> +	devmap_get_next_key(map, NULL, &key);
> +
> +	for (;;) {

I wonder if we should require DEVMAP_HASH maps to be indexed by ifindex
to avoid the loop?

> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			in_obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			in_obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
> +			return true;
> +
> +		err = devmap_get_next_key(map, &key, &next_key);
> +
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return false;
> +}
> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  bool exclude_ingress)
> +{
> +	struct bpf_dtab_netdev *obj = NULL;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	struct net_device *dev;
> +	u32 key, next_key;
> +	int err;
> +
> +	devmap_get_next_key(map, NULL, &key);
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;

You do a clone for each map entry below, so I think you end up leaking
this initial xdpf? Also, you'll end up with one clone more than
necessary - redirecting to two interfaces should only require 1 clone,
you're doing 2.

> +	for (;;) {
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_DEVMAP:
> +			obj = __dev_map_lookup_elem(map, key);
> +			break;
> +		case BPF_MAP_TYPE_DEVMAP_HASH:
> +			obj = __dev_map_hash_lookup_elem(map, key);
> +			break;
> +		default:
> +			break;
> +		}
> +
> +		if (!obj || dev_in_exclude_map(obj, ex_map,
> +					       exclude_ingress ? dev_rx->ifindex : 0))
> +			goto find_next;
> +
> +		dev = obj->dev;
> +
> +		if (!dev->netdev_ops->ndo_xdp_xmit)
> +			return -EOPNOTSUPP;
> +
> +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> +		if (unlikely(err))
> +			return err;

These abort the whole operation midway through the loop if any error
occurs. That is probably not what we want? I think the right thing to do
is just continue the loop and only return an error if *all* of the
forwarding attempts failed. Maybe we need a tracepoint to catch
individual errors?

> +		nxdpf = xdpf_clone(xdpf);
> +		if (unlikely(!nxdpf))
> +			return -ENOMEM;

As this is a memory error it's likely fatal on the nest loop iteration
as well, so probably OK to abort everything here.

> +		bq_enqueue(dev, nxdpf, dev_rx);
> +
> +find_next:
> +		err = devmap_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 38cfcf701eeb..f77213a0e354 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3880,6 +3880,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_redirect_map_multi &&
>  		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
> @@ -3970,6 +3971,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		    map->map_type != BPF_MAP_TYPE_XSKMAP)
>  			goto error;
>  		break;
> +	case BPF_FUNC_redirect_map_multi:
> +		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
> +		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +			goto error;
> +		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
>  	case BPF_FUNC_sock_map_update:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7d6ceaa54d21..94d1530e5ac6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    struct bpf_map *ex_map, bool exclude_ingress)
>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		return dev_map_enqueue(fwd, xdp, dev_rx);
> +		if (fwd)
> +			return dev_map_enqueue(fwd, xdp, dev_rx);
> +		else
> +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
> +						     exclude_ingress);
>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);

I don't think you need the READ_ONCE here since there's already one
below?

>  	struct bpf_map *map = READ_ONCE(ri->map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
> @@ -3552,7 +3559,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
>  	}
>  
>  	if (unlikely(err))
> @@ -3566,6 +3573,49 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
> +static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
> +				  struct bpf_prog *xdp_prog,
> +				  struct bpf_map *map, struct bpf_map *ex_map,
> +				  bool exclude_ingress)
> +
> +{
> +	struct bpf_dtab_netdev *dst;
> +	struct sk_buff *nskb;
> +	u32 key, next_key;
> +	int err;
> +	void *fwd;
> +
> +	/* Get first key from forward map */
> +	map->ops->map_get_next_key(map, NULL, &key);
> +
> +	for (;;) {
> +		fwd = __xdp_map_lookup_elem(map, key);
> +		if (fwd) {
> +			dst = (struct bpf_dtab_netdev *)fwd;
> +			if (dev_in_exclude_map(dst, ex_map,
> +					       exclude_ingress ? dev->ifindex : 0))
> +				goto find_next;
> +
> +			nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (!nskb)
> +				return -EOVERFLOW;
> +
> +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> +			if (unlikely(err))
> +				return err;
> +		}
> +
> +find_next:
> +		err = map->ops->map_get_next_key(map, &key, &next_key);
> +		if (err)
> +			break;
> +
> +		key = next_key;
> +	}
> +
> +	return 0;
> +}

This duplication bugs me; maybe we should try to consolidate the generic
and native XDP code paths?

>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct sk_buff *skb,
>  				       struct xdp_buff *xdp,
> @@ -3573,6 +3623,8 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       struct bpf_map *map)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> +	struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
>  	int err = 0;
> @@ -3583,9 +3635,16 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  
>  	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
>  	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
> -		struct bpf_dtab_netdev *dst = fwd;
> +		if (fwd) {
> +			struct bpf_dtab_netdev *dst = fwd;
> +
> +			err = dev_map_generic_redirect(dst, skb, xdp_prog);
> +		} else {
> +			/* Deal with multicast maps */
> +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> +						     ex_map, exclude_ingress);
> +		}
>  
> -		err = dev_map_generic_redirect(dst, skb, xdp_prog);
>  		if (unlikely(err))
>  			goto err;
>  	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
> @@ -3699,6 +3758,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
>  	.arg3_type      = ARG_ANYTHING,
>  };
>  
> +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> +	   struct bpf_map *, ex_map, u64, flags)
> +{
> +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
> +		return XDP_ABORTED;
> +
> +	ri->tgt_index = 0;
> +	ri->tgt_value = NULL;
> +	ri->flags = flags;
> +
> +	WRITE_ONCE(ri->map, map);
> +	WRITE_ONCE(ri->ex_map, ex_map);
> +
> +	return XDP_REDIRECT;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
> +	.func           = bpf_xdp_redirect_map_multi,
> +	.gpl_only       = false,
> +	.ret_type       = RET_INTEGER,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg3_type      = ARG_ANYTHING,
> +};
> +
>  static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
>  				  unsigned long off, unsigned long len)
>  {
> @@ -6304,6 +6390,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_xdp_redirect_proto;
>  	case BPF_FUNC_redirect_map:
>  		return &bpf_xdp_redirect_map_proto;
> +	case BPF_FUNC_redirect_map_multi:
> +		return &bpf_xdp_redirect_map_multi_proto;
>  	case BPF_FUNC_xdp_adjust_tail:
>  		return &bpf_xdp_adjust_tail_proto;
>  	case BPF_FUNC_fib_lookup:
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 4c7ea85486af..70dfb4910f84 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -496,3 +496,29 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>  	return xdpf;
>  }
>  EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
> +
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> +{
> +	unsigned int headroom, totalsize;
> +	struct xdp_frame *nxdpf;
> +	struct page *page;
> +	void *addr;
> +
> +	headroom = xdpf->headroom + sizeof(*xdpf);
> +	totalsize = headroom + xdpf->len;
> +
> +	if (unlikely(totalsize > PAGE_SIZE))
> +		return NULL;
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +	addr = page_to_virt(page);
> +
> +	memcpy(addr, xdpf, totalsize);
> +
> +	nxdpf = addr;
> +	nxdpf->data = addr + headroom;
> +
> +	return nxdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdpf_clone);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 2e29a671d67e..1dbe42290223 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h

Updates to tools/include should generally go into a separate patch.

> @@ -3025,6 +3025,21 @@ union bpf_attr {
>   *		* **-EOPNOTSUPP**	Unsupported operation, for example a
>   *					call from outside of TC ingress.
>   *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
> + *
> + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		Redirect the packet to all the interfaces in *map*, and
> + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> + * 		be NULL.
> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which could exlcude redirect to the ingress device.
> + *
> + * 		See also bpf_redirect_map(), which supports redirecting
> + * 		packet to a specific ifindex in the map.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3151,7 +3166,8 @@ union bpf_attr {
>  	FN(xdp_output),			\
>  	FN(get_netns_cookie),		\
>  	FN(get_current_ancestor_cgroup_id),	\
> -	FN(sk_assign),
> +	FN(sk_assign),			\
> +	FN(redirect_map_multi),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3280,6 +3296,11 @@ enum bpf_lwt_encap_mode {
>  	BPF_LWT_ENCAP_IP,
>  };
>  
> +/* BPF_FUNC_redirect_map_multi flags. */
> +enum {
> +	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
> +};
> +
>  #define __bpf_md_ptr(type, name)	\
>  union {					\
>  	type name;			\
> -- 
> 2.19.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:19     ` Lorenzo Bianconi
@ 2020-04-28 11:09       ` Eelco Chaudron
  2020-05-06  9:35       ` Hangbin Liu
  1 sibling, 0 replies; 22+ messages in thread
From: Eelco Chaudron @ 2020-04-28 11:09 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Hangbin Liu, bpf, netdev, Toke Høiland-Jørgensen,
	Jiri Benc, Jesper Dangaard Brouer, ast, Daniel Borkmann



On 24 Apr 2020, at 16:19, Lorenzo Bianconi wrote:

[...]

>> +{
>> +
>> +	switch (map->map_type) {
>> +	case BPF_MAP_TYPE_DEVMAP:
>> +		return dev_map_get_next_key(map, key, next_key);
>> +	case BPF_MAP_TYPE_DEVMAP_HASH:
>> +		return dev_map_hash_get_next_key(map, key, next_key);
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return -ENOENT;
>> +}
>> +
>> +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map 
>> *map,
>> +			int exclude_ifindex)
>> +{
>> +	struct bpf_dtab_netdev *in_obj = NULL;
>> +	u32 key, next_key;
>> +	int err;
>> +
>> +	if (!map)
>> +		return false;
>
> doing so it seems mandatory to define an exclude_map even if we want 
> just to do
> not forward the packet to the "ingress" interface.
> Moreover I was thinking that we can assume to never forward to in the 
> incoming
> interface. Doing so the code would be simpler I guess. Is there a use 
> case for
> it? (forward even to the ingress interface)
>

This part I can answer, it’s called VEPA, I think it’s part of IEEE 
802.1Qbg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
@ 2020-05-06  9:14       ` Hangbin Liu
  2020-05-06 10:00         ` Toke Høiland-Jørgensen
  2020-05-18  8:45       ` Hangbin Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Hangbin Liu @ 2020-05-06  9:14 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hi Toke,

Thanks for your review, please see replies below.

On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
> >
> > The general data path is kept in net/core/filter.c. The native data
> > path is in kernel/bpf/devmap.c so we can use direct calls to
> > get better performace.
> 
> Got any performance numbers? :)

No, I haven't test the performance. Do you have any suggestions about how
to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
how to test the throughput. I don't think netperf or iperf supports this.
> 
> > + * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> > + * 	Description
> > + * 		Redirect the packet to all the interfaces in *map*, and
> > + * 		exclude the interfaces that in *ex_map*. The *ex_map* could
> > + * 		be NULL.
> > + *
> > + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> > + * 		which could exlcude redirect to the ingress device.
> 
> I'd suggest rewording this to:
> 
> * 		Redirect the packet to ALL the interfaces in *map*, but
> * 		exclude the interfaces in *ex_map* (which may be NULL).
> *
> * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> * 		which additionally excludes the current ingress device.

Thanks, I will update it
> > +
> > +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> > +			int exclude_ifindex)
> > +{
> > +	struct bpf_dtab_netdev *in_obj = NULL;
> > +	u32 key, next_key;
> > +	int err;
> > +
> > +	if (!map)
> > +		return false;
> > +
> > +	if (obj->dev->ifindex == exclude_ifindex)
> > +		return true;
> 
> We probably want the EXCLUDE_INGRESS flag to work even if ex_map is
> NULL, right? In that case you want to switch the order of the two checks
> above.

Yes, will fix it.

> 
> > +	devmap_get_next_key(map, NULL, &key);
> > +
> > +	for (;;) {
> 
> I wonder if we should require DEVMAP_HASH maps to be indexed by ifindex
> to avoid the loop?

I guess it's not easy to force user to index the map by ifindex.

> > +	xdpf = convert_to_xdp_frame(xdp);
> > +	if (unlikely(!xdpf))
> > +		return -EOVERFLOW;
> 
> You do a clone for each map entry below, so I think you end up leaking
> this initial xdpf? Also, you'll end up with one clone more than
> necessary - redirecting to two interfaces should only require 1 clone,
> you're doing 2.

We don't know which is the latest one. So we need to keep the initial
for clone. Is it enough to call xdp_release_frame() after the for loop?
> 
> > +	for (;;) {
> > +		switch (map->map_type) {
> > +		case BPF_MAP_TYPE_DEVMAP:
> > +			obj = __dev_map_lookup_elem(map, key);
> > +			break;
> > +		case BPF_MAP_TYPE_DEVMAP_HASH:
> > +			obj = __dev_map_hash_lookup_elem(map, key);
> > +			break;
> > +		default:
> > +			break;
> > +		}
> > +
> > +		if (!obj || dev_in_exclude_map(obj, ex_map,
> > +					       exclude_ingress ? dev_rx->ifindex : 0))
> > +			goto find_next;
> > +
> > +		dev = obj->dev;
> > +
> > +		if (!dev->netdev_ops->ndo_xdp_xmit)
> > +			return -EOPNOTSUPP;
> > +
> > +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
> > +		if (unlikely(err))
> > +			return err;
> 
> These abort the whole operation midway through the loop if any error
> occurs. That is probably not what we want? I think the right thing to do
> is just continue the loop and only return an error if *all* of the
> forwarding attempts failed. Maybe we need a tracepoint to catch
> individual errors?

Makes sense. I will see if we can add a tracepoint here.
> >  
> > +static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
> > +				  struct bpf_prog *xdp_prog,
> > +				  struct bpf_map *map, struct bpf_map *ex_map,
> > +				  bool exclude_ingress)
> > +
> > +{
> > +	struct bpf_dtab_netdev *dst;
> > +	struct sk_buff *nskb;
> > +	u32 key, next_key;
> > +	int err;
> > +	void *fwd;
> > +
> > +	/* Get first key from forward map */
> > +	map->ops->map_get_next_key(map, NULL, &key);
> > +
> > +	for (;;) {
> > +		fwd = __xdp_map_lookup_elem(map, key);
> > +		if (fwd) {
> > +			dst = (struct bpf_dtab_netdev *)fwd;
> > +			if (dev_in_exclude_map(dst, ex_map,
> > +					       exclude_ingress ? dev->ifindex : 0))
> > +				goto find_next;
> > +
> > +			nskb = skb_clone(skb, GFP_ATOMIC);
> > +			if (!nskb)
> > +				return -EOVERFLOW;
> > +
> > +			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
> > +			if (unlikely(err))
> > +				return err;
> > +		}
> > +
> > +find_next:
> > +		err = map->ops->map_get_next_key(map, &key, &next_key);
> > +		if (err)
> > +			break;
> > +
> > +		key = next_key;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> This duplication bugs me; maybe we should try to consolidate the generic
> and native XDP code paths?

Yes, I have tried to combine these two functions together. But one is generic
code path and another is XDP code patch. One use skb_clone and another
use xdpf_clone(). There are also some extra checks for XDP code. So maybe
we'd better just keep it as it is.

> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 2e29a671d67e..1dbe42290223 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> 
> Updates to tools/include should generally go into a separate patch.

Will fix it, thanks.

Best Regards
Hangbin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:19     ` Lorenzo Bianconi
  2020-04-28 11:09       ` Eelco Chaudron
@ 2020-05-06  9:35       ` Hangbin Liu
  1 sibling, 0 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-05-06  9:35 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann


Hi Lorenzo,

Thanks for the comments, please see replies below.

On Fri, Apr 24, 2020 at 04:19:08PM +0200, Lorenzo Bianconi wrote:
> > +bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
> > +			int exclude_ifindex)
> > +{
> > +	struct bpf_dtab_netdev *in_obj = NULL;
> > +	u32 key, next_key;
> > +	int err;
> > +
> > +	if (!map)
> > +		return false;
> 
> doing so it seems mandatory to define an exclude_map even if we want just to do
> not forward the packet to the "ingress" interface.
> Moreover I was thinking that we can assume to never forward to in the incoming
> interface. Doing so the code would be simpler I guess. Is there a use case for
> it? (forward even to the ingress interface)

Eelco has help answered one use case: VEPA. Another reason I added this flag
is that the other syscalls like bpf_redirect() or bpf_redirect_map() are
also able to forward to ingress interface. So we need to behave the same
by default.
> 
> > +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> > +			  struct bpf_map *map, struct bpf_map *ex_map,
> > +			  bool exclude_ingress)
> > +{

[...]
> > +	}
> 
> Do we need to free 'incoming' xdp buffer here? I think most of the drivers assume
> the packet is owned by the stack if xdp_do_redirect returns 0

Yes, we need. I will fix it.
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 7d6ceaa54d21..94d1530e5ac6 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
> >  };
> >  
> >  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> > -			    struct bpf_map *map, struct xdp_buff *xdp)
> > +			    struct bpf_map *map, struct xdp_buff *xdp,
> > +			    struct bpf_map *ex_map, bool exclude_ingress)
> >  {
> >  	switch (map->map_type) {
> >  	case BPF_MAP_TYPE_DEVMAP:
> >  	case BPF_MAP_TYPE_DEVMAP_HASH:
> > -		return dev_map_enqueue(fwd, xdp, dev_rx);
> > +		if (fwd)
> > +			return dev_map_enqueue(fwd, xdp, dev_rx);
> > +		else
> > +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
> > +						     exclude_ingress);
> 
> I guess it would be better to do not make it the default case. Maybe you can
> add a bit in flags to mark it for "multicast"

But how do we distinguish the flag bit with other syscalls? e.g. If we define
0x02 as the "do_multicast" flag. What if other syscalls also used this flag.

Currently __bpf_tx_xdp_map() is only called by xdp_do_redirect(). If there
is a map and no fwd, it must be multicast forward. So we are still safe now.
Maybe we need an update in future.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-06  9:14       ` Hangbin Liu
@ 2020-05-06 10:00         ` Toke Høiland-Jørgensen
  2020-05-08  8:53           ` Hangbin Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-05-06 10:00 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> Hi Toke,
>
> Thanks for your review, please see replies below.
>
> On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
>> >
>> > The general data path is kept in net/core/filter.c. The native data
>> > path is in kernel/bpf/devmap.c so we can use direct calls to
>> > get better performace.
>> 
>> Got any performance numbers? :)
>
> No, I haven't test the performance. Do you have any suggestions about how
> to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
> how to test the throughput. I don't think netperf or iperf supports
> this.

What I usually do when benchmarking XDP_REDIRECT is to just use pktgen
(samples/pktgen in the kernel source tree) on another machine,
specifically, like this:

./pktgen_sample03_burst_single_flow.sh  -i enp1s0f1 -d 10.70.2.2 -m ec:0d:9a:db:11:35 -t 4  -s 64

(adjust iface, IP and MAC address to your system, of course). That'll
flood the target machine with small UDP packets. On that machine, I then
run the 'xdp_redirect_map' program from samples/bpf. The bpf program
used by that sample will update an internal counter for every packet,
and the userspace prints it out, which gives you the performance (in
PPS). So just modifying that sample to using your new multicast helper
(and comparing it to regular REDIRECT to a single device) would be a
first approximation of a performance test.

[...]

>> > +	devmap_get_next_key(map, NULL, &key);
>> > +
>> > +	for (;;) {
>> 
>> I wonder if we should require DEVMAP_HASH maps to be indexed by ifindex
>> to avoid the loop?
>
> I guess it's not easy to force user to index the map by ifindex.

Well, the way to 'force the user' is just to assume that this is the
case, and if the map is filled in wrong, things just won't work ;)

>> > +	xdpf = convert_to_xdp_frame(xdp);
>> > +	if (unlikely(!xdpf))
>> > +		return -EOVERFLOW;
>> 
>> You do a clone for each map entry below, so I think you end up leaking
>> this initial xdpf? Also, you'll end up with one clone more than
>> necessary - redirecting to two interfaces should only require 1 clone,
>> you're doing 2.
>
> We don't know which is the latest one. So we need to keep the initial
> for clone. Is it enough to call xdp_release_frame() after the for
> loop?

You could do something like:

bool first = true;
for (;;) {

[...]

           if (!first) {
   		nxdpf = xdpf_clone(xdpf);
   		if (unlikely(!nxdpf))
   			return -ENOMEM;
   		bq_enqueue(dev, nxdpf, dev_rx);
           } else {
   		bq_enqueue(dev, xdpf, dev_rx);
   		first = false;
           }
}

/* didn't find anywhere to forward to, free buf */
if (first)
   xdp_return_frame_rx_napi(xdpf);



[...]

>> This duplication bugs me; maybe we should try to consolidate the generic
>> and native XDP code paths?
>
> Yes, I have tried to combine these two functions together. But one is generic
> code path and another is XDP code patch. One use skb_clone and another
> use xdpf_clone(). There are also some extra checks for XDP code. So maybe
> we'd better just keep it as it is.

Yeah, guess it may not be as simple as I'd like it to be ;)
Let's keep it this way for now at least; we can always consolidate in a
separate patch series.

-Toke


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-06 10:00         ` Toke Høiland-Jørgensen
@ 2020-05-08  8:53           ` Hangbin Liu
  2020-05-08 14:58             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 22+ messages in thread
From: Hangbin Liu @ 2020-05-08  8:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

On Wed, May 06, 2020 at 12:00:08PM +0200, Toke Høiland-Jørgensen wrote:
> > No, I haven't test the performance. Do you have any suggestions about how
> > to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
> > how to test the throughput. I don't think netperf or iperf supports
> > this.
> 
> What I usually do when benchmarking XDP_REDIRECT is to just use pktgen
> (samples/pktgen in the kernel source tree) on another machine,
> specifically, like this:
> 
> ./pktgen_sample03_burst_single_flow.sh  -i enp1s0f1 -d 10.70.2.2 -m ec:0d:9a:db:11:35 -t 4  -s 64
> 
> (adjust iface, IP and MAC address to your system, of course). That'll
> flood the target machine with small UDP packets. On that machine, I then
> run the 'xdp_redirect_map' program from samples/bpf. The bpf program
> used by that sample will update an internal counter for every packet,
> and the userspace prints it out, which gives you the performance (in
> PPS). So just modifying that sample to using your new multicast helper
> (and comparing it to regular REDIRECT to a single device) would be a
> first approximation of a performance test.

Thanks for this method. I will update the sample and do some more tests.
> 
> You could do something like:
> 
> bool first = true;
> for (;;) {
> 
> [...]
> 
>            if (!first) {
>    		nxdpf = xdpf_clone(xdpf);
>    		if (unlikely(!nxdpf))
>    			return -ENOMEM;
>    		bq_enqueue(dev, nxdpf, dev_rx);
>            } else {
>    		bq_enqueue(dev, xdpf, dev_rx);
>    		first = false;
>            }
> }
> 
> /* didn't find anywhere to forward to, free buf */
> if (first)
>    xdp_return_frame_rx_napi(xdpf);

I think the first xdpf will be consumed by the driver and the later
xdpf_clone() will failed, won't it?

How about just do a xdp_return_frame_rx_napi(xdpf) after all nxdpf enqueue?

> > @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct
> > xdp_buff *xdp,
> >                   struct bpf_prog *xdp_prog)
> >  {
> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> > +     bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
> > +     struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>
> I don't think you need the READ_ONCE here since there's already one
> below?

BTW, I forgot to ask, why we don't need the READ_ONCE for ex_map?
I though the map and ex_map are two different pointers.

> >       struct bpf_map *map = READ_ONCE(ri->map);

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-08  8:53           ` Hangbin Liu
@ 2020-05-08 14:58             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 22+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-05-08 14:58 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, May 06, 2020 at 12:00:08PM +0200, Toke Høiland-Jørgensen wrote:
>> > No, I haven't test the performance. Do you have any suggestions about how
>> > to test it? I'd like to try forwarding pkts to 10+ ports. But I don't know
>> > how to test the throughput. I don't think netperf or iperf supports
>> > this.
>> 
>> What I usually do when benchmarking XDP_REDIRECT is to just use pktgen
>> (samples/pktgen in the kernel source tree) on another machine,
>> specifically, like this:
>> 
>> ./pktgen_sample03_burst_single_flow.sh  -i enp1s0f1 -d 10.70.2.2 -m ec:0d:9a:db:11:35 -t 4  -s 64
>> 
>> (adjust iface, IP and MAC address to your system, of course). That'll
>> flood the target machine with small UDP packets. On that machine, I then
>> run the 'xdp_redirect_map' program from samples/bpf. The bpf program
>> used by that sample will update an internal counter for every packet,
>> and the userspace prints it out, which gives you the performance (in
>> PPS). So just modifying that sample to using your new multicast helper
>> (and comparing it to regular REDIRECT to a single device) would be a
>> first approximation of a performance test.
>
> Thanks for this method. I will update the sample and do some more tests.

Great!

>> You could do something like:
>> 
>> bool first = true;
>> for (;;) {
>> 
>> [...]
>> 
>>            if (!first) {
>>    		nxdpf = xdpf_clone(xdpf);
>>    		if (unlikely(!nxdpf))
>>    			return -ENOMEM;
>>    		bq_enqueue(dev, nxdpf, dev_rx);
>>            } else {
>>    		bq_enqueue(dev, xdpf, dev_rx);
>>    		first = false;
>>            }
>> }
>> 
>> /* didn't find anywhere to forward to, free buf */
>> if (first)
>>    xdp_return_frame_rx_napi(xdpf);
>
> I think the first xdpf will be consumed by the driver and the later
> xdpf_clone() will failed, won't it?

No, bq_enqueue just sticks the frame on a list, it's not consumed until
after the NAPI cycle ends (and the driver calls xdp_do_flush()).

> How about just do a xdp_return_frame_rx_napi(xdpf) after all nxdpf enqueue?

Yeah, that would be the semantically obvious thing to do, but it is
wasteful in that you end up performing one more clone than you strictly
have to :)

>> > @@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct
>> > xdp_buff *xdp,
>> >                   struct bpf_prog *xdp_prog)
>> >  {
>> >       struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> > +     bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
>> > +     struct bpf_map *ex_map = READ_ONCE(ri->ex_map);
>>
>> I don't think you need the READ_ONCE here since there's already one
>> below?
>
> BTW, I forgot to ask, why we don't need the READ_ONCE for ex_map?
> I though the map and ex_map are two different pointers.

It isn't, but not for the reason I thought, so I can understand why my
comment might have been somewhat confusing (I have been confused by this
myself until just now...).

The READ_ONCE() is not needed because the ex_map field is only ever read
from or written to by the CPU owning the per-cpu pointer. Whereas the
'map' field is manipulated by remote CPUs in bpf_clear_redirect_map().
So you need neither READ_ONCE() nor WRITE_ONCE() on ex_map, just like
there are none on tgt_index and tgt_value.

-Toke


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-04-24 14:34     ` Toke Høiland-Jørgensen
  2020-05-06  9:14       ` Hangbin Liu
@ 2020-05-18  8:45       ` Hangbin Liu
  2020-05-19 10:15         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 22+ messages in thread
From: Hangbin Liu @ 2020-05-18  8:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi

Hi Toke,

On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
> 
> Yeah, the new helper is much cleaner!
> 
> > To achive this I add a new ex_map for struct bpf_redirect_info.
> > in the helper I set tgt_value to NULL to make a difference with
> > bpf_xdp_redirect_map()
> >
> > We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> > create a exclude map for each interface and just want to exclude the
> > ingress interface.
> >
> > The general data path is kept in net/core/filter.c. The native data
> > path is in kernel/bpf/devmap.c so we can use direct calls to
> > get better performace.
> 
> Got any performance numbers? :)

Recently I tried with pktgen to get the performance number. It works
with native mode, although the number looks not high.

I tested it on VM with 1 cpu core. By forwarding to 7 ports, With pktgen
config like:
echo "count 10000000" > /proc/net/pktgen/veth0
echo "clone_skb 0" > /proc/net/pktgen/veth0
echo "pkt_size 64" > /proc/net/pktgen/veth0
echo "dst 224.1.1.10" > /proc/net/pktgen/veth0

I got forwarding number like:
Forwarding     159958 pkt/s
Forwarding     160213 pkt/s
Forwarding     160448 pkt/s

But when testing generic mode, I got system crashed directly. The code
path is:
do_xdp_generic()
  - netif_receive_generic_xdp()
    - pskb_expand_head()    <- skb_is_nonlinear(skb)
      - BUG_ON(skb_shared(skb))

So I want to ask do you have the same issue with pktgen? Any workaround?

> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 2e29a671d67e..1dbe42290223 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> 
> Updates to tools/include should generally go into a separate patch.

Is this a must to? It looks strange to separate the same implementation
into two patches.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-18  8:45       ` Hangbin Liu
@ 2020-05-19 10:15         ` Jesper Dangaard Brouer
  2020-05-20  1:24           ` Hangbin Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2020-05-19 10:15 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: Toke Høiland-Jørgensen, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi, brouer

On Mon, 18 May 2020 16:45:27 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> Hi Toke,
> 
> On Fri, Apr 24, 2020 at 04:34:49PM +0200, Toke Høiland-Jørgensen wrote:
> > 
> > Yeah, the new helper is much cleaner!
> >   
> > > To achive this I add a new ex_map for struct bpf_redirect_info.
> > > in the helper I set tgt_value to NULL to make a difference with
> > > bpf_xdp_redirect_map()
> > >
> > > We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
> > > create a exclude map for each interface and just want to exclude the
> > > ingress interface.
> > >
> > > The general data path is kept in net/core/filter.c. The native data
> > > path is in kernel/bpf/devmap.c so we can use direct calls to
> > > get better performace.  
> > 
> > Got any performance numbers? :)  
> 
> Recently I tried with pktgen to get the performance number. It works
> with native mode, although the number looks not high.
> 
> I tested it on VM with 1 cpu core. 

Performance testing on a VM doesn't really make much sense.

> By forwarding to 7 ports, With pktgen
> config like:
> echo "count 10000000" > /proc/net/pktgen/veth0
> echo "clone_skb 0" > /proc/net/pktgen/veth0
> echo "pkt_size 64" > /proc/net/pktgen/veth0
> echo "dst 224.1.1.10" > /proc/net/pktgen/veth0
> 
> I got forwarding number like:
> Forwarding     159958 pkt/s
> Forwarding     160213 pkt/s
> Forwarding     160448 pkt/s
> 
> But when testing generic mode, I got system crashed directly. The code
> path is:
> do_xdp_generic()
>   - netif_receive_generic_xdp()
>     - pskb_expand_head()    <- skb_is_nonlinear(skb)
>       - BUG_ON(skb_shared(skb))
> 
> So I want to ask do you have the same issue with pktgen? Any workaround?

Pktgen is not meant to be used on virtual devices.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-19 10:15         ` Jesper Dangaard Brouer
@ 2020-05-20  1:24           ` Hangbin Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-05-20  1:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi

On Tue, May 19, 2020 at 12:15:12PM +0200, Jesper Dangaard Brouer wrote:
> Performance testing on a VM doesn't really make much sense.
> 
> Pktgen is not meant to be used on virtual devices.

Thanks, I will try on a physical machine.

Cheers
Hangbin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support
  2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
                   ` (2 preceding siblings ...)
  2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-05-23  6:05 ` Hangbin Liu
  2020-05-23  6:05   ` [PATCHv3 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
  2020-05-23  6:05   ` [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  3 siblings, 2 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-05-23  6:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

Hi all,

This patchset is for xdp multicast support, which has been discussed
before[0]. The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple
ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The example in patch 2 is functional, but not a lot of effort
has been made on performance optimisation. I did a simple test(pkt size 64)
with pktgen. Here is the test result with BPF_MAP_TYPE_DEVMAP_HASH
arrays:

bpf_redirect_map() with 1 ingress, 1 egress:
generic path: ~1600k pps
native path: ~980k pps

bpf_redirect_map_multi() with 1 ingress, 3 egress:
generic path: ~600k pps
native path: ~480k pps

bpf_redirect_map_multi() with 1 ingress, 9 egress:
generic path: ~125k pps
native path: ~100k pps

The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
the arrays and do clone skb/xdpf. The native path is slower than generic
path as we send skbs by pktgen. So the result looks reasonable.

We need also note that the performace number will get slower if we use large
BPF_MAP_TYPE_DEVMAP arrays.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.
f) Add rxcnt map to show the packet transmit speed in sample test.
g) Add performace test number.

I didn't split the tools/include to a separate patch because I think
they are all the same change, and I saw some others also do like this.
But I can re-post the patch and split it if you insist.

v2:
Discussed with Jiri, Toke, Jesper, Eelco, we think the v1 is doing
a trick and may make user confused. So let's just add a new helper
to make the implementation more clear.

Hangbin Liu (2):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test

 include/linux/bpf.h                       |  20 +++
 include/linux/filter.h                    |   1 +
 include/net/xdp.h                         |   1 +
 include/uapi/linux/bpf.h                  |  22 ++-
 kernel/bpf/devmap.c                       | 124 ++++++++++++++
 kernel/bpf/verifier.c                     |   6 +
 net/core/filter.c                         | 101 ++++++++++-
 net/core/xdp.c                            |  26 +++
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 133 +++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 112 ++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 198 ++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            |  22 ++-
 13 files changed, 762 insertions(+), 7 deletions(-)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCHv3 bpf-next 1/2] xdp: add a new helper for dev map multicast support
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
@ 2020-05-23  6:05   ` Hangbin Liu
  2020-05-23  6:05   ` [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
  1 sibling, 0 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-05-23  6:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This patch is for xdp multicast support. In this implementation we
add a new helper to accept two maps: forward map and exclude map.
We will redirect the packet to all the interfaces in *forward map*, but
exclude the interfaces that in *exclude map*.

To achive this I add a new ex_map for struct bpf_redirect_info.
in the helper I set tgt_value to NULL to make a difference with
bpf_xdp_redirect_map()

We also add a flag *BPF_F_EXCLUDE_INGRESS* incase you don't want to
create a exclude map for each interface and just want to exclude the
ingress interface.

The general data path is kept in net/core/filter.c. The native data
path is in kernel/bpf/devmap.c so we can use direct calls to
get better performace.

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/bpf.h            |  20 ++++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  22 +++++-
 kernel/bpf/devmap.c            | 124 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              | 101 +++++++++++++++++++++++++--
 net/core/xdp.c                 |  26 +++++++
 tools/include/uapi/linux/bpf.h |  22 +++++-
 9 files changed, 316 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index efe8836b5c48..d1c169bec6b5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1240,6 +1240,11 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 
@@ -1377,6 +1382,21 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 73d06a39e2d6..5d9c6ac6ade3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -612,6 +612,7 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 90f11760bd12..967684aa096a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -105,6 +105,7 @@ void xdp_warn(const char *msg, const char *func, const int line);
 #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 /* Convert xdp_buff to xdp_frame */
 static inline
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 97e1fd19ff58..000b0cf961ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3157,6 +3157,20 @@ union bpf_attr {
  *		**bpf_sk_cgroup_id**\ ().
  *	Return
  *		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map* (which may be NULL).
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3288,7 +3302,8 @@ union bpf_attr {
 	FN(seq_printf),			\
 	FN(seq_write),			\
 	FN(sk_cgroup_id),		\
-	FN(sk_ancestor_cgroup_id),
+	FN(sk_ancestor_cgroup_id),	\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a51d9fb7a359..ecc5c44a5bab 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -455,6 +455,130 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of  map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	struct bpf_dtab_netdev *in_obj = NULL;
+	u32 key, next_key;
+	int err;
+
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	if (!map)
+		return false;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			in_obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			in_obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (in_obj && in_obj->dev->ifindex == obj->dev->ifindex)
+			return true;
+
+		err = devmap_get_next_key(map, &key, &next_key);
+
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	return false;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  bool exclude_ingress)
+{
+	struct bpf_dtab_netdev *obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	struct net_device *dev;
+	bool first = true;
+	u32 key, next_key;
+	int err;
+
+	devmap_get_next_key(map, NULL, &key);
+
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map,
+					       exclude_ingress ? dev_rx->ifindex : 0))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			goto find_next;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			goto find_next;
+
+		if (!first) {
+			nxdpf = xdpf_clone(xdpf);
+			if (unlikely(!nxdpf))
+				return -ENOMEM;
+
+			bq_enqueue(dev, nxdpf, dev_rx);
+		} else {
+			bq_enqueue(dev, xdpf, dev_rx);
+			first = false;
+		}
+
+find_next:
+		err = devmap_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+		key = next_key;
+	}
+
+	/* didn't find anywhere to forward to, free buf */
+	if (first)
+		xdp_return_frame_rx_napi(xdpf);
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d2e27dba4ac6..a5857953248d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3946,6 +3946,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -4038,6 +4039,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index bd2853d23b50..f07eb1408f70 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3473,12 +3473,17 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, bool exclude_ingress)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map,
+						     exclude_ingress);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3534,6 +3539,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = ri->ex_map;
 	struct bpf_map *map = READ_ONCE(ri->map);
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
@@ -3541,6 +3548,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (unlikely(!map)) {
@@ -3552,7 +3560,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, exclude_ingress);
 	}
 
 	if (unlikely(err))
@@ -3566,6 +3574,50 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  bool exclude_ingress)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	u32 key, next_key;
+	int err;
+	void *fwd;
+
+	/* Get first key from forward map */
+	map->ops->map_get_next_key(map, NULL, &key);
+
+	for (;;) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -ENOMEM;
+
+			err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+			if (unlikely(err))
+				return err;
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	consume_skb(skb);
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3573,19 +3625,29 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	bool exclude_ingress = !!(ri->flags & BPF_F_EXCLUDE_INGRESS);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			/* Deal with multicast maps */
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, exclude_ingress);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3699,6 +3761,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(!map || flags > BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+	ri->ex_map = ex_map;
+
+	WRITE_ONCE(ri->map, map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6363,6 +6452,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 90f44f382115..acdc63833b1f 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -475,3 +475,29 @@ void xdp_warn(const char *msg, const char *func, const int line)
 	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 97e1fd19ff58..000b0cf961ea 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3157,6 +3157,20 @@ union bpf_attr {
  *		**bpf_sk_cgroup_id**\ ().
  *	Return
  *		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * int bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		Redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map* (which may be NULL).
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map(), which supports redirecting
+ * 		packet to a specific ifindex in the map.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3288,7 +3302,8 @@ union bpf_attr {
 	FN(seq_printf),			\
 	FN(seq_write),			\
 	FN(sk_cgroup_id),		\
-	FN(sk_ancestor_cgroup_id),
+	FN(sk_ancestor_cgroup_id),	\
+	FN(redirect_map_multi),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3417,6 +3432,11 @@ enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.25.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test
  2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
  2020-05-23  6:05   ` [PATCHv3 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
@ 2020-05-23  6:05   ` Hangbin Liu
  1 sibling, 0 replies; 22+ messages in thread
From: Hangbin Liu @ 2020-05-23  6:05 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, Hangbin Liu

This is a sample for xdp multicast. In the sample we have 3 forward
groups and 1 exclude group. It will redirect each interface's
packets to all the interfaces in the forward group, and exclude
the interface in exclude map.

For more testing details, please see the test description in
xdp_redirect_map_multi.sh.

v3: add rxcnt map to show the packet transmit speed.
v2: no update.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi.sh     | 135 +++++++++++++++
 samples/bpf/xdp_redirect_map_multi_kern.c | 113 +++++++++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 197 ++++++++++++++++++++++
 4 files changed, 448 insertions(+)
 create mode 100755 samples/bpf/xdp_redirect_map_multi.sh
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 8403e4762306..000709bb89c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -97,6 +98,7 @@ test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -156,6 +158,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi.sh b/samples/bpf/xdp_redirect_map_multi.sh
new file mode 100755
index 000000000000..bbf10ca06720
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi.sh
@@ -0,0 +1,135 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  ... init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |  ...
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward multicast groups:
+#     Forward group all has interfaces: veth1, veth2, veth3, veth4, ... (All traffic except IPv4, IPv6)
+#     Forward group v4 has interfaces: veth1, veth3, veth4, ... (For IPv4 traffic only)
+#     Forward group v6 has interfaces: veth2, veth3, veth4, ... (For IPv6 traffic only)
+# Exclude Groups:
+#     Exclude group: veth3 (assume ns3 is in black list)
+#
+# Test modules:
+# XDP modes: generic, native
+# map types: group v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test cases:
+#     ARP(we didn't exclude ns3 in kern.c for ARP):
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ns1 -> ns2 (fail), ns1 -> ns3 (fail), ns1 -> ns4 (pass)
+#     IPv6
+#        ns2 -> ns1 (fail), ns2 -> ns3 (fail), ns2 -> ns4 (pass)
+#
+
+
+# netns numbers
+NUM=10
+IFACES=""
+DRV_MODE="generic drv"
+
+test_pass()
+{
+	echo "Pass: $@"
+}
+
+test_fail()
+{
+	echo "fail: $@"
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip netns del ns$i
+	done
+}
+
+setup_ns()
+{
+	local mode=$1
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth0 type veth peer name veth$i
+	        ip link set veth0 netns ns$i
+		ip -n ns$i link set veth0 up
+		ip link set veth$i up
+
+		ip -n ns$i addr add 192.0.2.$i/24 dev veth0
+		ip -n ns$i addr add 2001:db8::$i/24 dev veth0
+		ip -n ns$i link set veth0 xdp$mode obj \
+			xdp_redirect_map_multi_kern.o sec xdp_redirect_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+	done
+}
+
+do_ping_tests()
+{
+	local drv_mode=$1
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${drv_mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${drv_mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${drv_mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-2" || test_fail "$drv_mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-3" || test_fail "$drv_mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${drv_mode}.log && \
+		test_pass "$drv_mode arp ns1-4" || test_fail "$drv_mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-2" || test_pass "$drv_mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping ns1-3" || test_pass "$drv_mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping ns1-4" || test_fail "$drv_mode ping ns1-4"
+
+	# ping6 test
+	ip netns exec ns2 ping6 2001:db8::1 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-1" || test_pass "$drv_mode ping6 ns2-1"
+	ip netns exec ns2 ping6 2001:db8::3 -c 4 &> /dev/null && \
+		test_fail "$drv_mode ping6 ns2-3" || test_pass "$drv_mode ping6 ns2-3"
+	ip netns exec ns2 ping6 2001:db8::4 -c 4 &> /dev/null && \
+		test_pass "$drv_mode ping6 ns2-4" || test_fail "$drv_mode ping6 ns2-4"
+}
+
+do_tests()
+{
+	local drv_mode=$1
+	local drv_p
+
+	[ ${drv_mode} == "drv" ] && drv_p="-N" || drv_p="-S"
+
+	# run `ulimit -l unlimited` if you got errors like
+	# libbpf: Error in bpf_object__probe_global_data():Operation not permitted(1).
+	./xdp_redirect_map_multi $drv_p $IFACES &> xdp_${drv_mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	do_ping_tests $drv_mode
+
+	kill $xdp_pid
+}
+
+for mode in ${DRV_MODE}; do
+	setup_ns $mode
+	do_tests $mode
+	sleep 10
+	clean_up
+	sleep 5
+done
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..81f71461a252
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+/* In this sample we will use 3 forward maps and 1 exclude map to
+ * show how to use the helper bpf_redirect_map_multi().
+ *
+ * In real world, there may have multi forward maps and exclude map. You can
+ * use map-in-map type to store the forward and exlude maps. e.g.
+ * forward_map_in_map[group_a_index] = forward_group_a_map
+ * forward_map_in_map[group_b_index] = forward_group_b_map
+ * exclude_map_in_map[iface_1_index] = iface_1_exclude_map
+ * exclude_map_in_map[iface_2_index] = iface_2_exclude_map
+ * Then store the forward group indexes based on IP/MAC policy in another
+ * hash map, e.g.:
+ * mcast_route_map[hash(subnet_a)] = group_a_index
+ * mcast_route_map[hash(subnet_b)] = group_b_index
+ *
+ * You can init the maps in user.c, and find the forward group index from
+ * mcast_route_map bye key hash(subnet) in kern.c, Then you could find
+ * the forward group by the group index. You can also get the exclude map
+ * simply by iface index in exclude_map_in_map.
+ */
+struct bpf_map_def SEC("maps") forward_map_v4 = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 4096,
+};
+
+struct bpf_map_def SEC("maps") forward_map_v6 = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") forward_map_all = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") exclude_map = {
+	.type = BPF_MAP_TYPE_DEVMAP_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 128,
+};
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 1,
+};
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	long *value;
+	u16 h_proto;
+	u32 key = 0;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	/* count packet in global counter */
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	if (h_proto == htons(ETH_P_IP))
+		return bpf_redirect_map_multi(&forward_map_v4, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else if (h_proto == htons(ETH_P_IPV6))
+		return bpf_redirect_map_multi(&forward_map_v6, &exclude_map,
+					      BPF_F_EXCLUDE_INGRESS);
+	else
+		return bpf_redirect_map_multi(&forward_map_all, NULL,
+					      BPF_F_EXCLUDE_INGRESS);
+}
+
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..7339ce4c7f9c
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static int ifaces[MAX_IFACE_NUM] = {};
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int rxcnt;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(rxcnt, &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("Forwarding %10llu pkt/s\n", sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, group_v6, exclude;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	int i, ret, opt, ifindex;
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNF")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	group_all = bpf_object__find_map_fd_by_name(obj, "forward_map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "forward_map_v4");
+	group_v6 = bpf_object__find_map_fd_by_name(obj, "forward_map_v6");
+	exclude = bpf_object__find_map_fd_by_name(obj, "exclude_map");
+	rxcnt = bpf_object__find_map_fd_by_name(obj, "rxcnt");
+
+	if (group_all < 0 || group_v4 < 0 || group_v6 < 0 || exclude < 0 ||
+	    rxcnt < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* Add all the interfaces to group all */
+		ret = bpf_map_update_elem(group_all, &ifindex, &ifindex, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: remove the 2nd interfaces from group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: remove the 1st interfaces from group v6 */
+		if (i != 0) {
+			ret = bpf_map_update_elem(group_v6, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* For testing: add the 3rd interfaces to exclude map */
+		if (i == 2) {
+			ret = bpf_map_update_elem(exclude, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+	}
+
+	poll_stats(2);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-15  8:54 [RFC PATCH bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
2020-04-15  8:54 ` [RFC PATCH bpf-next 1/2] " Hangbin Liu
2020-04-20  9:52   ` Hangbin Liu
2020-04-15  8:54 ` [RFC PATCH bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-04-24  8:56 ` [RFC PATCHv2 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
2020-04-24 14:19     ` Lorenzo Bianconi
2020-04-28 11:09       ` Eelco Chaudron
2020-05-06  9:35       ` Hangbin Liu
2020-04-24 14:34     ` Toke Høiland-Jørgensen
2020-05-06  9:14       ` Hangbin Liu
2020-05-06 10:00         ` Toke Høiland-Jørgensen
2020-05-08  8:53           ` Hangbin Liu
2020-05-08 14:58             ` Toke Høiland-Jørgensen
2020-05-18  8:45       ` Hangbin Liu
2020-05-19 10:15         ` Jesper Dangaard Brouer
2020-05-20  1:24           ` Hangbin Liu
2020-04-24  8:56   ` [RFC PATCHv2 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu
2020-04-24 14:21     ` Lorenzo Bianconi
2020-05-23  6:05 ` [PATCHv3 bpf-next 0/2] xdp: add dev map multicast support Hangbin Liu
2020-05-23  6:05   ` [PATCHv3 bpf-next 1/2] xdp: add a new helper for " Hangbin Liu
2020-05-23  6:05   ` [PATCHv3 bpf-next 2/2] sample/bpf: add xdp_redirect_map_multicast test Hangbin Liu

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git