bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support
@ 2021-04-14 12:26 Hangbin Liu
  2021-04-14 12:26 ` [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue Hangbin Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 39+ messages in thread
From: Hangbin Liu @ 2021-04-14 12:26 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, Hangbin Liu

Hi,

This patchset is a new implementation for XDP multicast support based
on my previous 2 maps implementation[1]. The reason is that Daniel think
the exclude map implementation is missing proper bond support in XDP
context. And there is a plan to add native XDP bonding support. Adding a
exclude map in the helper also increase the complex of verifier and has
draw back of performace.

The new implementation just add two new flags BPF_F_BROADCAST and
BPF_F_EXCLUDE_INGRESS to extend xdp_redirect_map for broadcast support.

With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.

The patchv6 link is here[2].

[1] https://lore.kernel.org/bpf/20210223125809.1376577-1-liuhangbin@gmail.com
[2] https://lore.kernel.org/bpf/20210414012341.3992365-1-liuhangbin@gmail.com

v7: No need to free xdpf in dev_map_enqueue_clone() if xdpf_clone failed.
v6: Fix a skb leak in the error path for generic XDP
v5: Just walk the map directly to get interfaces as get_next_key() of devmap
    hash may restart looping from the first key if the device get removed.
    After update the performace has improved 10% compired with v4.
v4: Fix flags never cleared issue in patch 02. Update selftest to cover this.
v3: Rebase the code based on latest bpf-next
v2: fix flag renaming issue in patch 02

Hangbin Liu (3):
  xdp: extend xdp_redirect_map with broadcast support
  sample/bpf: add xdp_redirect_map_multi for redirect_map broadcast test
  selftests/bpf: add xdp_redirect_multi test

Jesper Dangaard Brouer (1):
  bpf: run devmap xdp_prog on flush instead of bulk enqueue

 include/linux/bpf.h                           |  20 ++
 include/linux/filter.h                        |  18 +-
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  17 +-
 kernel/bpf/cpumap.c                           |   3 +-
 kernel/bpf/devmap.c                           | 304 +++++++++++++++---
 net/core/filter.c                             |  33 +-
 net/core/xdp.c                                |  29 ++
 net/xdp/xskmap.c                              |   3 +-
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  87 +++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 302 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  17 +-
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       |  99 ++++++
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 205 ++++++++++++
 .../selftests/bpf/xdp_redirect_multi.c        | 236 ++++++++++++++
 17 files changed, 1316 insertions(+), 64 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

-- 
2.26.3


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-14 12:26 [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
@ 2021-04-14 12:26 ` Hangbin Liu
  2021-04-15  0:17   ` Martin KaFai Lau
  2021-04-14 12:26 ` [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Hangbin Liu @ 2021-04-14 12:26 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, Hangbin Liu

From: Jesper Dangaard Brouer <brouer@redhat.com>

This changes the devmap XDP program support to run the program when the
bulk queue is flushed instead of before the frame is enqueued. This has
a couple of benefits:

- It "sorts" the packets by destination devmap entry, and then runs the
  same BPF program on all the packets in sequence. This ensures that we
  keep the XDP program and destination device properties hot in I-cache.

- It makes the multicast implementation simpler because it can just
  enqueue packets using bq_enqueue() without having to deal with the
  devmap program at all.

The drawback is that if the devmap program drops the packet, the enqueue
step is redundant. However, arguably this is mostly visible in a
micro-benchmark, and with more mixed traffic the I-cache benefit should
win out. The performance impact of just this patch is as follows:

When bq_xmit_all() is called from bq_enqueue(), another packet will
always be enqueued immediately after, so clearing dev_rx, xdp_prog and
flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(),
and only check them once in bq_enqueue() since they are all modified
together.

Using 10Gb i40e NIC, do XDP_DROP on veth peer, with xdp_redirect_map in
sample/bpf, send pkts via pktgen cmd:
./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

There are about +/- 0.1M deviation for native testing, the performance
improved for the base-case, but some drop back with xdp devmap prog attached.

Version          | Test                           | Generic | Native | Native + 2nd xdp_prog
5.12 rc4         | xdp_redirect_map   i40e->i40e  |    1.9M |   9.6M |  8.4M
5.12 rc4         | xdp_redirect_map   i40e->veth  |    1.7M |  11.7M |  9.8M
5.12 rc4 + patch | xdp_redirect_map   i40e->i40e  |    1.9M |   9.8M |  8.0M
5.12 rc4 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  12.0M |  9.4M

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

---
v3: rebase the code based on Lorenzo's "Move drop error path to devmap
    for XDP_REDIRECT"
v2: no update
---
 kernel/bpf/devmap.c | 127 ++++++++++++++++++++++++++------------------
 1 file changed, 76 insertions(+), 51 deletions(-)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index aa516472ce46..3980fb3bfb09 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {
 	struct list_head flush_node;
 	struct net_device *dev;
 	struct net_device *dev_rx;
+	struct bpf_prog *xdp_prog;
 	unsigned int count;
 };
 
@@ -326,22 +327,71 @@ bool dev_map_can_have_prog(struct bpf_map *map)
 	return false;
 }
 
+static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
+				struct xdp_frame **frames, int n,
+				struct net_device *dev)
+{
+	struct xdp_txq_info txq = { .dev = dev };
+	struct xdp_buff xdp;
+	int i, nframes = 0;
+
+	for (i = 0; i < n; i++) {
+		struct xdp_frame *xdpf = frames[i];
+		u32 act;
+		int err;
+
+		xdp_convert_frame_to_buff(xdpf, &xdp);
+		xdp.txq = &txq;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+		switch (act) {
+		case XDP_PASS:
+			err = xdp_update_frame_from_buff(&xdp, xdpf);
+			if (unlikely(err < 0))
+				xdp_return_frame_rx_napi(xdpf);
+			else
+				frames[nframes++] = xdpf;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			fallthrough;
+		case XDP_ABORTED:
+			trace_xdp_exception(dev, xdp_prog, act);
+			fallthrough;
+		case XDP_DROP:
+			xdp_return_frame_rx_napi(xdpf);
+			break;
+		}
+	}
+	return nframes; /* sent frames count */
+}
+
 static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
 {
 	struct net_device *dev = bq->dev;
-	int sent = 0, err = 0;
+	int sent = 0, drops = 0, err = 0;
+	unsigned int cnt = bq->count;
+	int to_send = cnt;
 	int i;
 
-	if (unlikely(!bq->count))
+	if (unlikely(!cnt))
 		return;
 
-	for (i = 0; i < bq->count; i++) {
+	for (i = 0; i < cnt; i++) {
 		struct xdp_frame *xdpf = bq->q[i];
 
 		prefetch(xdpf);
 	}
 
-	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
+	if (bq->xdp_prog) {
+		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
+		if (!to_send)
+			goto out;
+
+		drops = cnt - to_send;
+	}
+
+	sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_send, bq->q, flags);
 	if (sent < 0) {
 		/* If ndo_xdp_xmit fails with an errno, no frames have
 		 * been xmit'ed.
@@ -353,13 +403,13 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
 	/* If not all frames have been transmitted, it is our
 	 * responsibility to free them
 	 */
-	for (i = sent; unlikely(i < bq->count); i++)
+	for (i = sent; unlikely(i < to_send); i++)
 		xdp_return_frame_rx_napi(bq->q[i]);
 
-	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, bq->count - sent, err);
-	bq->dev_rx = NULL;
+out:
+	drops = cnt - sent;
 	bq->count = 0;
-	__list_del_clearprev(&bq->flush_node);
+	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);
 }
 
 /* __dev_flush is called from xdp_do_flush() which _must_ be signaled
@@ -377,8 +427,12 @@ void __dev_flush(void)
 	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
 	struct xdp_dev_bulk_queue *bq, *tmp;
 
-	list_for_each_entry_safe(bq, tmp, flush_list, flush_node)
+	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_xmit_all(bq, XDP_XMIT_FLUSH);
+		bq->dev_rx = NULL;
+		bq->xdp_prog = NULL;
+		__list_del_clearprev(&bq->flush_node);
+	}
 }
 
 /* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or
@@ -401,7 +455,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
  * Thus, safe percpu variable access.
  */
 static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
-		       struct net_device *dev_rx)
+		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
 {
 	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
 	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
@@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 	/* Ingress dev_rx will be the same for all xdp_frame's in
 	 * bulk_queue, because bq stored per-CPU and must be flushed
 	 * from net_device drivers NAPI func end.
+	 *
+	 * Do the same with xdp_prog and flush_list since these fields
+	 * are only ever modified together.
 	 */
-	if (!bq->dev_rx)
+	if (!bq->dev_rx) {
 		bq->dev_rx = dev_rx;
+		bq->xdp_prog = xdp_prog;
+		list_add(&bq->flush_node, flush_list);
+	}
 
 	bq->q[bq->count++] = xdpf;
-
-	if (!bq->flush_node.prev)
-		list_add(&bq->flush_node, flush_list);
 }
 
 static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
-			       struct net_device *dev_rx)
+				struct net_device *dev_rx,
+				struct bpf_prog *xdp_prog)
 {
 	struct xdp_frame *xdpf;
 	int err;
@@ -439,42 +497,14 @@ static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 	if (unlikely(!xdpf))
 		return -EOVERFLOW;
 
-	bq_enqueue(dev, xdpf, dev_rx);
+	bq_enqueue(dev, xdpf, dev_rx, xdp_prog);
 	return 0;
 }
 
-static struct xdp_buff *dev_map_run_prog(struct net_device *dev,
-					 struct xdp_buff *xdp,
-					 struct bpf_prog *xdp_prog)
-{
-	struct xdp_txq_info txq = { .dev = dev };
-	u32 act;
-
-	xdp_set_data_meta_invalid(xdp);
-	xdp->txq = &txq;
-
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
-	switch (act) {
-	case XDP_PASS:
-		return xdp;
-	case XDP_DROP:
-		break;
-	default:
-		bpf_warn_invalid_xdp_action(act);
-		fallthrough;
-	case XDP_ABORTED:
-		trace_xdp_exception(dev, xdp_prog, act);
-		break;
-	}
-
-	xdp_return_buff(xdp);
-	return NULL;
-}
-
 int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx)
 {
-	return __xdp_enqueue(dev, xdp, dev_rx);
+	return __xdp_enqueue(dev, xdp, dev_rx, NULL);
 }
 
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
@@ -482,12 +512,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 {
 	struct net_device *dev = dst->dev;
 
-	if (dst->xdp_prog) {
-		xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog);
-		if (!xdp)
-			return 0;
-	}
-	return __xdp_enqueue(dev, xdp, dev_rx);
+	return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog);
 }
 
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support
  2021-04-14 12:26 [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
  2021-04-14 12:26 ` [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue Hangbin Liu
@ 2021-04-14 12:26 ` Hangbin Liu
  2021-04-15  0:23   ` Martin KaFai Lau
  2021-04-14 12:26 ` [PATCHv7 bpf-next 3/4] sample/bpf: add xdp_redirect_map_multi for redirect_map broadcast test Hangbin Liu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Hangbin Liu @ 2021-04-14 12:26 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, Hangbin Liu

This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.

With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.

When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.

Function bpf_clear_redirect_map() was removed in
commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.

Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on
veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts
via pktgen cmd:
./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

There are some drop back as we need to loop the map and get each interface.

Version          | Test                                | Generic | Native
5.12 rc4         | redirect_map        i40e->i40e      |    1.9M |  9.6M
5.12 rc4         | redirect_map        i40e->veth      |    1.7M | 11.7M
5.12 rc4 + patch | redirect_map        i40e->i40e      |    1.9M |  9.3M
5.12 rc4 + patch | redirect_map        i40e->veth      |    1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi  i40e->i40e      |    1.9M |  8.9M
5.12 rc4 + patch | redirect_map multi  i40e->veth      |    1.7M | 10.9M
5.12 rc4 + patch | redirect_map multi  i40e->mlx4+veth |    1.2M |  3.8M

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

---
v7:
no need to free xdpf in dev_map_enqueue_clone() if xdpf_clone failed.
Also return -EOVERFLOW if xdp_convert_buff_to_frame() failed the same
as other caller did.

v6:
Fix a skb leak in the error path for generic XDP

v5:
a) use xchg() instead of READ/WRITE_ONCE and no need to clear ri->flags
   in xdp_do_redirect()
b) Do not use get_next_key() as we may restart looping from the first key
   when remove/update a dev in hash map. Just walk the map directly to
   get all the devices and ignore the new added/deleted objects.
c) Loop all the array map instead stop at the first hole.

v4:
a) add a new argument flag_mask to __bpf_xdp_redirect_map() filter out
invalid map.
b) __bpf_xdp_redirect_map() sets the map pointer if the broadcast flag
is set and clears it if the flag isn't set
c) xdp_do_redirect() does the READ_ONCE/WRITE_ONCE on ri->map to check
if we should enqueue multi

v3:
a) Rebase the code on Björn's "bpf, xdp: Restructure redirect actions".
   - Add struct bpf_map *map back to struct bpf_redirect_info as we need
     it for multicast.
   - Add bpf_clear_redirect_map() back for devmap.c
   - Add devmap_lookup_elem() as we need it in general path.
b) remove tmp_key in devmap_get_next_obj()

v2: Fix flag renaming issue in v1
---
 include/linux/bpf.h            |  20 ++++
 include/linux/filter.h         |  18 +++-
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  17 +++-
 kernel/bpf/cpumap.c            |   3 +-
 kernel/bpf/devmap.c            | 181 ++++++++++++++++++++++++++++++++-
 net/core/filter.c              |  33 +++++-
 net/core/xdp.c                 |  29 ++++++
 net/xdp/xskmap.c               |   3 +-
 tools/include/uapi/linux/bpf.h |  17 +++-
 10 files changed, 308 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ff8cd68c01b3..ab6bde1f3b91 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1496,8 +1496,13 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, bool exclude_ingress);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
+int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+			   struct bpf_prog *xdp_prog, struct bpf_map *map,
+			   bool exclude_ingress);
 bool dev_map_can_have_prog(struct bpf_map *map);
 
 void __cpu_map_flush(void);
@@ -1665,6 +1670,13 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, bool exclude_ingress)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
@@ -1674,6 +1686,14 @@ static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
 	return 0;
 }
 
+static inline
+int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+			   struct bpf_prog *xdp_prog, struct bpf_map *map,
+			   bool exclude_ingress)
+{
+	return 0;
+}
+
 static inline void __cpu_map_flush(void)
 {
 }
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9a09547bc7ba..e4885b42d754 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -646,6 +646,7 @@ struct bpf_redirect_info {
 	u32 flags;
 	u32 tgt_index;
 	void *tgt_value;
+	struct bpf_map *map;
 	u32 map_id;
 	enum bpf_map_type map_type;
 	u32 kern_flags;
@@ -1464,17 +1465,18 @@ static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol,
 }
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex, u64 flags,
+static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex,
+						  u64 flags, u64 flag_mask,
 						  void *lookup_elem(struct bpf_map *map, u32 key))
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	/* Lower bits of the flags are used as return code on lookup failure */
-	if (unlikely(flags > XDP_TX))
+	if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask)))
 		return XDP_ABORTED;
 
 	ri->tgt_value = lookup_elem(map, ifindex);
-	if (unlikely(!ri->tgt_value)) {
+	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
 		/* If the lookup fails we want to clear out the state in the
 		 * redirect_info struct completely, so that if an eBPF program
 		 * performs multiple lookups, the last one always takes
@@ -1482,13 +1484,21 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 		 */
 		ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */
 		ri->map_type = BPF_MAP_TYPE_UNSPEC;
-		return flags;
+		return flags & BPF_F_ACTION_MASK;
 	}
 
 	ri->tgt_index = ifindex;
 	ri->map_id = map->id;
 	ri->map_type = map->map_type;
 
+	if (flags & BPF_F_BROADCAST) {
+		WRITE_ONCE(ri->map, map);
+		ri->flags = flags;
+	} else {
+		WRITE_ONCE(ri->map, NULL);
+		ri->flags = 0;
+	}
+
 	return XDP_REDIRECT;
 }
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index a5bc214a49d9..5533f0ab2afc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -170,6 +170,7 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					 struct net_device *dev);
 int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 static inline
 void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 85c924bc21b1..b178f5b0d3f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2534,8 +2534,12 @@ union bpf_attr {
  * 		The lower two bits of *flags* are used as the return code if
  * 		the map lookup fails. This is so that the return value can be
  * 		one of the XDP program return codes up to **XDP_TX**, as chosen
- * 		by the caller. Any higher bits in the *flags* argument must be
- * 		unset.
+ * 		by the caller. The higher bits of *flags* can be set to
+ * 		BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below.
+ *
+ * 		With BPF_F_BROADCAST the packet will be broadcasted to all the
+ * 		interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress
+ * 		interface will be excluded when do broadcasting.
  *
  * 		See also **bpf_redirect**\ (), which only supports redirecting
  * 		to an ifindex, but doesn't require a map to do so.
@@ -5052,6 +5056,15 @@ enum {
 	BPF_F_BPRM_SECUREEXEC	= (1ULL << 0),
 };
 
+/* Flags for bpf_redirect_map helper */
+enum {
+	BPF_F_BROADCAST		= (1ULL << 3),
+	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
+};
+
+#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX)
+#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS)
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 0cf2791d5099..2c33a7a09783 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -601,7 +601,8 @@ static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 static int cpu_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __cpu_map_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+				      __cpu_map_lookup_elem);
 }
 
 static int cpu_map_btf_id;
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3980fb3bfb09..9c860f5a467a 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -198,6 +198,7 @@ static void dev_map_free(struct bpf_map *map)
 	list_del_rcu(&dtab->list);
 	spin_unlock(&dev_map_lock);
 
+	bpf_clear_redirect_map(map);
 	synchronize_rcu();
 
 	/* Make sure prior __dev_map_entry_free() have completed. */
@@ -515,6 +516,99 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog);
 }
 
+static bool is_valid_dst(struct bpf_dtab_netdev *obj, struct xdp_buff *xdp,
+			 int exclude_ifindex)
+{
+	if (!obj || obj->dev->ifindex == exclude_ifindex ||
+	    !obj->dev->netdev_ops->ndo_xdp_xmit)
+		return false;
+
+	if (xdp_ok_fwd_dev(obj->dev, xdp->data_end - xdp->data))
+		return false;
+
+	return true;
+}
+
+static int dev_map_enqueue_clone(struct bpf_dtab_netdev *obj,
+				 struct net_device *dev_rx,
+				 struct xdp_frame *xdpf)
+{
+	struct xdp_frame *nxdpf;
+
+	nxdpf = xdpf_clone(xdpf);
+	if (!nxdpf)
+		return -ENOMEM;
+
+	bq_enqueue(obj->dev, nxdpf, dev_rx, obj->xdp_prog);
+
+	return 0;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, bool exclude_ingress)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	int exclude_ifindex = exclude_ingress ? dev_rx->ifindex : 0;
+	struct bpf_dtab_netdev *dst, *last_dst = NULL;
+	struct hlist_head *head;
+	struct hlist_node *next;
+	struct xdp_frame *xdpf;
+	unsigned int i;
+	int err;
+
+	xdpf = xdp_convert_buff_to_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+		for (i = 0; i < map->max_entries; i++) {
+			dst = READ_ONCE(dtab->netdev_map[i]);
+			if (!is_valid_dst(dst, xdp, exclude_ifindex))
+				continue;
+
+			/* we only need n-1 clones; last_dst enqueued below */
+			if (!last_dst) {
+				last_dst = dst;
+				continue;
+			}
+
+			err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
+			if (err)
+				return err;
+
+			last_dst = dst;
+		}
+	} else { /* BPF_MAP_TYPE_DEVMAP_HASH */
+		for (i = 0; i < dtab->n_buckets; i++) {
+			head = dev_map_index_hash(dtab, i);
+			hlist_for_each_entry_safe(dst, next, head, index_hlist) {
+				if (!is_valid_dst(dst, xdp, exclude_ifindex))
+					continue;
+
+				/* we only need n-1 clones; last_dst enqueued below */
+				if (!last_dst) {
+					last_dst = dst;
+					continue;
+				}
+
+				err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
+				if (err)
+					return err;
+
+				last_dst = dst;
+			}
+		}
+	}
+
+	/* consume the last copy of the frame */
+	if (last_dst)
+		bq_enqueue(last_dst->dev, xdpf, dev_rx, last_dst->xdp_prog);
+	else
+		xdp_return_frame_rx_napi(xdpf); /* dtab is empty */
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
@@ -529,6 +623,87 @@ int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 	return 0;
 }
 
+static int dev_map_redirect_clone(struct bpf_dtab_netdev *dst,
+				  struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog)
+{
+	struct sk_buff *nskb;
+	int err;
+
+	nskb = skb_clone(skb, GFP_ATOMIC);
+	if (!nskb)
+		return -ENOMEM;
+
+	err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+	if (unlikely(err)) {
+		consume_skb(nskb);
+		return err;
+	}
+
+	return 0;
+}
+
+int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+			   struct bpf_prog *xdp_prog, struct bpf_map *map,
+			   bool exclude_ingress)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	int exclude_ifindex = exclude_ingress ? dev->ifindex : 0;
+	struct bpf_dtab_netdev *dst, *last_dst = NULL;
+	struct hlist_head *head;
+	struct hlist_node *next;
+	unsigned int i;
+	int err;
+
+	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+		for (i = 0; i < map->max_entries; i++) {
+			dst = READ_ONCE(dtab->netdev_map[i]);
+			if (!dst || dst->dev->ifindex == exclude_ifindex)
+				continue;
+
+			/* we only need n-1 clones; last_dst enqueued below */
+			if (!last_dst) {
+				last_dst = dst;
+				continue;
+			}
+
+			err = dev_map_redirect_clone(last_dst, skb, xdp_prog);
+			if (err)
+				return err;
+
+			last_dst = dst;
+		}
+	} else { /* BPF_MAP_TYPE_DEVMAP_HASH */
+		for (i = 0; i < dtab->n_buckets; i++) {
+			head = dev_map_index_hash(dtab, i);
+			hlist_for_each_entry_safe(dst, next, head, index_hlist) {
+				if (!dst || dst->dev->ifindex == exclude_ifindex)
+					continue;
+
+				/* we only need n-1 clones; last_dst enqueued below */
+				if (!last_dst) {
+					last_dst = dst;
+					continue;
+				}
+
+				err = dev_map_redirect_clone(last_dst, skb, xdp_prog);
+				if (err)
+					return err;
+
+				last_dst = dst;
+			}
+		}
+	}
+
+	/* consume the first skb and return */
+	if (last_dst)
+		return dev_map_generic_redirect(last_dst, skb, xdp_prog);
+
+	/* dtab is empty */
+	consume_skb(skb);
+	return 0;
+}
+
 static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
@@ -755,12 +930,14 @@ static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value,
 
 static int dev_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK,
+				      __dev_map_lookup_elem);
 }
 
 static int dev_hash_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_hash_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK,
+				      __dev_map_hash_lookup_elem);
 }
 
 static int dev_map_btf_id;
diff --git a/net/core/filter.c b/net/core/filter.c
index cae56d08a670..afec192c3b21 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3926,6 +3926,23 @@ void xdp_do_flush(void)
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush);
 
+void bpf_clear_redirect_map(struct bpf_map *map)
+{
+	struct bpf_redirect_info *ri;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
+		/* Avoid polluting remote cacheline due to writes if
+		 * not needed. Once we pass this test, we need the
+		 * cmpxchg() to make sure it hasn't been changed in
+		 * the meantime by remote CPU.
+		 */
+		if (unlikely(READ_ONCE(ri->map) == map))
+			cmpxchg(&ri->map, map, NULL);
+	}
+}
+
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
@@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	enum bpf_map_type map_type = ri->map_type;
 	void *fwd = ri->tgt_value;
 	u32 map_id = ri->map_id;
+	struct bpf_map *map;
 	int err;
 
 	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
@@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	case BPF_MAP_TYPE_DEVMAP:
 		fallthrough;
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		err = dev_map_enqueue(fwd, xdp, dev);
+		map = xchg(&ri->map, NULL);
+		if (map)
+			err = dev_map_enqueue_multi(xdp, dev, map,
+						    ri->flags & BPF_F_EXCLUDE_INGRESS);
+		else
+			err = dev_map_enqueue(fwd, xdp, dev);
 		break;
 	case BPF_MAP_TYPE_CPUMAP:
 		err = cpu_map_enqueue(fwd, xdp, dev);
@@ -3984,13 +4007,19 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       enum bpf_map_type map_type, u32 map_id)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_map *map;
 	int err;
 
 	switch (map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 		fallthrough;
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		err = dev_map_generic_redirect(fwd, skb, xdp_prog);
+		map = xchg(&ri->map, NULL);
+		if (map)
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ri->flags & BPF_F_EXCLUDE_INGRESS);
+		else
+			err = dev_map_generic_redirect(fwd, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 		break;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 05354976c1fc..aba84d04642b 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -583,3 +583,32 @@ struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 	return __xdp_build_skb_from_frame(xdpf, skb, dev);
 }
 EXPORT_SYMBOL_GPL(xdp_build_skb_from_frame);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+	nxdpf->frame_sz = PAGE_SIZE;
+	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+	nxdpf->mem.id = 0;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 67b4ce504852..9df75ea4a567 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -226,7 +226,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 
 static int xsk_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __xsk_map_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+				      __xsk_map_lookup_elem);
 }
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 85c924bc21b1..b178f5b0d3f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2534,8 +2534,12 @@ union bpf_attr {
  * 		The lower two bits of *flags* are used as the return code if
  * 		the map lookup fails. This is so that the return value can be
  * 		one of the XDP program return codes up to **XDP_TX**, as chosen
- * 		by the caller. Any higher bits in the *flags* argument must be
- * 		unset.
+ * 		by the caller. The higher bits of *flags* can be set to
+ * 		BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below.
+ *
+ * 		With BPF_F_BROADCAST the packet will be broadcasted to all the
+ * 		interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress
+ * 		interface will be excluded when do broadcasting.
  *
  * 		See also **bpf_redirect**\ (), which only supports redirecting
  * 		to an ifindex, but doesn't require a map to do so.
@@ -5052,6 +5056,15 @@ enum {
 	BPF_F_BPRM_SECUREEXEC	= (1ULL << 0),
 };
 
+/* Flags for bpf_redirect_map helper */
+enum {
+	BPF_F_BROADCAST		= (1ULL << 3),
+	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
+};
+
+#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX)
+#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS)
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv7 bpf-next 3/4] sample/bpf: add xdp_redirect_map_multi for redirect_map broadcast test
  2021-04-14 12:26 [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
  2021-04-14 12:26 ` [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue Hangbin Liu
  2021-04-14 12:26 ` [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
@ 2021-04-14 12:26 ` Hangbin Liu
  2021-04-14 12:26 ` [PATCHv7 bpf-next 4/4] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
  2021-04-14 14:16 ` [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Toke Høiland-Jørgensen
  4 siblings, 0 replies; 39+ messages in thread
From: Hangbin Liu @ 2021-04-14 12:26 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, Hangbin Liu

This is a sample for xdp redirect broadcast. In the sample we could forward
all packets between given interfaces. There is also an option -X that could
enable 2nd xdp_prog on egress interface.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 samples/bpf/Makefile                      |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c |  87 +++++++
 samples/bpf/xdp_redirect_map_multi_user.c | 302 ++++++++++++++++++++++
 3 files changed, 392 insertions(+)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 45ceca4e2c70..520434ea966f 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ tprogs-y += test_map_in_map
 tprogs-y += per_socket_stats_example
 tprogs-y += xdp_redirect
 tprogs-y += xdp_redirect_map
+tprogs-y += xdp_redirect_map_multi
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_monitor
 tprogs-y += xdp_rxq_info
@@ -99,6 +100,7 @@ test_map_in_map-objs := test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
 xdp_redirect-objs := xdp_redirect_user.o
 xdp_redirect_map-objs := xdp_redirect_map_user.o
+xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o
 xdp_redirect_cpu-objs := xdp_redirect_cpu_user.o
 xdp_monitor-objs := xdp_monitor_user.o
 xdp_rxq_info-objs := xdp_rxq_info_user.o
@@ -160,6 +162,7 @@ always-y += tcp_tos_reflect_kern.o
 always-y += tcp_dumpstats_kern.o
 always-y += xdp_redirect_kern.o
 always-y += xdp_redirect_map_kern.o
+always-y += xdp_redirect_map_multi_kern.o
 always-y += xdp_redirect_cpu_kern.o
 always-y += xdp_monitor_kern.o
 always-y += xdp_rxq_info_kern.o
diff --git a/samples/bpf/xdp_redirect_map_multi_kern.c b/samples/bpf/xdp_redirect_map_multi_kern.c
new file mode 100644
index 000000000000..e6be70225ee1
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_kern.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(int));
+	__uint(max_entries, 32);
+} forward_map_general SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(struct bpf_devmap_val));
+	__uint(max_entries, 32);
+} forward_map_native SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__type(key, u32);
+	__type(value, long);
+	__uint(max_entries, 1);
+} rxcnt SEC(".maps");
+
+/* map to store egress interfaces mac addresses, set the
+ * max_entries to 1 and extend it in user sapce prog.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, u32);
+	__type(value, __be64);
+	__uint(max_entries, 1);
+} mac_map SEC(".maps");
+
+static int xdp_redirect_map(struct xdp_md *ctx, void *forward_map)
+{
+	long *value;
+	u32 key = 0;
+
+	/* count packet in global counter */
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	return bpf_redirect_map(forward_map, key, BPF_F_REDIR_MASK);
+}
+
+SEC("xdp_redirect_general")
+int xdp_redirect_map_general(struct xdp_md *ctx)
+{
+	return xdp_redirect_map(ctx, &forward_map_general);
+}
+
+SEC("xdp_redirect_native")
+int xdp_redirect_map_native(struct xdp_md *ctx)
+{
+	return xdp_redirect_map(ctx, &forward_map_native);
+}
+
+SEC("xdp_devmap/map_prog")
+int xdp_devmap_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	u32 key = ctx->egress_ifindex;
+	struct ethhdr *eth = data;
+	__be64 *mac;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	mac = bpf_map_lookup_elem(&mac_map, &key);
+	if (mac)
+		__builtin_memcpy(eth->h_source, mac, ETH_ALEN);
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
new file mode 100644
index 000000000000..84cdbbed20b7
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -0,0 +1,302 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+#include <sys/resource.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int ifaces[MAX_IFACE_NUM] = {};
+static int rxcnt_map_fd;
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(rxcnt_map_fd, &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("Forwarding %10llu pkt/s\n", sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+static int get_mac_addr(unsigned int ifindex, void *mac_addr)
+{
+	char ifname[IF_NAMESIZE];
+	struct ifreq ifr;
+	int fd, ret = -1;
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0)
+		return ret;
+
+	if (!if_indextoname(ifindex, ifname))
+		goto err_out;
+
+	strcpy(ifr.ifr_name, ifname);
+
+	if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0)
+		goto err_out;
+
+	memcpy(mac_addr, ifr.ifr_hwaddr.sa_data, 6 * sizeof(char));
+	ret = 0;
+
+err_out:
+	close(fd);
+	return ret;
+}
+
+static int update_mac_map(struct bpf_object *obj)
+{
+	int i, ret = -1, mac_map_fd;
+	unsigned char mac_addr[6];
+	unsigned int ifindex;
+
+	mac_map_fd = bpf_object__find_map_fd_by_name(obj, "mac_map");
+	if (mac_map_fd < 0) {
+		printf("find mac map fd failed\n");
+		return ret;
+	}
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		ret = get_mac_addr(ifindex, mac_addr);
+		if (ret < 0) {
+			printf("get interface %d mac failed\n", ifindex);
+			return ret;
+		}
+
+		ret = bpf_map_update_elem(mac_map_fd, &ifindex, mac_addr, 0);
+		if (ret) {
+			perror("bpf_update_elem mac_map_fd");
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n"
+		"    -X    load xdp program on egress\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int i, ret, opt, forward_map_fd, max_ifindex = 0;
+	struct bpf_program *ingress_prog, *egress_prog;
+	int ingress_prog_fd, egress_prog_fd = 0;
+	struct bpf_devmap_val devmap_val;
+	bool attach_egress_prog = false;
+	char ifname[IF_NAMESIZE];
+	struct bpf_map *mac_map;
+	struct bpf_object *obj;
+	unsigned int ifindex;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNFX")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		case 'X':
+			attach_egress_prog = true;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE)) {
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+	} else if (attach_egress_prog) {
+		printf("Load xdp program on egress with SKB mode not supported yet\n");
+		return 1;
+	}
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		return 1;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			return 1;
+		}
+
+		/* Find the largest index number */
+		if (ifaces[i] > max_ifindex)
+			max_ifindex = ifaces[i];
+
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	obj = bpf_object__open(filename);
+	if (libbpf_get_error(obj)) {
+		printf("ERROR: opening BPF object file failed\n");
+		obj = NULL;
+		goto err_out;
+	}
+
+	/* Reset the map size to max ifindex + 1 */
+	if (attach_egress_prog) {
+		mac_map = bpf_object__find_map_by_name(obj, "mac_map");
+		ret = bpf_map__resize(mac_map, max_ifindex + 1);
+		if (ret < 0) {
+			printf("ERROR: reset mac map size failed\n");
+			goto err_out;
+		}
+	}
+
+	/* load BPF program */
+	if (bpf_object__load(obj)) {
+		printf("ERROR: loading BPF object file failed\n");
+		goto err_out;
+	}
+
+	if (xdp_flags & XDP_FLAGS_SKB_MODE) {
+		ingress_prog = bpf_object__find_program_by_name(obj, "xdp_redirect_map_general");
+		forward_map_fd = bpf_object__find_map_fd_by_name(obj, "forward_map_general");
+	} else {
+		ingress_prog = bpf_object__find_program_by_name(obj, "xdp_redirect_map_native");
+		forward_map_fd = bpf_object__find_map_fd_by_name(obj, "forward_map_native");
+	}
+	if (!ingress_prog || forward_map_fd < 0) {
+		printf("finding ingress_prog/forward_map in obj file failed\n");
+		goto err_out;
+	}
+
+	ingress_prog_fd = bpf_program__fd(ingress_prog);
+	if (ingress_prog_fd < 0) {
+		printf("find ingress_prog fd failed\n");
+		goto err_out;
+	}
+
+	rxcnt_map_fd = bpf_object__find_map_fd_by_name(obj, "rxcnt");
+	if (rxcnt_map_fd < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		goto err_out;
+	}
+
+	if (attach_egress_prog) {
+		/* Update mac_map with all egress interfaces' mac addr */
+		if (update_mac_map(obj) < 0) {
+			printf("Error: update mac map failed");
+			goto err_out;
+		}
+
+		/* Find egress prog fd */
+		egress_prog = bpf_object__find_program_by_name(obj, "xdp_devmap_prog");
+		if (!egress_prog) {
+			printf("finding egress_prog in obj file failed\n");
+			goto err_out;
+		}
+		egress_prog_fd = bpf_program__fd(egress_prog);
+		if (egress_prog_fd < 0) {
+			printf("find egress_prog fd failed\n");
+			goto err_out;
+		}
+	}
+
+	/* Remove attached program when program is interrupted or killed */
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, ingress_prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+
+		/* Add all the interfaces to forward group and attach
+		 * egress devmap programe if exist
+		 */
+		devmap_val.ifindex = ifindex;
+		devmap_val.bpf_prog.fd = egress_prog_fd;
+		ret = bpf_map_update_elem(forward_map_fd, &ifindex, &devmap_val, 0);
+		if (ret) {
+			perror("bpf_map_update_elem forward_map");
+			goto err_out;
+		}
+	}
+
+	poll_stats(2);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv7 bpf-next 4/4] selftests/bpf: add xdp_redirect_multi test
  2021-04-14 12:26 [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
                   ` (2 preceding siblings ...)
  2021-04-14 12:26 ` [PATCHv7 bpf-next 3/4] sample/bpf: add xdp_redirect_map_multi for redirect_map broadcast test Hangbin Liu
@ 2021-04-14 12:26 ` Hangbin Liu
  2021-04-14 14:16 ` [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Toke Høiland-Jørgensen
  4 siblings, 0 replies; 39+ messages in thread
From: Hangbin Liu @ 2021-04-14 12:26 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, Hangbin Liu

Add a bpf selftest for new helper xdp_redirect_map_multi(). In this
test there are 3 forward groups and 1 exclude group. The test will
redirect each interface's packets to all the interfaces in the forward
group, and exclude the interface in exclude map.

Two maps (DEVMAP, DEVMAP_HASH) and two xdp modes (generic, drive) will
be tested. XDP egress program will also be tested by setting pkt src MAC
to egress interface's MAC address.

For more test details, you can find it in the test script. Here is
the test result.
]# ./test_xdp_redirect_multi.sh
Pass: xdpgeneric arp ns1-2
Pass: xdpgeneric arp ns1-3
Pass: xdpgeneric arp ns1-4
Pass: xdpgeneric ping ns1-2
Pass: xdpgeneric ping ns1-3
Pass: xdpgeneric ping ns1-4
Pass: xdpgeneric ping6 ns1-2
Pass: xdpgeneric ping6 ns1-1 number
Pass: xdpgeneric ping6 ns1-2 number
Pass: xdpdrv arp ns1-2
Pass: xdpdrv arp ns1-3
Pass: xdpdrv arp ns1-4
Pass: xdpdrv ping ns1-2
Pass: xdpdrv ping ns1-3
Pass: xdpdrv ping ns1-4
Pass: xdpdrv ping6 ns1-2
Pass: xdpdrv ping6 ns1-1 number
Pass: xdpdrv ping6 ns1-2 number
Pass: xdpegress mac ns1-2
Pass: xdpegress mac ns1-3
Pass: xdpegress mac ns1-4
Summary: PASS 21, FAIL 0

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

---
v2: add a IPv6 test to validates that single redirect still works
after multicast redirect.
---
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       |  99 ++++++++
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 205 +++++++++++++++
 .../selftests/bpf/xdp_redirect_multi.c        | 236 ++++++++++++++++++
 4 files changed, 542 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 6448c626498f..0c08b662a64e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -49,6 +49,7 @@ TEST_FILES = xsk_prereqs.sh \
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
 	test_xdp_redirect.sh \
+	test_xdp_redirect_multi.sh \
 	test_xdp_meta.sh \
 	test_xdp_veth.sh \
 	test_offload.py \
@@ -79,7 +80,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
-	xdpxceiver
+	xdpxceiver xdp_redirect_multi
 
 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
 
diff --git a/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c b/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
new file mode 100644
index 000000000000..099bf444acab
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <string.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+/* It would be easier to use a key:if_index, value:if_index map, but it
+ * will need a very large entries as the if_index number may get very large,
+ * this would affect the performace. So the DEVMAP here is just for testing.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_DEVMAP);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(int));
+	__uint(max_entries, 1024);
+} map_v4 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(int));
+	__uint(max_entries, 128);
+} map_all SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(struct bpf_devmap_val));
+	__uint(max_entries, 128);
+} map_egress SEC(".maps");
+
+/* map to store egress interfaces mac addresses */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __be64);
+	__uint(max_entries, 128);
+} mac_map SEC(".maps");
+
+SEC("xdp_redirect_map_multi")
+int xdp_redirect_map_multi_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	int if_index = ctx->ingress_ifindex;
+	struct ethhdr *eth = data;
+	__u16 h_proto;
+	__u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == bpf_htons(ETH_P_IP))
+		return bpf_redirect_map(&map_v4, 0, BPF_F_REDIR_MASK);
+	else if (h_proto == bpf_htons(ETH_P_IPV6))
+		return bpf_redirect_map(&map_all, if_index, 0);
+	else
+		return bpf_redirect_map(&map_all, 0, BPF_F_REDIR_MASK);
+}
+
+/* The following 2 progs are for 2nd devmap prog testing */
+SEC("xdp_redirect_map_ingress")
+int xdp_redirect_map_all_prog(struct xdp_md *ctx)
+{
+	return bpf_redirect_map(&map_egress, 0, BPF_F_REDIR_MASK);
+}
+
+SEC("xdp_devmap/map_prog")
+int xdp_devmap_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	__u32 key = ctx->egress_ifindex;
+	struct ethhdr *eth = data;
+	__u64 nh_off;
+	__be64 *mac;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	mac = bpf_map_lookup_elem(&mac_map, &key);
+	if (mac)
+		__builtin_memcpy(eth->h_source, mac, ETH_ALEN);
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh b/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
new file mode 100755
index 000000000000..414f331823d2
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
@@ -0,0 +1,205 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test topology:
+#     - - - - - - - - - - - - - - - - - - - - - - - - -
+#    | veth1         veth2         veth3         veth4 |  ... init net
+#     - -| - - - - - - | - - - - - - | - - - - - - | - -
+#    ---------     ---------     ---------     ---------
+#    | veth0 |     | veth0 |     | veth0 |     | veth0 |  ...
+#    ---------     ---------     ---------     ---------
+#       ns1           ns2           ns3           ns4
+#
+# Forward maps:
+#     map_all has interfaces: veth1, veth2, veth3, veth4, ... (All traffic except IPv4)
+#     map_v4 has interfaces: veth1, veth3, veth4, ... (For IPv4 traffic only)
+#     map_egress has all interfaces and redirect all pkts
+# Map type:
+#     map_v4 use DEVMAP, others use DEVMAP_HASH
+#
+# Test modules:
+# XDP modes: generic, native, native + egress_prog
+#
+# Test cases:
+#     ARP:
+#        ns1 -> gw: ns2, ns3, ns4 should receive the arp request
+#     IPv4:
+#        ping test: ns1 -> ns2 (block), ns1 -> ns3 (pass), ns1 -> ns4 (pass)
+#     IPv6:
+#        ping test: ns1 -> ns2 (block), echo requests will be redirect back
+#     egress_prog:
+#        all src mac should be egress interface's mac
+#
+
+
+# netns numbers
+NUM=4
+IFACES=""
+DRV_MODE="xdpgeneric xdpdrv xdpegress"
+PASS=0
+FAIL=0
+
+test_pass()
+{
+	echo "Pass: $@"
+	PASS=$((PASS + 1))
+}
+
+test_fail()
+{
+	echo "fail: $@"
+	FAIL=$((FAIL + 1))
+}
+
+clean_up()
+{
+	for i in $(seq $NUM); do
+		ip link del veth$i 2> /dev/null
+		ip netns del ns$i 2> /dev/null
+	done
+}
+
+# Kselftest framework requirement - SKIP code is 4.
+check_env()
+{
+	ip link set dev lo xdpgeneric off &>/dev/null
+	if [ $? -ne 0 ];then
+		echo "selftests: [SKIP] Could not run test without the ip xdpgeneric support"
+		exit 4
+	fi
+
+	which tcpdump &>/dev/null
+	if [ $? -ne 0 ];then
+		echo "selftests: [SKIP] Could not run test without tcpdump"
+		exit 4
+	fi
+}
+
+setup_ns()
+{
+	local mode=$1
+	IFACES=""
+
+	if [ "$mode" = "xdpegress" ]; then
+		mode="xdpdrv"
+	fi
+
+	for i in $(seq $NUM); do
+	        ip netns add ns$i
+	        ip link add veth$i type veth peer name veth0 netns ns$i
+		ip link set veth$i up
+		ip -n ns$i link set veth0 up
+
+		ip -n ns$i addr add 192.0.2.$i/24 dev veth0
+		ip -n ns$i addr add 2001:db8::$i/64 dev veth0
+		ip -n ns$i link set veth0 $mode obj \
+			xdp_dummy.o sec xdp_dummy &> /dev/null || \
+			{ test_fail "Unable to load dummy xdp" && exit 1; }
+		IFACES="$IFACES veth$i"
+		veth_mac[$i]=$(ip link show veth$i | awk '/link\/ether/ {print $2}')
+	done
+}
+
+do_egress_tests()
+{
+	local mode=$1
+
+	# mac test
+	ip netns exec ns2 tcpdump -e -i veth0 -nn -l -e &> mac_ns1-2_${mode}.log &
+	ip netns exec ns3 tcpdump -e -i veth0 -nn -l -e &> mac_ns1-3_${mode}.log &
+	ip netns exec ns4 tcpdump -e -i veth0 -nn -l -e &> mac_ns1-4_${mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+
+	# mac check
+	grep -q "${veth_mac[2]} > ff:ff:ff:ff:ff:ff" mac_ns1-2_${mode}.log && \
+	       test_pass "$mode mac ns1-2" || test_fail "$mode mac ns1-2"
+	grep -q "${veth_mac[3]} > ff:ff:ff:ff:ff:ff" mac_ns1-3_${mode}.log && \
+		test_pass "$mode mac ns1-3" || test_fail "$mode mac ns1-3"
+	grep -q "${veth_mac[4]} > ff:ff:ff:ff:ff:ff" mac_ns1-4_${mode}.log && \
+		test_pass "$mode mac ns1-4" || test_fail "$mode mac ns1-4"
+}
+
+do_ping_tests()
+{
+	local mode=$1
+
+	# arp test
+	ip netns exec ns2 tcpdump -i veth0 -nn -l -e &> arp_ns1-2_${mode}.log &
+	ip netns exec ns3 tcpdump -i veth0 -nn -l -e &> arp_ns1-3_${mode}.log &
+	ip netns exec ns4 tcpdump -i veth0 -nn -l -e &> arp_ns1-4_${mode}.log &
+	ip netns exec ns1 ping 192.0.2.254 -c 4 &> /dev/null
+	sleep 2
+	pkill -9 tcpdump
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-2_${mode}.log && \
+		test_pass "$mode arp ns1-2" || test_fail "$mode arp ns1-2"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-3_${mode}.log && \
+		test_pass "$mode arp ns1-3" || test_fail "$mode arp ns1-3"
+	grep -q "Request who-has 192.0.2.254 tell 192.0.2.1" arp_ns1-4_${mode}.log && \
+		test_pass "$mode arp ns1-4" || test_fail "$mode arp ns1-4"
+
+	# ping test
+	ip netns exec ns1 ping 192.0.2.2 -c 4 &> /dev/null && \
+		test_fail "$mode ping ns1-2" || test_pass "$mode ping ns1-2"
+	ip netns exec ns1 ping 192.0.2.3 -c 4 &> /dev/null && \
+		test_pass "$mode ping ns1-3" || test_fail "$mode ping ns1-3"
+	ip netns exec ns1 ping 192.0.2.4 -c 4 &> /dev/null && \
+		test_pass "$mode ping ns1-4" || test_fail "$mode ping ns1-4"
+
+	# ping6 test: echo request should be redirect back to itself, not others
+	ip netns exec ns1 ip neigh add 2001:db8::2 dev veth0 lladdr 00:00:00:00:00:02
+	ip netns exec ns1 tcpdump -i veth0 -nn -l &> ping6_ns1_${mode}.log &
+	ip netns exec ns2 tcpdump -i veth0 -nn -l &> ping6_ns2_${mode}.log &
+	sleep 2
+	ip netns exec ns1 ping6 2001:db8::2 -c 2 &> /dev/null && \
+		test_fail "$mode ping6 ns1-2" || test_pass "$mode ping6 ns1-2"
+	sleep 2
+	pkill -9 tcpdump
+	ns1_echo_num=$(grep "ICMP6, echo request" ping6_ns1_${mode}.log | wc -l)
+	[ $ns1_echo_num -eq 4 ] && test_pass "$mode ping6 ns1-1 number" || \
+		test_fail "$mode ping6 ns1-1 number"
+	ns2_echo_num=$(grep "ICMP6, echo request" ping6_ns2_${mode}.log | wc -l)
+	[ $ns2_echo_num -eq 0 ] && test_pass "$mode ping6 ns1-2 number" || \
+		test_fail "$mode ping6 ns1-2 number"
+}
+
+do_tests()
+{
+	local mode=$1
+	local drv_p
+
+	case ${mode} in
+		xdpdrv)  drv_p="-N";;
+		xdpegress) drv_p="-X";;
+		xdpgeneric) drv_p="-S";;
+	esac
+
+	./xdp_redirect_multi $drv_p $IFACES &> xdp_redirect_${mode}.log &
+	xdp_pid=$!
+	sleep 10
+
+	if [ "$mode" = "xdpegress" ]; then
+		do_egress_tests $mode
+	else
+		do_ping_tests $mode
+	fi
+
+	kill $xdp_pid
+}
+
+trap clean_up 0 2 3 6 9
+
+check_env
+rm -f xdp_redirect_*.log arp_ns*.log ping6_ns*.log mac_ns*.log
+
+for mode in ${DRV_MODE}; do
+	setup_ns $mode
+	do_tests $mode
+	sleep 10
+	clean_up
+	sleep 5
+done
+
+echo "Summary: PASS $PASS, FAIL $FAIL"
+[ $FAIL -eq 0 ] && exit 0 || exit 1
diff --git a/tools/testing/selftests/bpf/xdp_redirect_multi.c b/tools/testing/selftests/bpf/xdp_redirect_multi.c
new file mode 100644
index 000000000000..6a282dde90bd
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_redirect_multi.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/if.h>
+#include <unistd.h>
+#include <libgen.h>
+#include <sys/resource.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define MAX_IFACE_NUM 32
+#define MAX_INDEX_NUM 1024
+
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static int ifaces[MAX_IFACE_NUM] = {};
+
+static void int_exit(int sig)
+{
+	__u32 prog_id = 0;
+	int i;
+
+	for (i = 0; ifaces[i] > 0; i++) {
+		if (bpf_get_link_xdp_id(ifaces[i], &prog_id, xdp_flags)) {
+			printf("bpf_get_link_xdp_id failed\n");
+			exit(1);
+		}
+		if (prog_id)
+			bpf_set_link_xdp_fd(ifaces[i], -1, xdp_flags);
+	}
+
+	exit(0);
+}
+
+static int get_mac_addr(unsigned int ifindex, void *mac_addr)
+{
+	char ifname[IF_NAMESIZE];
+	struct ifreq ifr;
+	int fd, ret = -1;
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0)
+		return ret;
+
+	if (!if_indextoname(ifindex, ifname))
+		goto err_out;
+
+	strcpy(ifr.ifr_name, ifname);
+
+	if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0)
+		goto err_out;
+
+	memcpy(mac_addr, ifr.ifr_hwaddr.sa_data, 6 * sizeof(char));
+	ret = 0;
+
+err_out:
+	close(fd);
+	return ret;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n"
+		"OPTS:\n"
+		"    -S    use skb-mode\n"
+		"    -N    enforce native mode\n"
+		"    -F    force loading prog\n"
+		"    -X    load xdp program on egress\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	int prog_fd, group_all, group_v4, mac_map;
+	struct bpf_program *ingress_prog, *egress_prog;
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type = BPF_PROG_TYPE_UNSPEC,
+	};
+	int i, ret, opt, egress_prog_fd = 0;
+	struct bpf_devmap_val devmap_val;
+	bool attach_egress_prog = false;
+	unsigned char mac_addr[6];
+	char ifname[IF_NAMESIZE];
+	struct bpf_object *obj;
+	unsigned int ifindex;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "SNFX")) != -1) {
+		switch (opt) {
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			/* default, set below */
+			break;
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		case 'X':
+			attach_egress_prog = true;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (!(xdp_flags & XDP_FLAGS_SKB_MODE)) {
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+	} else if (attach_egress_prog) {
+		printf("Load xdp program on egress with SKB mode not supported yet\n");
+		goto err_out;
+	}
+
+	if (optind == argc) {
+		printf("usage: %s <IFNAME|IFINDEX> <IFNAME|IFINDEX> ...\n", argv[0]);
+		goto err_out;
+	}
+
+	printf("Get interfaces");
+	for (i = 0; i < MAX_IFACE_NUM && argv[optind + i]; i++) {
+		ifaces[i] = if_nametoindex(argv[optind + i]);
+		if (!ifaces[i])
+			ifaces[i] = strtoul(argv[optind + i], NULL, 0);
+		if (!if_indextoname(ifaces[i], ifname)) {
+			perror("Invalid interface name or i");
+			goto err_out;
+		}
+		if (ifaces[i] > MAX_INDEX_NUM) {
+			printf("Interface index to large\n");
+			goto err_out;
+		}
+		printf(" %d", ifaces[i]);
+	}
+	printf("\n");
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		goto err_out;
+
+	if (attach_egress_prog)
+		group_all = bpf_object__find_map_fd_by_name(obj, "map_egress");
+	else
+		group_all = bpf_object__find_map_fd_by_name(obj, "map_all");
+	group_v4 = bpf_object__find_map_fd_by_name(obj, "map_v4");
+	mac_map = bpf_object__find_map_fd_by_name(obj, "mac_map");
+
+	if (group_all < 0 || group_v4 < 0 || mac_map < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		goto err_out;
+	}
+
+	if (attach_egress_prog) {
+		/* Find ingress/egress prog for 2nd xdp prog */
+		ingress_prog = bpf_object__find_program_by_name(obj, "xdp_redirect_map_all_prog");
+		egress_prog = bpf_object__find_program_by_name(obj, "xdp_devmap_prog");
+		if (!ingress_prog || !egress_prog) {
+			printf("finding ingress/egress_prog in obj file failed\n");
+			goto err_out;
+		}
+		prog_fd = bpf_program__fd(ingress_prog);
+		egress_prog_fd = bpf_program__fd(egress_prog);
+		if (prog_fd < 0 || egress_prog_fd < 0) {
+			printf("find egress_prog fd failed\n");
+			goto err_out;
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	/* Init forward multicast groups and exclude group */
+	for (i = 0; ifaces[i] > 0; i++) {
+		ifindex = ifaces[i];
+
+		if (attach_egress_prog) {
+			ret = get_mac_addr(ifindex, mac_addr);
+			if (ret < 0) {
+				printf("get interface %d mac failed\n", ifindex);
+				goto err_out;
+			}
+			ret = bpf_map_update_elem(mac_map, &ifindex, mac_addr, 0);
+			if (ret) {
+				perror("bpf_update_elem mac_map failed\n");
+				goto err_out;
+			}
+		}
+
+		/* Add all the interfaces to group all */
+		devmap_val.ifindex = ifindex;
+		devmap_val.bpf_prog.fd = egress_prog_fd;
+		ret = bpf_map_update_elem(group_all, &ifindex, &devmap_val, 0);
+		if (ret) {
+			perror("bpf_map_update_elem");
+			goto err_out;
+		}
+
+		/* For testing: skip adding the 2nd interfaces to group v4 */
+		if (i != 1) {
+			ret = bpf_map_update_elem(group_v4, &ifindex, &ifindex, 0);
+			if (ret) {
+				perror("bpf_map_update_elem");
+				goto err_out;
+			}
+		}
+
+		/* bind prog_fd to each interface */
+		ret = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+		if (ret) {
+			printf("Set xdp fd failed on %d\n", ifindex);
+			goto err_out;
+		}
+	}
+
+	/* sleep some time for testing */
+	sleep(999);
+
+	return 0;
+
+err_out:
+	return 1;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support
  2021-04-14 12:26 [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
                   ` (3 preceding siblings ...)
  2021-04-14 12:26 ` [PATCHv7 bpf-next 4/4] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
@ 2021-04-14 14:16 ` Toke Høiland-Jørgensen
  4 siblings, 0 replies; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-14 14:16 UTC (permalink / raw)
  To: Hangbin Liu, bpf
  Cc: netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, Hangbin Liu

Hangbin Liu <liuhangbin@gmail.com> writes:

> Hi,
>
> This patchset is a new implementation for XDP multicast support based
> on my previous 2 maps implementation[1]. The reason is that Daniel think
> the exclude map implementation is missing proper bond support in XDP
> context. And there is a plan to add native XDP bonding support. Adding a
> exclude map in the helper also increase the complex of verifier and has
> draw back of performace.
>
> The new implementation just add two new flags BPF_F_BROADCAST and
> BPF_F_EXCLUDE_INGRESS to extend xdp_redirect_map for broadcast support.
>
> With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
> in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
> excluded when do broadcasting.

Alright, I'm out of things to complain about - thanks for sticking with
it! :)

For the series:

Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-14 12:26 ` [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue Hangbin Liu
@ 2021-04-15  0:17   ` Martin KaFai Lau
  2021-04-15  2:37     ` Hangbin Liu
  0 siblings, 1 reply; 39+ messages in thread
From: Martin KaFai Lau @ 2021-04-15  0:17 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Wed, Apr 14, 2021 at 08:26:07PM +0800, Hangbin Liu wrote:
[ ... ]

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index aa516472ce46..3980fb3bfb09 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {
>  	struct list_head flush_node;
>  	struct net_device *dev;
>  	struct net_device *dev_rx;
> +	struct bpf_prog *xdp_prog;
>  	unsigned int count;
>  };
>  
> @@ -326,22 +327,71 @@ bool dev_map_can_have_prog(struct bpf_map *map)
>  	return false;
>  }
>  
> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
> +				struct xdp_frame **frames, int n,
> +				struct net_device *dev)
> +{
> +	struct xdp_txq_info txq = { .dev = dev };
> +	struct xdp_buff xdp;
> +	int i, nframes = 0;
> +
> +	for (i = 0; i < n; i++) {
> +		struct xdp_frame *xdpf = frames[i];
> +		u32 act;
> +		int err;
> +
> +		xdp_convert_frame_to_buff(xdpf, &xdp);
> +		xdp.txq = &txq;
> +
> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> +		switch (act) {
> +		case XDP_PASS:
> +			err = xdp_update_frame_from_buff(&xdp, xdpf);
> +			if (unlikely(err < 0))
> +				xdp_return_frame_rx_napi(xdpf);
> +			else
> +				frames[nframes++] = xdpf;
> +			break;
> +		default:
> +			bpf_warn_invalid_xdp_action(act);
> +			fallthrough;
> +		case XDP_ABORTED:
> +			trace_xdp_exception(dev, xdp_prog, act);
> +			fallthrough;
> +		case XDP_DROP:
> +			xdp_return_frame_rx_napi(xdpf);
> +			break;
> +		}
> +	}
> +	return nframes; /* sent frames count */
> +}
> +
>  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>  {
>  	struct net_device *dev = bq->dev;
> -	int sent = 0, err = 0;
> +	int sent = 0, drops = 0, err = 0;
> +	unsigned int cnt = bq->count;
> +	int to_send = cnt;
>  	int i;
>  
> -	if (unlikely(!bq->count))
> +	if (unlikely(!cnt))
>  		return;
>  
> -	for (i = 0; i < bq->count; i++) {
> +	for (i = 0; i < cnt; i++) {
>  		struct xdp_frame *xdpf = bq->q[i];
>  
>  		prefetch(xdpf);
>  	}
>  
> -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> +	if (bq->xdp_prog) {
bq->xdp_prog is used here

> +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> +		if (!to_send)
> +			goto out;
> +
> +		drops = cnt - to_send;
> +	}
> +

[ ... ]

>  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> -		       struct net_device *dev_rx)
> +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>  {
>  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>  	/* Ingress dev_rx will be the same for all xdp_frame's in
>  	 * bulk_queue, because bq stored per-CPU and must be flushed
>  	 * from net_device drivers NAPI func end.
> +	 *
> +	 * Do the same with xdp_prog and flush_list since these fields
> +	 * are only ever modified together.
>  	 */
> -	if (!bq->dev_rx)
> +	if (!bq->dev_rx) {
>  		bq->dev_rx = dev_rx;
> +		bq->xdp_prog = xdp_prog;
bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
It is not very obvious after taking a quick look at xdp_do_flush[_map].

e.g. what if the devmap elem gets deleted.

[ ... ]

>  static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
> -			       struct net_device *dev_rx)
> +				struct net_device *dev_rx,
> +				struct bpf_prog *xdp_prog)
>  {
>  	struct xdp_frame *xdpf;
>  	int err;
> @@ -439,42 +497,14 @@ static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
>  	if (unlikely(!xdpf))
>  		return -EOVERFLOW;
>  
> -	bq_enqueue(dev, xdpf, dev_rx);
> +	bq_enqueue(dev, xdpf, dev_rx, xdp_prog);
>  	return 0;
>  }
>  
[ ... ]

> @@ -482,12 +512,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  {
>  	struct net_device *dev = dst->dev;
>  
> -	if (dst->xdp_prog) {
> -		xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog);
> -		if (!xdp)
> -			return 0;
> -	}
> -	return __xdp_enqueue(dev, xdp, dev_rx);
> +	return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog);
>  }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support
  2021-04-14 12:26 ` [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
@ 2021-04-15  0:23   ` Martin KaFai Lau
  2021-04-15  2:21     ` Hangbin Liu
  0 siblings, 1 reply; 39+ messages in thread
From: Martin KaFai Lau @ 2021-04-15  0:23 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Wed, Apr 14, 2021 at 08:26:08PM +0800, Hangbin Liu wrote:
[ ... ]

> +static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex,
> +						  u64 flags, u64 flag_mask,
>  						  void *lookup_elem(struct bpf_map *map, u32 key))
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  
>  	/* Lower bits of the flags are used as return code on lookup failure */
> -	if (unlikely(flags > XDP_TX))
> +	if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask)))
>  		return XDP_ABORTED;
>  
>  	ri->tgt_value = lookup_elem(map, ifindex);
> -	if (unlikely(!ri->tgt_value)) {
> +	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
>  		/* If the lookup fails we want to clear out the state in the
>  		 * redirect_info struct completely, so that if an eBPF program
>  		 * performs multiple lookups, the last one always takes
> @@ -1482,13 +1484,21 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>  		 */
>  		ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */
>  		ri->map_type = BPF_MAP_TYPE_UNSPEC;
> -		return flags;
> +		return flags & BPF_F_ACTION_MASK;
>  	}
>  
>  	ri->tgt_index = ifindex;
>  	ri->map_id = map->id;
>  	ri->map_type = map->map_type;
>  
> +	if (flags & BPF_F_BROADCAST) {
> +		WRITE_ONCE(ri->map, map);
Why only WRITE_ONCE on ri->map?  Is it needed?

> +		ri->flags = flags;
> +	} else {
> +		WRITE_ONCE(ri->map, NULL);
> +		ri->flags = 0;
> +	}
> +
>  	return XDP_REDIRECT;
>  }
>  
[ ... ]

> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, bool exclude_ingress)
> +{
> +	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
> +	int exclude_ifindex = exclude_ingress ? dev_rx->ifindex : 0;
> +	struct bpf_dtab_netdev *dst, *last_dst = NULL;
> +	struct hlist_head *head;
> +	struct hlist_node *next;
> +	struct xdp_frame *xdpf;
> +	unsigned int i;
> +	int err;
> +
> +	xdpf = xdp_convert_buff_to_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
> +		for (i = 0; i < map->max_entries; i++) {
> +			dst = READ_ONCE(dtab->netdev_map[i]);
> +			if (!is_valid_dst(dst, xdp, exclude_ifindex))
> +				continue;
> +
> +			/* we only need n-1 clones; last_dst enqueued below */
> +			if (!last_dst) {
> +				last_dst = dst;
> +				continue;
> +			}
> +
> +			err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
> +			if (err)
> +				return err;
> +
> +			last_dst = dst;
> +		}
> +	} else { /* BPF_MAP_TYPE_DEVMAP_HASH */
> +		for (i = 0; i < dtab->n_buckets; i++) {
> +			head = dev_map_index_hash(dtab, i);
> +			hlist_for_each_entry_safe(dst, next, head, index_hlist) {
hmm.... should it be hlist_for_each_entry_rcu() instead?

> +				if (!is_valid_dst(dst, xdp, exclude_ifindex))
> +					continue;
> +
> +				/* we only need n-1 clones; last_dst enqueued below */
> +				if (!last_dst) {
> +					last_dst = dst;
> +					continue;
> +				}
> +
> +				err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
> +				if (err)
> +					return err;
> +
> +				last_dst = dst;
> +			}
> +		}
> +	}
> +
> +	/* consume the last copy of the frame */
> +	if (last_dst)
> +		bq_enqueue(last_dst->dev, xdpf, dev_rx, last_dst->xdp_prog);
> +	else
> +		xdp_return_frame_rx_napi(xdpf); /* dtab is empty */
> +
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support
  2021-04-15  0:23   ` Martin KaFai Lau
@ 2021-04-15  2:21     ` Hangbin Liu
  2021-04-15  9:29       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Hangbin Liu @ 2021-04-15  2:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Wed, Apr 14, 2021 at 05:23:50PM -0700, Martin KaFai Lau wrote:
> On Wed, Apr 14, 2021 at 08:26:08PM +0800, Hangbin Liu wrote:
> [ ... ]
> 
> > +static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex,
> > +						  u64 flags, u64 flag_mask,
> >  						  void *lookup_elem(struct bpf_map *map, u32 key))
> >  {
> >  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> >  
> >  	/* Lower bits of the flags are used as return code on lookup failure */
> > -	if (unlikely(flags > XDP_TX))
> > +	if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask)))
> >  		return XDP_ABORTED;
> >  
> >  	ri->tgt_value = lookup_elem(map, ifindex);
> > -	if (unlikely(!ri->tgt_value)) {
> > +	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
> >  		/* If the lookup fails we want to clear out the state in the
> >  		 * redirect_info struct completely, so that if an eBPF program
> >  		 * performs multiple lookups, the last one always takes
> > @@ -1482,13 +1484,21 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
> >  		 */
> >  		ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */
> >  		ri->map_type = BPF_MAP_TYPE_UNSPEC;
> > -		return flags;
> > +		return flags & BPF_F_ACTION_MASK;
> >  	}
> >  
> >  	ri->tgt_index = ifindex;
> >  	ri->map_id = map->id;
> >  	ri->map_type = map->map_type;
> >  
> > +	if (flags & BPF_F_BROADCAST) {
> > +		WRITE_ONCE(ri->map, map);
> Why only WRITE_ONCE on ri->map?  Is it needed?

I think this is make sure the map pointer assigned to ri->map safely.
which starts from commit f6069b9aa993 ("bpf: fix redirect to map under tail
calls")

> 
> > +		ri->flags = flags;
> > +	} else {
> > +		WRITE_ONCE(ri->map, NULL);
> > +		ri->flags = 0;
> > +	}
> > +
> >  	return XDP_REDIRECT;
> >  }
> >  
> [ ... ]
> 
> > +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> > +			  struct bpf_map *map, bool exclude_ingress)
> > +{
> > +	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
> > +	int exclude_ifindex = exclude_ingress ? dev_rx->ifindex : 0;
> > +	struct bpf_dtab_netdev *dst, *last_dst = NULL;
> > +	struct hlist_head *head;
> > +	struct hlist_node *next;
> > +	struct xdp_frame *xdpf;
> > +	unsigned int i;
> > +	int err;
> > +
> > +	xdpf = xdp_convert_buff_to_frame(xdp);
> > +	if (unlikely(!xdpf))
> > +		return -EOVERFLOW;
> > +
> > +	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
> > +		for (i = 0; i < map->max_entries; i++) {
> > +			dst = READ_ONCE(dtab->netdev_map[i]);
> > +			if (!is_valid_dst(dst, xdp, exclude_ifindex))
> > +				continue;
> > +
> > +			/* we only need n-1 clones; last_dst enqueued below */
> > +			if (!last_dst) {
> > +				last_dst = dst;
> > +				continue;
> > +			}
> > +
> > +			err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
> > +			if (err)
> > +				return err;
> > +
> > +			last_dst = dst;
> > +		}
> > +	} else { /* BPF_MAP_TYPE_DEVMAP_HASH */
> > +		for (i = 0; i < dtab->n_buckets; i++) {
> > +			head = dev_map_index_hash(dtab, i);
> > +			hlist_for_each_entry_safe(dst, next, head, index_hlist) {
> hmm.... should it be hlist_for_each_entry_rcu() instead?

Ah, makes sense to me. I will fix it.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-15  0:17   ` Martin KaFai Lau
@ 2021-04-15  2:37     ` Hangbin Liu
  2021-04-15  9:22       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Hangbin Liu @ 2021-04-15  2:37 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Toke Høiland-Jørgensen, Jiri Benc,
	Jesper Dangaard Brouer, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:
> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >  {
> >  	struct net_device *dev = bq->dev;
> > -	int sent = 0, err = 0;
> > +	int sent = 0, drops = 0, err = 0;
> > +	unsigned int cnt = bq->count;
> > +	int to_send = cnt;
> >  	int i;
> >  
> > -	if (unlikely(!bq->count))
> > +	if (unlikely(!cnt))
> >  		return;
> >  
> > -	for (i = 0; i < bq->count; i++) {
> > +	for (i = 0; i < cnt; i++) {
> >  		struct xdp_frame *xdpf = bq->q[i];
> >  
> >  		prefetch(xdpf);
> >  	}
> >  
> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > +	if (bq->xdp_prog) {
> bq->xdp_prog is used here
> 
> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> > +		if (!to_send)
> > +			goto out;
> > +
> > +		drops = cnt - to_send;
> > +	}
> > +
> 
> [ ... ]
> 
> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > -		       struct net_device *dev_rx)
> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >  {
> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >  	 * from net_device drivers NAPI func end.
> > +	 *
> > +	 * Do the same with xdp_prog and flush_list since these fields
> > +	 * are only ever modified together.
> >  	 */
> > -	if (!bq->dev_rx)
> > +	if (!bq->dev_rx) {
> >  		bq->dev_rx = dev_rx;
> > +		bq->xdp_prog = xdp_prog;
> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> 
> e.g. what if the devmap elem gets deleted.

Jesper knows better than me. From my veiw, based on the description of
__dev_flush():

On devmap tear down we ensure the flush list is empty before completing to
ensure all flush operations have completed. When drivers update the bpf
program they may need to ensure any flush ops are also complete.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-15  2:37     ` Hangbin Liu
@ 2021-04-15  9:22       ` Toke Høiland-Jørgensen
  2021-04-15 17:35         ` Martin KaFai Lau
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-15  9:22 UTC (permalink / raw)
  To: Hangbin Liu, Martin KaFai Lau
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi, David Ahern,
	Andrii Nakryiko, Alexei Starovoitov, John Fastabend,
	Maciej Fijalkowski, Björn Töpel

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:
>> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >  {
>> >  	struct net_device *dev = bq->dev;
>> > -	int sent = 0, err = 0;
>> > +	int sent = 0, drops = 0, err = 0;
>> > +	unsigned int cnt = bq->count;
>> > +	int to_send = cnt;
>> >  	int i;
>> >  
>> > -	if (unlikely(!bq->count))
>> > +	if (unlikely(!cnt))
>> >  		return;
>> >  
>> > -	for (i = 0; i < bq->count; i++) {
>> > +	for (i = 0; i < cnt; i++) {
>> >  		struct xdp_frame *xdpf = bq->q[i];
>> >  
>> >  		prefetch(xdpf);
>> >  	}
>> >  
>> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> > +	if (bq->xdp_prog) {
>> bq->xdp_prog is used here
>> 
>> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> > +		if (!to_send)
>> > +			goto out;
>> > +
>> > +		drops = cnt - to_send;
>> > +	}
>> > +
>> 
>> [ ... ]
>> 
>> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> > -		       struct net_device *dev_rx)
>> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >  {
>> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >  	 * from net_device drivers NAPI func end.
>> > +	 *
>> > +	 * Do the same with xdp_prog and flush_list since these fields
>> > +	 * are only ever modified together.
>> >  	 */
>> > -	if (!bq->dev_rx)
>> > +	if (!bq->dev_rx) {
>> >  		bq->dev_rx = dev_rx;
>> > +		bq->xdp_prog = xdp_prog;
>> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> 
>> e.g. what if the devmap elem gets deleted.
>
> Jesper knows better than me. From my veiw, based on the description of
> __dev_flush():
>
> On devmap tear down we ensure the flush list is empty before completing to
> ensure all flush operations have completed. When drivers update the bpf
> program they may need to ensure any flush ops are also complete.

Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
which also runs under one big rcu_read_lock(). So the storage in the
bulk queue is quite temporary, it's just used for bulking to increase
performance :)

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support
  2021-04-15  2:21     ` Hangbin Liu
@ 2021-04-15  9:29       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-15  9:29 UTC (permalink / raw)
  To: Hangbin Liu, Martin KaFai Lau
  Cc: bpf, netdev, Jiri Benc, Jesper Dangaard Brouer, Eelco Chaudron,
	ast, Daniel Borkmann, Lorenzo Bianconi, David Ahern,
	Andrii Nakryiko, Alexei Starovoitov, John Fastabend,
	Maciej Fijalkowski, Björn Töpel

Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, Apr 14, 2021 at 05:23:50PM -0700, Martin KaFai Lau wrote:
>> On Wed, Apr 14, 2021 at 08:26:08PM +0800, Hangbin Liu wrote:
>> [ ... ]
>> 
>> > +static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex,
>> > +						  u64 flags, u64 flag_mask,
>> >  						  void *lookup_elem(struct bpf_map *map, u32 key))
>> >  {
>> >  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> >  
>> >  	/* Lower bits of the flags are used as return code on lookup failure */
>> > -	if (unlikely(flags > XDP_TX))
>> > +	if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask)))
>> >  		return XDP_ABORTED;
>> >  
>> >  	ri->tgt_value = lookup_elem(map, ifindex);
>> > -	if (unlikely(!ri->tgt_value)) {
>> > +	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
>> >  		/* If the lookup fails we want to clear out the state in the
>> >  		 * redirect_info struct completely, so that if an eBPF program
>> >  		 * performs multiple lookups, the last one always takes
>> > @@ -1482,13 +1484,21 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>> >  		 */
>> >  		ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */
>> >  		ri->map_type = BPF_MAP_TYPE_UNSPEC;
>> > -		return flags;
>> > +		return flags & BPF_F_ACTION_MASK;
>> >  	}
>> >  
>> >  	ri->tgt_index = ifindex;
>> >  	ri->map_id = map->id;
>> >  	ri->map_type = map->map_type;
>> >  
>> > +	if (flags & BPF_F_BROADCAST) {
>> > +		WRITE_ONCE(ri->map, map);
>> Why only WRITE_ONCE on ri->map?  Is it needed?
>
> I think this is make sure the map pointer assigned to ri->map safely.
> which starts from commit f6069b9aa993 ("bpf: fix redirect to map under tail
> calls")

The reason WRITE_ONCE() is only on the map field is because that's the
one that could be changed by a remote CPU (in bpf_clear_redirect_map())
- everything else is only accessed on the local CPU.

As for whether it's strictly needed from a memory model PoV, I'm not
actually sure (and should we be using smp_{store_release,load_acquire}()
instead?); I view it mostly as an annotation to make it clear that the
map field is 'special' in this respect...

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-15  9:22       ` Toke Høiland-Jørgensen
@ 2021-04-15 17:35         ` Martin KaFai Lau
  2021-04-15 18:21           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Martin KaFai Lau @ 2021-04-15 17:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Hangbin Liu, bpf, netdev, Jiri Benc, Jesper Dangaard Brouer,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi,
	David Ahern, Andrii Nakryiko, Alexei Starovoitov, John Fastabend,
	Maciej Fijalkowski, Björn Töpel

On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:
> 
> > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:
> >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >  {
> >> >  	struct net_device *dev = bq->dev;
> >> > -	int sent = 0, err = 0;
> >> > +	int sent = 0, drops = 0, err = 0;
> >> > +	unsigned int cnt = bq->count;
> >> > +	int to_send = cnt;
> >> >  	int i;
> >> >  
> >> > -	if (unlikely(!bq->count))
> >> > +	if (unlikely(!cnt))
> >> >  		return;
> >> >  
> >> > -	for (i = 0; i < bq->count; i++) {
> >> > +	for (i = 0; i < cnt; i++) {
> >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >  
> >> >  		prefetch(xdpf);
> >> >  	}
> >> >  
> >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> > +	if (bq->xdp_prog) {
> >> bq->xdp_prog is used here
> >> 
> >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> > +		if (!to_send)
> >> > +			goto out;
> >> > +
> >> > +		drops = cnt - to_send;
> >> > +	}
> >> > +
> >> 
> >> [ ... ]
> >> 
> >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> > -		       struct net_device *dev_rx)
> >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >  {
> >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >  	 * from net_device drivers NAPI func end.
> >> > +	 *
> >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> > +	 * are only ever modified together.
> >> >  	 */
> >> > -	if (!bq->dev_rx)
> >> > +	if (!bq->dev_rx) {
> >> >  		bq->dev_rx = dev_rx;
> >> > +		bq->xdp_prog = xdp_prog;
> >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> 
> >> e.g. what if the devmap elem gets deleted.
> >
> > Jesper knows better than me. From my veiw, based on the description of
> > __dev_flush():
> >
> > On devmap tear down we ensure the flush list is empty before completing to
> > ensure all flush operations have completed. When drivers update the bpf
> > program they may need to ensure any flush ops are also complete.
AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.

> 
> Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> which also runs under one big rcu_read_lock(). So the storage in the
> bulk queue is quite temporary, it's just used for bulking to increase
> performance :)
I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
in i40e_run_xdp() and it is fine.

In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
or I missed the big rcu_read_lock() in i40e_napi_poll()?

I do see the big rcu_read_lock() in mlx5e_napi_poll().

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-15 17:35         ` Martin KaFai Lau
@ 2021-04-15 18:21           ` Jesper Dangaard Brouer
  2021-04-15 20:29             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2021-04-15 18:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, Hangbin Liu, bpf, netdev,
	Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, brouer

On Thu, 15 Apr 2021 10:35:51 -0700
Martin KaFai Lau <kafai@fb.com> wrote:

> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >   
> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > >> >  {
> > >> >  	struct net_device *dev = bq->dev;
> > >> > -	int sent = 0, err = 0;
> > >> > +	int sent = 0, drops = 0, err = 0;
> > >> > +	unsigned int cnt = bq->count;
> > >> > +	int to_send = cnt;
> > >> >  	int i;
> > >> >  
> > >> > -	if (unlikely(!bq->count))
> > >> > +	if (unlikely(!cnt))
> > >> >  		return;
> > >> >  
> > >> > -	for (i = 0; i < bq->count; i++) {
> > >> > +	for (i = 0; i < cnt; i++) {
> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> > >> >  
> > >> >  		prefetch(xdpf);
> > >> >  	}
> > >> >  
> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > >> > +	if (bq->xdp_prog) {  
> > >> bq->xdp_prog is used here
> > >>   
> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> > >> > +		if (!to_send)
> > >> > +			goto out;
> > >> > +
> > >> > +		drops = cnt - to_send;
> > >> > +	}
> > >> > +  
> > >> 
> > >> [ ... ]
> > >>   
> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > >> > -		       struct net_device *dev_rx)
> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> > >> >  {
> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> > >> >  	 * from net_device drivers NAPI func end.
> > >> > +	 *
> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> > >> > +	 * are only ever modified together.
> > >> >  	 */
> > >> > -	if (!bq->dev_rx)
> > >> > +	if (!bq->dev_rx) {
> > >> >  		bq->dev_rx = dev_rx;
> > >> > +		bq->xdp_prog = xdp_prog;  
> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> > >> 
> > >> e.g. what if the devmap elem gets deleted.  
> > >
> > > Jesper knows better than me. From my veiw, based on the description of
> > > __dev_flush():
> > >
> > > On devmap tear down we ensure the flush list is empty before completing to
> > > ensure all flush operations have completed. When drivers update the bpf
> > > program they may need to ensure any flush ops are also complete.  
>
> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> 
> > 
> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> > which also runs under one big rcu_read_lock(). So the storage in the
> > bulk queue is quite temporary, it's just used for bulking to increase
> > performance :)  
>
> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> in i40e_run_xdp() and it is fine.
> 
> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>
> I do see the big rcu_read_lock() in mlx5e_napi_poll().

I believed/assumed xdp_do_flush_map() was already protected under an
rcu_read_lock.  As the devmap and cpumap, which get called via
__dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
are operating on.

Perhaps it is a bug in i40e?

We are running in softirq in NAPI context, when xdp_do_flush_map() is
call, which I think means that this CPU will not go-through a RCU grace
period before we exit softirq, so in-practice it should be safe.  But
to be correct I do think we need a rcu_read_lock() around this call.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-15 18:21           ` Jesper Dangaard Brouer
@ 2021-04-15 20:29             ` Toke Høiland-Jørgensen
  2021-04-16  0:39               ` Martin KaFai Lau
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-15 20:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Martin KaFai Lau
  Cc: Hangbin Liu, bpf, netdev, Jiri Benc, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, brouer

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Thu, 15 Apr 2021 10:35:51 -0700
> Martin KaFai Lau <kafai@fb.com> wrote:
>
>> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
>> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >   
>> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
>> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> > >> >  {
>> > >> >  	struct net_device *dev = bq->dev;
>> > >> > -	int sent = 0, err = 0;
>> > >> > +	int sent = 0, drops = 0, err = 0;
>> > >> > +	unsigned int cnt = bq->count;
>> > >> > +	int to_send = cnt;
>> > >> >  	int i;
>> > >> >  
>> > >> > -	if (unlikely(!bq->count))
>> > >> > +	if (unlikely(!cnt))
>> > >> >  		return;
>> > >> >  
>> > >> > -	for (i = 0; i < bq->count; i++) {
>> > >> > +	for (i = 0; i < cnt; i++) {
>> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> > >> >  
>> > >> >  		prefetch(xdpf);
>> > >> >  	}
>> > >> >  
>> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> > >> > +	if (bq->xdp_prog) {  
>> > >> bq->xdp_prog is used here
>> > >>   
>> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> > >> > +		if (!to_send)
>> > >> > +			goto out;
>> > >> > +
>> > >> > +		drops = cnt - to_send;
>> > >> > +	}
>> > >> > +  
>> > >> 
>> > >> [ ... ]
>> > >>   
>> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> > >> > -		       struct net_device *dev_rx)
>> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> > >> >  {
>> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> > >> >  	 * from net_device drivers NAPI func end.
>> > >> > +	 *
>> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> > >> > +	 * are only ever modified together.
>> > >> >  	 */
>> > >> > -	if (!bq->dev_rx)
>> > >> > +	if (!bq->dev_rx) {
>> > >> >  		bq->dev_rx = dev_rx;
>> > >> > +		bq->xdp_prog = xdp_prog;  
>> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> > >> 
>> > >> e.g. what if the devmap elem gets deleted.  
>> > >
>> > > Jesper knows better than me. From my veiw, based on the description of
>> > > __dev_flush():
>> > >
>> > > On devmap tear down we ensure the flush list is empty before completing to
>> > > ensure all flush operations have completed. When drivers update the bpf
>> > > program they may need to ensure any flush ops are also complete.  
>>
>> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> 
>> > 
>> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> > which also runs under one big rcu_read_lock(). So the storage in the
>> > bulk queue is quite temporary, it's just used for bulking to increase
>> > performance :)  
>>
>> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> in i40e_run_xdp() and it is fine.
>> 
>> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>>
>> I do see the big rcu_read_lock() in mlx5e_napi_poll().
>
> I believed/assumed xdp_do_flush_map() was already protected under an
> rcu_read_lock.  As the devmap and cpumap, which get called via
> __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> are operating on.
>
> Perhaps it is a bug in i40e?
>
> We are running in softirq in NAPI context, when xdp_do_flush_map() is
> call, which I think means that this CPU will not go-through a RCU grace
> period before we exit softirq, so in-practice it should be safe.

Yup, this seems to be correct: rcu_softirq_qs() is only called between
full invocations of the softirq handler, which for networking is
net_rx_action(), and so translates into full NAPI poll cycles.

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-15 20:29             ` Toke Høiland-Jørgensen
@ 2021-04-16  0:39               ` Martin KaFai Lau
  2021-04-16 10:03                 ` Toke Høiland-Jørgensen
  2021-04-16 13:45                 ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 39+ messages in thread
From: Martin KaFai Lau @ 2021-04-16  0:39 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Hangbin Liu, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi,
	David Ahern, Andrii Nakryiko, Alexei Starovoitov, John Fastabend,
	Maciej Fijalkowski, Björn Töpel

On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> 
> > On Thu, 15 Apr 2021 10:35:51 -0700
> > Martin KaFai Lau <kafai@fb.com> wrote:
> >
> >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >   
> >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
> >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> > >> >  {
> >> > >> >  	struct net_device *dev = bq->dev;
> >> > >> > -	int sent = 0, err = 0;
> >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> > >> > +	unsigned int cnt = bq->count;
> >> > >> > +	int to_send = cnt;
> >> > >> >  	int i;
> >> > >> >  
> >> > >> > -	if (unlikely(!bq->count))
> >> > >> > +	if (unlikely(!cnt))
> >> > >> >  		return;
> >> > >> >  
> >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> > >> > +	for (i = 0; i < cnt; i++) {
> >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> > >> >  
> >> > >> >  		prefetch(xdpf);
> >> > >> >  	}
> >> > >> >  
> >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> > >> > +	if (bq->xdp_prog) {  
> >> > >> bq->xdp_prog is used here
> >> > >>   
> >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> > >> > +		if (!to_send)
> >> > >> > +			goto out;
> >> > >> > +
> >> > >> > +		drops = cnt - to_send;
> >> > >> > +	}
> >> > >> > +  
> >> > >> 
> >> > >> [ ... ]
> >> > >>   
> >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> > >> > -		       struct net_device *dev_rx)
> >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> > >> >  {
> >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> > >> >  	 * from net_device drivers NAPI func end.
> >> > >> > +	 *
> >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> > >> > +	 * are only ever modified together.
> >> > >> >  	 */
> >> > >> > -	if (!bq->dev_rx)
> >> > >> > +	if (!bq->dev_rx) {
> >> > >> >  		bq->dev_rx = dev_rx;
> >> > >> > +		bq->xdp_prog = xdp_prog;  
> >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> > >> 
> >> > >> e.g. what if the devmap elem gets deleted.  
> >> > >
> >> > > Jesper knows better than me. From my veiw, based on the description of
> >> > > __dev_flush():
> >> > >
> >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> > > ensure all flush operations have completed. When drivers update the bpf
> >> > > program they may need to ensure any flush ops are also complete.  
> >>
> >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> 
> >> > 
> >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> > performance :)  
> >>
> >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> in i40e_run_xdp() and it is fine.
> >> 
> >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >>
> >> I do see the big rcu_read_lock() in mlx5e_napi_poll().
> >
> > I believed/assumed xdp_do_flush_map() was already protected under an
> > rcu_read_lock.  As the devmap and cpumap, which get called via
> > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> > are operating on.
What other rcu objects it is using during flush?

> >
> > Perhaps it is a bug in i40e?
A quick look into ixgbe falls into the same bucket.
didn't look at other drivers though.

> >
> > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> > call, which I think means that this CPU will not go-through a RCU grace
> > period before we exit softirq, so in-practice it should be safe.
> 
> Yup, this seems to be correct: rcu_softirq_qs() is only called between
> full invocations of the softirq handler, which for networking is
> net_rx_action(), and so translates into full NAPI poll cycles.
I don't know enough to comment on the rcu/softirq part, may be someone
can chime in.  There is also a recent napi_threaded_poll().

If it is the case, then some of the existing rcu_read_lock() is unnecessary?
At least, it sounds incorrect to only make an exception here while keeping
other rcu_read_lock() as-is.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-16  0:39               ` Martin KaFai Lau
@ 2021-04-16 10:03                 ` Toke Høiland-Jørgensen
  2021-04-16 18:20                   ` Martin KaFai Lau
  2021-04-16 13:45                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-16 10:03 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jesper Dangaard Brouer, Hangbin Liu, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi,
	David Ahern, Andrii Nakryiko, Alexei Starovoitov, John Fastabend,
	Maciej Fijalkowski, Björn Töpel

Martin KaFai Lau <kafai@fb.com> writes:

> On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> 
>> > On Thu, 15 Apr 2021 10:35:51 -0700
>> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >
>> >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
>> >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >   
>> >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
>> >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> > >> >  {
>> >> > >> >  	struct net_device *dev = bq->dev;
>> >> > >> > -	int sent = 0, err = 0;
>> >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> > >> > +	unsigned int cnt = bq->count;
>> >> > >> > +	int to_send = cnt;
>> >> > >> >  	int i;
>> >> > >> >  
>> >> > >> > -	if (unlikely(!bq->count))
>> >> > >> > +	if (unlikely(!cnt))
>> >> > >> >  		return;
>> >> > >> >  
>> >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> > >> >  
>> >> > >> >  		prefetch(xdpf);
>> >> > >> >  	}
>> >> > >> >  
>> >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> > >> > +	if (bq->xdp_prog) {  
>> >> > >> bq->xdp_prog is used here
>> >> > >>   
>> >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> > >> > +		if (!to_send)
>> >> > >> > +			goto out;
>> >> > >> > +
>> >> > >> > +		drops = cnt - to_send;
>> >> > >> > +	}
>> >> > >> > +  
>> >> > >> 
>> >> > >> [ ... ]
>> >> > >>   
>> >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> > >> > -		       struct net_device *dev_rx)
>> >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> > >> >  {
>> >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> > >> >  	 * from net_device drivers NAPI func end.
>> >> > >> > +	 *
>> >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> > >> > +	 * are only ever modified together.
>> >> > >> >  	 */
>> >> > >> > -	if (!bq->dev_rx)
>> >> > >> > +	if (!bq->dev_rx) {
>> >> > >> >  		bq->dev_rx = dev_rx;
>> >> > >> > +		bq->xdp_prog = xdp_prog;  
>> >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> > >> 
>> >> > >> e.g. what if the devmap elem gets deleted.  
>> >> > >
>> >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> > > __dev_flush():
>> >> > >
>> >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> > > program they may need to ensure any flush ops are also complete.  
>> >>
>> >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> 
>> >> > 
>> >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> > performance :)  
>> >>
>> >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> in i40e_run_xdp() and it is fine.
>> >> 
>> >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >>
>> >> I do see the big rcu_read_lock() in mlx5e_napi_poll().
>> >
>> > I believed/assumed xdp_do_flush_map() was already protected under an
>> > rcu_read_lock.  As the devmap and cpumap, which get called via
>> > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> > are operating on.
> What other rcu objects it is using during flush?

The bq_enqueue() function in cpumap.c puts the 'bq' pointer onto the
flush_list, and 'bq' lives inside struct bpf_cpu_map_entry, so that's a
reference to the map entry as well.

The devmap function used to work the same way, until we changed it in
75ccae62cb8d ("xdp: Move devmap bulk queue into struct net_device").

>> > Perhaps it is a bug in i40e?
> A quick look into ixgbe falls into the same bucket.
> didn't look at other drivers though.
>
>> >
>> > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> > call, which I think means that this CPU will not go-through a RCU grace
>> > period before we exit softirq, so in-practice it should be safe.
>> 
>> Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> full invocations of the softirq handler, which for networking is
>> net_rx_action(), and so translates into full NAPI poll cycles.
>
> I don't know enough to comment on the rcu/softirq part, may be someone
> can chime in.  There is also a recent napi_threaded_poll().
>
> If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> At least, it sounds incorrect to only make an exception here while keeping
> other rcu_read_lock() as-is.

I'd tend to agree that the correct thing to do is to fix any affected
drivers so there's a wide rcu_read_lock() around the full xdp+flush. If
nothing else, this serves as an annotation for the expected lifetime of
the objects involved.

However, given that this is not a new issue, I don't think it should be
holding up this patch series... We can start a new conversation on what
the right way to fix this is - and maybe bring in Paul for advice on the
RCU side? WDYT?

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-16  0:39               ` Martin KaFai Lau
  2021-04-16 10:03                 ` Toke Høiland-Jørgensen
@ 2021-04-16 13:45                 ` Jesper Dangaard Brouer
  2021-04-16 14:35                   ` Toke Høiland-Jørgensen
  2021-04-16 18:22                   ` Martin KaFai Lau
  1 sibling, 2 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2021-04-16 13:45 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, Hangbin Liu, bpf, netdev,
	Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, brouer, Paul E. McKenney

On Thu, 15 Apr 2021 17:39:13 -0700
Martin KaFai Lau <kafai@fb.com> wrote:

> On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >   
> > > On Thu, 15 Apr 2021 10:35:51 -0700
> > > Martin KaFai Lau <kafai@fb.com> wrote:
> > >  
> > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> > >> >     
> > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > >> > >> >  {
> > >> > >> >  	struct net_device *dev = bq->dev;
> > >> > >> > -	int sent = 0, err = 0;
> > >> > >> > +	int sent = 0, drops = 0, err = 0;
> > >> > >> > +	unsigned int cnt = bq->count;
> > >> > >> > +	int to_send = cnt;
> > >> > >> >  	int i;
> > >> > >> >  
> > >> > >> > -	if (unlikely(!bq->count))
> > >> > >> > +	if (unlikely(!cnt))
> > >> > >> >  		return;
> > >> > >> >  
> > >> > >> > -	for (i = 0; i < bq->count; i++) {
> > >> > >> > +	for (i = 0; i < cnt; i++) {
> > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> > >> > >> >  
> > >> > >> >  		prefetch(xdpf);
> > >> > >> >  	}
> > >> > >> >  
> > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > >> > >> > +	if (bq->xdp_prog) {    
> > >> > >> bq->xdp_prog is used here
> > >> > >>     
> > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> > >> > >> > +		if (!to_send)
> > >> > >> > +			goto out;
> > >> > >> > +
> > >> > >> > +		drops = cnt - to_send;
> > >> > >> > +	}
> > >> > >> > +    
> > >> > >> 
> > >> > >> [ ... ]
> > >> > >>     
> > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > >> > >> > -		       struct net_device *dev_rx)
> > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> > >> > >> >  {
> > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> > >> > >> >  	 * from net_device drivers NAPI func end.
> > >> > >> > +	 *
> > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> > >> > >> > +	 * are only ever modified together.
> > >> > >> >  	 */
> > >> > >> > -	if (!bq->dev_rx)
> > >> > >> > +	if (!bq->dev_rx) {
> > >> > >> >  		bq->dev_rx = dev_rx;
> > >> > >> > +		bq->xdp_prog = xdp_prog;    
> > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> > >> > >> 
> > >> > >> e.g. what if the devmap elem gets deleted.    
> > >> > >
> > >> > > Jesper knows better than me. From my veiw, based on the description of
> > >> > > __dev_flush():
> > >> > >
> > >> > > On devmap tear down we ensure the flush list is empty before completing to
> > >> > > ensure all flush operations have completed. When drivers update the bpf
> > >> > > program they may need to ensure any flush ops are also complete.    
> > >>
> > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.

The bq->xdp_prog comes form the devmap "dev" element, and it is stored
in temporarily in the "bq" structure that is only valid for this
softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
the the xdp_prog here, more below (and Q for Paul).

> > >> > 
> > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> > >> > which also runs under one big rcu_read_lock(). So the storage in the
> > >> > bulk queue is quite temporary, it's just used for bulking to increase
> > >> > performance :)    
> > >>
> > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> > >> in i40e_run_xdp() and it is fine.
> > >> 
> > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> > >>
> > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> > >
> > > I believed/assumed xdp_do_flush_map() was already protected under an
> > > rcu_read_lock.  As the devmap and cpumap, which get called via
> > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> > > are operating on.  
>
> What other rcu objects it is using during flush?

Look at code:
 kernel/bpf/cpumap.c
 kernel/bpf/devmap.c

The devmap is filled with RCU code and complicated take-down steps.  
The devmap's elements are also RCU objects and the BPF xdp_prog is
embedded in this object (struct bpf_dtab_netdev).  The call_rcu
function is __dev_map_entry_free().


> > > Perhaps it is a bug in i40e?  
>
> A quick look into ixgbe falls into the same bucket.
> didn't look at other drivers though.

Intel driver are very much in copy-paste mode.
 
> > >
> > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> > > call, which I think means that this CPU will not go-through a RCU grace
> > > period before we exit softirq, so in-practice it should be safe.  
> > 
> > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> > full invocations of the softirq handler, which for networking is
> > net_rx_action(), and so translates into full NAPI poll cycles.  
>
> I don't know enough to comment on the rcu/softirq part, may be someone
> can chime in.  There is also a recent napi_threaded_poll().

CC added Paul. (link to patch[1][2] for context)

> If it is the case, then some of the existing rcu_read_lock() is unnecessary?

Well, in many cases, especially depending on how kernel is compiled,
that is true.  But we want to keep these, as they also document the
intend of the programmer.  And allow us to make the kernel even more
preempt-able in the future.

> At least, it sounds incorrect to only make an exception here while keeping
> other rcu_read_lock() as-is.

Let me be clear:  I think you have spotted a problem, and we need to
add rcu_read_lock() at least around the invocation of
bpf_prog_run_xdp() or before around if-statement that call
dev_map_bpf_prog_run(). (Hangbin please do this in V8).

Thank you Martin for reviewing the code carefully enough to find this
issue, that some drivers don't have a RCU-section around the full XDP
code path in their NAPI-loop.

Question to Paul.  (I will attempt to describe in generic terms what
happens, but ref real-function names).

We are running in softirq/NAPI context, the driver will call a
bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
some driver wrap this with a rcu_read_lock/unlock() section (other have
a large RCU-read section, that include the flush operation).

In the bq_enqueue() function we have a per_cpu_ptr (that store the
xdp_frame packets) that will get flushed/send in the call
xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
happen before we end our softirq/NAPI context.

The extension is that the per_cpu_ptr data structure (after this patch)
store a pointer to an xdp_prog (which is a RCU object).  In the flush
operation (which we will wrap with RCU-read section), we will use this
xdp_prog pointer.   I can see that it is in-principle wrong to pass
this-pointer between RCU-read sections, but I consider this safe as we
are running under softirq/NAPI and the per_cpu_ptr is only valid in
this short interval.

I claim a grace/quiescent RCU cannot happen between these two RCU-read
sections, but I might be wrong? (especially in the future or for RT).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://lore.kernel.org/netdev/20210414122610.4037085-2-liuhangbin@gmail.com/
[2] https://patchwork.kernel.org/project/netdevbpf/patch/20210414122610.4037085-2-liuhangbin@gmail.com/


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-16 13:45                 ` Jesper Dangaard Brouer
@ 2021-04-16 14:35                   ` Toke Høiland-Jørgensen
  2021-04-16 18:22                   ` Martin KaFai Lau
  1 sibling, 0 replies; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-16 14:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Martin KaFai Lau
  Cc: Hangbin Liu, bpf, netdev, Jiri Benc, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, brouer, Paul E. McKenney

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Thu, 15 Apr 2021 17:39:13 -0700
> Martin KaFai Lau <kafai@fb.com> wrote:
>
>> On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >   
>> > > On Thu, 15 Apr 2021 10:35:51 -0700
>> > > Martin KaFai Lau <kafai@fb.com> wrote:
>> > >  
>> > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> > >> >     
>> > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> > >> > >> >  {
>> > >> > >> >  	struct net_device *dev = bq->dev;
>> > >> > >> > -	int sent = 0, err = 0;
>> > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> > >> > >> > +	unsigned int cnt = bq->count;
>> > >> > >> > +	int to_send = cnt;
>> > >> > >> >  	int i;
>> > >> > >> >  
>> > >> > >> > -	if (unlikely(!bq->count))
>> > >> > >> > +	if (unlikely(!cnt))
>> > >> > >> >  		return;
>> > >> > >> >  
>> > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> > >> > >> > +	for (i = 0; i < cnt; i++) {
>> > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> > >> > >> >  
>> > >> > >> >  		prefetch(xdpf);
>> > >> > >> >  	}
>> > >> > >> >  
>> > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> > >> > >> > +	if (bq->xdp_prog) {    
>> > >> > >> bq->xdp_prog is used here
>> > >> > >>     
>> > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> > >> > >> > +		if (!to_send)
>> > >> > >> > +			goto out;
>> > >> > >> > +
>> > >> > >> > +		drops = cnt - to_send;
>> > >> > >> > +	}
>> > >> > >> > +    
>> > >> > >> 
>> > >> > >> [ ... ]
>> > >> > >>     
>> > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> > >> > >> > -		       struct net_device *dev_rx)
>> > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> > >> > >> >  {
>> > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> > >> > >> >  	 * from net_device drivers NAPI func end.
>> > >> > >> > +	 *
>> > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> > >> > >> > +	 * are only ever modified together.
>> > >> > >> >  	 */
>> > >> > >> > -	if (!bq->dev_rx)
>> > >> > >> > +	if (!bq->dev_rx) {
>> > >> > >> >  		bq->dev_rx = dev_rx;
>> > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> > >> > >> 
>> > >> > >> e.g. what if the devmap elem gets deleted.    
>> > >> > >
>> > >> > > Jesper knows better than me. From my veiw, based on the description of
>> > >> > > __dev_flush():
>> > >> > >
>> > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> > >> > > ensure all flush operations have completed. When drivers update the bpf
>> > >> > > program they may need to ensure any flush ops are also complete.    
>> > >>
>> > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>
> The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> in temporarily in the "bq" structure that is only valid for this
> softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> the the xdp_prog here, more below (and Q for Paul).
>
>> > >> > 
>> > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> > >> > performance :)    
>> > >>
>> > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> > >> in i40e_run_xdp() and it is fine.
>> > >> 
>> > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> > >>
>> > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> > >
>> > > I believed/assumed xdp_do_flush_map() was already protected under an
>> > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> > > are operating on.  
>>
>> What other rcu objects it is using during flush?
>
> Look at code:
>  kernel/bpf/cpumap.c
>  kernel/bpf/devmap.c
>
> The devmap is filled with RCU code and complicated take-down steps.  
> The devmap's elements are also RCU objects and the BPF xdp_prog is
> embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> function is __dev_map_entry_free().
>
>
>> > > Perhaps it is a bug in i40e?  
>>
>> A quick look into ixgbe falls into the same bucket.
>> didn't look at other drivers though.
>
> Intel driver are very much in copy-paste mode.
>  
>> > >
>> > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> > > call, which I think means that this CPU will not go-through a RCU grace
>> > > period before we exit softirq, so in-practice it should be safe.  
>> > 
>> > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> > full invocations of the softirq handler, which for networking is
>> > net_rx_action(), and so translates into full NAPI poll cycles.  
>>
>> I don't know enough to comment on the rcu/softirq part, may be someone
>> can chime in.  There is also a recent napi_threaded_poll().
>
> CC added Paul. (link to patch[1][2] for context)
>
>> If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>
> Well, in many cases, especially depending on how kernel is compiled,
> that is true.  But we want to keep these, as they also document the
> intend of the programmer.  And allow us to make the kernel even more
> preempt-able in the future.
>
>> At least, it sounds incorrect to only make an exception here while keeping
>> other rcu_read_lock() as-is.
>
> Let me be clear:  I think you have spotted a problem, and we need to
> add rcu_read_lock() at least around the invocation of
> bpf_prog_run_xdp() or before around if-statement that call
> dev_map_bpf_prog_run(). (Hangbin please do this in V8).

I'm not sure adding that is going to help, though? It'll make the
potential race window smaller (assuming there is one and we're not
protected by running inside the NAPI poll), but the pointer could still
be invalidated between the two rcu_read_lock() sections. So adding such
an (inner) rcu_read_lock() feels like just papering over the issue?

I think that to fix this properly we either (a) need to conclude that
it's not actually an issue because of the NAPI thing, (b) fix the
drivers to include everything in one big rcu_read_lock() or (c)
restructure the code so it doesn't assume the RCU protection at all.

(c) seems like a lot of work for little gain, so I guess we're left with
(b) unless Paul tells us that (a) is good enough? :)

I guess a variant of (b) that doesn't involve going through all the
drivers could be to just add an rcu_read_lock()/unlock() in the
top-level napi_poll() function? Any reason that couldn't work?

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-16 10:03                 ` Toke Høiland-Jørgensen
@ 2021-04-16 18:20                   ` Martin KaFai Lau
  0 siblings, 0 replies; 39+ messages in thread
From: Martin KaFai Lau @ 2021-04-16 18:20 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Hangbin Liu, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi,
	David Ahern, Andrii Nakryiko, Alexei Starovoitov, John Fastabend,
	Maciej Fijalkowski, Björn Töpel

On Fri, Apr 16, 2021 at 12:03:41PM +0200, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <kafai@fb.com> writes:
> 
> > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> 
> >> > On Thu, 15 Apr 2021 10:35:51 -0700
> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >
> >> >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> >> >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >   
> >> >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
> >> >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> > >> >  {
> >> >> > >> >  	struct net_device *dev = bq->dev;
> >> >> > >> > -	int sent = 0, err = 0;
> >> >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> > >> > +	unsigned int cnt = bq->count;
> >> >> > >> > +	int to_send = cnt;
> >> >> > >> >  	int i;
> >> >> > >> >  
> >> >> > >> > -	if (unlikely(!bq->count))
> >> >> > >> > +	if (unlikely(!cnt))
> >> >> > >> >  		return;
> >> >> > >> >  
> >> >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> > >> >  
> >> >> > >> >  		prefetch(xdpf);
> >> >> > >> >  	}
> >> >> > >> >  
> >> >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> > >> > +	if (bq->xdp_prog) {  
> >> >> > >> bq->xdp_prog is used here
> >> >> > >>   
> >> >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> > >> > +		if (!to_send)
> >> >> > >> > +			goto out;
> >> >> > >> > +
> >> >> > >> > +		drops = cnt - to_send;
> >> >> > >> > +	}
> >> >> > >> > +  
> >> >> > >> 
> >> >> > >> [ ... ]
> >> >> > >>   
> >> >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> > >> > -		       struct net_device *dev_rx)
> >> >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> > >> >  {
> >> >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> > >> > +	 *
> >> >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> > >> > +	 * are only ever modified together.
> >> >> > >> >  	 */
> >> >> > >> > -	if (!bq->dev_rx)
> >> >> > >> > +	if (!bq->dev_rx) {
> >> >> > >> >  		bq->dev_rx = dev_rx;
> >> >> > >> > +		bq->xdp_prog = xdp_prog;  
> >> >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> > >> 
> >> >> > >> e.g. what if the devmap elem gets deleted.  
> >> >> > >
> >> >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> > > __dev_flush():
> >> >> > >
> >> >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> > > program they may need to ensure any flush ops are also complete.  
> >> >>
> >> >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> 
> >> >> > 
> >> >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> > performance :)  
> >> >>
> >> >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> in i40e_run_xdp() and it is fine.
> >> >> 
> >> >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >>
> >> >> I do see the big rcu_read_lock() in mlx5e_napi_poll().
> >> >
> >> > I believed/assumed xdp_do_flush_map() was already protected under an
> >> > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> > are operating on.
> > What other rcu objects it is using during flush?
> 
> The bq_enqueue() function in cpumap.c puts the 'bq' pointer onto the
> flush_list, and 'bq' lives inside struct bpf_cpu_map_entry, so that's a
> reference to the map entry as well.
> 
> The devmap function used to work the same way, until we changed it in
> 75ccae62cb8d ("xdp: Move devmap bulk queue into struct net_device").
Got it. Thanks for the explanation in bq_enqueue() in cpumap.c.
I was under the impression that xdp_do_flush_map() should not
use any rcu object now since I don't see rcu_read_lock() there
and I use it as a hint in code reading.

> >> > Perhaps it is a bug in i40e?
> > A quick look into ixgbe falls into the same bucket.
> > didn't look at other drivers though.
> >
> >> >
> >> > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> > call, which I think means that this CPU will not go-through a RCU grace
> >> > period before we exit softirq, so in-practice it should be safe.
> >> 
> >> Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> full invocations of the softirq handler, which for networking is
> >> net_rx_action(), and so translates into full NAPI poll cycles.
> >
> > I don't know enough to comment on the rcu/softirq part, may be someone
> > can chime in.  There is also a recent napi_threaded_poll().
> >
> > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> > At least, it sounds incorrect to only make an exception here while keeping
> > other rcu_read_lock() as-is.
> 
> I'd tend to agree that the correct thing to do is to fix any affected
> drivers so there's a wide rcu_read_lock() around the full xdp+flush. If
> nothing else, this serves as an annotation for the expected lifetime of
> the objects involved.
> 
> However, given that this is not a new issue, I don't think it should be
> holding up this patch series... We can start a new conversation on what
> the right way to fix this is - and maybe bring in Paul for advice on the
> RCU side? WDYT?
Yeah...it falls into the same issue as the current bq_enqueue() in cpumap.c.
I am fine to put them together into the solve later bucket.  I will delegate
this decision to the maintainers.

I would wait a bit on Paul's reply though.

Also, patch 2 does not necessary depend on patch 1?  Another option is to post
patch 1 separately later as an optimization when the rcu discussion concluded.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-16 13:45                 ` Jesper Dangaard Brouer
  2021-04-16 14:35                   ` Toke Høiland-Jørgensen
@ 2021-04-16 18:22                   ` Martin KaFai Lau
  2021-04-17  0:23                     ` Paul E. McKenney
  1 sibling, 1 reply; 39+ messages in thread
From: Martin KaFai Lau @ 2021-04-16 18:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, Hangbin Liu, bpf, netdev,
	Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel, Paul E. McKenney

On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 15 Apr 2021 17:39:13 -0700
> Martin KaFai Lau <kafai@fb.com> wrote:
> 
> > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> > >   
> > > > On Thu, 15 Apr 2021 10:35:51 -0700
> > > > Martin KaFai Lau <kafai@fb.com> wrote:
> > > >  
> > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> > > >> >     
> > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > >> > >> >  {
> > > >> > >> >  	struct net_device *dev = bq->dev;
> > > >> > >> > -	int sent = 0, err = 0;
> > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> > > >> > >> > +	unsigned int cnt = bq->count;
> > > >> > >> > +	int to_send = cnt;
> > > >> > >> >  	int i;
> > > >> > >> >  
> > > >> > >> > -	if (unlikely(!bq->count))
> > > >> > >> > +	if (unlikely(!cnt))
> > > >> > >> >  		return;
> > > >> > >> >  
> > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> > > >> > >> > +	for (i = 0; i < cnt; i++) {
> > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> > > >> > >> >  
> > > >> > >> >  		prefetch(xdpf);
> > > >> > >> >  	}
> > > >> > >> >  
> > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > > >> > >> > +	if (bq->xdp_prog) {    
> > > >> > >> bq->xdp_prog is used here
> > > >> > >>     
> > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> > > >> > >> > +		if (!to_send)
> > > >> > >> > +			goto out;
> > > >> > >> > +
> > > >> > >> > +		drops = cnt - to_send;
> > > >> > >> > +	}
> > > >> > >> > +    
> > > >> > >> 
> > > >> > >> [ ... ]
> > > >> > >>     
> > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > > >> > >> > -		       struct net_device *dev_rx)
> > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> > > >> > >> >  {
> > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> > > >> > >> >  	 * from net_device drivers NAPI func end.
> > > >> > >> > +	 *
> > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> > > >> > >> > +	 * are only ever modified together.
> > > >> > >> >  	 */
> > > >> > >> > -	if (!bq->dev_rx)
> > > >> > >> > +	if (!bq->dev_rx) {
> > > >> > >> >  		bq->dev_rx = dev_rx;
> > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> > > >> > >> 
> > > >> > >> e.g. what if the devmap elem gets deleted.    
> > > >> > >
> > > >> > > Jesper knows better than me. From my veiw, based on the description of
> > > >> > > __dev_flush():
> > > >> > >
> > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> > > >> > > ensure all flush operations have completed. When drivers update the bpf
> > > >> > > program they may need to ensure any flush ops are also complete.    
> > > >>
> > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> 
> The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> in temporarily in the "bq" structure that is only valid for this
> softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> the the xdp_prog here, more below (and Q for Paul).
> 
> > > >> > 
> > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> > > >> > performance :)    
> > > >>
> > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> > > >> in i40e_run_xdp() and it is fine.
> > > >> 
> > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> > > >>
> > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> > > >
> > > > I believed/assumed xdp_do_flush_map() was already protected under an
> > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> > > > are operating on.  
> >
> > What other rcu objects it is using during flush?
> 
> Look at code:
>  kernel/bpf/cpumap.c
>  kernel/bpf/devmap.c
> 
> The devmap is filled with RCU code and complicated take-down steps.  
> The devmap's elements are also RCU objects and the BPF xdp_prog is
> embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> function is __dev_map_entry_free().
> 
> 
> > > > Perhaps it is a bug in i40e?  
> >
> > A quick look into ixgbe falls into the same bucket.
> > didn't look at other drivers though.
> 
> Intel driver are very much in copy-paste mode.
>  
> > > >
> > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> > > > call, which I think means that this CPU will not go-through a RCU grace
> > > > period before we exit softirq, so in-practice it should be safe.  
> > > 
> > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> > > full invocations of the softirq handler, which for networking is
> > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >
> > I don't know enough to comment on the rcu/softirq part, may be someone
> > can chime in.  There is also a recent napi_threaded_poll().
> 
> CC added Paul. (link to patch[1][2] for context)
Updated Paul's email address.

> 
> > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> 
> Well, in many cases, especially depending on how kernel is compiled,
> that is true.  But we want to keep these, as they also document the
> intend of the programmer.  And allow us to make the kernel even more
> preempt-able in the future.
> 
> > At least, it sounds incorrect to only make an exception here while keeping
> > other rcu_read_lock() as-is.
> 
> Let me be clear:  I think you have spotted a problem, and we need to
> add rcu_read_lock() at least around the invocation of
> bpf_prog_run_xdp() or before around if-statement that call
> dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> 
> Thank you Martin for reviewing the code carefully enough to find this
> issue, that some drivers don't have a RCU-section around the full XDP
> code path in their NAPI-loop.
> 
> Question to Paul.  (I will attempt to describe in generic terms what
> happens, but ref real-function names).
> 
> We are running in softirq/NAPI context, the driver will call a
> bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> some driver wrap this with a rcu_read_lock/unlock() section (other have
> a large RCU-read section, that include the flush operation).
> 
> In the bq_enqueue() function we have a per_cpu_ptr (that store the
> xdp_frame packets) that will get flushed/send in the call
> xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> happen before we end our softirq/NAPI context.
> 
> The extension is that the per_cpu_ptr data structure (after this patch)
> store a pointer to an xdp_prog (which is a RCU object).  In the flush
> operation (which we will wrap with RCU-read section), we will use this
> xdp_prog pointer.   I can see that it is in-principle wrong to pass
> this-pointer between RCU-read sections, but I consider this safe as we
> are running under softirq/NAPI and the per_cpu_ptr is only valid in
> this short interval.
> 
> I claim a grace/quiescent RCU cannot happen between these two RCU-read
> sections, but I might be wrong? (especially in the future or for RT).
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer 
> 
> [1] https://lore.kernel.org/netdev/20210414122610.4037085-2-liuhangbin@gmail.com/
> [2] https://patchwork.kernel.org/project/netdevbpf/patch/20210414122610.4037085-2-liuhangbin@gmail.com/
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-16 18:22                   ` Martin KaFai Lau
@ 2021-04-17  0:23                     ` Paul E. McKenney
  2021-04-17 12:27                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-17  0:23 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Hangbin Liu, bpf, netdev, Jiri Benc, Eelco Chaudron, ast,
	Daniel Borkmann, Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 15 Apr 2021 17:39:13 -0700
> > Martin KaFai Lau <kafai@fb.com> wrote:
> > 
> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> > > >   
> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >  
> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> > > > >> >     
> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > > >> > >> >  {
> > > > >> > >> >  	struct net_device *dev = bq->dev;
> > > > >> > >> > -	int sent = 0, err = 0;
> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> > > > >> > >> > +	unsigned int cnt = bq->count;
> > > > >> > >> > +	int to_send = cnt;
> > > > >> > >> >  	int i;
> > > > >> > >> >  
> > > > >> > >> > -	if (unlikely(!bq->count))
> > > > >> > >> > +	if (unlikely(!cnt))
> > > > >> > >> >  		return;
> > > > >> > >> >  
> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> > > > >> > >> >  
> > > > >> > >> >  		prefetch(xdpf);
> > > > >> > >> >  	}
> > > > >> > >> >  
> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > > > >> > >> > +	if (bq->xdp_prog) {    
> > > > >> > >> bq->xdp_prog is used here
> > > > >> > >>     
> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> > > > >> > >> > +		if (!to_send)
> > > > >> > >> > +			goto out;
> > > > >> > >> > +
> > > > >> > >> > +		drops = cnt - to_send;
> > > > >> > >> > +	}
> > > > >> > >> > +    
> > > > >> > >> 
> > > > >> > >> [ ... ]
> > > > >> > >>     
> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > > > >> > >> > -		       struct net_device *dev_rx)
> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> > > > >> > >> >  {
> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> > > > >> > >> > +	 *
> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> > > > >> > >> > +	 * are only ever modified together.
> > > > >> > >> >  	 */
> > > > >> > >> > -	if (!bq->dev_rx)
> > > > >> > >> > +	if (!bq->dev_rx) {
> > > > >> > >> >  		bq->dev_rx = dev_rx;
> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> > > > >> > >> 
> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> > > > >> > >
> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> > > > >> > > __dev_flush():
> > > > >> > >
> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> > > > >> > > program they may need to ensure any flush ops are also complete.    
> > > > >>
> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> > 
> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> > in temporarily in the "bq" structure that is only valid for this
> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> > the the xdp_prog here, more below (and Q for Paul).
> > 
> > > > >> > 
> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> > > > >> > performance :)    
> > > > >>
> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> > > > >> in i40e_run_xdp() and it is fine.
> > > > >> 
> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> > > > >>
> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> > > > >
> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> > > > > are operating on.  
> > >
> > > What other rcu objects it is using during flush?
> > 
> > Look at code:
> >  kernel/bpf/cpumap.c
> >  kernel/bpf/devmap.c
> > 
> > The devmap is filled with RCU code and complicated take-down steps.  
> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> > function is __dev_map_entry_free().
> > 
> > 
> > > > > Perhaps it is a bug in i40e?  
> > >
> > > A quick look into ixgbe falls into the same bucket.
> > > didn't look at other drivers though.
> > 
> > Intel driver are very much in copy-paste mode.
> >  
> > > > >
> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> > > > > call, which I think means that this CPU will not go-through a RCU grace
> > > > > period before we exit softirq, so in-practice it should be safe.  
> > > > 
> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> > > > full invocations of the softirq handler, which for networking is
> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> > >
> > > I don't know enough to comment on the rcu/softirq part, may be someone
> > > can chime in.  There is also a recent napi_threaded_poll().
> > 
> > CC added Paul. (link to patch[1][2] for context)
> Updated Paul's email address.
> 
> > 
> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> > 
> > Well, in many cases, especially depending on how kernel is compiled,
> > that is true.  But we want to keep these, as they also document the
> > intend of the programmer.  And allow us to make the kernel even more
> > preempt-able in the future.
> > 
> > > At least, it sounds incorrect to only make an exception here while keeping
> > > other rcu_read_lock() as-is.
> > 
> > Let me be clear:  I think you have spotted a problem, and we need to
> > add rcu_read_lock() at least around the invocation of
> > bpf_prog_run_xdp() or before around if-statement that call
> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> > 
> > Thank you Martin for reviewing the code carefully enough to find this
> > issue, that some drivers don't have a RCU-section around the full XDP
> > code path in their NAPI-loop.
> > 
> > Question to Paul.  (I will attempt to describe in generic terms what
> > happens, but ref real-function names).
> > 
> > We are running in softirq/NAPI context, the driver will call a
> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> > a large RCU-read section, that include the flush operation).
> > 
> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> > xdp_frame packets) that will get flushed/send in the call
> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> > happen before we end our softirq/NAPI context.
> > 
> > The extension is that the per_cpu_ptr data structure (after this patch)
> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> > operation (which we will wrap with RCU-read section), we will use this
> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> > this-pointer between RCU-read sections, but I consider this safe as we
> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> > this short interval.
> > 
> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> > sections, but I might be wrong? (especially in the future or for RT).

If I am reading this correctly (ha!), a very high-level summary of the
code in question is something like this:

	void foo(void)
	{
		local_bh_disable();

		rcu_read_lock();
		p = rcu_dereference(gp);
		do_something_with(p);
		rcu_read_unlock();

		do_something_else();

		rcu_read_lock();
		do_some_other_thing(p);
		rcu_read_unlock();

		local_bh_enable();
	}

	void bar(struct blat *new_gp)
	{
		struct blat *old_gp;

		spin_lock(my_lock);
		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
		rcu_assign_pointer(gp, new_gp);
		spin_unlock(my_lock);
		synchronize_rcu();
		kfree(old_gp);
	}

I need to check up on -rt.

But first... In recent mainline kernels, the local_bh_disable() region
will look like one big RCU read-side critical section.  But don't try
this prior to v4.20!!!  In v4.19 and earlier, you would need to use
both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).

Except that in that case, why not just drop the inner rcu_read_unlock()
and rcu_read_lock() pair?  Awkward function boundaries or some such?

Especially given that if this works on -rt, it is probably because
their variant of do_softirq() holds rcu_read_lock() across each softirq
handler invocation.  They do something similar for rwlocks.

							Thanx, Paul

> > -- 
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer 
> > 
> > [1] https://lore.kernel.org/netdev/20210414122610.4037085-2-liuhangbin@gmail.com/
> > [2] https://patchwork.kernel.org/project/netdevbpf/patch/20210414122610.4037085-2-liuhangbin@gmail.com/
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-17  0:23                     ` Paul E. McKenney
@ 2021-04-17 12:27                       ` Toke Høiland-Jørgensen
  2021-04-19 16:58                         ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-17 12:27 UTC (permalink / raw)
  To: paulmck, Martin KaFai Lau
  Cc: Jesper Dangaard Brouer, Hangbin Liu, bpf, netdev, Jiri Benc,
	Eelco Chaudron, ast, Daniel Borkmann, Lorenzo Bianconi,
	David Ahern, Andrii Nakryiko, Alexei Starovoitov, John Fastabend,
	Maciej Fijalkowski, Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> > On Thu, 15 Apr 2021 17:39:13 -0700
>> > Martin KaFai Lau <kafai@fb.com> wrote:
>> > 
>> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> > > >   
>> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> > > > >  
>> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> > > > >> >     
>> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> > > > >> > >> >  {
>> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> > > > >> > >> > -	int sent = 0, err = 0;
>> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> > > > >> > >> > +	unsigned int cnt = bq->count;
>> > > > >> > >> > +	int to_send = cnt;
>> > > > >> > >> >  	int i;
>> > > > >> > >> >  
>> > > > >> > >> > -	if (unlikely(!bq->count))
>> > > > >> > >> > +	if (unlikely(!cnt))
>> > > > >> > >> >  		return;
>> > > > >> > >> >  
>> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> > > > >> > >> >  
>> > > > >> > >> >  		prefetch(xdpf);
>> > > > >> > >> >  	}
>> > > > >> > >> >  
>> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> > > > >> > >> > +	if (bq->xdp_prog) {    
>> > > > >> > >> bq->xdp_prog is used here
>> > > > >> > >>     
>> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> > > > >> > >> > +		if (!to_send)
>> > > > >> > >> > +			goto out;
>> > > > >> > >> > +
>> > > > >> > >> > +		drops = cnt - to_send;
>> > > > >> > >> > +	}
>> > > > >> > >> > +    
>> > > > >> > >> 
>> > > > >> > >> [ ... ]
>> > > > >> > >>     
>> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> > > > >> > >> > -		       struct net_device *dev_rx)
>> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> > > > >> > >> >  {
>> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> > > > >> > >> > +	 *
>> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> > > > >> > >> > +	 * are only ever modified together.
>> > > > >> > >> >  	 */
>> > > > >> > >> > -	if (!bq->dev_rx)
>> > > > >> > >> > +	if (!bq->dev_rx) {
>> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> > > > >> > >> 
>> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> > > > >> > >
>> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> > > > >> > > __dev_flush():
>> > > > >> > >
>> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> > > > >>
>> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> > 
>> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> > in temporarily in the "bq" structure that is only valid for this
>> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> > the the xdp_prog here, more below (and Q for Paul).
>> > 
>> > > > >> > 
>> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> > > > >> > performance :)    
>> > > > >>
>> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> > > > >> in i40e_run_xdp() and it is fine.
>> > > > >> 
>> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> > > > >>
>> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> > > > >
>> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> > > > > are operating on.  
>> > >
>> > > What other rcu objects it is using during flush?
>> > 
>> > Look at code:
>> >  kernel/bpf/cpumap.c
>> >  kernel/bpf/devmap.c
>> > 
>> > The devmap is filled with RCU code and complicated take-down steps.  
>> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> > function is __dev_map_entry_free().
>> > 
>> > 
>> > > > > Perhaps it is a bug in i40e?  
>> > >
>> > > A quick look into ixgbe falls into the same bucket.
>> > > didn't look at other drivers though.
>> > 
>> > Intel driver are very much in copy-paste mode.
>> >  
>> > > > >
>> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> > > > > period before we exit softirq, so in-practice it should be safe.  
>> > > > 
>> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> > > > full invocations of the softirq handler, which for networking is
>> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> > >
>> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> > > can chime in.  There is also a recent napi_threaded_poll().
>> > 
>> > CC added Paul. (link to patch[1][2] for context)
>> Updated Paul's email address.
>> 
>> > 
>> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> > 
>> > Well, in many cases, especially depending on how kernel is compiled,
>> > that is true.  But we want to keep these, as they also document the
>> > intend of the programmer.  And allow us to make the kernel even more
>> > preempt-able in the future.
>> > 
>> > > At least, it sounds incorrect to only make an exception here while keeping
>> > > other rcu_read_lock() as-is.
>> > 
>> > Let me be clear:  I think you have spotted a problem, and we need to
>> > add rcu_read_lock() at least around the invocation of
>> > bpf_prog_run_xdp() or before around if-statement that call
>> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> > 
>> > Thank you Martin for reviewing the code carefully enough to find this
>> > issue, that some drivers don't have a RCU-section around the full XDP
>> > code path in their NAPI-loop.
>> > 
>> > Question to Paul.  (I will attempt to describe in generic terms what
>> > happens, but ref real-function names).
>> > 
>> > We are running in softirq/NAPI context, the driver will call a
>> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> > a large RCU-read section, that include the flush operation).
>> > 
>> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> > xdp_frame packets) that will get flushed/send in the call
>> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> > happen before we end our softirq/NAPI context.
>> > 
>> > The extension is that the per_cpu_ptr data structure (after this patch)
>> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> > operation (which we will wrap with RCU-read section), we will use this
>> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> > this-pointer between RCU-read sections, but I consider this safe as we
>> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> > this short interval.
>> > 
>> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> > sections, but I might be wrong? (especially in the future or for RT).
>
> If I am reading this correctly (ha!), a very high-level summary of the
> code in question is something like this:
>
> 	void foo(void)
> 	{
> 		local_bh_disable();
>
> 		rcu_read_lock();
> 		p = rcu_dereference(gp);
> 		do_something_with(p);
> 		rcu_read_unlock();
>
> 		do_something_else();
>
> 		rcu_read_lock();
> 		do_some_other_thing(p);
> 		rcu_read_unlock();
>
> 		local_bh_enable();
> 	}
>
> 	void bar(struct blat *new_gp)
> 	{
> 		struct blat *old_gp;
>
> 		spin_lock(my_lock);
> 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> 		rcu_assign_pointer(gp, new_gp);
> 		spin_unlock(my_lock);
> 		synchronize_rcu();
> 		kfree(old_gp);
> 	}

Yeah, something like that (the object is freed using call_rcu() - but I
think that's equivalent, right?). And the question is whether we need to
extend foo() so that is has one big rcu_read_lock() that covers the
whole lifetime of p.

> I need to check up on -rt.
>
> But first... In recent mainline kernels, the local_bh_disable() region
> will look like one big RCU read-side critical section.  But don't try
> this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).

OK. Variants of this code has been around since before then, but I
honestly have no idea what it looked like back then exactly...

> Except that in that case, why not just drop the inner rcu_read_unlock()
> and rcu_read_lock() pair?  Awkward function boundaries or some such?

Well if we can just treat such a local_bh_disable()/enable() pair as the
equivalent of rcu_read_lock()/unlock() then I suppose we could just get
rid of the inner ones. What about tools like lockdep; do they understand
this, or are we likely to get complaints if we remove it?

> Especially given that if this works on -rt, it is probably because
> their variant of do_softirq() holds rcu_read_lock() across each
> softirq handler invocation. They do something similar for rwlocks.

Right. Guess we'll wait for your confirmation of that, then. Thanks! :)

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-17 12:27                       ` Toke Høiland-Jørgensen
@ 2021-04-19 16:58                         ` Paul E. McKenney
  2021-04-19 18:12                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-19 16:58 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> > 
> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> > > >   
> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> > > > >  
> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> > > > >> >     
> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> > > > >> > >> >  {
> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> > > > >> > >> > +	int to_send = cnt;
> >> > > > >> > >> >  	int i;
> >> > > > >> > >> >  
> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> > > > >> > >> >  		return;
> >> > > > >> > >> >  
> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> > > > >> > >> >  
> >> > > > >> > >> >  		prefetch(xdpf);
> >> > > > >> > >> >  	}
> >> > > > >> > >> >  
> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> > > > >> > >> bq->xdp_prog is used here
> >> > > > >> > >>     
> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> > > > >> > >> > +		if (!to_send)
> >> > > > >> > >> > +			goto out;
> >> > > > >> > >> > +
> >> > > > >> > >> > +		drops = cnt - to_send;
> >> > > > >> > >> > +	}
> >> > > > >> > >> > +    
> >> > > > >> > >> 
> >> > > > >> > >> [ ... ]
> >> > > > >> > >>     
> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> > > > >> > >> >  {
> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> > > > >> > >> > +	 *
> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> > > > >> > >> > +	 * are only ever modified together.
> >> > > > >> > >> >  	 */
> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> > > > >> > >> 
> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> > > > >> > >
> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> > > > >> > > __dev_flush():
> >> > > > >> > >
> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> > > > >>
> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> > 
> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> > in temporarily in the "bq" structure that is only valid for this
> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> > the the xdp_prog here, more below (and Q for Paul).
> >> > 
> >> > > > >> > 
> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> > > > >> > performance :)    
> >> > > > >>
> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> > > > >> in i40e_run_xdp() and it is fine.
> >> > > > >> 
> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> > > > >>
> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> > > > >
> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> > > > > are operating on.  
> >> > >
> >> > > What other rcu objects it is using during flush?
> >> > 
> >> > Look at code:
> >> >  kernel/bpf/cpumap.c
> >> >  kernel/bpf/devmap.c
> >> > 
> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> > function is __dev_map_entry_free().
> >> > 
> >> > 
> >> > > > > Perhaps it is a bug in i40e?  
> >> > >
> >> > > A quick look into ixgbe falls into the same bucket.
> >> > > didn't look at other drivers though.
> >> > 
> >> > Intel driver are very much in copy-paste mode.
> >> >  
> >> > > > >
> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> > > > 
> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> > > > full invocations of the softirq handler, which for networking is
> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> > >
> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> > 
> >> > CC added Paul. (link to patch[1][2] for context)
> >> Updated Paul's email address.
> >> 
> >> > 
> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> > 
> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> > that is true.  But we want to keep these, as they also document the
> >> > intend of the programmer.  And allow us to make the kernel even more
> >> > preempt-able in the future.
> >> > 
> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> > > other rcu_read_lock() as-is.
> >> > 
> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> > add rcu_read_lock() at least around the invocation of
> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> > 
> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> > code path in their NAPI-loop.
> >> > 
> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> > happens, but ref real-function names).
> >> > 
> >> > We are running in softirq/NAPI context, the driver will call a
> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> > a large RCU-read section, that include the flush operation).
> >> > 
> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> > xdp_frame packets) that will get flushed/send in the call
> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> > happen before we end our softirq/NAPI context.
> >> > 
> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> > operation (which we will wrap with RCU-read section), we will use this
> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> > this short interval.
> >> > 
> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> > sections, but I might be wrong? (especially in the future or for RT).
> >
> > If I am reading this correctly (ha!), a very high-level summary of the
> > code in question is something like this:
> >
> > 	void foo(void)
> > 	{
> > 		local_bh_disable();
> >
> > 		rcu_read_lock();
> > 		p = rcu_dereference(gp);
> > 		do_something_with(p);
> > 		rcu_read_unlock();
> >
> > 		do_something_else();
> >
> > 		rcu_read_lock();
> > 		do_some_other_thing(p);
> > 		rcu_read_unlock();
> >
> > 		local_bh_enable();
> > 	}
> >
> > 	void bar(struct blat *new_gp)
> > 	{
> > 		struct blat *old_gp;
> >
> > 		spin_lock(my_lock);
> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> > 		rcu_assign_pointer(gp, new_gp);
> > 		spin_unlock(my_lock);
> > 		synchronize_rcu();
> > 		kfree(old_gp);
> > 	}
> 
> Yeah, something like that (the object is freed using call_rcu() - but I
> think that's equivalent, right?). And the question is whether we need to
> extend foo() so that is has one big rcu_read_lock() that covers the
> whole lifetime of p.

Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)

> > I need to check up on -rt.
> >
> > But first... In recent mainline kernels, the local_bh_disable() region
> > will look like one big RCU read-side critical section.  But don't try
> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> 
> OK. Variants of this code has been around since before then, but I
> honestly have no idea what it looked like back then exactly...

I know that feeling...

> > Except that in that case, why not just drop the inner rcu_read_unlock()
> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> 
> Well if we can just treat such a local_bh_disable()/enable() pair as the
> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> rid of the inner ones. What about tools like lockdep; do they understand
> this, or are we likely to get complaints if we remove it?

If you just got rid of the first rcu_read_unlock() and the second
rcu_read_lock() in the code above, lockdep will understand.

However, if you instead get rid of -all- of the rcu_read_lock() and
rcu_read_unlock() invocations in the code above, you would need to let
lockdep know by adding rcu_read_lock_bh_held().  So instead of this:

	p = rcu_dereference(gp);

You would do this:

	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());

This would be needed for mainline, regardless of -rt.

> > Especially given that if this works on -rt, it is probably because
> > their variant of do_softirq() holds rcu_read_lock() across each
> > softirq handler invocation. They do something similar for rwlocks.
> 
> Right. Guess we'll wait for your confirmation of that, then. Thanks! :)

Looking at v5.11.4-rt11...

And __local_bh_disable_ip() has added the required rcu_read_lock(),
so dropping all the rcu_read_lock() and rcu_read_unlock() calls would
do the right thing in -rt.  And lockdep would understand without the
rcu_read_lock_bh_held(), but that is still required for mainline.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-19 16:58                         ` Paul E. McKenney
@ 2021-04-19 18:12                           ` Toke Høiland-Jørgensen
  2021-04-19 18:32                             ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-19 18:12 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > 
>> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> > > >   
>> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > > > >  
>> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> > > > >> >     
>> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> > > > >> > >> >  {
>> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> > > > >> > >> > +	int to_send = cnt;
>> >> > > > >> > >> >  	int i;
>> >> > > > >> > >> >  
>> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> > > > >> > >> >  		return;
>> >> > > > >> > >> >  
>> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> > > > >> > >> >  
>> >> > > > >> > >> >  		prefetch(xdpf);
>> >> > > > >> > >> >  	}
>> >> > > > >> > >> >  
>> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> > > > >> > >> bq->xdp_prog is used here
>> >> > > > >> > >>     
>> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> > > > >> > >> > +		if (!to_send)
>> >> > > > >> > >> > +			goto out;
>> >> > > > >> > >> > +
>> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> > > > >> > >> > +	}
>> >> > > > >> > >> > +    
>> >> > > > >> > >> 
>> >> > > > >> > >> [ ... ]
>> >> > > > >> > >>     
>> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> > > > >> > >> >  {
>> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> > > > >> > >> > +	 *
>> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> > > > >> > >> > +	 * are only ever modified together.
>> >> > > > >> > >> >  	 */
>> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> > > > >> > >> 
>> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> > > > >> > >
>> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> > > > >> > > __dev_flush():
>> >> > > > >> > >
>> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> > > > >>
>> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> > 
>> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> > in temporarily in the "bq" structure that is only valid for this
>> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> > 
>> >> > > > >> > 
>> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> > > > >> > performance :)    
>> >> > > > >>
>> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> > > > >> 
>> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> > > > >>
>> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> > > > >
>> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> > > > > are operating on.  
>> >> > >
>> >> > > What other rcu objects it is using during flush?
>> >> > 
>> >> > Look at code:
>> >> >  kernel/bpf/cpumap.c
>> >> >  kernel/bpf/devmap.c
>> >> > 
>> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> > function is __dev_map_entry_free().
>> >> > 
>> >> > 
>> >> > > > > Perhaps it is a bug in i40e?  
>> >> > >
>> >> > > A quick look into ixgbe falls into the same bucket.
>> >> > > didn't look at other drivers though.
>> >> > 
>> >> > Intel driver are very much in copy-paste mode.
>> >> >  
>> >> > > > >
>> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> > > > 
>> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> > > > full invocations of the softirq handler, which for networking is
>> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> > >
>> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> > 
>> >> > CC added Paul. (link to patch[1][2] for context)
>> >> Updated Paul's email address.
>> >> 
>> >> > 
>> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> > 
>> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> > that is true.  But we want to keep these, as they also document the
>> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> > preempt-able in the future.
>> >> > 
>> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> > > other rcu_read_lock() as-is.
>> >> > 
>> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> > add rcu_read_lock() at least around the invocation of
>> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> > 
>> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> > code path in their NAPI-loop.
>> >> > 
>> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> > happens, but ref real-function names).
>> >> > 
>> >> > We are running in softirq/NAPI context, the driver will call a
>> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> > a large RCU-read section, that include the flush operation).
>> >> > 
>> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> > xdp_frame packets) that will get flushed/send in the call
>> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> > happen before we end our softirq/NAPI context.
>> >> > 
>> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> > this short interval.
>> >> > 
>> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >
>> > If I am reading this correctly (ha!), a very high-level summary of the
>> > code in question is something like this:
>> >
>> > 	void foo(void)
>> > 	{
>> > 		local_bh_disable();
>> >
>> > 		rcu_read_lock();
>> > 		p = rcu_dereference(gp);
>> > 		do_something_with(p);
>> > 		rcu_read_unlock();
>> >
>> > 		do_something_else();
>> >
>> > 		rcu_read_lock();
>> > 		do_some_other_thing(p);
>> > 		rcu_read_unlock();
>> >
>> > 		local_bh_enable();
>> > 	}
>> >
>> > 	void bar(struct blat *new_gp)
>> > 	{
>> > 		struct blat *old_gp;
>> >
>> > 		spin_lock(my_lock);
>> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> > 		rcu_assign_pointer(gp, new_gp);
>> > 		spin_unlock(my_lock);
>> > 		synchronize_rcu();
>> > 		kfree(old_gp);
>> > 	}
>> 
>> Yeah, something like that (the object is freed using call_rcu() - but I
>> think that's equivalent, right?). And the question is whether we need to
>> extend foo() so that is has one big rcu_read_lock() that covers the
>> whole lifetime of p.
>
> Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)

Right, gotcha!

>> > I need to check up on -rt.
>> >
>> > But first... In recent mainline kernels, the local_bh_disable() region
>> > will look like one big RCU read-side critical section.  But don't try
>> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> 
>> OK. Variants of this code has been around since before then, but I
>> honestly have no idea what it looked like back then exactly...
>
> I know that feeling...
>
>> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> 
>> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> rid of the inner ones. What about tools like lockdep; do they understand
>> this, or are we likely to get complaints if we remove it?
>
> If you just got rid of the first rcu_read_unlock() and the second
> rcu_read_lock() in the code above, lockdep will understand.

Right, but doing so entails going through all the drivers, which is what
we're trying to avoid :)

> However, if you instead get rid of -all- of the rcu_read_lock() and
> rcu_read_unlock() invocations in the code above, you would need to let
> lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>
> 	p = rcu_dereference(gp);
>
> You would do this:
>
> 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>
> This would be needed for mainline, regardless of -rt.

OK. And as far as I can tell this is harmless for code paths that call
the same function but from a regular rcu_read_lock()-protected section
instead from a bh-disabled section, right?

What happens, BTW, if we *don't* get rid of all the existing
rcu_read_lock() sections? Going back to your foo() example above, what
we're discussing is whether to add that second rcu_read_lock() around
do_some_other_thing(p). I.e., the first one around the rcu_dereference()
is already there (in the particular driver we're discussing), and the
local_bh_disable/enable() pair is already there. AFAICT from our
discussion, there really is not much point in adding that second
rcu_read_lock/unlock(), is there?

And because that first rcu_read_lock() around the rcu_dereference() is
already there, lockdep is not likely to complain either, so we're
basically fine? Except that the code is somewhat confusing as-is, of
course; i.e., we should probably fix it but it's not terribly urgent. Or?

Hmm, looking at it now, it seems not all the lookup code is actually
doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
a comment above it saying that RCU ensures objects won't disappear[0];
so I suppose we're at least safe from lockdep in that sense :P - but we
should definitely clean this up.

[0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391

>> > Especially given that if this works on -rt, it is probably because
>> > their variant of do_softirq() holds rcu_read_lock() across each
>> > softirq handler invocation. They do something similar for rwlocks.
>> 
>> Right. Guess we'll wait for your confirmation of that, then. Thanks! :)
>
> Looking at v5.11.4-rt11...
>
> And __local_bh_disable_ip() has added the required rcu_read_lock(),
> so dropping all the rcu_read_lock() and rcu_read_unlock() calls would
> do the right thing in -rt.  And lockdep would understand without the
> rcu_read_lock_bh_held(), but that is still required for mainline.

Great, thanks for checking!

So this brings to mind another question: Are there any performance
implications to nesting rcu_read_locks() inside each other? One
thing that would be fairly easy to do (in terms of how much code we have
to touch) is to just add a top-level rcu_read_lock() around the
napi_poll() call in the core dev code, thus making -rt and mainline
equivalent in that respect. Also, this would make it obvious that all
the RCU usage inside of NAPI is safe, without having to know about
bh_disable() and all that. But we obviously don't want to do that if it
is going to slow things down; WDYT?

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-19 18:12                           ` Toke Høiland-Jørgensen
@ 2021-04-19 18:32                             ` Paul E. McKenney
  2021-04-19 21:21                               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-19 18:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> > 
> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> >> > > >   
> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> > > > >  
> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> > > > >> >     
> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> > > > >> > >> >  {
> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> >> > > > >> > >> > +	int to_send = cnt;
> >> >> > > > >> > >> >  	int i;
> >> >> > > > >> > >> >  
> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> >> > > > >> > >> >  		return;
> >> >> > > > >> > >> >  
> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> > > > >> > >> >  
> >> >> > > > >> > >> >  		prefetch(xdpf);
> >> >> > > > >> > >> >  	}
> >> >> > > > >> > >> >  
> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> >> > > > >> > >> bq->xdp_prog is used here
> >> >> > > > >> > >>     
> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> > > > >> > >> > +		if (!to_send)
> >> >> > > > >> > >> > +			goto out;
> >> >> > > > >> > >> > +
> >> >> > > > >> > >> > +		drops = cnt - to_send;
> >> >> > > > >> > >> > +	}
> >> >> > > > >> > >> > +    
> >> >> > > > >> > >> 
> >> >> > > > >> > >> [ ... ]
> >> >> > > > >> > >>     
> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> > > > >> > >> >  {
> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> > > > >> > >> > +	 *
> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> > > > >> > >> > +	 * are only ever modified together.
> >> >> > > > >> > >> >  	 */
> >> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> > > > >> > >> 
> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> >> > > > >> > >
> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> > > > >> > > __dev_flush():
> >> >> > > > >> > >
> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> >> > > > >>
> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> > 
> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> >> > in temporarily in the "bq" structure that is only valid for this
> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> >> > the the xdp_prog here, more below (and Q for Paul).
> >> >> > 
> >> >> > > > >> > 
> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> > > > >> > performance :)    
> >> >> > > > >>
> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> > > > >> in i40e_run_xdp() and it is fine.
> >> >> > > > >> 
> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >> > > > >>
> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> >> > > > >
> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> >> > > > > are operating on.  
> >> >> > >
> >> >> > > What other rcu objects it is using during flush?
> >> >> > 
> >> >> > Look at code:
> >> >> >  kernel/bpf/cpumap.c
> >> >> >  kernel/bpf/devmap.c
> >> >> > 
> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> >> > function is __dev_map_entry_free().
> >> >> > 
> >> >> > 
> >> >> > > > > Perhaps it is a bug in i40e?  
> >> >> > >
> >> >> > > A quick look into ixgbe falls into the same bucket.
> >> >> > > didn't look at other drivers though.
> >> >> > 
> >> >> > Intel driver are very much in copy-paste mode.
> >> >> >  
> >> >> > > > >
> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> >> > > > 
> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> >> > > > full invocations of the softirq handler, which for networking is
> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> >> > >
> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> >> > 
> >> >> > CC added Paul. (link to patch[1][2] for context)
> >> >> Updated Paul's email address.
> >> >> 
> >> >> > 
> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> >> > 
> >> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> >> > that is true.  But we want to keep these, as they also document the
> >> >> > intend of the programmer.  And allow us to make the kernel even more
> >> >> > preempt-able in the future.
> >> >> > 
> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> >> > > other rcu_read_lock() as-is.
> >> >> > 
> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> >> > add rcu_read_lock() at least around the invocation of
> >> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> >> > 
> >> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> >> > code path in their NAPI-loop.
> >> >> > 
> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> >> > happens, but ref real-function names).
> >> >> > 
> >> >> > We are running in softirq/NAPI context, the driver will call a
> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> >> > a large RCU-read section, that include the flush operation).
> >> >> > 
> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> >> > xdp_frame packets) that will get flushed/send in the call
> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> >> > happen before we end our softirq/NAPI context.
> >> >> > 
> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> >> > operation (which we will wrap with RCU-read section), we will use this
> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> >> > this short interval.
> >> >> > 
> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> >> > sections, but I might be wrong? (especially in the future or for RT).
> >> >
> >> > If I am reading this correctly (ha!), a very high-level summary of the
> >> > code in question is something like this:
> >> >
> >> > 	void foo(void)
> >> > 	{
> >> > 		local_bh_disable();
> >> >
> >> > 		rcu_read_lock();
> >> > 		p = rcu_dereference(gp);
> >> > 		do_something_with(p);
> >> > 		rcu_read_unlock();
> >> >
> >> > 		do_something_else();
> >> >
> >> > 		rcu_read_lock();
> >> > 		do_some_other_thing(p);
> >> > 		rcu_read_unlock();
> >> >
> >> > 		local_bh_enable();
> >> > 	}
> >> >
> >> > 	void bar(struct blat *new_gp)
> >> > 	{
> >> > 		struct blat *old_gp;
> >> >
> >> > 		spin_lock(my_lock);
> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> >> > 		rcu_assign_pointer(gp, new_gp);
> >> > 		spin_unlock(my_lock);
> >> > 		synchronize_rcu();
> >> > 		kfree(old_gp);
> >> > 	}
> >> 
> >> Yeah, something like that (the object is freed using call_rcu() - but I
> >> think that's equivalent, right?). And the question is whether we need to
> >> extend foo() so that is has one big rcu_read_lock() that covers the
> >> whole lifetime of p.
> >
> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
> 
> Right, gotcha!
> 
> >> > I need to check up on -rt.
> >> >
> >> > But first... In recent mainline kernels, the local_bh_disable() region
> >> > will look like one big RCU read-side critical section.  But don't try
> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> >> 
> >> OK. Variants of this code has been around since before then, but I
> >> honestly have no idea what it looked like back then exactly...
> >
> > I know that feeling...
> >
> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> >> 
> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> >> rid of the inner ones. What about tools like lockdep; do they understand
> >> this, or are we likely to get complaints if we remove it?
> >
> > If you just got rid of the first rcu_read_unlock() and the second
> > rcu_read_lock() in the code above, lockdep will understand.
> 
> Right, but doing so entails going through all the drivers, which is what
> we're trying to avoid :)

I was afraid of that...  ;-)

> > However, if you instead get rid of -all- of the rcu_read_lock() and
> > rcu_read_unlock() invocations in the code above, you would need to let
> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
> >
> > 	p = rcu_dereference(gp);
> >
> > You would do this:
> >
> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
> >
> > This would be needed for mainline, regardless of -rt.
> 
> OK. And as far as I can tell this is harmless for code paths that call
> the same function but from a regular rcu_read_lock()-protected section
> instead from a bh-disabled section, right?

That is correct.  That rcu_dereference_check() invocation will make
lockdep be OK with rcu_read_lock() or with softirq being disabled.
Or both, for that matter.

> What happens, BTW, if we *don't* get rid of all the existing
> rcu_read_lock() sections? Going back to your foo() example above, what
> we're discussing is whether to add that second rcu_read_lock() around
> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
> is already there (in the particular driver we're discussing), and the
> local_bh_disable/enable() pair is already there. AFAICT from our
> discussion, there really is not much point in adding that second
> rcu_read_lock/unlock(), is there?

From an algorithmic point of view, the second rcu_read_lock()
and rcu_read_unlock() are redundant.  Of course, there are also
software-engineering considerations, including copy-pasta issues.

> And because that first rcu_read_lock() around the rcu_dereference() is
> already there, lockdep is not likely to complain either, so we're
> basically fine? Except that the code is somewhat confusing as-is, of
> course; i.e., we should probably fix it but it's not terribly urgent. Or?

I am concerned about copy-pasta-induced bugs.  Someone looks just at
the code, fails to note the fact that softirq is disabled throughout,
and decides that leaking a pointer from one RCU read-side critical
section to a later one is just fine.  :-/

> Hmm, looking at it now, it seems not all the lookup code is actually
> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
> a comment above it saying that RCU ensures objects won't disappear[0];
> so I suppose we're at least safe from lockdep in that sense :P - but we
> should definitely clean this up.
> 
> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391

That use of READ_ONCE() will definitely avoid lockdep complaints,
including those complaints that point out bugs.  It also might get you
sparse complaints if the RCU-protected pointer is marked with __rcu.

> >> > Especially given that if this works on -rt, it is probably because
> >> > their variant of do_softirq() holds rcu_read_lock() across each
> >> > softirq handler invocation. They do something similar for rwlocks.
> >> 
> >> Right. Guess we'll wait for your confirmation of that, then. Thanks! :)
> >
> > Looking at v5.11.4-rt11...
> >
> > And __local_bh_disable_ip() has added the required rcu_read_lock(),
> > so dropping all the rcu_read_lock() and rcu_read_unlock() calls would
> > do the right thing in -rt.  And lockdep would understand without the
> > rcu_read_lock_bh_held(), but that is still required for mainline.
> 
> Great, thanks for checking!
> 
> So this brings to mind another question: Are there any performance
> implications to nesting rcu_read_locks() inside each other? One
> thing that would be fairly easy to do (in terms of how much code we have
> to touch) is to just add a top-level rcu_read_lock() around the
> napi_poll() call in the core dev code, thus making -rt and mainline
> equivalent in that respect. Also, this would make it obvious that all
> the RCU usage inside of NAPI is safe, without having to know about
> bh_disable() and all that. But we obviously don't want to do that if it
> is going to slow things down; WDYT?

Both rcu_read_lock() and rcu_read_unlock() are quite lightweight (zero for
CONFIG_PREEMPT=n and about two nanoseconds per pair for CONFIG_PREEMPT=y
on 2GHz x86) and can be nested quite deeply.  So that approach should
be fine from that viewpoint.

However, remaining in a single RCU read-side critical section forever
will eventually OOM the system, so the code should periodically exit
its top-level RCU read-side critical section, say, every few tens of
milliseconds.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-19 18:32                             ` Paul E. McKenney
@ 2021-04-19 21:21                               ` Toke Høiland-Jørgensen
  2021-04-19 21:41                                 ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-19 21:21 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> > 
>> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> >> > > >   
>> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> > > > >  
>> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >> > > > >> >     
>> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> >> > > > >> > >> >  {
>> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> >> > > > >> > >> > +	int to_send = cnt;
>> >> >> > > > >> > >> >  	int i;
>> >> >> > > > >> > >> >  
>> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> >> > > > >> > >> >  		return;
>> >> >> > > > >> > >> >  
>> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> >> > > > >> > >> >  
>> >> >> > > > >> > >> >  		prefetch(xdpf);
>> >> >> > > > >> > >> >  	}
>> >> >> > > > >> > >> >  
>> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> >> > > > >> > >> bq->xdp_prog is used here
>> >> >> > > > >> > >>     
>> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> >> > > > >> > >> > +		if (!to_send)
>> >> >> > > > >> > >> > +			goto out;
>> >> >> > > > >> > >> > +
>> >> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> >> > > > >> > >> > +	}
>> >> >> > > > >> > >> > +    
>> >> >> > > > >> > >> 
>> >> >> > > > >> > >> [ ... ]
>> >> >> > > > >> > >>     
>> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> >> > > > >> > >> >  {
>> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> >> > > > >> > >> > +	 *
>> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> >> > > > >> > >> > +	 * are only ever modified together.
>> >> >> > > > >> > >> >  	 */
>> >> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> >> > > > >> > >> 
>> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> >> > > > >> > >
>> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> >> > > > >> > > __dev_flush():
>> >> >> > > > >> > >
>> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> >> > > > >>
>> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> >> > 
>> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> >> > in temporarily in the "bq" structure that is only valid for this
>> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> >> > 
>> >> >> > > > >> > 
>> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> >> > > > >> > performance :)    
>> >> >> > > > >>
>> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> >> > > > >> 
>> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> >> > > > >>
>> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> >> > > > >
>> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> >> > > > > are operating on.  
>> >> >> > >
>> >> >> > > What other rcu objects it is using during flush?
>> >> >> > 
>> >> >> > Look at code:
>> >> >> >  kernel/bpf/cpumap.c
>> >> >> >  kernel/bpf/devmap.c
>> >> >> > 
>> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> >> > function is __dev_map_entry_free().
>> >> >> > 
>> >> >> > 
>> >> >> > > > > Perhaps it is a bug in i40e?  
>> >> >> > >
>> >> >> > > A quick look into ixgbe falls into the same bucket.
>> >> >> > > didn't look at other drivers though.
>> >> >> > 
>> >> >> > Intel driver are very much in copy-paste mode.
>> >> >> >  
>> >> >> > > > >
>> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> >> > > > 
>> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> >> > > > full invocations of the softirq handler, which for networking is
>> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> >> > >
>> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> >> > 
>> >> >> > CC added Paul. (link to patch[1][2] for context)
>> >> >> Updated Paul's email address.
>> >> >> 
>> >> >> > 
>> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> >> > 
>> >> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> >> > that is true.  But we want to keep these, as they also document the
>> >> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> >> > preempt-able in the future.
>> >> >> > 
>> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> >> > > other rcu_read_lock() as-is.
>> >> >> > 
>> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> >> > add rcu_read_lock() at least around the invocation of
>> >> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> >> > 
>> >> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> >> > code path in their NAPI-loop.
>> >> >> > 
>> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> >> > happens, but ref real-function names).
>> >> >> > 
>> >> >> > We are running in softirq/NAPI context, the driver will call a
>> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> >> > a large RCU-read section, that include the flush operation).
>> >> >> > 
>> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> >> > xdp_frame packets) that will get flushed/send in the call
>> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> >> > happen before we end our softirq/NAPI context.
>> >> >> > 
>> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> >> > this short interval.
>> >> >> > 
>> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >> >
>> >> > If I am reading this correctly (ha!), a very high-level summary of the
>> >> > code in question is something like this:
>> >> >
>> >> > 	void foo(void)
>> >> > 	{
>> >> > 		local_bh_disable();
>> >> >
>> >> > 		rcu_read_lock();
>> >> > 		p = rcu_dereference(gp);
>> >> > 		do_something_with(p);
>> >> > 		rcu_read_unlock();
>> >> >
>> >> > 		do_something_else();
>> >> >
>> >> > 		rcu_read_lock();
>> >> > 		do_some_other_thing(p);
>> >> > 		rcu_read_unlock();
>> >> >
>> >> > 		local_bh_enable();
>> >> > 	}
>> >> >
>> >> > 	void bar(struct blat *new_gp)
>> >> > 	{
>> >> > 		struct blat *old_gp;
>> >> >
>> >> > 		spin_lock(my_lock);
>> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> >> > 		rcu_assign_pointer(gp, new_gp);
>> >> > 		spin_unlock(my_lock);
>> >> > 		synchronize_rcu();
>> >> > 		kfree(old_gp);
>> >> > 	}
>> >> 
>> >> Yeah, something like that (the object is freed using call_rcu() - but I
>> >> think that's equivalent, right?). And the question is whether we need to
>> >> extend foo() so that is has one big rcu_read_lock() that covers the
>> >> whole lifetime of p.
>> >
>> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
>> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
>> 
>> Right, gotcha!
>> 
>> >> > I need to check up on -rt.
>> >> >
>> >> > But first... In recent mainline kernels, the local_bh_disable() region
>> >> > will look like one big RCU read-side critical section.  But don't try
>> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> >> 
>> >> OK. Variants of this code has been around since before then, but I
>> >> honestly have no idea what it looked like back then exactly...
>> >
>> > I know that feeling...
>> >
>> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> >> 
>> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> >> rid of the inner ones. What about tools like lockdep; do they understand
>> >> this, or are we likely to get complaints if we remove it?
>> >
>> > If you just got rid of the first rcu_read_unlock() and the second
>> > rcu_read_lock() in the code above, lockdep will understand.
>> 
>> Right, but doing so entails going through all the drivers, which is what
>> we're trying to avoid :)
>
> I was afraid of that...  ;-)
>
>> > However, if you instead get rid of -all- of the rcu_read_lock() and
>> > rcu_read_unlock() invocations in the code above, you would need to let
>> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>> >
>> > 	p = rcu_dereference(gp);
>> >
>> > You would do this:
>> >
>> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>> >
>> > This would be needed for mainline, regardless of -rt.
>> 
>> OK. And as far as I can tell this is harmless for code paths that call
>> the same function but from a regular rcu_read_lock()-protected section
>> instead from a bh-disabled section, right?
>
> That is correct.  That rcu_dereference_check() invocation will make
> lockdep be OK with rcu_read_lock() or with softirq being disabled.
> Or both, for that matter.

OK, great, thank you for confirming my understanding!

>> What happens, BTW, if we *don't* get rid of all the existing
>> rcu_read_lock() sections? Going back to your foo() example above, what
>> we're discussing is whether to add that second rcu_read_lock() around
>> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
>> is already there (in the particular driver we're discussing), and the
>> local_bh_disable/enable() pair is already there. AFAICT from our
>> discussion, there really is not much point in adding that second
>> rcu_read_lock/unlock(), is there?
>
> From an algorithmic point of view, the second rcu_read_lock()
> and rcu_read_unlock() are redundant.  Of course, there are also
> software-engineering considerations, including copy-pasta issues.
>
>> And because that first rcu_read_lock() around the rcu_dereference() is
>> already there, lockdep is not likely to complain either, so we're
>> basically fine? Except that the code is somewhat confusing as-is, of
>> course; i.e., we should probably fix it but it's not terribly urgent. Or?
>
> I am concerned about copy-pasta-induced bugs.  Someone looks just at
> the code, fails to note the fact that softirq is disabled throughout,
> and decides that leaking a pointer from one RCU read-side critical
> section to a later one is just fine.  :-/

Yup, totally agreed that we need to fix this for the sake of the humans
reading the code; just wanted to make sure my understanding was correct
that we don't strictly need to do anything as far as the machines
executing it are concerned :)

>> Hmm, looking at it now, it seems not all the lookup code is actually
>> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
>> a comment above it saying that RCU ensures objects won't disappear[0];
>> so I suppose we're at least safe from lockdep in that sense :P - but we
>> should definitely clean this up.
>> 
>> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
>
> That use of READ_ONCE() will definitely avoid lockdep complaints,
> including those complaints that point out bugs.  It also might get you
> sparse complaints if the RCU-protected pointer is marked with __rcu.

It's not; it's the netdev_map member of this struct:

struct bpf_dtab {
	struct bpf_map map;
	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
	struct list_head list;

	/* these are only used for DEVMAP_HASH type maps */
	struct hlist_head *dev_index_head;
	spinlock_t index_lock;
	unsigned int items;
	u32 n_buckets;
};

Will adding __rcu to such a dynamic array member do the right thing when
paired with rcu_dereference() on array members (i.e., in place of the
READ_ONCE in the code linked above)?

Also, while you're being so nice about confirming my understanding of
things: I always understood the point of rcu_dereference() (and __rcu on
struct members) to be annotations that document the lifetime
expectations of the object being pointed to, rather than a functional
change vs READ_ONCE()? Documentation that the static checkers can turn
into warnings, of course, but totally transparent in terms of the
generated code. Right? :)

>> >> > Especially given that if this works on -rt, it is probably because
>> >> > their variant of do_softirq() holds rcu_read_lock() across each
>> >> > softirq handler invocation. They do something similar for rwlocks.
>> >> 
>> >> Right. Guess we'll wait for your confirmation of that, then. Thanks! :)
>> >
>> > Looking at v5.11.4-rt11...
>> >
>> > And __local_bh_disable_ip() has added the required rcu_read_lock(),
>> > so dropping all the rcu_read_lock() and rcu_read_unlock() calls would
>> > do the right thing in -rt.  And lockdep would understand without the
>> > rcu_read_lock_bh_held(), but that is still required for mainline.
>> 
>> Great, thanks for checking!
>> 
>> So this brings to mind another question: Are there any performance
>> implications to nesting rcu_read_locks() inside each other? One
>> thing that would be fairly easy to do (in terms of how much code we have
>> to touch) is to just add a top-level rcu_read_lock() around the
>> napi_poll() call in the core dev code, thus making -rt and mainline
>> equivalent in that respect. Also, this would make it obvious that all
>> the RCU usage inside of NAPI is safe, without having to know about
>> bh_disable() and all that. But we obviously don't want to do that if it
>> is going to slow things down; WDYT?
>
> Both rcu_read_lock() and rcu_read_unlock() are quite lightweight (zero for
> CONFIG_PREEMPT=n and about two nanoseconds per pair for CONFIG_PREEMPT=y
> on 2GHz x86) and can be nested quite deeply.  So that approach should
> be fine from that viewpoint.

OK, that may be fine, then. Guess I'll try it and benchmark (and compare
with the rcu_dereference_check() approach).

> However, remaining in a single RCU read-side critical section forever
> will eventually OOM the system, so the code should periodically exit
> its top-level RCU read-side critical section, say, every few tens of
> milliseconds.

Yup, NAPI already does this (there's a poll budget), so that should be
fine.

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-19 21:21                               ` Toke Høiland-Jørgensen
@ 2021-04-19 21:41                                 ` Paul E. McKenney
  2021-04-19 22:16                                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-19 21:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> 
> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> > 
> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> >> >> > > >   
> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> > > > >  
> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >> > > > >> >     
> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> >> > > > >> > >> >  {
> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> >> >> > > > >> > >> > +	int to_send = cnt;
> >> >> >> > > > >> > >> >  	int i;
> >> >> >> > > > >> > >> >  
> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> >> >> > > > >> > >> >  		return;
> >> >> >> > > > >> > >> >  
> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> >> > > > >> > >> >  
> >> >> >> > > > >> > >> >  		prefetch(xdpf);
> >> >> >> > > > >> > >> >  	}
> >> >> >> > > > >> > >> >  
> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> >> >> > > > >> > >> bq->xdp_prog is used here
> >> >> >> > > > >> > >>     
> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> >> > > > >> > >> > +		if (!to_send)
> >> >> >> > > > >> > >> > +			goto out;
> >> >> >> > > > >> > >> > +
> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
> >> >> >> > > > >> > >> > +	}
> >> >> >> > > > >> > >> > +    
> >> >> >> > > > >> > >> 
> >> >> >> > > > >> > >> [ ... ]
> >> >> >> > > > >> > >>     
> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> >> > > > >> > >> >  {
> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> >> > > > >> > >> > +	 *
> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> >> > > > >> > >> > +	 * are only ever modified together.
> >> >> >> > > > >> > >> >  	 */
> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> >> > > > >> > >> 
> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> >> >> > > > >> > >
> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> >> > > > >> > > __dev_flush():
> >> >> >> > > > >> > >
> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> >> >> > > > >>
> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> >> > 
> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> >> >> > in temporarily in the "bq" structure that is only valid for this
> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
> >> >> >> > 
> >> >> >> > > > >> > 
> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> >> > > > >> > performance :)    
> >> >> >> > > > >>
> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
> >> >> >> > > > >> 
> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >> >> > > > >>
> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> >> >> > > > >
> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> >> >> > > > > are operating on.  
> >> >> >> > >
> >> >> >> > > What other rcu objects it is using during flush?
> >> >> >> > 
> >> >> >> > Look at code:
> >> >> >> >  kernel/bpf/cpumap.c
> >> >> >> >  kernel/bpf/devmap.c
> >> >> >> > 
> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> >> >> > function is __dev_map_entry_free().
> >> >> >> > 
> >> >> >> > 
> >> >> >> > > > > Perhaps it is a bug in i40e?  
> >> >> >> > >
> >> >> >> > > A quick look into ixgbe falls into the same bucket.
> >> >> >> > > didn't look at other drivers though.
> >> >> >> > 
> >> >> >> > Intel driver are very much in copy-paste mode.
> >> >> >> >  
> >> >> >> > > > >
> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> >> >> > > > 
> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> >> >> > > > full invocations of the softirq handler, which for networking is
> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> >> >> > >
> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> >> >> > 
> >> >> >> > CC added Paul. (link to patch[1][2] for context)
> >> >> >> Updated Paul's email address.
> >> >> >> 
> >> >> >> > 
> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> >> >> > 
> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> >> >> > that is true.  But we want to keep these, as they also document the
> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
> >> >> >> > preempt-able in the future.
> >> >> >> > 
> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> >> >> > > other rcu_read_lock() as-is.
> >> >> >> > 
> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> >> >> > add rcu_read_lock() at least around the invocation of
> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> >> >> > 
> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> >> >> > code path in their NAPI-loop.
> >> >> >> > 
> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> >> >> > happens, but ref real-function names).
> >> >> >> > 
> >> >> >> > We are running in softirq/NAPI context, the driver will call a
> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> >> >> > a large RCU-read section, that include the flush operation).
> >> >> >> > 
> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> >> >> > xdp_frame packets) that will get flushed/send in the call
> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> >> >> > happen before we end our softirq/NAPI context.
> >> >> >> > 
> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> >> >> > this short interval.
> >> >> >> > 
> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
> >> >> >
> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
> >> >> > code in question is something like this:
> >> >> >
> >> >> > 	void foo(void)
> >> >> > 	{
> >> >> > 		local_bh_disable();
> >> >> >
> >> >> > 		rcu_read_lock();
> >> >> > 		p = rcu_dereference(gp);
> >> >> > 		do_something_with(p);
> >> >> > 		rcu_read_unlock();
> >> >> >
> >> >> > 		do_something_else();
> >> >> >
> >> >> > 		rcu_read_lock();
> >> >> > 		do_some_other_thing(p);
> >> >> > 		rcu_read_unlock();
> >> >> >
> >> >> > 		local_bh_enable();
> >> >> > 	}
> >> >> >
> >> >> > 	void bar(struct blat *new_gp)
> >> >> > 	{
> >> >> > 		struct blat *old_gp;
> >> >> >
> >> >> > 		spin_lock(my_lock);
> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> >> >> > 		rcu_assign_pointer(gp, new_gp);
> >> >> > 		spin_unlock(my_lock);
> >> >> > 		synchronize_rcu();
> >> >> > 		kfree(old_gp);
> >> >> > 	}
> >> >> 
> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
> >> >> think that's equivalent, right?). And the question is whether we need to
> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
> >> >> whole lifetime of p.
> >> >
> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
> >> 
> >> Right, gotcha!
> >> 
> >> >> > I need to check up on -rt.
> >> >> >
> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
> >> >> > will look like one big RCU read-side critical section.  But don't try
> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> >> >> 
> >> >> OK. Variants of this code has been around since before then, but I
> >> >> honestly have no idea what it looked like back then exactly...
> >> >
> >> > I know that feeling...
> >> >
> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> >> >> 
> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> >> >> rid of the inner ones. What about tools like lockdep; do they understand
> >> >> this, or are we likely to get complaints if we remove it?
> >> >
> >> > If you just got rid of the first rcu_read_unlock() and the second
> >> > rcu_read_lock() in the code above, lockdep will understand.
> >> 
> >> Right, but doing so entails going through all the drivers, which is what
> >> we're trying to avoid :)
> >
> > I was afraid of that...  ;-)
> >
> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
> >> > rcu_read_unlock() invocations in the code above, you would need to let
> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
> >> >
> >> > 	p = rcu_dereference(gp);
> >> >
> >> > You would do this:
> >> >
> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
> >> >
> >> > This would be needed for mainline, regardless of -rt.
> >> 
> >> OK. And as far as I can tell this is harmless for code paths that call
> >> the same function but from a regular rcu_read_lock()-protected section
> >> instead from a bh-disabled section, right?
> >
> > That is correct.  That rcu_dereference_check() invocation will make
> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
> > Or both, for that matter.
> 
> OK, great, thank you for confirming my understanding!
> 
> >> What happens, BTW, if we *don't* get rid of all the existing
> >> rcu_read_lock() sections? Going back to your foo() example above, what
> >> we're discussing is whether to add that second rcu_read_lock() around
> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
> >> is already there (in the particular driver we're discussing), and the
> >> local_bh_disable/enable() pair is already there. AFAICT from our
> >> discussion, there really is not much point in adding that second
> >> rcu_read_lock/unlock(), is there?
> >
> > From an algorithmic point of view, the second rcu_read_lock()
> > and rcu_read_unlock() are redundant.  Of course, there are also
> > software-engineering considerations, including copy-pasta issues.
> >
> >> And because that first rcu_read_lock() around the rcu_dereference() is
> >> already there, lockdep is not likely to complain either, so we're
> >> basically fine? Except that the code is somewhat confusing as-is, of
> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
> >
> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
> > the code, fails to note the fact that softirq is disabled throughout,
> > and decides that leaking a pointer from one RCU read-side critical
> > section to a later one is just fine.  :-/
> 
> Yup, totally agreed that we need to fix this for the sake of the humans
> reading the code; just wanted to make sure my understanding was correct
> that we don't strictly need to do anything as far as the machines
> executing it are concerned :)
> 
> >> Hmm, looking at it now, it seems not all the lookup code is actually
> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
> >> a comment above it saying that RCU ensures objects won't disappear[0];
> >> so I suppose we're at least safe from lockdep in that sense :P - but we
> >> should definitely clean this up.
> >> 
> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
> >
> > That use of READ_ONCE() will definitely avoid lockdep complaints,
> > including those complaints that point out bugs.  It also might get you
> > sparse complaints if the RCU-protected pointer is marked with __rcu.
> 
> It's not; it's the netdev_map member of this struct:
> 
> struct bpf_dtab {
> 	struct bpf_map map;
> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
> 	struct list_head list;
> 
> 	/* these are only used for DEVMAP_HASH type maps */
> 	struct hlist_head *dev_index_head;
> 	spinlock_t index_lock;
> 	unsigned int items;
> 	u32 n_buckets;
> };
> 
> Will adding __rcu to such a dynamic array member do the right thing when
> paired with rcu_dereference() on array members (i.e., in place of the
> READ_ONCE in the code linked above)?

The only thing __rcu will do is provide information to the sparse static
analysis tool.  Which will then gripe at you for applying READ_ONCE()
to a __rcu pointer.  But it is already griping at you for applying
rcu_dereference() to something not marked __rcu, so...  ;-)

> Also, while you're being so nice about confirming my understanding of
> things: I always understood the point of rcu_dereference() (and __rcu on
> struct members) to be annotations that document the lifetime
> expectations of the object being pointed to, rather than a functional
> change vs READ_ONCE()? Documentation that the static checkers can turn
> into warnings, of course, but totally transparent in terms of the
> generated code. Right? :)

Yes for __rcu.

Maybe for rcu_dereference().  Yes in that it is functionally the same
as READ_ONCE(), no in that it is not the same as a simple C-language load.

> >> >> > Especially given that if this works on -rt, it is probably because
> >> >> > their variant of do_softirq() holds rcu_read_lock() across each
> >> >> > softirq handler invocation. They do something similar for rwlocks.
> >> >> 
> >> >> Right. Guess we'll wait for your confirmation of that, then. Thanks! :)
> >> >
> >> > Looking at v5.11.4-rt11...
> >> >
> >> > And __local_bh_disable_ip() has added the required rcu_read_lock(),
> >> > so dropping all the rcu_read_lock() and rcu_read_unlock() calls would
> >> > do the right thing in -rt.  And lockdep would understand without the
> >> > rcu_read_lock_bh_held(), but that is still required for mainline.
> >> 
> >> Great, thanks for checking!
> >> 
> >> So this brings to mind another question: Are there any performance
> >> implications to nesting rcu_read_locks() inside each other? One
> >> thing that would be fairly easy to do (in terms of how much code we have
> >> to touch) is to just add a top-level rcu_read_lock() around the
> >> napi_poll() call in the core dev code, thus making -rt and mainline
> >> equivalent in that respect. Also, this would make it obvious that all
> >> the RCU usage inside of NAPI is safe, without having to know about
> >> bh_disable() and all that. But we obviously don't want to do that if it
> >> is going to slow things down; WDYT?
> >
> > Both rcu_read_lock() and rcu_read_unlock() are quite lightweight (zero for
> > CONFIG_PREEMPT=n and about two nanoseconds per pair for CONFIG_PREEMPT=y
> > on 2GHz x86) and can be nested quite deeply.  So that approach should
> > be fine from that viewpoint.
> 
> OK, that may be fine, then. Guess I'll try it and benchmark (and compare
> with the rcu_dereference_check() approach).

Sounds good!

> > However, remaining in a single RCU read-side critical section forever
> > will eventually OOM the system, so the code should periodically exit
> > its top-level RCU read-side critical section, say, every few tens of
> > milliseconds.
> 
> Yup, NAPI already does this (there's a poll budget), so that should be
> fine.

Whew!!!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-19 21:41                                 ` Paul E. McKenney
@ 2021-04-19 22:16                                   ` Toke Høiland-Jørgensen
  2021-04-19 22:31                                     ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-19 22:16 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> 
>> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> > 
>> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> >> >> > > >   
>> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> > > > >  
>> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >> >> > > > >> >     
>> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> >> >> > > > >> > >> >  {
>> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> >> >> > > > >> > >> > +	int to_send = cnt;
>> >> >> >> > > > >> > >> >  	int i;
>> >> >> >> > > > >> > >> >  
>> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> >> >> > > > >> > >> >  		return;
>> >> >> >> > > > >> > >> >  
>> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> >> >> > > > >> > >> >  
>> >> >> >> > > > >> > >> >  		prefetch(xdpf);
>> >> >> >> > > > >> > >> >  	}
>> >> >> >> > > > >> > >> >  
>> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> >> >> > > > >> > >> bq->xdp_prog is used here
>> >> >> >> > > > >> > >>     
>> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> >> >> > > > >> > >> > +		if (!to_send)
>> >> >> >> > > > >> > >> > +			goto out;
>> >> >> >> > > > >> > >> > +
>> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> >> >> > > > >> > >> > +	}
>> >> >> >> > > > >> > >> > +    
>> >> >> >> > > > >> > >> 
>> >> >> >> > > > >> > >> [ ... ]
>> >> >> >> > > > >> > >>     
>> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> >> >> > > > >> > >> >  {
>> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> >> >> > > > >> > >> > +	 *
>> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> >> >> > > > >> > >> > +	 * are only ever modified together.
>> >> >> >> > > > >> > >> >  	 */
>> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> >> >> > > > >> > >> 
>> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> >> >> > > > >> > >
>> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> >> >> > > > >> > > __dev_flush():
>> >> >> >> > > > >> > >
>> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> >> >> > > > >>
>> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> >> >> > 
>> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> >> >> > in temporarily in the "bq" structure that is only valid for this
>> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> >> >> > 
>> >> >> >> > > > >> > 
>> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> >> >> > > > >> > performance :)    
>> >> >> >> > > > >>
>> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> >> >> > > > >> 
>> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> >> >> > > > >>
>> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> >> >> > > > >
>> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> >> >> > > > > are operating on.  
>> >> >> >> > >
>> >> >> >> > > What other rcu objects it is using during flush?
>> >> >> >> > 
>> >> >> >> > Look at code:
>> >> >> >> >  kernel/bpf/cpumap.c
>> >> >> >> >  kernel/bpf/devmap.c
>> >> >> >> > 
>> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> >> >> > function is __dev_map_entry_free().
>> >> >> >> > 
>> >> >> >> > 
>> >> >> >> > > > > Perhaps it is a bug in i40e?  
>> >> >> >> > >
>> >> >> >> > > A quick look into ixgbe falls into the same bucket.
>> >> >> >> > > didn't look at other drivers though.
>> >> >> >> > 
>> >> >> >> > Intel driver are very much in copy-paste mode.
>> >> >> >> >  
>> >> >> >> > > > >
>> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> >> >> > > > 
>> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> >> >> > > > full invocations of the softirq handler, which for networking is
>> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> >> >> > >
>> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> >> >> > 
>> >> >> >> > CC added Paul. (link to patch[1][2] for context)
>> >> >> >> Updated Paul's email address.
>> >> >> >> 
>> >> >> >> > 
>> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> >> >> > 
>> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> >> >> > that is true.  But we want to keep these, as they also document the
>> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> >> >> > preempt-able in the future.
>> >> >> >> > 
>> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> >> >> > > other rcu_read_lock() as-is.
>> >> >> >> > 
>> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> >> >> > add rcu_read_lock() at least around the invocation of
>> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> >> >> > 
>> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> >> >> > code path in their NAPI-loop.
>> >> >> >> > 
>> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> >> >> > happens, but ref real-function names).
>> >> >> >> > 
>> >> >> >> > We are running in softirq/NAPI context, the driver will call a
>> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> >> >> > a large RCU-read section, that include the flush operation).
>> >> >> >> > 
>> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> >> >> > xdp_frame packets) that will get flushed/send in the call
>> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> >> >> > happen before we end our softirq/NAPI context.
>> >> >> >> > 
>> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> >> >> > this short interval.
>> >> >> >> > 
>> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >> >> >
>> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
>> >> >> > code in question is something like this:
>> >> >> >
>> >> >> > 	void foo(void)
>> >> >> > 	{
>> >> >> > 		local_bh_disable();
>> >> >> >
>> >> >> > 		rcu_read_lock();
>> >> >> > 		p = rcu_dereference(gp);
>> >> >> > 		do_something_with(p);
>> >> >> > 		rcu_read_unlock();
>> >> >> >
>> >> >> > 		do_something_else();
>> >> >> >
>> >> >> > 		rcu_read_lock();
>> >> >> > 		do_some_other_thing(p);
>> >> >> > 		rcu_read_unlock();
>> >> >> >
>> >> >> > 		local_bh_enable();
>> >> >> > 	}
>> >> >> >
>> >> >> > 	void bar(struct blat *new_gp)
>> >> >> > 	{
>> >> >> > 		struct blat *old_gp;
>> >> >> >
>> >> >> > 		spin_lock(my_lock);
>> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> >> >> > 		rcu_assign_pointer(gp, new_gp);
>> >> >> > 		spin_unlock(my_lock);
>> >> >> > 		synchronize_rcu();
>> >> >> > 		kfree(old_gp);
>> >> >> > 	}
>> >> >> 
>> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
>> >> >> think that's equivalent, right?). And the question is whether we need to
>> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
>> >> >> whole lifetime of p.
>> >> >
>> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
>> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
>> >> 
>> >> Right, gotcha!
>> >> 
>> >> >> > I need to check up on -rt.
>> >> >> >
>> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
>> >> >> > will look like one big RCU read-side critical section.  But don't try
>> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> >> >> 
>> >> >> OK. Variants of this code has been around since before then, but I
>> >> >> honestly have no idea what it looked like back then exactly...
>> >> >
>> >> > I know that feeling...
>> >> >
>> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> >> >> 
>> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> >> >> rid of the inner ones. What about tools like lockdep; do they understand
>> >> >> this, or are we likely to get complaints if we remove it?
>> >> >
>> >> > If you just got rid of the first rcu_read_unlock() and the second
>> >> > rcu_read_lock() in the code above, lockdep will understand.
>> >> 
>> >> Right, but doing so entails going through all the drivers, which is what
>> >> we're trying to avoid :)
>> >
>> > I was afraid of that...  ;-)
>> >
>> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
>> >> > rcu_read_unlock() invocations in the code above, you would need to let
>> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>> >> >
>> >> > 	p = rcu_dereference(gp);
>> >> >
>> >> > You would do this:
>> >> >
>> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>> >> >
>> >> > This would be needed for mainline, regardless of -rt.
>> >> 
>> >> OK. And as far as I can tell this is harmless for code paths that call
>> >> the same function but from a regular rcu_read_lock()-protected section
>> >> instead from a bh-disabled section, right?
>> >
>> > That is correct.  That rcu_dereference_check() invocation will make
>> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
>> > Or both, for that matter.
>> 
>> OK, great, thank you for confirming my understanding!
>> 
>> >> What happens, BTW, if we *don't* get rid of all the existing
>> >> rcu_read_lock() sections? Going back to your foo() example above, what
>> >> we're discussing is whether to add that second rcu_read_lock() around
>> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
>> >> is already there (in the particular driver we're discussing), and the
>> >> local_bh_disable/enable() pair is already there. AFAICT from our
>> >> discussion, there really is not much point in adding that second
>> >> rcu_read_lock/unlock(), is there?
>> >
>> > From an algorithmic point of view, the second rcu_read_lock()
>> > and rcu_read_unlock() are redundant.  Of course, there are also
>> > software-engineering considerations, including copy-pasta issues.
>> >
>> >> And because that first rcu_read_lock() around the rcu_dereference() is
>> >> already there, lockdep is not likely to complain either, so we're
>> >> basically fine? Except that the code is somewhat confusing as-is, of
>> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
>> >
>> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
>> > the code, fails to note the fact that softirq is disabled throughout,
>> > and decides that leaking a pointer from one RCU read-side critical
>> > section to a later one is just fine.  :-/
>> 
>> Yup, totally agreed that we need to fix this for the sake of the humans
>> reading the code; just wanted to make sure my understanding was correct
>> that we don't strictly need to do anything as far as the machines
>> executing it are concerned :)
>> 
>> >> Hmm, looking at it now, it seems not all the lookup code is actually
>> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
>> >> a comment above it saying that RCU ensures objects won't disappear[0];
>> >> so I suppose we're at least safe from lockdep in that sense :P - but we
>> >> should definitely clean this up.
>> >> 
>> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
>> >
>> > That use of READ_ONCE() will definitely avoid lockdep complaints,
>> > including those complaints that point out bugs.  It also might get you
>> > sparse complaints if the RCU-protected pointer is marked with __rcu.
>> 
>> It's not; it's the netdev_map member of this struct:
>> 
>> struct bpf_dtab {
>> 	struct bpf_map map;
>> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
>> 	struct list_head list;
>> 
>> 	/* these are only used for DEVMAP_HASH type maps */
>> 	struct hlist_head *dev_index_head;
>> 	spinlock_t index_lock;
>> 	unsigned int items;
>> 	u32 n_buckets;
>> };
>> 
>> Will adding __rcu to such a dynamic array member do the right thing when
>> paired with rcu_dereference() on array members (i.e., in place of the
>> READ_ONCE in the code linked above)?
>
> The only thing __rcu will do is provide information to the sparse static
> analysis tool.  Which will then gripe at you for applying READ_ONCE()
> to a __rcu pointer.  But it is already griping at you for applying
> rcu_dereference() to something not marked __rcu, so...  ;-)

Right, hence the need for a cleanup ;)

My question was more if it understood arrays, though. I.e., that
'netdev_map' is an array of RCU pointers, not an RCU pointer to an
array... Or am I maybe thinking that tool is way smarter than it is, and
it just complains for any access to that field that doesn't use
rcu_dereference()?

>> Also, while you're being so nice about confirming my understanding of
>> things: I always understood the point of rcu_dereference() (and __rcu on
>> struct members) to be annotations that document the lifetime
>> expectations of the object being pointed to, rather than a functional
>> change vs READ_ONCE()? Documentation that the static checkers can turn
>> into warnings, of course, but totally transparent in terms of the
>> generated code. Right? :)
>
> Yes for __rcu.
>
> Maybe for rcu_dereference().  Yes in that it is functionally the same
> as READ_ONCE(), no in that it is not the same as a simple C-language load.

Right, was going for "functionally the same" - cool!

>> >> >> > Especially given that if this works on -rt, it is probably because
>> >> >> > their variant of do_softirq() holds rcu_read_lock() across each
>> >> >> > softirq handler invocation. They do something similar for rwlocks.
>> >> >> 
>> >> >> Right. Guess we'll wait for your confirmation of that, then. Thanks! :)
>> >> >
>> >> > Looking at v5.11.4-rt11...
>> >> >
>> >> > And __local_bh_disable_ip() has added the required rcu_read_lock(),
>> >> > so dropping all the rcu_read_lock() and rcu_read_unlock() calls would
>> >> > do the right thing in -rt.  And lockdep would understand without the
>> >> > rcu_read_lock_bh_held(), but that is still required for mainline.
>> >> 
>> >> Great, thanks for checking!
>> >> 
>> >> So this brings to mind another question: Are there any performance
>> >> implications to nesting rcu_read_locks() inside each other? One
>> >> thing that would be fairly easy to do (in terms of how much code we have
>> >> to touch) is to just add a top-level rcu_read_lock() around the
>> >> napi_poll() call in the core dev code, thus making -rt and mainline
>> >> equivalent in that respect. Also, this would make it obvious that all
>> >> the RCU usage inside of NAPI is safe, without having to know about
>> >> bh_disable() and all that. But we obviously don't want to do that if it
>> >> is going to slow things down; WDYT?
>> >
>> > Both rcu_read_lock() and rcu_read_unlock() are quite lightweight (zero for
>> > CONFIG_PREEMPT=n and about two nanoseconds per pair for CONFIG_PREEMPT=y
>> > on 2GHz x86) and can be nested quite deeply.  So that approach should
>> > be fine from that viewpoint.
>> 
>> OK, that may be fine, then. Guess I'll try it and benchmark (and compare
>> with the rcu_dereference_check() approach).
>
> Sounds good!

Awesome! Thanks a lot for explaining, and for bearing me and all my
stupid questions - I feel like I get closer to understanding RCU each
time I speak with you about it :)

>> > However, remaining in a single RCU read-side critical section forever
>> > will eventually OOM the system, so the code should periodically exit
>> > its top-level RCU read-side critical section, say, every few tens of
>> > milliseconds.
>> 
>> Yup, NAPI already does this (there's a poll budget), so that should be
>> fine.
>
> Whew!!!  ;-)

I know, right? ;) Although I do seem to recall you quite recently
helping me fix a case where it didn't quite interrupt itself enough, and
was causing hangs...

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-19 22:16                                   ` Toke Høiland-Jørgensen
@ 2021-04-19 22:31                                     ` Paul E. McKenney
  2021-04-21 14:24                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-19 22:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> 
> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> 
> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> > 
> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> >> >> >> > > >   
> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> > > > >  
> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >> >> > > > >> >     
> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
> >> >> >> >> > > > >> > >> >  	int i;
> >> >> >> >> > > > >> > >> >  
> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> >> >> >> > > > >> > >> >  		return;
> >> >> >> >> > > > >> > >> >  
> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> >> >> > > > >> > >> >  
> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
> >> >> >> >> > > > >> > >> >  	}
> >> >> >> >> > > > >> > >> >  
> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
> >> >> >> >> > > > >> > >>     
> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> >> >> > > > >> > >> > +		if (!to_send)
> >> >> >> >> > > > >> > >> > +			goto out;
> >> >> >> >> > > > >> > >> > +
> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
> >> >> >> >> > > > >> > >> > +	}
> >> >> >> >> > > > >> > >> > +    
> >> >> >> >> > > > >> > >> 
> >> >> >> >> > > > >> > >> [ ... ]
> >> >> >> >> > > > >> > >>     
> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> >> >> > > > >> > >> > +	 *
> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
> >> >> >> >> > > > >> > >> >  	 */
> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> >> >> > > > >> > >> 
> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> >> >> >> > > > >> > >
> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> >> >> > > > >> > > __dev_flush():
> >> >> >> >> > > > >> > >
> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> >> >> >> > > > >>
> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> >> >> > 
> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
> >> >> >> >> > 
> >> >> >> >> > > > >> > 
> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> >> >> > > > >> > performance :)    
> >> >> >> >> > > > >>
> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
> >> >> >> >> > > > >> 
> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >> >> >> > > > >>
> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> >> >> >> > > > >
> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> >> >> >> > > > > are operating on.  
> >> >> >> >> > >
> >> >> >> >> > > What other rcu objects it is using during flush?
> >> >> >> >> > 
> >> >> >> >> > Look at code:
> >> >> >> >> >  kernel/bpf/cpumap.c
> >> >> >> >> >  kernel/bpf/devmap.c
> >> >> >> >> > 
> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> >> >> >> > function is __dev_map_entry_free().
> >> >> >> >> > 
> >> >> >> >> > 
> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
> >> >> >> >> > >
> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
> >> >> >> >> > > didn't look at other drivers though.
> >> >> >> >> > 
> >> >> >> >> > Intel driver are very much in copy-paste mode.
> >> >> >> >> >  
> >> >> >> >> > > > >
> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> >> >> >> > > > 
> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> >> >> >> > >
> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> >> >> >> > 
> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
> >> >> >> >> Updated Paul's email address.
> >> >> >> >> 
> >> >> >> >> > 
> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> >> >> >> > 
> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> >> >> >> > that is true.  But we want to keep these, as they also document the
> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
> >> >> >> >> > preempt-able in the future.
> >> >> >> >> > 
> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> >> >> >> > > other rcu_read_lock() as-is.
> >> >> >> >> > 
> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> >> >> >> > add rcu_read_lock() at least around the invocation of
> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> >> >> >> > 
> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> >> >> >> > code path in their NAPI-loop.
> >> >> >> >> > 
> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> >> >> >> > happens, but ref real-function names).
> >> >> >> >> > 
> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> >> >> >> > a large RCU-read section, that include the flush operation).
> >> >> >> >> > 
> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> >> >> >> > happen before we end our softirq/NAPI context.
> >> >> >> >> > 
> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> >> >> >> > this short interval.
> >> >> >> >> > 
> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
> >> >> >> >
> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
> >> >> >> > code in question is something like this:
> >> >> >> >
> >> >> >> > 	void foo(void)
> >> >> >> > 	{
> >> >> >> > 		local_bh_disable();
> >> >> >> >
> >> >> >> > 		rcu_read_lock();
> >> >> >> > 		p = rcu_dereference(gp);
> >> >> >> > 		do_something_with(p);
> >> >> >> > 		rcu_read_unlock();
> >> >> >> >
> >> >> >> > 		do_something_else();
> >> >> >> >
> >> >> >> > 		rcu_read_lock();
> >> >> >> > 		do_some_other_thing(p);
> >> >> >> > 		rcu_read_unlock();
> >> >> >> >
> >> >> >> > 		local_bh_enable();
> >> >> >> > 	}
> >> >> >> >
> >> >> >> > 	void bar(struct blat *new_gp)
> >> >> >> > 	{
> >> >> >> > 		struct blat *old_gp;
> >> >> >> >
> >> >> >> > 		spin_lock(my_lock);
> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
> >> >> >> > 		spin_unlock(my_lock);
> >> >> >> > 		synchronize_rcu();
> >> >> >> > 		kfree(old_gp);
> >> >> >> > 	}
> >> >> >> 
> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
> >> >> >> think that's equivalent, right?). And the question is whether we need to
> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
> >> >> >> whole lifetime of p.
> >> >> >
> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
> >> >> 
> >> >> Right, gotcha!
> >> >> 
> >> >> >> > I need to check up on -rt.
> >> >> >> >
> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
> >> >> >> > will look like one big RCU read-side critical section.  But don't try
> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> >> >> >> 
> >> >> >> OK. Variants of this code has been around since before then, but I
> >> >> >> honestly have no idea what it looked like back then exactly...
> >> >> >
> >> >> > I know that feeling...
> >> >> >
> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> >> >> >> 
> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
> >> >> >> this, or are we likely to get complaints if we remove it?
> >> >> >
> >> >> > If you just got rid of the first rcu_read_unlock() and the second
> >> >> > rcu_read_lock() in the code above, lockdep will understand.
> >> >> 
> >> >> Right, but doing so entails going through all the drivers, which is what
> >> >> we're trying to avoid :)
> >> >
> >> > I was afraid of that...  ;-)
> >> >
> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
> >> >> >
> >> >> > 	p = rcu_dereference(gp);
> >> >> >
> >> >> > You would do this:
> >> >> >
> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
> >> >> >
> >> >> > This would be needed for mainline, regardless of -rt.
> >> >> 
> >> >> OK. And as far as I can tell this is harmless for code paths that call
> >> >> the same function but from a regular rcu_read_lock()-protected section
> >> >> instead from a bh-disabled section, right?
> >> >
> >> > That is correct.  That rcu_dereference_check() invocation will make
> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
> >> > Or both, for that matter.
> >> 
> >> OK, great, thank you for confirming my understanding!
> >> 
> >> >> What happens, BTW, if we *don't* get rid of all the existing
> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
> >> >> we're discussing is whether to add that second rcu_read_lock() around
> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
> >> >> is already there (in the particular driver we're discussing), and the
> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
> >> >> discussion, there really is not much point in adding that second
> >> >> rcu_read_lock/unlock(), is there?
> >> >
> >> > From an algorithmic point of view, the second rcu_read_lock()
> >> > and rcu_read_unlock() are redundant.  Of course, there are also
> >> > software-engineering considerations, including copy-pasta issues.
> >> >
> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
> >> >> already there, lockdep is not likely to complain either, so we're
> >> >> basically fine? Except that the code is somewhat confusing as-is, of
> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
> >> >
> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
> >> > the code, fails to note the fact that softirq is disabled throughout,
> >> > and decides that leaking a pointer from one RCU read-side critical
> >> > section to a later one is just fine.  :-/
> >> 
> >> Yup, totally agreed that we need to fix this for the sake of the humans
> >> reading the code; just wanted to make sure my understanding was correct
> >> that we don't strictly need to do anything as far as the machines
> >> executing it are concerned :)
> >> 
> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
> >> >> should definitely clean this up.
> >> >> 
> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
> >> >
> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
> >> > including those complaints that point out bugs.  It also might get you
> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
> >> 
> >> It's not; it's the netdev_map member of this struct:
> >> 
> >> struct bpf_dtab {
> >> 	struct bpf_map map;
> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
> >> 	struct list_head list;
> >> 
> >> 	/* these are only used for DEVMAP_HASH type maps */
> >> 	struct hlist_head *dev_index_head;
> >> 	spinlock_t index_lock;
> >> 	unsigned int items;
> >> 	u32 n_buckets;
> >> };
> >> 
> >> Will adding __rcu to such a dynamic array member do the right thing when
> >> paired with rcu_dereference() on array members (i.e., in place of the
> >> READ_ONCE in the code linked above)?
> >
> > The only thing __rcu will do is provide information to the sparse static
> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
> > to a __rcu pointer.  But it is already griping at you for applying
> > rcu_dereference() to something not marked __rcu, so...  ;-)
> 
> Right, hence the need for a cleanup ;)
> 
> My question was more if it understood arrays, though. I.e., that
> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
> array... Or am I maybe thinking that tool is way smarter than it is, and
> it just complains for any access to that field that doesn't use
> rcu_dereference()?

I believe that sparse will know about the pointers being __rcu, but
not the array.  Unless you mark both levels.

> >> Also, while you're being so nice about confirming my understanding of
> >> things: I always understood the point of rcu_dereference() (and __rcu on
> >> struct members) to be annotations that document the lifetime
> >> expectations of the object being pointed to, rather than a functional
> >> change vs READ_ONCE()? Documentation that the static checkers can turn
> >> into warnings, of course, but totally transparent in terms of the
> >> generated code. Right? :)
> >
> > Yes for __rcu.
> >
> > Maybe for rcu_dereference().  Yes in that it is functionally the same
> > as READ_ONCE(), no in that it is not the same as a simple C-language load.
> 
> Right, was going for "functionally the same" - cool!
> 
> >> >> >> > Especially given that if this works on -rt, it is probably because
> >> >> >> > their variant of do_softirq() holds rcu_read_lock() across each
> >> >> >> > softirq handler invocation. They do something similar for rwlocks.
> >> >> >> 
> >> >> >> Right. Guess we'll wait for your confirmation of that, then. Thanks! :)
> >> >> >
> >> >> > Looking at v5.11.4-rt11...
> >> >> >
> >> >> > And __local_bh_disable_ip() has added the required rcu_read_lock(),
> >> >> > so dropping all the rcu_read_lock() and rcu_read_unlock() calls would
> >> >> > do the right thing in -rt.  And lockdep would understand without the
> >> >> > rcu_read_lock_bh_held(), but that is still required for mainline.
> >> >> 
> >> >> Great, thanks for checking!
> >> >> 
> >> >> So this brings to mind another question: Are there any performance
> >> >> implications to nesting rcu_read_locks() inside each other? One
> >> >> thing that would be fairly easy to do (in terms of how much code we have
> >> >> to touch) is to just add a top-level rcu_read_lock() around the
> >> >> napi_poll() call in the core dev code, thus making -rt and mainline
> >> >> equivalent in that respect. Also, this would make it obvious that all
> >> >> the RCU usage inside of NAPI is safe, without having to know about
> >> >> bh_disable() and all that. But we obviously don't want to do that if it
> >> >> is going to slow things down; WDYT?
> >> >
> >> > Both rcu_read_lock() and rcu_read_unlock() are quite lightweight (zero for
> >> > CONFIG_PREEMPT=n and about two nanoseconds per pair for CONFIG_PREEMPT=y
> >> > on 2GHz x86) and can be nested quite deeply.  So that approach should
> >> > be fine from that viewpoint.
> >> 
> >> OK, that may be fine, then. Guess I'll try it and benchmark (and compare
> >> with the rcu_dereference_check() approach).
> >
> > Sounds good!
> 
> Awesome! Thanks a lot for explaining, and for bearing me and all my
> stupid questions - I feel like I get closer to understanding RCU each
> time I speak with you about it :)

Glad it is producing a positive change.  ;-)

> >> > However, remaining in a single RCU read-side critical section forever
> >> > will eventually OOM the system, so the code should periodically exit
> >> > its top-level RCU read-side critical section, say, every few tens of
> >> > milliseconds.
> >> 
> >> Yup, NAPI already does this (there's a poll budget), so that should be
> >> fine.
> >
> > Whew!!!  ;-)
> 
> I know, right? ;) Although I do seem to recall you quite recently
> helping me fix a case where it didn't quite interrupt itself enough, and
> was causing hangs...

Done that myself as well...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-19 22:31                                     ` Paul E. McKenney
@ 2021-04-21 14:24                                       ` Toke Høiland-Jørgensen
  2021-04-21 14:59                                         ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-21 14:24 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> 
>> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> 
>> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> > 
>> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> >> >> >> > > >   
>> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> > > > >  
>> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >> >> >> > > > >> >     
>> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
>> >> >> >> >> > > > >> > >> >  	int i;
>> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> >> >> >> > > > >> > >> >  		return;
>> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
>> >> >> >> >> > > > >> > >> >  	}
>> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
>> >> >> >> >> > > > >> > >>     
>> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> >> >> >> > > > >> > >> > +		if (!to_send)
>> >> >> >> >> > > > >> > >> > +			goto out;
>> >> >> >> >> > > > >> > >> > +
>> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> >> >> >> > > > >> > >> > +	}
>> >> >> >> >> > > > >> > >> > +    
>> >> >> >> >> > > > >> > >> 
>> >> >> >> >> > > > >> > >> [ ... ]
>> >> >> >> >> > > > >> > >>     
>> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> >> >> >> > > > >> > >> > +	 *
>> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
>> >> >> >> >> > > > >> > >> >  	 */
>> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> >> >> >> > > > >> > >> 
>> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> >> >> >> > > > >> > >
>> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> >> >> >> > > > >> > > __dev_flush():
>> >> >> >> >> > > > >> > >
>> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> >> >> >> > > > >>
>> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> >> >> >> > 
>> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
>> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> >> >> >> > 
>> >> >> >> >> > > > >> > 
>> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> >> >> >> > > > >> > performance :)    
>> >> >> >> >> > > > >>
>> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> >> >> >> > > > >> 
>> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> >> >> >> > > > >>
>> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> >> >> >> > > > >
>> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> >> >> >> > > > > are operating on.  
>> >> >> >> >> > >
>> >> >> >> >> > > What other rcu objects it is using during flush?
>> >> >> >> >> > 
>> >> >> >> >> > Look at code:
>> >> >> >> >> >  kernel/bpf/cpumap.c
>> >> >> >> >> >  kernel/bpf/devmap.c
>> >> >> >> >> > 
>> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> >> >> >> > function is __dev_map_entry_free().
>> >> >> >> >> > 
>> >> >> >> >> > 
>> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
>> >> >> >> >> > >
>> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
>> >> >> >> >> > > didn't look at other drivers though.
>> >> >> >> >> > 
>> >> >> >> >> > Intel driver are very much in copy-paste mode.
>> >> >> >> >> >  
>> >> >> >> >> > > > >
>> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> >> >> >> > > > 
>> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
>> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> >> >> >> > >
>> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> >> >> >> > 
>> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
>> >> >> >> >> Updated Paul's email address.
>> >> >> >> >> 
>> >> >> >> >> > 
>> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> >> >> >> > 
>> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> >> >> >> > that is true.  But we want to keep these, as they also document the
>> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> >> >> >> > preempt-able in the future.
>> >> >> >> >> > 
>> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> >> >> >> > > other rcu_read_lock() as-is.
>> >> >> >> >> > 
>> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> >> >> >> > add rcu_read_lock() at least around the invocation of
>> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> >> >> >> > 
>> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> >> >> >> > code path in their NAPI-loop.
>> >> >> >> >> > 
>> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> >> >> >> > happens, but ref real-function names).
>> >> >> >> >> > 
>> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
>> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> >> >> >> > a large RCU-read section, that include the flush operation).
>> >> >> >> >> > 
>> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
>> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> >> >> >> > happen before we end our softirq/NAPI context.
>> >> >> >> >> > 
>> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> >> >> >> > this short interval.
>> >> >> >> >> > 
>> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >> >> >> >
>> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
>> >> >> >> > code in question is something like this:
>> >> >> >> >
>> >> >> >> > 	void foo(void)
>> >> >> >> > 	{
>> >> >> >> > 		local_bh_disable();
>> >> >> >> >
>> >> >> >> > 		rcu_read_lock();
>> >> >> >> > 		p = rcu_dereference(gp);
>> >> >> >> > 		do_something_with(p);
>> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >
>> >> >> >> > 		do_something_else();
>> >> >> >> >
>> >> >> >> > 		rcu_read_lock();
>> >> >> >> > 		do_some_other_thing(p);
>> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >
>> >> >> >> > 		local_bh_enable();
>> >> >> >> > 	}
>> >> >> >> >
>> >> >> >> > 	void bar(struct blat *new_gp)
>> >> >> >> > 	{
>> >> >> >> > 		struct blat *old_gp;
>> >> >> >> >
>> >> >> >> > 		spin_lock(my_lock);
>> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
>> >> >> >> > 		spin_unlock(my_lock);
>> >> >> >> > 		synchronize_rcu();
>> >> >> >> > 		kfree(old_gp);
>> >> >> >> > 	}
>> >> >> >> 
>> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
>> >> >> >> think that's equivalent, right?). And the question is whether we need to
>> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
>> >> >> >> whole lifetime of p.
>> >> >> >
>> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
>> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
>> >> >> 
>> >> >> Right, gotcha!
>> >> >> 
>> >> >> >> > I need to check up on -rt.
>> >> >> >> >
>> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
>> >> >> >> > will look like one big RCU read-side critical section.  But don't try
>> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> >> >> >> 
>> >> >> >> OK. Variants of this code has been around since before then, but I
>> >> >> >> honestly have no idea what it looked like back then exactly...
>> >> >> >
>> >> >> > I know that feeling...
>> >> >> >
>> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> >> >> >> 
>> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
>> >> >> >> this, or are we likely to get complaints if we remove it?
>> >> >> >
>> >> >> > If you just got rid of the first rcu_read_unlock() and the second
>> >> >> > rcu_read_lock() in the code above, lockdep will understand.
>> >> >> 
>> >> >> Right, but doing so entails going through all the drivers, which is what
>> >> >> we're trying to avoid :)
>> >> >
>> >> > I was afraid of that...  ;-)
>> >> >
>> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
>> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
>> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>> >> >> >
>> >> >> > 	p = rcu_dereference(gp);
>> >> >> >
>> >> >> > You would do this:
>> >> >> >
>> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>> >> >> >
>> >> >> > This would be needed for mainline, regardless of -rt.
>> >> >> 
>> >> >> OK. And as far as I can tell this is harmless for code paths that call
>> >> >> the same function but from a regular rcu_read_lock()-protected section
>> >> >> instead from a bh-disabled section, right?
>> >> >
>> >> > That is correct.  That rcu_dereference_check() invocation will make
>> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
>> >> > Or both, for that matter.
>> >> 
>> >> OK, great, thank you for confirming my understanding!
>> >> 
>> >> >> What happens, BTW, if we *don't* get rid of all the existing
>> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
>> >> >> we're discussing is whether to add that second rcu_read_lock() around
>> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
>> >> >> is already there (in the particular driver we're discussing), and the
>> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
>> >> >> discussion, there really is not much point in adding that second
>> >> >> rcu_read_lock/unlock(), is there?
>> >> >
>> >> > From an algorithmic point of view, the second rcu_read_lock()
>> >> > and rcu_read_unlock() are redundant.  Of course, there are also
>> >> > software-engineering considerations, including copy-pasta issues.
>> >> >
>> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
>> >> >> already there, lockdep is not likely to complain either, so we're
>> >> >> basically fine? Except that the code is somewhat confusing as-is, of
>> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
>> >> >
>> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
>> >> > the code, fails to note the fact that softirq is disabled throughout,
>> >> > and decides that leaking a pointer from one RCU read-side critical
>> >> > section to a later one is just fine.  :-/
>> >> 
>> >> Yup, totally agreed that we need to fix this for the sake of the humans
>> >> reading the code; just wanted to make sure my understanding was correct
>> >> that we don't strictly need to do anything as far as the machines
>> >> executing it are concerned :)
>> >> 
>> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
>> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
>> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
>> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
>> >> >> should definitely clean this up.
>> >> >> 
>> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
>> >> >
>> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
>> >> > including those complaints that point out bugs.  It also might get you
>> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
>> >> 
>> >> It's not; it's the netdev_map member of this struct:
>> >> 
>> >> struct bpf_dtab {
>> >> 	struct bpf_map map;
>> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
>> >> 	struct list_head list;
>> >> 
>> >> 	/* these are only used for DEVMAP_HASH type maps */
>> >> 	struct hlist_head *dev_index_head;
>> >> 	spinlock_t index_lock;
>> >> 	unsigned int items;
>> >> 	u32 n_buckets;
>> >> };
>> >> 
>> >> Will adding __rcu to such a dynamic array member do the right thing when
>> >> paired with rcu_dereference() on array members (i.e., in place of the
>> >> READ_ONCE in the code linked above)?
>> >
>> > The only thing __rcu will do is provide information to the sparse static
>> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
>> > to a __rcu pointer.  But it is already griping at you for applying
>> > rcu_dereference() to something not marked __rcu, so...  ;-)
>> 
>> Right, hence the need for a cleanup ;)
>> 
>> My question was more if it understood arrays, though. I.e., that
>> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
>> array... Or am I maybe thinking that tool is way smarter than it is, and
>> it just complains for any access to that field that doesn't use
>> rcu_dereference()?
>
> I believe that sparse will know about the pointers being __rcu, but
> not the array.  Unless you mark both levels.

Hi Paul

One more question, since I started adding the annotations: We are
currently swapping out the pointers using xchg():
https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555

and even cmpxchg():
https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831

Sparse complains about these if I add the __rcu annotation to the
definition (which otherwise works just fine with the double-pointer,
BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
atomic swaps or something? Or do we just keep the regular xchg() and
ignore those particular sparse warnings?

Thanks,
-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 14:24                                       ` Toke Høiland-Jørgensen
@ 2021-04-21 14:59                                         ` Paul E. McKenney
  2021-04-21 19:59                                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-21 14:59 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> 
> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> 
> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> 
> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> > 
> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> >> >> >> >> > > >   
> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> > > > >  
> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >> >> >> > > > >> >     
> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
> >> >> >> >> >> > > > >> > >> >  	int i;
> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> >> >> >> >> > > > >> > >> >  		return;
> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
> >> >> >> >> >> > > > >> > >> >  	}
> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
> >> >> >> >> >> > > > >> > >> > +			goto out;
> >> >> >> >> >> > > > >> > >> > +
> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
> >> >> >> >> >> > > > >> > >> > +	}
> >> >> >> >> >> > > > >> > >> > +    
> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> > > > >> > >> [ ... ]
> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> >> >> >> > > > >> > >> > +	 *
> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
> >> >> >> >> >> > > > >> > >> >  	 */
> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> >> >> >> > > > >> > > __dev_flush():
> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> >> >> >> >> > > > >>
> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> >> >> >> > 
> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
> >> >> >> >> >> > 
> >> >> >> >> >> > > > >> > 
> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> >> >> >> > > > >> > performance :)    
> >> >> >> >> >> > > > >>
> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
> >> >> >> >> >> > > > >> 
> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >> >> >> >> > > > >>
> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> >> >> >> >> > > > >
> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> >> >> >> >> > > > > are operating on.  
> >> >> >> >> >> > >
> >> >> >> >> >> > > What other rcu objects it is using during flush?
> >> >> >> >> >> > 
> >> >> >> >> >> > Look at code:
> >> >> >> >> >> >  kernel/bpf/cpumap.c
> >> >> >> >> >> >  kernel/bpf/devmap.c
> >> >> >> >> >> > 
> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> >> >> >> >> > function is __dev_map_entry_free().
> >> >> >> >> >> > 
> >> >> >> >> >> > 
> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
> >> >> >> >> >> > >
> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
> >> >> >> >> >> > > didn't look at other drivers though.
> >> >> >> >> >> > 
> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
> >> >> >> >> >> >  
> >> >> >> >> >> > > > >
> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> >> >> >> >> > > > 
> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> >> >> >> >> > >
> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> >> >> >> >> > 
> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
> >> >> >> >> >> Updated Paul's email address.
> >> >> >> >> >> 
> >> >> >> >> >> > 
> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> >> >> >> >> > 
> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
> >> >> >> >> >> > preempt-able in the future.
> >> >> >> >> >> > 
> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> >> >> >> >> > > other rcu_read_lock() as-is.
> >> >> >> >> >> > 
> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> >> >> >> >> > 
> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> >> >> >> >> > code path in their NAPI-loop.
> >> >> >> >> >> > 
> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> >> >> >> >> > happens, but ref real-function names).
> >> >> >> >> >> > 
> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
> >> >> >> >> >> > 
> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> >> >> >> >> > happen before we end our softirq/NAPI context.
> >> >> >> >> >> > 
> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> >> >> >> >> > this short interval.
> >> >> >> >> >> > 
> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
> >> >> >> >> >
> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
> >> >> >> >> > code in question is something like this:
> >> >> >> >> >
> >> >> >> >> > 	void foo(void)
> >> >> >> >> > 	{
> >> >> >> >> > 		local_bh_disable();
> >> >> >> >> >
> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> > 		p = rcu_dereference(gp);
> >> >> >> >> > 		do_something_with(p);
> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >
> >> >> >> >> > 		do_something_else();
> >> >> >> >> >
> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> > 		do_some_other_thing(p);
> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >
> >> >> >> >> > 		local_bh_enable();
> >> >> >> >> > 	}
> >> >> >> >> >
> >> >> >> >> > 	void bar(struct blat *new_gp)
> >> >> >> >> > 	{
> >> >> >> >> > 		struct blat *old_gp;
> >> >> >> >> >
> >> >> >> >> > 		spin_lock(my_lock);
> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
> >> >> >> >> > 		spin_unlock(my_lock);
> >> >> >> >> > 		synchronize_rcu();
> >> >> >> >> > 		kfree(old_gp);
> >> >> >> >> > 	}
> >> >> >> >> 
> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
> >> >> >> >> whole lifetime of p.
> >> >> >> >
> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
> >> >> >> 
> >> >> >> Right, gotcha!
> >> >> >> 
> >> >> >> >> > I need to check up on -rt.
> >> >> >> >> >
> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> >> >> >> >> 
> >> >> >> >> OK. Variants of this code has been around since before then, but I
> >> >> >> >> honestly have no idea what it looked like back then exactly...
> >> >> >> >
> >> >> >> > I know that feeling...
> >> >> >> >
> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> >> >> >> >> 
> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
> >> >> >> >> this, or are we likely to get complaints if we remove it?
> >> >> >> >
> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
> >> >> >> 
> >> >> >> Right, but doing so entails going through all the drivers, which is what
> >> >> >> we're trying to avoid :)
> >> >> >
> >> >> > I was afraid of that...  ;-)
> >> >> >
> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
> >> >> >> >
> >> >> >> > 	p = rcu_dereference(gp);
> >> >> >> >
> >> >> >> > You would do this:
> >> >> >> >
> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
> >> >> >> >
> >> >> >> > This would be needed for mainline, regardless of -rt.
> >> >> >> 
> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
> >> >> >> the same function but from a regular rcu_read_lock()-protected section
> >> >> >> instead from a bh-disabled section, right?
> >> >> >
> >> >> > That is correct.  That rcu_dereference_check() invocation will make
> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
> >> >> > Or both, for that matter.
> >> >> 
> >> >> OK, great, thank you for confirming my understanding!
> >> >> 
> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
> >> >> >> is already there (in the particular driver we're discussing), and the
> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
> >> >> >> discussion, there really is not much point in adding that second
> >> >> >> rcu_read_lock/unlock(), is there?
> >> >> >
> >> >> > From an algorithmic point of view, the second rcu_read_lock()
> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
> >> >> > software-engineering considerations, including copy-pasta issues.
> >> >> >
> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
> >> >> >> already there, lockdep is not likely to complain either, so we're
> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
> >> >> >
> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
> >> >> > the code, fails to note the fact that softirq is disabled throughout,
> >> >> > and decides that leaking a pointer from one RCU read-side critical
> >> >> > section to a later one is just fine.  :-/
> >> >> 
> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
> >> >> reading the code; just wanted to make sure my understanding was correct
> >> >> that we don't strictly need to do anything as far as the machines
> >> >> executing it are concerned :)
> >> >> 
> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
> >> >> >> should definitely clean this up.
> >> >> >> 
> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
> >> >> >
> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
> >> >> > including those complaints that point out bugs.  It also might get you
> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
> >> >> 
> >> >> It's not; it's the netdev_map member of this struct:
> >> >> 
> >> >> struct bpf_dtab {
> >> >> 	struct bpf_map map;
> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
> >> >> 	struct list_head list;
> >> >> 
> >> >> 	/* these are only used for DEVMAP_HASH type maps */
> >> >> 	struct hlist_head *dev_index_head;
> >> >> 	spinlock_t index_lock;
> >> >> 	unsigned int items;
> >> >> 	u32 n_buckets;
> >> >> };
> >> >> 
> >> >> Will adding __rcu to such a dynamic array member do the right thing when
> >> >> paired with rcu_dereference() on array members (i.e., in place of the
> >> >> READ_ONCE in the code linked above)?
> >> >
> >> > The only thing __rcu will do is provide information to the sparse static
> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
> >> > to a __rcu pointer.  But it is already griping at you for applying
> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
> >> 
> >> Right, hence the need for a cleanup ;)
> >> 
> >> My question was more if it understood arrays, though. I.e., that
> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
> >> array... Or am I maybe thinking that tool is way smarter than it is, and
> >> it just complains for any access to that field that doesn't use
> >> rcu_dereference()?
> >
> > I believe that sparse will know about the pointers being __rcu, but
> > not the array.  Unless you mark both levels.
> 
> Hi Paul
> 
> One more question, since I started adding the annotations: We are
> currently swapping out the pointers using xchg():
> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
> 
> and even cmpxchg():
> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
> 
> Sparse complains about these if I add the __rcu annotation to the
> definition (which otherwise works just fine with the double-pointer,
> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
> atomic swaps or something? Or do we just keep the regular xchg() and
> ignore those particular sparse warnings?

Sounds like I need to supply a unrcu_pointer() macro or some such.
This would operate something like the current open-coded casts
in __rcu_dereference_protected().

Would something like that work for you?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 14:59                                         ` Paul E. McKenney
@ 2021-04-21 19:59                                           ` Toke Høiland-Jørgensen
  2021-04-21 20:51                                             ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-21 19:59 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> 
>> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> 
>> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> 
>> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> > 
>> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> >> >> >> >> > > >   
>> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> > > > >  
>> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >> >> >> >> > > > >> >     
>> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
>> >> >> >> >> >> > > > >> > >> >  	int i;
>> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> >> >> >> >> > > > >> > >> >  		return;
>> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
>> >> >> >> >> >> > > > >> > >> >  	}
>> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
>> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
>> >> >> >> >> >> > > > >> > >> > +			goto out;
>> >> >> >> >> >> > > > >> > >> > +
>> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> >> >> >> >> > > > >> > >> > +	}
>> >> >> >> >> >> > > > >> > >> > +    
>> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> > > > >> > >> [ ... ]
>> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> >> >> >> >> > > > >> > >> > +	 *
>> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
>> >> >> >> >> >> > > > >> > >> >  	 */
>> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> >> >> >> >> > > > >> > > __dev_flush():
>> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> >> >> >> >> > > > >>
>> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
>> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> >> >> >> >> > 
>> >> >> >> >> >> > > > >> > 
>> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> >> >> >> >> > > > >> > performance :)    
>> >> >> >> >> >> > > > >>
>> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> >> >> >> >> > > > >> 
>> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> >> >> >> >> > > > >>
>> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> >> >> >> >> > > > >
>> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> >> >> >> >> > > > > are operating on.  
>> >> >> >> >> >> > >
>> >> >> >> >> >> > > What other rcu objects it is using during flush?
>> >> >> >> >> >> > 
>> >> >> >> >> >> > Look at code:
>> >> >> >> >> >> >  kernel/bpf/cpumap.c
>> >> >> >> >> >> >  kernel/bpf/devmap.c
>> >> >> >> >> >> > 
>> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> >> >> >> >> > function is __dev_map_entry_free().
>> >> >> >> >> >> > 
>> >> >> >> >> >> > 
>> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
>> >> >> >> >> >> > >
>> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
>> >> >> >> >> >> > > didn't look at other drivers though.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
>> >> >> >> >> >> >  
>> >> >> >> >> >> > > > >
>> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> >> >> >> >> > > > 
>> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
>> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> >> >> >> >> > >
>> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> >> >> >> >> > 
>> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
>> >> >> >> >> >> Updated Paul's email address.
>> >> >> >> >> >> 
>> >> >> >> >> >> > 
>> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> >> >> >> >> > 
>> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
>> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> >> >> >> >> > preempt-able in the future.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> >> >> >> >> > > other rcu_read_lock() as-is.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
>> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> >> >> >> >> > 
>> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> >> >> >> >> > code path in their NAPI-loop.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> >> >> >> >> > happens, but ref real-function names).
>> >> >> >> >> >> > 
>> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
>> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
>> >> >> >> >> >> > 
>> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
>> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> >> >> >> >> > happen before we end our softirq/NAPI context.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> >> >> >> >> > this short interval.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >> >> >> >> >
>> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
>> >> >> >> >> > code in question is something like this:
>> >> >> >> >> >
>> >> >> >> >> > 	void foo(void)
>> >> >> >> >> > 	{
>> >> >> >> >> > 		local_bh_disable();
>> >> >> >> >> >
>> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> > 		p = rcu_dereference(gp);
>> >> >> >> >> > 		do_something_with(p);
>> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >
>> >> >> >> >> > 		do_something_else();
>> >> >> >> >> >
>> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> > 		do_some_other_thing(p);
>> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >
>> >> >> >> >> > 		local_bh_enable();
>> >> >> >> >> > 	}
>> >> >> >> >> >
>> >> >> >> >> > 	void bar(struct blat *new_gp)
>> >> >> >> >> > 	{
>> >> >> >> >> > 		struct blat *old_gp;
>> >> >> >> >> >
>> >> >> >> >> > 		spin_lock(my_lock);
>> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
>> >> >> >> >> > 		spin_unlock(my_lock);
>> >> >> >> >> > 		synchronize_rcu();
>> >> >> >> >> > 		kfree(old_gp);
>> >> >> >> >> > 	}
>> >> >> >> >> 
>> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
>> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
>> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
>> >> >> >> >> whole lifetime of p.
>> >> >> >> >
>> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
>> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
>> >> >> >> 
>> >> >> >> Right, gotcha!
>> >> >> >> 
>> >> >> >> >> > I need to check up on -rt.
>> >> >> >> >> >
>> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
>> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
>> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> >> >> >> >> 
>> >> >> >> >> OK. Variants of this code has been around since before then, but I
>> >> >> >> >> honestly have no idea what it looked like back then exactly...
>> >> >> >> >
>> >> >> >> > I know that feeling...
>> >> >> >> >
>> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> >> >> >> >> 
>> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
>> >> >> >> >> this, or are we likely to get complaints if we remove it?
>> >> >> >> >
>> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
>> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
>> >> >> >> 
>> >> >> >> Right, but doing so entails going through all the drivers, which is what
>> >> >> >> we're trying to avoid :)
>> >> >> >
>> >> >> > I was afraid of that...  ;-)
>> >> >> >
>> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
>> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
>> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>> >> >> >> >
>> >> >> >> > 	p = rcu_dereference(gp);
>> >> >> >> >
>> >> >> >> > You would do this:
>> >> >> >> >
>> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>> >> >> >> >
>> >> >> >> > This would be needed for mainline, regardless of -rt.
>> >> >> >> 
>> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
>> >> >> >> the same function but from a regular rcu_read_lock()-protected section
>> >> >> >> instead from a bh-disabled section, right?
>> >> >> >
>> >> >> > That is correct.  That rcu_dereference_check() invocation will make
>> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
>> >> >> > Or both, for that matter.
>> >> >> 
>> >> >> OK, great, thank you for confirming my understanding!
>> >> >> 
>> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
>> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
>> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
>> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
>> >> >> >> is already there (in the particular driver we're discussing), and the
>> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
>> >> >> >> discussion, there really is not much point in adding that second
>> >> >> >> rcu_read_lock/unlock(), is there?
>> >> >> >
>> >> >> > From an algorithmic point of view, the second rcu_read_lock()
>> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
>> >> >> > software-engineering considerations, including copy-pasta issues.
>> >> >> >
>> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
>> >> >> >> already there, lockdep is not likely to complain either, so we're
>> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
>> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
>> >> >> >
>> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
>> >> >> > the code, fails to note the fact that softirq is disabled throughout,
>> >> >> > and decides that leaking a pointer from one RCU read-side critical
>> >> >> > section to a later one is just fine.  :-/
>> >> >> 
>> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
>> >> >> reading the code; just wanted to make sure my understanding was correct
>> >> >> that we don't strictly need to do anything as far as the machines
>> >> >> executing it are concerned :)
>> >> >> 
>> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
>> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
>> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
>> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
>> >> >> >> should definitely clean this up.
>> >> >> >> 
>> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
>> >> >> >
>> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
>> >> >> > including those complaints that point out bugs.  It also might get you
>> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
>> >> >> 
>> >> >> It's not; it's the netdev_map member of this struct:
>> >> >> 
>> >> >> struct bpf_dtab {
>> >> >> 	struct bpf_map map;
>> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
>> >> >> 	struct list_head list;
>> >> >> 
>> >> >> 	/* these are only used for DEVMAP_HASH type maps */
>> >> >> 	struct hlist_head *dev_index_head;
>> >> >> 	spinlock_t index_lock;
>> >> >> 	unsigned int items;
>> >> >> 	u32 n_buckets;
>> >> >> };
>> >> >> 
>> >> >> Will adding __rcu to such a dynamic array member do the right thing when
>> >> >> paired with rcu_dereference() on array members (i.e., in place of the
>> >> >> READ_ONCE in the code linked above)?
>> >> >
>> >> > The only thing __rcu will do is provide information to the sparse static
>> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
>> >> > to a __rcu pointer.  But it is already griping at you for applying
>> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
>> >> 
>> >> Right, hence the need for a cleanup ;)
>> >> 
>> >> My question was more if it understood arrays, though. I.e., that
>> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
>> >> array... Or am I maybe thinking that tool is way smarter than it is, and
>> >> it just complains for any access to that field that doesn't use
>> >> rcu_dereference()?
>> >
>> > I believe that sparse will know about the pointers being __rcu, but
>> > not the array.  Unless you mark both levels.
>> 
>> Hi Paul
>> 
>> One more question, since I started adding the annotations: We are
>> currently swapping out the pointers using xchg():
>> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
>> 
>> and even cmpxchg():
>> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
>> 
>> Sparse complains about these if I add the __rcu annotation to the
>> definition (which otherwise works just fine with the double-pointer,
>> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
>> atomic swaps or something? Or do we just keep the regular xchg() and
>> ignore those particular sparse warnings?
>
> Sounds like I need to supply a unrcu_pointer() macro or some such.
> This would operate something like the current open-coded casts
> in __rcu_dereference_protected().

So with that, I would turn the existing:

	dev = READ_ONCE(dtab->netdev_map[i]);
	if (!dev || netdev != dev->dev)
		continue;
	odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);

into:

	dev = rcu_dereference(dtab->netdev_map[i]);
	if (!dev || netdev != dev->dev)
		continue;
	odev = cmpxchg(unrcu_pointer(&dtab->netdev_map[i]), dev, NULL);


and with a _check version:

	old_dev = xchg(unrcu_pointer_check(&dtab->netdev_map[k], rcu_read_lock_bh_held()), NULL);

right?

Or would it be:
	odev = cmpxchg(&unrcu_pointer(dtab->netdev_map[i]), dev, NULL);
?

> Would something like that work for you?

Yeah, I believe it would :)

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 19:59                                           ` Toke Høiland-Jørgensen
@ 2021-04-21 20:51                                             ` Paul E. McKenney
  2021-04-21 21:10                                               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-21 20:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Wed, Apr 21, 2021 at 09:59:55PM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> 
> >> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> 
> >> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> 
> >> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >> 
> >> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> >> >> >> >> >> > > >   
> >> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> >> > > > >  
> >> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >> >> >> >> > > > >> >     
> >> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
> >> >> >> >> >> >> > > > >> > >> >  	int i;
> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> >> >> >> >> >> > > > >> > >> >  		return;
> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
> >> >> >> >> >> >> > > > >> > >> >  	}
> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
> >> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
> >> >> >> >> >> >> > > > >> > >> > +			goto out;
> >> >> >> >> >> >> > > > >> > >> > +
> >> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
> >> >> >> >> >> >> > > > >> > >> > +	}
> >> >> >> >> >> >> > > > >> > >> > +    
> >> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> >> > > > >> > >> [ ... ]
> >> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> >> >> >> >> > > > >> > >> > +	 *
> >> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
> >> >> >> >> >> >> > > > >> > >> >  	 */
> >> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> >> >> >> >> > > > >> > > __dev_flush():
> >> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
> >> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > > > >> > 
> >> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> >> >> >> >> > > > >> > performance :)    
> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
> >> >> >> >> >> >> > > > >> 
> >> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> >> >> >> >> >> > > > >
> >> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> >> >> >> >> >> > > > > are operating on.  
> >> >> >> >> >> >> > >
> >> >> >> >> >> >> > > What other rcu objects it is using during flush?
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > Look at code:
> >> >> >> >> >> >> >  kernel/bpf/cpumap.c
> >> >> >> >> >> >> >  kernel/bpf/devmap.c
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> >> >> >> >> >> > function is __dev_map_entry_free().
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
> >> >> >> >> >> >> > >
> >> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
> >> >> >> >> >> >> > > didn't look at other drivers though.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
> >> >> >> >> >> >> >  
> >> >> >> >> >> >> > > > >
> >> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> >> >> >> >> >> > > > 
> >> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
> >> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> >> >> >> >> >> > >
> >> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
> >> >> >> >> >> >> Updated Paul's email address.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
> >> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
> >> >> >> >> >> >> > preempt-able in the future.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> >> >> >> >> >> > > other rcu_read_lock() as-is.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
> >> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> >> >> >> >> >> > code path in their NAPI-loop.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> >> >> >> >> >> > happens, but ref real-function names).
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
> >> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
> >> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> >> >> >> >> >> > happen before we end our softirq/NAPI context.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
> >> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> >> >> >> >> >> > this short interval.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
> >> >> >> >> >> >
> >> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
> >> >> >> >> >> > code in question is something like this:
> >> >> >> >> >> >
> >> >> >> >> >> > 	void foo(void)
> >> >> >> >> >> > 	{
> >> >> >> >> >> > 		local_bh_disable();
> >> >> >> >> >> >
> >> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> >> > 		p = rcu_dereference(gp);
> >> >> >> >> >> > 		do_something_with(p);
> >> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >> >
> >> >> >> >> >> > 		do_something_else();
> >> >> >> >> >> >
> >> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> >> > 		do_some_other_thing(p);
> >> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >> >
> >> >> >> >> >> > 		local_bh_enable();
> >> >> >> >> >> > 	}
> >> >> >> >> >> >
> >> >> >> >> >> > 	void bar(struct blat *new_gp)
> >> >> >> >> >> > 	{
> >> >> >> >> >> > 		struct blat *old_gp;
> >> >> >> >> >> >
> >> >> >> >> >> > 		spin_lock(my_lock);
> >> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> >> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
> >> >> >> >> >> > 		spin_unlock(my_lock);
> >> >> >> >> >> > 		synchronize_rcu();
> >> >> >> >> >> > 		kfree(old_gp);
> >> >> >> >> >> > 	}
> >> >> >> >> >> 
> >> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
> >> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
> >> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
> >> >> >> >> >> whole lifetime of p.
> >> >> >> >> >
> >> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> >> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
> >> >> >> >> 
> >> >> >> >> Right, gotcha!
> >> >> >> >> 
> >> >> >> >> >> > I need to check up on -rt.
> >> >> >> >> >> >
> >> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
> >> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
> >> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> >> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> >> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> >> >> >> >> >> 
> >> >> >> >> >> OK. Variants of this code has been around since before then, but I
> >> >> >> >> >> honestly have no idea what it looked like back then exactly...
> >> >> >> >> >
> >> >> >> >> > I know that feeling...
> >> >> >> >> >
> >> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
> >> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> >> >> >> >> >> 
> >> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
> >> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> >> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
> >> >> >> >> >> this, or are we likely to get complaints if we remove it?
> >> >> >> >> >
> >> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
> >> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
> >> >> >> >> 
> >> >> >> >> Right, but doing so entails going through all the drivers, which is what
> >> >> >> >> we're trying to avoid :)
> >> >> >> >
> >> >> >> > I was afraid of that...  ;-)
> >> >> >> >
> >> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
> >> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
> >> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
> >> >> >> >> >
> >> >> >> >> > 	p = rcu_dereference(gp);
> >> >> >> >> >
> >> >> >> >> > You would do this:
> >> >> >> >> >
> >> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
> >> >> >> >> >
> >> >> >> >> > This would be needed for mainline, regardless of -rt.
> >> >> >> >> 
> >> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
> >> >> >> >> the same function but from a regular rcu_read_lock()-protected section
> >> >> >> >> instead from a bh-disabled section, right?
> >> >> >> >
> >> >> >> > That is correct.  That rcu_dereference_check() invocation will make
> >> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
> >> >> >> > Or both, for that matter.
> >> >> >> 
> >> >> >> OK, great, thank you for confirming my understanding!
> >> >> >> 
> >> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
> >> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
> >> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
> >> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
> >> >> >> >> is already there (in the particular driver we're discussing), and the
> >> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
> >> >> >> >> discussion, there really is not much point in adding that second
> >> >> >> >> rcu_read_lock/unlock(), is there?
> >> >> >> >
> >> >> >> > From an algorithmic point of view, the second rcu_read_lock()
> >> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
> >> >> >> > software-engineering considerations, including copy-pasta issues.
> >> >> >> >
> >> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
> >> >> >> >> already there, lockdep is not likely to complain either, so we're
> >> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
> >> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
> >> >> >> >
> >> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
> >> >> >> > the code, fails to note the fact that softirq is disabled throughout,
> >> >> >> > and decides that leaking a pointer from one RCU read-side critical
> >> >> >> > section to a later one is just fine.  :-/
> >> >> >> 
> >> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
> >> >> >> reading the code; just wanted to make sure my understanding was correct
> >> >> >> that we don't strictly need to do anything as far as the machines
> >> >> >> executing it are concerned :)
> >> >> >> 
> >> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
> >> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
> >> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
> >> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
> >> >> >> >> should definitely clean this up.
> >> >> >> >> 
> >> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
> >> >> >> >
> >> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
> >> >> >> > including those complaints that point out bugs.  It also might get you
> >> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
> >> >> >> 
> >> >> >> It's not; it's the netdev_map member of this struct:
> >> >> >> 
> >> >> >> struct bpf_dtab {
> >> >> >> 	struct bpf_map map;
> >> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
> >> >> >> 	struct list_head list;
> >> >> >> 
> >> >> >> 	/* these are only used for DEVMAP_HASH type maps */
> >> >> >> 	struct hlist_head *dev_index_head;
> >> >> >> 	spinlock_t index_lock;
> >> >> >> 	unsigned int items;
> >> >> >> 	u32 n_buckets;
> >> >> >> };
> >> >> >> 
> >> >> >> Will adding __rcu to such a dynamic array member do the right thing when
> >> >> >> paired with rcu_dereference() on array members (i.e., in place of the
> >> >> >> READ_ONCE in the code linked above)?
> >> >> >
> >> >> > The only thing __rcu will do is provide information to the sparse static
> >> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
> >> >> > to a __rcu pointer.  But it is already griping at you for applying
> >> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
> >> >> 
> >> >> Right, hence the need for a cleanup ;)
> >> >> 
> >> >> My question was more if it understood arrays, though. I.e., that
> >> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
> >> >> array... Or am I maybe thinking that tool is way smarter than it is, and
> >> >> it just complains for any access to that field that doesn't use
> >> >> rcu_dereference()?
> >> >
> >> > I believe that sparse will know about the pointers being __rcu, but
> >> > not the array.  Unless you mark both levels.
> >> 
> >> Hi Paul
> >> 
> >> One more question, since I started adding the annotations: We are
> >> currently swapping out the pointers using xchg():
> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
> >> 
> >> and even cmpxchg():
> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
> >> 
> >> Sparse complains about these if I add the __rcu annotation to the
> >> definition (which otherwise works just fine with the double-pointer,
> >> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
> >> atomic swaps or something? Or do we just keep the regular xchg() and
> >> ignore those particular sparse warnings?
> >
> > Sounds like I need to supply a unrcu_pointer() macro or some such.
> > This would operate something like the current open-coded casts
> > in __rcu_dereference_protected().
> 
> So with that, I would turn the existing:
> 
> 	dev = READ_ONCE(dtab->netdev_map[i]);
> 	if (!dev || netdev != dev->dev)
> 		continue;
> 	odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
> 
> into:
> 
> 	dev = rcu_dereference(dtab->netdev_map[i]);
> 	if (!dev || netdev != dev->dev)
> 		continue;
> 	odev = cmpxchg(unrcu_pointer(&dtab->netdev_map[i]), dev, NULL);
> 
> 
> and with a _check version:
> 
> 	old_dev = xchg(unrcu_pointer_check(&dtab->netdev_map[k], rcu_read_lock_bh_held()), NULL);
> 
> right?
> 
> Or would it be:
> 	odev = cmpxchg(&unrcu_pointer(dtab->netdev_map[i]), dev, NULL);
> ?
> 
> > Would something like that work for you?
> 
> Yeah, I believe it would :)

Except that I was forgetting that the __rcu decorates the pointed-to
data rather than the pointer itself.  :-/

But that is actually easier, as you can follow the example of
rcu_assign_pointer(), namely using RCU_INITIALIZER().

So like this:

	odev = cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL);

I -think- that the NULL doesn't need an RCU_INITIALIZER(), but it is
of course sparse's opinion that matters.

And of course like this:

	old_dev = xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap));

Does that work, or am I still confused?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 20:51                                             ` Paul E. McKenney
@ 2021-04-21 21:10                                               ` Toke Høiland-Jørgensen
  2021-04-21 21:30                                                 ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-21 21:10 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Wed, Apr 21, 2021 at 09:59:55PM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> 
>> >> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> 
>> >> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> 
>> >> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >> 
>> >> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> >> >> >> >> >> > > >   
>> >> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> >> > > > >  
>> >> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >> >> >> >> >> > > > >> >     
>> >> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
>> >> >> >> >> >> >> > > > >> > >> >  	int i;
>> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> >> >> >> >> >> > > > >> > >> >  		return;
>> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
>> >> >> >> >> >> >> > > > >> > >> >  	}
>> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
>> >> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
>> >> >> >> >> >> >> > > > >> > >> > +			goto out;
>> >> >> >> >> >> >> > > > >> > >> > +
>> >> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> >> >> >> >> >> > > > >> > >> > +	}
>> >> >> >> >> >> >> > > > >> > >> > +    
>> >> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> >> > > > >> > >> [ ... ]
>> >> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> >> >> >> >> >> > > > >> > >> > +	 *
>> >> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
>> >> >> >> >> >> >> > > > >> > >> >  	 */
>> >> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> >> >> >> >> >> > > > >> > > __dev_flush():
>> >> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
>> >> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > > > >> > 
>> >> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> >> >> >> >> >> > > > >> > performance :)    
>> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> >> >> >> >> >> > > > >> 
>> >> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> >> >> >> >> >> > > > >
>> >> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> >> >> >> >> >> > > > > are operating on.  
>> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> > > What other rcu objects it is using during flush?
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > Look at code:
>> >> >> >> >> >> >> >  kernel/bpf/cpumap.c
>> >> >> >> >> >> >> >  kernel/bpf/devmap.c
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> >> >> >> >> >> > function is __dev_map_entry_free().
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
>> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
>> >> >> >> >> >> >> > > didn't look at other drivers though.
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
>> >> >> >> >> >> >> >  
>> >> >> >> >> >> >> > > > >
>> >> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> >> >> >> >> >> > > > 
>> >> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
>> >> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
>> >> >> >> >> >> >> Updated Paul's email address.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
>> >> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> >> >> >> >> >> > preempt-able in the future.
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> >> >> >> >> >> > > other rcu_read_lock() as-is.
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
>> >> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> >> >> >> >> >> > code path in their NAPI-loop.
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> >> >> >> >> >> > happens, but ref real-function names).
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
>> >> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
>> >> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> >> >> >> >> >> > happen before we end our softirq/NAPI context.
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> >> >> >> >> >> > this short interval.
>> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >> >> >> >> >> >
>> >> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
>> >> >> >> >> >> > code in question is something like this:
>> >> >> >> >> >> >
>> >> >> >> >> >> > 	void foo(void)
>> >> >> >> >> >> > 	{
>> >> >> >> >> >> > 		local_bh_disable();
>> >> >> >> >> >> >
>> >> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> >> > 		p = rcu_dereference(gp);
>> >> >> >> >> >> > 		do_something_with(p);
>> >> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >> >
>> >> >> >> >> >> > 		do_something_else();
>> >> >> >> >> >> >
>> >> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> >> > 		do_some_other_thing(p);
>> >> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >> >
>> >> >> >> >> >> > 		local_bh_enable();
>> >> >> >> >> >> > 	}
>> >> >> >> >> >> >
>> >> >> >> >> >> > 	void bar(struct blat *new_gp)
>> >> >> >> >> >> > 	{
>> >> >> >> >> >> > 		struct blat *old_gp;
>> >> >> >> >> >> >
>> >> >> >> >> >> > 		spin_lock(my_lock);
>> >> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> >> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
>> >> >> >> >> >> > 		spin_unlock(my_lock);
>> >> >> >> >> >> > 		synchronize_rcu();
>> >> >> >> >> >> > 		kfree(old_gp);
>> >> >> >> >> >> > 	}
>> >> >> >> >> >> 
>> >> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
>> >> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
>> >> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
>> >> >> >> >> >> whole lifetime of p.
>> >> >> >> >> >
>> >> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
>> >> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
>> >> >> >> >> 
>> >> >> >> >> Right, gotcha!
>> >> >> >> >> 
>> >> >> >> >> >> > I need to check up on -rt.
>> >> >> >> >> >> >
>> >> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
>> >> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
>> >> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> >> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> >> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> >> >> >> >> >> 
>> >> >> >> >> >> OK. Variants of this code has been around since before then, but I
>> >> >> >> >> >> honestly have no idea what it looked like back then exactly...
>> >> >> >> >> >
>> >> >> >> >> > I know that feeling...
>> >> >> >> >> >
>> >> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> >> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> >> >> >> >> >> 
>> >> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> >> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> >> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
>> >> >> >> >> >> this, or are we likely to get complaints if we remove it?
>> >> >> >> >> >
>> >> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
>> >> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
>> >> >> >> >> 
>> >> >> >> >> Right, but doing so entails going through all the drivers, which is what
>> >> >> >> >> we're trying to avoid :)
>> >> >> >> >
>> >> >> >> > I was afraid of that...  ;-)
>> >> >> >> >
>> >> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
>> >> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
>> >> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>> >> >> >> >> >
>> >> >> >> >> > 	p = rcu_dereference(gp);
>> >> >> >> >> >
>> >> >> >> >> > You would do this:
>> >> >> >> >> >
>> >> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>> >> >> >> >> >
>> >> >> >> >> > This would be needed for mainline, regardless of -rt.
>> >> >> >> >> 
>> >> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
>> >> >> >> >> the same function but from a regular rcu_read_lock()-protected section
>> >> >> >> >> instead from a bh-disabled section, right?
>> >> >> >> >
>> >> >> >> > That is correct.  That rcu_dereference_check() invocation will make
>> >> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
>> >> >> >> > Or both, for that matter.
>> >> >> >> 
>> >> >> >> OK, great, thank you for confirming my understanding!
>> >> >> >> 
>> >> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
>> >> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
>> >> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
>> >> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
>> >> >> >> >> is already there (in the particular driver we're discussing), and the
>> >> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
>> >> >> >> >> discussion, there really is not much point in adding that second
>> >> >> >> >> rcu_read_lock/unlock(), is there?
>> >> >> >> >
>> >> >> >> > From an algorithmic point of view, the second rcu_read_lock()
>> >> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
>> >> >> >> > software-engineering considerations, including copy-pasta issues.
>> >> >> >> >
>> >> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
>> >> >> >> >> already there, lockdep is not likely to complain either, so we're
>> >> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
>> >> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
>> >> >> >> >
>> >> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
>> >> >> >> > the code, fails to note the fact that softirq is disabled throughout,
>> >> >> >> > and decides that leaking a pointer from one RCU read-side critical
>> >> >> >> > section to a later one is just fine.  :-/
>> >> >> >> 
>> >> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
>> >> >> >> reading the code; just wanted to make sure my understanding was correct
>> >> >> >> that we don't strictly need to do anything as far as the machines
>> >> >> >> executing it are concerned :)
>> >> >> >> 
>> >> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
>> >> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
>> >> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
>> >> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
>> >> >> >> >> should definitely clean this up.
>> >> >> >> >> 
>> >> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
>> >> >> >> >
>> >> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
>> >> >> >> > including those complaints that point out bugs.  It also might get you
>> >> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
>> >> >> >> 
>> >> >> >> It's not; it's the netdev_map member of this struct:
>> >> >> >> 
>> >> >> >> struct bpf_dtab {
>> >> >> >> 	struct bpf_map map;
>> >> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
>> >> >> >> 	struct list_head list;
>> >> >> >> 
>> >> >> >> 	/* these are only used for DEVMAP_HASH type maps */
>> >> >> >> 	struct hlist_head *dev_index_head;
>> >> >> >> 	spinlock_t index_lock;
>> >> >> >> 	unsigned int items;
>> >> >> >> 	u32 n_buckets;
>> >> >> >> };
>> >> >> >> 
>> >> >> >> Will adding __rcu to such a dynamic array member do the right thing when
>> >> >> >> paired with rcu_dereference() on array members (i.e., in place of the
>> >> >> >> READ_ONCE in the code linked above)?
>> >> >> >
>> >> >> > The only thing __rcu will do is provide information to the sparse static
>> >> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
>> >> >> > to a __rcu pointer.  But it is already griping at you for applying
>> >> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
>> >> >> 
>> >> >> Right, hence the need for a cleanup ;)
>> >> >> 
>> >> >> My question was more if it understood arrays, though. I.e., that
>> >> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
>> >> >> array... Or am I maybe thinking that tool is way smarter than it is, and
>> >> >> it just complains for any access to that field that doesn't use
>> >> >> rcu_dereference()?
>> >> >
>> >> > I believe that sparse will know about the pointers being __rcu, but
>> >> > not the array.  Unless you mark both levels.
>> >> 
>> >> Hi Paul
>> >> 
>> >> One more question, since I started adding the annotations: We are
>> >> currently swapping out the pointers using xchg():
>> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
>> >> 
>> >> and even cmpxchg():
>> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
>> >> 
>> >> Sparse complains about these if I add the __rcu annotation to the
>> >> definition (which otherwise works just fine with the double-pointer,
>> >> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
>> >> atomic swaps or something? Or do we just keep the regular xchg() and
>> >> ignore those particular sparse warnings?
>> >
>> > Sounds like I need to supply a unrcu_pointer() macro or some such.
>> > This would operate something like the current open-coded casts
>> > in __rcu_dereference_protected().
>> 
>> So with that, I would turn the existing:
>> 
>> 	dev = READ_ONCE(dtab->netdev_map[i]);
>> 	if (!dev || netdev != dev->dev)
>> 		continue;
>> 	odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
>> 
>> into:
>> 
>> 	dev = rcu_dereference(dtab->netdev_map[i]);
>> 	if (!dev || netdev != dev->dev)
>> 		continue;
>> 	odev = cmpxchg(unrcu_pointer(&dtab->netdev_map[i]), dev, NULL);
>> 
>> 
>> and with a _check version:
>> 
>> 	old_dev = xchg(unrcu_pointer_check(&dtab->netdev_map[k], rcu_read_lock_bh_held()), NULL);
>> 
>> right?
>> 
>> Or would it be:
>> 	odev = cmpxchg(&unrcu_pointer(dtab->netdev_map[i]), dev, NULL);
>> ?
>> 
>> > Would something like that work for you?
>> 
>> Yeah, I believe it would :)
>
> Except that I was forgetting that the __rcu decorates the pointed-to
> data rather than the pointer itself.  :-/
>
> But that is actually easier, as you can follow the example of
> rcu_assign_pointer(), namely using RCU_INITIALIZER().
>
> So like this:
>
> 	odev = cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL);
>
> I -think- that the NULL doesn't need an RCU_INITIALIZER(), but it is
> of course sparse's opinion that matters.
>
> And of course like this:
>
> 	old_dev = xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap));
>
> Does that work, or am I still confused?

That gets rid of one warning, but not the other. Before (plain xchg):

kernel/bpf/devmap.c:657:19: warning: incorrect type in initializer (different address spaces)
kernel/bpf/devmap.c:657:19:    expected struct bpf_dtab_netdev [noderef] __rcu *__ret
kernel/bpf/devmap.c:657:19:    got struct bpf_dtab_netdev *[assigned] dev
kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret

after (RCU_INITIALIZER() on the second argument to xchg):

kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret

I can get rid of that second one by marking old_dev as __rcu, but then I
get a new warning when dereferencing that in the subsequent
call_rcu()...

So I guess we still need that unrcu_pointer(), to wrap the xchg() in?

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 21:10                                               ` Toke Høiland-Jørgensen
@ 2021-04-21 21:30                                                 ` Paul E. McKenney
  2021-04-21 22:00                                                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-21 21:30 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Wed, Apr 21, 2021 at 11:10:38PM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Wed, Apr 21, 2021 at 09:59:55PM +0200, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> 
> >> >> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> 
> >> >> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> 
> >> >> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >> 
> >> >> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >> >> 
> >> >> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> >> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> >> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> >> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> >> >> >> >> >> >> > > >   
> >> >> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> >> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> >> >> > > > >  
> >> >> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> >> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >> >> >> >> >> > > > >> >     
> >> >> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> >> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> >> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> >> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
> >> >> >> >> >> >> >> > > > >> > >> >  	int i;
> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> >> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> >> >> >> >> >> >> > > > >> > >> >  		return;
> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
> >> >> >> >> >> >> >> > > > >> > >> >  	}
> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> >> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
> >> >> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
> >> >> >> >> >> >> >> > > > >> > >> > +			goto out;
> >> >> >> >> >> >> >> > > > >> > >> > +
> >> >> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
> >> >> >> >> >> >> >> > > > >> > >> > +	}
> >> >> >> >> >> >> >> > > > >> > >> > +    
> >> >> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> >> >> > > > >> > >> [ ... ]
> >> >> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> >> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> >> >> >> >> >> > > > >> > >> > +	 *
> >> >> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
> >> >> >> >> >> >> >> > > > >> > >> >  	 */
> >> >> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> >> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> >> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> >> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> >> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> >> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> >> >> >> >> >> > > > >> > > __dev_flush():
> >> >> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> >> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
> >> >> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> >> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > > > >> > 
> >> >> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> >> >> >> >> >> > > > >> > performance :)    
> >> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
> >> >> >> >> >> >> >> > > > >> 
> >> >> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> >> >> >> >> >> >> > > > >
> >> >> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> >> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> >> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> >> >> >> >> >> >> > > > > are operating on.  
> >> >> >> >> >> >> >> > >
> >> >> >> >> >> >> >> > > What other rcu objects it is using during flush?
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > Look at code:
> >> >> >> >> >> >> >> >  kernel/bpf/cpumap.c
> >> >> >> >> >> >> >> >  kernel/bpf/devmap.c
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> >> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> >> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> >> >> >> >> >> >> > function is __dev_map_entry_free().
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
> >> >> >> >> >> >> >> > >
> >> >> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
> >> >> >> >> >> >> >> > > didn't look at other drivers though.
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
> >> >> >> >> >> >> >> >  
> >> >> >> >> >> >> >> > > > >
> >> >> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> >> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> >> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> >> >> >> >> >> >> > > > 
> >> >> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> >> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
> >> >> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> >> >> >> >> >> >> > >
> >> >> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> >> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
> >> >> >> >> >> >> >> Updated Paul's email address.
> >> >> >> >> >> >> >> 
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> >> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
> >> >> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
> >> >> >> >> >> >> >> > preempt-able in the future.
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> >> >> >> >> >> >> > > other rcu_read_lock() as-is.
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> >> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
> >> >> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> >> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> >> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> >> >> >> >> >> >> > code path in their NAPI-loop.
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> >> >> >> >> >> >> > happens, but ref real-function names).
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
> >> >> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> >> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> >> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> >> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
> >> >> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> >> >> >> >> >> >> > happen before we end our softirq/NAPI context.
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> >> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> >> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
> >> >> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> >> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> >> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> >> >> >> >> >> >> > this short interval.
> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> >> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
> >> >> >> >> >> >> > code in question is something like this:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 	void foo(void)
> >> >> >> >> >> >> > 	{
> >> >> >> >> >> >> > 		local_bh_disable();
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> >> >> > 		p = rcu_dereference(gp);
> >> >> >> >> >> >> > 		do_something_with(p);
> >> >> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 		do_something_else();
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> >> >> > 		do_some_other_thing(p);
> >> >> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 		local_bh_enable();
> >> >> >> >> >> >> > 	}
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 	void bar(struct blat *new_gp)
> >> >> >> >> >> >> > 	{
> >> >> >> >> >> >> > 		struct blat *old_gp;
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 		spin_lock(my_lock);
> >> >> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> >> >> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
> >> >> >> >> >> >> > 		spin_unlock(my_lock);
> >> >> >> >> >> >> > 		synchronize_rcu();
> >> >> >> >> >> >> > 		kfree(old_gp);
> >> >> >> >> >> >> > 	}
> >> >> >> >> >> >> 
> >> >> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
> >> >> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
> >> >> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
> >> >> >> >> >> >> whole lifetime of p.
> >> >> >> >> >> >
> >> >> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> >> >> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
> >> >> >> >> >> 
> >> >> >> >> >> Right, gotcha!
> >> >> >> >> >> 
> >> >> >> >> >> >> > I need to check up on -rt.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
> >> >> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
> >> >> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> >> >> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> >> >> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> >> >> >> >> >> >> 
> >> >> >> >> >> >> OK. Variants of this code has been around since before then, but I
> >> >> >> >> >> >> honestly have no idea what it looked like back then exactly...
> >> >> >> >> >> >
> >> >> >> >> >> > I know that feeling...
> >> >> >> >> >> >
> >> >> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
> >> >> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> >> >> >> >> >> >> 
> >> >> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
> >> >> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> >> >> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
> >> >> >> >> >> >> this, or are we likely to get complaints if we remove it?
> >> >> >> >> >> >
> >> >> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
> >> >> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
> >> >> >> >> >> 
> >> >> >> >> >> Right, but doing so entails going through all the drivers, which is what
> >> >> >> >> >> we're trying to avoid :)
> >> >> >> >> >
> >> >> >> >> > I was afraid of that...  ;-)
> >> >> >> >> >
> >> >> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
> >> >> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
> >> >> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
> >> >> >> >> >> >
> >> >> >> >> >> > 	p = rcu_dereference(gp);
> >> >> >> >> >> >
> >> >> >> >> >> > You would do this:
> >> >> >> >> >> >
> >> >> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
> >> >> >> >> >> >
> >> >> >> >> >> > This would be needed for mainline, regardless of -rt.
> >> >> >> >> >> 
> >> >> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
> >> >> >> >> >> the same function but from a regular rcu_read_lock()-protected section
> >> >> >> >> >> instead from a bh-disabled section, right?
> >> >> >> >> >
> >> >> >> >> > That is correct.  That rcu_dereference_check() invocation will make
> >> >> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
> >> >> >> >> > Or both, for that matter.
> >> >> >> >> 
> >> >> >> >> OK, great, thank you for confirming my understanding!
> >> >> >> >> 
> >> >> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
> >> >> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
> >> >> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
> >> >> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
> >> >> >> >> >> is already there (in the particular driver we're discussing), and the
> >> >> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
> >> >> >> >> >> discussion, there really is not much point in adding that second
> >> >> >> >> >> rcu_read_lock/unlock(), is there?
> >> >> >> >> >
> >> >> >> >> > From an algorithmic point of view, the second rcu_read_lock()
> >> >> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
> >> >> >> >> > software-engineering considerations, including copy-pasta issues.
> >> >> >> >> >
> >> >> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
> >> >> >> >> >> already there, lockdep is not likely to complain either, so we're
> >> >> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
> >> >> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
> >> >> >> >> >
> >> >> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
> >> >> >> >> > the code, fails to note the fact that softirq is disabled throughout,
> >> >> >> >> > and decides that leaking a pointer from one RCU read-side critical
> >> >> >> >> > section to a later one is just fine.  :-/
> >> >> >> >> 
> >> >> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
> >> >> >> >> reading the code; just wanted to make sure my understanding was correct
> >> >> >> >> that we don't strictly need to do anything as far as the machines
> >> >> >> >> executing it are concerned :)
> >> >> >> >> 
> >> >> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
> >> >> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
> >> >> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
> >> >> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
> >> >> >> >> >> should definitely clean this up.
> >> >> >> >> >> 
> >> >> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
> >> >> >> >> >
> >> >> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
> >> >> >> >> > including those complaints that point out bugs.  It also might get you
> >> >> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
> >> >> >> >> 
> >> >> >> >> It's not; it's the netdev_map member of this struct:
> >> >> >> >> 
> >> >> >> >> struct bpf_dtab {
> >> >> >> >> 	struct bpf_map map;
> >> >> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
> >> >> >> >> 	struct list_head list;
> >> >> >> >> 
> >> >> >> >> 	/* these are only used for DEVMAP_HASH type maps */
> >> >> >> >> 	struct hlist_head *dev_index_head;
> >> >> >> >> 	spinlock_t index_lock;
> >> >> >> >> 	unsigned int items;
> >> >> >> >> 	u32 n_buckets;
> >> >> >> >> };
> >> >> >> >> 
> >> >> >> >> Will adding __rcu to such a dynamic array member do the right thing when
> >> >> >> >> paired with rcu_dereference() on array members (i.e., in place of the
> >> >> >> >> READ_ONCE in the code linked above)?
> >> >> >> >
> >> >> >> > The only thing __rcu will do is provide information to the sparse static
> >> >> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
> >> >> >> > to a __rcu pointer.  But it is already griping at you for applying
> >> >> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
> >> >> >> 
> >> >> >> Right, hence the need for a cleanup ;)
> >> >> >> 
> >> >> >> My question was more if it understood arrays, though. I.e., that
> >> >> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
> >> >> >> array... Or am I maybe thinking that tool is way smarter than it is, and
> >> >> >> it just complains for any access to that field that doesn't use
> >> >> >> rcu_dereference()?
> >> >> >
> >> >> > I believe that sparse will know about the pointers being __rcu, but
> >> >> > not the array.  Unless you mark both levels.
> >> >> 
> >> >> Hi Paul
> >> >> 
> >> >> One more question, since I started adding the annotations: We are
> >> >> currently swapping out the pointers using xchg():
> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
> >> >> 
> >> >> and even cmpxchg():
> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
> >> >> 
> >> >> Sparse complains about these if I add the __rcu annotation to the
> >> >> definition (which otherwise works just fine with the double-pointer,
> >> >> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
> >> >> atomic swaps or something? Or do we just keep the regular xchg() and
> >> >> ignore those particular sparse warnings?
> >> >
> >> > Sounds like I need to supply a unrcu_pointer() macro or some such.
> >> > This would operate something like the current open-coded casts
> >> > in __rcu_dereference_protected().
> >> 
> >> So with that, I would turn the existing:
> >> 
> >> 	dev = READ_ONCE(dtab->netdev_map[i]);
> >> 	if (!dev || netdev != dev->dev)
> >> 		continue;
> >> 	odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
> >> 
> >> into:
> >> 
> >> 	dev = rcu_dereference(dtab->netdev_map[i]);
> >> 	if (!dev || netdev != dev->dev)
> >> 		continue;
> >> 	odev = cmpxchg(unrcu_pointer(&dtab->netdev_map[i]), dev, NULL);
> >> 
> >> 
> >> and with a _check version:
> >> 
> >> 	old_dev = xchg(unrcu_pointer_check(&dtab->netdev_map[k], rcu_read_lock_bh_held()), NULL);
> >> 
> >> right?
> >> 
> >> Or would it be:
> >> 	odev = cmpxchg(&unrcu_pointer(dtab->netdev_map[i]), dev, NULL);
> >> ?
> >> 
> >> > Would something like that work for you?
> >> 
> >> Yeah, I believe it would :)
> >
> > Except that I was forgetting that the __rcu decorates the pointed-to
> > data rather than the pointer itself.  :-/
> >
> > But that is actually easier, as you can follow the example of
> > rcu_assign_pointer(), namely using RCU_INITIALIZER().
> >
> > So like this:
> >
> > 	odev = cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL);
> >
> > I -think- that the NULL doesn't need an RCU_INITIALIZER(), but it is
> > of course sparse's opinion that matters.
> >
> > And of course like this:
> >
> > 	old_dev = xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap));
> >
> > Does that work, or am I still confused?
> 
> That gets rid of one warning, but not the other. Before (plain xchg):
> 
> kernel/bpf/devmap.c:657:19: warning: incorrect type in initializer (different address spaces)
> kernel/bpf/devmap.c:657:19:    expected struct bpf_dtab_netdev [noderef] __rcu *__ret
> kernel/bpf/devmap.c:657:19:    got struct bpf_dtab_netdev *[assigned] dev
> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
> 
> after (RCU_INITIALIZER() on the second argument to xchg):
> 
> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
> 
> I can get rid of that second one by marking old_dev as __rcu, but then I
> get a new warning when dereferencing that in the subsequent
> call_rcu()...
> 
> So I guess we still need that unrcu_pointer(), to wrap the xchg() in?

Well, at least this use case permits an lvalue.  ;-)

Please see below for an untested patch intended to permit the following:

	old_dev = unrcu_pointer(xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap)));

Does that do the trick?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1199ffd305d1..a10480f2b4ef 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -363,6 +363,20 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define rcu_check_sparse(p, space)
 #endif /* #else #ifdef __CHECKER__ */
 
+/**
+ * unrcu_pointer - mark a pointer as not being RCU protected
+ * @p: pointer needing to lose its __rcu property
+ *
+ * Converts @p from an __rcu pointer to a __kernel pointer.
+ * This allows an __rcu pointer to be used with xchg() and friends.
+ */
+#define unrcu_pointer(p)						\
+({									\
+	typeof(*p) *_________p1 = (typeof(*p) *__force)(p);		\
+	rcu_check_sparse(p, __rcu); 					\
+	((typeof(*p) __force __kernel *)(_________p1)); 		\
+})
+
 #define __rcu_access_pointer(p, space) \
 ({ \
 	typeof(*p) *_________p1 = (typeof(*p) *__force)READ_ONCE(p); \

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 21:30                                                 ` Paul E. McKenney
@ 2021-04-21 22:00                                                   ` Toke Høiland-Jørgensen
  2021-04-21 22:31                                                     ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-21 22:00 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Wed, Apr 21, 2021 at 11:10:38PM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Wed, Apr 21, 2021 at 09:59:55PM +0200, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> 
>> >> >> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> 
>> >> >> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> 
>> >> >> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >> 
>> >> >> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> >> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> >> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> >> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> >> >> >> >> >> >> > > >   
>> >> >> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> >> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> >> >> > > > >  
>> >> >> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> >> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >> >> >> >> >> >> > > > >> >     
>> >> >> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> >> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> >> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> >> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> >> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> >> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
>> >> >> >> >> >> >> >> > > > >> > >> >  	int i;
>> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> >> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> >> >> >> >> >> >> > > > >> > >> >  		return;
>> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> >> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> >> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
>> >> >> >> >> >> >> >> > > > >> > >> >  	}
>> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> >> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> >> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
>> >> >> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> >> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
>> >> >> >> >> >> >> >> > > > >> > >> > +			goto out;
>> >> >> >> >> >> >> >> > > > >> > >> > +
>> >> >> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> >> >> >> >> >> >> > > > >> > >> > +	}
>> >> >> >> >> >> >> >> > > > >> > >> > +    
>> >> >> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> >> >> > > > >> > >> [ ... ]
>> >> >> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> >> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> >> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> >> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> >> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> >> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> >> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> >> >> >> >> >> >> > > > >> > >> > +	 *
>> >> >> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> >> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
>> >> >> >> >> >> >> >> > > > >> > >> >  	 */
>> >> >> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> >> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> >> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> >> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> >> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> >> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> >> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> >> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> >> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> >> >> >> >> >> >> > > > >> > > __dev_flush():
>> >> >> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> >> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> >> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> >> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
>> >> >> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> >> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > > > >> > 
>> >> >> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> >> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> >> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> >> >> >> >> >> >> > > > >> > performance :)    
>> >> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> >> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> >> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> >> >> >> >> >> >> > > > >> 
>> >> >> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> >> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> >> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> >> >> >> >> >> >> > > > >
>> >> >> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> >> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> >> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> >> >> >> >> >> >> > > > > are operating on.  
>> >> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> >> > > What other rcu objects it is using during flush?
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > Look at code:
>> >> >> >> >> >> >> >> >  kernel/bpf/cpumap.c
>> >> >> >> >> >> >> >> >  kernel/bpf/devmap.c
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> >> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> >> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> >> >> >> >> >> >> > function is __dev_map_entry_free().
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
>> >> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
>> >> >> >> >> >> >> >> > > didn't look at other drivers though.
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
>> >> >> >> >> >> >> >> >  
>> >> >> >> >> >> >> >> > > > >
>> >> >> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> >> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> >> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> >> >> >> >> >> >> > > > 
>> >> >> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> >> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
>> >> >> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> >> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
>> >> >> >> >> >> >> >> Updated Paul's email address.
>> >> >> >> >> >> >> >> 
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> >> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
>> >> >> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> >> >> >> >> >> >> > preempt-able in the future.
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> >> >> >> >> >> >> > > other rcu_read_lock() as-is.
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> >> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
>> >> >> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> >> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> >> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> >> >> >> >> >> >> > code path in their NAPI-loop.
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> >> >> >> >> >> >> > happens, but ref real-function names).
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
>> >> >> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> >> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> >> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> >> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
>> >> >> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> >> >> >> >> >> >> > happen before we end our softirq/NAPI context.
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> >> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> >> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> >> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> >> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> >> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> >> >> >> >> >> >> > this short interval.
>> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> >> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
>> >> >> >> >> >> >> > code in question is something like this:
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 	void foo(void)
>> >> >> >> >> >> >> > 	{
>> >> >> >> >> >> >> > 		local_bh_disable();
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> >> >> > 		p = rcu_dereference(gp);
>> >> >> >> >> >> >> > 		do_something_with(p);
>> >> >> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 		do_something_else();
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> >> >> > 		do_some_other_thing(p);
>> >> >> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 		local_bh_enable();
>> >> >> >> >> >> >> > 	}
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 	void bar(struct blat *new_gp)
>> >> >> >> >> >> >> > 	{
>> >> >> >> >> >> >> > 		struct blat *old_gp;
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 		spin_lock(my_lock);
>> >> >> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> >> >> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
>> >> >> >> >> >> >> > 		spin_unlock(my_lock);
>> >> >> >> >> >> >> > 		synchronize_rcu();
>> >> >> >> >> >> >> > 		kfree(old_gp);
>> >> >> >> >> >> >> > 	}
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
>> >> >> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
>> >> >> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
>> >> >> >> >> >> >> whole lifetime of p.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
>> >> >> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
>> >> >> >> >> >> 
>> >> >> >> >> >> Right, gotcha!
>> >> >> >> >> >> 
>> >> >> >> >> >> >> > I need to check up on -rt.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
>> >> >> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
>> >> >> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> >> >> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> >> >> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> OK. Variants of this code has been around since before then, but I
>> >> >> >> >> >> >> honestly have no idea what it looked like back then exactly...
>> >> >> >> >> >> >
>> >> >> >> >> >> > I know that feeling...
>> >> >> >> >> >> >
>> >> >> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> >> >> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> >> >> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> >> >> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
>> >> >> >> >> >> >> this, or are we likely to get complaints if we remove it?
>> >> >> >> >> >> >
>> >> >> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
>> >> >> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
>> >> >> >> >> >> 
>> >> >> >> >> >> Right, but doing so entails going through all the drivers, which is what
>> >> >> >> >> >> we're trying to avoid :)
>> >> >> >> >> >
>> >> >> >> >> > I was afraid of that...  ;-)
>> >> >> >> >> >
>> >> >> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
>> >> >> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
>> >> >> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>> >> >> >> >> >> >
>> >> >> >> >> >> > 	p = rcu_dereference(gp);
>> >> >> >> >> >> >
>> >> >> >> >> >> > You would do this:
>> >> >> >> >> >> >
>> >> >> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>> >> >> >> >> >> >
>> >> >> >> >> >> > This would be needed for mainline, regardless of -rt.
>> >> >> >> >> >> 
>> >> >> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
>> >> >> >> >> >> the same function but from a regular rcu_read_lock()-protected section
>> >> >> >> >> >> instead from a bh-disabled section, right?
>> >> >> >> >> >
>> >> >> >> >> > That is correct.  That rcu_dereference_check() invocation will make
>> >> >> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
>> >> >> >> >> > Or both, for that matter.
>> >> >> >> >> 
>> >> >> >> >> OK, great, thank you for confirming my understanding!
>> >> >> >> >> 
>> >> >> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
>> >> >> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
>> >> >> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
>> >> >> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
>> >> >> >> >> >> is already there (in the particular driver we're discussing), and the
>> >> >> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
>> >> >> >> >> >> discussion, there really is not much point in adding that second
>> >> >> >> >> >> rcu_read_lock/unlock(), is there?
>> >> >> >> >> >
>> >> >> >> >> > From an algorithmic point of view, the second rcu_read_lock()
>> >> >> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
>> >> >> >> >> > software-engineering considerations, including copy-pasta issues.
>> >> >> >> >> >
>> >> >> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
>> >> >> >> >> >> already there, lockdep is not likely to complain either, so we're
>> >> >> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
>> >> >> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
>> >> >> >> >> >
>> >> >> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
>> >> >> >> >> > the code, fails to note the fact that softirq is disabled throughout,
>> >> >> >> >> > and decides that leaking a pointer from one RCU read-side critical
>> >> >> >> >> > section to a later one is just fine.  :-/
>> >> >> >> >> 
>> >> >> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
>> >> >> >> >> reading the code; just wanted to make sure my understanding was correct
>> >> >> >> >> that we don't strictly need to do anything as far as the machines
>> >> >> >> >> executing it are concerned :)
>> >> >> >> >> 
>> >> >> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
>> >> >> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
>> >> >> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
>> >> >> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
>> >> >> >> >> >> should definitely clean this up.
>> >> >> >> >> >> 
>> >> >> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
>> >> >> >> >> >
>> >> >> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
>> >> >> >> >> > including those complaints that point out bugs.  It also might get you
>> >> >> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
>> >> >> >> >> 
>> >> >> >> >> It's not; it's the netdev_map member of this struct:
>> >> >> >> >> 
>> >> >> >> >> struct bpf_dtab {
>> >> >> >> >> 	struct bpf_map map;
>> >> >> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
>> >> >> >> >> 	struct list_head list;
>> >> >> >> >> 
>> >> >> >> >> 	/* these are only used for DEVMAP_HASH type maps */
>> >> >> >> >> 	struct hlist_head *dev_index_head;
>> >> >> >> >> 	spinlock_t index_lock;
>> >> >> >> >> 	unsigned int items;
>> >> >> >> >> 	u32 n_buckets;
>> >> >> >> >> };
>> >> >> >> >> 
>> >> >> >> >> Will adding __rcu to such a dynamic array member do the right thing when
>> >> >> >> >> paired with rcu_dereference() on array members (i.e., in place of the
>> >> >> >> >> READ_ONCE in the code linked above)?
>> >> >> >> >
>> >> >> >> > The only thing __rcu will do is provide information to the sparse static
>> >> >> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
>> >> >> >> > to a __rcu pointer.  But it is already griping at you for applying
>> >> >> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
>> >> >> >> 
>> >> >> >> Right, hence the need for a cleanup ;)
>> >> >> >> 
>> >> >> >> My question was more if it understood arrays, though. I.e., that
>> >> >> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
>> >> >> >> array... Or am I maybe thinking that tool is way smarter than it is, and
>> >> >> >> it just complains for any access to that field that doesn't use
>> >> >> >> rcu_dereference()?
>> >> >> >
>> >> >> > I believe that sparse will know about the pointers being __rcu, but
>> >> >> > not the array.  Unless you mark both levels.
>> >> >> 
>> >> >> Hi Paul
>> >> >> 
>> >> >> One more question, since I started adding the annotations: We are
>> >> >> currently swapping out the pointers using xchg():
>> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
>> >> >> 
>> >> >> and even cmpxchg():
>> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
>> >> >> 
>> >> >> Sparse complains about these if I add the __rcu annotation to the
>> >> >> definition (which otherwise works just fine with the double-pointer,
>> >> >> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
>> >> >> atomic swaps or something? Or do we just keep the regular xchg() and
>> >> >> ignore those particular sparse warnings?
>> >> >
>> >> > Sounds like I need to supply a unrcu_pointer() macro or some such.
>> >> > This would operate something like the current open-coded casts
>> >> > in __rcu_dereference_protected().
>> >> 
>> >> So with that, I would turn the existing:
>> >> 
>> >> 	dev = READ_ONCE(dtab->netdev_map[i]);
>> >> 	if (!dev || netdev != dev->dev)
>> >> 		continue;
>> >> 	odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
>> >> 
>> >> into:
>> >> 
>> >> 	dev = rcu_dereference(dtab->netdev_map[i]);
>> >> 	if (!dev || netdev != dev->dev)
>> >> 		continue;
>> >> 	odev = cmpxchg(unrcu_pointer(&dtab->netdev_map[i]), dev, NULL);
>> >> 
>> >> 
>> >> and with a _check version:
>> >> 
>> >> 	old_dev = xchg(unrcu_pointer_check(&dtab->netdev_map[k], rcu_read_lock_bh_held()), NULL);
>> >> 
>> >> right?
>> >> 
>> >> Or would it be:
>> >> 	odev = cmpxchg(&unrcu_pointer(dtab->netdev_map[i]), dev, NULL);
>> >> ?
>> >> 
>> >> > Would something like that work for you?
>> >> 
>> >> Yeah, I believe it would :)
>> >
>> > Except that I was forgetting that the __rcu decorates the pointed-to
>> > data rather than the pointer itself.  :-/
>> >
>> > But that is actually easier, as you can follow the example of
>> > rcu_assign_pointer(), namely using RCU_INITIALIZER().
>> >
>> > So like this:
>> >
>> > 	odev = cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL);
>> >
>> > I -think- that the NULL doesn't need an RCU_INITIALIZER(), but it is
>> > of course sparse's opinion that matters.
>> >
>> > And of course like this:
>> >
>> > 	old_dev = xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap));
>> >
>> > Does that work, or am I still confused?
>> 
>> That gets rid of one warning, but not the other. Before (plain xchg):
>> 
>> kernel/bpf/devmap.c:657:19: warning: incorrect type in initializer (different address spaces)
>> kernel/bpf/devmap.c:657:19:    expected struct bpf_dtab_netdev [noderef] __rcu *__ret
>> kernel/bpf/devmap.c:657:19:    got struct bpf_dtab_netdev *[assigned] dev
>> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
>> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
>> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
>> 
>> after (RCU_INITIALIZER() on the second argument to xchg):
>> 
>> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
>> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
>> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
>> 
>> I can get rid of that second one by marking old_dev as __rcu, but then I
>> get a new warning when dereferencing that in the subsequent
>> call_rcu()...
>> 
>> So I guess we still need that unrcu_pointer(), to wrap the xchg() in?
>
> Well, at least this use case permits an lvalue.  ;-)
>
> Please see below for an untested patch intended to permit the following:
>
> 	old_dev = unrcu_pointer(xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap)));
>
> Does that do the trick?

Yes, it does! With that I can mark the pointer as __rcu and get all uses
of it through sparse without complaints - awesome!

How do RCU patches usually make it into the kernel? Can you provide me
with a proper patch I can just include along with my cleanup patches
(taking it through the bpf tree)? Or do we need to go through some other
tree and wait for a merge?

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 22:00                                                   ` Toke Høiland-Jørgensen
@ 2021-04-21 22:31                                                     ` Paul E. McKenney
  2021-04-22 14:30                                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2021-04-21 22:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

On Thu, Apr 22, 2021 at 12:00:24AM +0200, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Wed, Apr 21, 2021 at 11:10:38PM +0200, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Wed, Apr 21, 2021 at 09:59:55PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> 
> >> >> > On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> 
> >> >> >> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> 
> >> >> >> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >> 
> >> >> >> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >> >> 
> >> >> >> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >> >> >> 
> >> >> >> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
> >> >> >> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> >> >> >> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
> >> >> >> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> >> >> >> >> >> >> >> > > >   
> >> >> >> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
> >> >> >> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> >> >> >> >> >> >> > > > >  
> >> >> >> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> >> >> >> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
> >> >> >> >> >> >> >> >> > > > >> >     
> >> >> >> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> >> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >> >> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
> >> >> >> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
> >> >> >> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
> >> >> >> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
> >> >> >> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
> >> >> >> >> >> >> >> >> > > > >> > >> >  	int i;
> >> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
> >> >> >> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
> >> >> >> >> >> >> >> >> > > > >> > >> >  		return;
> >> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
> >> >> >> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
> >> >> >> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> >> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
> >> >> >> >> >> >> >> >> > > > >> > >> >  	}
> >> >> >> >> >> >> >> >> > > > >> > >> >  
> >> >> >> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> >> >> >> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
> >> >> >> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
> >> >> >> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> >> >> >> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
> >> >> >> >> >> >> >> >> > > > >> > >> > +			goto out;
> >> >> >> >> >> >> >> >> > > > >> > >> > +
> >> >> >> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
> >> >> >> >> >> >> >> >> > > > >> > >> > +	}
> >> >> >> >> >> >> >> >> > > > >> > >> > +    
> >> >> >> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> >> >> >> > > > >> > >> [ ... ]
> >> >> >> >> >> >> >> >> > > > >> > >>     
> >> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
> >> >> >> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> >> >> >> >> >> >> >> >> > > > >> > >> >  {
> >> >> >> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >> >> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> >> >> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> >> >> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> >> >> >> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> >> >> >> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
> >> >> >> >> >> >> >> >> > > > >> > >> > +	 *
> >> >> >> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> >> >> >> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
> >> >> >> >> >> >> >> >> > > > >> > >> >  	 */
> >> >> >> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
> >> >> >> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
> >> >> >> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
> >> >> >> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
> >> >> >> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> >> >> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> >> >> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> >> >> >> >> >> >> >> > > > >> > >> 
> >> >> >> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
> >> >> >> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
> >> >> >> >> >> >> >> >> > > > >> > > __dev_flush():
> >> >> >> >> >> >> >> >> > > > >> > >
> >> >> >> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
> >> >> >> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
> >> >> >> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
> >> >> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
> >> >> >> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
> >> >> >> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
> >> >> >> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > > > >> > 
> >> >> >> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> >> >> >> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
> >> >> >> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
> >> >> >> >> >> >> >> >> > > > >> > performance :)    
> >> >> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> >> >> >> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> >> >> >> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
> >> >> >> >> >> >> >> >> > > > >> 
> >> >> >> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> >> >> >> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> >> >> >> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> >> >> >> >> >> >> >> >> > > > >>
> >> >> >> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> >> >> >> >> >> >> >> >> > > > >
> >> >> >> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
> >> >> >> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
> >> >> >> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> >> >> >> >> >> >> >> >> > > > > are operating on.  
> >> >> >> >> >> >> >> >> > >
> >> >> >> >> >> >> >> >> > > What other rcu objects it is using during flush?
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > Look at code:
> >> >> >> >> >> >> >> >> >  kernel/bpf/cpumap.c
> >> >> >> >> >> >> >> >> >  kernel/bpf/devmap.c
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
> >> >> >> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
> >> >> >> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
> >> >> >> >> >> >> >> >> > function is __dev_map_entry_free().
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
> >> >> >> >> >> >> >> >> > >
> >> >> >> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
> >> >> >> >> >> >> >> >> > > didn't look at other drivers though.
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
> >> >> >> >> >> >> >> >> >  
> >> >> >> >> >> >> >> >> > > > >
> >> >> >> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> >> >> >> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
> >> >> >> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
> >> >> >> >> >> >> >> >> > > > 
> >> >> >> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> >> >> >> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
> >> >> >> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
> >> >> >> >> >> >> >> >> > >
> >> >> >> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
> >> >> >> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
> >> >> >> >> >> >> >> >> Updated Paul's email address.
> >> >> >> >> >> >> >> >> 
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
> >> >> >> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
> >> >> >> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
> >> >> >> >> >> >> >> >> > preempt-able in the future.
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
> >> >> >> >> >> >> >> >> > > other rcu_read_lock() as-is.
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
> >> >> >> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
> >> >> >> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
> >> >> >> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
> >> >> >> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
> >> >> >> >> >> >> >> >> > code path in their NAPI-loop.
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
> >> >> >> >> >> >> >> >> > happens, but ref real-function names).
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
> >> >> >> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
> >> >> >> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
> >> >> >> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
> >> >> >> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
> >> >> >> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
> >> >> >> >> >> >> >> >> > happen before we end our softirq/NAPI context.
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
> >> >> >> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
> >> >> >> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
> >> >> >> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
> >> >> >> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
> >> >> >> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
> >> >> >> >> >> >> >> >> > this short interval.
> >> >> >> >> >> >> >> >> > 
> >> >> >> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
> >> >> >> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
> >> >> >> >> >> >> >> > code in question is something like this:
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 	void foo(void)
> >> >> >> >> >> >> >> > 	{
> >> >> >> >> >> >> >> > 		local_bh_disable();
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> >> >> >> > 		p = rcu_dereference(gp);
> >> >> >> >> >> >> >> > 		do_something_with(p);
> >> >> >> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 		do_something_else();
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 		rcu_read_lock();
> >> >> >> >> >> >> >> > 		do_some_other_thing(p);
> >> >> >> >> >> >> >> > 		rcu_read_unlock();
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 		local_bh_enable();
> >> >> >> >> >> >> >> > 	}
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 	void bar(struct blat *new_gp)
> >> >> >> >> >> >> >> > 	{
> >> >> >> >> >> >> >> > 		struct blat *old_gp;
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 		spin_lock(my_lock);
> >> >> >> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
> >> >> >> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
> >> >> >> >> >> >> >> > 		spin_unlock(my_lock);
> >> >> >> >> >> >> >> > 		synchronize_rcu();
> >> >> >> >> >> >> >> > 		kfree(old_gp);
> >> >> >> >> >> >> >> > 	}
> >> >> >> >> >> >> >> 
> >> >> >> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
> >> >> >> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
> >> >> >> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
> >> >> >> >> >> >> >> whole lifetime of p.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
> >> >> >> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
> >> >> >> >> >> >> 
> >> >> >> >> >> >> Right, gotcha!
> >> >> >> >> >> >> 
> >> >> >> >> >> >> >> > I need to check up on -rt.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
> >> >> >> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
> >> >> >> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
> >> >> >> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
> >> >> >> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
> >> >> >> >> >> >> >> 
> >> >> >> >> >> >> >> OK. Variants of this code has been around since before then, but I
> >> >> >> >> >> >> >> honestly have no idea what it looked like back then exactly...
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > I know that feeling...
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
> >> >> >> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
> >> >> >> >> >> >> >> 
> >> >> >> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
> >> >> >> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
> >> >> >> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
> >> >> >> >> >> >> >> this, or are we likely to get complaints if we remove it?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
> >> >> >> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> Right, but doing so entails going through all the drivers, which is what
> >> >> >> >> >> >> we're trying to avoid :)
> >> >> >> >> >> >
> >> >> >> >> >> > I was afraid of that...  ;-)
> >> >> >> >> >> >
> >> >> >> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
> >> >> >> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
> >> >> >> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 	p = rcu_dereference(gp);
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > You would do this:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > This would be needed for mainline, regardless of -rt.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
> >> >> >> >> >> >> the same function but from a regular rcu_read_lock()-protected section
> >> >> >> >> >> >> instead from a bh-disabled section, right?
> >> >> >> >> >> >
> >> >> >> >> >> > That is correct.  That rcu_dereference_check() invocation will make
> >> >> >> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
> >> >> >> >> >> > Or both, for that matter.
> >> >> >> >> >> 
> >> >> >> >> >> OK, great, thank you for confirming my understanding!
> >> >> >> >> >> 
> >> >> >> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
> >> >> >> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
> >> >> >> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
> >> >> >> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
> >> >> >> >> >> >> is already there (in the particular driver we're discussing), and the
> >> >> >> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
> >> >> >> >> >> >> discussion, there really is not much point in adding that second
> >> >> >> >> >> >> rcu_read_lock/unlock(), is there?
> >> >> >> >> >> >
> >> >> >> >> >> > From an algorithmic point of view, the second rcu_read_lock()
> >> >> >> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
> >> >> >> >> >> > software-engineering considerations, including copy-pasta issues.
> >> >> >> >> >> >
> >> >> >> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
> >> >> >> >> >> >> already there, lockdep is not likely to complain either, so we're
> >> >> >> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
> >> >> >> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
> >> >> >> >> >> >
> >> >> >> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
> >> >> >> >> >> > the code, fails to note the fact that softirq is disabled throughout,
> >> >> >> >> >> > and decides that leaking a pointer from one RCU read-side critical
> >> >> >> >> >> > section to a later one is just fine.  :-/
> >> >> >> >> >> 
> >> >> >> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
> >> >> >> >> >> reading the code; just wanted to make sure my understanding was correct
> >> >> >> >> >> that we don't strictly need to do anything as far as the machines
> >> >> >> >> >> executing it are concerned :)
> >> >> >> >> >> 
> >> >> >> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
> >> >> >> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
> >> >> >> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
> >> >> >> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
> >> >> >> >> >> >> should definitely clean this up.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
> >> >> >> >> >> >
> >> >> >> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
> >> >> >> >> >> > including those complaints that point out bugs.  It also might get you
> >> >> >> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
> >> >> >> >> >> 
> >> >> >> >> >> It's not; it's the netdev_map member of this struct:
> >> >> >> >> >> 
> >> >> >> >> >> struct bpf_dtab {
> >> >> >> >> >> 	struct bpf_map map;
> >> >> >> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
> >> >> >> >> >> 	struct list_head list;
> >> >> >> >> >> 
> >> >> >> >> >> 	/* these are only used for DEVMAP_HASH type maps */
> >> >> >> >> >> 	struct hlist_head *dev_index_head;
> >> >> >> >> >> 	spinlock_t index_lock;
> >> >> >> >> >> 	unsigned int items;
> >> >> >> >> >> 	u32 n_buckets;
> >> >> >> >> >> };
> >> >> >> >> >> 
> >> >> >> >> >> Will adding __rcu to such a dynamic array member do the right thing when
> >> >> >> >> >> paired with rcu_dereference() on array members (i.e., in place of the
> >> >> >> >> >> READ_ONCE in the code linked above)?
> >> >> >> >> >
> >> >> >> >> > The only thing __rcu will do is provide information to the sparse static
> >> >> >> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
> >> >> >> >> > to a __rcu pointer.  But it is already griping at you for applying
> >> >> >> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
> >> >> >> >> 
> >> >> >> >> Right, hence the need for a cleanup ;)
> >> >> >> >> 
> >> >> >> >> My question was more if it understood arrays, though. I.e., that
> >> >> >> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
> >> >> >> >> array... Or am I maybe thinking that tool is way smarter than it is, and
> >> >> >> >> it just complains for any access to that field that doesn't use
> >> >> >> >> rcu_dereference()?
> >> >> >> >
> >> >> >> > I believe that sparse will know about the pointers being __rcu, but
> >> >> >> > not the array.  Unless you mark both levels.
> >> >> >> 
> >> >> >> Hi Paul
> >> >> >> 
> >> >> >> One more question, since I started adding the annotations: We are
> >> >> >> currently swapping out the pointers using xchg():
> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
> >> >> >> 
> >> >> >> and even cmpxchg():
> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
> >> >> >> 
> >> >> >> Sparse complains about these if I add the __rcu annotation to the
> >> >> >> definition (which otherwise works just fine with the double-pointer,
> >> >> >> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
> >> >> >> atomic swaps or something? Or do we just keep the regular xchg() and
> >> >> >> ignore those particular sparse warnings?
> >> >> >
> >> >> > Sounds like I need to supply a unrcu_pointer() macro or some such.
> >> >> > This would operate something like the current open-coded casts
> >> >> > in __rcu_dereference_protected().
> >> >> 
> >> >> So with that, I would turn the existing:
> >> >> 
> >> >> 	dev = READ_ONCE(dtab->netdev_map[i]);
> >> >> 	if (!dev || netdev != dev->dev)
> >> >> 		continue;
> >> >> 	odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
> >> >> 
> >> >> into:
> >> >> 
> >> >> 	dev = rcu_dereference(dtab->netdev_map[i]);
> >> >> 	if (!dev || netdev != dev->dev)
> >> >> 		continue;
> >> >> 	odev = cmpxchg(unrcu_pointer(&dtab->netdev_map[i]), dev, NULL);
> >> >> 
> >> >> 
> >> >> and with a _check version:
> >> >> 
> >> >> 	old_dev = xchg(unrcu_pointer_check(&dtab->netdev_map[k], rcu_read_lock_bh_held()), NULL);
> >> >> 
> >> >> right?
> >> >> 
> >> >> Or would it be:
> >> >> 	odev = cmpxchg(&unrcu_pointer(dtab->netdev_map[i]), dev, NULL);
> >> >> ?
> >> >> 
> >> >> > Would something like that work for you?
> >> >> 
> >> >> Yeah, I believe it would :)
> >> >
> >> > Except that I was forgetting that the __rcu decorates the pointed-to
> >> > data rather than the pointer itself.  :-/
> >> >
> >> > But that is actually easier, as you can follow the example of
> >> > rcu_assign_pointer(), namely using RCU_INITIALIZER().
> >> >
> >> > So like this:
> >> >
> >> > 	odev = cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL);
> >> >
> >> > I -think- that the NULL doesn't need an RCU_INITIALIZER(), but it is
> >> > of course sparse's opinion that matters.
> >> >
> >> > And of course like this:
> >> >
> >> > 	old_dev = xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap));
> >> >
> >> > Does that work, or am I still confused?
> >> 
> >> That gets rid of one warning, but not the other. Before (plain xchg):
> >> 
> >> kernel/bpf/devmap.c:657:19: warning: incorrect type in initializer (different address spaces)
> >> kernel/bpf/devmap.c:657:19:    expected struct bpf_dtab_netdev [noderef] __rcu *__ret
> >> kernel/bpf/devmap.c:657:19:    got struct bpf_dtab_netdev *[assigned] dev
> >> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
> >> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
> >> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
> >> 
> >> after (RCU_INITIALIZER() on the second argument to xchg):
> >> 
> >> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
> >> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
> >> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
> >> 
> >> I can get rid of that second one by marking old_dev as __rcu, but then I
> >> get a new warning when dereferencing that in the subsequent
> >> call_rcu()...
> >> 
> >> So I guess we still need that unrcu_pointer(), to wrap the xchg() in?
> >
> > Well, at least this use case permits an lvalue.  ;-)
> >
> > Please see below for an untested patch intended to permit the following:
> >
> > 	old_dev = unrcu_pointer(xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap)));
> >
> > Does that do the trick?
> 
> Yes, it does! With that I can mark the pointer as __rcu and get all uses
> of it through sparse without complaints - awesome!
> 
> How do RCU patches usually make it into the kernel? Can you provide me
> with a proper patch I can just include along with my cleanup patches
> (taking it through the bpf tree)? Or do we need to go through some other
> tree and wait for a merge?

Normally through the -rcu tree, but please feel free to pull this one
(shown formally below) along with your changes.  I have queued it in
the -rcu tree as well, but my normal process would submit it during the
v5.14 merge window, that is, not the upcoming one but the one after that.

So for example if your work makes it into the upcoming merge window,
I will drop my copy of my patch when I rebase onto v5.13-rc1.

							Thanx, Paul

------------------------------------------------------------------------

commit 0bc5db666120c8fb604520853b30c351a9659c82
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Apr 21 14:30:54 2021 -0700

    rcu: Create an unrcu_pointer() to remove __rcu from a pointer
    
    The xchg() and cmpxchg() functions are sometimes used to carry out RCU
    updates.  Unfortunately, this can result in sparse warnings for both
    the old-value and new-value arguments, as well as for the return value.
    The arguments can be dealt with using RCU_INITIALIZER():
    
            old_p = xchg(&p, RCU_INITIALIZER(new_p));
    
    But a sparse warning still remains due to assigning the __rcu pointer
    returned from xchg to the (most likely) non-__rcu pointer old_p.
    
    This commit therefore provides an unrcu_pointer() macro that strips
    the __rcu.  This macro can be used as follows:
    
            old_p = unrcu_pointer(xchg(&p, RCU_INITIALIZER(new_p)));
    
    Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1199ffd305d1..a10480f2b4ef 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -363,6 +363,20 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define rcu_check_sparse(p, space)
 #endif /* #else #ifdef __CHECKER__ */
 
+/**
+ * unrcu_pointer - mark a pointer as not being RCU protected
+ * @p: pointer needing to lose its __rcu property
+ *
+ * Converts @p from an __rcu pointer to a __kernel pointer.
+ * This allows an __rcu pointer to be used with xchg() and friends.
+ */
+#define unrcu_pointer(p)						\
+({									\
+	typeof(*p) *_________p1 = (typeof(*p) *__force)(p);		\
+	rcu_check_sparse(p, __rcu); 					\
+	((typeof(*p) __force __kernel *)(_________p1)); 		\
+})
+
 #define __rcu_access_pointer(p, space) \
 ({ \
 	typeof(*p) *_________p1 = (typeof(*p) *__force)READ_ONCE(p); \

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue
  2021-04-21 22:31                                                     ` Paul E. McKenney
@ 2021-04-22 14:30                                                       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 39+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-22 14:30 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Hangbin Liu, bpf,
	netdev, Jiri Benc, Eelco Chaudron, ast, Daniel Borkmann,
	Lorenzo Bianconi, David Ahern, Andrii Nakryiko,
	Alexei Starovoitov, John Fastabend, Maciej Fijalkowski,
	Björn Töpel

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Thu, Apr 22, 2021 at 12:00:24AM +0200, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Wed, Apr 21, 2021 at 11:10:38PM +0200, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Wed, Apr 21, 2021 at 09:59:55PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> 
>> >> >> > On Wed, Apr 21, 2021 at 04:24:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> 
>> >> >> >> > On Tue, Apr 20, 2021 at 12:16:40AM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> 
>> >> >> >> >> > On Mon, Apr 19, 2021 at 11:21:41PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >> 
>> >> >> >> >> >> > On Mon, Apr 19, 2021 at 08:12:27PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> > On Sat, Apr 17, 2021 at 02:27:19PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >> >> >> 
>> >> >> >> >> >> >> >> > On Fri, Apr 16, 2021 at 11:22:52AM -0700, Martin KaFai Lau wrote:
>> >> >> >> >> >> >> >> >> On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
>> >> >> >> >> >> >> >> >> > On Thu, 15 Apr 2021 17:39:13 -0700
>> >> >> >> >> >> >> >> >> > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> >> >> > > > Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> >> >> >> >> >> >> >> >> > > >   
>> >> >> >> >> >> >> >> >> > > > > On Thu, 15 Apr 2021 10:35:51 -0700
>> >> >> >> >> >> >> >> >> > > > > Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> >> >> >> >> >> >> > > > >  
>> >> >> >> >> >> >> >> >> > > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
>> >> >> >> >> >> >> >> >> > > > >> > Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> >> >> >> >> >> >> >> > > > >> >     
>> >> >> >> >> >> >> >> >> > > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
>> >> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> >> >> >> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	struct net_device *dev = bq->dev;
>> >> >> >> >> >> >> >> >> > > > >> > >> > -	int sent = 0, err = 0;
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	int sent = 0, drops = 0, err = 0;
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	unsigned int cnt = bq->count;
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	int to_send = cnt;
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	int i;
>> >> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> >> > > > >> > >> > -	if (unlikely(!bq->count))
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	if (unlikely(!cnt))
>> >> >> >> >> >> >> >> >> > > > >> > >> >  		return;
>> >> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> >> > > > >> > >> > -	for (i = 0; i < bq->count; i++) {
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	for (i = 0; i < cnt; i++) {
>> >> >> >> >> >> >> >> >> > > > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
>> >> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> >> > > > >> > >> >  		prefetch(xdpf);
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	}
>> >> >> >> >> >> >> >> >> > > > >> > >> >  
>> >> >> >> >> >> >> >> >> > > > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	if (bq->xdp_prog) {    
>> >> >> >> >> >> >> >> >> > > > >> > >> bq->xdp_prog is used here
>> >> >> >> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> >> >> >> > > > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
>> >> >> >> >> >> >> >> >> > > > >> > >> > +		if (!to_send)
>> >> >> >> >> >> >> >> >> > > > >> > >> > +			goto out;
>> >> >> >> >> >> >> >> >> > > > >> > >> > +
>> >> >> >> >> >> >> >> >> > > > >> > >> > +		drops = cnt - to_send;
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	}
>> >> >> >> >> >> >> >> >> > > > >> > >> > +    
>> >> >> >> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> >> >> >> > > > >> > >> [ ... ]
>> >> >> >> >> >> >> >> >> > > > >> > >>     
>> >> >> >> >> >> >> >> >> > > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> >> >> >> > > > >> > >> > -		       struct net_device *dev_rx)
>> >> >> >> >> >> >> >> >> > > > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>> >> >> >> >> >> >> >> >> > > > >> > >> >  {
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>> >> >> >> >> >> >> >> >> > > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	 * from net_device drivers NAPI func end.
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	 *
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	 * are only ever modified together.
>> >> >> >> >> >> >> >> >> > > > >> > >> >  	 */
>> >> >> >> >> >> >> >> >> > > > >> > >> > -	if (!bq->dev_rx)
>> >> >> >> >> >> >> >> >> > > > >> > >> > +	if (!bq->dev_rx) {
>> >> >> >> >> >> >> >> >> > > > >> > >> >  		bq->dev_rx = dev_rx;
>> >> >> >> >> >> >> >> >> > > > >> > >> > +		bq->xdp_prog = xdp_prog;    
>> >> >> >> >> >> >> >> >> > > > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
>> >> >> >> >> >> >> >> >> > > > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
>> >> >> >> >> >> >> >> >> > > > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
>> >> >> >> >> >> >> >> >> > > > >> > >> 
>> >> >> >> >> >> >> >> >> > > > >> > >> e.g. what if the devmap elem gets deleted.    
>> >> >> >> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> >> >> >> > > > >> > > Jesper knows better than me. From my veiw, based on the description of
>> >> >> >> >> >> >> >> >> > > > >> > > __dev_flush():
>> >> >> >> >> >> >> >> >> > > > >> > >
>> >> >> >> >> >> >> >> >> > > > >> > > On devmap tear down we ensure the flush list is empty before completing to
>> >> >> >> >> >> >> >> >> > > > >> > > ensure all flush operations have completed. When drivers update the bpf
>> >> >> >> >> >> >> >> >> > > > >> > > program they may need to ensure any flush ops are also complete.    
>> >> >> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> >> >> > > > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > The bq->xdp_prog comes form the devmap "dev" element, and it is stored
>> >> >> >> >> >> >> >> >> > in temporarily in the "bq" structure that is only valid for this
>> >> >> >> >> >> >> >> >> > softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
>> >> >> >> >> >> >> >> >> > the the xdp_prog here, more below (and Q for Paul).
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > > > >> > 
>> >> >> >> >> >> >> >> >> > > > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
>> >> >> >> >> >> >> >> >> > > > >> > which also runs under one big rcu_read_lock(). So the storage in the
>> >> >> >> >> >> >> >> >> > > > >> > bulk queue is quite temporary, it's just used for bulking to increase
>> >> >> >> >> >> >> >> >> > > > >> > performance :)    
>> >> >> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> >> >> > > > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
>> >> >> >> >> >> >> >> >> > > > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
>> >> >> >> >> >> >> >> >> > > > >> in i40e_run_xdp() and it is fine.
>> >> >> >> >> >> >> >> >> > > > >> 
>> >> >> >> >> >> >> >> >> > > > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
>> >> >> >> >> >> >> >> >> > > > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
>> >> >> >> >> >> >> >> >> > > > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
>> >> >> >> >> >> >> >> >> > > > >>
>> >> >> >> >> >> >> >> >> > > > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
>> >> >> >> >> >> >> >> >> > > > >
>> >> >> >> >> >> >> >> >> > > > > I believed/assumed xdp_do_flush_map() was already protected under an
>> >> >> >> >> >> >> >> >> > > > > rcu_read_lock.  As the devmap and cpumap, which get called via
>> >> >> >> >> >> >> >> >> > > > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
>> >> >> >> >> >> >> >> >> > > > > are operating on.  
>> >> >> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> >> >> > > What other rcu objects it is using during flush?
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > Look at code:
>> >> >> >> >> >> >> >> >> >  kernel/bpf/cpumap.c
>> >> >> >> >> >> >> >> >> >  kernel/bpf/devmap.c
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > The devmap is filled with RCU code and complicated take-down steps.  
>> >> >> >> >> >> >> >> >> > The devmap's elements are also RCU objects and the BPF xdp_prog is
>> >> >> >> >> >> >> >> >> > embedded in this object (struct bpf_dtab_netdev).  The call_rcu
>> >> >> >> >> >> >> >> >> > function is __dev_map_entry_free().
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > > > > Perhaps it is a bug in i40e?  
>> >> >> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> >> >> > > A quick look into ixgbe falls into the same bucket.
>> >> >> >> >> >> >> >> >> > > didn't look at other drivers though.
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > Intel driver are very much in copy-paste mode.
>> >> >> >> >> >> >> >> >> >  
>> >> >> >> >> >> >> >> >> > > > >
>> >> >> >> >> >> >> >> >> > > > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
>> >> >> >> >> >> >> >> >> > > > > call, which I think means that this CPU will not go-through a RCU grace
>> >> >> >> >> >> >> >> >> > > > > period before we exit softirq, so in-practice it should be safe.  
>> >> >> >> >> >> >> >> >> > > > 
>> >> >> >> >> >> >> >> >> > > > Yup, this seems to be correct: rcu_softirq_qs() is only called between
>> >> >> >> >> >> >> >> >> > > > full invocations of the softirq handler, which for networking is
>> >> >> >> >> >> >> >> >> > > > net_rx_action(), and so translates into full NAPI poll cycles.  
>> >> >> >> >> >> >> >> >> > >
>> >> >> >> >> >> >> >> >> > > I don't know enough to comment on the rcu/softirq part, may be someone
>> >> >> >> >> >> >> >> >> > > can chime in.  There is also a recent napi_threaded_poll().
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > CC added Paul. (link to patch[1][2] for context)
>> >> >> >> >> >> >> >> >> Updated Paul's email address.
>> >> >> >> >> >> >> >> >> 
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > > If it is the case, then some of the existing rcu_read_lock() is unnecessary?
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > Well, in many cases, especially depending on how kernel is compiled,
>> >> >> >> >> >> >> >> >> > that is true.  But we want to keep these, as they also document the
>> >> >> >> >> >> >> >> >> > intend of the programmer.  And allow us to make the kernel even more
>> >> >> >> >> >> >> >> >> > preempt-able in the future.
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > > At least, it sounds incorrect to only make an exception here while keeping
>> >> >> >> >> >> >> >> >> > > other rcu_read_lock() as-is.
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > Let me be clear:  I think you have spotted a problem, and we need to
>> >> >> >> >> >> >> >> >> > add rcu_read_lock() at least around the invocation of
>> >> >> >> >> >> >> >> >> > bpf_prog_run_xdp() or before around if-statement that call
>> >> >> >> >> >> >> >> >> > dev_map_bpf_prog_run(). (Hangbin please do this in V8).
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > Thank you Martin for reviewing the code carefully enough to find this
>> >> >> >> >> >> >> >> >> > issue, that some drivers don't have a RCU-section around the full XDP
>> >> >> >> >> >> >> >> >> > code path in their NAPI-loop.
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > Question to Paul.  (I will attempt to describe in generic terms what
>> >> >> >> >> >> >> >> >> > happens, but ref real-function names).
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > We are running in softirq/NAPI context, the driver will call a
>> >> >> >> >> >> >> >> >> > bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
>> >> >> >> >> >> >> >> >> > some driver wrap this with a rcu_read_lock/unlock() section (other have
>> >> >> >> >> >> >> >> >> > a large RCU-read section, that include the flush operation).
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > In the bq_enqueue() function we have a per_cpu_ptr (that store the
>> >> >> >> >> >> >> >> >> > xdp_frame packets) that will get flushed/send in the call
>> >> >> >> >> >> >> >> >> > xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
>> >> >> >> >> >> >> >> >> > happen before we end our softirq/NAPI context.
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > The extension is that the per_cpu_ptr data structure (after this patch)
>> >> >> >> >> >> >> >> >> > store a pointer to an xdp_prog (which is a RCU object).  In the flush
>> >> >> >> >> >> >> >> >> > operation (which we will wrap with RCU-read section), we will use this
>> >> >> >> >> >> >> >> >> > xdp_prog pointer.   I can see that it is in-principle wrong to pass
>> >> >> >> >> >> >> >> >> > this-pointer between RCU-read sections, but I consider this safe as we
>> >> >> >> >> >> >> >> >> > are running under softirq/NAPI and the per_cpu_ptr is only valid in
>> >> >> >> >> >> >> >> >> > this short interval.
>> >> >> >> >> >> >> >> >> > 
>> >> >> >> >> >> >> >> >> > I claim a grace/quiescent RCU cannot happen between these two RCU-read
>> >> >> >> >> >> >> >> >> > sections, but I might be wrong? (especially in the future or for RT).
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > If I am reading this correctly (ha!), a very high-level summary of the
>> >> >> >> >> >> >> >> > code in question is something like this:
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 	void foo(void)
>> >> >> >> >> >> >> >> > 	{
>> >> >> >> >> >> >> >> > 		local_bh_disable();
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> >> >> >> > 		p = rcu_dereference(gp);
>> >> >> >> >> >> >> >> > 		do_something_with(p);
>> >> >> >> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 		do_something_else();
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 		rcu_read_lock();
>> >> >> >> >> >> >> >> > 		do_some_other_thing(p);
>> >> >> >> >> >> >> >> > 		rcu_read_unlock();
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 		local_bh_enable();
>> >> >> >> >> >> >> >> > 	}
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 	void bar(struct blat *new_gp)
>> >> >> >> >> >> >> >> > 	{
>> >> >> >> >> >> >> >> > 		struct blat *old_gp;
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 		spin_lock(my_lock);
>> >> >> >> >> >> >> >> > 		old_gp = rcu_dereference_protected(gp, lock_held(my_lock));
>> >> >> >> >> >> >> >> > 		rcu_assign_pointer(gp, new_gp);
>> >> >> >> >> >> >> >> > 		spin_unlock(my_lock);
>> >> >> >> >> >> >> >> > 		synchronize_rcu();
>> >> >> >> >> >> >> >> > 		kfree(old_gp);
>> >> >> >> >> >> >> >> > 	}
>> >> >> >> >> >> >> >> 
>> >> >> >> >> >> >> >> Yeah, something like that (the object is freed using call_rcu() - but I
>> >> >> >> >> >> >> >> think that's equivalent, right?). And the question is whether we need to
>> >> >> >> >> >> >> >> extend foo() so that is has one big rcu_read_lock() that covers the
>> >> >> >> >> >> >> >> whole lifetime of p.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > Yes, use of call_rcu() is an asynchronous version of synchronize_rcu().
>> >> >> >> >> >> >> > In fact, synchronize_rcu() is implemented in terms of call_rcu().  ;-)
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> Right, gotcha!
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> >> > I need to check up on -rt.
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > But first... In recent mainline kernels, the local_bh_disable() region
>> >> >> >> >> >> >> >> > will look like one big RCU read-side critical section.  But don't try
>> >> >> >> >> >> >> >> > this prior to v4.20!!!  In v4.19 and earlier, you would need to use
>> >> >> >> >> >> >> >> > both synchronize_rcu() and synchronize_rcu_bh() to make this work, or,
>> >> >> >> >> >> >> >> > for less latency, synchronize_rcu_mult(call_rcu, call_rcu_bh).
>> >> >> >> >> >> >> >> 
>> >> >> >> >> >> >> >> OK. Variants of this code has been around since before then, but I
>> >> >> >> >> >> >> >> honestly have no idea what it looked like back then exactly...
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > I know that feeling...
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > Except that in that case, why not just drop the inner rcu_read_unlock()
>> >> >> >> >> >> >> >> > and rcu_read_lock() pair?  Awkward function boundaries or some such?
>> >> >> >> >> >> >> >> 
>> >> >> >> >> >> >> >> Well if we can just treat such a local_bh_disable()/enable() pair as the
>> >> >> >> >> >> >> >> equivalent of rcu_read_lock()/unlock() then I suppose we could just get
>> >> >> >> >> >> >> >> rid of the inner ones. What about tools like lockdep; do they understand
>> >> >> >> >> >> >> >> this, or are we likely to get complaints if we remove it?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > If you just got rid of the first rcu_read_unlock() and the second
>> >> >> >> >> >> >> > rcu_read_lock() in the code above, lockdep will understand.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> Right, but doing so entails going through all the drivers, which is what
>> >> >> >> >> >> >> we're trying to avoid :)
>> >> >> >> >> >> >
>> >> >> >> >> >> > I was afraid of that...  ;-)
>> >> >> >> >> >> >
>> >> >> >> >> >> >> > However, if you instead get rid of -all- of the rcu_read_lock() and
>> >> >> >> >> >> >> > rcu_read_unlock() invocations in the code above, you would need to let
>> >> >> >> >> >> >> > lockdep know by adding rcu_read_lock_bh_held().  So instead of this:
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 	p = rcu_dereference(gp);
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > You would do this:
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > 	p = rcu_dereference_check(gp, rcu_read_lock_bh_held());
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > This would be needed for mainline, regardless of -rt.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> OK. And as far as I can tell this is harmless for code paths that call
>> >> >> >> >> >> >> the same function but from a regular rcu_read_lock()-protected section
>> >> >> >> >> >> >> instead from a bh-disabled section, right?
>> >> >> >> >> >> >
>> >> >> >> >> >> > That is correct.  That rcu_dereference_check() invocation will make
>> >> >> >> >> >> > lockdep be OK with rcu_read_lock() or with softirq being disabled.
>> >> >> >> >> >> > Or both, for that matter.
>> >> >> >> >> >> 
>> >> >> >> >> >> OK, great, thank you for confirming my understanding!
>> >> >> >> >> >> 
>> >> >> >> >> >> >> What happens, BTW, if we *don't* get rid of all the existing
>> >> >> >> >> >> >> rcu_read_lock() sections? Going back to your foo() example above, what
>> >> >> >> >> >> >> we're discussing is whether to add that second rcu_read_lock() around
>> >> >> >> >> >> >> do_some_other_thing(p). I.e., the first one around the rcu_dereference()
>> >> >> >> >> >> >> is already there (in the particular driver we're discussing), and the
>> >> >> >> >> >> >> local_bh_disable/enable() pair is already there. AFAICT from our
>> >> >> >> >> >> >> discussion, there really is not much point in adding that second
>> >> >> >> >> >> >> rcu_read_lock/unlock(), is there?
>> >> >> >> >> >> >
>> >> >> >> >> >> > From an algorithmic point of view, the second rcu_read_lock()
>> >> >> >> >> >> > and rcu_read_unlock() are redundant.  Of course, there are also
>> >> >> >> >> >> > software-engineering considerations, including copy-pasta issues.
>> >> >> >> >> >> >
>> >> >> >> >> >> >> And because that first rcu_read_lock() around the rcu_dereference() is
>> >> >> >> >> >> >> already there, lockdep is not likely to complain either, so we're
>> >> >> >> >> >> >> basically fine? Except that the code is somewhat confusing as-is, of
>> >> >> >> >> >> >> course; i.e., we should probably fix it but it's not terribly urgent. Or?
>> >> >> >> >> >> >
>> >> >> >> >> >> > I am concerned about copy-pasta-induced bugs.  Someone looks just at
>> >> >> >> >> >> > the code, fails to note the fact that softirq is disabled throughout,
>> >> >> >> >> >> > and decides that leaking a pointer from one RCU read-side critical
>> >> >> >> >> >> > section to a later one is just fine.  :-/
>> >> >> >> >> >> 
>> >> >> >> >> >> Yup, totally agreed that we need to fix this for the sake of the humans
>> >> >> >> >> >> reading the code; just wanted to make sure my understanding was correct
>> >> >> >> >> >> that we don't strictly need to do anything as far as the machines
>> >> >> >> >> >> executing it are concerned :)
>> >> >> >> >> >> 
>> >> >> >> >> >> >> Hmm, looking at it now, it seems not all the lookup code is actually
>> >> >> >> >> >> >> doing rcu_dereference() at all, but rather just a plain READ_ONCE() with
>> >> >> >> >> >> >> a comment above it saying that RCU ensures objects won't disappear[0];
>> >> >> >> >> >> >> so I suppose we're at least safe from lockdep in that sense :P - but we
>> >> >> >> >> >> >> should definitely clean this up.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> [0] Exhibit A: https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L391
>> >> >> >> >> >> >
>> >> >> >> >> >> > That use of READ_ONCE() will definitely avoid lockdep complaints,
>> >> >> >> >> >> > including those complaints that point out bugs.  It also might get you
>> >> >> >> >> >> > sparse complaints if the RCU-protected pointer is marked with __rcu.
>> >> >> >> >> >> 
>> >> >> >> >> >> It's not; it's the netdev_map member of this struct:
>> >> >> >> >> >> 
>> >> >> >> >> >> struct bpf_dtab {
>> >> >> >> >> >> 	struct bpf_map map;
>> >> >> >> >> >> 	struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
>> >> >> >> >> >> 	struct list_head list;
>> >> >> >> >> >> 
>> >> >> >> >> >> 	/* these are only used for DEVMAP_HASH type maps */
>> >> >> >> >> >> 	struct hlist_head *dev_index_head;
>> >> >> >> >> >> 	spinlock_t index_lock;
>> >> >> >> >> >> 	unsigned int items;
>> >> >> >> >> >> 	u32 n_buckets;
>> >> >> >> >> >> };
>> >> >> >> >> >> 
>> >> >> >> >> >> Will adding __rcu to such a dynamic array member do the right thing when
>> >> >> >> >> >> paired with rcu_dereference() on array members (i.e., in place of the
>> >> >> >> >> >> READ_ONCE in the code linked above)?
>> >> >> >> >> >
>> >> >> >> >> > The only thing __rcu will do is provide information to the sparse static
>> >> >> >> >> > analysis tool.  Which will then gripe at you for applying READ_ONCE()
>> >> >> >> >> > to a __rcu pointer.  But it is already griping at you for applying
>> >> >> >> >> > rcu_dereference() to something not marked __rcu, so...  ;-)
>> >> >> >> >> 
>> >> >> >> >> Right, hence the need for a cleanup ;)
>> >> >> >> >> 
>> >> >> >> >> My question was more if it understood arrays, though. I.e., that
>> >> >> >> >> 'netdev_map' is an array of RCU pointers, not an RCU pointer to an
>> >> >> >> >> array... Or am I maybe thinking that tool is way smarter than it is, and
>> >> >> >> >> it just complains for any access to that field that doesn't use
>> >> >> >> >> rcu_dereference()?
>> >> >> >> >
>> >> >> >> > I believe that sparse will know about the pointers being __rcu, but
>> >> >> >> > not the array.  Unless you mark both levels.
>> >> >> >> 
>> >> >> >> Hi Paul
>> >> >> >> 
>> >> >> >> One more question, since I started adding the annotations: We are
>> >> >> >> currently swapping out the pointers using xchg():
>> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L555
>> >> >> >> 
>> >> >> >> and even cmpxchg():
>> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/devmap.c#L831
>> >> >> >> 
>> >> >> >> Sparse complains about these if I add the __rcu annotation to the
>> >> >> >> definition (which otherwise works just fine with the double-pointer,
>> >> >> >> BTW). Is there a way to fix that? Some kind of rcu_ macro version of the
>> >> >> >> atomic swaps or something? Or do we just keep the regular xchg() and
>> >> >> >> ignore those particular sparse warnings?
>> >> >> >
>> >> >> > Sounds like I need to supply a unrcu_pointer() macro or some such.
>> >> >> > This would operate something like the current open-coded casts
>> >> >> > in __rcu_dereference_protected().
>> >> >> 
>> >> >> So with that, I would turn the existing:
>> >> >> 
>> >> >> 	dev = READ_ONCE(dtab->netdev_map[i]);
>> >> >> 	if (!dev || netdev != dev->dev)
>> >> >> 		continue;
>> >> >> 	odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
>> >> >> 
>> >> >> into:
>> >> >> 
>> >> >> 	dev = rcu_dereference(dtab->netdev_map[i]);
>> >> >> 	if (!dev || netdev != dev->dev)
>> >> >> 		continue;
>> >> >> 	odev = cmpxchg(unrcu_pointer(&dtab->netdev_map[i]), dev, NULL);
>> >> >> 
>> >> >> 
>> >> >> and with a _check version:
>> >> >> 
>> >> >> 	old_dev = xchg(unrcu_pointer_check(&dtab->netdev_map[k], rcu_read_lock_bh_held()), NULL);
>> >> >> 
>> >> >> right?
>> >> >> 
>> >> >> Or would it be:
>> >> >> 	odev = cmpxchg(&unrcu_pointer(dtab->netdev_map[i]), dev, NULL);
>> >> >> ?
>> >> >> 
>> >> >> > Would something like that work for you?
>> >> >> 
>> >> >> Yeah, I believe it would :)
>> >> >
>> >> > Except that I was forgetting that the __rcu decorates the pointed-to
>> >> > data rather than the pointer itself.  :-/
>> >> >
>> >> > But that is actually easier, as you can follow the example of
>> >> > rcu_assign_pointer(), namely using RCU_INITIALIZER().
>> >> >
>> >> > So like this:
>> >> >
>> >> > 	odev = cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL);
>> >> >
>> >> > I -think- that the NULL doesn't need an RCU_INITIALIZER(), but it is
>> >> > of course sparse's opinion that matters.
>> >> >
>> >> > And of course like this:
>> >> >
>> >> > 	old_dev = xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap));
>> >> >
>> >> > Does that work, or am I still confused?
>> >> 
>> >> That gets rid of one warning, but not the other. Before (plain xchg):
>> >> 
>> >> kernel/bpf/devmap.c:657:19: warning: incorrect type in initializer (different address spaces)
>> >> kernel/bpf/devmap.c:657:19:    expected struct bpf_dtab_netdev [noderef] __rcu *__ret
>> >> kernel/bpf/devmap.c:657:19:    got struct bpf_dtab_netdev *[assigned] dev
>> >> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
>> >> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
>> >> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
>> >> 
>> >> after (RCU_INITIALIZER() on the second argument to xchg):
>> >> 
>> >> kernel/bpf/devmap.c:657:17: warning: incorrect type in assignment (different address spaces)
>> >> kernel/bpf/devmap.c:657:17:    expected struct bpf_dtab_netdev *old_dev
>> >> kernel/bpf/devmap.c:657:17:    got struct bpf_dtab_netdev [noderef] __rcu *[assigned] __ret
>> >> 
>> >> I can get rid of that second one by marking old_dev as __rcu, but then I
>> >> get a new warning when dereferencing that in the subsequent
>> >> call_rcu()...
>> >> 
>> >> So I guess we still need that unrcu_pointer(), to wrap the xchg() in?
>> >
>> > Well, at least this use case permits an lvalue.  ;-)
>> >
>> > Please see below for an untested patch intended to permit the following:
>> >
>> > 	old_dev = unrcu_pointer(xchg(&dtab->netdev_map[k], RCU_INITIALIZER(newmap)));
>> >
>> > Does that do the trick?
>> 
>> Yes, it does! With that I can mark the pointer as __rcu and get all uses
>> of it through sparse without complaints - awesome!
>> 
>> How do RCU patches usually make it into the kernel? Can you provide me
>> with a proper patch I can just include along with my cleanup patches
>> (taking it through the bpf tree)? Or do we need to go through some other
>> tree and wait for a merge?
>
> Normally through the -rcu tree, but please feel free to pull this one
> (shown formally below) along with your changes.  I have queued it in
> the -rcu tree as well, but my normal process would submit it during the
> v5.14 merge window, that is, not the upcoming one but the one after that.
>
> So for example if your work makes it into the upcoming merge window,
> I will drop my copy of my patch when I rebase onto v5.13-rc1.

Sounds good; not sure if I'll manage to get something in before this
merge window (which seems to be fast approaching); we'll see. Thanks! :)

-Toke


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-04-22 14:30 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-14 12:26 [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
2021-04-14 12:26 ` [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue Hangbin Liu
2021-04-15  0:17   ` Martin KaFai Lau
2021-04-15  2:37     ` Hangbin Liu
2021-04-15  9:22       ` Toke Høiland-Jørgensen
2021-04-15 17:35         ` Martin KaFai Lau
2021-04-15 18:21           ` Jesper Dangaard Brouer
2021-04-15 20:29             ` Toke Høiland-Jørgensen
2021-04-16  0:39               ` Martin KaFai Lau
2021-04-16 10:03                 ` Toke Høiland-Jørgensen
2021-04-16 18:20                   ` Martin KaFai Lau
2021-04-16 13:45                 ` Jesper Dangaard Brouer
2021-04-16 14:35                   ` Toke Høiland-Jørgensen
2021-04-16 18:22                   ` Martin KaFai Lau
2021-04-17  0:23                     ` Paul E. McKenney
2021-04-17 12:27                       ` Toke Høiland-Jørgensen
2021-04-19 16:58                         ` Paul E. McKenney
2021-04-19 18:12                           ` Toke Høiland-Jørgensen
2021-04-19 18:32                             ` Paul E. McKenney
2021-04-19 21:21                               ` Toke Høiland-Jørgensen
2021-04-19 21:41                                 ` Paul E. McKenney
2021-04-19 22:16                                   ` Toke Høiland-Jørgensen
2021-04-19 22:31                                     ` Paul E. McKenney
2021-04-21 14:24                                       ` Toke Høiland-Jørgensen
2021-04-21 14:59                                         ` Paul E. McKenney
2021-04-21 19:59                                           ` Toke Høiland-Jørgensen
2021-04-21 20:51                                             ` Paul E. McKenney
2021-04-21 21:10                                               ` Toke Høiland-Jørgensen
2021-04-21 21:30                                                 ` Paul E. McKenney
2021-04-21 22:00                                                   ` Toke Høiland-Jørgensen
2021-04-21 22:31                                                     ` Paul E. McKenney
2021-04-22 14:30                                                       ` Toke Høiland-Jørgensen
2021-04-14 12:26 ` [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
2021-04-15  0:23   ` Martin KaFai Lau
2021-04-15  2:21     ` Hangbin Liu
2021-04-15  9:29       ` Toke Høiland-Jørgensen
2021-04-14 12:26 ` [PATCHv7 bpf-next 3/4] sample/bpf: add xdp_redirect_map_multi for redirect_map broadcast test Hangbin Liu
2021-04-14 12:26 ` [PATCHv7 bpf-next 4/4] selftests/bpf: add xdp_redirect_multi test Hangbin Liu
2021-04-14 14:16 ` [PATCHv7 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).