netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/2] veth: Bulk XDP_TX
@ 2019-06-05  5:36 Toshiaki Makita
  2019-06-05  5:36 ` [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
  2019-06-05  5:36 ` [PATCH v2 bpf-next 2/2] veth: Support " Toshiaki Makita
  0 siblings, 2 replies; 9+ messages in thread
From: Toshiaki Makita @ 2019-06-05  5:36 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf,
	Toke Høiland-Jørgensen, Jason Wang

This introduces bulk XDP_TX in veth.
Improves XDP_TX performance by approximately 9%. The detailed
explanation and performance numbers are shown in patch 2.

v2:
- Use stack for bulk queue instead of a global variable.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>

Toshiaki Makita (2):
  xdp: Add tracepoint for bulk XDP_TX
  veth: Support bulk XDP_TX

 drivers/net/veth.c         | 60 ++++++++++++++++++++++++++++++++++++----------
 include/trace/events/xdp.h | 25 +++++++++++++++++++
 kernel/bpf/core.c          |  1 +
 3 files changed, 74 insertions(+), 12 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
  2019-06-05  5:36 [PATCH v2 bpf-next 0/2] veth: Bulk XDP_TX Toshiaki Makita
@ 2019-06-05  5:36 ` Toshiaki Makita
  2019-06-05  7:59   ` Jesper Dangaard Brouer
  2019-06-05  5:36 ` [PATCH v2 bpf-next 2/2] veth: Support " Toshiaki Makita
  1 sibling, 1 reply; 9+ messages in thread
From: Toshiaki Makita @ 2019-06-05  5:36 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf,
	Toke Høiland-Jørgensen, Jason Wang

This is introduced for admins to check what is happening on XDP_TX when
bulk XDP_TX is in use, which will be first introduced in veth in next
commit.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
 kernel/bpf/core.c          |  1 +
 2 files changed, 26 insertions(+)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index e95cb86..e06ea65 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -50,6 +50,31 @@
 		  __entry->ifindex)
 );
 
+TRACE_EVENT(xdp_bulk_tx,
+
+	TP_PROTO(const struct net_device *dev,
+		 int sent, int drops, int err),
+
+	TP_ARGS(dev, sent, drops, err),
+
+	TP_STRUCT__entry(
+		__field(int, ifindex)
+		__field(int, drops)
+		__field(int, sent)
+		__field(int, err)
+	),
+
+	TP_fast_assign(
+		__entry->ifindex	= dev->ifindex;
+		__entry->drops		= drops;
+		__entry->sent		= sent;
+		__entry->err		= err;
+	),
+
+	TP_printk("ifindex=%d sent=%d drops=%d err=%d",
+		  __entry->ifindex, __entry->sent, __entry->drops, __entry->err)
+);
+
 DECLARE_EVENT_CLASS(xdp_redirect_template,
 
 	TP_PROTO(const struct net_device *dev,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 33fb292..3a3f4af 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2106,3 +2106,4 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
 #include <linux/bpf_trace.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
+EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 bpf-next 2/2] veth: Support bulk XDP_TX
  2019-06-05  5:36 [PATCH v2 bpf-next 0/2] veth: Bulk XDP_TX Toshiaki Makita
  2019-06-05  5:36 ` [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
@ 2019-06-05  5:36 ` Toshiaki Makita
  1 sibling, 0 replies; 9+ messages in thread
From: Toshiaki Makita @ 2019-06-05  5:36 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf,
	Toke Høiland-Jørgensen, Jason Wang

XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
heavy cost of indirect call but it also reduces lock acquisition on the
destination device that needs locks like veth and tun.

XDP_TX does not use indirect calls but drivers which require locks can
benefit from the bulk transmit for XDP_TX as well.

This patch introduces bulk transmit mechanism in veth using bulk queue
on stack, and improves XDP_TX performance by about 9%.

Here are single-core/single-flow XDP_TX test results. CPU consumptions
are taken from "perf report --no-child".

- Before:

  7.26 Mpps

  _raw_spin_lock  7.83%
  veth_xdp_xmit  12.23%

- After:

  7.94 Mpps

  _raw_spin_lock  1.08%
  veth_xdp_xmit   6.10%

v2:
- Use stack for bulk queue instead of a global variable.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 drivers/net/veth.c | 60 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 52110e5..b363a84 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -38,6 +38,8 @@
 #define VETH_XDP_TX		BIT(0)
 #define VETH_XDP_REDIR		BIT(1)
 
+#define VETH_XDP_TX_BULK_SIZE	16
+
 struct veth_rq_stats {
 	u64			xdp_packets;
 	u64			xdp_bytes;
@@ -64,6 +66,11 @@ struct veth_priv {
 	unsigned int		requested_headroom;
 };
 
+struct veth_xdp_tx_bq {
+	struct xdp_frame *q[VETH_XDP_TX_BULK_SIZE];
+	unsigned int count;
+};
+
 /*
  * ethtool interface
  */
@@ -442,13 +449,30 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 	return ret;
 }
 
-static void veth_xdp_flush(struct net_device *dev)
+static void veth_xdp_flush_bq(struct net_device *dev, struct veth_xdp_tx_bq *bq)
+{
+	int sent, i, err = 0;
+
+	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
+	if (sent < 0) {
+		err = sent;
+		sent = 0;
+		for (i = 0; i < bq->count; i++)
+			xdp_return_frame(bq->q[i]);
+	}
+	trace_xdp_bulk_tx(dev, sent, bq->count - sent, err);
+
+	bq->count = 0;
+}
+
+static void veth_xdp_flush(struct net_device *dev, struct veth_xdp_tx_bq *bq)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
 	struct veth_rq *rq;
 
 	rcu_read_lock();
+	veth_xdp_flush_bq(dev, bq);
 	rcv = rcu_dereference(priv->peer);
 	if (unlikely(!rcv))
 		goto out;
@@ -464,19 +488,26 @@ static void veth_xdp_flush(struct net_device *dev)
 	rcu_read_unlock();
 }
 
-static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp,
+		       struct veth_xdp_tx_bq *bq)
 {
 	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
 
 	if (unlikely(!frame))
 		return -EOVERFLOW;
 
-	return veth_xdp_xmit(dev, 1, &frame, 0);
+	if (unlikely(bq->count == VETH_XDP_TX_BULK_SIZE))
+		veth_xdp_flush_bq(dev, bq);
+
+	bq->q[bq->count++] = frame;
+
+	return 0;
 }
 
 static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
 					struct xdp_frame *frame,
-					unsigned int *xdp_xmit)
+					unsigned int *xdp_xmit,
+					struct veth_xdp_tx_bq *bq)
 {
 	void *hard_start = frame->data - frame->headroom;
 	void *head = hard_start - sizeof(struct xdp_frame);
@@ -509,7 +540,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
 			orig_frame = *frame;
 			xdp.data_hard_start = head;
 			xdp.rxq->mem = frame->mem;
-			if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) {
+			if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) {
 				trace_xdp_exception(rq->dev, xdp_prog, act);
 				frame = &orig_frame;
 				goto err_xdp;
@@ -559,7 +590,8 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
-					unsigned int *xdp_xmit)
+					unsigned int *xdp_xmit,
+					struct veth_xdp_tx_bq *bq)
 {
 	u32 pktlen, headroom, act, metalen;
 	void *orig_data, *orig_data_end;
@@ -635,7 +667,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
 		get_page(virt_to_page(xdp.data));
 		consume_skb(skb);
 		xdp.rxq->mem = rq->xdp_mem;
-		if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) {
+		if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) {
 			trace_xdp_exception(rq->dev, xdp_prog, act);
 			goto err_xdp;
 		}
@@ -690,7 +722,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
 	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
+static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit,
+			struct veth_xdp_tx_bq *bq)
 {
 	int i, done = 0, drops = 0, bytes = 0;
 
@@ -706,11 +739,11 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
 			struct xdp_frame *frame = veth_ptr_to_xdp(ptr);
 
 			bytes += frame->len;
-			skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one);
+			skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one, bq);
 		} else {
 			skb = ptr;
 			bytes += skb->len;
-			skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one);
+			skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one, bq);
 		}
 		*xdp_xmit |= xdp_xmit_one;
 
@@ -736,10 +769,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	struct veth_rq *rq =
 		container_of(napi, struct veth_rq, xdp_napi);
 	unsigned int xdp_xmit = 0;
+	struct veth_xdp_tx_bq bq;
 	int done;
 
+	bq.count = 0;
+
 	xdp_set_return_frame_no_direct();
-	done = veth_xdp_rcv(rq, budget, &xdp_xmit);
+	done = veth_xdp_rcv(rq, budget, &xdp_xmit, &bq);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
@@ -751,7 +787,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	}
 
 	if (xdp_xmit & VETH_XDP_TX)
-		veth_xdp_flush(rq->dev);
+		veth_xdp_flush(rq->dev, &bq);
 	if (xdp_xmit & VETH_XDP_REDIR)
 		xdp_do_flush_map();
 	xdp_clear_return_frame_no_direct();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
  2019-06-05  5:36 ` [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
@ 2019-06-05  7:59   ` Jesper Dangaard Brouer
  2019-06-06 11:04     ` Toshiaki Makita
  0 siblings, 1 reply; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2019-06-05  7:59 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, Toke Høiland-Jørgensen, Jason Wang,
	brouer, Brendan Gregg

On Wed,  5 Jun 2019 14:36:12 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> This is introduced for admins to check what is happening on XDP_TX when
> bulk XDP_TX is in use, which will be first introduced in veth in next
> commit.

Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by
all drivers?

(more below)

> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
> ---
>  include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
>  kernel/bpf/core.c          |  1 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index e95cb86..e06ea65 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -50,6 +50,31 @@
>  		  __entry->ifindex)
>  );
>  
> +TRACE_EVENT(xdp_bulk_tx,
> +
> +	TP_PROTO(const struct net_device *dev,
> +		 int sent, int drops, int err),
> +
> +	TP_ARGS(dev, sent, drops, err),
> +
> +	TP_STRUCT__entry(

All other tracepoints in this file starts with:

		__field(int, prog_id)
		__field(u32, act)
or
		__field(int, map_id)
		__field(u32, act)

Could you please add those?

> +		__field(int, ifindex)
> +		__field(int, drops)
> +		__field(int, sent)
> +		__field(int, err)
> +	),

The reason is that this make is easier to attach to multiple
tracepoints, and extract the same value.

Example with bpftrace oneliner:

$ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }'
Attaching 8 probes...
^C

@action[4]: 30259246
@action[0]: 34489024

XDP_ABORTED = 0 	 
XDP_REDIRECT= 4


> +
> +	TP_fast_assign(

		__entry->act		= XDP_TX;

> +		__entry->ifindex	= dev->ifindex;
> +		__entry->drops		= drops;
> +		__entry->sent		= sent;
> +		__entry->err		= err;
> +	),
> +
> +	TP_printk("ifindex=%d sent=%d drops=%d err=%d",
> +		  __entry->ifindex, __entry->sent, __entry->drops, __entry->err)
> +);
> +

Other fun bpftrace stuff:

sudo bpftrace -e 'tracepoint:xdp:xdp_*map* { @map_id[comm, args->map_id] = count(); }'
Attaching 5 probes...
^C

@map_id[swapper/2, 113]: 1428
@map_id[swapper/0, 113]: 2085
@map_id[ksoftirqd/4, 113]: 2253491
@map_id[ksoftirqd/2, 113]: 25677560
@map_id[ksoftirqd/0, 113]: 29004338
@map_id[ksoftirqd/3, 113]: 31034885


$ bpftool map list id 113
113: devmap  name tx_port  flags 0x0
	key 4B  value 4B  max_entries 100  memlock 4096B


p.s. People should look out for Brendan Gregg's upcoming book on BPF
performance tools, from which I learned to use bpftrace :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
  2019-06-05  7:59   ` Jesper Dangaard Brouer
@ 2019-06-06 11:04     ` Toshiaki Makita
  2019-06-06 19:41       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 9+ messages in thread
From: Toshiaki Makita @ 2019-06-06 11:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, Toke Høiland-Jørgensen, Jason Wang,
	Brendan Gregg

On 2019/06/05 16:59, Jesper Dangaard Brouer wrote:
> On Wed,  5 Jun 2019 14:36:12 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> This is introduced for admins to check what is happening on XDP_TX when
>> bulk XDP_TX is in use, which will be first introduced in veth in next
>> commit.
> 
> Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by
> all drivers?

I guess you mean all drivers that implement similar mechanism should use 
this? Then yes.
(I don't think all drivers needs bulk tx mechanism though)

> (more below)
> 
>> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
>> ---
>>   include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
>>   kernel/bpf/core.c          |  1 +
>>   2 files changed, 26 insertions(+)
>>
>> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
>> index e95cb86..e06ea65 100644
>> --- a/include/trace/events/xdp.h
>> +++ b/include/trace/events/xdp.h
>> @@ -50,6 +50,31 @@
>>   		  __entry->ifindex)
>>   );
>>   
>> +TRACE_EVENT(xdp_bulk_tx,
>> +
>> +	TP_PROTO(const struct net_device *dev,
>> +		 int sent, int drops, int err),
>> +
>> +	TP_ARGS(dev, sent, drops, err),
>> +
>> +	TP_STRUCT__entry(
> 
> All other tracepoints in this file starts with:
> 
> 		__field(int, prog_id)
> 		__field(u32, act)
> or
> 		__field(int, map_id)
> 		__field(u32, act)
> 
> Could you please add those?

So... prog_id is the problem. The program can be changed while we are 
enqueueing packets to the bulk queue, so the prog_id at flush may be an 
unexpected one.

It can be fixed by disabling NAPI when changing XDP programs. This stops 
packet processing while changing XDP programs, but I guess it is an 
acceptable compromise. Having said that, I'm honestly not so eager to 
make this change, since this will require refurbishment of one of the 
most delicate part of veth XDP, NAPI disabling/enabling mechanism.

WDYT?

>> +		__field(int, ifindex)
>> +		__field(int, drops)
>> +		__field(int, sent)
>> +		__field(int, err)
>> +	),
> 
> The reason is that this make is easier to attach to multiple
> tracepoints, and extract the same value.
> 
> Example with bpftrace oneliner:
> 
> $ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }'
> Attaching 8 probes...
> ^C
> 
> @action[4]: 30259246
> @action[0]: 34489024
> 
> XDP_ABORTED = 0 	
> XDP_REDIRECT= 4
> 
> 
>> +
>> +	TP_fast_assign(
> 
> 		__entry->act		= XDP_TX;

OK

> 
>> +		__entry->ifindex	= dev->ifindex;
>> +		__entry->drops		= drops;
>> +		__entry->sent		= sent;
>> +		__entry->err		= err;
>> +	),
>> +
>> +	TP_printk("ifindex=%d sent=%d drops=%d err=%d",
>> +		  __entry->ifindex, __entry->sent, __entry->drops, __entry->err)
>> +);
>> +
> 
> Other fun bpftrace stuff:
> 
> sudo bpftrace -e 'tracepoint:xdp:xdp_*map* { @map_id[comm, args->map_id] = count(); }'
> Attaching 5 probes...
> ^C
> 
> @map_id[swapper/2, 113]: 1428
> @map_id[swapper/0, 113]: 2085
> @map_id[ksoftirqd/4, 113]: 2253491
> @map_id[ksoftirqd/2, 113]: 25677560
> @map_id[ksoftirqd/0, 113]: 29004338
> @map_id[ksoftirqd/3, 113]: 31034885
> 
> 
> $ bpftool map list id 113
> 113: devmap  name tx_port  flags 0x0
> 	key 4B  value 4B  max_entries 100  memlock 4096B
> 
> 
> p.s. People should look out for Brendan Gregg's upcoming book on BPF
> performance tools, from which I learned to use bpftrace :-)

Where can I get information on the book?

--
Toshiaki Makita

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
  2019-06-06 11:04     ` Toshiaki Makita
@ 2019-06-06 19:41       ` Jesper Dangaard Brouer
  2019-06-07  2:22         ` Toshiaki Makita
  0 siblings, 1 reply; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2019-06-06 19:41 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, Toke Høiland-Jørgensen, Jason Wang,
	Brendan Gregg, brouer

On Thu, 6 Jun 2019 20:04:20 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 2019/06/05 16:59, Jesper Dangaard Brouer wrote:
> > On Wed,  5 Jun 2019 14:36:12 +0900
> > Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> >   
> >> This is introduced for admins to check what is happening on XDP_TX when
> >> bulk XDP_TX is in use, which will be first introduced in veth in next
> >> commit.  
> > 
> > Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by
> > all drivers?  
> 
> I guess you mean all drivers that implement similar mechanism should use 
> this? Then yes.
> (I don't think all drivers needs bulk tx mechanism though)
> 
> > (more below)
> >   
> >> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
> >> ---
> >>   include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
> >>   kernel/bpf/core.c          |  1 +
> >>   2 files changed, 26 insertions(+)
> >>
> >> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> >> index e95cb86..e06ea65 100644
> >> --- a/include/trace/events/xdp.h
> >> +++ b/include/trace/events/xdp.h
> >> @@ -50,6 +50,31 @@
> >>   		  __entry->ifindex)
> >>   );
> >>   
> >> +TRACE_EVENT(xdp_bulk_tx,
> >> +
> >> +	TP_PROTO(const struct net_device *dev,
> >> +		 int sent, int drops, int err),
> >> +
> >> +	TP_ARGS(dev, sent, drops, err),
> >> +
> >> +	TP_STRUCT__entry(  
> > 
> > All other tracepoints in this file starts with:
> > 
> > 		__field(int, prog_id)
> > 		__field(u32, act)
> > or
> > 		__field(int, map_id)
> > 		__field(u32, act)
> > 
> > Could you please add those?  
> 
> So... prog_id is the problem. The program can be changed while we are 
> enqueueing packets to the bulk queue, so the prog_id at flush may be an 
> unexpected one.

Hmmm... that sounds problematic, if the XDP bpf_prog for veth can
change underneath, before the flush.  Our redirect system, depend on
things being stable until the xdp_do_flush_map() operation, as will
e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend
on XDP prog), and expect it to be correct/valid.


> It can be fixed by disabling NAPI when changing XDP programs. This stops 
> packet processing while changing XDP programs, but I guess it is an 
> acceptable compromise. Having said that, I'm honestly not so eager to 
> make this change, since this will require refurbishment of one of the 
> most delicate part of veth XDP, NAPI disabling/enabling mechanism.
> 
> WDYT?

Sound like a bug, if XDP bpf_prog is not stable within the NAPI poll...

 
> >> +		__field(int, ifindex)
> >> +		__field(int, drops)
> >> +		__field(int, sent)
> >> +		__field(int, err)
> >> +	),  
> > 
> > The reason is that this make is easier to attach to multiple
> > tracepoints, and extract the same value.
> > 
> > Example with bpftrace oneliner:
> > 
> > $ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }'

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
  2019-06-06 19:41       ` Jesper Dangaard Brouer
@ 2019-06-07  2:22         ` Toshiaki Makita
  2019-06-07  9:32           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 9+ messages in thread
From: Toshiaki Makita @ 2019-06-07  2:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, Toke Høiland-Jørgensen, Jason Wang,
	Brendan Gregg

On 2019/06/07 4:41, Jesper Dangaard Brouer wrote:
> On Thu, 6 Jun 2019 20:04:20 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> On 2019/06/05 16:59, Jesper Dangaard Brouer wrote:
>>> On Wed,  5 Jun 2019 14:36:12 +0900
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
>>>    
>>>> This is introduced for admins to check what is happening on XDP_TX when
>>>> bulk XDP_TX is in use, which will be first introduced in veth in next
>>>> commit.
>>>
>>> Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by
>>> all drivers?
>>
>> I guess you mean all drivers that implement similar mechanism should use
>> this? Then yes.
>> (I don't think all drivers needs bulk tx mechanism though)
>>
>>> (more below)
>>>    
>>>> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
>>>> ---
>>>>    include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
>>>>    kernel/bpf/core.c          |  1 +
>>>>    2 files changed, 26 insertions(+)
>>>>
>>>> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
>>>> index e95cb86..e06ea65 100644
>>>> --- a/include/trace/events/xdp.h
>>>> +++ b/include/trace/events/xdp.h
>>>> @@ -50,6 +50,31 @@
>>>>    		  __entry->ifindex)
>>>>    );
>>>>    
>>>> +TRACE_EVENT(xdp_bulk_tx,
>>>> +
>>>> +	TP_PROTO(const struct net_device *dev,
>>>> +		 int sent, int drops, int err),
>>>> +
>>>> +	TP_ARGS(dev, sent, drops, err),
>>>> +
>>>> +	TP_STRUCT__entry(
>>>
>>> All other tracepoints in this file starts with:
>>>
>>> 		__field(int, prog_id)
>>> 		__field(u32, act)
>>> or
>>> 		__field(int, map_id)
>>> 		__field(u32, act)
>>>
>>> Could you please add those?
>>
>> So... prog_id is the problem. The program can be changed while we are
>> enqueueing packets to the bulk queue, so the prog_id at flush may be an
>> unexpected one.
> 
> Hmmm... that sounds problematic, if the XDP bpf_prog for veth can
> change underneath, before the flush.  Our redirect system, depend on
> things being stable until the xdp_do_flush_map() operation, as will
> e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend
> on XDP prog), and expect it to be correct/valid.

Sorry, I don't get how maps depend on programs.
At least xdp_do_redirect_map() handles map_to_flush change during NAPI. 
Is there a problem when the map is not changed but the program is changed?
Also I believe this is not veth-specific behavior. Looking at tun and 
i40e, they seem to change xdp_prog without stopping data path.

>> It can be fixed by disabling NAPI when changing XDP programs. This stops
>> packet processing while changing XDP programs, but I guess it is an
>> acceptable compromise. Having said that, I'm honestly not so eager to
>> make this change, since this will require refurbishment of one of the
>> most delicate part of veth XDP, NAPI disabling/enabling mechanism.
>>
>> WDYT?
> 
> Sound like a bug, if XDP bpf_prog is not stable within the NAPI poll...
> 
>   
>>>> +		__field(int, ifindex)
>>>> +		__field(int, drops)
>>>> +		__field(int, sent)
>>>> +		__field(int, err)
>>>> +	),
>>>
>>> The reason is that this make is easier to attach to multiple
>>> tracepoints, and extract the same value.
>>>
>>> Example with bpftrace oneliner:
>>>
>>> $ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }'
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
  2019-06-07  2:22         ` Toshiaki Makita
@ 2019-06-07  9:32           ` Jesper Dangaard Brouer
  2019-06-10 11:41             ` Toshiaki Makita
  0 siblings, 1 reply; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2019-06-07  9:32 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, Toke Høiland-Jørgensen, Jason Wang,
	brouer, Paul E. McKenney

On Fri, 7 Jun 2019 11:22:00 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 2019/06/07 4:41, Jesper Dangaard Brouer wrote:
> > On Thu, 6 Jun 2019 20:04:20 +0900
> > Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> >   
> >> On 2019/06/05 16:59, Jesper Dangaard Brouer wrote:  
> >>> On Wed,  5 Jun 2019 14:36:12 +0900
> >>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> >>>      
[...]
> >>
> >> So... prog_id is the problem. The program can be changed while we are
> >> enqueueing packets to the bulk queue, so the prog_id at flush may be an
> >> unexpected one.  
> > 
> > Hmmm... that sounds problematic, if the XDP bpf_prog for veth can
> > change underneath, before the flush.  Our redirect system, depend on
> > things being stable until the xdp_do_flush_map() operation, as will
> > e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend
> > on XDP prog), and expect it to be correct/valid.  
> 
> Sorry, I don't get how maps depend on programs.

BPF/XDP programs have a reference count on the map (e.g. used for
redirect) and when the XDP is removed, and last refcnt for the map is
reached, then the map is also removed (redirect maps does a call_rcu
when shutdown).

> At least xdp_do_redirect_map() handles map_to_flush change during NAPI. 
> Is there a problem when the map is not changed but the program is changed?
> Also I believe this is not veth-specific behavior. Looking at tun and 
> i40e, they seem to change xdp_prog without stopping data path.
 
I guess this could actually happen, but we are "saved" by the
'map_to_flush' (pointer) is still valid due to RCU protection.

But it does look fishy, as our rcu_read_lock's does not encapsulation
this. There is RCU-read-section in veth_xdp_rcv_skb(), which via can
call xdp_do_redirect() which set per-CPU ri->map_to_flush.  

Do we get this protection by running under softirq, and does this
prevent an RCU grace-period (call_rcu callbacks) from happening?
(between veth_xdp_rcv_skb() and xdp_do_flush_map() in veth_poll())


To Toshiaki, regarding your patch 2/2, you are not affected by this
per-CPU map storing, as you pass along the bulk-queue.  I do see you
point, with prog_id could change.  Could you change the tracepoint to
include the 'act' and place 'ifindex' above this in the struct, this way
the 'act' member is in the same location/offset as other XDP
tracepoints.  I see the 'ifindex' as the identifier for this tracepoint
(other have map_id or prog_id in this location).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
  2019-06-07  9:32           ` Jesper Dangaard Brouer
@ 2019-06-10 11:41             ` Toshiaki Makita
  0 siblings, 0 replies; 9+ messages in thread
From: Toshiaki Makita @ 2019-06-10 11:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, Toke Høiland-Jørgensen, Jason Wang,
	Paul E. McKenney

On 2019/06/07 18:32, Jesper Dangaard Brouer wrote:
> On Fri, 7 Jun 2019 11:22:00 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> On 2019/06/07 4:41, Jesper Dangaard Brouer wrote:
>>> On Thu, 6 Jun 2019 20:04:20 +0900
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
>>>    
>>>> On 2019/06/05 16:59, Jesper Dangaard Brouer wrote:
>>>>> On Wed,  5 Jun 2019 14:36:12 +0900
>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
>>>>>       
> [...]
>>>>
>>>> So... prog_id is the problem. The program can be changed while we are
>>>> enqueueing packets to the bulk queue, so the prog_id at flush may be an
>>>> unexpected one.
>>>
>>> Hmmm... that sounds problematic, if the XDP bpf_prog for veth can
>>> change underneath, before the flush.  Our redirect system, depend on
>>> things being stable until the xdp_do_flush_map() operation, as will
>>> e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend
>>> on XDP prog), and expect it to be correct/valid.
>>
>> Sorry, I don't get how maps depend on programs.
> 
> BPF/XDP programs have a reference count on the map (e.g. used for
> redirect) and when the XDP is removed, and last refcnt for the map is
> reached, then the map is also removed (redirect maps does a call_rcu
> when shutdown).

Thanks, now I understand what you mean.

>> At least xdp_do_redirect_map() handles map_to_flush change during NAPI.
>> Is there a problem when the map is not changed but the program is changed?
>> Also I believe this is not veth-specific behavior. Looking at tun and
>> i40e, they seem to change xdp_prog without stopping data path.
>   
> I guess this could actually happen, but we are "saved" by the
> 'map_to_flush' (pointer) is still valid due to RCU protection.
> 
> But it does look fishy, as our rcu_read_lock's does not encapsulation
> this. There is RCU-read-section in veth_xdp_rcv_skb(), which via can
> call xdp_do_redirect() which set per-CPU ri->map_to_flush.
> 
> Do we get this protection by running under softirq, and does this
> prevent an RCU grace-period (call_rcu callbacks) from happening?
> (between veth_xdp_rcv_skb() and xdp_do_flush_map() in veth_poll())

We are trying to avoid the problem in dev_map_free()?

	/* To ensure all pending flush operations have completed wait for flush
	 * bitmap to indicate all flush_needed bits to be zero on _all_ cpus.
	 * Because the above synchronize_rcu() ensures the map is disconnected
	 * from the program we can assume no new bits will be set.
	 */
	for_each_online_cpu(cpu) {
		unsigned long *bitmap = per_cpu_ptr(dtab->flush_needed, cpu);

		while (!bitmap_empty(bitmap, dtab->map.max_entries))
			cond_resched();
	}

Not sure if this is working as expected.

> 
> 
> To Toshiaki, regarding your patch 2/2, you are not affected by this
> per-CPU map storing, as you pass along the bulk-queue.  I do see you
> point, with prog_id could change.  Could you change the tracepoint to
> include the 'act' and place 'ifindex' above this in the struct, this way
> the 'act' member is in the same location/offset as other XDP
> tracepoints.  I see the 'ifindex' as the identifier for this tracepoint
> (other have map_id or prog_id in this location).

Sure, thanks.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-06-10 11:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05  5:36 [PATCH v2 bpf-next 0/2] veth: Bulk XDP_TX Toshiaki Makita
2019-06-05  5:36 ` [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
2019-06-05  7:59   ` Jesper Dangaard Brouer
2019-06-06 11:04     ` Toshiaki Makita
2019-06-06 19:41       ` Jesper Dangaard Brouer
2019-06-07  2:22         ` Toshiaki Makita
2019-06-07  9:32           ` Jesper Dangaard Brouer
2019-06-10 11:41             ` Toshiaki Makita
2019-06-05  5:36 ` [PATCH v2 bpf-next 2/2] veth: Support " Toshiaki Makita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).