All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] veth: Bulk XDP_TX
@ 2019-05-23 10:56 Toshiaki Makita
  2019-05-23 10:56 ` [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue Toshiaki Makita
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 10:56 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf

This adds an infrastructure for bulk XDP_TX and makes veth use it.
Improves XDP_TX performance by approximately 8%. The detailed
performance numbers are shown in patch 3.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Toshiaki Makita (3):
  xdp: Add bulk XDP_TX queue
  xdp: Add tracepoint for bulk XDP_TX
  veth: Support bulk XDP_TX

 drivers/net/veth.c         | 26 +++++++++++++++++++++++++-
 include/net/xdp.h          |  7 +++++++
 include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
 kernel/bpf/core.c          |  1 +
 net/core/xdp.c             |  3 +++
 5 files changed, 61 insertions(+), 1 deletion(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue
  2019-05-23 10:56 [PATCH bpf-next 0/3] veth: Bulk XDP_TX Toshiaki Makita
@ 2019-05-23 10:56 ` Toshiaki Makita
  2019-05-23 11:11     ` Toke Høiland-Jørgensen
  2019-05-23 10:56 ` [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
  2019-05-23 10:56 ` [PATCH bpf-next 3/3] veth: Support " Toshiaki Makita
  2 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 10:56 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf

XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
heavy cost of indirect call but it also reduces lock acquisition on the
destination device that needs locks like veth and tun.

XDP_TX does not use indirect calls but drivers which require locks can
benefit from the bulk transmit for XDP_TX as well.

This patch adds per-cpu queues which can be used for bulk transmit on
XDP_TX. I did not add functions like enqueue/flush but exposed the queue
directly because we should avoid indirect calls on XDP_TX.

Note that the queue must be flushed, i.e. "count" member needs to be
set to 0, when a NAPI handler which used this queue exits. Otherwise
packets left in the queue will be transmitted from totally unintentional
devices.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/net/xdp.h | 7 +++++++
 net/core/xdp.c    | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0f25b36..30b36c8 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -84,6 +84,13 @@ struct xdp_frame {
 	struct net_device *dev_rx; /* used by cpumap */
 };
 
+#define XDP_TX_BULK_SIZE	16
+struct xdp_tx_bulk_queue {
+	struct xdp_frame *q[XDP_TX_BULK_SIZE];
+	unsigned int count;
+};
+DECLARE_PER_CPU(struct xdp_tx_bulk_queue, xdp_tx_bq);
+
 /* Clear kernel pointers in xdp_frame */
 static inline void xdp_scrub_frame(struct xdp_frame *frame)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4b2b194..0622f2d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -40,6 +40,9 @@ struct xdp_mem_allocator {
 	struct rcu_head rcu;
 };
 
+DEFINE_PER_CPU(struct xdp_tx_bulk_queue, xdp_tx_bq);
+EXPORT_PER_CPU_SYMBOL_GPL(xdp_tx_bq);
+
 static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
 {
 	const u32 *k = data;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX
  2019-05-23 10:56 [PATCH bpf-next 0/3] veth: Bulk XDP_TX Toshiaki Makita
  2019-05-23 10:56 ` [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue Toshiaki Makita
@ 2019-05-23 10:56 ` Toshiaki Makita
  2019-05-23 13:12   ` Jesper Dangaard Brouer
  2019-05-23 10:56 ` [PATCH bpf-next 3/3] veth: Support " Toshiaki Makita
  2 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 10:56 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf

This is introduced for admins to check what is happening on XDP_TX when
bulk XDP_TX is in use.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
 kernel/bpf/core.c          |  1 +
 2 files changed, 26 insertions(+)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index e95cb86..e06ea65 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -50,6 +50,31 @@
 		  __entry->ifindex)
 );
 
+TRACE_EVENT(xdp_bulk_tx,
+
+	TP_PROTO(const struct net_device *dev,
+		 int sent, int drops, int err),
+
+	TP_ARGS(dev, sent, drops, err),
+
+	TP_STRUCT__entry(
+		__field(int, ifindex)
+		__field(int, drops)
+		__field(int, sent)
+		__field(int, err)
+	),
+
+	TP_fast_assign(
+		__entry->ifindex	= dev->ifindex;
+		__entry->drops		= drops;
+		__entry->sent		= sent;
+		__entry->err		= err;
+	),
+
+	TP_printk("ifindex=%d sent=%d drops=%d err=%d",
+		  __entry->ifindex, __entry->sent, __entry->drops, __entry->err)
+);
+
 DECLARE_EVENT_CLASS(xdp_redirect_template,
 
 	TP_PROTO(const struct net_device *dev,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 242a643..7687488 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2108,3 +2108,4 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
 #include <linux/bpf_trace.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
+EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 10:56 [PATCH bpf-next 0/3] veth: Bulk XDP_TX Toshiaki Makita
  2019-05-23 10:56 ` [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue Toshiaki Makita
  2019-05-23 10:56 ` [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
@ 2019-05-23 10:56 ` Toshiaki Makita
  2019-05-23 11:25     ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 10:56 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf

This improves XDP_TX performance by about 8%.

Here are single core XDP_TX test results. CPU consumptions are taken
from "perf report --no-child".

- Before:

  7.26 Mpps

  _raw_spin_lock  7.83%
  veth_xdp_xmit  12.23%

- After:

  7.84 Mpps

  _raw_spin_lock  1.17%
  veth_xdp_xmit   6.45%

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 52110e5..4edc75f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 	return ret;
 }
 
+static void veth_xdp_flush_bq(struct net_device *dev)
+{
+	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
+	int sent, i, err = 0;
+
+	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
+	if (sent < 0) {
+		err = sent;
+		sent = 0;
+		for (i = 0; i < bq->count; i++)
+			xdp_return_frame(bq->q[i]);
+	}
+	trace_xdp_bulk_tx(dev, sent, bq->count - sent, err);
+
+	bq->count = 0;
+}
+
 static void veth_xdp_flush(struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
@@ -449,6 +466,7 @@ static void veth_xdp_flush(struct net_device *dev)
 	struct veth_rq *rq;
 
 	rcu_read_lock();
+	veth_xdp_flush_bq(dev);
 	rcv = rcu_dereference(priv->peer);
 	if (unlikely(!rcv))
 		goto out;
@@ -466,12 +484,18 @@ static void veth_xdp_flush(struct net_device *dev)
 
 static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
 {
+	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
 	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
 
 	if (unlikely(!frame))
 		return -EOVERFLOW;
 
-	return veth_xdp_xmit(dev, 1, &frame, 0);
+	if (unlikely(bq->count == XDP_TX_BULK_SIZE))
+		veth_xdp_flush_bq(dev);
+
+	bq->q[bq->count++] = frame;
+
+	return 0;
 }
 
 static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue
  2019-05-23 10:56 ` [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue Toshiaki Makita
@ 2019-05-23 11:11     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-05-23 11:11 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
> the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
> heavy cost of indirect call but it also reduces lock acquisition on the
> destination device that needs locks like veth and tun.
>
> XDP_TX does not use indirect calls but drivers which require locks can
> benefit from the bulk transmit for XDP_TX as well.

XDP_TX happens on the same device, so there's an implicit bulking
happening because of the NAPI cycle. So why is an additional mechanism
needed (in the general case)?

-Toke

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue
@ 2019-05-23 11:11     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-05-23 11:11 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend
  Cc: netdev, xdp-newbies, bpf

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
> the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
> heavy cost of indirect call but it also reduces lock acquisition on the
> destination device that needs locks like veth and tun.
>
> XDP_TX does not use indirect calls but drivers which require locks can
> benefit from the bulk transmit for XDP_TX as well.

XDP_TX happens on the same device, so there's an implicit bulking
happening because of the NAPI cycle. So why is an additional mechanism
needed (in the general case)?

-Toke

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue
  2019-05-23 11:11     ` Toke Høiland-Jørgensen
  (?)
@ 2019-05-23 11:24     ` Toshiaki Makita
  2019-05-23 11:33       ` Toke Høiland-Jørgensen
  -1 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 11:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend
  Cc: netdev, xdp-newbies, bpf

On 2019/05/23 20:11, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> 
>> XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
>> the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
>> heavy cost of indirect call but it also reduces lock acquisition on the
>> destination device that needs locks like veth and tun.
>>
>> XDP_TX does not use indirect calls but drivers which require locks can
>> benefit from the bulk transmit for XDP_TX as well.
> 
> XDP_TX happens on the same device, so there's an implicit bulking
> happening because of the NAPI cycle. So why is an additional mechanism
> needed (in the general case)?

Not sure what the implicit bulking you mention is. XDP_TX calls
.ndo_xdp_xmit() for each packet, and it acquires a lock in veth and tun.
To avoid this, we need additional storage for bulking like devmap for
XDP_REDIRECT.

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 10:56 ` [PATCH bpf-next 3/3] veth: Support " Toshiaki Makita
@ 2019-05-23 11:25     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-05-23 11:25 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend
  Cc: Toshiaki Makita, netdev, xdp-newbies, bpf

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> This improves XDP_TX performance by about 8%.
>
> Here are single core XDP_TX test results. CPU consumptions are taken
> from "perf report --no-child".
>
> - Before:
>
>   7.26 Mpps
>
>   _raw_spin_lock  7.83%
>   veth_xdp_xmit  12.23%
>
> - After:
>
>   7.84 Mpps
>
>   _raw_spin_lock  1.17%
>   veth_xdp_xmit   6.45%
>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 52110e5..4edc75f 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>  	return ret;
>  }
>  
> +static void veth_xdp_flush_bq(struct net_device *dev)
> +{
> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> +	int sent, i, err = 0;
> +
> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);

Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
you're introducing an additional per-cpu bulk queue, only to avoid lock
contention around the existing pointer ring. But the pointer ring is
per-rq, so if you have lock contention, this means you must have
multiple CPUs servicing the same rq, no? So why not just fix that
instead?

-Toke

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
@ 2019-05-23 11:25     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-05-23 11:25 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend
  Cc: netdev, xdp-newbies, bpf

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> This improves XDP_TX performance by about 8%.
>
> Here are single core XDP_TX test results. CPU consumptions are taken
> from "perf report --no-child".
>
> - Before:
>
>   7.26 Mpps
>
>   _raw_spin_lock  7.83%
>   veth_xdp_xmit  12.23%
>
> - After:
>
>   7.84 Mpps
>
>   _raw_spin_lock  1.17%
>   veth_xdp_xmit   6.45%
>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 52110e5..4edc75f 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>  	return ret;
>  }
>  
> +static void veth_xdp_flush_bq(struct net_device *dev)
> +{
> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> +	int sent, i, err = 0;
> +
> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);

Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
you're introducing an additional per-cpu bulk queue, only to avoid lock
contention around the existing pointer ring. But the pointer ring is
per-rq, so if you have lock contention, this means you must have
multiple CPUs servicing the same rq, no? So why not just fix that
instead?

-Toke

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue
  2019-05-23 11:24     ` Toshiaki Makita
@ 2019-05-23 11:33       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-05-23 11:33 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend
  Cc: netdev, xdp-newbies, bpf

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> On 2019/05/23 20:11, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>> 
>>> XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
>>> the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
>>> heavy cost of indirect call but it also reduces lock acquisition on the
>>> destination device that needs locks like veth and tun.
>>>
>>> XDP_TX does not use indirect calls but drivers which require locks can
>>> benefit from the bulk transmit for XDP_TX as well.
>> 
>> XDP_TX happens on the same device, so there's an implicit bulking
>> happening because of the NAPI cycle. So why is an additional mechanism
>> needed (in the general case)?
>
> Not sure what the implicit bulking you mention is. XDP_TX calls
> .ndo_xdp_xmit() for each packet, and it acquires a lock in veth and
> tun. To avoid this, we need additional storage for bulking like devmap
> for XDP_REDIRECT.

The bulking is in veth_poll(), where veth_xdp_flush() is only called at
the end. But see my other reply to the veth.c patch for the lock
contention issue...

-Toke

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 11:25     ` Toke Høiland-Jørgensen
  (?)
@ 2019-05-23 11:35     ` Toshiaki Makita
  2019-05-23 12:18       ` Toke Høiland-Jørgensen
  2019-05-23 13:29       ` Jesper Dangaard Brouer
  -1 siblings, 2 replies; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 11:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend
  Cc: netdev, xdp-newbies, bpf

On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> 
>> This improves XDP_TX performance by about 8%.
>>
>> Here are single core XDP_TX test results. CPU consumptions are taken
>> from "perf report --no-child".
>>
>> - Before:
>>
>>   7.26 Mpps
>>
>>   _raw_spin_lock  7.83%
>>   veth_xdp_xmit  12.23%
>>
>> - After:
>>
>>   7.84 Mpps
>>
>>   _raw_spin_lock  1.17%
>>   veth_xdp_xmit   6.45%
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>> ---
>>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>  1 file changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 52110e5..4edc75f 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>  	return ret;
>>  }
>>  
>> +static void veth_xdp_flush_bq(struct net_device *dev)
>> +{
>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>> +	int sent, i, err = 0;
>> +
>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
> 
> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> you're introducing an additional per-cpu bulk queue, only to avoid lock
> contention around the existing pointer ring. But the pointer ring is
> per-rq, so if you have lock contention, this means you must have
> multiple CPUs servicing the same rq, no?

Yes, it's possible. Not recommended though.

> So why not just fix that
> instead?

The queues are shared with packets from stack sent from peer. That's
because I needed the lock. I have tried to separate the queues, one for
redirect and one for stack, but receiver side got too complicated and it
ended up with worse performance.

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 11:35     ` Toshiaki Makita
@ 2019-05-23 12:18       ` Toke Høiland-Jørgensen
  2019-05-23 13:40         ` Toshiaki Makita
  2019-05-23 13:29       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-05-23 12:18 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend
  Cc: netdev, xdp-newbies, bpf

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>> 
>>> This improves XDP_TX performance by about 8%.
>>>
>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>> from "perf report --no-child".
>>>
>>> - Before:
>>>
>>>   7.26 Mpps
>>>
>>>   _raw_spin_lock  7.83%
>>>   veth_xdp_xmit  12.23%
>>>
>>> - After:
>>>
>>>   7.84 Mpps
>>>
>>>   _raw_spin_lock  1.17%
>>>   veth_xdp_xmit   6.45%
>>>
>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>> ---
>>>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>  1 file changed, 25 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 52110e5..4edc75f 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>  	return ret;
>>>  }
>>>  
>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>> +{
>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>> +	int sent, i, err = 0;
>>> +
>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>> 
>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>> contention around the existing pointer ring. But the pointer ring is
>> per-rq, so if you have lock contention, this means you must have
>> multiple CPUs servicing the same rq, no?
>
> Yes, it's possible. Not recommended though.
>
>> So why not just fix that instead?
>
> The queues are shared with packets from stack sent from peer. That's
> because I needed the lock. I have tried to separate the queues, one for
> redirect and one for stack, but receiver side got too complicated and it
> ended up with worse performance.

I meant fix it with configuration. Now many receive queues are you
running on the veth device in your benchmarks, and how have you
configured the RPS?

-Toke

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX
  2019-05-23 10:56 ` [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
@ 2019-05-23 13:12   ` Jesper Dangaard Brouer
  2019-05-24  1:33     ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-23 13:12 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, brouer

On Thu, 23 May 2019 19:56:47 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> This is introduced for admins to check what is happening on XDP_TX when
> bulk XDP_TX is in use.
> 
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
>  kernel/bpf/core.c          |  1 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index e95cb86..e06ea65 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -50,6 +50,31 @@
>  		  __entry->ifindex)
>  );
>  
> +TRACE_EVENT(xdp_bulk_tx,
> +

You are using this tracepoint like/instead of trace_xdp_devmap_xmit if
I understand correctly?  Or maybe the trace_xdp_redirect tracepoint.

The point is that is will be good if the tracepoints can share the
TP_STRUCT layout beginning, as it allows for attaching and reusing eBPF
code that is only interested in the top part of the struct.

I would also want to see some identifier, that trace programs can use
to group and corrolate these events, you do have ifindex, but most
other XDP tracepoints also have "prog_id".

> +	TP_PROTO(const struct net_device *dev,
> +		 int sent, int drops, int err),
> +
> +	TP_ARGS(dev, sent, drops, err),
> +
> +	TP_STRUCT__entry(
> +		__field(int, ifindex)
> +		__field(int, drops)
> +		__field(int, sent)
> +		__field(int, err)
> +	),

The xdp_redirect_template have:

	TP_STRUCT__entry(
		__field(int, prog_id)
		__field(u32, act)
		__field(int, ifindex)
		__field(int, err)
		__field(int, to_ifindex)
		__field(u32, map_id)
		__field(int, map_index)
	),

And e.g. TRACE_EVENT xdp_exception have:

	TP_STRUCT__entry(
		__field(int, prog_id)
		__field(u32, act)
		__field(int, ifindex)
	),


> +
> +	TP_fast_assign(
> +		__entry->ifindex	= dev->ifindex;
> +		__entry->drops		= drops;
> +		__entry->sent		= sent;
> +		__entry->err		= err;
> +	),
> +
> +	TP_printk("ifindex=%d sent=%d drops=%d err=%d",
> +		  __entry->ifindex, __entry->sent, __entry->drops, __entry->err)
> +);
> +
>  DECLARE_EVENT_CLASS(xdp_redirect_template,
>  
>  	TP_PROTO(const struct net_device *dev,
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 242a643..7687488 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2108,3 +2108,4 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
>  #include <linux/bpf_trace.h>
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 11:35     ` Toshiaki Makita
  2019-05-23 12:18       ` Toke Høiland-Jørgensen
@ 2019-05-23 13:29       ` Jesper Dangaard Brouer
  2019-05-23 13:51         ` Toshiaki Makita
  1 sibling, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-23 13:29 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, netdev, xdp-newbies, bpf,
	brouer

On Thu, 23 May 2019 20:35:50 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> >   
> >> This improves XDP_TX performance by about 8%.
> >>
> >> Here are single core XDP_TX test results. CPU consumptions are taken
> >> from "perf report --no-child".
> >>
> >> - Before:
> >>
> >>   7.26 Mpps
> >>
> >>   _raw_spin_lock  7.83%
> >>   veth_xdp_xmit  12.23%
> >>
> >> - After:
> >>
> >>   7.84 Mpps
> >>
> >>   _raw_spin_lock  1.17%
> >>   veth_xdp_xmit   6.45%
> >>
> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >> ---
> >>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
> >>  1 file changed, 25 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >> index 52110e5..4edc75f 100644
> >> --- a/drivers/net/veth.c
> >> +++ b/drivers/net/veth.c
> >> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
> >>  	return ret;
> >>  }
> >>  
> >> +static void veth_xdp_flush_bq(struct net_device *dev)
> >> +{
> >> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> >> +	int sent, i, err = 0;
> >> +
> >> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
> > 
> > Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> > you're introducing an additional per-cpu bulk queue, only to avoid lock
> > contention around the existing pointer ring. But the pointer ring is
> > per-rq, so if you have lock contention, this means you must have
> > multiple CPUs servicing the same rq, no?  
> 
> Yes, it's possible. Not recommended though.
> 

I think the general per-cpu TX bulk queue is overkill.  There is a loop
over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
the caller veth_poll() will call veth_xdp_flush(rq->dev).

Why can't you store this "temp" bulk array in struct veth_rq ?

You could even alloc/create it on the stack of veth_poll() and send it
along via a pointer to veth_xdp_rcv).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 12:18       ` Toke Høiland-Jørgensen
@ 2019-05-23 13:40         ` Toshiaki Makita
  0 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 13:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Toshiaki Makita,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: netdev, xdp-newbies, bpf

On 19/05/23 (木) 21:18:25, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> 
>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>
>>>> This improves XDP_TX performance by about 8%.
>>>>
>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>> from "perf report --no-child".
>>>>
>>>> - Before:
>>>>
>>>>    7.26 Mpps
>>>>
>>>>    _raw_spin_lock  7.83%
>>>>    veth_xdp_xmit  12.23%
>>>>
>>>> - After:
>>>>
>>>>    7.84 Mpps
>>>>
>>>>    _raw_spin_lock  1.17%
>>>>    veth_xdp_xmit   6.45%
>>>>
>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>> ---
>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index 52110e5..4edc75f 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>   	return ret;
>>>>   }
>>>>   
>>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>>> +{
>>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>> +	int sent, i, err = 0;
>>>> +
>>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>
>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>>> contention around the existing pointer ring. But the pointer ring is
>>> per-rq, so if you have lock contention, this means you must have
>>> multiple CPUs servicing the same rq, no?
>>
>> Yes, it's possible. Not recommended though.
>>
>>> So why not just fix that instead?
>>
>> The queues are shared with packets from stack sent from peer. That's
>> because I needed the lock. I have tried to separate the queues, one for
>> redirect and one for stack, but receiver side got too complicated and it
>> ended up with worse performance.
> 
> I meant fix it with configuration. Now many receive queues are you
> running on the veth device in your benchmarks, and how have you
> configured the RPS?

As I wrote this test is a single queue test and does not have any 
contention.
Per packet lock has some overhead even in that configuration.

Toshiaki Makita


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 13:29       ` Jesper Dangaard Brouer
@ 2019-05-23 13:51         ` Toshiaki Makita
  2019-05-24  3:13           ` Jason Wang
  2019-05-24  9:53           ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-23 13:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Toshiaki Makita
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, netdev, xdp-newbies, bpf

On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
> On Thu, 23 May 2019 20:35:50 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>    
>>>> This improves XDP_TX performance by about 8%.
>>>>
>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>> from "perf report --no-child".
>>>>
>>>> - Before:
>>>>
>>>>    7.26 Mpps
>>>>
>>>>    _raw_spin_lock  7.83%
>>>>    veth_xdp_xmit  12.23%
>>>>
>>>> - After:
>>>>
>>>>    7.84 Mpps
>>>>
>>>>    _raw_spin_lock  1.17%
>>>>    veth_xdp_xmit   6.45%
>>>>
>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>> ---
>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index 52110e5..4edc75f 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>   	return ret;
>>>>   }
>>>>   
>>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>>> +{
>>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>> +	int sent, i, err = 0;
>>>> +
>>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>
>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>>> contention around the existing pointer ring. But the pointer ring is
>>> per-rq, so if you have lock contention, this means you must have
>>> multiple CPUs servicing the same rq, no?
>>
>> Yes, it's possible. Not recommended though.
>>
> 
> I think the general per-cpu TX bulk queue is overkill.  There is a loop
> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
> the caller veth_poll() will call veth_xdp_flush(rq->dev).
> 
> Why can't you store this "temp" bulk array in struct veth_rq ?

Of course I can. But I thought tun has the same problem and we can 
decrease memory footprint by sharing the same storage between devices.
Or if other devices want to reduce queues so that we can use XDP on 
many-cpu servers and introduce locks, we can use this storage for that 
case as well.

Still do you prefer veth-specific solution?

> 
> You could even alloc/create it on the stack of veth_poll() and send it
> along via a pointer to veth_xdp_rcv).
> 

Toshiaki Makita

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX
  2019-05-23 13:12   ` Jesper Dangaard Brouer
@ 2019-05-24  1:33     ` Toshiaki Makita
  0 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-24  1:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf

On 2019/05/23 22:12, Jesper Dangaard Brouer wrote:
> On Thu, 23 May 2019 19:56:47 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> This is introduced for admins to check what is happening on XDP_TX when
>> bulk XDP_TX is in use.
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>> ---
>>  include/trace/events/xdp.h | 25 +++++++++++++++++++++++++
>>  kernel/bpf/core.c          |  1 +
>>  2 files changed, 26 insertions(+)
>>
>> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
>> index e95cb86..e06ea65 100644
>> --- a/include/trace/events/xdp.h
>> +++ b/include/trace/events/xdp.h
>> @@ -50,6 +50,31 @@
>>  		  __entry->ifindex)
>>  );
>>  
>> +TRACE_EVENT(xdp_bulk_tx,
>> +
> 
> You are using this tracepoint like/instead of trace_xdp_devmap_xmit if
> I understand correctly?  Or maybe the trace_xdp_redirect tracepoint.

Yes, I have trace_xdp_devmap_xmit in mind, which is for XDP_REDIRECT.

> The point is that is will be good if the tracepoints can share the
> TP_STRUCT layout beginning, as it allows for attaching and reusing eBPF
> code that is only interested in the top part of the struct.

It's good, but this tracepoint does not have map concept so differs from
xdp_devmap_xmit.

> I would also want to see some identifier, that trace programs can use
> to group and corrolate these events, you do have ifindex, but most
> other XDP tracepoints also have "prog_id".

I have considered that too. The problem is that we cannot pass a
reliable prog_id since bulk xmit happens after RCU critical section of
XDP_TX.
xdp_devmap_xmit does not have prog_id and I guess there is a similar
reason for it?

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 13:51         ` Toshiaki Makita
@ 2019-05-24  3:13           ` Jason Wang
  2019-05-24  3:28             ` Toshiaki Makita
  2019-05-24  9:53           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Wang @ 2019-05-24  3:13 UTC (permalink / raw)
  To: Toshiaki Makita, Jesper Dangaard Brouer, Toshiaki Makita
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, netdev, xdp-newbies, bpf


On 2019/5/23 下午9:51, Toshiaki Makita wrote:
> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>> On Thu, 23 May 2019 20:35:50 +0900
>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>
>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>> This improves XDP_TX performance by about 8%.
>>>>>
>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>> from "perf report --no-child".
>>>>>
>>>>> - Before:
>>>>>
>>>>>    7.26 Mpps
>>>>>
>>>>>    _raw_spin_lock  7.83%
>>>>>    veth_xdp_xmit  12.23%
>>>>>
>>>>> - After:
>>>>>
>>>>>    7.84 Mpps
>>>>>
>>>>>    _raw_spin_lock  1.17%
>>>>>    veth_xdp_xmit   6.45%
>>>>>
>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>> ---
>>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>> index 52110e5..4edc75f 100644
>>>>> --- a/drivers/net/veth.c
>>>>> +++ b/drivers/net/veth.c
>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device 
>>>>> *dev, int n,
>>>>>       return ret;
>>>>>   }
>>>>>   +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>> +{
>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>> +    int sent, i, err = 0;
>>>>> +
>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>
>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>> you're introducing an additional per-cpu bulk queue, only to avoid 
>>>> lock
>>>> contention around the existing pointer ring. But the pointer ring is
>>>> per-rq, so if you have lock contention, this means you must have
>>>> multiple CPUs servicing the same rq, no?
>>>
>>> Yes, it's possible. Not recommended though.
>>>
>>
>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>
>> Why can't you store this "temp" bulk array in struct veth_rq ?
>
> Of course I can. But I thought tun has the same problem and we can 
> decrease memory footprint by sharing the same storage between devices.


For TUN and for its fast path where vhost passes a bulk of XDP frames 
(through msg_control) to us, we probably just need a temporary bulk 
array in tun_xdp_one() instead of a global one. I can post patch or 
maybe you if you're interested in this.

Thanks


> Or if other devices want to reduce queues so that we can use XDP on 
> many-cpu servers and introduce locks, we can use this storage for that 
> case as well.
>
> Still do you prefer veth-specific solution?
>
>>
>> You could even alloc/create it on the stack of veth_poll() and send it
>> along via a pointer to veth_xdp_rcv).
>>
>
> Toshiaki Makita

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-24  3:13           ` Jason Wang
@ 2019-05-24  3:28             ` Toshiaki Makita
  2019-05-24  3:54               ` Jason Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-24  3:28 UTC (permalink / raw)
  To: Jason Wang, Toshiaki Makita, Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, netdev, xdp-newbies, bpf

On 2019/05/24 12:13, Jason Wang wrote:
> On 2019/5/23 下午9:51, Toshiaki Makita wrote:
>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 23 May 2019 20:35:50 +0900
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>
>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>
>>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>>> from "perf report --no-child".
>>>>>>
>>>>>> - Before:
>>>>>>
>>>>>>    7.26 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  7.83%
>>>>>>    veth_xdp_xmit  12.23%
>>>>>>
>>>>>> - After:
>>>>>>
>>>>>>    7.84 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  1.17%
>>>>>>    veth_xdp_xmit   6.45%
>>>>>>
>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>> ---
>>>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>> index 52110e5..4edc75f 100644
>>>>>> --- a/drivers/net/veth.c
>>>>>> +++ b/drivers/net/veth.c
>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
>>>>>> *dev, int n,
>>>>>>       return ret;
>>>>>>   }
>>>>>>   +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>> +{
>>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>> +    int sent, i, err = 0;
>>>>>> +
>>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>>
>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>> you're introducing an additional per-cpu bulk queue, only to avoid
>>>>> lock
>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>> per-rq, so if you have lock contention, this means you must have
>>>>> multiple CPUs servicing the same rq, no?
>>>>
>>>> Yes, it's possible. Not recommended though.
>>>>
>>>
>>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>
>>> Why can't you store this "temp" bulk array in struct veth_rq ?
>>
>> Of course I can. But I thought tun has the same problem and we can
>> decrease memory footprint by sharing the same storage between devices.
> 
> 
> For TUN and for its fast path where vhost passes a bulk of XDP frames
> (through msg_control) to us, we probably just need a temporary bulk
> array in tun_xdp_one() instead of a global one. I can post patch or
> maybe you if you're interested in this.

Of course you/I can. What I'm concerned is that could be waste of cache
line when softirq runs veth napi handler and then tun napi handler.

> 
> Thanks
> 
> 
>> Or if other devices want to reduce queues so that we can use XDP on
>> many-cpu servers and introduce locks, we can use this storage for that
>> case as well.
>>
>> Still do you prefer veth-specific solution?
>>
>>>
>>> You could even alloc/create it on the stack of veth_poll() and send it
>>> along via a pointer to veth_xdp_rcv).
>>>
>>
>> Toshiaki Makita
> 
> 

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-24  3:28             ` Toshiaki Makita
@ 2019-05-24  3:54               ` Jason Wang
  2019-05-24  4:52                 ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Wang @ 2019-05-24  3:54 UTC (permalink / raw)
  To: Toshiaki Makita, Toshiaki Makita, Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, netdev, xdp-newbies, bpf


On 2019/5/24 上午11:28, Toshiaki Makita wrote:
> On 2019/05/24 12:13, Jason Wang wrote:
>> On 2019/5/23 下午9:51, Toshiaki Makita wrote:
>>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>>> On Thu, 23 May 2019 20:35:50 +0900
>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>>
>>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>>
>>>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>>>> from "perf report --no-child".
>>>>>>>
>>>>>>> - Before:
>>>>>>>
>>>>>>>     7.26 Mpps
>>>>>>>
>>>>>>>     _raw_spin_lock  7.83%
>>>>>>>     veth_xdp_xmit  12.23%
>>>>>>>
>>>>>>> - After:
>>>>>>>
>>>>>>>     7.84 Mpps
>>>>>>>
>>>>>>>     _raw_spin_lock  1.17%
>>>>>>>     veth_xdp_xmit   6.45%
>>>>>>>
>>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>>> ---
>>>>>>>    drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>>    1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>>> index 52110e5..4edc75f 100644
>>>>>>> --- a/drivers/net/veth.c
>>>>>>> +++ b/drivers/net/veth.c
>>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
>>>>>>> *dev, int n,
>>>>>>>        return ret;
>>>>>>>    }
>>>>>>>    +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>>> +{
>>>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>>> +    int sent, i, err = 0;
>>>>>>> +
>>>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>>> you're introducing an additional per-cpu bulk queue, only to avoid
>>>>>> lock
>>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>>> per-rq, so if you have lock contention, this means you must have
>>>>>> multiple CPUs servicing the same rq, no?
>>>>> Yes, it's possible. Not recommended though.
>>>>>
>>>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>>
>>>> Why can't you store this "temp" bulk array in struct veth_rq ?
>>> Of course I can. But I thought tun has the same problem and we can
>>> decrease memory footprint by sharing the same storage between devices.
>>
>> For TUN and for its fast path where vhost passes a bulk of XDP frames
>> (through msg_control) to us, we probably just need a temporary bulk
>> array in tun_xdp_one() instead of a global one. I can post patch or
>> maybe you if you're interested in this.
> Of course you/I can. What I'm concerned is that could be waste of cache
> line when softirq runs veth napi handler and then tun napi handler.
>

Well, technically the bulk queue passed to TUN could be reused. I admit 
it may save cacheline in ideal case but I wonder how much we could gain 
on real workload. (Note TUN doesn't use napi handler to do XDP, it has a 
NAPI mode but it was mainly used for hardening and XDP was not 
implemented there, maybe we should fix this).

Thanks


>> Thanks
>>
>>
>>> Or if other devices want to reduce queues so that we can use XDP on
>>> many-cpu servers and introduce locks, we can use this storage for that
>>> case as well.
>>>
>>> Still do you prefer veth-specific solution?
>>>
>>>> You could even alloc/create it on the stack of veth_poll() and send it
>>>> along via a pointer to veth_xdp_rcv).
>>>>
>>> Toshiaki Makita
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-24  3:54               ` Jason Wang
@ 2019-05-24  4:52                 ` Toshiaki Makita
  0 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-24  4:52 UTC (permalink / raw)
  To: Jason Wang, Toshiaki Makita, Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, netdev, xdp-newbies, bpf

On 2019/05/24 12:54, Jason Wang wrote:
> On 2019/5/24 上午11:28, Toshiaki Makita wrote:
>> On 2019/05/24 12:13, Jason Wang wrote:
>>> On 2019/5/23 下午9:51, Toshiaki Makita wrote:
>>>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>>>> On Thu, 23 May 2019 20:35:50 +0900
>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>>>
>>>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>>>
>>>>>>>> Here are single core XDP_TX test results. CPU consumptions are
>>>>>>>> taken
>>>>>>>> from "perf report --no-child".
>>>>>>>>
>>>>>>>> - Before:
>>>>>>>>
>>>>>>>>     7.26 Mpps
>>>>>>>>
>>>>>>>>     _raw_spin_lock  7.83%
>>>>>>>>     veth_xdp_xmit  12.23%
>>>>>>>>
>>>>>>>> - After:
>>>>>>>>
>>>>>>>>     7.84 Mpps
>>>>>>>>
>>>>>>>>     _raw_spin_lock  1.17%
>>>>>>>>     veth_xdp_xmit   6.45%
>>>>>>>>
>>>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>>>> ---
>>>>>>>>    drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>>>    1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>>>> index 52110e5..4edc75f 100644
>>>>>>>> --- a/drivers/net/veth.c
>>>>>>>> +++ b/drivers/net/veth.c
>>>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
>>>>>>>> *dev, int n,
>>>>>>>>        return ret;
>>>>>>>>    }
>>>>>>>>    +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>>>> +{
>>>>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>>>> +    int sent, i, err = 0;
>>>>>>>> +
>>>>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>>>> you're introducing an additional per-cpu bulk queue, only to avoid
>>>>>>> lock
>>>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>>>> per-rq, so if you have lock contention, this means you must have
>>>>>>> multiple CPUs servicing the same rq, no?
>>>>>> Yes, it's possible. Not recommended though.
>>>>>>
>>>>> I think the general per-cpu TX bulk queue is overkill.  There is a
>>>>> loop
>>>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>>>
>>>>> Why can't you store this "temp" bulk array in struct veth_rq ?
>>>> Of course I can. But I thought tun has the same problem and we can
>>>> decrease memory footprint by sharing the same storage between devices.
>>>
>>> For TUN and for its fast path where vhost passes a bulk of XDP frames
>>> (through msg_control) to us, we probably just need a temporary bulk
>>> array in tun_xdp_one() instead of a global one. I can post patch or
>>> maybe you if you're interested in this.
>> Of course you/I can. What I'm concerned is that could be waste of cache
>> line when softirq runs veth napi handler and then tun napi handler.
>>
> 
> Well, technically the bulk queue passed to TUN could be reused. I admit
> it may save cacheline in ideal case but I wonder how much we could gain
> on real workload.

I see veth_rq ptr_ring suffering from cacheline miss, which makes me
conservative about adding more buffers for xdp_frames.
I'll wait for some more feedback from others.

> (Note TUN doesn't use napi handler to do XDP, it has a
> NAPI mode but it was mainly used for hardening and XDP was not
> implemented there, maybe we should fix this).

Ah, that's true. Sorry for confusion.

> 
> Thanks
> 
> 
>>> Thanks
>>>
>>>
>>>> Or if other devices want to reduce queues so that we can use XDP on
>>>> many-cpu servers and introduce locks, we can use this storage for that
>>>> case as well.
>>>>
>>>> Still do you prefer veth-specific solution?
>>>>
>>>>> You could even alloc/create it on the stack of veth_poll() and send it
>>>>> along via a pointer to veth_xdp_rcv).
>>>>>
>>>> Toshiaki Makita
>>>
> 
> 

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-23 13:51         ` Toshiaki Makita
  2019-05-24  3:13           ` Jason Wang
@ 2019-05-24  9:53           ` Jesper Dangaard Brouer
  2019-05-27  6:08             ` Toshiaki Makita
  1 sibling, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-24  9:53 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf, brouer

On Thu, 23 May 2019 22:51:34 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
> > On Thu, 23 May 2019 20:35:50 +0900
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:  
> >>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> >>>      
> >>>> This improves XDP_TX performance by about 8%.
> >>>>
> >>>> Here are single core XDP_TX test results. CPU consumptions are taken
> >>>> from "perf report --no-child".
> >>>>
> >>>> - Before:
> >>>>
> >>>>    7.26 Mpps
> >>>>
> >>>>    _raw_spin_lock  7.83%
> >>>>    veth_xdp_xmit  12.23%
> >>>>
> >>>> - After:
> >>>>
> >>>>    7.84 Mpps
> >>>>
> >>>>    _raw_spin_lock  1.17%
> >>>>    veth_xdp_xmit   6.45%
> >>>>
> >>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >>>> ---
> >>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
> >>>>   1 file changed, 25 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >>>> index 52110e5..4edc75f 100644
> >>>> --- a/drivers/net/veth.c
> >>>> +++ b/drivers/net/veth.c
> >>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
> >>>>   	return ret;
> >>>>   }
> >>>>   
> >>>> +static void veth_xdp_flush_bq(struct net_device *dev)
> >>>> +{
> >>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> >>>> +	int sent, i, err = 0;
> >>>> +
> >>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
> >>>
> >>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> >>> you're introducing an additional per-cpu bulk queue, only to avoid lock
> >>> contention around the existing pointer ring. But the pointer ring is
> >>> per-rq, so if you have lock contention, this means you must have
> >>> multiple CPUs servicing the same rq, no?  
> >>
> >> Yes, it's possible. Not recommended though.
> >>  
> > 
> > I think the general per-cpu TX bulk queue is overkill.  There is a loop
> > over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
> > the caller veth_poll() will call veth_xdp_flush(rq->dev).
> > 
> > Why can't you store this "temp" bulk array in struct veth_rq ?  
> 
> Of course I can. But I thought tun has the same problem and we can 
> decrease memory footprint by sharing the same storage between devices.
> Or if other devices want to reduce queues so that we can use XDP on 
> many-cpu servers and introduce locks, we can use this storage for
> that case as well.
> 
> Still do you prefer veth-specific solution?

Yes.  Another reason is that with this shared/general per-cpu TX bulk
queue, I can easily see bugs resulting in xdp_frames getting
transmitted on a completely other NIC, which will be hard to debug for
people.

> > 
> > You could even alloc/create it on the stack of veth_poll() and send
> > it along via a pointer to veth_xdp_rcv).

IHMO it would be cleaner code wise to place the "temp" bulk array in
struct veth_rq.  But if you worry about performance and want a hot
cacheline for this, then you could just use the call-stack for
veth_poll(), as I described.  It should not be too ugly code wise to do
this I think.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
  2019-05-24  9:53           ` Jesper Dangaard Brouer
@ 2019-05-27  6:08             ` Toshiaki Makita
  0 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2019-05-27  6:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toshiaki Makita, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, netdev,
	xdp-newbies, bpf

On 2019/05/24 18:53, Jesper Dangaard Brouer wrote:
> On Thu, 23 May 2019 22:51:34 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 23 May 2019 20:35:50 +0900
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>   
>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:  
>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>      
>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>
>>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>>> from "perf report --no-child".
>>>>>>
>>>>>> - Before:
>>>>>>
>>>>>>    7.26 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  7.83%
>>>>>>    veth_xdp_xmit  12.23%
>>>>>>
>>>>>> - After:
>>>>>>
>>>>>>    7.84 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  1.17%
>>>>>>    veth_xdp_xmit   6.45%
>>>>>>
>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>> ---
>>>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>> index 52110e5..4edc75f 100644
>>>>>> --- a/drivers/net/veth.c
>>>>>> +++ b/drivers/net/veth.c
>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>>>   	return ret;
>>>>>>   }
>>>>>>   
>>>>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>> +{
>>>>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>> +	int sent, i, err = 0;
>>>>>> +
>>>>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
>>>>>
>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>> per-rq, so if you have lock contention, this means you must have
>>>>> multiple CPUs servicing the same rq, no?  
>>>>
>>>> Yes, it's possible. Not recommended though.
>>>>  
>>>
>>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>
>>> Why can't you store this "temp" bulk array in struct veth_rq ?  
>>
>> Of course I can. But I thought tun has the same problem and we can 
>> decrease memory footprint by sharing the same storage between devices.
>> Or if other devices want to reduce queues so that we can use XDP on 
>> many-cpu servers and introduce locks, we can use this storage for
>> that case as well.
>>
>> Still do you prefer veth-specific solution?
> 
> Yes.  Another reason is that with this shared/general per-cpu TX bulk
> queue, I can easily see bugs resulting in xdp_frames getting
> transmitted on a completely other NIC, which will be hard to debug for
> people.
> 
>>>
>>> You could even alloc/create it on the stack of veth_poll() and send
>>> it along via a pointer to veth_xdp_rcv).
> 
> IHMO it would be cleaner code wise to place the "temp" bulk array in
> struct veth_rq.  But if you worry about performance and want a hot
> cacheline for this, then you could just use the call-stack for
> veth_poll(), as I described.  It should not be too ugly code wise to do
> this I think.

Rethinking this I agree to not using global but use stack.

For performance you are right, stack should be as hot as global if other
drivers use stack as well. I was a bit concerned about stack size, but
128 bytes size is probably acceptable these days.

Wrt debugging, indeed the global solution is probably more difficult.
When we fail to flush bq, the stack solution can be tracked by something
like kmemleak but the global one cannot. Also the global solution has a
risk to send packets from unintentional devices, which leads to a
security problem. With the stack solution missing flush just causes
packet loss and memory leak.

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-05-27  6:10 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23 10:56 [PATCH bpf-next 0/3] veth: Bulk XDP_TX Toshiaki Makita
2019-05-23 10:56 ` [PATCH bpf-next 1/3] xdp: Add bulk XDP_TX queue Toshiaki Makita
2019-05-23 11:11   ` Toke Høiland-Jørgensen
2019-05-23 11:11     ` Toke Høiland-Jørgensen
2019-05-23 11:24     ` Toshiaki Makita
2019-05-23 11:33       ` Toke Høiland-Jørgensen
2019-05-23 10:56 ` [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX Toshiaki Makita
2019-05-23 13:12   ` Jesper Dangaard Brouer
2019-05-24  1:33     ` Toshiaki Makita
2019-05-23 10:56 ` [PATCH bpf-next 3/3] veth: Support " Toshiaki Makita
2019-05-23 11:25   ` Toke Høiland-Jørgensen
2019-05-23 11:25     ` Toke Høiland-Jørgensen
2019-05-23 11:35     ` Toshiaki Makita
2019-05-23 12:18       ` Toke Høiland-Jørgensen
2019-05-23 13:40         ` Toshiaki Makita
2019-05-23 13:29       ` Jesper Dangaard Brouer
2019-05-23 13:51         ` Toshiaki Makita
2019-05-24  3:13           ` Jason Wang
2019-05-24  3:28             ` Toshiaki Makita
2019-05-24  3:54               ` Jason Wang
2019-05-24  4:52                 ` Toshiaki Makita
2019-05-24  9:53           ` Jesper Dangaard Brouer
2019-05-27  6:08             ` Toshiaki Makita

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.