[PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect
@ 2019-04-12 15:07 Jesper Dangaard Brouer
  2019-04-12 15:07 ` [PATCH bpf-next V2 1/4] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-12 15:07 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: songliubraving, Toke Høiland-Jørgensen,
	Ilias Apalodimas, Jesper Dangaard Brouer, ecree, bpf

This patchset utilize a number of different kernel bulk APIs for optimizing
the performance for the XDP cpumap redirect feature.

Benchmark details are available here:
 https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org

Performance measurements can be considered micro benchmarks, as they measure
dropping packets at different stages in the network stack.
Summary based on above:

Baseline benchmarks
- baseline-redirect: UdpNoPorts: 3,180,074
- baseline-redirect: iptables-raw drop: 6,193,534

Patch1: bpf: cpumap use ptr_ring_consume_batched
- redirect: UdpNoPorts: 3,327,729
- redirect: iptables-raw drop: 6,321,540

Patch2: net: core: introduce build_skb_around
- redirect: UdpNoPorts: 3,221,303
- redirect: iptables-raw drop: 6,320,066

Patch3: bpf: cpumap do bulk allocation of SKBs
- redirect: UdpNoPorts: 3,290,563
- redirect: iptables-raw drop: 6,650,112

Patch4: bpf: cpumap memory prefetchw optimizations for struct page
- redirect: UdpNoPorts: 3,520,250
- redirect: iptables-raw drop: 7,649,604

In this V2 submission I have chosen drop the SKB-list patch using
netif_receive_skb_list() as it was not showing a performance improvement for
these micro benchmarks.

---

Jesper Dangaard Brouer (4):
      bpf: cpumap use ptr_ring_consume_batched
      net: core: introduce build_skb_around
      bpf: cpumap do bulk allocation of SKBs
      bpf: cpumap memory prefetchw optimizations for struct page


 include/linux/skbuff.h |    2 +
 kernel/bpf/cpumap.c    |   53 +++++++++++++++++++++++++-----------
 net/core/skbuff.c      |   71 +++++++++++++++++++++++++++++++++++-------------
 3 files changed, 91 insertions(+), 35 deletions(-)

--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH bpf-next V2 1/4] bpf: cpumap use ptr_ring_consume_batched
  2019-04-12 15:07 [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
@ 2019-04-12 15:07 ` Jesper Dangaard Brouer
  2019-04-12 15:07 ` [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-12 15:07 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: songliubraving, Toke Høiland-Jørgensen,
	Ilias Apalodimas, Jesper Dangaard Brouer, ecree, bpf

Move ptr_ring dequeue outside loop, that allocate SKBs and calls network
stack, as these operations that can take some time. The ptr_ring is a
communication channel between CPUs, where we want to reduce/limit any
cacheline bouncing.

Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the
period and times the remote cacheline in ptr_ring is read

Batch size 8 is both to (1) limit BH-disable period, and (2) consume one
cacheline on 64-bit archs. After reducing the BH-disable section further
then we can consider changing this, while still thinking about L1 cacheline
size being active.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/cpumap.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3c18260403dd..430103e182a0 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,6 +240,8 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 	}
 }
 
+#define CPUMAP_BATCH 8
+
 static int cpu_map_kthread_run(void *data)
 {
 	struct bpf_cpu_map_entry *rcpu = data;
@@ -252,8 +254,9 @@ static int cpu_map_kthread_run(void *data)
 	 * kthread_stop signal until queue is empty.
 	 */
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
-		unsigned int processed = 0, drops = 0, sched = 0;
-		struct xdp_frame *xdpf;
+		unsigned int drops = 0, sched = 0;
+		void *frames[CPUMAP_BATCH];
+		int i, n;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -269,14 +272,16 @@ static int cpu_map_kthread_run(void *data)
 			sched = cond_resched();
 		}
 
-		/* Process packets in rcpu->queue */
-		local_bh_disable();
 		/*
 		 * The bpf_cpu_map_entry is single consumer, with this
 		 * kthread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		while ((xdpf = __ptr_ring_consume(rcpu->queue))) {
+		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
+
+		local_bh_disable();
+		for (i = 0; i < n; i++) {
+			struct xdp_frame *xdpf = frames[i];
 			struct sk_buff *skb;
 			int ret;
 
@@ -290,13 +295,9 @@ static int cpu_map_kthread_run(void *data)
 			ret = netif_receive_skb_core(skb);
 			if (ret == NET_RX_DROP)
 				drops++;
-
-			/* Limit BH-disable period */
-			if (++processed == 8)
-				break;
 		}
 		/* Feedback loop via tracepoint */
-		trace_xdp_cpumap_kthread(rcpu->map_id, processed, drops, sched);
+		trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);
 
 		local_bh_enable(); /* resched point, may call do_softirq() */
 	}


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around
  2019-04-12 15:07 [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
  2019-04-12 15:07 ` [PATCH bpf-next V2 1/4] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
@ 2019-04-12 15:07 ` Jesper Dangaard Brouer
  2019-04-12 17:59   ` Song Liu
  2019-04-17  3:01   ` Alexei Starovoitov
  2019-04-12 15:07 ` [PATCH bpf-next V2 3/4] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-12 15:07 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: songliubraving, Toke Høiland-Jørgensen,
	Ilias Apalodimas, Jesper Dangaard Brouer, ecree, bpf

The function build_skb() also have the responsibility to allocate and clear
the SKB structure. Introduce a new function build_skb_around(), that moves
the responsibility of allocation and clearing to the caller. This allows
caller to use kmem_cache (slab/slub) bulk allocation API.

Next patch use this function combined with kmem_cache_alloc_bulk.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |   71 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a06275a618f0..e81f2b0e8a83 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1042,6 +1042,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
 			    int node);
 struct sk_buff *__build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb(void *data, unsigned int frag_size);
+struct sk_buff *build_skb_around(struct sk_buff *skb,
+				 void *data, unsigned int frag_size);
 
 /**
  * alloc_skb - allocate a network buffer
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9901f5322852..087622298d77 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -258,6 +258,33 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 }
 EXPORT_SYMBOL(__alloc_skb);
 
+/* Caller must provide SKB that is memset cleared */
+static struct sk_buff *__build_skb_around(struct sk_buff *skb,
+					  void *data, unsigned int frag_size)
+{
+	struct skb_shared_info *shinfo;
+	unsigned int size = frag_size ? : ksize(data);
+
+	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	/* Assumes caller memset cleared SKB */
+	skb->truesize = SKB_TRUESIZE(size);
+	refcount_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + size;
+	skb->mac_header = (typeof(skb->mac_header))~0U;
+	skb->transport_header = (typeof(skb->transport_header))~0U;
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+
+	return skb;
+}
+
 /**
  * __build_skb - build a network buffer
  * @data: data buffer provided by caller
@@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb);
  */
 struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 {
-	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
-	unsigned int size = frag_size ? : ksize(data);
 
 	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
-	if (!skb)
+	if (unlikely(!skb))
 		return NULL;
 
-	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-
 	memset(skb, 0, offsetof(struct sk_buff, tail));
-	skb->truesize = SKB_TRUESIZE(size);
-	refcount_set(&skb->users, 1);
-	skb->head = data;
-	skb->data = data;
-	skb_reset_tail_pointer(skb);
-	skb->end = skb->tail + size;
-	skb->mac_header = (typeof(skb->mac_header))~0U;
-	skb->transport_header = (typeof(skb->transport_header))~0U;
 
-	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
-
-	return skb;
+	return __build_skb_around(skb, data, frag_size);
 }
 
 /* build_skb() is wrapper over __build_skb(), that specifically
@@ -325,6 +335,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
+/**
+ * build_skb_around - build a network buffer around provided skb
+ * @skb: sk_buff provide by caller, must be memset cleared
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ */
+struct sk_buff *build_skb_around(struct sk_buff *skb,
+				 void *data, unsigned int frag_size)
+{
+	if (unlikely(!skb))
+		return NULL;
+
+	skb = __build_skb_around(skb, data, frag_size);
+
+	if (skb && frag_size) {
+		skb->head_frag = 1;
+		if (page_is_pfmemalloc(virt_to_head_page(data)))
+			skb->pfmemalloc = 1;
+	}
+	return skb;
+}
+EXPORT_SYMBOL(build_skb_around);
+
 #define NAPI_SKB_CACHE_SIZE	64
 
 struct napi_alloc_cache {


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next V2 3/4] bpf: cpumap do bulk allocation of SKBs
  2019-04-12 15:07 [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
  2019-04-12 15:07 ` [PATCH bpf-next V2 1/4] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
  2019-04-12 15:07 ` [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around Jesper Dangaard Brouer
@ 2019-04-12 15:07 ` Jesper Dangaard Brouer
  2019-04-12 18:01   ` Song Liu
  2019-04-12 15:07 ` [PATCH bpf-next V2 4/4] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
  2019-04-18  2:15 ` [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Alexei Starovoitov
  4 siblings, 1 reply; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-12 15:07 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: songliubraving, Toke Høiland-Jørgensen,
	Ilias Apalodimas, Jesper Dangaard Brouer, ecree, bpf

As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many
SKBs it need to allocate. Thus, lets bulk allocate these SKBs via
kmem_cache_alloc_bulk() API, and use the previously introduced function
build_skb_around().

Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the
memory for us. This does clear a larger area than needed, but my micro
benchmarks on Intel CPUs show that this is slightly faster due to being a
cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there
is a future optimization potential, because SKBs will with high probability
originate from same page. If we can find/identify continuous memory areas
then the Intel CPU memset rep stos will have a real performance gain.)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 kernel/bpf/cpumap.c |   22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 430103e182a0..732d6ced3987 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -160,12 +160,12 @@ static void cpu_map_kthread_stop(struct work_struct *work)
 }
 
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-					 struct xdp_frame *xdpf)
+					 struct xdp_frame *xdpf,
+					 struct sk_buff *skb)
 {
 	unsigned int hard_start_headroom;
 	unsigned int frame_size;
 	void *pkt_data_start;
-	struct sk_buff *skb;
 
 	/* Part of headroom was reserved to xdpf */
 	hard_start_headroom = sizeof(struct xdp_frame) +  xdpf->headroom;
@@ -191,8 +191,8 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	pkt_data_start = xdpf->data - hard_start_headroom;
-	skb = build_skb(pkt_data_start, frame_size);
-	if (!skb)
+	skb = build_skb_around(skb, pkt_data_start, frame_size);
+	if (unlikely(!skb))
 		return NULL;
 
 	skb_reserve(skb, hard_start_headroom);
@@ -256,7 +256,9 @@ static int cpu_map_kthread_run(void *data)
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
 		unsigned int drops = 0, sched = 0;
 		void *frames[CPUMAP_BATCH];
-		int i, n;
+		void *skbs[CPUMAP_BATCH];
+		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
+		int i, n, m;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -278,14 +280,20 @@ static int cpu_map_kthread_run(void *data)
 		 * consume side valid as no-resize allowed of queue.
 		 */
 		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
+		m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
+		if (unlikely(m == 0)) {
+			for (i = 0; i < n; i++)
+				skbs[i] = NULL; /* effect: xdp_return_frame */
+			drops = n;
+		}
 
 		local_bh_disable();
 		for (i = 0; i < n; i++) {
 			struct xdp_frame *xdpf = frames[i];
-			struct sk_buff *skb;
+			struct sk_buff *skb = skbs[i];
 			int ret;
 
-			skb = cpu_map_build_skb(rcpu, xdpf);
+			skb = cpu_map_build_skb(rcpu, xdpf, skb);
 			if (!skb) {
 				xdp_return_frame(xdpf);
 				continue;


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next V2 4/4] bpf: cpumap memory prefetchw optimizations for struct page
  2019-04-12 15:07 [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2019-04-12 15:07 ` [PATCH bpf-next V2 3/4] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
@ 2019-04-12 15:07 ` Jesper Dangaard Brouer
  2019-04-18  2:15 ` [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Alexei Starovoitov
  4 siblings, 0 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-12 15:07 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: songliubraving, Toke Høiland-Jørgensen,
	Ilias Apalodimas, Jesper Dangaard Brouer, ecree, bpf

A lot of the performance gain comes from this patch.

While analysing performance overhead it was found that the largest CPU
stalls were caused when touching the struct page area. It is first read with
a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
written by page_frag_free() call.

Measurements show that the prefetchw (W) variant operation is needed to
achieve the performance gain. We believe this optimization it two fold,
first the W-variant saves one step in the cache-coherency protocol, and
second it helps us to avoid the non-temporal prefetch HW optimizations and
bring this into all cache-levels. It might be worth investigating if
prefetch into L2 will have the same benefit.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/cpumap.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 732d6ced3987..cf727d77c6c6 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -280,6 +280,18 @@ static int cpu_map_kthread_run(void *data)
 		 * consume side valid as no-resize allowed of queue.
 		 */
 		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
+
+		for (i = 0; i < n; i++) {
+			void *f = frames[i];
+			struct page *page = virt_to_page(f);
+
+			/* Bring struct page memory area to curr CPU. Read by
+			 * build_skb_around via page_is_pfmemalloc(), and when
+			 * freed written by page_frag_free call.
+			 */
+			prefetchw(page);
+		}
+
 		m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
 		if (unlikely(m == 0)) {
 			for (i = 0; i < n; i++)


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around
  2019-04-12 15:07 ` [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around Jesper Dangaard Brouer
@ 2019-04-12 17:59   ` Song Liu
  2019-04-17  3:01   ` Alexei Starovoitov
  1 sibling, 0 replies; 10+ messages in thread
From: Song Liu @ 2019-04-12 17:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Song Liu, Toke Høiland-Jørgensen, Ilias Apalodimas,
	Edward Cree, bpf

On Fri, Apr 12, 2019 at 8:08 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> The function build_skb() also have the responsibility to allocate and clear
> the SKB structure. Introduce a new function build_skb_around(), that moves
> the responsibility of allocation and clearing to the caller. This allows
> caller to use kmem_cache (slab/slub) bulk allocation API.
>
> Next patch use this function combined with kmem_cache_alloc_bulk.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  include/linux/skbuff.h |    2 +
>  net/core/skbuff.c      |   71 +++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 54 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index a06275a618f0..e81f2b0e8a83 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1042,6 +1042,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
>                             int node);
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size);
>  struct sk_buff *build_skb(void *data, unsigned int frag_size);
> +struct sk_buff *build_skb_around(struct sk_buff *skb,
> +                                void *data, unsigned int frag_size);
>
>  /**
>   * alloc_skb - allocate a network buffer
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9901f5322852..087622298d77 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -258,6 +258,33 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  }
>  EXPORT_SYMBOL(__alloc_skb);
>
> +/* Caller must provide SKB that is memset cleared */
> +static struct sk_buff *__build_skb_around(struct sk_buff *skb,
> +                                         void *data, unsigned int frag_size)
> +{
> +       struct skb_shared_info *shinfo;
> +       unsigned int size = frag_size ? : ksize(data);
> +
> +       size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +
> +       /* Assumes caller memset cleared SKB */
> +       skb->truesize = SKB_TRUESIZE(size);
> +       refcount_set(&skb->users, 1);
> +       skb->head = data;
> +       skb->data = data;
> +       skb_reset_tail_pointer(skb);
> +       skb->end = skb->tail + size;
> +       skb->mac_header = (typeof(skb->mac_header))~0U;
> +       skb->transport_header = (typeof(skb->transport_header))~0U;
> +
> +       /* make sure we initialize shinfo sequentially */
> +       shinfo = skb_shinfo(skb);
> +       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +       atomic_set(&shinfo->dataref, 1);
> +
> +       return skb;
> +}
> +
>  /**
>   * __build_skb - build a network buffer
>   * @data: data buffer provided by caller
> @@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb);
>   */
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>  {
> -       struct skb_shared_info *shinfo;
>         struct sk_buff *skb;
> -       unsigned int size = frag_size ? : ksize(data);
>
>         skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> -       if (!skb)
> +       if (unlikely(!skb))
>                 return NULL;
>
> -       size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -
>         memset(skb, 0, offsetof(struct sk_buff, tail));
> -       skb->truesize = SKB_TRUESIZE(size);
> -       refcount_set(&skb->users, 1);
> -       skb->head = data;
> -       skb->data = data;
> -       skb_reset_tail_pointer(skb);
> -       skb->end = skb->tail + size;
> -       skb->mac_header = (typeof(skb->mac_header))~0U;
> -       skb->transport_header = (typeof(skb->transport_header))~0U;
>
> -       /* make sure we initialize shinfo sequentially */
> -       shinfo = skb_shinfo(skb);
> -       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> -       atomic_set(&shinfo->dataref, 1);
> -
> -       return skb;
> +       return __build_skb_around(skb, data, frag_size);
>  }
>
>  /* build_skb() is wrapper over __build_skb(), that specifically
> @@ -325,6 +335,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
>  }
>  EXPORT_SYMBOL(build_skb);
>
> +/**
> + * build_skb_around - build a network buffer around provided skb
> + * @skb: sk_buff provide by caller, must be memset cleared
> + * @data: data buffer provided by caller
> + * @frag_size: size of data, or 0 if head was kmalloced
> + */
> +struct sk_buff *build_skb_around(struct sk_buff *skb,
> +                                void *data, unsigned int frag_size)
> +{
> +       if (unlikely(!skb))
> +               return NULL;
> +
> +       skb = __build_skb_around(skb, data, frag_size);
> +
> +       if (skb && frag_size) {
> +               skb->head_frag = 1;
> +               if (page_is_pfmemalloc(virt_to_head_page(data)))
> +                       skb->pfmemalloc = 1;
> +       }
> +       return skb;
> +}
> +EXPORT_SYMBOL(build_skb_around);
> +
>  #define NAPI_SKB_CACHE_SIZE    64
>
>  struct napi_alloc_cache {
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next V2 3/4] bpf: cpumap do bulk allocation of SKBs
  2019-04-12 15:07 ` [PATCH bpf-next V2 3/4] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
@ 2019-04-12 18:01   ` Song Liu
  0 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2019-04-12 18:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Song Liu, Toke Høiland-Jørgensen, Ilias Apalodimas,
	Edward Cree, bpf

On Fri, Apr 12, 2019 at 8:08 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many
> SKBs it need to allocate. Thus, lets bulk allocate these SKBs via
> kmem_cache_alloc_bulk() API, and use the previously introduced function
> build_skb_around().
>
> Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the
> memory for us. This does clear a larger area than needed, but my micro
> benchmarks on Intel CPUs show that this is slightly faster due to being a
> cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there
> is a future optimization potential, because SKBs will with high probability
> originate from same page. If we can find/identify continuous memory areas
> then the Intel CPU memset rep stos will have a real performance gain.)
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/cpumap.c |   22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 430103e182a0..732d6ced3987 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -160,12 +160,12 @@ static void cpu_map_kthread_stop(struct work_struct *work)
>  }
>
>  static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> -                                        struct xdp_frame *xdpf)
> +                                        struct xdp_frame *xdpf,
> +                                        struct sk_buff *skb)
>  {
>         unsigned int hard_start_headroom;
>         unsigned int frame_size;
>         void *pkt_data_start;
> -       struct sk_buff *skb;
>
>         /* Part of headroom was reserved to xdpf */
>         hard_start_headroom = sizeof(struct xdp_frame) +  xdpf->headroom;
> @@ -191,8 +191,8 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
>                 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>
>         pkt_data_start = xdpf->data - hard_start_headroom;
> -       skb = build_skb(pkt_data_start, frame_size);
> -       if (!skb)
> +       skb = build_skb_around(skb, pkt_data_start, frame_size);
> +       if (unlikely(!skb))
>                 return NULL;
>
>         skb_reserve(skb, hard_start_headroom);
> @@ -256,7 +256,9 @@ static int cpu_map_kthread_run(void *data)
>         while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
>                 unsigned int drops = 0, sched = 0;
>                 void *frames[CPUMAP_BATCH];
> -               int i, n;
> +               void *skbs[CPUMAP_BATCH];
> +               gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
> +               int i, n, m;
>
>                 /* Release CPU reschedule checks */
>                 if (__ptr_ring_empty(rcpu->queue)) {
> @@ -278,14 +280,20 @@ static int cpu_map_kthread_run(void *data)
>                  * consume side valid as no-resize allowed of queue.
>                  */
>                 n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
> +               m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
> +               if (unlikely(m == 0)) {
> +                       for (i = 0; i < n; i++)
> +                               skbs[i] = NULL; /* effect: xdp_return_frame */
> +                       drops = n;
> +               }
>
>                 local_bh_disable();
>                 for (i = 0; i < n; i++) {
>                         struct xdp_frame *xdpf = frames[i];
> -                       struct sk_buff *skb;
> +                       struct sk_buff *skb = skbs[i];
>                         int ret;
>
> -                       skb = cpu_map_build_skb(rcpu, xdpf);
> +                       skb = cpu_map_build_skb(rcpu, xdpf, skb);
>                         if (!skb) {
>                                 xdp_return_frame(xdpf);
>                                 continue;
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around
  2019-04-12 15:07 ` [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around Jesper Dangaard Brouer
  2019-04-12 17:59   ` Song Liu
@ 2019-04-17  3:01   ` Alexei Starovoitov
  2019-04-17  5:20     ` Eric Dumazet
  1 sibling, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2019-04-17  3:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet
  Cc: Network Development, Daniel Borkmann, David S. Miller, Song Liu,
	Toke Høiland-Jørgensen, Ilias Apalodimas, Edward Cree,
	bpf

On Fri, Apr 12, 2019 at 8:07 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> The function build_skb() also have the responsibility to allocate and clear
> the SKB structure. Introduce a new function build_skb_around(), that moves
> the responsibility of allocation and clearing to the caller. This allows
> caller to use kmem_cache (slab/slub) bulk allocation API.
>
> Next patch use this function combined with kmem_cache_alloc_bulk.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/skbuff.h |    2 +
>  net/core/skbuff.c      |   71 +++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 54 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index a06275a618f0..e81f2b0e8a83 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1042,6 +1042,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
>                             int node);
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size);
>  struct sk_buff *build_skb(void *data, unsigned int frag_size);
> +struct sk_buff *build_skb_around(struct sk_buff *skb,
> +                                void *data, unsigned int frag_size);
>
>  /**
>   * alloc_skb - allocate a network buffer
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9901f5322852..087622298d77 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -258,6 +258,33 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  }
>  EXPORT_SYMBOL(__alloc_skb);
>
> +/* Caller must provide SKB that is memset cleared */
> +static struct sk_buff *__build_skb_around(struct sk_buff *skb,
> +                                         void *data, unsigned int frag_size)
> +{
> +       struct skb_shared_info *shinfo;
> +       unsigned int size = frag_size ? : ksize(data);
> +
> +       size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +
> +       /* Assumes caller memset cleared SKB */
> +       skb->truesize = SKB_TRUESIZE(size);
> +       refcount_set(&skb->users, 1);
> +       skb->head = data;
> +       skb->data = data;
> +       skb_reset_tail_pointer(skb);
> +       skb->end = skb->tail + size;
> +       skb->mac_header = (typeof(skb->mac_header))~0U;
> +       skb->transport_header = (typeof(skb->transport_header))~0U;
> +
> +       /* make sure we initialize shinfo sequentially */
> +       shinfo = skb_shinfo(skb);
> +       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +       atomic_set(&shinfo->dataref, 1);
> +
> +       return skb;
> +}
> +
>  /**
>   * __build_skb - build a network buffer
>   * @data: data buffer provided by caller
> @@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb);
>   */
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>  {
> -       struct skb_shared_info *shinfo;
>         struct sk_buff *skb;
> -       unsigned int size = frag_size ? : ksize(data);
>
>         skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> -       if (!skb)
> +       if (unlikely(!skb))
>                 return NULL;
>
> -       size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -
>         memset(skb, 0, offsetof(struct sk_buff, tail));
> -       skb->truesize = SKB_TRUESIZE(size);
> -       refcount_set(&skb->users, 1);
> -       skb->head = data;
> -       skb->data = data;
> -       skb_reset_tail_pointer(skb);
> -       skb->end = skb->tail + size;
> -       skb->mac_header = (typeof(skb->mac_header))~0U;
> -       skb->transport_header = (typeof(skb->transport_header))~0U;
>
> -       /* make sure we initialize shinfo sequentially */
> -       shinfo = skb_shinfo(skb);
> -       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> -       atomic_set(&shinfo->dataref, 1);
> -
> -       return skb;
> +       return __build_skb_around(skb, data, frag_size);
>  }

It looks good to me, but I'd like to see Eric's Ack before applying.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around
  2019-04-17  3:01   ` Alexei Starovoitov
@ 2019-04-17  5:20     ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2019-04-17  5:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Network Development, Daniel Borkmann,
	David S. Miller, Song Liu, Toke Høiland-Jørgensen,
	Ilias Apalodimas, Edward Cree, bpf

On Tue, Apr 16, 2019 at 8:01 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:

> It looks good to me, but I'd like to see Eric's Ack before applying.

This seems just fine, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect
  2019-04-12 15:07 [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2019-04-12 15:07 ` [PATCH bpf-next V2 4/4] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
@ 2019-04-18  2:15 ` Alexei Starovoitov
  4 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2019-04-18  2:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Network Development, Daniel Borkmann, David S. Miller, Song Liu,
	Toke Høiland-Jørgensen, Ilias Apalodimas, Edward Cree,
	bpf

On Fri, Apr 12, 2019 at 8:07 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> This patchset utilize a number of different kernel bulk APIs for optimizing
> the performance for the XDP cpumap redirect feature.
>
> Benchmark details are available here:
>  https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org
>
> Performance measurements can be considered micro benchmarks, as they measure
> dropping packets at different stages in the network stack.
> Summary based on above:
>
> Baseline benchmarks
> - baseline-redirect: UdpNoPorts: 3,180,074
> - baseline-redirect: iptables-raw drop: 6,193,534
>
> Patch1: bpf: cpumap use ptr_ring_consume_batched
> - redirect: UdpNoPorts: 3,327,729
> - redirect: iptables-raw drop: 6,321,540
>
> Patch2: net: core: introduce build_skb_around
> - redirect: UdpNoPorts: 3,221,303
> - redirect: iptables-raw drop: 6,320,066
>
> Patch3: bpf: cpumap do bulk allocation of SKBs
> - redirect: UdpNoPorts: 3,290,563
> - redirect: iptables-raw drop: 6,650,112
>
> Patch4: bpf: cpumap memory prefetchw optimizations for struct page
> - redirect: UdpNoPorts: 3,520,250
> - redirect: iptables-raw drop: 7,649,604
>
> In this V2 submission I have chosen drop the SKB-list patch using
> netif_receive_skb_list() as it was not showing a performance improvement for
> these micro benchmarks.

Applied. Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-04-18  2:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-12 15:07 [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
2019-04-12 15:07 ` [PATCH bpf-next V2 1/4] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
2019-04-12 15:07 ` [PATCH bpf-next V2 2/4] net: core: introduce build_skb_around Jesper Dangaard Brouer
2019-04-12 17:59   ` Song Liu
2019-04-17  3:01   ` Alexei Starovoitov
2019-04-17  5:20     ` Eric Dumazet
2019-04-12 15:07 ` [PATCH bpf-next V2 3/4] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
2019-04-12 18:01   ` Song Liu
2019-04-12 15:07 ` [PATCH bpf-next V2 4/4] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
2019-04-18  2:15 ` [PATCH bpf-next V2 0/4] Bulk optimization for XDP cpumap redirect Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.