[PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect
@ 2019-04-10 11:43 Jesper Dangaard Brouer
  2019-04-10 11:43 ` [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-10 11:43 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: Ilias Apalodimas, bpf, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

This patchset utilize a number of different kernel bulk APIs for optimizing
the performance for the XDP cpumap redirect feature.

Patch-1: ptr_ring batch consume
Patch-2: Send SKB-lists to network stack
Patch-3: Introduce SKB helper to alloc SKB outside net-core
Patch-4: kmem_cache bulk alloc of SKBs
Patch-5: Prefetch struct page to solve CPU stall

---

Jesper Dangaard Brouer (5):
      bpf: cpumap use ptr_ring_consume_batched
      bpf: cpumap use netif_receive_skb_list
      net: core: introduce build_skb_around
      bpf: cpumap do bulk allocation of SKBs
      bpf: cpumap memory prefetchw optimizations for struct page


 include/linux/netdevice.h |    1 +
 include/linux/skbuff.h    |    2 +
 kernel/bpf/cpumap.c       |   66 +++++++++++++++++++++++++++++-------------
 net/core/dev.c            |   18 +++++++++++
 net/core/skbuff.c         |   71 +++++++++++++++++++++++++++++++++------------
 5 files changed, 118 insertions(+), 40 deletions(-)

--

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched
  2019-04-10 11:43 [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
@ 2019-04-10 11:43 ` Jesper Dangaard Brouer
  2019-04-10 23:24   ` Song Liu
  2019-04-10 11:43 ` [PATCH bpf-next 2/5] bpf: cpumap use netif_receive_skb_list Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-10 11:43 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: Ilias Apalodimas, bpf, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

Move ptr_ring dequeue outside loop, that allocate SKBs and calls network
stack, as these operations that can take some time. The ptr_ring is a
communication channel between CPUs, where we want to reduce/limit any
cacheline bouncing.

Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the
period and times the remote cacheline in ptr_ring is read

Batch size 8 is both to (1) limit BH-disable period, and (2) consume one
cacheline on 64-bit archs. After reducing the BH-disable section further
then we can consider changing this, while still thinking about L1 cacheline
size being active.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 kernel/bpf/cpumap.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3c18260403dd..430103e182a0 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,6 +240,8 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 	}
 }
 
+#define CPUMAP_BATCH 8
+
 static int cpu_map_kthread_run(void *data)
 {
 	struct bpf_cpu_map_entry *rcpu = data;
@@ -252,8 +254,9 @@ static int cpu_map_kthread_run(void *data)
 	 * kthread_stop signal until queue is empty.
 	 */
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
-		unsigned int processed = 0, drops = 0, sched = 0;
-		struct xdp_frame *xdpf;
+		unsigned int drops = 0, sched = 0;
+		void *frames[CPUMAP_BATCH];
+		int i, n;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -269,14 +272,16 @@ static int cpu_map_kthread_run(void *data)
 			sched = cond_resched();
 		}
 
-		/* Process packets in rcpu->queue */
-		local_bh_disable();
 		/*
 		 * The bpf_cpu_map_entry is single consumer, with this
 		 * kthread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		while ((xdpf = __ptr_ring_consume(rcpu->queue))) {
+		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
+
+		local_bh_disable();
+		for (i = 0; i < n; i++) {
+			struct xdp_frame *xdpf = frames[i];
 			struct sk_buff *skb;
 			int ret;
 
@@ -290,13 +295,9 @@ static int cpu_map_kthread_run(void *data)
 			ret = netif_receive_skb_core(skb);
 			if (ret == NET_RX_DROP)
 				drops++;
-
-			/* Limit BH-disable period */
-			if (++processed == 8)
-				break;
 		}
 		/* Feedback loop via tracepoint */
-		trace_xdp_cpumap_kthread(rcpu->map_id, processed, drops, sched);
+		trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);
 
 		local_bh_enable(); /* resched point, may call do_softirq() */
 	}


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 2/5] bpf: cpumap use netif_receive_skb_list
  2019-04-10 11:43 [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
  2019-04-10 11:43 ` [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
@ 2019-04-10 11:43 ` Jesper Dangaard Brouer
  2019-04-10 18:56   ` Edward Cree
  2019-04-10 11:43 ` [PATCH bpf-next 3/5] net: core: introduce build_skb_around Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-10 11:43 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: Ilias Apalodimas, bpf, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

Reduce BH-disable period further by moving cpu_map_build_skb()
outside/before invoking the network stack.  And build up a
skb_list that is used for netif_receive_skb_list. This is also
an I-cache optimization.

When injecting packets into the network stack, cpumap uses a special
function named netif_receive_skb_core(), and we create a equivalent list
version  named netif_receive_skb_list_core().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |    1 +
 kernel/bpf/cpumap.c       |   17 ++++++++++-------
 net/core/dev.c            |   18 ++++++++++++++++++
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 166fdc0a78b4..37e78dc9f30a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3621,6 +3621,7 @@ int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
 void netif_receive_skb_list(struct list_head *head);
+void netif_receive_skb_list_core(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 430103e182a0..cb93df200cd0 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -256,6 +256,7 @@ static int cpu_map_kthread_run(void *data)
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
 		unsigned int drops = 0, sched = 0;
 		void *frames[CPUMAP_BATCH];
+		struct list_head skb_list;
 		int i, n;
 
 		/* Release CPU reschedule checks */
@@ -279,23 +280,25 @@ static int cpu_map_kthread_run(void *data)
 		 */
 		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
 
-		local_bh_disable();
+		INIT_LIST_HEAD(&skb_list);
+
 		for (i = 0; i < n; i++) {
 			struct xdp_frame *xdpf = frames[i];
 			struct sk_buff *skb;
-			int ret;
 
 			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
 				xdp_return_frame(xdpf);
 				continue;
 			}
-
-			/* Inject into network stack */
-			ret = netif_receive_skb_core(skb);
-			if (ret == NET_RX_DROP)
-				drops++;
+			list_add_tail(&skb->list, &skb_list);
 		}
+
+		local_bh_disable();
+
+		/* Inject into network stack */
+		netif_receive_skb_list_core(&skb_list);
+
 		/* Feedback loop via tracepoint */
 		trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 9ca2d3abfd1a..1dee7bd895a0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5297,6 +5297,24 @@ void netif_receive_skb_list(struct list_head *head)
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 
+/**
+ *	netif_receive_skb_list_core - special version of netif_receive_skb_list
+ *	@head: list of skbs to process.
+ *
+ *	More direct receive version of netif_receive_skb_list().  It should
+ *	only be used by callers that have a need to skip RPS and Generic XDP.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ */
+void netif_receive_skb_list_core(struct list_head *head)
+{
+	rcu_read_lock();
+	__netif_receive_skb_list(head);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(netif_receive_skb_list_core);
+
 DEFINE_PER_CPU(struct work_struct, flush_works);
 
 /* Network device is going away, flush any packets still pending */


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 3/5] net: core: introduce build_skb_around
  2019-04-10 11:43 [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
  2019-04-10 11:43 ` [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
  2019-04-10 11:43 ` [PATCH bpf-next 2/5] bpf: cpumap use netif_receive_skb_list Jesper Dangaard Brouer
@ 2019-04-10 11:43 ` Jesper Dangaard Brouer
  2019-04-10 23:34   ` Song Liu
  2019-04-11  5:33   ` Ilias Apalodimas
  2019-04-10 11:43 ` [PATCH bpf-next 4/5] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-10 11:43 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: Ilias Apalodimas, bpf, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

The function build_skb() also have the responsibility to allocate and clear
the SKB structure. Introduce a new function build_skb_around(), that moves
the responsibility of allocation and clearing to the caller. This allows
caller to use kmem_cache (slab/slub) bulk allocation API.

Next patch use this function combined with kmem_cache_alloc_bulk.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |   71 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9027a8c4219f..c40ffab8a9b0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1044,6 +1044,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
 			    int node);
 struct sk_buff *__build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb(void *data, unsigned int frag_size);
+struct sk_buff *build_skb_around(struct sk_buff *skb,
+				 void *data, unsigned int frag_size);
 
 /**
  * alloc_skb - allocate a network buffer
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4782f9354dd1..d904b6e5fe08 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -258,6 +258,33 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 }
 EXPORT_SYMBOL(__alloc_skb);
 
+/* Caller must provide SKB that is memset cleared */
+static struct sk_buff *__build_skb_around(struct sk_buff *skb,
+					  void *data, unsigned int frag_size)
+{
+	struct skb_shared_info *shinfo;
+	unsigned int size = frag_size ? : ksize(data);
+
+	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	/* Assumes caller memset cleared SKB */
+	skb->truesize = SKB_TRUESIZE(size);
+	refcount_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + size;
+	skb->mac_header = (typeof(skb->mac_header))~0U;
+	skb->transport_header = (typeof(skb->transport_header))~0U;
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+
+	return skb;
+}
+
 /**
  * __build_skb - build a network buffer
  * @data: data buffer provided by caller
@@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb);
  */
 struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 {
-	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
-	unsigned int size = frag_size ? : ksize(data);
 
 	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
-	if (!skb)
+	if (unlikely(!skb))
 		return NULL;
 
-	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-
 	memset(skb, 0, offsetof(struct sk_buff, tail));
-	skb->truesize = SKB_TRUESIZE(size);
-	refcount_set(&skb->users, 1);
-	skb->head = data;
-	skb->data = data;
-	skb_reset_tail_pointer(skb);
-	skb->end = skb->tail + size;
-	skb->mac_header = (typeof(skb->mac_header))~0U;
-	skb->transport_header = (typeof(skb->transport_header))~0U;
 
-	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
-
-	return skb;
+	return __build_skb_around(skb, data, frag_size);
 }
 
 /* build_skb() is wrapper over __build_skb(), that specifically
@@ -325,6 +335,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
+/**
+ * build_skb_around - build a network buffer around provided skb
+ * @skb: sk_buff provide by caller, must be memset cleared
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ */
+struct sk_buff *build_skb_around(struct sk_buff *skb,
+				 void *data, unsigned int frag_size)
+{
+	if (unlikely(!skb))
+		return NULL;
+
+	skb = __build_skb_around(skb, data, frag_size);
+
+	if (skb && frag_size) {
+		skb->head_frag = 1;
+		if (page_is_pfmemalloc(virt_to_head_page(data)))
+			skb->pfmemalloc = 1;
+	}
+	return skb;
+}
+EXPORT_SYMBOL(build_skb_around);
+
 #define NAPI_SKB_CACHE_SIZE	64
 
 struct napi_alloc_cache {


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 4/5] bpf: cpumap do bulk allocation of SKBs
  2019-04-10 11:43 [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2019-04-10 11:43 ` [PATCH bpf-next 3/5] net: core: introduce build_skb_around Jesper Dangaard Brouer
@ 2019-04-10 11:43 ` Jesper Dangaard Brouer
  2019-04-10 23:30   ` Song Liu
  2019-04-10 11:43 ` [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
  2019-04-10 23:36 ` [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Song Liu
  5 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-10 11:43 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: Ilias Apalodimas, bpf, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many
SKBs it need to allocate. Thus, lets bulk allocate these SKBs via
kmem_cache_alloc_bulk() API, and use the previously introduced function
build_skb_around().

Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the
memory for us. This does clear a larger area than needed, but my micro
benchmarks on Intel CPUs show that this is slightly faster due to being a
cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there
is a future optimization potential, because SKBs will with high probability
originate from same page. If we can find/identify continuous memory areas
then the Intel CPU memset rep stos will have a real performance gain.)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 kernel/bpf/cpumap.c |   22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index cb93df200cd0..b82a11556ad5 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -160,12 +160,12 @@ static void cpu_map_kthread_stop(struct work_struct *work)
 }
 
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-					 struct xdp_frame *xdpf)
+					 struct xdp_frame *xdpf,
+					 struct sk_buff *skb)
 {
 	unsigned int hard_start_headroom;
 	unsigned int frame_size;
 	void *pkt_data_start;
-	struct sk_buff *skb;
 
 	/* Part of headroom was reserved to xdpf */
 	hard_start_headroom = sizeof(struct xdp_frame) +  xdpf->headroom;
@@ -191,8 +191,8 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	pkt_data_start = xdpf->data - hard_start_headroom;
-	skb = build_skb(pkt_data_start, frame_size);
-	if (!skb)
+	skb = build_skb_around(skb, pkt_data_start, frame_size);
+	if (unlikely(!skb))
 		return NULL;
 
 	skb_reserve(skb, hard_start_headroom);
@@ -256,8 +256,10 @@ static int cpu_map_kthread_run(void *data)
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
 		unsigned int drops = 0, sched = 0;
 		void *frames[CPUMAP_BATCH];
+		void *skbs[CPUMAP_BATCH];
 		struct list_head skb_list;
-		int i, n;
+		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
+		int i, n, m;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -279,14 +281,20 @@ static int cpu_map_kthread_run(void *data)
 		 * consume side valid as no-resize allowed of queue.
 		 */
 		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
+		m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
+		if (unlikely(m == 0)) {
+			for (i = 0; i < n; i++)
+				skbs[i] = NULL; /* effect: xdp_return_frame */
+			drops = n;
+		}
 
 		INIT_LIST_HEAD(&skb_list);
 
 		for (i = 0; i < n; i++) {
 			struct xdp_frame *xdpf = frames[i];
-			struct sk_buff *skb;
+			struct sk_buff *skb = skbs[i];
 
-			skb = cpu_map_build_skb(rcpu, xdpf);
+			skb = cpu_map_build_skb(rcpu, xdpf, skb);
 			if (!skb) {
 				xdp_return_frame(xdpf);
 				continue;


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page
  2019-04-10 11:43 [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2019-04-10 11:43 ` [PATCH bpf-next 4/5] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
@ 2019-04-10 11:43 ` Jesper Dangaard Brouer
  2019-04-10 23:35   ` Song Liu
  2019-04-11  5:47   ` Ilias Apalodimas
  2019-04-10 23:36 ` [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Song Liu
  5 siblings, 2 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-10 11:43 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller
  Cc: Ilias Apalodimas, bpf, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

A lot of the performance gain comes from this patch.

While analysing performance overhead it was found that the largest CPU
stalls were caused when touching the struct page area. It is first read with
a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
written by page_frag_free() call.

Measurements show that the prefetchw (W) variant operation is needed to
achieve the performance gain. We believe this optimization it two fold,
first the W-variant saves one step in the cache-coherency protocol, and
second it helps us to avoid the non-temporal prefetch HW optimizations and
bring this into all cache-levels. It might be worth investigating if
prefetch into L2 will have the same benefit.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 kernel/bpf/cpumap.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b82a11556ad5..4758482ab5b9 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -281,6 +281,18 @@ static int cpu_map_kthread_run(void *data)
 		 * consume side valid as no-resize allowed of queue.
 		 */
 		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
+
+		for (i = 0; i < n; i++) {
+			void *f = frames[i];
+			struct page *page = virt_to_page(f);
+
+			/* Bring struct page memory area to curr CPU. Read by
+			 * build_skb_around via page_is_pfmemalloc(), and when
+			 * freed written by page_frag_free call.
+			 */
+			prefetchw(page);
+		}
+
 		m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
 		if (unlikely(m == 0)) {
 			for (i = 0; i < n; i++)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: cpumap use netif_receive_skb_list
  2019-04-10 11:43 ` [PATCH bpf-next 2/5] bpf: cpumap use netif_receive_skb_list Jesper Dangaard Brouer
@ 2019-04-10 18:56   ` Edward Cree
  0 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2019-04-10 18:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, David S. Miller
  Cc: Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On 10/04/2019 12:43, Jesper Dangaard Brouer wrote:
> Reduce BH-disable period further by moving cpu_map_build_skb()
> outside/before invoking the network stack.  And build up a
> skb_list that is used for netif_receive_skb_list. This is also
> an I-cache optimization.
>
> When injecting packets into the network stack, cpumap uses a special
> function named netif_receive_skb_core(), and we create a equivalent list
> version  named netif_receive_skb_list_core().
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/netdevice.h |    1 +
>  kernel/bpf/cpumap.c       |   17 ++++++++++-------
>  net/core/dev.c            |   18 ++++++++++++++++++
>  3 files changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 166fdc0a78b4..37e78dc9f30a 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3621,6 +3621,7 @@ int netif_rx_ni(struct sk_buff *skb);
>  int netif_receive_skb(struct sk_buff *skb);
>  int netif_receive_skb_core(struct sk_buff *skb);
>  void netif_receive_skb_list(struct list_head *head);
> +void netif_receive_skb_list_core(struct list_head *head);
>  gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
>  void napi_gro_flush(struct napi_struct *napi, bool flush_old);
>  struct sk_buff *napi_get_frags(struct napi_struct *napi);
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 430103e182a0..cb93df200cd0 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -256,6 +256,7 @@ static int cpu_map_kthread_run(void *data)
>  	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
>  		unsigned int drops = 0, sched = 0;
>  		void *frames[CPUMAP_BATCH];
> +		struct list_head skb_list;
>  		int i, n;
>  
>  		/* Release CPU reschedule checks */
> @@ -279,23 +280,25 @@ static int cpu_map_kthread_run(void *data)
>  		 */
>  		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
>  
> -		local_bh_disable();
> +		INIT_LIST_HEAD(&skb_list);
> +
>  		for (i = 0; i < n; i++) {
>  			struct xdp_frame *xdpf = frames[i];
>  			struct sk_buff *skb;
> -			int ret;
>  
>  			skb = cpu_map_build_skb(rcpu, xdpf);
>  			if (!skb) {
>  				xdp_return_frame(xdpf);
>  				continue;
>  			}
> -
> -			/* Inject into network stack */
> -			ret = netif_receive_skb_core(skb);
> -			if (ret == NET_RX_DROP)
> -				drops++;
You're losing this `drops` incrementation and not doing anything to
 replace it...
> +			list_add_tail(&skb->list, &skb_list);
>  		}
> +
> +		local_bh_disable();
> +
> +		/* Inject into network stack */
> +		netif_receive_skb_list_core(&skb_list);
> +
>  		/* Feedback loop via tracepoint */
>  		trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);
... yet still feeding it to the tracepoint here.

I ran into something similar with my list-GRO patches (callers wanted to
 know how many packets from the list were received vs. dropped); check
 those to see how I wired that counting all the way through the listified
 stack.

Apart from that, I like this!

-Ed
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9ca2d3abfd1a..1dee7bd895a0 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5297,6 +5297,24 @@ void netif_receive_skb_list(struct list_head *head)
>  }
>  EXPORT_SYMBOL(netif_receive_skb_list);
>  
> +/**
> + *	netif_receive_skb_list_core - special version of netif_receive_skb_list
> + *	@head: list of skbs to process.
> + *
> + *	More direct receive version of netif_receive_skb_list().  It should
> + *	only be used by callers that have a need to skip RPS and Generic XDP.
> + *
> + *	This function may only be called from softirq context and interrupts
> + *	should be enabled.
> + */
> +void netif_receive_skb_list_core(struct list_head *head)
> +{
> +	rcu_read_lock();
> +	__netif_receive_skb_list(head);
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(netif_receive_skb_list_core);
> +
>  DEFINE_PER_CPU(struct work_struct, flush_works);
>  
>  /* Network device is going away, flush any packets still pending */
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched
  2019-04-10 11:43 ` [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
@ 2019-04-10 23:24   ` Song Liu
  2019-04-11 11:23     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2019-04-10 23:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On Wed, Apr 10, 2019 at 6:03 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> Move ptr_ring dequeue outside loop, that allocate SKBs and calls network
> stack, as these operations that can take some time. The ptr_ring is a
> communication channel between CPUs, where we want to reduce/limit any
> cacheline bouncing.
>
> Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the
> period and times the remote cacheline in ptr_ring is read
>
> Batch size 8 is both to (1) limit BH-disable period, and (2) consume one
> cacheline on 64-bit archs. After reducing the BH-disable section further
> then we can consider changing this, while still thinking about L1 cacheline
> size being active.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/cpumap.c |   21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 3c18260403dd..430103e182a0 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -240,6 +240,8 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
>         }
>  }
>
> +#define CPUMAP_BATCH 8
> +
>  static int cpu_map_kthread_run(void *data)
>  {
>         struct bpf_cpu_map_entry *rcpu = data;
> @@ -252,8 +254,9 @@ static int cpu_map_kthread_run(void *data)
>          * kthread_stop signal until queue is empty.
>          */
>         while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
> -               unsigned int processed = 0, drops = 0, sched = 0;
> -               struct xdp_frame *xdpf;
> +               unsigned int drops = 0, sched = 0;
> +               void *frames[CPUMAP_BATCH];
> +               int i, n;
>
>                 /* Release CPU reschedule checks */
>                 if (__ptr_ring_empty(rcpu->queue)) {
> @@ -269,14 +272,16 @@ static int cpu_map_kthread_run(void *data)
>                         sched = cond_resched();
>                 }
>
> -               /* Process packets in rcpu->queue */
> -               local_bh_disable();
>                 /*
>                  * The bpf_cpu_map_entry is single consumer, with this
>                  * kthread CPU pinned. Lockless access to ptr_ring
>                  * consume side valid as no-resize allowed of queue.
>                  */
> -               while ((xdpf = __ptr_ring_consume(rcpu->queue))) {
> +               n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
> +
> +               local_bh_disable();
> +               for (i = 0; i < n; i++) {
> +                       struct xdp_frame *xdpf = frames[i];
>                         struct sk_buff *skb;
>                         int ret;
>
> @@ -290,13 +295,9 @@ static int cpu_map_kthread_run(void *data)
>                         ret = netif_receive_skb_core(skb);
>                         if (ret == NET_RX_DROP)
>                                 drops++;
> -
> -                       /* Limit BH-disable period */
> -                       if (++processed == 8)
> -                               break;
>                 }
>                 /* Feedback loop via tracepoint */
> -               trace_xdp_cpumap_kthread(rcpu->map_id, processed, drops, sched);
> +               trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);

btw: can we do the tracepoint after local_bh_enable()?

>
>                 local_bh_enable(); /* resched point, may call do_softirq() */
>         }
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: cpumap do bulk allocation of SKBs
  2019-04-10 11:43 ` [PATCH bpf-next 4/5] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
@ 2019-04-10 23:30   ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2019-04-10 23:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On Wed, Apr 10, 2019 at 6:02 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many
> SKBs it need to allocate. Thus, lets bulk allocate these SKBs via
> kmem_cache_alloc_bulk() API, and use the previously introduced function
> build_skb_around().
>
> Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the
> memory for us. This does clear a larger area than needed, but my micro
> benchmarks on Intel CPUs show that this is slightly faster due to being a
> cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there
> is a future optimization potential, because SKBs will with high probability
> originate from same page. If we can find/identify continuous memory areas
> then the Intel CPU memset rep stos will have a real performance gain.)
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/cpumap.c |   22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index cb93df200cd0..b82a11556ad5 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -160,12 +160,12 @@ static void cpu_map_kthread_stop(struct work_struct *work)
>  }
>
>  static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> -                                        struct xdp_frame *xdpf)
> +                                        struct xdp_frame *xdpf,
> +                                        struct sk_buff *skb)
>  {
>         unsigned int hard_start_headroom;
>         unsigned int frame_size;
>         void *pkt_data_start;
> -       struct sk_buff *skb;
>
>         /* Part of headroom was reserved to xdpf */
>         hard_start_headroom = sizeof(struct xdp_frame) +  xdpf->headroom;
> @@ -191,8 +191,8 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
>                 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>
>         pkt_data_start = xdpf->data - hard_start_headroom;
> -       skb = build_skb(pkt_data_start, frame_size);
> -       if (!skb)
> +       skb = build_skb_around(skb, pkt_data_start, frame_size);
> +       if (unlikely(!skb))
>                 return NULL;
>
>         skb_reserve(skb, hard_start_headroom);
> @@ -256,8 +256,10 @@ static int cpu_map_kthread_run(void *data)
>         while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
>                 unsigned int drops = 0, sched = 0;
>                 void *frames[CPUMAP_BATCH];
> +               void *skbs[CPUMAP_BATCH];
>                 struct list_head skb_list;
> -               int i, n;
> +               gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
> +               int i, n, m;
>
>                 /* Release CPU reschedule checks */
>                 if (__ptr_ring_empty(rcpu->queue)) {
> @@ -279,14 +281,20 @@ static int cpu_map_kthread_run(void *data)
>                  * consume side valid as no-resize allowed of queue.
>                  */
>                 n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
> +               m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
> +               if (unlikely(m == 0)) {
> +                       for (i = 0; i < n; i++)
> +                               skbs[i] = NULL; /* effect: xdp_return_frame */
> +                       drops = n;
> +               }
>
>                 INIT_LIST_HEAD(&skb_list);
>
>                 for (i = 0; i < n; i++) {
>                         struct xdp_frame *xdpf = frames[i];
> -                       struct sk_buff *skb;
> +                       struct sk_buff *skb = skbs[i];
>
> -                       skb = cpu_map_build_skb(rcpu, xdpf);
> +                       skb = cpu_map_build_skb(rcpu, xdpf, skb);
>                         if (!skb) {
>                                 xdp_return_frame(xdpf);
>                                 continue;
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 3/5] net: core: introduce build_skb_around
  2019-04-10 11:43 ` [PATCH bpf-next 3/5] net: core: introduce build_skb_around Jesper Dangaard Brouer
@ 2019-04-10 23:34   ` Song Liu
  2019-04-11 15:39     ` Jesper Dangaard Brouer
  2019-04-11  5:33   ` Ilias Apalodimas
  1 sibling, 1 reply; 21+ messages in thread
From: Song Liu @ 2019-04-10 23:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On Wed, Apr 10, 2019 at 6:02 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> The function build_skb() also have the responsibility to allocate and clear
> the SKB structure. Introduce a new function build_skb_around(), that moves
> the responsibility of allocation and clearing to the caller. This allows
> caller to use kmem_cache (slab/slub) bulk allocation API.
>
> Next patch use this function combined with kmem_cache_alloc_bulk.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/skbuff.h |    2 +
>  net/core/skbuff.c      |   71 +++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 54 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 9027a8c4219f..c40ffab8a9b0 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1044,6 +1044,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
>                             int node);
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size);
>  struct sk_buff *build_skb(void *data, unsigned int frag_size);
> +struct sk_buff *build_skb_around(struct sk_buff *skb,
> +                                void *data, unsigned int frag_size);
>
>  /**
>   * alloc_skb - allocate a network buffer
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4782f9354dd1..d904b6e5fe08 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -258,6 +258,33 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  }
>  EXPORT_SYMBOL(__alloc_skb);
>
> +/* Caller must provide SKB that is memset cleared */
> +static struct sk_buff *__build_skb_around(struct sk_buff *skb,
> +                                         void *data, unsigned int frag_size)
> +{
> +       struct skb_shared_info *shinfo;
> +       unsigned int size = frag_size ? : ksize(data);
> +
> +       size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +
> +       /* Assumes caller memset cleared SKB */
> +       skb->truesize = SKB_TRUESIZE(size);
> +       refcount_set(&skb->users, 1);
> +       skb->head = data;
> +       skb->data = data;
> +       skb_reset_tail_pointer(skb);
> +       skb->end = skb->tail + size;
> +       skb->mac_header = (typeof(skb->mac_header))~0U;
> +       skb->transport_header = (typeof(skb->transport_header))~0U;
> +
> +       /* make sure we initialize shinfo sequentially */
> +       shinfo = skb_shinfo(skb);
> +       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +       atomic_set(&shinfo->dataref, 1);
> +
> +       return skb;
> +}
> +
>  /**
>   * __build_skb - build a network buffer
>   * @data: data buffer provided by caller
> @@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb);
>   */
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>  {
> -       struct skb_shared_info *shinfo;
>         struct sk_buff *skb;
> -       unsigned int size = frag_size ? : ksize(data);
>
>         skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> -       if (!skb)
> +       if (unlikely(!skb))
>                 return NULL;
>
> -       size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -
>         memset(skb, 0, offsetof(struct sk_buff, tail));
> -       skb->truesize = SKB_TRUESIZE(size);
> -       refcount_set(&skb->users, 1);
> -       skb->head = data;
> -       skb->data = data;
> -       skb_reset_tail_pointer(skb);
> -       skb->end = skb->tail + size;
> -       skb->mac_header = (typeof(skb->mac_header))~0U;
> -       skb->transport_header = (typeof(skb->transport_header))~0U;
>
> -       /* make sure we initialize shinfo sequentially */
> -       shinfo = skb_shinfo(skb);
> -       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> -       atomic_set(&shinfo->dataref, 1);
> -
> -       return skb;
> +       return __build_skb_around(skb, data, frag_size);
>  }
>
>  /* build_skb() is wrapper over __build_skb(), that specifically
> @@ -325,6 +335,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
>  }
>  EXPORT_SYMBOL(build_skb);
>
> +/**
> + * build_skb_around - build a network buffer around provided skb
> + * @skb: sk_buff provide by caller, must be memset cleared
> + * @data: data buffer provided by caller
> + * @frag_size: size of data, or 0 if head was kmalloced
> + */
> +struct sk_buff *build_skb_around(struct sk_buff *skb,
> +                                void *data, unsigned int frag_size)
> +{
> +       if (unlikely(!skb))
> +               return NULL;
> +
> +       skb = __build_skb_around(skb, data, frag_size);


> +
> +       if (skb && frag_size) {
> +               skb->head_frag = 1;
> +               if (page_is_pfmemalloc(virt_to_head_page(data)))
> +                       skb->pfmemalloc = 1;
> +       }

I didn't find any explanation of this part (head_frag, pfmemalloc).
Shall we split it out to a separate patch?

Thanks,
Song

> +       return skb;
> +}
> +EXPORT_SYMBOL(build_skb_around);
> +
>  #define NAPI_SKB_CACHE_SIZE    64
>
>  struct napi_alloc_cache {
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page
  2019-04-10 11:43 ` [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
@ 2019-04-10 23:35   ` Song Liu
  2019-04-11  5:47   ` Ilias Apalodimas
  1 sibling, 0 replies; 21+ messages in thread
From: Song Liu @ 2019-04-10 23:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On Wed, Apr 10, 2019 at 6:02 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> A lot of the performance gain comes from this patch.
>
> While analysing performance overhead it was found that the largest CPU
> stalls were caused when touching the struct page area. It is first read with
> a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
> written by page_frag_free() call.
>
> Measurements show that the prefetchw (W) variant operation is needed to
> achieve the performance gain. We believe this optimization it two fold,
> first the W-variant saves one step in the cache-coherency protocol, and
> second it helps us to avoid the non-temporal prefetch HW optimizations and
> bring this into all cache-levels. It might be worth investigating if
> prefetch into L2 will have the same benefit.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Song Liu <songliubraving@fb.com>


> ---
>  kernel/bpf/cpumap.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index b82a11556ad5..4758482ab5b9 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -281,6 +281,18 @@ static int cpu_map_kthread_run(void *data)
>                  * consume side valid as no-resize allowed of queue.
>                  */
>                 n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
> +
> +               for (i = 0; i < n; i++) {
> +                       void *f = frames[i];
> +                       struct page *page = virt_to_page(f);
> +
> +                       /* Bring struct page memory area to curr CPU. Read by
> +                        * build_skb_around via page_is_pfmemalloc(), and when
> +                        * freed written by page_frag_free call.
> +                        */
> +                       prefetchw(page);
> +               }
> +
>                 m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
>                 if (unlikely(m == 0)) {
>                         for (i = 0; i < n; i++)
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect
  2019-04-10 11:43 [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2019-04-10 11:43 ` [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
@ 2019-04-10 23:36 ` Song Liu
  2019-04-11 13:18   ` Jesper Dangaard Brouer
  5 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2019-04-10 23:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On Wed, Apr 10, 2019 at 6:00 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> This patchset utilize a number of different kernel bulk APIs for optimizing
> the performance for the XDP cpumap redirect feature.

Could you please share some numbers about the optimization?

Thanks,
Song

>
> Patch-1: ptr_ring batch consume
> Patch-2: Send SKB-lists to network stack
> Patch-3: Introduce SKB helper to alloc SKB outside net-core
> Patch-4: kmem_cache bulk alloc of SKBs
> Patch-5: Prefetch struct page to solve CPU stall
>
> ---
>
> Jesper Dangaard Brouer (5):
>       bpf: cpumap use ptr_ring_consume_batched
>       bpf: cpumap use netif_receive_skb_list
>       net: core: introduce build_skb_around
>       bpf: cpumap do bulk allocation of SKBs
>       bpf: cpumap memory prefetchw optimizations for struct page
>
>
>  include/linux/netdevice.h |    1 +
>  include/linux/skbuff.h    |    2 +
>  kernel/bpf/cpumap.c       |   66 +++++++++++++++++++++++++++++-------------
>  net/core/dev.c            |   18 +++++++++++
>  net/core/skbuff.c         |   71 +++++++++++++++++++++++++++++++++------------
>  5 files changed, 118 insertions(+), 40 deletions(-)
>
> --

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 3/5] net: core: introduce build_skb_around
  2019-04-10 11:43 ` [PATCH bpf-next 3/5] net: core: introduce build_skb_around Jesper Dangaard Brouer
  2019-04-10 23:34   ` Song Liu
@ 2019-04-11  5:33   ` Ilias Apalodimas
  2019-04-11 11:17     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 21+ messages in thread
From: Ilias Apalodimas @ 2019-04-11  5:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	bpf, Toke Høiland-Jørgensen

On Wed, Apr 10, 2019 at 01:43:47PM +0200, Jesper Dangaard Brouer wrote:
> The function build_skb() also have the responsibility to allocate and clear
> the SKB structure. Introduce a new function build_skb_around(), that moves
> the responsibility of allocation and clearing to the caller. This allows
> caller to use kmem_cache (slab/slub) bulk allocation API.
> 
> Next patch use this function combined with kmem_cache_alloc_bulk.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/skbuff.h |    2 +
>  net/core/skbuff.c      |   71 +++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 54 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 9027a8c4219f..c40ffab8a9b0 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1044,6 +1044,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
>  			    int node);
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size);
>  struct sk_buff *build_skb(void *data, unsigned int frag_size);
> +struct sk_buff *build_skb_around(struct sk_buff *skb,
> +				 void *data, unsigned int frag_size);
>  
>  /**
>   * alloc_skb - allocate a network buffer
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4782f9354dd1..d904b6e5fe08 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -258,6 +258,33 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  }
>  EXPORT_SYMBOL(__alloc_skb);
>  
> +/* Caller must provide SKB that is memset cleared */
> +static struct sk_buff *__build_skb_around(struct sk_buff *skb,
> +					  void *data, unsigned int frag_size)
> +{
> +	struct skb_shared_info *shinfo;
> +	unsigned int size = frag_size ? : ksize(data);
> +
> +	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +
> +	/* Assumes caller memset cleared SKB */
> +	skb->truesize = SKB_TRUESIZE(size);
> +	refcount_set(&skb->users, 1);
> +	skb->head = data;
> +	skb->data = data;
> +	skb_reset_tail_pointer(skb);
> +	skb->end = skb->tail + size;
> +	skb->mac_header = (typeof(skb->mac_header))~0U;
> +	skb->transport_header = (typeof(skb->transport_header))~0U;
> +
> +	/* make sure we initialize shinfo sequentially */
> +	shinfo = skb_shinfo(skb);
> +	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +	atomic_set(&shinfo->dataref, 1);
> +
> +	return skb;
> +}
> +
>  /**
>   * __build_skb - build a network buffer
>   * @data: data buffer provided by caller
> @@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb);
>   */
>  struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>  {
> -	struct skb_shared_info *shinfo;
>  	struct sk_buff *skb;
> -	unsigned int size = frag_size ? : ksize(data);
>  
>  	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> -	if (!skb)
> +	if (unlikely(!skb))
>  		return NULL;
>  
> -	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -
>  	memset(skb, 0, offsetof(struct sk_buff, tail));
> -	skb->truesize = SKB_TRUESIZE(size);
> -	refcount_set(&skb->users, 1);
> -	skb->head = data;
> -	skb->data = data;
> -	skb_reset_tail_pointer(skb);
> -	skb->end = skb->tail + size;
> -	skb->mac_header = (typeof(skb->mac_header))~0U;
> -	skb->transport_header = (typeof(skb->transport_header))~0U;
>  
> -	/* make sure we initialize shinfo sequentially */
> -	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> -	atomic_set(&shinfo->dataref, 1);
> -
> -	return skb;
> +	return __build_skb_around(skb, data, frag_size);
>  }
>  
>  /* build_skb() is wrapper over __build_skb(), that specifically
> @@ -325,6 +335,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
>  }
>  EXPORT_SYMBOL(build_skb);
>  
> +/**
> + * build_skb_around - build a network buffer around provided skb
> + * @skb: sk_buff provide by caller, must be memset cleared
> + * @data: data buffer provided by caller
> + * @frag_size: size of data, or 0 if head was kmalloced
> + */
> +struct sk_buff *build_skb_around(struct sk_buff *skb,
> +				 void *data, unsigned int frag_size)
> +{
> +	if (unlikely(!skb))
Maybe add a warning here, indicating the buffer *must* be there before calling
this?

> +		return NULL;
> +
> +	skb = __build_skb_around(skb, data, frag_size);
> +
> +	if (skb && frag_size) {
> +		skb->head_frag = 1;
> +		if (page_is_pfmemalloc(virt_to_head_page(data)))
> +			skb->pfmemalloc = 1;
> +	}
> +	return skb;
> +}
> +EXPORT_SYMBOL(build_skb_around);
> +
>  #define NAPI_SKB_CACHE_SIZE	64
>  
>  struct napi_alloc_cache {
> 

/Ilias

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page
  2019-04-10 11:43 ` [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
  2019-04-10 23:35   ` Song Liu
@ 2019-04-11  5:47   ` Ilias Apalodimas
  1 sibling, 0 replies; 21+ messages in thread
From: Ilias Apalodimas @ 2019-04-11  5:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	bpf, Toke Høiland-Jørgensen

On Wed, Apr 10, 2019 at 01:43:58PM +0200, Jesper Dangaard Brouer wrote:
> A lot of the performance gain comes from this patch.
> 
> While analysing performance overhead it was found that the largest CPU
> stalls were caused when touching the struct page area. It is first read with
> a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
> written by page_frag_free() call.
> 
> Measurements show that the prefetchw (W) variant operation is needed to
> achieve the performance gain. We believe this optimization it two fold,
> first the W-variant saves one step in the cache-coherency protocol, and
> second it helps us to avoid the non-temporal prefetch HW optimizations and
> bring this into all cache-levels. It might be worth investigating if
> prefetch into L2 will have the same benefit.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  kernel/bpf/cpumap.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index b82a11556ad5..4758482ab5b9 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -281,6 +281,18 @@ static int cpu_map_kthread_run(void *data)
>  		 * consume side valid as no-resize allowed of queue.
>  		 */
>  		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
> +
> +		for (i = 0; i < n; i++) {
> +			void *f = frames[i];
> +			struct page *page = virt_to_page(f);
> +
> +			/* Bring struct page memory area to curr CPU. Read by
> +			 * build_skb_around via page_is_pfmemalloc(), and when
> +			 * freed written by page_frag_free call.
> +			 */
> +			prefetchw(page);
> +		}
> +
>  		m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
>  		if (unlikely(m == 0)) {
>  			for (i = 0; i < n; i++)
> 
LGTM 

Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 3/5] net: core: introduce build_skb_around
  2019-04-11  5:33   ` Ilias Apalodimas
@ 2019-04-11 11:17     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-11 11:17 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	bpf, Toke Høiland-Jørgensen, brouer

On Thu, 11 Apr 2019 08:33:03 +0300
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> > +/**
> > + * build_skb_around - build a network buffer around provided skb
> > + * @skb: sk_buff provide by caller, must be memset cleared
> > + * @data: data buffer provided by caller
> > + * @frag_size: size of data, or 0 if head was kmalloced
> > + */
> > +struct sk_buff *build_skb_around(struct sk_buff *skb,
> > +				 void *data, unsigned int frag_size)
> > +{
> > +	if (unlikely(!skb))  
>
> Maybe add a warning here, indicating the buffer *must* be there before calling
> this?

No. I actually use this !skb in next patch. Which only happens in case
memory allocation kmem_cache_alloc_bulk() fails.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched
  2019-04-10 23:24   ` Song Liu
@ 2019-04-11 11:23     ` Jesper Dangaard Brouer
  2019-04-11 17:38       ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-11 11:23 UTC (permalink / raw)
  To: Song Liu
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen, brouer

On Wed, 10 Apr 2019 16:24:37 -0700
Song Liu <liu.song.a23@gmail.com> wrote:

> >                 /* Feedback loop via tracepoint */
> > -               trace_xdp_cpumap_kthread(rcpu->map_id, processed, drops, sched);
> > +               trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);  
> 
> btw: can we do the tracepoint after local_bh_enable()?

I would rather not, as my experience is that this result in strange
inaccurate readings, because (as comment below says) this is a
CPU-process-reschedule point.  The test tool reads these values and
calculate PPS.

> >
> >                 local_bh_enable(); /* resched point, may call do_softirq() */
> >         }

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect
  2019-04-10 23:36 ` [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Song Liu
@ 2019-04-11 13:18   ` Jesper Dangaard Brouer
  2019-04-11 17:45     ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-11 13:18 UTC (permalink / raw)
  To: Song Liu
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen, brouer,
	Edward Cree

On Wed, 10 Apr 2019 16:36:40 -0700
Song Liu <liu.song.a23@gmail.com> wrote:

> On Wed, Apr 10, 2019 at 6:00 AM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > This patchset utilize a number of different kernel bulk APIs for optimizing
> > the performance for the XDP cpumap redirect feature.  
> 
> Could you please share some numbers about the optimization?

I've documented ALL the details here:
 https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org

I seem to have found that the SKB-list approach is not a performance
advantage, which is very surprising.  BUT it might still be due to
invalid benchmarking, as I found that F27 behind my back is auto-loading
iptables-filter modules, which change performance.  Thus, I have to
redo a lot of the tests...

I'm considering removing the SKB-list patch from the patchset, as all
other patches show a performance increase/improvement.  Then we can
merge that, and then I can focus on SKB-list approach in another
patchset.  BUT as I said above, I might have wrong/invalid
measurements... I have to retest before I concluded anything...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 3/5] net: core: introduce build_skb_around
  2019-04-10 23:34   ` Song Liu
@ 2019-04-11 15:39     ` Jesper Dangaard Brouer
  2019-04-11 17:43       ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-11 15:39 UTC (permalink / raw)
  To: Song Liu
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen, brouer

On Wed, 10 Apr 2019 16:34:29 -0700
Song Liu <liu.song.a23@gmail.com> wrote:

> > +struct sk_buff *build_skb_around(struct sk_buff *skb,
> > +                                void *data, unsigned int frag_size)
> > +{
> > +       if (unlikely(!skb))
> > +               return NULL;
> > +
> > +       skb = __build_skb_around(skb, data, frag_size);  
> 
> 
> > +
> > +       if (skb && frag_size) {
> > +               skb->head_frag = 1;
> > +               if (page_is_pfmemalloc(virt_to_head_page(data)))
> > +                       skb->pfmemalloc = 1;
> > +       }  
> 
> I didn't find any explanation of this part (head_frag, pfmemalloc).
> Shall we split it out to a separate patch?

No, it belongs here.  This is like/based on the __build_skb() and
build_skb() split, and needed such that __build_skb() can reuse the
code in __build_skb_around().

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched
  2019-04-11 11:23     ` Jesper Dangaard Brouer
@ 2019-04-11 17:38       ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2019-04-11 17:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On Thu, Apr 11, 2019 at 4:23 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 10 Apr 2019 16:24:37 -0700
> Song Liu <liu.song.a23@gmail.com> wrote:
>
> > >                 /* Feedback loop via tracepoint */
> > > -               trace_xdp_cpumap_kthread(rcpu->map_id, processed, drops, sched);
> > > +               trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);
> >
> > btw: can we do the tracepoint after local_bh_enable()?
>
> I would rather not, as my experience is that this result in strange
> inaccurate readings, because (as comment below says) this is a
> CPU-process-reschedule point.  The test tool reads these values and
> calculate PPS.

Thanks for the explanation!

Song

>
> > >
> > >                 local_bh_enable(); /* resched point, may call do_softirq() */
> > >         }
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 3/5] net: core: introduce build_skb_around
  2019-04-11 15:39     ` Jesper Dangaard Brouer
@ 2019-04-11 17:43       ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2019-04-11 17:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen

On Thu, Apr 11, 2019 at 8:39 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 10 Apr 2019 16:34:29 -0700
> Song Liu <liu.song.a23@gmail.com> wrote:
>
> > > +struct sk_buff *build_skb_around(struct sk_buff *skb,
> > > +                                void *data, unsigned int frag_size)
> > > +{
> > > +       if (unlikely(!skb))
> > > +               return NULL;
> > > +
> > > +       skb = __build_skb_around(skb, data, frag_size);
> >
> >
> > > +
> > > +       if (skb && frag_size) {
> > > +               skb->head_frag = 1;
> > > +               if (page_is_pfmemalloc(virt_to_head_page(data)))
> > > +                       skb->pfmemalloc = 1;
> > > +       }
> >
> > I didn't find any explanation of this part (head_frag, pfmemalloc).
> > Shall we split it out to a separate patch?
>
> No, it belongs here.  This is like/based on the __build_skb() and
> build_skb() split, and needed such that __build_skb() can reuse the
> code in __build_skb_around().

I see. Thanks for the explanation.

Acked-by: Song Liu <songliubraving@fb.com>


>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect
  2019-04-11 13:18   ` Jesper Dangaard Brouer
@ 2019-04-11 17:45     ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2019-04-11 17:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, David S. Miller,
	Ilias Apalodimas, bpf, Toke Høiland-Jørgensen,
	Edward Cree

On Thu, Apr 11, 2019 at 6:18 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 10 Apr 2019 16:36:40 -0700
> Song Liu <liu.song.a23@gmail.com> wrote:
>
> > On Wed, Apr 10, 2019 at 6:00 AM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > >
> > > This patchset utilize a number of different kernel bulk APIs for optimizing
> > > the performance for the XDP cpumap redirect feature.
> >
> > Could you please share some numbers about the optimization?
>
> I've documented ALL the details here:
>  https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org

Thanks for the results! Please consider adding the results somewhere.

>
> I seem to have found that the SKB-list approach is not a performance
> advantage, which is very surprising.  BUT it might still be due to
> invalid benchmarking, as I found that F27 behind my back is auto-loading
> iptables-filter modules, which change performance.  Thus, I have to
> redo a lot of the tests...
>
> I'm considering removing the SKB-list patch from the patchset, as all
> other patches show a performance increase/improvement.  Then we can
> merge that, and then I can focus on SKB-list approach in another
> patchset.  BUT as I said above, I might have wrong/invalid
> measurements... I have to retest before I concluded anything...

So we will wait for the new results?

Thanks,
Song

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-04-11 17:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10 11:43 [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Jesper Dangaard Brouer
2019-04-10 11:43 ` [PATCH bpf-next 1/5] bpf: cpumap use ptr_ring_consume_batched Jesper Dangaard Brouer
2019-04-10 23:24   ` Song Liu
2019-04-11 11:23     ` Jesper Dangaard Brouer
2019-04-11 17:38       ` Song Liu
2019-04-10 11:43 ` [PATCH bpf-next 2/5] bpf: cpumap use netif_receive_skb_list Jesper Dangaard Brouer
2019-04-10 18:56   ` Edward Cree
2019-04-10 11:43 ` [PATCH bpf-next 3/5] net: core: introduce build_skb_around Jesper Dangaard Brouer
2019-04-10 23:34   ` Song Liu
2019-04-11 15:39     ` Jesper Dangaard Brouer
2019-04-11 17:43       ` Song Liu
2019-04-11  5:33   ` Ilias Apalodimas
2019-04-11 11:17     ` Jesper Dangaard Brouer
2019-04-10 11:43 ` [PATCH bpf-next 4/5] bpf: cpumap do bulk allocation of SKBs Jesper Dangaard Brouer
2019-04-10 23:30   ` Song Liu
2019-04-10 11:43 ` [PATCH bpf-next 5/5] bpf: cpumap memory prefetchw optimizations for struct page Jesper Dangaard Brouer
2019-04-10 23:35   ` Song Liu
2019-04-11  5:47   ` Ilias Apalodimas
2019-04-10 23:36 ` [PATCH bpf-next 0/5] Bulk optimization for XDP cpumap redirect Song Liu
2019-04-11 13:18   ` Jesper Dangaard Brouer
2019-04-11 17:45     ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.