All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage
@ 2018-06-26 18:15 Edward Cree
  2018-06-26 18:17 ` [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point Edward Cree
                   ` (12 more replies)
  0 siblings, 13 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:15 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

This patch series adds the capability for the network stack to receive a
 list of packets and process them as a unit, rather than handling each
 packet singly in sequence.  This is done by factoring out the existing
 datapath code at each layer and wrapping it in list handling code.

The motivation for this change is twofold:
* Instruction cache locality.  Currently, running the entire network
  stack receive path on a packet involves more code than will fit in the
  lowest-level icache, meaning that when the next packet is handled, the
  code has to be reloaded from more distant caches.  By handling packets
  in "row-major order", we ensure that the code at each layer is hot for
  most of the list.  (There is a corresponding downside in _data_ cache
  locality, since we are now touching every packet at every layer, but in
  practice there is easily enough room in dcache to hold one cacheline of
  each of the 64 packets in a NAPI poll.)
* Reduction of indirect calls.  Owing to Spectre mitigations, indirect
  function calls are now more expensive than ever; they are also heavily
  used in the network stack's architecture (see [1]).  By replacing 64
  indirect calls to the next-layer per-packet function with a single
  indirect call to the next-layer list function, we can save CPU cycles.

Drivers pass an SKB list to the stack at the end of the NAPI poll; this
 gives a natural batch size (the NAPI poll weight) and avoids waiting at
 the software level for further packets to make a larger batch (which
 would add latency).  It also means that the batch size is automatically
 tuned by the existing interrupt moderation mechanism.
The stack then runs each layer of processing over all the packets in the
 list before proceeding to the next layer.  Where the 'next layer' (or
 the context in which it must run) differs among the packets, the stack
 splits the list; this 'late demux' means that packets which differ only
 in later headers (e.g. same L2/L3 but different L4) can traverse the
 early part of the stack together.
Also, where the next layer is not (yet) list-aware, the stack can revert
 to calling the rest of the stack in a loop; this allows gradual/creeping
 listification, with no 'flag day' patch needed to listify everything.

Patches 1-2 simply place received packets on a list during the event
 processing loop on the sfc EF10 architecture, then call the normal stack
 for each packet singly at the end of the NAPI poll.  (Analogues of patch
 #2 for other NIC drivers should be fairly straightforward.)
Patches 3-9 extend the list processing as far as the IP receive handler.
Patches 10-12 apply the list techniques to Generic XDP, since the bpf_func
 there is an indirect call.  In patch #12 we JIT a list_func that performs
 list unwrapping and makes direct calls to the bpf_func.

Patches 1-2 alone give about a 10% improvement in packet rate in the
 baseline test; adding patches 3-9 raises this to around 25%.  Patches 10-
 12, intended to improve Generic XDP performance, have in fact slightly
 worsened it; I am unsure why this is and have included them in this RFC
 in the hopes that someone will spot the reason.  If no progress is made I
 will drop them from the series.

Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
 packets and a single core to handle interrupts on the RX side; this was
 in order to measure as simply as possible the packet rate handled by a
 single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
 setup was tuned for maximum reproducibility, rather than raw performance.
 Full details and more results (both with and without retpolines) are
 presented in [2].

The baseline test uses four streams, and multiple RXQs all bound to a
 single CPU (the netperf binary is bound to a neighbouring CPU).  These
 tests were run with retpolines.
net-next: 6.60 Mb/s (datum)
 after 9: 8.35 Mb/s (datum + 26.6%)
after 12: 8.29 Mb/s (datum + 25.6%)
Note however that these results are not robust; changes in the parameters
 of the test often shrink the gain to single-digit percentages.  For
 instance, when using only a single RXQ, only a 4% gain was seen.  The
 results also seem to change significantly each time the patch series is
 rebased onto a new net-next; for instance the results in [3] with
 retpolines (slide 9) show only 11.6% gain in the same test as above (the
 post-patch performance is the same but the pre-patch datum is 7.5Mb/s).

I also performed tests with Generic XDP enabled (using a simple map-based
 UDP port drop program with no entries in the map), both with and without
 the eBPF JIT enabled.
No JIT:
net-next: 3.52 Mb/s (datum)
 after 9: 4.91 Mb/s (datum + 39.5%)
after 12: 4.83 Mb/s (datum + 37.3%)

With JIT:
net-next: 5.23 Mb/s (datum)
 after 9: 6.64 Mb/s (datum + 27.0%)
after 12: 6.46 Mb/s (datum + 23.6%)

Another test variation was the use of software filtering/firewall rules.
 Adding a single iptables rule (a UDP port drop on a port range not
 matching the test traffic), thus making the netfilter hook have work to
 do, reduced baseline performance but showed a similar delta from the
 patches.  Similarly, testing with a set of TC flower filters (kindly
 supplied by Cong Wang) in the single-RXQ setup (that previously gave 4%)
 slowed down the baseline but not the patched performance, giving a 5.7%
 performance delta.  These data suggest that the batching approach
 remains effective in the presence of software switching rules.

Changes from v1 (see [3]):
* Rebased across 2 years' net-next movement (surprisingly straightforward).
  - Added Generic XDP handling to netif_receive_skb_list_internal()
  - Dealt with changes to PFMEMALLOC setting APIs
* General cleanup of code and comments.
* Skipped function calls for empty lists at various points in the stack
  (patch #9).
* Added listified Generic XDP handling (patches 10-12), though it doesn't
  seem to help (see above).
* Extended testing to cover software firewalls / netfilter etc.

[1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
[2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
[3] http://lists.openwall.net/netdev/2016/04/19/89

Edward Cree (12):
  net: core: trivial netif_receive_skb_list() entry point
  sfc: batch up RX delivery
  net: core: unwrap skb list receive slightly further
  net: core: Another step of skb receive list processing
  net: core: another layer of lists, around PF_MEMALLOC skb handling
  net: core: propagate SKB lists through packet_type lookup
  net: ipv4: listified version of ip_rcv
  net: ipv4: listify ip_rcv_finish
  net: don't bother calling list RX functions on empty lists
  net: listify Generic XDP processing, part 1
  net: listify Generic XDP processing, part 2
  net: listify jited Generic XDP processing on x86_64

 arch/x86/net/bpf_jit_comp.c           | 164 ++++++++++++++
 drivers/net/ethernet/sfc/efx.c        |  12 +
 drivers/net/ethernet/sfc/net_driver.h |   3 +
 drivers/net/ethernet/sfc/rx.c         |   7 +-
 include/linux/filter.h                |  43 +++-
 include/linux/netdevice.h             |   4 +
 include/linux/netfilter.h             |  27 +++
 include/linux/skbuff.h                |  16 ++
 include/net/ip.h                      |   2 +
 include/trace/events/net.h            |  14 ++
 kernel/bpf/core.c                     |  38 +++-
 net/core/dev.c                        | 415 +++++++++++++++++++++++++++++-----
 net/core/filter.c                     |  10 +-
 net/ipv4/af_inet.c                    |   1 +
 net/ipv4/ip_input.c                   | 129 ++++++++++-
 15 files changed, 810 insertions(+), 75 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
@ 2018-06-26 18:17 ` Edward Cree
  2018-06-27  0:06   ` Eric Dumazet
  2018-06-26 18:17 ` [RFC PATCH v2 net-next 02/12] sfc: batch up RX delivery Edward Cree
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:17 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

Just calls netif_receive_skb() in a loop.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..105087369779 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3364,6 +3364,7 @@ int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
+void netif_receive_skb_list(struct sk_buff_head *list);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7444e6..473e24e31e38 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4792,6 +4792,26 @@ int netif_receive_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
+/**
+ *	netif_receive_skb_list - process many receive buffers from network
+ *	@list: list of skbs to process.  Must not be shareable (e.g. it may
+ *	be on the stack)
+ *
+ *	For now, just calls netif_receive_skb() in a loop, ignoring the
+ *	return value.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ */
+void netif_receive_skb_list(struct sk_buff_head *list)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(list)) != NULL)
+		netif_receive_skb(skb);
+}
+EXPORT_SYMBOL(netif_receive_skb_list);
+
 DEFINE_PER_CPU(struct work_struct, flush_works);
 
 /* Network device is going away, flush any packets still pending */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 02/12] sfc: batch up RX delivery
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
  2018-06-26 18:17 ` [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point Edward Cree
@ 2018-06-26 18:17 ` Edward Cree
  2018-06-26 18:18 ` [RFC PATCH v2 net-next 03/12] net: core: unwrap skb list receive slightly further Edward Cree
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:17 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

Improves packet rate of 1-byte UDP receives by up to 10%.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c        | 12 ++++++++++++
 drivers/net/ethernet/sfc/net_driver.h |  3 +++
 drivers/net/ethernet/sfc/rx.c         |  7 ++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index ad4a354ce570..e84e4437dbbd 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -264,11 +264,17 @@ static int efx_check_disabled(struct efx_nic *efx)
 static int efx_process_channel(struct efx_channel *channel, int budget)
 {
 	struct efx_tx_queue *tx_queue;
+	struct sk_buff_head rx_list;
 	int spent;
 
 	if (unlikely(!channel->enabled))
 		return 0;
 
+	/* Prepare the batch receive list */
+	EFX_WARN_ON_PARANOID(channel->rx_list != NULL);
+	channel->rx_list = &rx_list;
+	__skb_queue_head_init(channel->rx_list);
+
 	efx_for_each_channel_tx_queue(tx_queue, channel) {
 		tx_queue->pkts_compl = 0;
 		tx_queue->bytes_compl = 0;
@@ -291,6 +297,10 @@ static int efx_process_channel(struct efx_channel *channel, int budget)
 		}
 	}
 
+	/* Receive any packets we queued up */
+	netif_receive_skb_list(channel->rx_list);
+	channel->rx_list = NULL;
+
 	return spent;
 }
 
@@ -555,6 +565,8 @@ static int efx_probe_channel(struct efx_channel *channel)
 			goto fail;
 	}
 
+	channel->rx_list = NULL;
+
 	return 0;
 
 fail:
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 65568925c3ef..e1d3ca3b90b5 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -448,6 +448,7 @@ enum efx_sync_events_state {
  *	__efx_rx_packet(), or zero if there is none
  * @rx_pkt_index: Ring index of first buffer for next packet to be delivered
  *	by __efx_rx_packet(), if @rx_pkt_n_frags != 0
+ * @rx_list: list of SKBs from current RX, awaiting processing
  * @rx_queue: RX queue for this channel
  * @tx_queue: TX queues for this channel
  * @sync_events_state: Current state of sync events on this channel
@@ -500,6 +501,8 @@ struct efx_channel {
 	unsigned int rx_pkt_n_frags;
 	unsigned int rx_pkt_index;
 
+	struct sk_buff_head *rx_list;
+
 	struct efx_rx_queue rx_queue;
 	struct efx_tx_queue tx_queue[EFX_TXQ_TYPES];
 
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index d2e254f2f72b..3e4d67d2d45d 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -634,7 +634,12 @@ static void efx_rx_deliver(struct efx_channel *channel, u8 *eh,
 			return;
 
 	/* Pass the packet up */
-	netif_receive_skb(skb);
+	if (channel->rx_list != NULL)
+		/* Add to list, will pass up later */
+		__skb_queue_tail(channel->rx_list, skb);
+	else
+		/* No list, so pass it up now */
+		netif_receive_skb(skb);
 }
 
 /* Handle a received packet.  Second half: Touches packet payload. */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 03/12] net: core: unwrap skb list receive slightly further
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
  2018-06-26 18:17 ` [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point Edward Cree
  2018-06-26 18:17 ` [RFC PATCH v2 net-next 02/12] sfc: batch up RX delivery Edward Cree
@ 2018-06-26 18:18 ` Edward Cree
  2018-06-26 18:18 ` [RFC PATCH v2 net-next 04/12] net: core: Another step of skb receive list processing Edward Cree
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:18 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

Adds iterator skb_queue_for_each() to run over a list without modifying it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/skbuff.h     | 16 ++++++++++++++++
 include/trace/events/net.h |  7 +++++++
 net/core/dev.c             |  4 +++-
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c86885954994..a8c16c6700f3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1626,6 +1626,22 @@ static inline struct sk_buff *skb_peek_tail(const struct sk_buff_head *list_)
 }
 
 /**
+ *	skb_queue_for_each - iterate over an skb queue
+ *	@pos:        the &struct sk_buff to use as a loop cursor.
+ *	@head:       the &struct sk_buff_head for your list.
+ *
+ *	The reference count is not incremented and the reference is therefore
+ *	volatile; the list lock is not taken either. Use with caution.
+ *
+ *	The list must not be modified (though the individual skbs can be)
+ *	within the loop body.
+ *
+ *	After loop completion, @pos will be %NULL.
+ */
+#define skb_queue_for_each(pos, head) \
+	for (pos = skb_peek(head); pos != NULL; pos = skb_peek_next(pos, head))
+
+/**
  *	skb_queue_len	- get queue length
  *	@list_: list to measure
  *
diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 9c886739246a..00aa72ce0e7c 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -223,6 +223,13 @@ DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_entry,
 	TP_ARGS(skb)
 );
 
+DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_list_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
 DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_entry,
 
 	TP_PROTO(const struct sk_buff *skb),
diff --git a/net/core/dev.c b/net/core/dev.c
index 473e24e31e38..0ab16941a651 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4807,8 +4807,10 @@ void netif_receive_skb_list(struct sk_buff_head *list)
 {
 	struct sk_buff *skb;
 
+	skb_queue_for_each(skb, list)
+		trace_netif_receive_skb_list_entry(skb);
 	while ((skb = __skb_dequeue(list)) != NULL)
-		netif_receive_skb(skb);
+		netif_receive_skb_internal(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 04/12] net: core: Another step of skb receive list processing
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (2 preceding siblings ...)
  2018-06-26 18:18 ` [RFC PATCH v2 net-next 03/12] net: core: unwrap skb list receive slightly further Edward Cree
@ 2018-06-26 18:18 ` Edward Cree
  2018-06-26 18:19 ` [RFC PATCH v2 net-next 05/12] net: core: another layer of lists, around PF_MEMALLOC skb handling Edward Cree
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:18 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

netif_receive_skb_list_internal() now processes a list and hands it
 on to the next function.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 69 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0ab16941a651..27980c13ad5c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4729,6 +4729,14 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 	return ret;
 }
 
+static void __netif_receive_skb_list(struct sk_buff_head *list)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(list)) != NULL)
+		__netif_receive_skb(skb);
+}
+
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4769,6 +4777,64 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	return ret;
 }
 
+static void netif_receive_skb_list_internal(struct sk_buff_head *list)
+{
+	/* Two sublists so we can go back and forth between them */
+	struct sk_buff_head sublist, sublist2;
+	struct bpf_prog *xdp_prog = NULL;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&sublist);
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		net_timestamp_check(netdev_tstamp_prequeue, skb);
+		if (skb_defer_rx_timestamp(skb))
+			/* Handled, don't add to sublist */
+			continue;
+		__skb_queue_tail(&sublist, skb);
+	}
+
+	__skb_queue_head_init(&sublist2);
+	if (static_branch_unlikely(&generic_xdp_needed_key)) {
+		preempt_disable();
+		rcu_read_lock();
+		while ((skb = __skb_dequeue(&sublist)) != NULL) {
+			xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+			if (do_xdp_generic(xdp_prog, skb) != XDP_PASS)
+				/* Dropped, don't add to sublist */
+				continue;
+			__skb_queue_tail(&sublist2, skb);
+		}
+		rcu_read_unlock();
+		preempt_enable();
+		/* Move all packets onto first sublist */
+		skb_queue_splice_init(&sublist2, &sublist);
+	}
+
+	rcu_read_lock();
+#ifdef CONFIG_RPS
+	if (static_key_false(&rps_needed)) {
+		while ((skb = __skb_dequeue(&sublist)) != NULL) {
+			struct rps_dev_flow voidflow, *rflow = &voidflow;
+			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+
+			if (cpu >= 0) {
+				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+				/* Handled, don't add to sublist */
+				continue;
+			}
+
+			__skb_queue_tail(&sublist2, skb);
+		}
+
+		/* Move all packets onto first sublist */
+		skb_queue_splice_init(&sublist2, &sublist);
+	}
+#endif
+	__netif_receive_skb_list(&sublist);
+	rcu_read_unlock();
+}
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -4797,8 +4863,8 @@ EXPORT_SYMBOL(netif_receive_skb);
  *	@list: list of skbs to process.  Must not be shareable (e.g. it may
  *	be on the stack)
  *
- *	For now, just calls netif_receive_skb() in a loop, ignoring the
- *	return value.
+ *	Since return value of netif_receive_skb() is normally ignored, and
+ *	wouldn't be meaningful for a list, this function returns void.
  *
  *	This function may only be called from softirq context and interrupts
  *	should be enabled.
@@ -4809,8 +4875,7 @@ void netif_receive_skb_list(struct sk_buff_head *list)
 
 	skb_queue_for_each(skb, list)
 		trace_netif_receive_skb_list_entry(skb);
-	while ((skb = __skb_dequeue(list)) != NULL)
-		netif_receive_skb_internal(skb);
+	netif_receive_skb_list_internal(list);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 05/12] net: core: another layer of lists, around PF_MEMALLOC skb handling
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (3 preceding siblings ...)
  2018-06-26 18:18 ` [RFC PATCH v2 net-next 04/12] net: core: Another step of skb receive list processing Edward Cree
@ 2018-06-26 18:19 ` Edward Cree
  2018-06-26 18:19 ` [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup Edward Cree
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:19 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

First example of a layer splitting the list (rather than merely taking
 individual packets off it).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 46 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 27980c13ad5c..92d78b3de656 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4670,6 +4670,14 @@ int netif_receive_skb_core(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
+static void __netif_receive_skb_list_core(struct sk_buff_head *list, bool pfmemalloc)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(list)) != NULL)
+		__netif_receive_skb_core(skb, pfmemalloc);
+}
+
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	int ret;
@@ -4695,6 +4703,36 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
+static void __netif_receive_skb_list(struct sk_buff_head *list)
+{
+	unsigned long noreclaim_flag = 0;
+	struct sk_buff_head sublist;
+	bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&sublist);
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) {
+			/* Handle the previous sublist */
+			__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			pfmemalloc = !pfmemalloc;
+			/* See comments in __netif_receive_skb */
+			if (pfmemalloc)
+				noreclaim_flag = memalloc_noreclaim_save();
+			else
+				memalloc_noreclaim_restore(noreclaim_flag);
+			__skb_queue_head_init(&sublist);
+		}
+		__skb_queue_tail(&sublist, skb);
+	}
+	/* Handle the last sublist */
+	__netif_receive_skb_list_core(&sublist, pfmemalloc);
+	/* Restore pflags */
+	if (pfmemalloc)
+		memalloc_noreclaim_restore(noreclaim_flag);
+}
+
 static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
@@ -4729,14 +4767,6 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 	return ret;
 }
 
-static void __netif_receive_skb_list(struct sk_buff_head *list)
-{
-	struct sk_buff *skb;
-
-	while ((skb = __skb_dequeue(list)) != NULL)
-		__netif_receive_skb(skb);
-}
-
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (4 preceding siblings ...)
  2018-06-26 18:19 ` [RFC PATCH v2 net-next 05/12] net: core: another layer of lists, around PF_MEMALLOC skb handling Edward Cree
@ 2018-06-26 18:19 ` Edward Cree
  2018-06-27 14:36   ` Willem de Bruijn
  2018-06-26 18:20 ` [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv Edward Cree
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:19 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

__netif_receive_skb_taps() does a depressingly large amount of per-packet
 work that can't easily be listified, because the another_round looping
 makes it nontrivial to slice up into smaller functions.
Fortunately, most of that work disappears in the fast path:
 * Hardware devices generally don't have an rx_handler
 * Unless you're tcpdumping or something, there is usually only one ptype
 * VLAN processing comes before the protocol ptype lookup, so doesn't force
   a pt_prev deliver
 so normally, __netif_receive_skb_taps() will run straight through and return
 the one ptype found in ptype_base[hash of skb->protocol].

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/trace/events/net.h |   7 +++
 net/core/dev.c             | 138 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 105 insertions(+), 40 deletions(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 00aa72ce0e7c..3c9b262896c1 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -131,6 +131,13 @@ DEFINE_EVENT(net_dev_template, netif_receive_skb,
 	TP_ARGS(skb)
 );
 
+DEFINE_EVENT(net_dev_template, netif_receive_skb_list,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
 DEFINE_EVENT(net_dev_template, netif_rx,
 
 	TP_PROTO(struct sk_buff *skb),
diff --git a/net/core/dev.c b/net/core/dev.c
index 92d78b3de656..2f46ed07c8d8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4494,12 +4494,13 @@ static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
 	return 0;
 }
 
-static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
+static int __netif_receive_skb_taps(struct sk_buff *skb, bool pfmemalloc,
+				    struct packet_type **pt_prev)
 {
-	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
 	struct net_device *orig_dev;
 	bool deliver_exact = false;
+	struct packet_type *ptype;
 	int ret = NET_RX_DROP;
 	__be16 type;
 
@@ -4514,7 +4515,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 		skb_reset_transport_header(skb);
 	skb_reset_mac_len(skb);
 
-	pt_prev = NULL;
+	*pt_prev = NULL;
 
 another_round:
 	skb->skb_iif = skb->dev->ifindex;
@@ -4535,25 +4536,25 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 		goto skip_taps;
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (pt_prev)
-			ret = deliver_skb(skb, pt_prev, orig_dev);
-		pt_prev = ptype;
+		if (*pt_prev)
+			ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = ptype;
 	}
 
 	list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
-		if (pt_prev)
-			ret = deliver_skb(skb, pt_prev, orig_dev);
-		pt_prev = ptype;
+		if (*pt_prev)
+			ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = ptype;
 	}
 
 skip_taps:
 #ifdef CONFIG_NET_INGRESS
 	if (static_branch_unlikely(&ingress_needed_key)) {
-		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev);
+		skb = sch_handle_ingress(skb, pt_prev, &ret, orig_dev);
 		if (!skb)
 			goto out;
 
-		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
+		if (nf_ingress(skb, pt_prev, &ret, orig_dev) < 0)
 			goto out;
 	}
 #endif
@@ -4563,9 +4564,9 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 		goto drop;
 
 	if (skb_vlan_tag_present(skb)) {
-		if (pt_prev) {
-			ret = deliver_skb(skb, pt_prev, orig_dev);
-			pt_prev = NULL;
+		if (*pt_prev) {
+			ret = deliver_skb(skb, *pt_prev, orig_dev);
+			*pt_prev = NULL;
 		}
 		if (vlan_do_receive(&skb))
 			goto another_round;
@@ -4575,9 +4576,9 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (rx_handler) {
-		if (pt_prev) {
-			ret = deliver_skb(skb, pt_prev, orig_dev);
-			pt_prev = NULL;
+		if (*pt_prev) {
+			ret = deliver_skb(skb, *pt_prev, orig_dev);
+			*pt_prev = NULL;
 		}
 		switch (rx_handler(&skb)) {
 		case RX_HANDLER_CONSUMED:
@@ -4608,38 +4609,45 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 
 	/* deliver only exact match when indicated */
 	if (likely(!deliver_exact)) {
-		deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
+		deliver_ptype_list_skb(skb, pt_prev, orig_dev, type,
 				       &ptype_base[ntohs(type) &
 						   PTYPE_HASH_MASK]);
 	}
 
-	deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
+	deliver_ptype_list_skb(skb, pt_prev, orig_dev, type,
 			       &orig_dev->ptype_specific);
 
 	if (unlikely(skb->dev != orig_dev)) {
-		deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
+		deliver_ptype_list_skb(skb, pt_prev, orig_dev, type,
 				       &skb->dev->ptype_specific);
 	}
-
-	if (pt_prev) {
-		if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
-			goto drop;
-		else
-			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
-	} else {
+	if (*pt_prev && unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
+		goto drop;
+	return ret;
 drop:
-		if (!deliver_exact)
-			atomic_long_inc(&skb->dev->rx_dropped);
-		else
-			atomic_long_inc(&skb->dev->rx_nohandler);
-		kfree_skb(skb);
-		/* Jamal, now you will not able to escape explaining
-		 * me how you were going to use this. :-)
-		 */
-		ret = NET_RX_DROP;
-	}
-
+	if (!deliver_exact)
+		atomic_long_inc(&skb->dev->rx_dropped);
+	else
+		atomic_long_inc(&skb->dev->rx_nohandler);
+	kfree_skb(skb);
+	/* Jamal, now you will not able to escape explaining
+	 * me how you were going to use this. :-)
+	 */
+	ret = NET_RX_DROP;
 out:
+	*pt_prev = NULL;
+	return ret;
+}
+
+static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
+{
+	struct net_device *orig_dev = skb->dev;
+	struct packet_type *pt_prev;
+	int ret;
+
+	ret = __netif_receive_skb_taps(skb, pfmemalloc, &pt_prev);
+	if (pt_prev)
+		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	return ret;
 }
 
@@ -4670,12 +4678,62 @@ int netif_receive_skb_core(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
-static void __netif_receive_skb_list_core(struct sk_buff_head *list, bool pfmemalloc)
+static inline void __netif_receive_skb_list_ptype(struct sk_buff_head *list,
+						  struct packet_type *pt_prev,
+						  struct net_device *orig_dev)
 {
 	struct sk_buff *skb;
 
 	while ((skb = __skb_dequeue(list)) != NULL)
-		__netif_receive_skb_core(skb, pfmemalloc);
+		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+}
+
+static void __netif_receive_skb_list_core(struct sk_buff_head *list, bool pfmemalloc)
+{
+	/* Fast-path assumptions:
+	 * - There is no RX handler.
+	 * - Only one packet_type matches.
+	 * If either of these fails, we will end up doing some per-packet
+	 * processing in-line, then handling the 'last ptype' for the whole
+	 * sublist.  This can't cause out-of-order delivery to any single ptype,
+	 * because the 'last ptype' must be constant across the sublist, and all
+	 * other ptypes are handled per-packet.
+	 */
+	/* Current (common) ptype of sublist */
+	struct packet_type *pt_curr = NULL;
+	/* Current (common) orig_dev of sublist */
+	struct net_device *od_curr = NULL;
+	struct sk_buff_head sublist;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&sublist);
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		struct packet_type *pt_prev;
+		struct net_device *orig_dev = skb->dev;
+
+		__netif_receive_skb_taps(skb, pfmemalloc, &pt_prev);
+		if (pt_prev) {
+			if (skb_queue_empty(&sublist)) {
+				pt_curr = pt_prev;
+				od_curr = orig_dev;
+			} else if (!(pt_curr == pt_prev &&
+				     od_curr == orig_dev)) {
+				/* dispatch old sublist */
+				__netif_receive_skb_list_ptype(&sublist,
+							       pt_curr,
+							       od_curr);
+				/* start new sublist */
+				__skb_queue_head_init(&sublist);
+				pt_curr = pt_prev;
+				od_curr = orig_dev;
+			}
+			__skb_queue_tail(&sublist, skb);
+		}
+	}
+
+	/* dispatch final sublist */
+	__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
 }
 
 static int __netif_receive_skb(struct sk_buff *skb)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (5 preceding siblings ...)
  2018-06-26 18:19 ` [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup Edward Cree
@ 2018-06-26 18:20 ` Edward Cree
  2018-06-27 12:32   ` Florian Westphal
  2018-06-26 18:20 ` [RFC PATCH v2 net-next 08/12] net: ipv4: listify ip_rcv_finish Edward Cree
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:20 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

Also involved adding a way to run a netfilter hook over a list of packets.
 Rather than attempting to make netfilter know about lists (which would be
 a major project in itself) we just let it call the regular okfn (in this
 case ip_rcv_finish()) for any packets it steals, and have it give us back
 a list of packets it's synchronously accepted (which normally NF_HOOK
 would automatically call okfn() on, but we want to be able to potentially
 pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
 packet (see nf_hook_entry_hookfn()), but again, changing that can be left
 for future work.

There is potential for out-of-order receives if the netfilter hook ends up
 synchronously stealing packets, as they will be processed before any
 accepts earlier in the list.  However, it was already possible for an
 asynchronous accept to cause out-of-order receives, so presumably this is
 considered OK.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  3 ++
 include/linux/netfilter.h | 27 +++++++++++++++++
 include/net/ip.h          |  2 ++
 net/core/dev.c            | 11 +++++--
 net/ipv4/af_inet.c        |  1 +
 net/ipv4/ip_input.c       | 75 ++++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 105087369779..5296354fa621 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2290,6 +2290,9 @@ struct packet_type {
 					 struct net_device *,
 					 struct packet_type *,
 					 struct net_device *);
+	void			(*list_func) (struct sk_buff_head *,
+					      struct packet_type *,
+					      struct net_device *);
 	bool			(*id_match)(struct packet_type *ptype,
 					    struct sock *sk);
 	void			*af_packet_priv;
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index dd2052f0efb7..42395a8a6e70 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -288,6 +288,22 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct
 	return ret;
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct sk_buff_head *list, struct sk_buff_head *sublist,
+	     struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(sublist); /* list of synchronously ACCEPTed skbs */
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
+		if (ret == 1)
+			__skb_queue_tail(sublist, skb);
+	}
+}
+
 /* Call setsockopt() */
 int nf_setsockopt(struct sock *sk, u_int8_t pf, int optval, char __user *opt,
 		  unsigned int len);
@@ -369,6 +385,17 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 	return okfn(net, sk, skb);
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct sk_buff_head *list, struct sk_buff_head *sublist,
+	     struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	__skb_queue_head_init(sublist);
+	/* Move everything to the sublist */
+	skb_queue_splice_init(list, sublist);
+}
+
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
 			  struct sock *sk, struct sk_buff *skb,
 			  struct net_device *indev, struct net_device *outdev,
diff --git a/include/net/ip.h b/include/net/ip.h
index 0d2281b4b27a..fb3dfed537c0 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -138,6 +138,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk,
 			  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	   struct net_device *orig_dev);
+void ip_list_rcv(struct sk_buff_head *list, struct packet_type *pt,
+		 struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2f46ed07c8d8..f0eb00e9fb57 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4684,8 +4684,15 @@ static inline void __netif_receive_skb_list_ptype(struct sk_buff_head *list,
 {
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue(list)) != NULL)
-		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	if (!pt_prev)
+		return;
+	if (skb_queue_empty(list))
+		return;
+	if (pt_prev->list_func != NULL)
+		pt_prev->list_func(list, pt_prev, orig_dev);
+	else
+		while ((skb = __skb_dequeue(list)) != NULL)
+			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
 static void __netif_receive_skb_list_core(struct sk_buff_head *list, bool pfmemalloc)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 15e125558c76..e54381fe4b00 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1882,6 +1882,7 @@ fs_initcall(ipv4_offload_init);
 static struct packet_type ip_packet_type __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IP),
 	.func = ip_rcv,
+	.list_func = ip_list_rcv,
 };
 
 static int __init inet_init(void)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713dd18f..7a8af8ff3f07 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -408,10 +408,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 /*
  * 	Main IP Receive routine.
  */
-int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 {
 	const struct iphdr *iph;
-	struct net *net;
 	u32 len;
 
 	/* When the interface is in promisc. mode, drop all the crap
@@ -421,7 +420,6 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		goto drop;
 
 
-	net = dev_net(dev);
 	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
@@ -489,9 +487,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	/* Must drop socket now because of tproxy. */
 	skb_orphan(skb);
 
-	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
-		       net, NULL, skb, dev, NULL,
-		       ip_rcv_finish);
+	return skb;
 
 csum_error:
 	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
@@ -500,5 +496,70 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 drop:
 	kfree_skb(skb);
 out:
-	return NET_RX_DROP;
+	return NULL;
+}
+
+/*
+ * IP receive entry point
+ */
+int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
+	   struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+
+	skb = ip_rcv_core(skb, net);
+	if (skb == NULL)
+		return NET_RX_DROP;
+	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
+		       net, NULL, skb, dev, NULL,
+		       ip_rcv_finish);
+}
+
+static void ip_sublist_rcv(struct sk_buff_head *list, struct net_device *dev,
+			   struct net *net)
+{
+	struct sk_buff_head sublist;
+	struct sk_buff *skb;
+
+	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
+		     list, &sublist, dev, NULL, ip_rcv_finish);
+	while ((skb = __skb_dequeue(&sublist)) != NULL)
+		ip_rcv_finish(net, NULL, skb);
+}
+
+/* Receive a list of IP packets */
+void ip_list_rcv(struct sk_buff_head *list, struct packet_type *pt,
+		 struct net_device *orig_dev)
+{
+	struct net_device *curr_dev = NULL;
+	struct net *curr_net = NULL;
+	struct sk_buff_head sublist;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&sublist);
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		struct net_device *dev = skb->dev;
+		struct net *net = dev_net(dev);
+
+		skb = ip_rcv_core(skb, net);
+		if (skb == NULL)
+			continue;
+
+		if (skb_queue_empty(&sublist)) {
+			curr_dev = dev;
+			curr_net = net;
+		} else if (curr_dev != dev || curr_net != net) {
+			/* dispatch old sublist */
+			ip_sublist_rcv(&sublist, dev, net);
+			/* start new sublist */
+			__skb_queue_head_init(&sublist);
+			curr_dev = dev;
+			curr_net = net;
+		}
+		/* add to current sublist */
+		__skb_queue_tail(&sublist, skb);
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv(&sublist, curr_dev, curr_net);
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 08/12] net: ipv4: listify ip_rcv_finish
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (6 preceding siblings ...)
  2018-06-26 18:20 ` [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv Edward Cree
@ 2018-06-26 18:20 ` Edward Cree
  2018-06-26 18:21 ` [RFC PATCH v2 net-next 09/12] net: don't bother calling list RX functions on empty lists Edward Cree
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:20 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
 demux or route lookup.  The last step, calling dst_input(skb), is left to
 the caller; in the listified case, we split to form sublists with a common
 dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
 to struct dst_entry.

Early demux is an indirect call based on iph->protocol; this is another
 opportunity for listification which is not taken here (it would require
 slicing up ip_rcv_finish_core() to allow splitting on protocol changes).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/ipv4/ip_input.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7a8af8ff3f07..63d4dfdb1766 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -307,7 +307,8 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	return true;
 }
 
-static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+static int ip_rcv_finish_core(struct net *net, struct sock *sk,
+			      struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	int (*edemux)(struct sk_buff *skb);
@@ -393,7 +394,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	return dst_input(skb);
+	return NET_RX_SUCCESS;
 
 drop:
 	kfree_skb(skb);
@@ -405,6 +406,15 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	goto drop;
 }
 
+static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	int ret = ip_rcv_finish_core(net, sk, skb);
+
+	if (ret != NET_RX_DROP)
+		ret = dst_input(skb);
+	return ret;
+}
+
 /*
  * 	Main IP Receive routine.
  */
@@ -515,16 +525,54 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 }
 
+static void ip_sublist_rcv_finish(struct sk_buff_head *list)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(list)) != NULL)
+		dst_input(skb);
+}
+
+static void ip_list_rcv_finish(struct net *net, struct sock *sk,
+			       struct sk_buff_head *list)
+{
+	struct dst_entry *curr_dst = NULL;
+	struct sk_buff_head sublist;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&sublist);
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		struct dst_entry *dst;
+
+		if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP)
+			continue;
+
+		dst = skb_dst(skb);
+		if (skb_queue_empty(&sublist)) {
+			curr_dst = dst;
+		} else if (curr_dst != dst) {
+			/* dispatch old sublist */
+			ip_sublist_rcv_finish(&sublist);
+			/* start new sublist */
+			__skb_queue_head_init(&sublist);
+			curr_dst = dst;
+		}
+		/* add to current sublist */
+		__skb_queue_tail(&sublist, skb);
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv_finish(&sublist);
+}
+
 static void ip_sublist_rcv(struct sk_buff_head *list, struct net_device *dev,
 			   struct net *net)
 {
 	struct sk_buff_head sublist;
-	struct sk_buff *skb;
 
 	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
 		     list, &sublist, dev, NULL, ip_rcv_finish);
-	while ((skb = __skb_dequeue(&sublist)) != NULL)
-		ip_rcv_finish(net, NULL, skb);
+	ip_list_rcv_finish(net, NULL, &sublist);
 }
 
 /* Receive a list of IP packets */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 09/12] net: don't bother calling list RX functions on empty lists
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (7 preceding siblings ...)
  2018-06-26 18:20 ` [RFC PATCH v2 net-next 08/12] net: ipv4: listify ip_rcv_finish Edward Cree
@ 2018-06-26 18:21 ` Edward Cree
  2018-06-26 18:21 ` [RFC PATCH v2 net-next 10/12] net: listify Generic XDP processing, part 1 Edward Cree
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:21 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

Generally the check should be very cheap, as the sk_buff_head is in cache.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c      | 8 ++++++--
 net/ipv4/ip_input.c | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f0eb00e9fb57..11f80d4502b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4780,7 +4780,8 @@ static void __netif_receive_skb_list(struct sk_buff_head *list)
 	while ((skb = __skb_dequeue(list)) != NULL) {
 		if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) {
 			/* Handle the previous sublist */
-			__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			if (!skb_queue_empty(&sublist))
+				__netif_receive_skb_list_core(&sublist, pfmemalloc);
 			pfmemalloc = !pfmemalloc;
 			/* See comments in __netif_receive_skb */
 			if (pfmemalloc)
@@ -4792,7 +4793,8 @@ static void __netif_receive_skb_list(struct sk_buff_head *list)
 		__skb_queue_tail(&sublist, skb);
 	}
 	/* Handle the last sublist */
-	__netif_receive_skb_list_core(&sublist, pfmemalloc);
+	if (!skb_queue_empty(&sublist))
+		__netif_receive_skb_list_core(&sublist, pfmemalloc);
 	/* Restore pflags */
 	if (pfmemalloc)
 		memalloc_noreclaim_restore(noreclaim_flag);
@@ -4968,6 +4970,8 @@ void netif_receive_skb_list(struct sk_buff_head *list)
 {
 	struct sk_buff *skb;
 
+	if (skb_queue_empty(list))
+		return;
 	skb_queue_for_each(skb, list)
 		trace_netif_receive_skb_list_entry(skb);
 	netif_receive_skb_list_internal(list);
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 63d4dfdb1766..65a5ed9e4b3c 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -570,6 +570,8 @@ static void ip_sublist_rcv(struct sk_buff_head *list, struct net_device *dev,
 {
 	struct sk_buff_head sublist;
 
+	if (skb_queue_empty(list))
+		return;
 	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
 		     list, &sublist, dev, NULL, ip_rcv_finish);
 	ip_list_rcv_finish(net, NULL, &sublist);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 10/12] net: listify Generic XDP processing, part 1
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (8 preceding siblings ...)
  2018-06-26 18:21 ` [RFC PATCH v2 net-next 09/12] net: don't bother calling list RX functions on empty lists Edward Cree
@ 2018-06-26 18:21 ` Edward Cree
  2018-06-26 18:22 ` [RFC PATCH v2 net-next 11/12] net: listify Generic XDP processing, part 2 Edward Cree
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:21 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

Deals with all the pre- and post-amble to the BPF program itself, which is
 still called one packet at a time.
Involves some fiddly percpu variables to cope with XDP_REDIRECT handling.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/filter.h |  10 +++
 net/core/dev.c         | 165 +++++++++++++++++++++++++++++++++++++++++++------
 net/core/filter.c      |  10 +--
 3 files changed, 156 insertions(+), 29 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 20f2659dd829..75db6cbf78a3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -820,6 +820,16 @@ static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
 	return 0;
 }
 
+struct redirect_info {
+	u32 ifindex;
+	u32 flags;
+	struct bpf_map *map;
+	struct bpf_map *map_to_flush;
+	unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct redirect_info, redirect_info);
+
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
  * same cpu context. Further for best results no more than a single map
  * for the do_redirect/do_flush pair should be used. This limitation is
diff --git a/net/core/dev.c b/net/core/dev.c
index 11f80d4502b9..22cbd5314d56 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4015,15 +4015,14 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 	return rxqueue;
 }
 
-static u32 netif_receive_generic_xdp(struct sk_buff *skb,
-				     struct xdp_buff *xdp,
-				     struct bpf_prog *xdp_prog)
+static u32 netif_receive_generic_xdp_prepare(struct sk_buff *skb,
+					     struct xdp_buff *xdp,
+					     void **orig_data,
+					     void **orig_data_end,
+					     u32 *mac_len)
 {
 	struct netdev_rx_queue *rxqueue;
-	void *orig_data, *orig_data_end;
-	u32 metalen, act = XDP_DROP;
-	int hlen, off;
-	u32 mac_len;
+	int hlen;
 
 	/* Reinjected packets coming from act_mirred or similar should
 	 * not get XDP generic processing.
@@ -4054,19 +4053,35 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	/* The XDP program wants to see the packet starting at the MAC
 	 * header.
 	 */
-	mac_len = skb->data - skb_mac_header(skb);
-	hlen = skb_headlen(skb) + mac_len;
-	xdp->data = skb->data - mac_len;
+	*mac_len = skb->data - skb_mac_header(skb);
+	hlen = skb_headlen(skb) + *mac_len;
+	xdp->data = skb->data - *mac_len;
 	xdp->data_meta = xdp->data;
 	xdp->data_end = xdp->data + hlen;
 	xdp->data_hard_start = skb->data - skb_headroom(skb);
-	orig_data_end = xdp->data_end;
-	orig_data = xdp->data;
+	*orig_data_end = xdp->data_end;
+	*orig_data = xdp->data;
 
 	rxqueue = netif_get_rxqueue(skb);
 	xdp->rxq = &rxqueue->xdp_rxq;
+	/* is actually XDP_ABORTED, but here we use it to mean "go ahead and
+	 * run the xdp program"
+	 */
+	return 0;
+do_drop:
+	kfree_skb(skb);
+	return XDP_DROP;
+}
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+static u32 netif_receive_generic_xdp_finish(struct sk_buff *skb,
+					    struct xdp_buff *xdp,
+					    struct bpf_prog *xdp_prog,
+					    void *orig_data,
+					    void *orig_data_end,
+					    u32 act, u32 mac_len)
+{
+	u32 metalen;
+	int off;
 
 	off = xdp->data - orig_data;
 	if (off > 0)
@@ -4082,7 +4097,6 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	if (off != 0) {
 		skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
 		skb->len -= off;
-
 	}
 
 	switch (act) {
@@ -4102,7 +4116,6 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		trace_xdp_exception(skb->dev, xdp_prog, act);
 		/* fall through */
 	case XDP_DROP:
-	do_drop:
 		kfree_skb(skb);
 		break;
 	}
@@ -4110,6 +4123,23 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	return act;
 }
 
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct xdp_buff *xdp,
+				     struct bpf_prog *xdp_prog)
+{
+	void *orig_data, *orig_data_end;
+	u32 act, mac_len;
+
+	act = netif_receive_generic_xdp_prepare(skb, xdp, &orig_data,
+						&orig_data_end, &mac_len);
+	if (act)
+		return act;
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	return netif_receive_generic_xdp_finish(skb, xdp, xdp_prog,
+						orig_data, orig_data_end, act,
+						mac_len);
+}
+
 /* When doing generic XDP we have to bypass the qdisc layer and the
  * network taps in order to match in-driver-XDP behavior.
  */
@@ -4168,6 +4198,93 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(do_xdp_generic);
 
+struct bpf_work {
+	struct list_head list;
+	void *ctx;
+	struct redirect_info ri;
+	unsigned long ret;
+};
+
+struct xdp_work {
+	struct bpf_work w;
+	struct xdp_buff xdp;
+	struct sk_buff *skb;
+	void *orig_data;
+	void *orig_data_end;
+	u32 mac_len;
+};
+
+/* Storage area for per-packet Generic XDP metadata */
+static DEFINE_PER_CPU(struct xdp_work[NAPI_POLL_WEIGHT], xdp_work);
+
+static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
+				struct sk_buff_head *list,
+				struct sk_buff_head *pass_list)
+{
+	struct xdp_work (*xwa)[NAPI_POLL_WEIGHT], *xw;
+	struct bpf_work *bw;
+	struct sk_buff *skb;
+	LIST_HEAD(xdp_list);
+	int n = 0, i, err;
+	u32 act;
+
+	if (!xdp_prog) {
+		/* PASS everything */
+		skb_queue_splice_init(list, pass_list);
+		return;
+	}
+
+	xwa = this_cpu_ptr(&xdp_work);
+
+	skb_queue_for_each(skb, list) {
+		if (WARN_ON(n > NAPI_POLL_WEIGHT))
+			 /* checked in caller, can't happen */
+			 return;
+		xw = (*xwa) + n++;
+		memset(xw, 0, sizeof(*xw));
+		xw->skb = skb;
+		xw->w.ctx = &xw->xdp;
+		act = netif_receive_generic_xdp_prepare(skb, &xw->xdp,
+							&xw->orig_data,
+							&xw->orig_data_end,
+							&xw->mac_len);
+		if (act)
+			xw->w.ret = act;
+		else
+			list_add_tail(&xw->w.list, &xdp_list);
+	}
+
+	list_for_each_entry(bw, &xdp_list, list) {
+		bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
+		bw->ri = *this_cpu_ptr(&redirect_info);
+	}
+
+	for (i = 0; i < n; i++) {
+		xw = (*xwa) + i;
+		act = netif_receive_generic_xdp_finish(xw->skb, &xw->xdp,
+						       xdp_prog, xw->orig_data,
+						       xw->orig_data_end,
+						       xw->w.ret, xw->mac_len);
+		if (act != XDP_PASS) {
+			switch (act) {
+			case XDP_REDIRECT:
+				*this_cpu_ptr(&redirect_info) = xw->w.ri;
+				err = xdp_do_generic_redirect(xw->skb->dev,
+							      xw->skb, &xw->xdp,
+							      xdp_prog);
+				if (err) /* free and drop */
+					kfree_skb(xw->skb);
+				break;
+			case XDP_TX:
+				generic_xdp_tx(xw->skb, xdp_prog);
+				break;
+			}
+		} else {
+			__skb_queue_tail(pass_list, xw->skb);
+		}
+	}
+}
+
 static int netif_rx_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4878,7 +4995,7 @@ static void netif_receive_skb_list_internal(struct sk_buff_head *list)
 {
 	/* Two sublists so we can go back and forth between them */
 	struct sk_buff_head sublist, sublist2;
-	struct bpf_prog *xdp_prog = NULL;
+	struct bpf_prog *xdp_prog = NULL, *curr_prog = NULL;
 	struct sk_buff *skb;
 
 	__skb_queue_head_init(&sublist);
@@ -4893,15 +5010,23 @@ static void netif_receive_skb_list_internal(struct sk_buff_head *list)
 
 	__skb_queue_head_init(&sublist2);
 	if (static_branch_unlikely(&generic_xdp_needed_key)) {
+		struct sk_buff_head sublist3;
+		int n = 0;
+
+		__skb_queue_head_init(&sublist3);
 		preempt_disable();
 		rcu_read_lock();
 		while ((skb = __skb_dequeue(&sublist)) != NULL) {
 			xdp_prog = rcu_dereference(skb->dev->xdp_prog);
-			if (do_xdp_generic(xdp_prog, skb) != XDP_PASS)
-				/* Dropped, don't add to sublist */
-				continue;
-			__skb_queue_tail(&sublist2, skb);
+			if (++n >= NAPI_POLL_WEIGHT || xdp_prog != curr_prog) {
+				do_xdp_list_generic(curr_prog, &sublist3, &sublist2);
+				__skb_queue_head_init(&sublist3);
+				n = 0;
+				curr_prog = xdp_prog;
+			}
+			__skb_queue_tail(&sublist3, skb);
 		}
+		do_xdp_list_generic(curr_prog, &sublist3, &sublist2);
 		rcu_read_unlock();
 		preempt_enable();
 		/* Move all packets onto first sublist */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..c96aff14d76a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2039,15 +2039,7 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-struct redirect_info {
-	u32 ifindex;
-	u32 flags;
-	struct bpf_map *map;
-	struct bpf_map *map_to_flush;
-	unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct redirect_info, redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 11/12] net: listify Generic XDP processing, part 2
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (9 preceding siblings ...)
  2018-06-26 18:21 ` [RFC PATCH v2 net-next 10/12] net: listify Generic XDP processing, part 1 Edward Cree
@ 2018-06-26 18:22 ` Edward Cree
  2018-06-26 18:22 ` [RFC PATCH v2 net-next 12/12] net: listify jited Generic XDP processing on x86_64 Edward Cree
  2018-06-26 20:48 ` [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Tom Herbert
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:22 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

Adds listified versions of the eBPF interpreter functions, and uses them when
 the single func is not JITed.  If the single func is JITed (and the list
 func is not, which currently it never is), then use the single func since
 the cost of interpreting is probably much worse than the cost of the extra
 indirect calls.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/filter.h | 38 +++++++++++++++++++++++++++++---------
 kernel/bpf/core.c      | 26 ++++++++++++++++++++++++++
 net/core/dev.c         | 19 ++++++++-----------
 3 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 75db6cbf78a3..7d813034e286 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -477,6 +477,21 @@ struct bpf_binary_header {
 	u8 image[] __aligned(4);
 };
 
+struct redirect_info {
+	u32 ifindex;
+	u32 flags;
+	struct bpf_map *map;
+	struct bpf_map *map_to_flush;
+	unsigned long   map_owner;
+};
+
+struct bpf_work {
+	struct list_head list;
+	void *ctx;
+	struct redirect_info ri;
+	unsigned long ret;
+};
+
 struct bpf_prog {
 	u16			pages;		/* Number of allocated pages */
 	u16			jited:1,	/* Is our filter JIT'ed? */
@@ -488,7 +503,9 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				jited_list:1;	/* Is list func JIT'ed? */
+				/* 5 bits left */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -498,6 +515,9 @@ struct bpf_prog {
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
 	unsigned int		(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
+	/* Takes a list of struct bpf_work */
+	void			(*list_func)(struct list_head *list,
+					     const struct bpf_insn *insn);
 	/* Instructions for interpreter */
 	union {
 		struct sock_filter	insns[0];
@@ -512,6 +532,7 @@ struct sk_filter {
 };
 
 #define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+#define BPF_LIST_PROG_RUN(filter, list) (*(filter)->list_func)(list, (filter)->insnsi)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
@@ -616,6 +637,13 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 	return BPF_PROG_RUN(prog, xdp);
 }
 
+static __always_inline void bpf_list_prog_run_xdp(const struct bpf_prog *prog,
+						  struct list_head *list)
+{
+	/* Caller must hold rcu_read_lock(), as per bpf_prog_run_xdp(). */
+	BPF_LIST_PROG_RUN(prog, list);
+}
+
 static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
 {
 	return prog->len * sizeof(struct bpf_insn);
@@ -820,14 +848,6 @@ static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
 	return 0;
 }
 
-struct redirect_info {
-	u32 ifindex;
-	u32 flags;
-	struct bpf_map *map;
-	struct bpf_map *map_to_flush;
-	unsigned long   map_owner;
-};
-
 DECLARE_PER_CPU(struct redirect_info, redirect_info);
 
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a9e6c04d0f4a..c35da826cc3b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1356,6 +1356,18 @@ static u64 PROG_NAME_ARGS(stack_size)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, \
 	return ___bpf_prog_run(regs, insn, stack); \
 }
 
+#define LIST_PROG_NAME(stack_size) __bpf_list_prog_run##stack_size
+#define DEFINE_BPF_LIST_PROG_RUN(stack_size) \
+static void LIST_PROG_NAME(stack_size)(struct list_head *list, const struct bpf_insn *insn) \
+{ \
+	struct bpf_work *work; \
+\
+	list_for_each_entry(work, list, list) { \
+		work->ret = PROG_NAME(stack_size)(work->ctx, insn); \
+		work->ri = *this_cpu_ptr(&redirect_info); \
+	} \
+}
+
 #define EVAL1(FN, X) FN(X)
 #define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
 #define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
@@ -1367,6 +1379,10 @@ EVAL6(DEFINE_BPF_PROG_RUN, 32, 64, 96, 128, 160, 192);
 EVAL6(DEFINE_BPF_PROG_RUN, 224, 256, 288, 320, 352, 384);
 EVAL4(DEFINE_BPF_PROG_RUN, 416, 448, 480, 512);
 
+EVAL6(DEFINE_BPF_LIST_PROG_RUN, 32, 64, 96, 128, 160, 192);
+EVAL6(DEFINE_BPF_LIST_PROG_RUN, 224, 256, 288, 320, 352, 384);
+EVAL4(DEFINE_BPF_LIST_PROG_RUN, 416, 448, 480, 512);
+
 EVAL6(DEFINE_BPF_PROG_RUN_ARGS, 32, 64, 96, 128, 160, 192);
 EVAL6(DEFINE_BPF_PROG_RUN_ARGS, 224, 256, 288, 320, 352, 384);
 EVAL4(DEFINE_BPF_PROG_RUN_ARGS, 416, 448, 480, 512);
@@ -1380,6 +1396,14 @@ EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
 EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
 };
 #undef PROG_NAME_LIST
+#define PROG_NAME_LIST(stack_size) LIST_PROG_NAME(stack_size),
+static void (*list_interpreters[])(struct list_head *list,
+				   const struct bpf_insn *insn) = {
+EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
+EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
+EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
+};
+#undef PROG_NAME_LIST
 #define PROG_NAME_LIST(stack_size) PROG_NAME_ARGS(stack_size),
 static u64 (*interpreters_args[])(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5,
 				  const struct bpf_insn *insn) = {
@@ -1472,8 +1496,10 @@ static void bpf_prog_select_func(struct bpf_prog *fp)
 	u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
 
 	fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
+	fp->list_func = list_interpreters[(round_up(stack_depth, 32) / 32) - 1];
 #else
 	fp->bpf_func = __bpf_prog_ret0_warn;
+	fp->list_func = NULL;
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 22cbd5314d56..746112c22afd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4198,13 +4198,6 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(do_xdp_generic);
 
-struct bpf_work {
-	struct list_head list;
-	void *ctx;
-	struct redirect_info ri;
-	unsigned long ret;
-};
-
 struct xdp_work {
 	struct bpf_work w;
 	struct xdp_buff xdp;
@@ -4254,10 +4247,14 @@ static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
 			list_add_tail(&xw->w.list, &xdp_list);
 	}
 
-	list_for_each_entry(bw, &xdp_list, list) {
-		bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
-		bw->ri = *this_cpu_ptr(&redirect_info);
-	}
+	if (xdp_prog->list_func && (xdp_prog->jited_list ||
+				    !xdp_prog->jited))
+		bpf_list_prog_run_xdp(xdp_prog, &xdp_list);
+	else
+		list_for_each_entry(bw, &xdp_list, list) {
+			bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
+			bw->ri = *this_cpu_ptr(&redirect_info);
+		}
 
 	for (i = 0; i < n; i++) {
 		xw = (*xwa) + i;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 net-next 12/12] net: listify jited Generic XDP processing on x86_64
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (10 preceding siblings ...)
  2018-06-26 18:22 ` [RFC PATCH v2 net-next 11/12] net: listify Generic XDP processing, part 2 Edward Cree
@ 2018-06-26 18:22 ` Edward Cree
  2018-06-26 20:48 ` [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Tom Herbert
  12 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-26 18:22 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem

When JITing an eBPF program on x86_64, also JIT a list_func that calls the
 bpf_func in a loop.  Since this is a direct call, it should perform better
 than indirect-calling bpf_func in a loop.
Since getting the percpu 'redirect_info' variable is ugly and magic, pass
 a pointer into the list_func as a parameter instead of calculating it
 inside the loop.  This is safe because BPF execution is not preemptible,
 so the percpu variable can't get moved while we're using it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 arch/x86/net/bpf_jit_comp.c | 164 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/filter.h      |  19 +++--
 kernel/bpf/core.c           |  18 +++--
 net/core/dev.c              |   5 +-
 4 files changed, 195 insertions(+), 11 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 2580cd2e98b1..3e06dd79adda 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1060,6 +1060,169 @@ struct x64_jit_data {
 	struct jit_context ctx;
 };
 
+static void try_do_jit_list(struct bpf_prog *bpf_prog,
+			    unsigned int (*bpf_func)(const void *ctx,
+						     const struct bpf_insn *insn))
+{
+	/* list_func takes three arguments:
+	 * struct list_head *list [RDI]
+	 * const struct bpf_insn *insn [RSI]
+	 * const struct redirect_info *percpu_ri [RDX]
+	 *
+	 * Layout of struct bpf_work on x86_64:
+	 * struct list_head {
+	 *	struct list_head *next; // 0x0
+	 *	struct list_head *prev; // 0x8
+	 * } list; // 0x0
+	 * void *ctx; 0x10
+	 * struct redirect_info {
+	 *	u32 ifindex; // 0x18
+	 *	u32 flags; // 0x1c
+	 *	struct bpf_map *map; // 0x20
+	 *	struct bpf_map *map_to_flush; // 0x28
+	 *	unsigned long   map_owner; // 0x30
+	 * } ri; // 0x18
+	 * unsigned long ret; // 0x38
+	 * (total size 0x40 bytes)
+	 *
+	 * Desired function body:
+	 * struct redirect_info *ri = percpu_ri; [R12]
+	 * struct bpf_work *work; [RBP]
+	 *
+	 * list_for_each_entry(work, list, list) {
+	 *	work->ret = (*bpf_func)(work->ctx, insn);
+	 *	work->ri = *ri;
+	 * }
+	 *
+	 * Assembly to emit:
+	 * ; save CSRs
+	 *	push %rbx
+	 *	push %rbp
+	 *	push %r12
+	 * ; stash pointer to redirect_info
+	 *	mov %rdx,%r12	; ri = percpu_ri
+	 * ; start list
+	 *	mov %rdi,%rbp	; head = list
+	 * next:		; while (true) {
+	 *	mov (%rbp),%rbx	; rbx = head->next
+	 *	cmp %rbx,%rdi	; if (rbx == list)
+	 *	je out		;	break
+	 *	mov %rbx,%rbp	; head = rbx
+	 *	push %rdi	; save list
+	 *	push %rsi	; save insn (is still arg2)
+	 * ; struct bpf_work *work = head (container_of, but list is first member)
+	 *	mov 0x10(%rbp),%rdi; arg1 = work->ctx
+	 *	callq bpf_func  ; rax = (*bpf_func)(work->ctx, insn)
+	 *	mov %rax,0x38(%rbp); work->ret = rax
+	 * ; work->ri = *ri
+	 *	mov (%r12),%rdx
+	 *	mov %rdx,0x18(%rbp)
+	 *	mov 0x8(%r12),%rdx
+	 *	mov %rdx,0x20(%rbp)
+	 *	mov 0x10(%r12),%rdx
+	 *	mov %rdx,0x28(%rbp)
+	 *	mov 0x18(%r12),%rdx
+	 *	mov %rdx,0x30(%rbp)
+	 *	pop %rsi	; restore insn
+	 *	pop %rdi	; restore list
+	 *	jmp next	; }
+	 * out:
+	 * ; restore CSRs and return
+	 *	pop %r12
+	 *	pop %rbp
+	 *	pop %rbx
+	 *	retq
+	 */
+	u8 *image, *prog, *from_to_out, *next;
+	struct bpf_binary_header *header;
+	int off, cnt = 0;
+	s64 jmp_offset;
+
+	/* Prog should be 81 bytes; let's round up to 128 */
+	header = bpf_jit_binary_alloc(128, &image, 1, jit_fill_hole);
+	prog = image;
+
+	/* push rbx */
+	EMIT1(0x53);
+	/* push rbp */
+	EMIT1(0x55);
+	/* push %r12 */
+	EMIT2(0x41, 0x54);
+	/* mov %rdx,%r12 */
+	EMIT3(0x49, 0x89, 0xd4);
+	/* mov %rdi,%rbp */
+	EMIT3(0x48, 0x89, 0xfd);
+	next = prog;
+	/* mov 0x0(%rbp),%rbx */
+	EMIT4(0x48, 0x8b, 0x5d, 0x00);
+	/* cmp %rbx,%rdi */
+	EMIT3(0x48, 0x39, 0xdf);
+	/* je out */
+	EMIT2(X86_JE, 0);
+	from_to_out = prog; /* record . to patch this jump later */
+	/* mov %rbx,%rbp */
+	EMIT3(0x48, 0x89, 0xdd);
+	/* push %rdi */
+	EMIT1(0x57);
+	/* push %rsi */
+	EMIT1(0x56);
+	/* mov 0x10(%rbp),%rdi */
+	EMIT4(0x48, 0x8b, 0x7d, 0x10);
+	/* e8 callq bpf_func */
+	jmp_offset = (u8 *)bpf_func - (prog + 5);
+	if (!is_simm32(jmp_offset)) {
+		pr_err("call out of range to BPF func %p from list image %p\n",
+		       bpf_func, image);
+		goto fail;
+	}
+	EMIT1_off32(0xE8, jmp_offset);
+	/* mov %rax,0x38(%rbp) */
+	EMIT4(0x48, 0x89, 0x45, 0x38);
+	/* mov (%r12),%rdx */
+	EMIT4(0x49, 0x8b, 0x14, 0x24);
+	/* mov %rdx,0x18(%rbp) */
+	EMIT4(0x48, 0x89, 0x55, 0x18);
+	for (off = 0x8; off < 0x20; off += 0x8) {
+		/* mov off(%r12),%rdx */
+		EMIT4(0x49, 0x8b, 0x54, 0x24);
+		EMIT1(off);
+		/* mov %rdx,0x18+off(%rbp) */
+		EMIT4(0x48, 0x89, 0x55, 0x18 + off);
+	}
+	/* pop %rsi */
+	EMIT1(0x5e);
+	/* pop %rdi */
+	EMIT1(0x5f);
+	/* jmp next */
+	jmp_offset = next - (prog + 2);
+	if (WARN_ON(!is_imm8(jmp_offset))) /* can't happen */
+		goto fail;
+	EMIT2(0xeb, jmp_offset);
+	/* out: */
+	jmp_offset = prog - from_to_out;
+	if (WARN_ON(!is_imm8(jmp_offset))) /* can't happen */
+		goto fail;
+	from_to_out[-1] = jmp_offset;
+	/* pop %r12 */
+	EMIT2(0x41, 0x5c);
+	/* pop %rbp */
+	EMIT1(0x5d);
+	/* pop %rbx */
+	EMIT1(0x5b);
+	/* retq */
+	EMIT1(0xc3);
+	/* If we were wrong about how much space we needed, scream and shout */
+	WARN_ON(cnt != 81);
+	if (bpf_jit_enable > 1)
+		bpf_jit_dump(0, cnt, 0, image);
+	bpf_jit_binary_lock_ro(header);
+	bpf_prog->list_func = (void *)image;
+	bpf_prog->jited_list = 1;
+	return;
+fail:
+	bpf_jit_binary_free(header);
+}
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
 	struct bpf_binary_header *header = NULL;
@@ -1176,6 +1339,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		prog->bpf_func = (void *)image;
 		prog->jited = 1;
 		prog->jited_len = proglen;
+		try_do_jit_list(prog, prog->bpf_func);
 	} else {
 		prog = orig_prog;
 	}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7d813034e286..ad1e75bf0991 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -517,7 +517,8 @@ struct bpf_prog {
 					    const struct bpf_insn *insn);
 	/* Takes a list of struct bpf_work */
 	void			(*list_func)(struct list_head *list,
-					     const struct bpf_insn *insn);
+					     const struct bpf_insn *insn,
+					     const struct redirect_info *percpu_ri);
 	/* Instructions for interpreter */
 	union {
 		struct sock_filter	insns[0];
@@ -532,7 +533,7 @@ struct sk_filter {
 };
 
 #define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
-#define BPF_LIST_PROG_RUN(filter, list) (*(filter)->list_func)(list, (filter)->insnsi)
+#define BPF_LIST_PROG_RUN(filter, list, percpu) (*(filter)->list_func)(list, (filter)->insnsi, percpu)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
@@ -638,10 +639,11 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 }
 
 static __always_inline void bpf_list_prog_run_xdp(const struct bpf_prog *prog,
-						  struct list_head *list)
+						  struct list_head *list,
+						  const struct redirect_info *percpu_ri)
 {
 	/* Caller must hold rcu_read_lock(), as per bpf_prog_run_xdp(). */
-	BPF_LIST_PROG_RUN(prog, list);
+	BPF_LIST_PROG_RUN(prog, list, percpu_ri);
 }
 
 static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
@@ -756,6 +758,15 @@ bpf_jit_binary_hdr(const struct bpf_prog *fp)
 	return (void *)addr;
 }
 
+static inline struct bpf_binary_header *
+bpf_list_jit_binary_hdr(const struct bpf_prog *fp)
+{
+	unsigned long real_start = (unsigned long)fp->list_func;
+	unsigned long addr = real_start & PAGE_MASK;
+
+	return (void *)addr;
+}
+
 #ifdef CONFIG_ARCH_HAS_SET_MEMORY
 static inline int bpf_prog_check_pages_ro_single(const struct bpf_prog *fp)
 {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c35da826cc3b..028be88c4af8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -621,15 +621,22 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
  */
 void __weak bpf_jit_free(struct bpf_prog *fp)
 {
-	if (fp->jited) {
-		struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
+	struct bpf_binary_header *hdr;
 
+	if (fp->jited) {
+		hdr = bpf_jit_binary_hdr(fp);
 		bpf_jit_binary_unlock_ro(hdr);
 		bpf_jit_binary_free(hdr);
 
 		WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
 	}
 
+	if (fp->jited_list) {
+		hdr = bpf_list_jit_binary_hdr(fp);
+		bpf_jit_binary_unlock_ro(hdr);
+		bpf_jit_binary_free(hdr);
+	}
+
 	bpf_prog_unlock_free(fp);
 }
 
@@ -1358,13 +1365,13 @@ static u64 PROG_NAME_ARGS(stack_size)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, \
 
 #define LIST_PROG_NAME(stack_size) __bpf_list_prog_run##stack_size
 #define DEFINE_BPF_LIST_PROG_RUN(stack_size) \
-static void LIST_PROG_NAME(stack_size)(struct list_head *list, const struct bpf_insn *insn) \
+static void LIST_PROG_NAME(stack_size)(struct list_head *list, const struct bpf_insn *insn, const struct redirect_info *percpu_ri) \
 { \
 	struct bpf_work *work; \
 \
 	list_for_each_entry(work, list, list) { \
 		work->ret = PROG_NAME(stack_size)(work->ctx, insn); \
-		work->ri = *this_cpu_ptr(&redirect_info); \
+		work->ri = *percpu_ri; \
 	} \
 }
 
@@ -1398,7 +1405,8 @@ EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
 #undef PROG_NAME_LIST
 #define PROG_NAME_LIST(stack_size) LIST_PROG_NAME(stack_size),
 static void (*list_interpreters[])(struct list_head *list,
-				   const struct bpf_insn *insn) = {
+				   const struct bpf_insn *insn,
+				   const struct redirect_info *percpu_ri) = {
 EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
 EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
 EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
diff --git a/net/core/dev.c b/net/core/dev.c
index 746112c22afd..7c1879045ef8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4214,6 +4214,7 @@ static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
 				struct sk_buff_head *list,
 				struct sk_buff_head *pass_list)
 {
+	const struct redirect_info *percpu_ri = this_cpu_ptr(&redirect_info);
 	struct xdp_work (*xwa)[NAPI_POLL_WEIGHT], *xw;
 	struct bpf_work *bw;
 	struct sk_buff *skb;
@@ -4249,11 +4250,11 @@ static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
 
 	if (xdp_prog->list_func && (xdp_prog->jited_list ||
 				    !xdp_prog->jited))
-		bpf_list_prog_run_xdp(xdp_prog, &xdp_list);
+		bpf_list_prog_run_xdp(xdp_prog, &xdp_list, percpu_ri);
 	else
 		list_for_each_entry(bw, &xdp_list, list) {
 			bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
-			bw->ri = *this_cpu_ptr(&redirect_info);
+			bw->ri = *percpu_ri;
 		}
 
 	for (i = 0; i < n; i++) {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage
  2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
                   ` (11 preceding siblings ...)
  2018-06-26 18:22 ` [RFC PATCH v2 net-next 12/12] net: listify jited Generic XDP processing on x86_64 Edward Cree
@ 2018-06-26 20:48 ` Tom Herbert
  12 siblings, 0 replies; 21+ messages in thread
From: Tom Herbert @ 2018-06-26 20:48 UTC (permalink / raw)
  To: Edward Cree
  Cc: linux-net-drivers, Linux Kernel Network Developers, David S. Miller

On Tue, Jun 26, 2018 at 11:15 AM, Edward Cree <ecree@solarflare.com> wrote:
>
> This patch series adds the capability for the network stack to receive a
>  list of packets and process them as a unit, rather than handling each
>  packet singly in sequence.  This is done by factoring out the existing
>  datapath code at each layer and wrapping it in list handling code.
>
> The motivation for this change is twofold:
> * Instruction cache locality.  Currently, running the entire network
>   stack receive path on a packet involves more code than will fit in the
>   lowest-level icache, meaning that when the next packet is handled, the
>   code has to be reloaded from more distant caches.  By handling packets
>   in "row-major order", we ensure that the code at each layer is hot for
>   most of the list.  (There is a corresponding downside in _data_ cache
>   locality, since we are now touching every packet at every layer, but in
>   practice there is easily enough room in dcache to hold one cacheline of
>   each of the 64 packets in a NAPI poll.)
> * Reduction of indirect calls.  Owing to Spectre mitigations, indirect
>   function calls are now more expensive than ever; they are also heavily
>   used in the network stack's architecture (see [1]).  By replacing 64
>   indirect calls to the next-layer per-packet function with a single
>   indirect call to the next-layer list function, we can save CPU cycles.
>
> Drivers pass an SKB list to the stack at the end of the NAPI poll; this
>  gives a natural batch size (the NAPI poll weight) and avoids waiting at
>  the software level for further packets to make a larger batch (which
>  would add latency).  It also means that the batch size is automatically
>  tuned by the existing interrupt moderation mechanism.
> The stack then runs each layer of processing over all the packets in the
>  list before proceeding to the next layer.  Where the 'next layer' (or
>  the context in which it must run) differs among the packets, the stack
>  splits the list; this 'late demux' means that packets which differ only
>  in later headers (e.g. same L2/L3 but different L4) can traverse the
>  early part of the stack together.
> Also, where the next layer is not (yet) list-aware, the stack can revert
>  to calling the rest of the stack in a loop; this allows gradual/creeping
>  listification, with no 'flag day' patch needed to listify everything.
>
> Patches 1-2 simply place received packets on a list during the event
>  processing loop on the sfc EF10 architecture, then call the normal stack
>  for each packet singly at the end of the NAPI poll.  (Analogues of patch
>  #2 for other NIC drivers should be fairly straightforward.)
> Patches 3-9 extend the list processing as far as the IP receive handler.
> Patches 10-12 apply the list techniques to Generic XDP, since the bpf_func
>  there is an indirect call.  In patch #12 we JIT a list_func that performs
>  list unwrapping and makes direct calls to the bpf_func.
>
> Patches 1-2 alone give about a 10% improvement in packet rate in the
>  baseline test; adding patches 3-9 raises this to around 25%.  Patches 10-
>  12, intended to improve Generic XDP performance, have in fact slightly
>  worsened it; I am unsure why this is and have included them in this RFC
>  in the hopes that someone will spot the reason.  If no progress is made I
>  will drop them from the series.
>
> Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
>  packets and a single core to handle interrupts on the RX side; this was
>  in order to measure as simply as possible the packet rate handled by a
>  single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
>  setup was tuned for maximum reproducibility, rather than raw performance.
>  Full details and more results (both with and without retpolines) are
>  presented in [2].
>
> The baseline test uses four streams, and multiple RXQs all bound to a
>  single CPU (the netperf binary is bound to a neighbouring CPU).  These
>  tests were run with retpolines.
> net-next: 6.60 Mb/s (datum)
>  after 9: 8.35 Mb/s (datum + 26.6%)
> after 12: 8.29 Mb/s (datum + 25.6%)
> Note however that these results are not robust; changes in the parameters
>  of the test often shrink the gain to single-digit percentages.  For
>  instance, when using only a single RXQ, only a 4% gain was seen.  The
>  results also seem to change significantly each time the patch series is
>  rebased onto a new net-next; for instance the results in [3] with
>  retpolines (slide 9) show only 11.6% gain in the same test as above (the
>  post-patch performance is the same but the pre-patch datum is 7.5Mb/s).
>
Very nice! I really like the deliberate progression of functionality
in the patches makes follwing them very readable. I do think that XDP
related patches at the end of the set should be separated out.

I suspects the effects will vary a lot between architectures and
configuration, so I'm not too worried about the variance mentioned in
the performance numbers. For future work, it might also be worth it to
compare to techniques done in VPP.

Tom

>
> I also performed tests with Generic XDP enabled (using a simple map-based
>  UDP port drop program with no entries in the map), both with and without
>  the eBPF JIT enabled.
> No JIT:
> net-next: 3.52 Mb/s (datum)
>  after 9: 4.91 Mb/s (datum + 39.5%)
> after 12: 4.83 Mb/s (datum + 37.3%)
>
> With JIT:
> net-next: 5.23 Mb/s (datum)
>  after 9: 6.64 Mb/s (datum + 27.0%)
> after 12: 6.46 Mb/s (datum + 23.6%)
>
> Another test variation was the use of software filtering/firewall rules.
>  Adding a single iptables rule (a UDP port drop on a port range not
>  matching the test traffic), thus making the netfilter hook have work to
>  do, reduced baseline performance but showed a similar delta from the
>  patches.  Similarly, testing with a set of TC flower filters (kindly
>  supplied by Cong Wang) in the single-RXQ setup (that previously gave 4%)
>  slowed down the baseline but not the patched performance, giving a 5.7%
>  performance delta.  These data suggest that the batching approach
>  remains effective in the presence of software switching rules.
>
> Changes from v1 (see [3]):
> * Rebased across 2 years' net-next movement (surprisingly straightforward).
>   - Added Generic XDP handling to netif_receive_skb_list_internal()
>   - Dealt with changes to PFMEMALLOC setting APIs
> * General cleanup of code and comments.
> * Skipped function calls for empty lists at various points in the stack
>   (patch #9).
> * Added listified Generic XDP handling (patches 10-12), though it doesn't
>   seem to help (see above).
> * Extended testing to cover software firewalls / netfilter etc.
>
> [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
> [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
> [3] http://lists.openwall.net/netdev/2016/04/19/89
>
> Edward Cree (12):
>   net: core: trivial netif_receive_skb_list() entry point
>   sfc: batch up RX delivery
>   net: core: unwrap skb list receive slightly further
>   net: core: Another step of skb receive list processing
>   net: core: another layer of lists, around PF_MEMALLOC skb handling
>   net: core: propagate SKB lists through packet_type lookup
>   net: ipv4: listified version of ip_rcv
>   net: ipv4: listify ip_rcv_finish
>   net: don't bother calling list RX functions on empty lists
>   net: listify Generic XDP processing, part 1
>   net: listify Generic XDP processing, part 2
>   net: listify jited Generic XDP processing on x86_64
>
>  arch/x86/net/bpf_jit_comp.c           | 164 ++++++++++++++
>  drivers/net/ethernet/sfc/efx.c        |  12 +
>  drivers/net/ethernet/sfc/net_driver.h |   3 +
>  drivers/net/ethernet/sfc/rx.c         |   7 +-
>  include/linux/filter.h                |  43 +++-
>  include/linux/netdevice.h             |   4 +
>  include/linux/netfilter.h             |  27 +++
>  include/linux/skbuff.h                |  16 ++
>  include/net/ip.h                      |   2 +
>  include/trace/events/net.h            |  14 ++
>  kernel/bpf/core.c                     |  38 +++-
>  net/core/dev.c                        | 415 +++++++++++++++++++++++++++++-----
>  net/core/filter.c                     |  10 +-
>  net/ipv4/af_inet.c                    |   1 +
>  net/ipv4/ip_input.c                   | 129 ++++++++++-
>  15 files changed, 810 insertions(+), 75 deletions(-)
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point
  2018-06-26 18:17 ` [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point Edward Cree
@ 2018-06-27  0:06   ` Eric Dumazet
  2018-06-27 14:03     ` Edward Cree
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2018-06-27  0:06 UTC (permalink / raw)
  To: Edward Cree, linux-net-drivers, netdev; +Cc: davem



On 06/26/2018 11:17 AM, Edward Cree wrote:
> Just calls netif_receive_skb() in a loop.

...

> +void netif_receive_skb_list(struct sk_buff_head *list)


Please use a standard list_head and standard list operators.

(In all your patches)

1) We do not want to carry a spinlock_t + count per list...

2) We get nice debugging features with CONFIG_DEBUG_LIST=y

Note that we now have skb->list after 
commit d4546c2509b1e9cd082e3682dcec98472e37ee5a ("net: Convert GRO SKB handling to list_head.")

Thanks !

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv
  2018-06-26 18:20 ` [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv Edward Cree
@ 2018-06-27 12:32   ` Florian Westphal
  0 siblings, 0 replies; 21+ messages in thread
From: Florian Westphal @ 2018-06-27 12:32 UTC (permalink / raw)
  To: Edward Cree; +Cc: linux-net-drivers, netdev, davem

Edward Cree <ecree@solarflare.com> wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
>  Rather than attempting to make netfilter know about lists (which would be
>  a major project in itself) we just let it call the regular okfn (in this
>  case ip_rcv_finish()) for any packets it steals, and have it give us back
>  a list of packets it's synchronously accepted (which normally NF_HOOK
>  would automatically call okfn() on, but we want to be able to potentially
>  pass the list to a listified version of okfn().)

okfn() is only used during async reinject in NFQUEUE case,
skb is queued in kernel and we'll wait for a verdict from a userspace
process.  If thats ACCEPT, then okfn() gets called to reinject the skb
into the network stack.

A normal -j ACCEPT doesn't call okfn in the netfilter core, which is why
this occurs on '1' retval in NF_HOOK().

Only other user of okfn() is bridge netfilter, so listified version of
okfn() doesn't make too much sense to me, its not used normally
(unless such listified version makes the code simpler of course).

AFAICS its fine to unlink/free skbs from the list to handle
drops/queueing etc. so a future version of nf_hook() could propagate the
list into nf_hook_slow and mangle the list there to deal with hooks
that steal/drop/queue skbs.

Later on we can pass the list to the hook functions themselves.

We'll have to handle non-accept verdicts in-place in the hook functions
for this, but fortunately most hookfns only return NF_ACCEPT so I think
it is manageable.

I'll look into this once the series makes it to net-next.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point
  2018-06-27  0:06   ` Eric Dumazet
@ 2018-06-27 14:03     ` Edward Cree
  0 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-27 14:03 UTC (permalink / raw)
  To: Eric Dumazet, linux-net-drivers, netdev; +Cc: davem

On 27/06/18 01:06, Eric Dumazet wrote:
> On 06/26/2018 11:17 AM, Edward Cree wrote:
>> Just calls netif_receive_skb() in a loop.
> ...
>
>> +void netif_receive_skb_list(struct sk_buff_head *list)
>
> Please use a standard list_head and standard list operators.
>
> (In all your patches)
>
> 1) We do not want to carry a spinlock_t + count per list...
>
> 2) We get nice debugging features with CONFIG_DEBUG_LIST=y
>
> Note that we now have skb->list after 
> commit d4546c2509b1e9cd082e3682dcec98472e37ee5a ("net: Convert GRO SKB handling to list_head.")
So we do.  That hadn't gone in yet on Monday when I last pulled net-next; I'll make sure to use it in the next spin.
Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup
  2018-06-26 18:19 ` [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup Edward Cree
@ 2018-06-27 14:36   ` Willem de Bruijn
  2018-06-27 14:49     ` Edward Cree
  0 siblings, 1 reply; 21+ messages in thread
From: Willem de Bruijn @ 2018-06-27 14:36 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, Network Development, David Miller

On Tue, Jun 26, 2018 at 8:19 PM Edward Cree <ecree@solarflare.com> wrote:
>
> __netif_receive_skb_taps() does a depressingly large amount of per-packet
>  work that can't easily be listified, because the another_round looping
>  makes it nontrivial to slice up into smaller functions.
> Fortunately, most of that work disappears in the fast path:
>  * Hardware devices generally don't have an rx_handler
>  * Unless you're tcpdumping or something, there is usually only one ptype
>  * VLAN processing comes before the protocol ptype lookup, so doesn't force
>    a pt_prev deliver
>  so normally, __netif_receive_skb_taps() will run straight through and return
>  the one ptype found in ptype_base[hash of skb->protocol].
>
> Signed-off-by: Edward Cree <ecree@solarflare.com>
> ---

> -static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
> +static int __netif_receive_skb_taps(struct sk_buff *skb, bool pfmemalloc,
> +                                   struct packet_type **pt_prev)

A lot of code churn can be avoided by keeping local variable pt_prev and
calling this ppt_prev or so, then assigning just before returning on success.

Also, this function does more than just process network taps.

>  {
> -       struct packet_type *ptype, *pt_prev;
>         rx_handler_func_t *rx_handler;
>         struct net_device *orig_dev;
>         bool deliver_exact = false;
> +       struct packet_type *ptype;
>         int ret = NET_RX_DROP;
>         __be16 type;
>
> @@ -4514,7 +4515,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
>                 skb_reset_transport_header(skb);
>         skb_reset_mac_len(skb);
>
> -       pt_prev = NULL;
> +       *pt_prev = NULL;
>
>  another_round:
>         skb->skb_iif = skb->dev->ifindex;
> @@ -4535,25 +4536,25 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
>                 goto skip_taps;
>
>         list_for_each_entry_rcu(ptype, &ptype_all, list) {
> -               if (pt_prev)
> -                       ret = deliver_skb(skb, pt_prev, orig_dev);
> -               pt_prev = ptype;
> +               if (*pt_prev)
> +                       ret = deliver_skb(skb, *pt_prev, orig_dev);
> +               *pt_prev = ptype;
>         }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup
  2018-06-27 14:36   ` Willem de Bruijn
@ 2018-06-27 14:49     ` Edward Cree
  2018-06-27 16:00       ` Willem de Bruijn
  0 siblings, 1 reply; 21+ messages in thread
From: Edward Cree @ 2018-06-27 14:49 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: linux-net-drivers, Network Development, David Miller

On 27/06/18 15:36, Willem de Bruijn wrote:
> On Tue, Jun 26, 2018 at 8:19 PM Edward Cree <ecree@solarflare.com> wrote:
>> __netif_receive_skb_taps() does a depressingly large amount of per-packet
>>  work that can't easily be listified, because the another_round looping
>>  makes it nontrivial to slice up into smaller functions.
>> Fortunately, most of that work disappears in the fast path:
>>  * Hardware devices generally don't have an rx_handler
>>  * Unless you're tcpdumping or something, there is usually only one ptype
>>  * VLAN processing comes before the protocol ptype lookup, so doesn't force
>>    a pt_prev deliver
>>  so normally, __netif_receive_skb_taps() will run straight through and return
>>  the one ptype found in ptype_base[hash of skb->protocol].
>>
>> Signed-off-by: Edward Cree <ecree@solarflare.com>
>> ---
>> -static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
>> +static int __netif_receive_skb_taps(struct sk_buff *skb, bool pfmemalloc,
>> +                                   struct packet_type **pt_prev)
> A lot of code churn can be avoided by keeping local variable pt_prev and
> calling this ppt_prev or so, then assigning just before returning on success.
Good idea, I'll try that.

> Also, this function does more than just process network taps.
This is true, but naming things is hard, and I couldn't think of either a
 better new name for this function or a name that could fit in between
 __netif_receive_skb() and __netif_receive_skb_core() for the new function
 in my patch named __netif_receive_skb_core().  Any suggestions?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup
  2018-06-27 14:49     ` Edward Cree
@ 2018-06-27 16:00       ` Willem de Bruijn
  2018-06-27 16:34         ` Edward Cree
  0 siblings, 1 reply; 21+ messages in thread
From: Willem de Bruijn @ 2018-06-27 16:00 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, Network Development, David Miller

On Wed, Jun 27, 2018 at 10:49 AM Edward Cree <ecree@solarflare.com> wrote:
>
> On 27/06/18 15:36, Willem de Bruijn wrote:
> > On Tue, Jun 26, 2018 at 8:19 PM Edward Cree <ecree@solarflare.com> wrote:
> >> __netif_receive_skb_taps() does a depressingly large amount of per-packet
> >>  work that can't easily be listified, because the another_round looping
> >>  makes it nontrivial to slice up into smaller functions.
> >> Fortunately, most of that work disappears in the fast path:
> >>  * Hardware devices generally don't have an rx_handler
> >>  * Unless you're tcpdumping or something, there is usually only one ptype
> >>  * VLAN processing comes before the protocol ptype lookup, so doesn't force
> >>    a pt_prev deliver
> >>  so normally, __netif_receive_skb_taps() will run straight through and return
> >>  the one ptype found in ptype_base[hash of skb->protocol].
> >>
> >> Signed-off-by: Edward Cree <ecree@solarflare.com>
> >> ---
> >> -static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
> >> +static int __netif_receive_skb_taps(struct sk_buff *skb, bool pfmemalloc,
> >> +                                   struct packet_type **pt_prev)
> > A lot of code churn can be avoided by keeping local variable pt_prev and
> > calling this ppt_prev or so, then assigning just before returning on success.
> Good idea, I'll try that.
>
> > Also, this function does more than just process network taps.
> This is true, but naming things is hard, and I couldn't think of either a
>  better new name for this function or a name that could fit in between
>  __netif_receive_skb() and __netif_receive_skb_core() for the new function
>  in my patch named __netif_receive_skb_core().  Any suggestions?

____netif_receive_skb_core? Not that four underscores is particularly
readable. Perhaps __netif_receive_skb_core_inner. It's indeed tricky (and
not the most important, I didn't mean to bikeshed).

Come to think of it, from your fast path assumptions, we could perhaps wrap
ptype_all and rx_handler logic in a static_branch similar to tc and netfilter
(and sk_memalloc_socks). Remaining branches like skip_classify, pfmemalloc
and deliver_exact can also not be reached if all these are off, so this entire
section can be skipped. Then it could become __netif_receive_skb_slow,
taken only on the static branch or for vlan packets.  I do not suggest it as
part of this patchset. it would be a pretty complex change on its own.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup
  2018-06-27 16:00       ` Willem de Bruijn
@ 2018-06-27 16:34         ` Edward Cree
  0 siblings, 0 replies; 21+ messages in thread
From: Edward Cree @ 2018-06-27 16:34 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: linux-net-drivers, Network Development, David Miller

On 27/06/18 17:00, Willem de Bruijn wrote:
> On Wed, Jun 27, 2018 at 10:49 AM Edward Cree <ecree@solarflare.com> wrote:
>> On 27/06/18 15:36, Willem de Bruijn wrote:
>>> Also, this function does more than just process network taps.
>> This is true, but naming things is hard, and I couldn't think of either a
>>  better new name for this function or a name that could fit in between
>>  __netif_receive_skb() and __netif_receive_skb_core() for the new function
>>  in my patch named __netif_receive_skb_core().  Any suggestions?
> ____netif_receive_skb_core? Not that four underscores is particularly
> readable. Perhaps __netif_receive_skb_core_inner. It's indeed tricky (and
> not the most important, I didn't mean to bikeshed).
I've gone with __netif_receive_skb_one_core() (by contrast to ..._list_core())
 for the outer function.  And I don't mind when people shed bikes :)

> Come to think of it, from your fast path assumptions, we could perhaps wrap
> ptype_all and rx_handler logic in a static_branch similar to tc and netfilter
> (and sk_memalloc_socks). Remaining branches like skip_classify, pfmemalloc
> and deliver_exact can also not be reached if all these are off, so this entire
> section can be skipped. Then it could become __netif_receive_skb_slow,
> taken only on the static branch or for vlan packets.  I do not suggest it as
> part of this patchset. it would be a pretty complex change on its own.

That is an interesting idea, but agreed that it'd be quite complex.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-06-27 16:34 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-26 18:15 [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Edward Cree
2018-06-26 18:17 ` [RFC PATCH v2 net-next 01/12] net: core: trivial netif_receive_skb_list() entry point Edward Cree
2018-06-27  0:06   ` Eric Dumazet
2018-06-27 14:03     ` Edward Cree
2018-06-26 18:17 ` [RFC PATCH v2 net-next 02/12] sfc: batch up RX delivery Edward Cree
2018-06-26 18:18 ` [RFC PATCH v2 net-next 03/12] net: core: unwrap skb list receive slightly further Edward Cree
2018-06-26 18:18 ` [RFC PATCH v2 net-next 04/12] net: core: Another step of skb receive list processing Edward Cree
2018-06-26 18:19 ` [RFC PATCH v2 net-next 05/12] net: core: another layer of lists, around PF_MEMALLOC skb handling Edward Cree
2018-06-26 18:19 ` [RFC PATCH v2 net-next 06/12] net: core: propagate SKB lists through packet_type lookup Edward Cree
2018-06-27 14:36   ` Willem de Bruijn
2018-06-27 14:49     ` Edward Cree
2018-06-27 16:00       ` Willem de Bruijn
2018-06-27 16:34         ` Edward Cree
2018-06-26 18:20 ` [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv Edward Cree
2018-06-27 12:32   ` Florian Westphal
2018-06-26 18:20 ` [RFC PATCH v2 net-next 08/12] net: ipv4: listify ip_rcv_finish Edward Cree
2018-06-26 18:21 ` [RFC PATCH v2 net-next 09/12] net: don't bother calling list RX functions on empty lists Edward Cree
2018-06-26 18:21 ` [RFC PATCH v2 net-next 10/12] net: listify Generic XDP processing, part 1 Edward Cree
2018-06-26 18:22 ` [RFC PATCH v2 net-next 11/12] net: listify Generic XDP processing, part 2 Edward Cree
2018-06-26 18:22 ` [RFC PATCH v2 net-next 12/12] net: listify jited Generic XDP processing on x86_64 Edward Cree
2018-06-26 20:48 ` [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.