All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 net-next 0/9] Handle multiple received packets at each stage
@ 2018-07-02 15:11 Edward Cree
  2018-07-02 15:12 ` [PATCH v4 net-next 1/9] net: core: trivial netif_receive_skb_list() entry point Edward Cree
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:11 UTC (permalink / raw)
  To: davem; +Cc: netdev

This patch series adds the capability for the network stack to receive a
 list of packets and process them as a unit, rather than handling each
 packet singly in sequence.  This is done by factoring out the existing
 datapath code at each layer and wrapping it in list handling code.

The motivation for this change is twofold:
* Instruction cache locality.  Currently, running the entire network
  stack receive path on a packet involves more code than will fit in the
  lowest-level icache, meaning that when the next packet is handled, the
  code has to be reloaded from more distant caches.  By handling packets
  in "row-major order", we ensure that the code at each layer is hot for
  most of the list.  (There is a corresponding downside in _data_ cache
  locality, since we are now touching every packet at every layer, but in
  practice there is easily enough room in dcache to hold one cacheline of
  each of the 64 packets in a NAPI poll.)
* Reduction of indirect calls.  Owing to Spectre mitigations, indirect
  function calls are now more expensive than ever; they are also heavily
  used in the network stack's architecture (see [1]).  By replacing 64
  indirect calls to the next-layer per-packet function with a single
  indirect call to the next-layer list function, we can save CPU cycles.

Drivers pass an SKB list to the stack at the end of the NAPI poll; this
 gives a natural batch size (the NAPI poll weight) and avoids waiting at
 the software level for further packets to make a larger batch (which
 would add latency).  It also means that the batch size is automatically
 tuned by the existing interrupt moderation mechanism.
The stack then runs each layer of processing over all the packets in the
 list before proceeding to the next layer.  Where the 'next layer' (or
 the context in which it must run) differs among the packets, the stack
 splits the list; this 'late demux' means that packets which differ only
 in later headers (e.g. same L2/L3 but different L4) can traverse the
 early part of the stack together.
Also, where the next layer is not (yet) list-aware, the stack can revert
 to calling the rest of the stack in a loop; this allows gradual/creeping
 listification, with no 'flag day' patch needed to listify everything.

Patches 1-2 simply place received packets on a list during the event
 processing loop on the sfc EF10 architecture, then call the normal stack
 for each packet singly at the end of the NAPI poll.  (Analogues of patch
 #2 for other NIC drivers should be fairly straightforward.)
Patches 3-9 extend the list processing as far as the IP receive handler.

Patches 1-2 alone give about a 10% improvement in packet rate in the
 baseline test; adding patches 3-9 raises this to around 25%.

Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
 packets and a single core to handle interrupts on the RX side; this was
 in order to measure as simply as possible the packet rate handled by a
 single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
 setup was tuned for maximum reproducibility, rather than raw performance.
 Full details and more results (both with and without retpolines) from a
 previous version of the patch series are presented in [2].

The baseline test uses four streams, and multiple RXQs all bound to a
 single CPU (the netperf binary is bound to a neighbouring CPU).  These
 tests were run with retpolines.
net-next: 6.91 Mb/s (datum)
 after 9: 8.46 Mb/s (+22.5%)
Note however that these results are not robust; changes in the parameters
 of the test sometimes shrink the gain to single-digit percentages.  For
 instance, when using only a single RXQ, only a 4% gain was seen.

One test variation was the use of software filtering/firewall rules.
 Adding a single iptables rule (UDP port drop on a port range not matching
 the test traffic), thus making the netfilter hook have work to do,
 reduced baseline performance but showed a similar gain from the patches:
net-next: 5.02 Mb/s (datum)
 after 9: 6.78 Mb/s (+35.1%)

Similarly, testing with a set of TC flower filters (kindly supplied by
 Cong Wang) gave the following:
net-next: 6.83 Mb/s (datum)
 after 9: 8.86 Mb/s (+29.7%)

These data suggest that the batching approach remains effective in the
 presence of software switching rules, and perhaps even improves the
 performance of those rules by allowing them and their codepaths to stay
 in cache between packets.

Changes from v3:
* Fixed build error when CONFIG_NETFILTER=n (thanks kbuild).

Changes from v2:
* Used standard list handling (and skb->list) instead of the skb-queue
  functions (that use skb->next, skb->prev).
  - As part of this, changed from a "dequeue, process, enqueue" model to
    using list_for_each_safe, list_del, and (new) list_cut_before.
* Altered __netif_receive_skb_core() changes in patch 6 as per Willem de
  Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming).
* Removed patches to Generic XDP, since they were producing no benefit.
  I may revisit them later.
* Removed RFC tags.

Changes from v1:
* Rebased across 2 years' net-next movement (surprisingly straightforward).
  - Added Generic XDP handling to netif_receive_skb_list_internal()
  - Dealt with changes to PFMEMALLOC setting APIs
* General cleanup of code and comments.
* Skipped function calls for empty lists at various points in the stack
  (patch #9).
* Added listified Generic XDP handling (patches 10-12), though it doesn't
  seem to help (see above).
* Extended testing to cover software firewalls / netfilter etc.

[1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
[2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf

Edward Cree (9):
  net: core: trivial netif_receive_skb_list() entry point
  sfc: batch up RX delivery
  net: core: unwrap skb list receive slightly further
  net: core: Another step of skb receive list processing
  net: core: another layer of lists, around PF_MEMALLOC skb handling
  net: core: propagate SKB lists through packet_type lookup
  net: ipv4: listified version of ip_rcv
  net: ipv4: listify ip_rcv_finish
  net: don't bother calling list RX functions on empty lists

 drivers/net/ethernet/sfc/efx.c        |  12 +++
 drivers/net/ethernet/sfc/net_driver.h |   3 +
 drivers/net/ethernet/sfc/rx.c         |   7 +-
 include/linux/list.h                  |  30 ++++++
 include/linux/netdevice.h             |   4 +
 include/linux/netfilter.h             |  22 +++++
 include/net/ip.h                      |   2 +
 include/trace/events/net.h            |   7 ++
 net/core/dev.c                        | 174 ++++++++++++++++++++++++++++++++--
 net/ipv4/af_inet.c                    |   1 +
 net/ipv4/ip_input.c                   | 114 ++++++++++++++++++++--
 11 files changed, 360 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 1/9] net: core: trivial netif_receive_skb_list() entry point
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
@ 2018-07-02 15:12 ` Edward Cree
  2018-07-02 15:12 ` [PATCH v4 net-next 2/9] sfc: batch up RX delivery Edward Cree
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:12 UTC (permalink / raw)
  To: davem; +Cc: netdev

Just calls netif_receive_skb() in a loop.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c6b377a15869..e104b2e4a735 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3365,6 +3365,7 @@ int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
+void netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
diff --git a/net/core/dev.c b/net/core/dev.c
index dffed642e686..110c8dfebc01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4792,6 +4792,25 @@ int netif_receive_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
+/**
+ *	netif_receive_skb_list - process many receive buffers from network
+ *	@head: list of skbs to process.
+ *
+ *	For now, just calls netif_receive_skb() in a loop, ignoring the
+ *	return value.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ */
+void netif_receive_skb_list(struct list_head *head)
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list)
+		netif_receive_skb(skb);
+}
+EXPORT_SYMBOL(netif_receive_skb_list);
+
 DEFINE_PER_CPU(struct work_struct, flush_works);
 
 /* Network device is going away, flush any packets still pending */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 2/9] sfc: batch up RX delivery
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
  2018-07-02 15:12 ` [PATCH v4 net-next 1/9] net: core: trivial netif_receive_skb_list() entry point Edward Cree
@ 2018-07-02 15:12 ` Edward Cree
  2018-07-02 15:13 ` [PATCH v4 net-next 3/9] net: core: unwrap skb list receive slightly further Edward Cree
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:12 UTC (permalink / raw)
  To: davem; +Cc: netdev

Improves packet rate of 1-byte UDP receives by up to 10%.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c        | 12 ++++++++++++
 drivers/net/ethernet/sfc/net_driver.h |  3 +++
 drivers/net/ethernet/sfc/rx.c         |  7 ++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 570ec72266f3..b24c2e21db8e 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -264,11 +264,17 @@ static int efx_check_disabled(struct efx_nic *efx)
 static int efx_process_channel(struct efx_channel *channel, int budget)
 {
 	struct efx_tx_queue *tx_queue;
+	struct list_head rx_list;
 	int spent;
 
 	if (unlikely(!channel->enabled))
 		return 0;
 
+	/* Prepare the batch receive list */
+	EFX_WARN_ON_PARANOID(channel->rx_list != NULL);
+	INIT_LIST_HEAD(&rx_list);
+	channel->rx_list = &rx_list;
+
 	efx_for_each_channel_tx_queue(tx_queue, channel) {
 		tx_queue->pkts_compl = 0;
 		tx_queue->bytes_compl = 0;
@@ -291,6 +297,10 @@ static int efx_process_channel(struct efx_channel *channel, int budget)
 		}
 	}
 
+	/* Receive any packets we queued up */
+	netif_receive_skb_list(channel->rx_list);
+	channel->rx_list = NULL;
+
 	return spent;
 }
 
@@ -555,6 +565,8 @@ static int efx_probe_channel(struct efx_channel *channel)
 			goto fail;
 	}
 
+	channel->rx_list = NULL;
+
 	return 0;
 
 fail:
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 65568925c3ef..961b92979640 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -448,6 +448,7 @@ enum efx_sync_events_state {
  *	__efx_rx_packet(), or zero if there is none
  * @rx_pkt_index: Ring index of first buffer for next packet to be delivered
  *	by __efx_rx_packet(), if @rx_pkt_n_frags != 0
+ * @rx_list: list of SKBs from current RX, awaiting processing
  * @rx_queue: RX queue for this channel
  * @tx_queue: TX queues for this channel
  * @sync_events_state: Current state of sync events on this channel
@@ -500,6 +501,8 @@ struct efx_channel {
 	unsigned int rx_pkt_n_frags;
 	unsigned int rx_pkt_index;
 
+	struct list_head *rx_list;
+
 	struct efx_rx_queue rx_queue;
 	struct efx_tx_queue tx_queue[EFX_TXQ_TYPES];
 
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index d2e254f2f72b..396ff01298cd 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -634,7 +634,12 @@ static void efx_rx_deliver(struct efx_channel *channel, u8 *eh,
 			return;
 
 	/* Pass the packet up */
-	netif_receive_skb(skb);
+	if (channel->rx_list != NULL)
+		/* Add to list, will pass up later */
+		list_add_tail(&skb->list, channel->rx_list);
+	else
+		/* No list, so pass it up now */
+		netif_receive_skb(skb);
 }
 
 /* Handle a received packet.  Second half: Touches packet payload. */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 3/9] net: core: unwrap skb list receive slightly further
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
  2018-07-02 15:12 ` [PATCH v4 net-next 1/9] net: core: trivial netif_receive_skb_list() entry point Edward Cree
  2018-07-02 15:12 ` [PATCH v4 net-next 2/9] sfc: batch up RX delivery Edward Cree
@ 2018-07-02 15:13 ` Edward Cree
  2018-07-02 15:13 ` [PATCH v4 net-next 4/9] net: core: Another step of skb receive list processing Edward Cree
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:13 UTC (permalink / raw)
  To: davem; +Cc: netdev

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/trace/events/net.h | 7 +++++++
 net/core/dev.c             | 4 +++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 9c886739246a..00aa72ce0e7c 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -223,6 +223,13 @@ DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_entry,
 	TP_ARGS(skb)
 );
 
+DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_list_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
 DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_entry,
 
 	TP_PROTO(const struct sk_buff *skb),
diff --git a/net/core/dev.c b/net/core/dev.c
index 110c8dfebc01..99167ff83919 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4806,8 +4806,10 @@ void netif_receive_skb_list(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
 
+	list_for_each_entry(skb, head, list)
+		trace_netif_receive_skb_list_entry(skb);
 	list_for_each_entry_safe(skb, next, head, list)
-		netif_receive_skb(skb);
+		netif_receive_skb_internal(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 4/9] net: core: Another step of skb receive list processing
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (2 preceding siblings ...)
  2018-07-02 15:13 ` [PATCH v4 net-next 3/9] net: core: unwrap skb list receive slightly further Edward Cree
@ 2018-07-02 15:13 ` Edward Cree
  2018-07-02 15:13 ` [PATCH v4 net-next 5/9] net: core: another layer of lists, around PF_MEMALLOC skb handling Edward Cree
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:13 UTC (permalink / raw)
  To: davem; +Cc: netdev

netif_receive_skb_list_internal() now processes a list and hands it
 on to the next function.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 99167ff83919..d7f2a880aeed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4729,6 +4729,14 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 	return ret;
 }
 
+static void __netif_receive_skb_list(struct list_head *head)
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list)
+		__netif_receive_skb(skb);
+}
+
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4769,6 +4777,50 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	return ret;
 }
 
+static void netif_receive_skb_list_internal(struct list_head *head)
+{
+	struct bpf_prog *xdp_prog = NULL;
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		net_timestamp_check(netdev_tstamp_prequeue, skb);
+		if (skb_defer_rx_timestamp(skb))
+			/* Handled, remove from list */
+			list_del(&skb->list);
+	}
+
+	if (static_branch_unlikely(&generic_xdp_needed_key)) {
+		preempt_disable();
+		rcu_read_lock();
+		list_for_each_entry_safe(skb, next, head, list) {
+			xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+			if (do_xdp_generic(xdp_prog, skb) != XDP_PASS)
+				/* Dropped, remove from list */
+				list_del(&skb->list);
+		}
+		rcu_read_unlock();
+		preempt_enable();
+	}
+
+	rcu_read_lock();
+#ifdef CONFIG_RPS
+	if (static_key_false(&rps_needed)) {
+		list_for_each_entry_safe(skb, next, head, list) {
+			struct rps_dev_flow voidflow, *rflow = &voidflow;
+			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+
+			if (cpu >= 0) {
+				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+				/* Handled, remove from list */
+				list_del(&skb->list);
+			}
+		}
+	}
+#endif
+	__netif_receive_skb_list(head);
+	rcu_read_unlock();
+}
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -4796,20 +4848,19 @@ EXPORT_SYMBOL(netif_receive_skb);
  *	netif_receive_skb_list - process many receive buffers from network
  *	@head: list of skbs to process.
  *
- *	For now, just calls netif_receive_skb() in a loop, ignoring the
- *	return value.
+ *	Since return value of netif_receive_skb() is normally ignored, and
+ *	wouldn't be meaningful for a list, this function returns void.
  *
  *	This function may only be called from softirq context and interrupts
  *	should be enabled.
  */
 void netif_receive_skb_list(struct list_head *head)
 {
-	struct sk_buff *skb, *next;
+	struct sk_buff *skb;
 
 	list_for_each_entry(skb, head, list)
 		trace_netif_receive_skb_list_entry(skb);
-	list_for_each_entry_safe(skb, next, head, list)
-		netif_receive_skb_internal(skb);
+	netif_receive_skb_list_internal(head);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 5/9] net: core: another layer of lists, around PF_MEMALLOC skb handling
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (3 preceding siblings ...)
  2018-07-02 15:13 ` [PATCH v4 net-next 4/9] net: core: Another step of skb receive list processing Edward Cree
@ 2018-07-02 15:13 ` Edward Cree
  2018-07-02 15:13 ` [PATCH v4 net-next 6/9] net: core: propagate SKB lists through packet_type lookup Edward Cree
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:13 UTC (permalink / raw)
  To: davem; +Cc: netdev

First example of a layer splitting the list (rather than merely taking
 individual packets off it).
Involves new list.h function, list_cut_before(), like list_cut_position()
 but cuts on the other side of the given entry.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/list.h | 30 ++++++++++++++++++++++++++++++
 net/core/dev.c       | 44 ++++++++++++++++++++++++++++++++++++--------
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 4b129df4d46b..de04cc5ed536 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -285,6 +285,36 @@ static inline void list_cut_position(struct list_head *list,
 		__list_cut_position(list, head, entry);
 }
 
+/**
+ * list_cut_before - cut a list into two, before given entry
+ * @list: a new list to add all removed entries
+ * @head: a list with entries
+ * @entry: an entry within head, could be the head itself
+ *
+ * This helper moves the initial part of @head, up to but
+ * excluding @entry, from @head to @list.  You should pass
+ * in @entry an element you know is on @head.  @list should
+ * be an empty list or a list you do not care about losing
+ * its data.
+ * If @entry == @head, all entries on @head are moved to
+ * @list.
+ */
+static inline void list_cut_before(struct list_head *list,
+				   struct list_head *head,
+				   struct list_head *entry)
+{
+	if (head->next == entry) {
+		INIT_LIST_HEAD(list);
+		return;
+	}
+	list->next = head->next;
+	list->next->prev = list;
+	list->prev = entry->prev;
+	list->prev->next = list;
+	head->next = entry;
+	entry->prev = head;
+}
+
 static inline void __list_splice(const struct list_head *list,
 				 struct list_head *prev,
 				 struct list_head *next)
diff --git a/net/core/dev.c b/net/core/dev.c
index d7f2a880aeed..d2454678bc82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4670,6 +4670,14 @@ int netif_receive_skb_core(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
+static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list)
+		__netif_receive_skb_core(skb, pfmemalloc);
+}
+
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	int ret;
@@ -4695,6 +4703,34 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
+static void __netif_receive_skb_list(struct list_head *head)
+{
+	unsigned long noreclaim_flag = 0;
+	struct sk_buff *skb, *next;
+	bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) {
+			struct list_head sublist;
+
+			/* Handle the previous sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			pfmemalloc = !pfmemalloc;
+			/* See comments in __netif_receive_skb */
+			if (pfmemalloc)
+				noreclaim_flag = memalloc_noreclaim_save();
+			else
+				memalloc_noreclaim_restore(noreclaim_flag);
+		}
+	}
+	/* Handle the remaining sublist */
+	__netif_receive_skb_list_core(head, pfmemalloc);
+	/* Restore pflags */
+	if (pfmemalloc)
+		memalloc_noreclaim_restore(noreclaim_flag);
+}
+
 static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
@@ -4729,14 +4765,6 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 	return ret;
 }
 
-static void __netif_receive_skb_list(struct list_head *head)
-{
-	struct sk_buff *skb, *next;
-
-	list_for_each_entry_safe(skb, next, head, list)
-		__netif_receive_skb(skb);
-}
-
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 6/9] net: core: propagate SKB lists through packet_type lookup
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (4 preceding siblings ...)
  2018-07-02 15:13 ` [PATCH v4 net-next 5/9] net: core: another layer of lists, around PF_MEMALLOC skb handling Edward Cree
@ 2018-07-02 15:13 ` Edward Cree
  2018-07-02 15:14 ` [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv Edward Cree
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:13 UTC (permalink / raw)
  To: davem; +Cc: netdev

__netif_receive_skb_core() does a depressingly large amount of per-packet
 work that can't easily be listified, because the another_round looping
 makes it nontrivial to slice up into smaller functions.
Fortunately, most of that work disappears in the fast path:
 * Hardware devices generally don't have an rx_handler
 * Unless you're tcpdumping or something, there is usually only one ptype
 * VLAN processing comes before the protocol ptype lookup, so doesn't force
   a pt_prev deliver
 so normally, __netif_receive_skb_core() will run straight through and pass
 back the one ptype found in ptype_base[hash of skb->protocol].

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 64 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d2454678bc82..edd67b1f1e12 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4494,7 +4494,8 @@ static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
 	return 0;
 }
 
-static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
+static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
+				    struct packet_type **ppt_prev)
 {
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
@@ -4624,8 +4625,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	if (pt_prev) {
 		if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 			goto drop;
-		else
-			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+		*ppt_prev = pt_prev;
 	} else {
 drop:
 		if (!deliver_exact)
@@ -4643,6 +4643,18 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	return ret;
 }
 
+static int __netif_receive_skb_one_core(struct sk_buff *skb, bool pfmemalloc)
+{
+	struct net_device *orig_dev = skb->dev;
+	struct packet_type *pt_prev = NULL;
+	int ret;
+
+	ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
+	if (pt_prev)
+		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	return ret;
+}
+
 /**
  *	netif_receive_skb_core - special purpose version of netif_receive_skb
  *	@skb: buffer to process
@@ -4663,19 +4675,63 @@ int netif_receive_skb_core(struct sk_buff *skb)
 	int ret;
 
 	rcu_read_lock();
-	ret = __netif_receive_skb_core(skb, false);
+	ret = __netif_receive_skb_one_core(skb, false);
 	rcu_read_unlock();
 
 	return ret;
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
-static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+static inline void __netif_receive_skb_list_ptype(struct list_head *head,
+						  struct packet_type *pt_prev,
+						  struct net_device *orig_dev)
 {
 	struct sk_buff *skb, *next;
 
+	if (!pt_prev)
+		return;
+	if (list_empty(head))
+		return;
+
 	list_for_each_entry_safe(skb, next, head, list)
-		__netif_receive_skb_core(skb, pfmemalloc);
+		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+}
+
+static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+{
+	/* Fast-path assumptions:
+	 * - There is no RX handler.
+	 * - Only one packet_type matches.
+	 * If either of these fails, we will end up doing some per-packet
+	 * processing in-line, then handling the 'last ptype' for the whole
+	 * sublist.  This can't cause out-of-order delivery to any single ptype,
+	 * because the 'last ptype' must be constant across the sublist, and all
+	 * other ptypes are handled per-packet.
+	 */
+	/* Current (common) ptype of sublist */
+	struct packet_type *pt_curr = NULL;
+	/* Current (common) orig_dev of sublist */
+	struct net_device *od_curr = NULL;
+	struct list_head sublist;
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct net_device *orig_dev = skb->dev;
+		struct packet_type *pt_prev = NULL;
+
+		__netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
+		if (pt_curr != pt_prev || od_curr != orig_dev) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
+			/* start new sublist */
+			pt_curr = pt_prev;
+			od_curr = orig_dev;
+		}
+	}
+
+	/* dispatch final sublist */
+	__netif_receive_skb_list_ptype(head, pt_curr, od_curr);
 }
 
 static int __netif_receive_skb(struct sk_buff *skb)
@@ -4695,10 +4751,10 @@ static int __netif_receive_skb(struct sk_buff *skb)
 		 * context down to all allocation sites.
 		 */
 		noreclaim_flag = memalloc_noreclaim_save();
-		ret = __netif_receive_skb_core(skb, true);
+		ret = __netif_receive_skb_one_core(skb, true);
 		memalloc_noreclaim_restore(noreclaim_flag);
 	} else
-		ret = __netif_receive_skb_core(skb, false);
+		ret = __netif_receive_skb_one_core(skb, false);
 
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (5 preceding siblings ...)
  2018-07-02 15:13 ` [PATCH v4 net-next 6/9] net: core: propagate SKB lists through packet_type lookup Edward Cree
@ 2018-07-02 15:14 ` Edward Cree
  2018-07-03 10:50   ` Pablo Neira Ayuso
  2018-07-04 16:01   ` Edward Cree
  2018-07-02 15:14 ` [PATCH v4 net-next 8/9] net: ipv4: listify ip_rcv_finish Edward Cree
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:14 UTC (permalink / raw)
  To: davem; +Cc: netdev

Also involved adding a way to run a netfilter hook over a list of packets.
 Rather than attempting to make netfilter know about lists (which would be
 a major project in itself) we just let it call the regular okfn (in this
 case ip_rcv_finish()) for any packets it steals, and have it give us back
 a list of packets it's synchronously accepted (which normally NF_HOOK
 would automatically call okfn() on, but we want to be able to potentially
 pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
 packet (see nf_hook_entry_hookfn()), but again, changing that can be left
 for future work.

There is potential for out-of-order receives if the netfilter hook ends up
 synchronously stealing packets, as they will be processed before any
 accepts earlier in the list.  However, it was already possible for an
 asynchronous accept to cause out-of-order receives, so presumably this is
 considered OK.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  3 +++
 include/linux/netfilter.h | 22 +++++++++++++++
 include/net/ip.h          |  2 ++
 net/core/dev.c            |  8 +++---
 net/ipv4/af_inet.c        |  1 +
 net/ipv4/ip_input.c       | 68 ++++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e104b2e4a735..fe81a2bfcd08 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2291,6 +2291,9 @@ struct packet_type {
 					 struct net_device *,
 					 struct packet_type *,
 					 struct net_device *);
+	void			(*list_func) (struct list_head *,
+					      struct packet_type *,
+					      struct net_device *);
 	bool			(*id_match)(struct packet_type *ptype,
 					    struct sock *sk);
 	void			*af_packet_priv;
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index dd2052f0efb7..5a5e0a2ab2a3 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -288,6 +288,20 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct
 	return ret;
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
+		if (ret != 1)
+			list_del(&skb->list);
+	}
+}
+
 /* Call setsockopt() */
 int nf_setsockopt(struct sock *sk, u_int8_t pf, int optval, char __user *opt,
 		  unsigned int len);
@@ -369,6 +383,14 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 	return okfn(net, sk, skb);
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	/* nothing to do */
+}
+
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
 			  struct sock *sk, struct sk_buff *skb,
 			  struct net_device *indev, struct net_device *outdev,
diff --git a/include/net/ip.h b/include/net/ip.h
index 0d2281b4b27a..1de72f9cb23c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -138,6 +138,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk,
 			  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	   struct net_device *orig_dev);
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index edd67b1f1e12..4c5ebfab9bc8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4692,9 +4692,11 @@ static inline void __netif_receive_skb_list_ptype(struct list_head *head,
 		return;
 	if (list_empty(head))
 		return;
-
-	list_for_each_entry_safe(skb, next, head, list)
-		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	if (pt_prev->list_func != NULL)
+		pt_prev->list_func(head, pt_prev, orig_dev);
+	else
+		list_for_each_entry_safe(skb, next, head, list)
+			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
 static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 06b218a2870f..3ff7659c9afd 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1882,6 +1882,7 @@ fs_initcall(ipv4_offload_init);
 static struct packet_type ip_packet_type __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IP),
 	.func = ip_rcv,
+	.list_func = ip_list_rcv,
 };
 
 static int __init inet_init(void)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713dd18f..914240830bdf 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -408,10 +408,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 /*
  * 	Main IP Receive routine.
  */
-int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 {
 	const struct iphdr *iph;
-	struct net *net;
 	u32 len;
 
 	/* When the interface is in promisc. mode, drop all the crap
@@ -421,7 +420,6 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		goto drop;
 
 
-	net = dev_net(dev);
 	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
@@ -489,9 +487,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	/* Must drop socket now because of tproxy. */
 	skb_orphan(skb);
 
-	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
-		       net, NULL, skb, dev, NULL,
-		       ip_rcv_finish);
+	return skb;
 
 csum_error:
 	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
@@ -500,5 +496,63 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 drop:
 	kfree_skb(skb);
 out:
-	return NET_RX_DROP;
+	return NULL;
+}
+
+/*
+ * IP receive entry point
+ */
+int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
+	   struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+
+	skb = ip_rcv_core(skb, net);
+	if (skb == NULL)
+		return NET_RX_DROP;
+	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
+		       net, NULL, skb, dev, NULL,
+		       ip_rcv_finish);
+}
+
+static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
+			   struct net *net)
+{
+	struct sk_buff *skb, *next;
+
+	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
+		     head, dev, NULL, ip_rcv_finish);
+	list_for_each_entry_safe(skb, next, head, list)
+		ip_rcv_finish(net, NULL, skb);
+}
+
+/* Receive a list of IP packets */
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev)
+{
+	struct net_device *curr_dev = NULL;
+	struct net *curr_net = NULL;
+	struct sk_buff *skb, *next;
+	struct list_head sublist;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct net_device *dev = skb->dev;
+		struct net *net = dev_net(dev);
+
+		skb = ip_rcv_core(skb, net);
+		if (skb == NULL)
+			continue;
+
+		if (curr_dev != dev || curr_net != net) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				ip_sublist_rcv(&sublist, dev, net);
+			/* start new sublist */
+			curr_dev = dev;
+			curr_net = net;
+		}
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv(head, curr_dev, curr_net);
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 8/9] net: ipv4: listify ip_rcv_finish
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (6 preceding siblings ...)
  2018-07-02 15:14 ` [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv Edward Cree
@ 2018-07-02 15:14 ` Edward Cree
  2018-07-02 15:14 ` [PATCH v4 net-next 9/9] net: don't bother calling list RX functions on empty lists Edward Cree
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:14 UTC (permalink / raw)
  To: davem; +Cc: netdev

ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
 demux or route lookup.  The last step, calling dst_input(skb), is left to
 the caller; in the listified case, we split to form sublists with a common
 dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
 to struct dst_entry.

Early demux is an indirect call based on iph->protocol; this is another
 opportunity for listification which is not taken here (it would require
 slicing up ip_rcv_finish_core() to allow splitting on protocol changes).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/ipv4/ip_input.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 914240830bdf..24b9b0210aeb 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -307,7 +307,8 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	return true;
 }
 
-static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+static int ip_rcv_finish_core(struct net *net, struct sock *sk,
+			      struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	int (*edemux)(struct sk_buff *skb);
@@ -393,7 +394,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	return dst_input(skb);
+	return NET_RX_SUCCESS;
 
 drop:
 	kfree_skb(skb);
@@ -405,6 +406,15 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	goto drop;
 }
 
+static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	int ret = ip_rcv_finish_core(net, sk, skb);
+
+	if (ret != NET_RX_DROP)
+		ret = dst_input(skb);
+	return ret;
+}
+
 /*
  * 	Main IP Receive routine.
  */
@@ -515,15 +525,47 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 }
 
-static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
-			   struct net *net)
+static void ip_sublist_rcv_finish(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
 
+	list_for_each_entry_safe(skb, next, head, list)
+		dst_input(skb);
+}
+
+static void ip_list_rcv_finish(struct net *net, struct sock *sk,
+			       struct list_head *head)
+{
+	struct dst_entry *curr_dst = NULL;
+	struct sk_buff *skb, *next;
+	struct list_head sublist;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct dst_entry *dst;
+
+		if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP)
+			continue;
+
+		dst = skb_dst(skb);
+		if (curr_dst != dst) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				ip_sublist_rcv_finish(&sublist);
+			/* start new sublist */
+			curr_dst = dst;
+		}
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv_finish(head);
+}
+
+static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
+			   struct net *net)
+{
 	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
 		     head, dev, NULL, ip_rcv_finish);
-	list_for_each_entry_safe(skb, next, head, list)
-		ip_rcv_finish(net, NULL, skb);
+	ip_list_rcv_finish(net, NULL, head);
 }
 
 /* Receive a list of IP packets */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 net-next 9/9] net: don't bother calling list RX functions on empty lists
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (7 preceding siblings ...)
  2018-07-02 15:14 ` [PATCH v4 net-next 8/9] net: ipv4: listify ip_rcv_finish Edward Cree
@ 2018-07-02 15:14 ` Edward Cree
  2018-07-02 15:40 ` [PATCH v4 net-next 0/9] Handle multiple received packets at each stage David Ahern
  2018-07-04  5:09 ` David Miller
  10 siblings, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 15:14 UTC (permalink / raw)
  To: davem; +Cc: netdev

Generally the check should be very cheap, as the sk_buff_head is in cache.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4c5ebfab9bc8..d6084b0cd9ce 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4773,7 +4773,8 @@ static void __netif_receive_skb_list(struct list_head *head)
 
 			/* Handle the previous sublist */
 			list_cut_before(&sublist, head, &skb->list);
-			__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			if (!list_empty(&sublist))
+				__netif_receive_skb_list_core(&sublist, pfmemalloc);
 			pfmemalloc = !pfmemalloc;
 			/* See comments in __netif_receive_skb */
 			if (pfmemalloc)
@@ -4783,7 +4784,8 @@ static void __netif_receive_skb_list(struct list_head *head)
 		}
 	}
 	/* Handle the remaining sublist */
-	__netif_receive_skb_list_core(head, pfmemalloc);
+	if (!list_empty(head))
+		__netif_receive_skb_list_core(head, pfmemalloc);
 	/* Restore pflags */
 	if (pfmemalloc)
 		memalloc_noreclaim_restore(noreclaim_flag);
@@ -4944,6 +4946,8 @@ void netif_receive_skb_list(struct list_head *head)
 {
 	struct sk_buff *skb;
 
+	if (list_empty(head))
+		return;
 	list_for_each_entry(skb, head, list)
 		trace_netif_receive_skb_list_entry(skb);
 	netif_receive_skb_list_internal(head);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 net-next 0/9] Handle multiple received packets at each stage
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (8 preceding siblings ...)
  2018-07-02 15:14 ` [PATCH v4 net-next 9/9] net: don't bother calling list RX functions on empty lists Edward Cree
@ 2018-07-02 15:40 ` David Ahern
  2018-07-02 18:09   ` Edward Cree
  2018-07-03  7:51   ` Paolo Abeni
  2018-07-04  5:09 ` David Miller
  10 siblings, 2 replies; 17+ messages in thread
From: David Ahern @ 2018-07-02 15:40 UTC (permalink / raw)
  To: Edward Cree, davem; +Cc: netdev

On 7/2/18 9:11 AM, Edward Cree wrote:
> This patch series adds the capability for the network stack to receive a
>  list of packets and process them as a unit, rather than handling each
>  packet singly in sequence.  This is done by factoring out the existing
>  datapath code at each layer and wrapping it in list handling code.
> 

...

>  drivers/net/ethernet/sfc/efx.c        |  12 +++
>  drivers/net/ethernet/sfc/net_driver.h |   3 +
>  drivers/net/ethernet/sfc/rx.c         |   7 +-
>  include/linux/list.h                  |  30 ++++++
>  include/linux/netdevice.h             |   4 +
>  include/linux/netfilter.h             |  22 +++++
>  include/net/ip.h                      |   2 +
>  include/trace/events/net.h            |   7 ++
>  net/core/dev.c                        | 174 ++++++++++++++++++++++++++++++++--
>  net/ipv4/af_inet.c                    |   1 +
>  net/ipv4/ip_input.c                   | 114 ++++++++++++++++++++--
>  11 files changed, 360 insertions(+), 16 deletions(-)
> 

Nice work. Have you looked at IPv6 support yet?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 net-next 0/9] Handle multiple received packets at each stage
  2018-07-02 15:40 ` [PATCH v4 net-next 0/9] Handle multiple received packets at each stage David Ahern
@ 2018-07-02 18:09   ` Edward Cree
  2018-07-03  7:51   ` Paolo Abeni
  1 sibling, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-02 18:09 UTC (permalink / raw)
  To: David Ahern, davem; +Cc: netdev

On 02/07/18 16:40, David Ahern wrote:
> Nice work. Have you looked at IPv6 support yet? 
I hadn't looked at it yet, no.  After a quick glance at ip6_rcv() and
 ip6_rcv_finish(), it looks like it'd be basically the same as the IPv4
 code in patches 7 and 8.  I'll probably add it in a followup when (if)
 this series gets applied.

-Ed

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 net-next 0/9] Handle multiple received packets at each stage
  2018-07-02 15:40 ` [PATCH v4 net-next 0/9] Handle multiple received packets at each stage David Ahern
  2018-07-02 18:09   ` Edward Cree
@ 2018-07-03  7:51   ` Paolo Abeni
  1 sibling, 0 replies; 17+ messages in thread
From: Paolo Abeni @ 2018-07-03  7:51 UTC (permalink / raw)
  To: David Ahern, Edward Cree, davem; +Cc: netdev

On Mon, 2018-07-02 at 09:40 -0600, David Ahern wrote:
> On 7/2/18 9:11 AM, Edward Cree wrote:
> > This patch series adds the capability for the network stack to receive a
> >  list of packets and process them as a unit, rather than handling each
> >  packet singly in sequence.  This is done by factoring out the existing
> >  datapath code at each layer and wrapping it in list handling code.
> > 
> 
> ...
> 
> >  drivers/net/ethernet/sfc/efx.c        |  12 +++
> >  drivers/net/ethernet/sfc/net_driver.h |   3 +
> >  drivers/net/ethernet/sfc/rx.c         |   7 +-
> >  include/linux/list.h                  |  30 ++++++
> >  include/linux/netdevice.h             |   4 +
> >  include/linux/netfilter.h             |  22 +++++
> >  include/net/ip.h                      |   2 +
> >  include/trace/events/net.h            |   7 ++
> >  net/core/dev.c                        | 174 ++++++++++++++++++++++++++++++++--
> >  net/ipv4/af_inet.c                    |   1 +
> >  net/ipv4/ip_input.c                   | 114 ++++++++++++++++++++--
> >  11 files changed, 360 insertions(+), 16 deletions(-)
> > 
> 
> Nice work. Have you looked at IPv6 support yet?

I think this work opens opportunities for a lot of follow-ups, if there
is agreement on extending this approach to other areas. Onother item
I'd like to investigate is TC processing.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv
  2018-07-02 15:14 ` [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv Edward Cree
@ 2018-07-03 10:50   ` Pablo Neira Ayuso
  2018-07-03 12:13     ` Florian Westphal
  2018-07-04 16:01   ` Edward Cree
  1 sibling, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2018-07-03 10:50 UTC (permalink / raw)
  To: Edward Cree; +Cc: davem, netdev

On Mon, Jul 02, 2018 at 04:14:12PM +0100, Edward Cree wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
>  Rather than attempting to make netfilter know about lists (which would be
>  a major project in itself) we just let it call the regular okfn (in this
>  case ip_rcv_finish()) for any packets it steals, and have it give us back
>  a list of packets it's synchronously accepted (which normally NF_HOOK
>  would automatically call okfn() on, but we want to be able to potentially
>  pass the list to a listified version of okfn().)
> The netfilter hooks themselves are indirect calls that still happen per-
>  packet (see nf_hook_entry_hookfn()), but again, changing that can be left
>  for future work.
> 
> There is potential for out-of-order receives if the netfilter hook ends up
>  synchronously stealing packets, as they will be processed before any
>  accepts earlier in the list.  However, it was already possible for an
>  asynchronous accept to cause out-of-order receives, so presumably this is
>  considered OK.

I think we can simplify things if these chained packets don't follow
the standard forwarding path, this would require to revisit many
subsystems to handle these new chained packets - potentially a lot of
work and likely breaking many things - and I would expect we (and
other subsystems too) will not get very much benefits from these
chained packets.

In general I like this infrastructure, but I think we can get
something simpler if we combine it with the flowtable idea, so chained
packets follow the non-standard flowtable forwarding path as described
in [1].

We could generalize and place the flowtable code in the core if
needed, and make it not netfilter dependent if that's a problem.

Thanks.

[1] https://marc.info/?l=netfilter-devel&m=152898601419841&w=2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv
  2018-07-03 10:50   ` Pablo Neira Ayuso
@ 2018-07-03 12:13     ` Florian Westphal
  0 siblings, 0 replies; 17+ messages in thread
From: Florian Westphal @ 2018-07-03 12:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Edward Cree, davem, netdev

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Jul 02, 2018 at 04:14:12PM +0100, Edward Cree wrote:
> > Also involved adding a way to run a netfilter hook over a list of packets.
> >  Rather than attempting to make netfilter know about lists (which would be
> >  a major project in itself) we just let it call the regular okfn (in this
> >  case ip_rcv_finish()) for any packets it steals, and have it give us back
> >  a list of packets it's synchronously accepted (which normally NF_HOOK
> >  would automatically call okfn() on, but we want to be able to potentially
> >  pass the list to a listified version of okfn().)
> > The netfilter hooks themselves are indirect calls that still happen per-
> >  packet (see nf_hook_entry_hookfn()), but again, changing that can be left
> >  for future work.
> > 
> > There is potential for out-of-order receives if the netfilter hook ends up
> >  synchronously stealing packets, as they will be processed before any
> >  accepts earlier in the list.  However, it was already possible for an
> >  asynchronous accept to cause out-of-order receives, so presumably this is
> >  considered OK.
> 
> I think we can simplify things if these chained packets don't follow
> the standard forwarding path, this would require to revisit many
> subsystems to handle these new chained packets - potentially a lot of
> work and likely breaking many things - and I would expect we (and
> other subsystems too) will not get very much benefits from these
> chained packets.

I guess it depends on what type of 'bundles' to expect on the wire.
For an average netfilter ruleset it will require more unbundling
because we might have

ipv4/udp -> ipv4/tcp -> ipv6/udp -> ipv4/tcp -> ipv4/udp

>From Eds patch set, this would be rebundled to

ipv4/udp -> ipv4/tcp -> ipv4/tcp -> ipv4/udp
and
ipv6/udp

nf ipv4 rule eval has to untangle again, to
ipv4/udp -> ipv4/udp
ipv4/tcp -> ipv4/tcp

HOWEVER, there are hooks that are L4 agnostic, such as defragmentation,
so we might still be able to take advantage.

Defrag is extra silly at the moment, we eat the indirect call cost
only to return NF_ACCEPT without doing anything in 99.99% of all cases.

So I still think there is value in exploring to pass the bundle
into the nf core, via new nf_hook_slow_list() that can be used
from forward path (ingress, prerouting, input, forward, postrouting).

We can make this step-by-step, first splitting everything in
nf_hook_slow_list() and just calling hooksfns for each skb in list.

> In general I like this infrastructure, but I think we can get
> something simpler if we combine it with the flowtable idea, so chained
> packets follow the non-standard flowtable forwarding path as described
> in [1].

Yes, in case all packets would be in the flow table (offload software
fallback) we might be able to keep list as-is in case everything has
same nexthop.

However, I don't see any need to make changes to this series for this
now, it can be added on top.

Did i miss anything?

I do agree that from netfilter p.o.v. ingress hook looks like a good
initial candidate for a listification.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 net-next 0/9] Handle multiple received packets at each stage
  2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
                   ` (9 preceding siblings ...)
  2018-07-02 15:40 ` [PATCH v4 net-next 0/9] Handle multiple received packets at each stage David Ahern
@ 2018-07-04  5:09 ` David Miller
  10 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2018-07-04  5:09 UTC (permalink / raw)
  To: ecree; +Cc: netdev

From: Edward Cree <ecree@solarflare.com>
Date: Mon, 2 Jul 2018 16:11:36 +0100

> This patch series adds the capability for the network stack to receive a
>  list of packets and process them as a unit, rather than handling each
>  packet singly in sequence.  This is done by factoring out the existing
>  datapath code at each layer and wrapping it in list handling code.
 ...

This is really nice stuff.

I'll apply this, but please work on the ipv6 side too.

I hope that driver maintainers take a look at using the new
netif_receive_skb_list() interface and see how much it helps
performance with their devices.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv
  2018-07-02 15:14 ` [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv Edward Cree
  2018-07-03 10:50   ` Pablo Neira Ayuso
@ 2018-07-04 16:01   ` Edward Cree
  1 sibling, 0 replies; 17+ messages in thread
From: Edward Cree @ 2018-07-04 16:01 UTC (permalink / raw)
  To: davem; +Cc: netdev

On 02/07/18 16:14, Edward Cree wrote:
> +/* Receive a list of IP packets */
> +void ip_list_rcv(struct list_head *head, struct packet_type *pt,
> +		 struct net_device *orig_dev)
> +{
> +	struct net_device *curr_dev = NULL;
> +	struct net *curr_net = NULL;
> +	struct sk_buff *skb, *next;
> +	struct list_head sublist;
> +
> +	list_for_each_entry_safe(skb, next, head, list) {
> +		struct net_device *dev = skb->dev;
> +		struct net *net = dev_net(dev);
> +
> +		skb = ip_rcv_core(skb, net);
> +		if (skb == NULL)
> +			continue;
I've spotted a bug here, in that if ip_rcv_core() eats the skb (e.g. by
 freeing it) it won't list_del() it, so when we process the sublist we'll
 end up trying to process this skb anyway.
Thus, places where an skb could get freed (possibly remotely, as in nf
 hooks that can steal packets) need to use the dequeue/enqueue model
 rather than the list_cut_before() approach.
Followup patches soon.

-Ed

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-07-04 16:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-02 15:11 [PATCH v4 net-next 0/9] Handle multiple received packets at each stage Edward Cree
2018-07-02 15:12 ` [PATCH v4 net-next 1/9] net: core: trivial netif_receive_skb_list() entry point Edward Cree
2018-07-02 15:12 ` [PATCH v4 net-next 2/9] sfc: batch up RX delivery Edward Cree
2018-07-02 15:13 ` [PATCH v4 net-next 3/9] net: core: unwrap skb list receive slightly further Edward Cree
2018-07-02 15:13 ` [PATCH v4 net-next 4/9] net: core: Another step of skb receive list processing Edward Cree
2018-07-02 15:13 ` [PATCH v4 net-next 5/9] net: core: another layer of lists, around PF_MEMALLOC skb handling Edward Cree
2018-07-02 15:13 ` [PATCH v4 net-next 6/9] net: core: propagate SKB lists through packet_type lookup Edward Cree
2018-07-02 15:14 ` [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv Edward Cree
2018-07-03 10:50   ` Pablo Neira Ayuso
2018-07-03 12:13     ` Florian Westphal
2018-07-04 16:01   ` Edward Cree
2018-07-02 15:14 ` [PATCH v4 net-next 8/9] net: ipv4: listify ip_rcv_finish Edward Cree
2018-07-02 15:14 ` [PATCH v4 net-next 9/9] net: don't bother calling list RX functions on empty lists Edward Cree
2018-07-02 15:40 ` [PATCH v4 net-next 0/9] Handle multiple received packets at each stage David Ahern
2018-07-02 18:09   ` Edward Cree
2018-07-03  7:51   ` Paolo Abeni
2018-07-04  5:09 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.