All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 net-next 0/3] net: batched receive in GRO path
@ 2019-08-06 13:52 Edward Cree
  2019-08-06 13:53 ` [PATCH v3 net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Edward Cree @ 2019-08-06 13:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, linux-net-drivers

This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
 onto a list in the napi struct, which is received at the end of the napi
 poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.

Performance figures with this series, collected on a back-to-back pair of
 Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
 sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
 uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
 (which was the device under test) had the NetPerf process pinned to one CPU,
 and the device interrupts pinned to a second CPU.  CPU utilisation figures
 (used in cases of line-rate performance) are summed across all CPUs.
net.core.gro_normal_batch was left at its default value of 8.

TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next)
TCP 4 streams, GRO off:
net-next: 8.017 Gbps
after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next)
after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next.  But note *)
TCP 1 stream, GRO off:
net-next: 6.553 Gbps
after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next)
after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next)
TCP 1 stream, GRO on, busy_read = 50: all results line rate
net-next: 156.0% cpu
after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next)
after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next)
TCP 1 stream, GRO off, busy_read = 50:
net-next: 6.488 Gbps
after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next)
after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50:
net-next:   2.851 ms
after #1:   2.871 ms (+ 0.7%, p=0.134 vs net-next)
after #3:   2.937 ms (+ 3.0%, p<0.001 vs net-next)
TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50:
net-next: 867.317 us
after #1: 865.717 us (- 0.2%, p=0.334 vs net-next)
after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next)

(*) These tests produced a mixture of line-rate and below-line-rate results,
 meaning that statistically speaking the results were 'censored' by the
 upper bound, and were thus not normally distributed, making a Welch t-test
 mathematically invalid.  I therefore also calculated estimators according
 to [1], which gave the following:
net-next: 8.133 Gbps
after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next)
after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next)
(though my procedure for determining ν wasn't mathematically well-founded
 either, so take that p-value with a grain of salt).
A further check came from dividing the bandwidth figure by the CPU usage for
 each test run, giving:
net-next: 3.461
after #1: 3.198 (- 7.6%, p=0.145 vs net-next)
after #3: 3.641 (+ 5.2%, p=0.280 vs net-next)

The above results are fairly mixed, and in most cases not statistically
 significant.  But I think we can roughly conclude that the series
 marginally improves non-GROable throughput, without hurting latency
 (except in the large-payload busy-polling case, which in any case yields
 horrid performance even on net-next (almost triple the latency without
 busy-poll).  Also, drivers which, unlike sfc, pass UDP traffic to GRO
 would expect to see a benefit from gaining access to batching.

Changed in v3:
 * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one
 * removed RFC tags (no comments after a week means no-one objects, right?)

Changed in v2:
 * During busy poll, call gro_normal_list() to receive batched packets
   after each cycle of the napi busy loop.  See comments in Patch #3 for
   complications of doing the same in busy_poll_stop().

[1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859

Edward Cree (3):
  sfc: don't score irq moderation points for GRO
  sfc: falcon: don't score irq moderation points for GRO
  net: use listified RX for handling GRO_NORMAL skbs

 drivers/net/ethernet/sfc/falcon/rx.c |  5 +---
 drivers/net/ethernet/sfc/rx.c        |  5 +---
 include/linux/netdevice.h            |  3 ++
 net/core/dev.c                       | 44 ++++++++++++++++++++++++++--
 net/core/sysctl_net_core.c           |  8 +++++
 5 files changed, 54 insertions(+), 11 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 1/3] sfc: don't score irq moderation points for GRO
  2019-08-06 13:52 [PATCH v3 net-next 0/3] net: batched receive in GRO path Edward Cree
@ 2019-08-06 13:53 ` Edward Cree
  2019-08-06 13:53 ` [PATCH v3 net-next 2/3] sfc: falcon: " Edward Cree
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Edward Cree @ 2019-08-06 13:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, linux-net-drivers

We already scored points when handling the RX event, no-one else does this,
 and looking at the history it appears this was originally meant to only
 score on merges, not on GRO_NORMAL.  Moreover, it gets in the way of
 changing GRO to not immediately pass GRO_NORMAL skbs to the stack.
Performance testing with four TCP streams received on a single CPU (where
 throughput was line rate of 9.4Gbps in all tests) showed a 13.7% reduction
 in RX CPU usage (n=6, p=0.03).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/rx.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index d5db045535d3..85ec07f5a674 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -412,7 +412,6 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
 		  unsigned int n_frags, u8 *eh)
 {
 	struct napi_struct *napi = &channel->napi_str;
-	gro_result_t gro_result;
 	struct efx_nic *efx = channel->efx;
 	struct sk_buff *skb;
 
@@ -449,9 +448,7 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
 
 	skb_record_rx_queue(skb, channel->rx_queue.core_index);
 
-	gro_result = napi_gro_frags(napi);
-	if (gro_result != GRO_DROP)
-		channel->irq_mod_score += 2;
+	napi_gro_frags(napi);
 }
 
 /* Allocate and construct an SKB around page fragments */


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 2/3] sfc: falcon: don't score irq moderation points for GRO
  2019-08-06 13:52 [PATCH v3 net-next 0/3] net: batched receive in GRO path Edward Cree
  2019-08-06 13:53 ` [PATCH v3 net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
@ 2019-08-06 13:53 ` Edward Cree
  2019-08-06 13:53 ` [PATCH v3 net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs Edward Cree
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Edward Cree @ 2019-08-06 13:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, linux-net-drivers

Same rationale as for sfc, except that this wasn't performance-tested.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/falcon/rx.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sfc/falcon/rx.c b/drivers/net/ethernet/sfc/falcon/rx.c
index fd850d3d8ec0..05ea3523890a 100644
--- a/drivers/net/ethernet/sfc/falcon/rx.c
+++ b/drivers/net/ethernet/sfc/falcon/rx.c
@@ -424,7 +424,6 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
 		  unsigned int n_frags, u8 *eh)
 {
 	struct napi_struct *napi = &channel->napi_str;
-	gro_result_t gro_result;
 	struct ef4_nic *efx = channel->efx;
 	struct sk_buff *skb;
 
@@ -460,9 +459,7 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
 
 	skb_record_rx_queue(skb, channel->rx_queue.core_index);
 
-	gro_result = napi_gro_frags(napi);
-	if (gro_result != GRO_DROP)
-		channel->irq_mod_score += 2;
+	napi_gro_frags(napi);
 }
 
 /* Allocate and construct an SKB around page fragments */


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs
  2019-08-06 13:52 [PATCH v3 net-next 0/3] net: batched receive in GRO path Edward Cree
  2019-08-06 13:53 ` [PATCH v3 net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
  2019-08-06 13:53 ` [PATCH v3 net-next 2/3] sfc: falcon: " Edward Cree
@ 2019-08-06 13:53 ` Edward Cree
  2019-08-09  1:22 ` [PATCH v3 net-next 0/3] net: batched receive in GRO path David Miller
  2019-08-09 17:14 ` Ioana Ciocoi Radulescu
  4 siblings, 0 replies; 9+ messages in thread
From: Edward Cree @ 2019-08-06 13:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, linux-net-drivers

When GRO decides not to coalesce a packet, in napi_frags_finish(), instead
 of passing it to the stack immediately, place it on a list in the napi
 struct.  Then, at flush time (napi_complete_done(), napi_poll(), or
 napi_busy_loop()), call netif_receive_skb_list_internal() on the list.
We'd like to do that in napi_gro_flush(), but it's not called if
 !napi->gro_bitmask, so we have to do it in the callers instead.  (There are
 a handful of drivers that call napi_gro_flush() themselves, but it's not
 clear why, or whether this will affect them.)
Because a full 64 packets is an inefficiently large batch, also consume the
 list whenever it exceeds gro_normal_batch, a new net/core sysctl that
 defaults to 8.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h  |  3 +++
 net/core/dev.c             | 44 +++++++++++++++++++++++++++++++++++---
 net/core/sysctl_net_core.c |  8 +++++++
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88292953aa6f..55ac223553f8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -332,6 +332,8 @@ struct napi_struct {
 	struct net_device	*dev;
 	struct gro_list		gro_hash[GRO_HASH_BUCKETS];
 	struct sk_buff		*skb;
+	struct list_head	rx_list; /* Pending GRO_NORMAL skbs */
+	int			rx_count; /* length of rx_list */
 	struct hrtimer		timer;
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
@@ -4239,6 +4241,7 @@ extern int		dev_weight_rx_bias;
 extern int		dev_weight_tx_bias;
 extern int		dev_rx_weight;
 extern int		dev_tx_weight;
+extern int		gro_normal_batch;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/net/core/dev.c b/net/core/dev.c
index e2a11c62197b..4dd8db150ae5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3963,6 +3963,8 @@ int dev_weight_rx_bias __read_mostly = 1;  /* bias for backlog weight */
 int dev_weight_tx_bias __read_mostly = 1;  /* bias for output_queue quota */
 int dev_rx_weight __read_mostly = 64;
 int dev_tx_weight __read_mostly = 64;
+/* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
+int gro_normal_batch __read_mostly = 8;
 
 /* Called with irq disabled */
 static inline void ____napi_schedule(struct softnet_data *sd,
@@ -5742,6 +5744,26 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(napi_get_frags);
 
+/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
+static void gro_normal_list(struct napi_struct *napi)
+{
+	if (!napi->rx_count)
+		return;
+	netif_receive_skb_list_internal(&napi->rx_list);
+	INIT_LIST_HEAD(&napi->rx_list);
+	napi->rx_count = 0;
+}
+
+/* Queue one GRO_NORMAL SKB up for list processing.  If batch size exceeded,
+ * pass the whole batch up to the stack.
+ */
+static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
+{
+	list_add_tail(&skb->list, &napi->rx_list);
+	if (++napi->rx_count >= gro_normal_batch)
+		gro_normal_list(napi);
+}
+
 static gro_result_t napi_frags_finish(struct napi_struct *napi,
 				      struct sk_buff *skb,
 				      gro_result_t ret)
@@ -5751,8 +5773,8 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
 	case GRO_HELD:
 		__skb_push(skb, ETH_HLEN);
 		skb->protocol = eth_type_trans(skb, skb->dev);
-		if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
-			ret = GRO_DROP;
+		if (ret == GRO_NORMAL)
+			gro_normal_one(napi, skb);
 		break;
 
 	case GRO_DROP:
@@ -6029,6 +6051,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 				 NAPIF_STATE_IN_BUSY_POLL)))
 		return false;
 
+	gro_normal_list(n);
+
 	if (n->gro_bitmask) {
 		unsigned long timeout = 0;
 
@@ -6114,10 +6138,19 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
 	 * Ideally, a new ndo_busy_poll_stop() could avoid another round.
 	 */
 	rc = napi->poll(napi, BUSY_POLL_BUDGET);
+	/* We can't gro_normal_list() here, because napi->poll() might have
+	 * rearmed the napi (napi_complete_done()) in which case it could
+	 * already be running on another CPU.
+	 */
 	trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
 	netpoll_poll_unlock(have_poll_lock);
-	if (rc == BUSY_POLL_BUDGET)
+	if (rc == BUSY_POLL_BUDGET) {
+		/* As the whole budget was spent, we still own the napi so can
+		 * safely handle the rx_list.
+		 */
+		gro_normal_list(napi);
 		__napi_schedule(napi);
+	}
 	local_bh_enable();
 }
 
@@ -6162,6 +6195,7 @@ void napi_busy_loop(unsigned int napi_id,
 		}
 		work = napi_poll(napi, BUSY_POLL_BUDGET);
 		trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
+		gro_normal_list(napi);
 count:
 		if (work > 0)
 			__NET_ADD_STATS(dev_net(napi->dev),
@@ -6267,6 +6301,8 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 	napi->timer.function = napi_watchdog;
 	init_gro_hash(napi);
 	napi->skb = NULL;
+	INIT_LIST_HEAD(&napi->rx_list);
+	napi->rx_count = 0;
 	napi->poll = poll;
 	if (weight > NAPI_POLL_WEIGHT)
 		netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6363,6 +6399,8 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 		goto out_unlock;
 	}
 
+	gro_normal_list(n);
+
 	if (n->gro_bitmask) {
 		/* flush too old packets
 		 * If HZ < 1000, flush all packets.
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8da5b3a54dac..eb29e5adc84d 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -567,6 +567,14 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_do_static_key,
 	},
+	{
+		.procname	= "gro_normal_batch",
+		.data		= &gro_normal_batch,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+	},
 	{ }
 };
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
  2019-08-06 13:52 [PATCH v3 net-next 0/3] net: batched receive in GRO path Edward Cree
                   ` (2 preceding siblings ...)
  2019-08-06 13:53 ` [PATCH v3 net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs Edward Cree
@ 2019-08-09  1:22 ` David Miller
  2019-08-09 17:14 ` Ioana Ciocoi Radulescu
  4 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2019-08-09  1:22 UTC (permalink / raw)
  To: ecree; +Cc: netdev, eric.dumazet, linux-net-drivers

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 6 Aug 2019 14:52:06 +0100

> This series listifies part of GRO processing, in a manner which allows those
>  packets which are not GROed (i.e. for which dev_gro_receive returns
>  GRO_NORMAL) to be passed on to the listified regular receive path.
> dev_gro_receive() itself is not listified, nor the per-protocol GRO
>  callback, since GRO's need to hold packets on lists under napi->gro_hash
>  makes keeping the packets on other lists awkward, and since the GRO control
>  block state of held skbs can refer only to one 'new' skb at a time.
> Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
>  onto a list in the napi struct, which is received at the end of the napi
>  poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.
> 
> Performance figures with this series, collected on a back-to-back pair of
>  Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
>  sample size n for old and new code is 6 runs each; p is from a Welch t-test.
> Tests were run both with GRO enabled and disabled, the latter simulating
>  uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
>  (which was the device under test) had the NetPerf process pinned to one CPU,
>  and the device interrupts pinned to a second CPU.  CPU utilisation figures
>  (used in cases of line-rate performance) are summed across all CPUs.
> net.core.gro_normal_batch was left at its default value of 8.
 ...
> The above results are fairly mixed, and in most cases not statistically
>  significant.  But I think we can roughly conclude that the series
>  marginally improves non-GROable throughput, without hurting latency
>  (except in the large-payload busy-polling case, which in any case yields
>  horrid performance even on net-next (almost triple the latency without
>  busy-poll).  Also, drivers which, unlike sfc, pass UDP traffic to GRO
>  would expect to see a benefit from gaining access to batching.
> 
> Changed in v3:
>  * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one
>  * removed RFC tags (no comments after a week means no-one objects, right?)
> 
> Changed in v2:
>  * During busy poll, call gro_normal_list() to receive batched packets
>    after each cycle of the napi busy loop.  See comments in Patch #3 for
>    complications of doing the same in busy_poll_stop().
> 
> [1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859

Series applied, thanks Edward.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v3 net-next 0/3] net: batched receive in GRO path
  2019-08-06 13:52 [PATCH v3 net-next 0/3] net: batched receive in GRO path Edward Cree
                   ` (3 preceding siblings ...)
  2019-08-09  1:22 ` [PATCH v3 net-next 0/3] net: batched receive in GRO path David Miller
@ 2019-08-09 17:14 ` Ioana Ciocoi Radulescu
  2019-08-09 17:32   ` Edward Cree
  4 siblings, 1 reply; 9+ messages in thread
From: Ioana Ciocoi Radulescu @ 2019-08-09 17:14 UTC (permalink / raw)
  To: Edward Cree, David Miller; +Cc: netdev, Eric Dumazet, linux-net-drivers

> -----Original Message-----
> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Edward Cree
> Sent: Tuesday, August 6, 2019 4:52 PM
> To: David Miller <davem@davemloft.net>
> Cc: netdev <netdev@vger.kernel.org>; Eric Dumazet
> <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
> Subject: [PATCH v3 net-next 0/3] net: batched receive in GRO path
> 
> This series listifies part of GRO processing, in a manner which allows those
>  packets which are not GROed (i.e. for which dev_gro_receive returns
>  GRO_NORMAL) to be passed on to the listified regular receive path.
> dev_gro_receive() itself is not listified, nor the per-protocol GRO
>  callback, since GRO's need to hold packets on lists under napi->gro_hash
>  makes keeping the packets on other lists awkward, and since the GRO control
>  block state of held skbs can refer only to one 'new' skb at a time.
> Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
>  onto a list in the napi struct, which is received at the end of the napi
>  poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.

Hi Edward,

I'm probably missing a lot of context here, but is there a reason
this change targets only the napi_gro_frags() path and not the
napi_gro_receive() one?

I'm trying to understand what drivers that don't call napi_gro_frags()
should do in order to benefit from this batching feature.

Thanks,
Ioana

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
  2019-08-09 17:14 ` Ioana Ciocoi Radulescu
@ 2019-08-09 17:32   ` Edward Cree
  2019-08-12 17:51     ` Ioana Ciocoi Radulescu
  0 siblings, 1 reply; 9+ messages in thread
From: Edward Cree @ 2019-08-09 17:32 UTC (permalink / raw)
  To: Ioana Ciocoi Radulescu
  Cc: David Miller, netdev, Eric Dumazet, linux-net-drivers

On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
> Hi Edward,
>
> I'm probably missing a lot of context here, but is there a reason
> this change targets only the napi_gro_frags() path and not the
> napi_gro_receive() one?
> I'm trying to understand what drivers that don't call napi_gro_frags()
> should do in order to benefit from this batching feature.
The sfc driver (which is what I have lots of hardware for, so I can
 test it) uses napi_gro_frags().
It should be possible to do a similar patch to napi_gro_receive(),
 if someone wants to put in the effort of writing and testing it.
However, there are many more callers, so more effort required to
 make sure none of them care whether the return value is GRO_DROP
 or GRO_NORMAL (since the listified version cannot give that
 indication).
Also, the guidance from Eric is that drivers seeking high performance
 should use napi_gro_frags(), as this allows GRO to recycle the SKB.

All of this together means I don't plan to submit such a patch; I
 would however be happy to review a patch if someone else writes one.

HTH,
-Ed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v3 net-next 0/3] net: batched receive in GRO path
  2019-08-09 17:32   ` Edward Cree
@ 2019-08-12 17:51     ` Ioana Ciocoi Radulescu
  2019-08-12 18:20       ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Ioana Ciocoi Radulescu @ 2019-08-12 17:51 UTC (permalink / raw)
  To: Edward Cree; +Cc: David Miller, netdev, Eric Dumazet, linux-net-drivers

> -----Original Message-----
> From: Edward Cree <ecree@solarflare.com>
> Sent: Friday, August 9, 2019 8:32 PM
> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
> Cc: David Miller <davem@davemloft.net>; netdev <netdev@vger.kernel.org>;
> Eric Dumazet <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
> Subject: Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
> 
> On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
> > Hi Edward,
> >
> > I'm probably missing a lot of context here, but is there a reason
> > this change targets only the napi_gro_frags() path and not the
> > napi_gro_receive() one?
> > I'm trying to understand what drivers that don't call napi_gro_frags()
> > should do in order to benefit from this batching feature.
> The sfc driver (which is what I have lots of hardware for, so I can
>  test it) uses napi_gro_frags().
> It should be possible to do a similar patch to napi_gro_receive(),
>  if someone wants to put in the effort of writing and testing it.

Rather tricky, since I'm not really familiar with GRO internals and
probably don't understand all the implications of such a change :-/
Any pointers to what I should pay attention to/sensitive areas that
need extra care?

> However, there are many more callers, so more effort required to
>  make sure none of them care whether the return value is GRO_DROP
>  or GRO_NORMAL (since the listified version cannot give that
>  indication).

At a quick glance, there's only one driver that looks at the return
value of napi_gro_receive (drivers/net/ethernet/socionext/netsec.c),
and it only updates interface stats based on it.

> Also, the guidance from Eric is that drivers seeking high performance
>  should use napi_gro_frags(), as this allows GRO to recycle the SKB.

But this guidance is for GRO-able frames only, right? If I try to use
napi_gro_frags() indiscriminately on the Rx path, I get a big
performance penalty in some cases - e.g. forwarding of non-TCP
single buffer frames.

On the other hand, Eric shot down my attempt to differentiate between
TCP and non-TCP frames inside the driver (see 
https://patchwork.ozlabs.org/patch/1135817/#2222236), so I'm not
really sure what's the recommended approach here?

> 
> All of this together means I don't plan to submit such a patch; I
>  would however be happy to review a patch if someone else writes one.

Thanks a lot for the explanations!
Ioana

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
  2019-08-12 17:51     ` Ioana Ciocoi Radulescu
@ 2019-08-12 18:20       ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2019-08-12 18:20 UTC (permalink / raw)
  To: Ioana Ciocoi Radulescu, Edward Cree
  Cc: David Miller, netdev, Eric Dumazet, linux-net-drivers



On 8/12/19 7:51 PM, Ioana Ciocoi Radulescu wrote:
>> -----Original Message-----
>> From: Edward Cree <ecree@solarflare.com>
>> Sent: Friday, August 9, 2019 8:32 PM
>> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
>> Cc: David Miller <davem@davemloft.net>; netdev <netdev@vger.kernel.org>;
>> Eric Dumazet <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
>> Subject: Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
>>
>> On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
>>> Hi Edward,
>>>
>>> I'm probably missing a lot of context here, but is there a reason
>>> this change targets only the napi_gro_frags() path and not the
>>> napi_gro_receive() one?
>>> I'm trying to understand what drivers that don't call napi_gro_frags()
>>> should do in order to benefit from this batching feature.
>> The sfc driver (which is what I have lots of hardware for, so I can
>>  test it) uses napi_gro_frags().
>> It should be possible to do a similar patch to napi_gro_receive(),
>>  if someone wants to put in the effort of writing and testing it.
> 
> Rather tricky, since I'm not really familiar with GRO internals and
> probably don't understand all the implications of such a change :-/
> Any pointers to what I should pay attention to/sensitive areas that
> need extra care?
> 
>> However, there are many more callers, so more effort required to
>>  make sure none of them care whether the return value is GRO_DROP
>>  or GRO_NORMAL (since the listified version cannot give that
>>  indication).
> 
> At a quick glance, there's only one driver that looks at the return
> value of napi_gro_receive (drivers/net/ethernet/socionext/netsec.c),
> and it only updates interface stats based on it.
> 
>> Also, the guidance from Eric is that drivers seeking high performance
>>  should use napi_gro_frags(), as this allows GRO to recycle the SKB.
> 
> But this guidance is for GRO-able frames only, right? If I try to use
> napi_gro_frags() indiscriminately on the Rx path, I get a big
> performance penalty in some cases - e.g. forwarding of non-TCP
> single buffer frames.

How big is big ?

You can not win all the time.

Some design (or optimizations) are for the most common case,
they might hurt some other use cases.

> 
> On the other hand, Eric shot down my attempt to differentiate between
> TCP and non-TCP frames inside the driver (see 
> https://patchwork.ozlabs.org/patch/1135817/#2222236), so I'm not
> really sure what's the recommended approach here?

If GRO is not good enough for non-TCP buffer frames, please make the change in GRO,
or document that disabling GRO might help some setups.

We do not want each driver to implement their own logic that are a
maintenance nightmare.

GRO can aggregate non-TCP frames (say if you add any encapsulation over TCP),
with a very significant gain, so detecting if an incoming frame is a 'TCP packet'
in the driver would be a serious problem if the traffic is 100% SIT for example.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-08-12 18:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-06 13:52 [PATCH v3 net-next 0/3] net: batched receive in GRO path Edward Cree
2019-08-06 13:53 ` [PATCH v3 net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
2019-08-06 13:53 ` [PATCH v3 net-next 2/3] sfc: falcon: " Edward Cree
2019-08-06 13:53 ` [PATCH v3 net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs Edward Cree
2019-08-09  1:22 ` [PATCH v3 net-next 0/3] net: batched receive in GRO path David Miller
2019-08-09 17:14 ` Ioana Ciocoi Radulescu
2019-08-09 17:32   ` Edward Cree
2019-08-12 17:51     ` Ioana Ciocoi Radulescu
2019-08-12 18:20       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.