All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs
@ 2020-04-22 16:13 Eric Dumazet
  2020-04-22 16:13 ` [PATCH net-next 1/3] net: napi: add hard irqs deferral feature Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Eric Dumazet @ 2020-04-22 16:13 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Luigi Rizzo, Eric Dumazet

This patch series augments gro_glush_timeout feature with napi_defer_hard_irqs

As extensively described in first patch changelog, this can suppresss
the chit-chat traffic between NIC and host to signal interrupts and re-arming
them, since this can be an issue on high speed NIC with many queues.

The last patch in this series converts mlx4 TX completion to
napi_complete_done(), to enable this new mechanism.

Eric Dumazet (3):
  net: napi: add hard irqs deferral feature
  net: napi: use READ_ONCE()/WRITE_ONCE()
  net/mlx4_en: use napi_complete_done() in TX completion

 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 20 +++++++-------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 +--
 include/linux/netdevice.h                    |  2 ++
 net/core/dev.c                               | 29 ++++++++++++--------
 net/core/net-sysfs.c                         | 20 +++++++++++++-
 6 files changed, 52 insertions(+), 25 deletions(-)

-- 
2.26.1.301.g55bc3eb7cb9-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-04-22 16:13 [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs Eric Dumazet
@ 2020-04-22 16:13 ` Eric Dumazet
  2020-05-02 14:56   ` Julian Wiedmann
  2020-04-22 16:13 ` [PATCH net-next 2/3] net: napi: use READ_ONCE()/WRITE_ONCE() Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2020-04-22 16:13 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Luigi Rizzo, Eric Dumazet

Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
we added the ability to arm one high resolution timer, that we used
to keep not-complete packets in GRO engine a bit longer, hoping that further
frames might be added to them.

Since then, we added the napi_complete_done() interface, and commit
364b6055738b ("net: busy-poll: return busypolling status to drivers")
allowed drivers to avoid re-arming NIC interrupts if we made a promise
that their NAPI poll() handler would be called in the near future.

This infrastructure can be leveraged, thanks to a new device parameter,
which allows to arm the napi hrtimer, instead of re-arming the device
hard IRQ.

We have noticed that on some servers with 32 RX queues or more, the chit-chat
between the NIC and the host caused by IRQ delivery and re-arming could hurt
throughput by ~20% on 100Gbit NIC.

In contrast, hrtimers are using local (percpu) resources and might have lower
cost.

The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
than gro_flush_timeout (/sys/class/net/ethX/)

By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.

This patch does not change the prior behavior of gro_flush_timeout
if used alone : NIC hard irqs should be rearmed as before.

One concrete usage can be :

echo 20000 >/sys/class/net/eth1/gro_flush_timeout
echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs

If at least one packet is retired, then we will reset napi counter
to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
of the queue.

On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
avoidance was only possible if napi->poll() was exhausting its budget
and not call napi_complete_done().

This feature also can be used to work around some non-optimal NIC irq
coalescing strategies.

Having the ability to insert XX usec delays between each napi->poll()
can increase cache efficiency, since we increase batch sizes.

It also keeps serving cpus not idle too long, reducing tail latencies.

Co-developed-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c            | 29 ++++++++++++++++++-----------
 net/core/net-sysfs.c      | 18 ++++++++++++++++++
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0750b54b37651890eb297e9f97fc956fb5cc48c7..5a8d40f1ffe2afce7ca36c290786d87eb730b8fc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -329,6 +329,7 @@ struct napi_struct {
 
 	unsigned long		state;
 	int			weight;
+	int			defer_hard_irqs_count;
 	unsigned long		gro_bitmask;
 	int			(*poll)(struct napi_struct *, int);
 #ifdef CONFIG_NETPOLL
@@ -1995,6 +1996,7 @@ struct net_device {
 
 	struct bpf_prog __rcu	*xdp_prog;
 	unsigned long		gro_flush_timeout;
+	int			napi_defer_hard_irqs;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fb61522b1ce163963f8d0bf54c7c5fa58fce7b9a..67585484ad32b698c6bc4bf17f5d87c345d77502 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6227,7 +6227,8 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
 
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
-	unsigned long flags, val, new;
+	unsigned long flags, val, new, timeout = 0;
+	bool ret = true;
 
 	/*
 	 * 1) Don't let napi dequeue from the cpu poll list
@@ -6239,20 +6240,23 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 				 NAPIF_STATE_IN_BUSY_POLL)))
 		return false;
 
-	if (n->gro_bitmask) {
-		unsigned long timeout = 0;
-
-		if (work_done)
+	if (work_done) {
+		if (n->gro_bitmask)
 			timeout = n->dev->gro_flush_timeout;
-
+		n->defer_hard_irqs_count = n->dev->napi_defer_hard_irqs;
+	}
+	if (n->defer_hard_irqs_count > 0) {
+		n->defer_hard_irqs_count--;
+		timeout = n->dev->gro_flush_timeout;
+		if (timeout)
+			ret = false;
+	}
+	if (n->gro_bitmask) {
 		/* When the NAPI instance uses a timeout and keeps postponing
 		 * it, we need to bound somehow the time packets are kept in
 		 * the GRO layer
 		 */
 		napi_gro_flush(n, !!timeout);
-		if (timeout)
-			hrtimer_start(&n->timer, ns_to_ktime(timeout),
-				      HRTIMER_MODE_REL_PINNED);
 	}
 
 	gro_normal_list(n);
@@ -6284,7 +6288,10 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		return false;
 	}
 
-	return true;
+	if (timeout)
+		hrtimer_start(&n->timer, ns_to_ktime(timeout),
+			      HRTIMER_MODE_REL_PINNED);
+	return ret;
 }
 EXPORT_SYMBOL(napi_complete_done);
 
@@ -6464,7 +6471,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 	/* Note : we use a relaxed variant of napi_schedule_prep() not setting
 	 * NAPI_STATE_MISSED, since we do not react to a device IRQ.
 	 */
-	if (napi->gro_bitmask && !napi_disable_pending(napi) &&
+	if (!napi_disable_pending(napi) &&
 	    !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
 		__napi_schedule_irqoff(napi);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 0d9e46de205e9e5338b56ecd9441338929e07bc1..f3b650cd09231fd99604f6f66bab454eabaa06be 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -382,6 +382,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
 }
 NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
 
+static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val)
+{
+	dev->napi_defer_hard_irqs = val;
+	return 0;
+}
+
+static ssize_t napi_defer_hard_irqs_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t len)
+{
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	return netdev_store(dev, attr, buf, len, change_napi_defer_hard_irqs);
+}
+NETDEVICE_SHOW_RW(napi_defer_hard_irqs, fmt_dec);
+
 static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t len)
 {
@@ -545,6 +562,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = {
 	&dev_attr_flags.attr,
 	&dev_attr_tx_queue_len.attr,
 	&dev_attr_gro_flush_timeout.attr,
+	&dev_attr_napi_defer_hard_irqs.attr,
 	&dev_attr_phys_port_id.attr,
 	&dev_attr_phys_port_name.attr,
 	&dev_attr_phys_switch_id.attr,
-- 
2.26.1.301.g55bc3eb7cb9-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 2/3] net: napi: use READ_ONCE()/WRITE_ONCE()
  2020-04-22 16:13 [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs Eric Dumazet
  2020-04-22 16:13 ` [PATCH net-next 1/3] net: napi: add hard irqs deferral feature Eric Dumazet
@ 2020-04-22 16:13 ` Eric Dumazet
  2020-04-22 16:13 ` [PATCH net-next 3/3] net/mlx4_en: use napi_complete_done() in TX completion Eric Dumazet
  2020-04-23 19:43 ` [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs David Miller
  3 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2020-04-22 16:13 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Luigi Rizzo, Eric Dumazet

gro_flush_timeout and napi_defer_hard_irqs can be read
from napi_complete_done() while other cpus write the value,
whithout explicit synchronization.

Use READ_ONCE()/WRITE_ONCE() to annotate the races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c       | 6 +++---
 net/core/net-sysfs.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 67585484ad32b698c6bc4bf17f5d87c345d77502..afff16849c261181b4f1043cf3d23627c75c7898 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6242,12 +6242,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 
 	if (work_done) {
 		if (n->gro_bitmask)
-			timeout = n->dev->gro_flush_timeout;
-		n->defer_hard_irqs_count = n->dev->napi_defer_hard_irqs;
+			timeout = READ_ONCE(n->dev->gro_flush_timeout);
+		n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
 	}
 	if (n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
-		timeout = n->dev->gro_flush_timeout;
+		timeout = READ_ONCE(n->dev->gro_flush_timeout);
 		if (timeout)
 			ret = false;
 	}
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index f3b650cd09231fd99604f6f66bab454eabaa06be..880e89c894f6f3669b132547926164c1f36fc986 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -367,7 +367,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
 static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
 {
-	dev->gro_flush_timeout = val;
+	WRITE_ONCE(dev->gro_flush_timeout, val);
 	return 0;
 }
 
@@ -384,7 +384,7 @@ NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
 
 static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val)
 {
-	dev->napi_defer_hard_irqs = val;
+	WRITE_ONCE(dev->napi_defer_hard_irqs, val);
 	return 0;
 }
 
-- 
2.26.1.301.g55bc3eb7cb9-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 3/3] net/mlx4_en: use napi_complete_done() in TX completion
  2020-04-22 16:13 [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs Eric Dumazet
  2020-04-22 16:13 ` [PATCH net-next 1/3] net: napi: add hard irqs deferral feature Eric Dumazet
  2020-04-22 16:13 ` [PATCH net-next 2/3] net: napi: use READ_ONCE()/WRITE_ONCE() Eric Dumazet
@ 2020-04-22 16:13 ` Eric Dumazet
  2020-04-23 19:43 ` [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs David Miller
  3 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2020-04-22 16:13 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Luigi Rizzo, Eric Dumazet

In order to benefit from the new napi_defer_hard_irqs feature,
we need to use napi_complete_done() variant in this driver.

RX path is already using it, this patch implements TX completion side.

mlx4_en_process_tx_cq() now returns the amount of retired packets,
instead of a boolean, so that mlx4_en_poll_tx_cq() can pass
this value to napi_complete_done().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 20 ++++++++++----------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 ++--
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index db3552f2d0877e37ce8dcf215d4c273e91c2326c..7871392198130fa7d1a09baf26a0a00f1bf2e1f5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -946,7 +946,7 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 		xdp_tx_cq = priv->tx_cq[TX_XDP][cq->ring];
 		if (xdp_tx_cq->xdp_busy) {
 			clean_complete = mlx4_en_process_tx_cq(dev, xdp_tx_cq,
-							       budget);
+							       budget) < budget;
 			xdp_tx_cq->xdp_busy = !clean_complete;
 		}
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 4d5ca302c067126b8627cb4809485b45c10e2460..a99d3ed49ed684db5d5b90e78e0767f97ee6cc9d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -382,8 +382,8 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 	return cnt;
 }
 
-bool mlx4_en_process_tx_cq(struct net_device *dev,
-			   struct mlx4_en_cq *cq, int napi_budget)
+int mlx4_en_process_tx_cq(struct net_device *dev,
+			  struct mlx4_en_cq *cq, int napi_budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_cq *mcq = &cq->mcq;
@@ -405,7 +405,7 @@ bool mlx4_en_process_tx_cq(struct net_device *dev,
 	u32 ring_cons;
 
 	if (unlikely(!priv->port_up))
-		return true;
+		return 0;
 
 	netdev_txq_bql_complete_prefetchw(ring->tx_queue);
 
@@ -480,7 +480,7 @@ bool mlx4_en_process_tx_cq(struct net_device *dev,
 	WRITE_ONCE(ring->cons, ring_cons + txbbs_skipped);
 
 	if (cq->type == TX_XDP)
-		return done < budget;
+		return done;
 
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
 
@@ -492,7 +492,7 @@ bool mlx4_en_process_tx_cq(struct net_device *dev,
 		ring->wake_queue++;
 	}
 
-	return done < budget;
+	return done;
 }
 
 void mlx4_en_tx_irq(struct mlx4_cq *mcq)
@@ -512,14 +512,14 @@ int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget)
 	struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
 	struct net_device *dev = cq->dev;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	bool clean_complete;
+	int work_done;
 
-	clean_complete = mlx4_en_process_tx_cq(dev, cq, budget);
-	if (!clean_complete)
+	work_done = mlx4_en_process_tx_cq(dev, cq, budget);
+	if (work_done >= budget)
 		return budget;
 
-	napi_complete(napi);
-	mlx4_en_arm_cq(priv, cq);
+	if (napi_complete_done(napi, work_done))
+		mlx4_en_arm_cq(priv, cq);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 630f15977f091c1e28eceb7b6bc33414a69d5694..9f5603612960303c5d9f37603d8f7e51ddee9ac6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -737,8 +737,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev,
 			  int budget);
 int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget);
 int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget);
-bool mlx4_en_process_tx_cq(struct net_device *dev,
-			   struct mlx4_en_cq *cq, int napi_budget);
+int mlx4_en_process_tx_cq(struct net_device *dev,
+			  struct mlx4_en_cq *cq, int napi_budget);
 u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			 struct mlx4_en_tx_ring *ring,
 			 int index, u64 timestamp,
-- 
2.26.1.301.g55bc3eb7cb9-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs
  2020-04-22 16:13 [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs Eric Dumazet
                   ` (2 preceding siblings ...)
  2020-04-22 16:13 ` [PATCH net-next 3/3] net/mlx4_en: use napi_complete_done() in TX completion Eric Dumazet
@ 2020-04-23 19:43 ` David Miller
  3 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2020-04-23 19:43 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, lrizzo, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 22 Apr 2020 09:13:26 -0700

> This patch series augments gro_glush_timeout feature with napi_defer_hard_irqs
> 
> As extensively described in first patch changelog, this can suppresss
> the chit-chat traffic between NIC and host to signal interrupts and re-arming
> them, since this can be an issue on high speed NIC with many queues.
> 
> The last patch in this series converts mlx4 TX completion to
> napi_complete_done(), to enable this new mechanism.

Series applied, thanks Eric.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-04-22 16:13 ` [PATCH net-next 1/3] net: napi: add hard irqs deferral feature Eric Dumazet
@ 2020-05-02 14:56   ` Julian Wiedmann
  2020-05-02 15:40     ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Julian Wiedmann @ 2020-05-02 14:56 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Luigi Rizzo, Eric Dumazet

On 22.04.20 18:13, Eric Dumazet wrote:
> Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
> we added the ability to arm one high resolution timer, that we used
> to keep not-complete packets in GRO engine a bit longer, hoping that further
> frames might be added to them.
> 
> Since then, we added the napi_complete_done() interface, and commit
> 364b6055738b ("net: busy-poll: return busypolling status to drivers")
> allowed drivers to avoid re-arming NIC interrupts if we made a promise
> that their NAPI poll() handler would be called in the near future.
> 
> This infrastructure can be leveraged, thanks to a new device parameter,
> which allows to arm the napi hrtimer, instead of re-arming the device
> hard IRQ.
> 
> We have noticed that on some servers with 32 RX queues or more, the chit-chat
> between the NIC and the host caused by IRQ delivery and re-arming could hurt
> throughput by ~20% on 100Gbit NIC.
> 
> In contrast, hrtimers are using local (percpu) resources and might have lower
> cost.
> 
> The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
> than gro_flush_timeout (/sys/class/net/ethX/)
> 

Hi Eric,
could you please add some Documentation for this new sysfs tunable? Thanks!
Looks like gro_flush_timeout is missing the same :).

> By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
> 
> This patch does not change the prior behavior of gro_flush_timeout
> if used alone : NIC hard irqs should be rearmed as before.
> 
> One concrete usage can be :
> 
> echo 20000 >/sys/class/net/eth1/gro_flush_timeout
> echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
> 
> If at least one packet is retired, then we will reset napi counter
> to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
> of the queue.
> 
> On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
> avoidance was only possible if napi->poll() was exhausting its budget
> and not call napi_complete_done().
> 

I was confused here for a second, so let me just clarify how this is intended
to look like for pure TX completion IRQs:

napi->poll() calls napi_complete_done() with an accurate work_done value, but
then still returns 0 because TX completion work doesn't consume NAPI budget.

> This feature also can be used to work around some non-optimal NIC irq
> coalescing strategies.
> 
> Having the ability to insert XX usec delays between each napi->poll()
> can increase cache efficiency, since we increase batch sizes.
> 
> It also keeps serving cpus not idle too long, reducing tail latencies.
> 
> Co-developed-by: Luigi Rizzo <lrizzo@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-05-02 14:56   ` Julian Wiedmann
@ 2020-05-02 15:40     ` Eric Dumazet
  2020-05-02 16:10       ` Julian Wiedmann
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2020-05-02 15:40 UTC (permalink / raw)
  To: Julian Wiedmann; +Cc: David S . Miller, netdev, Luigi Rizzo, Eric Dumazet

On Sat, May 2, 2020 at 7:56 AM Julian Wiedmann <jwi@linux.ibm.com> wrote:
>
> On 22.04.20 18:13, Eric Dumazet wrote:
> > Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
> > we added the ability to arm one high resolution timer, that we used
> > to keep not-complete packets in GRO engine a bit longer, hoping that further
> > frames might be added to them.
> >
> > Since then, we added the napi_complete_done() interface, and commit
> > 364b6055738b ("net: busy-poll: return busypolling status to drivers")
> > allowed drivers to avoid re-arming NIC interrupts if we made a promise
> > that their NAPI poll() handler would be called in the near future.
> >
> > This infrastructure can be leveraged, thanks to a new device parameter,
> > which allows to arm the napi hrtimer, instead of re-arming the device
> > hard IRQ.
> >
> > We have noticed that on some servers with 32 RX queues or more, the chit-chat
> > between the NIC and the host caused by IRQ delivery and re-arming could hurt
> > throughput by ~20% on 100Gbit NIC.
> >
> > In contrast, hrtimers are using local (percpu) resources and might have lower
> > cost.
> >
> > The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
> > than gro_flush_timeout (/sys/class/net/ethX/)
> >
>
> Hi Eric,
> could you please add some Documentation for this new sysfs tunable? Thanks!
> Looks like gro_flush_timeout is missing the same :).


Yes. I was planning adding this in
Documentation/networking/scaling.rst, once our fires are extinguished.

>
>
> > By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
> >
> > This patch does not change the prior behavior of gro_flush_timeout
> > if used alone : NIC hard irqs should be rearmed as before.
> >
> > One concrete usage can be :
> >
> > echo 20000 >/sys/class/net/eth1/gro_flush_timeout
> > echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
> >
> > If at least one packet is retired, then we will reset napi counter
> > to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
> > of the queue.
> >
> > On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
> > avoidance was only possible if napi->poll() was exhausting its budget
> > and not call napi_complete_done().
> >
>
> I was confused here for a second, so let me just clarify how this is intended
> to look like for pure TX completion IRQs:
>
> napi->poll() calls napi_complete_done() with an accurate work_done value, but
> then still returns 0 because TX completion work doesn't consume NAPI budget.


If the napi budget was consumed, the driver does _not_ call
napi_complete() or napi_complete_done() anyway.

If the budget is consumed, then napi_complete_done(napi, X>0) allows
napi_complete_done()
to return 0 if napi_defer_hard_irqs is not 0

This means that the NIC hard irq will stay disabled for at least one more round.


>
>
> > This feature also can be used to work around some non-optimal NIC irq
> > coalescing strategies.
> >
> > Having the ability to insert XX usec delays between each napi->poll()
> > can increase cache efficiency, since we increase batch sizes.
> >
> > It also keeps serving cpus not idle too long, reducing tail latencies.
> >
> > Co-developed-by: Luigi Rizzo <lrizzo@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-05-02 15:40     ` Eric Dumazet
@ 2020-05-02 16:10       ` Julian Wiedmann
  2020-05-02 16:24         ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Julian Wiedmann @ 2020-05-02 16:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Luigi Rizzo, Eric Dumazet

On 02.05.20 17:40, Eric Dumazet wrote:
> On Sat, May 2, 2020 at 7:56 AM Julian Wiedmann <jwi@linux.ibm.com> wrote:
>>
>> On 22.04.20 18:13, Eric Dumazet wrote:
>>> Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
>>> we added the ability to arm one high resolution timer, that we used
>>> to keep not-complete packets in GRO engine a bit longer, hoping that further
>>> frames might be added to them.
>>>
>>> Since then, we added the napi_complete_done() interface, and commit
>>> 364b6055738b ("net: busy-poll: return busypolling status to drivers")
>>> allowed drivers to avoid re-arming NIC interrupts if we made a promise
>>> that their NAPI poll() handler would be called in the near future.
>>>
>>> This infrastructure can be leveraged, thanks to a new device parameter,
>>> which allows to arm the napi hrtimer, instead of re-arming the device
>>> hard IRQ.
>>>
>>> We have noticed that on some servers with 32 RX queues or more, the chit-chat
>>> between the NIC and the host caused by IRQ delivery and re-arming could hurt
>>> throughput by ~20% on 100Gbit NIC.
>>>
>>> In contrast, hrtimers are using local (percpu) resources and might have lower
>>> cost.
>>>
>>> The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
>>> than gro_flush_timeout (/sys/class/net/ethX/)
>>>
>>
>> Hi Eric,
>> could you please add some Documentation for this new sysfs tunable? Thanks!
>> Looks like gro_flush_timeout is missing the same :).
> 
> 
> Yes. I was planning adding this in
> Documentation/networking/scaling.rst, once our fires are extinguished.
> 
>>
>>
>>> By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
>>>
>>> This patch does not change the prior behavior of gro_flush_timeout
>>> if used alone : NIC hard irqs should be rearmed as before.
>>>
>>> One concrete usage can be :
>>>
>>> echo 20000 >/sys/class/net/eth1/gro_flush_timeout
>>> echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
>>>
>>> If at least one packet is retired, then we will reset napi counter
>>> to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
>>> of the queue.
>>>
>>> On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
>>> avoidance was only possible if napi->poll() was exhausting its budget
>>> and not call napi_complete_done().
>>>
>>
>> I was confused here for a second, so let me just clarify how this is intended
>> to look like for pure TX completion IRQs:
>>
>> napi->poll() calls napi_complete_done() with an accurate work_done value, but
>> then still returns 0 because TX completion work doesn't consume NAPI budget.
> 
> 
> If the napi budget was consumed, the driver does _not_ call
> napi_complete() or napi_complete_done() anyway.
> 

I was thinking of "TX completions are cheap and don't consume _any_ NAPI budget, ever"
as the current consensus, but looking at the mlx4 code that evidently isn't true
for all drivers.

> If the budget is consumed, then napi_complete_done(napi, X>0) allows
> napi_complete_done()
> to return 0 if napi_defer_hard_irqs is not 0
> 
> This means that the NIC hard irq will stay disabled for at least one more round.
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-05-02 16:10       ` Julian Wiedmann
@ 2020-05-02 16:24         ` Eric Dumazet
  2020-05-02 23:45           ` David Miller
  2020-05-04 15:25           ` Julian Wiedmann
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2020-05-02 16:24 UTC (permalink / raw)
  To: Julian Wiedmann, Eric Dumazet
  Cc: David S . Miller, netdev, Luigi Rizzo, Eric Dumazet



On 5/2/20 9:10 AM, Julian Wiedmann wrote:
> On 02.05.20 17:40, Eric Dumazet wrote:
>> On Sat, May 2, 2020 at 7:56 AM Julian Wiedmann <jwi@linux.ibm.com> wrote:
>>>
>>> On 22.04.20 18:13, Eric Dumazet wrote:
>>>> Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
>>>> we added the ability to arm one high resolution timer, that we used
>>>> to keep not-complete packets in GRO engine a bit longer, hoping that further
>>>> frames might be added to them.
>>>>
>>>> Since then, we added the napi_complete_done() interface, and commit
>>>> 364b6055738b ("net: busy-poll: return busypolling status to drivers")
>>>> allowed drivers to avoid re-arming NIC interrupts if we made a promise
>>>> that their NAPI poll() handler would be called in the near future.
>>>>
>>>> This infrastructure can be leveraged, thanks to a new device parameter,
>>>> which allows to arm the napi hrtimer, instead of re-arming the device
>>>> hard IRQ.
>>>>
>>>> We have noticed that on some servers with 32 RX queues or more, the chit-chat
>>>> between the NIC and the host caused by IRQ delivery and re-arming could hurt
>>>> throughput by ~20% on 100Gbit NIC.
>>>>
>>>> In contrast, hrtimers are using local (percpu) resources and might have lower
>>>> cost.
>>>>
>>>> The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
>>>> than gro_flush_timeout (/sys/class/net/ethX/)
>>>>
>>>
>>> Hi Eric,
>>> could you please add some Documentation for this new sysfs tunable? Thanks!
>>> Looks like gro_flush_timeout is missing the same :).
>>
>>
>> Yes. I was planning adding this in
>> Documentation/networking/scaling.rst, once our fires are extinguished.
>>
>>>
>>>
>>>> By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
>>>>
>>>> This patch does not change the prior behavior of gro_flush_timeout
>>>> if used alone : NIC hard irqs should be rearmed as before.
>>>>
>>>> One concrete usage can be :
>>>>
>>>> echo 20000 >/sys/class/net/eth1/gro_flush_timeout
>>>> echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
>>>>
>>>> If at least one packet is retired, then we will reset napi counter
>>>> to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
>>>> of the queue.
>>>>
>>>> On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
>>>> avoidance was only possible if napi->poll() was exhausting its budget
>>>> and not call napi_complete_done().
>>>>
>>>
>>> I was confused here for a second, so let me just clarify how this is intended
>>> to look like for pure TX completion IRQs:
>>>
>>> napi->poll() calls napi_complete_done() with an accurate work_done value, but
>>> then still returns 0 because TX completion work doesn't consume NAPI budget.
>>
>>
>> If the napi budget was consumed, the driver does _not_ call
>> napi_complete() or napi_complete_done() anyway.
>>
> 
> I was thinking of "TX completions are cheap and don't consume _any_ NAPI budget, ever"
> as the current consensus, but looking at the mlx4 code that evidently isn't true
> for all drivers.

TX completions are not cheap in many cases.

Doing the unmap stuff can be costly in IOMMU world, and freeing skb
can be also expensive.
Add to this that TCP stack might be called back (via skb->destructor()) to add more packets to the qdisc/device.

So using effectively the budget as a limit might help in some stress situations,
by not re-enabling NIC interrupts, even before napi_defer_hard_irqs addition.

> 
>> If the budget is consumed, then napi_complete_done(napi, X>0) allows
>> napi_complete_done()
>> to return 0 if napi_defer_hard_irqs is not 0
>>
>> This means that the NIC hard irq will stay disabled for at least one more round.
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-05-02 16:24         ` Eric Dumazet
@ 2020-05-02 23:45           ` David Miller
  2020-05-04 15:25           ` Julian Wiedmann
  1 sibling, 0 replies; 12+ messages in thread
From: David Miller @ 2020-05-02 23:45 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jwi, edumazet, netdev, lrizzo

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 2 May 2020 09:24:19 -0700

> Doing the unmap stuff can be costly in IOMMU world, and freeing skb
> can be also expensive.  Add to this that TCP stack might be called
> back (via skb->destructor()) to add more packets to the
> qdisc/device.
> 
> So using effectively the budget as a limit might help in some stress
> situations, by not re-enabling NIC interrupts, even before
> napi_defer_hard_irqs addition.

Even with this logic, TX budgeting should be consuming some fraction of
the budget compared to RX processing.  Even if TX work isn't free, it's
much cheaper than RX.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-05-02 16:24         ` Eric Dumazet
  2020-05-02 23:45           ` David Miller
@ 2020-05-04 15:25           ` Julian Wiedmann
  2020-05-04 15:33             ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Julian Wiedmann @ 2020-05-04 15:25 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet; +Cc: David S . Miller, netdev, Luigi Rizzo

On 02.05.20 18:24, Eric Dumazet wrote:
> 
> 
> On 5/2/20 9:10 AM, Julian Wiedmann wrote:
>> On 02.05.20 17:40, Eric Dumazet wrote:
>>> On Sat, May 2, 2020 at 7:56 AM Julian Wiedmann <jwi@linux.ibm.com> wrote:
>>>>
>>>> On 22.04.20 18:13, Eric Dumazet wrote:

[...]

>>>>
>>>>
>>>>> By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
>>>>>
>>>>> This patch does not change the prior behavior of gro_flush_timeout
>>>>> if used alone : NIC hard irqs should be rearmed as before.
>>>>>
>>>>> One concrete usage can be :
>>>>>
>>>>> echo 20000 >/sys/class/net/eth1/gro_flush_timeout
>>>>> echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
>>>>>
>>>>> If at least one packet is retired, then we will reset napi counter
>>>>> to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
>>>>> of the queue.
>>>>>
>>>>> On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
>>>>> avoidance was only possible if napi->poll() was exhausting its budget
>>>>> and not call napi_complete_done().
>>>>>
>>>>
>>>> I was confused here for a second, so let me just clarify how this is intended
>>>> to look like for pure TX completion IRQs:
>>>>
>>>> napi->poll() calls napi_complete_done() with an accurate work_done value, but
>>>> then still returns 0 because TX completion work doesn't consume NAPI budget.
>>>
>>>
>>> If the napi budget was consumed, the driver does _not_ call
>>> napi_complete() or napi_complete_done() anyway.
>>>
>>
>> I was thinking of "TX completions are cheap and don't consume _any_ NAPI budget, ever"
>> as the current consensus, but looking at the mlx4 code that evidently isn't true
>> for all drivers.
> 
> TX completions are not cheap in many cases.
> 
> Doing the unmap stuff can be costly in IOMMU world, and freeing skb
> can be also expensive.
> Add to this that TCP stack might be called back (via skb->destructor()) to add more packets to the qdisc/device.
> 
> So using effectively the budget as a limit might help in some stress situations,
> by not re-enabling NIC interrupts, even before napi_defer_hard_irqs addition.
> 

Neat, thanks for sharing this. Now I also see the tricks that mlx4 plays to still
get netpoll working.... fun.

>>
>>> If the budget is consumed, then napi_complete_done(napi, X>0) allows
>>> napi_complete_done()
>>> to return 0 if napi_defer_hard_irqs is not 0
>>>
>>> This means that the NIC hard irq will stay disabled for at least one more round.
>>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature
  2020-05-04 15:25           ` Julian Wiedmann
@ 2020-05-04 15:33             ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2020-05-04 15:33 UTC (permalink / raw)
  To: Julian Wiedmann; +Cc: Eric Dumazet, David S . Miller, netdev, Luigi Rizzo

On Mon, May 4, 2020 at 8:28 AM Julian Wiedmann <jwi@linux.ibm.com> wrote:
>
> On 02.05.20 18:24, Eric Dumazet wrote:
> >
> >
> > On 5/2/20 9:10 AM, Julian Wiedmann wrote:
> >> On 02.05.20 17:40, Eric Dumazet wrote:
> >>> On Sat, May 2, 2020 at 7:56 AM Julian Wiedmann <jwi@linux.ibm.com> wrote:
> >>>>
> >>>> On 22.04.20 18:13, Eric Dumazet wrote:
>
> [...]
>
> >>>>
> >>>>
> >>>>> By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
> >>>>>
> >>>>> This patch does not change the prior behavior of gro_flush_timeout
> >>>>> if used alone : NIC hard irqs should be rearmed as before.
> >>>>>
> >>>>> One concrete usage can be :
> >>>>>
> >>>>> echo 20000 >/sys/class/net/eth1/gro_flush_timeout
> >>>>> echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
> >>>>>
> >>>>> If at least one packet is retired, then we will reset napi counter
> >>>>> to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
> >>>>> of the queue.
> >>>>>
> >>>>> On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
> >>>>> avoidance was only possible if napi->poll() was exhausting its budget
> >>>>> and not call napi_complete_done().
> >>>>>
> >>>>
> >>>> I was confused here for a second, so let me just clarify how this is intended
> >>>> to look like for pure TX completion IRQs:
> >>>>
> >>>> napi->poll() calls napi_complete_done() with an accurate work_done value, but
> >>>> then still returns 0 because TX completion work doesn't consume NAPI budget.
> >>>
> >>>
> >>> If the napi budget was consumed, the driver does _not_ call
> >>> napi_complete() or napi_complete_done() anyway.
> >>>
> >>
> >> I was thinking of "TX completions are cheap and don't consume _any_ NAPI budget, ever"
> >> as the current consensus, but looking at the mlx4 code that evidently isn't true
> >> for all drivers.
> >
> > TX completions are not cheap in many cases.
> >
> > Doing the unmap stuff can be costly in IOMMU world, and freeing skb
> > can be also expensive.
> > Add to this that TCP stack might be called back (via skb->destructor()) to add more packets to the qdisc/device.
> >
> > So using effectively the budget as a limit might help in some stress situations,
> > by not re-enabling NIC interrupts, even before napi_defer_hard_irqs addition.
> >
>
> Neat, thanks for sharing this. Now I also see the tricks that mlx4 plays to still
> get netpoll working.... fun.
>

This is generic napi stuff :)

https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf

Drivers authors are welcomed to adapt their code, if not already updated.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-05-04 15:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-22 16:13 [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs Eric Dumazet
2020-04-22 16:13 ` [PATCH net-next 1/3] net: napi: add hard irqs deferral feature Eric Dumazet
2020-05-02 14:56   ` Julian Wiedmann
2020-05-02 15:40     ` Eric Dumazet
2020-05-02 16:10       ` Julian Wiedmann
2020-05-02 16:24         ` Eric Dumazet
2020-05-02 23:45           ` David Miller
2020-05-04 15:25           ` Julian Wiedmann
2020-05-04 15:33             ` Eric Dumazet
2020-04-22 16:13 ` [PATCH net-next 2/3] net: napi: use READ_ONCE()/WRITE_ONCE() Eric Dumazet
2020-04-22 16:13 ` [PATCH net-next 3/3] net/mlx4_en: use napi_complete_done() in TX completion Eric Dumazet
2020-04-23 19:43 ` [PATCH net-next 0/3] net: napi: addition of napi_defer_hard_irqs David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.