netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v6 0/3] implement kthread based napi poll
@ 2021-01-15  0:31 Wei Wang
  2021-01-15  0:31 ` [PATCH net-next v6 1/3] net: extract napi poll functionality to __napi_poll() Wei Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Wei Wang @ 2021-01-15  0:31 UTC (permalink / raw)
  To: David Miller, netdev, Jakub Kicinski
  Cc: Eric Dumazet, Paolo Abeni, Hannes Frederic Sowa, Felix Fietkau

The idea of moving the napi poll process out of softirq context to a
kernel thread based context is not new.
Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi
poll to kthread back in 2016. And Felix Fietkau has also proposed
patches of similar ideas to use workqueue to process napi poll just a
few weeks ago.

The main reason we'd like to push forward with this idea is that the
scheduler has poor visibility into cpu cycles spent in softirq context,
and is not able to make optimal scheduling decisions of the user threads.
For example, we see in one of the application benchmark where network
load is high, the CPUs handling network softirqs has ~80% cpu util. And
user threads are still scheduled on those CPUs, despite other more idle
cpus available in the system. And we see very high tail latencies. In this
case, we have to explicitly pin away user threads from the CPUs handling
network softirqs to ensure good performance.
With napi poll moved to kthread, scheduler is in charge of scheduling both
the kthreads handling network load, and the user threads, and is able to
make better decisions. In the previous benchmark, if we do this and we
pin the kthreads processing napi poll to specific CPUs, scheduler is
able to schedule user threads away from these CPUs automatically.

And the reason we prefer 1 kthread per napi, instead of 1 workqueue
entity per host, is that kthread is more configurable than workqueue,
and we could leverage existing tuning tools for threads, like taskset,
chrt, etc to tune scheduling class and cpu set, etc. Another reason is
if we eventually want to provide busy poll feature using kernel threads
for napi poll, kthread seems to be more suitable than workqueue.
Furthermore, for large platforms with 2 NICs attached to 2 sockets,
kthread is more flexible to be pinned to different sets of CPUs.  

In this patch series, I revived Paolo and Hannes's patch in 2016 and
made modifications. Then there are changes proposed by Felix, Jakub,
Paolo and myself on top of those, with suggestions from Eric Dumazet.

In terms of performance, I ran tcp_rr tests with 1000 flows with
various request/response sizes, with RFS/RPS disabled, and compared
performance between softirq vs kthread vs workqueue (patchset proposed
by Felix Fietkau).
Host has 56 hyper threads and 100Gbps nic, 8 rx queues and only 1 numa
node. All threads are unpinned.

        req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
softirq   1B/1B   2.75M   337us       376us      1.04ms     3.69ms
kthread   1B/1B   2.67M   371us       408us      455us      550us
workq     1B/1B   2.56M   384us       435us      673us      822us

softirq 5KB/5KB   1.46M   678us       750us      969us      2.78ms
kthread 5KB/5KB   1.44M   695us       789us      891us      1.06ms
workq   5KB/5KB   1.34M   720us       905us     1.06ms      1.57ms

softirq 1MB/1MB   11.0K   79ms       166ms      306ms       630ms
kthread 1MB/1MB   11.0K   75ms       177ms      303ms       596ms
workq   1MB/1MB   11.0K   79ms       180ms      303ms       587ms

When running workqueue implementation, I found the number of threads
used is usually twice as much as kthread implementation. This probably
introduces higher scheduling cost, which results in higher tail
latencies in most cases.

I also ran an application benchmark, which performs fixed qps remote SSD
read/write operations, with various sizes. Again, both with RFS/RPS
disabled.
The result is as follows:
         op_size  QPS   50%tile 95%tile 99%tile 99.9%tile  
softirq   4K     572.6K   385us   1.5ms  3.16ms   6.41ms
kthread   4K     572.6K   390us   803us  2.21ms   6.83ms
workq     4k     572.6K   384us   763us  3.12ms   6.87ms

softirq   64K    157.9K   736us   1.17ms 3.40ms   13.75ms
kthread   64K    157.9K   745us   1.23ms 2.76ms    9.87ms 
workq     64K    157.9K   746us   1.23ms 2.76ms    9.96ms

softirq   1M     10.98K   2.03ms  3.10ms  3.7ms   11.56ms
kthread   1M     10.98K   2.13ms  3.21ms  4.02ms  13.3ms
workq     1M     10.98K   2.13ms  3.20ms  3.99ms  14.12ms

In this set of tests, the latency is predominant by the SSD operation.
Also, the user threads are much busier compared to tcp_rr tests. We have
to pin the kthreads/workqueue threads to limit to a few CPUs, to not
disturb user threads, and provide some isolation.

Changes since v5:
Removed ASSERT_RTNL() from napi_set_threaded() and removed rtnl_lock()
operation from napi_enable().

Changes since v4:
Recorded the threaded setting in dev and restore it in napi_enable().

Changes since v3:
Merged and rearranged patches in a logical order for easier review. 
Changed sysfs control to be per device.

Changes since v2:
Corrected typo in patch 1, and updated the cover letter with more
detailed and updated test results.

Changes since v1:
Replaced kthread_create() with kthread_run() in patch 5 as suggested by
Felix Fietkau.

Changes since RFC:
Renamed the kthreads to be napi/<dev>-<napi_id> in patch 5 as suggested
by Hannes Frederic Sowa.

Felix Fietkau (1):
  net: extract napi poll functionality to __napi_poll()

Wei Wang (2):
  net: implement threaded-able napi poll loop support
  net: add sysfs attribute to control napi threaded mode

 include/linux/netdevice.h |  14 +--
 net/core/dev.c            | 176 +++++++++++++++++++++++++++++++++++---
 net/core/net-sysfs.c      |  63 ++++++++++++++
 3 files changed, 236 insertions(+), 17 deletions(-)

-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next v6 1/3] net: extract napi poll functionality to __napi_poll()
  2021-01-15  0:31 [PATCH net-next v6 0/3] implement kthread based napi poll Wei Wang
@ 2021-01-15  0:31 ` Wei Wang
  2021-01-15  0:31 ` [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support Wei Wang
  2021-01-15  0:31 ` [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode Wei Wang
  2 siblings, 0 replies; 13+ messages in thread
From: Wei Wang @ 2021-01-15  0:31 UTC (permalink / raw)
  To: David Miller, netdev, Jakub Kicinski
  Cc: Eric Dumazet, Paolo Abeni, Hannes Frederic Sowa, Felix Fietkau

From: Felix Fietkau <nbd@nbd.name> 

This commit introduces a new function __napi_poll() which does the main
logic of the existing napi_poll() function, and will be called by other
functions in later commits.
This idea and implementation is done by Felix Fietkau <nbd@nbd.name> and
is proposed as part of the patch to move napi work to work_queue
context.
This commit by itself is a code restructure.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 net/core/dev.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index e4d77c8abe76..83b59e4c0f37 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6772,15 +6772,10 @@ void __netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(__netif_napi_del);
 
-static int napi_poll(struct napi_struct *n, struct list_head *repoll)
+static int __napi_poll(struct napi_struct *n, bool *repoll)
 {
-	void *have;
 	int work, weight;
 
-	list_del_init(&n->poll_list);
-
-	have = netpoll_poll_lock(n);
-
 	weight = n->weight;
 
 	/* This NAPI_STATE_SCHED test is for avoiding a race
@@ -6800,7 +6795,7 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 			    n->poll, work, weight);
 
 	if (likely(work < weight))
-		goto out_unlock;
+		return work;
 
 	/* Drivers must not modify the NAPI state if they
 	 * consume the entire weight.  In such cases this code
@@ -6809,7 +6804,7 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 	 */
 	if (unlikely(napi_disable_pending(n))) {
 		napi_complete(n);
-		goto out_unlock;
+		return work;
 	}
 
 	/* The NAPI context has more processing work, but busy-polling
@@ -6822,7 +6817,7 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 			 */
 			napi_schedule(n);
 		}
-		goto out_unlock;
+		return work;
 	}
 
 	if (n->gro_bitmask) {
@@ -6840,9 +6835,29 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 	if (unlikely(!list_empty(&n->poll_list))) {
 		pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
 			     n->dev ? n->dev->name : "backlog");
-		goto out_unlock;
+		return work;
 	}
 
+	*repoll = true;
+
+	return work;
+}
+
+static int napi_poll(struct napi_struct *n, struct list_head *repoll)
+{
+	bool do_repoll = false;
+	void *have;
+	int work;
+
+	list_del_init(&n->poll_list);
+
+	have = netpoll_poll_lock(n);
+
+	work = __napi_poll(n, &do_repoll);
+
+	if (!do_repoll)
+		goto out_unlock;
+
 	list_add_tail(&n->poll_list, repoll);
 
 out_unlock:
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support
  2021-01-15  0:31 [PATCH net-next v6 0/3] implement kthread based napi poll Wei Wang
  2021-01-15  0:31 ` [PATCH net-next v6 1/3] net: extract napi poll functionality to __napi_poll() Wei Wang
@ 2021-01-15  0:31 ` Wei Wang
  2021-01-15  3:14   ` Alexander Duyck
  2021-01-15  0:31 ` [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode Wei Wang
  2 siblings, 1 reply; 13+ messages in thread
From: Wei Wang @ 2021-01-15  0:31 UTC (permalink / raw)
  To: David Miller, netdev, Jakub Kicinski
  Cc: Eric Dumazet, Paolo Abeni, Hannes Frederic Sowa, Felix Fietkau

This patch allows running each napi poll loop inside its own
kernel thread.
The threaded mode could be enabled through napi_set_threaded()
api, and does not require a device up/down. The kthread gets
created on demand when napi_set_threaded() is called, and gets
shut down eventually in napi_disable().

Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.

The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 include/linux/netdevice.h |  12 ++--
 net/core/dev.c            | 113 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5b949076ed23..c24ed232c746 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
+	struct task_struct	*thread;
 };
 
 enum {
@@ -358,6 +359,7 @@ enum {
 	NAPI_STATE_NO_BUSY_POLL,	/* Do not add in napi_hash, no busy polling */
 	NAPI_STATE_IN_BUSY_POLL,	/* sk_busy_loop() owns this NAPI */
 	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
+	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
 };
 
 enum {
@@ -369,6 +371,7 @@ enum {
 	NAPIF_STATE_NO_BUSY_POLL	= BIT(NAPI_STATE_NO_BUSY_POLL),
 	NAPIF_STATE_IN_BUSY_POLL	= BIT(NAPI_STATE_IN_BUSY_POLL),
 	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
+	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -510,13 +513,7 @@ void napi_disable(struct napi_struct *n);
  * Resume NAPI from being scheduled on this context.
  * Must be paired with napi_disable.
  */
-static inline void napi_enable(struct napi_struct *n)
-{
-	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
-	smp_mb__before_atomic();
-	clear_bit(NAPI_STATE_SCHED, &n->state);
-	clear_bit(NAPI_STATE_NPSVC, &n->state);
-}
+void napi_enable(struct napi_struct *n);
 
 /**
  *	napi_synchronize - wait until NAPI is not running
@@ -2140,6 +2137,7 @@ struct net_device {
 	struct lock_class_key	*qdisc_tx_busylock;
 	struct lock_class_key	*qdisc_running_key;
 	bool			proto_down;
+	bool			threaded;
 	unsigned		wol_enabled:1;
 
 	struct list_head	net_notifier_list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 83b59e4c0f37..edcfec1361e9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -91,6 +91,7 @@
 #include <linux/etherdevice.h>
 #include <linux/ethtool.h>
 #include <linux/skbuff.h>
+#include <linux/kthread.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <net/net_namespace.h>
@@ -1493,6 +1494,36 @@ void netdev_notify_peers(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_notify_peers);
 
+static int napi_threaded_poll(void *data);
+
+static int napi_kthread_create(struct napi_struct *n)
+{
+	int err = 0;
+
+	/* Create and wake up the kthread once to put it in
+	 * TASK_INTERRUPTIBLE mode to avoid the blocked task
+	 * warning and work with loadavg.
+	 */
+	n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
+				n->dev->name, n->napi_id);
+	if (IS_ERR(n->thread)) {
+		err = PTR_ERR(n->thread);
+		pr_err("kthread_run failed with err %d\n", err);
+		n->thread = NULL;
+	}
+
+	return err;
+}
+
+static void napi_kthread_stop(struct napi_struct *n)
+{
+	if (!n->thread)
+		return;
+	kthread_stop(n->thread);
+	clear_bit(NAPI_STATE_THREADED, &n->state);
+	n->thread = NULL;
+}
+
 static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -4252,6 +4283,11 @@ int gro_normal_batch __read_mostly = 8;
 static inline void ____napi_schedule(struct softnet_data *sd,
 				     struct napi_struct *napi)
 {
+	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
+		wake_up_process(napi->thread);
+		return;
+	}
+
 	list_add_tail(&napi->poll_list, &sd->poll_list);
 	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }
@@ -6697,6 +6733,27 @@ static void init_gro_hash(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 }
 
+static int napi_set_threaded(struct napi_struct *n, bool threaded)
+{
+	int err = 0;
+
+	if (threaded == !!test_bit(NAPI_STATE_THREADED, &n->state))
+		return 0;
+	if (threaded) {
+		if (!n->thread) {
+			err = napi_kthread_create(n);
+			if (err)
+				goto out;
+		}
+		set_bit(NAPI_STATE_THREADED, &n->state);
+	} else {
+		clear_bit(NAPI_STATE_THREADED, &n->state);
+	}
+
+out:
+	return err;
+}
+
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -6738,12 +6795,23 @@ void napi_disable(struct napi_struct *n)
 		msleep(1);
 
 	hrtimer_cancel(&n->timer);
+	napi_kthread_stop(n);
 
 	clear_bit(NAPI_STATE_PREFER_BUSY_POLL, &n->state);
 	clear_bit(NAPI_STATE_DISABLE, &n->state);
 }
 EXPORT_SYMBOL(napi_disable);
 
+void napi_enable(struct napi_struct *n)
+{
+	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
+	smp_mb__before_atomic();
+	clear_bit(NAPI_STATE_SCHED, &n->state);
+	clear_bit(NAPI_STATE_NPSVC, &n->state);
+	WARN_ON(napi_set_threaded(n, n->dev->threaded));
+}
+EXPORT_SYMBOL(napi_enable);
+
 static void flush_gro_hash(struct napi_struct *napi)
 {
 	int i;
@@ -6866,6 +6934,51 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 	return work;
 }
 
+static int napi_thread_wait(struct napi_struct *napi)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	while (!kthread_should_stop() && !napi_disable_pending(napi)) {
+		if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+			WARN_ON(!list_empty(&napi->poll_list));
+			__set_current_state(TASK_RUNNING);
+			return 0;
+		}
+
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+	__set_current_state(TASK_RUNNING);
+	return -1;
+}
+
+static int napi_threaded_poll(void *data)
+{
+	struct napi_struct *napi = data;
+	void *have;
+
+	while (!napi_thread_wait(napi)) {
+		for (;;) {
+			bool repoll = false;
+
+			local_bh_disable();
+
+			have = netpoll_poll_lock(napi);
+			__napi_poll(napi, &repoll);
+			netpoll_poll_unlock(have);
+
+			__kfree_skb_flush();
+			local_bh_enable();
+
+			if (!repoll)
+				break;
+
+			cond_resched();
+		}
+	}
+	return 0;
+}
+
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode
  2021-01-15  0:31 [PATCH net-next v6 0/3] implement kthread based napi poll Wei Wang
  2021-01-15  0:31 ` [PATCH net-next v6 1/3] net: extract napi poll functionality to __napi_poll() Wei Wang
  2021-01-15  0:31 ` [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support Wei Wang
@ 2021-01-15  0:31 ` Wei Wang
  2021-01-15  3:34   ` Alexander Duyck
  2 siblings, 1 reply; 13+ messages in thread
From: Wei Wang @ 2021-01-15  0:31 UTC (permalink / raw)
  To: David Miller, netdev, Jakub Kicinski
  Cc: Eric Dumazet, Paolo Abeni, Hannes Frederic Sowa, Felix Fietkau

This patch adds a new sysfs attribute to the network device class.
Said attribute provides a per-device control to enable/disable the
threaded mode for all the napi instances of the given network device.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c            | 28 +++++++++++++++++
 net/core/net-sysfs.c      | 63 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 93 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c24ed232c746..11ae0c9b9350 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)
 	return napi_complete_done(n, 0);
 }
 
+int dev_set_threaded(struct net_device *dev, bool threaded);
+
 /**
  *	napi_disable - prevent NAPI from scheduling
  *	@n: NAPI context
diff --git a/net/core/dev.c b/net/core/dev.c
index edcfec1361e9..d5fb95316ea8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6754,6 +6754,34 @@ static int napi_set_threaded(struct napi_struct *n, bool threaded)
 	return err;
 }
 
+static void dev_disable_threaded_all(struct net_device *dev)
+{
+	struct napi_struct *napi;
+
+	list_for_each_entry(napi, &dev->napi_list, dev_list)
+		napi_set_threaded(napi, false);
+}
+
+int dev_set_threaded(struct net_device *dev, bool threaded)
+{
+	struct napi_struct *napi;
+	int ret;
+
+	dev->threaded = threaded;
+	list_for_each_entry(napi, &dev->napi_list, dev_list) {
+		ret = napi_set_threaded(napi, threaded);
+		if (ret) {
+			/* Error occurred on one of the napi,
+			 * reset threaded mode on all napi.
+			 */
+			dev_disable_threaded_all(dev);
+			break;
+		}
+	}
+
+	return ret;
+}
+
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index daf502c13d6d..2017f8f07b8d 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -538,6 +538,68 @@ static ssize_t phys_switch_id_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(phys_switch_id);
 
+static ssize_t threaded_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct napi_struct *n;
+	bool enabled;
+	int ret;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	if (!dev_isalive(netdev)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	if (list_empty(&netdev->napi_list)) {
+		ret = -EOPNOTSUPP;
+		goto unlock;
+	}
+
+	/* Only return true if all napi have threaded mode.
+	 * The inconsistency could happen when the device driver calls
+	 * napi_disable()/napi_enable() with dev->threaded set to true,
+	 * but napi_kthread_create() fails.
+	 * We return false in this case to remind the user that one or
+	 * more napi did not have threaded mode enabled properly.
+	 */
+	list_for_each_entry(n, &netdev->napi_list, dev_list) {
+		enabled = !!test_bit(NAPI_STATE_THREADED, &n->state);
+		if (!enabled)
+			break;
+	}
+
+	ret = sprintf(buf, fmt_dec, enabled);
+
+unlock:
+	rtnl_unlock();
+	return ret;
+}
+
+static int modify_napi_threaded(struct net_device *dev, unsigned long val)
+{
+	struct napi_struct *napi;
+	int ret;
+
+	if (list_empty(&dev->napi_list))
+		return -EOPNOTSUPP;
+
+	ret = dev_set_threaded(dev, !!val);
+
+	return ret;
+}
+
+static ssize_t threaded_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, modify_napi_threaded);
+}
+static DEVICE_ATTR_RW(threaded);
+
 static struct attribute *net_class_attrs[] __ro_after_init = {
 	&dev_attr_netdev_group.attr,
 	&dev_attr_type.attr,
@@ -570,6 +632,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = {
 	&dev_attr_proto_down.attr,
 	&dev_attr_carrier_up_count.attr,
 	&dev_attr_carrier_down_count.attr,
+	&dev_attr_threaded.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(net_class);
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support
  2021-01-15  0:31 ` [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support Wei Wang
@ 2021-01-15  3:14   ` Alexander Duyck
  2021-01-15  3:38     ` Alexander Duyck
       [not found]     ` <CAEA6p_BxQG7ObXG2Qf=fe5ja1hQr_9B=1_0XAn5w+Cf5+YByog@mail.gmail.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Alexander Duyck @ 2021-01-15  3:14 UTC (permalink / raw)
  To: Wei Wang
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Thu, Jan 14, 2021 at 4:33 PM Wei Wang <weiwan@google.com> wrote:
>
> This patch allows running each napi poll loop inside its own
> kernel thread.
> The threaded mode could be enabled through napi_set_threaded()
> api, and does not require a device up/down. The kthread gets
> created on demand when napi_set_threaded() is called, and gets
> shut down eventually in napi_disable().
>
> Once that threaded mode is enabled and the kthread is
> started, napi_schedule() will wake-up such thread instead
> of scheduling the softirq.
>
> The threaded poll loop behaves quite likely the net_rx_action,
> but it does not have to manipulate local irqs and uses
> an explicit scheduling point based on netdev_budget.
>
> Co-developed-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Co-developed-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Wei Wang <weiwan@google.com>
> ---
>  include/linux/netdevice.h |  12 ++--
>  net/core/dev.c            | 113 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 118 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5b949076ed23..c24ed232c746 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -347,6 +347,7 @@ struct napi_struct {
>         struct list_head        dev_list;
>         struct hlist_node       napi_hash_node;
>         unsigned int            napi_id;
> +       struct task_struct      *thread;
>  };
>
>  enum {
> @@ -358,6 +359,7 @@ enum {
>         NAPI_STATE_NO_BUSY_POLL,        /* Do not add in napi_hash, no busy polling */
>         NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */
>         NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over softirq processing*/
> +       NAPI_STATE_THREADED,            /* The poll is performed inside its own thread*/
>  };
>
>  enum {
> @@ -369,6 +371,7 @@ enum {
>         NAPIF_STATE_NO_BUSY_POLL        = BIT(NAPI_STATE_NO_BUSY_POLL),
>         NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),
>         NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),
> +       NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),
>  };
>
>  enum gro_result {
> @@ -510,13 +513,7 @@ void napi_disable(struct napi_struct *n);
>   * Resume NAPI from being scheduled on this context.
>   * Must be paired with napi_disable.
>   */
> -static inline void napi_enable(struct napi_struct *n)
> -{
> -       BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> -       smp_mb__before_atomic();
> -       clear_bit(NAPI_STATE_SCHED, &n->state);
> -       clear_bit(NAPI_STATE_NPSVC, &n->state);
> -}
> +void napi_enable(struct napi_struct *n);

If you are reducing the function to just a prototype it might make
sense to move the doxygen comments to the definition of the function
rather than leaving them here. That way when you change the
functionality it is easier to update the comments as it is all in the
same place.

>
>  /**
>   *     napi_synchronize - wait until NAPI is not running
> @@ -2140,6 +2137,7 @@ struct net_device {
>         struct lock_class_key   *qdisc_tx_busylock;
>         struct lock_class_key   *qdisc_running_key;
>         bool                    proto_down;
> +       bool                    threaded;
>         unsigned                wol_enabled:1;

I would prefer to see this done like the wol_enabled  as a bit field
or at least converted to a u8 instead of a bool. Last I knew we
weren't really supposed to use bool in structures since some
architectures will make it a 32b value.

>         struct list_head        net_notifier_list;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 83b59e4c0f37..edcfec1361e9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -91,6 +91,7 @@
>  #include <linux/etherdevice.h>
>  #include <linux/ethtool.h>
>  #include <linux/skbuff.h>
> +#include <linux/kthread.h>
>  #include <linux/bpf.h>
>  #include <linux/bpf_trace.h>
>  #include <net/net_namespace.h>
> @@ -1493,6 +1494,36 @@ void netdev_notify_peers(struct net_device *dev)
>  }
>  EXPORT_SYMBOL(netdev_notify_peers);
>
> +static int napi_threaded_poll(void *data);
> +
> +static int napi_kthread_create(struct napi_struct *n)
> +{
> +       int err = 0;
> +
> +       /* Create and wake up the kthread once to put it in
> +        * TASK_INTERRUPTIBLE mode to avoid the blocked task
> +        * warning and work with loadavg.
> +        */
> +       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
> +                               n->dev->name, n->napi_id);
> +       if (IS_ERR(n->thread)) {
> +               err = PTR_ERR(n->thread);
> +               pr_err("kthread_run failed with err %d\n", err);
> +               n->thread = NULL;
> +       }
> +
> +       return err;
> +}
> +
> +static void napi_kthread_stop(struct napi_struct *n)
> +{
> +       if (!n->thread)
> +               return;

I would add a blank line after this if statement.

> +       kthread_stop(n->thread);
> +       clear_bit(NAPI_STATE_THREADED, &n->state);
> +       n->thread = NULL;
> +}
> +
>  static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
>  {
>         const struct net_device_ops *ops = dev->netdev_ops;
> @@ -4252,6 +4283,11 @@ int gro_normal_batch __read_mostly = 8;
>  static inline void ____napi_schedule(struct softnet_data *sd,
>                                      struct napi_struct *napi)
>  {
> +       if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
> +               wake_up_process(napi->thread);
> +               return;
> +       }
> +
>         list_add_tail(&napi->poll_list, &sd->poll_list);
>         __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>  }
> @@ -6697,6 +6733,27 @@ static void init_gro_hash(struct napi_struct *napi)
>         napi->gro_bitmask = 0;
>  }
>
> +static int napi_set_threaded(struct napi_struct *n, bool threaded)
> +{
> +       int err = 0;
> +
> +       if (threaded == !!test_bit(NAPI_STATE_THREADED, &n->state))
> +               return 0;

A blank line between these two statements might be nice.

> +       if (threaded) {
> +               if (!n->thread) {
> +                       err = napi_kthread_create(n);
> +                       if (err)
> +                               goto out;
> +               }
> +               set_bit(NAPI_STATE_THREADED, &n->state);
> +       } else {
> +               clear_bit(NAPI_STATE_THREADED, &n->state);
> +       }

You could probably flatten this bit by doing:
if (!threaded) {
    clear_bit(NAPI_STATE_THREADED, &n->state);
    return 0;
}

Then you could just pull the rest of this out of the if statement and
flatten it a bit.

> +
> +out:
> +       return err;
> +}
> +
>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>                     int (*poll)(struct napi_struct *, int), int weight)
>  {
> @@ -6738,12 +6795,23 @@ void napi_disable(struct napi_struct *n)
>                 msleep(1);
>
>         hrtimer_cancel(&n->timer);
> +       napi_kthread_stop(n);
>
>         clear_bit(NAPI_STATE_PREFER_BUSY_POLL, &n->state);
>         clear_bit(NAPI_STATE_DISABLE, &n->state);
>  }
>  EXPORT_SYMBOL(napi_disable);
>
> +void napi_enable(struct napi_struct *n)
> +{
> +       BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> +       smp_mb__before_atomic();
> +       clear_bit(NAPI_STATE_SCHED, &n->state);
> +       clear_bit(NAPI_STATE_NPSVC, &n->state);
> +       WARN_ON(napi_set_threaded(n, n->dev->threaded));

I am not sure what the point is in having a return value if you are
just using it to trigger a WARN_ON. It might make more sense to
actually set the WARN_ON inside of napi_set_threaded instead of having
it here as you could then identify the error much more easily. Or for
that matter you might be able to use something like pr_warn which
would allow you a more detailed message about the specific netdev that
experienced the failure.

> +}
> +EXPORT_SYMBOL(napi_enable);
> +
>  static void flush_gro_hash(struct napi_struct *napi)
>  {
>         int i;
> @@ -6866,6 +6934,51 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
>         return work;
>  }
>
> +static int napi_thread_wait(struct napi_struct *napi)
> +{
> +       set_current_state(TASK_INTERRUPTIBLE);
> +
> +       while (!kthread_should_stop() && !napi_disable_pending(napi)) {
> +               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
> +                       WARN_ON(!list_empty(&napi->poll_list));
> +                       __set_current_state(TASK_RUNNING);
> +                       return 0;
> +               }
> +
> +               schedule();
> +               set_current_state(TASK_INTERRUPTIBLE);
> +       }
> +       __set_current_state(TASK_RUNNING);
> +       return -1;
> +}
> +
> +static int napi_threaded_poll(void *data)
> +{
> +       struct napi_struct *napi = data;
> +       void *have;
> +
> +       while (!napi_thread_wait(napi)) {
> +               for (;;) {
> +                       bool repoll = false;
> +
> +                       local_bh_disable();
> +
> +                       have = netpoll_poll_lock(napi);
> +                       __napi_poll(napi, &repoll);
> +                       netpoll_poll_unlock(have);
> +

So it looks like v3 of this patch set was making use of the budget but
for v4 that went away and you had this. I was wondering why? One thing
that seems odd to me is that we are limiting the device to only
clearing one NAPI_POLL_WEIGHT of packets between flushing the freed
skbs and reenabling the bottom halves.

I'm wondering if it might make more sense to have a tunable like
netdev_budget that could be used for this to allow for more time
between flushes, bh enables, and cond_resched checks.

> +                       __kfree_skb_flush();
> +                       local_bh_enable();
> +
> +                       if (!repoll)
> +                               break;
> +
> +                       cond_resched();
> +               }
> +       }
> +       return 0;
> +}
> +
>  static __latent_entropy void net_rx_action(struct softirq_action *h)
>  {
>         struct softnet_data *sd = this_cpu_ptr(&softnet_data);
> --
> 2.30.0.284.gd98b1dd5eaa7-goog
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode
  2021-01-15  0:31 ` [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode Wei Wang
@ 2021-01-15  3:34   ` Alexander Duyck
  2021-01-15 21:54     ` Wei Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Duyck @ 2021-01-15  3:34 UTC (permalink / raw)
  To: Wei Wang
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Thu, Jan 14, 2021 at 4:34 PM Wei Wang <weiwan@google.com> wrote:
>
> This patch adds a new sysfs attribute to the network device class.
> Said attribute provides a per-device control to enable/disable the
> threaded mode for all the napi instances of the given network device.
>
> Co-developed-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Co-developed-by: Felix Fietkau <nbd@nbd.name>
> Signed-off-by: Felix Fietkau <nbd@nbd.name>
> Signed-off-by: Wei Wang <weiwan@google.com>
> ---
>  include/linux/netdevice.h |  2 ++
>  net/core/dev.c            | 28 +++++++++++++++++
>  net/core/net-sysfs.c      | 63 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 93 insertions(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index c24ed232c746..11ae0c9b9350 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)
>         return napi_complete_done(n, 0);
>  }
>
> +int dev_set_threaded(struct net_device *dev, bool threaded);
> +
>  /**
>   *     napi_disable - prevent NAPI from scheduling
>   *     @n: NAPI context
> diff --git a/net/core/dev.c b/net/core/dev.c
> index edcfec1361e9..d5fb95316ea8 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6754,6 +6754,34 @@ static int napi_set_threaded(struct napi_struct *n, bool threaded)
>         return err;
>  }
>
> +static void dev_disable_threaded_all(struct net_device *dev)
> +{
> +       struct napi_struct *napi;
> +
> +       list_for_each_entry(napi, &dev->napi_list, dev_list)
> +               napi_set_threaded(napi, false);
> +}
> +
> +int dev_set_threaded(struct net_device *dev, bool threaded)
> +{
> +       struct napi_struct *napi;
> +       int ret;
> +
> +       dev->threaded = threaded;
> +       list_for_each_entry(napi, &dev->napi_list, dev_list) {
> +               ret = napi_set_threaded(napi, threaded);
> +               if (ret) {
> +                       /* Error occurred on one of the napi,
> +                        * reset threaded mode on all napi.
> +                        */
> +                       dev_disable_threaded_all(dev);
> +                       break;
> +               }
> +       }
> +
> +       return ret;

This doesn't seem right. The NAPI instances can be active while this
is occuring can they not? I would think at a minimum you need to go
through a napi_disable/napi_enable in order to toggle this value for
each NAPI instance. Otherwise aren't you at risk for racing and having
a napi_schedule attempt to wake up the thread before it has been
allocated?

> +}
> +
>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>                     int (*poll)(struct napi_struct *, int), int weight)
>  {
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index daf502c13d6d..2017f8f07b8d 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -538,6 +538,68 @@ static ssize_t phys_switch_id_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(phys_switch_id);
>
> +static ssize_t threaded_show(struct device *dev,
> +                            struct device_attribute *attr, char *buf)
> +{
> +       struct net_device *netdev = to_net_dev(dev);
> +       struct napi_struct *n;
> +       bool enabled;
> +       int ret;
> +
> +       if (!rtnl_trylock())
> +               return restart_syscall();
> +
> +       if (!dev_isalive(netdev)) {
> +               ret = -EINVAL;
> +               goto unlock;
> +       }
> +
> +       if (list_empty(&netdev->napi_list)) {
> +               ret = -EOPNOTSUPP;
> +               goto unlock;
> +       }
> +
> +       /* Only return true if all napi have threaded mode.
> +        * The inconsistency could happen when the device driver calls
> +        * napi_disable()/napi_enable() with dev->threaded set to true,
> +        * but napi_kthread_create() fails.
> +        * We return false in this case to remind the user that one or
> +        * more napi did not have threaded mode enabled properly.
> +        */
> +       list_for_each_entry(n, &netdev->napi_list, dev_list) {
> +               enabled = !!test_bit(NAPI_STATE_THREADED, &n->state);
> +               if (!enabled)
> +                       break;
> +       }
> +

This logic seems backwards to me. If we have it enabled for any of
them it seems like we should report it was enabled. Otherwise we are
going to be leaking out instances of threaded napi and not be able to
easily find where they are coming from. If nothing else it might make
sense to have this as a ternary value where it is either enabled,
disabled, or partial/broken.

Also why bother testing each queue when you already have dev->threaded?

> +       ret = sprintf(buf, fmt_dec, enabled);
> +
> +unlock:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static int modify_napi_threaded(struct net_device *dev, unsigned long val)
> +{
> +       struct napi_struct *napi;
> +       int ret;
> +
> +       if (list_empty(&dev->napi_list))
> +               return -EOPNOTSUPP;
> +
> +       ret = dev_set_threaded(dev, !!val);
> +
> +       return ret;
> +}
> +
> +static ssize_t threaded_store(struct device *dev,
> +                             struct device_attribute *attr,
> +                             const char *buf, size_t len)
> +{
> +       return netdev_store(dev, attr, buf, len, modify_napi_threaded);
> +}
> +static DEVICE_ATTR_RW(threaded);
> +
>  static struct attribute *net_class_attrs[] __ro_after_init = {
>         &dev_attr_netdev_group.attr,
>         &dev_attr_type.attr,
> @@ -570,6 +632,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = {
>         &dev_attr_proto_down.attr,
>         &dev_attr_carrier_up_count.attr,
>         &dev_attr_carrier_down_count.attr,
> +       &dev_attr_threaded.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);
> --
> 2.30.0.284.gd98b1dd5eaa7-goog
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support
  2021-01-15  3:14   ` Alexander Duyck
@ 2021-01-15  3:38     ` Alexander Duyck
       [not found]     ` <CAEA6p_BxQG7ObXG2Qf=fe5ja1hQr_9B=1_0XAn5w+Cf5+YByog@mail.gmail.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Alexander Duyck @ 2021-01-15  3:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Thu, Jan 14, 2021 at 7:14 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Thu, Jan 14, 2021 at 4:33 PM Wei Wang <weiwan@google.com> wrote:
> >

<snip>

> > +void napi_enable(struct napi_struct *n)
> > +{
> > +       BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> > +       smp_mb__before_atomic();
> > +       clear_bit(NAPI_STATE_SCHED, &n->state);
> > +       clear_bit(NAPI_STATE_NPSVC, &n->state);
> > +       WARN_ON(napi_set_threaded(n, n->dev->threaded));
>
> I am not sure what the point is in having a return value if you are
> just using it to trigger a WARN_ON. It might make more sense to
> actually set the WARN_ON inside of napi_set_threaded instead of having
> it here as you could then identify the error much more easily. Or for
> that matter you might be able to use something like pr_warn which
> would allow you a more detailed message about the specific netdev that
> experienced the failure.

One additional change I would make here. The call to napi_set_threaded
should be moved to before the smp_mb__before_atomic(). That way we can
guarantee that the threaded flag and task_struct pointer are visible
to all consumers before they can set NAPI_STATE_SCHED. Otherwise I
think we run the risk of a race where a napi request could fire before
we have finished configuring it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode
  2021-01-15  3:34   ` Alexander Duyck
@ 2021-01-15 21:54     ` Wei Wang
  2021-01-15 23:08       ` Alexander Duyck
  0 siblings, 1 reply; 13+ messages in thread
From: Wei Wang @ 2021-01-15 21:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Thu, Jan 14, 2021 at 7:34 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Thu, Jan 14, 2021 at 4:34 PM Wei Wang <weiwan@google.com> wrote:
> >
> > This patch adds a new sysfs attribute to the network device class.
> > Said attribute provides a per-device control to enable/disable the
> > threaded mode for all the napi instances of the given network device.
> >
> > Co-developed-by: Paolo Abeni <pabeni@redhat.com>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Co-developed-by: Felix Fietkau <nbd@nbd.name>
> > Signed-off-by: Felix Fietkau <nbd@nbd.name>
> > Signed-off-by: Wei Wang <weiwan@google.com>
> > ---
> >  include/linux/netdevice.h |  2 ++
> >  net/core/dev.c            | 28 +++++++++++++++++
> >  net/core/net-sysfs.c      | 63 +++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 93 insertions(+)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index c24ed232c746..11ae0c9b9350 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)
> >         return napi_complete_done(n, 0);
> >  }
> >
> > +int dev_set_threaded(struct net_device *dev, bool threaded);
> > +
> >  /**
> >   *     napi_disable - prevent NAPI from scheduling
> >   *     @n: NAPI context
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index edcfec1361e9..d5fb95316ea8 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6754,6 +6754,34 @@ static int napi_set_threaded(struct napi_struct *n, bool threaded)
> >         return err;
> >  }
> >
> > +static void dev_disable_threaded_all(struct net_device *dev)
> > +{
> > +       struct napi_struct *napi;
> > +
> > +       list_for_each_entry(napi, &dev->napi_list, dev_list)
> > +               napi_set_threaded(napi, false);
> > +}
> > +
> > +int dev_set_threaded(struct net_device *dev, bool threaded)
> > +{
> > +       struct napi_struct *napi;
> > +       int ret;
> > +
> > +       dev->threaded = threaded;
> > +       list_for_each_entry(napi, &dev->napi_list, dev_list) {
> > +               ret = napi_set_threaded(napi, threaded);
> > +               if (ret) {
> > +                       /* Error occurred on one of the napi,
> > +                        * reset threaded mode on all napi.
> > +                        */
> > +                       dev_disable_threaded_all(dev);
> > +                       break;
> > +               }
> > +       }
> > +
> > +       return ret;
>
> This doesn't seem right. The NAPI instances can be active while this
> is occuring can they not? I would think at a minimum you need to go
> through a napi_disable/napi_enable in order to toggle this value for
> each NAPI instance. Otherwise aren't you at risk for racing and having
> a napi_schedule attempt to wake up the thread before it has been
> allocated?


Yes. The napi instance could be active when this occurs. And I think
it is OK. It is cause napi_set_threaded() only sets
NAPI_STATE_THREADED bit after successfully created the kthread. And
___napi_schedule() only tries to wake up the kthread after testing the
THREADED bit.

>
>
> > +}
> > +
> >  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
> >                     int (*poll)(struct napi_struct *, int), int weight)
> >  {
> > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > index daf502c13d6d..2017f8f07b8d 100644
> > --- a/net/core/net-sysfs.c
> > +++ b/net/core/net-sysfs.c
> > @@ -538,6 +538,68 @@ static ssize_t phys_switch_id_show(struct device *dev,
> >  }
> >  static DEVICE_ATTR_RO(phys_switch_id);
> >
> > +static ssize_t threaded_show(struct device *dev,
> > +                            struct device_attribute *attr, char *buf)
> > +{
> > +       struct net_device *netdev = to_net_dev(dev);
> > +       struct napi_struct *n;
> > +       bool enabled;
> > +       int ret;
> > +
> > +       if (!rtnl_trylock())
> > +               return restart_syscall();
> > +
> > +       if (!dev_isalive(netdev)) {
> > +               ret = -EINVAL;
> > +               goto unlock;
> > +       }
> > +
> > +       if (list_empty(&netdev->napi_list)) {
> > +               ret = -EOPNOTSUPP;
> > +               goto unlock;
> > +       }
> > +
> > +       /* Only return true if all napi have threaded mode.
> > +        * The inconsistency could happen when the device driver calls
> > +        * napi_disable()/napi_enable() with dev->threaded set to true,
> > +        * but napi_kthread_create() fails.
> > +        * We return false in this case to remind the user that one or
> > +        * more napi did not have threaded mode enabled properly.
> > +        */
> > +       list_for_each_entry(n, &netdev->napi_list, dev_list) {
> > +               enabled = !!test_bit(NAPI_STATE_THREADED, &n->state);
> > +               if (!enabled)
> > +                       break;
> > +       }
> > +
>
> This logic seems backwards to me. If we have it enabled for any of
> them it seems like we should report it was enabled. Otherwise we are
> going to be leaking out instances of threaded napi and not be able to
> easily find where they are coming from. If nothing else it might make
> sense to have this as a ternary value where it is either enabled,
> disabled, or partial/broken.


Good point. The reason to return true only if all napi have threaded
enabled, is I would like this return value to serve as a signal to the
user to indicate that the threaded mode is not enabled successfully
for all napi instances, when the user tries to enable it, but then got
"disabled".
But maybe using a ternary value is a better idea. I will see how to change that.


>
> Also why bother testing each queue when you already have dev->threaded?


It is cause I use dev-> threaded to store what user wants to set the
threaded mode to. But if it is set partially or it is broken, I'd like
to return "disabled".
Again, I will see how to implement a ternary value.

>
>
> > +       ret = sprintf(buf, fmt_dec, enabled);
> > +
> > +unlock:
> > +       rtnl_unlock();
> > +       return ret;
> > +}
> > +
> > +static int modify_napi_threaded(struct net_device *dev, unsigned long val)
> > +{
> > +       struct napi_struct *napi;
> > +       int ret;
> > +
> > +       if (list_empty(&dev->napi_list))
> > +               return -EOPNOTSUPP;
> > +
> > +       ret = dev_set_threaded(dev, !!val);
> > +
> > +       return ret;
> > +}
> > +
> > +static ssize_t threaded_store(struct device *dev,
> > +                             struct device_attribute *attr,
> > +                             const char *buf, size_t len)
> > +{
> > +       return netdev_store(dev, attr, buf, len, modify_napi_threaded);
> > +}
> > +static DEVICE_ATTR_RW(threaded);
> > +
> >  static struct attribute *net_class_attrs[] __ro_after_init = {
> >         &dev_attr_netdev_group.attr,
> >         &dev_attr_type.attr,
> > @@ -570,6 +632,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = {
> >         &dev_attr_proto_down.attr,
> >         &dev_attr_carrier_up_count.attr,
> >         &dev_attr_carrier_down_count.attr,
> > +       &dev_attr_threaded.attr,
> >         NULL,
> >  };
> >  ATTRIBUTE_GROUPS(net_class);
> > --
> > 2.30.0.284.gd98b1dd5eaa7-goog
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support
       [not found]     ` <CAEA6p_BxQG7ObXG2Qf=fe5ja1hQr_9B=1_0XAn5w+Cf5+YByog@mail.gmail.com>
@ 2021-01-15 21:56       ` Wei Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Wei Wang @ 2021-01-15 21:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

Resending the replies in plain text mode to avoid delivery failure.

On Fri, Jan 15, 2021 at 1:38 PM Wei Wang <weiwan@google.com> wrote:
>
>
>
> On Thu, Jan 14, 2021 at 7:14 PM Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>
>> On Thu, Jan 14, 2021 at 4:33 PM Wei Wang <weiwan@google.com> wrote:
>> >
>> > This patch allows running each napi poll loop inside its own
>> > kernel thread.
>> > The threaded mode could be enabled through napi_set_threaded()
>> > api, and does not require a device up/down. The kthread gets
>> > created on demand when napi_set_threaded() is called, and gets
>> > shut down eventually in napi_disable().
>> >
>> > Once that threaded mode is enabled and the kthread is
>> > started, napi_schedule() will wake-up such thread instead
>> > of scheduling the softirq.
>> >
>> > The threaded poll loop behaves quite likely the net_rx_action,
>> > but it does not have to manipulate local irqs and uses
>> > an explicit scheduling point based on netdev_budget.
>> >
>> > Co-developed-by: Paolo Abeni <pabeni@redhat.com>
>> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> > Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> > Co-developed-by: Jakub Kicinski <kuba@kernel.org>
>> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>> > Signed-off-by: Wei Wang <weiwan@google.com>
>> > ---
>> >  include/linux/netdevice.h |  12 ++--
>> >  net/core/dev.c            | 113 ++++++++++++++++++++++++++++++++++++++
>> >  2 files changed, 118 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index 5b949076ed23..c24ed232c746 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -347,6 +347,7 @@ struct napi_struct {
>> >         struct list_head        dev_list;
>> >         struct hlist_node       napi_hash_node;
>> >         unsigned int            napi_id;
>> > +       struct task_struct      *thread;
>> >  };
>> >
>> >  enum {
>> > @@ -358,6 +359,7 @@ enum {
>> >         NAPI_STATE_NO_BUSY_POLL,        /* Do not add in napi_hash, no busy polling */
>> >         NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */
>> >         NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over softirq processing*/
>> > +       NAPI_STATE_THREADED,            /* The poll is performed inside its own thread*/
>> >  };
>> >
>> >  enum {
>> > @@ -369,6 +371,7 @@ enum {
>> >         NAPIF_STATE_NO_BUSY_POLL        = BIT(NAPI_STATE_NO_BUSY_POLL),
>> >         NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),
>> >         NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),
>> > +       NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),
>> >  };
>> >
>> >  enum gro_result {
>> > @@ -510,13 +513,7 @@ void napi_disable(struct napi_struct *n);
>> >   * Resume NAPI from being scheduled on this context.
>> >   * Must be paired with napi_disable.
>> >   */
>> > -static inline void napi_enable(struct napi_struct *n)
>> > -{
>> > -       BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
>> > -       smp_mb__before_atomic();
>> > -       clear_bit(NAPI_STATE_SCHED, &n->state);
>> > -       clear_bit(NAPI_STATE_NPSVC, &n->state);
>> > -}
>> > +void napi_enable(struct napi_struct *n);
>>
>> If you are reducing the function to just a prototype it might make
>> sense to move the doxygen comments to the definition of the function
>> rather than leaving them here. That way when you change the
>> functionality it is easier to update the comments as it is all in the
>> same place.
>>
>> >
>> >  /**
>> >   *     napi_synchronize - wait until NAPI is not running
>> > @@ -2140,6 +2137,7 @@ struct net_device {
>> >         struct lock_class_key   *qdisc_tx_busylock;
>> >         struct lock_class_key   *qdisc_running_key;
>> >         bool                    proto_down;
>> > +       bool                    threaded;
>> >         unsigned                wol_enabled:1;
>>
>> I would prefer to see this done like the wol_enabled  as a bit field
>> or at least converted to a u8 instead of a bool. Last I knew we
>> weren't really supposed to use bool in structures since some
>> architectures will make it a 32b value.
>>
>>
>> >         struct list_head        net_notifier_list;
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 83b59e4c0f37..edcfec1361e9 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -91,6 +91,7 @@
>> >  #include <linux/etherdevice.h>
>> >  #include <linux/ethtool.h>
>> >  #include <linux/skbuff.h>
>> > +#include <linux/kthread.h>
>> >  #include <linux/bpf.h>
>> >  #include <linux/bpf_trace.h>
>> >  #include <net/net_namespace.h>
>> > @@ -1493,6 +1494,36 @@ void netdev_notify_peers(struct net_device *dev)
>> >  }
>> >  EXPORT_SYMBOL(netdev_notify_peers);
>> >
>> > +static int napi_threaded_poll(void *data);
>> > +
>> > +static int napi_kthread_create(struct napi_struct *n)
>> > +{
>> > +       int err = 0;
>> > +
>> > +       /* Create and wake up the kthread once to put it in
>> > +        * TASK_INTERRUPTIBLE mode to avoid the blocked task
>> > +        * warning and work with loadavg.
>> > +        */
>> > +       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
>> > +                               n->dev->name, n->napi_id);
>> > +       if (IS_ERR(n->thread)) {
>> > +               err = PTR_ERR(n->thread);
>> > +               pr_err("kthread_run failed with err %d\n", err);
>> > +               n->thread = NULL;
>> > +       }
>> > +
>> > +       return err;
>> > +}
>> > +
>> > +static void napi_kthread_stop(struct napi_struct *n)
>> > +{
>> > +       if (!n->thread)
>> > +               return;
>>
>> I would add a blank line after this if statement.
>>
>> > +       kthread_stop(n->thread);
>> > +       clear_bit(NAPI_STATE_THREADED, &n->state);
>> > +       n->thread = NULL;
>> > +}
>> > +
>> >  static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
>> >  {
>> >         const struct net_device_ops *ops = dev->netdev_ops;
>> > @@ -4252,6 +4283,11 @@ int gro_normal_batch __read_mostly = 8;
>> >  static inline void ____napi_schedule(struct softnet_data *sd,
>> >                                      struct napi_struct *napi)
>> >  {
>> > +       if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
>> > +               wake_up_process(napi->thread);
>> > +               return;
>> > +       }
>> > +
>> >         list_add_tail(&napi->poll_list, &sd->poll_list);
>> >         __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>> >  }
>> > @@ -6697,6 +6733,27 @@ static void init_gro_hash(struct napi_struct *napi)
>> >         napi->gro_bitmask = 0;
>> >  }
>> >
>> > +static int napi_set_threaded(struct napi_struct *n, bool threaded)
>> > +{
>> > +       int err = 0;
>> > +
>> > +       if (threaded == !!test_bit(NAPI_STATE_THREADED, &n->state))
>> > +               return 0;
>>
>> A blank line between these two statements might be nice.
>>
>> > +       if (threaded) {
>> > +               if (!n->thread) {
>> > +                       err = napi_kthread_create(n);
>> > +                       if (err)
>> > +                               goto out;
>> > +               }
>> > +               set_bit(NAPI_STATE_THREADED, &n->state);
>> > +       } else {
>> > +               clear_bit(NAPI_STATE_THREADED, &n->state);
>> > +       }
>>
>> You could probably flatten this bit by doing:
>> if (!threaded) {
>>     clear_bit(NAPI_STATE_THREADED, &n->state);
>>     return 0;
>> }
>>
>> Then you could just pull the rest of this out of the if statement and
>> flatten it a bit.
>
>
>
> Ack for all comments before this. Will update in next version.
>
>>
>>
>> > +
>> > +out:
>> > +       return err;
>> > +}
>> > +
>> >  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>> >                     int (*poll)(struct napi_struct *, int), int weight)
>> >  {
>> > @@ -6738,12 +6795,23 @@ void napi_disable(struct napi_struct *n)
>> >                 msleep(1);
>> >
>> >         hrtimer_cancel(&n->timer);
>> > +       napi_kthread_stop(n);
>> >
>> >         clear_bit(NAPI_STATE_PREFER_BUSY_POLL, &n->state);
>> >         clear_bit(NAPI_STATE_DISABLE, &n->state);
>> >  }
>> >  EXPORT_SYMBOL(napi_disable);
>> >
>> > +void napi_enable(struct napi_struct *n)
>> > +{
>> > +       BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
>> > +       smp_mb__before_atomic();
>> > +       clear_bit(NAPI_STATE_SCHED, &n->state);
>> > +       clear_bit(NAPI_STATE_NPSVC, &n->state);
>> > +       WARN_ON(napi_set_threaded(n, n->dev->threaded));
>>
>> I am not sure what the point is in having a return value if you are
>> just using it to trigger a WARN_ON. It might make more sense to
>> actually set the WARN_ON inside of napi_set_threaded instead of having
>> it here as you could then identify the error much more easily. Or for
>> that matter you might be able to use something like pr_warn which
>> would allow you a more detailed message about the specific netdev that
>> experienced the failure.
>
>
> This function is called from modify_napi_threaded() from sysfs path. The return value is used from that path.
>
>>
>>
>> > +}
>> > +EXPORT_SYMBOL(napi_enable);
>> > +
>> >  static void flush_gro_hash(struct napi_struct *napi)
>> >  {
>> >         int i;
>> > @@ -6866,6 +6934,51 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
>> >         return work;
>> >  }
>> >
>> > +static int napi_thread_wait(struct napi_struct *napi)
>> > +{
>> > +       set_current_state(TASK_INTERRUPTIBLE);
>> > +
>> > +       while (!kthread_should_stop() && !napi_disable_pending(napi)) {
>> > +               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
>> > +                       WARN_ON(!list_empty(&napi->poll_list));
>> > +                       __set_current_state(TASK_RUNNING);
>> > +                       return 0;
>> > +               }
>> > +
>> > +               schedule();
>> > +               set_current_state(TASK_INTERRUPTIBLE);
>> > +       }
>> > +       __set_current_state(TASK_RUNNING);
>> > +       return -1;
>> > +}
>> > +
>> > +static int napi_threaded_poll(void *data)
>> > +{
>> > +       struct napi_struct *napi = data;
>> > +       void *have;
>> > +
>> > +       while (!napi_thread_wait(napi)) {
>> > +               for (;;) {
>> > +                       bool repoll = false;
>> > +
>> > +                       local_bh_disable();
>> > +
>> > +                       have = netpoll_poll_lock(napi);
>> > +                       __napi_poll(napi, &repoll);
>> > +                       netpoll_poll_unlock(have);
>> > +
>>
>> So it looks like v3 of this patch set was making use of the budget but
>> for v4 that went away and you had this. I was wondering why? One thing
>> that seems odd to me is that we are limiting the device to only
>
>
> In v3, the bugdet existed in the previous proposed patch from Paolo. It also got modified to this same implementation in patch 4 in v3.
>
>>
>> clearing one NAPI_POLL_WEIGHT of packets between flushing the freed
>> skbs and reenabling the bottom halves.
>>
>>
>> I'm wondering if it might make more sense to have a tunable like
>> netdev_budget that could be used for this to allow for more time
>> between flushes, bh enables, and cond_resched checks.
>
>
> I think this depends on how long we want to keep the CPU before yielding. And I think there is a slight difference between this function vs napi_schedule() called from net_rx_action().
> net_rx_action() has a 'budget' parameter. But in that function, there are potentially multiple napi instances queued. And for each napi, it only gets polled for max of NAPI_POLL_WEIGHT as well. The 'bugdet' is actually controlling how many napi instances to poll at most.
> In napi_threaded_poll(), it is only responsible for polling 1 napi instance. So the behavior here is somewhat consistent: the napi instance gets polled once before reenabling bottem half, and yeild ng the CPU.
>
>>
>>
>> > +                       __kfree_skb_flush();
>> > +                       local_bh_enable();
>> > +
>> > +                       if (!repoll)
>> > +                               break;
>> > +
>> > +                       cond_resched();
>> > +               }
>> > +       }
>> > +       return 0;
>> > +}
>> > +
>> >  static __latent_entropy void net_rx_action(struct softirq_action *h)
>> >  {
>> >         struct softnet_data *sd = this_cpu_ptr(&softnet_data);
>> > --
>> > 2.30.0.284.gd98b1dd5eaa7-goog
>> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode
  2021-01-15 21:54     ` Wei Wang
@ 2021-01-15 23:08       ` Alexander Duyck
  2021-01-16  0:44         ` Wei Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Duyck @ 2021-01-15 23:08 UTC (permalink / raw)
  To: Wei Wang
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Fri, Jan 15, 2021 at 1:54 PM Wei Wang <weiwan@google.com> wrote:
>
> On Thu, Jan 14, 2021 at 7:34 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Thu, Jan 14, 2021 at 4:34 PM Wei Wang <weiwan@google.com> wrote:
> > >
> > > This patch adds a new sysfs attribute to the network device class.
> > > Said attribute provides a per-device control to enable/disable the
> > > threaded mode for all the napi instances of the given network device.
> > >
> > > Co-developed-by: Paolo Abeni <pabeni@redhat.com>
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > Co-developed-by: Felix Fietkau <nbd@nbd.name>
> > > Signed-off-by: Felix Fietkau <nbd@nbd.name>
> > > Signed-off-by: Wei Wang <weiwan@google.com>
> > > ---
> > >  include/linux/netdevice.h |  2 ++
> > >  net/core/dev.c            | 28 +++++++++++++++++
> > >  net/core/net-sysfs.c      | 63 +++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 93 insertions(+)
> > >
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index c24ed232c746..11ae0c9b9350 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)
> > >         return napi_complete_done(n, 0);
> > >  }
> > >
> > > +int dev_set_threaded(struct net_device *dev, bool threaded);
> > > +
> > >  /**
> > >   *     napi_disable - prevent NAPI from scheduling
> > >   *     @n: NAPI context
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index edcfec1361e9..d5fb95316ea8 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6754,6 +6754,34 @@ static int napi_set_threaded(struct napi_struct *n, bool threaded)
> > >         return err;
> > >  }
> > >
> > > +static void dev_disable_threaded_all(struct net_device *dev)
> > > +{
> > > +       struct napi_struct *napi;
> > > +
> > > +       list_for_each_entry(napi, &dev->napi_list, dev_list)
> > > +               napi_set_threaded(napi, false);
> > > +}
> > > +
> > > +int dev_set_threaded(struct net_device *dev, bool threaded)
> > > +{
> > > +       struct napi_struct *napi;
> > > +       int ret;
> > > +
> > > +       dev->threaded = threaded;
> > > +       list_for_each_entry(napi, &dev->napi_list, dev_list) {
> > > +               ret = napi_set_threaded(napi, threaded);
> > > +               if (ret) {
> > > +                       /* Error occurred on one of the napi,
> > > +                        * reset threaded mode on all napi.
> > > +                        */
> > > +                       dev_disable_threaded_all(dev);
> > > +                       break;
> > > +               }
> > > +       }
> > > +
> > > +       return ret;
> >
> > This doesn't seem right. The NAPI instances can be active while this
> > is occuring can they not? I would think at a minimum you need to go
> > through a napi_disable/napi_enable in order to toggle this value for
> > each NAPI instance. Otherwise aren't you at risk for racing and having
> > a napi_schedule attempt to wake up the thread before it has been
> > allocated?
>
>
> Yes. The napi instance could be active when this occurs. And I think
> it is OK. It is cause napi_set_threaded() only sets
> NAPI_STATE_THREADED bit after successfully created the kthread. And
> ___napi_schedule() only tries to wake up the kthread after testing the
> THREADED bit.

But what do you have guaranteeing that the kthread has been written to
memory? That is what I was getting at. Just because you have written
the value doesn't mean it is in memory yet so you would probably need
an smb_mb__before_atomic() barrier call before you set the bit.

Also I am not sure it is entirely safe to have the instance polling
while you are doing this. That is why I am thinking if the instance is
enabled then a napi_disable/napi_enable would be preferable.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode
  2021-01-15 23:08       ` Alexander Duyck
@ 2021-01-16  0:44         ` Wei Wang
  2021-01-16  2:18           ` Alexander Duyck
  0 siblings, 1 reply; 13+ messages in thread
From: Wei Wang @ 2021-01-16  0:44 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Fri, Jan 15, 2021 at 3:08 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Jan 15, 2021 at 1:54 PM Wei Wang <weiwan@google.com> wrote:
> >
> > On Thu, Jan 14, 2021 at 7:34 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Thu, Jan 14, 2021 at 4:34 PM Wei Wang <weiwan@google.com> wrote:
> > > >
> > > > This patch adds a new sysfs attribute to the network device class.
> > > > Said attribute provides a per-device control to enable/disable the
> > > > threaded mode for all the napi instances of the given network device.
> > > >
> > > > Co-developed-by: Paolo Abeni <pabeni@redhat.com>
> > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > > Co-developed-by: Felix Fietkau <nbd@nbd.name>
> > > > Signed-off-by: Felix Fietkau <nbd@nbd.name>
> > > > Signed-off-by: Wei Wang <weiwan@google.com>
> > > > ---
> > > >  include/linux/netdevice.h |  2 ++
> > > >  net/core/dev.c            | 28 +++++++++++++++++
> > > >  net/core/net-sysfs.c      | 63 +++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 93 insertions(+)
> > > >
> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > index c24ed232c746..11ae0c9b9350 100644
> > > > --- a/include/linux/netdevice.h
> > > > +++ b/include/linux/netdevice.h
> > > > @@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)
> > > >         return napi_complete_done(n, 0);
> > > >  }
> > > >
> > > > +int dev_set_threaded(struct net_device *dev, bool threaded);
> > > > +
> > > >  /**
> > > >   *     napi_disable - prevent NAPI from scheduling
> > > >   *     @n: NAPI context
> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index edcfec1361e9..d5fb95316ea8 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -6754,6 +6754,34 @@ static int napi_set_threaded(struct napi_struct *n, bool threaded)
> > > >         return err;
> > > >  }
> > > >
> > > > +static void dev_disable_threaded_all(struct net_device *dev)
> > > > +{
> > > > +       struct napi_struct *napi;
> > > > +
> > > > +       list_for_each_entry(napi, &dev->napi_list, dev_list)
> > > > +               napi_set_threaded(napi, false);
> > > > +}
> > > > +
> > > > +int dev_set_threaded(struct net_device *dev, bool threaded)
> > > > +{
> > > > +       struct napi_struct *napi;
> > > > +       int ret;
> > > > +
> > > > +       dev->threaded = threaded;
> > > > +       list_for_each_entry(napi, &dev->napi_list, dev_list) {
> > > > +               ret = napi_set_threaded(napi, threaded);
> > > > +               if (ret) {
> > > > +                       /* Error occurred on one of the napi,
> > > > +                        * reset threaded mode on all napi.
> > > > +                        */
> > > > +                       dev_disable_threaded_all(dev);
> > > > +                       break;
> > > > +               }
> > > > +       }
> > > > +
> > > > +       return ret;
> > >
> > > This doesn't seem right. The NAPI instances can be active while this
> > > is occuring can they not? I would think at a minimum you need to go
> > > through a napi_disable/napi_enable in order to toggle this value for
> > > each NAPI instance. Otherwise aren't you at risk for racing and having
> > > a napi_schedule attempt to wake up the thread before it has been
> > > allocated?
> >
> >
> > Yes. The napi instance could be active when this occurs. And I think
> > it is OK. It is cause napi_set_threaded() only sets
> > NAPI_STATE_THREADED bit after successfully created the kthread. And
> > ___napi_schedule() only tries to wake up the kthread after testing the
> > THREADED bit.
>
> But what do you have guaranteeing that the kthread has been written to
> memory? That is what I was getting at. Just because you have written
> the value doesn't mean it is in memory yet so you would probably need
> an smb_mb__before_atomic() barrier call before you set the bit.
>
Noted. Will look into this.

> Also I am not sure it is entirely safe to have the instance polling
> while you are doing this. That is why I am thinking if the instance is
> enabled then a napi_disable/napi_enable would be preferable.
When the napi is actively being polled in threaded mode, we will keep
rescheduling the kthread and calling __napi_poll() until
NAPI_SCHED_STATE is cleared by napi_complete_done(). And during the
next time napi_schedule() is called, we re-evaluate
NAPI_STATE_THREADED bit to see if we should wake up kthread, or
generate softirq.
And for the other way around, if napi is being handled during
net_rx_action(), toggling the bit won't cause immediate wake-up of the
kthread, but will wait for NAPI_SCHED_STATE to be cleared, and the
next time napi_schedule() is called.
I think it is OK. WDYT?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode
  2021-01-16  0:44         ` Wei Wang
@ 2021-01-16  2:18           ` Alexander Duyck
  2021-01-18 19:58             ` Wei Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Duyck @ 2021-01-16  2:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Fri, Jan 15, 2021 at 4:44 PM Wei Wang <weiwan@google.com> wrote:
>
> On Fri, Jan 15, 2021 at 3:08 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, Jan 15, 2021 at 1:54 PM Wei Wang <weiwan@google.com> wrote:
> > >
> > > On Thu, Jan 14, 2021 at 7:34 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 14, 2021 at 4:34 PM Wei Wang <weiwan@google.com> wrote:
> > > > >
> > > > > This patch adds a new sysfs attribute to the network device class.
> > > > > Said attribute provides a per-device control to enable/disable the
> > > > > threaded mode for all the napi instances of the given network device.
> > > > >
> > > > > Co-developed-by: Paolo Abeni <pabeni@redhat.com>
> > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > > Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > > > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > > > Co-developed-by: Felix Fietkau <nbd@nbd.name>
> > > > > Signed-off-by: Felix Fietkau <nbd@nbd.name>
> > > > > Signed-off-by: Wei Wang <weiwan@google.com>
> > > > > ---
> > > > >  include/linux/netdevice.h |  2 ++
> > > > >  net/core/dev.c            | 28 +++++++++++++++++
> > > > >  net/core/net-sysfs.c      | 63 +++++++++++++++++++++++++++++++++++++++
> > > > >  3 files changed, 93 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > > index c24ed232c746..11ae0c9b9350 100644
> > > > > --- a/include/linux/netdevice.h
> > > > > +++ b/include/linux/netdevice.h
> > > > > @@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)
> > > > >         return napi_complete_done(n, 0);
> > > > >  }
> > > > >
> > > > > +int dev_set_threaded(struct net_device *dev, bool threaded);
> > > > > +
> > > > >  /**
> > > > >   *     napi_disable - prevent NAPI from scheduling
> > > > >   *     @n: NAPI context
> > > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > > index edcfec1361e9..d5fb95316ea8 100644
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > @@ -6754,6 +6754,34 @@ static int napi_set_threaded(struct napi_struct *n, bool threaded)
> > > > >         return err;
> > > > >  }
> > > > >
> > > > > +static void dev_disable_threaded_all(struct net_device *dev)
> > > > > +{
> > > > > +       struct napi_struct *napi;
> > > > > +
> > > > > +       list_for_each_entry(napi, &dev->napi_list, dev_list)
> > > > > +               napi_set_threaded(napi, false);
> > > > > +}
> > > > > +
> > > > > +int dev_set_threaded(struct net_device *dev, bool threaded)
> > > > > +{
> > > > > +       struct napi_struct *napi;
> > > > > +       int ret;
> > > > > +
> > > > > +       dev->threaded = threaded;
> > > > > +       list_for_each_entry(napi, &dev->napi_list, dev_list) {
> > > > > +               ret = napi_set_threaded(napi, threaded);
> > > > > +               if (ret) {
> > > > > +                       /* Error occurred on one of the napi,
> > > > > +                        * reset threaded mode on all napi.
> > > > > +                        */
> > > > > +                       dev_disable_threaded_all(dev);
> > > > > +                       break;
> > > > > +               }
> > > > > +       }
> > > > > +
> > > > > +       return ret;
> > > >
> > > > This doesn't seem right. The NAPI instances can be active while this
> > > > is occuring can they not? I would think at a minimum you need to go
> > > > through a napi_disable/napi_enable in order to toggle this value for
> > > > each NAPI instance. Otherwise aren't you at risk for racing and having
> > > > a napi_schedule attempt to wake up the thread before it has been
> > > > allocated?
> > >
> > >
> > > Yes. The napi instance could be active when this occurs. And I think
> > > it is OK. It is cause napi_set_threaded() only sets
> > > NAPI_STATE_THREADED bit after successfully created the kthread. And
> > > ___napi_schedule() only tries to wake up the kthread after testing the
> > > THREADED bit.
> >
> > But what do you have guaranteeing that the kthread has been written to
> > memory? That is what I was getting at. Just because you have written
> > the value doesn't mean it is in memory yet so you would probably need
> > an smb_mb__before_atomic() barrier call before you set the bit.
> >
> Noted. Will look into this.
>
> > Also I am not sure it is entirely safe to have the instance polling
> > while you are doing this. That is why I am thinking if the instance is
> > enabled then a napi_disable/napi_enable would be preferable.
> When the napi is actively being polled in threaded mode, we will keep
> rescheduling the kthread and calling __napi_poll() until
> NAPI_SCHED_STATE is cleared by napi_complete_done(). And during the
> next time napi_schedule() is called, we re-evaluate
> NAPI_STATE_THREADED bit to see if we should wake up kthread, or
> generate softirq.
> And for the other way around, if napi is being handled during
> net_rx_action(), toggling the bit won't cause immediate wake-up of the
> kthread, but will wait for NAPI_SCHED_STATE to be cleared, and the
> next time napi_schedule() is called.
> I think it is OK. WDYT?

It is hard to say. The one spot that gives me a bit of concern is the
NAPIF_STATE_MISSED case in napi_complete_done. It is essentially would
become a switchover point between the two while we are actively
polling inside the driver. You end up with NAPI_SCHED_STATE not being
toggled but jumping from one to the other.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode
  2021-01-16  2:18           ` Alexander Duyck
@ 2021-01-18 19:58             ` Wei Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Wei Wang @ 2021-01-18 19:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Netdev, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Hannes Frederic Sowa, Felix Fietkau

On Fri, Jan 15, 2021 at 6:18 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Jan 15, 2021 at 4:44 PM Wei Wang <weiwan@google.com> wrote:
> >
> > On Fri, Jan 15, 2021 at 3:08 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, Jan 15, 2021 at 1:54 PM Wei Wang <weiwan@google.com> wrote:
> > > >
> > > > On Thu, Jan 14, 2021 at 7:34 PM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jan 14, 2021 at 4:34 PM Wei Wang <weiwan@google.com> wrote:
> > > > > >
> > > > > > This patch adds a new sysfs attribute to the network device class.
> > > > > > Said attribute provides a per-device control to enable/disable the
> > > > > > threaded mode for all the napi instances of the given network device.
> > > > > >
> > > > > > Co-developed-by: Paolo Abeni <pabeni@redhat.com>
> > > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > > > Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > > > > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > > > > Co-developed-by: Felix Fietkau <nbd@nbd.name>
> > > > > > Signed-off-by: Felix Fietkau <nbd@nbd.name>
> > > > > > Signed-off-by: Wei Wang <weiwan@google.com>
> > > > > > ---
> > > > > >  include/linux/netdevice.h |  2 ++
> > > > > >  net/core/dev.c            | 28 +++++++++++++++++
> > > > > >  net/core/net-sysfs.c      | 63 +++++++++++++++++++++++++++++++++++++++
> > > > > >  3 files changed, 93 insertions(+)
> > > > > >
> > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > > > index c24ed232c746..11ae0c9b9350 100644
> > > > > > --- a/include/linux/netdevice.h
> > > > > > +++ b/include/linux/netdevice.h
> > > > > > @@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)
> > > > > >         return napi_complete_done(n, 0);
> > > > > >  }
> > > > > >
> > > > > > +int dev_set_threaded(struct net_device *dev, bool threaded);
> > > > > > +
> > > > > >  /**
> > > > > >   *     napi_disable - prevent NAPI from scheduling
> > > > > >   *     @n: NAPI context
> > > > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > > > index edcfec1361e9..d5fb95316ea8 100644
> > > > > > --- a/net/core/dev.c
> > > > > > +++ b/net/core/dev.c
> > > > > > @@ -6754,6 +6754,34 @@ static int napi_set_threaded(struct napi_struct *n, bool threaded)
> > > > > >         return err;
> > > > > >  }
> > > > > >
> > > > > > +static void dev_disable_threaded_all(struct net_device *dev)
> > > > > > +{
> > > > > > +       struct napi_struct *napi;
> > > > > > +
> > > > > > +       list_for_each_entry(napi, &dev->napi_list, dev_list)
> > > > > > +               napi_set_threaded(napi, false);
> > > > > > +}
> > > > > > +
> > > > > > +int dev_set_threaded(struct net_device *dev, bool threaded)
> > > > > > +{
> > > > > > +       struct napi_struct *napi;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       dev->threaded = threaded;
> > > > > > +       list_for_each_entry(napi, &dev->napi_list, dev_list) {
> > > > > > +               ret = napi_set_threaded(napi, threaded);
> > > > > > +               if (ret) {
> > > > > > +                       /* Error occurred on one of the napi,
> > > > > > +                        * reset threaded mode on all napi.
> > > > > > +                        */
> > > > > > +                       dev_disable_threaded_all(dev);
> > > > > > +                       break;
> > > > > > +               }
> > > > > > +       }
> > > > > > +
> > > > > > +       return ret;
> > > > >
> > > > > This doesn't seem right. The NAPI instances can be active while this
> > > > > is occuring can they not? I would think at a minimum you need to go
> > > > > through a napi_disable/napi_enable in order to toggle this value for
> > > > > each NAPI instance. Otherwise aren't you at risk for racing and having
> > > > > a napi_schedule attempt to wake up the thread before it has been
> > > > > allocated?
> > > >
> > > >
> > > > Yes. The napi instance could be active when this occurs. And I think
> > > > it is OK. It is cause napi_set_threaded() only sets
> > > > NAPI_STATE_THREADED bit after successfully created the kthread. And
> > > > ___napi_schedule() only tries to wake up the kthread after testing the
> > > > THREADED bit.
> > >
> > > But what do you have guaranteeing that the kthread has been written to
> > > memory? That is what I was getting at. Just because you have written
> > > the value doesn't mean it is in memory yet so you would probably need
> > > an smb_mb__before_atomic() barrier call before you set the bit.
> > >
> > Noted. Will look into this.
> >
> > > Also I am not sure it is entirely safe to have the instance polling
> > > while you are doing this. That is why I am thinking if the instance is
> > > enabled then a napi_disable/napi_enable would be preferable.
> > When the napi is actively being polled in threaded mode, we will keep
> > rescheduling the kthread and calling __napi_poll() until
> > NAPI_SCHED_STATE is cleared by napi_complete_done(). And during the
> > next time napi_schedule() is called, we re-evaluate
> > NAPI_STATE_THREADED bit to see if we should wake up kthread, or
> > generate softirq.
> > And for the other way around, if napi is being handled during
> > net_rx_action(), toggling the bit won't cause immediate wake-up of the
> > kthread, but will wait for NAPI_SCHED_STATE to be cleared, and the
> > next time napi_schedule() is called.
> > I think it is OK. WDYT?
>
> It is hard to say. The one spot that gives me a bit of concern is the
> NAPIF_STATE_MISSED case in napi_complete_done. It is essentially would
> become a switchover point between the two while we are actively
> polling inside the driver. You end up with NAPI_SCHED_STATE not being
> toggled but jumping from one to the other.

Hmm.. Right. That is the one case where NAPI_SCHED_STATE will not be
toggled, but could potentially change the processing mode.
But still, I don't see any race in this case. The napi instance will
still either be processed in softirq mode by net_rx_action(), or in
the kthread mode, after napi_complete_done() calls __napi_schedule().

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-01-18 20:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-15  0:31 [PATCH net-next v6 0/3] implement kthread based napi poll Wei Wang
2021-01-15  0:31 ` [PATCH net-next v6 1/3] net: extract napi poll functionality to __napi_poll() Wei Wang
2021-01-15  0:31 ` [PATCH net-next v6 2/3] net: implement threaded-able napi poll loop support Wei Wang
2021-01-15  3:14   ` Alexander Duyck
2021-01-15  3:38     ` Alexander Duyck
     [not found]     ` <CAEA6p_BxQG7ObXG2Qf=fe5ja1hQr_9B=1_0XAn5w+Cf5+YByog@mail.gmail.com>
2021-01-15 21:56       ` Wei Wang
2021-01-15  0:31 ` [PATCH net-next v6 3/3] net: add sysfs attribute to control napi threaded mode Wei Wang
2021-01-15  3:34   ` Alexander Duyck
2021-01-15 21:54     ` Wei Wang
2021-01-15 23:08       ` Alexander Duyck
2021-01-16  0:44         ` Wei Wang
2021-01-16  2:18           ` Alexander Duyck
2021-01-18 19:58             ` Wei Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).