All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] ptq: Per Thread Queues
@ 2020-06-24 17:17 Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 01/11] cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next Tom Herbert
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Per Thread Queues allows application threads to be assigned dedicated
hardware network queues for both transmit and receive. This facility
provides a high degree of traffic isolation between applications and
can also help facilitate high performance due to fine grained packet
steering. An overview and design considerations of Per Thread Queues
has been add to Documentation/networking/scaling.rst.

This patch set provides a basic implementation of Per Thread Queues.
The patch set includes:

	- Minor Infrastructure changes to cgroups (just export a
	  couple of functions)
	- netqueue.h to hold generic definitions for network queues
	- Minor infrastructure in aRFS and net-sysfs to accommodate
	  PTQ
	- Introduce the concept of "global queues". These are used
	  in cgroup configuration of PTQ. Global queues can be
	  mapped to real device queues. A per device queue sysfs
	  parameter is added to configure the mapping of device
	  queue to a global queue
	- Creation of a new cgroup controller, "net_queues", that
	  is used to configure Per Thread Queues
	- Hook up the transmit path. This has two parts: 1) In
	  send socket operations record the transmit queue
	  associated with a task in the sock structure, 2) In
	  netdev_pick_tx, check if the sock structure of the skb
	  has a valid transmit global queue set. If so, convert the
	  queue identifier to a device queue identifier based on the per
	  device mapping table. This selection precedes XPS
	- Hook up the receive path. This has two parts: 1) In
	  rps_record_sock_flow check if a receive global queue is
	  assigned to the running task, if so then set it in the
	  sock_flow_table entry for the flow. Note this in lieu of
	  setting the running CPU in the entry. 2) Change get_rps_cpu to
	  query the sock_flow_table to see if a queue index has been
	  stored (as opposed to a CPU number). If a queue index is
	  present, use it for steering including for it to be the
	  target of ndo_rx_flow_steer.

Related features and concepts:

	- netprio and prio_tc_map: Similar to those, PTQ allows control,
	  via cgroups and per device maps, over mapping applications'
	  packets to transmit queues. However, PTQ is intended to
	  perform fine grained per application mapping to queues such
	  that each application thread, possibly thousands of them, can
	  have its own dedicate transmit queue.
	- aRFS: On the transmit side PTQ extends aRFS to steer packets
	  for a flow based on assigned global queue as opposed to only
	  running CPU for the processing thread. In PTQ, the queue
	  "follows" the thread so that when threads are scheduled to
	  run on a different CPU, the packets for flows of the thread
	  continue to be received on the right queue. This addresses
	  a problem in aRFS where when a thread is rescheduled all
	  of its aRFS steered flows may be to moved to a different queue
	  (i.e. ndo_rx_flow_steer needs to be called for each flow).
	- Busy polling: PTQ provides silo'ing of an application packets
	  into queues and busy polling of those queue can then be
	  applied for high performance. This is likely the fist
	  instantiation of PTQ to combine it with busy polling
	  (moving interrupts for those queues as threads are scheduled
	  is most likely prohibitive). Busy polling is only practical
	  with a few queues, like maybe at most one per CPU, and
	  won't scale to thousands of per thread queues in use.
          (to address that sleeping-busy-poll with completion
	  queues is suggested below).
	- Making Networking Queues a First Class Citizen in the Kernel
	  https://linuxplumbersconf.org/event/4/contributions/462/
	  attachments/241/422/LPC_2019_kernel_queue_manager.pdf:
	  The concept of "global queues" should be a good complement
	  to this proposal. Global queue provide an abstract
	  representation of device queues. the abstraction is resolved
	  when the global queue is mapped to a real hardware queue. This
	  layering allows exposing queues to the user and configuration
	  which might be associated with general attributes (like high
	  priority, QoS characteristics, etc.). The mapping to a
	  specific device queue gives the low level queue that satisfies
	  the implied service of the global queue.  Any attributes and
	  associations are configured and in no way hardcoded, so that
	  the use of queues in the manner is fully extensible and can be
	  driven be arbitrary user defined policy. Since global queues
	  are device agnostic they not just can be managed as local
	  system resource, but also across across the distributed
	  tasks for a job in the datacenter like as a property of a
	  container in Kubernetes (similar to how we might manage
	  network priority as a global DC resource, but global queues
	  provide much more granularity and richness in what they can
	  convey).

There are a number of possible extensions to this work

	- Queue selection could be done on a per process basis
	  or a per socket basis as well as a per thread basis. (per
	  packet basis probably makes little sense due to OOO)
	- The mechanism for selecting a queue to assign to a thread
	  could be programmed. For instance, an eBPF hook could be
	  added that would allow very fine grained policies to do
	  queue selection.
	- "Global queue groups" could be created where a global queue
	  identifier maps to some group of device queues and there is
	  a selection algorithm, possibly another eBPF hook, that
	  maps to a specific device queue for use.
	- Another attribute in the cgroup could be added to enable
	  or disable aRFS on a per thread basis.
	- Extend the net_queues cgroup to allow control over
	  busy-polling on a per cgroup basis. This could further
	  be enhanced by eBPF hooks to control busy-polling for
	  individual sockets of the cgroup per some arbitrary policy
	  (similar to eBPF hook for SO_RESUSEPORT).
	- Elasticity in listener sockets. As described in the
	  Documentation we expect that a filter can be installed to
	  direct packets an application to the set of queues for the
	  applications. The problem is that the application may
	  create threads on demand so that we don't know a priori
	  how many queues the application needs. Optimally, we
	  want a mechanism to dynamically enable/disable a
	  queue in the filter set so that at any given time the
	  application is receive packets only on queues it is
	  actively using. This may entail a new ndo_function.
	- The sleeping-busy-poll with completion queue model
	  described in the documentation could be integrated. This
	  would most entail creating a reverse mapping from queue
	  to threads, and then allowing the thread processing a
	  device completion queue to schedule the threads of interest.


Tom Herbert (11):
  cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next
  net: Create netqueue.h and define NO_QUEUE
  arfs: Create set_arfs_queue
  net-sysfs: Create rps_create_sock_flow_table
  net: Infrastructure for per queue aRFS
  net: Function to check against maximum number for RPS queues
  net: Introduce global queues
  ptq: Per Thread Queues
  ptq: Hook up transmit side of Per Queue Threads
  ptq: Hook up receive side of Per Queue Threads
  doc: Documentation for Per Thread Queues

 Documentation/networking/scaling.rst | 195 +++++++-
 include/linux/cgroup.h               |   3 +
 include/linux/cgroup_subsys.h        |   4 +
 include/linux/netdevice.h            | 204 +++++++-
 include/linux/netqueue.h             |  25 +
 include/linux/sched.h                |   4 +
 include/net/ptq.h                    |  45 ++
 include/net/sock.h                   |  75 ++-
 kernel/cgroup/cgroup.c               |   9 +-
 kernel/fork.c                        |   4 +
 net/Kconfig                          |  18 +
 net/core/Makefile                    |   1 +
 net/core/dev.c                       | 177 +++++--
 net/core/filter.c                    |   4 +-
 net/core/net-sysfs.c                 | 201 +++++++-
 net/core/ptq.c                       | 688 +++++++++++++++++++++++++++
 net/core/sysctl_net_core.c           | 152 ++++--
 net/ipv4/af_inet.c                   |   6 +
 18 files changed, 1693 insertions(+), 122 deletions(-)
 create mode 100644 include/linux/netqueue.h
 create mode 100644 include/net/ptq.h
 create mode 100644 net/core/ptq.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 01/11] cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 02/11] net: Create netqueue.h and define NO_QUEUE Tom Herbert
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Export the functions and put prototypes in linux/cgroup.h. This allows
creating cgroup entries that provide per task information.
---
 include/linux/cgroup.h | 3 +++
 kernel/cgroup/cgroup.c | 9 ++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..59837f6f4e54 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -119,6 +119,9 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
 int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
+void *cgroup_procs_start(struct seq_file *s, loff_t *pos);
+void *cgroup_threads_start(struct seq_file *s, loff_t *pos);
+void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos);
 
 void cgroup_fork(struct task_struct *p);
 extern int cgroup_can_fork(struct task_struct *p,
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..69cd14201cf0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4597,7 +4597,7 @@ static void cgroup_procs_release(struct kernfs_open_file *of)
 	}
 }
 
-static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
+void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 {
 	struct kernfs_open_file *of = s->private;
 	struct css_task_iter *it = of->priv;
@@ -4607,6 +4607,7 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 
 	return css_task_iter_next(it);
 }
+EXPORT_SYMBOL_GPL(cgroup_procs_next);
 
 static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
 				  unsigned int iter_flags)
@@ -4637,7 +4638,7 @@ static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
 	return cgroup_procs_next(s, NULL, NULL);
 }
 
-static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 {
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 
@@ -4653,6 +4654,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS |
 					    CSS_TASK_ITER_THREADED);
 }
+EXPORT_SYMBOL_GPL(cgroup_procs_start);
 
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
@@ -4764,10 +4766,11 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-static void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
+void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
 {
 	return __cgroup_procs_start(s, pos, 0);
 }
+EXPORT_SYMBOL_GPL(cgroup_threads_start);
 
 static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 02/11] net: Create netqueue.h and define NO_QUEUE
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 01/11] cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 03/11] arfs: Create set_arfs_queue Tom Herbert
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Create linux/netqueue.h to hold generic network queue definitions.

Define NO_QUEUE to replace NO_QUEUE_MAPPING in net/sock.h. NO_QUEUE
can generally be used to indicate that a 16 bit queue index does not
refer to a queue.

Also, define net_queue_pair which will be used as a generic way to store a
transmit/receive pair of network queues.
---
 include/linux/netdevice.h |  1 +
 include/linux/netqueue.h  | 25 +++++++++++++++++++++++++
 include/net/sock.h        | 12 +++++-------
 net/core/filter.c         |  4 ++--
 4 files changed, 33 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/netqueue.h

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6fc613ed8eae..bf5f2a85da97 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -32,6 +32,7 @@
 #include <linux/percpu.h>
 #include <linux/rculist.h>
 #include <linux/workqueue.h>
+#include <linux/netqueue.h>
 #include <linux/dynamic_queue_limits.h>
 
 #include <linux/ethtool.h>
diff --git a/include/linux/netqueue.h b/include/linux/netqueue.h
new file mode 100644
index 000000000000..5a4d39821ada
--- /dev/null
+++ b/include/linux/netqueue.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Network queue identifier definitions
+ *
+ * Copyright (c) 2020 Tom Herbert <tom@herbertland.com>
+ */
+
+#ifndef _LINUX_NETQUEUE_H
+#define _LINUX_NETQUEUE_H
+
+/* Indicates no network queue is present in 16 bit queue number */
+#define NO_QUEUE	USHRT_MAX
+
+struct net_queue_pair {
+	unsigned short txq_id;
+	unsigned short rxq_id;
+};
+
+static inline void init_net_queue_pair(struct net_queue_pair *qpair)
+{
+	qpair->rxq_id = NO_QUEUE;
+	qpair->txq_id = NO_QUEUE;
+}
+
+#endif /* _LINUX_NETQUEUE_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index c53cc42b5ab9..acb76cfaae1b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1800,16 +1800,14 @@ static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 	sk->sk_tx_queue_mapping = tx_queue;
 }
 
-#define NO_QUEUE_MAPPING	USHRT_MAX
-
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-	sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING;
+	sk->sk_tx_queue_mapping = NO_QUEUE;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	if (sk && sk->sk_tx_queue_mapping != NO_QUEUE_MAPPING)
+	if (sk && sk->sk_tx_queue_mapping != NO_QUEUE)
 		return sk->sk_tx_queue_mapping;
 
 	return -1;
@@ -1821,7 +1819,7 @@ static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
 	if (skb_rx_queue_recorded(skb)) {
 		u16 rx_queue = skb_get_rx_queue(skb);
 
-		if (WARN_ON_ONCE(rx_queue == NO_QUEUE_MAPPING))
+		if (WARN_ON_ONCE(rx_queue == NO_QUEUE))
 			return;
 
 		sk->sk_rx_queue_mapping = rx_queue;
@@ -1832,14 +1830,14 @@ static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
 static inline void sk_rx_queue_clear(struct sock *sk)
 {
 #ifdef CONFIG_XPS
-	sk->sk_rx_queue_mapping = NO_QUEUE_MAPPING;
+	sk->sk_rx_queue_mapping = NO_QUEUE;
 #endif
 }
 
 #ifdef CONFIG_XPS
 static inline int sk_rx_queue_get(const struct sock *sk)
 {
-	if (sk && sk->sk_rx_queue_mapping != NO_QUEUE_MAPPING)
+	if (sk && sk->sk_rx_queue_mapping != NO_QUEUE)
 		return sk->sk_rx_queue_mapping;
 
 	return -1;
diff --git a/net/core/filter.c b/net/core/filter.c
index 73395384afe2..d696aaabe3af 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7544,7 +7544,7 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 
 	case offsetof(struct __sk_buff, queue_mapping):
 		if (type == BPF_WRITE) {
-			*insn++ = BPF_JMP_IMM(BPF_JGE, si->src_reg, NO_QUEUE_MAPPING, 1);
+			*insn++ = BPF_JMP_IMM(BPF_JGE, si->src_reg, NO_QUEUE, 1);
 			*insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg,
 					      bpf_target_off(struct sk_buff,
 							     queue_mapping,
@@ -7981,7 +7981,7 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 				       sizeof_field(struct sock,
 						    sk_rx_queue_mapping),
 				       target_size));
-		*insn++ = BPF_JMP_IMM(BPF_JNE, si->dst_reg, NO_QUEUE_MAPPING,
+		*insn++ = BPF_JMP_IMM(BPF_JNE, si->dst_reg, NO_QUEUE,
 				      1);
 		*insn++ = BPF_MOV64_IMM(si->dst_reg, -1);
 #else
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 03/11] arfs: Create set_arfs_queue
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 01/11] cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 02/11] net: Create netqueue.h and define NO_QUEUE Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 04/11] net-sysfs: Create rps_create_sock_flow_table Tom Herbert
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Abstract out the code for steering a flow to an aRFS queue (via
ndo_rx_flow_steer) into its own function. This allows the function to
be called in other use cases.
---
 net/core/dev.c | 67 +++++++++++++++++++++++++++++---------------------
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6bc2388141f6..9f7a3e78e23a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4250,42 +4250,53 @@ EXPORT_SYMBOL(rps_needed);
 struct static_key_false rfs_needed __read_mostly;
 EXPORT_SYMBOL(rfs_needed);
 
+#ifdef CONFIG_RFS_ACCEL
+static void set_arfs_queue(struct net_device *dev, struct sk_buff *skb,
+			   struct rps_dev_flow *rflow, u16 rxq_index)
+{
+	struct rps_dev_flow_table *flow_table;
+	struct netdev_rx_queue *rxqueue;
+	struct rps_dev_flow *old_rflow;
+	u32 flow_id;
+	int rc;
+
+	rxqueue = dev->_rx + rxq_index;
+
+	flow_table = rcu_dereference(rxqueue->rps_flow_table);
+	if (!flow_table)
+		return;
+
+	flow_id = skb_get_hash(skb) & flow_table->mask;
+	rc = dev->netdev_ops->ndo_rx_flow_steer(dev, skb,
+						rxq_index, flow_id);
+	if (rc < 0)
+		return;
+
+	old_rflow = rflow;
+	rflow = &flow_table->flows[flow_id];
+	rflow->filter = rc;
+	if (old_rflow->filter == rflow->filter)
+		old_rflow->filter = RPS_NO_FILTER;
+}
+#endif
+
 static struct rps_dev_flow *
 set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	    struct rps_dev_flow *rflow, u16 next_cpu)
 {
 	if (next_cpu < nr_cpu_ids) {
 #ifdef CONFIG_RFS_ACCEL
-		struct netdev_rx_queue *rxqueue;
-		struct rps_dev_flow_table *flow_table;
-		struct rps_dev_flow *old_rflow;
-		u32 flow_id;
-		u16 rxq_index;
-		int rc;
 
 		/* Should we steer this flow to a different hardware queue? */
-		if (!skb_rx_queue_recorded(skb) || !dev->rx_cpu_rmap ||
-		    !(dev->features & NETIF_F_NTUPLE))
-			goto out;
-		rxq_index = cpu_rmap_lookup_index(dev->rx_cpu_rmap, next_cpu);
-		if (rxq_index == skb_get_rx_queue(skb))
-			goto out;
-
-		rxqueue = dev->_rx + rxq_index;
-		flow_table = rcu_dereference(rxqueue->rps_flow_table);
-		if (!flow_table)
-			goto out;
-		flow_id = skb_get_hash(skb) & flow_table->mask;
-		rc = dev->netdev_ops->ndo_rx_flow_steer(dev, skb,
-							rxq_index, flow_id);
-		if (rc < 0)
-			goto out;
-		old_rflow = rflow;
-		rflow = &flow_table->flows[flow_id];
-		rflow->filter = rc;
-		if (old_rflow->filter == rflow->filter)
-			old_rflow->filter = RPS_NO_FILTER;
-	out:
+		if (skb_rx_queue_recorded(skb) && dev->rx_cpu_rmap &&
+		    (dev->features & NETIF_F_NTUPLE)) {
+			u16 rxq_index;
+
+			rxq_index = cpu_rmap_lookup_index(dev->rx_cpu_rmap,
+							  next_cpu);
+			if (rxq_index != skb_get_rx_queue(skb))
+				set_arfs_queue(dev, skb, rflow, rxq_index);
+		}
 #endif
 		rflow->last_qtail =
 			per_cpu(softnet_data, next_cpu).input_queue_head;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 04/11] net-sysfs: Create rps_create_sock_flow_table
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (2 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 03/11] arfs: Create set_arfs_queue Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 05/11] net: Infrastructure for per queue aRFS Tom Herbert
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Move code for writing a sock_flow_table to its own function so that it
can be called for other use cases.
---
 net/core/sysctl_net_core.c | 102 +++++++++++++++++++++----------------
 1 file changed, 57 insertions(+), 45 deletions(-)

diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index f93f8ace6c56..9c7d46fbb75a 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -46,66 +46,78 @@ int sysctl_devconf_inherit_init_net __read_mostly;
 EXPORT_SYMBOL(sysctl_devconf_inherit_init_net);
 
 #ifdef CONFIG_RPS
+static int rps_create_sock_flow_table(size_t size, size_t orig_size,
+				      struct rps_sock_flow_table *orig_table,
+				      bool force)
+{
+	struct rps_sock_flow_table *sock_table;
+	int i;
+
+	if (size) {
+		if (size > 1 << 29) {
+			/* Enforce limit to prevent overflow */
+			return -EINVAL;
+		}
+		size = roundup_pow_of_two(size);
+		if (size != orig_size || force) {
+			sock_table = vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size));
+			if (!sock_table)
+				return -ENOMEM;
+
+			sock_table->mask = size - 1;
+		} else {
+			sock_table = orig_table;
+		}
+
+		for (i = 0; i < size; i++)
+			sock_table->ents[i] = RPS_NO_CPU;
+	} else {
+		sock_table = NULL;
+	}
+
+	if (sock_table != orig_table) {
+		rcu_assign_pointer(rps_sock_flow_table, sock_table);
+		if (sock_table) {
+			static_branch_inc(&rps_needed);
+			static_branch_inc(&rfs_needed);
+		}
+		if (orig_table) {
+			static_branch_dec(&rps_needed);
+			static_branch_dec(&rfs_needed);
+			synchronize_rcu();
+			vfree(orig_table);
+		}
+	}
+
+	return 0;
+}
+
+static DEFINE_MUTEX(sock_flow_mutex);
+
 static int rps_sock_flow_sysctl(struct ctl_table *table, int write,
 				void *buffer, size_t *lenp, loff_t *ppos)
 {
 	unsigned int orig_size, size;
-	int ret, i;
+	int ret;
 	struct ctl_table tmp = {
 		.data = &size,
 		.maxlen = sizeof(size),
 		.mode = table->mode
 	};
-	struct rps_sock_flow_table *orig_sock_table, *sock_table;
-	static DEFINE_MUTEX(sock_flow_mutex);
+	struct rps_sock_flow_table *sock_table;
 
 	mutex_lock(&sock_flow_mutex);
 
-	orig_sock_table = rcu_dereference_protected(rps_sock_flow_table,
-					lockdep_is_held(&sock_flow_mutex));
-	size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0;
+	sock_table = rcu_dereference_protected(rps_sock_flow_table,
+					       lockdep_is_held(&sock_flow_mutex));
+	size = sock_table ? sock_table->mask + 1 : 0;
+	orig_size = size;
 
 	ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
 
-	if (write) {
-		if (size) {
-			if (size > 1<<29) {
-				/* Enforce limit to prevent overflow */
-				mutex_unlock(&sock_flow_mutex);
-				return -EINVAL;
-			}
-			size = roundup_pow_of_two(size);
-			if (size != orig_size) {
-				sock_table =
-				    vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size));
-				if (!sock_table) {
-					mutex_unlock(&sock_flow_mutex);
-					return -ENOMEM;
-				}
-				rps_cpu_mask = roundup_pow_of_two(nr_cpu_ids) - 1;
-				sock_table->mask = size - 1;
-			} else
-				sock_table = orig_sock_table;
-
-			for (i = 0; i < size; i++)
-				sock_table->ents[i] = RPS_NO_CPU;
-		} else
-			sock_table = NULL;
-
-		if (sock_table != orig_sock_table) {
-			rcu_assign_pointer(rps_sock_flow_table, sock_table);
-			if (sock_table) {
-				static_branch_inc(&rps_needed);
-				static_branch_inc(&rfs_needed);
-			}
-			if (orig_sock_table) {
-				static_branch_dec(&rps_needed);
-				static_branch_dec(&rfs_needed);
-				synchronize_rcu();
-				vfree(orig_sock_table);
-			}
-		}
-	}
+	if (write)
+		ret = rps_create_sock_flow_table(size, orig_size,
+						 sock_table, false);
 
 	mutex_unlock(&sock_flow_mutex);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 05/11] net: Infrastructure for per queue aRFS
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (3 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 04/11] net-sysfs: Create rps_create_sock_flow_table Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-28  8:55   ` kernel test robot
  2020-06-24 17:17 ` [RFC PATCH 06/11] net: Function to check against maximum number for RPS queues Tom Herbert
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Infrastructure changes to allow aRFS to be based on Per Thread Queues
instead of just CPU. The basic change is to create a field in
rps_dev_flow to hold either a CPU or a queue index (not just a CPU
that is).

Changes include:
	- Replace u16 cpu field in rps_dev_flow structure with
	  rps_cpu_qid structure that contains either a CPU or a device
	  queue index. Note the structure is still sixteen bits
	- Helper functions to clear and set the cpu in the
	  rps_cpu_qid of rps_dev_flow
	- Create a sock_masks structure that contains the partition
	  of the thirty-two bit entry in rps_sock_flow_table. The
	  structure contains two masks, one to extract the upper bits
	  of the hash and one to extract the CPU number or queue index
	- Replace rps_cpu_mask with sock_masks from rps_sock_flow_table
	- Add rps_max_num_queues which will be used when creating
	  sock_masks for queue entries in rps_sock_flow_table
---
 include/linux/netdevice.h  | 94 +++++++++++++++++++++++++++++++++-----
 net/core/dev.c             | 47 ++++++++++++-------
 net/core/net-sysfs.c       |  2 +-
 net/core/sysctl_net_core.c |  6 ++-
 4 files changed, 119 insertions(+), 30 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bf5f2a85da97..d528aa61fea3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -674,18 +674,65 @@ struct rps_map {
 };
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + ((_num) * sizeof(u16)))
 
+/* The rps_cpu_qid structure is sixteen bits and holds either a CPU number or
+ * a queue index. The use_qid field specifies which type of value is set (i.e.
+ * if use_qid is 1 then cpu_qid contains a fifteen bit queue identifier, and if
+ * use_qid is 0 then cpu_qid contains a fifteen bit CPU number). No entry is
+ * signified by RPS_NO_CPU_QID in val which is set to NO_QUEUE (0xffff). So the
+ * range of CPU numbers that can be stored is 0..32,767 (0x7fff) and the range
+ * of queue identifiers is 0..32,766. Note that CPU numbers are limited by
+ * CONFIG_NR_CPUS which currently has a maximum supported value of 8,192 (per
+ * arch/x86/Kconfig), so WARN_ON is used to check that a CPU number is less
+ * than 0x8000 when setting the cpu in rps_cpu_qid. The queue index is limited
+ * by configuration.
+ */
+struct rps_cpu_qid {
+	union {
+		u16 val;
+		struct {
+			u16 use_qid: 1;
+			union {
+				u16 cpu: 15;
+				u16 qid: 15;
+			};
+		};
+	};
+};
+
+#define RPS_NO_CPU_QID	NO_QUEUE	/* No CPU or qid in rps_cpu_qid */
+#define RPS_MAX_CPU	0x7fff		/* Maximum cpu in rps_cpu_qid */
+#define RPS_MAX_QID	0x7ffe		/* Maximum qid in rps_cpu_qid */
+
 /*
  * The rps_dev_flow structure contains the mapping of a flow to a CPU, the
  * tail pointer for that CPU's input queue at the time of last enqueue, and
  * a hardware filter index.
  */
 struct rps_dev_flow {
-	u16 cpu;
+	struct rps_cpu_qid cpu_qid;
 	u16 filter;
 	unsigned int last_qtail;
 };
 #define RPS_NO_FILTER 0xffff
 
+static inline void rps_dev_flow_clear(struct rps_dev_flow *dev_flow)
+{
+	dev_flow->cpu_qid.val = RPS_NO_CPU_QID;
+}
+
+static inline void rps_dev_flow_set_cpu(struct rps_dev_flow *dev_flow, u16 cpu)
+{
+	struct rps_cpu_qid cpu_qid;
+
+	if (WARN_ON(cpu > RPS_MAX_CPU))
+		return;
+
+	/* Set the rflow target to the CPU atomically */
+	cpu_qid.use_qid = 0;
+	cpu_qid.cpu = cpu;
+	dev_flow->cpu_qid = cpu_qid;
+}
+
 /*
  * The rps_dev_flow_table structure contains a table of flow mappings.
  */
@@ -697,34 +744,57 @@ struct rps_dev_flow_table {
 #define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
     ((_num) * sizeof(struct rps_dev_flow)))
 
+struct rps_sock_masks {
+	u32 mask;
+	u32 hash_mask;
+};
+
 /*
- * The rps_sock_flow_table contains mappings of flows to the last CPU
- * on which they were processed by the application (set in recvmsg).
- * Each entry is a 32bit value. Upper part is the high-order bits
- * of flow hash, lower part is CPU number.
- * rps_cpu_mask is used to partition the space, depending on number of
- * possible CPUs : rps_cpu_mask = roundup_pow_of_two(nr_cpu_ids) - 1
- * For example, if 64 CPUs are possible, rps_cpu_mask = 0x3f,
- * meaning we use 32-6=26 bits for the hash.
+ * The rps_sock_flow_table contains mappings of flows to the last CPU on which
+ * they were processed by the application (set in recvmsg), or the mapping of
+ * the flow to a per thread queue for the application. Each entry is a 32bit
+ * value. The high order bit indicates whether a CPU number or a queue index is
+ * stored. The next high-order bits contain the flow hash, and the lower bits
+ * contain the CPU number or queue index. The sock_flow table contains two
+ * sets of masks, one for CPU entries (cpu_masks) and one for queue entries
+ * (queue_masks), that are to used partition the space between the hash bits
+ * and the CPU number or queue index. For the cpu masks, cpu_masks.mask is set
+ * to roundup_pow_of_two(nr_cpu_ids) - 1 and the corresponding hash mask,
+ * cpu_masks.hash_mask, is set to (~cpu_masks.mask & ~RPS_SOCK_FLOW_USE_QID).
+ * For example, if 64 CPUs are possible, cpu_masks.mask == 0x3f, meaning we use
+ * 31-6=25 bits for the hash (so cpu_masks.hash_mask == 0x7fffffc0). Similarly,
+ * queue_masks in rps_sock_flow_table is used to partition the space when a
+ * queue index is present.
  */
 struct rps_sock_flow_table {
 	u32	mask;
+	struct	rps_sock_masks cpu_masks;
+	struct	rps_sock_masks queue_masks;
 
 	u32	ents[] ____cacheline_aligned_in_smp;
 };
 #define	RPS_SOCK_FLOW_TABLE_SIZE(_num) (offsetof(struct rps_sock_flow_table, ents[_num]))
 
-#define RPS_NO_CPU 0xffff
+#define RPS_SOCK_FLOW_USE_QID	(1 << 31)
+#define RPS_SOCK_FLOW_NO_IDENT	-1U
 
-extern u32 rps_cpu_mask;
 extern struct rps_sock_flow_table __rcu *rps_sock_flow_table;
+extern unsigned int rps_max_num_queues;
+
+static inline void rps_init_sock_masks(struct rps_sock_masks *masks, u32 num)
+{
+	u32 mask = roundup_pow_of_two(num) - 1;
+
+	masks->mask = mask;
+	masks->hash_mask = (~mask & ~RPS_SOCK_FLOW_USE_QID);
+}
 
 static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
 					u32 hash)
 {
 	if (table && hash) {
+		u32 val = hash & table->cpu_masks.hash_mask;
 		unsigned int index = hash & table->mask;
-		u32 val = hash & ~rps_cpu_mask;
 
 		/* We only give a hint, preemption can change CPU under us */
 		val |= raw_smp_processor_id();
diff --git a/net/core/dev.c b/net/core/dev.c
index 9f7a3e78e23a..946940bdd583 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4242,8 +4242,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
 /* One global table that all flow-based protocols share. */
 struct rps_sock_flow_table __rcu *rps_sock_flow_table __read_mostly;
 EXPORT_SYMBOL(rps_sock_flow_table);
-u32 rps_cpu_mask __read_mostly;
-EXPORT_SYMBOL(rps_cpu_mask);
+unsigned int rps_max_num_queues;
 
 struct static_key_false rps_needed __read_mostly;
 EXPORT_SYMBOL(rps_needed);
@@ -4302,7 +4301,7 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 			per_cpu(softnet_data, next_cpu).input_queue_head;
 	}
 
-	rflow->cpu = next_cpu;
+	rps_dev_flow_set_cpu(rflow, next_cpu);
 	return rflow;
 }
 
@@ -4349,22 +4348,39 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 	sock_flow_table = rcu_dereference(rps_sock_flow_table);
 	if (flow_table && sock_flow_table) {
+		u32 next_cpu, comparator, ident;
 		struct rps_dev_flow *rflow;
-		u32 next_cpu;
-		u32 ident;
 
 		/* First check into global flow table if there is a match */
 		ident = sock_flow_table->ents[hash & sock_flow_table->mask];
-		if ((ident ^ hash) & ~rps_cpu_mask)
-			goto try_rps;
+		comparator = ((ident & RPS_SOCK_FLOW_USE_QID) ?
+				sock_flow_table->queue_masks.hash_mask :
+				sock_flow_table->cpu_masks.hash_mask);
 
-		next_cpu = ident & rps_cpu_mask;
+		if ((ident ^ hash) & comparator)
+			goto try_rps;
 
 		/* OK, now we know there is a match,
 		 * we can look at the local (per receive queue) flow table
 		 */
 		rflow = &flow_table->flows[hash & flow_table->mask];
-		tcpu = rflow->cpu;
+
+		/* The flow_sock entry may refer to either a queue or a
+		 * CPU. Proceed accordingly.
+		 */
+		if (ident & RPS_SOCK_FLOW_USE_QID) {
+			/* A queue identifier is in the sock_flow_table entry */
+
+			/* Don't use aRFS to set CPU in this case, skip to
+			 * trying RPS
+			 */
+			goto try_rps;
+		}
+
+		/* A CPU number is in the sock_flow_table entry */
+
+		next_cpu = ident & sock_flow_table->cpu_masks.mask;
+		tcpu = rflow->cpu_qid.use_qid ? NO_QUEUE : rflow->cpu_qid.cpu;
 
 		/*
 		 * If the desired CPU (where last recvmsg was done) is
@@ -4396,10 +4412,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 	if (map) {
 		tcpu = map->cpus[reciprocal_scale(hash, map->len)];
-		if (cpu_online(tcpu)) {
+		if (cpu_online(tcpu))
 			cpu = tcpu;
-			goto done;
-		}
 	}
 
 done:
@@ -4424,17 +4438,18 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index,
 {
 	struct netdev_rx_queue *rxqueue = dev->_rx + rxq_index;
 	struct rps_dev_flow_table *flow_table;
+	struct rps_cpu_qid cpu_qid;
 	struct rps_dev_flow *rflow;
 	bool expire = true;
-	unsigned int cpu;
 
 	rcu_read_lock();
 	flow_table = rcu_dereference(rxqueue->rps_flow_table);
 	if (flow_table && flow_id <= flow_table->mask) {
 		rflow = &flow_table->flows[flow_id];
-		cpu = READ_ONCE(rflow->cpu);
-		if (rflow->filter == filter_id && cpu < nr_cpu_ids &&
-		    ((int)(per_cpu(softnet_data, cpu).input_queue_head -
+		cpu_qid = READ_ONCE(rflow->cpu_qid);
+		if (rflow->filter == filter_id && !cpu_qid.use_qid &&
+		    cpu_qid.cpu < nr_cpu_ids &&
+		    ((int)(per_cpu(softnet_data, cpu_qid.cpu).input_queue_head -
 			   rflow->last_qtail) <
 		     (int)(10 * flow_table->mask)))
 			expire = false;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e353b822bb15..56d27463d466 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -858,7 +858,7 @@ static ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
 
 		table->mask = mask;
 		for (count = 0; count <= mask; count++)
-			table->flows[count].cpu = RPS_NO_CPU;
+			rps_dev_flow_clear(&table->flows[count]);
 	} else {
 		table = NULL;
 	}
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 9c7d46fbb75a..d09471f29d89 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -65,12 +65,16 @@ static int rps_create_sock_flow_table(size_t size, size_t orig_size,
 				return -ENOMEM;
 
 			sock_table->mask = size - 1;
+			rps_init_sock_masks(&sock_table->cpu_masks,
+					    nr_cpu_ids);
+			rps_init_sock_masks(&sock_table->queue_masks,
+					    rps_max_num_queues);
 		} else {
 			sock_table = orig_table;
 		}
 
 		for (i = 0; i < size; i++)
-			sock_table->ents[i] = RPS_NO_CPU;
+			sock_table->ents[i] = RPS_NO_CPU_QID;
 	} else {
 		sock_table = NULL;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 06/11] net: Function to check against maximum number for RPS queues
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (4 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 05/11] net: Infrastructure for per queue aRFS Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 07/11] net: Introduce global queues Tom Herbert
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Add rps_check_max_queues function which checks is the input number
is greater than rps_max_num_queues. If it is then set max_num_queues
to the value and recreating the sock_flow_table to update the
queue masks used in table entries.
---
 include/linux/netdevice.h  | 10 ++++++++
 net/core/sysctl_net_core.c | 48 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d528aa61fea3..48ba1c1fc644 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -804,6 +804,16 @@ static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
 	}
 }
 
+int __rps_check_max_queues(unsigned int idx);
+
+static inline int rps_check_max_queues(unsigned int idx)
+{
+	if (idx < rps_max_num_queues)
+		return 0;
+
+	return __rps_check_max_queues(idx);
+}
+
 #ifdef CONFIG_RFS_ACCEL
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 			 u16 filter_id);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index d09471f29d89..743c46148135 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -127,6 +127,54 @@ static int rps_sock_flow_sysctl(struct ctl_table *table, int write,
 
 	return ret;
 }
+
+int __rps_check_max_queues(unsigned int idx)
+{
+	unsigned int old;
+	size_t size;
+	int ret = 0;
+
+	/* Assume maximum queues should be a least the number of CPUs.
+	 * This avoids too much thrashing of the sock flow table at
+	 * initialization.
+	 */
+	if (idx < nr_cpu_ids && nr_cpu_ids < RPS_MAX_QID)
+		idx = nr_cpu_ids;
+
+	if (idx > RPS_MAX_QID)
+		return -EINVAL;
+
+	mutex_lock(&sock_flow_mutex);
+
+	old = rps_max_num_queues;
+	rps_max_num_queues = idx;
+
+	/* No need to reallocate table since nothing is changing */
+
+	if (roundup_pow_of_two(old) != roundup_pow_of_two(idx)) {
+		struct rps_sock_flow_table *sock_table;
+
+		sock_table = rcu_dereference_protected(rps_sock_flow_table,
+						       lockdep_is_held(&sock_flow_mutex));
+		size = sock_table ? sock_table->mask + 1 : 0;
+
+		/* Force creation of a new rps_sock_flow_table. It's
+		 * the same size as the existing table, but we expunge
+		 * any stale queue entries that would refer to the old
+		 * queue mask.
+		 */
+		ret = rps_create_sock_flow_table(size, size,
+						 sock_table, true);
+		if (ret)
+			rps_max_num_queues = old;
+	}
+
+	mutex_unlock(&sock_flow_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL(__rps_check_max_queues);
+
 #endif /* CONFIG_RPS */
 
 #ifdef CONFIG_NET_FLOW_LIMIT
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 07/11] net: Introduce global queues
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (5 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 06/11] net: Function to check against maximum number for RPS queues Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 23:00   ` kernel test robot
                     ` (3 more replies)
  2020-06-24 17:17 ` [RFC PATCH 08/11] ptq: Per Thread Queues Tom Herbert
                   ` (3 subsequent siblings)
  10 siblings, 4 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Global queues, or gqids, are an abstract representation of NIC
device queues. They are global in the sense that the each gqid
can be map to a queue in each device, i.e. if there are multiple
devices in the system, a gqid can map to a different queue, a dqid,
in each device in a one to many mapping.  gqids are used for
configuring packet steering on both send and receive in a generic
way not bound to a particular device.

Each transmit or receive device queue may be reversed mapped to
one gqid. Each device maintains a table mapping gqids to local
device queues, those tables are used in the data path to convert
a gqid receive or transmit queue into a device queue relative to
the sending or receiving device.

Changes in the patch:
	- Add a simple index to netdev_queue and netdev_rx_queue
	  This serves as the dqid (it's just the index in the
	  receive or transmit queue array for the device)
	- Add gqid to netdev_queue and netdev_rx_queue. This is the
	  mapping of a device queue to gqid. If gqid is NO_QUEUE
	  then the gqid is unmapped
	- The per device gqid to dqid maps are maintained in an
	  array of netdev_queue_map structures in a net_devce for
	  both transmit and receive
	- Functions that return a dqid where input is gqid and
	  a net_device
	- Sysfs to set device queue mappings in global_queue_mapping
	  attribyte of the sysfs rx- and tx- queue directory
	- Create per device gqid to dqid maps in the sysfs function
---
 include/linux/netdevice.h |  75 ++++++++++++++
 net/core/dev.c            |  20 +++-
 net/core/net-sysfs.c      | 199 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 290 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 48ba1c1fc644..ca163925211a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -606,6 +606,10 @@ struct netdev_queue {
 #endif
 #if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
 	int			numa_node;
+#endif
+#ifdef CONFIG_RPS
+	u16			index;
+	u16			gqid;
 #endif
 	unsigned long		tx_maxrate;
 	/*
@@ -823,6 +827,8 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
 #ifdef CONFIG_RPS
+	u16			index;
+	u16			gqid;
 	struct rps_map __rcu		*rps_map;
 	struct rps_dev_flow_table __rcu	*rps_flow_table;
 #endif
@@ -875,6 +881,25 @@ struct xps_dev_maps {
 
 #endif /* CONFIG_XPS */
 
+#ifdef CONFIG_RPS
+/* Structure to map a global queue to a device queue */
+struct netdev_queue_map {
+	struct rcu_head rcu;
+	unsigned int max_ents;
+	unsigned int set_count;
+	u16 map[0];
+};
+
+/* Allocate queue map in blocks to avoid thrashing */
+#define QUEUE_MAP_ALLOC_BLOCK 128
+
+#define QUEUE_MAP_ALLOC_NUMBER(_num)					\
+	((((_num - 1) / QUEUE_MAP_ALLOC_BLOCK) + 1) * QUEUE_MAP_ALLOC_BLOCK)
+
+#define QUEUE_MAP_ALLOC_SIZE(_num) (sizeof(struct netdev_queue_map) +	\
+	(_num) * sizeof(u16))
+#endif /* CONFIG_RPS */
+
 #define TC_MAX_QUEUE	16
 #define TC_BITMASK	15
 /* HW offloaded queuing disciplines txq count and offset maps */
@@ -2092,6 +2117,10 @@ struct net_device {
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
+#ifdef CONFIG_RPS
+	struct netdev_queue_map __rcu *rx_gqueue_map;
+#endif
+
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_ingress;
 #endif
@@ -2122,6 +2151,9 @@ struct net_device {
 	struct xps_dev_maps __rcu *xps_cpus_map;
 	struct xps_dev_maps __rcu *xps_rxqs_map;
 #endif
+#ifdef CONFIG_RPS
+	struct netdev_queue_map __rcu *tx_gqueue_map;
+#endif
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
 #endif
@@ -2218,6 +2250,36 @@ struct net_device {
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
+#ifdef CONFIG_RPS
+static inline u16 netdev_gqid_to_dqid(const struct netdev_queue_map *map,
+				      u16 gqid)
+{
+	return (map && gqid < map->max_ents) ? map->map[gqid] : NO_QUEUE;
+}
+
+static inline u16 netdev_tx_gqid_to_dqid(const struct net_device *dev, u16 gqid)
+{
+	u16 dqid;
+
+	rcu_read_lock();
+	dqid = netdev_gqid_to_dqid(rcu_dereference(dev->tx_gqueue_map), gqid);
+	rcu_read_unlock();
+
+	return dqid;
+}
+
+static inline u16 netdev_rx_gqid_to_dqid(const struct net_device *dev, u16 gqid)
+{
+	u16 dqid;
+
+	rcu_read_lock();
+	dqid = netdev_gqid_to_dqid(rcu_dereference(dev->rx_gqueue_map), gqid);
+	rcu_read_unlock();
+
+	return dqid;
+}
+#endif
+
 static inline bool netif_elide_gro(const struct net_device *dev)
 {
 	if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
@@ -2290,6 +2352,19 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 		f(dev, &dev->_tx[i], arg);
 }
 
+static inline void netdev_for_each_tx_queue_index(struct net_device *dev,
+						  void (*f)(struct net_device *,
+							    struct netdev_queue *,
+							    unsigned int index,
+							    void *),
+						  void *arg)
+{
+	unsigned int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++)
+		f(dev, &dev->_tx[i], i, arg);
+}
+
 #define netdev_lockdep_set_classes(dev)				\
 {								\
 	static struct lock_class_key qdisc_tx_busylock_key;	\
diff --git a/net/core/dev.c b/net/core/dev.c
index 946940bdd583..f64bf6608775 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9331,6 +9331,10 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	for (i = 0; i < count; i++) {
 		rx[i].dev = dev;
+#ifdef CONFIG_RPS
+		rx[i].index = i;
+		rx[i].gqid = NO_QUEUE;
+#endif
 
 		/* XDP RX-queue setup */
 		err = xdp_rxq_info_reg(&rx[i].xdp_rxq, dev, i);
@@ -9363,7 +9367,8 @@ static void netif_free_rx_queues(struct net_device *dev)
 }
 
 static void netdev_init_one_queue(struct net_device *dev,
-				  struct netdev_queue *queue, void *_unused)
+				  struct netdev_queue *queue,
+				  unsigned int index, void *_unused)
 {
 	/* Initialize queue lock */
 	spin_lock_init(&queue->_xmit_lock);
@@ -9371,6 +9376,10 @@ static void netdev_init_one_queue(struct net_device *dev,
 	queue->xmit_lock_owner = -1;
 	netdev_queue_numa_node_write(queue, NUMA_NO_NODE);
 	queue->dev = dev;
+#ifdef CONFIG_RPS
+	queue->index = index;
+	queue->gqid = NO_QUEUE;
+#endif
 #ifdef CONFIG_BQL
 	dql_init(&queue->dql, HZ);
 #endif
@@ -9396,7 +9405,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 
 	dev->_tx = tx;
 
-	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
+	netdev_for_each_tx_queue_index(dev, netdev_init_one_queue, NULL);
 	spin_lock_init(&dev->tx_global_lock);
 
 	return 0;
@@ -9884,7 +9893,7 @@ struct netdev_queue *dev_ingress_queue_create(struct net_device *dev)
 	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
 	if (!queue)
 		return NULL;
-	netdev_init_one_queue(dev, queue, NULL);
+	netdev_init_one_queue(dev, queue, 0, NULL);
 	RCU_INIT_POINTER(queue->qdisc, &noop_qdisc);
 	queue->qdisc_sleeping = &noop_qdisc;
 	rcu_assign_pointer(dev->ingress_queue, queue);
@@ -10041,6 +10050,11 @@ void free_netdev(struct net_device *dev)
 {
 	struct napi_struct *p, *n;
 
+#ifdef CONFIG_RPS
+	WARN_ON(rcu_dereference_protected(dev->tx_gqueue_map, 1));
+	WARN_ON(rcu_dereference_protected(dev->rx_gqueue_map, 1));
+#endif
+
 	might_sleep();
 	netif_free_tx_queues(dev);
 	netif_free_rx_queues(dev);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 56d27463d466..3a9d3d9ee8e0 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -875,18 +875,166 @@ static ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
 	return len;
 }
 
+static void queue_map_release(struct rcu_head *rcu)
+{
+	struct netdev_queue_map *q_map = container_of(rcu,
+	    struct netdev_queue_map, rcu);
+	vfree(q_map);
+}
+
+static int set_device_queue_mapping(struct netdev_queue_map **pmap,
+				    u16 gqid, u16 dqid, u16 *p_gqid)
+{
+	static DEFINE_MUTEX(global_mapping_table);
+	struct netdev_queue_map *gq_map, *old_gq_map;
+	u16 old_gqid;
+	int ret = 0;
+
+	mutex_lock(&global_mapping_table);
+
+	old_gqid = *p_gqid;
+	if (old_gqid == gqid) {
+		/* Nothing changing */
+		goto out;
+	}
+
+	gq_map = rcu_dereference_protected(*pmap,
+					   lockdep_is_held(&global_mapping_table));
+	old_gq_map = gq_map;
+
+	if (gqid == NO_QUEUE) {
+		/* Remove any old mapping (we know that old_gqid cannot be
+		 * NO_QUEUE from above)
+		 */
+		if (!WARN_ON(!gq_map || old_gqid > gq_map->max_ents ||
+			     gq_map->map[old_gqid] != dqid)) {
+			/* Unset old mapping */
+			gq_map->map[old_gqid] = NO_QUEUE;
+			if (--gq_map->set_count == 0) {
+				/* Done with map so free */
+				rcu_assign_pointer(*pmap, NULL);
+				call_rcu(&gq_map->rcu, queue_map_release);
+			}
+		}
+		*p_gqid = NO_QUEUE;
+
+		goto out;
+	}
+
+	if (!gq_map || gqid >= gq_map->max_ents) {
+		unsigned int max_queues;
+		int i = 0;
+
+		/* Need to create or expand queue map */
+
+		max_queues = QUEUE_MAP_ALLOC_NUMBER(gqid + 1);
+
+		gq_map = vmalloc(QUEUE_MAP_ALLOC_SIZE(max_queues));
+		if (!gq_map) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		gq_map->max_ents = max_queues;
+
+		if (old_gq_map) {
+			/* Copy old map entries */
+
+			memcpy(gq_map->map, old_gq_map->map,
+			       old_gq_map->max_ents * sizeof(gq_map->map[0]));
+			gq_map->set_count = old_gq_map->set_count;
+			i = old_gq_map->max_ents;
+		} else {
+			gq_map->set_count = 0;
+		}
+
+		/* Initialize entries not copied from old map */
+		for (; i < max_queues; i++)
+			gq_map->map[i] = NO_QUEUE;
+	} else if (gq_map->map[gqid] != NO_QUEUE) {
+		/* The global qid is already mapped to another device qid */
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* Set map entry */
+	gq_map->map[gqid] = dqid;
+	gq_map->set_count++;
+
+	if (old_gqid != NO_QUEUE) {
+		/* We know old_gqid is not equal to gqid */
+		if (!WARN_ON(!old_gq_map ||
+			     old_gqid > old_gq_map->max_ents ||
+			     old_gq_map->map[old_gqid] != dqid)) {
+			/* Unset old mapping in (new) table */
+			gq_map->map[old_gqid] = NO_QUEUE;
+			gq_map->set_count--;
+		}
+	}
+
+	if (gq_map != old_gq_map) {
+		rcu_assign_pointer(*pmap, gq_map);
+		if (old_gq_map)
+			call_rcu(&old_gq_map->rcu, queue_map_release);
+	}
+
+	/* Save for caller */
+	*p_gqid = gqid;
+
+out:
+	mutex_unlock(&global_mapping_table);
+
+	return ret;
+}
+
+static ssize_t show_rx_queue_global_mapping(struct netdev_rx_queue *queue,
+					    char *buf)
+{
+	u16 gqid = queue->gqid;
+
+	if (gqid == NO_QUEUE)
+		return sprintf(buf, "none\n");
+	else
+		return sprintf(buf, "%u\n", gqid);
+}
+
+static ssize_t store_rx_queue_global_mapping(struct netdev_rx_queue *queue,
+					     const char *buf, size_t len)
+{
+	unsigned long gqid;
+	int ret;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = kstrtoul(buf, 0, &gqid);
+	if (ret < 0)
+		return ret;
+
+	if (gqid > RPS_MAX_QID || WARN_ON(queue->index > RPS_MAX_QID))
+		return -EINVAL;
+
+	ret = set_device_queue_mapping(&queue->dev->rx_gqueue_map,
+				       gqid, queue->index, &queue->gqid);
+	return ret ? : len;
+}
+
 static struct rx_queue_attribute rps_cpus_attribute __ro_after_init
 	= __ATTR(rps_cpus, 0644, show_rps_map, store_rps_map);
 
 static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute __ro_after_init
 	= __ATTR(rps_flow_cnt, 0644,
 		 show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+static struct rx_queue_attribute rx_queue_global_mapping_attribute __ro_after_init =
+	__ATTR(global_queue_mapping, 0644,
+	       show_rx_queue_global_mapping, store_rx_queue_global_mapping);
 #endif /* CONFIG_RPS */
 
 static struct attribute *rx_queue_default_attrs[] __ro_after_init = {
 #ifdef CONFIG_RPS
 	&rps_cpus_attribute.attr,
 	&rps_dev_flow_table_cnt_attribute.attr,
+	&rx_queue_global_mapping_attribute.attr,
 #endif
 	NULL
 };
@@ -896,8 +1044,11 @@ static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
 #ifdef CONFIG_RPS
-	struct rps_map *map;
 	struct rps_dev_flow_table *flow_table;
+	struct rps_map *map;
+
+	set_device_queue_mapping(&queue->dev->rx_gqueue_map, NO_QUEUE,
+				 queue->index, &queue->gqid);
 
 	map = rcu_dereference_protected(queue->rps_map, 1);
 	if (map) {
@@ -1152,6 +1303,46 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 				 sprintf(buf, "%u\n", tc);
 }
 
+#ifdef CONFIG_RPS
+static ssize_t show_queue_global_queue_mapping(struct netdev_queue *queue,
+					       char *buf)
+{
+	u16 gqid = queue->gqid;
+
+	if (gqid == NO_QUEUE)
+		return sprintf(buf, "none\n");
+	else
+		return sprintf(buf, "%u\n", gqid);
+	return 0;
+}
+
+static ssize_t store_queue_global_queue_mapping(struct netdev_queue *queue,
+						const char *buf, size_t len)
+{
+	unsigned long gqid;
+	int ret;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = kstrtoul(buf, 0, &gqid);
+	if (ret < 0)
+		return ret;
+
+	if (gqid > RPS_MAX_QID || WARN_ON(queue->index > RPS_MAX_QID))
+		return -EINVAL;
+
+	ret = set_device_queue_mapping(&queue->dev->tx_gqueue_map,
+				       gqid, queue->index, &queue->gqid);
+	return ret ? : len;
+}
+
+static struct netdev_queue_attribute global_queue_mapping_attribute __ro_after_init =
+	__ATTR(global_queue_mapping, 0644,
+	       show_queue_global_queue_mapping,
+	       store_queue_global_queue_mapping);
+#endif
+
 #ifdef CONFIG_XPS
 static ssize_t tx_maxrate_show(struct netdev_queue *queue,
 			       char *buf)
@@ -1483,6 +1674,9 @@ static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
 	&queue_trans_timeout.attr,
 	&queue_traffic_class.attr,
+#ifdef CONFIG_RPS
+	&global_queue_mapping_attribute.attr,
+#endif
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
 	&xps_rxqs_attribute.attr,
@@ -1496,6 +1690,9 @@ static void netdev_queue_release(struct kobject *kobj)
 {
 	struct netdev_queue *queue = to_netdev_queue(kobj);
 
+	set_device_queue_mapping(&queue->dev->tx_gqueue_map, NO_QUEUE,
+				 queue->index, &queue->gqid);
+
 	memset(kobj, 0, sizeof(*kobj));
 	dev_put(queue->dev);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 08/11] ptq: Per Thread Queues
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (6 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 07/11] net: Introduce global queues Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 21:20   ` kernel test robot
                     ` (2 more replies)
  2020-06-24 17:17 ` [RFC PATCH 09/11] ptq: Hook up transmit side of Per Queue Threads Tom Herbert
                   ` (2 subsequent siblings)
  10 siblings, 3 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Per Thread Queues allows assigning a transmit and receive queue to each
thread via a cgroup controller. These queues are "global queues" that
are mapped to a real device queue when the sending or receiving device
is known.

Patch includes:
	- Create net_queues cgroup controller
	- Cgroup controller includes attributes to set a transmit
	  and receive queue range, unique assignment attributes, and
	  symmetric assignment attribute. Also, a read-only
	  attribute that lists task of the cgroup and their assigned
	  queues
	- Add a net_queue_pair to task_struct
	- Make ptq_cgroup_queue_desc which defines a index range by
	  a base index and length of the range
	- Assign queues to tasks when they attach to the cgroup.
	  For each of receive and transmit, a queue is selected
	  from the perspective range configured in the cgroup. If the
	  "assign" attribute is set for receive or transmit, a
	  unique queue (one not previously assigned to another task)
	  is chosen. If the "symmetric" attribute is set then
	  the receive and transmit queues are selected to be the same
	  number. If there are no queues available (e.g. assign
	  attribute is set for receive and all the queues in the
	  receive range are already assigned) then assignment
	  silently fails.
	- The assigned transmit and receive queues are set in
	  net_queue_pair structure for the task_struct
---
 include/linux/cgroup_subsys.h |   4 +
 include/linux/sched.h         |   4 +
 include/net/ptq.h             |  45 +++
 kernel/fork.c                 |   4 +
 net/Kconfig                   |  18 +
 net/core/Makefile             |   1 +
 net/core/ptq.c                | 688 ++++++++++++++++++++++++++++++++++
 7 files changed, 764 insertions(+)
 create mode 100644 include/net/ptq.h
 create mode 100644 net/core/ptq.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..9f80cde69890 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -49,6 +49,10 @@ SUBSYS(perf_event)
 SUBSYS(net_prio)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_NET_QUEUES)
+SUBSYS(net_queues)
+#endif
+
 #if IS_ENABLED(CONFIG_CGROUP_HUGETLB)
 SUBSYS(hugetlb)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b62e6aaf28f0..97cb8288faca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,6 +32,7 @@
 #include <linux/posix-timers.h>
 #include <linux/rseq.h>
 #include <linux/kcsan.h>
+#include <linux/netqueue.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1313,6 +1314,9 @@ struct task_struct {
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
 #endif
+#ifdef CONFIG_PER_THREAD_QUEUES
+	struct net_queue_pair		ptq_queues;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/net/ptq.h b/include/net/ptq.h
new file mode 100644
index 000000000000..a8ce39a85136
--- /dev/null
+++ b/include/net/ptq.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Per thread queues
+ *
+ * Copyright (c) 2020 Tom Herbert <tom@herbertland.com>
+ */
+
+#ifndef _NET_PTQ_H
+#define _NET_PTQ_H
+
+#include <linux/cgroup.h>
+#include <linux/netqueue.h>
+
+struct ptq_cgroup_queue_desc {
+	struct rcu_head rcu;
+
+	unsigned short base;
+	unsigned short num;
+	unsigned long alloced[0];
+};
+
+struct ptq_css {
+	struct cgroup_subsys_state css;
+
+	struct ptq_cgroup_queue_desc __rcu *txqs;
+	struct ptq_cgroup_queue_desc __rcu *rxqs;
+
+	unsigned short flags;
+#define PTQ_F_RX_ASSIGN		BIT(0)
+#define PTQ_F_TX_ASSIGN		BIT(1)
+#define PTQ_F_SYMMETRIC		BIT(2)
+};
+
+static inline struct ptq_css *css_to_ptq_css(struct cgroup_subsys_state *css)
+{
+	return (struct ptq_css *)css;
+}
+
+static inline struct ptq_cgroup_queue_desc **pcqd_select_desc(
+		struct ptq_css *pss, bool doing_tx)
+{
+	return doing_tx ? &pss->txqs : &pss->rxqs;
+}
+
+#endif /* _NET_PTQ_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..5d604e778f4d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -958,6 +958,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
 #endif
+
+#ifdef CONFIG_PER_THREAD_QUEUES
+	init_net_queue_pair(&tsk->ptq_queues);
+#endif
 	return tsk;
 
 free_stack:
diff --git a/net/Kconfig b/net/Kconfig
index d1672280d6a4..fd2d1da89cb9 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -256,6 +256,24 @@ config RFS_ACCEL
 	select CPU_RMAP
 	default y
 
+config CGROUP_NET_QUEUES
+	depends on PER_THREAD_QUEUES
+	depends on CGROUPS
+	bool
+
+config PER_THREAD_QUEUES
+	bool "Per thread queues"
+	depends on RPS
+	depends on RFS_ACCEL
+	select CGROUP_NET_QUEUES
+	default y
+	help
+	  Assign network hardware queues to tasks. This creates a
+	  cgroup subsys net_queues that allows associating a hardware
+	  transmit queue and a receive queue with a thread. The interface
+	  specifies a range of queues for each side from which queues
+	  are assigned to each task.
+
 config XPS
 	bool
 	depends on SMP
diff --git a/net/core/Makefile b/net/core/Makefile
index 3e2c378e5f31..156a152e2b0a 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -35,3 +35,4 @@ obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
 obj-$(CONFIG_FAILOVER) += failover.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
+obj-$(CONFIG_PER_THREAD_QUEUES) += ptq.o
diff --git a/net/core/ptq.c b/net/core/ptq.c
new file mode 100644
index 000000000000..edf6718e0a71
--- /dev/null
+++ b/net/core/ptq.c
@@ -0,0 +1,688 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* net/core/ptq.c
+ *
+ * Copyright (c) 2020 Tom Herbert
+ */
+#include <linux/bitmap.h>
+#include <linux/mutex.h>
+#include <linux/netdevice.h>
+#include <linux/types.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <net/ptq.h>
+
+struct ptq_cgroup_queue_desc null_pcdesc;
+
+static DEFINE_MUTEX(ptq_mutex);
+
+#define NETPTQ_ID_MAX	USHRT_MAX
+
+/* Check is a queue identifier is in the range of a descriptor */
+
+static inline bool idx_in_range(struct ptq_cgroup_queue_desc *pcdesc,
+				unsigned short idx)
+{
+	return (idx >= pcdesc->base && idx < (pcdesc->base + pcdesc->num));
+}
+
+/* Mutex held */
+static int assign_one(struct ptq_cgroup_queue_desc *pcdesc,
+		      bool assign, unsigned short requested_idx)
+{
+	unsigned short idx;
+
+	if (!pcdesc->num)
+		return NO_QUEUE;
+
+	if (idx_in_range(pcdesc, requested_idx)) {
+		/* Try to use requested queue id */
+
+		if (assign) {
+			idx = requested_idx - pcdesc->base;
+			if (!test_bit(idx, pcdesc->alloced)) {
+				set_bit(idx, pcdesc->alloced);
+				return requested_idx;
+			}
+		} else {
+			return requested_idx;
+		}
+	}
+
+	/* Need new queue id */
+
+	if (assign)  {
+		idx = find_first_zero_bit(pcdesc->alloced, pcdesc->num);
+		if (idx >= pcdesc->num)
+			return -EBUSY;
+		set_bit(idx, pcdesc->alloced);
+		return pcdesc->base + idx;
+	}
+
+	/* Checked for zero ranged above */
+	return pcdesc->base + (get_random_u32() % pcdesc->num);
+}
+
+/* Compute the overlap between two queue ranges. Indicate
+ * the overlap by returning the relative offsets for the two
+ * queue descriptors where the overlap starts and the
+ * length of the overlap region.
+ */
+static inline unsigned short
+make_relative_idxs(struct ptq_cgroup_queue_desc *pcdesc0,
+		   struct ptq_cgroup_queue_desc *pcdesc1,
+		   unsigned short *rel_idx0,
+		   unsigned short *rel_idx1)
+{
+	if (pcdesc0->base + pcdesc0->num <= pcdesc1->base ||
+	    pcdesc1->base + pcdesc1->num <= pcdesc0->base) {
+		/* No overlap */
+		return 0;
+	}
+
+	if (pcdesc0->base >= pcdesc1->base) {
+		*rel_idx0 = 0;
+		*rel_idx1 = pcdesc0->base - pcdesc1->base;
+	} else {
+		*rel_idx0 = pcdesc1->base - pcdesc0->base;
+		*rel_idx1 = 0;
+	}
+
+	return min_t(unsigned short, pcdesc0->num - *rel_idx0,
+		     pcdesc1->num - *rel_idx1);
+}
+
+/* Mutex held */
+static int assign_symmetric(struct ptq_css *pss,
+			    struct ptq_cgroup_queue_desc *tx_pcdesc,
+			    struct ptq_cgroup_queue_desc *rx_pcdesc,
+			    unsigned short requested_idx1,
+			    unsigned short requested_idx2)
+{
+	unsigned short base_tidx, base_ridx, overlap;
+	unsigned short tidx, ridx, num_tx, num_rx;
+	unsigned int requested_idx = NO_QUEUE;
+	int ret;
+
+	if (idx_in_range(tx_pcdesc, requested_idx1) &&
+	    idx_in_range(rx_pcdesc, requested_idx1))
+		requested_idx = requested_idx1;
+	else if (idx_in_range(tx_pcdesc, requested_idx2) &&
+		 idx_in_range(rx_pcdesc, requested_idx2))
+		requested_idx = requested_idx2;
+
+	if (requested_idx != NO_QUEUE) {
+		unsigned short tidx = requested_idx - tx_pcdesc->base;
+		unsigned short ridx = requested_idx - rx_pcdesc->base;
+
+		/* Try to use requested queue id */
+
+		ret = requested_idx; /* Be optimisitic */
+
+		if ((pss->flags & (PTQ_F_TX_ASSIGN | PTQ_F_RX_ASSIGN)) ==
+		    (PTQ_F_TX_ASSIGN | PTQ_F_RX_ASSIGN)) {
+			if (!test_bit(tidx, tx_pcdesc->alloced) &&
+			    !test_bit(ridx, rx_pcdesc->alloced)) {
+				set_bit(tidx, tx_pcdesc->alloced);
+				set_bit(ridx, rx_pcdesc->alloced);
+
+				goto out;
+			}
+		} else if (pss->flags & PTQ_F_TX_ASSIGN) {
+			if (!test_bit(tidx, tx_pcdesc->alloced)) {
+				set_bit(tidx, tx_pcdesc->alloced);
+
+				goto out;
+			}
+		} else if (pss->flags & PTQ_F_RX_ASSIGN) {
+			if (!test_bit(ridx, rx_pcdesc->alloced)) {
+				set_bit(ridx, rx_pcdesc->alloced);
+
+				goto out;
+			}
+		} else {
+			goto out;
+		}
+	}
+
+	/* Need new queue id */
+
+	overlap = make_relative_idxs(tx_pcdesc, rx_pcdesc, &base_tidx,
+				     &base_ridx);
+	if (!overlap) {
+		/* No overlap in ranges */
+		ret = -ERANGE;
+		goto out;
+	}
+
+	num_tx = base_tidx + overlap;
+	num_rx = base_ridx + overlap;
+
+	ret = -EBUSY;
+
+	if ((pss->flags & (PTQ_F_TX_ASSIGN | PTQ_F_RX_ASSIGN)) ==
+	    (PTQ_F_TX_ASSIGN | PTQ_F_RX_ASSIGN)) {
+		/* Both sides need to be assigned, find common cleared
+		 * bit in respective bitmaps
+		 */
+		for (tidx = base_tidx;
+		     (tidx = find_next_zero_bit(tx_pcdesc->alloced,
+						num_tx, tidx)) < num_tx;
+		     tidx++) {
+			ridx = base_ridx + (tidx - base_tidx);
+			if (!test_bit(ridx, rx_pcdesc->alloced))
+				break;
+		}
+		if (tidx < num_tx) {
+			/* Found symmetric queue index that is unassigned
+			 * for both transmit and receive
+			 */
+
+			set_bit(tidx, tx_pcdesc->alloced);
+			set_bit(ridx, rx_pcdesc->alloced);
+			ret = tx_pcdesc->base + tidx;
+		}
+	} else if (pss->flags & PTQ_F_TX_ASSIGN) {
+		tidx = find_next_zero_bit(tx_pcdesc->alloced,
+					  num_tx, base_tidx);
+		if (tidx < num_tx) {
+			set_bit(tidx, tx_pcdesc->alloced);
+			ret = tx_pcdesc->base + tidx;
+		}
+	} else if (pss->flags & PTQ_F_RX_ASSIGN) {
+		ridx = find_next_zero_bit(rx_pcdesc->alloced,
+					  num_rx, base_ridx);
+		if (ridx < num_rx) {
+			set_bit(ridx, rx_pcdesc->alloced);
+			ret = rx_pcdesc->base + ridx;
+		}
+	} else {
+		/* Overlap can't be zero from check above */
+		ret = tx_pcdesc->base + base_tidx +
+		    (get_random_u32() % overlap);
+	}
+out:
+	return ret;
+}
+
+/* Mutex held */
+static int assign_queues(struct ptq_css *pss, struct task_struct *task)
+{
+	struct ptq_cgroup_queue_desc *tx_pcdesc, *rx_pcdesc;
+	unsigned short txq_id = NO_QUEUE, rxq_id = NO_QUEUE;
+	struct net_queue_pair *qpair = &task->ptq_queues;
+	int ret, ret2;
+
+	tx_pcdesc = rcu_dereference_protected(pss->txqs,
+					      mutex_is_locked(&ptq_mutex));
+	rx_pcdesc = rcu_dereference_protected(pss->rxqs,
+					      mutex_is_locked(&ptq_mutex));
+
+	if (pss->flags & PTQ_F_SYMMETRIC) {
+		/* Assigning symmetric queues. Requested identifier is from
+		 * existing queue pair corresponding to side (TX or RX)
+		 * that is being tracked based on assign flag.
+		 */
+		ret =  assign_symmetric(pss, tx_pcdesc, rx_pcdesc,
+					qpair->txq_id, qpair->rxq_id);
+		if (ret >= 0) {
+			txq_id = ret;
+			rxq_id = ret;
+			ret = 0;
+		}
+	} else {
+		/* Not doing symmetric assignment. Assign transmit and
+		 * receive queues independently.
+		 */
+		ret = assign_one(tx_pcdesc, pss->flags & PTQ_F_TX_ASSIGN,
+				 qpair->txq_id);
+		if (ret >= 0)
+			txq_id = ret;
+
+		ret2 = assign_one(rx_pcdesc, pss->flags & PTQ_F_RX_ASSIGN,
+				  qpair->rxq_id);
+		if (ret2 >= 0)
+			rxq_id = ret2;
+
+		/* Return error if either assignment failed. Note that one
+		 * assignment for side may succeed and the other may fail.
+		 */
+		if (ret2 < 0)
+			ret = ret2;
+		else if (ret >= 0)
+			ret = 0;
+	}
+
+	qpair->txq_id = txq_id;
+	qpair->rxq_id = rxq_id;
+
+	return ret;
+}
+
+/* Mutex held */
+static void unassign_one(struct ptq_cgroup_queue_desc *pcdesc,
+			 unsigned short idx, bool assign)
+{
+	if (!pcdesc->num) {
+		WARN_ON(idx != NO_QUEUE);
+		return;
+	}
+	if (!assign || WARN_ON(!idx_in_range(pcdesc, idx)))
+		return;
+
+	idx -= pcdesc->base;
+	clear_bit(idx, pcdesc->alloced);
+}
+
+/* Mutex held */
+static void unassign_queues(struct ptq_css *pss, struct task_struct *task)
+{
+	struct ptq_cgroup_queue_desc *tx_pcdesc, *rx_pcdesc;
+	struct net_queue_pair *qpair = &task->ptq_queues;
+
+	tx_pcdesc = rcu_dereference_protected(pss->txqs,
+					      mutex_is_locked(&ptq_mutex));
+	rx_pcdesc = rcu_dereference_protected(pss->rxqs,
+					      mutex_is_locked(&ptq_mutex));
+
+	unassign_one(tx_pcdesc, qpair->txq_id, pss->flags & PTQ_F_TX_ASSIGN);
+	unassign_one(rx_pcdesc, qpair->rxq_id, pss->flags & PTQ_F_RX_ASSIGN);
+
+	init_net_queue_pair(qpair);
+}
+
+/* Mutex held */
+static void reassign_queues_all(struct ptq_css *pss)
+{
+	struct ptq_cgroup_queue_desc *tx_pcdesc, *rx_pcdesc;
+	struct task_struct *task;
+	struct css_task_iter it;
+
+	tx_pcdesc = rcu_dereference_protected(pss->txqs,
+					      mutex_is_locked(&ptq_mutex));
+	rx_pcdesc = rcu_dereference_protected(pss->rxqs,
+					      mutex_is_locked(&ptq_mutex));
+
+	/* PTQ configuration has changed, attempt to reassign queues for new
+	 * configuration. The assignment functions try to keep threads using
+	 * the same queues as much as possible to avoid thrashing.
+	 */
+
+	/* Clear the bitmaps, we will resonstruct them in the assignments */
+	bitmap_zero(tx_pcdesc->alloced, tx_pcdesc->num);
+	bitmap_zero(rx_pcdesc->alloced, rx_pcdesc->num);
+
+	css_task_iter_start(&pss->css, 0, &it);
+	while ((task = css_task_iter_next(&it)))
+		assign_queues(pss, task);
+	css_task_iter_end(&it);
+}
+
+static struct cgroup_subsys_state *
+cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct ptq_css *pss;
+
+	pss = kzalloc(sizeof(*pss), GFP_KERNEL);
+	if (!pss)
+		return ERR_PTR(-ENOMEM);
+
+	RCU_INIT_POINTER(pss->txqs, &null_pcdesc);
+	RCU_INIT_POINTER(pss->rxqs, &null_pcdesc);
+
+	return &pss->css;
+}
+
+static int cgrp_css_online(struct cgroup_subsys_state *css)
+{
+	struct cgroup_subsys_state *parent_css = css->parent;
+	int ret = 0;
+
+	if (css->id > NETPTQ_ID_MAX)
+		return -ENOSPC;
+
+	if (!parent_css)
+		return 0;
+
+	/* Don't inherit from parent for the time being */
+
+	return ret;
+}
+
+static void cgrp_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css);
+}
+
+static u64 read_ptqidx(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return css->id;
+}
+
+/* Takes mutex */
+static int ptq_can_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *dst_css, *src_css;
+	struct task_struct *task;
+
+	/* Unassign queues for tasks in preparation for attaching the tasks
+	 * to a different css
+	 */
+
+	mutex_lock(&ptq_mutex);
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		src_css = task_css(task, net_queues_cgrp_id);
+		unassign_queues(css_to_ptq_css(src_css), task);
+	}
+
+	mutex_unlock(&ptq_mutex);
+
+	return 0;
+}
+
+/* Takes mutex */
+static void ptq_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task;
+
+	mutex_lock(&ptq_mutex);
+
+	/* Assign queues for tasks their new css */
+
+	cgroup_taskset_for_each(task, css, tset)
+		assign_queues(css_to_ptq_css(css), task);
+
+	mutex_unlock(&ptq_mutex);
+}
+
+/* Takes mutex */
+static void ptq_cancel_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *dst_css, *src_css;
+	struct task_struct *task;
+
+	mutex_lock(&ptq_mutex);
+
+	/* Attach failed, reassign queues for tasks in their original
+	 * cgroup (previously they were unassigned in can_attach)
+	 */
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		/* Reassign in old cgroup */
+		src_css = task_css(task, net_queues_cgrp_id);
+		assign_queues(css_to_ptq_css(src_css), task);
+	}
+
+	mutex_unlock(&ptq_mutex);
+}
+
+/* Takes mutex */
+static void ptq_fork(struct task_struct *task)
+{
+	struct cgroup_subsys_state *css =
+		task_css(task, net_queues_cgrp_id);
+
+	mutex_lock(&ptq_mutex);
+	assign_queues(css_to_ptq_css(css), task);
+	mutex_unlock(&ptq_mutex);
+}
+
+/* Takes mutex */
+static void ptq_exit(struct task_struct *task)
+{
+	struct cgroup_subsys_state *css =
+		task_css(task, net_queues_cgrp_id);
+
+	mutex_lock(&ptq_mutex);
+	unassign_queues(css_to_ptq_css(css), task);
+	mutex_unlock(&ptq_mutex);
+}
+
+static u64 read_flag(struct cgroup_subsys_state *css, unsigned int flag)
+{
+	return !!(css_to_ptq_css(css)->flags & flag);
+}
+
+/* Takes mutex */
+static int write_flag(struct cgroup_subsys_state *css, unsigned int flag,
+		      u64 val)
+{
+	struct ptq_css *pss = css_to_ptq_css(css);
+	int ret = 0;
+
+	mutex_lock(&ptq_mutex);
+
+	if (val)
+		pss->flags |= flag;
+	else
+		pss->flags &= ~flag;
+
+	/* If we've changed a flag that affects how queues are assigned then
+	 * reassign the queues.
+	 */
+	if (flag & (PTQ_F_TX_ASSIGN | PTQ_F_RX_ASSIGN | PTQ_F_SYMMETRIC))
+		reassign_queues_all(pss);
+
+	mutex_unlock(&ptq_mutex);
+
+	return ret;
+}
+
+static int show_queue_desc(struct seq_file *sf,
+			   struct ptq_cgroup_queue_desc *pcdesc)
+{
+	seq_printf(sf, "%u:%u\n", pcdesc->base, pcdesc->num);
+
+	return 0;
+}
+
+static int parse_queues(char *buf, unsigned short *base,
+			unsigned short *num)
+{
+	return (sscanf(buf, "%hu:%hu", base, num) != 2) ? -EINVAL : 0;
+}
+
+static void format_queue(char *buf, unsigned short idx)
+{
+	if (idx == NO_QUEUE)
+		sprintf(buf, "none");
+	else
+		sprintf(buf, "%hu", idx);
+}
+
+static int cgroup_procs_show(struct seq_file *sf, void *v)
+{
+	struct net_queue_pair *qpair;
+	struct task_struct *task = v;
+	char buf1[32], buf2[32];
+
+	qpair = &task->ptq_queues;
+	format_queue(buf1, qpair->txq_id);
+	format_queue(buf2, qpair->rxq_id);
+
+	seq_printf(sf, "%d: %s %s\n", task_pid_vnr(v), buf1, buf2);
+	return 0;
+}
+
+#define QDESC_LEN(NUM) (sizeof(struct ptq_cgroup_queue_desc) + \
+			      BITS_TO_LONGS(NUM) * sizeof(unsigned long))
+
+/* Takes mutex */
+static int set_queue_desc(struct ptq_css *pss,
+			  struct ptq_cgroup_queue_desc **pcdescp,
+			  unsigned short base, unsigned short num)
+{
+	struct ptq_cgroup_queue_desc *new_pcdesc = &null_pcdesc, *old_pcdesc;
+	int ret = 0;
+
+	/* Check if RPS maximum queues can accommodate the range */
+	ret = rps_check_max_queues(base + num);
+	if (ret)
+		return ret;
+
+	mutex_lock(&ptq_mutex);
+
+	old_pcdesc = rcu_dereference_protected(*pcdescp,
+					       mutex_is_locked(&ptq_mutex));
+
+	if (old_pcdesc && old_pcdesc->base == base && old_pcdesc->num == num) {
+		/* Nothing to do */
+		goto out;
+	}
+
+	if (num != 0) {
+		new_pcdesc = kzalloc(QDESC_LEN(num), GFP_KERNEL);
+		if (!new_pcdesc) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		new_pcdesc->base = base;
+		new_pcdesc->num = num;
+	}
+	rcu_assign_pointer(*pcdescp, new_pcdesc);
+	if (old_pcdesc != &null_pcdesc)
+		kfree_rcu(old_pcdesc, rcu);
+
+	reassign_queues_all(pss);
+out:
+	mutex_unlock(&ptq_mutex);
+
+	return ret;
+}
+
+static ssize_t write_tx_queues(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct ptq_css *pss = css_to_ptq_css(of_css(of));
+	unsigned short base, num;
+	int ret;
+
+	ret = parse_queues(buf, &base, &num);
+	if (ret < 0)
+		return ret;
+
+	return set_queue_desc(pss, &pss->txqs, base, num) ? : nbytes;
+}
+
+static int read_tx_queues(struct seq_file *sf, void *v)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
+								 txqs));
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static ssize_t write_rx_queues(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct ptq_css *pss = css_to_ptq_css(of_css(of));
+	unsigned short base, num;
+	int ret;
+
+	ret = parse_queues(buf, &base, &num);
+	if (ret < 0)
+		return ret;
+
+	return set_queue_desc(pss, &pss->rxqs, base, num) ? : nbytes;
+}
+
+static int read_rx_queues(struct seq_file *sf, void *v)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
+								 rxqs));
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static u64 read_tx_assign(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return read_flag(css, PTQ_F_TX_ASSIGN);
+}
+
+static int write_tx_assign(struct cgroup_subsys_state *css,
+			   struct cftype *cft, u64 val)
+{
+	return write_flag(css, PTQ_F_TX_ASSIGN, val);
+}
+
+static u64 read_rx_assign(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return read_flag(css, PTQ_F_RX_ASSIGN);
+}
+
+static int write_rx_assign(struct cgroup_subsys_state *css,
+			   struct cftype *cft, u64 val)
+{
+	return write_flag(css, PTQ_F_RX_ASSIGN, val);
+}
+
+static u64 read_symmetric(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return read_flag(css, PTQ_F_SYMMETRIC);
+}
+
+static int write_symmetric(struct cgroup_subsys_state *css,
+			   struct cftype *cft, u64 val)
+{
+	return write_flag(css, PTQ_F_SYMMETRIC, val);
+}
+
+static struct cftype ss_files[] = {
+	{
+		.name = "ptqidx",
+		.read_u64 = read_ptqidx,
+	},
+	{
+		.name = "rx-queues",
+		.seq_show = read_rx_queues,
+		.write = write_rx_queues,
+	},
+	{
+		.name = "tx-queues",
+		.seq_show = read_tx_queues,
+		.write = write_tx_queues,
+	},
+	{
+		.name = "rx-assign",
+		.read_u64 = read_rx_assign,
+		.write_u64 = write_rx_assign,
+	},
+	{
+		.name = "tx-assign",
+		.read_u64 = read_tx_assign,
+		.write_u64 = write_tx_assign,
+	},
+	{
+		.name = "symmetric",
+		.read_u64 = read_symmetric,
+		.write_u64 = write_symmetric,
+	},
+	{
+		.name = "task-queues",
+		.seq_start = cgroup_threads_start,
+		.seq_next = cgroup_procs_next,
+		.seq_show = cgroup_procs_show,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys net_queues_cgrp_subsys = {
+	.css_alloc	= cgrp_css_alloc,
+	.css_online	= cgrp_css_online,
+	.css_free	= cgrp_css_free,
+	.attach		= ptq_attach,
+	.can_attach	= ptq_can_attach,
+	.cancel_attach	= ptq_cancel_attach,
+	.fork		= ptq_fork,
+	.exit		= ptq_exit,
+	.legacy_cftypes	= ss_files,
+};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 09/11] ptq: Hook up transmit side of Per Queue Threads
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (7 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 08/11] ptq: Per Thread Queues Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 10/11] ptq: Hook up receive " Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 11/11] doc: Documentation for Per Thread Queues Tom Herbert
  10 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Support to select device queue for transmit based on the per thread
transmit queue.

Patch includes:
	- Add a global queue (gqid) mapping to sock
	- Function to convert gqid in a sock to a device queue (dqid) by
	  calling sk_tx_gqid_to_dqid_get
	- Function sock_record_tx_queue to record a queue in a socket
	  taken from ptq_threads in struct task
	- Call sock_record_tx_queue from af_inet send, listen, and accept
	  functions to populate the socket's gqid for steerig
	- In netdev_pick_tx try to take the queue index from the socket
	  using sk_tx_gqid_to_dqid_get
---
 include/net/sock.h | 63 ++++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c     |  9 ++++---
 net/ipv4/af_inet.c |  6 +++++
 3 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index acb76cfaae1b..5ec9d02e7ad0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -140,6 +140,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_tx_gqid_mapping: global tx queue number for sending
  *	@skc_rx_queue_mapping: rx queue number for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
@@ -225,6 +226,9 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	unsigned short		skc_tx_queue_mapping;
+#ifdef CONFIG_RPS
+	unsigned short		skc_tx_gqid_mapping;
+#endif
 #ifdef CONFIG_XPS
 	unsigned short		skc_rx_queue_mapping;
 #endif
@@ -353,6 +357,9 @@ struct sock {
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_RPS
+#define sk_tx_gqid_mapping	__sk_common.skc_tx_gqid_mapping
+#endif
 #ifdef CONFIG_XPS
 #define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
 #endif
@@ -1792,6 +1799,34 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 	return __sk_receive_skb(sk, skb, nested, 1, true);
 }
 
+static inline int sk_tx_gqid_get(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	if (sk && sk->sk_tx_gqid_mapping != NO_QUEUE)
+		return sk->sk_tx_gqid_mapping;
+#endif
+
+	return -1;
+}
+
+static inline void sk_tx_gqid_set(struct sock *sk, int gqid)
+{
+#ifdef CONFIG_RPS
+	/* sk_tx_queue_mapping accept only up to RPS_MAX_QID (0x7ffe) */
+	if (WARN_ON_ONCE((unsigned int)gqid > RPS_MAX_QID &&
+			 gqid != NO_QUEUE))
+		return;
+	sk->sk_tx_gqid_mapping = gqid;
+#endif
+}
+
+static inline void sk_tx_gqid_clear(struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	sk->sk_tx_gqid_mapping = NO_QUEUE;
+#endif
+}
+
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
 	/* sk_tx_queue_mapping accept only upto a 16-bit value */
@@ -1803,6 +1838,9 @@ static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
 	sk->sk_tx_queue_mapping = NO_QUEUE;
+
+	/* Clear tx_gqid at same points */
+	sk_tx_gqid_clear(sk);
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
@@ -1813,6 +1851,31 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return -1;
 }
 
+static inline int sk_tx_gqid_to_dqid_get(const struct net_device *dev,
+					 const struct sock *sk)
+{
+	int ret = -1;
+#ifdef CONFIG_RPS
+	int gqid;
+	u16 dqid;
+
+	gqid = sk_tx_gqid_get(sk);
+	if (gqid >= 0) {
+		dqid = netdev_tx_gqid_to_dqid(dev, gqid);
+		if (dqid != NO_QUEUE)
+			ret = dqid;
+	}
+#endif
+	return ret;
+}
+
+static inline void sock_record_tx_queue(struct sock *sk)
+{
+#ifdef CONFIG_PER_THREAD_QUEUES
+	sk_tx_gqid_set(sk, current->ptq_queues.txq_id);
+#endif
+}
+
 static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
diff --git a/net/core/dev.c b/net/core/dev.c
index f64bf6608775..f4478c9b1c9c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3982,10 +3982,13 @@ u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
 
 	if (queue_index < 0 || skb->ooo_okay ||
 	    queue_index >= dev->real_num_tx_queues) {
-		int new_index = get_xps_queue(dev, sb_dev, skb);
+		int new_index = sk_tx_gqid_to_dqid_get(dev, sk);
 
-		if (new_index < 0)
-			new_index = skb_tx_hash(dev, sb_dev, skb);
+		if (new_index < 0) {
+			new_index = get_xps_queue(dev, sb_dev, skb);
+			if (new_index < 0)
+				new_index = skb_tx_hash(dev, sb_dev, skb);
+		}
 
 		if (queue_index != new_index && sk &&
 		    sk_fullsock(sk) &&
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 02aa5cb3a4fd..9b36aa3d1622 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -201,6 +201,8 @@ int inet_listen(struct socket *sock, int backlog)
 
 	lock_sock(sk);
 
+	sock_record_tx_queue(sk);
+
 	err = -EINVAL;
 	if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM)
 		goto out;
@@ -630,6 +632,8 @@ int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 		}
 	}
 
+	sock_record_tx_queue(sk);
+
 	switch (sock->state) {
 	default:
 		err = -EINVAL;
@@ -742,6 +746,7 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags,
 	lock_sock(sk2);
 
 	sock_rps_record_flow(sk2);
+	sock_record_tx_queue(sk2);
 	WARN_ON(!((1 << sk2->sk_state) &
 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
 		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
@@ -794,6 +799,7 @@ EXPORT_SYMBOL(inet_getname);
 int inet_send_prepare(struct sock *sk)
 {
 	sock_rps_record_flow(sk);
+	sock_record_tx_queue(sk);
 
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && !sk->sk_prot->no_autobind &&
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 10/11] ptq: Hook up receive side of Per Queue Threads
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (8 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 09/11] ptq: Hook up transmit side of Per Queue Threads Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-24 17:17 ` [RFC PATCH 11/11] doc: Documentation for Per Thread Queues Tom Herbert
  10 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Add code to set the queue in an rflow as opposed to just setting the
CPU in an rps_dev_flow entry. set_rps_qid is the analogue for
set_rps_cpu but for setting queues. In get_rps_cpu, a check is
performed that identifier in the sock_flow_table refers to a queue;
when it does call set_rps_qid after converting the global qid in the
sock_flow_table to a device qid.

In rps_record_sock_flow check is there is a per task receive queue
for current (i.e. current->ptq_queues.rxq_id != NO_QUEUE). If there
is a queue then set in sock_flow_table instead of setting the running
CPU. Subsequently, the receive queue for the flow can be programmed
by aRFS logic (ndo_rx_flow_steer).
---
 include/linux/netdevice.h | 28 ++++++++++++++++++++++++----
 net/core/dev.c            | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca163925211a..3b39be470720 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -731,12 +731,25 @@ static inline void rps_dev_flow_set_cpu(struct rps_dev_flow *dev_flow, u16 cpu)
 	if (WARN_ON(cpu > RPS_MAX_CPU))
 		return;
 
-	/* Set the rflow target to the CPU atomically */
+	/* Set the device flow target to the CPU atomically */
 	cpu_qid.use_qid = 0;
 	cpu_qid.cpu = cpu;
 	dev_flow->cpu_qid = cpu_qid;
 }
 
+static inline void rps_dev_flow_set_qid(struct rps_dev_flow *dev_flow, u16 qid)
+{
+	struct rps_cpu_qid cpu_qid;
+
+	if (WARN_ON(qid > RPS_MAX_QID))
+		return;
+
+	/* Set the device flow target to the CPU atomically */
+	cpu_qid.use_qid = 1;
+	cpu_qid.qid = qid;
+	dev_flow->cpu_qid = cpu_qid;
+}
+
 /*
  * The rps_dev_flow_table structure contains a table of flow mappings.
  */
@@ -797,11 +810,18 @@ static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
 					u32 hash)
 {
 	if (table && hash) {
-		u32 val = hash & table->cpu_masks.hash_mask;
 		unsigned int index = hash & table->mask;
+		u32 val;
 
-		/* We only give a hint, preemption can change CPU under us */
-		val |= raw_smp_processor_id();
+#ifdef CONFIG_PER_THREAD_QUEUES
+		if (current->ptq_queues.rxq_id != NO_QUEUE)
+			val = RPS_SOCK_FLOW_USE_QID |
+			      (hash & table->queue_masks.hash_mask) |
+			      current->ptq_queues.rxq_id;
+		else
+#endif
+			val = (hash & table->cpu_masks.hash_mask) |
+			      raw_smp_processor_id();
 
 		if (table->ents[index] != val)
 			table->ents[index] = val;
diff --git a/net/core/dev.c b/net/core/dev.c
index f4478c9b1c9c..1cad776e8847 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4308,6 +4308,25 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	return rflow;
 }
 
+static struct rps_dev_flow *
+set_rps_qid(struct net_device *dev, struct sk_buff *skb,
+	    struct rps_dev_flow *rflow, u16 qid)
+{
+	if (qid > RPS_MAX_QID) {
+		rps_dev_flow_clear(rflow);
+		return rflow;
+	}
+
+#ifdef CONFIG_RFS_ACCEL
+	/* Should we steer this flow to a different hardware queue? */
+	if (skb_rx_queue_recorded(skb) && (dev->features & NETIF_F_NTUPLE) &&
+	    qid != skb_get_rx_queue(skb) && qid < dev->real_num_rx_queues)
+		set_arfs_queue(dev, skb, rflow, qid);
+#endif
+	rps_dev_flow_set_qid(rflow, qid);
+	return rflow;
+}
+
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
@@ -4356,6 +4375,10 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 		/* First check into global flow table if there is a match */
 		ident = sock_flow_table->ents[hash & sock_flow_table->mask];
+
+		if (ident == RPS_SOCK_FLOW_NO_IDENT)
+			goto try_rps;
+
 		comparator = ((ident & RPS_SOCK_FLOW_USE_QID) ?
 				sock_flow_table->queue_masks.hash_mask :
 				sock_flow_table->cpu_masks.hash_mask);
@@ -4372,8 +4395,21 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		 * CPU. Proceed accordingly.
 		 */
 		if (ident & RPS_SOCK_FLOW_USE_QID) {
+			u16 dqid, gqid;
+
 			/* A queue identifier is in the sock_flow_table entry */
 
+			gqid = ident & sock_flow_table->queue_masks.mask;
+			dqid = netdev_rx_gqid_to_dqid(dev, gqid);
+
+			/* rflow has desired receive qid. Just set the qid in
+			 * HW and return to use current CPU. Note that we
+			 * don't consider OOO in this case.
+			 */
+			rflow = set_rps_qid(dev, skb, rflow, dqid);
+
+			*rflowp = rflow;
+
 			/* Don't use aRFS to set CPU in this case, skip to
 			 * trying RPS
 			 */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 11/11] doc: Documentation for Per Thread Queues
  2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
                   ` (9 preceding siblings ...)
  2020-06-24 17:17 ` [RFC PATCH 10/11] ptq: Hook up receive " Tom Herbert
@ 2020-06-24 17:17 ` Tom Herbert
  2020-06-25  2:20   ` kernel test robot
                     ` (2 more replies)
  10 siblings, 3 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-24 17:17 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Add a section on Per Thread Queues to scaling.rst.
---
 Documentation/networking/scaling.rst | 195 ++++++++++++++++++++++++++-
 1 file changed, 194 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 8f0347b9fb3d..42f1dc639ab7 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -250,7 +250,7 @@ RFS: Receive Flow Steering
 While RPS steers packets solely based on hash, and thus generally
 provides good load distribution, it does not take into account
 application locality. This is accomplished by Receive Flow Steering
-(RFS). The goal of RFS is to increase datacache hitrate by steering
+(RFS). The goal of RFS is to increase datacache hit rate by steering
 kernel processing of packets to the CPU where the application thread
 consuming the packet is running. RFS relies on the same RPS mechanisms
 to enqueue packets onto the backlog of another CPU and to wake up that
@@ -508,6 +508,199 @@ a max-rate attribute is supported, by setting a Mbps value to::
 A value of zero means disabled, and this is the default.
 
 
+PTQ: Per Thread Queues
+======================
+
+Per Thread Queues allows application threads to be assigned dedicated
+hardware network queues for both transmit and receive. This facility
+provides a high degree of traffic isolation between applications and
+can also help facilitate high performance due to fine grained packet
+steering.
+
+PTQ has three major design components:
+	- A method to assign transmit and receive queues to threads
+	- A means to associate packets with threads and then to steer
+	  those packets to the queues assigned to the threads
+	- Mechanisms to process the per thread hardware queues
+
+Global network queues
+~~~~~~~~~~~~~~~~~~~~~
+
+Global network queues are an abstraction of hardware networking
+queues that can be used in generic non-device specific configuration.
+Global queues may mapped to real device queues. The mapping is
+performed on a per device queue basis. A device sysfs parameter
+"global_queue_mapping" in queues/{tx,rx}-<num> indicates the mapping
+of a device queue to a global queue. Each device maintains a table
+that maps global queues to device queues for the device. Note that
+for a single device, the global to device queue mapping is 1 to 1,
+however each device may map a global queue to a different device
+queue.
+
+net_queues cgroup controller
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For assigning queues to the threads, a cgroup controller named
+"net_queues" is used. A cgroup can be configured with pools of transmit
+and receive global queues from which individual threads are assigned
+queues. The contents of the net_queues controller are described below in
+the configuration section.
+
+Handling PTQ in the transmit path
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a socket operation is performed that may result in sending packets
+(i.e. listen, accept, sendmsg, sendpage), the task structure for the
+current thread is consulted to see if there is an assigned transmit
+queue for the thread. If there is a queue assignment, the queue index is
+set in a field of the sock structure for the corresponding socket.
+Subsequently, when transmit queue selection is performed, the sock
+structure associated with packet being sent is consulted. If a transmit
+global queue is set in the sock then that index is mapped to a device
+queue for the output networking device. If a valid device queue is
+discovered then that queue is used, else if a device queue is not found
+then queue selection proceeds to XPS.
+
+Handling PTQ in the receive path
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The receive path uses the infrastructure of RFS which is extended
+to steer based on the assigned received global queue for a thread in
+addition to steering based on the CPU. The rps_sock_flow_table is
+modified to contain either the desired CPU for flows or the desired
+receive global queue. A queue is updated at the same time that the
+desired CPU would updated during calls to recvmsg and sendmsg (see RFS
+description above). The process is to consult the running task structure
+to see if a receive queue is assigned to the task. If a queue is assigned
+to the task then the corresponding queue index is set in the
+rps_sock_flow_table; if no queue is assigned then the current CPU is
+set as the desired per canonical RFS.
+
+When packets are received, the rps_sock_flow table is consulted to check
+if they were received on the proper queue. If the rps_sock_flow_table
+entry for a corresponding flow of a received packet contains a global
+queue index, then the index is mapped to a device queue on the received
+device. If the mapped device queue is equal to the receive queue then
+packets are being steered properly. If there is a mismatch then the
+local flow to queue mapping in the device is changed and
+ndo_rx_flow_steer is invoked to set the receive queue for the flow in
+the device as described in the aRFS section.
+
+Processing queues in Per Queue Threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When Per Queue Threads is used, the queue "follows" the thread. So when
+a thread is rescheduled from one queue to another we expect that the
+processing of the device queues that map to the thread are processed on
+the CPU where the thread is currently running. This is a bit tricky
+especially with respect to the canonical device interrupt driven model.
+There are at least three possible approaches:
+	- Arrange for interrupts to follow threads as they are
+	  rescheduled, or alternatively pin threads to CPUs and
+	  statically configure the interrupt mappings for the queues for
+	  each thread
+	- Use busy polling
+	- Use "sleeping busy-poll" with completion queues. The basic
+	  idea is to have one CPU busy poll a device completion queue
+	  that reports device queues with received or completed transmit
+	  packets. When a queue is ready, the thread associated with the
+	  queue (derived by reverse mapping the queue back to its
+	  assigned thread) is scheduled. When the thread runs it polls
+	  its queues to process any packets.
+
+Future work may further elaborate on solutions in this area.
+
+Reducing flow state in devices
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+PTQ (and aRFS as well) potentially create per flow state in a device.
+This is costly in at least two ways: 1) State requires device memory
+which is almost always much less than host memory can and thus the
+number of flows that can be instantiated in a device are less than that
+in the host. 2) State requires instantiation and synchronization
+messages, i.e. ndo_rx_flow_steer causes a message over PCIe bus; if
+there is a highly turnover rate of connections this messaging becomes
+a bottleneck.
+
+Mitigations to reduce the amount of flow state in the device should be
+considered.
+
+In PTQ (and aRFS) the device flow state is a considered cache. A flow
+entry is only set in the device on a cache miss which occurs when the
+receive queue for a packet doesn't match the desired receive queue. So
+conceptually, if a packets for a flow are always received on the desired
+queue from the beginning of the flow then a flow state might never need
+to be instantiated in the device. This motivates a strategy to try to
+use stateless steering mechanisms before resorting to stateful ones.
+
+As an example of applying this strategy, consider an application that
+creates four threads where each threads creates a TCP listener socket
+for some port that is shared amongst the threads via SO_REUSEPORT.
+Four global queues can be assigned to the application (via a cgroup
+for the application), and a filter rule can be set up in each device
+that matches the listener port and any bound destination address. The
+filter maps to a set of four device queues that map to the four global
+queues for the application. When a packet is received that matches the
+filter, one of the four queues is chosen via a hash over the packet's
+four tuple. So in this manner, packets for the application are
+distributed amongst the four threads. As long as processing for sockets
+doesn't move between threads and the number of listener threads is
+constant then packets are always received on the desired queue and no
+flow state needs to be instantiated. In practice, we want to allow
+elasticity in applications to create and destroy threads on demand, so
+additional techniques, such as consistent hashing, are probably needed.
+
+Per Thread Queues Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Per Thread Queues is only available if the kernel is compiled with
+CONFIG_PER_THREAD_QUEUES. For PTQ in the receive path, aRFS needs to be
+supported and configured (see aRFS section above).
+
+The net_queues cgroup controller is in:
+	/sys/fs/cgroup/<cgrp>/net_queues
+
+The net_queues controller contains the following attributes:
+	- tx-queues, rx-queues
+		Specifies the transmit queue pool and receive queue pool
+		respectively as a range of global queue indices. The
+		format of these entries is "<base>:<extent>" where
+		<base> is the first queue index in the pool, and
+		<extent> is the number of queues in the range of pool.
+		If <extent> is zero the queue pool is empty.
+	- tx-assign,rx-assign
+		Boolean attributes ("0" or "1") that indicate unique
+		queue assignment from the respective transmit or receive
+		queue pool. When the "assign" attribute is enabled, a
+		thread is assigned a queue that is not already assigned
+		to another thread.
+	- symmetric
+		A boolean attribute ("0" or "1") that indicates the
+		receive and transmit queue assignment for a thread
+		should be the same. That is the assigned transmit queue
+		index is equal to the assigned receive queue index.
+	- task-queues
+		A read-only attribute that lists the threads of the
+		cgroup and their assigned queues.
+
+The mapping of global queues to device queues is in:
+
+  /sys/class/net/<dev>/queues/tx-<n>/global_queue_mapping
+	-and -
+  /sys/class/net/<dev>/queues/rx-<n>/global_queue_mapping
+
+A value of "none" indicates no mapping, an integer value (up to
+a maximum of 32,766) indicates a global queue.
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~
+
+Unlike aRFS, PTQ requires per application application configuration. To
+most effectively use PTQ some understanding of the threading model of
+the application is warranted. The section above describes one possible
+configuration strategy for a canonical application using SO_REUSEPORT.
+
+
 Further Information
 ===================
 RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 08/11] ptq: Per Thread Queues
  2020-06-24 17:17 ` [RFC PATCH 08/11] ptq: Per Thread Queues Tom Herbert
@ 2020-06-24 21:20   ` kernel test robot
  2020-06-25  1:50   ` [RFC PATCH] ptq: null_pcdesc can be static kernel test robot
  2020-06-25  7:26   ` [RFC PATCH 08/11] ptq: Per Thread Queues kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-24 21:20 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 17349 bytes --]

Hi Tom,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on net/master]
[also build test ERROR on ipvs/master net-next/master linus/master v5.8-rc2 next-20200624]
[cannot apply to cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tom-Herbert/ptq-Per-Thread-Queues/20200625-012135
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 0275875530f692c725c6f993aced2eca2d6ac50c
config: arc-defconfig (attached as .config)
compiler: arc-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from net/core/ptq.c:12:
>> include/net/ptq.h:23:29: error: field 'css' has incomplete type
      23 |  struct cgroup_subsys_state css;
         |                             ^~~
   net/core/ptq.c: In function 'reassign_queues_all':
>> net/core/ptq.c:298:23: error: storage size of 'it' isn't known
     298 |  struct css_task_iter it;
         |                       ^~
>> net/core/ptq.c:314:2: error: implicit declaration of function 'css_task_iter_start'; did you mean 'sg_miter_start'? [-Werror=implicit-function-declaration]
     314 |  css_task_iter_start(&pss->css, 0, &it);
         |  ^~~~~~~~~~~~~~~~~~~
         |  sg_miter_start
>> net/core/ptq.c:315:17: error: implicit declaration of function 'css_task_iter_next'; did you mean 'class_dev_iter_next'? [-Werror=implicit-function-declaration]
     315 |  while ((task = css_task_iter_next(&it)))
         |                 ^~~~~~~~~~~~~~~~~~
         |                 class_dev_iter_next
>> net/core/ptq.c:317:2: error: implicit declaration of function 'css_task_iter_end' [-Werror=implicit-function-declaration]
     317 |  css_task_iter_end(&it);
         |  ^~~~~~~~~~~~~~~~~
   net/core/ptq.c:298:23: warning: unused variable 'it' [-Wunused-variable]
     298 |  struct css_task_iter it;
         |                       ^~
   net/core/ptq.c: In function 'cgrp_css_online':
>> net/core/ptq.c:337:46: error: dereferencing pointer to incomplete type 'struct cgroup_subsys_state'
     337 |  struct cgroup_subsys_state *parent_css = css->parent;
         |                                              ^~
   net/core/ptq.c: At top level:
>> net/core/ptq.c:356:64: warning: 'struct cftype' declared inside parameter list will not be visible outside of this definition or declaration
     356 | static u64 read_ptqidx(struct cgroup_subsys_state *css, struct cftype *cft)
         |                                                                ^~~~~~
>> net/core/ptq.c:362:34: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
     362 | static int ptq_can_attach(struct cgroup_taskset *tset)
         |                                  ^~~~~~~~~~~~~~
   net/core/ptq.c: In function 'ptq_can_attach':
>> net/core/ptq.c:373:2: error: implicit declaration of function 'cgroup_taskset_for_each'; did you mean 'cgroup_task_frozen'? [-Werror=implicit-function-declaration]
     373 |  cgroup_taskset_for_each(task, dst_css, tset) {
         |  ^~~~~~~~~~~~~~~~~~~~~~~
         |  cgroup_task_frozen
>> net/core/ptq.c:373:46: error: expected ';' before '{' token
     373 |  cgroup_taskset_for_each(task, dst_css, tset) {
         |                                              ^~
         |                                              ;
   net/core/ptq.c:364:40: warning: unused variable 'src_css' [-Wunused-variable]
     364 |  struct cgroup_subsys_state *dst_css, *src_css;
         |                                        ^~~~~~~
   net/core/ptq.c: At top level:
   net/core/ptq.c:384:31: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
     384 | static void ptq_attach(struct cgroup_taskset *tset)
         |                               ^~~~~~~~~~~~~~
   net/core/ptq.c: In function 'ptq_attach':
>> net/core/ptq.c:393:42: error: expected ';' before 'assign_queues'
     393 |  cgroup_taskset_for_each(task, css, tset)
         |                                          ^
         |                                          ;
     394 |   assign_queues(css_to_ptq_css(css), task);
         |   ~~~~~~~~~~~~~                           
   net/core/ptq.c: At top level:
   net/core/ptq.c:400:38: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
     400 | static void ptq_cancel_attach(struct cgroup_taskset *tset)
         |                                      ^~~~~~~~~~~~~~
   net/core/ptq.c: In function 'ptq_cancel_attach':
   net/core/ptq.c:411:46: error: expected ';' before '{' token
     411 |  cgroup_taskset_for_each(task, dst_css, tset) {
         |                                              ^~
         |                                              ;
   net/core/ptq.c:402:40: warning: unused variable 'src_css' [-Wunused-variable]
     402 |  struct cgroup_subsys_state *dst_css, *src_css;
         |                                        ^~~~~~~
   net/core/ptq.c: In function 'ptq_fork':
>> net/core/ptq.c:424:3: error: implicit declaration of function 'task_css'; did you mean 'task_cpu'? [-Werror=implicit-function-declaration]
     424 |   task_css(task, net_queues_cgrp_id);
         |   ^~~~~~~~
         |   task_cpu
>> net/core/ptq.c:424:18: error: 'net_queues_cgrp_id' undeclared (first use in this function); did you mean 'net_queue_pair'?
     424 |   task_css(task, net_queues_cgrp_id);
         |                  ^~~~~~~~~~~~~~~~~~
         |                  net_queue_pair
   net/core/ptq.c:424:18: note: each undeclared identifier is reported only once for each function it appears in
   net/core/ptq.c: In function 'ptq_exit':
   net/core/ptq.c:435:18: error: 'net_queues_cgrp_id' undeclared (first use in this function); did you mean 'net_queue_pair'?
     435 |   task_css(task, net_queues_cgrp_id);
         |                  ^~~~~~~~~~~~~~~~~~
         |                  net_queue_pair
   net/core/ptq.c: In function 'write_tx_queues':
>> net/core/ptq.c:557:39: error: implicit declaration of function 'of_css' [-Werror=implicit-function-declaration]
     557 |  struct ptq_css *pss = css_to_ptq_css(of_css(of));
         |                                       ^~~~~~
>> net/core/ptq.c:557:39: warning: passing argument 1 of 'css_to_ptq_css' makes pointer from integer without a cast [-Wint-conversion]
     557 |  struct ptq_css *pss = css_to_ptq_css(of_css(of));
         |                                       ^~~~~~~~~~
         |                                       |
         |                                       int
   In file included from net/core/ptq.c:12:
   include/net/ptq.h:34:74: note: expected 'struct cgroup_subsys_state *' but argument is of type 'int'
      34 | static inline struct ptq_css *css_to_ptq_css(struct cgroup_subsys_state *css)
         |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   In file included from include/linux/rculist.h:11,
                    from include/linux/netdevice.h:33,
                    from net/core/ptq.c:8:
   net/core/ptq.c: In function 'read_tx_queues':
>> net/core/ptq.c:573:59: error: implicit declaration of function 'seq_css' [-Werror=implicit-function-declaration]
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                                                           ^~~~~~~
   include/linux/rcupdate.h:352:10: note: in definition of macro '__rcu_dereference_check'
     352 |  typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
         |          ^
>> include/linux/rcupdate.h:549:28: note: in expansion of macro 'rcu_dereference_check'
     549 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
>> net/core/ptq.c:573:28: note: in expansion of macro 'rcu_dereference'
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                            ^~~~~~~~~~~~~~~
   net/core/ptq.c:573:59: warning: passing argument 1 of 'css_to_ptq_css' makes pointer from integer without a cast [-Wint-conversion]
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                                                           ^~~~~~~~~~~
         |                                                           |
         |                                                           int
   include/linux/rcupdate.h:352:10: note: in definition of macro '__rcu_dereference_check'
     352 |  typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
         |          ^
>> include/linux/rcupdate.h:549:28: note: in expansion of macro 'rcu_dereference_check'
     549 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
>> net/core/ptq.c:573:28: note: in expansion of macro 'rcu_dereference'
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                            ^~~~~~~~~~~~~~~
   In file included from net/core/ptq.c:12:
   include/net/ptq.h:34:74: note: expected 'struct cgroup_subsys_state *' but argument is of type 'int'
      34 | static inline struct ptq_css *css_to_ptq_css(struct cgroup_subsys_state *css)
         |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   In file included from include/linux/rculist.h:11,
                    from include/linux/netdevice.h:33,
                    from net/core/ptq.c:8:
   net/core/ptq.c:573:59: warning: passing argument 1 of 'css_to_ptq_css' makes pointer from integer without a cast [-Wint-conversion]
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                                                           ^~~~~~~~~~~
         |                                                           |
         |                                                           int
   include/linux/rcupdate.h:352:36: note: in definition of macro '__rcu_dereference_check'
     352 |  typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
         |                                    ^
   include/linux/rcupdate.h:549:28: note: in expansion of macro 'rcu_dereference_check'
     549 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   net/core/ptq.c:573:28: note: in expansion of macro 'rcu_dereference'
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                            ^~~~~~~~~~~~~~~
   In file included from net/core/ptq.c:12:
   include/net/ptq.h:34:74: note: expected 'struct cgroup_subsys_state *' but argument is of type 'int'
      34 | static inline struct ptq_css *css_to_ptq_css(struct cgroup_subsys_state *css)
         |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   In file included from include/linux/build_bug.h:5,
                    from include/linux/bits.h:23,
                    from include/linux/bitops.h:5,
                    from include/linux/bitmap.h:8,
                    from net/core/ptq.c:6:
   net/core/ptq.c:573:59: warning: passing argument 1 of 'css_to_ptq_css' makes pointer from integer without a cast [-Wint-conversion]
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                                                           ^~~~~~~~~~~
         |                                                           |
         |                                                           int
   include/linux/compiler.h:372:9: note: in definition of macro '__compiletime_assert'
     372 |   if (!(condition))     \
         |         ^~~~~~~~~
   include/linux/compiler.h:392:2: note: in expansion of macro '_compiletime_assert'
     392 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |  ^~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:405:2: note: in expansion of macro 'compiletime_assert'
     405 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |  ^~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:405:21: note: in expansion of macro '__native_word'
     405 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |                     ^~~~~~~~~~~~~
   include/linux/compiler.h:291:2: note: in expansion of macro 'compiletime_assert_rwonce_type'
     291 |  compiletime_assert_rwonce_type(x);    \
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:352:48: note: in expansion of macro 'READ_ONCE'
     352 |  typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                ^~~~~~~~~
   include/linux/rcupdate.h:491:2: note: in expansion of macro '__rcu_dereference_check'
     491 |  __rcu_dereference_check((p), (c) || rcu_read_lock_held(), __rcu)
         |  ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:549:28: note: in expansion of macro 'rcu_dereference_check'
     549 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   net/core/ptq.c:573:28: note: in expansion of macro 'rcu_dereference'
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                            ^~~~~~~~~~~~~~~
   In file included from net/core/ptq.c:12:
   include/net/ptq.h:34:74: note: expected 'struct cgroup_subsys_state *' but argument is of type 'int'
      34 | static inline struct ptq_css *css_to_ptq_css(struct cgroup_subsys_state *css)
         |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   In file included from include/linux/build_bug.h:5,
                    from include/linux/bits.h:23,
                    from include/linux/bitops.h:5,
                    from include/linux/bitmap.h:8,
                    from net/core/ptq.c:6:
   net/core/ptq.c:573:59: warning: passing argument 1 of 'css_to_ptq_css' makes pointer from integer without a cast [-Wint-conversion]
     573 |  ret = show_queue_desc(sf, rcu_dereference(css_to_ptq_css(seq_css(sf))->
         |                                                           ^~~~~~~~~~~
         |                                                           |
         |                                                           int
   include/linux/compiler.h:372:9: note: in definition of macro '__compiletime_assert'
     372 |   if (!(condition))     \
         |         ^~~~~~~~~
   include/linux/compiler.h:392:2: note: in expansion of macro '_compiletime_assert'
     392 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |  ^~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:405:2: note: in expansion of macro 'compiletime_assert'
     405 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |  ^~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:405:21: note: in expansion of macro '__native_word'
     405 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |                     ^~~~~~~~~~~~~
   include/linux/compiler.h:291:2: note: in expansion of macro 'compiletime_assert_rwonce_type'
     291 |  compiletime_assert_rwonce_type(x);    \
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:352:48: note: in expansion of macro 'READ_ONCE'
     352 |  typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                ^~~~~~~~~
   include/linux/rcupdate.h:491:2: note: in expansion of macro '__rcu_dereference_check'
     491 |  __rcu_dereference_check((p), (c) || rcu_read_lock_held(), __rcu)
         |  ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:549:28: note: in expansion of macro 'rcu_dereference_check'

vim +/css +23 include/net/ptq.h

    21	
    22	struct ptq_css {
  > 23		struct cgroup_subsys_state css;
    24	
    25		struct ptq_cgroup_queue_desc __rcu *txqs;
    26		struct ptq_cgroup_queue_desc __rcu *rxqs;
    27	
    28		unsigned short flags;
    29	#define PTQ_F_RX_ASSIGN		BIT(0)
    30	#define PTQ_F_TX_ASSIGN		BIT(1)
    31	#define PTQ_F_SYMMETRIC		BIT(2)
    32	};
    33	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 9353 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/11] net: Introduce global queues
  2020-06-24 17:17 ` [RFC PATCH 07/11] net: Introduce global queues Tom Herbert
@ 2020-06-24 23:00   ` kernel test robot
  2020-06-24 23:58   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-24 23:00 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3091 bytes --]

Hi Tom,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on net/master]
[also build test ERROR on ipvs/master net-next/master linus/master v5.8-rc2 next-20200624]
[cannot apply to cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tom-Herbert/ptq-Per-Thread-Queues/20200625-012135
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 0275875530f692c725c6f993aced2eca2d6ac50c
config: arm-randconfig-r004-20200624 (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 1d4c87335d5236ea1f35937e1014980ba961ae34)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> net/core/net-sysfs.c:1693:2: error: implicit declaration of function 'set_device_queue_mapping' [-Werror,-Wimplicit-function-declaration]
           set_device_queue_mapping(&queue->dev->tx_gqueue_map, NO_QUEUE,
           ^
>> net/core/net-sysfs.c:1694:13: error: no member named 'index' in 'struct netdev_queue'
                                    queue->index, &queue->gqid);
                                    ~~~~~  ^
>> net/core/net-sysfs.c:1694:28: error: no member named 'gqid' in 'struct netdev_queue'
                                    queue->index, &queue->gqid);
                                                   ~~~~~  ^
>> net/core/net-sysfs.c:1693:40: error: no member named 'tx_gqueue_map' in 'struct net_device'; did you mean 'tx_queue_len'?
           set_device_queue_mapping(&queue->dev->tx_gqueue_map, NO_QUEUE,
                                                 ^~~~~~~~~~~~~
                                                 tx_queue_len
   include/linux/netdevice.h:2145:16: note: 'tx_queue_len' declared here
           unsigned int            tx_queue_len;
                                   ^
   4 errors generated.

vim +/set_device_queue_mapping +1693 net/core/net-sysfs.c

  1688	
  1689	static void netdev_queue_release(struct kobject *kobj)
  1690	{
  1691		struct netdev_queue *queue = to_netdev_queue(kobj);
  1692	
> 1693		set_device_queue_mapping(&queue->dev->tx_gqueue_map, NO_QUEUE,
> 1694					 queue->index, &queue->gqid);
  1695	
  1696		memset(kobj, 0, sizeof(*kobj));
  1697		dev_put(queue->dev);
  1698	}
  1699	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 34853 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/11] net: Introduce global queues
  2020-06-24 17:17 ` [RFC PATCH 07/11] net: Introduce global queues Tom Herbert
  2020-06-24 23:00   ` kernel test robot
@ 2020-06-24 23:58   ` kernel test robot
  2020-06-25  0:23   ` kernel test robot
  2020-06-30 21:06   ` Jonathan Lemon
  3 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-24 23:58 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6204 bytes --]

Hi Tom,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on net/master]
[also build test WARNING on ipvs/master net-next/master linus/master v5.8-rc2 next-20200624]
[cannot apply to cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tom-Herbert/ptq-Per-Thread-Queues/20200625-012135
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 0275875530f692c725c6f993aced2eca2d6ac50c
config: s390-randconfig-s031-20200624 (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 ARCH=s390 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> net/core/net-sysfs.c:901:18: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/core/net-sysfs.c:901:18: sparse:    struct netdev_queue_map [noderef] <asn:4> *
>> net/core/net-sysfs.c:901:18: sparse:    struct netdev_queue_map *
   net/core/net-sysfs.c:915:33: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/core/net-sysfs.c:915:33: sparse:    struct netdev_queue_map [noderef] <asn:4> *
   net/core/net-sysfs.c:915:33: sparse:    struct netdev_queue_map *
   net/core/net-sysfs.c:976:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/core/net-sysfs.c:976:17: sparse:    struct netdev_queue_map [noderef] <asn:4> *
   net/core/net-sysfs.c:976:17: sparse:    struct netdev_queue_map *
   net/core/net-sysfs.c:1017:46: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct netdev_queue_map **pmap @@     got struct netdev_queue_map [noderef] <asn:4> ** @@
   net/core/net-sysfs.c:1050:40: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct netdev_queue_map **pmap @@     got struct netdev_queue_map [noderef] <asn:4> ** @@
   net/core/net-sysfs.c:1335:46: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct netdev_queue_map **pmap @@     got struct netdev_queue_map [noderef] <asn:4> ** @@
   net/core/net-sysfs.c:1693:40: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct netdev_queue_map **pmap @@     got struct netdev_queue_map [noderef] <asn:4> ** @@

vim +901 net/core/net-sysfs.c

   884	
   885	static int set_device_queue_mapping(struct netdev_queue_map **pmap,
   886					    u16 gqid, u16 dqid, u16 *p_gqid)
   887	{
   888		static DEFINE_MUTEX(global_mapping_table);
   889		struct netdev_queue_map *gq_map, *old_gq_map;
   890		u16 old_gqid;
   891		int ret = 0;
   892	
   893		mutex_lock(&global_mapping_table);
   894	
   895		old_gqid = *p_gqid;
   896		if (old_gqid == gqid) {
   897			/* Nothing changing */
   898			goto out;
   899		}
   900	
 > 901		gq_map = rcu_dereference_protected(*pmap,
   902						   lockdep_is_held(&global_mapping_table));
   903		old_gq_map = gq_map;
   904	
   905		if (gqid == NO_QUEUE) {
   906			/* Remove any old mapping (we know that old_gqid cannot be
   907			 * NO_QUEUE from above)
   908			 */
   909			if (!WARN_ON(!gq_map || old_gqid > gq_map->max_ents ||
   910				     gq_map->map[old_gqid] != dqid)) {
   911				/* Unset old mapping */
   912				gq_map->map[old_gqid] = NO_QUEUE;
   913				if (--gq_map->set_count == 0) {
   914					/* Done with map so free */
   915					rcu_assign_pointer(*pmap, NULL);
   916					call_rcu(&gq_map->rcu, queue_map_release);
   917				}
   918			}
   919			*p_gqid = NO_QUEUE;
   920	
   921			goto out;
   922		}
   923	
   924		if (!gq_map || gqid >= gq_map->max_ents) {
   925			unsigned int max_queues;
   926			int i = 0;
   927	
   928			/* Need to create or expand queue map */
   929	
   930			max_queues = QUEUE_MAP_ALLOC_NUMBER(gqid + 1);
   931	
   932			gq_map = vmalloc(QUEUE_MAP_ALLOC_SIZE(max_queues));
   933			if (!gq_map) {
   934				ret = -ENOMEM;
   935				goto out;
   936			}
   937	
   938			gq_map->max_ents = max_queues;
   939	
   940			if (old_gq_map) {
   941				/* Copy old map entries */
   942	
   943				memcpy(gq_map->map, old_gq_map->map,
   944				       old_gq_map->max_ents * sizeof(gq_map->map[0]));
   945				gq_map->set_count = old_gq_map->set_count;
   946				i = old_gq_map->max_ents;
   947			} else {
   948				gq_map->set_count = 0;
   949			}
   950	
   951			/* Initialize entries not copied from old map */
   952			for (; i < max_queues; i++)
   953				gq_map->map[i] = NO_QUEUE;
   954		} else if (gq_map->map[gqid] != NO_QUEUE) {
   955			/* The global qid is already mapped to another device qid */
   956			ret = -EBUSY;
   957			goto out;
   958		}
   959	
   960		/* Set map entry */
   961		gq_map->map[gqid] = dqid;
   962		gq_map->set_count++;
   963	
   964		if (old_gqid != NO_QUEUE) {
   965			/* We know old_gqid is not equal to gqid */
   966			if (!WARN_ON(!old_gq_map ||
   967				     old_gqid > old_gq_map->max_ents ||
   968				     old_gq_map->map[old_gqid] != dqid)) {
   969				/* Unset old mapping in (new) table */
   970				gq_map->map[old_gqid] = NO_QUEUE;
   971				gq_map->set_count--;
   972			}
   973		}
   974	
   975		if (gq_map != old_gq_map) {
   976			rcu_assign_pointer(*pmap, gq_map);
   977			if (old_gq_map)
   978				call_rcu(&old_gq_map->rcu, queue_map_release);
   979		}
   980	
   981		/* Save for caller */
   982		*p_gqid = gqid;
   983	
   984	out:
   985		mutex_unlock(&global_mapping_table);
   986	
   987		return ret;
   988	}
   989	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 19564 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/11] net: Introduce global queues
  2020-06-24 17:17 ` [RFC PATCH 07/11] net: Introduce global queues Tom Herbert
  2020-06-24 23:00   ` kernel test robot
  2020-06-24 23:58   ` kernel test robot
@ 2020-06-25  0:23   ` kernel test robot
  2020-06-30 21:06   ` Jonathan Lemon
  3 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-25  0:23 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2821 bytes --]

Hi Tom,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on net/master]
[also build test ERROR on ipvs/master net-next/master linus/master v5.8-rc2 next-20200624]
[cannot apply to cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tom-Herbert/ptq-Per-Thread-Queues/20200625-012135
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 0275875530f692c725c6f993aced2eca2d6ac50c
config: c6x-randconfig-r003-20200624 (attached as .config)
compiler: c6x-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=c6x 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   net/core/net-sysfs.c: In function 'netdev_queue_release':
>> net/core/net-sysfs.c:1693:2: error: implicit declaration of function 'set_device_queue_mapping'; did you mean 'skb_get_queue_mapping'? [-Werror=implicit-function-declaration]
    1693 |  set_device_queue_mapping(&queue->dev->tx_gqueue_map, NO_QUEUE,
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  skb_get_queue_mapping
>> net/core/net-sysfs.c:1693:40: error: 'struct net_device' has no member named 'tx_gqueue_map'; did you mean 'tx_queue_len'?
    1693 |  set_device_queue_mapping(&queue->dev->tx_gqueue_map, NO_QUEUE,
         |                                        ^~~~~~~~~~~~~
         |                                        tx_queue_len
>> net/core/net-sysfs.c:1694:11: error: 'struct netdev_queue' has no member named 'index'
    1694 |      queue->index, &queue->gqid);
         |           ^~
>> net/core/net-sysfs.c:1694:26: error: 'struct netdev_queue' has no member named 'gqid'
    1694 |      queue->index, &queue->gqid);
         |                          ^~
   cc1: some warnings being treated as errors

vim +1693 net/core/net-sysfs.c

  1688	
  1689	static void netdev_queue_release(struct kobject *kobj)
  1690	{
  1691		struct netdev_queue *queue = to_netdev_queue(kobj);
  1692	
> 1693		set_device_queue_mapping(&queue->dev->tx_gqueue_map, NO_QUEUE,
> 1694					 queue->index, &queue->gqid);
  1695	
  1696		memset(kobj, 0, sizeof(*kobj));
  1697		dev_put(queue->dev);
  1698	}
  1699	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28873 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH] ptq: null_pcdesc can be static
  2020-06-24 17:17 ` [RFC PATCH 08/11] ptq: Per Thread Queues Tom Herbert
  2020-06-24 21:20   ` kernel test robot
@ 2020-06-25  1:50   ` kernel test robot
  2020-06-25  7:26   ` [RFC PATCH 08/11] ptq: Per Thread Queues kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-25  1:50 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 466 bytes --]


Signed-off-by: kernel test robot <lkp@intel.com>
---
 ptq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/ptq.c b/net/core/ptq.c
index edf6718e0a713..62ff7f6543632 100644
--- a/net/core/ptq.c
+++ b/net/core/ptq.c
@@ -11,7 +11,7 @@
 #include <linux/rcupdate.h>
 #include <net/ptq.h>
 
-struct ptq_cgroup_queue_desc null_pcdesc;
+static struct ptq_cgroup_queue_desc null_pcdesc;
 
 static DEFINE_MUTEX(ptq_mutex);
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 11/11] doc: Documentation for Per Thread Queues
  2020-06-24 17:17 ` [RFC PATCH 11/11] doc: Documentation for Per Thread Queues Tom Herbert
@ 2020-06-25  2:20   ` kernel test robot
  2020-06-25 23:00   ` Jacob Keller
  2020-06-29  6:28   ` Saeed Mahameed
  2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-25  2:20 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1461 bytes --]

Hi Tom,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on net/master]
[also build test ERROR on ipvs/master net-next/master linus/master v5.8-rc2 next-20200624]
[cannot apply to cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tom-Herbert/ptq-Per-Thread-Queues/20200625-012135
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 0275875530f692c725c6f993aced2eca2d6ac50c
reproduce: make htmldocs

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

Documentation/networking/scaling.rst:526: (SEVERE/4) Title level inconsistent:

vim +526 Documentation/networking/scaling.rst

   519	
   520	PTQ has three major design components:
   521		- A method to assign transmit and receive queues to threads
   522		- A means to associate packets with threads and then to steer
   523		  those packets to the queues assigned to the threads
   524		- Mechanisms to process the per thread hardware queues
   525	
 > 526	Global network queues
   527	~~~~~~~~~~~~~~~~~~~~~
   528	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 7415 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 08/11] ptq: Per Thread Queues
  2020-06-24 17:17 ` [RFC PATCH 08/11] ptq: Per Thread Queues Tom Herbert
  2020-06-24 21:20   ` kernel test robot
  2020-06-25  1:50   ` [RFC PATCH] ptq: null_pcdesc can be static kernel test robot
@ 2020-06-25  7:26   ` kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-25  7:26 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 13582 bytes --]

Hi Tom,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on net/master]
[also build test WARNING on ipvs/master net-next/master linus/master v5.8-rc2 next-20200624]
[cannot apply to cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tom-Herbert/ptq-Per-Thread-Queues/20200625-012135
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 0275875530f692c725c6f993aced2eca2d6ac50c
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from drivers/scsi/ibmvscsi_tgt/libsrp.c:22:
>> drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h:199: warning: "NO_QUEUE" redefined
     199 | #define NO_QUEUE                    0x00
         | 
   In file included from include/linux/sched.h:35,
                    from include/linux/mm.h:31,
                    from include/linux/scatterlist.h:8,
                    from include/linux/kfifo.h:42,
                    from drivers/scsi/ibmvscsi_tgt/libsrp.c:15:
   include/linux/netqueue.h:12: note: this is the location of the previous definition
      12 | #define NO_QUEUE USHRT_MAX
         | 
--
   In file included from drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:34:
>> drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h:199: warning: "NO_QUEUE" redefined
     199 | #define NO_QUEUE                    0x00
         | 
   In file included from include/linux/sched.h:35,
                    from arch/powerpc/include/asm/elf.h:8,
                    from include/linux/elf.h:6,
                    from include/linux/module.h:18,
                    from drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:18:
   include/linux/netqueue.h:12: note: this is the location of the previous definition
      12 | #define NO_QUEUE USHRT_MAX
         | 
   In file included from arch/powerpc/include/asm/paca.h:15,
                    from arch/powerpc/include/asm/current.h:13,
                    from include/linux/thread_info.h:21,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/powerpc/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:78,
                    from include/linux/spinlock.h:51,
                    from include/linux/seqlock.h:36,
                    from include/linux/time.h:6,
                    from include/linux/stat.h:19,
                    from include/linux/module.h:13,
                    from drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:18:
   In function 'strncpy',
       inlined from 'ibmvscsis_get_system_info' at drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:3666:3,
       inlined from 'ibmvscsis_init' at drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:4106:7:
   include/linux/string.h:297:30: warning: '__builtin_strncpy' specified bound 96 equals destination size [-Wstringop-truncation]
     297 | #define __underlying_strncpy __builtin_strncpy
         |                              ^
   include/linux/string.h:307:9: note: in expansion of macro '__underlying_strncpy'
     307 |  return __underlying_strncpy(p, q, size);
         |         ^~~~~~~~~~~~~~~~~~~~
   In function 'strncpy',
       inlined from 'ibmvscsis_cap_mad' at drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:1647:3,
       inlined from 'ibmvscsis_process_mad' at drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:1743:8,
       inlined from 'ibmvscsis_mad' at drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:2077:8,
       inlined from 'ibmvscsis_parse_command' at drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c:2543:10:
   include/linux/string.h:297:30: warning: '__builtin_strncpy' specified bound 32 equals destination size [-Wstringop-truncation]
     297 | #define __underlying_strncpy __builtin_strncpy
         |                              ^
   include/linux/string.h:307:9: note: in expansion of macro '__underlying_strncpy'
     307 |  return __underlying_strncpy(p, q, size);
         |         ^~~~~~~~~~~~~~~~~~~~

vim +/NO_QUEUE +199 drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h

88a678bbc34cec Bryant G. Ly 2016-06-28  192  
88a678bbc34cec Bryant G. Ly 2016-06-28  193  struct scsi_info {
88a678bbc34cec Bryant G. Ly 2016-06-28  194  	struct list_head list;
88a678bbc34cec Bryant G. Ly 2016-06-28  195  	char eye[MAX_EYE];
88a678bbc34cec Bryant G. Ly 2016-06-28  196  
88a678bbc34cec Bryant G. Ly 2016-06-28  197  	/* commands waiting for space on repsonse queue */
88a678bbc34cec Bryant G. Ly 2016-06-28  198  	struct list_head waiting_rsp;
88a678bbc34cec Bryant G. Ly 2016-06-28 @199  #define NO_QUEUE                    0x00
88a678bbc34cec Bryant G. Ly 2016-06-28  200  #define WAIT_ENABLED                0X01
88a678bbc34cec Bryant G. Ly 2016-06-28  201  #define WAIT_CONNECTION             0x04
88a678bbc34cec Bryant G. Ly 2016-06-28  202  	/* have established a connection */
88a678bbc34cec Bryant G. Ly 2016-06-28  203  #define CONNECTED                   0x08
88a678bbc34cec Bryant G. Ly 2016-06-28  204  	/* at least one port is processing SRP IU */
88a678bbc34cec Bryant G. Ly 2016-06-28  205  #define SRP_PROCESSING              0x10
88a678bbc34cec Bryant G. Ly 2016-06-28  206  	/* remove request received */
88a678bbc34cec Bryant G. Ly 2016-06-28  207  #define UNCONFIGURING               0x20
88a678bbc34cec Bryant G. Ly 2016-06-28  208  	/* disconnect by letting adapter go idle, no error */
88a678bbc34cec Bryant G. Ly 2016-06-28  209  #define WAIT_IDLE                   0x40
88a678bbc34cec Bryant G. Ly 2016-06-28  210  	/* disconnecting to clear an error */
88a678bbc34cec Bryant G. Ly 2016-06-28  211  #define ERR_DISCONNECT              0x80
88a678bbc34cec Bryant G. Ly 2016-06-28  212  	/* disconnect to clear error state, then come back up */
88a678bbc34cec Bryant G. Ly 2016-06-28  213  #define ERR_DISCONNECT_RECONNECT    0x100
88a678bbc34cec Bryant G. Ly 2016-06-28  214  	/* disconnected after clearing an error */
88a678bbc34cec Bryant G. Ly 2016-06-28  215  #define ERR_DISCONNECTED            0x200
88a678bbc34cec Bryant G. Ly 2016-06-28  216  	/* A series of errors caused unexpected errors */
88a678bbc34cec Bryant G. Ly 2016-06-28  217  #define UNDEFINED                   0x400
88a678bbc34cec Bryant G. Ly 2016-06-28  218  	u16  state;
88a678bbc34cec Bryant G. Ly 2016-06-28  219  	int fast_fail;
88a678bbc34cec Bryant G. Ly 2016-06-28  220  	struct target_dds dds;
88a678bbc34cec Bryant G. Ly 2016-06-28  221  	char *cmd_pool;
88a678bbc34cec Bryant G. Ly 2016-06-28  222  	/* list of free commands */
88a678bbc34cec Bryant G. Ly 2016-06-28  223  	struct list_head free_cmd;
88a678bbc34cec Bryant G. Ly 2016-06-28  224  	/* command elements ready for scheduler */
88a678bbc34cec Bryant G. Ly 2016-06-28  225  	struct list_head schedule_q;
88a678bbc34cec Bryant G. Ly 2016-06-28  226  	/* commands sent to TCM */
88a678bbc34cec Bryant G. Ly 2016-06-28  227  	struct list_head active_q;
88a678bbc34cec Bryant G. Ly 2016-06-28  228  	caddr_t *map_buf;
88a678bbc34cec Bryant G. Ly 2016-06-28  229  	/* ioba of map buffer */
88a678bbc34cec Bryant G. Ly 2016-06-28  230  	dma_addr_t map_ioba;
88a678bbc34cec Bryant G. Ly 2016-06-28  231  	/* allowable number of outstanding SRP requests */
88a678bbc34cec Bryant G. Ly 2016-06-28  232  	int request_limit;
88a678bbc34cec Bryant G. Ly 2016-06-28  233  	/* extra credit */
88a678bbc34cec Bryant G. Ly 2016-06-28  234  	int credit;
88a678bbc34cec Bryant G. Ly 2016-06-28  235  	/* outstanding transactions against credit limit */
88a678bbc34cec Bryant G. Ly 2016-06-28  236  	int debit;
88a678bbc34cec Bryant G. Ly 2016-06-28  237  
88a678bbc34cec Bryant G. Ly 2016-06-28  238  	/* allow only one outstanding mad request */
88a678bbc34cec Bryant G. Ly 2016-06-28  239  #define PROCESSING_MAD                0x00002
88a678bbc34cec Bryant G. Ly 2016-06-28  240  	/* Waiting to go idle */
88a678bbc34cec Bryant G. Ly 2016-06-28  241  #define WAIT_FOR_IDLE		      0x00004
88a678bbc34cec Bryant G. Ly 2016-06-28  242  	/* H_REG_CRQ called */
88a678bbc34cec Bryant G. Ly 2016-06-28  243  #define CRQ_CLOSED                    0x00010
88a678bbc34cec Bryant G. Ly 2016-06-28  244  	/* detected that client has failed */
88a678bbc34cec Bryant G. Ly 2016-06-28  245  #define CLIENT_FAILED                 0x00040
88a678bbc34cec Bryant G. Ly 2016-06-28  246  	/* detected that transport event occurred */
88a678bbc34cec Bryant G. Ly 2016-06-28  247  #define TRANS_EVENT                   0x00080
88a678bbc34cec Bryant G. Ly 2016-06-28  248  	/* don't attempt to send anything to the client */
88a678bbc34cec Bryant G. Ly 2016-06-28  249  #define RESPONSE_Q_DOWN               0x00100
88a678bbc34cec Bryant G. Ly 2016-06-28  250  	/* request made to schedule disconnect handler */
88a678bbc34cec Bryant G. Ly 2016-06-28  251  #define SCHEDULE_DISCONNECT           0x00400
88a678bbc34cec Bryant G. Ly 2016-06-28  252  	/* disconnect handler is scheduled */
88a678bbc34cec Bryant G. Ly 2016-06-28  253  #define DISCONNECT_SCHEDULED          0x00800
8bf11557d44d00 Michael Cyr  2016-10-13  254  	/* remove function is sleeping */
8bf11557d44d00 Michael Cyr  2016-10-13  255  #define CFG_SLEEPING                  0x01000
464fd6419c68bc Michael Cyr  2017-05-16  256  	/* Register for Prepare for Suspend Transport Events */
464fd6419c68bc Michael Cyr  2017-05-16  257  #define PREP_FOR_SUSPEND_ENABLED      0x02000
464fd6419c68bc Michael Cyr  2017-05-16  258  	/* Prepare for Suspend event sent */
464fd6419c68bc Michael Cyr  2017-05-16  259  #define PREP_FOR_SUSPEND_PENDING      0x04000
464fd6419c68bc Michael Cyr  2017-05-16  260  	/* Resume from Suspend event sent */
464fd6419c68bc Michael Cyr  2017-05-16  261  #define PREP_FOR_SUSPEND_ABORTED      0x08000
464fd6419c68bc Michael Cyr  2017-05-16  262  	/* Prepare for Suspend event overwrote another CRQ entry */
464fd6419c68bc Michael Cyr  2017-05-16  263  #define PREP_FOR_SUSPEND_OVERWRITE    0x10000
88a678bbc34cec Bryant G. Ly 2016-06-28  264  	u32 flags;
88a678bbc34cec Bryant G. Ly 2016-06-28  265  	/* adapter lock */
88a678bbc34cec Bryant G. Ly 2016-06-28  266  	spinlock_t intr_lock;
88a678bbc34cec Bryant G. Ly 2016-06-28  267  	/* information needed to manage command queue */
88a678bbc34cec Bryant G. Ly 2016-06-28  268  	struct cmd_queue cmd_q;
88a678bbc34cec Bryant G. Ly 2016-06-28  269  	/* used in hcall to copy response back into srp buffer */
88a678bbc34cec Bryant G. Ly 2016-06-28  270  	u64  empty_iu_id;
88a678bbc34cec Bryant G. Ly 2016-06-28  271  	/* used in crq, to tag what iu the response is for */
88a678bbc34cec Bryant G. Ly 2016-06-28  272  	u64  empty_iu_tag;
88a678bbc34cec Bryant G. Ly 2016-06-28  273  	uint new_state;
464fd6419c68bc Michael Cyr  2017-05-16  274  	uint resume_state;
88a678bbc34cec Bryant G. Ly 2016-06-28  275  	/* control block for the response queue timer */
88a678bbc34cec Bryant G. Ly 2016-06-28  276  	struct timer_cb rsp_q_timer;
88a678bbc34cec Bryant G. Ly 2016-06-28  277  	/* keep last client to enable proper accounting */
88a678bbc34cec Bryant G. Ly 2016-06-28  278  	struct client_info client_data;
88a678bbc34cec Bryant G. Ly 2016-06-28  279  	/* what can this client do */
88a678bbc34cec Bryant G. Ly 2016-06-28  280  	u32 client_cap;
88a678bbc34cec Bryant G. Ly 2016-06-28  281  	/*
88a678bbc34cec Bryant G. Ly 2016-06-28  282  	 * The following two fields capture state and flag changes that
88a678bbc34cec Bryant G. Ly 2016-06-28  283  	 * can occur when the lock is given up.  In the orginal design,
88a678bbc34cec Bryant G. Ly 2016-06-28  284  	 * the lock was held during calls into phyp;
88a678bbc34cec Bryant G. Ly 2016-06-28  285  	 * however, phyp did not meet PAPR architecture.  This is
88a678bbc34cec Bryant G. Ly 2016-06-28  286  	 * a work around.
88a678bbc34cec Bryant G. Ly 2016-06-28  287  	 */
88a678bbc34cec Bryant G. Ly 2016-06-28  288  	u16  phyp_acr_state;
88a678bbc34cec Bryant G. Ly 2016-06-28  289  	u32 phyp_acr_flags;
88a678bbc34cec Bryant G. Ly 2016-06-28  290  
88a678bbc34cec Bryant G. Ly 2016-06-28  291  	struct workqueue_struct *work_q;
88a678bbc34cec Bryant G. Ly 2016-06-28  292  	struct completion wait_idle;
8bf11557d44d00 Michael Cyr  2016-10-13  293  	struct completion unconfig;
88a678bbc34cec Bryant G. Ly 2016-06-28  294  	struct device dev;
88a678bbc34cec Bryant G. Ly 2016-06-28  295  	struct vio_dev *dma_dev;
88a678bbc34cec Bryant G. Ly 2016-06-28  296  	struct srp_target target;
88a678bbc34cec Bryant G. Ly 2016-06-28  297  	struct ibmvscsis_tport tport;
88a678bbc34cec Bryant G. Ly 2016-06-28  298  	struct tasklet_struct work_task;
88a678bbc34cec Bryant G. Ly 2016-06-28  299  	struct work_struct proc_work;
88a678bbc34cec Bryant G. Ly 2016-06-28  300  };
88a678bbc34cec Bryant G. Ly 2016-06-28  301  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 69768 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 11/11] doc: Documentation for Per Thread Queues
  2020-06-24 17:17 ` [RFC PATCH 11/11] doc: Documentation for Per Thread Queues Tom Herbert
  2020-06-25  2:20   ` kernel test robot
@ 2020-06-25 23:00   ` Jacob Keller
  2020-06-29  6:28   ` Saeed Mahameed
  2 siblings, 0 replies; 24+ messages in thread
From: Jacob Keller @ 2020-06-25 23:00 UTC (permalink / raw)
  To: Tom Herbert, netdev



On 6/24/2020 10:17 AM, Tom Herbert wrote:
> Add a section on Per Thread Queues to scaling.rst.
> ---
>  Documentation/networking/scaling.rst | 195 ++++++++++++++++++++++++++-
>  1 file changed, 194 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
> index 8f0347b9fb3d..42f1dc639ab7 100644
> --- a/Documentation/networking/scaling.rst
> +++ b/Documentation/networking/scaling.rst
> @@ -250,7 +250,7 @@ RFS: Receive Flow Steering
>  While RPS steers packets solely based on hash, and thus generally
>  provides good load distribution, it does not take into account
>  application locality. This is accomplished by Receive Flow Steering
> -(RFS). The goal of RFS is to increase datacache hitrate by steering
> +(RFS). The goal of RFS is to increase datacache hit rate by steering
>  kernel processing of packets to the CPU where the application thread
>  consuming the packet is running. RFS relies on the same RPS mechanisms
>  to enqueue packets onto the backlog of another CPU and to wake up that
> @@ -508,6 +508,199 @@ a max-rate attribute is supported, by setting a Mbps value to::
>  A value of zero means disabled, and this is the default.
>  
> 

It might be helpful to expand this with a few examples that show the
user experience for setting this up via the sysfs entries or similar.

That would also aid in giving an example of what a reviewer might want
to do to try this out!

Thanks,
Jake


> +PTQ: Per Thread Queues
> +======================
> +
> +Per Thread Queues allows application threads to be assigned dedicated
> +hardware network queues for both transmit and receive. This facility
> +provides a high degree of traffic isolation between applications and
> +can also help facilitate high performance due to fine grained packet
> +steering.
> +
> +PTQ has three major design components:
> +	- A method to assign transmit and receive queues to threads
> +	- A means to associate packets with threads and then to steer
> +	  those packets to the queues assigned to the threads
> +	- Mechanisms to process the per thread hardware queues
> +
> +Global network queues
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Global network queues are an abstraction of hardware networking
> +queues that can be used in generic non-device specific configuration.
> +Global queues may mapped to real device queues. The mapping is
> +performed on a per device queue basis. A device sysfs parameter
> +"global_queue_mapping" in queues/{tx,rx}-<num> indicates the mapping
> +of a device queue to a global queue. Each device maintains a table
> +that maps global queues to device queues for the device. Note that
> +for a single device, the global to device queue mapping is 1 to 1,
> +however each device may map a global queue to a different device
> +queue.
> +
> +net_queues cgroup controller
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For assigning queues to the threads, a cgroup controller named
> +"net_queues" is used. A cgroup can be configured with pools of transmit
> +and receive global queues from which individual threads are assigned
> +queues. The contents of the net_queues controller are described below in
> +the configuration section.
> +
> +Handling PTQ in the transmit path
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a socket operation is performed that may result in sending packets
> +(i.e. listen, accept, sendmsg, sendpage), the task structure for the
> +current thread is consulted to see if there is an assigned transmit
> +queue for the thread. If there is a queue assignment, the queue index is
> +set in a field of the sock structure for the corresponding socket.
> +Subsequently, when transmit queue selection is performed, the sock
> +structure associated with packet being sent is consulted. If a transmit
> +global queue is set in the sock then that index is mapped to a device
> +queue for the output networking device. If a valid device queue is
> +discovered then that queue is used, else if a device queue is not found
> +then queue selection proceeds to XPS.
> +
> +Handling PTQ in the receive path
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The receive path uses the infrastructure of RFS which is extended
> +to steer based on the assigned received global queue for a thread in
> +addition to steering based on the CPU. The rps_sock_flow_table is
> +modified to contain either the desired CPU for flows or the desired
> +receive global queue. A queue is updated at the same time that the
> +desired CPU would updated during calls to recvmsg and sendmsg (see RFS
> +description above). The process is to consult the running task structure
> +to see if a receive queue is assigned to the task. If a queue is assigned
> +to the task then the corresponding queue index is set in the
> +rps_sock_flow_table; if no queue is assigned then the current CPU is
> +set as the desired per canonical RFS.
> +
> +When packets are received, the rps_sock_flow table is consulted to check
> +if they were received on the proper queue. If the rps_sock_flow_table
> +entry for a corresponding flow of a received packet contains a global
> +queue index, then the index is mapped to a device queue on the received
> +device. If the mapped device queue is equal to the receive queue then
> +packets are being steered properly. If there is a mismatch then the
> +local flow to queue mapping in the device is changed and
> +ndo_rx_flow_steer is invoked to set the receive queue for the flow in
> +the device as described in the aRFS section.
> +
> +Processing queues in Per Queue Threads
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When Per Queue Threads is used, the queue "follows" the thread. So when
> +a thread is rescheduled from one queue to another we expect that the
> +processing of the device queues that map to the thread are processed on
> +the CPU where the thread is currently running. This is a bit tricky
> +especially with respect to the canonical device interrupt driven model.
> +There are at least three possible approaches:
> +	- Arrange for interrupts to follow threads as they are
> +	  rescheduled, or alternatively pin threads to CPUs and
> +	  statically configure the interrupt mappings for the queues for
> +	  each thread
> +	- Use busy polling
> +	- Use "sleeping busy-poll" with completion queues. The basic
> +	  idea is to have one CPU busy poll a device completion queue
> +	  that reports device queues with received or completed transmit
> +	  packets. When a queue is ready, the thread associated with the
> +	  queue (derived by reverse mapping the queue back to its
> +	  assigned thread) is scheduled. When the thread runs it polls
> +	  its queues to process any packets.
> +
> +Future work may further elaborate on solutions in this area.
> +
> +Reducing flow state in devices
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +PTQ (and aRFS as well) potentially create per flow state in a device.
> +This is costly in at least two ways: 1) State requires device memory
> +which is almost always much less than host memory can and thus the
> +number of flows that can be instantiated in a device are less than that
> +in the host. 2) State requires instantiation and synchronization
> +messages, i.e. ndo_rx_flow_steer causes a message over PCIe bus; if
> +there is a highly turnover rate of connections this messaging becomes
> +a bottleneck.
> +
> +Mitigations to reduce the amount of flow state in the device should be
> +considered.
> +
> +In PTQ (and aRFS) the device flow state is a considered cache. A flow
> +entry is only set in the device on a cache miss which occurs when the
> +receive queue for a packet doesn't match the desired receive queue. So
> +conceptually, if a packets for a flow are always received on the desired
> +queue from the beginning of the flow then a flow state might never need
> +to be instantiated in the device. This motivates a strategy to try to
> +use stateless steering mechanisms before resorting to stateful ones.
> +
> +As an example of applying this strategy, consider an application that
> +creates four threads where each threads creates a TCP listener socket
> +for some port that is shared amongst the threads via SO_REUSEPORT.
> +Four global queues can be assigned to the application (via a cgroup
> +for the application), and a filter rule can be set up in each device
> +that matches the listener port and any bound destination address. The
> +filter maps to a set of four device queues that map to the four global
> +queues for the application. When a packet is received that matches the
> +filter, one of the four queues is chosen via a hash over the packet's
> +four tuple. So in this manner, packets for the application are
> +distributed amongst the four threads. As long as processing for sockets
> +doesn't move between threads and the number of listener threads is
> +constant then packets are always received on the desired queue and no
> +flow state needs to be instantiated. In practice, we want to allow
> +elasticity in applications to create and destroy threads on demand, so
> +additional techniques, such as consistent hashing, are probably needed.
> +
> +Per Thread Queues Configuration
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Per Thread Queues is only available if the kernel is compiled with
> +CONFIG_PER_THREAD_QUEUES. For PTQ in the receive path, aRFS needs to be
> +supported and configured (see aRFS section above).
> +
> +The net_queues cgroup controller is in:
> +	/sys/fs/cgroup/<cgrp>/net_queues
> +
> +The net_queues controller contains the following attributes:
> +	- tx-queues, rx-queues
> +		Specifies the transmit queue pool and receive queue pool
> +		respectively as a range of global queue indices. The
> +		format of these entries is "<base>:<extent>" where
> +		<base> is the first queue index in the pool, and
> +		<extent> is the number of queues in the range of pool.
> +		If <extent> is zero the queue pool is empty.
> +	- tx-assign,rx-assign
> +		Boolean attributes ("0" or "1") that indicate unique
> +		queue assignment from the respective transmit or receive
> +		queue pool. When the "assign" attribute is enabled, a
> +		thread is assigned a queue that is not already assigned
> +		to another thread.
> +	- symmetric
> +		A boolean attribute ("0" or "1") that indicates the
> +		receive and transmit queue assignment for a thread
> +		should be the same. That is the assigned transmit queue
> +		index is equal to the assigned receive queue index.
> +	- task-queues
> +		A read-only attribute that lists the threads of the
> +		cgroup and their assigned queues.
> +
> +The mapping of global queues to device queues is in:
> +
> +  /sys/class/net/<dev>/queues/tx-<n>/global_queue_mapping
> +	-and -
> +  /sys/class/net/<dev>/queues/rx-<n>/global_queue_mapping
> +
> +A value of "none" indicates no mapping, an integer value (up to
> +a maximum of 32,766) indicates a global queue.
> +
> +Suggested Configuration
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +Unlike aRFS, PTQ requires per application application configuration. To
> +most effectively use PTQ some understanding of the threading model of
> +the application is warranted. The section above describes one possible
> +configuration strategy for a canonical application using SO_REUSEPORT.
> +
> +
>  Further Information
>  ===================
>  RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 05/11] net: Infrastructure for per queue aRFS
  2020-06-24 17:17 ` [RFC PATCH 05/11] net: Infrastructure for per queue aRFS Tom Herbert
@ 2020-06-28  8:55   ` kernel test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-06-28  8:55 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 7732 bytes --]

Hi Tom,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on net/master]
[also build test WARNING on ipvs/master net-next/master linus/master v5.8-rc2 next-20200624]
[cannot apply to cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tom-Herbert/ptq-Per-Thread-Queues/20200625-012135
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 0275875530f692c725c6f993aced2eca2d6ac50c
config: s390-randconfig-s031-20200624 (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 ARCH=s390 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   net/core/dev.c:3264:23: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __wsum [usertype] csum @@     got unsigned int @@
   net/core/dev.c:3264:23: sparse:     expected restricted __wsum [usertype] csum
   net/core/dev.c:3264:23: sparse:     got unsigned int
   net/core/dev.c:3264:23: sparse: sparse: cast from restricted __wsum
   net/core/dev.c:3264:23: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __wsum [usertype] csum @@     got unsigned int @@
   net/core/dev.c:3264:23: sparse:     expected restricted __wsum [usertype] csum
   net/core/dev.c:3264:23: sparse:     got unsigned int
   net/core/dev.c:3264:23: sparse: sparse: incorrect type in argument 1 (different base types) @@     expected unsigned int [usertype] val @@     got restricted __wsum @@
   net/core/dev.c:3264:23: sparse:     expected unsigned int [usertype] val
   net/core/dev.c:3264:23: sparse:     got restricted __wsum
   net/core/dev.c:3264:23: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __wsum [usertype] csum @@     got unsigned int @@
   net/core/dev.c:3264:23: sparse:     expected restricted __wsum [usertype] csum
   net/core/dev.c:3264:23: sparse:     got unsigned int
   net/core/dev.c:3264:23: sparse: sparse: cast from restricted __wsum
   net/core/dev.c:3264:23: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __wsum [usertype] csum @@     got unsigned int @@
   net/core/dev.c:3264:23: sparse:     expected restricted __wsum [usertype] csum
   net/core/dev.c:3264:23: sparse:     got unsigned int
   net/core/dev.c:3264:23: sparse: sparse: cast from restricted __wsum
   net/core/dev.c:3264:23: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __wsum [usertype] csum @@     got unsigned int @@
   net/core/dev.c:3264:23: sparse:     expected restricted __wsum [usertype] csum
   net/core/dev.c:3264:23: sparse:     got unsigned int
   net/core/dev.c:3264:23: sparse: sparse: cast from restricted __wsum
   net/core/dev.c:3264:23: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __wsum [usertype] csum @@     got unsigned int @@
   net/core/dev.c:3264:23: sparse:     expected restricted __wsum [usertype] csum
   net/core/dev.c:3264:23: sparse:     got unsigned int
   net/core/dev.c:3264:23: sparse: sparse: cast from restricted __wsum
>> net/core/dev.c:4451:27: sparse: sparse: cast to non-scalar
>> net/core/dev.c:4451:27: sparse: sparse: cast from non-scalar
   net/core/dev.c:5614:1: sparse: sparse: symbol '__pcpu_scope_flush_works' was not declared. Should it be static?
   net/core/dev.c:3747:26: sparse: sparse: context imbalance in '__dev_queue_xmit' - different lock contexts for basic block
   net/core/dev.c:4922:44: sparse: sparse: context imbalance in 'net_tx_action' - unexpected unlock

# https://github.com/0day-ci/linux/commit/8cf630e2a48d7b6e18be2f46f90cebf8ec5d506c
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 8cf630e2a48d7b6e18be2f46f90cebf8ec5d506c
vim +4451 net/core/dev.c

c445477d74ab37 Ben Hutchings 2011-01-19  4426  
c445477d74ab37 Ben Hutchings 2011-01-19  4427  /**
c445477d74ab37 Ben Hutchings 2011-01-19  4428   * rps_may_expire_flow - check whether an RFS hardware filter may be removed
c445477d74ab37 Ben Hutchings 2011-01-19  4429   * @dev: Device on which the filter was set
c445477d74ab37 Ben Hutchings 2011-01-19  4430   * @rxq_index: RX queue index
c445477d74ab37 Ben Hutchings 2011-01-19  4431   * @flow_id: Flow ID passed to ndo_rx_flow_steer()
c445477d74ab37 Ben Hutchings 2011-01-19  4432   * @filter_id: Filter ID returned by ndo_rx_flow_steer()
c445477d74ab37 Ben Hutchings 2011-01-19  4433   *
c445477d74ab37 Ben Hutchings 2011-01-19  4434   * Drivers that implement ndo_rx_flow_steer() should periodically call
c445477d74ab37 Ben Hutchings 2011-01-19  4435   * this function for each installed filter and remove the filters for
c445477d74ab37 Ben Hutchings 2011-01-19  4436   * which it returns %true.
c445477d74ab37 Ben Hutchings 2011-01-19  4437   */
c445477d74ab37 Ben Hutchings 2011-01-19  4438  bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index,
c445477d74ab37 Ben Hutchings 2011-01-19  4439  			 u32 flow_id, u16 filter_id)
c445477d74ab37 Ben Hutchings 2011-01-19  4440  {
c445477d74ab37 Ben Hutchings 2011-01-19  4441  	struct netdev_rx_queue *rxqueue = dev->_rx + rxq_index;
c445477d74ab37 Ben Hutchings 2011-01-19  4442  	struct rps_dev_flow_table *flow_table;
8cf630e2a48d7b Tom Herbert   2020-06-24  4443  	struct rps_cpu_qid cpu_qid;
c445477d74ab37 Ben Hutchings 2011-01-19  4444  	struct rps_dev_flow *rflow;
c445477d74ab37 Ben Hutchings 2011-01-19  4445  	bool expire = true;
c445477d74ab37 Ben Hutchings 2011-01-19  4446  
c445477d74ab37 Ben Hutchings 2011-01-19  4447  	rcu_read_lock();
c445477d74ab37 Ben Hutchings 2011-01-19  4448  	flow_table = rcu_dereference(rxqueue->rps_flow_table);
c445477d74ab37 Ben Hutchings 2011-01-19  4449  	if (flow_table && flow_id <= flow_table->mask) {
c445477d74ab37 Ben Hutchings 2011-01-19  4450  		rflow = &flow_table->flows[flow_id];
8cf630e2a48d7b Tom Herbert   2020-06-24 @4451  		cpu_qid = READ_ONCE(rflow->cpu_qid);
8cf630e2a48d7b Tom Herbert   2020-06-24  4452  		if (rflow->filter == filter_id && !cpu_qid.use_qid &&
8cf630e2a48d7b Tom Herbert   2020-06-24  4453  		    cpu_qid.cpu < nr_cpu_ids &&
8cf630e2a48d7b Tom Herbert   2020-06-24  4454  		    ((int)(per_cpu(softnet_data, cpu_qid.cpu).input_queue_head -
c445477d74ab37 Ben Hutchings 2011-01-19  4455  			   rflow->last_qtail) <
c445477d74ab37 Ben Hutchings 2011-01-19  4456  		     (int)(10 * flow_table->mask)))
c445477d74ab37 Ben Hutchings 2011-01-19  4457  			expire = false;
c445477d74ab37 Ben Hutchings 2011-01-19  4458  	}
c445477d74ab37 Ben Hutchings 2011-01-19  4459  	rcu_read_unlock();
c445477d74ab37 Ben Hutchings 2011-01-19  4460  	return expire;
c445477d74ab37 Ben Hutchings 2011-01-19  4461  }
c445477d74ab37 Ben Hutchings 2011-01-19  4462  EXPORT_SYMBOL(rps_may_expire_flow);
c445477d74ab37 Ben Hutchings 2011-01-19  4463  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

_______________________________________________
kbuild mailing list -- kbuild(a)lists.01.org
To unsubscribe send an email to kbuild-leave(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 19564 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 11/11] doc: Documentation for Per Thread Queues
  2020-06-24 17:17 ` [RFC PATCH 11/11] doc: Documentation for Per Thread Queues Tom Herbert
  2020-06-25  2:20   ` kernel test robot
  2020-06-25 23:00   ` Jacob Keller
@ 2020-06-29  6:28   ` Saeed Mahameed
  2020-06-29 15:10     ` Tom Herbert
  2 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2020-06-29  6:28 UTC (permalink / raw)
  To: netdev, tom

On Wed, 2020-06-24 at 10:17 -0700, Tom Herbert wrote:
> Add a section on Per Thread Queues to scaling.rst.
> ---
>  Documentation/networking/scaling.rst | 195
> ++++++++++++++++++++++++++-
>  1 file changed, 194 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/scaling.rst
> b/Documentation/networking/scaling.rst
> index 8f0347b9fb3d..42f1dc639ab7 100644
> --- a/Documentation/networking/scaling.rst
> +++ b/Documentation/networking/scaling.rst
> @@ -250,7 +250,7 @@ RFS: Receive Flow Steering
>  While RPS steers packets solely based on hash, and thus generally
>  provides good load distribution, it does not take into account
>  application locality. This is accomplished by Receive Flow Steering
> -(RFS). The goal of RFS is to increase datacache hitrate by steering
> +(RFS). The goal of RFS is to increase datacache hit rate by steering
>  kernel processing of packets to the CPU where the application thread
>  consuming the packet is running. RFS relies on the same RPS
> mechanisms
>  to enqueue packets onto the backlog of another CPU and to wake up
> that
> @@ -508,6 +508,199 @@ a max-rate attribute is supported, by setting a
> Mbps value to::
>  A value of zero means disabled, and this is the default.
>  
>  
> +PTQ: Per Thread Queues
> +======================
> +
> +Per Thread Queues allows application threads to be assigned 

I think i am a bit confused about the definition of Thread in this
context. Is it a netwroking kernel thread ? or an actual application
user thread ?

> dedicated
> +hardware network queues for both transmit and receive. This facility

"dedicated hardware network queues" seems a bit out of context here, as
from the looks of it, the series only deals with lightweight mapping
between threads and hardware queues, as opposed to CPUs (XPS) and aRFS
(HW flow steering) mappings.

But for someone like me, from device drivers and XDP-XSK/RDMA world,
"dedicated hardware queues", means a dedicated and isolated hw queue
per thread/app which can't be shared with others, and I don't see any
mention of creation of new dedicated hw queues per thread in this
patchset, just re-mapping of pre-existing HW queues.

So no matter how you look at it, there will only be #CPUs hardware
queues (at most) that must be shared somehow if you want to run more
than #CPUs threads.

So either i am missing something, or this is actually the point of the
series, in which case, this limitation of the #threads or sharing
should be clarified .. 


> +provides a high degree of traffic isolation between applications and
> +can also help facilitate high performance due to fine grained packet
> +steering.
> +
> +PTQ has three major design components:
> +	- A method to assign transmit and receive queues to threads
> +	- A means to associate packets with threads and then to steer
> +	  those packets to the queues assigned to the threads
> +	- Mechanisms to process the per thread hardware queues
> +
> +Global network queues
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Global network queues are an abstraction of hardware networking
> +queues that can be used in generic non-device specific
> configuration.
> +Global queues may mapped to real device queues. The mapping is
> +performed on a per device queue basis. A device sysfs parameter
> +"global_queue_mapping" in queues/{tx,rx}-<num> indicates the mapping
> +of a device queue to a global queue. Each device maintains a table
> +that maps global queues to device queues for the device. Note that
> +for a single device, the global to device queue mapping is 1 to 1,
> +however each device may map a global queue to a different device
> +queue.
> +
> +net_queues cgroup controller
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For assigning queues to the threads, a cgroup controller named
> +"net_queues" is used. A cgroup can be configured with pools of
> transmit
> +and receive global queues from which individual threads are assigned
> +queues. The contents of the net_queues controller are described
> below in
> +the configuration section.
> +
> +Handling PTQ in the transmit path
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a socket operation is performed that may result in sending
> packets
> +(i.e. listen, accept, sendmsg, sendpage), the task structure for the
> +current thread is consulted to see if there is an assigned transmit
> +queue for the thread. If there is a queue assignment, the queue
> index is
> +set in a field of the sock structure for the corresponding socket.
> +Subsequently, when transmit queue selection is performed, the sock
> +structure associated with packet being sent is consulted. If a
> transmit
> +global queue is set in the sock then that index is mapped to a
> device
> +queue for the output networking device. If a valid device queue is
> +discovered then that queue is used, else if a device queue is not
> found
> +then queue selection proceeds to XPS.
> +
> +Handling PTQ in the receive path
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The receive path uses the infrastructure of RFS which is extended
> +to steer based on the assigned received global queue for a thread in
> +addition to steering based on the CPU. The rps_sock_flow_table is
> +modified to contain either the desired CPU for flows or the desired
> +receive global queue. A queue is updated at the same time that the
> +desired CPU would updated during calls to recvmsg and sendmsg (see
> RFS
> +description above). The process is to consult the running task
> structure
> +to see if a receive queue is assigned to the task. If a queue is
> assigned
> +to the task then the corresponding queue index is set in the
> +rps_sock_flow_table; if no queue is assigned then the current CPU is
> +set as the desired per canonical RFS.
> +
> +When packets are received, the rps_sock_flow table is consulted to
> check
> +if they were received on the proper queue. If the
> rps_sock_flow_table
> +entry for a corresponding flow of a received packet contains a
> global
> +queue index, then the index is mapped to a device queue on the
> received
> +device. If the mapped device queue is equal to the receive queue
> then
> +packets are being steered properly. If there is a mismatch then the
> +local flow to queue mapping in the device is changed and
> +ndo_rx_flow_steer is invoked to set the receive queue for the flow
> in
> +the device as described in the aRFS section.
> +
> +Processing queues in Per Queue Threads
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When Per Queue Threads is used, the queue "follows" the thread. So
> when
> +a thread is rescheduled from one queue to another we expect that the
> +processing of the device queues that map to the thread are processed
> on
> +the CPU where the thread is currently running. This is a bit tricky
> +especially with respect to the canonical device interrupt driven
> model.
> +There are at least three possible approaches:
> +	- Arrange for interrupts to follow threads as they are
> +	  rescheduled, or alternatively pin threads to CPUs and
> +	  statically configure the interrupt mappings for the queues
> for
> +	  each thread
> +	- Use busy polling
> +	- Use "sleeping busy-poll" with completion queues. The basic
> +	  idea is to have one CPU busy poll a device completion queue
> +	  that reports device queues with received or completed
> transmit
> +	  packets. When a queue is ready, the thread associated with
> the
> +	  queue (derived by reverse mapping the queue back to its
> +	  assigned thread) is scheduled. When the thread runs it polls
> +	  its queues to process any packets.
> +
> +Future work may further elaborate on solutions in this area.
> +
> +Reducing flow state in devices
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +PTQ (and aRFS as well) potentially create per flow state in a
> device.
> +This is costly in at least two ways: 1) State requires device memory
> +which is almost always much less than host memory can and thus the
> +number of flows that can be instantiated in a device are less than
> that
> +in the host. 2) State requires instantiation and synchronization
> +messages, i.e. ndo_rx_flow_steer causes a message over PCIe bus; if
> +there is a highly turnover rate of connections this messaging
> becomes
> +a bottleneck.
> +
> +Mitigations to reduce the amount of flow state in the device should
> be
> +considered.
> +
> +In PTQ (and aRFS) the device flow state is a considered cache. A
> flow
> +entry is only set in the device on a cache miss which occurs when
> the
> +receive queue for a packet doesn't match the desired receive queue.
> So
> +conceptually, if a packets for a flow are always received on the
> desired
> +queue from the beginning of the flow then a flow state might never
> need
> +to be instantiated in the device. This motivates a strategy to try
> to
> +use stateless steering mechanisms before resorting to stateful ones.
> +
> +As an example of applying this strategy, consider an application
> that
> +creates four threads where each threads creates a TCP listener
> socket
> +for some port that is shared amongst the threads via SO_REUSEPORT.
> +Four global queues can be assigned to the application (via a cgroup
> +for the application), and a filter rule can be set up in each device
> +that matches the listener port and any bound destination address.
> The
> +filter maps to a set of four device queues that map to the four
> global
> +queues for the application. When a packet is received that matches
> the
> +filter, one of the four queues is chosen via a hash over the
> packet's
> +four tuple. So in this manner, packets for the application are
> +distributed amongst the four threads. As long as processing for
> sockets
> +doesn't move between threads and the number of listener threads is
> +constant then packets are always received on the desired queue and
> no
> +flow state needs to be instantiated. In practice, we want to allow
> +elasticity in applications to create and destroy threads on demand,
> so
> +additional techniques, such as consistent hashing, are probably
> needed.
> +
> +Per Thread Queues Configuration
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Per Thread Queues is only available if the kernel is compiled with
> +CONFIG_PER_THREAD_QUEUES. For PTQ in the receive path, aRFS needs to
> be
> +supported and configured (see aRFS section above).
> +
> +The net_queues cgroup controller is in:
> +	/sys/fs/cgroup/<cgrp>/net_queues
> +
> +The net_queues controller contains the following attributes:
> +	- tx-queues, rx-queues
> +		Specifies the transmit queue pool and receive queue
> pool
> +		respectively as a range of global queue indices. The
> +		format of these entries is "<base>:<extent>" where
> +		<base> is the first queue index in the pool, and
> +		<extent> is the number of queues in the range of pool.
> +		If <extent> is zero the queue pool is empty.
> +	- tx-assign,rx-assign
> +		Boolean attributes ("0" or "1") that indicate unique
> +		queue assignment from the respective transmit or
> receive
> +		queue pool. When the "assign" attribute is enabled, a
> +		thread is assigned a queue that is not already assigned
> +		to another thread.
> +	- symmetric
> +		A boolean attribute ("0" or "1") that indicates the
> +		receive and transmit queue assignment for a thread
> +		should be the same. That is the assigned transmit queue
> +		index is equal to the assigned receive queue index.
> +	- task-queues
> +		A read-only attribute that lists the threads of the
> +		cgroup and their assigned queues.
> +
> +The mapping of global queues to device queues is in:
> +
> +  /sys/class/net/<dev>/queues/tx-<n>/global_queue_mapping
> +	-and -
> +  /sys/class/net/<dev>/queues/rx-<n>/global_queue_mapping
> +
> +A value of "none" indicates no mapping, an integer value (up to
> +a maximum of 32,766) indicates a global queue.
> +
> +Suggested Configuration
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +Unlike aRFS, PTQ requires per application application configuration.
> To
> +most effectively use PTQ some understanding of the threading model
> of
> +the application is warranted. The section above describes one
> possible
> +configuration strategy for a canonical application using
> SO_REUSEPORT.
> +
> +
>  Further Information
>  ===================
>  RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated
> into

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 11/11] doc: Documentation for Per Thread Queues
  2020-06-29  6:28   ` Saeed Mahameed
@ 2020-06-29 15:10     ` Tom Herbert
  0 siblings, 0 replies; 24+ messages in thread
From: Tom Herbert @ 2020-06-29 15:10 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: netdev

On Sun, Jun 28, 2020 at 11:28 PM Saeed Mahameed <saeedm@mellanox.com> wrote:
>
> On Wed, 2020-06-24 at 10:17 -0700, Tom Herbert wrote:
> > Add a section on Per Thread Queues to scaling.rst.
> > ---
> >  Documentation/networking/scaling.rst | 195
> > ++++++++++++++++++++++++++-
> >  1 file changed, 194 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/networking/scaling.rst
> > b/Documentation/networking/scaling.rst
> > index 8f0347b9fb3d..42f1dc639ab7 100644
> > --- a/Documentation/networking/scaling.rst
> > +++ b/Documentation/networking/scaling.rst
> > @@ -250,7 +250,7 @@ RFS: Receive Flow Steering
> >  While RPS steers packets solely based on hash, and thus generally
> >  provides good load distribution, it does not take into account
> >  application locality. This is accomplished by Receive Flow Steering
> > -(RFS). The goal of RFS is to increase datacache hitrate by steering
> > +(RFS). The goal of RFS is to increase datacache hit rate by steering
> >  kernel processing of packets to the CPU where the application thread
> >  consuming the packet is running. RFS relies on the same RPS
> > mechanisms
> >  to enqueue packets onto the backlog of another CPU and to wake up
> > that
> > @@ -508,6 +508,199 @@ a max-rate attribute is supported, by setting a
> > Mbps value to::
> >  A value of zero means disabled, and this is the default.
> >
> >
> > +PTQ: Per Thread Queues
> > +======================
> > +
> > +Per Thread Queues allows application threads to be assigned
>
> I think i am a bit confused about the definition of Thread in this
> context. Is it a netwroking kernel thread ? or an actual application
> user thread ?
>
Actual appication user threads.

> > dedicated
> > +hardware network queues for both transmit and receive. This facility
>
> "dedicated hardware network queues" seems a bit out of context here, as
> from the looks of it, the series only deals with lightweight mapping
> between threads and hardware queues, as opposed to CPUs (XPS) and aRFS
> (HW flow steering) mappings.

Global queues are mapped to hardware queues via some function that
takes both the global queue and device as input. If the mapping is 1-1
per device that it is equivalent to providing a dedicated HW queues to
threads.

>
> But for someone like me, from device drivers and XDP-XSK/RDMA world,
> "dedicated hardware queues", means a dedicated and isolated hw queue
> per thread/app which can't be shared with others, and I don't see any
> mention of creation of new dedicated hw queues per thread in this
> patchset, just re-mapping of pre-existing HW queues.
>
Hardware queues are generic entities that have no particular semantics
until they're instantiated with some sort of packet steering
configuration (e.g. RSS, aRFS, tc filters on receive; XPS, hash
mapping on slection). For a dedicated hardware receive tc receve
filtering and aRFS (sterring to queue instead of CPU) provide
necessary isolation for receive, and grabbing the assign transmit
queue from the running thread and using that in lieu of XPS provides
for transmit isolation. Note that dedicated queues won't be in RSS or
aRFS maps, and neither in XPS maps. So if queue #800 is assigned to a
thread for TX and RX then we expect all packets for that sockets
processed by that thread to be received and sent on the queue. The TX
side should be completely accurate in this regard since we select
queue with full state for the socket, it's thread, and running CPU.
Receive is accurate up the ability to program the device for all
possible received packets that might be received on the sockets.

> So no matter how you look at it, there will only be #CPUs hardware
> queues (at most) that must be shared somehow if you want to run more
> than #CPUs threads.

Yes, there is no requirement that _all_ threads have to use dedicated
queues. We will always need some number of general non-dedicated
queues to handle "other traffic". In fact, I think the dedicated
queues would more be used in cases where isolation or performance of
crucial.

>
> So either i am missing something, or this is actually the point of the
> series, in which case, this limitation of the #threads or sharing
> should be clarified ..
>
I'm not sure what the limitation here is-- I beleive this patch set is
overcoming limitations. Some hardware devices support thousands of
queues, and we have been unable to leverage those since packet
steering to date have more queues than CPUs didn't make much sense.
Using queues for isolation of threads' traffic makes sense, we just
need the right infrastructure. This also solves one of the nagging
problems in aRFS: when a thread is scheduled on a different CPU all of
the thread's sockets thrash to using a different queue (i.e. a whole
bunch of ndo_rx_flow_steer calls). In PTQ, the receive queue for the
thread sockets follow the thread so when it's rescheduled to a new CPU
there's no work to do. This does mean that we probably need
alternative mechanisms than canonical interrupt per queue ready, for
that busy polling or we can leverage device completion queues (latter
is follow on work).

>
> > +provides a high degree of traffic isolation between applications and
> > +can also help facilitate high performance due to fine grained packet
> > +steering.
> > +
> > +PTQ has three major design components:
> > +     - A method to assign transmit and receive queues to threads
> > +     - A means to associate packets with threads and then to steer
> > +       those packets to the queues assigned to the threads
> > +     - Mechanisms to process the per thread hardware queues
> > +
> > +Global network queues
> > +~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Global network queues are an abstraction of hardware networking
> > +queues that can be used in generic non-device specific
> > configuration.
> > +Global queues may mapped to real device queues. The mapping is
> > +performed on a per device queue basis. A device sysfs parameter
> > +"global_queue_mapping" in queues/{tx,rx}-<num> indicates the mapping
> > +of a device queue to a global queue. Each device maintains a table
> > +that maps global queues to device queues for the device. Note that
> > +for a single device, the global to device queue mapping is 1 to 1,
> > +however each device may map a global queue to a different device
> > +queue.
> > +
> > +net_queues cgroup controller
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +For assigning queues to the threads, a cgroup controller named
> > +"net_queues" is used. A cgroup can be configured with pools of
> > transmit
> > +and receive global queues from which individual threads are assigned
> > +queues. The contents of the net_queues controller are described
> > below in
> > +the configuration section.
> > +
> > +Handling PTQ in the transmit path
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When a socket operation is performed that may result in sending
> > packets
> > +(i.e. listen, accept, sendmsg, sendpage), the task structure for the
> > +current thread is consulted to see if there is an assigned transmit
> > +queue for the thread. If there is a queue assignment, the queue
> > index is
> > +set in a field of the sock structure for the corresponding socket.
> > +Subsequently, when transmit queue selection is performed, the sock
> > +structure associated with packet being sent is consulted. If a
> > transmit
> > +global queue is set in the sock then that index is mapped to a
> > device
> > +queue for the output networking device. If a valid device queue is
> > +discovered then that queue is used, else if a device queue is not
> > found
> > +then queue selection proceeds to XPS.
> > +
> > +Handling PTQ in the receive path
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The receive path uses the infrastructure of RFS which is extended
> > +to steer based on the assigned received global queue for a thread in
> > +addition to steering based on the CPU. The rps_sock_flow_table is
> > +modified to contain either the desired CPU for flows or the desired
> > +receive global queue. A queue is updated at the same time that the
> > +desired CPU would updated during calls to recvmsg and sendmsg (see
> > RFS
> > +description above). The process is to consult the running task
> > structure
> > +to see if a receive queue is assigned to the task. If a queue is
> > assigned
> > +to the task then the corresponding queue index is set in the
> > +rps_sock_flow_table; if no queue is assigned then the current CPU is
> > +set as the desired per canonical RFS.
> > +
> > +When packets are received, the rps_sock_flow table is consulted to
> > check
> > +if they were received on the proper queue. If the
> > rps_sock_flow_table
> > +entry for a corresponding flow of a received packet contains a
> > global
> > +queue index, then the index is mapped to a device queue on the
> > received
> > +device. If the mapped device queue is equal to the receive queue
> > then
> > +packets are being steered properly. If there is a mismatch then the
> > +local flow to queue mapping in the device is changed and
> > +ndo_rx_flow_steer is invoked to set the receive queue for the flow
> > in
> > +the device as described in the aRFS section.
> > +
> > +Processing queues in Per Queue Threads
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When Per Queue Threads is used, the queue "follows" the thread. So
> > when
> > +a thread is rescheduled from one queue to another we expect that the
> > +processing of the device queues that map to the thread are processed
> > on
> > +the CPU where the thread is currently running. This is a bit tricky
> > +especially with respect to the canonical device interrupt driven
> > model.
> > +There are at least three possible approaches:
> > +     - Arrange for interrupts to follow threads as they are
> > +       rescheduled, or alternatively pin threads to CPUs and
> > +       statically configure the interrupt mappings for the queues
> > for
> > +       each thread
> > +     - Use busy polling
> > +     - Use "sleeping busy-poll" with completion queues. The basic
> > +       idea is to have one CPU busy poll a device completion queue
> > +       that reports device queues with received or completed
> > transmit
> > +       packets. When a queue is ready, the thread associated with
> > the
> > +       queue (derived by reverse mapping the queue back to its
> > +       assigned thread) is scheduled. When the thread runs it polls
> > +       its queues to process any packets.
> > +
> > +Future work may further elaborate on solutions in this area.
> > +
> > +Reducing flow state in devices
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +PTQ (and aRFS as well) potentially create per flow state in a
> > device.
> > +This is costly in at least two ways: 1) State requires device memory
> > +which is almost always much less than host memory can and thus the
> > +number of flows that can be instantiated in a device are less than
> > that
> > +in the host. 2) State requires instantiation and synchronization
> > +messages, i.e. ndo_rx_flow_steer causes a message over PCIe bus; if
> > +there is a highly turnover rate of connections this messaging
> > becomes
> > +a bottleneck.
> > +
> > +Mitigations to reduce the amount of flow state in the device should
> > be
> > +considered.
> > +
> > +In PTQ (and aRFS) the device flow state is a considered cache. A
> > flow
> > +entry is only set in the device on a cache miss which occurs when
> > the
> > +receive queue for a packet doesn't match the desired receive queue.
> > So
> > +conceptually, if a packets for a flow are always received on the
> > desired
> > +queue from the beginning of the flow then a flow state might never
> > need
> > +to be instantiated in the device. This motivates a strategy to try
> > to
> > +use stateless steering mechanisms before resorting to stateful ones.
> > +
> > +As an example of applying this strategy, consider an application
> > that
> > +creates four threads where each threads creates a TCP listener
> > socket
> > +for some port that is shared amongst the threads via SO_REUSEPORT.
> > +Four global queues can be assigned to the application (via a cgroup
> > +for the application), and a filter rule can be set up in each device
> > +that matches the listener port and any bound destination address.
> > The
> > +filter maps to a set of four device queues that map to the four
> > global
> > +queues for the application. When a packet is received that matches
> > the
> > +filter, one of the four queues is chosen via a hash over the
> > packet's
> > +four tuple. So in this manner, packets for the application are
> > +distributed amongst the four threads. As long as processing for
> > sockets
> > +doesn't move between threads and the number of listener threads is
> > +constant then packets are always received on the desired queue and
> > no
> > +flow state needs to be instantiated. In practice, we want to allow
> > +elasticity in applications to create and destroy threads on demand,
> > so
> > +additional techniques, such as consistent hashing, are probably
> > needed.
> > +
> > +Per Thread Queues Configuration
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Per Thread Queues is only available if the kernel is compiled with
> > +CONFIG_PER_THREAD_QUEUES. For PTQ in the receive path, aRFS needs to
> > be
> > +supported and configured (see aRFS section above).
> > +
> > +The net_queues cgroup controller is in:
> > +     /sys/fs/cgroup/<cgrp>/net_queues
> > +
> > +The net_queues controller contains the following attributes:
> > +     - tx-queues, rx-queues
> > +             Specifies the transmit queue pool and receive queue
> > pool
> > +             respectively as a range of global queue indices. The
> > +             format of these entries is "<base>:<extent>" where
> > +             <base> is the first queue index in the pool, and
> > +             <extent> is the number of queues in the range of pool.
> > +             If <extent> is zero the queue pool is empty.
> > +     - tx-assign,rx-assign
> > +             Boolean attributes ("0" or "1") that indicate unique
> > +             queue assignment from the respective transmit or
> > receive
> > +             queue pool. When the "assign" attribute is enabled, a
> > +             thread is assigned a queue that is not already assigned
> > +             to another thread.
> > +     - symmetric
> > +             A boolean attribute ("0" or "1") that indicates the
> > +             receive and transmit queue assignment for a thread
> > +             should be the same. That is the assigned transmit queue
> > +             index is equal to the assigned receive queue index.
> > +     - task-queues
> > +             A read-only attribute that lists the threads of the
> > +             cgroup and their assigned queues.
> > +
> > +The mapping of global queues to device queues is in:
> > +
> > +  /sys/class/net/<dev>/queues/tx-<n>/global_queue_mapping
> > +     -and -
> > +  /sys/class/net/<dev>/queues/rx-<n>/global_queue_mapping
> > +
> > +A value of "none" indicates no mapping, an integer value (up to
> > +a maximum of 32,766) indicates a global queue.
> > +
> > +Suggested Configuration
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Unlike aRFS, PTQ requires per application application configuration.
> > To
> > +most effectively use PTQ some understanding of the threading model
> > of
> > +the application is warranted. The section above describes one
> > possible
> > +configuration strategy for a canonical application using
> > SO_REUSEPORT.
> > +
> > +
> >  Further Information
> >  ===================
> >  RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated
> > into

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/11] net: Introduce global queues
  2020-06-24 17:17 ` [RFC PATCH 07/11] net: Introduce global queues Tom Herbert
                     ` (2 preceding siblings ...)
  2020-06-25  0:23   ` kernel test robot
@ 2020-06-30 21:06   ` Jonathan Lemon
  3 siblings, 0 replies; 24+ messages in thread
From: Jonathan Lemon @ 2020-06-30 21:06 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev

On Wed, Jun 24, 2020 at 10:17:46AM -0700, Tom Herbert wrote:
> Global queues, or gqids, are an abstract representation of NIC
> device queues. They are global in the sense that the each gqid
> can be map to a queue in each device, i.e. if there are multiple
> devices in the system, a gqid can map to a different queue, a dqid,
> in each device in a one to many mapping.  gqids are used for
> configuring packet steering on both send and receive in a generic
> way not bound to a particular device.
> 
> Each transmit or receive device queue may be reversed mapped to
> one gqid. Each device maintains a table mapping gqids to local
> device queues, those tables are used in the data path to convert
> a gqid receive or transmit queue into a device queue relative to
> the sending or receiving device.

I'm confused by this word salad, can it be simplified?

So a RX device queue maps to one global queue, implying that there's a
one way relationship here.  But at the same time, the second sentence
implies each device can map a global RX queue to a device queue.

This would logically mean that for a given device, there's a 1:1
relationship between global and device queue, and the only 'one-to-many'
portion is coming from mapping global queues across different devices.

How would I do this:
    given device eth0
    create new RSS context 200
    create RX queues 800, 801, added to RSS context 200
    create global RX queue for context 200 
    attach 4 sockets to context 200

I'm assuming that each socket ends up being flow-assigned to one of the
underlying device queues (800 or 801), correct?
-- 
Jonathan

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-06-30 21:06 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24 17:17 [RFC PATCH 00/11] ptq: Per Thread Queues Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 01/11] cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 02/11] net: Create netqueue.h and define NO_QUEUE Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 03/11] arfs: Create set_arfs_queue Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 04/11] net-sysfs: Create rps_create_sock_flow_table Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 05/11] net: Infrastructure for per queue aRFS Tom Herbert
2020-06-28  8:55   ` kernel test robot
2020-06-24 17:17 ` [RFC PATCH 06/11] net: Function to check against maximum number for RPS queues Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 07/11] net: Introduce global queues Tom Herbert
2020-06-24 23:00   ` kernel test robot
2020-06-24 23:58   ` kernel test robot
2020-06-25  0:23   ` kernel test robot
2020-06-30 21:06   ` Jonathan Lemon
2020-06-24 17:17 ` [RFC PATCH 08/11] ptq: Per Thread Queues Tom Herbert
2020-06-24 21:20   ` kernel test robot
2020-06-25  1:50   ` [RFC PATCH] ptq: null_pcdesc can be static kernel test robot
2020-06-25  7:26   ` [RFC PATCH 08/11] ptq: Per Thread Queues kernel test robot
2020-06-24 17:17 ` [RFC PATCH 09/11] ptq: Hook up transmit side of Per Queue Threads Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 10/11] ptq: Hook up receive " Tom Herbert
2020-06-24 17:17 ` [RFC PATCH 11/11] doc: Documentation for Per Thread Queues Tom Herbert
2020-06-25  2:20   ` kernel test robot
2020-06-25 23:00   ` Jacob Keller
2020-06-29  6:28   ` Saeed Mahameed
2020-06-29 15:10     ` Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.