All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues
@ 2018-06-25 18:04 Amritha Nambiar
  2018-06-25 18:04 ` [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and " Amritha Nambiar
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

This patch series implements support for Tx queue selection based on
Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.

XPS is refactored to support Tx queue selection based on either the
CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.

- /sys/class/net/<dev>/queues/tx-*/xps_rxqs

A set of receive queues can be mapped to a set of transmit queues (many:many),
although the common use case is a 1:1 mapping. This will enable sending
packets on the same Tx-Rx queue association as this is useful for busy polling
multi-threaded workloads where it is not possible to pin the threads to
a CPU. This is a rework of Sridhar's patch for symmetric queueing via
socket option:
https://www.spinics.net/lists/netdev/msg453106.html

Testing Hints:
Kernel:  Linux 4.17.0-rc7+
Interface:
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x00015e0b

Configuration:
ethtool -L $iface combined 16
ethtool -C $iface rx-usecs 1000
sysctl net.core.busy_poll=1000
ATR disabled:
ethtool -K $iface ntuple on

Workload:
Modified memcached that changes the thread selection policy to be based
on the incoming rx-queue of a connection using SO_INCOMING_NAPI_ID socket
option. The default is round-robin.

Default: No rxqs_map configured
Symmetric queues: Enable rxqs_map for all queues 1:1 mapped to Tx queue

System:
Architecture:          x86_64
CPU(s):                72
Model name:            Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

16 threads  400K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                 4/51/2215               2/30/5163
(usec)


intr/sec                        26655                   18606

contextswitch/sec               5145                    4044

insn per cycle                  0.43                    0.72

cache-misses                    6.919                   4.310
(% of all cache refs)

L1-dcache-load-                 4.49                    3.29
-misses
(% of all L1-dcache hits)

LLC-load-misses                 13.26                   8.96
(% of all LL-cache hits)

-------------------------------------------------------------------------------

32 threads  400K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                 10/112/5562             9/46/4637
(usec)


intr/sec                        30456                   27666

contextswitch/sec               7552                    5133

insn per cycle                  0.41                    0.49

cache-misses                    9.357                   2.769
(% of all cache refs)

L1-dcache-load-                 4.09                    3.98
-misses
(% of all L1-dcache hits)

LLC-load-misses                 12.96                   3.96
(% of all LL-cache hits)

-------------------------------------------------------------------------------

16 threads  800K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                  5/151/4989             9/69/2611
(usec)


intr/sec                        35686                   22907

contextswitch/sec               25522                   12281

insn per cycle                  0.67                    0.74

cache-misses                    8.652                   6.38
(% of all cache refs)

L1-dcache-load-                 3.19                    2.86
-misses
(% of all L1-dcache hits)

LLC-load-misses                 16.53                   11.99
(% of all LL-cache hits)

-------------------------------------------------------------------------------
32 threads  800K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                  6/163/6152             8/88/4209
(usec)


intr/sec                        47079                   26548

contextswitch/sec               42190                   39168

insn per cycle                  0.45                    0.54

cache-misses                    8.798                   4.668
(% of all cache refs)

L1-dcache-load-                 6.55                    6.29
-misses
(% of all L1-dcache hits)

LLC-load-misses                 13.91                   10.44
(% of all LL-cache hits)

-------------------------------------------------------------------------------

v4:
- Removed enum for map types and used boolean to identify rxqs_map vs cpus_map.
- Added comments for helper functions.
- Added another static_key for rxqs_map (xps_rxqs_needed).
- New patch to change tx_queue_mapping in sock_common to unsigned short.
- Separated marking receive queue number into a standalone patch.
- Changed wording in documentation (queue-pair to queue-association)

---

Amritha Nambiar (7):
      net: Refactor XPS for CPUs and Rx queues
      net: Use static_key for XPS maps
      net: sock: Change tx_queue_mapping in sock_common to unsigned short
      net: Record receive queue number for a connection
      net: Enable Tx queue selection based on Rx queues
      net-sysfs: Add interface for Rx queue(s) map per Tx queue
      Documentation: Add explanation for XPS using Rx-queue(s) map


 Documentation/ABI/testing/sysfs-class-net-queues |   11 +
 Documentation/networking/scaling.txt             |   57 ++++
 include/linux/cpumask.h                          |   11 +
 include/linux/netdevice.h                        |  100 ++++++++
 include/net/busy_poll.h                          |    1 
 include/net/sock.h                               |   28 ++
 net/core/dev.c                                   |  283 +++++++++++++++-------
 net/core/net-sysfs.c                             |   85 ++++++-
 net/core/sock.c                                  |    4 
 net/ipv4/tcp_input.c                             |    3 
 10 files changed, 474 insertions(+), 109 deletions(-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues
  2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
@ 2018-06-25 18:04 ` Amritha Nambiar
  2018-06-26 22:53   ` Tom Herbert
  2018-06-25 18:04 ` [net-next PATCH v4 2/7] net: Use static_key for XPS maps Amritha Nambiar
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/linux/cpumask.h   |   11 ++
 include/linux/netdevice.h |  100 +++++++++++++++++++++
 net/core/dev.c            |  211 ++++++++++++++++++++++++++++++---------------
 net/core/net-sysfs.c      |    4 -
 4 files changed, 246 insertions(+), 80 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bf53d89..57f20a0 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
 #define cpu_active(cpu)		((cpu) == 0)
 #endif
 
-/* verify cpu argument to cpumask_* operators */
-static inline unsigned int cpumask_check(unsigned int cpu)
+static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
 {
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-	WARN_ON_ONCE(cpu >= nr_cpumask_bits);
+	WARN_ON_ONCE(cpu >= bits);
 #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+}
+
+/* verify cpu argument to cpumask_* operators */
+static inline unsigned int cpumask_check(unsigned int cpu)
+{
+	cpu_max_bits_warn(cpu, nr_cpumask_bits);
 	return cpu;
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850..c534f03 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -730,10 +730,15 @@ struct xps_map {
  */
 struct xps_dev_maps {
 	struct rcu_head rcu;
-	struct xps_map __rcu *cpu_map[0];
+	struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
 };
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +		\
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +	\
 	(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+	(_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE	16
@@ -1909,7 +1914,8 @@ struct net_device {
 	int			watchdog_timeo;
 
 #ifdef CONFIG_XPS
-	struct xps_dev_maps __rcu *xps_maps;
+	struct xps_dev_maps __rcu *xps_cpus_map;
+	struct xps_dev_maps __rcu *xps_rxqs_map;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
@@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 #ifdef CONFIG_XPS
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, bool is_rxqs_map);
+
+/**
+ *	attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask
+ *	@j: CPU/Rx queue index
+ *	@mask: bitmask of all cpus/rx queues
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
+ */
+static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
+				  unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+	return test_bit(j, mask);
+}
+
+/**
+ *	attr_test_online - Test for online CPU/Rx queue
+ *	@j: CPU/Rx queue index
+ *	@online_mask: bitmask for CPUs/Rx queues that are online
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns true if a CPU/Rx queue is online.
+ */
+static inline bool attr_test_online(unsigned long j,
+				    const unsigned long *online_mask,
+				    unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+
+	if (online_mask)
+		return test_bit(j, online_mask);
+
+	if (j >= 0 && j < nr_bits)
+		return true;
+
+	return false;
+}
+
+/**
+ *	attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
+ *	@n: CPU/Rx queue index
+ *	@srcp: the cpumask/Rx queue mask pointer
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set.
+ */
+static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
+					 unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (srcp)
+		return find_next_bit(srcp, nr_bits, n + 1);
+
+	return n + 1;
+}
+
+/**
+ *	attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
+ *	@n: CPU/Rx queue index
+ *	@src1p: the first CPUs/Rx queues mask pointer
+ *	@src2p: the second CPUs/Rx queues mask pointer
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set in both.
+ */
+static inline int attrmask_next_and(int n, const unsigned long *src1p,
+				    const unsigned long *src2p,
+				    unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (src1p && src2p)
+		return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
+	else if (src1p)
+		return find_next_bit(src1p, nr_bits, n + 1);
+	else if (src2p)
+		return find_next_bit(src2p, nr_bits, n + 1);
+
+	return n + 1;
+}
 #else
 static inline int netif_set_xps_queue(struct net_device *dev,
 				      const struct cpumask *mask,
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7..2552556 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 	int pos;
 
 	if (dev_maps)
-		map = xmap_dereference(dev_maps->cpu_map[tci]);
+		map = xmap_dereference(dev_maps->attr_map[tci]);
 	if (!map)
 		return false;
 
@@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 			break;
 		}
 
-		RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+		RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
 		kfree_rcu(map, rcu);
 		return false;
 	}
@@ -2135,31 +2135,58 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 	return active;
 }
 
+static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
+			   struct xps_dev_maps *dev_maps, unsigned int nr_ids,
+			   u16 offset, u16 count, bool is_rxqs_map)
+{
+	bool active = false;
+	int i, j;
+
+	for (j = -1; j = attrmask_next(j, mask, nr_ids),
+	     j < nr_ids;)
+		active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
+					       count);
+	if (!active) {
+		if (is_rxqs_map) {
+			RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
+		} else {
+			RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
+
+			for (i = offset + (count - 1); count--; i--)
+				netdev_queue_numa_node_write(
+					netdev_get_tx_queue(dev, i),
+							NUMA_NO_NODE);
+		}
+		kfree_rcu(dev_maps, rcu);
+	}
+}
+
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 				   u16 count)
 {
+	const unsigned long *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps;
-	int cpu, i;
-	bool active = false;
+	unsigned int nr_ids;
 
 	mutex_lock(&xps_map_mutex);
-	dev_maps = xmap_dereference(dev->xps_maps);
 
-	if (!dev_maps)
-		goto out_no_maps;
-
-	for_each_possible_cpu(cpu)
-		active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
-					       offset, count);
+	dev_maps = xmap_dereference(dev->xps_rxqs_map);
+	if (dev_maps) {
+		nr_ids = dev->num_rx_queues;
+		clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
+			       count, true);
 
-	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
-		kfree_rcu(dev_maps, rcu);
 	}
 
-	for (i = offset + (count - 1); count--; i--)
-		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
-					     NUMA_NO_NODE);
+	dev_maps = xmap_dereference(dev->xps_cpus_map);
+	if (!dev_maps)
+		goto out_no_maps;
+
+	if (num_possible_cpus() > 1)
+		possible_mask = cpumask_bits(cpu_possible_mask);
+	nr_ids = nr_cpu_ids;
+	clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
+		       false);
 
 out_no_maps:
 	mutex_unlock(&xps_map_mutex);
@@ -2170,8 +2197,8 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
 	netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
 }
 
-static struct xps_map *expand_xps_map(struct xps_map *map,
-				      int cpu, u16 index)
+static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
+				      u16 index, bool is_rxqs_map)
 {
 	struct xps_map *new_map;
 	int alloc_len = XPS_MIN_MAP_ALLOC;
@@ -2183,7 +2210,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		return map;
 	}
 
-	/* Need to add queue to this CPU's existing map */
+	/* Need to add tx-queue to this CPU's/rx-queue's existing map */
 	if (map) {
 		if (pos < map->alloc_len)
 			return map;
@@ -2191,9 +2218,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		alloc_len = map->alloc_len * 2;
 	}
 
-	/* Need to allocate new map to store queue on this CPU's map */
-	new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
-			       cpu_to_node(cpu));
+	/* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
+	 *  map
+	 */
+	if (is_rxqs_map)
+		new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
+	else
+		new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+				       cpu_to_node(attr_index));
 	if (!new_map)
 		return NULL;
 
@@ -2205,14 +2237,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 	return new_map;
 }
 
-int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
-			u16 index)
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, bool is_rxqs_map)
 {
+	const unsigned long *online_mask = NULL, *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
-	int i, cpu, tci, numa_node_id = -2;
+	int i, j, tci, numa_node_id = -2;
 	int maps_sz, num_tc = 1, tc = 0;
 	struct xps_map *map, *new_map;
 	bool active = false;
+	unsigned int nr_ids;
 
 	if (dev->num_tc) {
 		num_tc = dev->num_tc;
@@ -2221,16 +2255,27 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -EINVAL;
 	}
 
-	maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
-	if (maps_sz < L1_CACHE_BYTES)
-		maps_sz = L1_CACHE_BYTES;
-
 	mutex_lock(&xps_map_mutex);
+	if (is_rxqs_map) {
+		maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
+		dev_maps = xmap_dereference(dev->xps_rxqs_map);
+		nr_ids = dev->num_rx_queues;
+	} else {
+		maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
+		if (num_possible_cpus() > 1) {
+			online_mask = cpumask_bits(cpu_online_mask);
+			possible_mask = cpumask_bits(cpu_possible_mask);
+		}
+		dev_maps = xmap_dereference(dev->xps_cpus_map);
+		nr_ids = nr_cpu_ids;
+	}
 
-	dev_maps = xmap_dereference(dev->xps_maps);
+	if (maps_sz < L1_CACHE_BYTES)
+		maps_sz = L1_CACHE_BYTES;
 
 	/* allocate memory for queue storage */
-	for_each_cpu_and(cpu, cpu_online_mask, mask) {
+	for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
+	     j < nr_ids;) {
 		if (!new_dev_maps)
 			new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
 		if (!new_dev_maps) {
@@ -2238,73 +2283,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -ENOMEM;
 		}
 
-		tci = cpu * num_tc + tc;
-		map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
+		tci = j * num_tc + tc;
+		map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
 				 NULL;
 
-		map = expand_xps_map(map, cpu, index);
+		map = expand_xps_map(map, j, index, is_rxqs_map);
 		if (!map)
 			goto error;
 
-		RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+		RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 	}
 
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
-	for_each_possible_cpu(cpu) {
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
-		for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+		for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* We need to explicitly update tci as prevous loop
 		 * could break out early if dev_maps is NULL.
 		 */
-		tci = cpu * num_tc + tc;
+		tci = j * num_tc + tc;
 
-		if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
-			/* add queue to CPU maps */
+		if (attr_test_mask(j, mask, nr_ids) &&
+		    attr_test_online(j, online_mask, nr_ids)) {
+			/* add tx-queue to CPU/rx-queue maps */
 			int pos = 0;
 
-			map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+			map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			while ((pos < map->len) && (map->queues[pos] != index))
 				pos++;
 
 			if (pos == map->len)
 				map->queues[map->len++] = index;
 #ifdef CONFIG_NUMA
-			if (numa_node_id == -2)
-				numa_node_id = cpu_to_node(cpu);
-			else if (numa_node_id != cpu_to_node(cpu))
-				numa_node_id = -1;
+			if (!is_rxqs_map) {
+				if (numa_node_id == -2)
+					numa_node_id = cpu_to_node(j);
+				else if (numa_node_id != cpu_to_node(j))
+					numa_node_id = -1;
+			}
 #endif
 		} else if (dev_maps) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* copy maps belonging to foreign traffic classes */
 		for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 	}
 
-	rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+	if (is_rxqs_map)
+		rcu_assign_pointer(dev->xps_rxqs_map, new_dev_maps);
+	else
+		rcu_assign_pointer(dev->xps_cpus_map, new_dev_maps);
 
 	/* Cleanup old maps */
 	if (!dev_maps)
 		goto out_no_old_maps;
 
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
 			if (map && map != new_map)
 				kfree_rcu(map, rcu);
 		}
@@ -2317,19 +2370,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	active = true;
 
 out_no_new_maps:
-	/* update Tx queue numa node */
-	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
-				     (numa_node_id >= 0) ? numa_node_id :
-				     NUMA_NO_NODE);
+	if (!is_rxqs_map) {
+		/* update Tx queue numa node */
+		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+					     (numa_node_id >= 0) ?
+					     numa_node_id : NUMA_NO_NODE);
+	}
 
 	if (!dev_maps)
 		goto out_no_maps;
 
-	/* removes queue from unused CPUs */
-	for_each_possible_cpu(cpu) {
-		for (i = tc, tci = cpu * num_tc; i--; tci++)
+	/* removes tx-queue from unused CPUs/rx-queues */
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = tc, tci = j * num_tc; i--; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
-		if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
+		if (!attr_test_mask(j, mask, nr_ids) ||
+		    !attr_test_online(j, online_mask, nr_ids))
 			active |= remove_xps_queue(dev_maps, tci, index);
 		for (i = num_tc - tc, tci++; --i; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
@@ -2337,7 +2394,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 
 	/* free map if not active */
 	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		if (is_rxqs_map)
+			RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
+		else
+			RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
 		kfree_rcu(dev_maps, rcu);
 	}
 
@@ -2347,11 +2407,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	return 0;
 error:
 	/* remove any maps that we added */
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			map = dev_maps ?
-			      xmap_dereference(dev_maps->cpu_map[tci]) :
+			      xmap_dereference(dev_maps->attr_map[tci]) :
 			      NULL;
 			if (new_map && new_map != map)
 				kfree(new_map);
@@ -2363,6 +2424,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	kfree(new_dev_maps);
 	return -ENOMEM;
 }
+
+int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
+			u16 index)
+{
+	return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
+}
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
@@ -3384,7 +3451,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3393,7 +3460,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 			tci += netdev_get_prio_tc_map(dev, skb->priority);
 		}
 
-		map = rcu_dereference(dev_maps->cpu_map[tci]);
+		map = rcu_dereference(dev_maps->attr_map[tci]);
 		if (map) {
 			if (map->len == 1)
 				queue_index = map->queues[0];
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..b39987c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 		return -ENOMEM;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {
 		for_each_possible_cpu(cpu) {
 			int i, tci = cpu * num_tc + tc;
 			struct xps_map *map;
 
-			map = rcu_dereference(dev_maps->cpu_map[tci]);
+			map = rcu_dereference(dev_maps->attr_map[tci]);
 			if (!map)
 				continue;
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next PATCH v4 2/7] net: Use static_key for XPS maps
  2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
  2018-06-25 18:04 ` [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and " Amritha Nambiar
@ 2018-06-25 18:04 ` Amritha Nambiar
  2018-06-26 22:54   ` Tom Herbert
  2018-06-25 18:04 ` [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short Amritha Nambiar
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

Use static_key for XPS maps to reduce the cost of extra map checks,
similar to how it is used for RPS and RFS. This includes static_key
'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
Rx queues map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/dev.c |   26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2552556..df2a78d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
 EXPORT_SYMBOL(netdev_txq_to_tc);
 
 #ifdef CONFIG_XPS
+struct static_key xps_needed __read_mostly;
+EXPORT_SYMBOL(xps_needed);
+struct static_key xps_rxqs_needed __read_mostly;
+EXPORT_SYMBOL(xps_rxqs_needed);
 static DEFINE_MUTEX(xps_map_mutex);
 #define xmap_dereference(P)		\
 	rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
@@ -2170,12 +2174,14 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 
 	mutex_lock(&xps_map_mutex);
 
-	dev_maps = xmap_dereference(dev->xps_rxqs_map);
-	if (dev_maps) {
-		nr_ids = dev->num_rx_queues;
-		clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
-			       count, true);
-
+	if (static_key_false(&xps_rxqs_needed)) {
+		dev_maps = xmap_dereference(dev->xps_rxqs_map);
+		if (dev_maps) {
+			nr_ids = dev->num_rx_queues;
+			clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
+				       offset, count, true);
+		}
+		static_key_slow_dec(&xps_rxqs_needed);
 	}
 
 	dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2189,6 +2195,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 		       false);
 
 out_no_maps:
+	static_key_slow_dec(&xps_needed);
 	mutex_unlock(&xps_map_mutex);
 }
 
@@ -2297,6 +2304,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
+	static_key_slow_inc(&xps_needed);
+	if (is_rxqs_map)
+		static_key_slow_inc(&xps_rxqs_needed);
+
 	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
 	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
@@ -3450,6 +3461,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	struct xps_map *map;
 	int queue_index = -1;
 
+	if (!static_key_false(&xps_needed))
+		return -1;
+
 	rcu_read_lock();
 	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
  2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
  2018-06-25 18:04 ` [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and " Amritha Nambiar
  2018-06-25 18:04 ` [net-next PATCH v4 2/7] net: Use static_key for XPS maps Amritha Nambiar
@ 2018-06-25 18:04 ` Amritha Nambiar
       [not found]   ` <CALx6S37uFs1shuPmno+L=p_Hyy1Q2qNaK+AqYvrk4HXTApL_Vg@mail.gmail.com>
  2018-06-26 10:58   ` Willem de Bruijn
  2018-06-25 18:04 ` [net-next PATCH v4 4/7] net: Record receive queue number for a connection Amritha Nambiar
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

Change 'skc_tx_queue_mapping' field in sock_common structure from
'int' to 'unsigned short' type with 0 indicating unset and
a positive queue value being set. This way it is consistent with
the queue_mapping field in the sk_buff. This will also accommodate
adding a new 'unsigned short' field in sock_common in the next
patch for rx_queue_mapping.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/net/sock.h |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b7541..009fd30 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -214,7 +214,7 @@ struct sock_common {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	int			skc_tx_queue_mapping;
+	unsigned short		skc_tx_queue_mapping;
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
-	sk->sk_tx_queue_mapping = tx_queue;
+	/* sk_tx_queue_mapping accept only upto a 16-bit value */
+	WARN_ON((unsigned short)tx_queue > USHRT_MAX);
+	sk->sk_tx_queue_mapping = tx_queue + 1;
 }
 
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-	sk->sk_tx_queue_mapping = -1;
+	sk->sk_tx_queue_mapping = 0;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	return sk ? sk->sk_tx_queue_mapping : -1;
+	return sk ? sk->sk_tx_queue_mapping - 1 : -1;
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next PATCH v4 4/7] net: Record receive queue number for a connection
  2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
                   ` (2 preceding siblings ...)
  2018-06-25 18:04 ` [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short Amritha Nambiar
@ 2018-06-25 18:04 ` Amritha Nambiar
  2018-06-25 18:04 ` [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues Amritha Nambiar
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/busy_poll.h |    1 +
 include/net/sock.h      |   14 ++++++++++++++
 net/core/sock.c         |    4 ++++
 net/ipv4/tcp_input.c    |    3 +++
 4 files changed, 22 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c518743..9e36fda6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id = skb->napi_id;
 #endif
+	sk_rx_queue_set(sk, skb);
 }
 
 /* variant used for unconnected sockets */
diff --git a/include/net/sock.h b/include/net/sock.h
index 009fd30..0ff4416 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_rx_queue_mapping: rx queue number for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +216,9 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	unsigned short		skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+	unsigned short		skc_rx_queue_mapping;
+#endif
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -326,6 +330,9 @@ struct sock {
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
+#endif
 
 #define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end		__sk_common.skc_dontcopy_end
@@ -1696,6 +1703,13 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return sk ? sk->sk_tx_queue_mapping - 1 : -1;
 }
 
+static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	sk->sk_rx_queue_mapping = skb_get_rx_queue(skb) + 1;
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index bcc4182..5e4715b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2818,6 +2818,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_pacing_rate = ~0U;
 	sk->sk_pacing_shift = 10;
 	sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+	sk->sk_rx_queue_mapping = ~0;
+#endif
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f..c404c53 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <net/busy_poll.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5584,6 +5585,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 	if (skb) {
 		icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
 		security_inet_conn_established(sk, skb);
+		sk_mark_napi_id(sk, skb);
 	}
 
 	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
@@ -6412,6 +6414,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_openreq_init_rwin(req, sk, dst);
+	sk_rx_queue_set(req_to_sk(req), skb);
 	if (!want_cookie) {
 		tcp_reqsk_record_syn(sk, req, skb);
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
  2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
                   ` (3 preceding siblings ...)
  2018-06-25 18:04 ` [net-next PATCH v4 4/7] net: Record receive queue number for a connection Amritha Nambiar
@ 2018-06-25 18:04 ` Amritha Nambiar
  2018-06-26  2:04   ` kbuild test robot
  2018-06-26 11:04   ` Willem de Bruijn
  2018-06-25 18:04 ` [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue Amritha Nambiar
  2018-06-25 18:04 ` [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map Amritha Nambiar
  6 siblings, 2 replies; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/net/sock.h |    4 +++
 net/core/dev.c     |   62 ++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0ff4416..cb18139 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1710,6 +1710,10 @@ static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
 #endif
 }
 
+static inline int sk_rx_queue_get(const struct sock *sk)
+{
+	return sk ? sk->sk_rx_queue_mapping - 1 : -1;
+}
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index df2a78d..2450c5e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3454,35 +3454,63 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+#ifdef CONFIG_XPS
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+			       struct xps_dev_maps *dev_maps, unsigned int tci)
+{
+	struct xps_map *map;
+	int queue_index = -1;
+
+	if (dev->num_tc) {
+		tci *= dev->num_tc;
+		tci += netdev_get_prio_tc_map(dev, skb->priority);
+	}
+
+	map = rcu_dereference(dev_maps->attr_map[tci]);
+	if (map) {
+		if (map->len == 1)
+			queue_index = map->queues[0];
+		else
+			queue_index = map->queues[reciprocal_scale(
+						skb_get_hash(skb), map->len)];
+		if (unlikely(queue_index >= dev->real_num_tx_queues))
+			queue_index = -1;
+	}
+	return queue_index;
+}
+#endif
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
-	struct xps_map *map;
+	struct sock *sk = skb->sk;
 	int queue_index = -1;
 
 	if (!static_key_false(&xps_needed))
 		return -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_cpus_map);
+	if (!static_key_false(&xps_rxqs_needed))
+		goto get_cpus_map;
+
+	dev_maps = rcu_dereference(dev->xps_rxqs_map);
 	if (dev_maps) {
-		unsigned int tci = skb->sender_cpu - 1;
+		int tci = sk_rx_queue_get(sk);
 
-		if (dev->num_tc) {
-			tci *= dev->num_tc;
-			tci += netdev_get_prio_tc_map(dev, skb->priority);
-		}
+		if (tci >= 0 && tci < dev->num_rx_queues)
+			queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+							  tci);
+	}
 
-		map = rcu_dereference(dev_maps->attr_map[tci]);
-		if (map) {
-			if (map->len == 1)
-				queue_index = map->queues[0];
-			else
-				queue_index = map->queues[reciprocal_scale(skb_get_hash(skb),
-									   map->len)];
-			if (unlikely(queue_index >= dev->real_num_tx_queues))
-				queue_index = -1;
+get_cpus_map:
+	if (queue_index < 0) {
+		dev_maps = rcu_dereference(dev->xps_cpus_map);
+		if (dev_maps) {
+			unsigned int tci = skb->sender_cpu - 1;
+
+			queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+							  tci);
 		}
 	}
 	rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
  2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
                   ` (4 preceding siblings ...)
  2018-06-25 18:04 ` [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues Amritha Nambiar
@ 2018-06-25 18:04 ` Amritha Nambiar
  2018-06-26 10:55   ` Willem de Bruijn
  2018-06-25 18:04 ` [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map Amritha Nambiar
  6 siblings, 1 reply; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

Extend transmit queue sysfs attribute to configure Rx queue(s) map
per Tx queue. By default no receive queues are configured for the
Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/net-sysfs.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b39987c..5d2ed02 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
 	= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+	struct net_device *dev = queue->dev;
+	struct xps_dev_maps *dev_maps;
+	unsigned long *mask, index;
+	int j, len, num_tc = 1, tc = 0;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	if (dev->num_tc) {
+		num_tc = dev->num_tc;
+		tc = netdev_txq_to_tc(dev, index);
+		if (tc < 0)
+			return -EINVAL;
+	}
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(dev->xps_rxqs_map);
+	if (dev_maps) {
+		for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues),
+		     j < dev->num_rx_queues;) {
+			int i, tci = j * num_tc + tc;
+			struct xps_map *map;
+
+			map = rcu_dereference(dev_maps->attr_map[tci]);
+			if (!map)
+				continue;
+
+			for (i = map->len; i--;) {
+				if (map->queues[i] == index) {
+					set_bit(j, mask);
+					break;
+				}
+			}
+		}
+	}
+
+	len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+	rcu_read_unlock();
+	kfree(mask);
+
+	return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+			      size_t len)
+{
+	struct net_device *dev = queue->dev;
+	unsigned long *mask, index;
+	int err;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+	if (err) {
+		kfree(mask);
+		return err;
+	}
+
+	err = __netif_set_xps_queue(dev, mask, index, true);
+	kfree(mask);
+	return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+	= __ATTR_RW(xps_rxqs);
 #endif /* CONFIG_XPS */
 
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
 	&queue_traffic_class.attr,
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
+	&xps_rxqs_attribute.attr,
 	&queue_tx_maxrate.attr,
 #endif
 	NULL

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
  2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
                   ` (5 preceding siblings ...)
  2018-06-25 18:04 ` [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue Amritha Nambiar
@ 2018-06-25 18:04 ` Amritha Nambiar
  2018-06-26 22:59   ` Tom Herbert
  6 siblings, 1 reply; 23+ messages in thread
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 Documentation/ABI/testing/sysfs-class-net-queues |   11 ++++
 Documentation/networking/scaling.txt             |   57 ++++++++++++++++++----
 2 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
index 0c0df91..978b763 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -42,6 +42,17 @@ Description:
 		network device transmit queue. Possible vaules depend on the
 		number of available CPU(s) in the system.
 
+What:		/sys/class/<iface>/queues/tx-<queue>/xps_rxqs
+Date:		June 2018
+KernelVersion:	4.18.0
+Contact:	netdev@vger.kernel.org
+Description:
+		Mask of the receive queue(s) currently enabled to participate
+		into the Transmit Packet Steering packet processing flow for this
+		network device transmit queue. Possible values depend on the
+		number of available receive queue(s) in the network device.
+		Default is disabled.
+
 What:		/sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/hold_time
 Date:		November 2011
 KernelVersion:	3.3
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index f55639d..8336116 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
 
 Transmit Packet Steering is a mechanism for intelligently selecting
 which transmit queue to use when transmitting a packet on a multi-queue
-device. To accomplish this, a mapping from CPU to hardware queue(s) is
-recorded. The goal of this mapping is usually to assign queues
+device. This can be accomplished by recording two kinds of maps, either
+a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
+to hardware transmit queue(s).
+
+1. XPS using CPUs map
+
+The goal of this mapping is usually to assign queues
 exclusively to a subset of CPUs, where the transmit completions for
 these queues are processed on a CPU within this set. This choice
 provides two benefits. First, contention on the device queue lock is
@@ -377,12 +382,35 @@ transmit queue). Secondly, cache miss rate on transmit completion is
 reduced, in particular for data cache lines that hold the sk_buff
 structures.
 
-XPS is configured per transmit queue by setting a bitmap of CPUs that
-may use that queue to transmit. The reverse mapping, from CPUs to
-transmit queues, is computed and maintained for each network device.
-When transmitting the first packet in a flow, the function
-get_xps_queue() is called to select a queue. This function uses the ID
-of the running CPU as a key into the CPU-to-queue lookup table. If the
+2. XPS using receive queues map
+
+This mapping is used to pick transmit queue based on the receive
+queue(s) map configuration set by the administrator. A set of receive
+queues can be mapped to a set of transmit queues (many:many), although
+the common use case is a 1:1 mapping. This will enable sending packets
+on the same queue associations for transmit and receive. This is useful for
+busy polling multi-threaded workloads where there are challenges in
+associating a given CPU to a given application thread. The application
+threads are not pinned to CPUs and each thread handles packets
+received on a single queue. The receive queue number is cached in the
+socket for the connection. In this model, sending the packets on the same
+transmit queue corresponding to the associated receive queue has benefits
+in keeping the CPU overhead low. Transmit completion work is locked into
+the same queue-association that a given application is polling on. This
+avoids the overhead of triggering an interrupt on another CPU. When the
+application cleans up the packets during the busy poll, transmit completion
+may be processed along with it in the same thread context and so result in
+reduced latency.
+
+XPS is configured per transmit queue by setting a bitmap of
+CPUs/receive-queues that may use that queue to transmit. The reverse
+mapping, from CPUs to transmit queues or from receive-queues to transmit
+queues, is computed and maintained for each network device. When
+transmitting the first packet in a flow, the function get_xps_queue() is
+called to select a queue. This function uses the ID of the receive queue
+for the socket connection for a match in the receive queue-to-transmit queue
+lookup table. Alternatively, this function can also use the ID of the
+running CPU as a key into the CPU-to-queue lookup table. If the
 ID matches a single queue, that is used for transmission. If multiple
 queues match, one is selected by using the flow hash to compute an index
 into the set.
@@ -404,11 +432,15 @@ acknowledged.
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
 default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+configured. To enable XPS, the bitmap of CPUs/receive-queues that may
+use a transmit queue is configured using the sysfs file entry:
 
+For selection based on CPUs map:
 /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
 
+For selection based on receive-queues map:
+/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration
@@ -421,6 +453,11 @@ best CPUs to share a given queue are probably those that share the cache
 with the CPU that processes transmit completions for that queue
 (transmit interrupts).
 
+For transmit queue selection based on receive queue(s), XPS has to be
+explicitly configured mapping receive-queue(s) to transmit queue(s). If the
+user configuration for receive-queue map does not apply, then the transmit
+queue is selected based on the CPUs map.
+
 Per TX Queue rate limitation:
 =============================
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
  2018-06-25 18:04 ` [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues Amritha Nambiar
@ 2018-06-26  2:04   ` kbuild test robot
  2018-06-26 11:04   ` Willem de Bruijn
  1 sibling, 0 replies; 23+ messages in thread
From: kbuild test robot @ 2018-06-26  2:04 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: kbuild-all, netdev, davem, alexander.h.duyck,
	willemdebruijn.kernel, amritha.nambiar, sridhar.samudrala,
	alexander.duyck, edumazet, hannes, tom

[-- Attachment #1: Type: text/plain, Size: 1295 bytes --]

Hi Amritha,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Amritha-Nambiar/Symmetric-queue-selection-using-XPS-for-Rx-queues/20180626-070944
config: i386-randconfig-x078-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from include/net/cls_cgroup.h:19:0,
                    from net/socket.c:98:
   include/net/sock.h: In function 'sk_rx_queue_get':
>> include/net/sock.h:1715:16: error: 'const struct sock' has no member named 'sk_rx_queue_mapping'
     return sk ? sk->sk_rx_queue_mapping - 1 : -1;
                   ^~

vim +1715 include/net/sock.h

  1712	
  1713	static inline int sk_rx_queue_get(const struct sock *sk)
  1714	{
> 1715		return sk ? sk->sk_rx_queue_mapping - 1 : -1;
  1716	}
  1717	static inline void sk_set_socket(struct sock *sk, struct socket *sock)
  1718	{
  1719		sk_tx_queue_clear(sk);
  1720		sk->sk_socket = sock;
  1721	}
  1722	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26631 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
       [not found]   ` <CALx6S37uFs1shuPmno+L=p_Hyy1Q2qNaK+AqYvrk4HXTApL_Vg@mail.gmail.com>
@ 2018-06-26  3:25     ` Alexander Duyck
  2018-06-26 23:54       ` Nambiar, Amritha
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Duyck @ 2018-06-26  3:25 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Amritha Nambiar, Linux Kernel Network Developers,
	David S. Miller, Alexander Duyck, Willem de Bruijn,
	Sridhar Samudrala, Eric Dumazet, Hannes Frederic Sowa

On Mon, Jun 25, 2018 at 6:34 PM, Tom Herbert <tom@herbertland.com> wrote:
>
>
> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>>
>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>> 'int' to 'unsigned short' type with 0 indicating unset and
>> a positive queue value being set. This way it is consistent with
>> the queue_mapping field in the sk_buff. This will also accommodate
>> adding a new 'unsigned short' field in sock_common in the next
>> patch for rx_queue_mapping.
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
>>  include/net/sock.h |   10 ++++++----
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index b3b7541..009fd30 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -214,7 +214,7 @@ struct sock_common {
>>                 struct hlist_node       skc_node;
>>                 struct hlist_nulls_node skc_nulls_node;
>>         };
>> -       int                     skc_tx_queue_mapping;
>> +       unsigned short          skc_tx_queue_mapping;
>>         union {
>>                 int             skc_incoming_cpu;
>>                 u32             skc_rcv_wnd;
>> @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk,
>> struct sk_buff *skb,
>>
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> -       sk->sk_tx_queue_mapping = tx_queue;
>> +       /* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +       WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>
>
> Shouldn't this be USHRT_MAX - 1 ?

Actually just a ">=" would probably do as well.

>
>> +       sk->sk_tx_queue_mapping = tx_queue + 1;
>>  }
>>
>>  static inline void sk_tx_queue_clear(struct sock *sk)
>>  {
>> -       sk->sk_tx_queue_mapping = -1;
>>
>> +       sk->sk_tx_queue_mapping = 0;
>
>
> I think it's slightly better to define a new constant like NO_QUEUE_MAPPING
> to be USHRT_MAX. That avoids needing to do the arithmetic every time the
> value is accessed.
>>
>>  }
>>
>>  static inline int sk_tx_queue_get(const struct sock *sk)
>>  {
>> -       return sk ? sk->sk_tx_queue_mapping : -1;
>> +       return sk ? sk->sk_tx_queue_mapping - 1 : -1;
>
>
> Doesn't the comparison in __netdev_pick_tx need to be simultaneously changed
> for this?

This doesn't change the result. It was still -1 if the queue mapping
is not set. It was just initialized to 0 instead of to -1 so we have
to perform the operation to get there.

Also in regards to the comment above about needing an extra operation
I am not sure it makes much difference.

In the case of us starting with 0 as a reserved value I think the
instruction count should be about the same. We move the unsigned short
into an unsigned in, then decrement, and if the value is non-negative
we can assume it is valid. Although maybe I should double check the
code to make certain it is doing what I thought it was supposed to be
doing.

>
>>
>>
>>
>>  }
>>
>>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
  2018-06-25 18:04 ` [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue Amritha Nambiar
@ 2018-06-26 10:55   ` Willem de Bruijn
  2018-06-27  0:21     ` Nambiar, Amritha
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-06-26 10:55 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
>
> Extend transmit queue sysfs attribute to configure Rx queue(s) map
> per Tx queue. By default no receive queues are configured for the
> Tx queue.
>
> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---

> +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
> +{
> +       struct net_device *dev = queue->dev;
> +       struct xps_dev_maps *dev_maps;
> +       unsigned long *mask, index;
> +       int j, len, num_tc = 1, tc = 0;
> +
> +       mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
> +                      GFP_KERNEL);
> +       if (!mask)
> +               return -ENOMEM;
> +
> +       index = get_netdev_queue_index(queue);
> +
> +       if (dev->num_tc) {
> +               num_tc = dev->num_tc;
> +               tc = netdev_txq_to_tc(dev, index);
> +               if (tc < 0)
> +                       return -EINVAL;

Must free mask

> +static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
> +                             size_t len)
> +{
> +       struct net_device *dev = queue->dev;
> +       unsigned long *mask, index;
> +       int err;
> +
> +       if (!capable(CAP_NET_ADMIN))
> +               return -EPERM;

ns_capable?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
  2018-06-25 18:04 ` [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short Amritha Nambiar
       [not found]   ` <CALx6S37uFs1shuPmno+L=p_Hyy1Q2qNaK+AqYvrk4HXTApL_Vg@mail.gmail.com>
@ 2018-06-26 10:58   ` Willem de Bruijn
  2018-06-27  0:00     ` Nambiar, Amritha
  1 sibling, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-06-26 10:58 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
>
> Change 'skc_tx_queue_mapping' field in sock_common structure from
> 'int' to 'unsigned short' type with 0 indicating unset and
> a positive queue value being set. This way it is consistent with
> the queue_mapping field in the sk_buff. This will also accommodate
> adding a new 'unsigned short' field in sock_common in the next
> patch for rx_queue_mapping.
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---

>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>  {
> -       sk->sk_tx_queue_mapping = tx_queue;
> +       /* sk_tx_queue_mapping accept only upto a 16-bit value */
> +       WARN_ON((unsigned short)tx_queue > USHRT_MAX);
> +       sk->sk_tx_queue_mapping = tx_queue + 1;
>  }

WARN_ON_ONCE to avoid flooding the kernel buffer.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
  2018-06-25 18:04 ` [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues Amritha Nambiar
  2018-06-26  2:04   ` kbuild test robot
@ 2018-06-26 11:04   ` Willem de Bruijn
  2018-06-27  0:36     ` Nambiar, Amritha
  1 sibling, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-06-26 11:04 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
>
> This patch adds support to pick Tx queue based on the Rx queue(s) map
> configuration set by the admin through the sysfs attribute
> for each Tx queue. If the user configuration for receive queue(s) map
> does not apply, then the Tx queue selection falls back to CPU(s) map
> based selection and finally to hashing.
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---

> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>  {
>  #ifdef CONFIG_XPS
>         struct xps_dev_maps *dev_maps;
> -       struct xps_map *map;
> +       struct sock *sk = skb->sk;
>         int queue_index = -1;
>
>         if (!static_key_false(&xps_needed))
>                 return -1;
>
>         rcu_read_lock();
> -       dev_maps = rcu_dereference(dev->xps_cpus_map);
> +       if (!static_key_false(&xps_rxqs_needed))
> +               goto get_cpus_map;
> +
> +       dev_maps = rcu_dereference(dev->xps_rxqs_map);
>         if (dev_maps) {
> -               unsigned int tci = skb->sender_cpu - 1;
> +               int tci = sk_rx_queue_get(sk);

What if the rx device differs from the tx device?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues
  2018-06-25 18:04 ` [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and " Amritha Nambiar
@ 2018-06-26 22:53   ` Tom Herbert
  2018-06-28  0:47     ` Nambiar, Amritha
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Herbert @ 2018-06-26 22:53 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck, Willem de Bruijn, Sridhar Samudrala,
	Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa

On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Refactor XPS code to support Tx queue selection based on
> CPU(s) map or Rx queue(s) map.
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
>  include/linux/cpumask.h   |   11 ++
>  include/linux/netdevice.h |  100 +++++++++++++++++++++
>  net/core/dev.c            |  211 ++++++++++++++++++++++++++++++---------------
>  net/core/net-sysfs.c      |    4 -
>  4 files changed, 246 insertions(+), 80 deletions(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index bf53d89..57f20a0 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
>  #define cpu_active(cpu)                ((cpu) == 0)
>  #endif
>
> -/* verify cpu argument to cpumask_* operators */
> -static inline unsigned int cpumask_check(unsigned int cpu)
> +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
>  {
>  #ifdef CONFIG_DEBUG_PER_CPU_MAPS
> -       WARN_ON_ONCE(cpu >= nr_cpumask_bits);
> +       WARN_ON_ONCE(cpu >= bits);
>  #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> +}
> +
> +/* verify cpu argument to cpumask_* operators */
> +static inline unsigned int cpumask_check(unsigned int cpu)
> +{
> +       cpu_max_bits_warn(cpu, nr_cpumask_bits);
>         return cpu;
>  }
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3ec9850..c534f03 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -730,10 +730,15 @@ struct xps_map {
>   */
>  struct xps_dev_maps {
>         struct rcu_head rcu;
> -       struct xps_map __rcu *cpu_map[0];
> +       struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
>  };
> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +         \
> +
> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +     \
>         (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
> +
> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
> +       (_rxqs * (_tcs) * sizeof(struct xps_map *)))
> +
>  #endif /* CONFIG_XPS */
>
>  #define TC_MAX_QUEUE   16
> @@ -1909,7 +1914,8 @@ struct net_device {
>         int                     watchdog_timeo;
>
>  #ifdef CONFIG_XPS
> -       struct xps_dev_maps __rcu *xps_maps;
> +       struct xps_dev_maps __rcu *xps_cpus_map;
> +       struct xps_dev_maps __rcu *xps_rxqs_map;
>  #endif
>  #ifdef CONFIG_NET_CLS_ACT
>         struct mini_Qdisc __rcu *miniq_egress;
> @@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
>  #ifdef CONFIG_XPS
>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         u16 index);
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> +                         u16 index, bool is_rxqs_map);
> +
> +/**
> + *     attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask
> + *     @j: CPU/Rx queue index
> + *     @mask: bitmask of all cpus/rx queues
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
> + */
> +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
> +                                 unsigned int nr_bits)
> +{
> +       cpu_max_bits_warn(j, nr_bits);
> +       return test_bit(j, mask);
> +}
> +
> +/**
> + *     attr_test_online - Test for online CPU/Rx queue
> + *     @j: CPU/Rx queue index
> + *     @online_mask: bitmask for CPUs/Rx queues that are online
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Returns true if a CPU/Rx queue is online.
> + */
> +static inline bool attr_test_online(unsigned long j,
> +                                   const unsigned long *online_mask,
> +                                   unsigned int nr_bits)
> +{
> +       cpu_max_bits_warn(j, nr_bits);
> +
> +       if (online_mask)
> +               return test_bit(j, online_mask);
> +
> +       if (j >= 0 && j < nr_bits)

j is unsigned so j >= 0 is superfluous.

> +               return true;
> +
> +       return false;

Could just do:

return (j < nr_bits);

> +}
> +
> +/**
> + *     attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
> + *     @n: CPU/Rx queue index
> + *     @srcp: the cpumask/Rx queue mask pointer
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Returns >= nr_bits if no further CPUs/Rx queues set.
> + */
> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
> +                                        unsigned int nr_bits)
> +{
> +       /* -1 is a legal arg here. */
> +       if (n != -1)
> +               cpu_max_bits_warn(n, nr_bits);
> +
> +       if (srcp)
> +               return find_next_bit(srcp, nr_bits, n + 1);
> +
> +       return n + 1;
> +}
> +
> +/**
> + *     attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
> + *     @n: CPU/Rx queue index
> + *     @src1p: the first CPUs/Rx queues mask pointer
> + *     @src2p: the second CPUs/Rx queues mask pointer
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Returns >= nr_bits if no further CPUs/Rx queues set in both.
> + */
> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
> +                                   const unsigned long *src2p,
> +                                   unsigned int nr_bits)
> +{
> +       /* -1 is a legal arg here. */
> +       if (n != -1)
> +               cpu_max_bits_warn(n, nr_bits);
> +
> +       if (src1p && src2p)
> +               return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
> +       else if (src1p)
> +               return find_next_bit(src1p, nr_bits, n + 1);
> +       else if (src2p)
> +               return find_next_bit(src2p, nr_bits, n + 1);
> +
> +       return n + 1;
> +}
>  #else
>  static inline int netif_set_xps_queue(struct net_device *dev,
>                                       const struct cpumask *mask,
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a5aa1c7..2552556 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>         int pos;
>
>         if (dev_maps)
> -               map = xmap_dereference(dev_maps->cpu_map[tci]);
> +               map = xmap_dereference(dev_maps->attr_map[tci]);
>         if (!map)
>                 return false;
>
> @@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>                         break;
>                 }
>
> -               RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
> +               RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
>                 kfree_rcu(map, rcu);
>                 return false;
>         }
> @@ -2135,31 +2135,58 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
>         return active;
>  }
>
> +static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
> +                          struct xps_dev_maps *dev_maps, unsigned int nr_ids,
> +                          u16 offset, u16 count, bool is_rxqs_map)
> +{
> +       bool active = false;
> +       int i, j;
> +
> +       for (j = -1; j = attrmask_next(j, mask, nr_ids),
> +            j < nr_ids;)
> +               active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
> +                                              count);
> +       if (!active) {
> +               if (is_rxqs_map) {
> +                       RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
> +               } else {
> +                       RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
> +
> +                       for (i = offset + (count - 1); count--; i--)
> +                               netdev_queue_numa_node_write(
> +                                       netdev_get_tx_queue(dev, i),
> +                                                       NUMA_NO_NODE);
> +               }
> +               kfree_rcu(dev_maps, rcu);
> +       }
> +}
> +
>  static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>                                    u16 count)
>  {
> +       const unsigned long *possible_mask = NULL;
>         struct xps_dev_maps *dev_maps;
> -       int cpu, i;
> -       bool active = false;
> +       unsigned int nr_ids;
>
>         mutex_lock(&xps_map_mutex);
> -       dev_maps = xmap_dereference(dev->xps_maps);
>
> -       if (!dev_maps)
> -               goto out_no_maps;
> -
> -       for_each_possible_cpu(cpu)
> -               active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
> -                                              offset, count);
> +       dev_maps = xmap_dereference(dev->xps_rxqs_map);
> +       if (dev_maps) {
> +               nr_ids = dev->num_rx_queues;
> +               clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
> +                              count, true);
>
> -       if (!active) {
> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
> -               kfree_rcu(dev_maps, rcu);
>         }
>
> -       for (i = offset + (count - 1); count--; i--)
> -               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
> -                                            NUMA_NO_NODE);
> +       dev_maps = xmap_dereference(dev->xps_cpus_map);
> +       if (!dev_maps)
> +               goto out_no_maps;
> +
> +       if (num_possible_cpus() > 1)
> +               possible_mask = cpumask_bits(cpu_possible_mask);
> +       nr_ids = nr_cpu_ids;
> +       clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
> +                      false);
>
>  out_no_maps:
>         mutex_unlock(&xps_map_mutex);
> @@ -2170,8 +2197,8 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
>         netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
>  }
>
> -static struct xps_map *expand_xps_map(struct xps_map *map,
> -                                     int cpu, u16 index)
> +static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
> +                                     u16 index, bool is_rxqs_map)
>  {
>         struct xps_map *new_map;
>         int alloc_len = XPS_MIN_MAP_ALLOC;
> @@ -2183,7 +2210,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>                 return map;
>         }
>
> -       /* Need to add queue to this CPU's existing map */
> +       /* Need to add tx-queue to this CPU's/rx-queue's existing map */
>         if (map) {
>                 if (pos < map->alloc_len)
>                         return map;
> @@ -2191,9 +2218,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>                 alloc_len = map->alloc_len * 2;
>         }
>
> -       /* Need to allocate new map to store queue on this CPU's map */
> -       new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> -                              cpu_to_node(cpu));
> +       /* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
> +        *  map
> +        */
> +       if (is_rxqs_map)
> +               new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
> +       else
> +               new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> +                                      cpu_to_node(attr_index));
>         if (!new_map)
>                 return NULL;
>
> @@ -2205,14 +2237,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>         return new_map;
>  }
>
> -int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> -                       u16 index)
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> +                         u16 index, bool is_rxqs_map)
>  {
> +       const unsigned long *online_mask = NULL, *possible_mask = NULL;
>         struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
> -       int i, cpu, tci, numa_node_id = -2;
> +       int i, j, tci, numa_node_id = -2;
>         int maps_sz, num_tc = 1, tc = 0;
>         struct xps_map *map, *new_map;
>         bool active = false;
> +       unsigned int nr_ids;
>
>         if (dev->num_tc) {
>                 num_tc = dev->num_tc;
> @@ -2221,16 +2255,27 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         return -EINVAL;
>         }
>
> -       maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
> -       if (maps_sz < L1_CACHE_BYTES)
> -               maps_sz = L1_CACHE_BYTES;
> -
>         mutex_lock(&xps_map_mutex);
> +       if (is_rxqs_map) {
> +               maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
> +               dev_maps = xmap_dereference(dev->xps_rxqs_map);
> +               nr_ids = dev->num_rx_queues;
> +       } else {
> +               maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
> +               if (num_possible_cpus() > 1) {
> +                       online_mask = cpumask_bits(cpu_online_mask);
> +                       possible_mask = cpumask_bits(cpu_possible_mask);
> +               }
> +               dev_maps = xmap_dereference(dev->xps_cpus_map);
> +               nr_ids = nr_cpu_ids;
> +       }
>
> -       dev_maps = xmap_dereference(dev->xps_maps);
> +       if (maps_sz < L1_CACHE_BYTES)
> +               maps_sz = L1_CACHE_BYTES;
>
>         /* allocate memory for queue storage */
> -       for_each_cpu_and(cpu, cpu_online_mask, mask) {
> +       for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
> +            j < nr_ids;) {
>                 if (!new_dev_maps)
>                         new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
>                 if (!new_dev_maps) {
> @@ -2238,73 +2283,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         return -ENOMEM;
>                 }
>
> -               tci = cpu * num_tc + tc;
> -               map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
> +               tci = j * num_tc + tc;
> +               map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
>                                  NULL;
>
> -               map = expand_xps_map(map, cpu, index);
> +               map = expand_xps_map(map, j, index, is_rxqs_map);
>                 if (!map)
>                         goto error;
>
> -               RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +               RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>         }
>
>         if (!new_dev_maps)
>                 goto out_no_new_maps;
>
> -       for_each_possible_cpu(cpu) {
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
>                 /* copy maps belonging to foreign traffic classes */
> -               for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
> +               for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>
>                 /* We need to explicitly update tci as prevous loop
>                  * could break out early if dev_maps is NULL.
>                  */
> -               tci = cpu * num_tc + tc;
> +               tci = j * num_tc + tc;
>
> -               if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
> -                       /* add queue to CPU maps */
> +               if (attr_test_mask(j, mask, nr_ids) &&
> +                   attr_test_online(j, online_mask, nr_ids)) {
> +                       /* add tx-queue to CPU/rx-queue maps */
>                         int pos = 0;
>
> -                       map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> +                       map = xmap_dereference(new_dev_maps->attr_map[tci]);
>                         while ((pos < map->len) && (map->queues[pos] != index))
>                                 pos++;
>
>                         if (pos == map->len)
>                                 map->queues[map->len++] = index;
>  #ifdef CONFIG_NUMA
> -                       if (numa_node_id == -2)
> -                               numa_node_id = cpu_to_node(cpu);
> -                       else if (numa_node_id != cpu_to_node(cpu))
> -                               numa_node_id = -1;

Seems like there should be a comment here about meaning of -2 and -1
in NUMA node. Better yet, seems like there should be constants defined
for these special values. Maybe something to clean up in the future.

> +                       if (!is_rxqs_map) {
> +                               if (numa_node_id == -2)
> +                                       numa_node_id = cpu_to_node(j);
> +                               else if (numa_node_id != cpu_to_node(j))
> +                                       numa_node_id = -1;
> +                       }
>  #endif
>                 } else if (dev_maps) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>
>                 /* copy maps belonging to foreign traffic classes */
>                 for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>         }
>
> -       rcu_assign_pointer(dev->xps_maps, new_dev_maps);
> +       if (is_rxqs_map)
> +               rcu_assign_pointer(dev->xps_rxqs_map, new_dev_maps);
> +       else
> +               rcu_assign_pointer(dev->xps_cpus_map, new_dev_maps);
>
>         /* Cleanup old maps */
>         if (!dev_maps)
>                 goto out_no_old_maps;
>
> -       for_each_possible_cpu(cpu) {
> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
>                         if (map && map != new_map)
>                                 kfree_rcu(map, rcu);
>                 }
> @@ -2317,19 +2370,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         active = true;
>
>  out_no_new_maps:
> -       /* update Tx queue numa node */
> -       netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> -                                    (numa_node_id >= 0) ? numa_node_id :
> -                                    NUMA_NO_NODE);
> +       if (!is_rxqs_map) {
> +               /* update Tx queue numa node */
> +               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> +                                            (numa_node_id >= 0) ?
> +                                            numa_node_id : NUMA_NO_NODE);
> +       }
>
>         if (!dev_maps)
>                 goto out_no_maps;
>
> -       /* removes queue from unused CPUs */
> -       for_each_possible_cpu(cpu) {
> -               for (i = tc, tci = cpu * num_tc; i--; tci++)
> +       /* removes tx-queue from unused CPUs/rx-queues */
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = tc, tci = j * num_tc; i--; tci++)
>                         active |= remove_xps_queue(dev_maps, tci, index);
> -               if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
> +               if (!attr_test_mask(j, mask, nr_ids) ||
> +                   !attr_test_online(j, online_mask, nr_ids))
>                         active |= remove_xps_queue(dev_maps, tci, index);
>                 for (i = num_tc - tc, tci++; --i; tci++)
>                         active |= remove_xps_queue(dev_maps, tci, index);
> @@ -2337,7 +2394,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>
>         /* free map if not active */
>         if (!active) {
> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
> +               if (is_rxqs_map)
> +                       RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
> +               else
> +                       RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
>                 kfree_rcu(dev_maps, rcu);
>         }
>
> @@ -2347,11 +2407,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         return 0;
>  error:
>         /* remove any maps that we added */
> -       for_each_possible_cpu(cpu) {
> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
>                         map = dev_maps ?
> -                             xmap_dereference(dev_maps->cpu_map[tci]) :
> +                             xmap_dereference(dev_maps->attr_map[tci]) :
>                               NULL;
>                         if (new_map && new_map != map)
>                                 kfree(new_map);
> @@ -2363,6 +2424,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         kfree(new_dev_maps);
>         return -ENOMEM;
>  }
> +
> +int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> +                       u16 index)
> +{
> +       return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
> +}
>  EXPORT_SYMBOL(netif_set_xps_queue);
>
>  #endif
> @@ -3384,7 +3451,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>         int queue_index = -1;
>
>         rcu_read_lock();
> -       dev_maps = rcu_dereference(dev->xps_maps);
> +       dev_maps = rcu_dereference(dev->xps_cpus_map);
>         if (dev_maps) {
>                 unsigned int tci = skb->sender_cpu - 1;
>
> @@ -3393,7 +3460,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>                         tci += netdev_get_prio_tc_map(dev, skb->priority);
>                 }
>
> -               map = rcu_dereference(dev_maps->cpu_map[tci]);
> +               map = rcu_dereference(dev_maps->attr_map[tci]);
>                 if (map) {
>                         if (map->len == 1)
>                                 queue_index = map->queues[0];
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index bb7e80f..b39987c 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
>                 return -ENOMEM;
>
>         rcu_read_lock();
> -       dev_maps = rcu_dereference(dev->xps_maps);
> +       dev_maps = rcu_dereference(dev->xps_cpus_map);
>         if (dev_maps) {
>                 for_each_possible_cpu(cpu) {
>                         int i, tci = cpu * num_tc + tc;
>                         struct xps_map *map;
>
> -                       map = rcu_dereference(dev_maps->cpu_map[tci]);
> +                       map = rcu_dereference(dev_maps->attr_map[tci]);
>                         if (!map)
>                                 continue;
>
>

Acked-by: Tom Herbert <tom@quantonium.net>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 2/7] net: Use static_key for XPS maps
  2018-06-25 18:04 ` [net-next PATCH v4 2/7] net: Use static_key for XPS maps Amritha Nambiar
@ 2018-06-26 22:54   ` Tom Herbert
  0 siblings, 0 replies; 23+ messages in thread
From: Tom Herbert @ 2018-06-26 22:54 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck, Willem de Bruijn, Sridhar Samudrala,
	Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa

On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Use static_key for XPS maps to reduce the cost of extra map checks,
> similar to how it is used for RPS and RFS. This includes static_key
> 'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
> Rx queues map.
>

Acked-by: Tom Herbert <tom@quantonium.net>

> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
>  net/core/dev.c |   26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2552556..df2a78d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
>  EXPORT_SYMBOL(netdev_txq_to_tc);
>
>  #ifdef CONFIG_XPS
> +struct static_key xps_needed __read_mostly;
> +EXPORT_SYMBOL(xps_needed);
> +struct static_key xps_rxqs_needed __read_mostly;
> +EXPORT_SYMBOL(xps_rxqs_needed);
>  static DEFINE_MUTEX(xps_map_mutex);
>  #define xmap_dereference(P)            \
>         rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
> @@ -2170,12 +2174,14 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>
>         mutex_lock(&xps_map_mutex);
>
> -       dev_maps = xmap_dereference(dev->xps_rxqs_map);
> -       if (dev_maps) {
> -               nr_ids = dev->num_rx_queues;
> -               clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
> -                              count, true);
> -
> +       if (static_key_false(&xps_rxqs_needed)) {
> +               dev_maps = xmap_dereference(dev->xps_rxqs_map);
> +               if (dev_maps) {
> +                       nr_ids = dev->num_rx_queues;
> +                       clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
> +                                      offset, count, true);
> +               }
> +               static_key_slow_dec(&xps_rxqs_needed);
>         }
>
>         dev_maps = xmap_dereference(dev->xps_cpus_map);
> @@ -2189,6 +2195,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>                        false);
>
>  out_no_maps:
> +       static_key_slow_dec(&xps_needed);
>         mutex_unlock(&xps_map_mutex);
>  }
>
> @@ -2297,6 +2304,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
>         if (!new_dev_maps)
>                 goto out_no_new_maps;
>
> +       static_key_slow_inc(&xps_needed);
> +       if (is_rxqs_map)
> +               static_key_slow_inc(&xps_rxqs_needed);
> +
>         for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
>              j < nr_ids;) {
>                 /* copy maps belonging to foreign traffic classes */
> @@ -3450,6 +3461,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>         struct xps_map *map;
>         int queue_index = -1;
>
> +       if (!static_key_false(&xps_needed))
> +               return -1;
> +
>         rcu_read_lock();
>         dev_maps = rcu_dereference(dev->xps_cpus_map);
>         if (dev_maps) {
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
  2018-06-25 18:04 ` [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map Amritha Nambiar
@ 2018-06-26 22:59   ` Tom Herbert
  0 siblings, 0 replies; 23+ messages in thread
From: Tom Herbert @ 2018-06-26 22:59 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck, Willem de Bruijn, Sridhar Samudrala,
	Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa

On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>

Acked-by: Tom Herbert <tom@quantonium.net>

> ---
>  Documentation/ABI/testing/sysfs-class-net-queues |   11 ++++
>  Documentation/networking/scaling.txt             |   57 ++++++++++++++++++----
>  2 files changed, 58 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
> index 0c0df91..978b763 100644
> --- a/Documentation/ABI/testing/sysfs-class-net-queues
> +++ b/Documentation/ABI/testing/sysfs-class-net-queues
> @@ -42,6 +42,17 @@ Description:
>                 network device transmit queue. Possible vaules depend on the
>                 number of available CPU(s) in the system.
>
> +What:          /sys/class/<iface>/queues/tx-<queue>/xps_rxqs
> +Date:          June 2018
> +KernelVersion: 4.18.0
> +Contact:       netdev@vger.kernel.org
> +Description:
> +               Mask of the receive queue(s) currently enabled to participate
> +               into the Transmit Packet Steering packet processing flow for this
> +               network device transmit queue. Possible values depend on the
> +               number of available receive queue(s) in the network device.
> +               Default is disabled.
> +
>  What:          /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/hold_time
>  Date:          November 2011
>  KernelVersion: 3.3
> diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
> index f55639d..8336116 100644
> --- a/Documentation/networking/scaling.txt
> +++ b/Documentation/networking/scaling.txt
> @@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
>
>  Transmit Packet Steering is a mechanism for intelligently selecting
>  which transmit queue to use when transmitting a packet on a multi-queue
> -device. To accomplish this, a mapping from CPU to hardware queue(s) is
> -recorded. The goal of this mapping is usually to assign queues
> +device. This can be accomplished by recording two kinds of maps, either
> +a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
> +to hardware transmit queue(s).
> +
> +1. XPS using CPUs map
> +
> +The goal of this mapping is usually to assign queues
>  exclusively to a subset of CPUs, where the transmit completions for
>  these queues are processed on a CPU within this set. This choice
>  provides two benefits. First, contention on the device queue lock is
> @@ -377,12 +382,35 @@ transmit queue). Secondly, cache miss rate on transmit completion is
>  reduced, in particular for data cache lines that hold the sk_buff
>  structures.
>
> -XPS is configured per transmit queue by setting a bitmap of CPUs that
> -may use that queue to transmit. The reverse mapping, from CPUs to
> -transmit queues, is computed and maintained for each network device.
> -When transmitting the first packet in a flow, the function
> -get_xps_queue() is called to select a queue. This function uses the ID
> -of the running CPU as a key into the CPU-to-queue lookup table. If the
> +2. XPS using receive queues map
> +
> +This mapping is used to pick transmit queue based on the receive
> +queue(s) map configuration set by the administrator. A set of receive
> +queues can be mapped to a set of transmit queues (many:many), although
> +the common use case is a 1:1 mapping. This will enable sending packets
> +on the same queue associations for transmit and receive. This is useful for
> +busy polling multi-threaded workloads where there are challenges in
> +associating a given CPU to a given application thread. The application
> +threads are not pinned to CPUs and each thread handles packets
> +received on a single queue. The receive queue number is cached in the
> +socket for the connection. In this model, sending the packets on the same
> +transmit queue corresponding to the associated receive queue has benefits
> +in keeping the CPU overhead low. Transmit completion work is locked into
> +the same queue-association that a given application is polling on. This
> +avoids the overhead of triggering an interrupt on another CPU. When the
> +application cleans up the packets during the busy poll, transmit completion
> +may be processed along with it in the same thread context and so result in
> +reduced latency.
> +
> +XPS is configured per transmit queue by setting a bitmap of
> +CPUs/receive-queues that may use that queue to transmit. The reverse
> +mapping, from CPUs to transmit queues or from receive-queues to transmit
> +queues, is computed and maintained for each network device. When
> +transmitting the first packet in a flow, the function get_xps_queue() is
> +called to select a queue. This function uses the ID of the receive queue
> +for the socket connection for a match in the receive queue-to-transmit queue
> +lookup table. Alternatively, this function can also use the ID of the
> +running CPU as a key into the CPU-to-queue lookup table. If the
>  ID matches a single queue, that is used for transmission. If multiple
>  queues match, one is selected by using the flow hash to compute an index
>  into the set.
> @@ -404,11 +432,15 @@ acknowledged.
>
>  XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
>  default for SMP). The functionality remains disabled until explicitly
> -configured. To enable XPS, the bitmap of CPUs that may use a transmit
> -queue is configured using the sysfs file entry:
> +configured. To enable XPS, the bitmap of CPUs/receive-queues that may
> +use a transmit queue is configured using the sysfs file entry:
>
> +For selection based on CPUs map:
>  /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
>
> +For selection based on receive-queues map:
> +/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
> +
>  == Suggested Configuration
>
>  For a network device with a single transmission queue, XPS configuration
> @@ -421,6 +453,11 @@ best CPUs to share a given queue are probably those that share the cache
>  with the CPU that processes transmit completions for that queue
>  (transmit interrupts).
>
> +For transmit queue selection based on receive queue(s), XPS has to be
> +explicitly configured mapping receive-queue(s) to transmit queue(s). If the
> +user configuration for receive-queue map does not apply, then the transmit
> +queue is selected based on the CPUs map.
> +
>  Per TX Queue rate limitation:
>  =============================
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
  2018-06-26  3:25     ` Alexander Duyck
@ 2018-06-26 23:54       ` Nambiar, Amritha
  0 siblings, 0 replies; 23+ messages in thread
From: Nambiar, Amritha @ 2018-06-26 23:54 UTC (permalink / raw)
  To: Alexander Duyck, Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck, Willem de Bruijn, Sridhar Samudrala,
	Eric Dumazet, Hannes Frederic Sowa

On 6/25/2018 8:25 PM, Alexander Duyck wrote:
> On Mon, Jun 25, 2018 at 6:34 PM, Tom Herbert <tom@herbertland.com> wrote:
>>
>>
>> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
>> <amritha.nambiar@intel.com> wrote:
>>>
>>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>>> 'int' to 'unsigned short' type with 0 indicating unset and
>>> a positive queue value being set. This way it is consistent with
>>> the queue_mapping field in the sk_buff. This will also accommodate
>>> adding a new 'unsigned short' field in sock_common in the next
>>> patch for rx_queue_mapping.
>>>
>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>> ---
>>>  include/net/sock.h |   10 ++++++----
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index b3b7541..009fd30 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -214,7 +214,7 @@ struct sock_common {
>>>                 struct hlist_node       skc_node;
>>>                 struct hlist_nulls_node skc_nulls_node;
>>>         };
>>> -       int                     skc_tx_queue_mapping;
>>> +       unsigned short          skc_tx_queue_mapping;
>>>         union {
>>>                 int             skc_incoming_cpu;
>>>                 u32             skc_rcv_wnd;
>>> @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk,
>>> struct sk_buff *skb,
>>>
>>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>>  {
>>> -       sk->sk_tx_queue_mapping = tx_queue;
>>> +       /* sk_tx_queue_mapping accept only upto a 16-bit value */
>>> +       WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>>
>>
>> Shouldn't this be USHRT_MAX - 1 ?
> 
> Actually just a ">=" would probably do as well.

Ugh! Will definitely fix this.

> 
>>
>>> +       sk->sk_tx_queue_mapping = tx_queue + 1;
>>>  }
>>>
>>>  static inline void sk_tx_queue_clear(struct sock *sk)
>>>  {
>>> -       sk->sk_tx_queue_mapping = -1;
>>>
>>> +       sk->sk_tx_queue_mapping = 0;
>>
>>
>> I think it's slightly better to define a new constant like NO_QUEUE_MAPPING
>> to be USHRT_MAX. That avoids needing to do the arithmetic every time the
>> value is accessed.

The idea was to have avoid having to make any changes to the callers of
these functions and make this similar to queue_mapping in skbuff with 0
indicating unset and +ve value for set. sk_tx_queue_get returns -1 on
invalid and the callers were validating -ve values.  With
sk_tx_queue_mapping initialized to USHRT_MAX, and having an additional
check in sk_tx_queue_get to return -1 if sk_tx_queue_mapping has
USHRT_MAX, I think I can keep changes minimal and avoid the arithmetic
if that's a more acceptable solution.

>>>
>>>  }
>>>
>>>  static inline int sk_tx_queue_get(const struct sock *sk)
>>>  {
>>> -       return sk ? sk->sk_tx_queue_mapping : -1;
>>> +       return sk ? sk->sk_tx_queue_mapping - 1 : -1;
>>
>>
>> Doesn't the comparison in __netdev_pick_tx need to be simultaneously changed
>> for this?
> 
> This doesn't change the result. It was still -1 if the queue mapping
> is not set. It was just initialized to 0 instead of to -1 so we have
> to perform the operation to get there.
> 
> Also in regards to the comment above about needing an extra operation
> I am not sure it makes much difference.
> 
> In the case of us starting with 0 as a reserved value I think the
> instruction count should be about the same. We move the unsigned short
> into an unsigned in, then decrement, and if the value is non-negative
> we can assume it is valid. Although maybe I should double check the
> code to make certain it is doing what I thought it was supposed to be
> doing.
> 
>>
>>>
>>>
>>>
>>>  }
>>>
>>>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>>>
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
  2018-06-26 10:58   ` Willem de Bruijn
@ 2018-06-27  0:00     ` Nambiar, Amritha
  0 siblings, 0 replies; 23+ messages in thread
From: Nambiar, Amritha @ 2018-06-27  0:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

On 6/26/2018 3:58 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>>
>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>> 'int' to 'unsigned short' type with 0 indicating unset and
>> a positive queue value being set. This way it is consistent with
>> the queue_mapping field in the sk_buff. This will also accommodate
>> adding a new 'unsigned short' field in sock_common in the next
>> patch for rx_queue_mapping.
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
> 
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> -       sk->sk_tx_queue_mapping = tx_queue;
>> +       /* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +       WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>> +       sk->sk_tx_queue_mapping = tx_queue + 1;
>>  }
> 
> WARN_ON_ONCE to avoid flooding the kernel buffer.
> 
Will fix.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
  2018-06-26 10:55   ` Willem de Bruijn
@ 2018-06-27  0:21     ` Nambiar, Amritha
  0 siblings, 0 replies; 23+ messages in thread
From: Nambiar, Amritha @ 2018-06-27  0:21 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

On 6/26/2018 3:55 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>>
>> Extend transmit queue sysfs attribute to configure Rx queue(s) map
>> per Tx queue. By default no receive queues are configured for the
>> Tx queue.
>>
>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
> 
>> +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
>> +{
>> +       struct net_device *dev = queue->dev;
>> +       struct xps_dev_maps *dev_maps;
>> +       unsigned long *mask, index;
>> +       int j, len, num_tc = 1, tc = 0;
>> +
>> +       mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
>> +                      GFP_KERNEL);
>> +       if (!mask)
>> +               return -ENOMEM;
>> +
>> +       index = get_netdev_queue_index(queue);
>> +
>> +       if (dev->num_tc) {
>> +               num_tc = dev->num_tc;
>> +               tc = netdev_txq_to_tc(dev, index);
>> +               if (tc < 0)
>> +                       return -EINVAL;
> 
> Must free mask
> 
>> +static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
>> +                             size_t len)
>> +{
>> +       struct net_device *dev = queue->dev;
>> +       unsigned long *mask, index;
>> +       int err;
>> +
>> +       if (!capable(CAP_NET_ADMIN))
>> +               return -EPERM;
> 
> ns_capable?
> 
Will change this to ns_capable.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
  2018-06-26 11:04   ` Willem de Bruijn
@ 2018-06-27  0:36     ` Nambiar, Amritha
  2018-06-27 10:47       ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Nambiar, Amritha @ 2018-06-27  0:36 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

On 6/26/2018 4:04 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>>
>> This patch adds support to pick Tx queue based on the Rx queue(s) map
>> configuration set by the admin through the sysfs attribute
>> for each Tx queue. If the user configuration for receive queue(s) map
>> does not apply, then the Tx queue selection falls back to CPU(s) map
>> based selection and finally to hashing.
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
> 
>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>  {
>>  #ifdef CONFIG_XPS
>>         struct xps_dev_maps *dev_maps;
>> -       struct xps_map *map;
>> +       struct sock *sk = skb->sk;
>>         int queue_index = -1;
>>
>>         if (!static_key_false(&xps_needed))
>>                 return -1;
>>
>>         rcu_read_lock();
>> -       dev_maps = rcu_dereference(dev->xps_cpus_map);
>> +       if (!static_key_false(&xps_rxqs_needed))
>> +               goto get_cpus_map;
>> +
>> +       dev_maps = rcu_dereference(dev->xps_rxqs_map);
>>         if (dev_maps) {
>> -               unsigned int tci = skb->sender_cpu - 1;
>> +               int tci = sk_rx_queue_get(sk);
> 
> What if the rx device differs from the tx device?
> 
I think I have 3 options here:
1. Cache the ifindex in sock_common which will introduce a new
additional field in sock_common.
2. Use dev_get_by_napi_id to get the device id. This could be expensive,
if the rxqs_map is set, this will be done on every packet and involves
walking through the hashlist for napi_id lookup.
3. Remove validating device id, similar to how it is in skb_tx_hash
where rx_queue recorded is used and if not, fall through to flow hash
calculation.
What do you think is suitable here?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
  2018-06-27  0:36     ` Nambiar, Amritha
@ 2018-06-27 10:47       ` Willem de Bruijn
  2018-06-28  0:48         ` Nambiar, Amritha
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-06-27 10:47 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

> >> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
> >>  {
> >>  #ifdef CONFIG_XPS
> >>         struct xps_dev_maps *dev_maps;
> >> -       struct xps_map *map;
> >> +       struct sock *sk = skb->sk;
> >>         int queue_index = -1;
> >>
> >>         if (!static_key_false(&xps_needed))
> >>                 return -1;
> >>
> >>         rcu_read_lock();
> >> -       dev_maps = rcu_dereference(dev->xps_cpus_map);
> >> +       if (!static_key_false(&xps_rxqs_needed))
> >> +               goto get_cpus_map;
> >> +
> >> +       dev_maps = rcu_dereference(dev->xps_rxqs_map);
> >>         if (dev_maps) {
> >> -               unsigned int tci = skb->sender_cpu - 1;
> >> +               int tci = sk_rx_queue_get(sk);
> >
> > What if the rx device differs from the tx device?
> >
> I think I have 3 options here:
> 1. Cache the ifindex in sock_common which will introduce a new
> additional field in sock_common.
> 2. Use dev_get_by_napi_id to get the device id. This could be expensive,
> if the rxqs_map is set, this will be done on every packet and involves
> walking through the hashlist for napi_id lookup.

The tx queue mapping is cached in the sk for connected sockets, but
indeed this would be expensive for many workloads.

> 3. Remove validating device id, similar to how it is in skb_tx_hash
> where rx_queue recorded is used and if not, fall through to flow hash
> calculation.
> What do you think is suitable here?

Alternatively, just accept the misprediction in this rare case. But do
make the caveat explicit in the documentation.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues
  2018-06-26 22:53   ` Tom Herbert
@ 2018-06-28  0:47     ` Nambiar, Amritha
  0 siblings, 0 replies; 23+ messages in thread
From: Nambiar, Amritha @ 2018-06-28  0:47 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck, Willem de Bruijn, Sridhar Samudrala,
	Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa

On 6/26/2018 3:53 PM, Tom Herbert wrote:
> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>> Refactor XPS code to support Tx queue selection based on
>> CPU(s) map or Rx queue(s) map.
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
>>  include/linux/cpumask.h   |   11 ++
>>  include/linux/netdevice.h |  100 +++++++++++++++++++++
>>  net/core/dev.c            |  211 ++++++++++++++++++++++++++++++---------------
>>  net/core/net-sysfs.c      |    4 -
>>  4 files changed, 246 insertions(+), 80 deletions(-)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index bf53d89..57f20a0 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
>>  #define cpu_active(cpu)                ((cpu) == 0)
>>  #endif
>>
>> -/* verify cpu argument to cpumask_* operators */
>> -static inline unsigned int cpumask_check(unsigned int cpu)
>> +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
>>  {
>>  #ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> -       WARN_ON_ONCE(cpu >= nr_cpumask_bits);
>> +       WARN_ON_ONCE(cpu >= bits);
>>  #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
>> +}
>> +
>> +/* verify cpu argument to cpumask_* operators */
>> +static inline unsigned int cpumask_check(unsigned int cpu)
>> +{
>> +       cpu_max_bits_warn(cpu, nr_cpumask_bits);
>>         return cpu;
>>  }
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 3ec9850..c534f03 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -730,10 +730,15 @@ struct xps_map {
>>   */
>>  struct xps_dev_maps {
>>         struct rcu_head rcu;
>> -       struct xps_map __rcu *cpu_map[0];
>> +       struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
>>  };
>> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +         \
>> +
>> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +     \
>>         (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
>> +
>> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
>> +       (_rxqs * (_tcs) * sizeof(struct xps_map *)))
>> +
>>  #endif /* CONFIG_XPS */
>>
>>  #define TC_MAX_QUEUE   16
>> @@ -1909,7 +1914,8 @@ struct net_device {
>>         int                     watchdog_timeo;
>>
>>  #ifdef CONFIG_XPS
>> -       struct xps_dev_maps __rcu *xps_maps;
>> +       struct xps_dev_maps __rcu *xps_cpus_map;
>> +       struct xps_dev_maps __rcu *xps_rxqs_map;
>>  #endif
>>  #ifdef CONFIG_NET_CLS_ACT
>>         struct mini_Qdisc __rcu *miniq_egress;
>> @@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
>>  #ifdef CONFIG_XPS
>>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>>                         u16 index);
>> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
>> +                         u16 index, bool is_rxqs_map);
>> +
>> +/**
>> + *     attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask
>> + *     @j: CPU/Rx queue index
>> + *     @mask: bitmask of all cpus/rx queues
>> + *     @nr_bits: number of bits in the bitmask
>> + *
>> + * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
>> + */
>> +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
>> +                                 unsigned int nr_bits)
>> +{
>> +       cpu_max_bits_warn(j, nr_bits);
>> +       return test_bit(j, mask);
>> +}
>> +
>> +/**
>> + *     attr_test_online - Test for online CPU/Rx queue
>> + *     @j: CPU/Rx queue index
>> + *     @online_mask: bitmask for CPUs/Rx queues that are online
>> + *     @nr_bits: number of bits in the bitmask
>> + *
>> + * Returns true if a CPU/Rx queue is online.
>> + */
>> +static inline bool attr_test_online(unsigned long j,
>> +                                   const unsigned long *online_mask,
>> +                                   unsigned int nr_bits)
>> +{
>> +       cpu_max_bits_warn(j, nr_bits);
>> +
>> +       if (online_mask)
>> +               return test_bit(j, online_mask);
>> +
>> +       if (j >= 0 && j < nr_bits)
> 
> j is unsigned so j >= 0 is superfluous.
> 
>> +               return true;
>> +
>> +       return false;
> 
> Could just do:
> 
> return (j < nr_bits);

Will fix.

> 
>> +}
>> +
>> +/**
>> + *     attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
>> + *     @n: CPU/Rx queue index
>> + *     @srcp: the cpumask/Rx queue mask pointer
>> + *     @nr_bits: number of bits in the bitmask
>> + *
>> + * Returns >= nr_bits if no further CPUs/Rx queues set.
>> + */
>> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
>> +                                        unsigned int nr_bits)
>> +{
>> +       /* -1 is a legal arg here. */
>> +       if (n != -1)
>> +               cpu_max_bits_warn(n, nr_bits);
>> +
>> +       if (srcp)
>> +               return find_next_bit(srcp, nr_bits, n + 1);
>> +
>> +       return n + 1;
>> +}
>> +
>> +/**
>> + *     attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
>> + *     @n: CPU/Rx queue index
>> + *     @src1p: the first CPUs/Rx queues mask pointer
>> + *     @src2p: the second CPUs/Rx queues mask pointer
>> + *     @nr_bits: number of bits in the bitmask
>> + *
>> + * Returns >= nr_bits if no further CPUs/Rx queues set in both.
>> + */
>> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
>> +                                   const unsigned long *src2p,
>> +                                   unsigned int nr_bits)
>> +{
>> +       /* -1 is a legal arg here. */
>> +       if (n != -1)
>> +               cpu_max_bits_warn(n, nr_bits);
>> +
>> +       if (src1p && src2p)
>> +               return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
>> +       else if (src1p)
>> +               return find_next_bit(src1p, nr_bits, n + 1);
>> +       else if (src2p)
>> +               return find_next_bit(src2p, nr_bits, n + 1);
>> +
>> +       return n + 1;
>> +}
>>  #else
>>  static inline int netif_set_xps_queue(struct net_device *dev,
>>                                       const struct cpumask *mask,
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index a5aa1c7..2552556 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>>         int pos;
>>
>>         if (dev_maps)
>> -               map = xmap_dereference(dev_maps->cpu_map[tci]);
>> +               map = xmap_dereference(dev_maps->attr_map[tci]);
>>         if (!map)
>>                 return false;
>>
>> @@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>>                         break;
>>                 }
>>
>> -               RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
>> +               RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
>>                 kfree_rcu(map, rcu);
>>                 return false;
>>         }
>> @@ -2135,31 +2135,58 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
>>         return active;
>>  }
>>
>> +static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
>> +                          struct xps_dev_maps *dev_maps, unsigned int nr_ids,
>> +                          u16 offset, u16 count, bool is_rxqs_map)
>> +{
>> +       bool active = false;
>> +       int i, j;
>> +
>> +       for (j = -1; j = attrmask_next(j, mask, nr_ids),
>> +            j < nr_ids;)
>> +               active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
>> +                                              count);
>> +       if (!active) {
>> +               if (is_rxqs_map) {
>> +                       RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
>> +               } else {
>> +                       RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
>> +
>> +                       for (i = offset + (count - 1); count--; i--)
>> +                               netdev_queue_numa_node_write(
>> +                                       netdev_get_tx_queue(dev, i),
>> +                                                       NUMA_NO_NODE);
>> +               }
>> +               kfree_rcu(dev_maps, rcu);
>> +       }
>> +}
>> +
>>  static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>>                                    u16 count)
>>  {
>> +       const unsigned long *possible_mask = NULL;
>>         struct xps_dev_maps *dev_maps;
>> -       int cpu, i;
>> -       bool active = false;
>> +       unsigned int nr_ids;
>>
>>         mutex_lock(&xps_map_mutex);
>> -       dev_maps = xmap_dereference(dev->xps_maps);
>>
>> -       if (!dev_maps)
>> -               goto out_no_maps;
>> -
>> -       for_each_possible_cpu(cpu)
>> -               active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
>> -                                              offset, count);
>> +       dev_maps = xmap_dereference(dev->xps_rxqs_map);
>> +       if (dev_maps) {
>> +               nr_ids = dev->num_rx_queues;
>> +               clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
>> +                              count, true);
>>
>> -       if (!active) {
>> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
>> -               kfree_rcu(dev_maps, rcu);
>>         }
>>
>> -       for (i = offset + (count - 1); count--; i--)
>> -               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
>> -                                            NUMA_NO_NODE);
>> +       dev_maps = xmap_dereference(dev->xps_cpus_map);
>> +       if (!dev_maps)
>> +               goto out_no_maps;
>> +
>> +       if (num_possible_cpus() > 1)
>> +               possible_mask = cpumask_bits(cpu_possible_mask);
>> +       nr_ids = nr_cpu_ids;
>> +       clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
>> +                      false);
>>
>>  out_no_maps:
>>         mutex_unlock(&xps_map_mutex);
>> @@ -2170,8 +2197,8 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
>>         netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
>>  }
>>
>> -static struct xps_map *expand_xps_map(struct xps_map *map,
>> -                                     int cpu, u16 index)
>> +static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
>> +                                     u16 index, bool is_rxqs_map)
>>  {
>>         struct xps_map *new_map;
>>         int alloc_len = XPS_MIN_MAP_ALLOC;
>> @@ -2183,7 +2210,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>>                 return map;
>>         }
>>
>> -       /* Need to add queue to this CPU's existing map */
>> +       /* Need to add tx-queue to this CPU's/rx-queue's existing map */
>>         if (map) {
>>                 if (pos < map->alloc_len)
>>                         return map;
>> @@ -2191,9 +2218,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>>                 alloc_len = map->alloc_len * 2;
>>         }
>>
>> -       /* Need to allocate new map to store queue on this CPU's map */
>> -       new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
>> -                              cpu_to_node(cpu));
>> +       /* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
>> +        *  map
>> +        */
>> +       if (is_rxqs_map)
>> +               new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
>> +       else
>> +               new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
>> +                                      cpu_to_node(attr_index));
>>         if (!new_map)
>>                 return NULL;
>>
>> @@ -2205,14 +2237,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>>         return new_map;
>>  }
>>
>> -int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>> -                       u16 index)
>> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
>> +                         u16 index, bool is_rxqs_map)
>>  {
>> +       const unsigned long *online_mask = NULL, *possible_mask = NULL;
>>         struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
>> -       int i, cpu, tci, numa_node_id = -2;
>> +       int i, j, tci, numa_node_id = -2;
>>         int maps_sz, num_tc = 1, tc = 0;
>>         struct xps_map *map, *new_map;
>>         bool active = false;
>> +       unsigned int nr_ids;
>>
>>         if (dev->num_tc) {
>>                 num_tc = dev->num_tc;
>> @@ -2221,16 +2255,27 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>>                         return -EINVAL;
>>         }
>>
>> -       maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
>> -       if (maps_sz < L1_CACHE_BYTES)
>> -               maps_sz = L1_CACHE_BYTES;
>> -
>>         mutex_lock(&xps_map_mutex);
>> +       if (is_rxqs_map) {
>> +               maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
>> +               dev_maps = xmap_dereference(dev->xps_rxqs_map);
>> +               nr_ids = dev->num_rx_queues;
>> +       } else {
>> +               maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
>> +               if (num_possible_cpus() > 1) {
>> +                       online_mask = cpumask_bits(cpu_online_mask);
>> +                       possible_mask = cpumask_bits(cpu_possible_mask);
>> +               }
>> +               dev_maps = xmap_dereference(dev->xps_cpus_map);
>> +               nr_ids = nr_cpu_ids;
>> +       }
>>
>> -       dev_maps = xmap_dereference(dev->xps_maps);
>> +       if (maps_sz < L1_CACHE_BYTES)
>> +               maps_sz = L1_CACHE_BYTES;
>>
>>         /* allocate memory for queue storage */
>> -       for_each_cpu_and(cpu, cpu_online_mask, mask) {
>> +       for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
>> +            j < nr_ids;) {
>>                 if (!new_dev_maps)
>>                         new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
>>                 if (!new_dev_maps) {
>> @@ -2238,73 +2283,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>>                         return -ENOMEM;
>>                 }
>>
>> -               tci = cpu * num_tc + tc;
>> -               map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
>> +               tci = j * num_tc + tc;
>> +               map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
>>                                  NULL;
>>
>> -               map = expand_xps_map(map, cpu, index);
>> +               map = expand_xps_map(map, j, index, is_rxqs_map);
>>                 if (!map)
>>                         goto error;
>>
>> -               RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
>> +               RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>>         }
>>
>>         if (!new_dev_maps)
>>                 goto out_no_new_maps;
>>
>> -       for_each_possible_cpu(cpu) {
>> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
>> +            j < nr_ids;) {
>>                 /* copy maps belonging to foreign traffic classes */
>> -               for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
>> +               for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
>>                         /* fill in the new device map from the old device map */
>> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
>> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
>> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
>> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>>                 }
>>
>>                 /* We need to explicitly update tci as prevous loop
>>                  * could break out early if dev_maps is NULL.
>>                  */
>> -               tci = cpu * num_tc + tc;
>> +               tci = j * num_tc + tc;
>>
>> -               if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
>> -                       /* add queue to CPU maps */
>> +               if (attr_test_mask(j, mask, nr_ids) &&
>> +                   attr_test_online(j, online_mask, nr_ids)) {
>> +                       /* add tx-queue to CPU/rx-queue maps */
>>                         int pos = 0;
>>
>> -                       map = xmap_dereference(new_dev_maps->cpu_map[tci]);
>> +                       map = xmap_dereference(new_dev_maps->attr_map[tci]);
>>                         while ((pos < map->len) && (map->queues[pos] != index))
>>                                 pos++;
>>
>>                         if (pos == map->len)
>>                                 map->queues[map->len++] = index;
>>  #ifdef CONFIG_NUMA
>> -                       if (numa_node_id == -2)
>> -                               numa_node_id = cpu_to_node(cpu);
>> -                       else if (numa_node_id != cpu_to_node(cpu))
>> -                               numa_node_id = -1;
> 
> Seems like there should be a comment here about meaning of -2 and -1
> in NUMA node. Better yet, seems like there should be constants defined
> for these special values. Maybe something to clean up in the future.

Will have a separate patch (not part of this series) for this.

> 
>> +                       if (!is_rxqs_map) {
>> +                               if (numa_node_id == -2)
>> +                                       numa_node_id = cpu_to_node(j);
>> +                               else if (numa_node_id != cpu_to_node(j))
>> +                                       numa_node_id = -1;
>> +                       }
>>  #endif
>>                 } else if (dev_maps) {
>>                         /* fill in the new device map from the old device map */
>> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
>> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
>> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
>> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>>                 }
>>
>>                 /* copy maps belonging to foreign traffic classes */
>>                 for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
>>                         /* fill in the new device map from the old device map */
>> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
>> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
>> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
>> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>>                 }
>>         }
>>
>> -       rcu_assign_pointer(dev->xps_maps, new_dev_maps);
>> +       if (is_rxqs_map)
>> +               rcu_assign_pointer(dev->xps_rxqs_map, new_dev_maps);
>> +       else
>> +               rcu_assign_pointer(dev->xps_cpus_map, new_dev_maps);
>>
>>         /* Cleanup old maps */
>>         if (!dev_maps)
>>                 goto out_no_old_maps;
>>
>> -       for_each_possible_cpu(cpu) {
>> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
>> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
>> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
>> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
>> +            j < nr_ids;) {
>> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
>> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
>> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
>>                         if (map && map != new_map)
>>                                 kfree_rcu(map, rcu);
>>                 }
>> @@ -2317,19 +2370,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>>         active = true;
>>
>>  out_no_new_maps:
>> -       /* update Tx queue numa node */
>> -       netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
>> -                                    (numa_node_id >= 0) ? numa_node_id :
>> -                                    NUMA_NO_NODE);
>> +       if (!is_rxqs_map) {
>> +               /* update Tx queue numa node */
>> +               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
>> +                                            (numa_node_id >= 0) ?
>> +                                            numa_node_id : NUMA_NO_NODE);
>> +       }
>>
>>         if (!dev_maps)
>>                 goto out_no_maps;
>>
>> -       /* removes queue from unused CPUs */
>> -       for_each_possible_cpu(cpu) {
>> -               for (i = tc, tci = cpu * num_tc; i--; tci++)
>> +       /* removes tx-queue from unused CPUs/rx-queues */
>> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
>> +            j < nr_ids;) {
>> +               for (i = tc, tci = j * num_tc; i--; tci++)
>>                         active |= remove_xps_queue(dev_maps, tci, index);
>> -               if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
>> +               if (!attr_test_mask(j, mask, nr_ids) ||
>> +                   !attr_test_online(j, online_mask, nr_ids))
>>                         active |= remove_xps_queue(dev_maps, tci, index);
>>                 for (i = num_tc - tc, tci++; --i; tci++)
>>                         active |= remove_xps_queue(dev_maps, tci, index);
>> @@ -2337,7 +2394,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>>
>>         /* free map if not active */
>>         if (!active) {
>> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
>> +               if (is_rxqs_map)
>> +                       RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
>> +               else
>> +                       RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
>>                 kfree_rcu(dev_maps, rcu);
>>         }
>>
>> @@ -2347,11 +2407,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>>         return 0;
>>  error:
>>         /* remove any maps that we added */
>> -       for_each_possible_cpu(cpu) {
>> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
>> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
>> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
>> +            j < nr_ids;) {
>> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
>> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
>>                         map = dev_maps ?
>> -                             xmap_dereference(dev_maps->cpu_map[tci]) :
>> +                             xmap_dereference(dev_maps->attr_map[tci]) :
>>                               NULL;
>>                         if (new_map && new_map != map)
>>                                 kfree(new_map);
>> @@ -2363,6 +2424,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>>         kfree(new_dev_maps);
>>         return -ENOMEM;
>>  }
>> +
>> +int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>> +                       u16 index)
>> +{
>> +       return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
>> +}
>>  EXPORT_SYMBOL(netif_set_xps_queue);
>>
>>  #endif
>> @@ -3384,7 +3451,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>         int queue_index = -1;
>>
>>         rcu_read_lock();
>> -       dev_maps = rcu_dereference(dev->xps_maps);
>> +       dev_maps = rcu_dereference(dev->xps_cpus_map);
>>         if (dev_maps) {
>>                 unsigned int tci = skb->sender_cpu - 1;
>>
>> @@ -3393,7 +3460,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>                         tci += netdev_get_prio_tc_map(dev, skb->priority);
>>                 }
>>
>> -               map = rcu_dereference(dev_maps->cpu_map[tci]);
>> +               map = rcu_dereference(dev_maps->attr_map[tci]);
>>                 if (map) {
>>                         if (map->len == 1)
>>                                 queue_index = map->queues[0];
>> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
>> index bb7e80f..b39987c 100644
>> --- a/net/core/net-sysfs.c
>> +++ b/net/core/net-sysfs.c
>> @@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
>>                 return -ENOMEM;
>>
>>         rcu_read_lock();
>> -       dev_maps = rcu_dereference(dev->xps_maps);
>> +       dev_maps = rcu_dereference(dev->xps_cpus_map);
>>         if (dev_maps) {
>>                 for_each_possible_cpu(cpu) {
>>                         int i, tci = cpu * num_tc + tc;
>>                         struct xps_map *map;
>>
>> -                       map = rcu_dereference(dev_maps->cpu_map[tci]);
>> +                       map = rcu_dereference(dev_maps->attr_map[tci]);
>>                         if (!map)
>>                                 continue;
>>
>>
> 
> Acked-by: Tom Herbert <tom@quantonium.net>
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
  2018-06-27 10:47       ` Willem de Bruijn
@ 2018-06-28  0:48         ` Nambiar, Amritha
  0 siblings, 0 replies; 23+ messages in thread
From: Nambiar, Amritha @ 2018-06-28  0:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Alexander Duyck, Samudrala,
	Sridhar, Alexander Duyck, Eric Dumazet, Hannes Frederic Sowa,
	Tom Herbert

On 6/27/2018 3:47 AM, Willem de Bruijn wrote:
>>>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>>>  {
>>>>  #ifdef CONFIG_XPS
>>>>         struct xps_dev_maps *dev_maps;
>>>> -       struct xps_map *map;
>>>> +       struct sock *sk = skb->sk;
>>>>         int queue_index = -1;
>>>>
>>>>         if (!static_key_false(&xps_needed))
>>>>                 return -1;
>>>>
>>>>         rcu_read_lock();
>>>> -       dev_maps = rcu_dereference(dev->xps_cpus_map);
>>>> +       if (!static_key_false(&xps_rxqs_needed))
>>>> +               goto get_cpus_map;
>>>> +
>>>> +       dev_maps = rcu_dereference(dev->xps_rxqs_map);
>>>>         if (dev_maps) {
>>>> -               unsigned int tci = skb->sender_cpu - 1;
>>>> +               int tci = sk_rx_queue_get(sk);
>>>
>>> What if the rx device differs from the tx device?
>>>
>> I think I have 3 options here:
>> 1. Cache the ifindex in sock_common which will introduce a new
>> additional field in sock_common.
>> 2. Use dev_get_by_napi_id to get the device id. This could be expensive,
>> if the rxqs_map is set, this will be done on every packet and involves
>> walking through the hashlist for napi_id lookup.
> 
> The tx queue mapping is cached in the sk for connected sockets, but
> indeed this would be expensive for many workloads.
> 
>> 3. Remove validating device id, similar to how it is in skb_tx_hash
>> where rx_queue recorded is used and if not, fall through to flow hash
>> calculation.
>> What do you think is suitable here?
> 
> Alternatively, just accept the misprediction in this rare case. But do
> make the caveat explicit in the documentation.
> 
Okay, I will add this in the documentation.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-06-28  0:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-25 18:04 [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues Amritha Nambiar
2018-06-25 18:04 ` [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and " Amritha Nambiar
2018-06-26 22:53   ` Tom Herbert
2018-06-28  0:47     ` Nambiar, Amritha
2018-06-25 18:04 ` [net-next PATCH v4 2/7] net: Use static_key for XPS maps Amritha Nambiar
2018-06-26 22:54   ` Tom Herbert
2018-06-25 18:04 ` [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short Amritha Nambiar
     [not found]   ` <CALx6S37uFs1shuPmno+L=p_Hyy1Q2qNaK+AqYvrk4HXTApL_Vg@mail.gmail.com>
2018-06-26  3:25     ` Alexander Duyck
2018-06-26 23:54       ` Nambiar, Amritha
2018-06-26 10:58   ` Willem de Bruijn
2018-06-27  0:00     ` Nambiar, Amritha
2018-06-25 18:04 ` [net-next PATCH v4 4/7] net: Record receive queue number for a connection Amritha Nambiar
2018-06-25 18:04 ` [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues Amritha Nambiar
2018-06-26  2:04   ` kbuild test robot
2018-06-26 11:04   ` Willem de Bruijn
2018-06-27  0:36     ` Nambiar, Amritha
2018-06-27 10:47       ` Willem de Bruijn
2018-06-28  0:48         ` Nambiar, Amritha
2018-06-25 18:04 ` [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue Amritha Nambiar
2018-06-26 10:55   ` Willem de Bruijn
2018-06-27  0:21     ` Nambiar, Amritha
2018-06-25 18:04 ` [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map Amritha Nambiar
2018-06-26 22:59   ` Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.