All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 net-next 0/5]  xps_flows: XPS flow steering when there is no socket
@ 2016-09-29  3:54 Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk Tom Herbert
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Tom Herbert @ 2016-09-29  3:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, rick.jones2, alexander.duyck

This patch set introduces transmit flow steering for socketless packets.
The idea is that we record the transmit queues in a flow table that is
indexed by skbuff hash.  The flow table entries have two values: the
queue_index and the head cnt of packets from the TX queue. We only allow
a queue to change for a flow if the tail cnt in the TX queue advances
beyond the recorded head cnt. That is the condition that should indicate
that all outstanding packets for the flow have completed transmission so
the queue can change.

Tracking the inflight queue is performed as part of DQL. Two fields are
added to the dql structure: num_enqueue_ops and num_completed_ops.
num_enqueue_ops incremented in dql_queued and num_completed_ops is
incremented in dql_completed by the number of operations completed (a
new argument to the function).

This patch set creates /sys/class/net/eth*/xps_dev_flow_table_cnt
which number of entries in the XPS flow table.

Note that the functionality here is technically best effort (for
instance we don't obtain a lock while processing a flow table entry).
Under high load it is possible that OOO packets can still be generated
due to XPS if two threads are hammering on the same flow table entry.
The assumption of these patches is that OOO packets are not the end of
the world and these should prevent OOO in most common use cases with
XPS.

This is a followup to previous RFC version. Fixes from RFC are:

  - Move counters to DQL
  - Fixed typo
  - Simplified get flow index funtion
  - Fixed sysfs flow_table_cnt to properly use DEVICE_ATTR_RW
  - Renamed the mechanism

V2:
  - Added documentation in scaling.txt and sysfs documentation
  - Call skb_tx_hash directly from get_xps_queue. This allows
    the socketless transmit flow steering to work properly if
    a flow is bouncing between non-XPS and XPS CPUS. (suggested
    by Alexander Duyck).
  - Added a whold bunch of tested results provided by Rick Jones
    (Thanks Rick!)

Tested:
  Manually forced all packets to go through the xps_flows path.
  Observed that some flows were deferred to change queues because
  packets were in flight witht the flow bucket.

Testing done by Rick Jones:

  Here is a quick look at performance tests for the result of trying the
  prototype fix for the packet reordering problem with VMs sending over
  an XPS-configured NIC.  In particular, the Emulex/Avago/Broadcom
  Skyhawk.  The fix was applied to a 4.4 kernel.

  Before: 3884 Mbit/s
  After: 8897 Mbit/s

  That was from a VM on a node with a Skyhawk and 2 E5-2640 processors
  to baremetal E5-2640 with a BE3.  Physical MTU was 1500, the VM's
  vNIC's MTU was 1400.  Systems were HPE ProLiants in OS Control Mode
  for power management, with the "performance" frequency governor
  loaded. An OpenStack Mitaka setup with Distributed Virtual Router.

  We had some other NIC types in the setup as well.  XPS was also
  enabled on the ConnectX3-Pro.  It was not enabled on the 82599ES (a
  function of the kernel being used, which had it disabled from the
  first reports of XPS negatively affecting VM traffic at the beginning
  of the year)

  Average Mbit/s From NIC type To Bare Metal BE3:

   NIC Type,
   CPU on VM Host            Before        After
  ------------------------------------------------
  ConnectX-3 Pro,E5-2670v3    9224         9271
  BE3, E5-2640                9016         9022
  82599, E5-2640              9192         9003
  BCM57840, E5-2640           9213         9153
  Skyhawk, E5-2640            3884         8897

  For completeness:
  Average Mbit/s To NIC type from Bare Metal BE3:

  NIC Type,
  CPU on VM Host            Before        After
  ------------------------------------------------
  ConnectX-3 Pro,E5-2670v3    9322         9144
  BE3, E5-2640                9074         9017
  82599, E5-2640              8670         8564
  BCM57840, E5-2640           2468 *       7979
  Skyhawk, E5-2640            8897         9269

  * This is the Busted bnx2x NIC FW GRO implementation issue.  It was
    not visible in the "After" because the system was setup to disable
    the NIC FW GRO by the time it booted on the fix kernel.

  Average Transactions/s Between NIC type and Bare Metal BE3:

  NIC Type,
  CPU on VM Host            Before        After
  ------------------------------------------------
  ConnectX-3 Pro,E5-2670v3   12421         12612
  BE3, E5-2640                8178          8484
  82599, E5-2640              8499          8549
  BCM57840, E5-2640           8544          8560
  Skyhawk, E5-2640            8537          8701

Tom Herbert (5):
  net: Set SW hash in skb_set_hash_from_sk
  dql: Add counters for number of queuing and completion operations
  net: Add xps_dev_flow_table_cnt
  xps_flows: XPS for packets that don't have a socket
  xps: Documentation for transmit socketles flow steering

 Documentation/ABI/testing/sysfs-class-net |   8 +++
 Documentation/networking/scaling.txt      |  26 ++++++++
 include/linux/dynamic_queue_limits.h      |   7 +-
 include/linux/netdevice.h                 |  26 +++++++-
 include/net/sock.h                        |   6 +-
 lib/dynamic_queue_limits.c                |   3 +-
 net/Kconfig                               |   6 ++
 net/core/dev.c                            |  87 +++++++++++++++++++------
 net/core/net-sysfs.c                      | 103 ++++++++++++++++++++++++++++++
 9 files changed, 246 insertions(+), 26 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk
  2016-09-29  3:54 [PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket Tom Herbert
@ 2016-09-29  3:54 ` Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 2/5] dql: Add counters for number of queuing and completion operations Tom Herbert
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Tom Herbert @ 2016-09-29  3:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, rick.jones2, alexander.duyck

Use the __skb_set_sw_hash to set the hash in an skbuff from the socket
txhash.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/net/sock.h | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ebf75db..17d379a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1920,10 +1920,8 @@ static inline void sock_poll_wait(struct file *filp,
 
 static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk)
 {
-	if (sk->sk_txhash) {
-		skb->l4_hash = 1;
-		skb->hash = sk->sk_txhash;
-	}
+	if (sk->sk_txhash)
+		__skb_set_sw_hash(skb, sk->sk_txhash, true);
 }
 
 void skb_set_owner_w(struct sk_buff *skb, struct sock *sk);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 net-next 2/5] dql: Add counters for number of queuing and completion operations
  2016-09-29  3:54 [PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk Tom Herbert
@ 2016-09-29  3:54 ` Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 3/5] net: Add xps_dev_flow_table_cnt Tom Herbert
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Tom Herbert @ 2016-09-29  3:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, rick.jones2, alexander.duyck

Add two new counters to struct dql that are num_enqueue_ops and
num_completed_ops. num_enqueue_ops is incremented by one in each call to
dql_queued. num_enqueue_ops is incremented in dql_completed which takes
an argument indicating number of operations completed. These counters
are only intended for statistics and do not impact the BQL algorithm.

We add a new sysfs entry in byte_queue_limits named inflight_pkts.
This provides the number of packets in flight for the queue by
dql->num_enqueue_ops - dql->num_completed_ops.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/dynamic_queue_limits.h |  7 ++++++-
 include/linux/netdevice.h            |  2 +-
 lib/dynamic_queue_limits.c           |  3 ++-
 net/core/net-sysfs.c                 | 14 ++++++++++++++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index a4be703..b6a4804 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -43,6 +43,8 @@ struct dql {
 	unsigned int	adj_limit;		/* limit + num_completed */
 	unsigned int	last_obj_cnt;		/* Count at last queuing */
 
+	unsigned int	num_enqueue_ops;	/* Number of queue operations */
+
 	/* Fields accessed only by completion path (dql_completed) */
 
 	unsigned int	limit ____cacheline_aligned_in_smp; /* Current limit */
@@ -55,6 +57,8 @@ struct dql {
 	unsigned int	lowest_slack;		/* Lowest slack found */
 	unsigned long	slack_start_time;	/* Time slacks seen */
 
+	unsigned int	num_completed_ops;	/* Number of complete ops */
+
 	/* Configuration */
 	unsigned int	max_limit;		/* Max limit */
 	unsigned int	min_limit;		/* Minimum limit */
@@ -83,6 +87,7 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
 	barrier();
 
 	dql->num_queued += count;
+	dql->num_enqueue_ops++;
 }
 
 /* Returns how many objects can be queued, < 0 indicates over limit. */
@@ -92,7 +97,7 @@ static inline int dql_avail(const struct dql *dql)
 }
 
 /* Record number of completed objects and recalculate the limit. */
-void dql_completed(struct dql *dql, unsigned int count);
+void dql_completed(struct dql *dql, unsigned int count, unsigned int ops);
 
 /* Reset dql state */
 void dql_reset(struct dql *dql);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 136ae6bb..9567107 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3015,7 +3015,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue,
 	if (unlikely(!bytes))
 		return;
 
-	dql_completed(&dev_queue->dql, bytes);
+	dql_completed(&dev_queue->dql, bytes, pkts);
 
 	/*
 	 * Without the memory barrier there is a small possiblity that
diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
index f346715..d5e7a27 100644
--- a/lib/dynamic_queue_limits.c
+++ b/lib/dynamic_queue_limits.c
@@ -14,7 +14,7 @@
 #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
 
 /* Records completed count and recalculates the queue limit */
-void dql_completed(struct dql *dql, unsigned int count)
+void dql_completed(struct dql *dql, unsigned int count, unsigned int ops)
 {
 	unsigned int inprogress, prev_inprogress, limit;
 	unsigned int ovlimit, completed, num_queued;
@@ -108,6 +108,7 @@ void dql_completed(struct dql *dql, unsigned int count)
 	dql->prev_ovlimit = ovlimit;
 	dql->prev_last_obj_cnt = dql->last_obj_cnt;
 	dql->num_completed = completed;
+	dql->num_completed_ops += ops;
 	dql->prev_num_queued = num_queued;
 }
 EXPORT_SYMBOL(dql_completed);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 6e4f347..ab7b0b6 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1147,6 +1147,19 @@ static ssize_t bql_show_inflight(struct netdev_queue *queue,
 static struct netdev_queue_attribute bql_inflight_attribute =
 	__ATTR(inflight, S_IRUGO, bql_show_inflight, NULL);
 
+static ssize_t bql_show_inflight_pkts(struct netdev_queue *queue,
+				      struct netdev_queue_attribute *attr,
+				      char *buf)
+{
+	struct dql *dql = &queue->dql;
+
+	return sprintf(buf, "%u\n",
+		       dql->num_enqueue_ops - dql->num_completed_ops);
+}
+
+static struct netdev_queue_attribute bql_inflight_pkts_attribute =
+	__ATTR(inflight_pkts, S_IRUGO, bql_show_inflight_pkts, NULL);
+
 #define BQL_ATTR(NAME, FIELD)						\
 static ssize_t bql_show_ ## NAME(struct netdev_queue *queue,		\
 				 struct netdev_queue_attribute *attr,	\
@@ -1176,6 +1189,7 @@ static struct attribute *dql_attrs[] = {
 	&bql_limit_min_attribute.attr,
 	&bql_hold_time_attribute.attr,
 	&bql_inflight_attribute.attr,
+	&bql_inflight_pkts_attribute.attr,
 	NULL
 };
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 net-next 3/5] net: Add xps_dev_flow_table_cnt
  2016-09-29  3:54 [PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 2/5] dql: Add counters for number of queuing and completion operations Tom Herbert
@ 2016-09-29  3:54 ` Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 5/5] xps: Documentation for transmit socketles flow steering Tom Herbert
  4 siblings, 0 replies; 14+ messages in thread
From: Tom Herbert @ 2016-09-29  3:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, rick.jones2, alexander.duyck

Add infrastructure and definitions to create XFS flow tables. This
creates the new sys entry /sys/class/net/eth*/xps_dev_flow_table_cnt

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/netdevice.h | 24 +++++++++++++
 net/core/net-sysfs.c      | 89 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9567107..60063b3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -736,6 +736,27 @@ struct xps_dev_maps {
     (nr_cpu_ids * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */
 
+#ifdef CONFIG_XPS_FLOWS
+struct xps_dev_flow {
+	union {
+		u64	v64;
+		struct {
+			int		queue_index;
+			unsigned int	queue_ptr;
+		};
+	};
+};
+
+struct xps_dev_flow_table {
+	unsigned int mask;
+	struct rcu_head rcu;
+	struct xps_dev_flow flows[0];
+};
+#define XPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct xps_dev_flow_table) + \
+	((_num) * sizeof(struct xps_dev_flow)))
+
+#endif /* CONFIG_XPS_FLOWS */
+
 #define TC_MAX_QUEUE	16
 #define TC_BITMASK	15
 /* HW offloaded queuing disciplines txq count and offset maps */
@@ -1825,6 +1846,9 @@ struct net_device {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps __rcu *xps_maps;
 #endif
+#ifdef CONFIG_XPS_FLOWS
+	struct xps_dev_flow_table __rcu *xps_flow_table;
+#endif
 #ifdef CONFIG_NET_CLS_ACT
 	struct tcf_proto __rcu  *egress_cl_list;
 #endif
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index ab7b0b6..0d00b9c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -503,6 +503,92 @@ static ssize_t phys_switch_id_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(phys_switch_id);
 
+#ifdef CONFIG_XPS_FLOWS
+static void xps_dev_flow_table_release(struct rcu_head *rcu)
+{
+	struct xps_dev_flow_table *table = container_of(rcu,
+	    struct xps_dev_flow_table, rcu);
+	vfree(table);
+}
+
+static int change_xps_dev_flow_table_cnt(struct net_device *dev,
+					 unsigned long count)
+{
+	unsigned long mask;
+	struct xps_dev_flow_table *table, *old_table;
+	static DEFINE_SPINLOCK(xps_dev_flow_lock);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (count) {
+		mask = count - 1;
+		/* mask = roundup_pow_of_two(count) - 1;
+		 * without overflows...
+		 */
+		while ((mask | (mask >> 1)) != mask)
+			mask |= (mask >> 1);
+		/* On 64 bit arches, must check mask fits in table->mask (u32),
+		 * and on 32bit arches, must check
+		 * XPS_DEV_FLOW_TABLE_SIZE(mask + 1) doesn't overflow.
+		 */
+#if BITS_PER_LONG > 32
+		if (mask > (unsigned long)(u32)mask)
+			return -EINVAL;
+#else
+		if (mask > (ULONG_MAX - XPS_DEV_FLOW_TABLE_SIZE(1))
+				/ sizeof(struct xps_dev_flow)) {
+			/* Enforce a limit to prevent overflow */
+			return -EINVAL;
+		}
+#endif
+		table = vmalloc(XPS_DEV_FLOW_TABLE_SIZE(mask + 1));
+		if (!table)
+			return -ENOMEM;
+
+		table->mask = mask;
+		for (count = 0; count <= mask; count++)
+			table->flows[count].queue_index = -1;
+	} else
+		table = NULL;
+
+	spin_lock(&xps_dev_flow_lock);
+	old_table = rcu_dereference_protected(dev->xps_flow_table,
+					      lockdep_is_held(&xps_dev_flow_lock));
+	rcu_assign_pointer(dev->xps_flow_table, table);
+	spin_unlock(&xps_dev_flow_lock);
+
+	if (old_table)
+		call_rcu(&old_table->rcu, xps_dev_flow_table_release);
+
+	return 0;
+}
+
+static ssize_t xps_dev_flow_table_cnt_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, change_xps_dev_flow_table_cnt);
+}
+
+static ssize_t xps_dev_flow_table_cnt_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct xps_dev_flow_table *table;
+	unsigned int cnt = 0;
+
+	rcu_read_lock();
+	table = rcu_dereference(netdev->xps_flow_table);
+	if (table)
+		cnt = table->mask + 1;
+	rcu_read_unlock();
+
+	return sprintf(buf, fmt_dec, cnt);
+}
+DEVICE_ATTR_RW(xps_dev_flow_table_cnt);
+#endif /* CONFIG_XPS_FLOWS */
+
 static struct attribute *net_class_attrs[] = {
 	&dev_attr_netdev_group.attr,
 	&dev_attr_type.attr,
@@ -531,6 +617,9 @@ static struct attribute *net_class_attrs[] = {
 	&dev_attr_phys_port_name.attr,
 	&dev_attr_phys_switch_id.attr,
 	&dev_attr_proto_down.attr,
+#ifdef CONFIG_XPS_FLOWS
+	&dev_attr_xps_dev_flow_table_cnt.attr,
+#endif
 	NULL,
 };
 ATTRIBUTE_GROUPS(net_class);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29  3:54 [PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket Tom Herbert
                   ` (2 preceding siblings ...)
  2016-09-29  3:54 ` [PATCH v2 net-next 3/5] net: Add xps_dev_flow_table_cnt Tom Herbert
@ 2016-09-29  3:54 ` Tom Herbert
  2016-09-29  4:54   ` Eric Dumazet
  2016-09-29  3:54 ` [PATCH v2 net-next 5/5] xps: Documentation for transmit socketles flow steering Tom Herbert
  4 siblings, 1 reply; 14+ messages in thread
From: Tom Herbert @ 2016-09-29  3:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, rick.jones2, alexander.duyck

xps_flows maintains a per device flow table that is indexed by the
skbuff hash. The table is only consulted when there is no queue saved in
a transmit socket for an skbuff.

Each entry in the flow table contains a queue index and a queue
pointer. The queue pointer is set when a queue is chosen using a
flow table entry. This pointer is set to the head pointer in the
transmit queue (which is maintained by BQL).

The new function get_xps_flows_index that looks up flows in the
xps_flows table. The entry returned gives the last queue a matching flow
used. The returned queue is compared against the normal XPS queue. If
they are different, then we only switch if the tail pointer in the TX
queue has advanced past the pointer saved in the entry. In this
way OOO should be avoided when XPS wants to use a different queue.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 net/Kconfig    |  6 ++++
 net/core/dev.c | 87 +++++++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..f77fad1 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -255,6 +255,12 @@ config XPS
 	depends on SMP
 	default y
 
+config XPS_FLOWS
+	bool
+	depends on XPS
+	depends on BQL
+	default y
+
 config HWBM
        bool
 
diff --git a/net/core/dev.c b/net/core/dev.c
index c0c291f..1ca08b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
+/* Must be called with RCU read_lock */
 static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
@@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	struct xps_map *map;
 	int queue_index = -1;
 
-	rcu_read_lock();
 	dev_maps = rcu_dereference(dev->xps_maps);
 	if (dev_maps) {
 		map = rcu_dereference(
@@ -3228,15 +3228,62 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 			else
 				queue_index = map->queues[reciprocal_scale(skb_get_hash(skb),
 									   map->len)];
-			if (unlikely(queue_index >= dev->real_num_tx_queues))
-				queue_index = -1;
+			if (queue_index >= 0 &&
+			    likely(queue_index < dev->real_num_tx_queues))
+				return queue_index;
 		}
 	}
-	rcu_read_unlock();
+#endif
+	return skb_tx_hash(dev, skb);
+}
+
+/* Must be called with RCU read_lock */
+static int get_xps_flows_index(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS_FLOWS
+	struct xps_dev_flow_table *flow_table;
+	struct xps_dev_flow ent;
+	int queue_index;
+	struct netdev_queue *txq;
+	u32 hash;
+
+	queue_index = get_xps_queue(dev, skb);
+	if (queue_index < 0)
+		return -1;
+
+	flow_table = rcu_dereference(dev->xps_flow_table);
+	if (!flow_table)
+		return queue_index;
+
+	hash = skb_get_hash(skb);
+	if (!hash)
+		return queue_index;
+
+	ent.v64 = flow_table->flows[hash & flow_table->mask].v64;
+
+	if (queue_index != ent.queue_index &&
+	    ent.queue_index >= 0 &&
+	    ent.queue_index < dev->real_num_tx_queues) {
+		txq = netdev_get_tx_queue(dev, ent.queue_index);
+		if ((int)(txq->dql.num_completed_ops - ent.queue_ptr) < 0)  {
+			/* The current queue's tail has not advanced beyond the
+			 * last packet that was enqueued using the table entry.
+			 * We can't change queues without risking OOO. Stick
+			 * with the queue listed in the flow table.
+			 */
+			queue_index = ent.queue_index;
+		}
+	}
+
+	/* Save the updated entry */
+	txq = netdev_get_tx_queue(dev, queue_index);
+	ent.queue_index = queue_index;
+	ent.queue_ptr = txq->dql.num_enqueue_ops;
+	flow_table->flows[hash & flow_table->mask].v64 = ent.v64;
 
 	return queue_index;
 #else
-	return -1;
+	return get_xps_queue(dev, skb);
 #endif
 }
 
@@ -3244,22 +3291,24 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
 	int queue_index = sk_tx_queue_get(sk);
-
-	if (queue_index < 0 || skb->ooo_okay ||
-	    queue_index >= dev->real_num_tx_queues) {
-		int new_index = get_xps_queue(dev, skb);
-		if (new_index < 0)
-			new_index = skb_tx_hash(dev, skb);
-
-		if (queue_index != new_index && sk &&
-		    sk_fullsock(sk) &&
-		    rcu_access_pointer(sk->sk_dst_cache))
-			sk_tx_queue_set(sk, new_index);
-
-		queue_index = new_index;
+	int new_index;
+
+	if (queue_index < 0) {
+		/* Socket did not provide a queue index, try xps_flows */
+		new_index = get_xps_flows_index(dev, skb);
+	} else if (skb->ooo_okay || queue_index >= dev->real_num_tx_queues) {
+		/* Queue index in socket, see if we can find a better one */
+		new_index = get_xps_queue(dev, skb);
+	} else {
+		/* Valid queue in socket and can't send OOO. Just return it */
+		return queue_index;
 	}
 
-	return queue_index;
+	if (queue_index != new_index && sk && sk_fullsock(sk) &&
+	    rcu_access_pointer(sk->sk_dst_cache))
+		sk_tx_queue_set(sk, new_index);
+
+	return new_index;
 }
 
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 net-next 5/5] xps: Documentation for transmit socketles flow steering
  2016-09-29  3:54 [PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket Tom Herbert
                   ` (3 preceding siblings ...)
  2016-09-29  3:54 ` [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket Tom Herbert
@ 2016-09-29  3:54 ` Tom Herbert
  4 siblings, 0 replies; 14+ messages in thread
From: Tom Herbert @ 2016-09-29  3:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, rick.jones2, alexander.duyck

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 Documentation/ABI/testing/sysfs-class-net |  8 ++++++++
 Documentation/networking/scaling.txt      | 26 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index 668604f..0d2a6a9 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -251,3 +251,11 @@ Contact:	netdev@vger.kernel.org
 Description:
 		Indicates the unique physical switch identifier of a switch this
 		port belongs to, as a string.
+
+What:		/sys/class/net/<iface>/xps_dev_flow_table_cnt
+Date:		October 2016
+KernelVersion:	4.8
+Contact:	netdev@vger.kernel.org
+Description:
+		Indicates the number of entries in the XPS socketless flow
+		table.
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index 59f4db2..7f9590c 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -400,6 +400,26 @@ transport layer is responsible for setting ooo_okay appropriately. TCP,
 for instance, sets the flag when all data for a connection has been
 acknowledged.
 
+XPS: Transmit Socketless Flow Steering
+======================================
+
+Not all flows have an associated socket in which the transmit queue and
+ooo information can be saved. For instance, packets being transmitted over
+OVS for VMs do not have a socket that is visible to the kernel. To allow
+XPS to be effectively used a flow hash table is employed to save queue and
+ooo information for such socketless flows.
+
+The idea is that transmit queues are saved for socketless flows in a
+flow table that is indexed by skbuff hash. The flow table entries have
+two values: the queue_index and the head cnt of packets from the TX
+queue. When a packet is being sent without an associated socket the
+hash table is consulted which returns the queue_index to use. The
+returned queue is compared to the queue that would be selected by XPS.
+If they are are different the XPS queue is selected only if the tail cnt
+in the TX queue advances beyond the recorded head cnt. This checks if
+sending the packet on the new queue could cause ooo packets for the
+flow.
+
 ==== XPS Configuration
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
@@ -409,6 +429,12 @@ queue is configured using the sysfs file entry:
 
 /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
 
+Transmit socketless flow steering is enabled by setting the number of
+entries in the XPS flow table. This is configured in the sysfs file
+entry:
+
+/sys/class/net/<dev>/xps_dev_flow_table_cnt
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29  3:54 ` [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket Tom Herbert
@ 2016-09-29  4:54   ` Eric Dumazet
  2016-09-29 12:53     ` Tom Herbert
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-29  4:54 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, kernel-team, rick.jones2, alexander.duyck

On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
> xps_flows maintains a per device flow table that is indexed by the
> skbuff hash. The table is only consulted when there is no queue saved in
> a transmit socket for an skbuff.
> 
> Each entry in the flow table contains a queue index and a queue
> pointer. The queue pointer is set when a queue is chosen using a
> flow table entry. This pointer is set to the head pointer in the
> transmit queue (which is maintained by BQL).
> 
> The new function get_xps_flows_index that looks up flows in the
> xps_flows table. The entry returned gives the last queue a matching flow
> used. The returned queue is compared against the normal XPS queue. If
> they are different, then we only switch if the tail pointer in the TX
> queue has advanced past the pointer saved in the entry. In this
> way OOO should be avoided when XPS wants to use a different queue.

There is something I do not understand with this.

If this OOO avoidance is tied to BQL, it means that packets sitting in a
qdisc wont be part of the detection.

So packets of flow X could possibly be queued on multiple qdiscs.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29  4:54   ` Eric Dumazet
@ 2016-09-29 12:53     ` Tom Herbert
  2016-09-29 13:18       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Herbert @ 2016-09-29 12:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Rick Jones, Alexander Duyck

On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
>> xps_flows maintains a per device flow table that is indexed by the
>> skbuff hash. The table is only consulted when there is no queue saved in
>> a transmit socket for an skbuff.
>>
>> Each entry in the flow table contains a queue index and a queue
>> pointer. The queue pointer is set when a queue is chosen using a
>> flow table entry. This pointer is set to the head pointer in the
>> transmit queue (which is maintained by BQL).
>>
>> The new function get_xps_flows_index that looks up flows in the
>> xps_flows table. The entry returned gives the last queue a matching flow
>> used. The returned queue is compared against the normal XPS queue. If
>> they are different, then we only switch if the tail pointer in the TX
>> queue has advanced past the pointer saved in the entry. In this
>> way OOO should be avoided when XPS wants to use a different queue.
>
> There is something I do not understand with this.
>
> If this OOO avoidance is tied to BQL, it means that packets sitting in a
> qdisc wont be part of the detection.
>
> So packets of flow X could possibly be queued on multiple qdiscs.
>
Hi Eric,

I'm not sure I understand your concern. If packets for flow X can be
queued on multiple qdiscs wouldn't that mean that flow will be subject
to ooo transmission regardless of this patch? In other words here
we're trying to keep packets for the flow in order as they are emitted
from the qdiscs which we assume maintains ordering of packets in a
flow.

Tom

>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29 12:53     ` Tom Herbert
@ 2016-09-29 13:18       ` Eric Dumazet
  2016-09-29 14:08         ` Tom Herbert
  2016-09-29 16:35         ` Rick Jones
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2016-09-29 13:18 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Rick Jones, Alexander Duyck

On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote:
> On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
> >> xps_flows maintains a per device flow table that is indexed by the
> >> skbuff hash. The table is only consulted when there is no queue saved in
> >> a transmit socket for an skbuff.
> >>
> >> Each entry in the flow table contains a queue index and a queue
> >> pointer. The queue pointer is set when a queue is chosen using a
> >> flow table entry. This pointer is set to the head pointer in the
> >> transmit queue (which is maintained by BQL).
> >>
> >> The new function get_xps_flows_index that looks up flows in the
> >> xps_flows table. The entry returned gives the last queue a matching flow
> >> used. The returned queue is compared against the normal XPS queue. If
> >> they are different, then we only switch if the tail pointer in the TX
> >> queue has advanced past the pointer saved in the entry. In this
> >> way OOO should be avoided when XPS wants to use a different queue.
> >
> > There is something I do not understand with this.
> >
> > If this OOO avoidance is tied to BQL, it means that packets sitting in a
> > qdisc wont be part of the detection.
> >
> > So packets of flow X could possibly be queued on multiple qdiscs.
> >
> Hi Eric,
> 
> I'm not sure I understand your concern. If packets for flow X can be
> queued on multiple qdiscs wouldn't that mean that flow will be subject
> to ooo transmission regardless of this patch? In other words here
> we're trying to keep packets for the flow in order as they are emitted
> from the qdiscs which we assume maintains ordering of packets in a
> flow.

Well, then what this patch series is solving ?

You have a producer of packets running on 8 vcpus in a VM.

Packets are exiting the VM and need to be queued on a mq NIC in the
hypervisor.

Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
selecting different TXQ.

Your patch aims to detect this and steer packets to one TXQ, but not
using a static hash() % num_real_queues (as RSS would have done on a NIC
at RX), but trying to please XPS (as : selecting a queue close to the
CPU producing the packet. VM with one vcpu, and cpu bounded, would
really be happy to not spread packets all over the queues)

Your mechanism relies on counters updated at the time packets are given
to the NIC, not at the time packets are given to the txq (and eventually
sit for a while in the qdisc right before BQL layer)

dev_queue_xmit() is not talking to the device directly...


All I am saying is that the producer/consumer counters you added are not
at the right place.

You tried hard to not change the drivers, by adding something to
existing BQL. But packets can sit in a qdisc for a while...

So you added 2 potential cache lines misses per outgoing packet, but
with no guarantee that OOO are really avoided.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29 13:18       ` Eric Dumazet
@ 2016-09-29 14:08         ` Tom Herbert
  2016-09-29 14:51           ` Eric Dumazet
  2016-09-29 16:35         ` Rick Jones
  1 sibling, 1 reply; 14+ messages in thread
From: Tom Herbert @ 2016-09-29 14:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Rick Jones, Alexander Duyck

On Thu, Sep 29, 2016 at 9:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote:
>> On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
>> >> xps_flows maintains a per device flow table that is indexed by the
>> >> skbuff hash. The table is only consulted when there is no queue saved in
>> >> a transmit socket for an skbuff.
>> >>
>> >> Each entry in the flow table contains a queue index and a queue
>> >> pointer. The queue pointer is set when a queue is chosen using a
>> >> flow table entry. This pointer is set to the head pointer in the
>> >> transmit queue (which is maintained by BQL).
>> >>
>> >> The new function get_xps_flows_index that looks up flows in the
>> >> xps_flows table. The entry returned gives the last queue a matching flow
>> >> used. The returned queue is compared against the normal XPS queue. If
>> >> they are different, then we only switch if the tail pointer in the TX
>> >> queue has advanced past the pointer saved in the entry. In this
>> >> way OOO should be avoided when XPS wants to use a different queue.
>> >
>> > There is something I do not understand with this.
>> >
>> > If this OOO avoidance is tied to BQL, it means that packets sitting in a
>> > qdisc wont be part of the detection.
>> >
>> > So packets of flow X could possibly be queued on multiple qdiscs.
>> >
>> Hi Eric,
>>
>> I'm not sure I understand your concern. If packets for flow X can be
>> queued on multiple qdiscs wouldn't that mean that flow will be subject
>> to ooo transmission regardless of this patch? In other words here
>> we're trying to keep packets for the flow in order as they are emitted
>> from the qdiscs which we assume maintains ordering of packets in a
>> flow.
>
> Well, then what this patch series is solving ?
>
It addresses  the issue that Rick Jones pointed out was happening with
XPS. When packets are sent for a flow that has no socket and XPS is
enabled then each packet uses the XPS queue based on the running CPU.
Since the thread sending on a flow can be rescheduled on different
CPUs this is creating ooo packets. In this case the ooo is being
caused by interaction with XPS.

> You have a producer of packets running on 8 vcpus in a VM.
>
> Packets are exiting the VM and need to be queued on a mq NIC in the
> hypervisor.
>
> Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
> selecting different TXQ.
>
> Your patch aims to detect this and steer packets to one TXQ, but not
> using a static hash() % num_real_queues (as RSS would have done on a NIC
> at RX), but trying to please XPS (as : selecting a queue close to the
> CPU producing the packet. VM with one vcpu, and cpu bounded, would
> really be happy to not spread packets all over the queues)
>
You're not describing this patch, you're describing how XPS works in
the first place. This patch is done to make socketless packets work
well with XPS. If all you want is a static hash() % num_real_queues
then just disable XPS to get that.

> Your mechanism relies on counters updated at the time packets are given
> to the NIC, not at the time packets are given to the txq (and eventually
> sit for a while in the qdisc right before BQL layer)
>
> dev_queue_xmit() is not talking to the device directly...
>
>
> All I am saying is that the producer/consumer counters you added are not
> at the right place.
>
> You tried hard to not change the drivers, by adding something to
> existing BQL. But packets can sit in a qdisc for a while...
>
But again, this patch is to prevent ooo being caused by an interaction
with XPS. It does not address ooo being caused by qdiscs or VMs, but
then I don't see that as being the issue reported by Rick.

> So you added 2 potential cache lines misses per outgoing packet, but
> with no guarantee that OOO are really avoided.
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29 14:08         ` Tom Herbert
@ 2016-09-29 14:51           ` Eric Dumazet
  2016-09-29 15:15             ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-29 14:51 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Rick Jones, Alexander Duyck

On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:

> It addresses  the issue that Rick Jones pointed out was happening with
> XPS. When packets are sent for a flow that has no socket and XPS is
> enabled then each packet uses the XPS queue based on the running CPU.
> Since the thread sending on a flow can be rescheduled on different
> CPUs this is creating ooo packets. In this case the ooo is being
> caused by interaction with XPS.
> 

Nope, your patch does not address the problem properly.

I am not sure I want to spend more time explaining the issue.

Lets talk about this in Tokyo next week.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29 14:51           ` Eric Dumazet
@ 2016-09-29 15:15             ` Eric Dumazet
  2016-09-29 20:26               ` Tom Herbert
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-29 15:15 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Rick Jones, Alexander Duyck

On Thu, 2016-09-29 at 07:51 -0700, Eric Dumazet wrote:
> On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:
> 
> > It addresses  the issue that Rick Jones pointed out was happening with
> > XPS. When packets are sent for a flow that has no socket and XPS is
> > enabled then each packet uses the XPS queue based on the running CPU.
> > Since the thread sending on a flow can be rescheduled on different
> > CPUs this is creating ooo packets. In this case the ooo is being
> > caused by interaction with XPS.
> > 
> 
> Nope, your patch does not address the problem properly.
> 
> I am not sure I want to spend more time explaining the issue.
> 
> Lets talk about this in Tokyo next week.
> 

Just as a reminder, sorry to bother you, stating some obvious facts for
both of us. We have public exchanges, so we also need to re-explain how
things work.

Queue selection on xmit happens before we hit the qdisc and its delays.

So when you access txq->dql.num_completed_ops and
txq->dql.num_enqueue_ops you can observe values that do not change for a
while.

Say a thread runs on a VM, and sends 2 packets P1, P2 on the same flow
(skb_get_hash() returns the same value for these 2 packets)

P1 is sent on behalf of CPU 1, we pickup queue txq1, and queue the
packet on its qdisc . Transmit does not happen because of some
constraints like rate limiting or scheduling constraints.

P2 is sent on behalf of CPU 2, we pickup queue txq2, notice that prior
packet chose txq1. We check txq1->dql and decide it is fine to use txq2,
since the dql params of txq1 were not changed yet.

( txq->dql.num_completed_ops == ent.queue_ptr )

Note that in RFS case, we have the guarantee that we observe 'live
queues' since they are the per cpu backlog.

So input_queue_head_incr() and input_queue_tail_incr_save() are
correctly doing the OOO prevention, because a queued packet immediately
changes the state.

So really your patch works if you have no qdisc, or a non congested
qdisc. (Think if P1 is dropped by a full pfifo or pfifo_fast : We really
want to avoid steering P2, P3, ..., PN on this full pfifo while maybe
other txq are idle). Strange attractors are back (check commit
9b462d02d6dd6 )

You could avoid (ab)using BQL with a different method, grabbing
skb->destructor for the packets that are socketless

The hash table would simply track the sum of skb->truesize to allow flow
migration. This would be self contained and not intrusive.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29 13:18       ` Eric Dumazet
  2016-09-29 14:08         ` Tom Herbert
@ 2016-09-29 16:35         ` Rick Jones
  1 sibling, 0 replies; 14+ messages in thread
From: Rick Jones @ 2016-09-29 16:35 UTC (permalink / raw)
  To: Eric Dumazet, Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Alexander Duyck

On 09/29/2016 06:18 AM, Eric Dumazet wrote:
> Well, then what this patch series is solving ?
>
> You have a producer of packets running on 8 vcpus in a VM.
>
> Packets are exiting the VM and need to be queued on a mq NIC in the
> hypervisor.
>
> Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
> selecting different TXQ.

Just for completeness, in my testing, the VMs were single-vCPU.

rick jones

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
  2016-09-29 15:15             ` Eric Dumazet
@ 2016-09-29 20:26               ` Tom Herbert
  0 siblings, 0 replies; 14+ messages in thread
From: Tom Herbert @ 2016-09-29 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Rick Jones, Alexander Duyck

On Thu, Sep 29, 2016 at 11:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-09-29 at 07:51 -0700, Eric Dumazet wrote:
>> On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:
>>
>> > It addresses  the issue that Rick Jones pointed out was happening with
>> > XPS. When packets are sent for a flow that has no socket and XPS is
>> > enabled then each packet uses the XPS queue based on the running CPU.
>> > Since the thread sending on a flow can be rescheduled on different
>> > CPUs this is creating ooo packets. In this case the ooo is being
>> > caused by interaction with XPS.
>> >
>>
>> Nope, your patch does not address the problem properly.
>>
>> I am not sure I want to spend more time explaining the issue.
>>
>> Lets talk about this in Tokyo next week.
>>
>
> Just as a reminder, sorry to bother you, stating some obvious facts for
> both of us. We have public exchanges, so we also need to re-explain how
> things work.
>
> Queue selection on xmit happens before we hit the qdisc and its delays.
>
> So when you access txq->dql.num_completed_ops and
> txq->dql.num_enqueue_ops you can observe values that do not change for a
> while.
>
> Say a thread runs on a VM, and sends 2 packets P1, P2 on the same flow
> (skb_get_hash() returns the same value for these 2 packets)
>
> P1 is sent on behalf of CPU 1, we pickup queue txq1, and queue the
> packet on its qdisc . Transmit does not happen because of some
> constraints like rate limiting or scheduling constraints.
>
> P2 is sent on behalf of CPU 2, we pickup queue txq2, notice that prior
> packet chose txq1. We check txq1->dql and decide it is fine to use txq2,
> since the dql params of txq1 were not changed yet.
>
> ( txq->dql.num_completed_ops == ent.queue_ptr )
>
> Note that in RFS case, we have the guarantee that we observe 'live
> queues' since they are the per cpu backlog.
>
> So input_queue_head_incr() and input_queue_tail_incr_save() are
> correctly doing the OOO prevention, because a queued packet immediately
> changes the state.
>
> So really your patch works if you have no qdisc, or a non congested
> qdisc. (Think if P1 is dropped by a full pfifo or pfifo_fast : We really
> want to avoid steering P2, P3, ..., PN on this full pfifo while maybe
> other txq are idle). Strange attractors are back (check commit
> 9b462d02d6dd6 )
>
Understood.

> You could avoid (ab)using BQL with a different method, grabbing
> skb->destructor for the packets that are socketless
>
> The hash table would simply track the sum of skb->truesize to allow flow
> migration. This would be self contained and not intrusive.
>
Okay, will look that.

>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-09-29 20:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-29  3:54 [PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket Tom Herbert
2016-09-29  3:54 ` [PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk Tom Herbert
2016-09-29  3:54 ` [PATCH v2 net-next 2/5] dql: Add counters for number of queuing and completion operations Tom Herbert
2016-09-29  3:54 ` [PATCH v2 net-next 3/5] net: Add xps_dev_flow_table_cnt Tom Herbert
2016-09-29  3:54 ` [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket Tom Herbert
2016-09-29  4:54   ` Eric Dumazet
2016-09-29 12:53     ` Tom Herbert
2016-09-29 13:18       ` Eric Dumazet
2016-09-29 14:08         ` Tom Herbert
2016-09-29 14:51           ` Eric Dumazet
2016-09-29 15:15             ` Eric Dumazet
2016-09-29 20:26               ` Tom Herbert
2016-09-29 16:35         ` Rick Jones
2016-09-29  3:54 ` [PATCH v2 net-next 5/5] xps: Documentation for transmit socketles flow steering Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.