[RFC] net;sched: Try to find idle cpu for RPS to handle packets

From: Kirill Tkhai <ktkhai@virtuozzo.com>
To: peterz@infradead.org, davem@davemloft.net, daniel@iogearbox.net,
	edumazet@google.com, tom@quantonium.net, ktkhai@virtuozzo.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
Date: Wed, 19 Sep 2018 15:28:54 +0300	[thread overview]
Message-ID: <153736009982.24033.13696245431713246950.stgit@localhost.localdomain> (raw)

Many workloads have polling mode of work. The application
checks for incomming packets from time to time, but it also
has a work to do, when there is no packets. This RFC
tries to develop an idea to queue RPS packets on idle
CPU in the the L3 domain of the consumer, so backlog
processing of the packets and the application can execute
in parallel.

We require this in case of network cards does not
have enough RX queues to cover all online CPUs (this seems
to be the most cards), and  get_rps_cpu() actually chooses
remote cpu, and SMP interrupt is sent. Here we may try
our best, and to find idle CPU nearly the consumer's CPU.
Note, that in case of consumer works in poll mode and it
does not waits for incomming packets, its CPU will be not
idle, while CPU of a sleeping consumer may be idle. So,
not polling consumers will still be able to have skb
handled on its CPU.

In case of network card has many queues, the device
interrupts will come on consumer's CPU, and this patch
won't try to find idle cpu for them.

I've tried simple netperf test for this:
netserver -p 1234
netperf -L 127.0.0.1 -p 1234 -l 100

Before:
 87380  16384  16384    100.00   60323.56
 87380  16384  16384    100.00   60388.46
 87380  16384  16384    100.00   60217.68
 87380  16384  16384    100.00   57995.41
 87380  16384  16384    100.00   60659.00

After:
 87380  16384  16384    100.00   64569.09
 87380  16384  16384    100.00   64569.25
 87380  16384  16384    100.00   64691.63
 87380  16384  16384    100.00   64930.14
 87380  16384  16384    100.00   62670.15

The difference between best runs is +7%,
the worst runs differ +8%.

What do you think about following somehow in this way?

[This also requires a pre-patch, which exports
 select_idle_sibling() and teaches it handles
 NULL task argument, but since it's not very
 interesting to see, I skip it sending].

Kirill
---
 net/core/dev.c |   34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 559a91271f82..9a867ff34622 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3738,13 +3738,12 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		       struct rps_dev_flow **rflowp)
 {
-	const struct rps_sock_flow_table *sock_flow_table;
+	struct rps_sock_flow_table *sock_flow_table;
 	struct netdev_rx_queue *rxqueue = dev->_rx;
 	struct rps_dev_flow_table *flow_table;
 	struct rps_map *map;
+	u32 tcpu, hash, val;
 	int cpu = -1;
-	u32 tcpu;
-	u32 hash;
 
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
@@ -3774,6 +3773,9 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	sock_flow_table = rcu_dereference(rps_sock_flow_table);
 	if (flow_table && sock_flow_table) {
 		struct rps_dev_flow *rflow;
+		bool want_new_cpu = false;
+		unsigned long flags;
+		unsigned int qhead;
 		u32 next_cpu;
 		u32 ident;
 
@@ -3801,12 +3803,26 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		 *     This guarantees that all previous packets for the flow
 		 *     have been dequeued, thus preserving in order delivery.
 		 */
-		if (unlikely(tcpu != next_cpu) &&
-		    (tcpu >= nr_cpu_ids || !cpu_online(tcpu) ||
-		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
-		      rflow->last_qtail)) >= 0)) {
-			tcpu = next_cpu;
-			rflow = set_rps_cpu(dev, skb, rflow, next_cpu);
+		if (tcpu != next_cpu) {
+			qhead = per_cpu(softnet_data, tcpu).input_queue_head;
+			if (tcpu >= nr_cpu_ids || !cpu_online(tcpu) ||
+			    (int)(qhead - rflow->last_qtail) >= 0)
+				want_new_cpu = true;
+		} else if (tcpu < nr_cpu_ids && cpu_online(tcpu) &&
+			   tcpu != smp_processor_id() && !available_idle_cpu(tcpu)) {
+			want_new_cpu = true;
+		}
+
+		if (want_new_cpu) {
+			local_irq_save(flags);
+			next_cpu = select_idle_sibling(NULL, next_cpu, next_cpu);
+			local_irq_restore(flags);
+			if (tcpu != next_cpu) {
+				tcpu = next_cpu;
+				rflow = set_rps_cpu(dev, skb, rflow, tcpu);
+				val = (hash & ~rps_cpu_mask) | tcpu;
+				sock_flow_table->ents[hash & sock_flow_table->mask] = val;
+			}
 		}
 
 		if (tcpu < nr_cpu_ids && cpu_online(tcpu)) {