All of lore.kernel.org
 help / color / mirror / Atom feed
* Poor TCP performance with XPS enabled after scrubbing skb
@ 2018-05-15 19:31 Flavio Leitner
  2018-05-15 21:08 ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Flavio Leitner @ 2018-05-15 19:31 UTC (permalink / raw)
  To: netdev; +Cc: Paolo Abeni

Hi,

There is a significant throughput issue (~50% drop) for a single TCP
stream when the skb is scrubbed and XPS is enabled.

If I turn CONFIG_XPS off, then the issue never happens and the test
reaches line rate.  The same happens if I echo 0 to tx-*/xps_cpus.

It looks like that when the skb is scrubbed, there is no more reference
to the struct sock, which forces XPS to use a TX queue mapped to the
running CPU. However, since there is no mapping between RX queue and
TX queue, the returning traffic usually ends up in another CPU. This
other CPU process the skb and if the stack needs to send something,
then we have two TX queues being used in parallel for the same stream
and TCP seems to not like that (Out-Of-Order, dup ACKS, retransmissions..)

The test environment is quite simple. The iperf/iperf3 -s can be
just a NIC with IP address.  The peer running iperf/iperf3 -c needs
to use veth (scrub the packet), so create a pair, attach one end
to a linux bridge with the NIC and add the IP address to the other
end:
      Bridge
NIC ---/  \--- veth0 ---- veth1 [ IP address ]

Paolo and I discussed the issue and we came up with a patch[1] that
supports the explanation above. It may not be the best way to fix the
problem though, so for now consider it just as an experiment :-)

Kernel net-next updated with today's:
commit f3002c1374fb2367c9d8dbb28852791ef90d2bac
Date:   Mon May 14 08:14:49 2018 -0400


Default config (CONFIG_XPS on)
# iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 40332 connected with 192.168.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  16.8 GBytes  4.80 Gbits/sec


# ./xps_disable.sh; iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 40334 connected with 192.168.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  32.2 GBytes  9.21 Gbits/sec


[root@dell-r430-23 ~]# ./xps_restore.sh; iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 40336 connected with 192.168.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  16.0 GBytes  4.59 Gbits/sec


Experimental patch applied and XPS functioning:

# iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 34202 connected with 192.168.1.2 port
5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  32.2 GBytes  9.21 Gbits/sec


Sometimes the return traffic ends up in the same CPU running iperf -c.
When that happens, the same TX queue is used and I see line rate.

The issue always happen with MLX and be2net NICs, but so far I am
unable to reproduce with i40e, though I could see two TX queues being
used in parallel as in other cases.

[1]
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 71c72a9..482d046 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -31,9 +31,10 @@
 
 /*		0 - Reserved to indicate value not set
  *     1..NR_CPUS - Reserved for sender_cpu
- *  NR_CPUS+1..~0 - Region available for NAPI IDs
+ *      NR_CPUS+1 - Scrubbed packet, do not use XPS
+ *  NR_CPUS+2..~0 - Region available for NAPI IDs
  */
-#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
+#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 2))
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 
diff --git a/net/core/dev.c b/net/core/dev.c
index af0558b..5567d4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3398,6 +3398,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	struct xps_map *map;
 	int queue_index = -1;
 
+	if (skb->sender_cpu ==  (u32)(NR_CPUS + 1))
+		return -1;
+
 	rcu_read_lock();
 	dev_maps = rcu_dereference(dev->xps_maps);
 	if (dev_maps) {
@@ -3459,7 +3462,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 #ifdef CONFIG_XPS
 	u32 sender_cpu = skb->sender_cpu - 1;
 
-	if (sender_cpu >= (u32)NR_CPUS)
+	if (sender_cpu >= (u32)NR_CPUS + 1)
 		skb->sender_cpu = raw_smp_processor_id() + 1;
 #endif
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b518..99040a0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4898,6 +4898,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 	ipvs_reset(skb);
 	skb_orphan(skb);
 	skb->mark = 0;
+	skb->sender_cpu = (u32)(NR_CPUS + 1);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 

-- 
Flavio

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Poor TCP performance with XPS enabled after scrubbing skb
  2018-05-15 19:31 Poor TCP performance with XPS enabled after scrubbing skb Flavio Leitner
@ 2018-05-15 21:08 ` Eric Dumazet
  2018-05-24 19:17   ` Flavio Leitner
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2018-05-15 21:08 UTC (permalink / raw)
  To: Flavio Leitner, netdev; +Cc: Paolo Abeni



On 05/15/2018 12:31 PM, Flavio Leitner wrote:
> Hi,
> 
> There is a significant throughput issue (~50% drop) for a single TCP
> stream when the skb is scrubbed and XPS is enabled.
> 
> If I turn CONFIG_XPS off, then the issue never happens and the test
> reaches line rate.  The same happens if I echo 0 to tx-*/xps_cpus.
> 
> It looks like that when the skb is scrubbed, there is no more reference
> to the struct sock, 

And this is really the problem here, since it breaks back pressure (and TCP Small queues)

I am not sure why skb_orphan() is used in this scrubbing really.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Poor TCP performance with XPS enabled after scrubbing skb
  2018-05-15 21:08 ` Eric Dumazet
@ 2018-05-24 19:17   ` Flavio Leitner
  2018-05-25 20:29     ` David Miller
  0 siblings, 1 reply; 4+ messages in thread
From: Flavio Leitner @ 2018-05-24 19:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Paolo Abeni

On Tue, May 15, 2018 at 02:08:09PM -0700, Eric Dumazet wrote:
> 
> 
> On 05/15/2018 12:31 PM, Flavio Leitner wrote:
> > Hi,
> > 
> > There is a significant throughput issue (~50% drop) for a single TCP
> > stream when the skb is scrubbed and XPS is enabled.
> > 
> > If I turn CONFIG_XPS off, then the issue never happens and the test
> > reaches line rate.  The same happens if I echo 0 to tx-*/xps_cpus.
> > 
> > It looks like that when the skb is scrubbed, there is no more reference
> > to the struct sock, 
> 
> And this is really the problem here, since it breaks back pressure (and TCP Small queues)
> 
> I am not sure why skb_orphan() is used in this scrubbing really.
> 

veth originally called skb_orphan() on veth_xmit() most probably
because there was no TX completion. Then the code got generalized to
dev_forward_skb() and later on moved to skb_scrub_packet().

The issue is that we call skb_scrub_packet() on TX and RX paths and
that is done while crossing netns.  It doesn't look correct to keep
the ->sk because I suspect that iptables/selinux/bpf, or some code
path that I am probably missing could expose/use the wrong ->sk, for
example.

However, netdev_pick_tx() can't store the queue mapping without ->sk.

The hack in the first email relies on the headers (skb_tx_hash) to
always selected the same TX queue, which solves the original problem
but not the TCP small queues you mentioned.

-- 
Flavio

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Poor TCP performance with XPS enabled after scrubbing skb
  2018-05-24 19:17   ` Flavio Leitner
@ 2018-05-25 20:29     ` David Miller
  0 siblings, 0 replies; 4+ messages in thread
From: David Miller @ 2018-05-25 20:29 UTC (permalink / raw)
  To: fbl; +Cc: eric.dumazet, netdev, pabeni

From: Flavio Leitner <fbl@sysclose.org>
Date: Thu, 24 May 2018 16:17:29 -0300

> veth originally called skb_orphan() on veth_xmit() most probably
> because there was no TX completion. Then the code got generalized to
> dev_forward_skb() and later on moved to skb_scrub_packet().
> 
> The issue is that we call skb_scrub_packet() on TX and RX paths and
> that is done while crossing netns.  It doesn't look correct to keep
> the ->sk because I suspect that iptables/selinux/bpf, or some code
> path that I am probably missing could expose/use the wrong ->sk, for
> example.
> 
> However, netdev_pick_tx() can't store the queue mapping without ->sk.
> 
> The hack in the first email relies on the headers (skb_tx_hash) to
> always selected the same TX queue, which solves the original problem
> but not the TCP small queues you mentioned.

Right, we can't allow a socket reference to escape over a netns
crossing.

However, that is where we get the queue mapping state.

We might need to put the sk based decision into the skb somehow in
order to satisfy these two incompatibel requirements.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-05-25 20:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-15 19:31 Poor TCP performance with XPS enabled after scrubbing skb Flavio Leitner
2018-05-15 21:08 ` Eric Dumazet
2018-05-24 19:17   ` Flavio Leitner
2018-05-25 20:29     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.