All of lore.kernel.org
 help / color / mirror / Atom feed
* High perf top ip_idents_reserve doing netperf UDP_STREAM
@ 2014-09-03 14:59 Jesper Dangaard Brouer
  2014-09-03 15:17 ` Eric Dumazet
  0 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-03 14:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: brouer, netdev, Daniel Borkmann, Hannes Frederic Sowa, Florian Westphal


Hi Eric,

When doing:
 super_netperf 120 -H 192.168.8.2 -t UDP_STREAM -l 100 -- -m 256

I'm seeing function ip_idents_reserve() consuming most CPU.  Could you
help, explain what is going on, and how I can avoid this?

Perf top:
  11.67%  [kernel]   [k] ip_idents_reserve
   8.37%  [kernel]   [k] fib_table_lookup
   4.46%  [kernel]   [k] _raw_spin_lock
   3.21%  [kernel]   [k] copy_user_enhanced_fast_string
   2.92%  [kernel]   [k] sock_alloc_send_pskb
   2.88%  [kernel]   [k] udp_sendmsg

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: High perf top ip_idents_reserve doing netperf UDP_STREAM
  2014-09-03 14:59 High perf top ip_idents_reserve doing netperf UDP_STREAM Jesper Dangaard Brouer
@ 2014-09-03 15:17 ` Eric Dumazet
  2016-11-16 12:16   ` Netperf UDP issue with connected sockets Jesper Dangaard Brouer
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2014-09-03 15:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Hannes Frederic Sowa, Florian Westphal

On Wed, 2014-09-03 at 16:59 +0200, Jesper Dangaard Brouer wrote:
> Hi Eric,
> 
> When doing:
>  super_netperf 120 -H 192.168.8.2 -t UDP_STREAM -l 100 -- -m 256
> 
> I'm seeing function ip_idents_reserve() consuming most CPU.  Could you
> help, explain what is going on, and how I can avoid this?
> 
> Perf top:
>   11.67%  [kernel]   [k] ip_idents_reserve
>    8.37%  [kernel]   [k] fib_table_lookup
>    4.46%  [kernel]   [k] _raw_spin_lock
>    3.21%  [kernel]   [k] copy_user_enhanced_fast_string
>    2.92%  [kernel]   [k] sock_alloc_send_pskb
>    2.88%  [kernel]   [k] udp_sendmsg
> 

Because you use a single destination, all flows compete on a single
atomic to get their next IP identifier.

You can try to use netperf options  (-- -N -n) so that netperf uses
connected UDP sockets.

In this case, the IP identifier generator is held in each socket.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Netperf UDP issue with connected sockets
  2014-09-03 15:17 ` Eric Dumazet
@ 2016-11-16 12:16   ` Jesper Dangaard Brouer
  2016-11-16 17:46     ` Rick Jones
  0 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-16 12:16 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, Rick Jones, brouer, Eric Dumazet


While optimizing the kernel RX path, I've run into an issue where I
cannot use netperf UDP_STREAM for testing, because the sender is
slower than receiver.  Thus, it cannot show my receiver improvements
(as receiver have idle cycles).

Eric Dumazet previously told me[1] this was related to netperf need to
use connected socket for UDP.  Netperf options "-- -n -N" should
enable connected UDP sockets, but it never worked! Options are
documented, but netperf seems to have a bug.

Called like:
 netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N

Problem on sender-side is "__ip_select_ident".

 Samples: 15K of event 'cycles', Event count (approx.): 16409681913
   Overhead  Command         Shared Object     Symbol
 +   11.18%  netperf         [kernel.vmlinux]  [k] __ip_select_ident
 +    6.93%  netperf         [kernel.vmlinux]  [k] _raw_spin_lock
 +    6.12%  netperf         [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
 +    4.31%  netperf         [kernel.vmlinux]  [k] __ip_make_skb
 +    3.97%  netperf         [kernel.vmlinux]  [k] fib_table_lookup
 +    3.51%  netperf         [mlx5_core]       [k] mlx5e_sq_xmit
 +    2.43%  netperf         [kernel.vmlinux]  [k] __ip_route_output_key_hash
 +    2.24%  netperf         netperf           [.] send_omni_inner
 +    2.17%  netperf         netperf           [.] send_data

[1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
 - https://www.spinics.net/lists/netdev/msg294752.html

Not fixed in version 2.7.0.
 - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz

Used extra netperf configure compile options:
 ./configure  --enable-histogram --enable-demo

It seems like some fix attempts exists in the SVN repository::

 svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
 svn log -r709
 # A quick stab at getting remote connect going for UDP_STREAM
 svn diff -r708:709

Testing with SVN version, still show __ip_select_ident() in top#1.

(p.s. is netperf ever going to be converted from SVN to git?)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

(old email below)

On Wed, 03 Sep 2014 08:17:06 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-09-03 at 16:59 +0200, Jesper Dangaard Brouer wrote:
> > Hi Eric,
> > 
> > When doing:
> >  super_netperf 120 -H 192.168.8.2 -t UDP_STREAM -l 100 -- -m 256
> > 
> > I'm seeing function ip_idents_reserve() consuming most CPU.  Could you
> > help, explain what is going on, and how I can avoid this?
> > 
> > Perf top:
> >   11.67%  [kernel]   [k] ip_idents_reserve
> >    8.37%  [kernel]   [k] fib_table_lookup
> >    4.46%  [kernel]   [k] _raw_spin_lock
> >    3.21%  [kernel]   [k] copy_user_enhanced_fast_string
> >    2.92%  [kernel]   [k] sock_alloc_send_pskb
> >    2.88%  [kernel]   [k] udp_sendmsg
> >   
> 
> Because you use a single destination, all flows compete on a single
> atomic to get their next IP identifier.
> 
> You can try to use netperf options  (-- -N -n) so that netperf uses
> connected UDP sockets.
> 
> In this case, the IP identifier generator is held in each socket.
 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-16 12:16   ` Netperf UDP issue with connected sockets Jesper Dangaard Brouer
@ 2016-11-16 17:46     ` Rick Jones
  2016-11-16 22:40       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 60+ messages in thread
From: Rick Jones @ 2016-11-16 17:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, Eric Dumazet

On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote:
> [1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
>  - https://www.spinics.net/lists/netdev/msg294752.html
>
> Not fixed in version 2.7.0.
>  - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz
>
> Used extra netperf configure compile options:
>  ./configure  --enable-histogram --enable-demo
>
> It seems like some fix attempts exists in the SVN repository::
>
>  svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
>  svn log -r709
>  # A quick stab at getting remote connect going for UDP_STREAM
>  svn diff -r708:709
>
> Testing with SVN version, still show __ip_select_ident() in top#1.

Indeed, there was a fix for getting the remote side connect()ed. 
Looking at what I have for the top of trunk I do though see a connect() 
call being made at the local end:

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
brk(0xe53000)                           = 0xe53000
getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), 
sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
sendto(3, 
"\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 
656, 0, NULL, 0) = 656
select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630})
recvfrom(3, 
"\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 
656, 0, NULL, NULL) = 656
write(1, "need to connect is 1\n", 21)  = 21
rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0
alarm(1)                                = 0
connect(4, {sa_family=AF_INET, sin_port=htons(34832), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024

the only difference there with top of trunk is that "need to connect" 
write/printf I just put in the code to be a nice marker in the system 
call trace.

It is a wild guess, but does setting SO_DONTROUTE affect whether or not 
a connect() would have the desired effect?  That is there to protect 
people from themselves (long story about people using UDP_STREAM to 
stress improperly air-gapped systems during link up/down testing....) 
It can be disabled with a test-specific -R 1 option, so your netperf 
command would become:

netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1

>
> (p.s. is netperf ever going to be converted from SVN to git?)
>

Well....  my git-fu could use some work (gentle, offlinetaps with a 
clueful tutorial bat would be welcome), and at least in the past, going 
to git was held back because there were a bunch of netperf users on 
Windows and there wasn't (at the time) support for git under Windows.

But I am not against the idea in principle.

happy benchmarking,

rick jones

PS - rick.jones@hp.com no longer works.  rick.jones2@hpe.com should be 
used instead.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-16 17:46     ` Rick Jones
@ 2016-11-16 22:40       ` Jesper Dangaard Brouer
  2016-11-16 22:50         ` Rick Jones
  2016-11-17  0:34         ` Eric Dumazet
  0 siblings, 2 replies; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-16 22:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, Eric Dumazet, brouer

On Wed, 16 Nov 2016 09:46:37 -0800
Rick Jones <rick.jones2@hpe.com> wrote:

> On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote:
> > [1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
> >  - https://www.spinics.net/lists/netdev/msg294752.html
> >
> > Not fixed in version 2.7.0.
> >  - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz
> >
> > Used extra netperf configure compile options:
> >  ./configure  --enable-histogram --enable-demo
> >
> > It seems like some fix attempts exists in the SVN repository::
> >
> >  svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
> >  svn log -r709
> >  # A quick stab at getting remote connect going for UDP_STREAM
> >  svn diff -r708:709
> >
> > Testing with SVN version, still show __ip_select_ident() in top#1.  
> 
> Indeed, there was a fix for getting the remote side connect()ed. 
> Looking at what I have for the top of trunk I do though see a connect() 
> call being made at the local end:
> 
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(4, {sa_family=AF_INET, sin_port=htons(0), 
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
> setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
> brk(0xe53000)                           = 0xe53000
> getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), 
> sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
> sendto(3, 
> "\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 
> 656, 0, NULL, 0) = 656
> select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630})
> recvfrom(3, 
> "\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 
> 656, 0, NULL, NULL) = 656
> write(1, "need to connect is 1\n", 21)  = 21
> rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
> 0x7f2824eb2cb0}, NULL, 8) = 0
> rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
> 0x7f2824eb2cb0}, NULL, 8) = 0
> alarm(1)                                = 0
> connect(4, {sa_family=AF_INET, sin_port=htons(34832), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
> 1024
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
> 1024
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
> 1024
> 
> the only difference there with top of trunk is that "need to connect" 
> write/printf I just put in the code to be a nice marker in the system 
> call trace.
> 
> It is a wild guess, but does setting SO_DONTROUTE affect whether or not 
> a connect() would have the desired effect?  That is there to protect 
> people from themselves (long story about people using UDP_STREAM to 
> stress improperly air-gapped systems during link up/down testing....) 
> It can be disabled with a test-specific -R 1 option, so your netperf 
> command would become:
> 
> netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1

Using -R 1 does not seem to help remove __ip_select_ident()

Samples: 56K of event 'cycles', Event count (approx.): 78628132661
  Overhead  Command        Shared Object        Symbol
+    9.11%  netperf        [kernel.vmlinux]     [k] __ip_select_ident
+    6.98%  netperf        [kernel.vmlinux]     [k] _raw_spin_lock
+    6.21%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
+    5.03%  netperf        [kernel.vmlinux]     [k] copy_user_enhanced_fast_string
+    4.69%  netperf        [kernel.vmlinux]     [k] __ip_make_skb
+    4.63%  netperf        [kernel.vmlinux]     [k] skb_set_owner_w
+    4.15%  swapper        [kernel.vmlinux]     [k] __slab_free
+    3.80%  netperf        [mlx5_core]          [k] mlx5e_sq_xmit
+    2.00%  swapper        [kernel.vmlinux]     [k] sock_wfree
+    1.94%  netperf        netperf              [.] send_data
+    1.92%  netperf        netperf              [.] send_omni_inner


> >
> > (p.s. is netperf ever going to be converted from SVN to git?)
> >  
> 
> Well....  my git-fu could use some work (gentle, offlinetaps with a 
> clueful tutorial bat would be welcome), and at least in the past, going 
> to git was held back because there were a bunch of netperf users on 
> Windows and there wasn't (at the time) support for git under Windows.
> 
> But I am not against the idea in principle.

Once you have learned git, you will never go back to SVN. Just do it! :-)

Here are even nice writeups of how to convert and preserve history:
 http://john.albin.net/git/convert-subversion-to-git

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-16 22:40       ` Jesper Dangaard Brouer
@ 2016-11-16 22:50         ` Rick Jones
  2016-11-17  0:34         ` Eric Dumazet
  1 sibling, 0 replies; 60+ messages in thread
From: Rick Jones @ 2016-11-16 22:50 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, Eric Dumazet

On 11/16/2016 02:40 PM, Jesper Dangaard Brouer wrote:
> On Wed, 16 Nov 2016 09:46:37 -0800
> Rick Jones <rick.jones2@hpe.com> wrote:
>> It is a wild guess, but does setting SO_DONTROUTE affect whether or not
>> a connect() would have the desired effect?  That is there to protect
>> people from themselves (long story about people using UDP_STREAM to
>> stress improperly air-gapped systems during link up/down testing....)
>> It can be disabled with a test-specific -R 1 option, so your netperf
>> command would become:
>>
>> netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1
>
> Using -R 1 does not seem to help remove __ip_select_ident()

Bummer.  It was a wild guess anyway, since I was seeing a connect() call 
on the data socket.

> Samples: 56K of event 'cycles', Event count (approx.): 78628132661
>   Overhead  Command        Shared Object        Symbol
> +    9.11%  netperf        [kernel.vmlinux]     [k] __ip_select_ident
> +    6.98%  netperf        [kernel.vmlinux]     [k] _raw_spin_lock
> +    6.21%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
> +    5.03%  netperf        [kernel.vmlinux]     [k] copy_user_enhanced_fast_string
> +    4.69%  netperf        [kernel.vmlinux]     [k] __ip_make_skb
> +    4.63%  netperf        [kernel.vmlinux]     [k] skb_set_owner_w
> +    4.15%  swapper        [kernel.vmlinux]     [k] __slab_free
> +    3.80%  netperf        [mlx5_core]          [k] mlx5e_sq_xmit
> +    2.00%  swapper        [kernel.vmlinux]     [k] sock_wfree
> +    1.94%  netperf        netperf              [.] send_data
> +    1.92%  netperf        netperf              [.] send_omni_inner

Well, the next step I suppose is to have you try a quick netperf 
UDP_STREAM under strace to see if your netperf binary does what mine did:

strace -v -o /tmp/netperf.strace netperf -H 198.18.50.1 -t UDP_STREAM -l 
1 -- -m 1472 -n -N -R 1

And see if you see the connect() I saw. (Note, I make the runtime 1 second)

rick

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-16 22:40       ` Jesper Dangaard Brouer
  2016-11-16 22:50         ` Rick Jones
@ 2016-11-17  0:34         ` Eric Dumazet
  2016-11-17  8:16           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-11-17  0:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev

On Wed, 2016-11-16 at 23:40 +0100, Jesper Dangaard Brouer wrote:

> Using -R 1 does not seem to help remove __ip_select_ident()
> 
> Samples: 56K of event 'cycles', Event count (approx.): 78628132661
>   Overhead  Command        Shared Object        Symbol
> +    9.11%  netperf        [kernel.vmlinux]     [k] __ip_select_ident
> +    6.98%  netperf        [kernel.vmlinux]     [k] _raw_spin_lock
> +    6.21%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
> +    5.03%  netperf        [kernel.vmlinux]     [k] copy_user_enhanced_fast_string
> +    4.69%  netperf        [kernel.vmlinux]     [k] __ip_make_skb
> +    4.63%  netperf        [kernel.vmlinux]     [k] skb_set_owner_w
> +    4.15%  swapper        [kernel.vmlinux]     [k] __slab_free
> +    3.80%  netperf        [mlx5_core]          [k] mlx5e_sq_xmit
> +    2.00%  swapper        [kernel.vmlinux]     [k] sock_wfree
> +    1.94%  netperf        netperf              [.] send_data
> +    1.92%  netperf        netperf              [.] send_omni_inner

Check "ss -nu"  ?

You will see if sockets are connected (present in ss output or not)

UDP being connected does not prevent __ip_select_ident() being used.

    if ((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) {

So you need IP_DF being set, and skb->ignore_df being 0

-> time to try IP_MTU_DISCOVER ;)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17  0:34         ` Eric Dumazet
@ 2016-11-17  8:16           ` Jesper Dangaard Brouer
  2016-11-17 13:20             ` Eric Dumazet
                               ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-17  8:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones,
	netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	brouer@redhat.com

On Wed, 16 Nov 2016 16:34:09 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-11-16 at 23:40 +0100, Jesper Dangaard Brouer wrote:
> 
> > Using -R 1 does not seem to help remove __ip_select_ident()
> > 
> > Samples: 56K of event 'cycles', Event count (approx.): 78628132661
> >   Overhead  Command        Shared Object        Symbol
> > +    9.11%  netperf        [kernel.vmlinux]     [k] __ip_select_ident
> > +    6.98%  netperf        [kernel.vmlinux]     [k] _raw_spin_lock
> > +    6.21%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
> > +    5.03%  netperf        [kernel.vmlinux]     [k] copy_user_enhanced_fast_string
> > +    4.69%  netperf        [kernel.vmlinux]     [k] __ip_make_skb
> > +    4.63%  netperf        [kernel.vmlinux]     [k] skb_set_owner_w
> > +    4.15%  swapper        [kernel.vmlinux]     [k] __slab_free
> > +    3.80%  netperf        [mlx5_core]          [k] mlx5e_sq_xmit
> > +    2.00%  swapper        [kernel.vmlinux]     [k] sock_wfree
> > +    1.94%  netperf        netperf              [.] send_data
> > +    1.92%  netperf        netperf              [.] send_omni_inner  
> 
> Check "ss -nu"  ?
> 
> You will see if sockets are connected (present in ss output or not)

Tested different versions of netperf, commands used below signature:

 netperf-2.6.0: connected "broken"
 netperf-2.7.0: connected works
 SVN-r709     : connected works

I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
looks like it comes from __dev_queue_xmit(), but we know from
experience that this stall is actually caused by writing the
tailptr/doorbell in the HW.  Thus, this could benefit a lot from
bulk/xmit_more into the qdisc layer.


> UDP being connected does not prevent __ip_select_ident() being used.
> 
>     if ((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) {
> 
> So you need IP_DF being set, and skb->ignore_df being 0

Thanks for explaining that! :-)

http://lxr.free-electrons.com/source/include/net/ip.h?v=4.8#L332
http://lxr.free-electrons.com/source/net/ipv4/ip_output.c?v=4.8#L449

Netperf UDP_STREAM default send 64K packets that get fragmented...
which actually is very unfortunate because people end-up testing a code
path in the kernel they didn't expect.  That is why I use the
option "-- -m 1472".


> time to try IP_MTU_DISCOVER ;)  

To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Testing with ss -nua

$ /usr/local/stow/netperf-2.6.0-demo/bin/netperf -H 198.18.50.1 -t UDP_STREAM -l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una

State      Recv-Q Send-Q       Local Address:Port          Peer Address:Port
UNCONN     0      11520                    *:54589                    *:*

$ /usr/local/stow/netperf-2.7.0-demo/bin/netperf -H 198.18.50.1 -t UDP_STREAM -l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una
State      Recv-Q Send-Q       Local Address:Port          Peer Address:Port
ESTAB      0      18432          198.18.50.3:46803          198.18.50.1:51851

$ ~/tools/netperf2-svn/src/netperf -H 198.18.50.1 -t UDP_STREAM -l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una
State      Recv-Q Send-Q       Local Address:Port          Peer Address:Port
ESTAB      0      43776          198.18.50.3:42965          198.18.50.1:51948

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17  8:16           ` Jesper Dangaard Brouer
@ 2016-11-17 13:20             ` Eric Dumazet
  2016-11-17 13:42               ` Jesper Dangaard Brouer
  2016-11-17 17:42             ` Rick Jones
  2016-11-28 18:33             ` Rick Jones
  2 siblings, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-11-17 13:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev

On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote:

> 
> I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
> looks like it comes from __dev_queue_xmit(), but we know from
> experience that this stall is actually caused by writing the
> tailptr/doorbell in the HW.  Thus, this could benefit a lot from
> bulk/xmit_more into the qdisc layer.

The Send-Q is there because of TX-completions being delayed a bit,
because of IRQ mitigation.

(ethtool -c eth0)

It happens even if you do not have a qdisc in the first place.

And we do have xmit_more in the qdisc layer already.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 13:20             ` Eric Dumazet
@ 2016-11-17 13:42               ` Jesper Dangaard Brouer
  2016-11-17 14:17                 ` Eric Dumazet
  0 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-17 13:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, brouer

On Thu, 17 Nov 2016 05:20:50 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote:
> 
> > 
> > I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
> > looks like it comes from __dev_queue_xmit(), but we know from
> > experience that this stall is actually caused by writing the
> > tailptr/doorbell in the HW.  Thus, this could benefit a lot from
> > bulk/xmit_more into the qdisc layer.  
> 
> The Send-Q is there because of TX-completions being delayed a bit,
> because of IRQ mitigation.
> 
> (ethtool -c eth0)
> 
> It happens even if you do not have a qdisc in the first place.
> 
> And we do have xmit_more in the qdisc layer already.

I can see that qdisc layer does not activate xmit_more in this case.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


$ ethtool -c mlx5p4
Coalesce parameters for mlx5p4:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 3
rx-frames: 32
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 32
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 13:42               ` Jesper Dangaard Brouer
@ 2016-11-17 14:17                 ` Eric Dumazet
  2016-11-17 14:57                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-11-17 14:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev

On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:

> I can see that qdisc layer does not activate xmit_more in this case.
> 

Sure. Not enough pressure from the sender(s).

The bottleneck is not the NIC or qdisc in your case, meaning that BQL
limit is kept at a small value.

(BTW not all NIC have expensive doorbells)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 14:17                 ` Eric Dumazet
@ 2016-11-17 14:57                   ` Jesper Dangaard Brouer
  2016-11-17 16:21                     ` Eric Dumazet
  2016-11-17 17:34                     ` Netperf UDP issue with connected sockets David Laight
  0 siblings, 2 replies; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-17 14:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, brouer

On Thu, 17 Nov 2016 06:17:38 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> 
> > I can see that qdisc layer does not activate xmit_more in this case.
> >   
> 
> Sure. Not enough pressure from the sender(s).
> 
> The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> limit is kept at a small value.
> 
> (BTW not all NIC have expensive doorbells)

I believe this NIC mlx5 (50G edition) does.

I'm seeing UDP TX of 1656017.55 pps, which is per packet:
2414 cycles(tsc) 603.86 ns

Perf top shows (with my own udp_flood, that avoids __ip_select_ident):

 Samples: 56K of event 'cycles', Event count (approx.): 51613832267
   Overhead  Command        Shared Object        Symbol
 +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
   - _raw_spin_lock
      + 90.78% __dev_queue_xmit
      + 7.83% dev_queue_xmit
      + 1.30% ___slab_alloc
 +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w
 +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
 +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup
 +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
 +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash
 +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free

In this setup the spinlock in __dev_queue_xmit should be uncongested.
An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.

But 8.92% of the time is spend on it, which corresponds to a cost of 215
cycles (2414*0.0892).  This cost is too high, thus something else is
going on... I claim this mysterious extra cost is the tailptr/doorbell.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 14:57                   ` Jesper Dangaard Brouer
@ 2016-11-17 16:21                     ` Eric Dumazet
  2016-11-17 18:30                       ` Jesper Dangaard Brouer
  2016-11-17 17:34                     ` Netperf UDP issue with connected sockets David Laight
  1 sibling, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-11-17 16:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev

On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >   
> > 
> > Sure. Not enough pressure from the sender(s).
> > 
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> > 
> > (BTW not all NIC have expensive doorbells)
> 
> I believe this NIC mlx5 (50G edition) does.
> 
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
> 
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> 
>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>    Overhead  Command        Shared Object        Symbol
>  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
>    - _raw_spin_lock
>       + 90.78% __dev_queue_xmit
>       + 7.83% dev_queue_xmit
>       + 1.30% ___slab_alloc
>  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w

Does TX completion happens on same cpu ?

>  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
>  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup

Why fib_table_lookup() is used with connected UDP flow ???

>  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
>  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash

Same here, this is suspect.

>  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
> 
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> 
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892).  This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.

Well, with no pressure, doorbell is triggered for each packet.

Since we can not predict that your application is going to send yet
another packet one usec later, we can not avoid this.

Note that with the patches I am working on (busypolling extentions),
we could decide to let the busypoller doing the doorbells, say one every
10 usec. (or after ~16 packets were queued)

But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
strategy .

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: Netperf UDP issue with connected sockets
  2016-11-17 14:57                   ` Jesper Dangaard Brouer
  2016-11-17 16:21                     ` Eric Dumazet
@ 2016-11-17 17:34                     ` David Laight
  2016-11-17 22:39                       ` Alexander Duyck
  1 sibling, 1 reply; 60+ messages in thread
From: David Laight @ 2016-11-17 17:34 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', Eric Dumazet; +Cc: Rick Jones, netdev

From: Jesper Dangaard Brouer
> Sent: 17 November 2016 14:58
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> >
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >
> >
> > Sure. Not enough pressure from the sender(s).
> >
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> >
> > (BTW not all NIC have expensive doorbells)
> 
> I believe this NIC mlx5 (50G edition) does.
> 
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
> 
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> 
>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>    Overhead  Command        Shared Object        Symbol
>  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
>    - _raw_spin_lock
>       + 90.78% __dev_queue_xmit
>       + 7.83% dev_queue_xmit
>       + 1.30% ___slab_alloc
>  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w
>  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
>  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup
>  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
>  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash
>  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
> 
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> 
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892).  This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.

Try adding code to ring the doorbell twice.
If this doesn't slow things down then it isn't (likely to be) responsible
for the delay you are seeing.

	David

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17  8:16           ` Jesper Dangaard Brouer
  2016-11-17 13:20             ` Eric Dumazet
@ 2016-11-17 17:42             ` Rick Jones
  2016-11-28 18:33             ` Rick Jones
  2 siblings, 0 replies; 60+ messages in thread
From: Rick Jones @ 2016-11-17 17:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet; +Cc: netdev, brouer

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:
>> time to try IP_MTU_DISCOVER ;)
>
> To Rick, maybe you can find a good solution or option with Eric's hint,
> to send appropriate sized UDP packets with Don't Fragment (DF).

Well, I suppose adding another setsockopt() to the data socket creation 
wouldn't be too difficult, along with another command-line option to 
cause it to happen.

Could we leave things as "make sure you don't need fragmentation when 
you use this" or would netperf have to start processing ICMP messages?

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 16:21                     ` Eric Dumazet
@ 2016-11-17 18:30                       ` Jesper Dangaard Brouer
  2016-11-17 18:51                         ` Eric Dumazet
  0 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-17 18:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, brouer, Saeed Mahameed, Tariq Toukan

On Thu, 17 Nov 2016 08:21:19 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 17 Nov 2016 06:17:38 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >   
> > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > >   
> > > > I can see that qdisc layer does not activate xmit_more in this case.
> > > >     
> > > 
> > > Sure. Not enough pressure from the sender(s).
> > > 
> > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > > limit is kept at a small value.
> > > 
> > > (BTW not all NIC have expensive doorbells)  
> > 
> > I believe this NIC mlx5 (50G edition) does.
> > 
> > I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> > 2414 cycles(tsc) 603.86 ns
> > 
> > Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> > 
> >  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
> >    Overhead  Command        Shared Object        Symbol
> >  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
> >    - _raw_spin_lock
> >       + 90.78% __dev_queue_xmit
> >       + 7.83% dev_queue_xmit
> >       + 1.30% ___slab_alloc
> >  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w  
> 
> Does TX completion happens on same cpu ?
> 
> >  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
> >  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup  
> 
> Why fib_table_lookup() is used with connected UDP flow ???

Don't know. I would be interested in hints howto avoid this!...

I also see it with netperf, and my udp_flood code is here:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c


> >  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
> >  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash  
> 
> Same here, this is suspect.

It is the function calling fib_table_lookup(), and it is called by ip_route_output_flow().
 
> >  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
> > 
> > In this setup the spinlock in __dev_queue_xmit should be uncongested.
> > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> > 
> > But 8.92% of the time is spend on it, which corresponds to a cost of 215
> > cycles (2414*0.0892).  This cost is too high, thus something else is
> > going on... I claim this mysterious extra cost is the tailptr/doorbell.  
> 
> Well, with no pressure, doorbell is triggered for each packet.
> 
> Since we can not predict that your application is going to send yet
> another packet one usec later, we can not avoid this.

The point is I can see a socket Send-Q forming, thus we do know the
application have something to send. Thus, and possibility for
non-opportunistic bulking. Allowing/implementing bulk enqueue from
socket layer into qdisc layer, should be fairly simple (and rest of
xmit_more is already in place).  


> Note that with the patches I am working on (busypolling extentions),
> we could decide to let the busypoller doing the doorbells, say one every
> 10 usec. (or after ~16 packets were queued)

Sounds interesting! but an opportunistically approach.

> But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
> strategy .

It is a possibility that TX completions were happening on another CPU
(but I don't think so for mlx5).

To rule that out, I reran the experiment making sure to pin everything
to CPU-0 and the results are the same.

sudo ethtool -L mlx5p2 combined 1

sudo sh -c '\                                                    
 for x in /proc/irq/*/mlx5*/../smp_affinity; do \
   echo 01 > $x; grep . -H $x; done \
'

$ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9))

Perf report validating CPU in use:

$ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph none

# Overhead  CPU  Command         Shared Object        Symbol                                   
# ........  ...  ..............  ...................  .........................................
#
     9.97%  000  udp_flood       [kernel.vmlinux]     [k] _raw_spin_lock                       
     4.37%  000  udp_flood       [kernel.vmlinux]     [k] fib_table_lookup                     
     3.97%  000  udp_flood       [mlx5_core]          [k] mlx5e_sq_xmit                        
     3.06%  000  udp_flood       [kernel.vmlinux]     [k] __ip_route_output_key_hash           
     2.51%  000  udp_flood       [mlx5_core]          [k] mlx5e_poll_tx_cq                     
     2.48%  000  udp_flood       [kernel.vmlinux]     [k] copy_user_enhanced_fast_string       
     2.47%  000  udp_flood       [kernel.vmlinux]     [k] entry_SYSCALL_64                     
     2.42%  000  udp_flood       [kernel.vmlinux]     [k] udp_sendmsg                          
     2.39%  000  udp_flood       [kernel.vmlinux]     [k] __ip_append_data.isra.47             
     2.29%  000  udp_flood       [kernel.vmlinux]     [k] sock_def_write_space                 
     2.19%  000  udp_flood       [mlx5_core]          [k] mlx5e_get_cqe                        
     1.95%  000  udp_flood       [kernel.vmlinux]     [k] __ip_make_skb                        
     1.94%  000  udp_flood       [kernel.vmlinux]     [k] __alloc_skb                          
     1.90%  000  udp_flood       [kernel.vmlinux]     [k] sock_wfree                           
     1.85%  000  udp_flood       [kernel.vmlinux]     [k] skb_set_owner_w                      
     1.62%  000  udp_flood       [kernel.vmlinux]     [k] ip_finish_output2                    
     1.61%  000  udp_flood       [kernel.vmlinux]     [k] kfree                                
     1.54%  000  udp_flood       [kernel.vmlinux]     [k] entry_SYSCALL_64_fastpath            
     1.35%  000  udp_flood       libc-2.17.so         [.] __sendmsg_nocancel                   
     1.26%  000  udp_flood       [kernel.vmlinux]     [k] __kmalloc_node_track_caller          
     1.24%  000  udp_flood       [kernel.vmlinux]     [k] __rcu_read_unlock                    
     1.23%  000  udp_flood       [kernel.vmlinux]     [k] __local_bh_enable_ip                 
     1.22%  000  udp_flood       [kernel.vmlinux]     [k] ip_idents_reserve         

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 18:30                       ` Jesper Dangaard Brouer
@ 2016-11-17 18:51                         ` Eric Dumazet
  2016-11-17 21:19                           ` Jesper Dangaard Brouer
  2016-11-21 16:03                           ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-11-17 18:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan

On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:

> The point is I can see a socket Send-Q forming, thus we do know the
> application have something to send. Thus, and possibility for
> non-opportunistic bulking. Allowing/implementing bulk enqueue from
> socket layer into qdisc layer, should be fairly simple (and rest of
> xmit_more is already in place).  


As I said, you are fooled by TX completions.

Please make sure to increase the sndbuf limits !

echo 2129920 >/proc/sys/net/core/wmem_default

lpaa23:~# sar -n DEV 1 10|grep eth1
10:49:25         eth1      7.00 9273283.00      0.61 2187214.90      0.00      0.00      0.00
10:49:26         eth1      1.00 9230795.00      0.06 2176787.57      0.00      0.00      1.00
10:49:27         eth1      2.00 9247906.00      0.17 2180915.45      0.00      0.00      0.00
10:49:28         eth1      3.00 9246542.00      0.23 2180790.38      0.00      0.00      1.00
10:49:29         eth1      1.00 9239218.00      0.06 2179044.83      0.00      0.00      0.00
10:49:30         eth1      3.00 9248775.00      0.23 2181257.84      0.00      0.00      1.00
10:49:31         eth1      4.00 9225471.00      0.65 2175772.75      0.00      0.00      0.00
10:49:32         eth1      2.00 9253536.00      0.33 2182666.44      0.00      0.00      1.00
10:49:33         eth1      1.00 9265900.00      0.06 2185598.40      0.00      0.00      0.00
10:49:34         eth1      1.00 6949031.00      0.06 1638889.63      0.00      0.00      1.00
Average:         eth1      2.50 9018045.70      0.25 2126893.82      0.00      0.00      0.50


lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
     xmit_more: 2251366909
     xmit_more: 2256011392

lpaa23:~# echo 2256011392-2251366909 | bc
4644483

   PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  (all, 48 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

    11.64%  [kernel]  [k] skb_set_owner_w               
     6.21%  [kernel]  [k] queued_spin_lock_slowpath     
     4.76%  [kernel]  [k] _raw_spin_lock                
     4.40%  [kernel]  [k] __ip_make_skb                 
     3.10%  [kernel]  [k] sock_wfree                    
     2.87%  [kernel]  [k] ipt_do_table                  
     2.76%  [kernel]  [k] fq_dequeue                    
     2.71%  [kernel]  [k] mlx4_en_xmit                  
     2.50%  [kernel]  [k] __dev_queue_xmit              
     2.29%  [kernel]  [k] __ip_append_data.isra.40      
     2.28%  [kernel]  [k] udp_sendmsg                   
     2.01%  [kernel]  [k] __alloc_skb                   
     1.90%  [kernel]  [k] napi_consume_skb              
     1.63%  [kernel]  [k] udp_send_skb                  
     1.62%  [kernel]  [k] skb_release_data              
     1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath     
     1.56%  [kernel]  [k] dev_hard_start_xmit           
     1.55%  udpsnd    [.] __libc_send                   
     1.48%  [kernel]  [k] netif_skb_features            
     1.42%  [kernel]  [k] __qdisc_run                   
     1.35%  [kernel]  [k] sk_dst_check                  
     1.33%  [kernel]  [k] sock_def_write_space          
     1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
     1.29%  [kernel]  [k] __local_bh_enable_ip          
     1.21%  [kernel]  [k] copy_user_enhanced_fast_string
     1.08%  [kernel]  [k] __kmalloc_reserve.isra.40     
     1.08%  [kernel]  [k] SYSC_sendto                   
     1.07%  [kernel]  [k] kmem_cache_alloc_node         
     0.95%  [kernel]  [k] ip_finish_output2             
     0.95%  [kernel]  [k] ktime_get                     
     0.91%  [kernel]  [k] validate_xmit_skb             
     0.88%  [kernel]  [k] sock_alloc_send_pskb          
     0.82%  [kernel]  [k] sock_sendmsg                  

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 18:51                         ` Eric Dumazet
@ 2016-11-17 21:19                           ` Jesper Dangaard Brouer
  2016-11-17 21:44                             ` Eric Dumazet
  2016-11-21 16:03                           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-17 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, brouer

On Thu, 17 Nov 2016 10:51:23 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> 
> > The point is I can see a socket Send-Q forming, thus we do know the
> > application have something to send. Thus, and possibility for
> > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > socket layer into qdisc layer, should be fairly simple (and rest of
> > xmit_more is already in place).    
> 
> 
> As I said, you are fooled by TX completions.
> 
> Please make sure to increase the sndbuf limits !
> 
> echo 2129920 >/proc/sys/net/core/wmem_default
> 
> lpaa23:~# sar -n DEV 1 10|grep eth1
> 10:49:25         eth1      7.00 9273283.00      0.61 2187214.90      0.00      0.00      0.00
> 10:49:26         eth1      1.00 9230795.00      0.06 2176787.57      0.00      0.00      1.00
> 10:49:27         eth1      2.00 9247906.00      0.17 2180915.45      0.00      0.00      0.00
> 10:49:28         eth1      3.00 9246542.00      0.23 2180790.38      0.00      0.00      1.00
> 10:49:29         eth1      1.00 9239218.00      0.06 2179044.83      0.00      0.00      0.00
> 10:49:30         eth1      3.00 9248775.00      0.23 2181257.84      0.00      0.00      1.00
> 10:49:31         eth1      4.00 9225471.00      0.65 2175772.75      0.00      0.00      0.00
> 10:49:32         eth1      2.00 9253536.00      0.33 2182666.44      0.00      0.00      1.00
> 10:49:33         eth1      1.00 9265900.00      0.06 2185598.40      0.00      0.00      0.00
> 10:49:34         eth1      1.00 6949031.00      0.06 1638889.63      0.00      0.00      1.00
> Average:         eth1      2.50 9018045.70      0.25 2126893.82      0.00      0.00      0.50
> 
> 
> lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
>      xmit_more: 2251366909
>      xmit_more: 2256011392
> 
> lpaa23:~# echo 2256011392-2251366909 | bc
> 4644483

xmit more not happen that frequently for my setup, it does happen
sometimes. And I do monitor with "ethtool -S".

~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2
Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
Ethtool(mlx5p2  ) stat:     92900913 (     92,900,913) <= tx0_bytes /sec
Ethtool(mlx5p2  ) stat:        36073 (         36,073) <= tx0_nop /sec
Ethtool(mlx5p2  ) stat:      1548349 (      1,548,349) <= tx0_packets /sec
Ethtool(mlx5p2  ) stat:            1 (              1) <= tx0_xmit_more /sec
Ethtool(mlx5p2  ) stat:     92884899 (     92,884,899) <= tx_bytes /sec
Ethtool(mlx5p2  ) stat:     99297696 (     99,297,696) <= tx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:      1548082 (      1,548,082) <= tx_csum_partial /sec
Ethtool(mlx5p2  ) stat:      1548082 (      1,548,082) <= tx_packets /sec
Ethtool(mlx5p2  ) stat:      1551527 (      1,551,527) <= tx_packets_phy /sec
Ethtool(mlx5p2  ) stat:     99076658 (     99,076,658) <= tx_prio1_bytes /sec
Ethtool(mlx5p2  ) stat:      1548073 (      1,548,073) <= tx_prio1_packets /sec
Ethtool(mlx5p2  ) stat:     92936078 (     92,936,078) <= tx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:      1548934 (      1,548,934) <= tx_vport_unicast_packets /sec
Ethtool(mlx5p2  ) stat:            1 (              1) <= tx_xmit_more /sec

(after several attempts I got:)
$ ethtool -S mlx5p2|grep more; sleep 1;ethtool -S mlx5p2|grep more
     tx_xmit_more: 14048
     tx0_xmit_more: 14048
     tx_xmit_more: 14049
     tx0_xmit_more: 14049

This was with:
 $ grep -H . /proc/sys/net/core/wmem_default
 /proc/sys/net/core/wmem_default:2129920

>    PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  (all, 48 CPUs)
> ---------------------------------------------------------------------------------------------
> 
>     11.64%  [kernel]  [k] skb_set_owner_w               
>      6.21%  [kernel]  [k] queued_spin_lock_slowpath     
>      4.76%  [kernel]  [k] _raw_spin_lock                
>      4.40%  [kernel]  [k] __ip_make_skb                 
>      3.10%  [kernel]  [k] sock_wfree                    
>      2.87%  [kernel]  [k] ipt_do_table                  
>      2.76%  [kernel]  [k] fq_dequeue                    
>      2.71%  [kernel]  [k] mlx4_en_xmit                  
>      2.50%  [kernel]  [k] __dev_queue_xmit              
>      2.29%  [kernel]  [k] __ip_append_data.isra.40      
>      2.28%  [kernel]  [k] udp_sendmsg                   
>      2.01%  [kernel]  [k] __alloc_skb                   
>      1.90%  [kernel]  [k] napi_consume_skb              
>      1.63%  [kernel]  [k] udp_send_skb                  
>      1.62%  [kernel]  [k] skb_release_data              
>      1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath     
>      1.56%  [kernel]  [k] dev_hard_start_xmit           
>      1.55%  udpsnd    [.] __libc_send                   
>      1.48%  [kernel]  [k] netif_skb_features            
>      1.42%  [kernel]  [k] __qdisc_run                   
>      1.35%  [kernel]  [k] sk_dst_check                  
>      1.33%  [kernel]  [k] sock_def_write_space          
>      1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
>      1.29%  [kernel]  [k] __local_bh_enable_ip          
>      1.21%  [kernel]  [k] copy_user_enhanced_fast_string
>      1.08%  [kernel]  [k] __kmalloc_reserve.isra.40     
>      1.08%  [kernel]  [k] SYSC_sendto                   
>      1.07%  [kernel]  [k] kmem_cache_alloc_node         
>      0.95%  [kernel]  [k] ip_finish_output2             
>      0.95%  [kernel]  [k] ktime_get                     
>      0.91%  [kernel]  [k] validate_xmit_skb             
>      0.88%  [kernel]  [k] sock_alloc_send_pskb          
>      0.82%  [kernel]  [k] sock_sendmsg                  

I'm more interested in why I see fib_table_lookup() and
__ip_route_output_key_hash() when you don't ?!?  There must be some
mistake in my setup!

Maybe you can share your udp flood "udpsnd" program source?

Maybe I'm missing some important sysctl /proc/net/sys/ ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

p.s. I placed my testing software here:
 https://github.com/netoptimizer/network-testing/tree/master/src

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 21:19                           ` Jesper Dangaard Brouer
@ 2016-11-17 21:44                             ` Eric Dumazet
  2016-11-17 23:08                               ` Rick Jones
  2016-11-18 17:12                               ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-11-17 21:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan

On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:

> 
> Maybe you can share your udp flood "udpsnd" program source?

Very ugly. This is based on what I wrote when tracking the UDP v6
checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
checksum in skb_checksum_help()), because netperf sends the same message
over and over...


Use -d 2   to remove the ip_idents_reserve() overhead.


#define _GNU_SOURCE

#include <errno.h>
#include <error.h>
#include <linux/errqueue.h>
#include <netinet/in.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>

char buffer[1400];

int main(int argc, char** argv) {
  int fd, i;
  struct sockaddr_in6 addr;
  char *host = "2002:af6:798::1";
  int family = AF_INET6;
  int discover = -1;

  while ((i = getopt(argc, argv, "4H:d:")) != -1) {
    switch (i) {
    case 'H': host = optarg; break;
    case '4': family = AF_INET; break;
    case 'd': discover = atoi(optarg); break;
    }
  }
  fd = socket(family, SOCK_DGRAM, 0);
  if (fd < 0)
    error(1, errno, "failed to create socket");
  if (discover != -1)
    setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
               &discover, sizeof(discover));

  memset(&addr, 0, sizeof(addr));
  if (family == AF_INET6) {
	  addr.sin6_family = AF_INET6;
	  addr.sin6_port = htons(9);
      inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr);
  } else {
    struct sockaddr_in *in = (struct sockaddr_in *)&addr;
    in->sin_family = family;
    in->sin_port = htons(9);
      inet_pton(family, host, &in->sin_addr);
  }
  connect(fd, (struct sockaddr *)&addr,
          (family == AF_INET6) ? sizeof(addr) :
                                 sizeof(struct sockaddr_in));
  memset(buffer, 1, 1400);
  for (i = 0; i < 655360000; i++) {
    memcpy(buffer, &i, sizeof(i));
    send(fd, buffer, 100 + rand() % 200, 0);
  }
  return 0;
}

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 17:34                     ` Netperf UDP issue with connected sockets David Laight
@ 2016-11-17 22:39                       ` Alexander Duyck
  0 siblings, 0 replies; 60+ messages in thread
From: Alexander Duyck @ 2016-11-17 22:39 UTC (permalink / raw)
  To: David Laight; +Cc: Jesper Dangaard Brouer, Eric Dumazet, Rick Jones, netdev

On Thu, Nov 17, 2016 at 9:34 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Jesper Dangaard Brouer
>> Sent: 17 November 2016 14:58
>> On Thu, 17 Nov 2016 06:17:38 -0800
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
>> >
>> > > I can see that qdisc layer does not activate xmit_more in this case.
>> > >
>> >
>> > Sure. Not enough pressure from the sender(s).
>> >
>> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
>> > limit is kept at a small value.
>> >
>> > (BTW not all NIC have expensive doorbells)
>>
>> I believe this NIC mlx5 (50G edition) does.
>>
>> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
>> 2414 cycles(tsc) 603.86 ns
>>
>> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
>>
>>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>>    Overhead  Command        Shared Object        Symbol
>>  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
>>    - _raw_spin_lock
>>       + 90.78% __dev_queue_xmit
>>       + 7.83% dev_queue_xmit
>>       + 1.30% ___slab_alloc
>>  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w
>>  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
>>  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup
>>  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
>>  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash
>>  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
>>
>> In this setup the spinlock in __dev_queue_xmit should be uncongested.
>> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
>>
>> But 8.92% of the time is spend on it, which corresponds to a cost of 215
>> cycles (2414*0.0892).  This cost is too high, thus something else is
>> going on... I claim this mysterious extra cost is the tailptr/doorbell.
>
> Try adding code to ring the doorbell twice.
> If this doesn't slow things down then it isn't (likely to be) responsible
> for the delay you are seeing.
>
>         David
>

The problem isn't only the doorbell.  It is doorbell plus a locked
transaction on x86 results in a long wait until the doorbell write has
been completed.

You could batch a bunch of doorbell writes together and it isn't an
issue unless you do something like writel(), wmb(), writel(), wmb(),
then you will see the effect double since the write memory barrier is
what is forcing the delays.

- Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 21:44                             ` Eric Dumazet
@ 2016-11-17 23:08                               ` Rick Jones
  2016-11-18  0:37                                 ` Julian Anastasov
  2016-11-18 17:12                               ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 60+ messages in thread
From: Rick Jones @ 2016-11-17 23:08 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer; +Cc: netdev, Saeed Mahameed, Tariq Toukan

On 11/17/2016 01:44 PM, Eric Dumazet wrote:
> because netperf sends the same message
> over and over...

Well, sort of, by default.  That can be altered to a degree.

The global -F option should cause netperf to fill the buffers in its 
send ring with data from the specified file.  The number of buffers in 
the send ring can be controlled via the global -W option.  The number of 
elements in the ring will default to one more than the initial SO_SNDBUF 
size divided by the send size.

raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf 
-F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472

...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
open("src/nettest_omni.c", O_RDONLY)    = 5
fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, 
st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, 
st_size=230027, st_atime=2016/11/16-09:49:29, 
st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f3099f62000
read(5, "#ifdef HAVE_CONFIG_H\n#include <c"..., 4096) = 4096
read(5, "_INTEGER *intvl_two_ptr = &intvl"..., 4096) = 4096
read(5, "interval_count = interval_burst;"..., 4096) = 4096
read(5, ";\n\n/* these will control the wid"..., 4096) = 4096
read(5, "\n  LOCAL_SECURITY_ENABLED_NUM,\n "..., 4096) = 4096
read(5, "      &dwBytes,  \n              "..., 4096) = 4096

...

rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
alarm(1)                                = 0
sendto(4, "#ifdef HAVE_CONFIG_H\n#include <c"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " used\\n\\\n    -m local,remote   S"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " do here but clear the legacy fl"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "e before we scan the test-specif"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472

Of course, it will continue to send the same messages from the send_ring 
over and over instead of putting different data into the buffers each 
time, but if one has a sufficiently large -W option specified...
happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 23:08                               ` Rick Jones
@ 2016-11-18  0:37                                 ` Julian Anastasov
  2016-11-18  0:42                                   ` Rick Jones
  0 siblings, 1 reply; 60+ messages in thread
From: Julian Anastasov @ 2016-11-18  0:37 UTC (permalink / raw)
  To: Rick Jones
  Cc: Eric Dumazet, Jesper Dangaard Brouer, netdev, Saeed Mahameed,
	Tariq Toukan


	Hello,

On Thu, 17 Nov 2016, Rick Jones wrote:

> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
> 
> ...
> 
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
> 16) = 0
> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0

	connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

	Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.

Regards

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-18  0:37                                 ` Julian Anastasov
@ 2016-11-18  0:42                                   ` Rick Jones
  0 siblings, 0 replies; 60+ messages in thread
From: Rick Jones @ 2016-11-18  0:42 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, Jesper Dangaard Brouer, netdev, Saeed Mahameed,
	Tariq Toukan

On 11/17/2016 04:37 PM, Julian Anastasov wrote:
> On Thu, 17 Nov 2016, Rick Jones wrote:
>
>> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
>> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
>>
>> ...
>>
>> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
>> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
>> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
>> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
>> 16) = 0
>> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
>
> 	connected socket can benefit from dst cached in socket
> but not if SO_DONTROUTE is set. If we do not want to send packets
> via gateway this -l 1 should help but I don't see IP_TTL setsockopt
> in your first example with connect() to 127.0.0.1.
>
> 	Also, may be there can be another default, if -l is used to
> specify TTL then SO_DONTROUTE should not be set. I.e. we should
> avoid SO_DONTROUTE, if possible.

The global -l option specifies the duration of the test.  It doesn't 
specify the TTL of the IP datagrams being generated by the actions of 
the test.

I resisted setting SO_DONTROUTE for a number of years after the first 
instance of UDP_STREAM being used in link up/down testing took-out a 
company's network (including security camera feeds to galactic HQ) but 
at this point I'm likely to keep it in there because there ended-up 
being a second such incident.  It is set only for UDP_STREAM.  It isn't 
set for UDP_RR or TCP_*.  And for UDP_STREAM it can be overridden by the 
test-specific -R option.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 21:44                             ` Eric Dumazet
  2016-11-17 23:08                               ` Rick Jones
@ 2016-11-18 17:12                               ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-18 17:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, brouer

On Thu, 17 Nov 2016 13:44:02 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:
> 
> > 
> > Maybe you can share your udp flood "udpsnd" program source?  
> 
> Very ugly. This is based on what I wrote when tracking the UDP v6
> checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
> checksum in skb_checksum_help()), because netperf sends the same message
> over and over...

Thanks a lot, hope you don't mind; I added the code to my github repo:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_snd.c

So I identified the difference, and reason behind the route lookups.
Your program is using send() and I was using sendmsg().  Given
udp_flood is designed to test different calls, I simply added --send as
a new possibility.
 https://github.com/netoptimizer/network-testing/commit/16166c2cd1fa8

If I use --write instead, then I can also avoid the fib_table_lookup
and __ip_route_output_key_hash calls.


> Use -d 2   to remove the ip_idents_reserve() overhead.

#define IP_PMTUDISC_DO	2 /* Always DF	*/

Added a --pmtu option to my udp_flood program
 https://github.com/netoptimizer/network-testing/commit/23a78caf4bb5b

 
> #define _GNU_SOURCE
> 
> #include <errno.h>
> #include <error.h>
> #include <linux/errqueue.h>
> #include <netinet/in.h>
> #include <sched.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <unistd.h>
> 
> char buffer[1400];
> 
> int main(int argc, char** argv) {
>   int fd, i;
>   struct sockaddr_in6 addr;
>   char *host = "2002:af6:798::1";
>   int family = AF_INET6;
>   int discover = -1;
> 
>   while ((i = getopt(argc, argv, "4H:d:")) != -1) {
>     switch (i) {
>     case 'H': host = optarg; break;
>     case '4': family = AF_INET; break;
>     case 'd': discover = atoi(optarg); break;
>     }
>   }
>   fd = socket(family, SOCK_DGRAM, 0);
>   if (fd < 0)
>     error(1, errno, "failed to create socket");
>   if (discover != -1)
>     setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
>                &discover, sizeof(discover));
> 
>   memset(&addr, 0, sizeof(addr));
>   if (family == AF_INET6) {
> 	  addr.sin6_family = AF_INET6;
> 	  addr.sin6_port = htons(9);
>       inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr);
>   } else {
>     struct sockaddr_in *in = (struct sockaddr_in *)&addr;
>     in->sin_family = family;
>     in->sin_port = htons(9);
>       inet_pton(family, host, &in->sin_addr);
>   }
>   connect(fd, (struct sockaddr *)&addr,
>           (family == AF_INET6) ? sizeof(addr) :
>                                  sizeof(struct sockaddr_in));
>   memset(buffer, 1, 1400);
>   for (i = 0; i < 655360000; i++) {
>     memcpy(buffer, &i, sizeof(i));
>     send(fd, buffer, 100 + rand() % 200, 0);

Using send() avoids the fib_table_lookup, on a connected UDP socket.

>   }
>   return 0;
> }


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17 18:51                         ` Eric Dumazet
  2016-11-17 21:19                           ` Jesper Dangaard Brouer
@ 2016-11-21 16:03                           ` Jesper Dangaard Brouer
  2016-11-21 18:10                             ` Eric Dumazet
  1 sibling, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-21 16:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, brouer


On Thu, 17 Nov 2016 10:51:23 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> 
> > The point is I can see a socket Send-Q forming, thus we do know the
> > application have something to send. Thus, and possibility for
> > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > socket layer into qdisc layer, should be fairly simple (and rest of
> > xmit_more is already in place).    
> 
> 
> As I said, you are fooled by TX completions.

Obviously TX completions play a role yes, and I bet I can adjust the
TX completion to cause xmit_more to happen, at the expense of
introducing added latency.

The point is the "bloated" spinlock in __dev_queue_xmit is still caused
by the MMIO tailptr/doorbell.  The added cost occurs when enqueueing
packets, and result in the inability to get enough packets into the
qdisc for xmit_more going (on my system).  I argue that a bulk enqueue
API would allow us to get past the hurtle of transitioning into
xmit_more mode more easily.


> Please make sure to increase the sndbuf limits !
> 
> echo 2129920 >/proc/sys/net/core/wmem_default

Testing with this makes no difference.

 $ grep -H . /proc/sys/net/core/wmem_default
 /proc/sys/net/core/wmem_default:2129920


> lpaa23:~# sar -n DEV 1 10|grep eth1
                  IFACE   rxpck/s    txpck/s    rxkB/s     txkB/s   rxcmp/s   txcmp/s  rxmcst/s
> 10:49:25         eth1      7.00 9273283.00      0.61 2187214.90      0.00      0.00      0.00
> 10:49:26         eth1      1.00 9230795.00      0.06 2176787.57      0.00      0.00      1.00
> 10:49:27         eth1      2.00 9247906.00      0.17 2180915.45      0.00      0.00      0.00
> 10:49:28         eth1      3.00 9246542.00      0.23 2180790.38      0.00      0.00      1.00
> Average:         eth1      2.50 9018045.70      0.25 2126893.82      0.00      0.00      0.50

Very impressive numbers 9.2Mpps TX.

What is this test?  What kind of traffic? Multiple CPUs?


> lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
>      xmit_more: 2251366909
>      xmit_more: 2256011392
> 
> lpaa23:~# echo 2256011392-2251366909 | bc
> 4644483

The xmit_more definitely works on your system, but I cannot get it to
"kick-in" on my setup.  Once the xmit_more is active, then the
"bloated" spinlock problem should go way.


(Tests with "udp_flood --pmtu 3 --send")

Forcing TX completion to happen on the same CPU, no xmit_more:

 ~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2
 Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
 Ethtool(mlx5p2  ) stat:    104592908 (    104,592,908) <= tx0_bytes /sec
 Ethtool(mlx5p2  ) stat:        39059 (         39,059) <= tx0_nop /sec
 Ethtool(mlx5p2  ) stat:      1743215 (      1,743,215) <= tx0_packets /sec 
 Ethtool(mlx5p2  ) stat:    104719986 (    104,719,986) <= tx_bytes /sec
 Ethtool(mlx5p2  ) stat:    111774540 (    111,774,540) <= tx_bytes_phy /sec
 Ethtool(mlx5p2  ) stat:      1745333 (      1,745,333) <= tx_csum_partial /sec
 Ethtool(mlx5p2  ) stat:      1745333 (      1,745,333) <= tx_packets /sec
 Ethtool(mlx5p2  ) stat:      1746477 (      1,746,477) <= tx_packets_phy /sec
 Ethtool(mlx5p2  ) stat:    111483434 (    111,483,434) <= tx_prio1_bytes /sec
 Ethtool(mlx5p2  ) stat:      1741928 (      1,741,928) <= tx_prio1_packets /sec

Forcing TX completion to happen on remote CPU, some xmit_more:

 Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
 Ethtool(mlx5p2  ) stat:    128485892 (    128,485,892) <= tx0_bytes /sec
 Ethtool(mlx5p2  ) stat:        31840 (         31,840) <= tx0_nop /sec
 Ethtool(mlx5p2  ) stat:      2141432 (      2,141,432) <= tx0_packets /sec
 Ethtool(mlx5p2  ) stat:          350 (            350) <= tx0_xmit_more /sec
 Ethtool(mlx5p2  ) stat:    128486459 (    128,486,459) <= tx_bytes /sec
 Ethtool(mlx5p2  ) stat:    137052191 (    137,052,191) <= tx_bytes_phy /sec
 Ethtool(mlx5p2  ) stat:      2141441 (      2,141,441) <= tx_csum_partial /sec
 Ethtool(mlx5p2  ) stat:      2141441 (      2,141,441) <= tx_packets /sec
 Ethtool(mlx5p2  ) stat:      2141441 (      2,141,441) <= tx_packets_phy /sec
 Ethtool(mlx5p2  ) stat:    137051300 (    137,051,300) <= tx_prio1_bytes /sec
 Ethtool(mlx5p2  ) stat:      2141427 (      2,141,427) <= tx_prio1_packets /sec
 Ethtool(mlx5p2  ) stat:          350 (            350) <= tx_xmit_more /sec



>    PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  (all, 48 CPUs)
>---------------------------------------------------------------------------------------------
>     11.64%  [kernel]  [k] skb_set_owner_w               
>      6.21%  [kernel]  [k] queued_spin_lock_slowpath     
>      4.76%  [kernel]  [k] _raw_spin_lock                
>      4.40%  [kernel]  [k] __ip_make_skb                 
>      3.10%  [kernel]  [k] sock_wfree                    
>      2.87%  [kernel]  [k] ipt_do_table                  
>      2.76%  [kernel]  [k] fq_dequeue                    
>      2.71%  [kernel]  [k] mlx4_en_xmit                  
>      2.50%  [kernel]  [k] __dev_queue_xmit              
>      2.29%  [kernel]  [k] __ip_append_data.isra.40      
>      2.28%  [kernel]  [k] udp_sendmsg                   
>      2.01%  [kernel]  [k] __alloc_skb                   
>      1.90%  [kernel]  [k] napi_consume_skb              
>      1.63%  [kernel]  [k] udp_send_skb                  
>      1.62%  [kernel]  [k] skb_release_data              
>      1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath     
>      1.56%  [kernel]  [k] dev_hard_start_xmit           
>      1.55%  udpsnd    [.] __libc_send                   
>      1.48%  [kernel]  [k] netif_skb_features            
>      1.42%  [kernel]  [k] __qdisc_run                   
>      1.35%  [kernel]  [k] sk_dst_check                  
>      1.33%  [kernel]  [k] sock_def_write_space          
>      1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
>      1.29%  [kernel]  [k] __local_bh_enable_ip          
>      1.21%  [kernel]  [k] copy_user_enhanced_fast_string
>      1.08%  [kernel]  [k] __kmalloc_reserve.isra.40     
>      1.08%  [kernel]  [k] SYSC_sendto                   
>      1.07%  [kernel]  [k] kmem_cache_alloc_node         
>      0.95%  [kernel]  [k] ip_finish_output2             
>      0.95%  [kernel]  [k] ktime_get                     
>      0.91%  [kernel]  [k] validate_xmit_skb             
>      0.88%  [kernel]  [k] sock_alloc_send_pskb          
>      0.82%  [kernel]  [k] sock_sendmsg                  

My perf outputs below...

Forcing TX completion to happen on the same CPU, no xmit_more:

# Overhead  CPU  Command     Shared Object     Symbol                         
# ........  ...  ..........  ................. ...............................
#
    12.17%  000  udp_flood   [kernel.vmlinux]  [k] _raw_spin_lock             
     5.03%  000  udp_flood   [mlx5_core]       [k] mlx5e_sq_xmit              
     3.13%  000  udp_flood   [kernel.vmlinux]  [k] __ip_append_data.isra.47   
     2.85%  000  udp_flood   [kernel.vmlinux]  [k] entry_SYSCALL_64           
     2.75%  000  udp_flood   [mlx5_core]       [k] mlx5e_poll_tx_cq           
     2.61%  000  udp_flood   [kernel.vmlinux]  [k] sock_def_write_space       
     2.48%  000  udp_flood   [kernel.vmlinux]  [k] skb_set_owner_w            
     2.25%  000  udp_flood   [kernel.vmlinux]  [k] __alloc_skb                
     2.21%  000  udp_flood   [kernel.vmlinux]  [k] udp_sendmsg                
     2.19%  000  udp_flood   [kernel.vmlinux]  [k] __slab_free                
     2.08%  000  udp_flood   [kernel.vmlinux]  [k] sock_wfree                 
     2.06%  000  udp_flood   [kernel.vmlinux]  [k] __ip_make_skb              
     1.93%  000  udp_flood   [mlx5_core]       [k] mlx5e_get_cqe              
     1.93%  000  udp_flood   libc-2.17.so      [.] __libc_send                
     1.80%  000  udp_flood   [kernel.vmlinux]  [k] entry_SYSCALL_64_fastpath  
     1.64%  000  udp_flood   [kernel.vmlinux]  [k] kfree                      
     1.61%  000  udp_flood   [kernel.vmlinux]  [k] ip_finish_output2          
     1.59%  000  udp_flood   [kernel.vmlinux]  [k] __local_bh_enable_ip       
     1.57%  000  udp_flood   [kernel.vmlinux]  [k] __dev_queue_xmit           
     1.49%  000  udp_flood   [kernel.vmlinux]  [k] __kmalloc_node_track_caller
     1.38%  000  udp_flood   [kernel.vmlinux]  [k] kmem_cache_alloc_node      
     1.30%  000  udp_flood   [kernel.vmlinux]  [k] dst_release                
     1.26%  000  udp_flood   [kernel.vmlinux]  [k] ksize                      
     1.26%  000  udp_flood   [kernel.vmlinux]  [k] sk_dst_check               
     1.22%  000  udp_flood   [kernel.vmlinux]  [k] SYSC_sendto                
     1.22%  000  udp_flood   [kernel.vmlinux]  [k] ip_send_check              


Forcing TX completion to happen on remote CPU, some xmit_more:

# Overhead  CPU  Command      Shared Object     Symbol                        
# ........  ...  ............ ................  ..............................
#
    11.67%  002  udp_flood   [kernel.vmlinux]  [k] _raw_spin_lock             
     7.61%  002  udp_flood   [kernel.vmlinux]  [k] skb_set_owner_w            
     6.15%  002  udp_flood   [mlx5_core]       [k] mlx5e_sq_xmit              
     3.05%  002  udp_flood   [kernel.vmlinux]  [k] entry_SYSCALL_64           
     2.89%  002  udp_flood   [kernel.vmlinux]  [k] __ip_append_data.isra.47   
     2.78%  000  swapper     [mlx5_core]       [k] mlx5e_poll_tx_cq           
     2.65%  002  udp_flood   [kernel.vmlinux]  [k] sk_dst_check               
     2.36%  002  udp_flood   [kernel.vmlinux]  [k] __alloc_skb                
     2.22%  002  udp_flood   [kernel.vmlinux]  [k] ip_finish_output2          
     2.07%  000  swapper     [kernel.vmlinux]  [k] __slab_free                
     2.06%  002  udp_flood   [kernel.vmlinux]  [k] udp_sendmsg                
     1.97%  002  udp_flood   [kernel.vmlinux]  [k] ksize                      
     1.92%  002  udp_flood   [kernel.vmlinux]  [k] entry_SYSCALL_64_fastpath  
     1.82%  002  udp_flood   [kernel.vmlinux]  [k] __ip_make_skb              
     1.79%  002  udp_flood   libc-2.17.so      [.] __libc_send                
     1.62%  002  udp_flood   [kernel.vmlinux]  [k] __kmalloc_node_track_caller
     1.53%  002  udp_flood   [kernel.vmlinux]  [k] __local_bh_enable_ip       
     1.48%  002  udp_flood   [kernel.vmlinux]  [k] sock_alloc_send_pskb       
     1.43%  002  udp_flood   [kernel.vmlinux]  [k] __dev_queue_xmit           
     1.39%  002  udp_flood   [kernel.vmlinux]  [k] ip_send_check              
     1.39%  002  udp_flood   [kernel.vmlinux]  [k] kmem_cache_alloc_node      
     1.37%  002  udp_flood   [kernel.vmlinux]  [k] dst_release                
     1.21%  002  udp_flood   [kernel.vmlinux]  [k] udp_send_skb               
     1.18%  002  udp_flood   [kernel.vmlinux]  [k] __fget_light               
     1.16%  002  udp_flood   [kernel.vmlinux]  [k] kfree                      
     1.15%  000  swapper     [kernel.vmlinux]  [k] sock_wfree                 
     1.14%  002  udp_flood   [kernel.vmlinux]  [k] SYSC_sendto                

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-21 16:03                           ` Jesper Dangaard Brouer
@ 2016-11-21 18:10                             ` Eric Dumazet
  2016-11-29  6:58                               ` [WIP] net+mlx4: auto doorbell Eric Dumazet
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-11-21 18:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan

On Mon, 2016-11-21 at 17:03 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 17 Nov 2016 10:51:23 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > The point is I can see a socket Send-Q forming, thus we do know the
> > > application have something to send. Thus, and possibility for
> > > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > > socket layer into qdisc layer, should be fairly simple (and rest of
> > > xmit_more is already in place).    
> > 
> > 
> > As I said, you are fooled by TX completions.
> 
> Obviously TX completions play a role yes, and I bet I can adjust the
> TX completion to cause xmit_more to happen, at the expense of
> introducing added latency.
> 
> The point is the "bloated" spinlock in __dev_queue_xmit is still caused
> by the MMIO tailptr/doorbell.  The added cost occurs when enqueueing
> packets, and result in the inability to get enough packets into the
> qdisc for xmit_more going (on my system).  I argue that a bulk enqueue
> API would allow us to get past the hurtle of transitioning into
> xmit_more mode more easily.
> 


This is very nice, but we already have bulk enqueue, it is called
xmit_more.

Kernel does not know your application is sending a packet after the one
you send.

xmit_more is not often used applications/stacks send many small packets.

qdisc is empty (one enqueued packet is immediately dequeued so
skb->xmit_more is 0), and even bypassed (TCQ_F_CAN_BYPASS)

Not sure it this has been tried before, but the doorbell avoidance could
be done by the driver itself, because it knows a TX completion will come
shortly (well... if softirqs are not delayed too much !)

Doorbell would be forced only if :

(    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
OR
( too many [1] packets were put in TX ring buffer, no point deferring
more)

Start the pump, but once it is started, let the doorbells being done by
TX completion.

ndo_start_xmit and TX completion handler would have to maintain a shared
state describing if packets were ready but doorbell deferred.


Note that TX completion means "if at least one packet was drained",
otherwise busy polling, constantly calling napi->poll() would force a
doorbell too soon for devices sharing a NAPI for both RX and TX.

But then, maybe busy poll would like to force a doorbell...

I could try these ideas on mlx4 shortly.


[1] limit could be derived from active "ethtool -c" params, eg tx-frames

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-17  8:16           ` Jesper Dangaard Brouer
  2016-11-17 13:20             ` Eric Dumazet
  2016-11-17 17:42             ` Rick Jones
@ 2016-11-28 18:33             ` Rick Jones
  2016-11-28 18:40               ` Rick Jones
  2016-11-30 10:43               ` Jesper Dangaard Brouer
  2 siblings, 2 replies; 60+ messages in thread
From: Rick Jones @ 2016-11-28 18:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet; +Cc: netdev, brouer

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:
>> time to try IP_MTU_DISCOVER ;)
>
> To Rick, maybe you can find a good solution or option with Eric's hint,
> to send appropriate sized UDP packets with Don't Fragment (DF).

Jesper -

Top of trunk has a change adding an omni, test-specific -f option which 
will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that 
sufficient to your needs?

happy benchmarking,

rick

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-28 18:33             ` Rick Jones
@ 2016-11-28 18:40               ` Rick Jones
  2016-11-30 10:43               ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 60+ messages in thread
From: Rick Jones @ 2016-11-28 18:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet; +Cc: netdev

On 11/28/2016 10:33 AM, Rick Jones wrote:
> On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:
>>> time to try IP_MTU_DISCOVER ;)
>>
>> To Rick, maybe you can find a good solution or option with Eric's hint,
>> to send appropriate sized UDP packets with Don't Fragment (DF).
>
> Jesper -
>
> Top of trunk has a change adding an omni, test-specific -f option which
> will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that
> sufficient to your needs?

Usage examples:

raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 1472 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   1.00        77495      0     912.35
212992           1.00        77495            912.35

[1]+  Done                    emacs nettest_omni.c
raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 14720 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET
send_data: data send error: Message too long (errno 90)
netperf: send_omni: send_data failed: Message too long

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [WIP] net+mlx4: auto doorbell
  2016-11-21 18:10                             ` Eric Dumazet
@ 2016-11-29  6:58                               ` Eric Dumazet
  2016-11-30 11:38                                 ` Jesper Dangaard Brouer
                                                   ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-11-29  6:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan

On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:


> Not sure it this has been tried before, but the doorbell avoidance could
> be done by the driver itself, because it knows a TX completion will come
> shortly (well... if softirqs are not delayed too much !)
> 
> Doorbell would be forced only if :
> 
> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
> OR
> ( too many [1] packets were put in TX ring buffer, no point deferring
> more)
> 
> Start the pump, but once it is started, let the doorbells being done by
> TX completion.
> 
> ndo_start_xmit and TX completion handler would have to maintain a shared
> state describing if packets were ready but doorbell deferred.
> 
> 
> Note that TX completion means "if at least one packet was drained",
> otherwise busy polling, constantly calling napi->poll() would force a
> doorbell too soon for devices sharing a NAPI for both RX and TX.
> 
> But then, maybe busy poll would like to force a doorbell...
> 
> I could try these ideas on mlx4 shortly.
> 
> 
> [1] limit could be derived from active "ethtool -c" params, eg tx-frames

I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
not used.

lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt 
lpaa23:~# sar -n DEV 1 10|grep eth0
22:43:26         eth0      0.00 5822800.00      0.00 597064.41      0.00      0.00      1.00
22:43:27         eth0     24.00 5788237.00      2.09 593520.26      0.00      0.00      0.00
22:43:28         eth0     12.00 5817777.00      1.43 596551.47      0.00      0.00      1.00
22:43:29         eth0     22.00 5841516.00      1.61 598982.87      0.00      0.00      0.00
22:43:30         eth0      4.00 4389137.00      0.71 450058.08      0.00      0.00      1.00
22:43:31         eth0      4.00 5871008.00      0.72 602007.79      0.00      0.00      0.00
22:43:32         eth0     12.00 5891809.00      1.43 604142.60      0.00      0.00      1.00
22:43:33         eth0     10.00 5901904.00      1.12 605175.70      0.00      0.00      0.00
22:43:34         eth0      5.00 5907982.00      0.69 605798.99      0.00      0.00      1.00
22:43:35         eth0      2.00 5847086.00      0.12 599554.71      0.00      0.00      0.00
Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50
lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt 
lpaa23:~# sar -n DEV 1 10|grep eth0
22:43:47         eth0      9.00 10226424.00      1.02 1048608.05      0.00      0.00      1.00
22:43:48         eth0      1.00 10316955.00      0.06 1057890.89      0.00      0.00      0.00
22:43:49         eth0      1.00 10310104.00      0.10 1057188.32      0.00      0.00      1.00
22:43:50         eth0      0.00 10249423.00      0.00 1050966.23      0.00      0.00      0.00
22:43:51         eth0      0.00 10210441.00      0.00 1046969.05      0.00      0.00      1.00
22:43:52         eth0      2.00 10198389.00      0.16 1045733.17      0.00      0.00      1.00
22:43:53         eth0      8.00 10079257.00      1.43 1033517.83      0.00      0.00      0.00
22:43:54         eth0      0.00 7693509.00      0.00 788885.16      0.00      0.00      0.00
22:43:55         eth0      2.00 10343076.00      0.20 1060569.32      0.00      0.00      1.00
22:43:56         eth0      1.00 10224571.00      0.14 1048417.93      0.00      0.00      0.00
Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50

And about 11 % improvement on an mono-flow UDP_STREAM test.

skb_set_owner_w() is now the most consuming function.


lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
[1] 13696
lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
lpaa23:~# sar -n DEV 1 10|grep eth0
22:50:47         eth0      3.00 1355422.00      0.45 319706.04      0.00      0.00      0.00
22:50:48         eth0      2.00 1344270.00      0.42 317035.21      0.00      0.00      1.00
22:50:49         eth0      3.00 1350503.00      0.51 318478.34      0.00      0.00      0.00
22:50:50         eth0     29.00 1348593.00      2.86 318113.02      0.00      0.00      1.00
22:50:51         eth0     14.00 1354855.00      1.83 319508.56      0.00      0.00      0.00
22:50:52         eth0      7.00 1357794.00      0.73 320226.89      0.00      0.00      1.00
22:50:53         eth0      5.00 1326130.00      0.63 312784.72      0.00      0.00      0.00
22:50:54         eth0      2.00 994584.00      0.12 234598.40      0.00      0.00      1.00
22:50:55         eth0      5.00 1318209.00      0.75 310932.46      0.00      0.00      0.00
22:50:56         eth0     20.00 1323445.00      1.73 312178.11      0.00      0.00      1.00
Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50
lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
lpaa23:~# sar -n DEV 1 10|grep eth0
22:51:03         eth0      4.00 1512055.00      0.54 356599.40      0.00      0.00      0.00
22:51:04         eth0      4.00 1507631.00      0.55 355609.46      0.00      0.00      1.00
22:51:05         eth0      4.00 1487789.00      0.42 350917.47      0.00      0.00      0.00
22:51:06         eth0      7.00 1474460.00      1.22 347811.16      0.00      0.00      1.00
22:51:07         eth0      2.00 1496529.00      0.24 352995.18      0.00      0.00      0.00
22:51:08         eth0      3.00 1485856.00      0.49 350425.65      0.00      0.00      1.00
22:51:09         eth0      1.00 1114808.00      0.06 262905.38      0.00      0.00      0.00
22:51:10         eth0      2.00 1510924.00      0.30 356397.53      0.00      0.00      1.00
22:51:11         eth0      2.00 1506408.00      0.30 355345.76      0.00      0.00      0.00
22:51:12         eth0      2.00 1499122.00      0.32 353668.75      0.00      0.00      1.00
Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50

 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2 
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4 
 include/linux/netdevice.h                    |    1 
 net/core/net-sysfs.c                         |   18 +++
 5 files changed, 83 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6562f78b07f4..fbea83218fc0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 
 	if (polled) {
 		if (doorbell_pending)
-			mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
+			mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
 
 		mlx4_cq_set_ci(&cq->mcq);
 		wmb(); /* ensure HW sees CQ consumer before we post new buffers */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 4b597dca5c52..affebb435679 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->size = size;
 	ring->size_mask = size - 1;
 	ring->sp_stride = stride;
-	ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
+	ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
 	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
@@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	ring->sp_cqn = cq;
 	ring->prod = 0;
 	ring->cons = 0xffffffff;
+	ring->ncons = 0;
 	ring->last_nr_txbb = 1;
 	memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
 	memset(ring->buf, 0, ring->buf_size);
@@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
 		       MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
 }
 
-static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
+static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
 {
-	return ring->prod - ring->cons > ring->full_size;
+	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
 }
 
 static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
@@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 
 	/* Skip last polled descriptor */
 	ring->cons += ring->last_nr_txbb;
+	ring->ncons += ring->last_nr_txbb;
 	en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
 		 ring->cons, ring->prod);
 
@@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 						!!(ring->cons & ring->size), 0,
 						0 /* Non-NAPI caller */);
 		ring->cons += ring->last_nr_txbb;
+		ring->ncons += ring->last_nr_txbb;
 		cnt++;
 	}
 
@@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 	return cnt;
 }
 
+void mlx4_en_xmit_doorbell(const struct net_device *dev,
+			   struct mlx4_en_tx_ring *ring)
+{
+
+	if (dev->doorbell_opt & 1) {
+		u32 oval = READ_ONCE(ring->prod_bell);
+		u32 nval = READ_ONCE(ring->prod);
+
+		if (oval == nval)
+			return;
+
+		/* I can not tell yet if a cmpxchg() is needed or not */
+		if (dev->doorbell_opt & 2)
+			WRITE_ONCE(ring->prod_bell, nval);
+		else
+			if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
+				return;
+	}
+	/* Since there is no iowrite*_native() that writes the
+	 * value as is, without byteswapping - using the one
+	 * the doesn't do byteswapping in the relevant arch
+	 * endianness.
+	 */
+#if defined(__LITTLE_ENDIAN)
+	iowrite32(
+#else
+	iowrite32be(
+#endif
+		  ring->doorbell_qpn,
+		  ring->bf.uar->map + MLX4_SEND_DOORBELL);
+}
+
 static bool mlx4_en_process_tx_cq(struct net_device *dev,
 				  struct mlx4_en_cq *cq, int napi_budget)
 {
@@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	wmb();
 
 	/* we want to dirty this cache line once */
-	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
-	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
+	WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
+	ring_cons += txbbs_skipped;
+	WRITE_ONCE(ring->cons, ring_cons);
+	WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
+
+	if (dev->doorbell_opt)
+		mlx4_en_xmit_doorbell(dev, ring);
 
 	if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
 		return done < budget;
@@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
 
-void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
-{
-	wmb();
-	/* Since there is no iowrite*_native() that writes the
-	 * value as is, without byteswapping - using the one
-	 * the doesn't do byteswapping in the relevant arch
-	 * endianness.
-	 */
-#if defined(__LITTLE_ENDIAN)
-	iowrite32(
-#else
-	iowrite32be(
-#endif
-		  ring->doorbell_qpn,
-		  ring->bf.uar->map + MLX4_SEND_DOORBELL);
-}
 
 static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
 				  struct mlx4_en_tx_desc *tx_desc,
 				  union mlx4_wqe_qpn_vlan qpn_vlan,
 				  int desc_size, int bf_index,
 				  __be32 op_own, bool bf_ok,
-				  bool send_doorbell)
+				  bool send_doorbell,
+				  const struct net_device *dev, int nr_txbb)
 {
 	tx_desc->ctrl.qpn_vlan = qpn_vlan;
 
@@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
 
 		wmb();
 
+		ring->prod += nr_txbb;
 		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
 			     desc_size);
 
@@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
 		 */
 		dma_wmb();
 		tx_desc->ctrl.owner_opcode = op_own;
+		ring->prod += nr_txbb;
 		if (send_doorbell)
-			mlx4_en_xmit_doorbell(ring);
+			mlx4_en_xmit_doorbell(dev, ring);
 		else
 			ring->xmit_more++;
 	}
@@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
 	}
 
-	ring->prod += nr_txbb;
-
 	/* If we used a bounce buffer then copy descriptor back into place */
 	if (unlikely(bounce))
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
@@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
 
+	/* Doorbell avoidance : We can omit doorbell if we know a TX completion
+	 * will happen shortly.
+	 */
+	if (send_doorbell &&
+	    dev->doorbell_opt &&
+	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
+		send_doorbell = false;
+
 	real_size = (real_size / 16) & 0x3f;
 
 	bf_ok &= desc_size <= MAX_BF && send_doorbell;
@@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		qpn_vlan.fence_size = real_size;
 
 	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
-			      op_own, bf_ok, send_doorbell);
+			      op_own, bf_ok, send_doorbell, dev, nr_txbb);
 
 	if (unlikely(stop_queue)) {
 		/* If queue was emptied after the if (stop_queue) , and before
@@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		 */
 		smp_rmb();
 
-		ring_cons = ACCESS_ONCE(ring->cons);
 		if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
 			netif_tx_wake_queue(ring->tx_queue);
 			ring->wake_queue++;
@@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	rx_ring->xdp_tx++;
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
 
-	ring->prod += nr_txbb;
-
 	stop_queue = mlx4_en_is_tx_ring_full(ring);
 	send_doorbell = stop_queue ||
 				*doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
@@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 		qpn_vlan.fence_size = real_size;
 
 	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
-			      op_own, bf_ok, send_doorbell);
+			      op_own, bf_ok, send_doorbell, dev, nr_txbb);
 	*doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
 
 	return NETDEV_TX_OK;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 574bcbb1b38f..c3fd0deda198 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
 	 */
 	u32			last_nr_txbb;
 	u32			cons;
+	u32			ncons;
 	unsigned long		wake_queue;
 	struct netdev_queue	*tx_queue;
 	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
@@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
 
 	/* cache line used and dirtied in mlx4_en_xmit() */
 	u32			prod ____cacheline_aligned_in_smp;
+	u32			prod_bell;
 	unsigned int		tx_dropped;
 	unsigned long		bytes;
 	unsigned long		packets;
@@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 			       struct mlx4_en_rx_alloc *frame,
 			       struct net_device *dev, unsigned int length,
 			       int tx_ind, int *doorbell_pending);
-void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
+void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
 bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
 			struct mlx4_en_rx_alloc *frame);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4ffcd874cc20..39565b5425a6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1816,6 +1816,7 @@ struct net_device {
 	DECLARE_HASHTABLE	(qdisc_hash, 4);
 #endif
 	unsigned long		tx_queue_len;
+	unsigned long		doorbell_opt;
 	spinlock_t		tx_global_lock;
 	int			watchdog_timeo;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b0c04cf4851d..df05f81f5150 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
 }
 NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
 
+static int change_doorbell_opt(struct net_device *dev, unsigned long val)
+{
+	dev->doorbell_opt = val;
+	return 0;
+}
+
+static ssize_t doorbell_opt_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	return netdev_store(dev, attr, buf, len, change_doorbell_opt);
+}
+NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
+
 static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t len)
 {
@@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
 	&dev_attr_phys_port_name.attr,
 	&dev_attr_phys_switch_id.attr,
 	&dev_attr_proto_down.attr,
+	&dev_attr_doorbell_opt.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(net_class);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-28 18:33             ` Rick Jones
  2016-11-28 18:40               ` Rick Jones
@ 2016-11-30 10:43               ` Jesper Dangaard Brouer
  2016-11-30 17:42                 ` Rick Jones
  1 sibling, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 10:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, netdev, brouer


On Mon, 28 Nov 2016 10:33:49 -0800 Rick Jones <rick.jones2@hpe.com> wrote:

> On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:
> >> time to try IP_MTU_DISCOVER ;)  
> >
> > To Rick, maybe you can find a good solution or option with Eric's hint,
> > to send appropriate sized UDP packets with Don't Fragment (DF).  
> 
> Jesper -
> 
> Top of trunk has a change adding an omni, test-specific -f option which 
> will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that 
> sufficient to your needs?

The "-- -f" option makes the __ip_select_ident lookup go away.  So,
confirming your new option works.

Notice the "fib_lookup" cost is still present, even when I use
option "-- -n -N" to create a connected socket.  As Eric taught us,
this is because we should use syscalls "send" or "write" on a connected
socket.

My udp_flood tool[1] cycle through the different syscalls:

taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) --pmtu 2
             	ns/pkt	pps		cycles/pkt
send      	473.08	2113816.28	1891
sendto    	558.58	1790265.84	2233
sendmsg   	587.24	1702873.80	2348
sendMmsg/32  	547.57	1826265.90	2189
write     	518.36	1929175.52	2072

Using "send" seems to be the fastest option.

Some notes on test: I've forced TX completions to happen on another CPU0
and pinned the udp_flood program (to CPU2) as I want to avoid the CPU
scheduler to move udp_flood around as this cause fluctuations in the
results (as it stress the memory allocations more).

My udp_flood --pmtu option is documented in the --help usage text (see below signature)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

$ uname -a
Linux canyon 4.9.0-rc6-page_pool07-baseline+ #185 SMP PREEMPT Wed Nov 30 10:07:51 CET 2016 x86_64

[1] udp_flood tool:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c

Quick command used for verifying  __ip_select_ident is removed:

 # First run benchmark
 sudo perf record -g -a ~/tools/netperf2-svn/src/netperf -H 198.18.50.1 \
  -t UDP_STREAM -l 3 -- -m 1472 -f

 # Second grep in perf output for functions
 sudo perf report --no-children --call-graph none --stdio |\
  egrep -e '__ip_select_ident|fib_table_lookup'


$ ./udp_flood --help

DOCUMENTATION:
 This tool is a UDP flood that measures the outgoing packet rate.
 Default cycles through tests with different send system calls.
 What function-call to invoke can also be specified as a command
 line option (see below).

 Default transmit 1000000 packets per test, adjust via --count

 Usage: ./udp_flood (options-see-below) IPADDR
 Listing options:
 --help         short-option: -h
 --ipv4         short-option: -4
 --ipv6         short-option: -6
 --sendmsg      short-option: -u
 --sendmmsg     short-option: -U
 --sendto       short-option: -t
 --write        short-option: -T
 --send         short-option: -S
 --batch        short-option: -b
 --count        short-option: -c
 --port         short-option: -p
 --payload      short-option: -m
 --pmtu         short-option: -d
 --verbose      short-option: -v

 Multiple tests can be selected:
     default: all tests
     -u -U -t -T -S: run any combination of sendmsg/sendmmsg/sendto/write/send

Option --pmtu <N>  for Path MTU discover socket option IP_MTU_DISCOVER
 This affects the DF(Don't-Fragment) bit setting.
 Following values are selectable:
  0 = IP_PMTUDISC_DONT
  1 = IP_PMTUDISC_WANT
  2 = IP_PMTUDISC_DO
  3 = IP_PMTUDISC_PROBE
  4 = IP_PMTUDISC_INTERFACE
  5 = IP_PMTUDISC_OMIT
 Documentation see under IP_MTU_DISCOVER in 'man 7 ip'

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-29  6:58                               ` [WIP] net+mlx4: auto doorbell Eric Dumazet
@ 2016-11-30 11:38                                 ` Jesper Dangaard Brouer
  2016-11-30 15:56                                   ` Eric Dumazet
  2016-11-30 13:50                                 ` Saeed Mahameed
  2016-12-01 21:32                                 ` Alexander Duyck
  2 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 11:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, brouer, Achiad Shochat

[-- Attachment #1: Type: text/plain, Size: 3942 bytes --]


I've played with a somewhat similar patch (from Achiad Shochat) for
mlx5 (attached).  While it gives huge improvements, the problem I ran
into was that; TX performance became a function of the TX completion
time/interrupt and could easily be throttled if configured too
high/slow.

Can your patch be affected by this too?

Adjustable via:
 ethtool -C mlx5p2 tx-usecs 16 tx-frames 32
 

On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
> 
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt 
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt 
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50

These +75% number is pktgen without "burst", and definitely show that
your patch activate xmit_more.
What is the pps performance number when using pktgen "burst" option?


> And about 11 % improvement on an mono-flow UDP_STREAM test.
> 
> skb_set_owner_w() is now the most consuming function.
> 
> 
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50
> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50

The +11% number seems consistent with my perf observations that approx
12% was "fakely" spend on the xmit spin_lock.


[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
[...]
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>  {
> -	return ring->prod - ring->cons > ring->full_size;
> +	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>  }
[...]

> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>  
> +	/* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +	 * will happen shortly.
> +	 */
> +	if (send_doorbell &&
> +	    dev->doorbell_opt &&
> +	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)

It would be nice with a function call with an appropriate name, instead
of an open-coded queue size check.  I'm also confused by the "ncons" name.

> +		send_doorbell = false;
> +
[...]

> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
>  	 */
>  	u32			last_nr_txbb;
>  	u32			cons;
> +	u32			ncons;

Maybe we can find a better name than "ncons" ?

>  	unsigned long		wake_queue;
>  	struct netdev_queue	*tx_queue;
>  	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>  
>  	/* cache line used and dirtied in mlx4_en_xmit() */
>  	u32			prod ____cacheline_aligned_in_smp;
> +	u32			prod_bell;

Good descriptive variable name.

>  	unsigned int		tx_dropped;
>  	unsigned long		bytes;
>  	unsigned long		packets;


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[-- Attachment #2: net_mlx5e__force_tx_skb_bulking.patch --]
[-- Type: text/x-patch, Size: 5079 bytes --]

Return-Path: tariqt@mellanox.com
Received: from zmta04.collab.prod.int.phx2.redhat.com (LHLO
 zmta04.collab.prod.int.phx2.redhat.com) (10.5.81.11) by
 zmail22.collab.prod.int.phx2.redhat.com with LMTP; Wed, 17 Aug 2016
 05:21:47 -0400 (EDT)
Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by zmta04.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id B23B4DA128
	for <jbrouer@mail.corp.redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx01.extmail.prod.ext.phx2.redhat.com [10.5.110.25])
	by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u7H9LlWp015796
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO)
	for <brouer@redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400
Received: from mellanox.co.il (mail-il-dmz.mellanox.com [193.47.165.129])
	by mx1.redhat.com (Postfix) with ESMTP id B3ADB8122E
	for <brouer@redhat.com>; Wed, 17 Aug 2016 09:21:45 +0000 (UTC)
Received: from Internal Mail-Server by MTLPINE1 (envelope-from tariqt@mellanox.com)
	with ESMTPS (AES256-SHA encrypted); 17 Aug 2016 12:15:03 +0300
Received: from dev-l-vrt-206.mtl.labs.mlnx (dev-l-vrt-206.mtl.labs.mlnx [10.134.206.1])
	by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id u7H9F31D010642;
	Wed, 17 Aug 2016 12:15:03 +0300
From: Tariq Toukan <tariqt@mellanox.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>,
        Achiad Shochat <achiad@mellanox.com>,
        Rana Shahout <ranas@mellanox.com>,
        Saeed Mahameed <saeedm@mellanox.com>
Subject: [PATCH] net/mlx5e: force tx skb bulking
Date: Wed, 17 Aug 2016 12:14:51 +0300
Message-Id: <1471425291-1782-1-git-send-email-tariqt@mellanox.com>
X-Greylist: Delayed for 00:06:41 by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC)
X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC) for IP:'193.47.165.129' DOMAIN:'mail-il-dmz.mellanox.com' HELO:'mellanox.co.il' FROM:'tariqt@mellanox.com' RCPT:''
X-RedHat-Spam-Score: 0.251  (BAYES_50,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS,UNPARSEABLE_RELAY) 193.47.165.129 mail-il-dmz.mellanox.com 193.47.165.129 mail-il-dmz.mellanox.com <tariqt@mellanox.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
X-Scanned-By: MIMEDefang 2.78 on 10.5.110.25

From: Achiad Shochat <achiad@mellanox.com>

To improve SW message rate in case HW is faster.
Heuristically detect cases where the message rate is high and there
is still no skb bulking and if so, stops the txq for a while trying
to force the bulking.

Change-Id: Icb925135e69b030943cb4666117c47d1cc04da97
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    | 5 +++++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 9 ++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 74edd01..78a0661 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -394,6 +394,10 @@ enum {
 	MLX5E_SQ_STATE_TX_TIMEOUT,
 };
 
+enum {
+	MLX5E_SQ_STOP_ONCE,
+};
+
 struct mlx5e_ico_wqe_info {
 	u8  opcode;
 	u8  num_wqebbs;
@@ -403,6 +407,7 @@ struct mlx5e_sq {
 	/* data path */
 
 	/* dirtied @completion */
+	unsigned long              flags;
 	u16                        cc;
 	u32                        dma_fifo_cc;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index e073bf59..034eef0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -351,8 +351,10 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
 		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 
-	if (unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) {
+	if (test_bit(MLX5E_SQ_STOP_ONCE, &sq->flags) ||
+	    unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) {
 		netif_tx_stop_queue(sq->txq);
+		clear_bit(MLX5E_SQ_STOP_ONCE, &sq->flags);
 		sq->stats.stopped++;
 	}
 
@@ -429,6 +431,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 	u32 dma_fifo_cc;
 	u32 nbytes;
 	u16 npkts;
+	u16 ncqes;
 	u16 sqcc;
 	int i;
 
@@ -439,6 +442,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
 	npkts = 0;
 	nbytes = 0;
+	ncqes = 0;
 
 	/* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
 	 * otherwise a cq overrun may occur
@@ -458,6 +462,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			break;
 
 		mlx5_cqwq_pop(&cq->wq);
+		ncqes++;
 
 		wqe_counter = be16_to_cpu(cqe->wqe_counter);
 
@@ -508,6 +513,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
 	sq->dma_fifo_cc = dma_fifo_cc;
 	sq->cc = sqcc;
+	if ((npkts > 7) && ((npkts >> (ilog2(ncqes))) < 8))
+		set_bit(MLX5E_SQ_STOP_ONCE, &sq->flags);
 
 	netdev_tx_completed_queue(sq->txq, npkts, nbytes);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-29  6:58                               ` [WIP] net+mlx4: auto doorbell Eric Dumazet
  2016-11-30 11:38                                 ` Jesper Dangaard Brouer
@ 2016-11-30 13:50                                 ` Saeed Mahameed
  2016-11-30 15:44                                   ` Eric Dumazet
  2016-12-01 21:32                                 ` Alexander Duyck
  2 siblings, 1 reply; 60+ messages in thread
From: Saeed Mahameed @ 2016-11-30 13:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan

On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.

Hi Eric, Nice Idea indeed and we need something like this,
today we almost don't exploit the TX bulking at all.

But please see below, i am not sure different contexts should share
the doorbell ringing, it is really risky.

>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>  include/linux/netdevice.h                    |    1
>  net/core/net-sysfs.c                         |   18 +++
>  5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
>         if (polled) {
>                 if (doorbell_pending)
> -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
>                 mlx4_cq_set_ci(&cq->mcq);
>                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>         ring->size = size;
>         ring->size_mask = size - 1;
>         ring->sp_stride = stride;
> -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>
>         tmp = size * sizeof(struct mlx4_en_tx_info);
>         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>         ring->sp_cqn = cq;
>         ring->prod = 0;
>         ring->cons = 0xffffffff;
> +       ring->ncons = 0;
>         ring->last_nr_txbb = 1;
>         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>         memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>  }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>  {
> -       return ring->prod - ring->cons > ring->full_size;
> +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>  }
>
>  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
>         /* Skip last polled descriptor */
>         ring->cons += ring->last_nr_txbb;
> +       ring->ncons += ring->last_nr_txbb;
>         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>                  ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>                                                 !!(ring->cons & ring->size), 0,
>                                                 0 /* Non-NAPI caller */);
>                 ring->cons += ring->last_nr_txbb;
> +               ring->ncons += ring->last_nr_txbb;
>                 cnt++;
>         }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>         return cnt;
>  }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> +                          struct mlx4_en_tx_ring *ring)
> +{
> +
> +       if (dev->doorbell_opt & 1) {
> +               u32 oval = READ_ONCE(ring->prod_bell);
> +               u32 nval = READ_ONCE(ring->prod);
> +
> +               if (oval == nval)
> +                       return;
> +
> +               /* I can not tell yet if a cmpxchg() is needed or not */
> +               if (dev->doorbell_opt & 2)
> +                       WRITE_ONCE(ring->prod_bell, nval);
> +               else
> +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> +                               return;
> +       }
> +       /* Since there is no iowrite*_native() that writes the
> +        * value as is, without byteswapping - using the one
> +        * the doesn't do byteswapping in the relevant arch
> +        * endianness.
> +        */
> +#if defined(__LITTLE_ENDIAN)
> +       iowrite32(
> +#else
> +       iowrite32be(
> +#endif
> +                 ring->doorbell_qpn,
> +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +
>  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>                                   struct mlx4_en_cq *cq, int napi_budget)
>  {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>         wmb();
>
>         /* we want to dirty this cache line once */
> -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> +       ring_cons += txbbs_skipped;
> +       WRITE_ONCE(ring->cons, ring_cons);
> +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> +       if (dev->doorbell_opt)
> +               mlx4_en_xmit_doorbell(dev, ring);
>
>         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>                 return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>         __iowrite64_copy(dst, src, bytecnt / 8);
>  }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> -       wmb();

you missed/removed this "wmb()" in the new function, why ? where did
you compensate for it ?

> -       /* Since there is no iowrite*_native() that writes the
> -        * value as is, without byteswapping - using the one
> -        * the doesn't do byteswapping in the relevant arch
> -        * endianness.
> -        */
> -#if defined(__LITTLE_ENDIAN)
> -       iowrite32(
> -#else
> -       iowrite32be(
> -#endif
> -                 ring->doorbell_qpn,
> -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
>  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                                   struct mlx4_en_tx_desc *tx_desc,
>                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>                                   int desc_size, int bf_index,
>                                   __be32 op_own, bool bf_ok,
> -                                 bool send_doorbell)
> +                                 bool send_doorbell,
> +                                 const struct net_device *dev, int nr_txbb)
>  {
>         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
>                 wmb();
>
> +               ring->prod += nr_txbb;
>                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>                              desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                  */
>                 dma_wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;
> +               ring->prod += nr_txbb;
>                 if (send_doorbell)
> -                       mlx4_en_xmit_doorbell(ring);
> +                       mlx4_en_xmit_doorbell(dev, ring);
>                 else
>                         ring->xmit_more++;
>         }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>         }
>
> -       ring->prod += nr_txbb;
> -
>         /* If we used a bounce buffer then copy descriptor back into place */
>         if (unlikely(bounce))
>                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +        * will happen shortly.
> +        */
> +       if (send_doorbell &&
> +           dev->doorbell_opt &&
> +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)

Aelexi already expressed his worries about synchronization, and i
think here (in this exact line) sits the problem,
what about if exactly at this point the TX completion handler just
finished and rang the last doorbell,
you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
the last doorbell from the CQ handler missed it.
even if you wrote the TX descriptor before the doorbell decision here,
you will need a memory barrier to make sure the HW sees
the new packet, which was typically done before ringing the doorbell.

All in all, this is risky business :),  the right way to go is to
force the upper layer to use xmit-more and delay doorbells/use bulking
but from the same context
(xmit routine).  For example see Achiad's suggestion (attached in
Jesper's response), he used stop queue to force the stack to queue up
packets (TX bulking)
which would set xmit-more and will use the next completion to release
the "stopped" ring TXQ rather than hit the doorbell on behalf of it.

> +               send_doorbell = false;
> +
>         real_size = (real_size / 16) & 0x3f;
>
>         bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
>         if (unlikely(stop_queue)) {
>                 /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                  */
>                 smp_rmb();
>
> -               ring_cons = ACCESS_ONCE(ring->cons);
>                 if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
>                         netif_tx_wake_queue(ring->tx_queue);
>                         ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         rx_ring->xdp_tx++;
>         AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> -       ring->prod += nr_txbb;
> -
>         stop_queue = mlx4_en_is_tx_ring_full(ring);
>         send_doorbell = stop_queue ||
>                                 *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>         *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
>         return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
>          */
>         u32                     last_nr_txbb;
>         u32                     cons;
> +       u32                     ncons;
>         unsigned long           wake_queue;
>         struct netdev_queue     *tx_queue;
>         u32                     (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
>         /* cache line used and dirtied in mlx4_en_xmit() */
>         u32                     prod ____cacheline_aligned_in_smp;
> +       u32                     prod_bell;
>         unsigned int            tx_dropped;
>         unsigned long           bytes;
>         unsigned long           packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                                struct mlx4_en_rx_alloc *frame,
>                                struct net_device *dev, unsigned int length,
>                                int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
>  bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>                         struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
>         DECLARE_HASHTABLE       (qdisc_hash, 4);
>  #endif
>         unsigned long           tx_queue_len;
> +       unsigned long           doorbell_opt;
>         spinlock_t              tx_global_lock;
>         int                     watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> +       dev->doorbell_opt = val;
> +       return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t len)
> +{
> +       if (!capable(CAP_NET_ADMIN))
> +               return -EPERM;
> +
> +       return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
>                              const char *buf, size_t len)
>  {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
>         &dev_attr_phys_port_name.attr,
>         &dev_attr_phys_switch_id.attr,
>         &dev_attr_proto_down.attr,
> +       &dev_attr_doorbell_opt.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 13:50                                 ` Saeed Mahameed
@ 2016-11-30 15:44                                   ` Eric Dumazet
  2016-11-30 16:27                                     ` Saeed Mahameed
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-11-30 15:44 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan

On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
> >
> >
> >> Not sure it this has been tried before, but the doorbell avoidance could
> >> be done by the driver itself, because it knows a TX completion will come
> >> shortly (well... if softirqs are not delayed too much !)
> >>
> >> Doorbell would be forced only if :
> >>
> >> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
> >> OR
> >> ( too many [1] packets were put in TX ring buffer, no point deferring
> >> more)
> >>
> >> Start the pump, but once it is started, let the doorbells being done by
> >> TX completion.
> >>
> >> ndo_start_xmit and TX completion handler would have to maintain a shared
> >> state describing if packets were ready but doorbell deferred.
> >>
> >>
> >> Note that TX completion means "if at least one packet was drained",
> >> otherwise busy polling, constantly calling napi->poll() would force a
> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
> >>
> >> But then, maybe busy poll would like to force a doorbell...
> >>
> >> I could try these ideas on mlx4 shortly.
> >>
> >>
> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
> >
> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> > not used.
> 
> Hi Eric, Nice Idea indeed and we need something like this,
> today we almost don't exploit the TX bulking at all.
> 
> But please see below, i am not sure different contexts should share
> the doorbell ringing, it is really risky.
> 
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
> >  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
> >  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
> >  include/linux/netdevice.h                    |    1
> >  net/core/net-sysfs.c                         |   18 +++
> >  5 files changed, 83 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 6562f78b07f4..fbea83218fc0 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >
> >         if (polled) {
> >                 if (doorbell_pending)
> > -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> > +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
> >
> >                 mlx4_cq_set_ci(&cq->mcq);
> >                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 4b597dca5c52..affebb435679 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
> >         ring->size = size;
> >         ring->size_mask = size - 1;
> >         ring->sp_stride = stride;
> > -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> > +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
> >
> >         tmp = size * sizeof(struct mlx4_en_tx_info);
> >         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
> >         ring->sp_cqn = cq;
> >         ring->prod = 0;
> >         ring->cons = 0xffffffff;
> > +       ring->ncons = 0;
> >         ring->last_nr_txbb = 1;
> >         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
> >         memset(ring->buf, 0, ring->buf_size);
> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
> >                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
> >  }
> >
> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> >  {
> > -       return ring->prod - ring->cons > ring->full_size;
> > +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> >  }
> >
> >  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >
> >         /* Skip last polled descriptor */
> >         ring->cons += ring->last_nr_txbb;
> > +       ring->ncons += ring->last_nr_txbb;
> >         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
> >                  ring->cons, ring->prod);
> >
> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >                                                 !!(ring->cons & ring->size), 0,
> >                                                 0 /* Non-NAPI caller */);
> >                 ring->cons += ring->last_nr_txbb;
> > +               ring->ncons += ring->last_nr_txbb;
> >                 cnt++;
> >         }
> >
> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >         return cnt;
> >  }
> >
> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> > +                          struct mlx4_en_tx_ring *ring)
> > +{
> > +
> > +       if (dev->doorbell_opt & 1) {
> > +               u32 oval = READ_ONCE(ring->prod_bell);
> > +               u32 nval = READ_ONCE(ring->prod);
> > +
> > +               if (oval == nval)
> > +                       return;
> > +
> > +               /* I can not tell yet if a cmpxchg() is needed or not */
> > +               if (dev->doorbell_opt & 2)
> > +                       WRITE_ONCE(ring->prod_bell, nval);
> > +               else
> > +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> > +                               return;
> > +       }
> > +       /* Since there is no iowrite*_native() that writes the
> > +        * value as is, without byteswapping - using the one
> > +        * the doesn't do byteswapping in the relevant arch
> > +        * endianness.
> > +        */
> > +#if defined(__LITTLE_ENDIAN)
> > +       iowrite32(
> > +#else
> > +       iowrite32be(
> > +#endif
> > +                 ring->doorbell_qpn,
> > +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > +}
> > +
> >  static bool mlx4_en_process_tx_cq(struct net_device *dev,
> >                                   struct mlx4_en_cq *cq, int napi_budget)
> >  {
> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> >         wmb();
> >
> >         /* we want to dirty this cache line once */
> > -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> > -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> > +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> > +       ring_cons += txbbs_skipped;
> > +       WRITE_ONCE(ring->cons, ring_cons);
> > +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> > +
> > +       if (dev->doorbell_opt)
> > +               mlx4_en_xmit_doorbell(dev, ring);
> >
> >         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
> >                 return done < budget;
> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
> >         __iowrite64_copy(dst, src, bytecnt / 8);
> >  }
> >
> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> > -{
> > -       wmb();
> 
> you missed/removed this "wmb()" in the new function, why ? where did
> you compensate for it ?

I removed it because I had a cmpxchg() there if the barrier was needed.

My patch is a WIP, where you can set the bit 2 to ask to replace the
cmpxchg() by a simple write, only for performance testing/comparisons.


> 
> > -       /* Since there is no iowrite*_native() that writes the
> > -        * value as is, without byteswapping - using the one
> > -        * the doesn't do byteswapping in the relevant arch
> > -        * endianness.
> > -        */
> > -#if defined(__LITTLE_ENDIAN)
> > -       iowrite32(
> > -#else
> > -       iowrite32be(
> > -#endif
> > -                 ring->doorbell_qpn,
> > -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > -}
> >
> >  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >                                   struct mlx4_en_tx_desc *tx_desc,
> >                                   union mlx4_wqe_qpn_vlan qpn_vlan,
> >                                   int desc_size, int bf_index,
> >                                   __be32 op_own, bool bf_ok,
> > -                                 bool send_doorbell)
> > +                                 bool send_doorbell,
> > +                                 const struct net_device *dev, int nr_txbb)
> >  {
> >         tx_desc->ctrl.qpn_vlan = qpn_vlan;
> >
> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >
> >                 wmb();
> >
> > +               ring->prod += nr_txbb;
> >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
> >                              desc_size);
> >
> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >                  */
> >                 dma_wmb();
> >                 tx_desc->ctrl.owner_opcode = op_own;
> > +               ring->prod += nr_txbb;
> >                 if (send_doorbell)
> > -                       mlx4_en_xmit_doorbell(ring);
> > +                       mlx4_en_xmit_doorbell(dev, ring);
> >                 else
> >                         ring->xmit_more++;
> >         }
> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
> >         }
> >
> > -       ring->prod += nr_txbb;
> > -
> >         /* If we used a bounce buffer then copy descriptor back into place */
> >         if (unlikely(bounce))
> >                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >         }
> >         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> >
> > +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > +        * will happen shortly.
> > +        */
> > +       if (send_doorbell &&
> > +           dev->doorbell_opt &&
> > +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> 
> Aelexi already expressed his worries about synchronization, and i
> think here (in this exact line) sits the problem,
> what about if exactly at this point the TX completion handler just
> finished and rang the last doorbell,
> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
> the last doorbell from the CQ handler missed it.
> even if you wrote the TX descriptor before the doorbell decision here,
> you will need a memory barrier to make sure the HW sees
> the new packet, which was typically done before ringing the doorbell.


My patch is a WIP, meaning it is not completed ;)

Surely we can find a non racy way to handle this.


> All in all, this is risky business :),  the right way to go is to
> force the upper layer to use xmit-more and delay doorbells/use bulking
> but from the same context
> (xmit routine).  For example see Achiad's suggestion (attached in
> Jesper's response), he used stop queue to force the stack to queue up
> packets (TX bulking)
> which would set xmit-more and will use the next completion to release
> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.



Well, you depend on having a higher level queue like a qdisc.

Some users do not use a qdisc.
If you stop the queue, they no longer can send anything -> drops.


T

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 11:38                                 ` Jesper Dangaard Brouer
@ 2016-11-30 15:56                                   ` Eric Dumazet
  2016-11-30 19:17                                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-11-30 15:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat

On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:
> I've played with a somewhat similar patch (from Achiad Shochat) for
> mlx5 (attached).  While it gives huge improvements, the problem I ran
> into was that; TX performance became a function of the TX completion
> time/interrupt and could easily be throttled if configured too
> high/slow.
> 
> Can your patch be affected by this too?

Like all TX business, you should know this Jesper.

No need to constantly remind us something very well known.

> 
> Adjustable via:
>  ethtool -C mlx5p2 tx-usecs 16 tx-frames 32
>  

Generally speaking these settings impact TX, throughput/latencies.

Since NIC IRQ then trigger softirqs, we already know that IRQ handling
is critical, and some tuning can help, depending on particular load.

Now the trick is when you want a generic kernel, being deployed on hosts
shared by millions of jobs.


> 
> On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> > not used.
> > 
> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt 
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50
> > lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt 
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50
> 
> These +75% number is pktgen without "burst", and definitely show that
> your patch activate xmit_more.
> What is the pps performance number when using pktgen "burst" option?

About the same really. About all packets now get the xmit_more effect.

> 
> 
> > And about 11 % improvement on an mono-flow UDP_STREAM test.
> > 
> > skb_set_owner_w() is now the most consuming function.
> > 
> > 
> > lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> > [1] 13696
> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50
> > lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50
> 
> The +11% number seems consistent with my perf observations that approx
> 12% was "fakely" spend on the xmit spin_lock.
> 
> 
> [...]
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 4b597dca5c52..affebb435679 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> [...]
> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> >  {
> > -	return ring->prod - ring->cons > ring->full_size;
> > +	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> >  }
> [...]
> 
> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >  	}
> >  	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> >  
> > +	/* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > +	 * will happen shortly.
> > +	 */
> > +	if (send_doorbell &&
> > +	    dev->doorbell_opt &&
> > +	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> 
> It would be nice with a function call with an appropriate name, instead
> of an open-coded queue size check.  I'm also confused by the "ncons" name.
> 
> > +		send_doorbell = false;
> > +
> [...]
> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > index 574bcbb1b38f..c3fd0deda198 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> >  	 */
> >  	u32			last_nr_txbb;
> >  	u32			cons;
> > +	u32			ncons;
> 
> Maybe we can find a better name than "ncons" ?

Thats because 'cons' in this driver really means 'old cons' 

and new cons = old cons + last_nr_txbb;


> 
> >  	unsigned long		wake_queue;
> >  	struct netdev_queue	*tx_queue;
> >  	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
> > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
> >  
> >  	/* cache line used and dirtied in mlx4_en_xmit() */
> >  	u32			prod ____cacheline_aligned_in_smp;
> > +	u32			prod_bell;
> 
> Good descriptive variable name.
> 
> >  	unsigned int		tx_dropped;
> >  	unsigned long		bytes;
> >  	unsigned long		packets;
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 15:44                                   ` Eric Dumazet
@ 2016-11-30 16:27                                     ` Saeed Mahameed
  2016-11-30 17:28                                       ` Eric Dumazet
  2016-12-01 12:05                                       ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 60+ messages in thread
From: Saeed Mahameed @ 2016-11-30 16:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan

On Wed, Nov 30, 2016 at 5:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
>> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>> >
>> >
>> >> Not sure it this has been tried before, but the doorbell avoidance could
>> >> be done by the driver itself, because it knows a TX completion will come
>> >> shortly (well... if softirqs are not delayed too much !)
>> >>
>> >> Doorbell would be forced only if :
>> >>
>> >> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> >> OR
>> >> ( too many [1] packets were put in TX ring buffer, no point deferring
>> >> more)
>> >>
>> >> Start the pump, but once it is started, let the doorbells being done by
>> >> TX completion.
>> >>
>> >> ndo_start_xmit and TX completion handler would have to maintain a shared
>> >> state describing if packets were ready but doorbell deferred.
>> >>
>> >>
>> >> Note that TX completion means "if at least one packet was drained",
>> >> otherwise busy polling, constantly calling napi->poll() would force a
>> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
>> >>
>> >> But then, maybe busy poll would like to force a doorbell...
>> >>
>> >> I could try these ideas on mlx4 shortly.
>> >>
>> >>
>> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>> >
>> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
>> > not used.
>>
>> Hi Eric, Nice Idea indeed and we need something like this,
>> today we almost don't exploit the TX bulking at all.
>>
>> But please see below, i am not sure different contexts should share
>> the doorbell ringing, it is really risky.
>>
>> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>> >  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>> >  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>> >  include/linux/netdevice.h                    |    1
>> >  net/core/net-sysfs.c                         |   18 +++
>> >  5 files changed, 83 insertions(+), 32 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > index 6562f78b07f4..fbea83218fc0 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>> >
>> >         if (polled) {
>> >                 if (doorbell_pending)
>> > -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
>> > +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>> >
>> >                 mlx4_cq_set_ci(&cq->mcq);
>> >                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > index 4b597dca5c52..affebb435679 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>> >         ring->size = size;
>> >         ring->size_mask = size - 1;
>> >         ring->sp_stride = stride;
>> > -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
>> > +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>> >
>> >         tmp = size * sizeof(struct mlx4_en_tx_info);
>> >         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
>> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>> >         ring->sp_cqn = cq;
>> >         ring->prod = 0;
>> >         ring->cons = 0xffffffff;
>> > +       ring->ncons = 0;
>> >         ring->last_nr_txbb = 1;
>> >         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>> >         memset(ring->buf, 0, ring->buf_size);
>> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>> >                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>> >  }
>> >
>> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
>> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>> >  {
>> > -       return ring->prod - ring->cons > ring->full_size;
>> > +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>> >  }
>> >
>> >  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
>> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >
>> >         /* Skip last polled descriptor */
>> >         ring->cons += ring->last_nr_txbb;
>> > +       ring->ncons += ring->last_nr_txbb;
>> >         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>> >                  ring->cons, ring->prod);
>> >
>> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >                                                 !!(ring->cons & ring->size), 0,
>> >                                                 0 /* Non-NAPI caller */);
>> >                 ring->cons += ring->last_nr_txbb;
>> > +               ring->ncons += ring->last_nr_txbb;
>> >                 cnt++;
>> >         }
>> >
>> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >         return cnt;
>> >  }
>> >
>> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
>> > +                          struct mlx4_en_tx_ring *ring)
>> > +{
>> > +
>> > +       if (dev->doorbell_opt & 1) {
>> > +               u32 oval = READ_ONCE(ring->prod_bell);
>> > +               u32 nval = READ_ONCE(ring->prod);
>> > +
>> > +               if (oval == nval)
>> > +                       return;
>> > +
>> > +               /* I can not tell yet if a cmpxchg() is needed or not */
>> > +               if (dev->doorbell_opt & 2)
>> > +                       WRITE_ONCE(ring->prod_bell, nval);
>> > +               else
>> > +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
>> > +                               return;
>> > +       }
>> > +       /* Since there is no iowrite*_native() that writes the
>> > +        * value as is, without byteswapping - using the one
>> > +        * the doesn't do byteswapping in the relevant arch
>> > +        * endianness.
>> > +        */
>> > +#if defined(__LITTLE_ENDIAN)
>> > +       iowrite32(
>> > +#else
>> > +       iowrite32be(
>> > +#endif
>> > +                 ring->doorbell_qpn,
>> > +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > +}
>> > +
>> >  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> >                                   struct mlx4_en_cq *cq, int napi_budget)
>> >  {
>> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> >         wmb();
>> >
>> >         /* we want to dirty this cache line once */
>> > -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
>> > -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
>> > +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
>> > +       ring_cons += txbbs_skipped;
>> > +       WRITE_ONCE(ring->cons, ring_cons);
>> > +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
>> > +
>> > +       if (dev->doorbell_opt)
>> > +               mlx4_en_xmit_doorbell(dev, ring);
>> >
>> >         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>> >                 return done < budget;
>> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>> >         __iowrite64_copy(dst, src, bytecnt / 8);
>> >  }
>> >
>> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
>> > -{
>> > -       wmb();
>>
>> you missed/removed this "wmb()" in the new function, why ? where did
>> you compensate for it ?
>
> I removed it because I had a cmpxchg() there if the barrier was needed.
>

Cool, so the answer is yes, the barrier is needed in order for the HW
to see the last step of
mlx4_en_tx_write_desc where we write the ownership bit (which means
this descriptor is valid for HW processing).
tx_desc->ctrl.owner_opcode = op_own;

ringing the doorbell without this wmb might cause the HW to miss that
last packet.

> My patch is a WIP, where you can set the bit 2 to ask to replace the
> cmpxchg() by a simple write, only for performance testing/comparisons.
>
>
>>
>> > -       /* Since there is no iowrite*_native() that writes the
>> > -        * value as is, without byteswapping - using the one
>> > -        * the doesn't do byteswapping in the relevant arch
>> > -        * endianness.
>> > -        */
>> > -#if defined(__LITTLE_ENDIAN)
>> > -       iowrite32(
>> > -#else
>> > -       iowrite32be(
>> > -#endif
>> > -                 ring->doorbell_qpn,
>> > -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > -}
>> >
>> >  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >                                   struct mlx4_en_tx_desc *tx_desc,
>> >                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>> >                                   int desc_size, int bf_index,
>> >                                   __be32 op_own, bool bf_ok,
>> > -                                 bool send_doorbell)
>> > +                                 bool send_doorbell,
>> > +                                 const struct net_device *dev, int nr_txbb)
>> >  {
>> >         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>> >
>> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >
>> >                 wmb();
>> >
>> > +               ring->prod += nr_txbb;
>> >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>> >                              desc_size);
>> >
>> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >                  */
>> >                 dma_wmb();
>> >                 tx_desc->ctrl.owner_opcode = op_own;
>> > +               ring->prod += nr_txbb;
>> >                 if (send_doorbell)
>> > -                       mlx4_en_xmit_doorbell(ring);
>> > +                       mlx4_en_xmit_doorbell(dev, ring);
>> >                 else
>> >                         ring->xmit_more++;
>> >         }
>> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> >                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>> >         }
>> >
>> > -       ring->prod += nr_txbb;
>> > -
>> >         /* If we used a bounce buffer then copy descriptor back into place */
>> >         if (unlikely(bounce))
>> >                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> >         }
>> >         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>> >
>> > +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
>> > +        * will happen shortly.
>> > +        */
>> > +       if (send_doorbell &&
>> > +           dev->doorbell_opt &&
>> > +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
>>
>> Aelexi already expressed his worries about synchronization, and i
>> think here (in this exact line) sits the problem,
>> what about if exactly at this point the TX completion handler just
>> finished and rang the last doorbell,
>> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
>> the last doorbell from the CQ handler missed it.
>> even if you wrote the TX descriptor before the doorbell decision here,
>> you will need a memory barrier to make sure the HW sees
>> the new packet, which was typically done before ringing the doorbell.
>
>
> My patch is a WIP, meaning it is not completed ;)
>
> Surely we can find a non racy way to handle this.

The question is, can it be done without locking ? Maybe ring the
doorbell every N msecs  just in case.

>
>
>> All in all, this is risky business :),  the right way to go is to
>> force the upper layer to use xmit-more and delay doorbells/use bulking
>> but from the same context
>> (xmit routine).  For example see Achiad's suggestion (attached in
>> Jesper's response), he used stop queue to force the stack to queue up
>> packets (TX bulking)
>> which would set xmit-more and will use the next completion to release
>> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.
>
>
>
> Well, you depend on having a higher level queue like a qdisc.
>
> Some users do not use a qdisc.
> If you stop the queue, they no longer can send anything -> drops.
>

In this case, i think they should implement their own bulking (pktgen
is not a good example)
but XDP can predict if it has more packets to xmit  as long as all of
them fall in the same NAPI cycle.
Others should try and do the same.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 16:27                                     ` Saeed Mahameed
@ 2016-11-30 17:28                                       ` Eric Dumazet
  2016-12-01 12:05                                       ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-11-30 17:28 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan

On Wed, 2016-11-30 at 18:27 +0200, Saeed Mahameed wrote:

> 
> In this case, i think they should implement their own bulking (pktgen
> is not a good example)
> but XDP can predict if it has more packets to xmit  as long as all of
> them fall in the same NAPI cycle.


> Others should try and do the same.

We can not trust user space to signal us when it is the good time to
perform a doorbell.

trafgen is using af_packet with qdisc bypass for example.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-30 10:43               ` Jesper Dangaard Brouer
@ 2016-11-30 17:42                 ` Rick Jones
  2016-11-30 18:11                   ` David Miller
  0 siblings, 1 reply; 60+ messages in thread
From: Rick Jones @ 2016-11-30 17:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Eric Dumazet, netdev

On 11/30/2016 02:43 AM, Jesper Dangaard Brouer wrote:
> Notice the "fib_lookup" cost is still present, even when I use
> option "-- -n -N" to create a connected socket.  As Eric taught us,
> this is because we should use syscalls "send" or "write" on a connected
> socket.

In theory, once the data socket is connected, the send_data() call in 
src/nettest_omni.c is supposed to use send() rather than sendto().

And indeed, based on a quick check, send() is what is being called, 
though it becomes it seems a sendto() system call - with the destination 
information NJULL:

write(1, "send\n", 5)                   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
write(1, "send\n", 5)                   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024

So I'm not sure what might be going-on there.

You can get netperf to use write() instead of send() by adding a 
test-specific -I option.

happy benchmarking,

rick

>
> My udp_flood tool[1] cycle through the different syscalls:
>
> taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) --pmtu 2
>              	ns/pkt	pps		cycles/pkt
> send      	473.08	2113816.28	1891
> sendto    	558.58	1790265.84	2233
> sendmsg   	587.24	1702873.80	2348
> sendMmsg/32  	547.57	1826265.90	2189
> write     	518.36	1929175.52	2072
>
> Using "send" seems to be the fastest option.
>
> Some notes on test: I've forced TX completions to happen on another CPU0
> and pinned the udp_flood program (to CPU2) as I want to avoid the CPU
> scheduler to move udp_flood around as this cause fluctuations in the
> results (as it stress the memory allocations more).
>
> My udp_flood --pmtu option is documented in the --help usage text (see below signature)
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Netperf UDP issue with connected sockets
  2016-11-30 17:42                 ` Rick Jones
@ 2016-11-30 18:11                   ` David Miller
  0 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2016-11-30 18:11 UTC (permalink / raw)
  To: rick.jones2; +Cc: brouer, eric.dumazet, netdev

From: Rick Jones <rick.jones2@hpe.com>
Date: Wed, 30 Nov 2016 09:42:40 -0800

> And indeed, based on a quick check, send() is what is being called,
> though it becomes it seems a sendto() system call - with the
> destination information NJULL:
> 
> write(1, "send\n", 5)                   = 5
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0)
> = 1024
> write(1, "send\n", 5)                   = 5
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0)
> = 1024
> 
> So I'm not sure what might be going-on there.

It's because of glibc's implementation of send() which is:

ssize_t
__libc_send (int sockfd, const void *buffer, size_t len, int flags)
{
  return SYSCALL_CANCEL (sendto, sockfd, buffer, len, flags, NULL, 0);
}
strong_alias (__libc_send, __send)
weak_alias (__libc_send, send)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 15:56                                   ` Eric Dumazet
@ 2016-11-30 19:17                                     ` Jesper Dangaard Brouer
  2016-11-30 19:30                                       ` Eric Dumazet
  0 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 19:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat, brouer

On Wed, 30 Nov 2016 07:56:26 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:
> > I've played with a somewhat similar patch (from Achiad Shochat) for
> > mlx5 (attached).  While it gives huge improvements, the problem I ran
> > into was that; TX performance became a function of the TX completion
> > time/interrupt and could easily be throttled if configured too
> > high/slow.
> > 
> > Can your patch be affected by this too?  
> 
> Like all TX business, you should know this Jesper.
> No need to constantly remind us something very well known.

Don't take is as critique Eric.  I was hoping your patch would have
solved this issue of being sensitive to TX completion adjustments.  You
usually have good solutions for difficult issues. I basically rejected
Achiad's approach/patch because it was too sensitive to these kind of
adjustments.


> > On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
[...]
> > 
> > These +75% number is pktgen without "burst", and definitely show that
> > your patch activate xmit_more.
> > What is the pps performance number when using pktgen "burst" option?  
> 
> About the same really. About all packets now get the xmit_more effect.

Perfect!

> > [...]  
> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > > index 4b597dca5c52..affebb435679 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c  
> > [...]  
> > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> > >  {
> > > -	return ring->prod - ring->cons > ring->full_size;
> > > +	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> > >  }  
> > [...]
> >   
> > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > >  	}
> > >  	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> > >  
> > > +	/* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > > +	 * will happen shortly.
> > > +	 */
> > > +	if (send_doorbell &&
> > > +	    dev->doorbell_opt &&
> > > +	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)  
> > 
> > It would be nice with a function call with an appropriate name, instead
> > of an open-coded queue size check.  I'm also confused by the "ncons" name.
> >   
> > > +		send_doorbell = false;
> > > +  
> > [...]
> >   
> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > index 574bcbb1b38f..c3fd0deda198 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> > >  	 */
> > >  	u32			last_nr_txbb;
> > >  	u32			cons;
> > > +	u32			ncons;  
> > 
> > Maybe we can find a better name than "ncons" ?  
> 
> Thats because 'cons' in this driver really means 'old cons' 
> 
> and new cons = old cons + last_nr_txbb;

It was not clear to me that "n" meant "new".  And also not clear that
this drive have an issue of "cons" (consumer) is tracking "old" cons.

  
> > >  	unsigned long		wake_queue;
> > >  	struct netdev_queue	*tx_queue;
> > >  	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
> > > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
> > >  
> > >  	/* cache line used and dirtied in mlx4_en_xmit() */
> > >  	u32			prod ____cacheline_aligned_in_smp;
> > > +	u32			prod_bell;  
> > 
> > Good descriptive variable name.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 19:17                                     ` Jesper Dangaard Brouer
@ 2016-11-30 19:30                                       ` Eric Dumazet
  2016-11-30 22:30                                         ` Jesper Dangaard Brouer
  2016-12-01  0:27                                         ` Eric Dumazet
  0 siblings, 2 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-11-30 19:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat

On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:

> Don't take is as critique Eric.  I was hoping your patch would have
> solved this issue of being sensitive to TX completion adjustments.  You
> usually have good solutions for difficult issues. I basically rejected
> Achiad's approach/patch because it was too sensitive to these kind of
> adjustments.

Well, this patch can hurt latencies, because a doorbell can be delayed,
and softirqs can be delayed by many hundred of usec in some cases.

I would not enable this behavior by default.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 19:30                                       ` Eric Dumazet
@ 2016-11-30 22:30                                         ` Jesper Dangaard Brouer
  2016-11-30 22:40                                           ` Eric Dumazet
  2016-12-01  0:27                                         ` Eric Dumazet
  1 sibling, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 22:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat, brouer

On Wed, 30 Nov 2016 11:30:00 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> 
> > Don't take is as critique Eric.  I was hoping your patch would have
> > solved this issue of being sensitive to TX completion adjustments.  You
> > usually have good solutions for difficult issues. I basically rejected
> > Achiad's approach/patch because it was too sensitive to these kind of
> > adjustments.  
> 
> Well, this patch can hurt latencies, because a doorbell can be delayed,
> and softirqs can be delayed by many hundred of usec in some cases.
> 
> I would not enable this behavior by default.

What about another scheme, where dev_hard_start_xmit() can return an
indication that driver choose not to flush (based on TX queue depth),
and there by requesting stack to call flush at a later point.

Would that introduce less latency issues?


Patch muckup (not even compile tested):

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4ffcd874cc20..d7d15e4e6766 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -109,6 +109,7 @@ enum netdev_tx {
 	__NETDEV_TX_MIN	 = INT_MIN,	/* make sure enum is signed */
 	NETDEV_TX_OK	 = 0x00,	/* driver took care of packet */
 	NETDEV_TX_BUSY	 = 0x10,	/* driver tx path was busy*/
+	NETDEV_TX_FLUSHME= 0x04,	/* driver request doorbell/flush later */
 };
 typedef enum netdev_tx netdev_tx_t;
 
@@ -536,6 +537,8 @@ enum netdev_queue_state_t {
 	__QUEUE_STATE_DRV_XOFF,
 	__QUEUE_STATE_STACK_XOFF,
 	__QUEUE_STATE_FROZEN,
+	// __QUEUE_STATE_NEED_FLUSH
+	// is is better to store in txq state?
 };
 
 #define QUEUE_STATE_DRV_XOFF	(1 << __QUEUE_STATE_DRV_XOFF)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e6aa0a249672..7480e44c5a50 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -75,6 +75,7 @@ struct Qdisc {
 	void			*u32_node;
 
 	struct netdev_queue	*dev_queue;
+	struct netdev_queue	*flush_dev_queue; // store txq to flush here?
 
 	struct gnet_stats_rate_est64	rate_est;
 	struct gnet_stats_basic_cpu __percpu *cpu_bstats;
@@ -98,6 +99,20 @@ struct Qdisc {
 	spinlock_t		busylock ____cacheline_aligned_in_smp;
 };
 
+static inline void qdisc_request_txq_flush(struct Qdisc *qdisc,
+					   struct netdev_queue *txq)
+{
+	struct net_device dev;
+
+	if (qdisc->flush_dev_queue) {
+		if (likely(qdisc->flush_dev_queue == txq))
+			return;
+		/* Flush existing txq before reassignment */
+		dev_flush_xmit(qdisc_dev(q), txq);
+	}
+	qdisc->flush_dev_queue = txq;
+}
+
 static inline bool qdisc_is_running(const struct Qdisc *qdisc)
 {
 	return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
@@ -117,6 +132,19 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
+	/* flush device txq here, if needed */
+	if (qdisc->flush_dev_queue) {
+		struct netdev_queue *txq = qdisc->flush_dev_queue;
+		struct net_device *dev = qdisc_dev(q);
+
+		qdisc->flush_dev_queue = NULL;
+		dev_flush_xmit(dev, txq);
+		/*
+		 * DISCUSS: it is too soon to flush here? What about
+		 * rescheduling a NAPI poll cycle for this device,
+		 * before calling flush.
+		 */
+	}
 	write_seqcount_end(&qdisc->running);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 048b46b7c92a..70339c267f33 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2880,6 +2880,15 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_skb_features);
 
+static int dev_flush_xmit(struct net_device *dev,
+			  struct netdev_queue *txq)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	ops->ndo_flush_xmit(dev, txq);
+	// Oh oh, do we need to take HARD_TX_LOCK ??
+}
+
 static int xmit_one(struct sk_buff *skb, struct net_device *dev,
 		    struct netdev_queue *txq, bool more)
 {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6cfb6e9038c2..55c01b6f6311 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -190,6 +190,13 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 	if (dev_xmit_complete(ret)) {
 		/* Driver sent out skb successfully or skb was consumed */
+		if (ret == NETDEV_TX_FLUSHME) {
+			/* Driver choose no-TX-doorbell MMIO write.
+			 * This made taking qdisc root_lock less expensive.
+			 */
+			qdisc_request_txq_flush(q, txq);
+			// Flush happens later in qdisc_run_end()
+		}
 		ret = qdisc_qlen(q);
 	} else {
 		/* Driver returned NETDEV_TX_BUSY - requeue skb */


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 22:30                                         ` Jesper Dangaard Brouer
@ 2016-11-30 22:40                                           ` Eric Dumazet
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-11-30 22:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat

On Wed, 2016-11-30 at 23:30 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 30 Nov 2016 11:30:00 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > Don't take is as critique Eric.  I was hoping your patch would have
> > > solved this issue of being sensitive to TX completion adjustments.  You
> > > usually have good solutions for difficult issues. I basically rejected
> > > Achiad's approach/patch because it was too sensitive to these kind of
> > > adjustments.  
> > 
> > Well, this patch can hurt latencies, because a doorbell can be delayed,
> > and softirqs can be delayed by many hundred of usec in some cases.
> > 
> > I would not enable this behavior by default.
> 
> What about another scheme, where dev_hard_start_xmit() can return an
> indication that driver choose not to flush (based on TX queue depth),
> and there by requesting stack to call flush at a later point.
> 
> Would that introduce less latency issues?


Again, how is it going to change anything when your netperf UDP sends
one packet at a time ?

qdisc gets one packet , dequeue it immediately.

No pressure. -> doorbell will be sent as before.

Some TX producers do not even use a qdisc to begin with.

Please think of a solution that does not involve qdisc layer at all.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 19:30                                       ` Eric Dumazet
  2016-11-30 22:30                                         ` Jesper Dangaard Brouer
@ 2016-12-01  0:27                                         ` Eric Dumazet
  2016-12-01  1:16                                           ` Tom Herbert
  1 sibling, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01  0:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tom Herbert, Willem de Bruijn
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat


Another issue I found during my tests last days, is a problem with BQL,
and more generally when driver stops/starts the queue.

When under stress and BQL stops the queue, driver TX completion does a
lot of work, and servicing CPU also takes over further qdisc_run().

The work-flow is :

1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
unmap things, queue skbs for freeing.

2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); 

if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
     netif_schedule_queue(dev_queue);

This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
(They absolutely have no chance to grab it)

So we end up with one cpu doing the ndo_start_xmit() and TX completions,
and RX work.

This problem is magnified when XPS is used, if one mono-threaded application deals with
thousands of TCP sockets.

We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
to allow another cpu to service the qdisc and spare us.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01  0:27                                         ` Eric Dumazet
@ 2016-12-01  1:16                                           ` Tom Herbert
  2016-12-01  2:32                                             ` Eric Dumazet
  0 siblings, 1 reply; 60+ messages in thread
From: Tom Herbert @ 2016-12-01  1:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat

On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Another issue I found during my tests last days, is a problem with BQL,
> and more generally when driver stops/starts the queue.
>
> When under stress and BQL stops the queue, driver TX completion does a
> lot of work, and servicing CPU also takes over further qdisc_run().
>
Hmm, hard to say if this is problem or a feature I think ;-). One way
to "solve" this would be to use IRQ round robin, that would spread the
load of networking across CPUs, but that gives us no additional
parallelism and reduces locality-- it's generally considered a bad
idea. The question might be: is it better to continuously ding one CPU
with all the networking work or try to artificially spread it out?
Note this is orthogonal to MQ also, so we should already have multiple
CPUs doing netif_schedule_queue for queues they manage.

Do you have a test or application that shows this is causing pain?

> The work-flow is :
>
> 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> unmap things, queue skbs for freeing.
>
> 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>
> if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>      netif_schedule_queue(dev_queue);
>
> This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> (They absolutely have no chance to grab it)
>
> So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> and RX work.
>
> This problem is magnified when XPS is used, if one mono-threaded application deals with
> thousands of TCP sockets.
>
Do you know of an application doing this? The typical way XPS and most
of the other steering mechanisms are configured assume that workloads
tend towards a normal distribution. Such mechanisms can become
problematic under asymmetric loads, but then I would expect these are
likely dedicated servers so that the mechanisms can be tuned
accordingly. For instance, XPS can be configured to map more than one
queue to a CPU. Alternatively, IIRC Windows has some functionality to
tune networking for the load (spin up queues, reconfigure things
similar to XPS/RPS, etc.)-- that's promising up the point that we need
a lot of heuristics and measurement; but then we lose determinism and
risk edge case where we get completely unsatisfied results (one of my
concerns with the recent attempt to put configuration in the kernel).

> We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> to allow another cpu to service the qdisc and spare us.
>
Wouldn't this need to be an active operation? That is to queue the
qdisc on another CPU's output_queue?

Tom

>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01  1:16                                           ` Tom Herbert
@ 2016-12-01  2:32                                             ` Eric Dumazet
  2016-12-01  2:50                                               ` Eric Dumazet
  2016-12-01  5:03                                               ` Tom Herbert
  0 siblings, 2 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01  2:32 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat

On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Another issue I found during my tests last days, is a problem with BQL,
> > and more generally when driver stops/starts the queue.
> >
> > When under stress and BQL stops the queue, driver TX completion does a
> > lot of work, and servicing CPU also takes over further qdisc_run().
> >
> Hmm, hard to say if this is problem or a feature I think ;-). One way
> to "solve" this would be to use IRQ round robin, that would spread the
> load of networking across CPUs, but that gives us no additional
> parallelism and reduces locality-- it's generally considered a bad
> idea. The question might be: is it better to continuously ding one CPU
> with all the networking work or try to artificially spread it out?
> Note this is orthogonal to MQ also, so we should already have multiple
> CPUs doing netif_schedule_queue for queues they manage.
> 
> Do you have a test or application that shows this is causing pain?

Yes, just launch enough TCP senders (more than 10,000) to fully utilize
the NIC, with small messages.

super_netperf is not good for that, because you would need 10,000
processes and would spend too much cycles just dealing with an enormous
working set, you would not activate BQL.


> 
> > The work-flow is :
> >
> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> > unmap things, queue skbs for freeing.
> >
> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
> >
> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
> >      netif_schedule_queue(dev_queue);
> >
> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> > (They absolutely have no chance to grab it)
> >
> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> > and RX work.
> >
> > This problem is magnified when XPS is used, if one mono-threaded application deals with
> > thousands of TCP sockets.
> >
> Do you know of an application doing this? The typical way XPS and most
> of the other steering mechanisms are configured assume that workloads
> tend towards a normal distribution. Such mechanisms can become
> problematic under asymmetric loads, but then I would expect these are
> likely dedicated servers so that the mechanisms can be tuned
> accordingly. For instance, XPS can be configured to map more than one
> queue to a CPU. Alternatively, IIRC Windows has some functionality to
> tune networking for the load (spin up queues, reconfigure things
> similar to XPS/RPS, etc.)-- that's promising up the point that we need
> a lot of heuristics and measurement; but then we lose determinism and
> risk edge case where we get completely unsatisfied results (one of my
> concerns with the recent attempt to put configuration in the kernel).

We have thousands of applications, and some of them 'kind of multicast'
events to a broad number of TCP sockets.

Very often, applications writers use a single thread for doing this,
when all they need is to send small packets to 10,000 sockets, and they
do not really care of doing this very fast. They also do not want to
hurt other applications sharing the NIC.

Very often, process scheduler will also run this single thread in a
single cpu, ie avoiding expensive migrations if they are not needed.

Problem is this behavior locks one TX queue for the duration of the
multicast, since XPS will force all the TX packets to go to one TX
queue.

Other flows that would share the locked CPU experience high P99
latencies.


> 
> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> > to allow another cpu to service the qdisc and spare us.
> >
> Wouldn't this need to be an active operation? That is to queue the
> qdisc on another CPU's output_queue?

I simply suggest we try to queue the qdisc for further servicing as we
do today, from net_tx_action(), but we might use a different bit, so
that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
before we grab it from net_tx_action(), maybe 100 usec later (time to
flush all skbs queued in napi_consume_skb() and maybe RX processing,
since most NIC handle TX completion before doing RX processing from thei
napi poll handler.

Should be doable with few changes in __netif_schedule() and
net_tx_action(), plus some control paths that will need to take care of
the new bit at dismantle time, right ?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01  2:32                                             ` Eric Dumazet
@ 2016-12-01  2:50                                               ` Eric Dumazet
  2016-12-02 18:16                                                 ` Eric Dumazet
  2016-12-01  5:03                                               ` Tom Herbert
  1 sibling, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01  2:50 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat

On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:

> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
> 
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?

Hmm... this is silly. Code already implements a different bit.

qdisc_run() seems to run more often from net_tx_action(), I have to find
out why.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01  2:32                                             ` Eric Dumazet
  2016-12-01  2:50                                               ` Eric Dumazet
@ 2016-12-01  5:03                                               ` Tom Herbert
  2016-12-01 19:24                                                 ` Willem de Bruijn
  1 sibling, 1 reply; 60+ messages in thread
From: Tom Herbert @ 2016-12-01  5:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat

On Wed, Nov 30, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
>> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Another issue I found during my tests last days, is a problem with BQL,
>> > and more generally when driver stops/starts the queue.
>> >
>> > When under stress and BQL stops the queue, driver TX completion does a
>> > lot of work, and servicing CPU also takes over further qdisc_run().
>> >
>> Hmm, hard to say if this is problem or a feature I think ;-). One way
>> to "solve" this would be to use IRQ round robin, that would spread the
>> load of networking across CPUs, but that gives us no additional
>> parallelism and reduces locality-- it's generally considered a bad
>> idea. The question might be: is it better to continuously ding one CPU
>> with all the networking work or try to artificially spread it out?
>> Note this is orthogonal to MQ also, so we should already have multiple
>> CPUs doing netif_schedule_queue for queues they manage.
>>
>> Do you have a test or application that shows this is causing pain?
>
> Yes, just launch enough TCP senders (more than 10,000) to fully utilize
> the NIC, with small messages.
>
> super_netperf is not good for that, because you would need 10,000
> processes and would spend too much cycles just dealing with an enormous
> working set, you would not activate BQL.
>
>
>>
>> > The work-flow is :
>> >
>> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
>> > unmap things, queue skbs for freeing.
>> >
>> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>> >
>> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>> >      netif_schedule_queue(dev_queue);
>> >
>> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
>> > (They absolutely have no chance to grab it)
>> >
>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>> > and RX work.
>> >
>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>> > thousands of TCP sockets.
>> >
>> Do you know of an application doing this? The typical way XPS and most
>> of the other steering mechanisms are configured assume that workloads
>> tend towards a normal distribution. Such mechanisms can become
>> problematic under asymmetric loads, but then I would expect these are
>> likely dedicated servers so that the mechanisms can be tuned
>> accordingly. For instance, XPS can be configured to map more than one
>> queue to a CPU. Alternatively, IIRC Windows has some functionality to
>> tune networking for the load (spin up queues, reconfigure things
>> similar to XPS/RPS, etc.)-- that's promising up the point that we need
>> a lot of heuristics and measurement; but then we lose determinism and
>> risk edge case where we get completely unsatisfied results (one of my
>> concerns with the recent attempt to put configuration in the kernel).
>
> We have thousands of applications, and some of them 'kind of multicast'
> events to a broad number of TCP sockets.
>
> Very often, applications writers use a single thread for doing this,
> when all they need is to send small packets to 10,000 sockets, and they
> do not really care of doing this very fast. They also do not want to
> hurt other applications sharing the NIC.
>
> Very often, process scheduler will also run this single thread in a
> single cpu, ie avoiding expensive migrations if they are not needed.
>
> Problem is this behavior locks one TX queue for the duration of the
> multicast, since XPS will force all the TX packets to go to one TX
> queue.
>
The fact that XPS is forcing TX packets to go over one CPU is a result
of how you've configured XPS. There are other potentially
configurations that present different tradeoffs. For instance, TX
queues are plentiful now days to the point that we could map a number
of queues to each CPU while respecting NUMA locality between the
sending CPU and where TX completions occur. If the set is sufficiently
large this would also helps to address device lock contention. The
other thing I'm wondering is if Willem's concepts distributing DOS
attacks in RPS might be applicable in XPS. If we could detect that a
TX queue is "under attack" maybe we can automatically backoff to
distributing the load to more queues in XPS somehow.

Tom

> Other flows that would share the locked CPU experience high P99
> latencies.
>
>
>>
>> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
>> > to allow another cpu to service the qdisc and spare us.
>> >
>> Wouldn't this need to be an active operation? That is to queue the
>> qdisc on another CPU's output_queue?
>
> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
>
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?
>
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-30 16:27                                     ` Saeed Mahameed
  2016-11-30 17:28                                       ` Eric Dumazet
@ 2016-12-01 12:05                                       ` Jesper Dangaard Brouer
  2016-12-01 14:24                                         ` Eric Dumazet
  1 sibling, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-01 12:05 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Eric Dumazet, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan, brouer


On Wed, 30 Nov 2016 18:27:45 +0200
Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:

> >> All in all, this is risky business :),  the right way to go is to
> >> force the upper layer to use xmit-more and delay doorbells/use bulking
> >> but from the same context (xmit routine).  For example see
> >> Achiad's suggestion (attached in Jesper's response), he used stop
> >> queue to force the stack to queue up packets (TX bulking)
> >> which would set xmit-more and will use the next completion to
> >> release the "stopped" ring TXQ rather than hit the doorbell on
> >> behalf of it.  
> >
> > Well, you depend on having a higher level queue like a qdisc.
> >
> > Some users do not use a qdisc.
> > If you stop the queue, they no longer can send anything -> drops.
> >

You do have a point that stopping the device might not be the best way
to create a push-back (to allow stack queue packets).

 netif_tx_stop_queue() / __QUEUE_STATE_DRV_XOFF


> In this case, i think they should implement their own bulking (pktgen
> is not a good example) but XDP can predict if it has more packets to
> xmit  as long as all of them fall in the same NAPI cycle.
> Others should try and do the same.

I actually agree with Saeed here.

Maybe we can come up with another __QUEUE_STATE_xxx that informs the
upper layer what the driver is doing.  Then users not using a qdisc can
use this indication (like the qdisc could).  (qdisc-bypass users already
check the QUEUE_STATE flags e.g. via netif_xmit_frozen_or_drv_stopped).

My main objection is that this is a driver local optimization.  By not
involving the upper layers, the netstack looses the ability to amortize
it's cost, as it still does per packet handling.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 12:05                                       ` Jesper Dangaard Brouer
@ 2016-12-01 14:24                                         ` Eric Dumazet
  2016-12-01 16:04                                           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01 14:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan

On Thu, 2016-12-01 at 13:05 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 30 Nov 2016 18:27:45 +0200
> Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> 
> > >> All in all, this is risky business :),  the right way to go is to
> > >> force the upper layer to use xmit-more and delay doorbells/use bulking
> > >> but from the same context (xmit routine).  For example see
> > >> Achiad's suggestion (attached in Jesper's response), he used stop
> > >> queue to force the stack to queue up packets (TX bulking)
> > >> which would set xmit-more and will use the next completion to
> > >> release the "stopped" ring TXQ rather than hit the doorbell on
> > >> behalf of it.  
> > >
> > > Well, you depend on having a higher level queue like a qdisc.
> > >
> > > Some users do not use a qdisc.
> > > If you stop the queue, they no longer can send anything -> drops.
> > >
> 
> You do have a point that stopping the device might not be the best way
> to create a push-back (to allow stack queue packets).
> 
>  netif_tx_stop_queue() / __QUEUE_STATE_DRV_XOFF
> 
> 
> > In this case, i think they should implement their own bulking (pktgen
> > is not a good example) but XDP can predict if it has more packets to
> > xmit  as long as all of them fall in the same NAPI cycle.
> > Others should try and do the same.
> 
> I actually agree with Saeed here.
> 
> Maybe we can come up with another __QUEUE_STATE_xxx that informs the
> upper layer what the driver is doing.  Then users not using a qdisc can
> use this indication (like the qdisc could).  (qdisc-bypass users already
> check the QUEUE_STATE flags e.g. via netif_xmit_frozen_or_drv_stopped).

Can you explain how this is going to help trafgen using AF_PACKET with
Qdisc bypass ?

Say trafgen wants to send 10 or 1000 packets back to back (as fast as
possible)

With my proposal, only the first is triggering a doorbell from
ndo_start_xmit(). Following ones are driven by TX completion logic, or
BQL if we can push packets faster than TX interrupt can be
delivered/handled.

If you stop the queue (with yet another atomic operations to stop/unstop
btw), packet_direct_xmit() will have to drop trafgen packets on the
floor.

We already have BQL stopping the queue at a fine granularity.
I suspect that Saeed proposal will interfere with BQL logic.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 14:24                                         ` Eric Dumazet
@ 2016-12-01 16:04                                           ` Jesper Dangaard Brouer
  2016-12-01 17:04                                             ` Eric Dumazet
  0 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-01 16:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan, brouer

On Thu, 01 Dec 2016 06:24:34 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-12-01 at 13:05 +0100, Jesper Dangaard Brouer wrote:
> > On Wed, 30 Nov 2016 18:27:45 +0200
> > Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> >   
> > > >> All in all, this is risky business :),  the right way to go is to
> > > >> force the upper layer to use xmit-more and delay doorbells/use bulking
> > > >> but from the same context (xmit routine).  For example see
> > > >> Achiad's suggestion (attached in Jesper's response), he used stop
> > > >> queue to force the stack to queue up packets (TX bulking)
> > > >> which would set xmit-more and will use the next completion to
> > > >> release the "stopped" ring TXQ rather than hit the doorbell on
> > > >> behalf of it.    
> > > >
> > > > Well, you depend on having a higher level queue like a qdisc.
> > > >
> > > > Some users do not use a qdisc.
> > > > If you stop the queue, they no longer can send anything -> drops.
> > > >  
> > 
> > You do have a point that stopping the device might not be the best way
> > to create a push-back (to allow stack queue packets).
> > 
> >  netif_tx_stop_queue() / __QUEUE_STATE_DRV_XOFF
> > 
> >   
> > > In this case, i think they should implement their own bulking (pktgen
> > > is not a good example) but XDP can predict if it has more packets to
> > > xmit  as long as all of them fall in the same NAPI cycle.
> > > Others should try and do the same.  
> > 
> > I actually agree with Saeed here.
> > 
> > Maybe we can come up with another __QUEUE_STATE_xxx that informs the
> > upper layer what the driver is doing.  Then users not using a qdisc can
> > use this indication (like the qdisc could).  (qdisc-bypass users already
> > check the QUEUE_STATE flags e.g. via netif_xmit_frozen_or_drv_stopped).  
> 
> Can you explain how this is going to help trafgen using AF_PACKET with
> Qdisc bypass ?
> 
> Say trafgen wants to send 10 or 1000 packets back to back (as fast as
> possible)
>
> With my proposal, only the first is triggering a doorbell from
> ndo_start_xmit(). Following ones are driven by TX completion logic, or
> BQL if we can push packets faster than TX interrupt can be
> delivered/handled.
> 
> If you stop the queue (with yet another atomic operations to stop/unstop
> btw), packet_direct_xmit() will have to drop trafgen packets on the
> floor.

I think you misunderstood my concept[1].  I don't want to stop the
queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
it just indicating that someone need to flush/ring-doorbell.  Maybe it
need another name, because it also indicate that the driver can see
that its TX queue is so busy that we don't need to call it immediately.
The qdisc layer can then choose to enqueue instead if doing direct xmit.

When qdisc layer or trafgen/af_packet see this indication it knows it
should/must flush the queue when it don't have more work left.  Perhaps
through net_tx_action(), by registering itself and e.g. if qdisc_run()
is called and queue is empty then check if queue needs a flush. I would
also allow driver to flush and clear this bit.

I just see it as an extension of your solution, as we still need the
driver to figure out then the doorbell/flush can be delayed.
p.s. don't be discouraged by this feedback, I'm just very excited and
happy that your are working on a solution in this area. As this is a
problem area that I've not been able to solve myself for the last
approx 2 years. Keep up the good work!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://lkml.kernel.org/r/20161130233015.3de95356@redhat.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 16:04                                           ` Jesper Dangaard Brouer
@ 2016-12-01 17:04                                             ` Eric Dumazet
  2016-12-01 19:17                                               ` Jesper Dangaard Brouer
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01 17:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan

On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:

> I think you misunderstood my concept[1].  I don't want to stop the
> queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
> it just indicating that someone need to flush/ring-doorbell.  Maybe it
> need another name, because it also indicate that the driver can see
> that its TX queue is so busy that we don't need to call it immediately.
> The qdisc layer can then choose to enqueue instead if doing direct xmit.

But driver ndo_start_xmit() does not have a pointer to qdisc.

Also the concept of 'queue busy' just because we queued one packet is a
bit flaky.

> 
> When qdisc layer or trafgen/af_packet see this indication it knows it
> should/must flush the queue when it don't have more work left.  Perhaps
> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> is called and queue is empty then check if queue needs a flush. I would
> also allow driver to flush and clear this bit.

net_tx_action() is not normally called, unless BQL limit is hit and/or
some qdiscs with throttling (HTB, TBF, FQ, ...)

> 
> I just see it as an extension of your solution, as we still need the
> driver to figure out then the doorbell/flush can be delayed.
> p.s. don't be discouraged by this feedback, I'm just very excited and
> happy that your are working on a solution in this area. As this is a
> problem area that I've not been able to solve myself for the last
> approx 2 years. Keep up the good work!

Do not worry, I appreciate the feedbacks ;)

BTW, if you are doing tests on mlx4 40Gbit, would you check the
following quick/dirty hack, using lots of low-rate flows ?

mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 12ea3405f442..96940666abd3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct  net_device *dev,
        queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
 }
 
+static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
+module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
+MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
+
+
 static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
                                                struct net_device *dev,
                                                netdev_features_t features)
@@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
                    (udp_hdr(skb)->dest != priv->vxlan_port))
                        features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
        }
+       if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
+               features &= NETIF_F_GSO_MASK;
 
        return features;
 }

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 17:04                                             ` Eric Dumazet
@ 2016-12-01 19:17                                               ` Jesper Dangaard Brouer
  2016-12-01 20:11                                                 ` Eric Dumazet
  2016-12-01 20:20                                               ` David Miller
  2016-12-02 14:23                                               ` Eric Dumazet
  2 siblings, 1 reply; 60+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-01 19:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan, brouer


On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> BTW, if you are doing tests on mlx4 40Gbit,

I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
machines too.

>  would you check the
> following quick/dirty hack, using lots of low-rate flows ?

What tool should I use to send "low-rate flows"?

And what am I looking for?

> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct  net_device *dev,
>         queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
>  }
>  
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
> +
> +
>  static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                                                 struct net_device *dev,
>                                                 netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                     (udp_hdr(skb)->dest != priv->vxlan_port))
>                         features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>         }
> +       if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> +               features &= NETIF_F_GSO_MASK;
>  
>         return features;
>  }
 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01  5:03                                               ` Tom Herbert
@ 2016-12-01 19:24                                                 ` Willem de Bruijn
  0 siblings, 0 replies; 60+ messages in thread
From: Willem de Bruijn @ 2016-12-01 19:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Willem de Bruijn,
	Rick Jones, Linux Kernel Network Developers, Saeed Mahameed,
	Tariq Toukan, Achiad Shochat

>>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>>> > and RX work.

This problem is somewhat tangential to the doorbell avoidance discussion.

>>> >
>>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>>> > thousands of TCP sockets.
>>> >
>> We have thousands of applications, and some of them 'kind of multicast'
>> events to a broad number of TCP sockets.
>>
>> Very often, applications writers use a single thread for doing this,
>> when all they need is to send small packets to 10,000 sockets, and they
>> do not really care of doing this very fast. They also do not want to
>> hurt other applications sharing the NIC.
>>
>> Very often, process scheduler will also run this single thread in a
>> single cpu, ie avoiding expensive migrations if they are not needed.
>>
>> Problem is this behavior locks one TX queue for the duration of the
>> multicast, since XPS will force all the TX packets to go to one TX
>> queue.
>>
> The fact that XPS is forcing TX packets to go over one CPU is a result
> of how you've configured XPS. There are other potentially
> configurations that present different tradeoffs.

Right, XPS supports multiple txqueues mappings, using skb_tx_hash
to decide among them. Unfortunately cross-cpu is more expensive
than tx + completion on the same core, so this is suboptimal for
the common case where there is no excessive load imbalance.

> For instance, TX
> queues are plentiful now days to the point that we could map a number
> of queues to each CPU while respecting NUMA locality between the
> sending CPU and where TX completions occur. If the set is sufficiently
> large this would also helps to address device lock contention. The
> other thing I'm wondering is if Willem's concepts distributing DOS
> attacks in RPS might be applicable in XPS. If we could detect that a
> TX queue is "under attack" maybe we can automatically backoff to
> distributing the load to more queues in XPS somehow.

If only targeting states of imbalance, that indeed could work. For the
10,000 socket case, instead of load balancing qdisc servicing, we
could perhaps modify tx queue selection in __netdev_pick_tx to
choose another queue if the the initial choice is paused or otherwise
backlogged.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 19:17                                               ` Jesper Dangaard Brouer
@ 2016-12-01 20:11                                                 ` Eric Dumazet
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01 20:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan

On Thu, 2016-12-01 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > BTW, if you are doing tests on mlx4 40Gbit,
> 
> I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
> machines too.
> 
> >  would you check the
> > following quick/dirty hack, using lots of low-rate flows ?
> 
> What tool should I use to send "low-rate flows"?
> 

You could use https://github.com/google/neper

It supports SO_MAX_PACING_RATE, and you could launch 1600 flows, rate
limited to 3028000 bytes per second  (so sending one 2-MSS TSO packet
every ms per flow)



> And what am I looking for?

Max throughput, in packets per second :/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 17:04                                             ` Eric Dumazet
  2016-12-01 19:17                                               ` Jesper Dangaard Brouer
@ 2016-12-01 20:20                                               ` David Miller
  2016-12-01 22:10                                                 ` Eric Dumazet
  2016-12-02 14:23                                               ` Eric Dumazet
  2 siblings, 1 reply; 60+ messages in thread
From: David Miller @ 2016-12-01 20:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 01 Dec 2016 09:04:17 -0800

> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> 
>> When qdisc layer or trafgen/af_packet see this indication it knows it
>> should/must flush the queue when it don't have more work left.  Perhaps
>> through net_tx_action(), by registering itself and e.g. if qdisc_run()
>> is called and queue is empty then check if queue needs a flush. I would
>> also allow driver to flush and clear this bit.
> 
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)

The one thing I wonder about is whether we should "ramp up" into a mode
where the NAPI poll does the doorbells instead of going directly there.

Maybe I misunderstand your algorithm, but it looks to me like if there
are any active packets in the TX queue at enqueue time you will defer
the doorbell to the interrupt handler.

Let's say we put 1 packet in, and hit the doorbell.

Then another packet comes in and we defer the doorbell to the IRQ.

At this point there are a couple things I'm unclear about.

For example, if we didn't hit the doorbell, will the chip still take a
peek at the second descriptor?  Depending upon how the doorbell works
it might, or it might not.

Either way, wouldn't there be a possible condition where the chip
wouldn't see the second enqueued packet and we'd thus have the wire
idle until the interrupt + NAPI runs and hits the doorbell?

This is why I think we should "ramp up" the doorbell deferral, in
order to avoid this potential wire idle time situation.

Maybe the situation I'm worried about is not possible, so please
explain it to me :-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-11-29  6:58                               ` [WIP] net+mlx4: auto doorbell Eric Dumazet
  2016-11-30 11:38                                 ` Jesper Dangaard Brouer
  2016-11-30 13:50                                 ` Saeed Mahameed
@ 2016-12-01 21:32                                 ` Alexander Duyck
  2016-12-01 22:04                                   ` Eric Dumazet
  2 siblings, 1 reply; 60+ messages in thread
From: Alexander Duyck @ 2016-12-01 21:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed, Tariq Toukan

On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
>
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:26         eth0      0.00 5822800.00      0.00 597064.41      0.00      0.00      1.00
> 22:43:27         eth0     24.00 5788237.00      2.09 593520.26      0.00      0.00      0.00
> 22:43:28         eth0     12.00 5817777.00      1.43 596551.47      0.00      0.00      1.00
> 22:43:29         eth0     22.00 5841516.00      1.61 598982.87      0.00      0.00      0.00
> 22:43:30         eth0      4.00 4389137.00      0.71 450058.08      0.00      0.00      1.00
> 22:43:31         eth0      4.00 5871008.00      0.72 602007.79      0.00      0.00      0.00
> 22:43:32         eth0     12.00 5891809.00      1.43 604142.60      0.00      0.00      1.00
> 22:43:33         eth0     10.00 5901904.00      1.12 605175.70      0.00      0.00      0.00
> 22:43:34         eth0      5.00 5907982.00      0.69 605798.99      0.00      0.00      1.00
> 22:43:35         eth0      2.00 5847086.00      0.12 599554.71      0.00      0.00      0.00
> Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:47         eth0      9.00 10226424.00      1.02 1048608.05      0.00      0.00      1.00
> 22:43:48         eth0      1.00 10316955.00      0.06 1057890.89      0.00      0.00      0.00
> 22:43:49         eth0      1.00 10310104.00      0.10 1057188.32      0.00      0.00      1.00
> 22:43:50         eth0      0.00 10249423.00      0.00 1050966.23      0.00      0.00      0.00
> 22:43:51         eth0      0.00 10210441.00      0.00 1046969.05      0.00      0.00      1.00
> 22:43:52         eth0      2.00 10198389.00      0.16 1045733.17      0.00      0.00      1.00
> 22:43:53         eth0      8.00 10079257.00      1.43 1033517.83      0.00      0.00      0.00
> 22:43:54         eth0      0.00 7693509.00      0.00 788885.16      0.00      0.00      0.00
> 22:43:55         eth0      2.00 10343076.00      0.20 1060569.32      0.00      0.00      1.00
> 22:43:56         eth0      1.00 10224571.00      0.14 1048417.93      0.00      0.00      0.00
> Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50
>
> And about 11 % improvement on an mono-flow UDP_STREAM test.
>
> skb_set_owner_w() is now the most consuming function.

A few years back when I did something like this on ixgbe I was told by
you that the issue was that doing something like this would add too
much latency.  I was just wondering what the latency impact is on a
change like this and if this now isn't that much of an issue?

>
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:50:47         eth0      3.00 1355422.00      0.45 319706.04      0.00      0.00      0.00
> 22:50:48         eth0      2.00 1344270.00      0.42 317035.21      0.00      0.00      1.00
> 22:50:49         eth0      3.00 1350503.00      0.51 318478.34      0.00      0.00      0.00
> 22:50:50         eth0     29.00 1348593.00      2.86 318113.02      0.00      0.00      1.00
> 22:50:51         eth0     14.00 1354855.00      1.83 319508.56      0.00      0.00      0.00
> 22:50:52         eth0      7.00 1357794.00      0.73 320226.89      0.00      0.00      1.00
> 22:50:53         eth0      5.00 1326130.00      0.63 312784.72      0.00      0.00      0.00
> 22:50:54         eth0      2.00 994584.00      0.12 234598.40      0.00      0.00      1.00
> 22:50:55         eth0      5.00 1318209.00      0.75 310932.46      0.00      0.00      0.00
> 22:50:56         eth0     20.00 1323445.00      1.73 312178.11      0.00      0.00      1.00
> Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50
> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:51:03         eth0      4.00 1512055.00      0.54 356599.40      0.00      0.00      0.00
> 22:51:04         eth0      4.00 1507631.00      0.55 355609.46      0.00      0.00      1.00
> 22:51:05         eth0      4.00 1487789.00      0.42 350917.47      0.00      0.00      0.00
> 22:51:06         eth0      7.00 1474460.00      1.22 347811.16      0.00      0.00      1.00
> 22:51:07         eth0      2.00 1496529.00      0.24 352995.18      0.00      0.00      0.00
> 22:51:08         eth0      3.00 1485856.00      0.49 350425.65      0.00      0.00      1.00
> 22:51:09         eth0      1.00 1114808.00      0.06 262905.38      0.00      0.00      0.00
> 22:51:10         eth0      2.00 1510924.00      0.30 356397.53      0.00      0.00      1.00
> 22:51:11         eth0      2.00 1506408.00      0.30 355345.76      0.00      0.00      0.00
> 22:51:12         eth0      2.00 1499122.00      0.32 353668.75      0.00      0.00      1.00
> Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50
>
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>  include/linux/netdevice.h                    |    1
>  net/core/net-sysfs.c                         |   18 +++
>  5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
>         if (polled) {
>                 if (doorbell_pending)
> -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
>                 mlx4_cq_set_ci(&cq->mcq);
>                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>         ring->size = size;
>         ring->size_mask = size - 1;
>         ring->sp_stride = stride;
> -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>

Just wondering if this should be a separate change?  Seems like this
is something that might apply outside of your changes since it seems
like it is reserving enough room for 2 buffers.

>         tmp = size * sizeof(struct mlx4_en_tx_info);
>         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>         ring->sp_cqn = cq;
>         ring->prod = 0;
>         ring->cons = 0xffffffff;
> +       ring->ncons = 0;
>         ring->last_nr_txbb = 1;
>         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>         memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>  }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>  {
> -       return ring->prod - ring->cons > ring->full_size;
> +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>  }
>

With the use of READ_ONCE are there some barriers that are going to
need to be added to make sure these are populated in the correct
spots?  Any ordering constraints on the updates these fields?

>  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
>         /* Skip last polled descriptor */
>         ring->cons += ring->last_nr_txbb;
> +       ring->ncons += ring->last_nr_txbb;
>         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>                  ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>                                                 !!(ring->cons & ring->size), 0,
>                                                 0 /* Non-NAPI caller */);
>                 ring->cons += ring->last_nr_txbb;
> +               ring->ncons += ring->last_nr_txbb;
>                 cnt++;
>         }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>         return cnt;
>  }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> +                          struct mlx4_en_tx_ring *ring)
> +{
> +
> +       if (dev->doorbell_opt & 1) {
> +               u32 oval = READ_ONCE(ring->prod_bell);
> +               u32 nval = READ_ONCE(ring->prod);
> +
> +               if (oval == nval)
> +                       return;
> +
> +               /* I can not tell yet if a cmpxchg() is needed or not */
> +               if (dev->doorbell_opt & 2)
> +                       WRITE_ONCE(ring->prod_bell, nval);
> +               else
> +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> +                               return;
> +       }
> +       /* Since there is no iowrite*_native() that writes the
> +        * value as is, without byteswapping - using the one
> +        * the doesn't do byteswapping in the relevant arch
> +        * endianness.
> +        */
> +#if defined(__LITTLE_ENDIAN)
> +       iowrite32(
> +#else
> +       iowrite32be(
> +#endif
> +                 ring->doorbell_qpn,
> +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +

I realize you are just copying from the function further down, but
couldn't this just be __raw_writel instead of iowrite32?

Also you may be able to squeeze a bit more out of this by doing some
barrier clean-ups.  In most drivers the wmb() is needed between writes
of the DMA memory and the MMIO memory.  So odds are you could hold off
on using a wmb() until you call this function and you wouldn't really
need it until just before the call to iowrite32 or __raw_writel.  The
rest of it could use either an smp_wmb() or dma_wmb().

>  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>                                   struct mlx4_en_cq *cq, int napi_budget)
>  {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>         wmb();
>
>         /* we want to dirty this cache line once */
> -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> +       ring_cons += txbbs_skipped;
> +       WRITE_ONCE(ring->cons, ring_cons);
> +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> +       if (dev->doorbell_opt)
> +               mlx4_en_xmit_doorbell(dev, ring);
>

It might be more readable to put the addition on ring_cons out in
front of all the WRITE_ONCE macro calls.

>         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>                 return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>         __iowrite64_copy(dst, src, bytecnt / 8);
>  }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> -       wmb();
> -       /* Since there is no iowrite*_native() that writes the
> -        * value as is, without byteswapping - using the one
> -        * the doesn't do byteswapping in the relevant arch
> -        * endianness.
> -        */
> -#if defined(__LITTLE_ENDIAN)
> -       iowrite32(
> -#else
> -       iowrite32be(
> -#endif
> -                 ring->doorbell_qpn,
> -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
>  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                                   struct mlx4_en_tx_desc *tx_desc,
>                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>                                   int desc_size, int bf_index,
>                                   __be32 op_own, bool bf_ok,
> -                                 bool send_doorbell)
> +                                 bool send_doorbell,
> +                                 const struct net_device *dev, int nr_txbb)
>  {
>         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
>                 wmb();
>
> +               ring->prod += nr_txbb;
>                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>                              desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                  */
>                 dma_wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;
> +               ring->prod += nr_txbb;
>                 if (send_doorbell)
> -                       mlx4_en_xmit_doorbell(ring);
> +                       mlx4_en_xmit_doorbell(dev, ring);
>                 else
>                         ring->xmit_more++;
>         }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>         }
>
> -       ring->prod += nr_txbb;
> -
>         /* If we used a bounce buffer then copy descriptor back into place */
>         if (unlikely(bounce))
>                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +        * will happen shortly.
> +        */
> +       if (send_doorbell &&
> +           dev->doorbell_opt &&
> +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> +               send_doorbell = false;
> +
>         real_size = (real_size / 16) & 0x3f;
>
>         bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
>         if (unlikely(stop_queue)) {
>                 /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                  */
>                 smp_rmb();
>
> -               ring_cons = ACCESS_ONCE(ring->cons);
>                 if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
>                         netif_tx_wake_queue(ring->tx_queue);
>                         ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         rx_ring->xdp_tx++;
>         AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> -       ring->prod += nr_txbb;
> -
>         stop_queue = mlx4_en_is_tx_ring_full(ring);
>         send_doorbell = stop_queue ||
>                                 *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>         *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
>         return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
>          */
>         u32                     last_nr_txbb;
>         u32                     cons;
> +       u32                     ncons;
>         unsigned long           wake_queue;
>         struct netdev_queue     *tx_queue;
>         u32                     (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
>         /* cache line used and dirtied in mlx4_en_xmit() */
>         u32                     prod ____cacheline_aligned_in_smp;
> +       u32                     prod_bell;
>         unsigned int            tx_dropped;
>         unsigned long           bytes;
>         unsigned long           packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                                struct mlx4_en_rx_alloc *frame,
>                                struct net_device *dev, unsigned int length,
>                                int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
>  bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>                         struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
>         DECLARE_HASHTABLE       (qdisc_hash, 4);
>  #endif
>         unsigned long           tx_queue_len;
> +       unsigned long           doorbell_opt;
>         spinlock_t              tx_global_lock;
>         int                     watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> +       dev->doorbell_opt = val;
> +       return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t len)
> +{
> +       if (!capable(CAP_NET_ADMIN))
> +               return -EPERM;
> +
> +       return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
>                              const char *buf, size_t len)
>  {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
>         &dev_attr_phys_port_name.attr,
>         &dev_attr_phys_switch_id.attr,
>         &dev_attr_proto_down.attr,
> +       &dev_attr_doorbell_opt.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 21:32                                 ` Alexander Duyck
@ 2016-12-01 22:04                                   ` Eric Dumazet
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01 22:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed, Tariq Toukan

On Thu, 2016-12-01 at 13:32 -0800, Alexander Duyck wrote:

> A few years back when I did something like this on ixgbe I was told by
> you that the issue was that doing something like this would add too
> much latency.  I was just wondering what the latency impact is on a
> change like this and if this now isn't that much of an issue?

I believe it is clear that we can not use this trick without admin
choice.

In my experience, mlx4 can deliver way more interrupts than ixgbe.
(No idea why)

This "auto doorbell" is tied to proper "ethtool -c tx-usecs|tx-frames|
tx-frames-irq", or simply a choice for hosts dedicated to content
delivery (like video servers) with 40GBit+ cards.

Under stress, softirq can be handled by ksoftirqd, and might be delayed
by ~10 ms or even more.

This WIP was minimal, but we certainly need to add belts and suspenders.

1) <maybe> timestamps
  use a ktime_get_ns() to remember a timestamp for the last doorbell,
  and force a doorbell if it gets too old, checked in ndo_start_xmit()
  every time we would like to not send the doorbell.

2) <maybe> max numbers of packets not yet doorbelled.

But we can not rely on another high resolution timer, since they also
require softirq being processed timely.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 20:20                                               ` David Miller
@ 2016-12-01 22:10                                                 ` Eric Dumazet
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-12-01 22:10 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt

On Thu, 2016-12-01 at 15:20 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 01 Dec 2016 09:04:17 -0800
> 
> > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> > 
> >> When qdisc layer or trafgen/af_packet see this indication it knows it
> >> should/must flush the queue when it don't have more work left.  Perhaps
> >> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> >> is called and queue is empty then check if queue needs a flush. I would
> >> also allow driver to flush and clear this bit.
> > 
> > net_tx_action() is not normally called, unless BQL limit is hit and/or
> > some qdiscs with throttling (HTB, TBF, FQ, ...)
> 
> The one thing I wonder about is whether we should "ramp up" into a mode
> where the NAPI poll does the doorbells instead of going directly there.
> 
> Maybe I misunderstand your algorithm, but it looks to me like if there
> are any active packets in the TX queue at enqueue time you will defer
> the doorbell to the interrupt handler.
> 
> Let's say we put 1 packet in, and hit the doorbell.
> 
> Then another packet comes in and we defer the doorbell to the IRQ.
> 
> At this point there are a couple things I'm unclear about.


> 
> For example, if we didn't hit the doorbell, will the chip still take a
> peek at the second descriptor?  Depending upon how the doorbell works
> it might, or it might not.

It might depend on the hardware. I can easily check on mlx4, by
increasing tx-usecs and tx-frames, and sending 2 packets back to back.

> 
> Either way, wouldn't there be a possible condition where the chip
> wouldn't see the second enqueued packet and we'd thus have the wire
> idle until the interrupt + NAPI runs and hits the doorbell?
> 
> This is why I think we should "ramp up" the doorbell deferral, in
> order to avoid this potential wire idle time situation.


> 
> Maybe the situation I'm worried about is not possible, so please
> explain it to me :-)

This is absolutely the problem. We might need to enable this mode only
above a given load. We could have an EWMA of the number of packets
that TX completion runs can dequeue. And enable auto doorbell only if we
have that many packets in the TX ring (instead of the  "1 packet
threshold" of the WIP)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01 17:04                                             ` Eric Dumazet
  2016-12-01 19:17                                               ` Jesper Dangaard Brouer
  2016-12-01 20:20                                               ` David Miller
@ 2016-12-02 14:23                                               ` Eric Dumazet
  2 siblings, 0 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-12-02 14:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan

On Thu, 2016-12-01 at 09:04 -0800, Eric Dumazet wrote:
> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> 
> > I think you misunderstood my concept[1].  I don't want to stop the
> > queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
> > it just indicating that someone need to flush/ring-doorbell.  Maybe it
> > need another name, because it also indicate that the driver can see
> > that its TX queue is so busy that we don't need to call it immediately.
> > The qdisc layer can then choose to enqueue instead if doing direct xmit.
> 
> But driver ndo_start_xmit() does not have a pointer to qdisc.
> 
> Also the concept of 'queue busy' just because we queued one packet is a
> bit flaky.
> 
> > 
> > When qdisc layer or trafgen/af_packet see this indication it knows it
> > should/must flush the queue when it don't have more work left.  Perhaps
> > through net_tx_action(), by registering itself and e.g. if qdisc_run()
> > is called and queue is empty then check if queue needs a flush. I would
> > also allow driver to flush and clear this bit.
> 
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)
> 
> > 
> > I just see it as an extension of your solution, as we still need the
> > driver to figure out then the doorbell/flush can be delayed.
> > p.s. don't be discouraged by this feedback, I'm just very excited and
> > happy that your are working on a solution in this area. As this is a
> > problem area that I've not been able to solve myself for the last
> > approx 2 years. Keep up the good work!
> 
> Do not worry, I appreciate the feedbacks ;)
> 
> BTW, if you are doing tests on mlx4 40Gbit, would you check the
> following quick/dirty hack, using lots of low-rate flows ?
> 
> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct  net_device *dev,
>         queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
>  }
>  
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
> +
> +
>  static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                                                 struct net_device *dev,
>                                                 netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                     (udp_hdr(skb)->dest != priv->vxlan_port))
>                         features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>         }
> +       if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> +               features &= NETIF_F_GSO_MASK;


Sorry, stupid typo.
This should be "features &= ~NETIF_F_GSO_MASK;" of course

>  
>         return features;
>  }
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP] net+mlx4: auto doorbell
  2016-12-01  2:50                                               ` Eric Dumazet
@ 2016-12-02 18:16                                                 ` Eric Dumazet
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Dumazet @ 2016-12-02 18:16 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat

On Wed, 2016-11-30 at 18:50 -0800, Eric Dumazet wrote:
> On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:
> 
> > I simply suggest we try to queue the qdisc for further servicing as we
> > do today, from net_tx_action(), but we might use a different bit, so
> > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> > before we grab it from net_tx_action(), maybe 100 usec later (time to
> > flush all skbs queued in napi_consume_skb() and maybe RX processing,
> > since most NIC handle TX completion before doing RX processing from thei
> > napi poll handler.
> > 
> > Should be doable with few changes in __netif_schedule() and
> > net_tx_action(), plus some control paths that will need to take care of
> > the new bit at dismantle time, right ?
> 
> Hmm... this is silly. Code already implements a different bit.
> 
> qdisc_run() seems to run more often from net_tx_action(), I have to find
> out why.

After more analysis I believe TSQ was one of the bottlenecks.

I prepared a patch series that helped my use cases.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-12-02 18:17 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-03 14:59 High perf top ip_idents_reserve doing netperf UDP_STREAM Jesper Dangaard Brouer
2014-09-03 15:17 ` Eric Dumazet
2016-11-16 12:16   ` Netperf UDP issue with connected sockets Jesper Dangaard Brouer
2016-11-16 17:46     ` Rick Jones
2016-11-16 22:40       ` Jesper Dangaard Brouer
2016-11-16 22:50         ` Rick Jones
2016-11-17  0:34         ` Eric Dumazet
2016-11-17  8:16           ` Jesper Dangaard Brouer
2016-11-17 13:20             ` Eric Dumazet
2016-11-17 13:42               ` Jesper Dangaard Brouer
2016-11-17 14:17                 ` Eric Dumazet
2016-11-17 14:57                   ` Jesper Dangaard Brouer
2016-11-17 16:21                     ` Eric Dumazet
2016-11-17 18:30                       ` Jesper Dangaard Brouer
2016-11-17 18:51                         ` Eric Dumazet
2016-11-17 21:19                           ` Jesper Dangaard Brouer
2016-11-17 21:44                             ` Eric Dumazet
2016-11-17 23:08                               ` Rick Jones
2016-11-18  0:37                                 ` Julian Anastasov
2016-11-18  0:42                                   ` Rick Jones
2016-11-18 17:12                               ` Jesper Dangaard Brouer
2016-11-21 16:03                           ` Jesper Dangaard Brouer
2016-11-21 18:10                             ` Eric Dumazet
2016-11-29  6:58                               ` [WIP] net+mlx4: auto doorbell Eric Dumazet
2016-11-30 11:38                                 ` Jesper Dangaard Brouer
2016-11-30 15:56                                   ` Eric Dumazet
2016-11-30 19:17                                     ` Jesper Dangaard Brouer
2016-11-30 19:30                                       ` Eric Dumazet
2016-11-30 22:30                                         ` Jesper Dangaard Brouer
2016-11-30 22:40                                           ` Eric Dumazet
2016-12-01  0:27                                         ` Eric Dumazet
2016-12-01  1:16                                           ` Tom Herbert
2016-12-01  2:32                                             ` Eric Dumazet
2016-12-01  2:50                                               ` Eric Dumazet
2016-12-02 18:16                                                 ` Eric Dumazet
2016-12-01  5:03                                               ` Tom Herbert
2016-12-01 19:24                                                 ` Willem de Bruijn
2016-11-30 13:50                                 ` Saeed Mahameed
2016-11-30 15:44                                   ` Eric Dumazet
2016-11-30 16:27                                     ` Saeed Mahameed
2016-11-30 17:28                                       ` Eric Dumazet
2016-12-01 12:05                                       ` Jesper Dangaard Brouer
2016-12-01 14:24                                         ` Eric Dumazet
2016-12-01 16:04                                           ` Jesper Dangaard Brouer
2016-12-01 17:04                                             ` Eric Dumazet
2016-12-01 19:17                                               ` Jesper Dangaard Brouer
2016-12-01 20:11                                                 ` Eric Dumazet
2016-12-01 20:20                                               ` David Miller
2016-12-01 22:10                                                 ` Eric Dumazet
2016-12-02 14:23                                               ` Eric Dumazet
2016-12-01 21:32                                 ` Alexander Duyck
2016-12-01 22:04                                   ` Eric Dumazet
2016-11-17 17:34                     ` Netperf UDP issue with connected sockets David Laight
2016-11-17 22:39                       ` Alexander Duyck
2016-11-17 17:42             ` Rick Jones
2016-11-28 18:33             ` Rick Jones
2016-11-28 18:40               ` Rick Jones
2016-11-30 10:43               ` Jesper Dangaard Brouer
2016-11-30 17:42                 ` Rick Jones
2016-11-30 18:11                   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.