netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Bad XDP performance with mlx5
       [not found]         ` <2218141a-7026-1cb8-c594-37e38eef7b15@kth.se>
@ 2019-05-31 16:18           ` Jesper Dangaard Brouer
  2019-05-31 18:00             ` David Miller
  2019-05-31 18:06             ` Saeed Mahameed
  0 siblings, 2 replies; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-31 16:18 UTC (permalink / raw)
  To: Tom Barbette
  Cc: xdp-newbies, Toke Høiland-Jørgensen, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, brouer, netdev


On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@kth.se> wrote:

> CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a huge 
> CPU increase on CX5 when introducing a XDP program.
>
> See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be
> around 0:40. We're talking something like 15% while it's near 0 for
> other drivers. The machine is a recent Skylake. For us it makes XDP
> unusable. Is that a known problem?

I have a similar test setup, and I can reproduce. I have found the
root-cause see below.  But on my system it was even worse, with an
XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see
100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58%
CPU (43% idle) and total 89.7 Gbits/sec.

 
> I wonder if it doesn't simply come from mlx5/en_main.c:
> rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> 

Nope, that is not the problem.

> Which would be inline from my observation that memory access seems 
> heavier. I guess this is for the XDP_TX case.
> 
> If this is indeed the problem. Any chance we can:
> a) detect automatically that a program will not return XDP_TX (I'm not 
> quite sure about what the BPF limitations allow to guess in advance) or
> b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in 
> performance when not needed?

This was kind of hard to root-cause, but I solved it by increasing the TCP
socket size used by the iperf tool, like this (please reproduce):

$ iperf -s --window 4M
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  416 KByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------

Given I could reproduce, I took at closer look at perf record/report stats,
and it was actually quite clear that this was related to stalling on getting
pages from the page allocator (function calls top#6 get_page_from_freelist
and free_pcppages_bulk).

Using my tool: ethtool_stats.pl
 https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

It was clear that the mlx5 driver page-cache was not working:
 Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr /sec
 Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_csum_unnecessary /sec
 Ethtool(mlx5p1  ) stat:        1034 (       1,034) <= rx_discards_phy /sec
 Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy /sec

While the non-XDP case looked like this:
 Ethtool(mlx5p1  ) stat:      298929 (     298,929) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:      298971 (     298,971) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat:     3548789 (   3,548,789) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_csum_complete /sec
 Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat:     7695169 (   7,695,169) <= rx_packets_phy /sec
Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44

With the increased TCP window size, the mlx5 driver cache is working better,
but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU usage.
 Ethtool(mlx5p1  ) stat:      894438 (     894,438) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:      894453 (     894,453) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat:     6638518 (   6,638,518) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:           6 (           6) <= rx_congst_umr /sec
 Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_csum_unnecessary /sec
 Ethtool(mlx5p1  ) stat:         164 (         164) <= rx_discards_phy /sec
 Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat:     7533193 (   7,533,193) <= rx_packets_phy /sec
Manual consistence calc: 7532983-(6638518+894453) = 12

To understand why this is happening, you first have to know that the
difference is between the two RX-memory modes used by mlx5 for non-XDP vs
XDP. With non-XDP two frames are stored per memory-page, while for XDP only
a single frame per page is used.  The packets available in the RX-rings are
actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.

I believe, the real issue is that TCP use the SKB->truesize (based on frame
size) for different memory pressure and window calculations, which is why it
solved the issue to increase the window size manually.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31 16:18           ` Bad XDP performance with mlx5 Jesper Dangaard Brouer
@ 2019-05-31 18:00             ` David Miller
  2019-05-31 18:06             ` Saeed Mahameed
  1 sibling, 0 replies; 5+ messages in thread
From: David Miller @ 2019-05-31 18:00 UTC (permalink / raw)
  To: brouer; +Cc: barbette, xdp-newbies, toke, saeedm, leonro, tariqt, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 31 May 2019 18:18:17 +0200

> On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@kth.se> wrote:
> 
>> I wonder if it doesn't simply come from mlx5/en_main.c:
>> rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
>> 
> 
> Nope, that is not the problem.

And it's easy to test this theory by forcing DMA_FROM_DEVICE.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31 16:18           ` Bad XDP performance with mlx5 Jesper Dangaard Brouer
  2019-05-31 18:00             ` David Miller
@ 2019-05-31 18:06             ` Saeed Mahameed
  2019-05-31 21:57               ` Jesper Dangaard Brouer
       [not found]               ` <9f116335-0fad-079b-4070-89f24af4ab55@kth.se>
  1 sibling, 2 replies; 5+ messages in thread
From: Saeed Mahameed @ 2019-05-31 18:06 UTC (permalink / raw)
  To: barbette, brouer; +Cc: toke, xdp-newbies, netdev, Leon Romanovsky, Tariq Toukan

On Fri, 2019-05-31 at 18:18 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@kth.se>
> wrote:
> 
> > CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a
> > huge 
> > CPU increase on CX5 when introducing a XDP program.
> > 
> > See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be
> > around 0:40. We're talking something like 15% while it's near 0 for
> > other drivers. The machine is a recent Skylake. For us it makes XDP
> > unusable. Is that a known problem?
> 

The question is, On the same packet rate/bandwidth do you see higher
cpu utilization on mlx5 compared to other drivers? you have to compare
apples to apples.


> I have a similar test setup, and I can reproduce. I have found the
> root-cause see below.  But on my system it was even worse, with an
> XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see
> 100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58%
> CPU (43% idle) and total 89.7 Gbits/sec.
> 
>  
> > I wonder if it doesn't simply come from mlx5/en_main.c:
> > rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL :
> > DMA_FROM_DEVICE;
> > 
> 
> Nope, that is not the problem.
> 
> > Which would be inline from my observation that memory access seems 
> > heavier. I guess this is for the XDP_TX case.
> > 
> > If this is indeed the problem. Any chance we can:
> > a) detect automatically that a program will not return XDP_TX (I'm
> > not 
> > quite sure about what the BPF limitations allow to guess in
> > advance) or
> > b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in 
> > performance when not needed?
> 
> This was kind of hard to root-cause, but I solved it by increasing
> the TCP
> socket size used by the iperf tool, like this (please reproduce):
> 
> $ iperf -s --window 4M
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size:  416 KByte (WARNING: requested 4.00 MByte)
> ------------------------------------------------------------
> 
> Given I could reproduce, I took at closer look at perf record/report
> stats,
> and it was actually quite clear that this was related to stalling on
> getting
> pages from the page allocator (function calls top#6
> get_page_from_freelist
> and free_pcppages_bulk).
> 
> Using my tool: ethtool_stats.pl
>  
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> 
> It was clear that the mlx5 driver page-cache was not working:
>  Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr
> /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <=
> rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:        1034 (       1,034) <=
> rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy
> /sec
> 
> While the non-XDP case looked like this:
>  Ethtool(mlx5p1  ) stat:      298929 (     298,929) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:      298971 (     298,971) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:     3548789 (   3,548,789) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <=
> rx_csum_complete /sec
>  Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7695169 (   7,695,169) <= rx_packets_phy
> /sec
> Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44
> 
> With the increased TCP window size, the mlx5 driver cache is working
> better,
> but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU
> usage.
>  Ethtool(mlx5p1  ) stat:      894438 (     894,438) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:      894453 (     894,453) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:     6638518 (   6,638,518) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:           6 (           6) <= rx_congst_umr
> /sec
>  Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <=
> rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:         164 (         164) <=
> rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7533193 (   7,533,193) <= rx_packets_phy
> /sec
> Manual consistence calc: 7532983-(6638518+894453) = 12
> 
> To understand why this is happening, you first have to know that the
> difference is between the two RX-memory modes used by mlx5 for non-
> XDP vs
> XDP. With non-XDP two frames are stored per memory-page, while for
> XDP only
> a single frame per page is used.  The packets available in the RX-
> rings are
> actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.
> 

Thanks Jesper ! this was a well put together explanation.
I want to point out that some other drivers are using alloc_skb APIs
which provide a good caching mechanism, which is even better than the
mlx5 internal one (which uses the alloc_page APIs directly), this can
explain the difference, and your explanation shows the root cause of
the higher cpu util with XDP on mlx5, since the mlx5 page cache works
with half of its capacity when enabling XDP.

Now do we really need to keep this page per packet in mlx5 when XDP is
enabled ? i think it is time to drop that .. 

> I believe, the real issue is that TCP use the SKB->truesize (based on
> frame
> size) for different memory pressure and window calculations, which is
> why it
> solved the issue to increase the window size manually.
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31 18:06             ` Saeed Mahameed
@ 2019-05-31 21:57               ` Jesper Dangaard Brouer
       [not found]               ` <9f116335-0fad-079b-4070-89f24af4ab55@kth.se>
  1 sibling, 0 replies; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-31 21:57 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: barbette, toke, xdp-newbies, netdev, Leon Romanovsky,
	Tariq Toukan, brouer


On Fri, 31 May 2019 18:06:01 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote:

> On Fri, 2019-05-31 at 18:18 +0200, Jesper Dangaard Brouer wrote:
[...]
> > 
> > To understand why this is happening, you first have to know that the
> > difference is between the two RX-memory modes used by mlx5 for non-
> > XDP vs XDP. With non-XDP two frames are stored per memory-page,
> > while for XDP only a single frame per page is used.  The packets
> > available in the RX- rings are  actually the same, as the ring
> > sizes are non-XDP=512 vs. XDP=1024. 
> 
> Thanks Jesper ! this was a well put together explanation.
> I want to point out that some other drivers are using alloc_skb APIs
> which provide a good caching mechanism, which is even better than the
> mlx5 internal one (which uses the alloc_page APIs directly), this can
> explain the difference, and your explanation shows the root cause of
> the higher cpu util with XDP on mlx5, since the mlx5 page cache works
> with half of its capacity when enabling XDP.
> 
> Now do we really need to keep this page per packet in mlx5 when XDP is
> enabled ? i think it is time to drop that .. 

No, we need to keep the page per packet (at least until, I've solved
some corner-cases with page_pool, which could likely require getting a
page-flag).

> > I believe, the real issue is that TCP use the SKB->truesize (based
> > on frame size) for different memory pressure and window
> > calculations, which is why it solved the issue to increase the
> > window size manually. 

The TCP performance issue is not solely a SKB->truesize issue, but also
an issue with how the driver level page-cache works.  It is actually
very fragile, as single page with elevated refcnt can block the cache
(see mlx5e_rx_cache_get()).  Which easily happens with TCP packets
that is waiting to be re-transmitted in-case of loss.  This is
happening here, as indicated by the rx_cache_busy and rx_cache_full
being the same.

We (Ilias, Tariq and I) have been planning to remove this small driver
cache, and instead use the page_pool, and create a page-return path for
SKBs.  Which should make this problem go away.  I'm going to be working
on this the next couple of weeks (the tricky part is all the corner
cases).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Fri, 31 May 2019 18:18:17 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> It was clear that the mlx5 driver page-cache was not working:
>  Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy /sec
>  Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full /sec
>  Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse /sec
>  Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:        1034 (       1,034) <= rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets /sec
>  Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy /sec



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad XDP performance with mlx5
       [not found]               ` <9f116335-0fad-079b-4070-89f24af4ab55@kth.se>
@ 2019-06-04  9:15                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2019-06-04  9:15 UTC (permalink / raw)
  To: Tom Barbette
  Cc: Saeed Mahameed, toke, xdp-newbies, Leon Romanovsky, Tariq Toukan,
	brouer, Björn Töpel, Karlsson, Magnus, Jakub Kicinski,
	netdev

On Tue, 4 Jun 2019 09:28:22 +0200
Tom Barbette <barbette@kth.se> wrote:

> Thanks Jesper for looking into this!
> 
> I don't think I will be of much help further on this matter. My take
> out would be: as a first-time user looking into XDP after watching a
> dozen of XDP talks, I would have expected XDP default settings to be
> identical to SKB, so I don't have to watch out for a set of
> per-driver parameter checklist to avoid increasing my CPU consumption
> by 15% when inserting "a super efficient and light BPF program". But
> I understand it's not that easy...

The gap should not be this large, but as I demonstrated it was primarily
because you hit an unfortunate interaction with TCP and how the mlx5
driver does page-caching (p.s. we are working on removing this driver
local recycle-cache).
  When loading an XDP/eBPF-prog then the driver change the underlying RX
memory model, which waste memory to gain packets-per-sec speed, but TCP
sees this memory waste and gives us a penalty.

It is important to understand, that XDP is not optimized for TCP.  XDP
is designed and optimized for L2-L3 handling of packets (TCP is L4).
Before XDP these L2-L3 use-cases were "slow", because the kernel
netstack assumes a L4/socket use-case (full SKB), when less was really
needed.

This is actually another good example of why XDP programs per RX-queue,
will be useful (notice: which is not implemented upstream, yet...).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-06-04  9:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <d7968b89-7218-1e76-86bf-c452b2f8d0c2@kth.se>
     [not found] ` <20190529191602.71eb6c87@carbon>
     [not found]   ` <0836bd30-828a-9126-5d99-1d35b931e3ab@kth.se>
     [not found]     ` <20190530094053.364b1147@carbon>
     [not found]       ` <d695d08a-9ee1-0228-2cbb-4b2538a1d2f8@kth.se>
     [not found]         ` <2218141a-7026-1cb8-c594-37e38eef7b15@kth.se>
2019-05-31 16:18           ` Bad XDP performance with mlx5 Jesper Dangaard Brouer
2019-05-31 18:00             ` David Miller
2019-05-31 18:06             ` Saeed Mahameed
2019-05-31 21:57               ` Jesper Dangaard Brouer
     [not found]               ` <9f116335-0fad-079b-4070-89f24af4ab55@kth.se>
2019-06-04  9:15                 ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).