From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-f44.google.com ([209.85.210.44]:37293 "EHLO mail-ot1-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726575AbgBENbo (ORCPT ); Wed, 5 Feb 2020 08:31:44 -0500 Received: by mail-ot1-f44.google.com with SMTP id d3so1912851otp.4 for ; Wed, 05 Feb 2020 05:31:43 -0800 (PST) MIME-Version: 1.0 References: <20200113151159.GB68570@smtp.ads.isi.edu> <20200114205250.GA85903@smtp.ads.isi.edu> <20200115014137.GA105434@smtp.ads.isi.edu> <20200116020414.GA46831@smtp.ads.isi.edu> <4c03813d-5edf-7e9e-8905-31902b5acb71@mellanox.com> <20200122214352.GA13201@smtp.ads.isi.edu> In-Reply-To: From: Magnus Karlsson Date: Wed, 5 Feb 2020 14:31:31 +0100 Message-ID: Subject: Re: zero-copy between interfaces Sender: xdp-newbies-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable To: Maxim Mikityanskiy Cc: Ryan Goodfellow , "xdp-newbies@vger.kernel.org" , Tariq Toukan , Saeed Mahameed , Moshe Shemesh On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson wrote: > > On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson > wrote: > > > > On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson > > wrote: > > > > > > On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy > > > wrote: > > > > > > > > > -----Original Message----- > > > > > From: Magnus Karlsson > > > > > Sent: 27 January, 2020 17:55 > > > > > To: Maxim Mikityanskiy > > > > > Cc: Ryan Goodfellow ; xdp-newbies@vger.kernel.o= rg; Tariq > > > > > Toukan ; Saeed Mahameed ; Moshe > > > > > Shemesh > > > > > Subject: Re: zero-copy between interfaces > > > > > > > > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy > > > > > wrote: > > > > > > > > > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote: > > > > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy = wrote: > > > > > > >>>> I've posted output from the program in debugging mode here > > > > > > >>>> > > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > > emulation/kernel/snippets/1930375 > > > > > > >>>> > > > > > > >>>> Yes, you are correct in that forwarding works for a brief = period and > > > > > then stops. > > > > > > >>>> I've noticed that the number of packets that are forwarded= is equal > > > > > to the size > > > > > > >>>> of the producer/consumer descriptor rings. I've posted two= ping > > > > > traces from a > > > > > > >>>> client ping that shows this. > > > > > > >>>> > > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > > emulation/kernel/snippets/1930376 > > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > > emulation/kernel/snippets/1930377 > > > > > > >> > > > > > > >> These snippets are not available. > > > > > > > > > > > > > > Apologies, I had the wrong permissions set. They should be av= ailable > > > > > now. > > > > > > > > > > > > > >> > > > > > > >>>> > > > > > > >>>> I've also noticed that when the forwarding stops, the CPU = usage for > > > > > the proc > > > > > > >>>> running the program is pegged, which is not the norm for t= his program > > > > > as it uses > > > > > > >>>> a poll call with a timeout on the xsk fd. > > > > > > >> > > > > > > >> This information led me to a guess what may be happening. On= the RX > > > > > > >> side, mlx5e allocates pages in bulks for performance reasons= and to > > > > > > >> leverage hardware features targeted to performance. In AF_XD= P mode, > > > > > > >> bulking of frames is also used (on x86, the bulk size is 64 = with > > > > > > >> striding RQ enabled, and 8 otherwise, however, it's implemen= tation > > > > > > >> details that might change later). If you don't put enough fr= ames to XSK > > > > > > >> Fill Ring, the driver will be demanding more frames and retu= rn from > > > > > > >> poll() immediately. Basically, in the application, you shoul= d put as > > > > > > >> many frames to the Fill Ring as you can. Please check if tha= t could be > > > > > > >> the root cause of your issue. > > > > > > > > > > > > > > The code in this application makes an effort to relenish the = fill ring > > > > > as fast > > > > > > > as possible. The basic loop of the application is to first ch= eck if > > > > > there are > > > > > > > any descriptors to be consumed from the completion queue or a= ny > > > > > descriptors that > > > > > > > can be added to the fill queue, and only then to move on to m= oving > > > > > packets > > > > > > > through the rx and tx rings. > > > > > > > > > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob= /v5.5- > > > > > moa/samples/bpf/xdpsock_multidev.c#L452-474 > > > > > > > > > > > > I reproduced your issue and did my investigation, and here is w= hat I > > > > > found: > > > > > > > > > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduce= s an > > > > > > important behavioral change (which I thought was not that impor= tant). > > > > > > xskq_nb_avail used to return min(entries, dcnt), but after the = change it > > > > > > just returns entries, which may be as big as the ring size. > > > > > > > > > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_t= ail > > > > > > catches up with q->cons_head. So, before that patch and one pre= vious > > > > > > patch, cons_head - cons_tail was not more than 16, so the consu= mer index > > > > > > was updated periodically. Now consumer is updated only when the= whole > > > > > > ring is exhausted. > > > > > > > > > > > > 3. The application can't replenish the fill ring if the consume= r index > > > > > > doesn't move. As a consequence, refilling the descriptors by th= e > > > > > > application can't happen in parallel with using them by the dri= ver. It > > > > > > should have some performance penalty and possibly even lead to = packet > > > > > > drops, because the driver uses all the descriptors and only the= n > > > > > > advances the consumer index, and then it has to wait until the > > > > > > application refills the ring, busy-looping and losing packets. > > > > > > > > > > This will happen if user space always fills up the whole ring, wh= ich > > > > > might or might not happen all depending on the app. > > > > > > > > Yes, that's right, and as far as I know, it's common to fill as man= y > > > > frames as the application can (there was no reason to do it other w= ay > > > > till now). > > > > > > > > > With that said, it > > > > > might provide better performance to update it once in a while. It > > > > > might also be the case that we will get better performance with t= he > > > > > new scheme if we only fill half the fill ring. > > > > > > > > Yes, it may improve performance. However, I don't think it's correc= t to > > > > set such a limitation on the app. > > > > Actually, a much worse limitation to put on an application is to say > > that you have to have a certain amount of buffers on some ring for the > > zero-copy feature to work. For example that we need at least 64 > > buffers on the fill ring for all the NIC cards to work in zero-copy > > mode. That would be a bad thing to have to put in the documentation. > > An OS is supposed to abstract away HW differences, and with this > > current limitation in your driver, it shines through for sure. What we > > would like to put in the documentation is a statement along the lines > > of: "for best performance, make sure you have plenty of buffers on the > > fill ring so that the NIC can work as efficiently as possible". Not a > > statement that it does not work on Mellanox unless you put enough > > buffers on the fill ring. So my advice (and wish) is that you fix this > > in your driver. With that said, I will still look into what is the > > best way to get at least the sample to work for you. But there is no > > way to make sure every single app works for you in zero-copy mode, > > unless you support arbitrary amount of buffers on the fill ring. I > > guess that sooner or later, a customer of yours will get into this > > situation one way or the other, so why not fix it now. > > > > /Magnus > > > > > I will examine what provides the best performance. On one hand it is > > > the number of updates to shared ring state (minimized by current > > > scheme) and the ability for the user app to but buffers on the fill > > > ring. Stating that putting e.g. half the packets on the fill ring > > > provides better performance (for some application) is not a > > > limitation. It is performance tuning advise :-). > > I have now made a set of measurements. First I just made a variation > study using the xdpsock app, varying the amount of packets the kernel > can withdraw from a consumer ring (fill and Tx) before updating global > state. For the 1 core case (app and driver on the same core) the more > frequent you do this update, the better. The reason for this is that > it costs very little to update the state since the application is not > running. And it is beneficial for the app to have a freshly updated > state when it starts to execute as it can operate on more packets. For > the 2 core case (app on one core, driver on another) it is the > complete opposite: the fewer updates to global state, the better. The > reason for this is that it costs a lot to update global state as it > triggers cache coherency actions between the two cores. > > What I did then was to compare the current scheme, update only when > grabbing new packets, to a new scheme were we also update the global > consumer pointer when we are exiting Rx or Tx processing in the NAPI > context. On two cores the current scheme gets 0.5 to 1 Mpps more in > throughput than also updating the pointer at the end of NAPI. But for > 1 core case, the new scheme is the best and generates between 0.2 and > 0.3 Mpps more in throughput than the current one. But all in all, the > current scheme is more beneficial than the proposed one if we say that > both the 1 core and the 2 core case is equally important. > > Something to note is that the xdpsock application only puts batch size > (64) of packets in the fill ring in every iteration, and this might > lead to some good pacing for the current scheme and the 2 core case. > I.e., we do not get into the case of the fill ring only being full or > empty. But I will run this on some real apps to get some more results, > and I know that Bj=C3=B6rn has an optimized xdpsock application that puts > many more packets into the fill queue than 64. This optimization might > actually make the new proposal (also updating at the end of NAPI) be > better and make the current scheme suffer. We will examine this > further and get back. Actually, after some more thought and discussions I think we should optimize for the 1 core case, since that is what gives the whole system the best performance, provided that you can scale your application with instantiation that is. For a 4 core system, 4 x the 1 core performance > 2 x 2 core performance by a lot. I think that the 1 core case is the one that is going to be used out there. At least that is what I hear and see. So, when the merge window opens I am going to submit a patch that updates the consumer pointer when we exit NAPI too. Will increase the performance of the 1 core case. /Magnus > /Magnus > > > > > > I will look into this > > > > > and see what I get. > > > > > > > > > > > 4. As mlx5e allocates frames in batches, the consequences are e= ven more > > > > > > severe: it's a deadlock where the driver waits for the applicat= ion, and > > > > > > vice versa. The driver never reaches the point where cons_tail = gets > > > > > > equal to cons_head. E.g., if cons_tail + 3 =3D=3D cons_head, an= d the batch > > > > > > size requested by the driver is 8, the driver won't peek anythi= ng from > > > > > > the fill ring waiting for difference between cons_tail and cons= _head to > > > > > > increase to be at least 8. On the other hand, the application c= an't put > > > > > > anything to the ring, because it still thinks that the consumer= index is > > > > > > 0. As cons_tail never reaches cons_head, the consumer index doe= sn't get > > > > > > updated, hence the deadlock. > > > > > > > > > > Good thing that you detected this. Maybe I should get a Mellanox = card > > > > > :-). This is different from how we implemented Intel's drivers th= at > > > > > just work on any batch size. If it gets 3 packets back, it will u= se > > > > > those. How do you deal with the case when the application just pu= ts a > > > > > single buffer in the fill ring and wants to receive a single pack= et? > > > > > > > > mlx5e will wait until the full batch is available. As AF_XDP is int= ended > > > > for high-performance apps, this scenario is less expected. We prefe= r to > > > > leverage our performance features. > > > > > > That you cannot support all applications on top of AF_XDP with your > > > zero-copy support seems broken to me. But I give you that you might > > > support all the ones you care about with your current batching > > > support. Like you say, most apps will put plenty of buffers on the > > > fill ring, so this should not be a problem. Can you not implement som= e > > > slow path for these cases? You must have one for the skb case. > > > > > > > > Why does the Mellanox driver need a specific batch size? This is = just > > > > > for my understanding so we can find a good solution. > > > > > > > > The main reason is our performance feature called striding RQ. Skip= ping > > > > all irrelevant details, a batch of 64 pages is posted to the NIC wi= th a > > > > single request, and the NIC fills them progressively. This feature = is > > > > turned on by default on modern NICs, and it's really good for > > > > performance. > > > > > > > > It might be possible to post a smaller batch though, I'm not sure a= bout > > > > it, it needs to be checked, but anyway it's not something that we w= ill > > > > likely do, because it's a complication of the data path, and if we = know > > > > more frames are coming, it's much better for the performance to wai= t for > > > > them, rather than to post several incomplete batches. > > > > > > > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and peri= odic > > > > > > updates of the consumer index was wrong. It totally breaks mlx5= e, that > > > > > > does batching, and it will affect the performance of any driver= , because > > > > > > the application can't refill the ring until it gets completely = empty and > > > > > > the driver starts waiting for frames. I suggest that periodic u= pdates of > > > > > > the consumer index should be readded to xskq_cons_peek_addr. > > > > > > > > > > The reason I wanted to remove RX_BATCH_SIZE is that application > > > > > developers complained about it giving rise to counter intuitive > > > > > behavior in user space. I will try to dig out the complaints and = the > > > > > explanations Bj=C3=B6rn and I had to send which it seemed that us= ers really > > > > > should not have to care about. It should just work. > > > > > > > > I think the counter that doesn't update till the very last moment a= nd > > > > then advances by the ring size will also be something to complain a= bout > > > > (and I am the first one to complain :D). Such bursts are not good i= n any > > > > case. > > > > > > Do you have any performance data that shows this scheme is bad for > > > performance? The good thing about this scheme is that global state is > > > updated less frequently. And the bad thing is what you mentioned. But > > > which one has the greatest effect, is the question. > > > > > > > > I also do not like > > > > > to have arbitrary constants like this. Why 16? > > > > > > > > I believe any batching mechanism has a constant that look arbitrary= . > > > > This constant should be the golden mean: if it's too small, there i= s > > > > little effect from batching; if it's too big, it gets too bursty. > > > > > > > > Basically, after your patch it just changed from 16 to the ring siz= e. > > > > Maybe we can tie that constant to ring size? Make it ring_size / > > > > another_arbitrary_constant? :) > > > > > > Agreed, I thought about this too. Something tied to ring_size might b= e > > > a good compromise. Will examine this. But I want to base this on > > > performance data not idle speculation, so need to run some experiment= s > > > first. > > > > > > /Magnus > > > > > > > > Would much prefer not > > > > > having to deal with this, unless of course it horribly breaks > > > > > something or gives rise to worse performance. Might still be the = case > > > > > here, but if not, I would like to remove it. > > > > > > > > > > Thanks: Magnus > > > > > > > > > > > Magnus, what do you think of the suggestion above? > > > > > > > > > > > > Thanks, > > > > > > Max > > > > > > > > > > > > >> > > > > > > >> I tracked this issue in our internal bug tracker in case we = need to > > > > > > >> perform actual debugging of mlx5e. I'm looking forward to yo= ur feedback > > > > > > >> on my assumption above. > > > > > > >> > > > > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G car= d (MCX416A- > > > > > CCAT) > > > > > > >>>> running the mlx5 driver. > > > > > > >> > > > > > > >> This one should run without striding RQ, please verify it wi= th ethtool > > > > > > >> --show-priv-flags (the flag name is rx_striding_rq). > > > > > > > > > > > > > > I do not remember changing this option, so whatever the defau= lt is, is > > > > > what it > > > > > > > was running with. I am traveling this week and do not have ac= cess to > > > > > these > > > > > > > systems, but will ensure that this flag is set properly when = I get back. > > > > > > > > > > > > >