From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path Date: Mon, 13 Mar 2017 14:09:23 -0700 Message-ID: References: <20170313005848.7076-1-edumazet@google.com> <20170313173432.GA31333@ast-mbp.thefacebook.com> <20170313183121.GA36306@ast-mbp.thefacebook.com> <20170313202300.GA52396@ast-mbp.thefacebook.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: "David S . Miller" , netdev , Tariq Toukan , Saeed Mahameed , Willem de Bruijn , Alexei Starovoitov , Eric Dumazet , Alexander Duyck To: Alexei Starovoitov Return-path: Received: from mail-io0-f182.google.com ([209.85.223.182]:34580 "EHLO mail-io0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754472AbdCMVJa (ORCPT ); Mon, 13 Mar 2017 17:09:30 -0400 Received: by mail-io0-f182.google.com with SMTP id g6so92636011ioj.1 for ; Mon, 13 Mar 2017 14:09:29 -0700 (PDT) In-Reply-To: <20170313202300.GA52396@ast-mbp.thefacebook.com> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, Mar 13, 2017 at 1:23 PM, Alexei Starovoitov wrote: > On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote: >> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov >> wrote: >> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote: >> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov >> >> wrote: >> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote: >> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud >> >> >> case XDP_PASS: >> >> >> break; >> >> >> case XDP_TX: >> >> >> + /* Make sure we have one page ready to replace this one */ >> >> >> + npage = NULL; >> >> >> + if (!ring->page_cache.index) { >> >> >> + npage = mlx4_alloc_page(priv, ring, >> >> >> + &ndma, numa_mem_id(), >> >> >> + GFP_ATOMIC | __GFP_MEMALLOC); >> >> > >> >> > did you test this with xdp2 test ? >> >> > under what conditions it allocates ? >> >> > It looks dangerous from security point of view to do allocations here. >> >> > Can it be exploited by an attacker? >> >> > we use xdp for ddos and lb and this is fast path. >> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious >> >> > perf regression. >> >> > In general I dont think it's a good idea to penalize x86 in favor of powerpc. >> >> > Can you #ifdef this new code somehow? so we won't have these concerns on x86? >> >> >> >> Normal paths would never hit this point really. I wanted to be extra >> >> safe, because who knows, some guys could be tempted to set >> >> ethtool -G ethX rx 512 tx 8192 >> >> >> >> Before this patch, if you were able to push enough frames in TX ring, >> >> you would also eventually be forced to allocate memory, or drop frames... >> > >> > hmm. not following. >> > Into xdp tx queues packets don't come from stack. It can only be via xdp_tx. >> > So this rx page belongs to driver, not shared with anyone and it only needs to >> > be put onto tx ring, so I don't understand why driver needs to allocating >> > anything here. To refill the rx ring? but why here? >> >> Because RX ring can be shared, by packets goind to the upper stacks (say TCP) >> >> So there is no guarantee that the pages in the quarantine pool have >> their page count to 1. >> >> The normal TX_XDP path will recycle pages in ring->cache_page . >> >> This is exactly where I pick up a replacement. >> >> Pages in ring->cache_page have the guarantee to have no other users >> than ourself (mlx4 driver) >> >> You might have not noticed that current mlx4 driver has a lazy refill >> of RX ring buffers, that eventually >> removes all the pages from RX ring, and we have to recover with this >> lazy mlx4_en_recover_from_oom() thing >> that will attempt to restart the allocations. >> >> After my patch, we have the guarantee that the RX ring buffer is >> always fully populated. >> >> When we receive a frame (XDP or not), we drop it if we can not >> replenish the RX slot, >> in case the oldest page in quarantine is not a recycling candidate and >> we can not allocate a new page. > > Got it. Could you please add above explanation to the commit log, > since it's a huge difference vs other drivers. > I don't think any other driver in the tree follows this strategy. sk_buff allocation can fail before considering adding a frag on the skb anyway... Look, all this is completely irrelevant for XDP_TX, since there is no allocation at all, once ~128 pages have been put into page_cache. Do you want to prefill this cache when XDP is loaded ? > I think that's the right call, but it shouldn't be hidden in details. ring->page_cache contains the pages used by XDP_TX are recycled through mlx4_en_rx_recycle() There is a nice comment there. I did not change this part. These pages _used_ to be directly taken from mlx4_en_prepare_rx_desc(), because there was no page recycling for the normal path. All my patch does is a generic implementation for page recycling, leaving the XDP_TX pages in their own pool, because we do not even have to check their page count, adding a cache line miss. So I do have 2 different pools, simply to let XDP_TX path being ultra fast, as before. Does not look details to me. > If that's the right approach (probably is) we should do it > in the other drivers too. > Though I don't see you dropping "to the stack" packet in this patch. > The idea is to drop the packet (whether xdp or not) if rx > buffer cannot be replinshed _regardless_ whether driver > implements recycling, right? > > Theory vs practice. We don't see this issue running xdp. Sure, so lets ignore syzkaller guys and stop doing code reviews, if all this is working for you. > My understanding, since rx and tx napi are scheduled for the same cpu > there is a guarantee that during normal invocation they won't race > and the only problem is due to mlx4_en_recover_from_oom() that can > run rx napi on some random cpu. Exactly. So under pressure, host will panic :/ Same if you let one application using busy polling, or if IRQ is moved to another cpu (/proc/irq/*/smp_affinity) Looks fragile, but whatever, maybe you control everything on your hosts.