Re: [net-next, PATCH 1/2, v3] net: socionext: different approach on DMA

From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: netdev@vger.kernel.org, jaswinder.singh@linaro.org,
	ard.biesheuvel@linaro.org, masami.hiramatsu@linaro.org,
	arnd@arndb.de, bjorn.topel@intel.com, magnus.karlsson@intel.com,
	daniel@iogearbox.net, ast@kernel.org,
	jesus.sanchez-palencia@intel.com, vinicius.gomes@intel.com,
	makita.toshiaki@lab.ntt.co.jp, Tariq Toukan <tariqt@mellanox.com>,
	Tariq Toukan <ttoukan.linux@gmail.com>
Subject: Re: [net-next, PATCH 1/2, v3] net: socionext: different approach on DMA
Date: Mon, 1 Oct 2018 17:37:06 +0300	[thread overview]
Message-ID: <20181001143706.GB810@apalos> (raw)
In-Reply-To: <20181001154845.4cd1d5dc@redhat.com>

On Mon, Oct 01, 2018 at 03:48:45PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 1 Oct 2018 14:20:21 +0300
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> 
> > On Mon, Oct 01, 2018 at 01:03:13PM +0200, Jesper Dangaard Brouer wrote:
> > > On Mon, 1 Oct 2018 12:56:58 +0300
> > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > >   
> > > > > > #2: You have allocations on the XDP fast-path.
> > > > > > 
> > > > > > The REAL secret behind the XDP performance is to avoid allocations on
> > > > > > the fast-path.  While I just told you to use the page-allocator and
> > > > > > order-0 pages, this will actually kill performance.  Thus, to make this
> > > > > > fast, you need a driver local recycle scheme that avoids going through
> > > > > > the page allocator, which makes XDP_DROP and XDP_TX extremely fast.
> > > > > > For the XDP_REDIRECT action (which you seems to be interested in, as
> > > > > > this is needed for AF_XDP), there is a xdp_return_frame() API that can
> > > > > > make this fast.    
> > > > >
> > > > > I had an initial implementation that did exactly that (that's why you the
> > > > > dma_sync_single_for_cpu() -> dma_unmap_single_attrs() is there). In the case 
> > > > > of AF_XDP isn't that introducing a 'bottleneck' though? I mean you'll feed fresh
> > > > > buffers back to the hardware only when your packets have been processed from
> > > > > your userspace application   
> > > >
> > > > Just a clarification here. This is the case if ZC is implemented. In my case
> > > > the buffers will be 'ok' to be passed back to the hardware once the use
> > > > userspace payload has been copied by xdp_do_redirect()  
> > > 
> > > Thanks for clarifying.  But no, this is not introducing a 'bottleneck'
> > > for AF_XDP.
> > > 
> > > For (1) the copy-mode-AF_XDP the frame (as you noticed) is "freed" or
> > > "returned" very quickly after it is copied.  The code is a bit hard to
> > > follow, but in __xsk_rcv() it calls xdp_return_buff() after the memcpy.
> > > Thus, the frame can be kept DMA mapped and reused in RX-ring quickly.
> >  
> > Ok makes sense. I'll send a v4 with page re-usage, while using your
> > API for page allocation
> 
> Sound good, BUT do notice that using the bare page_pool, will/should
> give you increased XDP performance, but might slow-down normal network
> stack delivery, because netstack will not call xdp_return_frame() and
> instead falls back to returning the pages through the page-allocator.
> 
> I'm very interested in knowing what performance increase you see with
> XDP_DROP, with just a "bare" page_pool implementation.
When i was just syncing the page fragments instead of unmap -> alloc -> map i
was getting ~340kpps (with XDP_REDIRECT). I ended up with 320kpps on this patch.
I did a couple of more changes though (like the dma mapping when allocating 
the buffers) so i am not 100% sure what caused that difference
I'll let you know once i finish up the code using the API for page allocation

Regarding the change and the 'bottleneck' discussion we had. XDP_REDIRECT is
straight forward (non ZC mode). I agree with you that since the payload is
pretty much immediately copied before being flushed to the userspace, it's
unlikely you'll end up delaying the hardware (starving without buffers).
Do you think that's the same for XDP_TX? The DMA buffer will need to be synced 
for the CPU, then you ring a doorbell with X packets. After that you'll have to
wait for the Tx completion and resync the buffers to the device. So you actually
make your Rx descriptors depending on your Tx completion (and keep in mind this
NIC only has 1 queue per direction)

Now for the measurements part, i'll have to check with the vendor if the
interface can do more than 340kpps and we are missing something
performance-wise. 
Have you done any tests with IOMMU enabled/disabled? In theory the dma recycling
will shine against map/unmap when IOMMU is on (and the IOMMU stressed i.e have a
different NIC doing a traffic test)

> 
> The mlx5 driver does not see this netstack slowdown, because it have a
> hybrid approach of maintaining a recycle ring for frames going into
> netstack, by bumping the refcnt.  I think Tariq is cleaning this up.
> The mlx5 code is hard to follow... in mlx5e_xdp_handle()[1] the
> refcnt==1 and a bit is set. And in [2] the refcnt is page_ref_inc(),
> and bit is caught in [3].  (This really need to be cleaned up and
> generalized).
I've read most of the XDP related code on Intel/Mellanox
before starting my patch series. I'll have a closer look now, thanks!
> 
> 
> 
> [1] https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L83-L88
>     https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L952-L959
> 
> [2] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1015-L1025
> 
> [3] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1094-L1098
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer