> > Lorenzo Bianconi wrote:
> > > > Lorenzo Bianconi wrote:
> > > > > > Lorenzo Bianconi wrote:
> > > 
> > > [...]
> > > 
> > > > > > > + *	Description
> > > > > > > + *		Adjust frame headers moving *offset* bytes from/to the second
> > > > > > > + *		buffer to/from the first one. This helper can be used to move
> > > > > > > + *		headers when the hw DMA SG does not copy all the headers in
> > > > > > > + *		the first fragment.
> > > > > 
> > > > > + Eric to the discussion
> > > > > 
> > 
> > [...]
> > 

[...]

> > 
> > Still in a normal L2/L3/L4 use case I expect all the headers you
> > need to be in the fist buffer so its unlikely for use cases that
> > send most traffic via XDP_TX for example to ever need the extra
> > info. In these cases I think you are paying some penalty for
> > having to do the work of populating the shinfo. Maybe its measurable
> > maybe not I'm not sure.
> > 
> > Also if we make it required for multi-buffer than we also need
> > the shinfo on 40gbps or 100gbps nics and now even small costs
> > matter.
> 
> Now I realized I used the word "split" in a not clear way here,
> I apologize for that.
> What I mean is not related "header" split, I am referring to the case where
> the hw is configured with a given rx buffer size (e.g. 1 PAGE) and we have
> set a higher MTU/max received size (e.g. 9K).
> In this case the hw will "split" the jumbo received frame over multiple rx
> buffers/descriptors. Populating the "xdp_shared_info" we will forward this
> layout info to the eBPF sandbox and to a remote driver/cpu.
> Please note this use case is not currently covered by XDP so if we develop it a
> proper way I guess we should not get any performance hit for the legacy single-buffer
> mode since we will not populate the shared_info for it (I think you refer to
> the "legacy" use-case in your "normal L2/L3/L4" example, right?)
> Anyway I will run some tests to verify the performances for the single buffer
> use-case are not hit.
> 
> Regards,
> Lorenzo

I carried out some performance measurements on my Espressobin to check if the
XDP "single buffer" use-case has been hit introducing xdp multi-buff support.
Each test has been carried out sending ~900Kpps (pkt length 64B). The rx
buffer size was set to 1 PAGE (default value).
The results are roughly the same:

commit: f2ca673d2cd5 "net: mvneta: fix use of state->speed"
==========================================================
- XDP-DROP: ~ 740 Kpps
- XDP-TX: ~ 286 Kpps
- XDP-PASS + tc drop: ~ 219.5 Kpps

xdp multi-buff:
===============
- XDP-DROP: ~ 739-740 Kpps
- XDP-TX: ~ 285 Kpps
- XDP-PASS + tc drop: ~ 223 Kpps

I will add these results to v3 cover letter.

Regards,
Lorenzo

> 
> > 
> > > 
> > > > 
> > > > If you take the simplest possible program that just returns XDP_TX
> > > > and run a pkt generator against it. I believe (haven't run any
> > > > tests) that you will see overhead now just from populating this
> > > > shinfo. I think it needs to only be done when its needed e.g. when
> > > > user makes this helper call or we need to build the skb and populate
> > > > the frags there.
> > > 
> > > sure, I will carry out some tests.
> > 
> > Thanks!
> > 
> > > 
> > > > 
> > > > I think a smart driver will just keep the frags list in whatever
> > > > form it has them (rx descriptors?) and push them over to the
> > > > tx descriptors without having to do extra work with frag lists.
> > > 
> > > I think there are many use-cases where we want to have this info available in
> > > xdp_buff/xdp_frame. E.g: let's consider the following Jumbo frame example:
> > > - MTU > 1 PAGE (so we the driver will split the received data in multiple rx
> > >   descriptors)
> > > - the driver performs a XDP_REDIRECT to a veth or cpumap
> > > 
> > > Relying on the proposed architecture we could enable GRO in veth or cpumap I
> > > guess since we can build a non-linear skb from the xdp multi-buff, right?
> > 
> > I'm not disputing there are use-cases. But, I'm trying to see if we
> > can cover those without introducing additional latency in other
> > cases. Hence the extra benchmarks request ;)
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > Did you benchmark this?
> > > > > 
> > > > > will do, I need to understand if we can use tiny buffers in mvneta.
> > > > 
> > > > Why tiny buffers? How does mvneta layout the frags when doing
> > > > header split? Can we just benchmark what mvneta is doing at the
> > > > end of this patch series?
> > > 
> > > for the moment mvneta can split the received data when the previous buffer is
> > > full (e.g. when we the first page is completely written). I want to explore if
> > > I can set a tiny buffer (e.g. 128B) as max received buffer to run some performance
> > > tests and have some "comparable" results respect to the ones I got when I added XDP
> > > support to mvneta.
> > 
> > OK would be great.
> > 
> > > 
> > > > 
> > > > Also can you try the basic XDP_TX case mentioned above.
> > > > I don't want this to degrade existing use cases if at all
> > > > possible.
> > > 
> > > sure, will do.
> > 
> > Thanks!
> >