> > Lorenzo Bianconi wrote: > > > > Lorenzo Bianconi wrote: > > > > > > Lorenzo Bianconi wrote: > > > > > > [...] > > > > > > > > > > + * Description > > > > > > > + * Adjust frame headers moving *offset* bytes from/to the second > > > > > > > + * buffer to/from the first one. This helper can be used to move > > > > > > > + * headers when the hw DMA SG does not copy all the headers in > > > > > > > + * the first fragment. > > > > > > > > > > + Eric to the discussion > > > > > > > > > [...] > > [...] > > > > Still in a normal L2/L3/L4 use case I expect all the headers you > > need to be in the fist buffer so its unlikely for use cases that > > send most traffic via XDP_TX for example to ever need the extra > > info. In these cases I think you are paying some penalty for > > having to do the work of populating the shinfo. Maybe its measurable > > maybe not I'm not sure. > > > > Also if we make it required for multi-buffer than we also need > > the shinfo on 40gbps or 100gbps nics and now even small costs > > matter. > > Now I realized I used the word "split" in a not clear way here, > I apologize for that. > What I mean is not related "header" split, I am referring to the case where > the hw is configured with a given rx buffer size (e.g. 1 PAGE) and we have > set a higher MTU/max received size (e.g. 9K). > In this case the hw will "split" the jumbo received frame over multiple rx > buffers/descriptors. Populating the "xdp_shared_info" we will forward this > layout info to the eBPF sandbox and to a remote driver/cpu. > Please note this use case is not currently covered by XDP so if we develop it a > proper way I guess we should not get any performance hit for the legacy single-buffer > mode since we will not populate the shared_info for it (I think you refer to > the "legacy" use-case in your "normal L2/L3/L4" example, right?) > Anyway I will run some tests to verify the performances for the single buffer > use-case are not hit. > > Regards, > Lorenzo I carried out some performance measurements on my Espressobin to check if the XDP "single buffer" use-case has been hit introducing xdp multi-buff support. Each test has been carried out sending ~900Kpps (pkt length 64B). The rx buffer size was set to 1 PAGE (default value). The results are roughly the same: commit: f2ca673d2cd5 "net: mvneta: fix use of state->speed" ========================================================== - XDP-DROP: ~ 740 Kpps - XDP-TX: ~ 286 Kpps - XDP-PASS + tc drop: ~ 219.5 Kpps xdp multi-buff: =============== - XDP-DROP: ~ 739-740 Kpps - XDP-TX: ~ 285 Kpps - XDP-PASS + tc drop: ~ 223 Kpps I will add these results to v3 cover letter. Regards, Lorenzo > > > > > > > > > > > > > > If you take the simplest possible program that just returns XDP_TX > > > > and run a pkt generator against it. I believe (haven't run any > > > > tests) that you will see overhead now just from populating this > > > > shinfo. I think it needs to only be done when its needed e.g. when > > > > user makes this helper call or we need to build the skb and populate > > > > the frags there. > > > > > > sure, I will carry out some tests. > > > > Thanks! > > > > > > > > > > > > > I think a smart driver will just keep the frags list in whatever > > > > form it has them (rx descriptors?) and push them over to the > > > > tx descriptors without having to do extra work with frag lists. > > > > > > I think there are many use-cases where we want to have this info available in > > > xdp_buff/xdp_frame. E.g: let's consider the following Jumbo frame example: > > > - MTU > 1 PAGE (so we the driver will split the received data in multiple rx > > > descriptors) > > > - the driver performs a XDP_REDIRECT to a veth or cpumap > > > > > > Relying on the proposed architecture we could enable GRO in veth or cpumap I > > > guess since we can build a non-linear skb from the xdp multi-buff, right? > > > > I'm not disputing there are use-cases. But, I'm trying to see if we > > can cover those without introducing additional latency in other > > cases. Hence the extra benchmarks request ;) > > > > > > > > > > > > > > > > > > > > > > > > > > Did you benchmark this? > > > > > > > > > > will do, I need to understand if we can use tiny buffers in mvneta. > > > > > > > > Why tiny buffers? How does mvneta layout the frags when doing > > > > header split? Can we just benchmark what mvneta is doing at the > > > > end of this patch series? > > > > > > for the moment mvneta can split the received data when the previous buffer is > > > full (e.g. when we the first page is completely written). I want to explore if > > > I can set a tiny buffer (e.g. 128B) as max received buffer to run some performance > > > tests and have some "comparable" results respect to the ones I got when I added XDP > > > support to mvneta. > > > > OK would be great. > > > > > > > > > > > > > Also can you try the basic XDP_TX case mentioned above. > > > > I don't want this to degrade existing use cases if at all > > > > possible. > > > > > > sure, will do. > > > > Thanks! > >