All of lore.kernel.org
 help / color / mirror / Atom feed
* XDP redirect measurements, gotchas and tracepoints
@ 2017-08-21 19:25 Jesper Dangaard Brouer
  2017-08-21 22:35 ` Alexei Starovoitov
  2017-08-22 18:02 ` Michael Chan
  0 siblings, 2 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-21 19:25 UTC (permalink / raw)
  To: xdp-newbies
  Cc: brouer, John Fastabend, Daniel Borkmann, Andy Gospodarek, netdev,
	Paweł Staszewski


I'be been playing with the latest XDP_REDIRECT feature, that was
accepted in net-next (for ixgbe), see merge commit[1].
 [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31

At a first glance the performance looks awesome, and it is(!) when
your system is tuned for this workload. When perfectly tuned I can
show 13,096,427 pps forwarding, which is very close to 10Gbit/s
wirespeed at 64bytes (14.88Mpps).  Using only a single CPU (E5-1650 v4
@3.60GHz) core.

First gotcha(1): be aware of what you measure.  The reported numbers from
xdp_redirect_map is how many packets the XDP program received.  It
have no info whether the packet was actually transmitted out.  This
info is avail via TX counters[2] or an xdp tracepoint.

[2] ethtool_stats:
    https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

Second gotcha(2): you cannot TX out a device, unless it also have a
xdp bpf program attached. (This is an implicit dependency, as the
driver code need to setup XDP resources before it can ndo_xdp_xmit).

Third gotcha(3): You got this far, loaded xdp on both interfaces, and
notice now that (with default setup) you can RX with 14Mpps but only
TX with 6.9Mpps (and might have 5% idle cycles).  I debugged this via
perf tracepoint event xdp:xdp_redirect, and found this was due to
overrunning the xdp TX ring-queue size.

 Thus, for this workload, we need to adjust either the TX ring-queue
size (ethtool -G) or the DMA completion interval (ethtool -C rx-usecs).
See tuning and measurements below signature.

Fourth gotcha(4): Monitoring XDP redirect performance via the
tracepoint xdp:xdp_redirect, is too slow, and affect the measurements
themselves.  I'm working on optimizing these tracepoints, and will
share results tomorrow.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


No-tuning (default auto-tuning rx-usecs 1):
 Notice tx_packets is too low compared to RX

Show adapter(s) (ixgbe1 ixgbe2) statistics (ONLY that changed!)
Ethtool(ixgbe1  ) stat:     14720134 (     14,720,134) <= fdir_miss /sec
Ethtool(ixgbe1  ) stat:    874951205 (    874,951,205) <= rx_bytes /sec
Ethtool(ixgbe1  ) stat:    952434290 (    952,434,290) <= rx_bytes_nic /sec
Ethtool(ixgbe1  ) stat:       271737 (        271,737) <= rx_missed_errors /sec
Ethtool(ixgbe1  ) stat:        27631 (         27,631) <= rx_no_dma_resources /sec
Ethtool(ixgbe1  ) stat:     14582520 (     14,582,520) <= rx_packets /sec
Ethtool(ixgbe1  ) stat:     14610072 (     14,610,072) <= rx_pkts_nic /sec
Ethtool(ixgbe1  ) stat:    874947566 (    874,947,566) <= rx_queue_2_bytes /sec
Ethtool(ixgbe1  ) stat:     14582459 (     14,582,459) <= rx_queue_2_packets /sec
Ethtool(ixgbe2  ) stat:    417934735 (    417,934,735) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:    445801114 (    445,801,114) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat:      6965579 (      6,965,579) <= tx_packets /sec
Ethtool(ixgbe2  ) stat:      6965771 (      6,965,771) <= tx_pkts_nic /sec


Tuned with rx-usecs 25:
 ethtool -C ixgbe1 rx-usecs 25 ;\
 ethtool -C ixgbe2 rx-usecs 25

Show adapter(s) (ixgbe1 ixgbe2) statistics (ONLY that changed!)
Ethtool(ixgbe1  ) stat:     14123764 (     14,123,764) <= fdir_miss /sec
Ethtool(ixgbe1  ) stat:    786101618 (    786,101,618) <= rx_bytes /sec
Ethtool(ixgbe1  ) stat:    952807289 (    952,807,289) <= rx_bytes_nic /sec
Ethtool(ixgbe1  ) stat:      1047989 (      1,047,989) <= rx_missed_errors /sec
Ethtool(ixgbe1  ) stat:       737938 (        737,938) <= rx_no_dma_resources /sec
Ethtool(ixgbe1  ) stat:     13101694 (     13,101,694) <= rx_packets /sec
Ethtool(ixgbe1  ) stat:     13839620 (     13,839,620) <= rx_pkts_nic /sec
Ethtool(ixgbe1  ) stat:    786101618 (    786,101,618) <= rx_queue_2_bytes /sec
Ethtool(ixgbe1  ) stat:     13101694 (     13,101,694) <= rx_queue_2_packets /sec
Ethtool(ixgbe2  ) stat:    785785590 (    785,785,590) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:    838179358 (    838,179,358) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat:     13096427 (     13,096,427) <= tx_packets /sec
Ethtool(ixgbe2  ) stat:     13096519 (     13,096,519) <= tx_pkts_nic /

Tuned with adjusting ring-queue sizes:
 ethtool -G ixgbe1 rx 1024 tx 1024 ;\
 ethtool -G ixgbe2 rx 1024 tx 1024

Show adapter(s) (ixgbe1 ixgbe2) statistics (ONLY that changed!)
Ethtool(ixgbe1  ) stat:     14169252 (     14,169,252) <= fdir_miss /sec
Ethtool(ixgbe1  ) stat:    783666937 (    783,666,937) <= rx_bytes /sec
Ethtool(ixgbe1  ) stat:    957332815 (    957,332,815) <= rx_bytes_nic /sec
Ethtool(ixgbe1  ) stat:      1053052 (      1,053,052) <= rx_missed_errors /sec
Ethtool(ixgbe1  ) stat:       844113 (        844,113) <= rx_no_dma_resources /sec
Ethtool(ixgbe1  ) stat:     13061116 (     13,061,116) <= rx_packets /sec
Ethtool(ixgbe1  ) stat:     13905221 (     13,905,221) <= rx_pkts_nic /sec
Ethtool(ixgbe1  ) stat:    783666937 (    783,666,937) <= rx_queue_2_bytes /sec
Ethtool(ixgbe1  ) stat:     13061116 (     13,061,116) <= rx_queue_2_packets /sec
Ethtool(ixgbe2  ) stat:    783312119 (    783,312,119) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:    835526092 (    835,526,092) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat:     13055202 (     13,055,202) <= tx_packets /sec
Ethtool(ixgbe2  ) stat:     13055093 (     13,055,093) <= tx_pkts_nic /sec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-21 19:25 XDP redirect measurements, gotchas and tracepoints Jesper Dangaard Brouer
@ 2017-08-21 22:35 ` Alexei Starovoitov
  2017-08-22  6:37   ` Jesper Dangaard Brouer
  2017-08-22 18:02 ` Michael Chan
  1 sibling, 1 reply; 26+ messages in thread
From: Alexei Starovoitov @ 2017-08-21 22:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: xdp-newbies, John Fastabend, Daniel Borkmann, Andy Gospodarek,
	netdev, Paweł Staszewski

On Mon, Aug 21, 2017 at 09:25:06PM +0200, Jesper Dangaard Brouer wrote:
> 
> Third gotcha(3): You got this far, loaded xdp on both interfaces, and
> notice now that (with default setup) you can RX with 14Mpps but only
> TX with 6.9Mpps (and might have 5% idle cycles).  I debugged this via
> perf tracepoint event xdp:xdp_redirect, and found this was due to
> overrunning the xdp TX ring-queue size.

we should probably fix this somehow.
Once tx-ing netdev added to devmap we can enable xdp on it automatically?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-21 22:35 ` Alexei Starovoitov
@ 2017-08-22  6:37   ` Jesper Dangaard Brouer
  2017-08-22 17:09     ` Alexei Starovoitov
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-22  6:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: xdp-newbies, John Fastabend, Daniel Borkmann, Andy Gospodarek,
	netdev, Paweł Staszewski, brouer

On Mon, 21 Aug 2017 15:35:42 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Mon, Aug 21, 2017 at 09:25:06PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > Third gotcha(3): You got this far, loaded xdp on both interfaces, and
> > notice now that (with default setup) you can RX with 14Mpps but only
> > TX with 6.9Mpps (and might have 5% idle cycles).  I debugged this via
> > perf tracepoint event xdp:xdp_redirect, and found this was due to
> > overrunning the xdp TX ring-queue size.  
> 
> we should probably fix this somehow.

Gotcha-3 (quoted above) is an interesting problem.  At first it looks
like a driver tuning problem. But it is actually an inherent property
of XDP, as there is no-queue or push-back flow control with XDP, there
is no way to handle TX queue overrun.
 My proposed solution: I want to provide a facility for userspace to
load another eBPF program (at the tracepoint xdp_redirect), which can
"see" the issue occurring.  This allows a XDP/BPF developer to
implement their own reaction/mitigation flow-control (e.g. via a map
shared with the XDP program).


> Once tx-ing netdev added to devmap we can enable xdp on it automatically?

I think you are referring to Gotcha-2 here:

  Second gotcha(2): you cannot TX out a device, unless it also have a
  xdp bpf program attached. (This is an implicit dependency, as the
  driver code need to setup XDP resources before it can ndo_xdp_xmit).

Yes, we should work on improving this situation.  Auto enabling XDP
when a netdev is added to a devmap is a good solution.  Currently this
is tied to loading an XDP bpf_prog.  Do you propose loading a dummy
bpf_prog on the netdev? (then we need to handle 1. not replacing
existing bpf_prog, 2. on take-down don't remove "later" loaded
bpf_prog).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-22  6:37   ` Jesper Dangaard Brouer
@ 2017-08-22 17:09     ` Alexei Starovoitov
  2017-08-22 17:17       ` John Fastabend
  0 siblings, 1 reply; 26+ messages in thread
From: Alexei Starovoitov @ 2017-08-22 17:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: xdp-newbies, John Fastabend, Daniel Borkmann, Andy Gospodarek,
	netdev, Paweł Staszewski

On Tue, Aug 22, 2017 at 08:37:10AM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> > Once tx-ing netdev added to devmap we can enable xdp on it automatically?
> 
> I think you are referring to Gotcha-2 here:

oops. yes :)

> 
>   Second gotcha(2): you cannot TX out a device, unless it also have a
>   xdp bpf program attached. (This is an implicit dependency, as the
>   driver code need to setup XDP resources before it can ndo_xdp_xmit).
> 
> Yes, we should work on improving this situation.  Auto enabling XDP
> when a netdev is added to a devmap is a good solution.  Currently this
> is tied to loading an XDP bpf_prog.  Do you propose loading a dummy
> bpf_prog on the netdev? (then we need to handle 1. not replacing
> existing bpf_prog, 2. on take-down don't remove "later" loaded
> bpf_prog).

right. these things need to be taken care of.
Technically for ndo_xdp_xmit to work the program doesn't need
to be attached, but the device needs to be in xdp mode with
configured xdp tx rings.
The easiest, of course, is just to document it :)
and may be add some sort of warning that if netdev is added
to devmap and it's not in xdp mode, return warning or error.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-22 17:09     ` Alexei Starovoitov
@ 2017-08-22 17:17       ` John Fastabend
  2017-08-23  8:56         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: John Fastabend @ 2017-08-22 17:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Jesper Dangaard Brouer
  Cc: xdp-newbies, Daniel Borkmann, Andy Gospodarek, netdev,
	Paweł Staszewski

On 08/22/2017 10:09 AM, Alexei Starovoitov wrote:
> On Tue, Aug 22, 2017 at 08:37:10AM +0200, Jesper Dangaard Brouer wrote:
>>
>>
>>> Once tx-ing netdev added to devmap we can enable xdp on it automatically?
>>
>> I think you are referring to Gotcha-2 here:
> 
> oops. yes :)
> 
>>
>>   Second gotcha(2): you cannot TX out a device, unless it also have a
>>   xdp bpf program attached. (This is an implicit dependency, as the
>>   driver code need to setup XDP resources before it can ndo_xdp_xmit).
>>
>> Yes, we should work on improving this situation.  Auto enabling XDP
>> when a netdev is added to a devmap is a good solution.  Currently this
>> is tied to loading an XDP bpf_prog.  Do you propose loading a dummy
>> bpf_prog on the netdev? (then we need to handle 1. not replacing
>> existing bpf_prog, 2. on take-down don't remove "later" loaded
>> bpf_prog).
> 
> right. these things need to be taken care of.
> Technically for ndo_xdp_xmit to work the program doesn't need
> to be attached, but the device needs to be in xdp mode with
> configured xdp tx rings.
> The easiest, of course, is just to document it :)
> and may be add some sort of warning that if netdev is added
> to devmap and it's not in xdp mode, return warning or error.
> 

When I wrote this I assumed some user space piece could
load the "dummy" nop program on devices as needed. It seemed
easier than putting semi-complex logic in the kernel to load
programs on update_elem, but only if the user hasn't already
loaded a program and then unload it but again only if some
criteria is met. Then we would have one more kernel path into
load/unload BPF programs and would need all the tests and what
not.

+1 for documenting and userland usability patches.

.John

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-21 19:25 XDP redirect measurements, gotchas and tracepoints Jesper Dangaard Brouer
  2017-08-21 22:35 ` Alexei Starovoitov
@ 2017-08-22 18:02 ` Michael Chan
  2017-08-22 18:17   ` John Fastabend
  1 sibling, 1 reply; 26+ messages in thread
From: Michael Chan @ 2017-08-22 18:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: xdp-newbies, John Fastabend, Daniel Borkmann, Andy Gospodarek,
	netdev, Paweł Staszewski

On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> I'be been playing with the latest XDP_REDIRECT feature, that was
> accepted in net-next (for ixgbe), see merge commit[1].
>  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
>

Just catching on XDP_REDIRECT and I have a very basic question.  The
ingress device passes the XDP buffer to the egress device for XDP
redirect transmission.  When the egress device has transmitted the
packet, is it supposed to just free the buffer?  Or is it supposed to
be recycled?

In XDP_TX, the buffer is recycled back to the rx ring.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-22 18:02 ` Michael Chan
@ 2017-08-22 18:17   ` John Fastabend
  2017-08-22 18:30     ` Duyck, Alexander H
  0 siblings, 1 reply; 26+ messages in thread
From: John Fastabend @ 2017-08-22 18:17 UTC (permalink / raw)
  To: Michael Chan, Jesper Dangaard Brouer
  Cc: xdp-newbies, Daniel Borkmann, Andy Gospodarek, netdev,
	Paweł Staszewski, Alexander H Duyck

On 08/22/2017 11:02 AM, Michael Chan wrote:
> On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> I'be been playing with the latest XDP_REDIRECT feature, that was
>> accepted in net-next (for ixgbe), see merge commit[1].
>>  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
>>
> 
> Just catching on XDP_REDIRECT and I have a very basic question.  The
> ingress device passes the XDP buffer to the egress device for XDP
> redirect transmission.  When the egress device has transmitted the
> packet, is it supposed to just free the buffer?  Or is it supposed to
> be recycled?
> 
> In XDP_TX, the buffer is recycled back to the rx ring.
> 

With XDP_REDIRECT we must "just free the buffer" in ixgbe this means
page_frag_free() on the data. There is no way to know where the xdp
buffer came from it could be a different NIC for example.

However with how ixgbe is coded up recycling will work as long as
the memory is free'd before the driver ring tries to use it again. In
normal usage this should be the case. And if we are over-running a device
it doesn't really hurt to slow down the sender a bit.

I think this is a pretty good model, we could probably provide a set
of APIs for drivers to use so that we get some consistency across
vendors here, ala Jesper's page pool ideas.

(+Alex, for ixgbe details)

Thanks,
John

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-22 18:17   ` John Fastabend
@ 2017-08-22 18:30     ` Duyck, Alexander H
  2017-08-22 20:04       ` Michael Chan
  0 siblings, 1 reply; 26+ messages in thread
From: Duyck, Alexander H @ 2017-08-22 18:30 UTC (permalink / raw)
  To: john.fastabend, brouer, michael.chan
  Cc: pstaszewski, netdev, xdp-newbies, andy, borkmann

On Tue, 2017-08-22 at 11:17 -0700, John Fastabend wrote:
> On 08/22/2017 11:02 AM, Michael Chan wrote:
> > On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > > 
> > > I'be been playing with the latest XDP_REDIRECT feature, that was
> > > accepted in net-next (for ixgbe), see merge commit[1].
> > >  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
> > > 
> > 
> > Just catching on XDP_REDIRECT and I have a very basic question.  The
> > ingress device passes the XDP buffer to the egress device for XDP
> > redirect transmission.  When the egress device has transmitted the
> > packet, is it supposed to just free the buffer?  Or is it supposed to
> > be recycled?
> > 
> > In XDP_TX, the buffer is recycled back to the rx ring.
> > 
> 
> With XDP_REDIRECT we must "just free the buffer" in ixgbe this means
> page_frag_free() on the data. There is no way to know where the xdp
> buffer came from it could be a different NIC for example.
> 
> However with how ixgbe is coded up recycling will work as long as
> the memory is free'd before the driver ring tries to use it again. In
> normal usage this should be the case. And if we are over-running a device
> it doesn't really hurt to slow down the sender a bit.
> 
> I think this is a pretty good model, we could probably provide a set
> of APIs for drivers to use so that we get some consistency across
> vendors here, ala Jesper's page pool ideas.
> 
> (+Alex, for ixgbe details)
> 
> Thanks,
> John

I think you pretty much covered the inner workings for the ixgbe bits.

The only piece I would add is that the recycling trick normally only
works if the same interface/driver is doing both the Tx and the Rx. The
redirect code cannot assume that is the case and that is the reason why
it must always be freeing the traffic on clean-up.

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-22 18:30     ` Duyck, Alexander H
@ 2017-08-22 20:04       ` Michael Chan
  2017-08-23  1:06         ` Alexander Duyck
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Chan @ 2017-08-22 20:04 UTC (permalink / raw)
  To: Duyck, Alexander H
  Cc: john.fastabend, brouer, pstaszewski, netdev, xdp-newbies, andy, borkmann

On Tue, Aug 22, 2017 at 11:30 AM, Duyck, Alexander H
<alexander.h.duyck@intel.com> wrote:
> On Tue, 2017-08-22 at 11:17 -0700, John Fastabend wrote:
>> On 08/22/2017 11:02 AM, Michael Chan wrote:
>> > On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
>> > <brouer@redhat.com> wrote:
>> > >
>> > > I'be been playing with the latest XDP_REDIRECT feature, that was
>> > > accepted in net-next (for ixgbe), see merge commit[1].
>> > >  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
>> > >
>> >
>> > Just catching on XDP_REDIRECT and I have a very basic question.  The
>> > ingress device passes the XDP buffer to the egress device for XDP
>> > redirect transmission.  When the egress device has transmitted the
>> > packet, is it supposed to just free the buffer?  Or is it supposed to
>> > be recycled?
>> >
>> > In XDP_TX, the buffer is recycled back to the rx ring.
>> >
>>
>> With XDP_REDIRECT we must "just free the buffer" in ixgbe this means
>> page_frag_free() on the data. There is no way to know where the xdp
>> buffer came from it could be a different NIC for example.
>>
>> However with how ixgbe is coded up recycling will work as long as
>> the memory is free'd before the driver ring tries to use it again. In
>> normal usage this should be the case. And if we are over-running a device
>> it doesn't really hurt to slow down the sender a bit.
>>
>> I think this is a pretty good model, we could probably provide a set
>> of APIs for drivers to use so that we get some consistency across
>> vendors here, ala Jesper's page pool ideas.
>>
>> (+Alex, for ixgbe details)
>>
>> Thanks,
>> John
>
> I think you pretty much covered the inner workings for the ixgbe bits.
>
> The only piece I would add is that the recycling trick normally only
> works if the same interface/driver is doing both the Tx and the Rx. The
> redirect code cannot assume that is the case and that is the reason why
> it must always be freeing the traffic on clean-up.
>

Right, but it's conceivable to add an API to "return" the buffer to
the input device, right?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-22 20:04       ` Michael Chan
@ 2017-08-23  1:06         ` Alexander Duyck
  2017-08-23  6:59           ` Michael Chan
  0 siblings, 1 reply; 26+ messages in thread
From: Alexander Duyck @ 2017-08-23  1:06 UTC (permalink / raw)
  To: Michael Chan
  Cc: Duyck, Alexander H, john.fastabend, brouer, pstaszewski, netdev,
	xdp-newbies, andy, borkmann

On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
> On Tue, Aug 22, 2017 at 11:30 AM, Duyck, Alexander H
> <alexander.h.duyck@intel.com> wrote:
>> On Tue, 2017-08-22 at 11:17 -0700, John Fastabend wrote:
>>> On 08/22/2017 11:02 AM, Michael Chan wrote:
>>> > On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
>>> > <brouer@redhat.com> wrote:
>>> > >
>>> > > I'be been playing with the latest XDP_REDIRECT feature, that was
>>> > > accepted in net-next (for ixgbe), see merge commit[1].
>>> > >  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
>>> > >
>>> >
>>> > Just catching on XDP_REDIRECT and I have a very basic question.  The
>>> > ingress device passes the XDP buffer to the egress device for XDP
>>> > redirect transmission.  When the egress device has transmitted the
>>> > packet, is it supposed to just free the buffer?  Or is it supposed to
>>> > be recycled?
>>> >
>>> > In XDP_TX, the buffer is recycled back to the rx ring.
>>> >
>>>
>>> With XDP_REDIRECT we must "just free the buffer" in ixgbe this means
>>> page_frag_free() on the data. There is no way to know where the xdp
>>> buffer came from it could be a different NIC for example.
>>>
>>> However with how ixgbe is coded up recycling will work as long as
>>> the memory is free'd before the driver ring tries to use it again. In
>>> normal usage this should be the case. And if we are over-running a device
>>> it doesn't really hurt to slow down the sender a bit.
>>>
>>> I think this is a pretty good model, we could probably provide a set
>>> of APIs for drivers to use so that we get some consistency across
>>> vendors here, ala Jesper's page pool ideas.
>>>
>>> (+Alex, for ixgbe details)
>>>
>>> Thanks,
>>> John
>>
>> I think you pretty much covered the inner workings for the ixgbe bits.
>>
>> The only piece I would add is that the recycling trick normally only
>> works if the same interface/driver is doing both the Tx and the Rx. The
>> redirect code cannot assume that is the case and that is the reason why
>> it must always be freeing the traffic on clean-up.
>>
>
> Right, but it's conceivable to add an API to "return" the buffer to
> the input device, right?

You could, it is just added complexity. "just free the buffer" in
ixgbe usually just amounts to one atomic operation to decrement the
total page count since page recycling is already implemented in the
driver. You still would have to unmap the buffer regardless of if you
were recycling it or not so all you would save is 1.000015259 atomic
operations per packet. The fraction is because once every 64K uses we
have to bulk update the count on the page.

There are still thoughts at some point in the future to consider
changing the layout so that we lay things out linearly instead of
interleaving the page halves. However that is a bit of optimization
and right now I don't really have the spare time to explore it. It
would help the performance by making sure the pages are warm on the
second freeing assuming all the packets in a given flow are received
back to back.

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-23  1:06         ` Alexander Duyck
@ 2017-08-23  6:59           ` Michael Chan
  2017-08-23  8:29             ` Jesper Dangaard Brouer
  2017-08-23 14:51             ` Alexander Duyck
  0 siblings, 2 replies; 26+ messages in thread
From: Michael Chan @ 2017-08-23  6:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, john.fastabend, brouer, pstaszewski, netdev,
	xdp-newbies, andy, borkmann

On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>
>> Right, but it's conceivable to add an API to "return" the buffer to
>> the input device, right?
>
> You could, it is just added complexity. "just free the buffer" in
> ixgbe usually just amounts to one atomic operation to decrement the
> total page count since page recycling is already implemented in the
> driver. You still would have to unmap the buffer regardless of if you
> were recycling it or not so all you would save is 1.000015259 atomic
> operations per packet. The fraction is because once every 64K uses we
> have to bulk update the count on the page.
>

If the buffer is returned to the input device, the input device can
keep the DMA mapping.  All it needs to do is to dma_sync it back to
the input device when the buffer is returned.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-23  6:59           ` Michael Chan
@ 2017-08-23  8:29             ` Jesper Dangaard Brouer
  2017-08-25  3:36               ` Michael Chan
  2017-08-23 14:51             ` Alexander Duyck
  1 sibling, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-23  8:29 UTC (permalink / raw)
  To: Michael Chan
  Cc: Alexander Duyck, Duyck, Alexander H, john.fastabend, pstaszewski,
	netdev, xdp-newbies, andy, borkmann, brouer

On Tue, 22 Aug 2017 23:59:05 -0700
Michael Chan <michael.chan@broadcom.com> wrote:

> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:  
> >>
> >> Right, but it's conceivable to add an API to "return" the buffer to
> >> the input device, right?

Yes, I would really like to see an API like this.

> >
> > You could, it is just added complexity. "just free the buffer" in
> > ixgbe usually just amounts to one atomic operation to decrement the
> > total page count since page recycling is already implemented in the
> > driver. You still would have to unmap the buffer regardless of if you
> > were recycling it or not so all you would save is 1.000015259 atomic
> > operations per packet. The fraction is because once every 64K uses we
> > have to bulk update the count on the page.
> >  
> 
> If the buffer is returned to the input device, the input device can
> keep the DMA mapping.  All it needs to do is to dma_sync it back to
> the input device when the buffer is returned.

Yes, exactly, return to the input device. I really think we should
work on a solution where we can keep the DMA mapping around.  We have
an opportunity here to make ndo_xdp_xmit TX queues use a specialized
page return call, to achieve this. (I imagine other arch's have a high
DMA overhead than Intel)

I'm not sure how the API should look.  The ixgbe recycle mechanism and
splitting the page (into two packets) actually complicates things, and
tie us into a page-refcnt based model.  We could get around this by
each driver implementing a page-return-callback, that allow us to
return the page to the input device?  Then, drivers implementing the
1-packet-per-page can simply check/read the page-refcnt, and if it is
"1" DMA-sync and reuse it in the RX queue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-22 17:17       ` John Fastabend
@ 2017-08-23  8:56         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-23  8:56 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, xdp-newbies, Daniel Borkmann,
	Andy Gospodarek, netdev, Paweł Staszewski, brouer

On Tue, 22 Aug 2017 10:17:59 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 08/22/2017 10:09 AM, Alexei Starovoitov wrote:
> > On Tue, Aug 22, 2017 at 08:37:10AM +0200, Jesper Dangaard Brouer wrote:  
> >>
> >>  
> >>> Once tx-ing netdev added to devmap we can enable xdp on it automatically?  
> >>
> >> I think you are referring to Gotcha-2 here:  
> > 
> > oops. yes :)
> >   
> >>
> >>   Second gotcha(2): you cannot TX out a device, unless it also have a
> >>   xdp bpf program attached. (This is an implicit dependency, as the
> >>   driver code need to setup XDP resources before it can ndo_xdp_xmit).
> >>
> >> Yes, we should work on improving this situation.  Auto enabling XDP
> >> when a netdev is added to a devmap is a good solution.  Currently this
> >> is tied to loading an XDP bpf_prog.  Do you propose loading a dummy
> >> bpf_prog on the netdev? (then we need to handle 1. not replacing
> >> existing bpf_prog, 2. on take-down don't remove "later" loaded
> >> bpf_prog).  
> > 
> > right. these things need to be taken care of.
> > Technically for ndo_xdp_xmit to work the program doesn't need
> > to be attached, but the device needs to be in xdp mode with
> > configured xdp tx rings.
> > The easiest, of course, is just to document it :)
> > and may be add some sort of warning that if netdev is added
> > to devmap and it's not in xdp mode, return warning or error.
> >   
> 
> When I wrote this I assumed some user space piece could
> load the "dummy" nop program on devices as needed. It seemed
> easier than putting semi-complex logic in the kernel to load
> programs on update_elem, but only if the user hasn't already
> loaded a program and then unload it but again only if some
> criteria is met. Then we would have one more kernel path into
> load/unload BPF programs and would need all the tests and what
> not.

I agree, it is not good to add this kind of logic to the kernel. We
would create too many funny race conditions with userspace.

> +1 for documenting and userland usability patches.

We still don't have a good place for XDP documentation.  My own[1]
attempt have gotten out-of-sync, and I need to restructure the
rst-docs, before I want to propose it for the kernel.

[1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html

I want to create some samples/examples with XDP_REDIRECT, that
highlight this property.  Maybe the samples/bpf/xdp_redirect{,_map}
should automatically attach a XDP bpf_prog to the egress device?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-23  6:59           ` Michael Chan
  2017-08-23  8:29             ` Jesper Dangaard Brouer
@ 2017-08-23 14:51             ` Alexander Duyck
  1 sibling, 0 replies; 26+ messages in thread
From: Alexander Duyck @ 2017-08-23 14:51 UTC (permalink / raw)
  To: Michael Chan
  Cc: Duyck, Alexander H, john.fastabend, brouer, pstaszewski, netdev,
	xdp-newbies, andy, borkmann

On Tue, Aug 22, 2017 at 11:59 PM, Michael Chan
<michael.chan@broadcom.com> wrote:
> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>>
>>> Right, but it's conceivable to add an API to "return" the buffer to
>>> the input device, right?
>>
>> You could, it is just added complexity. "just free the buffer" in
>> ixgbe usually just amounts to one atomic operation to decrement the
>> total page count since page recycling is already implemented in the
>> driver. You still would have to unmap the buffer regardless of if you
>> were recycling it or not so all you would save is 1.000015259 atomic
>> operations per packet. The fraction is because once every 64K uses we
>> have to bulk update the count on the page.
>>
>
> If the buffer is returned to the input device, the input device can
> keep the DMA mapping.  All it needs to do is to dma_sync it back to
> the input device when the buffer is returned.

That is what ixgbe is already doing. The Rx path doesn't free the
page, it just treats it as a bounce buffer and uses the page count to
make certain we don't use a section of the buffer that is already in
use.

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-23  8:29             ` Jesper Dangaard Brouer
@ 2017-08-25  3:36               ` Michael Chan
  2017-08-25 12:45                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Chan @ 2017-08-25  3:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Duyck, Alexander H, john.fastabend, pstaszewski,
	netdev, xdp-newbies, andy, borkmann

On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 22 Aug 2017 23:59:05 -0700
> Michael Chan <michael.chan@broadcom.com> wrote:
>
>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>> > On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>> >>
>> >> Right, but it's conceivable to add an API to "return" the buffer to
>> >> the input device, right?
>
> Yes, I would really like to see an API like this.
>
>> >
>> > You could, it is just added complexity. "just free the buffer" in
>> > ixgbe usually just amounts to one atomic operation to decrement the
>> > total page count since page recycling is already implemented in the
>> > driver. You still would have to unmap the buffer regardless of if you
>> > were recycling it or not so all you would save is 1.000015259 atomic
>> > operations per packet. The fraction is because once every 64K uses we
>> > have to bulk update the count on the page.
>> >
>>
>> If the buffer is returned to the input device, the input device can
>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>> the input device when the buffer is returned.
>
> Yes, exactly, return to the input device. I really think we should
> work on a solution where we can keep the DMA mapping around.  We have
> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
> page return call, to achieve this. (I imagine other arch's have a high
> DMA overhead than Intel)
>
> I'm not sure how the API should look.  The ixgbe recycle mechanism and
> splitting the page (into two packets) actually complicates things, and
> tie us into a page-refcnt based model.  We could get around this by
> each driver implementing a page-return-callback, that allow us to
> return the page to the input device?  Then, drivers implementing the
> 1-packet-per-page can simply check/read the page-refcnt, and if it is
> "1" DMA-sync and reuse it in the RX queue.
>

Yeah, based on Alex' description, it's not clear to me whether ixgbe
redirecting to a non-intel NIC or vice versa will actually work.  It
sounds like the output device has to make some assumptions about how
the page was allocated by the input device.  With buffer return API,
each driver can cleanly recycle or free its own buffers properly.

Let me discuss this further with Andy to see if we can come up with a
good scheme.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-25  3:36               ` Michael Chan
@ 2017-08-25 12:45                 ` Jesper Dangaard Brouer
  2017-08-25 15:10                   ` John Fastabend
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-25 12:45 UTC (permalink / raw)
  To: Michael Chan
  Cc: Alexander Duyck, Duyck, Alexander H, john.fastabend, pstaszewski,
	netdev, xdp-newbies, andy, borkmann, brouer

On Thu, 24 Aug 2017 20:36:28 -0700
Michael Chan <michael.chan@broadcom.com> wrote:

> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Tue, 22 Aug 2017 23:59:05 -0700
> > Michael Chan <michael.chan@broadcom.com> wrote:
> >  
> >> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
> >> <alexander.duyck@gmail.com> wrote:  
> >> > On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:  
> >> >>
> >> >> Right, but it's conceivable to add an API to "return" the buffer to
> >> >> the input device, right?  
> >
> > Yes, I would really like to see an API like this.
> >  
> >> >
> >> > You could, it is just added complexity. "just free the buffer" in
> >> > ixgbe usually just amounts to one atomic operation to decrement the
> >> > total page count since page recycling is already implemented in the
> >> > driver. You still would have to unmap the buffer regardless of if you
> >> > were recycling it or not so all you would save is 1.000015259 atomic
> >> > operations per packet. The fraction is because once every 64K uses we
> >> > have to bulk update the count on the page.
> >> >  
> >>
> >> If the buffer is returned to the input device, the input device can
> >> keep the DMA mapping.  All it needs to do is to dma_sync it back to
> >> the input device when the buffer is returned.  
> >
> > Yes, exactly, return to the input device. I really think we should
> > work on a solution where we can keep the DMA mapping around.  We have
> > an opportunity here to make ndo_xdp_xmit TX queues use a specialized
> > page return call, to achieve this. (I imagine other arch's have a high
> > DMA overhead than Intel)
> >
> > I'm not sure how the API should look.  The ixgbe recycle mechanism and
> > splitting the page (into two packets) actually complicates things, and
> > tie us into a page-refcnt based model.  We could get around this by
> > each driver implementing a page-return-callback, that allow us to
> > return the page to the input device?  Then, drivers implementing the
> > 1-packet-per-page can simply check/read the page-refcnt, and if it is
> > "1" DMA-sync and reuse it in the RX queue.
> >  
> 
> Yeah, based on Alex' description, it's not clear to me whether ixgbe
> redirecting to a non-intel NIC or vice versa will actually work.  It
> sounds like the output device has to make some assumptions about how
> the page was allocated by the input device. 

Yes, exactly. We are tied into a page refcnt based scheme.

Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
is also tied to the RX queue size, plus how fast the pages are returned.
This makes it very hard to tune.  As I demonstrated, default ixgbe
settings does not work well with XDP_REDIRECT.  I needed to increase
TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
10Mpps) so I also needed it increase RX-ring size.  But perf is best if
RX-ring size is smaller, thus two contradicting tuning needed.


> With buffer return API,
> each driver can cleanly recycle or free its own buffers properly.

Yes, exactly. And RX-driver can implement a special memory model for
this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
which is never used for SKBs, thus opening for new RX memory models.

Another advantage of a return API.  There is also an opportunity for
avoiding the DMA map on TX. As we need to know the from-device.  Thus,
we can add a DMA API, where we can query if the two devices uses the
same DMA engine, and can reuse the same DMA address the RX-side already
knows.


> Let me discuss this further with Andy to see if we can come up with a
> good scheme.

Sound good, looking forward to hear what you come-up with :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-25 12:45                 ` Jesper Dangaard Brouer
@ 2017-08-25 15:10                   ` John Fastabend
  2017-08-25 15:28                     ` Michael Chan
  0 siblings, 1 reply; 26+ messages in thread
From: John Fastabend @ 2017-08-25 15:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Michael Chan
  Cc: Alexander Duyck, Duyck, Alexander H, pstaszewski, netdev,
	xdp-newbies, andy, borkmann

On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
> On Thu, 24 Aug 2017 20:36:28 -0700
> Michael Chan <michael.chan@broadcom.com> wrote:
> 
>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>> On Tue, 22 Aug 2017 23:59:05 -0700
>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>  
>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>>>> <alexander.duyck@gmail.com> wrote:  
>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:  
>>>>>>
>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>>>>>> the input device, right?  
>>>
>>> Yes, I would really like to see an API like this.
>>>  
>>>>>
>>>>> You could, it is just added complexity. "just free the buffer" in
>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>>>>> total page count since page recycling is already implemented in the
>>>>> driver. You still would have to unmap the buffer regardless of if you
>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>>>>> operations per packet. The fraction is because once every 64K uses we
>>>>> have to bulk update the count on the page.
>>>>>  
>>>>
>>>> If the buffer is returned to the input device, the input device can
>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>>>> the input device when the buffer is returned.  
>>>
>>> Yes, exactly, return to the input device. I really think we should
>>> work on a solution where we can keep the DMA mapping around.  We have
>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>>> page return call, to achieve this. (I imagine other arch's have a high
>>> DMA overhead than Intel)
>>>
>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>>> splitting the page (into two packets) actually complicates things, and
>>> tie us into a page-refcnt based model.  We could get around this by
>>> each driver implementing a page-return-callback, that allow us to
>>> return the page to the input device?  Then, drivers implementing the
>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>>> "1" DMA-sync and reuse it in the RX queue.
>>>  
>>
>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>> redirecting to a non-intel NIC or vice versa will actually work.  It
>> sounds like the output device has to make some assumptions about how
>> the page was allocated by the input device. 
> 
> Yes, exactly. We are tied into a page refcnt based scheme.
> 
> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
> is also tied to the RX queue size, plus how fast the pages are returned.
> This makes it very hard to tune.  As I demonstrated, default ixgbe
> settings does not work well with XDP_REDIRECT.  I needed to increase
> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
> RX-ring size is smaller, thus two contradicting tuning needed.
> 

The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
split into two halves being the default) from the number of descriptors
doesn't look too bad IMO. It seems like it could be done by having some
extra pages allocated upfront and pulling those in when we need another
page.

This would be a nice iterative step we could take on the existing API.

> 
>> With buffer return API,
>> each driver can cleanly recycle or free its own buffers properly.
> 
> Yes, exactly. And RX-driver can implement a special memory model for
> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
> which is never used for SKBs, thus opening for new RX memory models.
> 
> Another advantage of a return API.  There is also an opportunity for
> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
> we can add a DMA API, where we can query if the two devices uses the
> same DMA engine, and can reuse the same DMA address the RX-side already
> knows.
> 
> 
>> Let me discuss this further with Andy to see if we can come up with a
>> good scheme.
> 
> Sound good, looking forward to hear what you come-up with :-)
> 

I guess by this thread we will see a broadcom nic with redirect support
soon ;)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-25 15:10                   ` John Fastabend
@ 2017-08-25 15:28                     ` Michael Chan
  2017-08-28 16:02                       ` Andy Gospodarek
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Chan @ 2017-08-25 15:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, Alexander Duyck, Duyck, Alexander H,
	pstaszewski, netdev, xdp-newbies, andy, borkmann

On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
>> On Thu, 24 Aug 2017 20:36:28 -0700
>> Michael Chan <michael.chan@broadcom.com> wrote:
>>
>>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>>> <brouer@redhat.com> wrote:
>>>> On Tue, 22 Aug 2017 23:59:05 -0700
>>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>>
>>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>>>>> <alexander.duyck@gmail.com> wrote:
>>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>>>>>>
>>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>>>>>>> the input device, right?
>>>>
>>>> Yes, I would really like to see an API like this.
>>>>
>>>>>>
>>>>>> You could, it is just added complexity. "just free the buffer" in
>>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>>>>>> total page count since page recycling is already implemented in the
>>>>>> driver. You still would have to unmap the buffer regardless of if you
>>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>>>>>> operations per packet. The fraction is because once every 64K uses we
>>>>>> have to bulk update the count on the page.
>>>>>>
>>>>>
>>>>> If the buffer is returned to the input device, the input device can
>>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>>>>> the input device when the buffer is returned.
>>>>
>>>> Yes, exactly, return to the input device. I really think we should
>>>> work on a solution where we can keep the DMA mapping around.  We have
>>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>>>> page return call, to achieve this. (I imagine other arch's have a high
>>>> DMA overhead than Intel)
>>>>
>>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>>>> splitting the page (into two packets) actually complicates things, and
>>>> tie us into a page-refcnt based model.  We could get around this by
>>>> each driver implementing a page-return-callback, that allow us to
>>>> return the page to the input device?  Then, drivers implementing the
>>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>>>> "1" DMA-sync and reuse it in the RX queue.
>>>>
>>>
>>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>>> redirecting to a non-intel NIC or vice versa will actually work.  It
>>> sounds like the output device has to make some assumptions about how
>>> the page was allocated by the input device.
>>
>> Yes, exactly. We are tied into a page refcnt based scheme.
>>
>> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
>> is also tied to the RX queue size, plus how fast the pages are returned.
>> This makes it very hard to tune.  As I demonstrated, default ixgbe
>> settings does not work well with XDP_REDIRECT.  I needed to increase
>> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
>> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
>> RX-ring size is smaller, thus two contradicting tuning needed.
>>
>
> The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
> split into two halves being the default) from the number of descriptors
> doesn't look too bad IMO. It seems like it could be done by having some
> extra pages allocated upfront and pulling those in when we need another
> page.
>
> This would be a nice iterative step we could take on the existing API.
>
>>
>>> With buffer return API,
>>> each driver can cleanly recycle or free its own buffers properly.
>>
>> Yes, exactly. And RX-driver can implement a special memory model for
>> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
>> which is never used for SKBs, thus opening for new RX memory models.
>>
>> Another advantage of a return API.  There is also an opportunity for
>> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
>> we can add a DMA API, where we can query if the two devices uses the
>> same DMA engine, and can reuse the same DMA address the RX-side already
>> knows.
>>
>>
>>> Let me discuss this further with Andy to see if we can come up with a
>>> good scheme.
>>
>> Sound good, looking forward to hear what you come-up with :-)
>>
>
> I guess by this thread we will see a broadcom nic with redirect support
> soon ;)

Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
buffer recycling scheme has some problems.  We can make it work for
Broadcom to Broadcom only, but we want a better solution.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-25 15:28                     ` Michael Chan
@ 2017-08-28 16:02                       ` Andy Gospodarek
  2017-08-28 16:11                         ` Alexander Duyck
  2017-08-28 16:14                         ` John Fastabend
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Gospodarek @ 2017-08-28 16:02 UTC (permalink / raw)
  To: Michael Chan
  Cc: John Fastabend, Jesper Dangaard Brouer, Alexander Duyck, Duyck,
	Alexander H, pstaszewski, netdev, xdp-newbies, borkmann

On Fri, Aug 25, 2017 at 08:28:55AM -0700, Michael Chan wrote:
> On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
> >> On Thu, 24 Aug 2017 20:36:28 -0700
> >> Michael Chan <michael.chan@broadcom.com> wrote:
> >>
> >>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
> >>> <brouer@redhat.com> wrote:
> >>>> On Tue, 22 Aug 2017 23:59:05 -0700
> >>>> Michael Chan <michael.chan@broadcom.com> wrote:
> >>>>
> >>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
> >>>>> <alexander.duyck@gmail.com> wrote:
> >>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
> >>>>>>>
> >>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
> >>>>>>> the input device, right?
> >>>>
> >>>> Yes, I would really like to see an API like this.
> >>>>
> >>>>>>
> >>>>>> You could, it is just added complexity. "just free the buffer" in
> >>>>>> ixgbe usually just amounts to one atomic operation to decrement the
> >>>>>> total page count since page recycling is already implemented in the
> >>>>>> driver. You still would have to unmap the buffer regardless of if you
> >>>>>> were recycling it or not so all you would save is 1.000015259 atomic
> >>>>>> operations per packet. The fraction is because once every 64K uses we
> >>>>>> have to bulk update the count on the page.
> >>>>>>
> >>>>>
> >>>>> If the buffer is returned to the input device, the input device can
> >>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
> >>>>> the input device when the buffer is returned.
> >>>>
> >>>> Yes, exactly, return to the input device. I really think we should
> >>>> work on a solution where we can keep the DMA mapping around.  We have
> >>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
> >>>> page return call, to achieve this. (I imagine other arch's have a high
> >>>> DMA overhead than Intel)
> >>>>
> >>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
> >>>> splitting the page (into two packets) actually complicates things, and
> >>>> tie us into a page-refcnt based model.  We could get around this by
> >>>> each driver implementing a page-return-callback, that allow us to
> >>>> return the page to the input device?  Then, drivers implementing the
> >>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
> >>>> "1" DMA-sync and reuse it in the RX queue.
> >>>>
> >>>
> >>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
> >>> redirecting to a non-intel NIC or vice versa will actually work.  It
> >>> sounds like the output device has to make some assumptions about how
> >>> the page was allocated by the input device.
> >>
> >> Yes, exactly. We are tied into a page refcnt based scheme.
> >>
> >> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
> >> is also tied to the RX queue size, plus how fast the pages are returned.
> >> This makes it very hard to tune.  As I demonstrated, default ixgbe
> >> settings does not work well with XDP_REDIRECT.  I needed to increase
> >> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
> >> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
> >> RX-ring size is smaller, thus two contradicting tuning needed.
> >>
> >
> > The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
> > split into two halves being the default) from the number of descriptors
> > doesn't look too bad IMO. It seems like it could be done by having some
> > extra pages allocated upfront and pulling those in when we need another
> > page.
> >
> > This would be a nice iterative step we could take on the existing API.
> >
> >>
> >>> With buffer return API,
> >>> each driver can cleanly recycle or free its own buffers properly.
> >>
> >> Yes, exactly. And RX-driver can implement a special memory model for
> >> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
> >> which is never used for SKBs, thus opening for new RX memory models.
> >>
> >> Another advantage of a return API.  There is also an opportunity for
> >> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
> >> we can add a DMA API, where we can query if the two devices uses the
> >> same DMA engine, and can reuse the same DMA address the RX-side already
> >> knows.
> >>
> >>
> >>> Let me discuss this further with Andy to see if we can come up with a
> >>> good scheme.
> >>
> >> Sound good, looking forward to hear what you come-up with :-)
> >>
> >
> > I guess by this thread we will see a broadcom nic with redirect support
> > soon ;)
> 
> Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
> buffer recycling scheme has some problems.  We can make it work for
> Broadcom to Broadcom only, but we want a better solution.

(Sorry for the radio silence I was AFK last week...)

I finished it a little while ago, but Michael and I both have concerns
that in a heterogenous hardware setup one can quickly run into issues
and haven't had time to work-up a few solutions before bringing this up
formally.  It also isn't a major problem until the second
optimized/native XDP driver appears on the scene.

I can run a test where XDP redirects from an ixgbe <-> bnxt_en based
device I get OOM kills after only a few seconds, due to the lack of
feedback between the different drivers that the pointer to xdp->data can
be freed/reused/etc and the different buffer allocation schemes used.

Initially I did not think this was an issue and that xdp_do_flush_map()
would handle this, but I think there is a still a need to be able to
signal back to the receving device that the buffer allocated has been
xmitted by the transmitter and can be freed.  Since there is really no
guarantee that completion of an XDP_REDIRECT action means that it is
safe to free area pointed to by xdp->data area that contains the packet
to be xmitted.  Since the packet done interrupt handler in a driver
cannot signal back the the receiving driver that the buffer is now safe
to reuse/free there is a chance for trouble.  

I was hoping to spend some time this week cooking up a patch that just
did not allow use of XDP_REDIRECT when the ifindex of the outgoing
device did not match that of the device to which the XDP prog was
attached, but that probably is not worth the trouble when we would just
fix it for real.  (It would also require some really terrible hacks to
enforce this in the kernel when all that is being done is setting up a
map that contains the redirect table, so it is probably not useful.)

The basic prototype would be something like this:

(rx packet interrupt on eth0, leads to napi_poll)
napi_poll (eth0)
  call xdp_prog (eth0)
    xdp_do_redirect (eth0)
      ndo_xdp_xmit (eth1)
      mark buffer with information netdev/ring/etc
      place buffer on tx ring for eth1

(tx done interrupt on eth1, leads to napi_poll)
napi_poll (eth1)
  process tx interrupt (eth1)
    look up information about netdev/ring/etc
    ndo_xdp_data_free (eth0, ring, etc)

Thoughts?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-28 16:02                       ` Andy Gospodarek
@ 2017-08-28 16:11                         ` Alexander Duyck
  2017-08-29 13:26                           ` Jesper Dangaard Brouer
  2017-08-28 16:14                         ` John Fastabend
  1 sibling, 1 reply; 26+ messages in thread
From: Alexander Duyck @ 2017-08-28 16:11 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Michael Chan, John Fastabend, Jesper Dangaard Brouer, Duyck,
	Alexander H, pstaszewski, netdev, xdp-newbies, borkmann

On Mon, Aug 28, 2017 at 9:02 AM, Andy Gospodarek <andy@greyhouse.net> wrote:
> On Fri, Aug 25, 2017 at 08:28:55AM -0700, Michael Chan wrote:
>> On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>> > On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
>> >> On Thu, 24 Aug 2017 20:36:28 -0700
>> >> Michael Chan <michael.chan@broadcom.com> wrote:
>> >>
>> >>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>> >>> <brouer@redhat.com> wrote:
>> >>>> On Tue, 22 Aug 2017 23:59:05 -0700
>> >>>> Michael Chan <michael.chan@broadcom.com> wrote:
>> >>>>
>> >>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>> >>>>> <alexander.duyck@gmail.com> wrote:
>> >>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>> >>>>>>>
>> >>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>> >>>>>>> the input device, right?
>> >>>>
>> >>>> Yes, I would really like to see an API like this.
>> >>>>
>> >>>>>>
>> >>>>>> You could, it is just added complexity. "just free the buffer" in
>> >>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>> >>>>>> total page count since page recycling is already implemented in the
>> >>>>>> driver. You still would have to unmap the buffer regardless of if you
>> >>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>> >>>>>> operations per packet. The fraction is because once every 64K uses we
>> >>>>>> have to bulk update the count on the page.
>> >>>>>>
>> >>>>>
>> >>>>> If the buffer is returned to the input device, the input device can
>> >>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>> >>>>> the input device when the buffer is returned.
>> >>>>
>> >>>> Yes, exactly, return to the input device. I really think we should
>> >>>> work on a solution where we can keep the DMA mapping around.  We have
>> >>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>> >>>> page return call, to achieve this. (I imagine other arch's have a high
>> >>>> DMA overhead than Intel)
>> >>>>
>> >>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>> >>>> splitting the page (into two packets) actually complicates things, and
>> >>>> tie us into a page-refcnt based model.  We could get around this by
>> >>>> each driver implementing a page-return-callback, that allow us to
>> >>>> return the page to the input device?  Then, drivers implementing the
>> >>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>> >>>> "1" DMA-sync and reuse it in the RX queue.
>> >>>>
>> >>>
>> >>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>> >>> redirecting to a non-intel NIC or vice versa will actually work.  It
>> >>> sounds like the output device has to make some assumptions about how
>> >>> the page was allocated by the input device.
>> >>
>> >> Yes, exactly. We are tied into a page refcnt based scheme.
>> >>
>> >> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
>> >> is also tied to the RX queue size, plus how fast the pages are returned.
>> >> This makes it very hard to tune.  As I demonstrated, default ixgbe
>> >> settings does not work well with XDP_REDIRECT.  I needed to increase
>> >> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
>> >> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
>> >> RX-ring size is smaller, thus two contradicting tuning needed.
>> >>
>> >
>> > The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
>> > split into two halves being the default) from the number of descriptors
>> > doesn't look too bad IMO. It seems like it could be done by having some
>> > extra pages allocated upfront and pulling those in when we need another
>> > page.
>> >
>> > This would be a nice iterative step we could take on the existing API.
>> >
>> >>
>> >>> With buffer return API,
>> >>> each driver can cleanly recycle or free its own buffers properly.
>> >>
>> >> Yes, exactly. And RX-driver can implement a special memory model for
>> >> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
>> >> which is never used for SKBs, thus opening for new RX memory models.
>> >>
>> >> Another advantage of a return API.  There is also an opportunity for
>> >> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
>> >> we can add a DMA API, where we can query if the two devices uses the
>> >> same DMA engine, and can reuse the same DMA address the RX-side already
>> >> knows.
>> >>
>> >>
>> >>> Let me discuss this further with Andy to see if we can come up with a
>> >>> good scheme.
>> >>
>> >> Sound good, looking forward to hear what you come-up with :-)
>> >>
>> >
>> > I guess by this thread we will see a broadcom nic with redirect support
>> > soon ;)
>>
>> Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
>> buffer recycling scheme has some problems.  We can make it work for
>> Broadcom to Broadcom only, but we want a better solution.
>
> (Sorry for the radio silence I was AFK last week...)
>
> I finished it a little while ago, but Michael and I both have concerns
> that in a heterogenous hardware setup one can quickly run into issues
> and haven't had time to work-up a few solutions before bringing this up
> formally.  It also isn't a major problem until the second
> optimized/native XDP driver appears on the scene.
>
> I can run a test where XDP redirects from an ixgbe <-> bnxt_en based
> device I get OOM kills after only a few seconds, due to the lack of
> feedback between the different drivers that the pointer to xdp->data can
> be freed/reused/etc and the different buffer allocation schemes used.
>
> Initially I did not think this was an issue and that xdp_do_flush_map()
> would handle this, but I think there is a still a need to be able to
> signal back to the receving device that the buffer allocated has been
> xmitted by the transmitter and can be freed.  Since there is really no
> guarantee that completion of an XDP_REDIRECT action means that it is
> safe to free area pointed to by xdp->data area that contains the packet
> to be xmitted.  Since the packet done interrupt handler in a driver
> cannot signal back the the receiving driver that the buffer is now safe
> to reuse/free there is a chance for trouble.
>
> I was hoping to spend some time this week cooking up a patch that just
> did not allow use of XDP_REDIRECT when the ifindex of the outgoing
> device did not match that of the device to which the XDP prog was
> attached, but that probably is not worth the trouble when we would just
> fix it for real.  (It would also require some really terrible hacks to
> enforce this in the kernel when all that is being done is setting up a
> map that contains the redirect table, so it is probably not useful.)
>
> The basic prototype would be something like this:
>
> (rx packet interrupt on eth0, leads to napi_poll)
> napi_poll (eth0)
>   call xdp_prog (eth0)
>     xdp_do_redirect (eth0)
>       ndo_xdp_xmit (eth1)
>       mark buffer with information netdev/ring/etc
>       place buffer on tx ring for eth1
>
> (tx done interrupt on eth1, leads to napi_poll)
> napi_poll (eth1)
>   process tx interrupt (eth1)
>     look up information about netdev/ring/etc
>     ndo_xdp_data_free (eth0, ring, etc)
>
> Thoughts?
>

My advice would be to not over complicate this. My big concern with
all this buffer recycling is what happens the first time somebody
introduces something like mirroring? Are you going to copy the data to
a new page which would be quite expensive or just have to introduce
reference counts? You are going to have to deal with stuff like
reference counts eventually so you might as well bite that bullet now.
My advice would be to not bother with optimizing for performance right
now and instead focus on just getting functionality. The approach we
took in ixgbe for the transmit path should work for almost any other
driver since all you are looking at is having to free the page
reference which takes care of reference counting already.

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-28 16:02                       ` Andy Gospodarek
  2017-08-28 16:11                         ` Alexander Duyck
@ 2017-08-28 16:14                         ` John Fastabend
  2017-08-28 19:39                           ` Andy Gospodarek
  1 sibling, 1 reply; 26+ messages in thread
From: John Fastabend @ 2017-08-28 16:14 UTC (permalink / raw)
  To: Andy Gospodarek, Michael Chan
  Cc: Jesper Dangaard Brouer, Alexander Duyck, Duyck, Alexander H,
	pstaszewski, netdev, xdp-newbies, borkmann

On 08/28/2017 09:02 AM, Andy Gospodarek wrote:
> On Fri, Aug 25, 2017 at 08:28:55AM -0700, Michael Chan wrote:
>> On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>> On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
>>>> On Thu, 24 Aug 2017 20:36:28 -0700
>>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>>
>>>>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>>>>> <brouer@redhat.com> wrote:
>>>>>> On Tue, 22 Aug 2017 23:59:05 -0700
>>>>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>>>>
>>>>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>>>>>>> <alexander.duyck@gmail.com> wrote:
>>>>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>>>>>>>>
>>>>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>>>>>>>>> the input device, right?
>>>>>>
>>>>>> Yes, I would really like to see an API like this.
>>>>>>
>>>>>>>>
>>>>>>>> You could, it is just added complexity. "just free the buffer" in
>>>>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>>>>>>>> total page count since page recycling is already implemented in the
>>>>>>>> driver. You still would have to unmap the buffer regardless of if you
>>>>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>>>>>>>> operations per packet. The fraction is because once every 64K uses we
>>>>>>>> have to bulk update the count on the page.
>>>>>>>>
>>>>>>>
>>>>>>> If the buffer is returned to the input device, the input device can
>>>>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>>>>>>> the input device when the buffer is returned.
>>>>>>
>>>>>> Yes, exactly, return to the input device. I really think we should
>>>>>> work on a solution where we can keep the DMA mapping around.  We have
>>>>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>>>>>> page return call, to achieve this. (I imagine other arch's have a high
>>>>>> DMA overhead than Intel)
>>>>>>
>>>>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>>>>>> splitting the page (into two packets) actually complicates things, and
>>>>>> tie us into a page-refcnt based model.  We could get around this by
>>>>>> each driver implementing a page-return-callback, that allow us to
>>>>>> return the page to the input device?  Then, drivers implementing the
>>>>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>>>>>> "1" DMA-sync and reuse it in the RX queue.
>>>>>>
>>>>>
>>>>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>>>>> redirecting to a non-intel NIC or vice versa will actually work.  It
>>>>> sounds like the output device has to make some assumptions about how
>>>>> the page was allocated by the input device.
>>>>
>>>> Yes, exactly. We are tied into a page refcnt based scheme.
>>>>
>>>> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
>>>> is also tied to the RX queue size, plus how fast the pages are returned.
>>>> This makes it very hard to tune.  As I demonstrated, default ixgbe
>>>> settings does not work well with XDP_REDIRECT.  I needed to increase
>>>> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
>>>> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
>>>> RX-ring size is smaller, thus two contradicting tuning needed.
>>>>
>>>
>>> The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
>>> split into two halves being the default) from the number of descriptors
>>> doesn't look too bad IMO. It seems like it could be done by having some
>>> extra pages allocated upfront and pulling those in when we need another
>>> page.
>>>
>>> This would be a nice iterative step we could take on the existing API.
>>>
>>>>
>>>>> With buffer return API,
>>>>> each driver can cleanly recycle or free its own buffers properly.
>>>>
>>>> Yes, exactly. And RX-driver can implement a special memory model for
>>>> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
>>>> which is never used for SKBs, thus opening for new RX memory models.
>>>>
>>>> Another advantage of a return API.  There is also an opportunity for
>>>> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
>>>> we can add a DMA API, where we can query if the two devices uses the
>>>> same DMA engine, and can reuse the same DMA address the RX-side already
>>>> knows.
>>>>
>>>>
>>>>> Let me discuss this further with Andy to see if we can come up with a
>>>>> good scheme.
>>>>
>>>> Sound good, looking forward to hear what you come-up with :-)
>>>>
>>>
>>> I guess by this thread we will see a broadcom nic with redirect support
>>> soon ;)
>>
>> Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
>> buffer recycling scheme has some problems.  We can make it work for
>> Broadcom to Broadcom only, but we want a better solution.
> 
> (Sorry for the radio silence I was AFK last week...)
> 
> I finished it a little while ago, but Michael and I both have concerns
> that in a heterogenous hardware setup one can quickly run into issues
> and haven't had time to work-up a few solutions before bringing this up
> formally.  It also isn't a major problem until the second
> optimized/native XDP driver appears on the scene.
> 
> I can run a test where XDP redirects from an ixgbe <-> bnxt_en based
> device I get OOM kills after only a few seconds, due to the lack of
> feedback between the different drivers that the pointer to xdp->data can
> be freed/reused/etc and the different buffer allocation schemes used.
> 

hmm so how do you get OOM here, I expect the number of in-flight xdp
bufs should be limited by the number of xdps that can be posted to the
outgoing interface. If we are hitting OOM that _should_ mean the size of
the tx queue is too large. Ixgbe should be free'ing the buffer if an error
is returned from xdp xmit routines (will check this today). And bnxt should
return an error if we hit some high water mark on xmit.

> Initially I did not think this was an issue and that xdp_do_flush_map()
> would handle this, but I think there is a still a need to be able to
> signal back to the receving device that the buffer allocated has been
> xmitted by the transmitter and can be freed.  Since there is really no
> guarantee that completion of an XDP_REDIRECT action means that it is
> safe to free area pointed to by xdp->data area that contains the packet
> to be xmitted.  Since the packet done interrupt handler in a driver
> cannot signal back the the receiving driver that the buffer is now safe
> to reuse/free there is a chance for trouble.  

There should be some high water mark on how many outstanding packets
can be in-flight. At the moment I assumed this was something related to
queue lengths a more explicit high water mark could added to the xmit path
and tracked in xdp infrastructure.

> 
> I was hoping to spend some time this week cooking up a patch that just
> did not allow use of XDP_REDIRECT when the ifindex of the outgoing
> device did not match that of the device to which the XDP prog was
> attached, but that probably is not worth the trouble when we would just
> fix it for real.  (It would also require some really terrible hacks to
> enforce this in the kernel when all that is being done is setting up a
> map that contains the redirect table, so it is probably not useful.)
> 

I would prefer to solve the problem vs limiting the implementation

> The basic prototype would be something like this:
> 
> (rx packet interrupt on eth0, leads to napi_poll)
> napi_poll (eth0)
>   call xdp_prog (eth0)
>     xdp_do_redirect (eth0)
>       ndo_xdp_xmit (eth1)
>       mark buffer with information netdev/ring/etc
>       place buffer on tx ring for eth1
> 
> (tx done interrupt on eth1, leads to napi_poll)
> napi_poll (eth1)
>   process tx interrupt (eth1)
>     look up information about netdev/ring/etc
>     ndo_xdp_data_free (eth0, ring, etc)
> 
> Thoughts?
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-28 16:14                         ` John Fastabend
@ 2017-08-28 19:39                           ` Andy Gospodarek
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Gospodarek @ 2017-08-28 19:39 UTC (permalink / raw)
  To: John Fastabend
  Cc: Michael Chan, Jesper Dangaard Brouer, Alexander Duyck, Duyck,
	Alexander H, pstaszewski, netdev, xdp-newbies, borkmann

On Mon, Aug 28, 2017 at 09:14:20AM -0700, John Fastabend wrote:
> On 08/28/2017 09:02 AM, Andy Gospodarek wrote:
> > On Fri, Aug 25, 2017 at 08:28:55AM -0700, Michael Chan wrote:
> >> On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
> >> <john.fastabend@gmail.com> wrote:
> >>> On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
> >>>> On Thu, 24 Aug 2017 20:36:28 -0700
> >>>> Michael Chan <michael.chan@broadcom.com> wrote:
> >>>>
> >>>>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
> >>>>> <brouer@redhat.com> wrote:
> >>>>>> On Tue, 22 Aug 2017 23:59:05 -0700
> >>>>>> Michael Chan <michael.chan@broadcom.com> wrote:
> >>>>>>
> >>>>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
> >>>>>>> <alexander.duyck@gmail.com> wrote:
> >>>>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
> >>>>>>>>> the input device, right?
> >>>>>>
> >>>>>> Yes, I would really like to see an API like this.
> >>>>>>
> >>>>>>>>
> >>>>>>>> You could, it is just added complexity. "just free the buffer" in
> >>>>>>>> ixgbe usually just amounts to one atomic operation to decrement the
> >>>>>>>> total page count since page recycling is already implemented in the
> >>>>>>>> driver. You still would have to unmap the buffer regardless of if you
> >>>>>>>> were recycling it or not so all you would save is 1.000015259 atomic
> >>>>>>>> operations per packet. The fraction is because once every 64K uses we
> >>>>>>>> have to bulk update the count on the page.
> >>>>>>>>
> >>>>>>>
> >>>>>>> If the buffer is returned to the input device, the input device can
> >>>>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
> >>>>>>> the input device when the buffer is returned.
> >>>>>>
> >>>>>> Yes, exactly, return to the input device. I really think we should
> >>>>>> work on a solution where we can keep the DMA mapping around.  We have
> >>>>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
> >>>>>> page return call, to achieve this. (I imagine other arch's have a high
> >>>>>> DMA overhead than Intel)
> >>>>>>
> >>>>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
> >>>>>> splitting the page (into two packets) actually complicates things, and
> >>>>>> tie us into a page-refcnt based model.  We could get around this by
> >>>>>> each driver implementing a page-return-callback, that allow us to
> >>>>>> return the page to the input device?  Then, drivers implementing the
> >>>>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
> >>>>>> "1" DMA-sync and reuse it in the RX queue.
> >>>>>>
> >>>>>
> >>>>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
> >>>>> redirecting to a non-intel NIC or vice versa will actually work.  It
> >>>>> sounds like the output device has to make some assumptions about how
> >>>>> the page was allocated by the input device.
> >>>>
> >>>> Yes, exactly. We are tied into a page refcnt based scheme.
> >>>>
> >>>> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
> >>>> is also tied to the RX queue size, plus how fast the pages are returned.
> >>>> This makes it very hard to tune.  As I demonstrated, default ixgbe
> >>>> settings does not work well with XDP_REDIRECT.  I needed to increase
> >>>> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
> >>>> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
> >>>> RX-ring size is smaller, thus two contradicting tuning needed.
> >>>>
> >>>
> >>> The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
> >>> split into two halves being the default) from the number of descriptors
> >>> doesn't look too bad IMO. It seems like it could be done by having some
> >>> extra pages allocated upfront and pulling those in when we need another
> >>> page.
> >>>
> >>> This would be a nice iterative step we could take on the existing API.
> >>>
> >>>>
> >>>>> With buffer return API,
> >>>>> each driver can cleanly recycle or free its own buffers properly.
> >>>>
> >>>> Yes, exactly. And RX-driver can implement a special memory model for
> >>>> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
> >>>> which is never used for SKBs, thus opening for new RX memory models.
> >>>>
> >>>> Another advantage of a return API.  There is also an opportunity for
> >>>> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
> >>>> we can add a DMA API, where we can query if the two devices uses the
> >>>> same DMA engine, and can reuse the same DMA address the RX-side already
> >>>> knows.
> >>>>
> >>>>
> >>>>> Let me discuss this further with Andy to see if we can come up with a
> >>>>> good scheme.
> >>>>
> >>>> Sound good, looking forward to hear what you come-up with :-)
> >>>>
> >>>
> >>> I guess by this thread we will see a broadcom nic with redirect support
> >>> soon ;)
> >>
> >> Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
> >> buffer recycling scheme has some problems.  We can make it work for
> >> Broadcom to Broadcom only, but we want a better solution.
> > 
> > (Sorry for the radio silence I was AFK last week...)
> > 
> > I finished it a little while ago, but Michael and I both have concerns
> > that in a heterogenous hardware setup one can quickly run into issues
> > and haven't had time to work-up a few solutions before bringing this up
> > formally.  It also isn't a major problem until the second
> > optimized/native XDP driver appears on the scene.
> > 
> > I can run a test where XDP redirects from an ixgbe <-> bnxt_en based
> > device I get OOM kills after only a few seconds, due to the lack of
> > feedback between the different drivers that the pointer to xdp->data can
> > be freed/reused/etc and the different buffer allocation schemes used.
> > 
> 
> hmm so how do you get OOM here, I expect the number of in-flight xdp
> bufs should be limited by the number of xdps that can be posted to the
> outgoing interface. If we are hitting OOM that _should_ mean the size of
> the tx queue is too large. Ixgbe should be free'ing the buffer if an error
> is returned from xdp xmit routines (will check this today). And bnxt should
> return an error if we hit some high water mark on xmit.

I reconfigured the hardware after I was done with the bnxt_en devel, but I
should be able to set it up and provide some more detail.  Let me repro it and
debug a bit more.

> 
> > Initially I did not think this was an issue and that xdp_do_flush_map()
> > would handle this, but I think there is a still a need to be able to
> > signal back to the receving device that the buffer allocated has been
> > xmitted by the transmitter and can be freed.  Since there is really no
> > guarantee that completion of an XDP_REDIRECT action means that it is
> > safe to free area pointed to by xdp->data area that contains the packet
> > to be xmitted.  Since the packet done interrupt handler in a driver
> > cannot signal back the the receiving driver that the buffer is now safe
> > to reuse/free there is a chance for trouble.  
> 
> There should be some high water mark on how many outstanding packets
> can be in-flight. At the moment I assumed this was something related to
> queue lengths a more explicit high water mark could added to the xmit path
> and tracked in xdp infrastructure.
> 
> > 
> > I was hoping to spend some time this week cooking up a patch that just
> > did not allow use of XDP_REDIRECT when the ifindex of the outgoing
> > device did not match that of the device to which the XDP prog was
> > attached, but that probably is not worth the trouble when we would just
> > fix it for real.  (It would also require some really terrible hacks to
> > enforce this in the kernel when all that is being done is setting up a
> > map that contains the redirect table, so it is probably not useful.)
> > 
> 
> I would prefer to solve the problem vs limiting the implementation
> 

Agreed.

> > The basic prototype would be something like this:
> > 
> > (rx packet interrupt on eth0, leads to napi_poll)
> > napi_poll (eth0)
> >   call xdp_prog (eth0)
> >     xdp_do_redirect (eth0)
> >       ndo_xdp_xmit (eth1)
> >       mark buffer with information netdev/ring/etc
> >       place buffer on tx ring for eth1
> > 
> > (tx done interrupt on eth1, leads to napi_poll)
> > napi_poll (eth1)
> >   process tx interrupt (eth1)
> >     look up information about netdev/ring/etc
> >     ndo_xdp_data_free (eth0, ring, etc)
> > 
> > Thoughts?
> > 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-28 16:11                         ` Alexander Duyck
@ 2017-08-29 13:26                           ` Jesper Dangaard Brouer
  2017-08-29 16:23                             ` Alexander Duyck
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-29 13:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andy Gospodarek, Michael Chan, John Fastabend, Duyck,
	Alexander H, pstaszewski, netdev, xdp-newbies, borkmann, brouer


On Mon, 28 Aug 2017 09:11:25 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:

> My advice would be to not over complicate this. My big concern with
> all this buffer recycling is what happens the first time somebody
> introduces something like mirroring? Are you going to copy the data to
> a new page which would be quite expensive or just have to introduce
> reference counts? You are going to have to deal with stuff like
> reference counts eventually so you might as well bite that bullet now.
> My advice would be to not bother with optimizing for performance right
> now and instead focus on just getting functionality. The approach we
> took in ixgbe for the transmit path should work for almost any other
> driver since all you are looking at is having to free the page
> reference which takes care of reference counting already.

This return API is not about optimizing performance right now.  It is
actually about allowing us to change the underlying memory model per RX
queue for XDP.

If a RX-ring is use for both SKBs and XDP, then the refcnt model is
still enforced.  Although a driver using the 1-packet-per-page model,
should be able to reuse refcnt==1 pages when returned from XDP.

If a RX-ring is _ONLY_ used for XDP, then the driver have freedom to
implement another memory model, with the return-API.  We need to
experiment with the most optimal memory model.  The 1-packet-per-page
model is actually not the fastest, because of PCI-e bottlenecks.  With
HW support for packing descriptors and packets over the PCI-e bus, much
higher rates can be achieved.  Mellanox mlx5-Lx already have the needed HW
support.  And companies like NetCope also have 100G HW that does
similar tricks, and they even have a whitepaper[1][2] how they are
faster than DPDK with their NDP (Netcope Data Plane) API.

We do need the ability/flexibility to change the RX memory model, to
take advantage of this new NIC hardware.

[1] https://www.netcope.com/en/resources/improving-dpdk-performance
[2] https://www.netcope.com/en/company/press-center/press-releases/read-new-netcope-whitepaper-on-dpdk-acceleration

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-29 13:26                           ` Jesper Dangaard Brouer
@ 2017-08-29 16:23                             ` Alexander Duyck
  2017-08-29 19:02                               ` Andy Gospodarek
  0 siblings, 1 reply; 26+ messages in thread
From: Alexander Duyck @ 2017-08-29 16:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andy Gospodarek, Michael Chan, John Fastabend, Duyck,
	Alexander H, pstaszewski, netdev, xdp-newbies, borkmann

On Tue, Aug 29, 2017 at 6:26 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 28 Aug 2017 09:11:25 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> My advice would be to not over complicate this. My big concern with
>> all this buffer recycling is what happens the first time somebody
>> introduces something like mirroring? Are you going to copy the data to
>> a new page which would be quite expensive or just have to introduce
>> reference counts? You are going to have to deal with stuff like
>> reference counts eventually so you might as well bite that bullet now.
>> My advice would be to not bother with optimizing for performance right
>> now and instead focus on just getting functionality. The approach we
>> took in ixgbe for the transmit path should work for almost any other
>> driver since all you are looking at is having to free the page
>> reference which takes care of reference counting already.
>
> This return API is not about optimizing performance right now.  It is
> actually about allowing us to change the underlying memory model per RX
> queue for XDP.

 I would disagree. To me this is a obvious case of premature optimization.

> If a RX-ring is use for both SKBs and XDP, then the refcnt model is
> still enforced.  Although a driver using the 1-packet-per-page model,
> should be able to reuse refcnt==1 pages when returned from XDP.

Isn't this the case for all Rx on XDP enabled rings. Last I knew there
was an option to pass packets up via an SKB if XDP_PASS is returned.
Are you saying we need to do a special allocation path if an XDP
program doesn't make use of XDP_PASS?

> If a RX-ring is _ONLY_ used for XDP, then the driver have freedom to
> implement another memory model, with the return-API.  We need to
> experiment with the most optimal memory model.  The 1-packet-per-page
> model is actually not the fastest, because of PCI-e bottlenecks.  With
> HW support for packing descriptors and packets over the PCI-e bus, much
> higher rates can be achieved.  Mellanox mlx5-Lx already have the needed HW
> support.  And companies like NetCope also have 100G HW that does
> similar tricks, and they even have a whitepaper[1][2] how they are
> faster than DPDK with their NDP (Netcope Data Plane) API.
>
> We do need the ability/flexibility to change the RX memory model, to
> take advantage of this new NIC hardware.

Looking over the white paper I see nothing that prevents us from using
the same memory model we do with the Intel NICs. If anything I think
the Intel drivers in "legacy-rx" mode could support something like
this now, even if the hardware doesn't simply because we can get away
with keeping the memory pseudo-pinned. My bigger concern is that we
keep coming back to this idea that we need to have the network stack
taking care of the 1 page per packet recycling when I really think it
has no business being there. We either need to look at implementing
this in the way we did in the Intel drivers where we use the reference
counts or implement our own memory handling API like SLUB or something
similar based on compound page destructors. I would much rather see us
focus on getting this going with an agnostic memory model where we
don't have to make the stack aware of where the memory came from or
where it has to be returned to.

> [1] https://www.netcope.com/en/resources/improving-dpdk-performance
> [2] https://www.netcope.com/en/company/press-center/press-releases/read-new-netcope-whitepaper-on-dpdk-acceleration

My only concern with something like this is the fact that it is
optimized for a setup where the data is left in place and nothing
extra is added. Trying to work with something like this gets more
expensive when you have to deal with the full stack as you have to
copy out the headers and still deal with all the skb metadata. I fully
agree with the basic premise that writing in large blocks provides
significant gains in throughput, specifically with small packets. The
only gotcha you would have to deal with is SKB allocation and data
copying overhead to make room and fill in metadata for the frame and
any extra headers needed.

- Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-29 16:23                             ` Alexander Duyck
@ 2017-08-29 19:02                               ` Andy Gospodarek
  2017-08-29 19:52                                 ` Alexander Duyck
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Gospodarek @ 2017-08-29 19:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, Michael Chan, John Fastabend, Duyck,
	Alexander H, pstaszewski, netdev, xdp-newbies, borkmann

On Tue, Aug 29, 2017 at 09:23:49AM -0700, Alexander Duyck wrote:
> On Tue, Aug 29, 2017 at 6:26 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Mon, 28 Aug 2017 09:11:25 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >
> >> My advice would be to not over complicate this. My big concern with
> >> all this buffer recycling is what happens the first time somebody
> >> introduces something like mirroring? Are you going to copy the data to
> >> a new page which would be quite expensive or just have to introduce
> >> reference counts? You are going to have to deal with stuff like
> >> reference counts eventually so you might as well bite that bullet now.
> >> My advice would be to not bother with optimizing for performance right
> >> now and instead focus on just getting functionality. The approach we
> >> took in ixgbe for the transmit path should work for almost any other
> >> driver since all you are looking at is having to free the page
> >> reference which takes care of reference counting already.
> >
> > This return API is not about optimizing performance right now.  It is
> > actually about allowing us to change the underlying memory model per RX
> > queue for XDP.
> 
>  I would disagree. To me this is a obvious case of premature optimization.
> 

I'm with Jesper on this.  Though it may seem to you that this is an
optimization that is not a goal.

> > If a RX-ring is use for both SKBs and XDP, then the refcnt model is
> > still enforced.  Although a driver using the 1-packet-per-page model,
> > should be able to reuse refcnt==1 pages when returned from XDP.
> 
> Isn't this the case for all Rx on XDP enabled rings. Last I knew there
> was an option to pass packets up via an SKB if XDP_PASS is returned.
> Are you saying we need to do a special allocation path if an XDP
> program doesn't make use of XDP_PASS?

I am not proposing that a special allocation path is needed depending on the
return code from the XDP program.  I'm proposing that in a case where
the return code is XDP_REDIRECT (or really anytime the ndo_xdp_xmit
operation is called), that there should be:

(1) notification back to the driver/resource/etc that allocated the page
that resources are no longer in use.

or 

(2) common alloc/free framework used by drivers that operate on
xdp->data so that framework takes care of refcounting, etc.

My preference is (1) since it provides drivers the most flexibility in
the event that some hardware resource (rx ring buffer pointer) or
software resource (page or other chunk of memory) can be freed.

> > If a RX-ring is _ONLY_ used for XDP, then the driver have freedom to
> > implement another memory model, with the return-API.  We need to
> > experiment with the most optimal memory model.  The 1-packet-per-page
> > model is actually not the fastest, because of PCI-e bottlenecks.  With
> > HW support for packing descriptors and packets over the PCI-e bus, much
> > higher rates can be achieved.  Mellanox mlx5-Lx already have the needed HW
> > support.  And companies like NetCope also have 100G HW that does
> > similar tricks, and they even have a whitepaper[1][2] how they are
> > faster than DPDK with their NDP (Netcope Data Plane) API.
> >
> > We do need the ability/flexibility to change the RX memory model, to
> > take advantage of this new NIC hardware.
> 
> Looking over the white paper I see nothing that prevents us from using
> the same memory model we do with the Intel NICs. If anything I think
> the Intel drivers in "legacy-rx" mode could support something like
> this now, even if the hardware doesn't simply because we can get away
> with keeping the memory pseudo-pinned. My bigger concern is that we
> keep coming back to this idea that we need to have the network stack
> taking care of the 1 page per packet recycling when I really think it
> has no business being there. We either need to look at implementing
> this in the way we did in the Intel drivers where we use the reference
> counts or implement our own memory handling API like SLUB or something
> similar based on compound page destructors. I would much rather see us
> focus on getting this going with an agnostic memory model where we
> don't have to make the stack aware of where the memory came from or
> where it has to be returned to.
> 
> > [1] https://www.netcope.com/en/resources/improving-dpdk-performance
> > [2] https://www.netcope.com/en/company/press-center/press-releases/read-new-netcope-whitepaper-on-dpdk-acceleration
> 
> My only concern with something like this is the fact that it is
> optimized for a setup where the data is left in place and nothing
> extra is added. Trying to work with something like this gets more
> expensive when you have to deal with the full stack as you have to
> copy out the headers and still deal with all the skb metadata. I fully
> agree with the basic premise that writing in large blocks provides
> significant gains in throughput, specifically with small packets. The
> only gotcha you would have to deal with is SKB allocation and data
> copying overhead to make room and fill in metadata for the frame and
> any extra headers needed.
> 
> - Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: XDP redirect measurements, gotchas and tracepoints
  2017-08-29 19:02                               ` Andy Gospodarek
@ 2017-08-29 19:52                                 ` Alexander Duyck
  0 siblings, 0 replies; 26+ messages in thread
From: Alexander Duyck @ 2017-08-29 19:52 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Jesper Dangaard Brouer, Michael Chan, John Fastabend, Duyck,
	Alexander H, pstaszewski, netdev, xdp-newbies, borkmann

On Tue, Aug 29, 2017 at 12:02 PM, Andy Gospodarek <andy@greyhouse.net> wrote:
> On Tue, Aug 29, 2017 at 09:23:49AM -0700, Alexander Duyck wrote:
>> On Tue, Aug 29, 2017 at 6:26 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> >
>> > On Mon, 28 Aug 2017 09:11:25 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>> >
>> >> My advice would be to not over complicate this. My big concern with
>> >> all this buffer recycling is what happens the first time somebody
>> >> introduces something like mirroring? Are you going to copy the data to
>> >> a new page which would be quite expensive or just have to introduce
>> >> reference counts? You are going to have to deal with stuff like
>> >> reference counts eventually so you might as well bite that bullet now.
>> >> My advice would be to not bother with optimizing for performance right
>> >> now and instead focus on just getting functionality. The approach we
>> >> took in ixgbe for the transmit path should work for almost any other
>> >> driver since all you are looking at is having to free the page
>> >> reference which takes care of reference counting already.
>> >
>> > This return API is not about optimizing performance right now.  It is
>> > actually about allowing us to change the underlying memory model per RX
>> > queue for XDP.
>>
>>  I would disagree. To me this is a obvious case of premature optimization.
>>
>
> I'm with Jesper on this.  Though it may seem to you that this is an
> optimization that is not a goal.
>
>> > If a RX-ring is use for both SKBs and XDP, then the refcnt model is
>> > still enforced.  Although a driver using the 1-packet-per-page model,
>> > should be able to reuse refcnt==1 pages when returned from XDP.
>>
>> Isn't this the case for all Rx on XDP enabled rings. Last I knew there
>> was an option to pass packets up via an SKB if XDP_PASS is returned.
>> Are you saying we need to do a special allocation path if an XDP
>> program doesn't make use of XDP_PASS?
>
> I am not proposing that a special allocation path is needed depending on the
> return code from the XDP program.  I'm proposing that in a case where
> the return code is XDP_REDIRECT (or really anytime the ndo_xdp_xmit
> operation is called), that there should be:
>
> (1) notification back to the driver/resource/etc that allocated the page
> that resources are no longer in use.
>
> or
>
> (2) common alloc/free framework used by drivers that operate on
> xdp->data so that framework takes care of refcounting, etc.
>
> My preference is (1) since it provides drivers the most flexibility in
> the event that some hardware resource (rx ring buffer pointer) or
> software resource (page or other chunk of memory) can be freed.

So my preference would be (2) rather than (1). The simple reason being
that I would prefer it if we didn't have every driver doing their own
memory management API. With that said though I realize we need to have
a solid proof-of-concept before we can get there so I am okay with
each driver doing their own thing until we have a clear winner of some
sort in all these discussions.

The biggest issue I see with (1) is that it assumes that the Tx should
somehow be responsible for the recycling. The whole thing ends up
being way too racy. The problem is you will have two drivers sharing a
region of memory and each will have to do some sort of additional
reference counting if not using the existing page count. The result is
you can have either the Rx device needing to free resources which will
have to somehow be tracked from the Rx side, and the Tx side will have
to handle the freeing and dumping the packet back into the Rx pool if
it is the Tx that needs to free the resources. The standard case in
this gets messy, but the exception cases such as removing or resetting
a driver can get messy quickly.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-08-29 19:52 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-21 19:25 XDP redirect measurements, gotchas and tracepoints Jesper Dangaard Brouer
2017-08-21 22:35 ` Alexei Starovoitov
2017-08-22  6:37   ` Jesper Dangaard Brouer
2017-08-22 17:09     ` Alexei Starovoitov
2017-08-22 17:17       ` John Fastabend
2017-08-23  8:56         ` Jesper Dangaard Brouer
2017-08-22 18:02 ` Michael Chan
2017-08-22 18:17   ` John Fastabend
2017-08-22 18:30     ` Duyck, Alexander H
2017-08-22 20:04       ` Michael Chan
2017-08-23  1:06         ` Alexander Duyck
2017-08-23  6:59           ` Michael Chan
2017-08-23  8:29             ` Jesper Dangaard Brouer
2017-08-25  3:36               ` Michael Chan
2017-08-25 12:45                 ` Jesper Dangaard Brouer
2017-08-25 15:10                   ` John Fastabend
2017-08-25 15:28                     ` Michael Chan
2017-08-28 16:02                       ` Andy Gospodarek
2017-08-28 16:11                         ` Alexander Duyck
2017-08-29 13:26                           ` Jesper Dangaard Brouer
2017-08-29 16:23                             ` Alexander Duyck
2017-08-29 19:02                               ` Andy Gospodarek
2017-08-29 19:52                                 ` Alexander Duyck
2017-08-28 16:14                         ` John Fastabend
2017-08-28 19:39                           ` Andy Gospodarek
2017-08-23 14:51             ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.