All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Questions on XDP
@ 2017-02-18 23:31 Alexei Starovoitov
  2017-02-18 23:48 ` John Fastabend
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2017-02-18 23:31 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Jesper Dangaard Brouer, John Fastabend, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
>> XDP_DROP does not require having one page per frame.
>
> Agreed.

why do you think so?
xdp_drop is targeting ddos where in good case
all traffic is passed up and in bad case
most of the traffic is dropped, but good traffic still needs
to be serviced by the layers after. Like other xdp
programs and the stack.
Say ixgbe+xdp goes with 2k per packet,
very soon we will have a bunch of half pages
sitting in the stack and other halfs requiring
complex refcnting and making the actual
ddos mitigation ineffective and forcing nic to drop packets
because it runs out of buffers. Why complicate things?
packet per page approach is simple and effective.
virtio is different. there we don't have hw that needs
to have buffers ready for dma.

> Looking at the Mellanox way of doing it I am not entirely sure it is
> useful.  It looks good for benchmarks but that is about it.  Also I

it's the opposite. It already runs very nicely in production.
In real life it's always a combination of xdp_drop, xdp_tx and
xdp_pass actions.
Sounds like ixgbe wants to do things differently because
of not-invented-here. That new approach may turn
out to be good or bad, but why risk it?
mlx4 approach works.
mlx5 has few issues though, because page recycling
was done too simplistic. Generic page pool/recycling
that all drivers will use should solve that. I hope.
Is the proposal to have generic split-page recycler ?
How that is going to work?

> don't see it extending out to the point that we would be able to
> exchange packets between interfaces which really seems like it should
> be the ultimate goal for XDP_TX.

we don't have a use case for multi-port xdp_tx,
but I'm not objecting to doing it in general.
Just right now I don't see a need to complicate
drivers to do so.

> It seems like eventually we want to be able to peel off the buffer and
> send it to something other than ourselves.  For example it seems like
> it might be useful at some point to use XDP to do traffic
> classification and have it route packets between multiple interfaces
> on a host and it wouldn't make sense to have all of them map every
> page as bidirectional because it starts becoming ridiculous if you
> have dozens of interfaces in a system.

dozen interfaces? Like a single nic with dozen ports?
or many nics with many ports on the same system?
are you trying to build a switch out of x86?
I don't think it's realistic to have multi-terrabit x86 box.
Is it all because of dpdk/6wind demos?
I saw how dpdk was bragging that they can saturate
pcie bus. So? Why is this useful?
Why anyone would care to put a bunch of nics
into x86 and demonstrate that bandwidth of pcie is now
a limiting factor ?

> Also as far as the one page per frame it occurs to me that you will
> have to eventually deal with things like frame replication.

... only in cases where one needs to demo a multi-port
bridge with lots of nics in one x86 box.
I don't see practicality of such setup and I think
that copying full page every time xdp needs to
broadcast is preferred vs doing atomic refcnting
that will slow down the main case. broadcast is slow path.

My strong believe that xdp should not care about
niche architectures. It never meant to be a solution
for everyone and for all use cases.
If xdp sucks on powerpc, so be it.
cpus with 64k pages are doomed. We should
not sacrifice performance on x86 because of ppc.
I think it was a mistake that ixgbe choose to do that
in the past. When mb()s were added because
of powerpc and it took years to introduce dma_mb()
and return performance to good levels.
btw, dma_mb work was awesome.
In xdp I don't want to make such trade-offs.
Really only x86 and arm64 archs matter today.
Everything else is best effort.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-18 23:31 Questions on XDP Alexei Starovoitov
@ 2017-02-18 23:48 ` John Fastabend
  2017-02-18 23:59   ` Eric Dumazet
  2017-02-19  2:16   ` Alexander Duyck
  0 siblings, 2 replies; 21+ messages in thread
From: John Fastabend @ 2017-02-18 23:48 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexander Duyck
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Netdev, Tom Herbert,
	Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On 17-02-18 03:31 PM, Alexei Starovoitov wrote:
> On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>>
>>> XDP_DROP does not require having one page per frame.
>>
>> Agreed.
> 
> why do you think so?
> xdp_drop is targeting ddos where in good case
> all traffic is passed up and in bad case
> most of the traffic is dropped, but good traffic still needs
> to be serviced by the layers after. Like other xdp
> programs and the stack.
> Say ixgbe+xdp goes with 2k per packet,
> very soon we will have a bunch of half pages
> sitting in the stack and other halfs requiring
> complex refcnting and making the actual
> ddos mitigation ineffective and forcing nic to drop packets

I'm not seeing the distinction here. If its a 4k page and
in the stack the driver will get overrun as well.

> because it runs out of buffers. Why complicate things?

It doesn't seem complex to me and the driver already handles this
case so it actually makes the drivers simpler because there is only
a single buffer management path.

> packet per page approach is simple and effective.
> virtio is different. there we don't have hw that needs
> to have buffers ready for dma.
> 
>> Looking at the Mellanox way of doing it I am not entirely sure it is
>> useful.  It looks good for benchmarks but that is about it.  Also I
> 
> it's the opposite. It already runs very nicely in production.
> In real life it's always a combination of xdp_drop, xdp_tx and
> xdp_pass actions.
> Sounds like ixgbe wants to do things differently because
> of not-invented-here. That new approach may turn
> out to be good or bad, but why risk it?
> mlx4 approach works.
> mlx5 has few issues though, because page recycling
> was done too simplistic. Generic page pool/recycling
> that all drivers will use should solve that. I hope.
> Is the proposal to have generic split-page recycler ?
> How that is going to work?
> 

No, just give the driver a page when it asks for it. How the
driver uses the page is not the pools concern.

>> don't see it extending out to the point that we would be able to
>> exchange packets between interfaces which really seems like it should
>> be the ultimate goal for XDP_TX.
> 
> we don't have a use case for multi-port xdp_tx,
> but I'm not objecting to doing it in general.
> Just right now I don't see a need to complicate
> drivers to do so.

We are running our vswitch in userspace now for many workloads
it would be nice to have these in kernel if possible.

> 
>> It seems like eventually we want to be able to peel off the buffer and
>> send it to something other than ourselves.  For example it seems like
>> it might be useful at some point to use XDP to do traffic
>> classification and have it route packets between multiple interfaces
>> on a host and it wouldn't make sense to have all of them map every
>> page as bidirectional because it starts becoming ridiculous if you
>> have dozens of interfaces in a system.
> 
> dozen interfaces? Like a single nic with dozen ports?
> or many nics with many ports on the same system?
> are you trying to build a switch out of x86?
> I don't think it's realistic to have multi-terrabit x86 box.
> Is it all because of dpdk/6wind demos?
> I saw how dpdk was bragging that they can saturate
> pcie bus. So? Why is this useful?
> Why anyone would care to put a bunch of nics
> into x86 and demonstrate that bandwidth of pcie is now
> a limiting factor ?

Maybe Alex had something else in mind but we have many virtual interfaces
plus physical interfaces in vswitch use case. Possibly thousands.

> 
>> Also as far as the one page per frame it occurs to me that you will
>> have to eventually deal with things like frame replication.
> 
> ... only in cases where one needs to demo a multi-port
> bridge with lots of nics in one x86 box.
> I don't see practicality of such setup and I think
> that copying full page every time xdp needs to
> broadcast is preferred vs doing atomic refcnting
> that will slow down the main case. broadcast is slow path.
> 
> My strong believe that xdp should not care about
> niche architectures. It never meant to be a solution
> for everyone and for all use cases.
> If xdp sucks on powerpc, so be it.
> cpus with 64k pages are doomed. We should
> not sacrifice performance on x86 because of ppc.
> I think it was a mistake that ixgbe choose to do that
> in the past. When mb()s were added because
> of powerpc and it took years to introduce dma_mb()
> and return performance to good levels.
> btw, dma_mb work was awesome.
> In xdp I don't want to make such trade-offs.
> Really only x86 and arm64 archs matter today.
> Everything else is best effort.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-18 23:48 ` John Fastabend
@ 2017-02-18 23:59   ` Eric Dumazet
  2017-02-19  2:16   ` Alexander Duyck
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2017-02-18 23:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Alexander Duyck, Jesper Dangaard Brouer,
	Netdev, Tom Herbert, Alexei Starovoitov, John Fastabend,
	Daniel Borkmann, David Miller

On Sat, 2017-02-18 at 15:48 -0800, John Fastabend wrote:

> I'm not seeing the distinction here. If its a 4k page and
> in the stack the driver will get overrun as well.

Agree.

Using a full page per Ethernet frame does not change the attack vector.

It makes attacker job easier.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-18 23:48 ` John Fastabend
  2017-02-18 23:59   ` Eric Dumazet
@ 2017-02-19  2:16   ` Alexander Duyck
  2017-02-19  3:48     ` John Fastabend
  2017-02-21  3:18     ` Alexei Starovoitov
  1 sibling, 2 replies; 21+ messages in thread
From: Alexander Duyck @ 2017-02-19  2:16 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Eric Dumazet, Jesper Dangaard Brouer, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 17-02-18 03:31 PM, Alexei Starovoitov wrote:
>> On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>>
>>>> XDP_DROP does not require having one page per frame.
>>>
>>> Agreed.
>>
>> why do you think so?
>> xdp_drop is targeting ddos where in good case
>> all traffic is passed up and in bad case
>> most of the traffic is dropped, but good traffic still needs
>> to be serviced by the layers after. Like other xdp
>> programs and the stack.
>> Say ixgbe+xdp goes with 2k per packet,
>> very soon we will have a bunch of half pages
>> sitting in the stack and other halfs requiring
>> complex refcnting and making the actual
>> ddos mitigation ineffective and forcing nic to drop packets
>
> I'm not seeing the distinction here. If its a 4k page and
> in the stack the driver will get overrun as well.
>
>> because it runs out of buffers. Why complicate things?
>
> It doesn't seem complex to me and the driver already handles this
> case so it actually makes the drivers simpler because there is only
> a single buffer management path.
>
>> packet per page approach is simple and effective.
>> virtio is different. there we don't have hw that needs
>> to have buffers ready for dma.
>>
>>> Looking at the Mellanox way of doing it I am not entirely sure it is
>>> useful.  It looks good for benchmarks but that is about it.  Also I
>>
>> it's the opposite. It already runs very nicely in production.
>> In real life it's always a combination of xdp_drop, xdp_tx and
>> xdp_pass actions.
>> Sounds like ixgbe wants to do things differently because
>> of not-invented-here. That new approach may turn
>> out to be good or bad, but why risk it?
>> mlx4 approach works.
>> mlx5 has few issues though, because page recycling
>> was done too simplistic. Generic page pool/recycling
>> that all drivers will use should solve that. I hope.
>> Is the proposal to have generic split-page recycler ?
>> How that is going to work?
>>
>
> No, just give the driver a page when it asks for it. How the
> driver uses the page is not the pools concern.
>
>>> don't see it extending out to the point that we would be able to
>>> exchange packets between interfaces which really seems like it should
>>> be the ultimate goal for XDP_TX.
>>
>> we don't have a use case for multi-port xdp_tx,
>> but I'm not objecting to doing it in general.
>> Just right now I don't see a need to complicate
>> drivers to do so.
>
> We are running our vswitch in userspace now for many workloads
> it would be nice to have these in kernel if possible.
>
>>
>>> It seems like eventually we want to be able to peel off the buffer and
>>> send it to something other than ourselves.  For example it seems like
>>> it might be useful at some point to use XDP to do traffic
>>> classification and have it route packets between multiple interfaces
>>> on a host and it wouldn't make sense to have all of them map every
>>> page as bidirectional because it starts becoming ridiculous if you
>>> have dozens of interfaces in a system.
>>
>> dozen interfaces? Like a single nic with dozen ports?
>> or many nics with many ports on the same system?
>> are you trying to build a switch out of x86?
>> I don't think it's realistic to have multi-terrabit x86 box.
>> Is it all because of dpdk/6wind demos?
>> I saw how dpdk was bragging that they can saturate
>> pcie bus. So? Why is this useful?

Actually I was thinking more of an OVS, bridge, or routing
replacement.  Basically with a couple of physical interfaces and then
either veth and/or vhost interfaces.

>> Why anyone would care to put a bunch of nics
>> into x86 and demonstrate that bandwidth of pcie is now
>> a limiting factor ?
>
> Maybe Alex had something else in mind but we have many virtual interfaces
> plus physical interfaces in vswitch use case. Possibly thousands.

I was thinking about the fact that the Mellanox driver is currently
mapping pages as bidirectional, so I was sticking to the device to
device case in regards to that discussion.  For virtual interfaces we
don't even need the DMA mapping, it is just a copy to user space we
have to deal with in the case of vhost.  In that regard I was thinking
we need to start looking at taking XDP_TX one step further and
possibly look at supporting the transmit of an xdp_buf on an unrelated
netdev.  Although it looks like that means adding a netdev pointer to
xdp_buf in order to support returning that.

Anyway I am just running on conjecture at this point.  But it seems
like if we want to make XDP capable of doing transmit we should
support something other than bounce on the same port since that seems
like a "just saturate the bus" use case more than anything.  I suppose
you can do a one armed router, or have it do encap/decap for a tunnel,
but that is about the limits of it.  If we allow it to do transmit on
other netdevs then suddenly this has the potential to replace
significant existing infrastructure.

Sorry if I am stirring the hornets nest here.  I just finished the DMA
API changes to allow DMA page reuse with writable pages on ixgbe, and
igb/i40e/i40evf should be getting the same treatment shortly.  So now
I am looking forward at XDP and just noticing a few things that didn't
seem to make sense given the work I was doing to enable the API.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-19  2:16   ` Alexander Duyck
@ 2017-02-19  3:48     ` John Fastabend
  2017-02-20 20:06       ` Jakub Kicinski
  2017-02-21  3:18     ` Alexei Starovoitov
  1 sibling, 1 reply; 21+ messages in thread
From: John Fastabend @ 2017-02-19  3:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexei Starovoitov, Eric Dumazet, Jesper Dangaard Brouer, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On 17-02-18 06:16 PM, Alexander Duyck wrote:
> On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 17-02-18 03:31 PM, Alexei Starovoitov wrote:
>>> On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>>
>>>>> XDP_DROP does not require having one page per frame.
>>>>
>>>> Agreed.
>>>
>>> why do you think so?
>>> xdp_drop is targeting ddos where in good case
>>> all traffic is passed up and in bad case
>>> most of the traffic is dropped, but good traffic still needs
>>> to be serviced by the layers after. Like other xdp
>>> programs and the stack.
>>> Say ixgbe+xdp goes with 2k per packet,
>>> very soon we will have a bunch of half pages
>>> sitting in the stack and other halfs requiring
>>> complex refcnting and making the actual
>>> ddos mitigation ineffective and forcing nic to drop packets
>>
>> I'm not seeing the distinction here. If its a 4k page and
>> in the stack the driver will get overrun as well.
>>
>>> because it runs out of buffers. Why complicate things?
>>
>> It doesn't seem complex to me and the driver already handles this
>> case so it actually makes the drivers simpler because there is only
>> a single buffer management path.
>>
>>> packet per page approach is simple and effective.
>>> virtio is different. there we don't have hw that needs
>>> to have buffers ready for dma.
>>>
>>>> Looking at the Mellanox way of doing it I am not entirely sure it is
>>>> useful.  It looks good for benchmarks but that is about it.  Also I
>>>
>>> it's the opposite. It already runs very nicely in production.
>>> In real life it's always a combination of xdp_drop, xdp_tx and
>>> xdp_pass actions.
>>> Sounds like ixgbe wants to do things differently because
>>> of not-invented-here. That new approach may turn
>>> out to be good or bad, but why risk it?
>>> mlx4 approach works.
>>> mlx5 has few issues though, because page recycling
>>> was done too simplistic. Generic page pool/recycling
>>> that all drivers will use should solve that. I hope.
>>> Is the proposal to have generic split-page recycler ?
>>> How that is going to work?
>>>
>>
>> No, just give the driver a page when it asks for it. How the
>> driver uses the page is not the pools concern.
>>
>>>> don't see it extending out to the point that we would be able to
>>>> exchange packets between interfaces which really seems like it should
>>>> be the ultimate goal for XDP_TX.
>>>
>>> we don't have a use case for multi-port xdp_tx,
>>> but I'm not objecting to doing it in general.
>>> Just right now I don't see a need to complicate
>>> drivers to do so.
>>
>> We are running our vswitch in userspace now for many workloads
>> it would be nice to have these in kernel if possible.
>>
>>>
>>>> It seems like eventually we want to be able to peel off the buffer and
>>>> send it to something other than ourselves.  For example it seems like
>>>> it might be useful at some point to use XDP to do traffic
>>>> classification and have it route packets between multiple interfaces
>>>> on a host and it wouldn't make sense to have all of them map every
>>>> page as bidirectional because it starts becoming ridiculous if you
>>>> have dozens of interfaces in a system.
>>>
>>> dozen interfaces? Like a single nic with dozen ports?
>>> or many nics with many ports on the same system?
>>> are you trying to build a switch out of x86?
>>> I don't think it's realistic to have multi-terrabit x86 box.
>>> Is it all because of dpdk/6wind demos?
>>> I saw how dpdk was bragging that they can saturate
>>> pcie bus. So? Why is this useful?
> 
> Actually I was thinking more of an OVS, bridge, or routing
> replacement.  Basically with a couple of physical interfaces and then
> either veth and/or vhost interfaces.
> 

Yep valid use case for me. We would use this with Intel Clear Linux
assuming we can sort it out and perf metrics are good.

>>> Why anyone would care to put a bunch of nics
>>> into x86 and demonstrate that bandwidth of pcie is now
>>> a limiting factor ?
>>
>> Maybe Alex had something else in mind but we have many virtual interfaces
>> plus physical interfaces in vswitch use case. Possibly thousands.
> 
> I was thinking about the fact that the Mellanox driver is currently
> mapping pages as bidirectional, so I was sticking to the device to
> device case in regards to that discussion.  For virtual interfaces we
> don't even need the DMA mapping, it is just a copy to user space we
> have to deal with in the case of vhost.  In that regard I was thinking
> we need to start looking at taking XDP_TX one step further and
> possibly look at supporting the transmit of an xdp_buf on an unrelated
> netdev.  Although it looks like that means adding a netdev pointer to
> xdp_buf in order to support returning that.
> 
> Anyway I am just running on conjecture at this point.  But it seems
> like if we want to make XDP capable of doing transmit we should
> support something other than bounce on the same port since that seems
> like a "just saturate the bus" use case more than anything.  I suppose
> you can do a one armed router, or have it do encap/decap for a tunnel,
> but that is about the limits of it.  If we allow it to do transmit on
> other netdevs then suddenly this has the potential to replace
> significant existing infrastructure.
> 
> Sorry if I am stirring the hornets nest here.  I just finished the DMA
> API changes to allow DMA page reuse with writable pages on ixgbe, and
> igb/i40e/i40evf should be getting the same treatment shortly.  So now
> I am looking forward at XDP and just noticing a few things that didn't
> seem to make sense given the work I was doing to enable the API.
> 

Yep good to push on it IMO. So as I hinted here is forward to another
port interface I've been looking at. I'm not claiming its the best
possible solution but the simplest thing I could come up with that works.
I was hoping to think about it more next week.

Here is XDP extensions for redirect (need to be rebased though)

 https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee

And here is a sample program,

 https://github.com/jrfastab/linux/commit/19d0a5de3f6e934baa8df23d95e766bab7f026d0

Probably the most relevant pieces in the above patch is a new ndo op as follows,

 +	void			(*ndo_xdp_xmit)(struct net_device *dev,
 +						struct xdp_buff *xdp);


Then support for redirect in xdp ebpf,

 +BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 +{
 +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 +
 +	if (unlikely(flags))
 +		return XDP_ABORTED;
 +
 +	ri->ifindex = ifindex;
 +	return XDP_REDIRECT;
 +}
 +

And then a routine for drivers to use to push packets with the XDP_REDIRECT
action around,

+static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
+{
+	if (dev->netdev_ops->ndo_xdp_xmit) {
+		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		return 0;
+	}
+	bpf_warn_invalid_xdp_redirect(dev->ifindex);
+	return -EOPNOTSUPP;
+}
+
+int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
+	ri->ifindex = 0;
+	if (unlikely(!dev)) {
+		bpf_warn_invalid_xdp_redirect(ri->ifindex);
+		return -EINVAL;
+	}
+
+	return __bpf_tx_xdp(dev, xdp);
+}


Still thinking on it though to see if I might have a better mechanism and
need benchmarks to show various metrics.

Thanks,
John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-19  3:48     ` John Fastabend
@ 2017-02-20 20:06       ` Jakub Kicinski
  2017-02-22  5:02         ` John Fastabend
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2017-02-20 20:06 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Duyck, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer, Netdev, Tom Herbert, Alexei Starovoitov,
	John Fastabend, Daniel Borkmann, David Miller

On Sat, 18 Feb 2017 19:48:25 -0800, John Fastabend wrote:
> On 17-02-18 06:16 PM, Alexander Duyck wrote:
> > On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
> > <john.fastabend@gmail.com> wrote:  
> >> On 17-02-18 03:31 PM, Alexei Starovoitov wrote:  
> >>> On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
> >>> <alexander.duyck@gmail.com> wrote:  
> >>>>  
> >>>>> XDP_DROP does not require having one page per frame.  
> >>>>
> >>>> Agreed.  
> >>>
> >>> why do you think so?
> >>> xdp_drop is targeting ddos where in good case
> >>> all traffic is passed up and in bad case
> >>> most of the traffic is dropped, but good traffic still needs
> >>> to be serviced by the layers after. Like other xdp
> >>> programs and the stack.
> >>> Say ixgbe+xdp goes with 2k per packet,
> >>> very soon we will have a bunch of half pages
> >>> sitting in the stack and other halfs requiring
> >>> complex refcnting and making the actual
> >>> ddos mitigation ineffective and forcing nic to drop packets  
> >>
> >> I'm not seeing the distinction here. If its a 4k page and
> >> in the stack the driver will get overrun as well.
> >>  
> >>> because it runs out of buffers. Why complicate things?  
> >>
> >> It doesn't seem complex to me and the driver already handles this
> >> case so it actually makes the drivers simpler because there is only
> >> a single buffer management path.
> >>  
> >>> packet per page approach is simple and effective.
> >>> virtio is different. there we don't have hw that needs
> >>> to have buffers ready for dma.
> >>>  
> >>>> Looking at the Mellanox way of doing it I am not entirely sure it is
> >>>> useful.  It looks good for benchmarks but that is about it.  Also I  
> >>>
> >>> it's the opposite. It already runs very nicely in production.
> >>> In real life it's always a combination of xdp_drop, xdp_tx and
> >>> xdp_pass actions.
> >>> Sounds like ixgbe wants to do things differently because
> >>> of not-invented-here. That new approach may turn
> >>> out to be good or bad, but why risk it?
> >>> mlx4 approach works.
> >>> mlx5 has few issues though, because page recycling
> >>> was done too simplistic. Generic page pool/recycling
> >>> that all drivers will use should solve that. I hope.
> >>> Is the proposal to have generic split-page recycler ?
> >>> How that is going to work?
> >>>  
> >>
> >> No, just give the driver a page when it asks for it. How the
> >> driver uses the page is not the pools concern.
> >>  
> >>>> don't see it extending out to the point that we would be able to
> >>>> exchange packets between interfaces which really seems like it should
> >>>> be the ultimate goal for XDP_TX.  
> >>>
> >>> we don't have a use case for multi-port xdp_tx,
> >>> but I'm not objecting to doing it in general.
> >>> Just right now I don't see a need to complicate
> >>> drivers to do so.  
> >>
> >> We are running our vswitch in userspace now for many workloads
> >> it would be nice to have these in kernel if possible.
> >>  
> >>>  
> >>>> It seems like eventually we want to be able to peel off the buffer and
> >>>> send it to something other than ourselves.  For example it seems like
> >>>> it might be useful at some point to use XDP to do traffic
> >>>> classification and have it route packets between multiple interfaces
> >>>> on a host and it wouldn't make sense to have all of them map every
> >>>> page as bidirectional because it starts becoming ridiculous if you
> >>>> have dozens of interfaces in a system.  
> >>>
> >>> dozen interfaces? Like a single nic with dozen ports?
> >>> or many nics with many ports on the same system?
> >>> are you trying to build a switch out of x86?
> >>> I don't think it's realistic to have multi-terrabit x86 box.
> >>> Is it all because of dpdk/6wind demos?
> >>> I saw how dpdk was bragging that they can saturate
> >>> pcie bus. So? Why is this useful?  
> > 
> > Actually I was thinking more of an OVS, bridge, or routing
> > replacement.  Basically with a couple of physical interfaces and then
> > either veth and/or vhost interfaces.
> >   
> 
> Yep valid use case for me. We would use this with Intel Clear Linux
> assuming we can sort it out and perf metrics are good.

FWIW the limitation of having to remap buffers to TX to other netdev
also does not apply to NICs which share the same PCI device among all ports
(mlx4, nfp of the top of my head).  I wonder if it would be worthwhile
to mentally separate high-performance NICs of which there is a limited
number from swarms of slow "devices" like VF interfaces, perhaps we
will want to choose different solutions for the two down the road.

> Here is XDP extensions for redirect (need to be rebased though)
> 
>  https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee
> 
> And here is a sample program,
> 
>  https://github.com/jrfastab/linux/commit/19d0a5de3f6e934baa8df23d95e766bab7f026d0
> 
> Probably the most relevant pieces in the above patch is a new ndo op as follows,
> 
>  +	void			(*ndo_xdp_xmit)(struct net_device *dev,
>  +						struct xdp_buff *xdp);
> 
> 
> Then support for redirect in xdp ebpf,
> 
>  +BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
>  +{
>  +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
>  +
>  +	if (unlikely(flags))
>  +		return XDP_ABORTED;
>  +
>  +	ri->ifindex = ifindex;
>  +	return XDP_REDIRECT;
>  +}
>  +
> 
> And then a routine for drivers to use to push packets with the XDP_REDIRECT
> action around,
> 
> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
> +{
> +	if (dev->netdev_ops->ndo_xdp_xmit) {
> +		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
> +		return 0;
> +	}
> +	bpf_warn_invalid_xdp_redirect(dev->ifindex);
> +	return -EOPNOTSUPP;
> +}
> +
> +int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
> +{
> +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
> +
> +	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
> +	ri->ifindex = 0;
> +	if (unlikely(!dev)) {
> +		bpf_warn_invalid_xdp_redirect(ri->ifindex);
> +		return -EINVAL;
> +	}
> +
> +	return __bpf_tx_xdp(dev, xdp);
> +}
> 
> 
> Still thinking on it though to see if I might have a better mechanism and
> need benchmarks to show various metrics.

Would it perhaps make sense to consider this work as first step on the
path towards lightweight-skb rather than leaking XDP constructs outside
of drivers?  If we forced all XDP drivers to produce build_skb-able 
buffers, we could define the new .ndo as accepting skbs which are not
fully initialized but can be turned into real skbs if needed?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-19  2:16   ` Alexander Duyck
  2017-02-19  3:48     ` John Fastabend
@ 2017-02-21  3:18     ` Alexei Starovoitov
  2017-02-21  3:39       ` John Fastabend
  1 sibling, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2017-02-21  3:18 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: John Fastabend, Eric Dumazet, Jesper Dangaard Brouer, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Sat, Feb 18, 2017 at 06:16:47PM -0800, Alexander Duyck wrote:
>
> I was thinking about the fact that the Mellanox driver is currently
> mapping pages as bidirectional, so I was sticking to the device to
> device case in regards to that discussion.  For virtual interfaces we
> don't even need the DMA mapping, it is just a copy to user space we
> have to deal with in the case of vhost.  In that regard I was thinking
> we need to start looking at taking XDP_TX one step further and
> possibly look at supporting the transmit of an xdp_buf on an unrelated
> netdev.  Although it looks like that means adding a netdev pointer to
> xdp_buf in order to support returning that.

xdp_tx variant (via bpf_xdp_redirect as John proposed) should work.
I don't see why such tx into another netdev cannot be done today.
The only requirement that it shouldn't be driver specific.
Whichever way it's implemented in igxbe/i40e should be applicable
to mlx*, bnx*, nfp at least.

> Anyway I am just running on conjecture at this point.  But it seems
> like if we want to make XDP capable of doing transmit we should
> support something other than bounce on the same port since that seems
> like a "just saturate the bus" use case more than anything.  I suppose
> you can do a one armed router, or have it do encap/decap for a tunnel,
> but that is about the limits of it. 

one armed router is exactly our ILA router use case.
encap/decap is our load balancer use case.

>From your other email:
> 1.  The Tx code is mostly just a toy.  We need support for more
> functional use cases.

this tx toy is serving real traffic.
Adding more use cases to xdp is nice, but we cannot sacrifice
performance of these bread and butter use cases like ddos and lb.

> 2.  1 page per packet is costly, and blocks use on the intel drivers,
> mlx4 (after Eric's patches), and 64K page architectures.

1 page per packet is costly on archs with 64k pagesize. that's it.
I see no reason to waste x86 cycles to improve perf on such archs.
If the argument is truesize socket limits due to 4k vs 2k, then
please show the patch where split page can work just as fast
as page per packet and everyone will be giving two thumbs up.
If we can have 2k truesize with the same xdp_drop/tx performance
then by all means please do it.

But I suspect what is really happening is a premature defense
of likely mediocre ixgbe xdp performance on xdp_drop due to split page...
If so, that's only ixgbe's fault and trying to make other
nics slower to have apple to apples with ixgbe is just wrong.

> 3.  Should we support scatter-gather to support 9K jumbo frames
> instead of allocating order 2 pages?

we can, if main use case of mtu < 4k doesn't suffer.

> If we allow it to do transmit on
> other netdevs then suddenly this has the potential to replace
> significant existing infrastructure.

what existing infrastructure are we talking about?
The clear containers need is clear :)
The xdp_redirect into vhost/virtio would be great to have,
but xdp_tx from one port into another of physical nic
is much less clear. That's 'saturate pci' demo.

> Sorry if I am stirring the hornets nest here.  I just finished the DMA
> API changes to allow DMA page reuse with writable pages on ixgbe, and
> igb/i40e/i40evf should be getting the same treatment shortly.  So now
> I am looking forward at XDP and just noticing a few things that didn't
> seem to make sense given the work I was doing to enable the API.

did I miss the patches that already landed ?
I don't see any recycling done by i40e_clean_tx_irq or by
ixgbe_clean_tx_irq ...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-21  3:18     ` Alexei Starovoitov
@ 2017-02-21  3:39       ` John Fastabend
  2017-02-21  4:00         ` Alexander Duyck
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2017-02-21  3:39 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexander Duyck
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Netdev, Tom Herbert,
	Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On 17-02-20 07:18 PM, Alexei Starovoitov wrote:
> On Sat, Feb 18, 2017 at 06:16:47PM -0800, Alexander Duyck wrote:
>>
>> I was thinking about the fact that the Mellanox driver is currently
>> mapping pages as bidirectional, so I was sticking to the device to
>> device case in regards to that discussion.  For virtual interfaces we
>> don't even need the DMA mapping, it is just a copy to user space we
>> have to deal with in the case of vhost.  In that regard I was thinking
>> we need to start looking at taking XDP_TX one step further and
>> possibly look at supporting the transmit of an xdp_buf on an unrelated
>> netdev.  Although it looks like that means adding a netdev pointer to
>> xdp_buf in order to support returning that.
> 
> xdp_tx variant (via bpf_xdp_redirect as John proposed) should work.
> I don't see why such tx into another netdev cannot be done today.
> The only requirement that it shouldn't be driver specific.
> Whichever way it's implemented in igxbe/i40e should be applicable
> to mlx*, bnx*, nfp at least.

I'm working on it this week so I'll let everyone know how it goes. But
it should work. On virtio it runs OK but will test out ixgbe soon.

> 
>> Anyway I am just running on conjecture at this point.  But it seems
>> like if we want to make XDP capable of doing transmit we should
>> support something other than bounce on the same port since that seems
>> like a "just saturate the bus" use case more than anything.  I suppose
>> you can do a one armed router, or have it do encap/decap for a tunnel,
>> but that is about the limits of it. 
> 
> one armed router is exactly our ILA router use case.
> encap/decap is our load balancer use case.
> 
> From your other email:
>> 1.  The Tx code is mostly just a toy.  We need support for more
>> functional use cases.
> 
> this tx toy is serving real traffic.
> Adding more use cases to xdp is nice, but we cannot sacrifice
> performance of these bread and butter use cases like ddos and lb.
> 

Sure, but above redirect is needed for my use case ;) which is why
I'm pushing for it.

>> 2.  1 page per packet is costly, and blocks use on the intel drivers,
>> mlx4 (after Eric's patches), and 64K page architectures.
> 
> 1 page per packet is costly on archs with 64k pagesize. that's it.
> I see no reason to waste x86 cycles to improve perf on such archs.
> If the argument is truesize socket limits due to 4k vs 2k, then
> please show the patch where split page can work just as fast
> as page per packet and everyone will be giving two thumbs up.
> If we can have 2k truesize with the same xdp_drop/tx performance
> then by all means please do it.
> 
> But I suspect what is really happening is a premature defense
> of likely mediocre ixgbe xdp performance on xdp_drop due to split page...
> If so, that's only ixgbe's fault and trying to make other
> nics slower to have apple to apples with ixgbe is just wrong.
> 

Nope I don't think this is the case drop rates seem good on my side
at least after initial tests. And XDP_TX is a bit slow at the moment
but I suspect batching the send with xmit_more should get it up to
line rate.

>> 3.  Should we support scatter-gather to support 9K jumbo frames
>> instead of allocating order 2 pages?
> 
> we can, if main use case of mtu < 4k doesn't suffer.

Agreed I don't think it should degrade <4k performance. That said
for VM traffic this is absolutely needed. Without TSO enabled VM
traffic is 50% slower on my tests :/.

With tap/vhost support for XDP this becomes necessary. vhost/tap
support for XDP is on my list directly behind ixgbe and redirect
support.

> 
>> If we allow it to do transmit on
>> other netdevs then suddenly this has the potential to replace
>> significant existing infrastructure.
> 
> what existing infrastructure are we talking about?
> The clear containers need is clear :)
> The xdp_redirect into vhost/virtio would be great to have,
> but xdp_tx from one port into another of physical nic
> is much less clear. That's 'saturate pci' demo.

middlebox use cases exist but I doubt those stacks will move to
XDP anytime soon.

> 
>> Sorry if I am stirring the hornets nest here.  I just finished the DMA
>> API changes to allow DMA page reuse with writable pages on ixgbe, and
>> igb/i40e/i40evf should be getting the same treatment shortly.  So now
>> I am looking forward at XDP and just noticing a few things that didn't
>> seem to make sense given the work I was doing to enable the API.
> 
> did I miss the patches that already landed ?
> I don't see any recycling done by i40e_clean_tx_irq or by
> ixgbe_clean_tx_irq ...
> 

ixgbe (and I believe i40e) already do recycling so there is nothing to add
to support this. For example running XDP_DROP tests and XDP_TX tests I never
see any allocations occurring after initial buffers are setup. With the
caveat that XDP_TX is a bit slow still.

.John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-21  3:39       ` John Fastabend
@ 2017-02-21  4:00         ` Alexander Duyck
  2017-02-21  7:55           ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Duyck @ 2017-02-21  4:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Eric Dumazet, Jesper Dangaard Brouer, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Mon, Feb 20, 2017 at 7:39 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 17-02-20 07:18 PM, Alexei Starovoitov wrote:
>> On Sat, Feb 18, 2017 at 06:16:47PM -0800, Alexander Duyck wrote:
>>>
>>> I was thinking about the fact that the Mellanox driver is currently
>>> mapping pages as bidirectional, so I was sticking to the device to
>>> device case in regards to that discussion.  For virtual interfaces we
>>> don't even need the DMA mapping, it is just a copy to user space we
>>> have to deal with in the case of vhost.  In that regard I was thinking
>>> we need to start looking at taking XDP_TX one step further and
>>> possibly look at supporting the transmit of an xdp_buf on an unrelated
>>> netdev.  Although it looks like that means adding a netdev pointer to
>>> xdp_buf in order to support returning that.
>>
>> xdp_tx variant (via bpf_xdp_redirect as John proposed) should work.
>> I don't see why such tx into another netdev cannot be done today.
>> The only requirement that it shouldn't be driver specific.
>> Whichever way it's implemented in igxbe/i40e should be applicable
>> to mlx*, bnx*, nfp at least.
>
> I'm working on it this week so I'll let everyone know how it goes. But
> it should work. On virtio it runs OK but will test out ixgbe soon.
>
>>
>>> Anyway I am just running on conjecture at this point.  But it seems
>>> like if we want to make XDP capable of doing transmit we should
>>> support something other than bounce on the same port since that seems
>>> like a "just saturate the bus" use case more than anything.  I suppose
>>> you can do a one armed router, or have it do encap/decap for a tunnel,
>>> but that is about the limits of it.
>>
>> one armed router is exactly our ILA router use case.
>> encap/decap is our load balancer use case.
>>
>> From your other email:
>>> 1.  The Tx code is mostly just a toy.  We need support for more
>>> functional use cases.
>>
>> this tx toy is serving real traffic.
>> Adding more use cases to xdp is nice, but we cannot sacrifice
>> performance of these bread and butter use cases like ddos and lb.
>>
>
> Sure, but above redirect is needed for my use case ;) which is why
> I'm pushing for it.

I assumed "toy Tx" since I wasn't aware that they were actually
allowing writing to the page.  I think that might work for the XDP_TX
case, but the case where encap/decap is done and then passed up to the
stack runs the risk of causing data corruption on some architectures
if they unmap the page before the stack is done with the skb.  I
already pointed out the issue to the Mellanox guys and that will
hopefully be addressed shortly.

>>> 2.  1 page per packet is costly, and blocks use on the intel drivers,
>>> mlx4 (after Eric's patches), and 64K page architectures.
>>
>> 1 page per packet is costly on archs with 64k pagesize. that's it.
>> I see no reason to waste x86 cycles to improve perf on such archs.
>> If the argument is truesize socket limits due to 4k vs 2k, then
>> please show the patch where split page can work just as fast
>> as page per packet and everyone will be giving two thumbs up.
>> If we can have 2k truesize with the same xdp_drop/tx performance
>> then by all means please do it.
>>
>> But I suspect what is really happening is a premature defense
>> of likely mediocre ixgbe xdp performance on xdp_drop due to split page...
>> If so, that's only ixgbe's fault and trying to make other
>> nics slower to have apple to apples with ixgbe is just wrong.
>>
>
> Nope I don't think this is the case drop rates seem good on my side
> at least after initial tests. And XDP_TX is a bit slow at the moment
> but I suspect batching the send with xmit_more should get it up to
> line rate.

In the case of drop it is just a matter of updating the local
pagecnt_bias.  Just a local increment so no cost there.

As far as the Tx I need to work with John since his current solution
doesn't have any batching support that I saw and that is a major
requirement if we want to get above 7 Mpps for a single core.

>>> 3.  Should we support scatter-gather to support 9K jumbo frames
>>> instead of allocating order 2 pages?
>>
>> we can, if main use case of mtu < 4k doesn't suffer.
>
> Agreed I don't think it should degrade <4k performance. That said
> for VM traffic this is absolutely needed. Without TSO enabled VM
> traffic is 50% slower on my tests :/.
>
> With tap/vhost support for XDP this becomes necessary. vhost/tap
> support for XDP is on my list directly behind ixgbe and redirect
> support.

I'm thinking we just need to turn XDP into something like a
scatterlist for such cases.  It wouldn't take much to just convert the
single xdp_buf into an array of xdp_buf.

>>
>>> If we allow it to do transmit on
>>> other netdevs then suddenly this has the potential to replace
>>> significant existing infrastructure.
>>
>> what existing infrastructure are we talking about?
>> The clear containers need is clear :)
>> The xdp_redirect into vhost/virtio would be great to have,
>> but xdp_tx from one port into another of physical nic
>> is much less clear. That's 'saturate pci' demo.
>
> middlebox use cases exist but I doubt those stacks will move to
> XDP anytime soon.
>
>>
>>> Sorry if I am stirring the hornets nest here.  I just finished the DMA
>>> API changes to allow DMA page reuse with writable pages on ixgbe, and
>>> igb/i40e/i40evf should be getting the same treatment shortly.  So now
>>> I am looking forward at XDP and just noticing a few things that didn't
>>> seem to make sense given the work I was doing to enable the API.
>>
>> did I miss the patches that already landed ?
>> I don't see any recycling done by i40e_clean_tx_irq or by
>> ixgbe_clean_tx_irq ...
>>
>
> ixgbe (and I believe i40e) already do recycling so there is nothing to add
> to support this. For example running XDP_DROP tests and XDP_TX tests I never
> see any allocations occurring after initial buffers are setup. With the
> caveat that XDP_TX is a bit slow still.
>
> .John
>

The ixgbe driver has been doing page recycling for years.  I believe
Dave just pulled the bits from Jeff to enable ixgbe to use build_skb,
update the DMA API, and bulk the page count additions.  There is still
a few tweaks I plan to make to increase the headroom since it is
currently only NET_SKB_PAD + NET_IP_ALIGN and I think we have enough
room for 192 bytes of headroom as I recall.

The code for i40e/i40evf is still pending, though there is an earlier
version of the page recycling code there that is doing the old
get_page/page_ref_inc approach that Eric just pushed for mlx4.  I have
it done in our out-of-tree i40e and i40evf drivers and it will take a
little while to make it all the way to the kernel.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-21  4:00         ` Alexander Duyck
@ 2017-02-21  7:55           ` Alexei Starovoitov
  2017-02-21 17:44             ` Alexander Duyck
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2017-02-21  7:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: John Fastabend, Eric Dumazet, Jesper Dangaard Brouer, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Mon, Feb 20, 2017 at 08:00:57PM -0800, Alexander Duyck wrote:
> 
> I assumed "toy Tx" since I wasn't aware that they were actually
> allowing writing to the page.  I think that might work for the XDP_TX
> case, 

Take a look at samples/bpf/xdp_tx_iptunnel_kern.c
It's close enough approximation of load balancer.
The packet header is rewritten by the bpf program.
That's where dma_bidirectional requirement came from.

> but the case where encap/decap is done and then passed up to the
> stack runs the risk of causing data corruption on some architectures
> if they unmap the page before the stack is done with the skb.  I
> already pointed out the issue to the Mellanox guys and that will
> hopefully be addressed shortly.

sure. the path were xdp program does decap and passes to the stack
is not finished. To make it work properly we need to expose
csum complete field to the program at least.

> As far as the Tx I need to work with John since his current solution
> doesn't have any batching support that I saw and that is a major
> requirement if we want to get above 7 Mpps for a single core.

I think we need to focus on both Mpps and 'perf report' together.
Single core doing 7Mpps and scaling linearly to 40Gbps line rate
is much better than single core doing 20Mpps and not scaling at all.
There could be sw inefficiencies and hw limits, hence 'perf report'
is must have when discussing numbers.

I think long term we will be able to agree on a set of real life
use cases and corresponding set of 'blessed' bpf programs and
create a table of nic, driver, use case 1, 2, 3, single core, multi.
Making level playing field for all nic vendors is one of the goals.

Right now we have xdp1, xdp2 and xdp_tx_iptunnel benchmarks.
They are approximations of ddos, router, load balancer
use cases. They obviously need work to get to 'blessed' shape,
but imo quite good to do vendor vs vendor comparison for
the use cases that we care about.
Eventually nic->vm and vm->vm use cases via xdp_redirect should
be added to such set of 'blessed' benchmarks too.
I think so far we avoided falling into trap of microbenchmarking wars.

> >>> 3.  Should we support scatter-gather to support 9K jumbo frames
> >>> instead of allocating order 2 pages?
> >>
> >> we can, if main use case of mtu < 4k doesn't suffer.
> >
> > Agreed I don't think it should degrade <4k performance. That said
> > for VM traffic this is absolutely needed. Without TSO enabled VM
> > traffic is 50% slower on my tests :/.
> >
> > With tap/vhost support for XDP this becomes necessary. vhost/tap
> > support for XDP is on my list directly behind ixgbe and redirect
> > support.
> 
> I'm thinking we just need to turn XDP into something like a
> scatterlist for such cases.  It wouldn't take much to just convert the
> single xdp_buf into an array of xdp_buf.

datapath has to be fast. If xdp program needs to look at all
bytes of the packet the performance is gone. Therefore I don't see
a need to expose an array of xdp_buffs to the program.
The alternative would be to add a hidden field to xdp_buff that keeps
SG in some form and data_end will point to the end of linear chunk.
But you cannot put only headers into linear part. If program
needs to see something that is after data_end, it will drop the packet.
So it's not at all split-header model. 'data-data_end' chunk
should cover as much as possible. We cannot get into sg games
while parsing the packet inside the program. Everything
that program needs to see has to be in the linear part.
I think such model will work well for jumbo packet case.
but I don't think it will work for VM's tso.
For xdp program to pass csum_partial packet from vm into nic
in a meaninful way it needs to gain knowledge of ip, l4, csum
and bunch of other meta-data fields that nic needs to do TSO.
I'm not sure it's worth exposing all that to xdp. Instead
can we make VM to do segmentation, so that xdp program don't
need to deal with gso packets ?
I think the main cost is packet csum and for this something
like xdp_tx_with_pseudo_header() helper can work.
xdp program will see individual packets with pseudo header
and hw nic will do final csum over packet.
The program will see csum field as part of xdp_buff
and if it's csum_partial it will use xdp_tx_with_pseudo_header()
to transmit the packet instead of xdp_redirect or xdp_tx.
The call may look like:
xdp_tx_with_pseudo_header(xdp_buff, ip_off, tcp_off);
and these two offsets the program will compute based on
the packet itself and not metadata that came from VM.
In other words I'd like xdp program to deal with raw packets
as much as possible. pseudo header is part of the packet.
So the only metadata program needs is whether packet has
pseudo header or not.
Saying it differently: whether the packet came from physical
nic and in xdp_buff we have csum field (from hw rx descriptor)
that has csum complete meaning or the packet came from VM,
pseudo header is populated and xdp_buff->csum is empty.
>From physical nic the packet will travel through xdp program
into VM and csum complete is nicely covers all encap/decap
cases whether they're done by xdp program or by stack inside VM.
>From VM the packet similarly travels through xdp programs
and when it's about to hit physical nic the last program
calls xdp_tx_with_pseudo_header(). Any packet manipulations
that are done in between are done cleanly without worrying
about gso and adjustments to metadata.

> The ixgbe driver has been doing page recycling for years.  I believe
> Dave just pulled the bits from Jeff to enable ixgbe to use build_skb,
> update the DMA API, and bulk the page count additions.  There is still
> a few tweaks I plan to make to increase the headroom since it is
> currently only NET_SKB_PAD + NET_IP_ALIGN and I think we have enough
> room for 192 bytes of headroom as I recall.

Nice. Why keep old ixgbe_construct_skb code around?
With split page and build_skb perf should be great for small and large
packets ?

Unless I wasn't clear earlier:
Please release your ixgbe and i40e xdp patches in whatever shape
they are right now. I'm ready to test with xdp1+xdp2+xdp_tx_iptunnel :)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-21  7:55           ` Alexei Starovoitov
@ 2017-02-21 17:44             ` Alexander Duyck
  2017-02-22 17:08               ` John Fastabend
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Duyck @ 2017-02-21 17:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Eric Dumazet, Jesper Dangaard Brouer, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Mon, Feb 20, 2017 at 11:55 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 08:00:57PM -0800, Alexander Duyck wrote:
>>
>> I assumed "toy Tx" since I wasn't aware that they were actually
>> allowing writing to the page.  I think that might work for the XDP_TX
>> case,
>
> Take a look at samples/bpf/xdp_tx_iptunnel_kern.c
> It's close enough approximation of load balancer.
> The packet header is rewritten by the bpf program.
> That's where dma_bidirectional requirement came from.

Thanks.  I will take a look at it.

>> but the case where encap/decap is done and then passed up to the
>> stack runs the risk of causing data corruption on some architectures
>> if they unmap the page before the stack is done with the skb.  I
>> already pointed out the issue to the Mellanox guys and that will
>> hopefully be addressed shortly.
>
> sure. the path were xdp program does decap and passes to the stack
> is not finished. To make it work properly we need to expose
> csum complete field to the program at least.

I would think the checksum is something that could be validated after
the frame has been modified.  In the case of encapsulating or
decapsulating a TCP frame you could probably assume the inner TCP
checksum is valid and then you only have to deal with the checksum if
it is present in the outer tunnel header.  Basically deal with it like
we do the local checksum offload, only you would have to compute the
pseudo header checksum for the inner and outer headers since you can't
use the partial checksum of the inner header.

>> As far as the Tx I need to work with John since his current solution
>> doesn't have any batching support that I saw and that is a major
>> requirement if we want to get above 7 Mpps for a single core.
>
> I think we need to focus on both Mpps and 'perf report' together.

Agreed, I usually look over both as one tells you how fast you are
going and the other tells you where the bottlenecks are.

> Single core doing 7Mpps and scaling linearly to 40Gbps line rate
> is much better than single core doing 20Mpps and not scaling at all.
> There could be sw inefficiencies and hw limits, hence 'perf report'
> is must have when discussing numbers.

Agreed.

> I think long term we will be able to agree on a set of real life
> use cases and corresponding set of 'blessed' bpf programs and
> create a table of nic, driver, use case 1, 2, 3, single core, multi.
> Making level playing field for all nic vendors is one of the goals.
>
> Right now we have xdp1, xdp2 and xdp_tx_iptunnel benchmarks.
> They are approximations of ddos, router, load balancer
> use cases. They obviously need work to get to 'blessed' shape,
> but imo quite good to do vendor vs vendor comparison for
> the use cases that we care about.
> Eventually nic->vm and vm->vm use cases via xdp_redirect should
> be added to such set of 'blessed' benchmarks too.
> I think so far we avoided falling into trap of microbenchmarking wars.

I'll keep this in mind for upcoming patches.

>> >>> 3.  Should we support scatter-gather to support 9K jumbo frames
>> >>> instead of allocating order 2 pages?
>> >>
>> >> we can, if main use case of mtu < 4k doesn't suffer.
>> >
>> > Agreed I don't think it should degrade <4k performance. That said
>> > for VM traffic this is absolutely needed. Without TSO enabled VM
>> > traffic is 50% slower on my tests :/.
>> >
>> > With tap/vhost support for XDP this becomes necessary. vhost/tap
>> > support for XDP is on my list directly behind ixgbe and redirect
>> > support.
>>
>> I'm thinking we just need to turn XDP into something like a
>> scatterlist for such cases.  It wouldn't take much to just convert the
>> single xdp_buf into an array of xdp_buf.
>
> datapath has to be fast. If xdp program needs to look at all
> bytes of the packet the performance is gone. Therefore I don't see
> a need to expose an array of xdp_buffs to the program.

The program itself may not care, but if we are going to deal with
things like Tx and Drop we need to make sure we drop all the parts of
the frame.  An alternate idea I have been playing around with is just
having the driver repeat the last action until it hits the end of a
frame.  So XDP would analyze the first 1.5K or 3K of the frame, and
then tell us to either drop it, pass it, or xmit it.  After that we
would just repeat that action until we hit the end of the frame.  The
only limitation is that it means XDP is limited to only accessing the
first 1514 bytes.

> The alternative would be to add a hidden field to xdp_buff that keeps
> SG in some form and data_end will point to the end of linear chunk.
> But you cannot put only headers into linear part. If program
> needs to see something that is after data_end, it will drop the packet.
> So it's not at all split-header model. 'data-data_end' chunk
> should cover as much as possible. We cannot get into sg games
> while parsing the packet inside the program. Everything
> that program needs to see has to be in the linear part.

Agreed, I wouldn't want to do header split.  My thought is to break
things up so that you have 1.5K at least.  The rest should hopefully
just be data that we wouldn't need to look at.

> I think such model will work well for jumbo packet case.
> but I don't think it will work for VM's tso.
> For xdp program to pass csum_partial packet from vm into nic
> in a meaninful way it needs to gain knowledge of ip, l4, csum
> and bunch of other meta-data fields that nic needs to do TSO.
> I'm not sure it's worth exposing all that to xdp. Instead
> can we make VM to do segmentation, so that xdp program don't
> need to deal with gso packets ?

GSO/TSO is getting into advanced stuff I would rather not have to get
into right now.  I figure we need to take this portion one step at a
time.  To support GSO we need more information like the mss.

I think for now if we can get xmit bulking working that should get us
behavior close enough to see some of the benefits of GRO as we can
avoid having vhost transition between user space and kernel space
quite so much.

> I think the main cost is packet csum and for this something
> like xdp_tx_with_pseudo_header() helper can work.
> xdp program will see individual packets with pseudo header
> and hw nic will do final csum over packet.
> The program will see csum field as part of xdp_buff
> and if it's csum_partial it will use xdp_tx_with_pseudo_header()
> to transmit the packet instead of xdp_redirect or xdp_tx.
> The call may look like:
> xdp_tx_with_pseudo_header(xdp_buff, ip_off, tcp_off);
> and these two offsets the program will compute based on
> the packet itself and not metadata that came from VM.
> In other words I'd like xdp program to deal with raw packets
> as much as possible. pseudo header is part of the packet.
> So the only metadata program needs is whether packet has
> pseudo header or not.
> Saying it differently: whether the packet came from physical
> nic and in xdp_buff we have csum field (from hw rx descriptor)
> that has csum complete meaning or the packet came from VM,
> pseudo header is populated and xdp_buff->csum is empty.
> From physical nic the packet will travel through xdp program
> into VM and csum complete is nicely covers all encap/decap
> cases whether they're done by xdp program or by stack inside VM.
> From VM the packet similarly travels through xdp programs
> and when it's about to hit physical nic the last program
> calls xdp_tx_with_pseudo_header(). Any packet manipulations
> that are done in between are done cleanly without worrying
> about gso and adjustments to metadata.
>
>> The ixgbe driver has been doing page recycling for years.  I believe
>> Dave just pulled the bits from Jeff to enable ixgbe to use build_skb,
>> update the DMA API, and bulk the page count additions.  There is still
>> a few tweaks I plan to make to increase the headroom since it is
>> currently only NET_SKB_PAD + NET_IP_ALIGN and I think we have enough
>> room for 192 bytes of headroom as I recall.
>
> Nice. Why keep old ixgbe_construct_skb code around?
> With split page and build_skb perf should be great for small and large
> packets ?

I kept it around mostly just as a fall back in case something goes
wrong.  The last time I pushed the build_skb code I had to revert it
shortly afterwards as we found the DMA API did't support he use of
build_skb on DMA mapped pages.  This way if we find there is an
architecture that doesn't play well with this we have a fall back we
can enable until we can get it fixed.

> Unless I wasn't clear earlier:
> Please release your ixgbe and i40e xdp patches in whatever shape
> they are right now. I'm ready to test with xdp1+xdp2+xdp_tx_iptunnel :)

The i40e patches are going to be redone since they need to be rebased
on top of the build_skb changes for our out-of-tree driver and then
resubmitted once those changes are upstream.  So it will probably be a
few weeks before we have them ready.

I'll let John talk to when the ixgbe changes can be submitted.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-20 20:06       ` Jakub Kicinski
@ 2017-02-22  5:02         ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2017-02-22  5:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer, Netdev, Tom Herbert, Alexei Starovoitov,
	John Fastabend, Daniel Borkmann, David Miller

On 17-02-20 12:06 PM, Jakub Kicinski wrote:
> On Sat, 18 Feb 2017 19:48:25 -0800, John Fastabend wrote:
>> On 17-02-18 06:16 PM, Alexander Duyck wrote:
>>> On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
>>> <john.fastabend@gmail.com> wrote:  
>>>> On 17-02-18 03:31 PM, Alexei Starovoitov wrote:  
>>>>> On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
>>>>> <alexander.duyck@gmail.com> wrote:  
>>>>>>  
>>>>>>> XDP_DROP does not require having one page per frame.  
>>>>>>
>>>>>> Agreed.  
>>>>>
>>>>> why do you think so?
>>>>> xdp_drop is targeting ddos where in good case
>>>>> all traffic is passed up and in bad case
>>>>> most of the traffic is dropped, but good traffic still needs
>>>>> to be serviced by the layers after. Like other xdp
>>>>> programs and the stack.
>>>>> Say ixgbe+xdp goes with 2k per packet,
>>>>> very soon we will have a bunch of half pages
>>>>> sitting in the stack and other halfs requiring
>>>>> complex refcnting and making the actual
>>>>> ddos mitigation ineffective and forcing nic to drop packets  
>>>>
>>>> I'm not seeing the distinction here. If its a 4k page and
>>>> in the stack the driver will get overrun as well.
>>>>  
>>>>> because it runs out of buffers. Why complicate things?  
>>>>
>>>> It doesn't seem complex to me and the driver already handles this
>>>> case so it actually makes the drivers simpler because there is only
>>>> a single buffer management path.
>>>>  
>>>>> packet per page approach is simple and effective.
>>>>> virtio is different. there we don't have hw that needs
>>>>> to have buffers ready for dma.
>>>>>  
>>>>>> Looking at the Mellanox way of doing it I am not entirely sure it is
>>>>>> useful.  It looks good for benchmarks but that is about it.  Also I  
>>>>>
>>>>> it's the opposite. It already runs very nicely in production.
>>>>> In real life it's always a combination of xdp_drop, xdp_tx and
>>>>> xdp_pass actions.
>>>>> Sounds like ixgbe wants to do things differently because
>>>>> of not-invented-here. That new approach may turn
>>>>> out to be good or bad, but why risk it?
>>>>> mlx4 approach works.
>>>>> mlx5 has few issues though, because page recycling
>>>>> was done too simplistic. Generic page pool/recycling
>>>>> that all drivers will use should solve that. I hope.
>>>>> Is the proposal to have generic split-page recycler ?
>>>>> How that is going to work?
>>>>>  
>>>>
>>>> No, just give the driver a page when it asks for it. How the
>>>> driver uses the page is not the pools concern.
>>>>  
>>>>>> don't see it extending out to the point that we would be able to
>>>>>> exchange packets between interfaces which really seems like it should
>>>>>> be the ultimate goal for XDP_TX.  
>>>>>
>>>>> we don't have a use case for multi-port xdp_tx,
>>>>> but I'm not objecting to doing it in general.
>>>>> Just right now I don't see a need to complicate
>>>>> drivers to do so.  
>>>>
>>>> We are running our vswitch in userspace now for many workloads
>>>> it would be nice to have these in kernel if possible.
>>>>  
>>>>>  
>>>>>> It seems like eventually we want to be able to peel off the buffer and
>>>>>> send it to something other than ourselves.  For example it seems like
>>>>>> it might be useful at some point to use XDP to do traffic
>>>>>> classification and have it route packets between multiple interfaces
>>>>>> on a host and it wouldn't make sense to have all of them map every
>>>>>> page as bidirectional because it starts becoming ridiculous if you
>>>>>> have dozens of interfaces in a system.  
>>>>>
>>>>> dozen interfaces? Like a single nic with dozen ports?
>>>>> or many nics with many ports on the same system?
>>>>> are you trying to build a switch out of x86?
>>>>> I don't think it's realistic to have multi-terrabit x86 box.
>>>>> Is it all because of dpdk/6wind demos?
>>>>> I saw how dpdk was bragging that they can saturate
>>>>> pcie bus. So? Why is this useful?  
>>>
>>> Actually I was thinking more of an OVS, bridge, or routing
>>> replacement.  Basically with a couple of physical interfaces and then
>>> either veth and/or vhost interfaces.
>>>   
>>
>> Yep valid use case for me. We would use this with Intel Clear Linux
>> assuming we can sort it out and perf metrics are good.
> 
> FWIW the limitation of having to remap buffers to TX to other netdev
> also does not apply to NICs which share the same PCI device among all ports
> (mlx4, nfp of the top of my head).  I wonder if it would be worthwhile
> to mentally separate high-performance NICs of which there is a limited
> number from swarms of slow "devices" like VF interfaces, perhaps we
> will want to choose different solutions for the two down the road.
> 
>> Here is XDP extensions for redirect (need to be rebased though)
>>
>>  https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee
>>
>> And here is a sample program,
>>
>>  https://github.com/jrfastab/linux/commit/19d0a5de3f6e934baa8df23d95e766bab7f026d0
>>
>> Probably the most relevant pieces in the above patch is a new ndo op as follows,
>>
>>  +	void			(*ndo_xdp_xmit)(struct net_device *dev,
>>  +						struct xdp_buff *xdp);
>>
>>
>> Then support for redirect in xdp ebpf,
>>
>>  +BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
>>  +{
>>  +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
>>  +
>>  +	if (unlikely(flags))
>>  +		return XDP_ABORTED;
>>  +
>>  +	ri->ifindex = ifindex;
>>  +	return XDP_REDIRECT;
>>  +}
>>  +
>>
>> And then a routine for drivers to use to push packets with the XDP_REDIRECT
>> action around,
>>
>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
>> +{
>> +	if (dev->netdev_ops->ndo_xdp_xmit) {
>> +		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
>> +		return 0;
>> +	}
>> +	bpf_warn_invalid_xdp_redirect(dev->ifindex);
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
>> +{
>> +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
>> +
>> +	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
>> +	ri->ifindex = 0;
>> +	if (unlikely(!dev)) {
>> +		bpf_warn_invalid_xdp_redirect(ri->ifindex);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return __bpf_tx_xdp(dev, xdp);
>> +}
>>
>>
>> Still thinking on it though to see if I might have a better mechanism and
>> need benchmarks to show various metrics.
> 
> Would it perhaps make sense to consider this work as first step on the
> path towards lightweight-skb rather than leaking XDP constructs outside
> of drivers?  If we forced all XDP drivers to produce build_skb-able 
> buffers, we could define the new .ndo as accepting skbs which are not
> fully initialized but can be turned into real skbs if needed?
> 

I believe this is a good idea. But I need a few iterations on existing code
base :) before I can try to realize something like this.

.John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-21 17:44             ` Alexander Duyck
@ 2017-02-22 17:08               ` John Fastabend
  2017-02-22 21:59                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2017-02-22 17:08 UTC (permalink / raw)
  To: Alexander Duyck, Alexei Starovoitov
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Netdev, Tom Herbert,
	Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On 17-02-21 09:44 AM, Alexander Duyck wrote:
> On Mon, Feb 20, 2017 at 11:55 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Feb 20, 2017 at 08:00:57PM -0800, Alexander Duyck wrote:
>>>
>>> I assumed "toy Tx" since I wasn't aware that they were actually
>>> allowing writing to the page.  I think that might work for the XDP_TX
>>> case,
>>
>> Take a look at samples/bpf/xdp_tx_iptunnel_kern.c
>> It's close enough approximation of load balancer.
>> The packet header is rewritten by the bpf program.
>> That's where dma_bidirectional requirement came from.
> 
> Thanks.  I will take a look at it.
> 
>>> but the case where encap/decap is done and then passed up to the
>>> stack runs the risk of causing data corruption on some architectures
>>> if they unmap the page before the stack is done with the skb.  I
>>> already pointed out the issue to the Mellanox guys and that will
>>> hopefully be addressed shortly.
>>
>> sure. the path were xdp program does decap and passes to the stack
>> is not finished. To make it work properly we need to expose
>> csum complete field to the program at least.
> 
> I would think the checksum is something that could be validated after
> the frame has been modified.  In the case of encapsulating or
> decapsulating a TCP frame you could probably assume the inner TCP
> checksum is valid and then you only have to deal with the checksum if
> it is present in the outer tunnel header.  Basically deal with it like
> we do the local checksum offload, only you would have to compute the
> pseudo header checksum for the inner and outer headers since you can't
> use the partial checksum of the inner header.
> 
>>> As far as the Tx I need to work with John since his current solution
>>> doesn't have any batching support that I saw and that is a major
>>> requirement if we want to get above 7 Mpps for a single core.
>>
>> I think we need to focus on both Mpps and 'perf report' together.
> 
> Agreed, I usually look over both as one tells you how fast you are
> going and the other tells you where the bottlenecks are.
> 
>> Single core doing 7Mpps and scaling linearly to 40Gbps line rate
>> is much better than single core doing 20Mpps and not scaling at all.
>> There could be sw inefficiencies and hw limits, hence 'perf report'
>> is must have when discussing numbers.
> 
> Agreed.
> 
>> I think long term we will be able to agree on a set of real life
>> use cases and corresponding set of 'blessed' bpf programs and
>> create a table of nic, driver, use case 1, 2, 3, single core, multi.
>> Making level playing field for all nic vendors is one of the goals.
>>
>> Right now we have xdp1, xdp2 and xdp_tx_iptunnel benchmarks.
>> They are approximations of ddos, router, load balancer
>> use cases. They obviously need work to get to 'blessed' shape,
>> but imo quite good to do vendor vs vendor comparison for
>> the use cases that we care about.
>> Eventually nic->vm and vm->vm use cases via xdp_redirect should
>> be added to such set of 'blessed' benchmarks too.
>> I think so far we avoided falling into trap of microbenchmarking wars.
> 
> I'll keep this in mind for upcoming patches.
> 

Yep, agreed although having some larger examples in the wild even if
not in the kernel source would be great. I think we will see these
soon.

>>>>>> 3.  Should we support scatter-gather to support 9K jumbo frames
>>>>>> instead of allocating order 2 pages?
>>>>>
>>>>> we can, if main use case of mtu < 4k doesn't suffer.
>>>>
>>>> Agreed I don't think it should degrade <4k performance. That said
>>>> for VM traffic this is absolutely needed. Without TSO enabled VM
>>>> traffic is 50% slower on my tests :/.
>>>>
>>>> With tap/vhost support for XDP this becomes necessary. vhost/tap
>>>> support for XDP is on my list directly behind ixgbe and redirect
>>>> support.
>>>
>>> I'm thinking we just need to turn XDP into something like a
>>> scatterlist for such cases.  It wouldn't take much to just convert the
>>> single xdp_buf into an array of xdp_buf.
>>
>> datapath has to be fast. If xdp program needs to look at all
>> bytes of the packet the performance is gone. Therefore I don't see
>> a need to expose an array of xdp_buffs to the program.
> 
> The program itself may not care, but if we are going to deal with
> things like Tx and Drop we need to make sure we drop all the parts of
> the frame.  An alternate idea I have been playing around with is just
> having the driver repeat the last action until it hits the end of a
> frame.  So XDP would analyze the first 1.5K or 3K of the frame, and
> then tell us to either drop it, pass it, or xmit it.  After that we
> would just repeat that action until we hit the end of the frame.  The
> only limitation is that it means XDP is limited to only accessing the
> first 1514 bytes.
> 
>> The alternative would be to add a hidden field to xdp_buff that keeps
>> SG in some form and data_end will point to the end of linear chunk.
>> But you cannot put only headers into linear part. If program
>> needs to see something that is after data_end, it will drop the packet.
>> So it's not at all split-header model. 'data-data_end' chunk
>> should cover as much as possible. We cannot get into sg games
>> while parsing the packet inside the program. Everything
>> that program needs to see has to be in the linear part.
> 
> Agreed, I wouldn't want to do header split.  My thought is to break
> things up so that you have 1.5K at least.  The rest should hopefully
> just be data that we wouldn't need to look at.
> 
>> I think such model will work well for jumbo packet case.
>> but I don't think it will work for VM's tso.
>> For xdp program to pass csum_partial packet from vm into nic
>> in a meaninful way it needs to gain knowledge of ip, l4, csum
>> and bunch of other meta-data fields that nic needs to do TSO.
>> I'm not sure it's worth exposing all that to xdp. Instead
>> can we make VM to do segmentation, so that xdp program don't
>> need to deal with gso packets ?
> 
> GSO/TSO is getting into advanced stuff I would rather not have to get
> into right now.  I figure we need to take this portion one step at a
> time.  To support GSO we need more information like the mss.
> 

Agreed lets get the driver support for basic things first. But this
is on my list. I'm just repeating myself but VM to VM performance uses
TSO/LRO heavily.

> I think for now if we can get xmit bulking working that should get us
> behavior close enough to see some of the benefits of GRO as we can
> avoid having vhost transition between user space and kernel space
> quite so much.
> 

I'll try this out on ixgbe.

>> I think the main cost is packet csum and for this something
>> like xdp_tx_with_pseudo_header() helper can work.
>> xdp program will see individual packets with pseudo header
>> and hw nic will do final csum over packet.
>> The program will see csum field as part of xdp_buff
>> and if it's csum_partial it will use xdp_tx_with_pseudo_header()
>> to transmit the packet instead of xdp_redirect or xdp_tx.
>> The call may look like:
>> xdp_tx_with_pseudo_header(xdp_buff, ip_off, tcp_off);
>> and these two offsets the program will compute based on
>> the packet itself and not metadata that came from VM.
>> In other words I'd like xdp program to deal with raw packets
>> as much as possible. pseudo header is part of the packet.
>> So the only metadata program needs is whether packet has
>> pseudo header or not.
>> Saying it differently: whether the packet came from physical
>> nic and in xdp_buff we have csum field (from hw rx descriptor)
>> that has csum complete meaning or the packet came from VM,
>> pseudo header is populated and xdp_buff->csum is empty.
>> From physical nic the packet will travel through xdp program
>> into VM and csum complete is nicely covers all encap/decap
>> cases whether they're done by xdp program or by stack inside VM.
>> From VM the packet similarly travels through xdp programs
>> and when it's about to hit physical nic the last program
>> calls xdp_tx_with_pseudo_header(). Any packet manipulations
>> that are done in between are done cleanly without worrying
>> about gso and adjustments to metadata.
>>
>>> The ixgbe driver has been doing page recycling for years.  I believe
>>> Dave just pulled the bits from Jeff to enable ixgbe to use build_skb,
>>> update the DMA API, and bulk the page count additions.  There is still
>>> a few tweaks I plan to make to increase the headroom since it is
>>> currently only NET_SKB_PAD + NET_IP_ALIGN and I think we have enough
>>> room for 192 bytes of headroom as I recall.
>>
>> Nice. Why keep old ixgbe_construct_skb code around?
>> With split page and build_skb perf should be great for small and large
>> packets ?
> 
> I kept it around mostly just as a fall back in case something goes
> wrong.  The last time I pushed the build_skb code I had to revert it
> shortly afterwards as we found the DMA API did't support he use of
> build_skb on DMA mapped pages.  This way if we find there is an
> architecture that doesn't play well with this we have a fall back we
> can enable until we can get it fixed.
> 
>> Unless I wasn't clear earlier:
>> Please release your ixgbe and i40e xdp patches in whatever shape
>> they are right now. I'm ready to test with xdp1+xdp2+xdp_tx_iptunnel :)
> 
> The i40e patches are going to be redone since they need to be rebased
> on top of the build_skb changes for our out-of-tree driver and then
> resubmitted once those changes are upstream.  So it will probably be a
> few weeks before we have them ready.
> 
> I'll let John talk to when the ixgbe changes can be submitted.

I'm addressing Alex's comments now and should have a v2 out soon. Certainly
in the next day or two.

Thanks,
John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-22 17:08               ` John Fastabend
@ 2017-02-22 21:59                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-22 21:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Duyck, Alexei Starovoitov, Eric Dumazet, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller, brouer

On Wed, 22 Feb 2017 09:08:53 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> > GSO/TSO is getting into advanced stuff I would rather not have to get
> > into right now.  I figure we need to take this portion one step at a
> > time.  To support GSO we need more information like the mss.
> >   
> 
> Agreed lets get the driver support for basic things first. But this
> is on my list. I'm just repeating myself but VM to VM performance uses
> TSO/LRO heavily.

Sorry, but I get annoyed every time I hear we need to support
TSO/LRO/GRO for performance reasons.  If you take one step back, you
are actually saying we need bulking for better performance.  And the
bulking you are proposing is a TCP protocol specific bulking mechanism.

I'm saying is let's make bulking protocol agnostic, by doing it at the
packet level.  And once the bulk enters the VM, by-all-means it should
construct a GRO packet it can send into it's own network stack.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
@ 2017-02-18 23:59 Alexei Starovoitov
  0 siblings, 0 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2017-02-18 23:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Duyck, Eric Dumazet, Jesper Dangaard Brouer, Netdev,
	Tom Herbert, Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
>
> We are running our vswitch in userspace now for many workloads
> it would be nice to have these in kernel if possible.
...
> Maybe Alex had something else in mind but we have many virtual interfaces
> plus physical interfaces in vswitch use case. Possibly thousands.

virtual interfaces towards many VMs is certainly good use case
that we need to address.
we'd still need to copy the packet from memory of one vm into another,
right? so per packet allocation strategy for virtual interface can
be anything.

Sounds like you already have patches that do that?
Excellent. Please share.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-18 18:18       ` Alexander Duyck
@ 2017-02-18 23:28         ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2017-02-18 23:28 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: Jesper Dangaard Brouer, Netdev, Tom Herbert, Alexei Starovoitov,
	John Fastabend, Daniel Borkmann, David Miller

On 17-02-18 10:18 AM, Alexander Duyck wrote:
> On Sat, Feb 18, 2017 at 9:41 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Sat, 2017-02-18 at 17:34 +0100, Jesper Dangaard Brouer wrote:
>>> On Thu, 16 Feb 2017 14:36:41 -0800
>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>
>>>> On 17-02-16 12:41 PM, Alexander Duyck wrote:
>>>>> So I'm in the process of working on enabling XDP for the Intel NICs
>>>>> and I had a few questions so I just thought I would put them out here
>>>>> to try and get everything sorted before I paint myself into a corner.
>>>>>
>>>>> So my first question is why does the documentation mention 1 frame per
>>>>> page for XDP?
>>>
>>> Yes, XDP defines upfront a memory model where there is only one packet
>>> per page[1], please respect that!
>>>
>>> This is currently used/needed for fast-direct recycling of pages inside
>>> the driver for XDP_DROP and XDP_TX, _without_ performing any atomic
>>> refcnt operations on the page. E.g. see mlx4_en_rx_recycle().

Alex, does your pagecnt_bias trick resolve this? It seems to me that the
recycling is working in ixgbe patches just fine (at least I never see the
allocator being triggered with simple XDP programs). The biggest win for
me right now is to avoid the dma mapping operations.

>>
>>
>> XDP_DROP does not require having one page per frame.
> 
> Agreed.
> 
>> (Look after my recent mlx4 patch series if you need to be convinced)
>>
>> Only XDP_TX is.

I'm still not sure what page per packet buys us on XDP_TX. What was the
explanation again?

>>
>> This requirement makes XDP useless (very OOM likely) on arches with 64K
>> pages.
> 
> Actually I have been having a side discussion with John about XDP_TX.
> Looking at the Mellanox way of doing it I am not entirely sure it is
> useful.  It looks good for benchmarks but that is about it.  Also I
> don't see it extending out to the point that we would be able to
> exchange packets between interfaces which really seems like it should
> be the ultimate goal for XDP_TX.

This is needed if we want XDP to be used for vswitch use cases. We have
a patch running on virtio but really need to get it working on real
hardware before we push it.

> 
> It seems like eventually we want to be able to peel off the buffer and
> send it to something other than ourselves.  For example it seems like
> it might be useful at some point to use XDP to do traffic
> classification and have it route packets between multiple interfaces
> on a host and it wouldn't make sense to have all of them map every
> page as bidirectional because it starts becoming ridiculous if you
> have dozens of interfaces in a system.
> 
> As per our original discussion at netconf if we want to be able to do
> XDP Tx with a fully lockless Tx ring we needed to have a Tx ring per
> CPU that is performing XDP.  The Tx path will end up needing to do the
> map/unmap itself in the case of physical devices but the expense of
> that can be somewhat mitigated on x86 at least by either disabling the
> IOMMU or using identity mapping.  I think this might be the route
> worth exploring as we could then start looking at doing things like
> implementing bridges and routers in XDP and see what performance gains
> can be had there.

One issue I have with TX ring per CPU per device is in my current use
case I have 2k tap/vhost devices and need to scale up to more than that.
Taking the naive approach and making each tap/vhost create a per cpu
ring would be 128k rings on my current dev box. I think locking could
be optional without too much difficulty.

> 
> Also as far as the one page per frame it occurs to me that you will
> have to eventually deal with things like frame replication.  Once that
> comes into play everything becomes much more difficult because the
> recycling doesn't work without some sort of reference counting, and
> since the device interrupt can migrate you could end up with clean-up
> occurring on a different CPUs so you need to have some sort of
> synchronization mechanism.
> 
> Thanks.
> 
> - Alex
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-18 17:41     ` Eric Dumazet
@ 2017-02-18 18:18       ` Alexander Duyck
  2017-02-18 23:28         ` John Fastabend
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Duyck @ 2017-02-18 18:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, John Fastabend, Netdev, Tom Herbert,
	Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Sat, Feb 18, 2017 at 9:41 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2017-02-18 at 17:34 +0100, Jesper Dangaard Brouer wrote:
>> On Thu, 16 Feb 2017 14:36:41 -0800
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>
>> > On 17-02-16 12:41 PM, Alexander Duyck wrote:
>> > > So I'm in the process of working on enabling XDP for the Intel NICs
>> > > and I had a few questions so I just thought I would put them out here
>> > > to try and get everything sorted before I paint myself into a corner.
>> > >
>> > > So my first question is why does the documentation mention 1 frame per
>> > > page for XDP?
>>
>> Yes, XDP defines upfront a memory model where there is only one packet
>> per page[1], please respect that!
>>
>> This is currently used/needed for fast-direct recycling of pages inside
>> the driver for XDP_DROP and XDP_TX, _without_ performing any atomic
>> refcnt operations on the page. E.g. see mlx4_en_rx_recycle().
>
>
> XDP_DROP does not require having one page per frame.

Agreed.

> (Look after my recent mlx4 patch series if you need to be convinced)
>
> Only XDP_TX is.
>
> This requirement makes XDP useless (very OOM likely) on arches with 64K
> pages.

Actually I have been having a side discussion with John about XDP_TX.
Looking at the Mellanox way of doing it I am not entirely sure it is
useful.  It looks good for benchmarks but that is about it.  Also I
don't see it extending out to the point that we would be able to
exchange packets between interfaces which really seems like it should
be the ultimate goal for XDP_TX.

It seems like eventually we want to be able to peel off the buffer and
send it to something other than ourselves.  For example it seems like
it might be useful at some point to use XDP to do traffic
classification and have it route packets between multiple interfaces
on a host and it wouldn't make sense to have all of them map every
page as bidirectional because it starts becoming ridiculous if you
have dozens of interfaces in a system.

As per our original discussion at netconf if we want to be able to do
XDP Tx with a fully lockless Tx ring we needed to have a Tx ring per
CPU that is performing XDP.  The Tx path will end up needing to do the
map/unmap itself in the case of physical devices but the expense of
that can be somewhat mitigated on x86 at least by either disabling the
IOMMU or using identity mapping.  I think this might be the route
worth exploring as we could then start looking at doing things like
implementing bridges and routers in XDP and see what performance gains
can be had there.

Also as far as the one page per frame it occurs to me that you will
have to eventually deal with things like frame replication.  Once that
comes into play everything becomes much more difficult because the
recycling doesn't work without some sort of reference counting, and
since the device interrupt can migrate you could end up with clean-up
occurring on a different CPUs so you need to have some sort of
synchronization mechanism.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-18 16:34   ` Jesper Dangaard Brouer
@ 2017-02-18 17:41     ` Eric Dumazet
  2017-02-18 18:18       ` Alexander Duyck
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2017-02-18 17:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Alexander Duyck, Netdev, Tom Herbert,
	Alexei Starovoitov, John Fastabend, Daniel Borkmann,
	David Miller

On Sat, 2017-02-18 at 17:34 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 16 Feb 2017 14:36:41 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > On 17-02-16 12:41 PM, Alexander Duyck wrote:
> > > So I'm in the process of working on enabling XDP for the Intel NICs
> > > and I had a few questions so I just thought I would put them out here
> > > to try and get everything sorted before I paint myself into a corner.
> > >   
> > > So my first question is why does the documentation mention 1 frame per
> > > page for XDP?  
> 
> Yes, XDP defines upfront a memory model where there is only one packet
> per page[1], please respect that!
> 
> This is currently used/needed for fast-direct recycling of pages inside
> the driver for XDP_DROP and XDP_TX, _without_ performing any atomic
> refcnt operations on the page. E.g. see mlx4_en_rx_recycle().


XDP_DROP does not require having one page per frame.

(Look after my recent mlx4 patch series if you need to be convinced)

Only XDP_TX is.

This requirement makes XDP useless (very OOM likely) on arches with 64K
pages.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-16 22:36 ` John Fastabend
@ 2017-02-18 16:34   ` Jesper Dangaard Brouer
  2017-02-18 17:41     ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-18 16:34 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Duyck, Netdev, Tom Herbert, Alexei Starovoitov,
	John Fastabend, Daniel Borkmann, brouer, David Miller

On Thu, 16 Feb 2017 14:36:41 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 17-02-16 12:41 PM, Alexander Duyck wrote:
> > So I'm in the process of working on enabling XDP for the Intel NICs
> > and I had a few questions so I just thought I would put them out here
> > to try and get everything sorted before I paint myself into a corner.
> >   
> > So my first question is why does the documentation mention 1 frame per
> > page for XDP?  

Yes, XDP defines upfront a memory model where there is only one packet
per page[1], please respect that!

This is currently used/needed for fast-direct recycling of pages inside
the driver for XDP_DROP and XDP_TX, _without_ performing any atomic
refcnt operations on the page. E.g. see mlx4_en_rx_recycle().

This is also about controlling the cache-coherency state of the
struct-page cache-line.  (With two (or-more) packets per page,
the struct-page cache-line will be jumping around.) Controlling this is
essential when packets are transferred between CPUs. We need an
architecture were we can control this, please.

[1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#page-per-packet

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Questions on XDP
  2017-02-16 20:41 Alexander Duyck
@ 2017-02-16 22:36 ` John Fastabend
  2017-02-18 16:34   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2017-02-16 22:36 UTC (permalink / raw)
  To: Alexander Duyck, Netdev
  Cc: Tom Herbert, Alexei Starovoitov, John Fastabend,
	Jesper Dangaard Brouer, Daniel Borkmann

On 17-02-16 12:41 PM, Alexander Duyck wrote:
> So I'm in the process of working on enabling XDP for the Intel NICs
> and I had a few questions so I just thought I would put them out here
> to try and get everything sorted before I paint myself into a corner.
> 

Added Daniel.

> So my first question is why does the documentation mention 1 frame per
> page for XDP?  Is this with the intention at some point to try and
> support page flipping into user space, or is it supposed to have been
> for the use with an API such as the AF_PACKET mmap stuff?  If I am not
> mistaken the page flipping has been tried in the past and failed, and
> as far as the AF_PACKET stuff my understanding is that the pages had
> to be mapped beforehand so it doesn't gain us anything without a
> hardware offload to a pre-mapped queue.

+1 here. The implementation for virtio does not use page per packet and
works fine. And agreed AF_PACKET does not require it.

If anyone has page-flipping code I would be happy to benchmark it.

> 
> Second I was wondering about supporting jumbo frames and scatter
> gather.  Specifically if I let XDP handle the first 2-3K of a frame,
> and then processed the remaining portion of the frame following the
> directive set forth based on the first frame would that be good enough
> to satisfy XDP or do I actually have to support 1 linear buffer
> always.

For now yes. But, I need a solution to support 64k TSO packets or else
VM to VM traffic is severely degraded in my vswitch use case.

> 
> Finally I was looking at xdp_adjust_head.  From what I can tell all
> that is technically required to support it is allowing the head to be
> adjusted either in or out.  I'm assuming there is some amount of
> padding that is preferred.  With the setup I have currently I am
> guaranteeing at least NET_SKB_PAD + NET_IP_ALIGN, however I have found
> that there should be enough room for 192 bytes on an x86 system if I
> am using a 2K buffer.  I'm just wondering if that is enough padding or
> if we need more for XDP.
> 

Not surprisingly I'm also in agreement here it would help the ixgbe
implementation out.

> Anyway sorry for the stupid questions but I haven't been paying close
> of attention to this and was mostly focused on the DMA bits needed to
> support this so now I am playing catch-up.

None of the above are stupid IMO. Let me send out the ixgbe implementation
later this afternoon so you can have a look at my interpretation of the
rules.

> 
> - Alex
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Questions on XDP
@ 2017-02-16 20:41 Alexander Duyck
  2017-02-16 22:36 ` John Fastabend
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Duyck @ 2017-02-16 20:41 UTC (permalink / raw)
  To: Netdev
  Cc: Tom Herbert, Alexei Starovoitov, John Fastabend, Jesper Dangaard Brouer

So I'm in the process of working on enabling XDP for the Intel NICs
and I had a few questions so I just thought I would put them out here
to try and get everything sorted before I paint myself into a corner.

So my first question is why does the documentation mention 1 frame per
page for XDP?  Is this with the intention at some point to try and
support page flipping into user space, or is it supposed to have been
for the use with an API such as the AF_PACKET mmap stuff?  If I am not
mistaken the page flipping has been tried in the past and failed, and
as far as the AF_PACKET stuff my understanding is that the pages had
to be mapped beforehand so it doesn't gain us anything without a
hardware offload to a pre-mapped queue.

Second I was wondering about supporting jumbo frames and scatter
gather.  Specifically if I let XDP handle the first 2-3K of a frame,
and then processed the remaining portion of the frame following the
directive set forth based on the first frame would that be good enough
to satisfy XDP or do I actually have to support 1 linear buffer
always.

Finally I was looking at xdp_adjust_head.  From what I can tell all
that is technically required to support it is allowing the head to be
adjusted either in or out.  I'm assuming there is some amount of
padding that is preferred.  With the setup I have currently I am
guaranteeing at least NET_SKB_PAD + NET_IP_ALIGN, however I have found
that there should be enough room for 192 bytes on an x86 system if I
am using a 2K buffer.  I'm just wondering if that is enough padding or
if we need more for XDP.

Anyway sorry for the stupid questions but I haven't been paying close
of attention to this and was mostly focused on the DMA bits needed to
support this so now I am playing catch-up.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-02-22 22:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-18 23:31 Questions on XDP Alexei Starovoitov
2017-02-18 23:48 ` John Fastabend
2017-02-18 23:59   ` Eric Dumazet
2017-02-19  2:16   ` Alexander Duyck
2017-02-19  3:48     ` John Fastabend
2017-02-20 20:06       ` Jakub Kicinski
2017-02-22  5:02         ` John Fastabend
2017-02-21  3:18     ` Alexei Starovoitov
2017-02-21  3:39       ` John Fastabend
2017-02-21  4:00         ` Alexander Duyck
2017-02-21  7:55           ` Alexei Starovoitov
2017-02-21 17:44             ` Alexander Duyck
2017-02-22 17:08               ` John Fastabend
2017-02-22 21:59                 ` Jesper Dangaard Brouer
  -- strict thread matches above, loose matches on Subject: below --
2017-02-18 23:59 Alexei Starovoitov
2017-02-16 20:41 Alexander Duyck
2017-02-16 22:36 ` John Fastabend
2017-02-18 16:34   ` Jesper Dangaard Brouer
2017-02-18 17:41     ` Eric Dumazet
2017-02-18 18:18       ` Alexander Duyck
2017-02-18 23:28         ` John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.