All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Jakub Kicinski <kubakici@wp.pl>
Cc: Alexander Duyck <alexander.duyck@gmail.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Netdev <netdev@vger.kernel.org>,
	Tom Herbert <tom@herbertland.com>,
	Alexei Starovoitov <ast@kernel.org>,
	John Fastabend <john.r.fastabend@intel.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	David Miller <davem@davemloft.net>
Subject: Re: Questions on XDP
Date: Tue, 21 Feb 2017 21:02:29 -0800	[thread overview]
Message-ID: <58AD1B65.6010901@gmail.com> (raw)
In-Reply-To: <20170220120625.524bc425@cakuba.lan>

On 17-02-20 12:06 PM, Jakub Kicinski wrote:
> On Sat, 18 Feb 2017 19:48:25 -0800, John Fastabend wrote:
>> On 17-02-18 06:16 PM, Alexander Duyck wrote:
>>> On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
>>> <john.fastabend@gmail.com> wrote:  
>>>> On 17-02-18 03:31 PM, Alexei Starovoitov wrote:  
>>>>> On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
>>>>> <alexander.duyck@gmail.com> wrote:  
>>>>>>  
>>>>>>> XDP_DROP does not require having one page per frame.  
>>>>>>
>>>>>> Agreed.  
>>>>>
>>>>> why do you think so?
>>>>> xdp_drop is targeting ddos where in good case
>>>>> all traffic is passed up and in bad case
>>>>> most of the traffic is dropped, but good traffic still needs
>>>>> to be serviced by the layers after. Like other xdp
>>>>> programs and the stack.
>>>>> Say ixgbe+xdp goes with 2k per packet,
>>>>> very soon we will have a bunch of half pages
>>>>> sitting in the stack and other halfs requiring
>>>>> complex refcnting and making the actual
>>>>> ddos mitigation ineffective and forcing nic to drop packets  
>>>>
>>>> I'm not seeing the distinction here. If its a 4k page and
>>>> in the stack the driver will get overrun as well.
>>>>  
>>>>> because it runs out of buffers. Why complicate things?  
>>>>
>>>> It doesn't seem complex to me and the driver already handles this
>>>> case so it actually makes the drivers simpler because there is only
>>>> a single buffer management path.
>>>>  
>>>>> packet per page approach is simple and effective.
>>>>> virtio is different. there we don't have hw that needs
>>>>> to have buffers ready for dma.
>>>>>  
>>>>>> Looking at the Mellanox way of doing it I am not entirely sure it is
>>>>>> useful.  It looks good for benchmarks but that is about it.  Also I  
>>>>>
>>>>> it's the opposite. It already runs very nicely in production.
>>>>> In real life it's always a combination of xdp_drop, xdp_tx and
>>>>> xdp_pass actions.
>>>>> Sounds like ixgbe wants to do things differently because
>>>>> of not-invented-here. That new approach may turn
>>>>> out to be good or bad, but why risk it?
>>>>> mlx4 approach works.
>>>>> mlx5 has few issues though, because page recycling
>>>>> was done too simplistic. Generic page pool/recycling
>>>>> that all drivers will use should solve that. I hope.
>>>>> Is the proposal to have generic split-page recycler ?
>>>>> How that is going to work?
>>>>>  
>>>>
>>>> No, just give the driver a page when it asks for it. How the
>>>> driver uses the page is not the pools concern.
>>>>  
>>>>>> don't see it extending out to the point that we would be able to
>>>>>> exchange packets between interfaces which really seems like it should
>>>>>> be the ultimate goal for XDP_TX.  
>>>>>
>>>>> we don't have a use case for multi-port xdp_tx,
>>>>> but I'm not objecting to doing it in general.
>>>>> Just right now I don't see a need to complicate
>>>>> drivers to do so.  
>>>>
>>>> We are running our vswitch in userspace now for many workloads
>>>> it would be nice to have these in kernel if possible.
>>>>  
>>>>>  
>>>>>> It seems like eventually we want to be able to peel off the buffer and
>>>>>> send it to something other than ourselves.  For example it seems like
>>>>>> it might be useful at some point to use XDP to do traffic
>>>>>> classification and have it route packets between multiple interfaces
>>>>>> on a host and it wouldn't make sense to have all of them map every
>>>>>> page as bidirectional because it starts becoming ridiculous if you
>>>>>> have dozens of interfaces in a system.  
>>>>>
>>>>> dozen interfaces? Like a single nic with dozen ports?
>>>>> or many nics with many ports on the same system?
>>>>> are you trying to build a switch out of x86?
>>>>> I don't think it's realistic to have multi-terrabit x86 box.
>>>>> Is it all because of dpdk/6wind demos?
>>>>> I saw how dpdk was bragging that they can saturate
>>>>> pcie bus. So? Why is this useful?  
>>>
>>> Actually I was thinking more of an OVS, bridge, or routing
>>> replacement.  Basically with a couple of physical interfaces and then
>>> either veth and/or vhost interfaces.
>>>   
>>
>> Yep valid use case for me. We would use this with Intel Clear Linux
>> assuming we can sort it out and perf metrics are good.
> 
> FWIW the limitation of having to remap buffers to TX to other netdev
> also does not apply to NICs which share the same PCI device among all ports
> (mlx4, nfp of the top of my head).  I wonder if it would be worthwhile
> to mentally separate high-performance NICs of which there is a limited
> number from swarms of slow "devices" like VF interfaces, perhaps we
> will want to choose different solutions for the two down the road.
> 
>> Here is XDP extensions for redirect (need to be rebased though)
>>
>>  https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee
>>
>> And here is a sample program,
>>
>>  https://github.com/jrfastab/linux/commit/19d0a5de3f6e934baa8df23d95e766bab7f026d0
>>
>> Probably the most relevant pieces in the above patch is a new ndo op as follows,
>>
>>  +	void			(*ndo_xdp_xmit)(struct net_device *dev,
>>  +						struct xdp_buff *xdp);
>>
>>
>> Then support for redirect in xdp ebpf,
>>
>>  +BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
>>  +{
>>  +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
>>  +
>>  +	if (unlikely(flags))
>>  +		return XDP_ABORTED;
>>  +
>>  +	ri->ifindex = ifindex;
>>  +	return XDP_REDIRECT;
>>  +}
>>  +
>>
>> And then a routine for drivers to use to push packets with the XDP_REDIRECT
>> action around,
>>
>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
>> +{
>> +	if (dev->netdev_ops->ndo_xdp_xmit) {
>> +		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
>> +		return 0;
>> +	}
>> +	bpf_warn_invalid_xdp_redirect(dev->ifindex);
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
>> +{
>> +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
>> +
>> +	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
>> +	ri->ifindex = 0;
>> +	if (unlikely(!dev)) {
>> +		bpf_warn_invalid_xdp_redirect(ri->ifindex);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return __bpf_tx_xdp(dev, xdp);
>> +}
>>
>>
>> Still thinking on it though to see if I might have a better mechanism and
>> need benchmarks to show various metrics.
> 
> Would it perhaps make sense to consider this work as first step on the
> path towards lightweight-skb rather than leaking XDP constructs outside
> of drivers?  If we forced all XDP drivers to produce build_skb-able 
> buffers, we could define the new .ndo as accepting skbs which are not
> fully initialized but can be turned into real skbs if needed?
> 

I believe this is a good idea. But I need a few iterations on existing code
base :) before I can try to realize something like this.

.John

  reply	other threads:[~2017-02-22  5:02 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-18 23:31 Questions on XDP Alexei Starovoitov
2017-02-18 23:48 ` John Fastabend
2017-02-18 23:59   ` Eric Dumazet
2017-02-19  2:16   ` Alexander Duyck
2017-02-19  3:48     ` John Fastabend
2017-02-20 20:06       ` Jakub Kicinski
2017-02-22  5:02         ` John Fastabend [this message]
2017-02-21  3:18     ` Alexei Starovoitov
2017-02-21  3:39       ` John Fastabend
2017-02-21  4:00         ` Alexander Duyck
2017-02-21  7:55           ` Alexei Starovoitov
2017-02-21 17:44             ` Alexander Duyck
2017-02-22 17:08               ` John Fastabend
2017-02-22 21:59                 ` Jesper Dangaard Brouer
  -- strict thread matches above, loose matches on Subject: below --
2017-02-18 23:59 Alexei Starovoitov
2017-02-16 20:41 Alexander Duyck
2017-02-16 22:36 ` John Fastabend
2017-02-18 16:34   ` Jesper Dangaard Brouer
2017-02-18 17:41     ` Eric Dumazet
2017-02-18 18:18       ` Alexander Duyck
2017-02-18 23:28         ` John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58AD1B65.6010901@gmail.com \
    --to=john.fastabend@gmail.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=john.r.fastabend@intel.com \
    --cc=kubakici@wp.pl \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.