From mboxrd@z Thu Jan  1 00:00:00 1970
From: Toshiaki Makita <toshiaki.makita1@gmail.com>
Subject: Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
Date: Wed, 2 May 2018 12:33:47 +0900
Message-ID: <874a1c42-bbdf-0ab4-b96e-3b07e6fb4aef@gmail.com>
References: <20180424143923.26519-1-toshiaki.makita1@gmail.com>
 <20180424143923.26519-7-toshiaki.makita1@gmail.com>
 <20180425222452.1ea5c69f@redhat.com>
 <da8cb809-ab79-1373-f69b-9eed323b99f6@lab.ntt.co.jp>
 <20180430192709.1fc47265@redhat.com>
 <6ed7c0c5-d6cb-7870-38da-4bb49707a63c@lab.ntt.co.jp>
 <20180501101420.544ff225@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>,
        netdev@vger.kernel.org, Tariq Toukan <tariqt@mellanox.com>,
        Daniel Borkmann <borkmann@iogearbox.net>,
        Alexei Starovoitov <alexei.starovoitov@gmail.com>,
        Eran Ben Elisha <eranbe@mellanox.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pf0-f195.google.com ([209.85.192.195]:32805 "EHLO
        mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750785AbeEBDdy (ORCPT
        <rfc822;netdev@vger.kernel.org>); Tue, 1 May 2018 23:33:54 -0400
Received: by mail-pf0-f195.google.com with SMTP id f20so1786544pfn.0
        for <netdev@vger.kernel.org>; Tue, 01 May 2018 20:33:53 -0700 (PDT)
In-Reply-To: <20180501101420.544ff225@redhat.com>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
>>> <makita.toshiaki@lab.ntt.co.jp> wrote:
>>> 
>>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
>>>>> <toshiaki.makita1@gmail.com> wrote:
>>>>> 
>>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
>>>>>> xdp_frame *frame) +{ +	struct veth_priv *rcv_priv, *priv = 
>>>>>> netdev_priv(dev); +	int headroom = frame->data - (void 
>>>>>> *)frame; +	struct net_device *rcv; +	int err = 0; + +	rcv
>>>>>> = rcu_dereference(priv->peer); +	if (unlikely(!rcv)) + 
>>>>>> return -ENXIO; + +	rcv_priv = netdev_priv(rcv); +	/* 
>>>>>> xdp_ring is initialized on receive side? */ +	if 
>>>>>> (rcu_access_pointer(rcv_priv->xdp_prog)) { +		err = 
>>>>>> xdp_ok_fwd_dev(rcv, frame->len); +		if (unlikely(err)) + 
>>>>>> return err; + +		err = veth_xdp_enqueue(rcv_priv, 
>>>>>> veth_xdp_to_ptr(frame)); +	} else { +		struct sk_buff *skb;
>>>>>> + +		skb = veth_build_skb(frame, headroom, frame->len, 0);
>>>>>> +		if (unlikely(!skb)) +			return -ENOMEM; + +		/* Get page
>>>>>> ref in case skb is dropped in netif_rx. + * The caller is
>>>>>> responsible for freeing the page on error. +		 */ +
>>>>>> get_page(virt_to_page(frame->data));
>>>>> 
>>>>> I'm not sure you can make this assumption, that xdp_frames 
>>>>> coming from another device driver uses a refcnt based memory 
>>>>> model. But maybe I'm confused, as this looks like an SKB 
>>>>> receive path, but in the ndo_xdp_xmit().
>>>> 
>>>> I find this path similar to cpumap, which creates skb from 
>>>> redirected xdp frame. Once it is converted to skb, skb head is 
>>>> freed by page_frag_free, so anyway I needed to get the
>>>> refcount here regardless of memory model.
>>> 
>>> Yes I know, I wrote cpumap ;-)
>>> 
>>> First of all, I don't want to see such xdp_frame to SKB 
>>> conversion code in every driver.  Because that increase the 
>>> chances of errors.  And when looking at the details, then it 
>>> seems that you have made the mistake of making it possible to 
>>> leak xdp_frame info to the SKB (which cpumap takes into 
>>> account).
>> 
>> Do you mean leaving xdp_frame in skb->head is leaking something? 
>> how?
> 
> Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> in frame data on page reuse").

Thanks for sharing the info.

> But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> level, that can that can adjust head via bpf_skb_change_head (for
> XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> is not super critical at the moment, as this _currently_ runs as 
> CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.

What I don't get is why special casing xdp_frame info. My assumption is
that the area above skb->mac_header is uninit kernel memory and should
not be readable by unprivileged users anyway. So I didn't clear the area
at this point.

>>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
>>> should be "owned" by XDP and have the proper refcnt to deliver
>>> it directly to the network stack.
>>> 
>>> Third, if we choose that we want a fallback, in-case XDP is not 
>>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
>>> should be placed in the generic/core code.  E.g. 
>>> __bpf_tx_xdp_map() could look at the return code from 
>>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
>>> make it easy to implement TX bulking towards the dev).
>> 
>> Right, this is a much cleaner way.
>> 
>> Although I feel like we should add this fallback for veth because 
>> it requires something which is different from other drivers 
>> (enabling XDP on the peer device of the egress device),
> 
> (This is why I Cc'ed Tariq...)
> 
> This is actually a general problem with the xdp "xmit" side (and not
>  specific to veth driver). The problem exists for other drivers as 
> well.
> 
> The problem is that a driver can implement ndo_xdp_xmit(), but the 
> driver might not have allocated the necessary XDP TX-queue resources
>  yet (or it might not be possible due to system resource limits).
> 
> The current "hack" is to load an XDP prog on the egress device, and 
> then assume that this cause the driver to also allocate the XDP 
> ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
 >
> We need a more generic way to test if a net_device is "ready/enabled"
> for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> on how to implement this...

My assumption on REDIRECT requirement came from this.
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

I guess you are saying thing are changing, and having an XDP program 
attached on the egress device is no longer generally sufficient. Looking 
forward to Tariq's solution.

Toshiaki Makita

> 
> My opinion is that it is a waste of (HW/mem) resources to always 
> allocate resources for ndo_xdp_xmit when loading an XDP program. 
> Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> today using ixgbe on machines with more than 96 CPUs, will fail due 
> to limited TX queue resources. Thus, blocking the mentioned 
> use-cases.
> 
> 
>> I'll drop the part for now. It should not be resolved in the driver
>> code.
> 
> Thank you.
>