From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jesper Dangaard Brouer <brouer@redhat.com>
Subject: Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
Date: Tue, 1 May 2018 10:14:20 +0200
Message-ID: <20180501101420.544ff225@redhat.com>
References: <20180424143923.26519-1-toshiaki.makita1@gmail.com>
        <20180424143923.26519-7-toshiaki.makita1@gmail.com>
        <20180425222452.1ea5c69f@redhat.com>
        <da8cb809-ab79-1373-f69b-9eed323b99f6@lab.ntt.co.jp>
        <20180430192709.1fc47265@redhat.com>
        <6ed7c0c5-d6cb-7870-38da-4bb49707a63c@lab.ntt.co.jp>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>,
        netdev@vger.kernel.org, Tariq Toukan <tariqt@mellanox.com>,
        brouer@redhat.com, Daniel Borkmann <borkmann@iogearbox.net>,
        Alexei Starovoitov <alexei.starovoitov@gmail.com>,
        Eran Ben Elisha <eranbe@mellanox.com>
To: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx3-rdu2.redhat.com ([66.187.233.73]:47544 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1751081AbeEAIOY (ORCPT <rfc822;netdev@vger.kernel.org>);
        Tue, 1 May 2018 04:14:24 -0400
In-Reply-To: <6ed7c0c5-d6cb-7870-38da-4bb49707a63c@lab.ntt.co.jp>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Tue, 1 May 2018 10:02:01 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
> > On Thu, 26 Apr 2018 19:52:40 +0900
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:  
> >>> On Tue, 24 Apr 2018 23:39:20 +0900
> >>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> >>>     
> >>>> +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
> >>>> +{
> >>>> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> >>>> +	int headroom = frame->data - (void *)frame;
> >>>> +	struct net_device *rcv;
> >>>> +	int err = 0;
> >>>> +
> >>>> +	rcv = rcu_dereference(priv->peer);
> >>>> +	if (unlikely(!rcv))
> >>>> +		return -ENXIO;
> >>>> +
> >>>> +	rcv_priv = netdev_priv(rcv);
> >>>> +	/* xdp_ring is initialized on receive side? */
> >>>> +	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
> >>>> +		err = xdp_ok_fwd_dev(rcv, frame->len);
> >>>> +		if (unlikely(err))
> >>>> +			return err;
> >>>> +
> >>>> +		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
> >>>> +	} else {
> >>>> +		struct sk_buff *skb;
> >>>> +
> >>>> +		skb = veth_build_skb(frame, headroom, frame->len, 0);
> >>>> +		if (unlikely(!skb))
> >>>> +			return -ENOMEM;
> >>>> +
> >>>> +		/* Get page ref in case skb is dropped in netif_rx.
> >>>> +		 * The caller is responsible for freeing the page on error.
> >>>> +		 */
> >>>> +		get_page(virt_to_page(frame->data));    
> >>>
> >>> I'm not sure you can make this assumption, that xdp_frames coming from
> >>> another device driver uses a refcnt based memory model. But maybe I'm
> >>> confused, as this looks like an SKB receive path, but in the
> >>> ndo_xdp_xmit().    
> >>
> >> I find this path similar to cpumap, which creates skb from redirected
> >> xdp frame. Once it is converted to skb, skb head is freed by
> >> page_frag_free, so anyway I needed to get the refcount here regardless
> >> of memory model.  
> > 
> > Yes I know, I wrote cpumap ;-)
> > 
> > First of all, I don't want to see such xdp_frame to SKB conversion code
> > in every driver.  Because that increase the chances of errors.  And
> > when looking at the details, then it seems that you have made the
> > mistake of making it possible to leak xdp_frame info to the SKB (which
> > cpumap takes into account).  
> 
> Do you mean leaving xdp_frame in skb->head is leaking something? how?

Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom")
and commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data
on page reuse").

But this time, the concern is a bpf_prog attached at TC/bpf_cls level, that can 
that can adjust head via bpf_skb_change_head (for XDP it is
bpf_xdp_adjust_head) into the area used by xdp_frame.  As described in
https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in is not super
critical at the moment, as this _currently_ runs as CAP_SYS_ADMIN, but
we would like to move towards CAP_NET_ADMIN.

 
> > Second, I think the refcnt scheme here is wrong. The xdp_frame should
> > be "owned" by XDP and have the proper refcnt to deliver it directly to
> > the network stack.
> > 
> > Third, if we choose that we want a fallback, in-case XDP is not enabled
> > on egress dev (but it have an ndo_xdp_xmit), then it should be placed
> > in the generic/core code.  E.g. __bpf_tx_xdp_map() could look at the
> > return code from dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint,
> > this would make it easy to implement TX bulking towards the dev).  
> 
> Right, this is a much cleaner way.
>
> Although I feel like we should add this fallback for veth because it
> requires something which is different from other drivers (enabling XDP
> on the peer device of the egress device), 

(This is why I Cc'ed Tariq...)

This is actually a general problem with the xdp "xmit" side (and not
specific to veth driver). The problem exists for other drivers as well.

The problem is that a driver can implement ndo_xdp_xmit(), but the
driver might not have allocated the necessary XDP TX-queue resources
yet (or it might not be possible due to system resource limits).

The current "hack" is to load an XDP prog on the egress device, and
then assume that this cause the driver to also allocate the XDP
ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.

We need a more generic way to test if a net_device is "ready/enabled"
for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas on
how to implement this...

My opinion is that it is a waste of (HW/mem) resources to always
allocate resources for ndo_xdp_xmit when loading an XDP program.
Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP
redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g.
today using ixgbe on machines with more than 96 CPUs, will fail due to
limited TX queue resources. Thus, blocking the mentioned use-cases.


> I'll drop the part for now. It should not be resolved in the driver
> code.

Thank you.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer