All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Brenden Blanco <bblanco@plumgrid.com>,
	davem@davemloft.net, netdev@vger.kernel.org, tom@herbertland.com,
	ogerlitz@mellanox.com, daniel@iogearbox.net,
	eric.dumazet@gmail.com, ecree@solarflare.com,
	john.fastabend@gmail.com, tgraf@suug.ch,
	johannes@sipsolutions.net, eranlinuxmellanox@gmail.com,
	lorenzo@google.com, linux-mm <linux-mm@kvack.org>
Subject: Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
Date: Fri, 8 Apr 2016 10:26:53 -0700	[thread overview]
Message-ID: <20160408172651.GA38264@ast-mbp.thefacebook.com> (raw)
In-Reply-To: <20160408143340.10e5b1d0@redhat.com>

On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > > +/* user return codes for PHYS_DEV prog type */
> > > +enum bpf_phys_dev_action {
> > > +	BPF_PHYS_DEV_DROP,
> > > +	BPF_PHYS_DEV_OK,
> > > +};  
> > 
> > I can imagine these extra return codes:
> > 
> >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
> >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
> >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> > 
> > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> > which we can look at when we get that far...
> 
> I want to point out something which is quite FUNDAMENTAL, for
> understanding these return codes (and network stack).
> 
> 
> At driver RX time, the network stack basically have two ways of
> building an SKB, which is send up the stack.
> 
> Option-A (fastest): The packet page is writable. The SKB can be
> allocated and skb->data/head can point directly to the page.  And
> we place/write skb_shared_info in the end/tail-room. (This is done by
> calling build_skb()).
> 
> Option-B (slower): The packet page is read-only.  The SKB cannot point
> skb->data/head directly to the page, because skb_shared_info need to be
> written into skb->end (slightly hidden via skb_shinfo() casting).  To
> get around this, a separate piece of memory is allocated (speedup by
> __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
> be written. (This is done when calling netdev/napi_alloc_skb()).
>   Drivers then need to copy over packet headers, and assign + adjust
> skb_shinfo(skb)->frags[0] offset to skip copied headers.
> 
> 
> Unfortunately most drivers use option-B.  Due to cost of calling the
> page allocator.  It is only slightly most expensive to get a larger
> compound page from the page allocator, which then can be partitioned into
> page-fragments, thus amortizing the page alloc cost.  Unfortunately the
> cost is added later, when constructing the SKB.
>  Another reason for option-B, is that archs with expensive IOMMU
> requirements (like PowerPC), don't need to dma_unmap on every packet,
> but only on the compound page level.
> 
> Side-note: Most drivers have a "copy-break" optimization.  Especially
> for option-B, when copying header data anyhow. For small packet, one
> might as well free (or recycle) the RX page, if header size fits into
> the newly allocated memory (for skb_shared_info).

I think you guys are going into overdesign territory, so
. nack on read-only pages
. nack on copy-break approach
. nack on per-ring programs
. nack on modified/stolen/shared return codes

The whole thing must be dead simple to use. Above is not simple by any means.
The programs must see writeable pages only and return codes:
drop, pass to stack, redirect to xmit.
If program wishes to modify packets before passing it to stack, it
shouldn't need to deal with different return values.
No special things to deal with small or large packets. No header splits.
Program must not be aware of any such things.
Drivers can use DMA_BIDIRECTIONAL to allow received page to be
modified by the program and immediately sent to xmit. 
No dma map/unmap/sync per packet. If some odd architectures/dma setups
cannot do it, then XDP will not be applicable there.
We are not going to sacrifice performance for generality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-04-08 17:26 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-08  4:48 [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Brenden Blanco
2016-04-08  4:48 ` [RFC PATCH v2 2/5] net: add ndo to set bpf prog in adapter rx Brenden Blanco
2016-04-08  9:38   ` Jesper Dangaard Brouer
2016-04-08 16:39     ` Brenden Blanco
2016-04-08  4:48 ` [RFC PATCH v2 3/5] rtnl: add option for setting link bpf prog Brenden Blanco
2016-04-08  4:48 ` [RFC PATCH v2 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
2016-04-08 11:41   ` Jesper Dangaard Brouer
2016-04-08 17:04     ` Brenden Blanco
2016-04-08  4:48 ` [RFC PATCH v2 5/5] Add sample for adding simple drop program to link Brenden Blanco
2016-04-09 14:48   ` Jamal Hadi Salim
2016-04-09 16:43     ` Brenden Blanco
2016-04-09 17:27       ` Jamal Hadi Salim
2016-04-10 18:38         ` Brenden Blanco
2016-04-13 10:40           ` Jamal Hadi Salim
2016-04-08 10:36 ` [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Jesper Dangaard Brouer
2016-04-08 11:09   ` Daniel Borkmann
2016-04-08 16:48     ` Brenden Blanco
2016-04-08 12:33   ` Jesper Dangaard Brouer
2016-04-08 17:02     ` Brenden Blanco
2016-04-08 19:05       ` Jesper Dangaard Brouer
2016-04-08 17:26     ` Alexei Starovoitov [this message]
2016-04-08 20:08       ` Jesper Dangaard Brouer
2016-04-08 21:34         ` Alexei Starovoitov
2016-04-09 11:29           ` Tom Herbert
2016-04-09 15:29             ` Jamal Hadi Salim
2016-04-09 17:26               ` Alexei Starovoitov
2016-04-10  7:55                 ` Thomas Graf
2016-04-10 16:53                   ` Tom Herbert
2016-04-10 18:09                     ` Jamal Hadi Salim
2016-04-10 13:07                 ` Jamal Hadi Salim
2016-04-09 11:17 ` Tom Herbert
2016-04-09 12:27   ` Jesper Dangaard Brouer
2016-04-09 13:17     ` Tom Herbert
2016-04-09 17:00   ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160408172651.GA38264@ast-mbp.thefacebook.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=bblanco@plumgrid.com \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=ecree@solarflare.com \
    --cc=eranlinuxmellanox@gmail.com \
    --cc=eric.dumazet@gmail.com \
    --cc=johannes@sipsolutions.net \
    --cc=john.fastabend@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=tgraf@suug.ch \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.