All of lore.kernel.org
 help / color / mirror / Atom feed
* XDP question: best API for returning/setting egress port?
@ 2017-04-18 19:58 Jesper Dangaard Brouer
  2017-04-18 20:54 ` John Fastabend
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-18 19:58 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov
  Cc: brouer, netdev, xdp-newbies, John Fastabend


As I argued in NetConf presentation[1] (from slide #9) we need a port
mapping table (instead of using ifindex'es).  Both for supporting
other "port" types than net_devices (think sockets), and for
sandboxing what XDP can bypass.

I want to create a new XDP action called XDP_REDIRECT, that instruct
XDP to send the xdp_buff to another "port" (get translated into a
net_device, or something else depending on internal port type).

Looking at the userspace/eBPF interface, I'm wondering what is the
best API for "returning" this port number from eBPF?

The options I see is:

1) Split-up the u32 action code, and e.g let the high-16-bit be the
   port number and lower-16bit the (existing) action verdict.

 Pros: Simple API
 Cons: Number of ports limited to 64K

2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
   eBPF to update xdp_md->port.

 Pros: Larger number of ports.
 Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
       (see xdp_convert_ctx_access)

3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.

 Pros: Hides impl details, and allows helper to give eBPF code feedback
       (on e.g. if port doesn't exist any longer)
 Cons: Helper function call likely slower?


(Cc'ed xdp-newbies as end-users might have an opinion on UAPI?)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://people.netfilter.org/hawk/presentations/NetConf2017/xdp_work_ahead_NetConf_April_2017.pdf

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer
@ 2017-04-18 20:54 ` John Fastabend
  2017-04-19 12:00   ` Jesper Dangaard Brouer
  2017-04-19 12:25 ` Hannes Frederic Sowa
  2017-04-19 20:02 ` Andy Gospodarek
  2 siblings, 1 reply; 34+ messages in thread
From: John Fastabend @ 2017-04-18 20:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov,
	Alexei Starovoitov
  Cc: netdev, xdp-newbies

On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote:
> 
> As I argued in NetConf presentation[1] (from slide #9) we need a port
> mapping table (instead of using ifindex'es).  Both for supporting
> other "port" types than net_devices (think sockets), and for
> sandboxing what XDP can bypass.
> 
> I want to create a new XDP action called XDP_REDIRECT, that instruct
> XDP to send the xdp_buff to another "port" (get translated into a
> net_device, or something else depending on internal port type).
> 
> Looking at the userspace/eBPF interface, I'm wondering what is the
> best API for "returning" this port number from eBPF?
> 
> The options I see is:
> 
> 1) Split-up the u32 action code, and e.g let the high-16-bit be the
>    port number and lower-16bit the (existing) action verdict.
> 
>  Pros: Simple API
>  Cons: Number of ports limited to 64K
> 
> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
>    eBPF to update xdp_md->port.
> 
>  Pros: Larger number of ports.
>  Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
>        (see xdp_convert_ctx_access)
> 
> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.
> 
>  Pros: Hides impl details, and allows helper to give eBPF code feedback
>        (on e.g. if port doesn't exist any longer)
>  Cons: Helper function call likely slower?
> 
> 

How about doing this the same way redirect is done in the tc case? I have this
patch under test,

 https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee

that should give you some idea. It just needs a port mapping table in the
bpf_tx_xdp() call.


> (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?)
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-18 20:54 ` John Fastabend
@ 2017-04-19 12:00   ` Jesper Dangaard Brouer
  2017-04-19 12:33     ` Daniel Borkmann
  0 siblings, 1 reply; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-19 12:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev,
	xdp-newbies, brouer

On Tue, 18 Apr 2017 13:54:45 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote:
> > 
> > As I argued in NetConf presentation[1] (from slide #9) we need a port
> > mapping table (instead of using ifindex'es).  Both for supporting
> > other "port" types than net_devices (think sockets), and for
> > sandboxing what XDP can bypass.
> > 
> > I want to create a new XDP action called XDP_REDIRECT, that instruct
> > XDP to send the xdp_buff to another "port" (get translated into a
> > net_device, or something else depending on internal port type).
> > 
> > Looking at the userspace/eBPF interface, I'm wondering what is the
> > best API for "returning" this port number from eBPF?
> > 
> > The options I see is:
> > 
> > 1) Split-up the u32 action code, and e.g let the high-16-bit be the
> >    port number and lower-16bit the (existing) action verdict.
> > 
> >  Pros: Simple API
> >  Cons: Number of ports limited to 64K
> > 
> > 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
> >    eBPF to update xdp_md->port.
> > 
> >  Pros: Larger number of ports.
> >  Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
> >        (see xdp_convert_ctx_access)
> > 
> > 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.
> > 
> >  Pros: Hides impl details, and allows helper to give eBPF code feedback
> >        (on e.g. if port doesn't exist any longer)
> >  Cons: Helper function call likely slower?
> > 
> >   
> 
> How about doing this the same way redirect is done in the tc case? I have this
> patch under test,
> 
>  https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee

I have been looking at this approach, which is close to option #3 above.

The problem with your implementation that you use a per-cpu store.
This creates the problem of storing state between packets. First packet
can call helper bpf_xdp_redirect() setting an ifindex, but program can
still return XDP_PASS.  Next packet can call XDP_REDIRECT and use the
ifindex set from the first packet.  IMHO this is a problematic API to
expose.

I do see that the TC interface that uses the same approach, via helper
bpf_redirect().  Maybe it have the same API problem?  Looking at
sch_handle_ingress() I don't see this is handled (e.g. by always
clearing this_cpu_ptr(redirect_info)->ifindex = 0).


> that should give you some idea. It just needs a port mapping table in the
> bpf_tx_xdp() call.

I'll take a closer look. I don't think we need the per-cpu-store
approach for XDP, as we might as well store the port info in xdp_buff,
or return it directly option #1.

(TC redirect need the per-cpu-store to avoid extending the SKB).


> > (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?)

I would still like people to comment on the above options?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer
  2017-04-18 20:54 ` John Fastabend
@ 2017-04-19 12:25 ` Hannes Frederic Sowa
  2017-04-19 20:02 ` Andy Gospodarek
  2 siblings, 0 replies; 34+ messages in thread
From: Hannes Frederic Sowa @ 2017-04-19 12:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov,
	Alexei Starovoitov
  Cc: netdev, xdp-newbies, John Fastabend

Hi,

On 18.04.2017 21:58, Jesper Dangaard Brouer wrote:
> 
> As I argued in NetConf presentation[1] (from slide #9) we need a port
> mapping table (instead of using ifindex'es).  Both for supporting
> other "port" types than net_devices (think sockets), and for
> sandboxing what XDP can bypass.
> 
> I want to create a new XDP action called XDP_REDIRECT, that instruct
> XDP to send the xdp_buff to another "port" (get translated into a
> net_device, or something else depending on internal port type).
> 
> Looking at the userspace/eBPF interface, I'm wondering what is the
> best API for "returning" this port number from eBPF?
> 
> The options I see is:
> 
> 1) Split-up the u32 action code, and e.g let the high-16-bit be the
>    port number and lower-16bit the (existing) action verdict.
> 
>  Pros: Simple API
>  Cons: Number of ports limited to 64K
> 
> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
>    eBPF to update xdp_md->port.
> 
>  Pros: Larger number of ports.
>  Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
>        (see xdp_convert_ctx_access)
> 
> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.
> 
>  Pros: Hides impl details, and allows helper to give eBPF code feedback
>        (on e.g. if port doesn't exist any longer)
>  Cons: Helper function call likely slower?
> 
> 
> (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?)

I am not sure how the socket interface should look like, it seems to be
too far away to me right now.

Regarding having stable ifindexes, I wonder if we could do something:

int ifindexes_in_use_by_ebpf_program[] __section("ifindex") = {
1,2,3,8,9,10 };

and we can make sure that the ifindexes automatically stay stable for
the lifetime while the ebpf program is loaded?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-19 12:00   ` Jesper Dangaard Brouer
@ 2017-04-19 12:33     ` Daniel Borkmann
  2017-04-19 15:24       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Borkmann @ 2017-04-19 12:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev,
	xdp-newbies

On 04/19/2017 02:00 PM, Jesper Dangaard Brouer wrote:
> On Tue, 18 Apr 2017 13:54:45 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
>> On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote:
>>>
>>> As I argued in NetConf presentation[1] (from slide #9) we need a port
>>> mapping table (instead of using ifindex'es).  Both for supporting
>>> other "port" types than net_devices (think sockets), and for
>>> sandboxing what XDP can bypass.
>>>
>>> I want to create a new XDP action called XDP_REDIRECT, that instruct
>>> XDP to send the xdp_buff to another "port" (get translated into a
>>> net_device, or something else depending on internal port type).
>>>
>>> Looking at the userspace/eBPF interface, I'm wondering what is the
>>> best API for "returning" this port number from eBPF?
>>>
>>> The options I see is:
>>>
>>> 1) Split-up the u32 action code, and e.g let the high-16-bit be the
>>>     port number and lower-16bit the (existing) action verdict.
>>>
>>>   Pros: Simple API
>>>   Cons: Number of ports limited to 64K
>>>
>>> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
>>>     eBPF to update xdp_md->port.
>>>
>>>   Pros: Larger number of ports.
>>>   Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
>>>         (see xdp_convert_ctx_access)
>>>
>>> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.
>>>
>>>   Pros: Hides impl details, and allows helper to give eBPF code feedback
>>>         (on e.g. if port doesn't exist any longer)
>>>   Cons: Helper function call likely slower?
>>
>> How about doing this the same way redirect is done in the tc case? I have this
>> patch under test,
>>
>>   https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee
>
> I have been looking at this approach, which is close to option #3 above.
>
> The problem with your implementation that you use a per-cpu store.
> This creates the problem of storing state between packets. First packet
> can call helper bpf_xdp_redirect() setting an ifindex, but program can
> still return XDP_PASS.  Next packet can call XDP_REDIRECT and use the
> ifindex set from the first packet.  IMHO this is a problematic API to
> expose.
>
> I do see that the TC interface that uses the same approach, via helper
> bpf_redirect().  Maybe it have the same API problem?  Looking at
> sch_handle_ingress() I don't see this is handled (e.g. by always
> clearing this_cpu_ptr(redirect_info)->ifindex = 0).

It's cleared in {skb,xdp}_do_redirect() right after fetching the
ifindex. I think this approach is just fine. The example described
above is a misuse of the API by a buggy program calling bpf_xdp_redirect()
and returning XDP_PASS while another time it returns XDP_REDIRECT
without the bpf_xdp_redirect() helper, sounds very exotic, but it's
as buggy as, say, a program doing the csum update wrong, a program
writing the wrong data to the packet, doing adjust head on the wrong
header offset, jumping into the wrong tail call entry and other things.

I think encoding this into an action code is rather limiting, f.e.
where would we place a flags argument if needed in future? Would
that mean, we need a XDP_REDIRECT2 return code that also allows for
encoding flags?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-19 12:33     ` Daniel Borkmann
@ 2017-04-19 15:24       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-19 15:24 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: John Fastabend, Daniel Borkmann, Alexei Starovoitov,
	Alexei Starovoitov, netdev, xdp-newbies, brouer

On Wed, 19 Apr 2017 14:33:27 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 04/19/2017 02:00 PM, Jesper Dangaard Brouer wrote:
> > On Tue, 18 Apr 2017 13:54:45 -0700
> > John Fastabend <john.fastabend@gmail.com> wrote:  
> >> On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote:  
> >>>
> >>> As I argued in NetConf presentation[1] (from slide #9) we need a port
> >>> mapping table (instead of using ifindex'es).  Both for supporting
> >>> other "port" types than net_devices (think sockets), and for
> >>> sandboxing what XDP can bypass.
> >>>
> >>> I want to create a new XDP action called XDP_REDIRECT, that instruct
> >>> XDP to send the xdp_buff to another "port" (get translated into a
> >>> net_device, or something else depending on internal port type).
> >>>
> >>> Looking at the userspace/eBPF interface, I'm wondering what is the
> >>> best API for "returning" this port number from eBPF?
> >>>
> >>> The options I see is:
> >>>
> >>> 1) Split-up the u32 action code, and e.g let the high-16-bit be the
> >>>     port number and lower-16bit the (existing) action verdict.
> >>>
> >>>   Pros: Simple API
> >>>   Cons: Number of ports limited to 64K
> >>>
> >>> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
> >>>     eBPF to update xdp_md->port.
> >>>
> >>>   Pros: Larger number of ports.
> >>>   Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
> >>>         (see xdp_convert_ctx_access)
> >>>
> >>> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.
> >>>
> >>>   Pros: Hides impl details, and allows helper to give eBPF code feedback
> >>>         (on e.g. if port doesn't exist any longer)
> >>>   Cons: Helper function call likely slower?  
> >>
> >> How about doing this the same way redirect is done in the tc case? I have this
> >> patch under test,
> >>
> >>   https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee  
> >
> > I have been looking at this approach, which is close to option #3 above.
> >
> > The problem with your implementation that you use a per-cpu store.
> > This creates the problem of storing state between packets. First packet
> > can call helper bpf_xdp_redirect() setting an ifindex, but program can
> > still return XDP_PASS.  Next packet can call XDP_REDIRECT and use the
> > ifindex set from the first packet.  IMHO this is a problematic API to
> > expose.
> >
> > I do see that the TC interface that uses the same approach, via helper
> > bpf_redirect().  Maybe it have the same API problem?  Looking at
> > sch_handle_ingress() I don't see this is handled (e.g. by always
> > clearing this_cpu_ptr(redirect_info)->ifindex = 0).  
> 
> It's cleared in {skb,xdp}_do_redirect() right after fetching the
> ifindex. I think this approach is just fine. The example described
> above is a misuse of the API by a buggy program calling bpf_xdp_redirect()
> and returning XDP_PASS while another time it returns XDP_REDIRECT
> without the bpf_xdp_redirect() helper, sounds very exotic, but it's
> as buggy as, say, a program doing the csum update wrong, a program
> writing the wrong data to the packet, doing adjust head on the wrong
> header offset, jumping into the wrong tail call entry and other things.

For TC I guess it is fine to keep it as is, because it is needed to
avoid extending skb.  IHMO for XDP I see no reason to keep a
per-cpu-store (which besides will be slower), simply update
xdp_buff.port should be sufficient (which is only relevant for this
packet).

As noted in option#3, my concern is that calling a helper function call
will be slower, than simply returning the needed port info? 

Maybe some bpf experts can tell me if such helper call could be
optimized out with some bpf magic?

> I think encoding this into an action code is rather limiting, f.e.
> where would we place a flags argument if needed in future? Would
> that mean, we need a XDP_REDIRECT2 return code that also allows for
> encoding flags?

Nope, it will be extensible.

We can start with:

 struct xdp_ret {
     union {
         __u32 act;
         struct {
             __u16 action;
             __u16 port;
         };
 };

And later change it to:

 struct xdp_ret {
     union {
         __u32 act;
         struct {
             __u8  action;
             __u8  flags;
             __u16 port;
         };
 };

If actions does not go above 255.  I would prefer that we start with
the latter, else people would argue that we need to extend the
structure like:

 struct xdp_ret {
     union {
         __u32 act;
         struct {
             union {
                 __u16 action;
                 struct {
                     __u8 action2;
                     __u8 flags;
                 };
             };
             __u16 port;
         };
 };


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer
  2017-04-18 20:54 ` John Fastabend
  2017-04-19 12:25 ` Hannes Frederic Sowa
@ 2017-04-19 20:02 ` Andy Gospodarek
  2017-04-19 21:42   ` Daniel Borkmann
                     ` (2 more replies)
  2 siblings, 3 replies; 34+ messages in thread
From: Andy Gospodarek @ 2017-04-19 20:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev,
	xdp-newbies, John Fastabend

On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote:
> 
> As I argued in NetConf presentation[1] (from slide #9) we need a port
> mapping table (instead of using ifindex'es).  Both for supporting
> other "port" types than net_devices (think sockets), and for
> sandboxing what XDP can bypass.
> 
> I want to create a new XDP action called XDP_REDIRECT, that instruct
> XDP to send the xdp_buff to another "port" (get translated into a
> net_device, or something else depending on internal port type).
> 
> Looking at the userspace/eBPF interface, I'm wondering what is the
> best API for "returning" this port number from eBPF?
> 
> The options I see is:
> 
> 1) Split-up the u32 action code, and e.g let the high-16-bit be the
>    port number and lower-16bit the (existing) action verdict.
> 
>  Pros: Simple API
>  Cons: Number of ports limited to 64K

Practically speaking this may be seem reserving 64k for the port number
might be enough space, but I would also like to see a new return option
for flags so I start to get concerned about space.  Daniel also
hightlights the fact that encoding the port in the action may does not
leave room for flags and could get confusing.

One unfortunate side-effect of dropping or transmitting frames with XDP
is that we lose the opportunity to statistically sample in netfilter
since the frames were dropped so early and I'd like to bring that back
with a call to parse flags and possibly call psample_sample_packet()
after the xdp action.  Packet sampling cannot simply be an action
since there are times when a frame should be dropped but should also be
sampled, so it seems logical to add this as a flag.

> 
> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow
>    eBPF to update xdp_md->port.
> 
>  Pros: Larger number of ports.
>  Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md.
>        (see xdp_convert_ctx_access)

I think I would lean towards this based on the fact that I'd like to see
a flags field added to the u32 return (maybe the top 8 bits) as
mentioned above.

So if we follow down this path and add a 'dest' field to xdp_buff and
xdp_md like this:

struct xdp_buff {
        void *data;
        void *data_end;
        void *data_hard_start;
	__u32 dest; 
};

struct xdp_md {
        __u32 data;
        __u32 data_end;
	__u32 dest;
};

and then lookup this dest in a table we have the option to make that
dest an ifindex/socket/other.

I did also look at JohnF's patch and I do like the simplicity of the redirect
action and new ndo_xdp_xmit and how it moves towards a way to transmit the
frame.  The downside is that it presumes an ifindex, so it might not be ideal
we want the lookup to return something other than an ifindex.

Before aligning on a direction for the return values from exiting xdp
call, it seems like we should also think about the tx side and how that
would be handled.  If we are ultimately going to need a new netdev op to
handle the redirect then what may be the issue with not providing the
destination port the return code and the option proposed by JohnF looks
good to me with maybe a small tweak to not presume ifindex in some manner.

JohnF, any test results with this you can share?  Presumably you tested with
virtio-net, right?

> 
> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff.
> 
>  Pros: Hides impl details, and allows helper to give eBPF code feedback
>        (on e.g. if port doesn't exist any longer)
>  Cons: Helper function call likely slower?
> 
> 
> (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?)
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [1] http://people.netfilter.org/hawk/presentations/NetConf2017/xdp_work_ahead_NetConf_April_2017.pdf

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-19 20:02 ` Andy Gospodarek
@ 2017-04-19 21:42   ` Daniel Borkmann
  2017-04-20 17:12     ` Andy Gospodarek
  2017-04-19 22:51   ` Daniel Borkmann
  2017-04-20  4:43   ` John Fastabend
  2 siblings, 1 reply; 34+ messages in thread
From: Daniel Borkmann @ 2017-04-19 21:42 UTC (permalink / raw)
  To: Andy Gospodarek, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev,
	xdp-newbies, John Fastabend

On 04/19/2017 10:02 PM, Andy Gospodarek wrote:
> On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote:
>>
>> As I argued in NetConf presentation[1] (from slide #9) we need a port
>> mapping table (instead of using ifindex'es).  Both for supporting
>> other "port" types than net_devices (think sockets), and for
>> sandboxing what XDP can bypass.
>>
>> I want to create a new XDP action called XDP_REDIRECT, that instruct
>> XDP to send the xdp_buff to another "port" (get translated into a
>> net_device, or something else depending on internal port type).
>>
>> Looking at the userspace/eBPF interface, I'm wondering what is the
>> best API for "returning" this port number from eBPF?
>>
>> The options I see is:
>>
>> 1) Split-up the u32 action code, and e.g let the high-16-bit be the
>>     port number and lower-16bit the (existing) action verdict.
>>
>>   Pros: Simple API
>>   Cons: Number of ports limited to 64K
>
> Practically speaking this may be seem reserving 64k for the port number
> might be enough space, but I would also like to see a new return option
> for flags so I start to get concerned about space.  Daniel also
> hightlights the fact that encoding the port in the action may does not
> leave room for flags and could get confusing.
>
> One unfortunate side-effect of dropping or transmitting frames with XDP
> is that we lose the opportunity to statistically sample in netfilter
> since the frames were dropped so early and I'd like to bring that back
> with a call to parse flags and possibly call psample_sample_packet()
> after the xdp action.  Packet sampling cannot simply be an action
> since there are times when a frame should be dropped but should also be
> sampled, so it seems logical to add this as a flag.

Hmm, you can do that already today with bpf_xdp_event_output() helper
and the program decides when to sample and what custom meta data to
prepend along with the (partial or full) xdp packet, see also [1], slide
11 for the "XDP dump" part.

No need to change drivers for this, psample_sample_packet() would also
be slower.

   [1] http://netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-19 20:02 ` Andy Gospodarek
  2017-04-19 21:42   ` Daniel Borkmann
@ 2017-04-19 22:51   ` Daniel Borkmann
  2017-04-20  2:56     ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov
  2017-04-20  6:39     ` XDP question: " Jesper Dangaard Brouer
  2017-04-20  4:43   ` John Fastabend
  2 siblings, 2 replies; 34+ messages in thread
From: Daniel Borkmann @ 2017-04-19 22:51 UTC (permalink / raw)
  To: Andy Gospodarek, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev,
	xdp-newbies, John Fastabend

On 04/19/2017 10:02 PM, Andy Gospodarek wrote:
[...]
> and then lookup this dest in a table we have the option to make that
> dest an ifindex/socket/other.
>
> I did also look at JohnF's patch and I do like the simplicity of the redirect
> action and new ndo_xdp_xmit and how it moves towards a way to transmit the
> frame.  The downside is that it presumes an ifindex, so it might not be ideal
> we want the lookup to return something other than an ifindex.
>
[...]
> would be handled.  If we are ultimately going to need a new netdev op to
> handle the redirect then what may be the issue with not providing the
> destination port the return code and the option proposed by JohnF looks
> good to me with maybe a small tweak to not presume ifindex in some manner.

Is there a concrete reason that all the proposed future cases like sockets
have to be handled within the very same XDP_REDIRECT return code? F.e. why
not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
ones would get a different return code f.e. XDP_TX_SK only handling sockets
when we get there implementation-wise?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-19 22:51   ` Daniel Borkmann
@ 2017-04-20  2:56     ` Alexei Starovoitov
  2017-04-20  4:38       ` John Fastabend
  2017-04-20  6:10       ` Jesper Dangaard Brouer
  2017-04-20  6:39     ` XDP question: " Jesper Dangaard Brouer
  1 sibling, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-20  2:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Gospodarek, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, netdev, xdp-newbies, John Fastabend

On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> 
> Is there a concrete reason that all the proposed future cases like sockets
> have to be handled within the very same XDP_REDIRECT return code? F.e. why
> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> ones would get a different return code f.e. XDP_TX_SK only handling sockets
> when we get there implementation-wise?

yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
out of this discussion.
XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev.
If we make it too generic it will lose performance.

For cls_bpf the ifindex concept is symmetric. The program can access it as
skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper.
Since netdev is not locked, it's actually big plus, since container management
control plane can simply delete netns+veth and it goes away. The program can
have dangling ifindex (if control plane is buggy and didn't update the bpf side),
but it's harmless. Packets that redirect to non-existing ifindex are dropped.
This approach already understood and works well, so for XDP I suggest to use
the same approach initially before starting to reinvent the wheel.
struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
to L2 netdev only. That's it. Simple and easy.
I think the main use cases in John's and Jesper's minds is something like
xdpswitch where packets are coming from VMs and from physical eths and
being redirected to either physical eth or to VM via upcoming vhost+xdp support.
This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.

Once we have vhost+xdp and all other bits implemented, we must come back
to this discussion about having port mapping table. As I mentioned
during netconf I think it's very useful, but I don't think we should
gate vhost+xdp and xdp_redirect work on this discussion.
As far as this port mapping table we would need 'port' field in xdp_md as well
and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md
plus another XDP_REDIRECT_PORT action code.
The actual port table (array) should be populated by user space with netdevs
and these netdev will have their refcnt incremented. Then we'll have discussion
what to do with netdev_unregister notifiers, whether they should be auto-removed
from port table or bpf should have a chance to be notified and act on it.
Such port mapping will allow us to optimize inevitable call
dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
away, since netdevs will be stored in the port table and direct deref
port_map_array[xdp_md->out_port] will give us target netdev quickly.
It's nice optimization and there are other more powerful optimizations we
can do with such port table (since we will know in advance which netdevs
the program will be redirecting too), but I still think we should do
ifindex based xdp_redirect first and only add this port table later.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-20  2:56     ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov
@ 2017-04-20  4:38       ` John Fastabend
  2017-04-20  4:58         ` Alexei Starovoitov
  2017-04-20  6:10       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 34+ messages in thread
From: John Fastabend @ 2017-04-20  4:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Andy Gospodarek, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, netdev, xdp-newbies

On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
>>
>> Is there a concrete reason that all the proposed future cases like sockets
>> have to be handled within the very same XDP_REDIRECT return code? F.e. why
>> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
>> ones would get a different return code f.e. XDP_TX_SK only handling sockets
>> when we get there implementation-wise?
> 
> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> out of this discussion.
> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev.
> If we make it too generic it will lose performance.
> 
> For cls_bpf the ifindex concept is symmetric. The program can access it as
> skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper.
> Since netdev is not locked, it's actually big plus, since container management
> control plane can simply delete netns+veth and it goes away. The program can
> have dangling ifindex (if control plane is buggy and didn't update the bpf side),
> but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> This approach already understood and works well, so for XDP I suggest to use
> the same approach initially before starting to reinvent the wheel.
> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> to L2 netdev only. That's it. Simple and easy.
> I think the main use cases in John's and Jesper's minds is something like
> xdpswitch where packets are coming from VMs and from physical eths and
> being redirected to either physical eth or to VM via upcoming vhost+xdp support.
> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.

hmm I must be missing something, bpf_redirect() helper should be used as a
return statement, e.g.

	return bpf_redirect(ifindex, flags);

Its a bit awkward to use in any other way. You would have to ensure the
program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
data and expects the core to call skb_do_redirect() to push the packet into
the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
looking at the code now.

Are you suggesting using xdp_md to store the ifindex value instead of a per
cpu variable in the redirect helper? Do you really mean the xdp_md struct in
the uapi headers? I don't see why it needs to be in the UAPI at all. If we
don't like per cpu variables it could be pushed as part of xdp_buff I guess.

My suggestion is we could add an ifindex to the xdp_md structure and have the
receiving NIC driver populate the value assuming it is useful to programs. But,
if we use cls_bpf as a model then xdp_md ifindex is more or less independent of
the redirect helper. In my opinion we should avoid diverging cls bpf and xdp bpf
in subtle ways like handling of ifindex and redirect.

> 
> Once we have vhost+xdp and all other bits implemented, we must come back
> to this discussion about having port mapping table. As I mentioned
> during netconf I think it's very useful, but I don't think we should
> gate vhost+xdp and xdp_redirect work on this discussion.
> As far as this port mapping table we would need 'port' field in xdp_md as well
> and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md
> plus another XDP_REDIRECT_PORT action code.
> The actual port table (array) should be populated by user space with netdevs
> and these netdev will have their refcnt incremented. Then we'll have discussion
> what to do with netdev_unregister notifiers, whether they should be auto-removed
> from port table or bpf should have a chance to be notified and act on it.
> Such port mapping will allow us to optimize inevitable call
> dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> away, since netdevs will be stored in the port table and direct deref
> port_map_array[xdp_md->out_port] will give us target netdev quickly.
> It's nice optimization and there are other more powerful optimizations we
> can do with such port table (since we will know in advance which netdevs
> the program will be redirecting too), but I still think we should do
> ifindex based xdp_redirect first and only add this port table later.
> 

Agreed on all this further optimization would be welcome of course after
basic implementation is in place.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-19 20:02 ` Andy Gospodarek
  2017-04-19 21:42   ` Daniel Borkmann
  2017-04-19 22:51   ` Daniel Borkmann
@ 2017-04-20  4:43   ` John Fastabend
  2 siblings, 0 replies; 34+ messages in thread
From: John Fastabend @ 2017-04-20  4:43 UTC (permalink / raw)
  To: Andy Gospodarek, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev,
	xdp-newbies

[...]

> JohnF, any test results with this you can share?  Presumably you tested with
> virtio-net, right?
> 

For my patches using the xdp_redirect with virtio-net showed no measurable
difference from running just the straight XDP_TX in the PPS column. However
virtio_net XDP_TX was quite low to start with (and I was mostly just looking
at functionality so I don't recall if there was a CPU usage hit) so I expect on
ixgbe to see a difference. However I have plenty of cpu overhead to consume on
the 10Gbps card so it will likely just be a CPU % cost not a PPS cost. I'll
bother Alex about the 40Gbps cards ;)

Anyways, I'll put it on my list to rerun. But like the xdp generic pieces it
wont be for a couple days until I get around to it. Sorry need to take care of
a few things first.

Thanks,
John

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-20  4:38       ` John Fastabend
@ 2017-04-20  4:58         ` Alexei Starovoitov
  2017-04-20  5:14           ` John Fastabend
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-20  4:58 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Andy Gospodarek, Jesper Dangaard Brouer,
	Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies

On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote:
> On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
> > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> >>
> >> Is there a concrete reason that all the proposed future cases like sockets
> >> have to be handled within the very same XDP_REDIRECT return code? F.e. why
> >> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> >> ones would get a different return code f.e. XDP_TX_SK only handling sockets
> >> when we get there implementation-wise?
> > 
> > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> > out of this discussion.
> > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev.
> > If we make it too generic it will lose performance.
> > 
> > For cls_bpf the ifindex concept is symmetric. The program can access it as
> > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper.
> > Since netdev is not locked, it's actually big plus, since container management
> > control plane can simply delete netns+veth and it goes away. The program can
> > have dangling ifindex (if control plane is buggy and didn't update the bpf side),
> > but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> > This approach already understood and works well, so for XDP I suggest to use
> > the same approach initially before starting to reinvent the wheel.
> > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> > to L2 netdev only. That's it. Simple and easy.
> > I think the main use cases in John's and Jesper's minds is something like
> > xdpswitch where packets are coming from VMs and from physical eths and
> > being redirected to either physical eth or to VM via upcoming vhost+xdp support.
> > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> 
> hmm I must be missing something, bpf_redirect() helper should be used as a
> return statement, e.g.
> 
> 	return bpf_redirect(ifindex, flags);
> 
> Its a bit awkward to use in any other way. You would have to ensure the
> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
> data and expects the core to call skb_do_redirect() to push the packet into
> the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
> looking at the code now.
> 
> Are you suggesting using xdp_md to store the ifindex value instead of a per
> cpu variable in the redirect helper? 

no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does.

> Do you really mean the xdp_md struct in
> the uapi headers? 

yes. since 'ifindex' needs to be part of xdp_md struct in read-only way.
Just like in cls_bpf does it.
Otherwise if we attach the same program to multiple taps it won't
know which tap the traffic arriving on and won't be able to redirect properly.

> I don't see why it needs to be in the UAPI at all. If we
> don't like per cpu variables it could be pushed as part of xdp_buff I guess.

It's not about like or dont-like per-cpu scratch area.
My main point: it works just fine for cls_bpf and i'm suggesting to do
the same for xdp_redirect, since no one ever complained about that bit of cls_bpf.

> My suggestion is we could add an ifindex to the xdp_md structure and have the
> receiving NIC driver populate the value assuming it is useful to programs. But,
> if we use cls_bpf as a model then xdp_md ifindex is more or less independent of
> the redirect helper. 

exactly. arriving ifindex is independent of xdp_redirect helper.

> In my opinion we should avoid diverging cls bpf and xdp bpf
> in subtle ways like handling of ifindex and redirect.

exactly. I'm saying the same thing.
I'm not sure which part of my proposal was so confusing.
Sorry about that.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-20  4:58         ` Alexei Starovoitov
@ 2017-04-20  5:14           ` John Fastabend
  0 siblings, 0 replies; 34+ messages in thread
From: John Fastabend @ 2017-04-20  5:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andy Gospodarek, Jesper Dangaard Brouer,
	Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies

On 17-04-19 09:58 PM, Alexei Starovoitov wrote:
> On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote:
>> On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
>>> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
>>>>
>>>> Is there a concrete reason that all the proposed future cases like sockets
>>>> have to be handled within the very same XDP_REDIRECT return code? F.e. why
>>>> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
>>>> ones would get a different return code f.e. XDP_TX_SK only handling sockets
>>>> when we get there implementation-wise?
>>>
>>> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
>>> out of this discussion.
>>> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev.
>>> If we make it too generic it will lose performance.
>>>
>>> For cls_bpf the ifindex concept is symmetric. The program can access it as
>>> skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper.
>>> Since netdev is not locked, it's actually big plus, since container management
>>> control plane can simply delete netns+veth and it goes away. The program can
>>> have dangling ifindex (if control plane is buggy and didn't update the bpf side),
>>> but it's harmless. Packets that redirect to non-existing ifindex are dropped.
>>> This approach already understood and works well, so for XDP I suggest to use
>>> the same approach initially before starting to reinvent the wheel.
>>> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
>>> to L2 netdev only. That's it. Simple and easy.
>>> I think the main use cases in John's and Jesper's minds is something like
>>> xdpswitch where packets are coming from VMs and from physical eths and
>>> being redirected to either physical eth or to VM via upcoming vhost+xdp support.
>>> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
>>
>> hmm I must be missing something, bpf_redirect() helper should be used as a
>> return statement, e.g.
>>
>> 	return bpf_redirect(ifindex, flags);
>>
>> Its a bit awkward to use in any other way. You would have to ensure the
>> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
>> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
>> data and expects the core to call skb_do_redirect() to push the packet into
>> the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
>> looking at the code now.
>>
>> Are you suggesting using xdp_md to store the ifindex value instead of a per
>> cpu variable in the redirect helper? 
> 
> no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does.
> 
>> Do you really mean the xdp_md struct in
>> the uapi headers? 
> 
> yes. since 'ifindex' needs to be part of xdp_md struct in read-only way.
> Just like in cls_bpf does it.
> Otherwise if we attach the same program to multiple taps it won't
> know which tap the traffic arriving on and won't be able to redirect properly.
> 
>> I don't see why it needs to be in the UAPI at all. If we
>> don't like per cpu variables it could be pushed as part of xdp_buff I guess.
> 
> It's not about like or dont-like per-cpu scratch area.
> My main point: it works just fine for cls_bpf and i'm suggesting to do
> the same for xdp_redirect, since no one ever complained about that bit of cls_bpf.
> 
>> My suggestion is we could add an ifindex to the xdp_md structure and have the
>> receiving NIC driver populate the value assuming it is useful to programs. But,
>> if we use cls_bpf as a model then xdp_md ifindex is more or less independent of
>> the redirect helper. 
> 
> exactly. arriving ifindex is independent of xdp_redirect helper.
> 
>> In my opinion we should avoid diverging cls bpf and xdp bpf
>> in subtle ways like handling of ifindex and redirect.
> 
> exactly. I'm saying the same thing.
> I'm not sure which part of my proposal was so confusing.
> Sorry about that.
> 

Aha what you are suggesting is exactly what I prototyped on virtio_net so I'm
happy :) I just didn't manage to parse it for whatever reason must be getting
tired.

Thanks,
John

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-20  2:56     ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov
  2017-04-20  4:38       ` John Fastabend
@ 2017-04-20  6:10       ` Jesper Dangaard Brouer
  2017-04-20 17:10         ` Alexei Starovoitov
  1 sibling, 1 reply; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-20  6:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann,
	Alexei Starovoitov, netdev, xdp-newbies, John Fastabend, brouer

On Wed, 19 Apr 2017 19:56:13 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> > 
> > Is there a concrete reason that all the proposed future cases like sockets
> > have to be handled within the very same XDP_REDIRECT return code? F.e. why
> > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> > ones would get a different return code f.e. XDP_TX_SK only handling sockets
> > when we get there implementation-wise?  
> 
> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> out of this discussion.
> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev.
> If we make it too generic it will lose performance.
> 
> For cls_bpf the ifindex concept is symmetric. The program can access it as
> skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper.
> Since netdev is not locked, it's actually big plus, since container management
> control plane can simply delete netns+veth and it goes away. The program can
> have dangling ifindex (if control plane is buggy and didn't update the bpf side),
> but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> This approach already understood and works well, so for XDP I suggest to use
> the same approach initially before starting to reinvent the wheel.
> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> to L2 netdev only. That's it. Simple and easy.
> I think the main use cases in John's and Jesper's minds is something like
> xdpswitch where packets are coming from VMs and from physical eths and
> being redirected to either physical eth or to VM via upcoming vhost+xdp support.
> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> 
> Once we have vhost+xdp and all other bits implemented, we must come back
> to this discussion about having port mapping table. As I mentioned
> during netconf I think it's very useful, but I don't think we should
> gate vhost+xdp and xdp_redirect work on this discussion.
> As far as this port mapping table we would need 'port' field in xdp_md as well
> and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md
> plus another XDP_REDIRECT_PORT action code.

Guess it would be easier to talk about if we name it "ingress_port" and
"egress_port".

> The actual port table (array) should be populated by user space with netdevs
> and these netdev will have their refcnt incremented. Then we'll have discussion
> what to do with netdev_unregister notifiers, whether they should be auto-removed
> from port table or bpf should have a chance to be notified and act on it.
> Such port mapping will allow us to optimize inevitable call
> dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> away, since netdevs will be stored in the port table and direct deref
> port_map_array[xdp_md->out_port] will give us target netdev quickly.

I agree with above paragraph, and is happy that you can see that this
will actually be faster than using ifindex'es.

> It's nice optimization and there are other more powerful optimizations we
> can do with such port table (since we will know in advance which netdevs
> the program will be redirecting too), but I still think we should do
> ifindex based xdp_redirect first and only add this port table later.

No, we cannot first do an ifindex based xdp_redirect. The point of the
port table is to sandbox which ports XDP can use.

XDP is different than TC/cls_bpf, as it does "bypass", there is no
other layer that can stop or inspect these packets. The TC hooks
redirect into the network stack, which have all the usual facilities
available for filtering, inspection and debugging what is going on
(e.g. tcpdump works for TC redirect).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-19 22:51   ` Daniel Borkmann
  2017-04-20  2:56     ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov
@ 2017-04-20  6:39     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-20  6:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov,
	Alexei Starovoitov, netdev, xdp-newbies, John Fastabend, brouer

On Thu, 20 Apr 2017 00:51:31 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 04/19/2017 10:02 PM, Andy Gospodarek wrote:
> [...]
> > and then lookup this dest in a table we have the option to make that
> > dest an ifindex/socket/other.
> >
> > I did also look at JohnF's patch and I do like the simplicity of the redirect
> > action and new ndo_xdp_xmit and how it moves towards a way to transmit the
> > frame.  The downside is that it presumes an ifindex, so it might not be ideal
> > we want the lookup to return something other than an ifindex.
> >  
> [...]
> > would be handled.  If we are ultimately going to need a new netdev op to
> > handle the redirect then what may be the issue with not providing the
> > destination port the return code and the option proposed by JohnF looks
> > good to me with maybe a small tweak to not presume ifindex in some manner.  
> 
> Is there a concrete reason that all the proposed future cases like sockets
> have to be handled within the very same XDP_REDIRECT return code? F.e. why
> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> ones would get a different return code f.e. XDP_TX_SK only handling sockets
> when we get there implementation-wise?

The concrete reason for an "unified" return code is to simplify changes
needed in the drivers.  The purpose is moving code out of the drivers
into the core.  IMHO we should keep drivers as simple as possible and
stop adding more return action codes. XDP_REDIRECT combined with a
port-table should be the last action-code needed in the drivers.
As the port table will know the "type" of the port.

Keeping action XDP_DROP and XDP_TX inside the drivers make sense,
because it allow for driver level optimizations. Actions that want to
transfer the packet/xdp_buff "outside" the driver should call a core
function (given it will be called with the xdp_buff struct, you can
hide all your extra extensions there without changing the driver code).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-20  6:10       ` Jesper Dangaard Brouer
@ 2017-04-20 17:10         ` Alexei Starovoitov
  2017-04-25  9:34           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-20 17:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann,
	Alexei Starovoitov, netdev, xdp-newbies, John Fastabend

On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 19 Apr 2017 19:56:13 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> > > 
> > > Is there a concrete reason that all the proposed future cases like sockets
> > > have to be handled within the very same XDP_REDIRECT return code? F.e. why
> > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> > > ones would get a different return code f.e. XDP_TX_SK only handling sockets
> > > when we get there implementation-wise?  
> > 
> > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> > out of this discussion.
> > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev.
> > If we make it too generic it will lose performance.
> > 
> > For cls_bpf the ifindex concept is symmetric. The program can access it as
> > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper.
> > Since netdev is not locked, it's actually big plus, since container management
> > control plane can simply delete netns+veth and it goes away. The program can
> > have dangling ifindex (if control plane is buggy and didn't update the bpf side),
> > but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> > This approach already understood and works well, so for XDP I suggest to use
> > the same approach initially before starting to reinvent the wheel.
> > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> > to L2 netdev only. That's it. Simple and easy.
> > I think the main use cases in John's and Jesper's minds is something like
> > xdpswitch where packets are coming from VMs and from physical eths and
> > being redirected to either physical eth or to VM via upcoming vhost+xdp support.
> > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> > 
> > Once we have vhost+xdp and all other bits implemented, we must come back
> > to this discussion about having port mapping table. As I mentioned
> > during netconf I think it's very useful, but I don't think we should
> > gate vhost+xdp and xdp_redirect work on this discussion.
> > As far as this port mapping table we would need 'port' field in xdp_md as well
> > and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md
> > plus another XDP_REDIRECT_PORT action code.
> 
> Guess it would be easier to talk about if we name it "ingress_port" and
> "egress_port".
> 
> > The actual port table (array) should be populated by user space with netdevs
> > and these netdev will have their refcnt incremented. Then we'll have discussion
> > what to do with netdev_unregister notifiers, whether they should be auto-removed
> > from port table or bpf should have a chance to be notified and act on it.
> > Such port mapping will allow us to optimize inevitable call
> > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> > away, since netdevs will be stored in the port table and direct deref
> > port_map_array[xdp_md->out_port] will give us target netdev quickly.
> 
> I agree with above paragraph, and is happy that you can see that this
> will actually be faster than using ifindex'es.
> 
> > It's nice optimization and there are other more powerful optimizations we
> > can do with such port table (since we will know in advance which netdevs
> > the program will be redirecting too), but I still think we should do
> > ifindex based xdp_redirect first and only add this port table later.
> 
> No, we cannot first do an ifindex based xdp_redirect. The point of the
> port table is to sandbox which ports XDP can use.

hmm. port table cannot sandbox the ports. The only thing it does
from 'safety' point of view is moving the checks from run-time into
static insertion time.
So the checks that we would do on netdev after looking it up
based on ifindex are the same checks we will do at insertion time
into port table. The user space will insert/delete them live
from that port table, so from program point of view it's exactly
the same as ifindex. The ports can disappear and can be added
while the program is running.

Note the very first bpf patchset years ago contained the port table
abstraction. ovs has concept of vports as well. These two very
different projects needed port table to provide a layer of
indirection between ifindex==netdev and virtual port number.
This is still the case and I'd like to see this port table to be
implemented for both cls_bpf and xdp. In that sense xdp is not special.

> XDP is different than TC/cls_bpf, as it does "bypass", there is no
> other layer that can stop or inspect these packets. The TC hooks
> redirect into the network stack, which have all the usual facilities
> available for filtering, inspection and debugging what is going on
> (e.g. tcpdump works for TC redirect).

not true. when bpf_redirect() drops the packet due to incorrect ifindex
that packet disappears without a trace. No tcpdump and no counter.
And this is fine. We can add tracepoint there for debugging,
but it wasn't a problem for anyone who's using it today, so it's
'nice to have', but certainly not mandatory.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: XDP question: best API for returning/setting egress port?
  2017-04-19 21:42   ` Daniel Borkmann
@ 2017-04-20 17:12     ` Andy Gospodarek
  0 siblings, 0 replies; 34+ messages in thread
From: Andy Gospodarek @ 2017-04-20 17:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov,
	Alexei Starovoitov, netdev, xdp-newbies, John Fastabend

On Wed, Apr 19, 2017 at 11:42:30PM +0200, Daniel Borkmann wrote:
> On 04/19/2017 10:02 PM, Andy Gospodarek wrote:
> > On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > As I argued in NetConf presentation[1] (from slide #9) we need a port
> > > mapping table (instead of using ifindex'es).  Both for supporting
> > > other "port" types than net_devices (think sockets), and for
> > > sandboxing what XDP can bypass.
> > > 
> > > I want to create a new XDP action called XDP_REDIRECT, that instruct
> > > XDP to send the xdp_buff to another "port" (get translated into a
> > > net_device, or something else depending on internal port type).
> > > 
> > > Looking at the userspace/eBPF interface, I'm wondering what is the
> > > best API for "returning" this port number from eBPF?
> > > 
> > > The options I see is:
> > > 
> > > 1) Split-up the u32 action code, and e.g let the high-16-bit be the
> > >     port number and lower-16bit the (existing) action verdict.
> > > 
> > >   Pros: Simple API
> > >   Cons: Number of ports limited to 64K
> > 
> > Practically speaking this may be seem reserving 64k for the port number
> > might be enough space, but I would also like to see a new return option
> > for flags so I start to get concerned about space.  Daniel also
> > hightlights the fact that encoding the port in the action may does not
> > leave room for flags and could get confusing.
> > 
> > One unfortunate side-effect of dropping or transmitting frames with XDP
> > is that we lose the opportunity to statistically sample in netfilter
> > since the frames were dropped so early and I'd like to bring that back
> > with a call to parse flags and possibly call psample_sample_packet()
> > after the xdp action.  Packet sampling cannot simply be an action
> > since there are times when a frame should be dropped but should also be
> > sampled, so it seems logical to add this as a flag.
> 
> Hmm, you can do that already today with bpf_xdp_event_output() helper
> and the program decides when to sample and what custom meta data to
> prepend along with the (partial or full) xdp packet, see also [1], slide
> 11 for the "XDP dump" part.
> 
> No need to change drivers for this, psample_sample_packet() would also
> be slower.
> 
>   [1] http://netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf

Thanks, Daniel.  Not sure how I missed this, but I appreciate the pointer.
I'll add coding up a program to demo this to my TODO list.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-20 17:10         ` Alexei Starovoitov
@ 2017-04-25  9:34           ` Jesper Dangaard Brouer
  2017-04-26  0:26             ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-25  9:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann,
	Alexei Starovoitov, netdev, xdp-newbies, John Fastabend, brouer

On Thu, 20 Apr 2017 10:10:08 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 19 Apr 2017 19:56:13 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >   
> > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:  
> > > > 
> > > > Is there a concrete reason that all the proposed future cases like sockets
> > > > have to be handled within the very same XDP_REDIRECT return code? F.e. why
> > > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> > > > ones would get a different return code f.e. XDP_TX_SK only handling sockets
> > > > when we get there implementation-wise?    
> > > 
> > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> > > out of this discussion.
> > > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev.
> > > If we make it too generic it will lose performance.
> > > 
> > > For cls_bpf the ifindex concept is symmetric. The program can access it as
> > > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper.
> > > Since netdev is not locked, it's actually big plus, since container management
> > > control plane can simply delete netns+veth and it goes away. The program can
> > > have dangling ifindex (if control plane is buggy and didn't update the bpf side),
> > > but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> > > This approach already understood and works well, so for XDP I suggest to use
> > > the same approach initially before starting to reinvent the wheel.
> > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> > > to L2 netdev only. That's it. Simple and easy.
> > > I think the main use cases in John's and Jesper's minds is something like
> > > xdpswitch where packets are coming from VMs and from physical eths and
> > > being redirected to either physical eth or to VM via upcoming vhost+xdp support.
> > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> > > 
> > > Once we have vhost+xdp and all other bits implemented, we must come back
> > > to this discussion about having port mapping table. As I mentioned
> > > during netconf I think it's very useful, but I don't think we should
> > > gate vhost+xdp and xdp_redirect work on this discussion.
> > > As far as this port mapping table we would need 'port' field in xdp_md as well
> > > and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md
> > > plus another XDP_REDIRECT_PORT action code.  
> > 
> > Guess it would be easier to talk about if we name it "ingress_port" and
> > "egress_port".
> >   
> > > The actual port table (array) should be populated by user space with netdevs
> > > and these netdev will have their refcnt incremented. Then we'll have discussion
> > > what to do with netdev_unregister notifiers, whether they should be auto-removed
> > > from port table or bpf should have a chance to be notified and act on it.
> > > Such port mapping will allow us to optimize inevitable call
> > > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> > > away, since netdevs will be stored in the port table and direct deref
> > > port_map_array[xdp_md->out_port] will give us target netdev quickly.  
> > 
> > I agree with above paragraph, and is happy that you can see that this
> > will actually be faster than using ifindex'es.
> >   
> > > It's nice optimization and there are other more powerful optimizations we
> > > can do with such port table (since we will know in advance which netdevs
> > > the program will be redirecting too), but I still think we should do
> > > ifindex based xdp_redirect first and only add this port table later.  
> > 
> > No, we cannot first do an ifindex based xdp_redirect. The point of the
> > port table is to sandbox which ports XDP can use.  
> 
> hmm. port table cannot sandbox the ports. The only thing it does
> from 'safety' point of view is moving the checks from run-time into
> static insertion time.
> So the checks that we would do on netdev after looking it up
> based on ifindex are the same checks we will do at insertion time
> into port table. The user space will insert/delete them live
> from that port table, so from program point of view it's exactly
> the same as ifindex. The ports can disappear and can be added
> while the program is running.

I agree, that from the eBPF programs point of view using an ifindex or a
port number is the same.  And I do like this model, that this is just a
number seem from bpf.  It provides a clean separation between the
kernel and ebpf program world.


> Note the very first bpf patchset years ago contained the port table
> abstraction. ovs has concept of vports as well. These two very
> different projects needed port table to provide a layer of
> indirection between ifindex==netdev and virtual port number.
> This is still the case and I'd like to see this port table to be
> implemented for both cls_bpf and xdp. In that sense xdp is not
> special.

Glad to hear you want to see this implemented, I will start coding on
this then.  Good point with cls_bpf, I was planning to make this port
table strongly connected to XDP, guess I should also think of cls_bpf.


> > XDP is different than TC/cls_bpf, as it does "bypass", there is no
> > other layer that can stop or inspect these packets. The TC hooks
> > redirect into the network stack, which have all the usual facilities
> > available for filtering, inspection and debugging what is going on
> > (e.g. tcpdump works for TC redirect).  
> 
> not true. when bpf_redirect() drops the packet due to incorrect ifindex
> that packet disappears without a trace. No tcpdump and no counter.
> And this is fine. We can add tracepoint there for debugging,
> but it wasn't a problem for anyone who's using it today, so it's
> 'nice to have', but certainly not mandatory.
 
I'm not worried about the DROP case, I agree that is fine (as you also
say).  The problem is unintentionally sending a packet to a wrong
ifindex.  This is clearly an eBPF program error, BUT with XDP this
becomes a very hard to debug program error.  With TC-redirect/cls_bpf
we can tcpdump the packets, with XDP there is no visibility into this
happening (the NSA is going to love this "feature").  Maybe we could add
yet-another tracepoint to allow debugging this.  My proposal that we
simply remove the possibility for such program errors, by as you say
move the validation from run-time into static insertion-time, via a
port table.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-25  9:34           ` Jesper Dangaard Brouer
@ 2017-04-26  0:26             ` Alexei Starovoitov
  2017-04-26  3:07               ` John Fastabend
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-26  0:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann,
	Alexei Starovoitov, netdev, xdp-newbies, John Fastabend

On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:
> > Note the very first bpf patchset years ago contained the port table
> > abstraction. ovs has concept of vports as well. These two very
> > different projects needed port table to provide a layer of
> > indirection between ifindex==netdev and virtual port number.
> > This is still the case and I'd like to see this port table to be
> > implemented for both cls_bpf and xdp. In that sense xdp is not
> > special.
> 
> Glad to hear you want to see this implemented, I will start coding on
> this then.  Good point with cls_bpf, I was planning to make this port
> table strongly connected to XDP, guess I should also think of cls_bpf.

perfect.
I think we should try to make all additions to bpf networking world
to be usable for both tc and xdp, since both are actively used and
it wouldn't be great to have cool feature for one, but not the other.
I think port table is an excellent candidate that applies to both.

> I'm not worried about the DROP case, I agree that is fine (as you also
> say).  The problem is unintentionally sending a packet to a wrong
> ifindex.  This is clearly an eBPF program error, BUT with XDP this
> becomes a very hard to debug program error.  With TC-redirect/cls_bpf
> we can tcpdump the packets, with XDP there is no visibility into this
> happening (the NSA is going to love this "feature").  Maybe we could add
> yet-another tracepoint to allow debugging this.  My proposal that we
> simply remove the possibility for such program errors, by as you say
> move the validation from run-time into static insertion-time, via a
> port table.

I think lack of tcpdump-like debugging in xdp is a separate issue.
As I was saying in the other thread we have trivial 'xdpdump' kern+user
app that emits pcap file, but it's too specific to how we use
tail_calls+prog_array in our xdp setup. I'm working on the program
chaining that will be generic and allow us transparently add multiple
xdp or tc progs to the same attachment point and will allow us to
do 'xdpdump' at any point of this pipeline, so debugging of what
happened to the packet will be easier and done in the same way
for both tc and xdp.
btw in our experience working with both tc and xdp the tc+bpf was
actually harder to use and more bug prone.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-26  0:26             ` Alexei Starovoitov
@ 2017-04-26  3:07               ` John Fastabend
  2017-04-26  9:11                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 34+ messages in thread
From: John Fastabend @ 2017-04-26  3:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann,
	Alexei Starovoitov, netdev, xdp-newbies

On 17-04-25 05:26 PM, Alexei Starovoitov wrote:
> On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:
>>> Note the very first bpf patchset years ago contained the port table
>>> abstraction. ovs has concept of vports as well. These two very
>>> different projects needed port table to provide a layer of
>>> indirection between ifindex==netdev and virtual port number.
>>> This is still the case and I'd like to see this port table to be
>>> implemented for both cls_bpf and xdp. In that sense xdp is not
>>> special.
>>
>> Glad to hear you want to see this implemented, I will start coding on
>> this then.  Good point with cls_bpf, I was planning to make this port
>> table strongly connected to XDP, guess I should also think of cls_bpf.
> 
> perfect.
> I think we should try to make all additions to bpf networking world
> to be usable for both tc and xdp, since both are actively used and
> it wouldn't be great to have cool feature for one, but not the other.
> I think port table is an excellent candidate that applies to both.

+1

Jesper, I was working up the code for the redirect piece for ixgbe and
virtio, please use this as a base for your virtual port number table. I'll
push an update onto github tomorrow. I think the table should drop in fairly
nicely.

One piece that isn't clear to me is how do you plan to instantiate and
program this table. Is it a new static bpf map that is created any time we see
the redirect command? I think this would be preferred.

> 
>> I'm not worried about the DROP case, I agree that is fine (as you also
>> say).  The problem is unintentionally sending a packet to a wrong
>> ifindex.  This is clearly an eBPF program error, BUT with XDP this
>> becomes a very hard to debug program error.  With TC-redirect/cls_bpf
>> we can tcpdump the packets, with XDP there is no visibility into this
>> happening (the NSA is going to love this "feature").  Maybe we could add
>> yet-another tracepoint to allow debugging this.  My proposal that we
>> simply remove the possibility for such program errors, by as you say
>> move the validation from run-time into static insertion-time, via a
>> port table.
> 
> I think lack of tcpdump-like debugging in xdp is a separate issue.
> As I was saying in the other thread we have trivial 'xdpdump' kern+user
> app that emits pcap file, but it's too specific to how we use
> tail_calls+prog_array in our xdp setup. I'm working on the program
> chaining that will be generic and allow us transparently add multiple
> xdp or tc progs to the same attachment point and will allow us to
> do 'xdpdump' at any point of this pipeline, so debugging of what
> happened to the packet will be easier and done in the same way
> for both tc and xdp.
> btw in our experience working with both tc and xdp the tc+bpf was
> actually harder to use and more bug prone.
> 

Nice, the tcpdump-like debugging looks interesting.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-26  3:07               ` John Fastabend
@ 2017-04-26  9:11                 ` Jesper Dangaard Brouer
  2017-04-26 16:35                   ` John Fastabend
  0 siblings, 1 reply; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-26  9:11 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek,
	Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies, brouer

On Tue, 25 Apr 2017 20:07:34 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 17-04-25 05:26 PM, Alexei Starovoitov wrote:
> > On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:  
> >>> Note the very first bpf patchset years ago contained the port table
> >>> abstraction. ovs has concept of vports as well. These two very
> >>> different projects needed port table to provide a layer of
> >>> indirection between ifindex==netdev and virtual port number.
> >>> This is still the case and I'd like to see this port table to be
> >>> implemented for both cls_bpf and xdp. In that sense xdp is not
> >>> special.  
> >>
> >> Glad to hear you want to see this implemented, I will start coding on
> >> this then.  Good point with cls_bpf, I was planning to make this port
> >> table strongly connected to XDP, guess I should also think of cls_bpf.  
> > 
> > perfect.
> > I think we should try to make all additions to bpf networking world
> > to be usable for both tc and xdp, since both are actively used and
> > it wouldn't be great to have cool feature for one, but not the other.
> > I think port table is an excellent candidate that applies to both.  
> 
> +1
> 
> Jesper, I was working up the code for the redirect piece for ixgbe and
> virtio, please use this as a base for your virtual port number table. I'll
> push an update onto github tomorrow. I think the table should drop in fairly
> nicely.

Cool, I will do that. Then, I'll also have a redirect method to shape
this around, and I would have to benchmark/test your ixgbe redirect.

(John please let me know, what github tree we are talking about, and
what branch)


> One piece that isn't clear to me is how do you plan to instantiate and
> program this table. Is it a new static bpf map that is created any
> time we see the redirect command? I think this would be preferred.

(This is difficult to explain without us misunderstanding each-other)

As Alexei also mentioned before, ifindex vs port makes no real
difference seen from the bpf program side.  It is userspace's
responsibility to add ifindex/port's to the bpf-maps, according to how
the bpf program "policy" want to "connect" these ports.  The
port-table system add one extra step, of also adding this port to the
port-table (which lives inside the kernel). 

When loading the XDP program, we also need to pass along a port table
"id" this XDP program is associated with (and if it doesn't exists you
create it).  And your userspace "control-plane" application also need
to know this port table "id", when adding a new port.

The concept of having multiple port tables is key.  As this implies we
can have several simultaneous "data-planes" that is *isolated* from
each-other.  Think about how network-namespaces/containers want
isolation. A subtle thing I'm afraid to mention, is that oppose to the
ifindex model, a port table with mapping to a net_device pointer, would
allow (faster) delivery into the container's inner net_device, which
sort of violates the isolation, but I would argue it is not a problem
as this net_device pointer could only be added from a process within the
namespace.  I like this feature, but it could easily be disallowed via
port insertion-time validation.

   
> >> I'm not worried about the DROP case, I agree that is fine (as you
> >> also say).  The problem is unintentionally sending a packet to a
> >> wrong ifindex.  This is clearly an eBPF program error, BUT with
> >> XDP this becomes a very hard to debug program error.  With
> >> TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is
> >> no visibility into this happening (the NSA is going to love this
> >> "feature").  Maybe we could add yet-another tracepoint to allow
> >> debugging this.  My proposal that we simply remove the possibility
> >> for such program errors, by as you say move the validation from
> >> run-time into static insertion-time, via a port table.  
> > 
> > I think lack of tcpdump-like debugging in xdp is a separate issue.
> > As I was saying in the other thread we have trivial 'xdpdump'
> > kern+user app that emits pcap file, but it's too specific to how we
> > use tail_calls+prog_array in our xdp setup. I'm working on the
> > program chaining that will be generic and allow us transparently
> > add multiple xdp or tc progs to the same attachment point and will
> > allow us to do 'xdpdump' at any point of this pipeline, so
> > debugging of what happened to the packet will be easier and done in
> > the same way for both tc and xdp.
> > btw in our experience working with both tc and xdp the tc+bpf was
> > actually harder to use and more bug prone.
> >   
> 
> Nice, the tcpdump-like debugging looks interesting.

Yes, this xdpdump sound like a very useful tool.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-26  9:11                 ` Jesper Dangaard Brouer
@ 2017-04-26 16:35                   ` John Fastabend
  2017-04-26 17:58                     ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: John Fastabend @ 2017-04-26 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek,
	Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies

[...]

>> Jesper, I was working up the code for the redirect piece for ixgbe and
>> virtio, please use this as a base for your virtual port number table. I'll
>> push an update onto github tomorrow. I think the table should drop in fairly
>> nicely.
> 
> Cool, I will do that. Then, I'll also have a redirect method to shape
> this around, and I would have to benchmark/test your ixgbe redirect.
> 
> (John please let me know, what github tree we are talking about, and
> what branch)
> 
> 
>> One piece that isn't clear to me is how do you plan to instantiate and
>> program this table. Is it a new static bpf map that is created any
>> time we see the redirect command? I think this would be preferred.
> 
> (This is difficult to explain without us misunderstanding each-other)
> 

Yep and I'm not sure I follow :)

> As Alexei also mentioned before, ifindex vs port makes no real
> difference seen from the bpf program side.  It is userspace's
> responsibility to add ifindex/port's to the bpf-maps, according to how
> the bpf program "policy" want to "connect" these ports.  The
> port-table system add one extra step, of also adding this port to the
> port-table (which lives inside the kernel). 
> 

I'm not sure I understand the "lives inside the kernel" bit. I assumed
the 'map' should be a bpf map and behave like any other bpf map.

I wanted a new map to be defined, something like this from the bpf programmer
side.

struct bpf_map_def SEC("maps") port_table =
	.type = BPF_MAP_TYPE_PORT_CONNECTION,
	.key_size = sizeof(u32),
	.value_size = BPF_PORT_CONNECTION_SIZE,
	.max_entries = 256,
};

> When loading the XDP program, we also need to pass along a port table
> "id" this XDP program is associated with (and if it doesn't exists you
> create it).  And your userspace "control-plane" application also need
> to know this port table "id", when adding a new port.

So the user space application that is loading the program also needs
to handle this map. This seems correct to me. But I don't see the
value in making some new port table when we already have well understood
framework for maps.

> 
> The concept of having multiple port tables is key.  As this implies we
> can have several simultaneous "data-planes" that is *isolated* from
> each-other.  Think about how network-namespaces/containers want
> isolation. A subtle thing I'm afraid to mention, is that oppose to the
> ifindex model, a port table with mapping to a net_device pointer, would
> allow (faster) delivery into the container's inner net_device, which
> sort of violates the isolation, but I would argue it is not a problem
> as this net_device pointer could only be added from a process within the
> namespace.  I like this feature, but it could easily be disallowed via
> port insertion-time validation.
> 

I think the above optimization should be allowed. And agree multiple port
tables (maps?) is needed. Again all this points to using standard maps
logic in my mind. For permissions and different domains, which I think
you were starting to touch on, it looks like we could extend the pinning API.
At the moment it does an inode_permission(inode, MAY_WRITE) check but I
presume this could be extended. None of this would be needed in v1 and
could be added subsequently. read-only maps seems doable.

>    
>>>> I'm not worried about the DROP case, I agree that is fine (as you
>>>> also say).  The problem is unintentionally sending a packet to a
>>>> wrong ifindex.  This is clearly an eBPF program error, BUT with
>>>> XDP this becomes a very hard to debug program error.  With
>>>> TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is
>>>> no visibility into this happening (the NSA is going to love this
>>>> "feature").  Maybe we could add yet-another tracepoint to allow
>>>> debugging this.  My proposal that we simply remove the possibility
>>>> for such program errors, by as you say move the validation from
>>>> run-time into static insertion-time, via a port table.  
>>>
>>> I think lack of tcpdump-like debugging in xdp is a separate issue.
>>> As I was saying in the other thread we have trivial 'xdpdump'
>>> kern+user app that emits pcap file, but it's too specific to how we
>>> use tail_calls+prog_array in our xdp setup. I'm working on the
>>> program chaining that will be generic and allow us transparently
>>> add multiple xdp or tc progs to the same attachment point and will
>>> allow us to do 'xdpdump' at any point of this pipeline, so
>>> debugging of what happened to the packet will be easier and done in
>>> the same way for both tc and xdp.
>>> btw in our experience working with both tc and xdp the tc+bpf was
>>> actually harder to use and more bug prone.
>>>   
>>
>> Nice, the tcpdump-like debugging looks interesting.
> 
> Yes, this xdpdump sound like a very useful tool.
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-26 16:35                   ` John Fastabend
@ 2017-04-26 17:58                     ` Alexei Starovoitov
  2017-04-26 20:55                       ` Andy Gospodarek
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-26 17:58 UTC (permalink / raw)
  To: John Fastabend, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek,
	Daniel Borkmann, netdev, xdp-newbies

On 4/26/17 9:35 AM, John Fastabend wrote:
>
>> As Alexei also mentioned before, ifindex vs port makes no real
>> difference seen from the bpf program side.  It is userspace's
>> responsibility to add ifindex/port's to the bpf-maps, according to how
>> the bpf program "policy" want to "connect" these ports.  The
>> port-table system add one extra step, of also adding this port to the
>> port-table (which lives inside the kernel).
>>
>
> I'm not sure I understand the "lives inside the kernel" bit. I assumed
> the 'map' should be a bpf map and behave like any other bpf map.
>
> I wanted a new map to be defined, something like this from the bpf programmer
> side.
>
> struct bpf_map_def SEC("maps") port_table =
> 	.type = BPF_MAP_TYPE_PORT_CONNECTION,
> 	.key_size = sizeof(u32),
> 	.value_size = BPF_PORT_CONNECTION_SIZE,
> 	.max_entries = 256,
> };

I like the idea.
We have prog_array, perf_event_array, cgroup_array map specializations.
This one can be new netdev_array with some new bpf_redirect-like helper
accessing it.

>> When loading the XDP program, we also need to pass along a port table
>> "id" this XDP program is associated with (and if it doesn't exists you
>> create it).  And your userspace "control-plane" application also need
>> to know this port table "id", when adding a new port.
>
> So the user space application that is loading the program also needs
> to handle this map. This seems correct to me. But I don't see the
> value in making some new port table when we already have well understood
> framework for maps.

+1

>>
>> The concept of having multiple port tables is key.  As this implies we
>> can have several simultaneous "data-planes" that is *isolated* from
>> each-other.  Think about how network-namespaces/containers want
>> isolation. A subtle thing I'm afraid to mention, is that oppose to the
>> ifindex model, a port table with mapping to a net_device pointer, would
>> allow (faster) delivery into the container's inner net_device, which
>> sort of violates the isolation, but I would argue it is not a problem
>> as this net_device pointer could only be added from a process within the
>> namespace.  I like this feature, but it could easily be disallowed via
>> port insertion-time validation.
>>
>
> I think the above optimization should be allowed. And agree multiple port
> tables (maps?) is needed. Again all this points to using standard maps
> logic in my mind. For permissions and different domains, which I think
> you were starting to touch on, it looks like we could extend the pinning API.
> At the moment it does an inode_permission(inode, MAY_WRITE) check but I
> presume this could be extended. None of this would be needed in v1 and
> could be added subsequently. read-only maps seems doable.

this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
the user space can make it readonly to prevent further changes.

 From user space it can be done similar to perf_events/cgroups as well.
bpf_map_update_elem(&netdev_array, &port_num, &ifindex)
should work.
For bpf_map_lookup_elem() from such netdev_array we can return
ifindex back.
The bpf_map_show_fdinfo() can be customized as well to pretty print
ifindexes of netdevs stored in there.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-26 17:58                     ` Alexei Starovoitov
@ 2017-04-26 20:55                       ` Andy Gospodarek
  2017-04-27  8:41                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Gospodarek @ 2017-04-26 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Jesper Dangaard Brouer, Alexei Starovoitov,
	Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies

On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote:
> On 4/26/17 9:35 AM, John Fastabend wrote:
> > 
> > > As Alexei also mentioned before, ifindex vs port makes no real
> > > difference seen from the bpf program side.  It is userspace's
> > > responsibility to add ifindex/port's to the bpf-maps, according to how
> > > the bpf program "policy" want to "connect" these ports.  The
> > > port-table system add one extra step, of also adding this port to the
> > > port-table (which lives inside the kernel).
> > > 
> > 
> > I'm not sure I understand the "lives inside the kernel" bit. I assumed
> > the 'map' should be a bpf map and behave like any other bpf map.
> > 
> > I wanted a new map to be defined, something like this from the bpf programmer
> > side.
> > 
> > struct bpf_map_def SEC("maps") port_table =
> > 	.type = BPF_MAP_TYPE_PORT_CONNECTION,
> > 	.key_size = sizeof(u32),
> > 	.value_size = BPF_PORT_CONNECTION_SIZE,
> > 	.max_entries = 256,
> > };
> 
> I like the idea.
> We have prog_array, perf_event_array, cgroup_array map specializations.
> This one can be new netdev_array with some new bpf_redirect-like helper
> accessing it.
> 
> > > When loading the XDP program, we also need to pass along a port table
> > > "id" this XDP program is associated with (and if it doesn't exists you
> > > create it).  And your userspace "control-plane" application also need
> > > to know this port table "id", when adding a new port.
> > 
> > So the user space application that is loading the program also needs
> > to handle this map. This seems correct to me. But I don't see the
> > value in making some new port table when we already have well understood
> > framework for maps.
> 
> +1
> 
> > > 
> > > The concept of having multiple port tables is key.  As this implies we
> > > can have several simultaneous "data-planes" that is *isolated* from
> > > each-other.  Think about how network-namespaces/containers want
> > > isolation. A subtle thing I'm afraid to mention, is that oppose to the
> > > ifindex model, a port table with mapping to a net_device pointer, would
> > > allow (faster) delivery into the container's inner net_device, which
> > > sort of violates the isolation, but I would argue it is not a problem
> > > as this net_device pointer could only be added from a process within the
> > > namespace.  I like this feature, but it could easily be disallowed via
> > > port insertion-time validation.
> > > 
> > 
> > I think the above optimization should be allowed. And agree multiple port
> > tables (maps?) is needed. Again all this points to using standard maps
> > logic in my mind. For permissions and different domains, which I think
> > you were starting to touch on, it looks like we could extend the pinning API.
> > At the moment it does an inode_permission(inode, MAY_WRITE) check but I
> > presume this could be extended. None of this would be needed in v1 and
> > could be added subsequently. read-only maps seems doable.
> 
> this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
> the user space can make it readonly to prevent further changes.
> 
> From user space it can be done similar to perf_events/cgroups as well.
> bpf_map_update_elem(&netdev_array, &port_num, &ifindex)
> should work.
> For bpf_map_lookup_elem() from such netdev_array we can return
> ifindex back.
> The bpf_map_show_fdinfo() can be customized as well to pretty print
> ifindexes of netdevs stored in there.
> 

I agree with both of you on all of these points.  Having the port
redirection in a new type of map and/or array seems like the way to go.

I understood Jesper's perspecitive when thinking about a way to pass a
port-table id down, but I think the idea that the userspace loader code
defining the maps is going to be the one making this link is the right
idea and handling things like ifindex changes (rather than identifiers
that perform lookups in other tables) is going to have to be yet another
exercise left up to the...user.  :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-26 20:55                       ` Andy Gospodarek
@ 2017-04-27  8:41                         ` Jesper Dangaard Brouer
  2017-04-27 23:31                           ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-27  8:41 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Alexei Starovoitov, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies, brouer

On Wed, 26 Apr 2017 16:55:44 -0400
Andy Gospodarek <andy@greyhouse.net> wrote:

> On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote:
> > On 4/26/17 9:35 AM, John Fastabend wrote:  
> > >   
> > > > As Alexei also mentioned before, ifindex vs port makes no real
> > > > difference seen from the bpf program side.  It is userspace's
> > > > responsibility to add ifindex/port's to the bpf-maps, according to how
> > > > the bpf program "policy" want to "connect" these ports.  The
> > > > port-table system add one extra step, of also adding this port to the
> > > > port-table (which lives inside the kernel).
> > > >   
> > > 
> > > I'm not sure I understand the "lives inside the kernel" bit. I assumed
> > > the 'map' should be a bpf map and behave like any other bpf map.
> > > 
> > > I wanted a new map to be defined, something like this from the bpf programmer
> > > side.
> > > 
> > > struct bpf_map_def SEC("maps") port_table =
> > > 	.type = BPF_MAP_TYPE_PORT_CONNECTION,
> > > 	.key_size = sizeof(u32),
> > > 	.value_size = BPF_PORT_CONNECTION_SIZE,
> > > 	.max_entries = 256,
> > > };  
> > 
> > I like the idea.
> > We have prog_array, perf_event_array, cgroup_array map specializations.
> > This one can be new netdev_array with some new bpf_redirect-like helper
> > accessing it.
> >   
> > > > When loading the XDP program, we also need to pass along a port table
> > > > "id" this XDP program is associated with (and if it doesn't exists you
> > > > create it).  And your userspace "control-plane" application also need
> > > > to know this port table "id", when adding a new port.  
> > > 
> > > So the user space application that is loading the program also needs
> > > to handle this map. This seems correct to me. But I don't see the
> > > value in making some new port table when we already have well understood
> > > framework for maps.  
> > 
> > +1
> >   
> > > > 
> > > > The concept of having multiple port tables is key.  As this implies we
> > > > can have several simultaneous "data-planes" that is *isolated* from
> > > > each-other.  Think about how network-namespaces/containers want
> > > > isolation. A subtle thing I'm afraid to mention, is that oppose to the
> > > > ifindex model, a port table with mapping to a net_device pointer, would
> > > > allow (faster) delivery into the container's inner net_device, which
> > > > sort of violates the isolation, but I would argue it is not a problem
> > > > as this net_device pointer could only be added from a process within the
> > > > namespace.  I like this feature, but it could easily be disallowed via
> > > > port insertion-time validation.
> > > >   
> > > 
> > > I think the above optimization should be allowed. And agree multiple port
> > > tables (maps?) is needed. Again all this points to using standard maps
> > > logic in my mind. For permissions and different domains, which I think
> > > you were starting to touch on, it looks like we could extend the pinning API.
> > > At the moment it does an inode_permission(inode, MAY_WRITE) check but I
> > > presume this could be extended. None of this would be needed in v1 and
> > > could be added subsequently. read-only maps seems doable.  
> > 
> > this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
> > the user space can make it readonly to prevent further changes.
> > 
> > From user space it can be done similar to perf_events/cgroups as well.
> > bpf_map_update_elem(&netdev_array, &port_num, &ifindex)
> > should work.
> > For bpf_map_lookup_elem() from such netdev_array we can return
> > ifindex back.
> > The bpf_map_show_fdinfo() can be customized as well to pretty print
> > ifindexes of netdevs stored in there.
> >   
> 
> I agree with both of you on all of these points.  Having the port
> redirection in a new type of map and/or array seems like the way to go.
> 
> I understood Jesper's perspecitive when thinking about a way to pass a
> port-table id down, but I think the idea that the userspace loader code
> defining the maps is going to be the one making this link is the right
> idea and handling things like ifindex changes (rather than identifiers
> that perform lookups in other tables) is going to have to be yet another
> exercise left up to the...user.  :-)
> 

I love this idea. Integrating the port table closer with the bpf-maps
infrastructure makes sense.  This gives me a place to hook the code into,
instead of (re)inventing a new infrastructure for this port table, and the
interface will be more natural from a bpf-API point of view.

When registering/attaching a XDP/bpf program, we would just send the
file-descriptor for this port-map along (like we do with the bpf_prog
FD). Plus, it own ingress-port number this program is in the port-map.

It is not clear to me, in-which-data-structure on the kernel-side we
store this reference to the port-map and ingress-port. As today we only
have the "raw" struct bpf_prog pointer. I see several options:

1. Create a new xdp_prog struct that contains existing bpf_prog,
a port-map pointer and ingress-port. (IMHO easiest solution)

2. Just create a new pointer to port-map and store it in driver rx-ring
struct (like existing bpf_prog), but this create a race-challenge
replacing (cmpxchg) the program (or perhaps it's not a problem as it
runs under rcu and RTNL-lock).

3. Extend bpf_prog to store this port-map and ingress-port, and have a
fast-way to access it.  I assume it will be accessible via
bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-27  8:41                         ` Jesper Dangaard Brouer
@ 2017-04-27 23:31                           ` Alexei Starovoitov
  2017-04-28  5:06                             ` John Fastabend
  2017-04-28 10:58                             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-27 23:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Andy Gospodarek
  Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Daniel Borkmann, netdev, xdp-newbies

On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
> When registering/attaching a XDP/bpf program, we would just send the
> file-descriptor for this port-map along (like we do with the bpf_prog
> FD). Plus, it own ingress-port number this program is in the port-map.
>
> It is not clear to me, in-which-data-structure on the kernel-side we
> store this reference to the port-map and ingress-port. As today we only
> have the "raw" struct bpf_prog pointer. I see several options:
>
> 1. Create a new xdp_prog struct that contains existing bpf_prog,
> a port-map pointer and ingress-port. (IMHO easiest solution)
>
> 2. Just create a new pointer to port-map and store it in driver rx-ring
> struct (like existing bpf_prog), but this create a race-challenge
> replacing (cmpxchg) the program (or perhaps it's not a problem as it
> runs under rcu and RTNL-lock).
>
> 3. Extend bpf_prog to store this port-map and ingress-port, and have a
> fast-way to access it.  I assume it will be accessible via
> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.

I'm not sure I completely follow the 3 proposals.
Are you suggesting to have only one netdev_array per program?
Why not to allow any number like we do for tailcall+prog_array, etc?
We can teach verifier to allow new helper
bpf_tx_port(netdev_array, port_num);
to only be used with netdev_array map type.
It will fetch netdevice pointer from netdev_array[port_num]
and will tx the packet into it.
We can make it similar to bpf_tail_call(), so that program will
finish on successful bpf_tx_port() or
make it into 'delayed' tx which will be executed when program finishes.
Not sure which approach is better.

We can also extend this netdev_array into broadcast/multicast. Like
bpf_tx_allports(&netdev_array);
call from the program will xmit the packet to all netdevices
in that 'netdev_array' map type.

The map-in-map support can be trivially extended to allow netdev_array,
then the program can create N multicast groups of netdevices.
Each multicast group == one netdev_array map.
The user space will populate a hashmap with these netdev_arrays and
bpf kernel side can select dynamically which multicast group to use
to send the packets to.
bpf kernel side may look like:
struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key);
if (!netdev_array)
   ...
if (my_condition)
    bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
else
    bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */

that's an artificial example. Just trying to point out
that we shouldn't restrict the feature too soon.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-27 23:31                           ` Alexei Starovoitov
@ 2017-04-28  5:06                             ` John Fastabend
  2017-04-28  5:30                               ` Alexei Starovoitov
  2017-04-28 10:58                             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 34+ messages in thread
From: John Fastabend @ 2017-04-28  5:06 UTC (permalink / raw)
  To: Alexei Starovoitov, Jesper Dangaard Brouer, Andy Gospodarek
  Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev,
	xdp-newbies

On 17-04-27 04:31 PM, Alexei Starovoitov wrote:
> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
>> When registering/attaching a XDP/bpf program, we would just send the
>> file-descriptor for this port-map along (like we do with the bpf_prog
>> FD). Plus, it own ingress-port number this program is in the port-map.
>>
>> It is not clear to me, in-which-data-structure on the kernel-side we
>> store this reference to the port-map and ingress-port. As today we only
>> have the "raw" struct bpf_prog pointer. I see several options:
>>
>> 1. Create a new xdp_prog struct that contains existing bpf_prog,
>> a port-map pointer and ingress-port. (IMHO easiest solution)
>>
>> 2. Just create a new pointer to port-map and store it in driver rx-ring
>> struct (like existing bpf_prog), but this create a race-challenge
>> replacing (cmpxchg) the program (or perhaps it's not a problem as it
>> runs under rcu and RTNL-lock).
>>
>> 3. Extend bpf_prog to store this port-map and ingress-port, and have a
>> fast-way to access it.  I assume it will be accessible via
>> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.
> 
> I'm not sure I completely follow the 3 proposals.
> Are you suggesting to have only one netdev_array per program?
> Why not to allow any number like we do for tailcall+prog_array, etc?
> We can teach verifier to allow new helper
> bpf_tx_port(netdev_array, port_num);
> to only be used with netdev_array map type.
> It will fetch netdevice pointer from netdev_array[port_num]
> and will tx the packet into it.
> We can make it similar to bpf_tail_call(), so that program will
> finish on successful bpf_tx_port() or
> make it into 'delayed' tx which will be executed when program finishes.
> Not sure which approach is better.

My reaction would be to make it finish on success but would like to write
a few programs first and see. I can't think of any use _not_ to terminate
but maybe there is something I'm missing.

> 
> We can also extend this netdev_array into broadcast/multicast. Like
> bpf_tx_allports(&netdev_array);
> call from the program will xmit the packet to all netdevices
> in that 'netdev_array' map type.

Yep nice solution to the multicast problem.

> 
> The map-in-map support can be trivially extended to allow netdev_array,
> then the program can create N multicast groups of netdevices.
> Each multicast group == one netdev_array map.
> The user space will populate a hashmap with these netdev_arrays and
> bpf kernel side can select dynamically which multicast group to use
> to send the packets to.
> bpf kernel side may look like:
> struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key);
> if (!netdev_array)
>   ...
> if (my_condition)
>    bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
> else
>    bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */
> 
> that's an artificial example. Just trying to point out
> that we shouldn't restrict the feature too soon.
> 

That is more or less what I was thinking as well. The other question
I have though is should we have a bpf_redirect() call for the simple
case where I use the ifindex directly. This will be helpful for taking
existing programs from tc_cls into xdp. I think it makes sense to have
both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().

.John

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-28  5:06                             ` John Fastabend
@ 2017-04-28  5:30                               ` Alexei Starovoitov
  2017-04-28 19:43                                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-28  5:30 UTC (permalink / raw)
  To: John Fastabend, Jesper Dangaard Brouer, Andy Gospodarek
  Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev,
	xdp-newbies

On 4/27/17 10:06 PM, John Fastabend wrote:
> That is more or less what I was thinking as well. The other question
> I have though is should we have a bpf_redirect() call for the simple
> case where I use the ifindex directly. This will be helpful for taking
> existing programs from tc_cls into xdp. I think it makes sense to have
> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().

I think so too.
Once netdevice is stored into netdev_array map the netdevice is pinned
and we need to figure out what to do if somebody tries to delete it.
Should we add a new netlink notifier that this netdev's refcnt is
almost zero and it's only in netdev_array(s) ?
or should it be deleted from the array(s) automatically and
then user space will be notified post-deletion?
Both approaches have their pros and cons.

Whereas raw ifindex approach (via bpf_redirect) doesn't have these
caveats. It's clear to both bpf prog and user space that ifindex
can be stale and user space needs to monitor netdevs and update
programs/maps.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-27 23:31                           ` Alexei Starovoitov
  2017-04-28  5:06                             ` John Fastabend
@ 2017-04-28 10:58                             ` Jesper Dangaard Brouer
  2017-04-30  1:04                               ` Alexei Starovoitov
  1 sibling, 1 reply; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-28 10:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Gospodarek, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies, brouer

On Thu, 27 Apr 2017 16:31:14 -0700
Alexei Starovoitov <ast@fb.com> wrote:

> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
> > When registering/attaching a XDP/bpf program, we would just send the
> > file-descriptor for this port-map along (like we do with the bpf_prog
> > FD). Plus, it own ingress-port number this program is in the port-map.
> >
> > It is not clear to me, in-which-data-structure on the kernel-side we
> > store this reference to the port-map and ingress-port. As today we only
> > have the "raw" struct bpf_prog pointer. I see several options:
> >
> > 1. Create a new xdp_prog struct that contains existing bpf_prog,
> > a port-map pointer and ingress-port. (IMHO easiest solution)
> >
> > 2. Just create a new pointer to port-map and store it in driver rx-ring
> > struct (like existing bpf_prog), but this create a race-challenge
> > replacing (cmpxchg) the program (or perhaps it's not a problem as it
> > runs under rcu and RTNL-lock).
> >
> > 3. Extend bpf_prog to store this port-map and ingress-port, and have a
> > fast-way to access it.  I assume it will be accessible via
> > bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.  
> 
> I'm not sure I completely follow the 3 proposals.
> Are you suggesting to have only one netdev_array per program?

Yes, but I can see you have a more clever idea below.

> Why not to allow any number like we do for tailcall+prog_array, etc?

> We can teach verifier to allow new helper
>  bpf_tx_port(netdev_array, port_num);
> to only be used with netdev_array map type.
> It will fetch netdevice pointer from netdev_array[port_num]
> and will tx the packet into it.

I love it. 

I just don't like the "netdev" part of the name "netdev_array" as one
basic ideas of a port tabel, is that a port can be anything that can
consume a XDP_buff packet.  This generalization allow us to move code
out of the drivers.  We might be on the same page, as I do imagine that
netdev_array or port_array is just a struct bpf_map pointer, and the
bpf_map->map_type will tell us that this bpf_map contains net_device
pointers.  Thus, when later introducing a new type of redirect (like to
a socket or remote-CPU) then we just add a new bpf_map_type for this,
without needing to change anything in the drivers, right?

Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT?
Or does it return if the call was successful (e.g validate port_num
existed in map)?

On the kernel side, we need to receive this info "port_array" and
"port_num", given you don't provide the call a xdp_buff/ctx, then I
assume you want the per-CPU temp-store solution.  Then during the
XDP_REDIRECT action we call a core redirect function that based on the
bpf_map_type does a lookup, and find the net_device ptr.


> We can make it similar to bpf_tail_call(), so that program will
> finish on successful bpf_tx_port() or
> make it into 'delayed' tx which will be executed when program finishes.
> Not sure which approach is better.

I know you are talking about something slightly different, about
delaying TX.

But I want to mention (as I've done before) that it is important (for
me) that we get bulking working/integrated.   I imagine the driver will
call a function that will delay the TX/redirect action and at the end
of the NAPI cycle have a function that flush packets, bulk per
destination port.

I was wondering where to store these delayed TX packets, but now that
we have an associated bpf_map data-structure (netdev_array), I'm thinking
about storing packets (ordered by port) inside that.  And then have a
bpf_tx_flush(netdev_array) call in the driver (for every port-table-map
seen, which will likely be small).


> We can also extend this netdev_array into broadcast/multicast. Like
> bpf_tx_allports(&netdev_array);
> call from the program will xmit the packet to all netdevices
> in that 'netdev_array' map type.

When broadcasting you often don't want to broadcast the packet out of
the incoming interface.  How can you support this?

Normally you would know your ingress port, and then excluded that port
in the broadcast.  But with many netdev_array's how do the program know
it's own ingress port.


> The map-in-map support can be trivially extended to allow netdev_array,
> then the program can create N multicast groups of netdevices.
> Each multicast group == one netdev_array map.
> The user space will populate a hashmap with these netdev_arrays and
> bpf kernel side can select dynamically which multicast group to use
> to send the packets to.
> bpf kernel side may look like:
> struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key);
> if (!netdev_array)
>    ...
> if (my_condition)
>     bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
> else
>     bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */
> 
> that's an artificial example. Just trying to point out
> that we shouldn't restrict the feature too soon.
 
I like how you solve the multicast problem.  (But I do need to learn
some more of the inner-workings of bpf map-in-map to follow this
completely).

Thanks a lot for all this input, I got a much more clear picture of how
I can/should implement this :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-28  5:30                               ` Alexei Starovoitov
@ 2017-04-28 19:43                                 ` Hannes Frederic Sowa
  2017-04-30  1:35                                   ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Hannes Frederic Sowa @ 2017-04-28 19:43 UTC (permalink / raw)
  To: Alexei Starovoitov, John Fastabend, Jesper Dangaard Brouer,
	Andy Gospodarek
  Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev,
	xdp-newbies

On 28.04.2017 07:30, Alexei Starovoitov wrote:
> On 4/27/17 10:06 PM, John Fastabend wrote:
>> That is more or less what I was thinking as well. The other question
>> I have though is should we have a bpf_redirect() call for the simple
>> case where I use the ifindex directly. This will be helpful for taking
>> existing programs from tc_cls into xdp. I think it makes sense to have
>> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().
> 
> I think so too.
> Once netdevice is stored into netdev_array map the netdevice is pinned
> and we need to figure out what to do if somebody tries to delete it.
> Should we add a new netlink notifier that this netdev's refcnt is
> almost zero and it's only in netdev_array(s) ?

We basically do that automatically in netdev_wait_allrefs:

pr_emerg("unregister_netdevice: waiting for %s to become free. Usage
count = %d\n",
         dev->name, refcnt);

It is a very unpleasant warning and users probably think about a bug in
the kernel at first.

I don't think we should wait for user space to clean that up but have to
do it automatically from the kernel. Maybe we can introduce a special
value that basically NOPs the transmission. The hash table itself would
install a netdevice notifier and would clean all tables. Could
definitely cause some storm in the kernel, if a lot of keys are mapped
to the same interface.

> or should it be deleted from the array(s) automatically and
> then user space will be notified post-deletion?
> Both approaches have their pros and cons.

I am leaning more towards deleting it automatically. But walking all
tables and in there all keys might cause some unwanted load spikes.

> Whereas raw ifindex approach (via bpf_redirect) doesn't have these
> caveats. It's clear to both bpf prog and user space that ifindex
> can be stale and user space needs to monitor netdevs and update
> programs/maps.

A separate type for ifindex as key or value might be nice to expose this
information directly via the kernel (fdinfo etc.) but at the same time,
debugging infrastructure from user space can also easily deal with that.

Another approach would be:

ifindexes are allocated cyclic and also are signed int and not u32
during allocation. Maybe we can negate the ifindex during transmission
in the table and thus mark it as stale (or set it to -1)? This update
would be done by bpf_tx_*ports() that take a reference to a table and a
key and submit the packets on the appropriate ports and can flag the
relevant ifindexes as stale.

Just wanted to draft this idea, I am not particular happy with that
idea. Maybe someone comes up with a better one.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-28 10:58                             ` Jesper Dangaard Brouer
@ 2017-04-30  1:04                               ` Alexei Starovoitov
  2017-04-30 22:55                                 ` John Fastabend
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-30  1:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andy Gospodarek, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies

On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote:
> On Thu, 27 Apr 2017 16:31:14 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
>
>> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
>>> When registering/attaching a XDP/bpf program, we would just send the
>>> file-descriptor for this port-map along (like we do with the bpf_prog
>>> FD). Plus, it own ingress-port number this program is in the port-map.
>>>
>>> It is not clear to me, in-which-data-structure on the kernel-side we
>>> store this reference to the port-map and ingress-port. As today we only
>>> have the "raw" struct bpf_prog pointer. I see several options:
>>>
>>> 1. Create a new xdp_prog struct that contains existing bpf_prog,
>>> a port-map pointer and ingress-port. (IMHO easiest solution)
>>>
>>> 2. Just create a new pointer to port-map and store it in driver rx-ring
>>> struct (like existing bpf_prog), but this create a race-challenge
>>> replacing (cmpxchg) the program (or perhaps it's not a problem as it
>>> runs under rcu and RTNL-lock).
>>>
>>> 3. Extend bpf_prog to store this port-map and ingress-port, and have a
>>> fast-way to access it.  I assume it will be accessible via
>>> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.
>>
>> I'm not sure I completely follow the 3 proposals.
>> Are you suggesting to have only one netdev_array per program?
>
> Yes, but I can see you have a more clever idea below.
>
>> Why not to allow any number like we do for tailcall+prog_array, etc?
>
>> We can teach verifier to allow new helper
>>  bpf_tx_port(netdev_array, port_num);
>> to only be used with netdev_array map type.
>> It will fetch netdevice pointer from netdev_array[port_num]
>> and will tx the packet into it.
>
> I love it.
>
> I just don't like the "netdev" part of the name "netdev_array" as one
> basic ideas of a port tabel, is that a port can be anything that can
> consume a XDP_buff packet.  This generalization allow us to move code
> out of the drivers.  We might be on the same page, as I do imagine that
> netdev_array or port_array is just a struct bpf_map pointer, and the
> bpf_map->map_type will tell us that this bpf_map contains net_device
> pointers.  Thus, when later introducing a new type of redirect (like to
> a socket or remote-CPU) then we just add a new bpf_map_type for this,
> without needing to change anything in the drivers, right?

In theory, yes, but in practice I doubt it will be so easy.
We probably shouldn't allow very different types of netdev
into the same netdev_array or port_array (whatever the name).
They need to be similar enough, otherwise we'd have to do run-time
checks. If they're all the same, these checks can be done at
insertion time instead.

> Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT?
> Or does it return if the call was successful (e.g validate port_num
> existed in map)?

don't know :)
we need to brainstorm pros and cons.

> On the kernel side, we need to receive this info "port_array" and
> "port_num", given you don't provide the call a xdp_buff/ctx, then I
> assume you want the per-CPU temp-store solution.  Then during the
> XDP_REDIRECT action we call a core redirect function that based on the
> bpf_map_type does a lookup, and find the net_device ptr.

hmm. didn't think that far either :)
indeed makes sense to pass 'ctx' into such helper as well,
so it's easier to deal with original netdev.

>> We can make it similar to bpf_tail_call(), so that program will
>> finish on successful bpf_tx_port() or
>> make it into 'delayed' tx which will be executed when program finishes.
>> Not sure which approach is better.
>
> I know you are talking about something slightly different, about
> delaying TX.
>
> But I want to mention (as I've done before) that it is important (for
> me) that we get bulking working/integrated.   I imagine the driver will
> call a function that will delay the TX/redirect action and at the end
> of the NAPI cycle have a function that flush packets, bulk per
> destination port.
>
> I was wondering where to store these delayed TX packets, but now that
> we have an associated bpf_map data-structure (netdev_array), I'm thinking
> about storing packets (ordered by port) inside that.  And then have a
> bpf_tx_flush(netdev_array) call in the driver (for every port-table-map
> seen, which will likely be small).

makes sense to me as well.
Ideally we should try to make an api such, that batching or no-batching
can be kernel choice. The program will just do
xdp_tx_port(...something here...)
and the kernel does the best for performance. If it needs to delay
the result to do batching the api should allow that transparently.
Like right now xdp program does 'return XDP_TX;' and
on the kernel side we can either do immediate xmit (like we do today)
or can change it to do batching and programs don't need to change.

>> We can also extend this netdev_array into broadcast/multicast. Like
>> bpf_tx_allports(&netdev_array);
>> call from the program will xmit the packet to all netdevices
>> in that 'netdev_array' map type.
>
> When broadcasting you often don't want to broadcast the packet out of
> the incoming interface.  How can you support this?
>
> Normally you would know your ingress port, and then excluded that port
> in the broadcast.  But with many netdev_array's how do the program know
> it's own ingress port.

absolutely!
bpf_tx_allports() should somehow exclude the port packet arrived on.
What you're proposing about passing 'ctx' into this helper,
should solve it, I guess.

> Thanks a lot for all this input, I got a much more clear picture of how
> I can/should implement this :-)

awesome :) Let's brainstorm more and get John's opinion on it as well,
since sounds like he'll be heavy user of such api.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-28 19:43                                 ` Hannes Frederic Sowa
@ 2017-04-30  1:35                                   ` Alexei Starovoitov
  0 siblings, 0 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2017-04-30  1:35 UTC (permalink / raw)
  To: Hannes Frederic Sowa, John Fastabend, Jesper Dangaard Brouer,
	Andy Gospodarek
  Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev,
	xdp-newbies

On 4/28/17 12:43 PM, Hannes Frederic Sowa wrote:
> On 28.04.2017 07:30, Alexei Starovoitov wrote:
>> On 4/27/17 10:06 PM, John Fastabend wrote:
>>> That is more or less what I was thinking as well. The other question
>>> I have though is should we have a bpf_redirect() call for the simple
>>> case where I use the ifindex directly. This will be helpful for taking
>>> existing programs from tc_cls into xdp. I think it makes sense to have
>>> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().
>>
>> I think so too.
>> Once netdevice is stored into netdev_array map the netdevice is pinned
>> and we need to figure out what to do if somebody tries to delete it.
>> Should we add a new netlink notifier that this netdev's refcnt is
>> almost zero and it's only in netdev_array(s) ?
>
> We basically do that automatically in netdev_wait_allrefs:
>
> pr_emerg("unregister_netdevice: waiting for %s to become free. Usage
> count = %d\n",
>          dev->name, refcnt);
>
> It is a very unpleasant warning and users probably think about a bug in
> the kernel at first.
>
> I don't think we should wait for user space to clean that up but have to
> do it automatically from the kernel. Maybe we can introduce a special
> value that basically NOPs the transmission. The hash table itself would
> install a netdevice notifier and would clean all tables. Could
> definitely cause some storm in the kernel, if a lot of keys are mapped
> to the same interface.
>
>> or should it be deleted from the array(s) automatically and
>> then user space will be notified post-deletion?
>> Both approaches have their pros and cons.
>
> I am leaning more towards deleting it automatically. But walking all
> tables and in there all keys might cause some unwanted load spikes.

yeah, when netdev is being unregistered we have to drop the ref
to avoid the warning.

Speaking of delete notifiers for maps.
Until recently all deletions were explicit and the program
could do extra perf_event_submit() if it needed to notify
user space of deletion.
But right now we have LRU map that deletes automatically and this
netdev_array will be deleting automatically too, so we probably
need some sort of generic map notifier for all map types.
It can be as simple as user space sleeping on read(map_fd, buf);
or waiting for epoll on map_fd and kernel wakes it up
when map element is deleted and pushes 'key/value' pair
into the buf. That should be generic for all map types,
but it needs some mechanism to avoid blocking, since number
of deletions can be huge. Something similar to perf's event record
'lost N records' should do the trick.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
  2017-04-30  1:04                               ` Alexei Starovoitov
@ 2017-04-30 22:55                                 ` John Fastabend
  0 siblings, 0 replies; 34+ messages in thread
From: John Fastabend @ 2017-04-30 22:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Jesper Dangaard Brouer
  Cc: Andy Gospodarek, Alexei Starovoitov, Daniel Borkmann,
	Daniel Borkmann, netdev, xdp-newbies

On 17-04-29 06:04 PM, Alexei Starovoitov wrote:
> On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote:
>> On Thu, 27 Apr 2017 16:31:14 -0700
>> Alexei Starovoitov <ast@fb.com> wrote:
>>
>>> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
>>>> When registering/attaching a XDP/bpf program, we would just send the
>>>> file-descriptor for this port-map along (like we do with the bpf_prog
>>>> FD). Plus, it own ingress-port number this program is in the port-map.
>>>>
>>>> It is not clear to me, in-which-data-structure on the kernel-side we
>>>> store this reference to the port-map and ingress-port. As today we only
>>>> have the "raw" struct bpf_prog pointer. I see several options:
>>>>
>>>> 1. Create a new xdp_prog struct that contains existing bpf_prog,
>>>> a port-map pointer and ingress-port. (IMHO easiest solution)
>>>>
>>>> 2. Just create a new pointer to port-map and store it in driver rx-ring
>>>> struct (like existing bpf_prog), but this create a race-challenge
>>>> replacing (cmpxchg) the program (or perhaps it's not a problem as it
>>>> runs under rcu and RTNL-lock).
>>>>
>>>> 3. Extend bpf_prog to store this port-map and ingress-port, and have a
>>>> fast-way to access it.  I assume it will be accessible via
>>>> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.
>>>
>>> I'm not sure I completely follow the 3 proposals.
>>> Are you suggesting to have only one netdev_array per program?
>>
>> Yes, but I can see you have a more clever idea below.
>>
>>> Why not to allow any number like we do for tailcall+prog_array, etc?
>>
>>> We can teach verifier to allow new helper
>>>  bpf_tx_port(netdev_array, port_num);
>>> to only be used with netdev_array map type.
>>> It will fetch netdevice pointer from netdev_array[port_num]
>>> and will tx the packet into it.
>>
>> I love it.
>>
>> I just don't like the "netdev" part of the name "netdev_array" as one
>> basic ideas of a port tabel, is that a port can be anything that can
>> consume a XDP_buff packet.  This generalization allow us to move code
>> out of the drivers.  We might be on the same page, as I do imagine that
>> netdev_array or port_array is just a struct bpf_map pointer, and the
>> bpf_map->map_type will tell us that this bpf_map contains net_device
>> pointers.  Thus, when later introducing a new type of redirect (like to
>> a socket or remote-CPU) then we just add a new bpf_map_type for this,
>> without needing to change anything in the drivers, right?
> 
> In theory, yes, but in practice I doubt it will be so easy.
> We probably shouldn't allow very different types of netdev
> into the same netdev_array or port_array (whatever the name).
> They need to be similar enough, otherwise we'd have to do run-time
> checks. If they're all the same, these checks can be done at
> insertion time instead.
> 

I think we can just have different map types for each redirect type. So
a netdev_map_array, socket_map_array, etc.

>> Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT?
>> Or does it return if the call was successful (e.g validate port_num
>> existed in map)?
> 
> don't know :)
> we need to brainstorm pros and cons.
> 
>> On the kernel side, we need to receive this info "port_array" and
>> "port_num", given you don't provide the call a xdp_buff/ctx, then I
>> assume you want the per-CPU temp-store solution.  Then during the
>> XDP_REDIRECT action we call a core redirect function that based on the
>> bpf_map_type does a lookup, and find the net_device ptr.
> 
> hmm. didn't think that far either :)
> indeed makes sense to pass 'ctx' into such helper as well,
> so it's easier to deal with original netdev.
> 
>>> We can make it similar to bpf_tail_call(), so that program will
>>> finish on successful bpf_tx_port() or
>>> make it into 'delayed' tx which will be executed when program finishes.
>>> Not sure which approach is better.
>>
>> I know you are talking about something slightly different, about
>> delaying TX.
>>
>> But I want to mention (as I've done before) that it is important (for
>> me) that we get bulking working/integrated.   I imagine the driver will
>> call a function that will delay the TX/redirect action and at the end
>> of the NAPI cycle have a function that flush packets, bulk per
>> destination port.
>>
>> I was wondering where to store these delayed TX packets, but now that
>> we have an associated bpf_map data-structure (netdev_array), I'm thinking
>> about storing packets (ordered by port) inside that.  And then have a
>> bpf_tx_flush(netdev_array) call in the driver (for every port-table-map
>> seen, which will likely be small).
> 
> makes sense to me as well.
> Ideally we should try to make an api such, that batching or no-batching
> can be kernel choice. The program will just do
> xdp_tx_port(...something here...)
> and the kernel does the best for performance. If it needs to delay
> the result to do batching the api should allow that transparently.
> Like right now xdp program does 'return XDP_TX;' and
> on the kernel side we can either do immediate xmit (like we do today)
> or can change it to do batching and programs don't need to change.

I think on the kernel side we can just add a flag to tell the xmit routine
to hold off hitting the tail. Then when we have the last packet we can
just set the flag. For the case where you don't know when the last packet
is coming you can send a NULL xdp buff and have the tail updated. This seems
to give the most flexibility to driver writers to optimize for their hardware.

I also tried a xmit routine to send many xdp_bufs in one call. This
became a bit ugly IMO and I decided the above was actually better. The
basic problem is, in general, we get packets for all sorts of ports and
do not get say 10 packets all for one port in one call.

> 
>>> We can also extend this netdev_array into broadcast/multicast. Like
>>> bpf_tx_allports(&netdev_array);
>>> call from the program will xmit the packet to all netdevices
>>> in that 'netdev_array' map type.
>>
>> When broadcasting you often don't want to broadcast the packet out of
>> the incoming interface.  How can you support this?
>>
>> Normally you would know your ingress port, and then excluded that port
>> in the broadcast.  But with many netdev_array's how do the program know
>> it's own ingress port.
> 
> absolutely!
> bpf_tx_allports() should somehow exclude the port packet arrived on.
> What you're proposing about passing 'ctx' into this helper,
> should solve it, I guess.

+1 seems reasonable.

> 
>> Thanks a lot for all this input, I got a much more clear picture of how
>> I can/should implement this :-)
> 
> awesome :) Let's brainstorm more and get John's opinion on it as well,
> since sounds like he'll be heavy user of such api.
> 

Yep I have a couple cls_tc programs that I'm eager to port as soon as the
infrastructure is there. I think this is generally headed the "right" direction.
I'm trying to wrap up the qdisc codes now and then will start working on getting
the basic xdp_redirect working. The qdisc stuff took a bit longer than I hoped
:/

Thanks,
John

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-04-30 22:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer
2017-04-18 20:54 ` John Fastabend
2017-04-19 12:00   ` Jesper Dangaard Brouer
2017-04-19 12:33     ` Daniel Borkmann
2017-04-19 15:24       ` Jesper Dangaard Brouer
2017-04-19 12:25 ` Hannes Frederic Sowa
2017-04-19 20:02 ` Andy Gospodarek
2017-04-19 21:42   ` Daniel Borkmann
2017-04-20 17:12     ` Andy Gospodarek
2017-04-19 22:51   ` Daniel Borkmann
2017-04-20  2:56     ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov
2017-04-20  4:38       ` John Fastabend
2017-04-20  4:58         ` Alexei Starovoitov
2017-04-20  5:14           ` John Fastabend
2017-04-20  6:10       ` Jesper Dangaard Brouer
2017-04-20 17:10         ` Alexei Starovoitov
2017-04-25  9:34           ` Jesper Dangaard Brouer
2017-04-26  0:26             ` Alexei Starovoitov
2017-04-26  3:07               ` John Fastabend
2017-04-26  9:11                 ` Jesper Dangaard Brouer
2017-04-26 16:35                   ` John Fastabend
2017-04-26 17:58                     ` Alexei Starovoitov
2017-04-26 20:55                       ` Andy Gospodarek
2017-04-27  8:41                         ` Jesper Dangaard Brouer
2017-04-27 23:31                           ` Alexei Starovoitov
2017-04-28  5:06                             ` John Fastabend
2017-04-28  5:30                               ` Alexei Starovoitov
2017-04-28 19:43                                 ` Hannes Frederic Sowa
2017-04-30  1:35                                   ` Alexei Starovoitov
2017-04-28 10:58                             ` Jesper Dangaard Brouer
2017-04-30  1:04                               ` Alexei Starovoitov
2017-04-30 22:55                                 ` John Fastabend
2017-04-20  6:39     ` XDP question: " Jesper Dangaard Brouer
2017-04-20  4:43   ` John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.