* XDP question: best API for returning/setting egress port? @ 2017-04-18 19:58 Jesper Dangaard Brouer 2017-04-18 20:54 ` John Fastabend ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-18 19:58 UTC (permalink / raw) To: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov Cc: brouer, netdev, xdp-newbies, John Fastabend As I argued in NetConf presentation[1] (from slide #9) we need a port mapping table (instead of using ifindex'es). Both for supporting other "port" types than net_devices (think sockets), and for sandboxing what XDP can bypass. I want to create a new XDP action called XDP_REDIRECT, that instruct XDP to send the xdp_buff to another "port" (get translated into a net_device, or something else depending on internal port type). Looking at the userspace/eBPF interface, I'm wondering what is the best API for "returning" this port number from eBPF? The options I see is: 1) Split-up the u32 action code, and e.g let the high-16-bit be the port number and lower-16bit the (existing) action verdict. Pros: Simple API Cons: Number of ports limited to 64K 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow eBPF to update xdp_md->port. Pros: Larger number of ports. Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md. (see xdp_convert_ctx_access) 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff. Pros: Hides impl details, and allows helper to give eBPF code feedback (on e.g. if port doesn't exist any longer) Cons: Helper function call likely slower? (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] http://people.netfilter.org/hawk/presentations/NetConf2017/xdp_work_ahead_NetConf_April_2017.pdf ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer @ 2017-04-18 20:54 ` John Fastabend 2017-04-19 12:00 ` Jesper Dangaard Brouer 2017-04-19 12:25 ` Hannes Frederic Sowa 2017-04-19 20:02 ` Andy Gospodarek 2 siblings, 1 reply; 34+ messages in thread From: John Fastabend @ 2017-04-18 20:54 UTC (permalink / raw) To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov Cc: netdev, xdp-newbies On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote: > > As I argued in NetConf presentation[1] (from slide #9) we need a port > mapping table (instead of using ifindex'es). Both for supporting > other "port" types than net_devices (think sockets), and for > sandboxing what XDP can bypass. > > I want to create a new XDP action called XDP_REDIRECT, that instruct > XDP to send the xdp_buff to another "port" (get translated into a > net_device, or something else depending on internal port type). > > Looking at the userspace/eBPF interface, I'm wondering what is the > best API for "returning" this port number from eBPF? > > The options I see is: > > 1) Split-up the u32 action code, and e.g let the high-16-bit be the > port number and lower-16bit the (existing) action verdict. > > Pros: Simple API > Cons: Number of ports limited to 64K > > 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow > eBPF to update xdp_md->port. > > Pros: Larger number of ports. > Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md. > (see xdp_convert_ctx_access) > > 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff. > > Pros: Hides impl details, and allows helper to give eBPF code feedback > (on e.g. if port doesn't exist any longer) > Cons: Helper function call likely slower? > > How about doing this the same way redirect is done in the tc case? I have this patch under test, https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee that should give you some idea. It just needs a port mapping table in the bpf_tx_xdp() call. > (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?) > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-18 20:54 ` John Fastabend @ 2017-04-19 12:00 ` Jesper Dangaard Brouer 2017-04-19 12:33 ` Daniel Borkmann 0 siblings, 1 reply; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-19 12:00 UTC (permalink / raw) To: John Fastabend Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies, brouer On Tue, 18 Apr 2017 13:54:45 -0700 John Fastabend <john.fastabend@gmail.com> wrote: > On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote: > > > > As I argued in NetConf presentation[1] (from slide #9) we need a port > > mapping table (instead of using ifindex'es). Both for supporting > > other "port" types than net_devices (think sockets), and for > > sandboxing what XDP can bypass. > > > > I want to create a new XDP action called XDP_REDIRECT, that instruct > > XDP to send the xdp_buff to another "port" (get translated into a > > net_device, or something else depending on internal port type). > > > > Looking at the userspace/eBPF interface, I'm wondering what is the > > best API for "returning" this port number from eBPF? > > > > The options I see is: > > > > 1) Split-up the u32 action code, and e.g let the high-16-bit be the > > port number and lower-16bit the (existing) action verdict. > > > > Pros: Simple API > > Cons: Number of ports limited to 64K > > > > 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow > > eBPF to update xdp_md->port. > > > > Pros: Larger number of ports. > > Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md. > > (see xdp_convert_ctx_access) > > > > 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff. > > > > Pros: Hides impl details, and allows helper to give eBPF code feedback > > (on e.g. if port doesn't exist any longer) > > Cons: Helper function call likely slower? > > > > > > How about doing this the same way redirect is done in the tc case? I have this > patch under test, > > https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee I have been looking at this approach, which is close to option #3 above. The problem with your implementation that you use a per-cpu store. This creates the problem of storing state between packets. First packet can call helper bpf_xdp_redirect() setting an ifindex, but program can still return XDP_PASS. Next packet can call XDP_REDIRECT and use the ifindex set from the first packet. IMHO this is a problematic API to expose. I do see that the TC interface that uses the same approach, via helper bpf_redirect(). Maybe it have the same API problem? Looking at sch_handle_ingress() I don't see this is handled (e.g. by always clearing this_cpu_ptr(redirect_info)->ifindex = 0). > that should give you some idea. It just needs a port mapping table in the > bpf_tx_xdp() call. I'll take a closer look. I don't think we need the per-cpu-store approach for XDP, as we might as well store the port info in xdp_buff, or return it directly option #1. (TC redirect need the per-cpu-store to avoid extending the SKB). > > (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?) I would still like people to comment on the above options? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-19 12:00 ` Jesper Dangaard Brouer @ 2017-04-19 12:33 ` Daniel Borkmann 2017-04-19 15:24 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 34+ messages in thread From: Daniel Borkmann @ 2017-04-19 12:33 UTC (permalink / raw) To: Jesper Dangaard Brouer, John Fastabend Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies On 04/19/2017 02:00 PM, Jesper Dangaard Brouer wrote: > On Tue, 18 Apr 2017 13:54:45 -0700 > John Fastabend <john.fastabend@gmail.com> wrote: >> On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote: >>> >>> As I argued in NetConf presentation[1] (from slide #9) we need a port >>> mapping table (instead of using ifindex'es). Both for supporting >>> other "port" types than net_devices (think sockets), and for >>> sandboxing what XDP can bypass. >>> >>> I want to create a new XDP action called XDP_REDIRECT, that instruct >>> XDP to send the xdp_buff to another "port" (get translated into a >>> net_device, or something else depending on internal port type). >>> >>> Looking at the userspace/eBPF interface, I'm wondering what is the >>> best API for "returning" this port number from eBPF? >>> >>> The options I see is: >>> >>> 1) Split-up the u32 action code, and e.g let the high-16-bit be the >>> port number and lower-16bit the (existing) action verdict. >>> >>> Pros: Simple API >>> Cons: Number of ports limited to 64K >>> >>> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow >>> eBPF to update xdp_md->port. >>> >>> Pros: Larger number of ports. >>> Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md. >>> (see xdp_convert_ctx_access) >>> >>> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff. >>> >>> Pros: Hides impl details, and allows helper to give eBPF code feedback >>> (on e.g. if port doesn't exist any longer) >>> Cons: Helper function call likely slower? >> >> How about doing this the same way redirect is done in the tc case? I have this >> patch under test, >> >> https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee > > I have been looking at this approach, which is close to option #3 above. > > The problem with your implementation that you use a per-cpu store. > This creates the problem of storing state between packets. First packet > can call helper bpf_xdp_redirect() setting an ifindex, but program can > still return XDP_PASS. Next packet can call XDP_REDIRECT and use the > ifindex set from the first packet. IMHO this is a problematic API to > expose. > > I do see that the TC interface that uses the same approach, via helper > bpf_redirect(). Maybe it have the same API problem? Looking at > sch_handle_ingress() I don't see this is handled (e.g. by always > clearing this_cpu_ptr(redirect_info)->ifindex = 0). It's cleared in {skb,xdp}_do_redirect() right after fetching the ifindex. I think this approach is just fine. The example described above is a misuse of the API by a buggy program calling bpf_xdp_redirect() and returning XDP_PASS while another time it returns XDP_REDIRECT without the bpf_xdp_redirect() helper, sounds very exotic, but it's as buggy as, say, a program doing the csum update wrong, a program writing the wrong data to the packet, doing adjust head on the wrong header offset, jumping into the wrong tail call entry and other things. I think encoding this into an action code is rather limiting, f.e. where would we place a flags argument if needed in future? Would that mean, we need a XDP_REDIRECT2 return code that also allows for encoding flags? Thanks, Daniel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-19 12:33 ` Daniel Borkmann @ 2017-04-19 15:24 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-19 15:24 UTC (permalink / raw) To: Daniel Borkmann Cc: John Fastabend, Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies, brouer On Wed, 19 Apr 2017 14:33:27 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote: > On 04/19/2017 02:00 PM, Jesper Dangaard Brouer wrote: > > On Tue, 18 Apr 2017 13:54:45 -0700 > > John Fastabend <john.fastabend@gmail.com> wrote: > >> On 17-04-18 12:58 PM, Jesper Dangaard Brouer wrote: > >>> > >>> As I argued in NetConf presentation[1] (from slide #9) we need a port > >>> mapping table (instead of using ifindex'es). Both for supporting > >>> other "port" types than net_devices (think sockets), and for > >>> sandboxing what XDP can bypass. > >>> > >>> I want to create a new XDP action called XDP_REDIRECT, that instruct > >>> XDP to send the xdp_buff to another "port" (get translated into a > >>> net_device, or something else depending on internal port type). > >>> > >>> Looking at the userspace/eBPF interface, I'm wondering what is the > >>> best API for "returning" this port number from eBPF? > >>> > >>> The options I see is: > >>> > >>> 1) Split-up the u32 action code, and e.g let the high-16-bit be the > >>> port number and lower-16bit the (existing) action verdict. > >>> > >>> Pros: Simple API > >>> Cons: Number of ports limited to 64K > >>> > >>> 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow > >>> eBPF to update xdp_md->port. > >>> > >>> Pros: Larger number of ports. > >>> Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md. > >>> (see xdp_convert_ctx_access) > >>> > >>> 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff. > >>> > >>> Pros: Hides impl details, and allows helper to give eBPF code feedback > >>> (on e.g. if port doesn't exist any longer) > >>> Cons: Helper function call likely slower? > >> > >> How about doing this the same way redirect is done in the tc case? I have this > >> patch under test, > >> > >> https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee > > > > I have been looking at this approach, which is close to option #3 above. > > > > The problem with your implementation that you use a per-cpu store. > > This creates the problem of storing state between packets. First packet > > can call helper bpf_xdp_redirect() setting an ifindex, but program can > > still return XDP_PASS. Next packet can call XDP_REDIRECT and use the > > ifindex set from the first packet. IMHO this is a problematic API to > > expose. > > > > I do see that the TC interface that uses the same approach, via helper > > bpf_redirect(). Maybe it have the same API problem? Looking at > > sch_handle_ingress() I don't see this is handled (e.g. by always > > clearing this_cpu_ptr(redirect_info)->ifindex = 0). > > It's cleared in {skb,xdp}_do_redirect() right after fetching the > ifindex. I think this approach is just fine. The example described > above is a misuse of the API by a buggy program calling bpf_xdp_redirect() > and returning XDP_PASS while another time it returns XDP_REDIRECT > without the bpf_xdp_redirect() helper, sounds very exotic, but it's > as buggy as, say, a program doing the csum update wrong, a program > writing the wrong data to the packet, doing adjust head on the wrong > header offset, jumping into the wrong tail call entry and other things. For TC I guess it is fine to keep it as is, because it is needed to avoid extending skb. IHMO for XDP I see no reason to keep a per-cpu-store (which besides will be slower), simply update xdp_buff.port should be sufficient (which is only relevant for this packet). As noted in option#3, my concern is that calling a helper function call will be slower, than simply returning the needed port info? Maybe some bpf experts can tell me if such helper call could be optimized out with some bpf magic? > I think encoding this into an action code is rather limiting, f.e. > where would we place a flags argument if needed in future? Would > that mean, we need a XDP_REDIRECT2 return code that also allows for > encoding flags? Nope, it will be extensible. We can start with: struct xdp_ret { union { __u32 act; struct { __u16 action; __u16 port; }; }; And later change it to: struct xdp_ret { union { __u32 act; struct { __u8 action; __u8 flags; __u16 port; }; }; If actions does not go above 255. I would prefer that we start with the latter, else people would argue that we need to extend the structure like: struct xdp_ret { union { __u32 act; struct { union { __u16 action; struct { __u8 action2; __u8 flags; }; }; __u16 port; }; }; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer 2017-04-18 20:54 ` John Fastabend @ 2017-04-19 12:25 ` Hannes Frederic Sowa 2017-04-19 20:02 ` Andy Gospodarek 2 siblings, 0 replies; 34+ messages in thread From: Hannes Frederic Sowa @ 2017-04-19 12:25 UTC (permalink / raw) To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov Cc: netdev, xdp-newbies, John Fastabend Hi, On 18.04.2017 21:58, Jesper Dangaard Brouer wrote: > > As I argued in NetConf presentation[1] (from slide #9) we need a port > mapping table (instead of using ifindex'es). Both for supporting > other "port" types than net_devices (think sockets), and for > sandboxing what XDP can bypass. > > I want to create a new XDP action called XDP_REDIRECT, that instruct > XDP to send the xdp_buff to another "port" (get translated into a > net_device, or something else depending on internal port type). > > Looking at the userspace/eBPF interface, I'm wondering what is the > best API for "returning" this port number from eBPF? > > The options I see is: > > 1) Split-up the u32 action code, and e.g let the high-16-bit be the > port number and lower-16bit the (existing) action verdict. > > Pros: Simple API > Cons: Number of ports limited to 64K > > 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow > eBPF to update xdp_md->port. > > Pros: Larger number of ports. > Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md. > (see xdp_convert_ctx_access) > > 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff. > > Pros: Hides impl details, and allows helper to give eBPF code feedback > (on e.g. if port doesn't exist any longer) > Cons: Helper function call likely slower? > > > (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?) I am not sure how the socket interface should look like, it seems to be too far away to me right now. Regarding having stable ifindexes, I wonder if we could do something: int ifindexes_in_use_by_ebpf_program[] __section("ifindex") = { 1,2,3,8,9,10 }; and we can make sure that the ifindexes automatically stay stable for the lifetime while the ebpf program is loaded? Bye, Hannes ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer 2017-04-18 20:54 ` John Fastabend 2017-04-19 12:25 ` Hannes Frederic Sowa @ 2017-04-19 20:02 ` Andy Gospodarek 2017-04-19 21:42 ` Daniel Borkmann ` (2 more replies) 2 siblings, 3 replies; 34+ messages in thread From: Andy Gospodarek @ 2017-04-19 20:02 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote: > > As I argued in NetConf presentation[1] (from slide #9) we need a port > mapping table (instead of using ifindex'es). Both for supporting > other "port" types than net_devices (think sockets), and for > sandboxing what XDP can bypass. > > I want to create a new XDP action called XDP_REDIRECT, that instruct > XDP to send the xdp_buff to another "port" (get translated into a > net_device, or something else depending on internal port type). > > Looking at the userspace/eBPF interface, I'm wondering what is the > best API for "returning" this port number from eBPF? > > The options I see is: > > 1) Split-up the u32 action code, and e.g let the high-16-bit be the > port number and lower-16bit the (existing) action verdict. > > Pros: Simple API > Cons: Number of ports limited to 64K Practically speaking this may be seem reserving 64k for the port number might be enough space, but I would also like to see a new return option for flags so I start to get concerned about space. Daniel also hightlights the fact that encoding the port in the action may does not leave room for flags and could get confusing. One unfortunate side-effect of dropping or transmitting frames with XDP is that we lose the opportunity to statistically sample in netfilter since the frames were dropped so early and I'd like to bring that back with a call to parse flags and possibly call psample_sample_packet() after the xdp action. Packet sampling cannot simply be an action since there are times when a frame should be dropped but should also be sampled, so it seems logical to add this as a flag. > > 2) Extend both xdp_buff + xdp_md to contain a (u32) port number, allow > eBPF to update xdp_md->port. > > Pros: Larger number of ports. > Cons: This require some ebpf translation steps between xdp_buff <-> xdp_md. > (see xdp_convert_ctx_access) I think I would lean towards this based on the fact that I'd like to see a flags field added to the u32 return (maybe the top 8 bits) as mentioned above. So if we follow down this path and add a 'dest' field to xdp_buff and xdp_md like this: struct xdp_buff { void *data; void *data_end; void *data_hard_start; __u32 dest; }; struct xdp_md { __u32 data; __u32 data_end; __u32 dest; }; and then lookup this dest in a table we have the option to make that dest an ifindex/socket/other. I did also look at JohnF's patch and I do like the simplicity of the redirect action and new ndo_xdp_xmit and how it moves towards a way to transmit the frame. The downside is that it presumes an ifindex, so it might not be ideal we want the lookup to return something other than an ifindex. Before aligning on a direction for the return values from exiting xdp call, it seems like we should also think about the tx side and how that would be handled. If we are ultimately going to need a new netdev op to handle the redirect then what may be the issue with not providing the destination port the return code and the option proposed by JohnF looks good to me with maybe a small tweak to not presume ifindex in some manner. JohnF, any test results with this you can share? Presumably you tested with virtio-net, right? > > 3) Extend only xdp_buff and create bpf_helper that set port in xdp_buff. > > Pros: Hides impl details, and allows helper to give eBPF code feedback > (on e.g. if port doesn't exist any longer) > Cons: Helper function call likely slower? > > > (Cc'ed xdp-newbies as end-users might have an opinion on UAPI?) > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > > [1] http://people.netfilter.org/hawk/presentations/NetConf2017/xdp_work_ahead_NetConf_April_2017.pdf ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-19 20:02 ` Andy Gospodarek @ 2017-04-19 21:42 ` Daniel Borkmann 2017-04-20 17:12 ` Andy Gospodarek 2017-04-19 22:51 ` Daniel Borkmann 2017-04-20 4:43 ` John Fastabend 2 siblings, 1 reply; 34+ messages in thread From: Daniel Borkmann @ 2017-04-19 21:42 UTC (permalink / raw) To: Andy Gospodarek, Jesper Dangaard Brouer Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend On 04/19/2017 10:02 PM, Andy Gospodarek wrote: > On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote: >> >> As I argued in NetConf presentation[1] (from slide #9) we need a port >> mapping table (instead of using ifindex'es). Both for supporting >> other "port" types than net_devices (think sockets), and for >> sandboxing what XDP can bypass. >> >> I want to create a new XDP action called XDP_REDIRECT, that instruct >> XDP to send the xdp_buff to another "port" (get translated into a >> net_device, or something else depending on internal port type). >> >> Looking at the userspace/eBPF interface, I'm wondering what is the >> best API for "returning" this port number from eBPF? >> >> The options I see is: >> >> 1) Split-up the u32 action code, and e.g let the high-16-bit be the >> port number and lower-16bit the (existing) action verdict. >> >> Pros: Simple API >> Cons: Number of ports limited to 64K > > Practically speaking this may be seem reserving 64k for the port number > might be enough space, but I would also like to see a new return option > for flags so I start to get concerned about space. Daniel also > hightlights the fact that encoding the port in the action may does not > leave room for flags and could get confusing. > > One unfortunate side-effect of dropping or transmitting frames with XDP > is that we lose the opportunity to statistically sample in netfilter > since the frames were dropped so early and I'd like to bring that back > with a call to parse flags and possibly call psample_sample_packet() > after the xdp action. Packet sampling cannot simply be an action > since there are times when a frame should be dropped but should also be > sampled, so it seems logical to add this as a flag. Hmm, you can do that already today with bpf_xdp_event_output() helper and the program decides when to sample and what custom meta data to prepend along with the (partial or full) xdp packet, see also [1], slide 11 for the "XDP dump" part. No need to change drivers for this, psample_sample_packet() would also be slower. [1] http://netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-19 21:42 ` Daniel Borkmann @ 2017-04-20 17:12 ` Andy Gospodarek 0 siblings, 0 replies; 34+ messages in thread From: Andy Gospodarek @ 2017-04-20 17:12 UTC (permalink / raw) To: Daniel Borkmann Cc: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend On Wed, Apr 19, 2017 at 11:42:30PM +0200, Daniel Borkmann wrote: > On 04/19/2017 10:02 PM, Andy Gospodarek wrote: > > On Tue, Apr 18, 2017 at 09:58:56PM +0200, Jesper Dangaard Brouer wrote: > > > > > > As I argued in NetConf presentation[1] (from slide #9) we need a port > > > mapping table (instead of using ifindex'es). Both for supporting > > > other "port" types than net_devices (think sockets), and for > > > sandboxing what XDP can bypass. > > > > > > I want to create a new XDP action called XDP_REDIRECT, that instruct > > > XDP to send the xdp_buff to another "port" (get translated into a > > > net_device, or something else depending on internal port type). > > > > > > Looking at the userspace/eBPF interface, I'm wondering what is the > > > best API for "returning" this port number from eBPF? > > > > > > The options I see is: > > > > > > 1) Split-up the u32 action code, and e.g let the high-16-bit be the > > > port number and lower-16bit the (existing) action verdict. > > > > > > Pros: Simple API > > > Cons: Number of ports limited to 64K > > > > Practically speaking this may be seem reserving 64k for the port number > > might be enough space, but I would also like to see a new return option > > for flags so I start to get concerned about space. Daniel also > > hightlights the fact that encoding the port in the action may does not > > leave room for flags and could get confusing. > > > > One unfortunate side-effect of dropping or transmitting frames with XDP > > is that we lose the opportunity to statistically sample in netfilter > > since the frames were dropped so early and I'd like to bring that back > > with a call to parse flags and possibly call psample_sample_packet() > > after the xdp action. Packet sampling cannot simply be an action > > since there are times when a frame should be dropped but should also be > > sampled, so it seems logical to add this as a flag. > > Hmm, you can do that already today with bpf_xdp_event_output() helper > and the program decides when to sample and what custom meta data to > prepend along with the (partial or full) xdp packet, see also [1], slide > 11 for the "XDP dump" part. > > No need to change drivers for this, psample_sample_packet() would also > be slower. > > [1] http://netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf Thanks, Daniel. Not sure how I missed this, but I appreciate the pointer. I'll add coding up a program to demo this to my TODO list. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-19 20:02 ` Andy Gospodarek 2017-04-19 21:42 ` Daniel Borkmann @ 2017-04-19 22:51 ` Daniel Borkmann 2017-04-20 2:56 ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov 2017-04-20 6:39 ` XDP question: " Jesper Dangaard Brouer 2017-04-20 4:43 ` John Fastabend 2 siblings, 2 replies; 34+ messages in thread From: Daniel Borkmann @ 2017-04-19 22:51 UTC (permalink / raw) To: Andy Gospodarek, Jesper Dangaard Brouer Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend On 04/19/2017 10:02 PM, Andy Gospodarek wrote: [...] > and then lookup this dest in a table we have the option to make that > dest an ifindex/socket/other. > > I did also look at JohnF's patch and I do like the simplicity of the redirect > action and new ndo_xdp_xmit and how it moves towards a way to transmit the > frame. The downside is that it presumes an ifindex, so it might not be ideal > we want the lookup to return something other than an ifindex. > [...] > would be handled. If we are ultimately going to need a new netdev op to > handle the redirect then what may be the issue with not providing the > destination port the return code and the option proposed by JohnF looks > good to me with maybe a small tweak to not presume ifindex in some manner. Is there a concrete reason that all the proposed future cases like sockets have to be handled within the very same XDP_REDIRECT return code? F.e. why not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future ones would get a different return code f.e. XDP_TX_SK only handling sockets when we get there implementation-wise? ^ permalink raw reply [flat|nested] 34+ messages in thread
* xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-19 22:51 ` Daniel Borkmann @ 2017-04-20 2:56 ` Alexei Starovoitov 2017-04-20 4:38 ` John Fastabend 2017-04-20 6:10 ` Jesper Dangaard Brouer 2017-04-20 6:39 ` XDP question: " Jesper Dangaard Brouer 1 sibling, 2 replies; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-20 2:56 UTC (permalink / raw) To: Daniel Borkmann Cc: Andy Gospodarek, Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > Is there a concrete reason that all the proposed future cases like sockets > have to be handled within the very same XDP_REDIRECT return code? F.e. why > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future > ones would get a different return code f.e. XDP_TX_SK only handling sockets > when we get there implementation-wise? yeah. let's keep redirect to sockets, tunnels, crypto and exotic things out of this discussion. XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. If we make it too generic it will lose performance. For cls_bpf the ifindex concept is symmetric. The program can access it as skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. Since netdev is not locked, it's actually big plus, since container management control plane can simply delete netns+veth and it goes away. The program can have dangling ifindex (if control plane is buggy and didn't update the bpf side), but it's harmless. Packets that redirect to non-existing ifindex are dropped. This approach already understood and works well, so for XDP I suggest to use the same approach initially before starting to reinvent the wheel. struct xdp_md needs ifindex field and xdp_redirect() helper that redirects to L2 netdev only. That's it. Simple and easy. I think the main use cases in John's and Jesper's minds is something like xdpswitch where packets are coming from VMs and from physical eths and being redirected to either physical eth or to VM via upcoming vhost+xdp support. This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. Once we have vhost+xdp and all other bits implemented, we must come back to this discussion about having port mapping table. As I mentioned during netconf I think it's very useful, but I don't think we should gate vhost+xdp and xdp_redirect work on this discussion. As far as this port mapping table we would need 'port' field in xdp_md as well and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md plus another XDP_REDIRECT_PORT action code. The actual port table (array) should be populated by user space with netdevs and these netdev will have their refcnt incremented. Then we'll have discussion what to do with netdev_unregister notifiers, whether they should be auto-removed from port table or bpf should have a chance to be notified and act on it. Such port mapping will allow us to optimize inevitable call dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); away, since netdevs will be stored in the port table and direct deref port_map_array[xdp_md->out_port] will give us target netdev quickly. It's nice optimization and there are other more powerful optimizations we can do with such port table (since we will know in advance which netdevs the program will be redirecting too), but I still think we should do ifindex based xdp_redirect first and only add this port table later. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-20 2:56 ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov @ 2017-04-20 4:38 ` John Fastabend 2017-04-20 4:58 ` Alexei Starovoitov 2017-04-20 6:10 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 34+ messages in thread From: John Fastabend @ 2017-04-20 4:38 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann Cc: Andy Gospodarek, Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies On 17-04-19 07:56 PM, Alexei Starovoitov wrote: > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: >> >> Is there a concrete reason that all the proposed future cases like sockets >> have to be handled within the very same XDP_REDIRECT return code? F.e. why >> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future >> ones would get a different return code f.e. XDP_TX_SK only handling sockets >> when we get there implementation-wise? > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > out of this discussion. > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. > If we make it too generic it will lose performance. > > For cls_bpf the ifindex concept is symmetric. The program can access it as > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. > Since netdev is not locked, it's actually big plus, since container management > control plane can simply delete netns+veth and it goes away. The program can > have dangling ifindex (if control plane is buggy and didn't update the bpf side), > but it's harmless. Packets that redirect to non-existing ifindex are dropped. > This approach already understood and works well, so for XDP I suggest to use > the same approach initially before starting to reinvent the wheel. > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > to L2 netdev only. That's it. Simple and easy. > I think the main use cases in John's and Jesper's minds is something like > xdpswitch where packets are coming from VMs and from physical eths and > being redirected to either physical eth or to VM via upcoming vhost+xdp support. > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. hmm I must be missing something, bpf_redirect() helper should be used as a return statement, e.g. return bpf_redirect(ifindex, flags); Its a bit awkward to use in any other way. You would have to ensure the program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu data and expects the core to call skb_do_redirect() to push the packet into the correct netdev. bpf_redirect() does not modify the skb->ifindex value, looking at the code now. Are you suggesting using xdp_md to store the ifindex value instead of a per cpu variable in the redirect helper? Do you really mean the xdp_md struct in the uapi headers? I don't see why it needs to be in the UAPI at all. If we don't like per cpu variables it could be pushed as part of xdp_buff I guess. My suggestion is we could add an ifindex to the xdp_md structure and have the receiving NIC driver populate the value assuming it is useful to programs. But, if we use cls_bpf as a model then xdp_md ifindex is more or less independent of the redirect helper. In my opinion we should avoid diverging cls bpf and xdp bpf in subtle ways like handling of ifindex and redirect. > > Once we have vhost+xdp and all other bits implemented, we must come back > to this discussion about having port mapping table. As I mentioned > during netconf I think it's very useful, but I don't think we should > gate vhost+xdp and xdp_redirect work on this discussion. > As far as this port mapping table we would need 'port' field in xdp_md as well > and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md > plus another XDP_REDIRECT_PORT action code. > The actual port table (array) should be populated by user space with netdevs > and these netdev will have their refcnt incremented. Then we'll have discussion > what to do with netdev_unregister notifiers, whether they should be auto-removed > from port table or bpf should have a chance to be notified and act on it. > Such port mapping will allow us to optimize inevitable call > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > away, since netdevs will be stored in the port table and direct deref > port_map_array[xdp_md->out_port] will give us target netdev quickly. > It's nice optimization and there are other more powerful optimizations we > can do with such port table (since we will know in advance which netdevs > the program will be redirecting too), but I still think we should do > ifindex based xdp_redirect first and only add this port table later. > Agreed on all this further optimization would be welcome of course after basic implementation is in place. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-20 4:38 ` John Fastabend @ 2017-04-20 4:58 ` Alexei Starovoitov 2017-04-20 5:14 ` John Fastabend 0 siblings, 1 reply; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-20 4:58 UTC (permalink / raw) To: John Fastabend Cc: Daniel Borkmann, Andy Gospodarek, Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote: > On 17-04-19 07:56 PM, Alexei Starovoitov wrote: > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > >> > >> Is there a concrete reason that all the proposed future cases like sockets > >> have to be handled within the very same XDP_REDIRECT return code? F.e. why > >> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future > >> ones would get a different return code f.e. XDP_TX_SK only handling sockets > >> when we get there implementation-wise? > > > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > > out of this discussion. > > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. > > If we make it too generic it will lose performance. > > > > For cls_bpf the ifindex concept is symmetric. The program can access it as > > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. > > Since netdev is not locked, it's actually big plus, since container management > > control plane can simply delete netns+veth and it goes away. The program can > > have dangling ifindex (if control plane is buggy and didn't update the bpf side), > > but it's harmless. Packets that redirect to non-existing ifindex are dropped. > > This approach already understood and works well, so for XDP I suggest to use > > the same approach initially before starting to reinvent the wheel. > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > > to L2 netdev only. That's it. Simple and easy. > > I think the main use cases in John's and Jesper's minds is something like > > xdpswitch where packets are coming from VMs and from physical eths and > > being redirected to either physical eth or to VM via upcoming vhost+xdp support. > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > hmm I must be missing something, bpf_redirect() helper should be used as a > return statement, e.g. > > return bpf_redirect(ifindex, flags); > > Its a bit awkward to use in any other way. You would have to ensure the > program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly > fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu > data and expects the core to call skb_do_redirect() to push the packet into > the correct netdev. bpf_redirect() does not modify the skb->ifindex value, > looking at the code now. > > Are you suggesting using xdp_md to store the ifindex value instead of a per > cpu variable in the redirect helper? no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does. > Do you really mean the xdp_md struct in > the uapi headers? yes. since 'ifindex' needs to be part of xdp_md struct in read-only way. Just like in cls_bpf does it. Otherwise if we attach the same program to multiple taps it won't know which tap the traffic arriving on and won't be able to redirect properly. > I don't see why it needs to be in the UAPI at all. If we > don't like per cpu variables it could be pushed as part of xdp_buff I guess. It's not about like or dont-like per-cpu scratch area. My main point: it works just fine for cls_bpf and i'm suggesting to do the same for xdp_redirect, since no one ever complained about that bit of cls_bpf. > My suggestion is we could add an ifindex to the xdp_md structure and have the > receiving NIC driver populate the value assuming it is useful to programs. But, > if we use cls_bpf as a model then xdp_md ifindex is more or less independent of > the redirect helper. exactly. arriving ifindex is independent of xdp_redirect helper. > In my opinion we should avoid diverging cls bpf and xdp bpf > in subtle ways like handling of ifindex and redirect. exactly. I'm saying the same thing. I'm not sure which part of my proposal was so confusing. Sorry about that. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-20 4:58 ` Alexei Starovoitov @ 2017-04-20 5:14 ` John Fastabend 0 siblings, 0 replies; 34+ messages in thread From: John Fastabend @ 2017-04-20 5:14 UTC (permalink / raw) To: Alexei Starovoitov Cc: Daniel Borkmann, Andy Gospodarek, Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies On 17-04-19 09:58 PM, Alexei Starovoitov wrote: > On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote: >> On 17-04-19 07:56 PM, Alexei Starovoitov wrote: >>> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: >>>> >>>> Is there a concrete reason that all the proposed future cases like sockets >>>> have to be handled within the very same XDP_REDIRECT return code? F.e. why >>>> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future >>>> ones would get a different return code f.e. XDP_TX_SK only handling sockets >>>> when we get there implementation-wise? >>> >>> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things >>> out of this discussion. >>> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. >>> If we make it too generic it will lose performance. >>> >>> For cls_bpf the ifindex concept is symmetric. The program can access it as >>> skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. >>> Since netdev is not locked, it's actually big plus, since container management >>> control plane can simply delete netns+veth and it goes away. The program can >>> have dangling ifindex (if control plane is buggy and didn't update the bpf side), >>> but it's harmless. Packets that redirect to non-existing ifindex are dropped. >>> This approach already understood and works well, so for XDP I suggest to use >>> the same approach initially before starting to reinvent the wheel. >>> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects >>> to L2 netdev only. That's it. Simple and easy. >>> I think the main use cases in John's and Jesper's minds is something like >>> xdpswitch where packets are coming from VMs and from physical eths and >>> being redirected to either physical eth or to VM via upcoming vhost+xdp support. >>> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. >> >> hmm I must be missing something, bpf_redirect() helper should be used as a >> return statement, e.g. >> >> return bpf_redirect(ifindex, flags); >> >> Its a bit awkward to use in any other way. You would have to ensure the >> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly >> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu >> data and expects the core to call skb_do_redirect() to push the packet into >> the correct netdev. bpf_redirect() does not modify the skb->ifindex value, >> looking at the code now. >> >> Are you suggesting using xdp_md to store the ifindex value instead of a per >> cpu variable in the redirect helper? > > no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does. > >> Do you really mean the xdp_md struct in >> the uapi headers? > > yes. since 'ifindex' needs to be part of xdp_md struct in read-only way. > Just like in cls_bpf does it. > Otherwise if we attach the same program to multiple taps it won't > know which tap the traffic arriving on and won't be able to redirect properly. > >> I don't see why it needs to be in the UAPI at all. If we >> don't like per cpu variables it could be pushed as part of xdp_buff I guess. > > It's not about like or dont-like per-cpu scratch area. > My main point: it works just fine for cls_bpf and i'm suggesting to do > the same for xdp_redirect, since no one ever complained about that bit of cls_bpf. > >> My suggestion is we could add an ifindex to the xdp_md structure and have the >> receiving NIC driver populate the value assuming it is useful to programs. But, >> if we use cls_bpf as a model then xdp_md ifindex is more or less independent of >> the redirect helper. > > exactly. arriving ifindex is independent of xdp_redirect helper. > >> In my opinion we should avoid diverging cls bpf and xdp bpf >> in subtle ways like handling of ifindex and redirect. > > exactly. I'm saying the same thing. > I'm not sure which part of my proposal was so confusing. > Sorry about that. > Aha what you are suggesting is exactly what I prototyped on virtio_net so I'm happy :) I just didn't manage to parse it for whatever reason must be getting tired. Thanks, John ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-20 2:56 ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov 2017-04-20 4:38 ` John Fastabend @ 2017-04-20 6:10 ` Jesper Dangaard Brouer 2017-04-20 17:10 ` Alexei Starovoitov 1 sibling, 1 reply; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-20 6:10 UTC (permalink / raw) To: Alexei Starovoitov Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend, brouer On Wed, 19 Apr 2017 19:56:13 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > > > Is there a concrete reason that all the proposed future cases like sockets > > have to be handled within the very same XDP_REDIRECT return code? F.e. why > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future > > ones would get a different return code f.e. XDP_TX_SK only handling sockets > > when we get there implementation-wise? > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > out of this discussion. > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. > If we make it too generic it will lose performance. > > For cls_bpf the ifindex concept is symmetric. The program can access it as > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. > Since netdev is not locked, it's actually big plus, since container management > control plane can simply delete netns+veth and it goes away. The program can > have dangling ifindex (if control plane is buggy and didn't update the bpf side), > but it's harmless. Packets that redirect to non-existing ifindex are dropped. > This approach already understood and works well, so for XDP I suggest to use > the same approach initially before starting to reinvent the wheel. > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > to L2 netdev only. That's it. Simple and easy. > I think the main use cases in John's and Jesper's minds is something like > xdpswitch where packets are coming from VMs and from physical eths and > being redirected to either physical eth or to VM via upcoming vhost+xdp support. > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > Once we have vhost+xdp and all other bits implemented, we must come back > to this discussion about having port mapping table. As I mentioned > during netconf I think it's very useful, but I don't think we should > gate vhost+xdp and xdp_redirect work on this discussion. > As far as this port mapping table we would need 'port' field in xdp_md as well > and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md > plus another XDP_REDIRECT_PORT action code. Guess it would be easier to talk about if we name it "ingress_port" and "egress_port". > The actual port table (array) should be populated by user space with netdevs > and these netdev will have their refcnt incremented. Then we'll have discussion > what to do with netdev_unregister notifiers, whether they should be auto-removed > from port table or bpf should have a chance to be notified and act on it. > Such port mapping will allow us to optimize inevitable call > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > away, since netdevs will be stored in the port table and direct deref > port_map_array[xdp_md->out_port] will give us target netdev quickly. I agree with above paragraph, and is happy that you can see that this will actually be faster than using ifindex'es. > It's nice optimization and there are other more powerful optimizations we > can do with such port table (since we will know in advance which netdevs > the program will be redirecting too), but I still think we should do > ifindex based xdp_redirect first and only add this port table later. No, we cannot first do an ifindex based xdp_redirect. The point of the port table is to sandbox which ports XDP can use. XDP is different than TC/cls_bpf, as it does "bypass", there is no other layer that can stop or inspect these packets. The TC hooks redirect into the network stack, which have all the usual facilities available for filtering, inspection and debugging what is going on (e.g. tcpdump works for TC redirect). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-20 6:10 ` Jesper Dangaard Brouer @ 2017-04-20 17:10 ` Alexei Starovoitov 2017-04-25 9:34 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-20 17:10 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote: > On Wed, 19 Apr 2017 19:56:13 -0700 > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > > > > > Is there a concrete reason that all the proposed future cases like sockets > > > have to be handled within the very same XDP_REDIRECT return code? F.e. why > > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future > > > ones would get a different return code f.e. XDP_TX_SK only handling sockets > > > when we get there implementation-wise? > > > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > > out of this discussion. > > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. > > If we make it too generic it will lose performance. > > > > For cls_bpf the ifindex concept is symmetric. The program can access it as > > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. > > Since netdev is not locked, it's actually big plus, since container management > > control plane can simply delete netns+veth and it goes away. The program can > > have dangling ifindex (if control plane is buggy and didn't update the bpf side), > > but it's harmless. Packets that redirect to non-existing ifindex are dropped. > > This approach already understood and works well, so for XDP I suggest to use > > the same approach initially before starting to reinvent the wheel. > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > > to L2 netdev only. That's it. Simple and easy. > > I think the main use cases in John's and Jesper's minds is something like > > xdpswitch where packets are coming from VMs and from physical eths and > > being redirected to either physical eth or to VM via upcoming vhost+xdp support. > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > > > Once we have vhost+xdp and all other bits implemented, we must come back > > to this discussion about having port mapping table. As I mentioned > > during netconf I think it's very useful, but I don't think we should > > gate vhost+xdp and xdp_redirect work on this discussion. > > As far as this port mapping table we would need 'port' field in xdp_md as well > > and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md > > plus another XDP_REDIRECT_PORT action code. > > Guess it would be easier to talk about if we name it "ingress_port" and > "egress_port". > > > The actual port table (array) should be populated by user space with netdevs > > and these netdev will have their refcnt incremented. Then we'll have discussion > > what to do with netdev_unregister notifiers, whether they should be auto-removed > > from port table or bpf should have a chance to be notified and act on it. > > Such port mapping will allow us to optimize inevitable call > > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > > away, since netdevs will be stored in the port table and direct deref > > port_map_array[xdp_md->out_port] will give us target netdev quickly. > > I agree with above paragraph, and is happy that you can see that this > will actually be faster than using ifindex'es. > > > It's nice optimization and there are other more powerful optimizations we > > can do with such port table (since we will know in advance which netdevs > > the program will be redirecting too), but I still think we should do > > ifindex based xdp_redirect first and only add this port table later. > > No, we cannot first do an ifindex based xdp_redirect. The point of the > port table is to sandbox which ports XDP can use. hmm. port table cannot sandbox the ports. The only thing it does from 'safety' point of view is moving the checks from run-time into static insertion time. So the checks that we would do on netdev after looking it up based on ifindex are the same checks we will do at insertion time into port table. The user space will insert/delete them live from that port table, so from program point of view it's exactly the same as ifindex. The ports can disappear and can be added while the program is running. Note the very first bpf patchset years ago contained the port table abstraction. ovs has concept of vports as well. These two very different projects needed port table to provide a layer of indirection between ifindex==netdev and virtual port number. This is still the case and I'd like to see this port table to be implemented for both cls_bpf and xdp. In that sense xdp is not special. > XDP is different than TC/cls_bpf, as it does "bypass", there is no > other layer that can stop or inspect these packets. The TC hooks > redirect into the network stack, which have all the usual facilities > available for filtering, inspection and debugging what is going on > (e.g. tcpdump works for TC redirect). not true. when bpf_redirect() drops the packet due to incorrect ifindex that packet disappears without a trace. No tcpdump and no counter. And this is fine. We can add tracepoint there for debugging, but it wasn't a problem for anyone who's using it today, so it's 'nice to have', but certainly not mandatory. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-20 17:10 ` Alexei Starovoitov @ 2017-04-25 9:34 ` Jesper Dangaard Brouer 2017-04-26 0:26 ` Alexei Starovoitov 0 siblings, 1 reply; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-25 9:34 UTC (permalink / raw) To: Alexei Starovoitov Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend, brouer On Thu, 20 Apr 2017 10:10:08 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote: > > On Wed, 19 Apr 2017 19:56:13 -0700 > > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > > > > > > > Is there a concrete reason that all the proposed future cases like sockets > > > > have to be handled within the very same XDP_REDIRECT return code? F.e. why > > > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future > > > > ones would get a different return code f.e. XDP_TX_SK only handling sockets > > > > when we get there implementation-wise? > > > > > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > > > out of this discussion. > > > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. > > > If we make it too generic it will lose performance. > > > > > > For cls_bpf the ifindex concept is symmetric. The program can access it as > > > skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. > > > Since netdev is not locked, it's actually big plus, since container management > > > control plane can simply delete netns+veth and it goes away. The program can > > > have dangling ifindex (if control plane is buggy and didn't update the bpf side), > > > but it's harmless. Packets that redirect to non-existing ifindex are dropped. > > > This approach already understood and works well, so for XDP I suggest to use > > > the same approach initially before starting to reinvent the wheel. > > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > > > to L2 netdev only. That's it. Simple and easy. > > > I think the main use cases in John's and Jesper's minds is something like > > > xdpswitch where packets are coming from VMs and from physical eths and > > > being redirected to either physical eth or to VM via upcoming vhost+xdp support. > > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > > > > > Once we have vhost+xdp and all other bits implemented, we must come back > > > to this discussion about having port mapping table. As I mentioned > > > during netconf I think it's very useful, but I don't think we should > > > gate vhost+xdp and xdp_redirect work on this discussion. > > > As far as this port mapping table we would need 'port' field in xdp_md as well > > > and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md > > > plus another XDP_REDIRECT_PORT action code. > > > > Guess it would be easier to talk about if we name it "ingress_port" and > > "egress_port". > > > > > The actual port table (array) should be populated by user space with netdevs > > > and these netdev will have their refcnt incremented. Then we'll have discussion > > > what to do with netdev_unregister notifiers, whether they should be auto-removed > > > from port table or bpf should have a chance to be notified and act on it. > > > Such port mapping will allow us to optimize inevitable call > > > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > > > away, since netdevs will be stored in the port table and direct deref > > > port_map_array[xdp_md->out_port] will give us target netdev quickly. > > > > I agree with above paragraph, and is happy that you can see that this > > will actually be faster than using ifindex'es. > > > > > It's nice optimization and there are other more powerful optimizations we > > > can do with such port table (since we will know in advance which netdevs > > > the program will be redirecting too), but I still think we should do > > > ifindex based xdp_redirect first and only add this port table later. > > > > No, we cannot first do an ifindex based xdp_redirect. The point of the > > port table is to sandbox which ports XDP can use. > > hmm. port table cannot sandbox the ports. The only thing it does > from 'safety' point of view is moving the checks from run-time into > static insertion time. > So the checks that we would do on netdev after looking it up > based on ifindex are the same checks we will do at insertion time > into port table. The user space will insert/delete them live > from that port table, so from program point of view it's exactly > the same as ifindex. The ports can disappear and can be added > while the program is running. I agree, that from the eBPF programs point of view using an ifindex or a port number is the same. And I do like this model, that this is just a number seem from bpf. It provides a clean separation between the kernel and ebpf program world. > Note the very first bpf patchset years ago contained the port table > abstraction. ovs has concept of vports as well. These two very > different projects needed port table to provide a layer of > indirection between ifindex==netdev and virtual port number. > This is still the case and I'd like to see this port table to be > implemented for both cls_bpf and xdp. In that sense xdp is not > special. Glad to hear you want to see this implemented, I will start coding on this then. Good point with cls_bpf, I was planning to make this port table strongly connected to XDP, guess I should also think of cls_bpf. > > XDP is different than TC/cls_bpf, as it does "bypass", there is no > > other layer that can stop or inspect these packets. The TC hooks > > redirect into the network stack, which have all the usual facilities > > available for filtering, inspection and debugging what is going on > > (e.g. tcpdump works for TC redirect). > > not true. when bpf_redirect() drops the packet due to incorrect ifindex > that packet disappears without a trace. No tcpdump and no counter. > And this is fine. We can add tracepoint there for debugging, > but it wasn't a problem for anyone who's using it today, so it's > 'nice to have', but certainly not mandatory. I'm not worried about the DROP case, I agree that is fine (as you also say). The problem is unintentionally sending a packet to a wrong ifindex. This is clearly an eBPF program error, BUT with XDP this becomes a very hard to debug program error. With TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is no visibility into this happening (the NSA is going to love this "feature"). Maybe we could add yet-another tracepoint to allow debugging this. My proposal that we simply remove the possibility for such program errors, by as you say move the validation from run-time into static insertion-time, via a port table. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-25 9:34 ` Jesper Dangaard Brouer @ 2017-04-26 0:26 ` Alexei Starovoitov 2017-04-26 3:07 ` John Fastabend 0 siblings, 1 reply; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-26 0:26 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote: > > Note the very first bpf patchset years ago contained the port table > > abstraction. ovs has concept of vports as well. These two very > > different projects needed port table to provide a layer of > > indirection between ifindex==netdev and virtual port number. > > This is still the case and I'd like to see this port table to be > > implemented for both cls_bpf and xdp. In that sense xdp is not > > special. > > Glad to hear you want to see this implemented, I will start coding on > this then. Good point with cls_bpf, I was planning to make this port > table strongly connected to XDP, guess I should also think of cls_bpf. perfect. I think we should try to make all additions to bpf networking world to be usable for both tc and xdp, since both are actively used and it wouldn't be great to have cool feature for one, but not the other. I think port table is an excellent candidate that applies to both. > I'm not worried about the DROP case, I agree that is fine (as you also > say). The problem is unintentionally sending a packet to a wrong > ifindex. This is clearly an eBPF program error, BUT with XDP this > becomes a very hard to debug program error. With TC-redirect/cls_bpf > we can tcpdump the packets, with XDP there is no visibility into this > happening (the NSA is going to love this "feature"). Maybe we could add > yet-another tracepoint to allow debugging this. My proposal that we > simply remove the possibility for such program errors, by as you say > move the validation from run-time into static insertion-time, via a > port table. I think lack of tcpdump-like debugging in xdp is a separate issue. As I was saying in the other thread we have trivial 'xdpdump' kern+user app that emits pcap file, but it's too specific to how we use tail_calls+prog_array in our xdp setup. I'm working on the program chaining that will be generic and allow us transparently add multiple xdp or tc progs to the same attachment point and will allow us to do 'xdpdump' at any point of this pipeline, so debugging of what happened to the packet will be easier and done in the same way for both tc and xdp. btw in our experience working with both tc and xdp the tc+bpf was actually harder to use and more bug prone. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-26 0:26 ` Alexei Starovoitov @ 2017-04-26 3:07 ` John Fastabend 2017-04-26 9:11 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 34+ messages in thread From: John Fastabend @ 2017-04-26 3:07 UTC (permalink / raw) To: Alexei Starovoitov, Jesper Dangaard Brouer Cc: Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies On 17-04-25 05:26 PM, Alexei Starovoitov wrote: > On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote: >>> Note the very first bpf patchset years ago contained the port table >>> abstraction. ovs has concept of vports as well. These two very >>> different projects needed port table to provide a layer of >>> indirection between ifindex==netdev and virtual port number. >>> This is still the case and I'd like to see this port table to be >>> implemented for both cls_bpf and xdp. In that sense xdp is not >>> special. >> >> Glad to hear you want to see this implemented, I will start coding on >> this then. Good point with cls_bpf, I was planning to make this port >> table strongly connected to XDP, guess I should also think of cls_bpf. > > perfect. > I think we should try to make all additions to bpf networking world > to be usable for both tc and xdp, since both are actively used and > it wouldn't be great to have cool feature for one, but not the other. > I think port table is an excellent candidate that applies to both. +1 Jesper, I was working up the code for the redirect piece for ixgbe and virtio, please use this as a base for your virtual port number table. I'll push an update onto github tomorrow. I think the table should drop in fairly nicely. One piece that isn't clear to me is how do you plan to instantiate and program this table. Is it a new static bpf map that is created any time we see the redirect command? I think this would be preferred. > >> I'm not worried about the DROP case, I agree that is fine (as you also >> say). The problem is unintentionally sending a packet to a wrong >> ifindex. This is clearly an eBPF program error, BUT with XDP this >> becomes a very hard to debug program error. With TC-redirect/cls_bpf >> we can tcpdump the packets, with XDP there is no visibility into this >> happening (the NSA is going to love this "feature"). Maybe we could add >> yet-another tracepoint to allow debugging this. My proposal that we >> simply remove the possibility for such program errors, by as you say >> move the validation from run-time into static insertion-time, via a >> port table. > > I think lack of tcpdump-like debugging in xdp is a separate issue. > As I was saying in the other thread we have trivial 'xdpdump' kern+user > app that emits pcap file, but it's too specific to how we use > tail_calls+prog_array in our xdp setup. I'm working on the program > chaining that will be generic and allow us transparently add multiple > xdp or tc progs to the same attachment point and will allow us to > do 'xdpdump' at any point of this pipeline, so debugging of what > happened to the packet will be easier and done in the same way > for both tc and xdp. > btw in our experience working with both tc and xdp the tc+bpf was > actually harder to use and more bug prone. > Nice, the tcpdump-like debugging looks interesting. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-26 3:07 ` John Fastabend @ 2017-04-26 9:11 ` Jesper Dangaard Brouer 2017-04-26 16:35 ` John Fastabend 0 siblings, 1 reply; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-26 9:11 UTC (permalink / raw) To: John Fastabend Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies, brouer On Tue, 25 Apr 2017 20:07:34 -0700 John Fastabend <john.fastabend@gmail.com> wrote: > On 17-04-25 05:26 PM, Alexei Starovoitov wrote: > > On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote: > >>> Note the very first bpf patchset years ago contained the port table > >>> abstraction. ovs has concept of vports as well. These two very > >>> different projects needed port table to provide a layer of > >>> indirection between ifindex==netdev and virtual port number. > >>> This is still the case and I'd like to see this port table to be > >>> implemented for both cls_bpf and xdp. In that sense xdp is not > >>> special. > >> > >> Glad to hear you want to see this implemented, I will start coding on > >> this then. Good point with cls_bpf, I was planning to make this port > >> table strongly connected to XDP, guess I should also think of cls_bpf. > > > > perfect. > > I think we should try to make all additions to bpf networking world > > to be usable for both tc and xdp, since both are actively used and > > it wouldn't be great to have cool feature for one, but not the other. > > I think port table is an excellent candidate that applies to both. > > +1 > > Jesper, I was working up the code for the redirect piece for ixgbe and > virtio, please use this as a base for your virtual port number table. I'll > push an update onto github tomorrow. I think the table should drop in fairly > nicely. Cool, I will do that. Then, I'll also have a redirect method to shape this around, and I would have to benchmark/test your ixgbe redirect. (John please let me know, what github tree we are talking about, and what branch) > One piece that isn't clear to me is how do you plan to instantiate and > program this table. Is it a new static bpf map that is created any > time we see the redirect command? I think this would be preferred. (This is difficult to explain without us misunderstanding each-other) As Alexei also mentioned before, ifindex vs port makes no real difference seen from the bpf program side. It is userspace's responsibility to add ifindex/port's to the bpf-maps, according to how the bpf program "policy" want to "connect" these ports. The port-table system add one extra step, of also adding this port to the port-table (which lives inside the kernel). When loading the XDP program, we also need to pass along a port table "id" this XDP program is associated with (and if it doesn't exists you create it). And your userspace "control-plane" application also need to know this port table "id", when adding a new port. The concept of having multiple port tables is key. As this implies we can have several simultaneous "data-planes" that is *isolated* from each-other. Think about how network-namespaces/containers want isolation. A subtle thing I'm afraid to mention, is that oppose to the ifindex model, a port table with mapping to a net_device pointer, would allow (faster) delivery into the container's inner net_device, which sort of violates the isolation, but I would argue it is not a problem as this net_device pointer could only be added from a process within the namespace. I like this feature, but it could easily be disallowed via port insertion-time validation. > >> I'm not worried about the DROP case, I agree that is fine (as you > >> also say). The problem is unintentionally sending a packet to a > >> wrong ifindex. This is clearly an eBPF program error, BUT with > >> XDP this becomes a very hard to debug program error. With > >> TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is > >> no visibility into this happening (the NSA is going to love this > >> "feature"). Maybe we could add yet-another tracepoint to allow > >> debugging this. My proposal that we simply remove the possibility > >> for such program errors, by as you say move the validation from > >> run-time into static insertion-time, via a port table. > > > > I think lack of tcpdump-like debugging in xdp is a separate issue. > > As I was saying in the other thread we have trivial 'xdpdump' > > kern+user app that emits pcap file, but it's too specific to how we > > use tail_calls+prog_array in our xdp setup. I'm working on the > > program chaining that will be generic and allow us transparently > > add multiple xdp or tc progs to the same attachment point and will > > allow us to do 'xdpdump' at any point of this pipeline, so > > debugging of what happened to the packet will be easier and done in > > the same way for both tc and xdp. > > btw in our experience working with both tc and xdp the tc+bpf was > > actually harder to use and more bug prone. > > > > Nice, the tcpdump-like debugging looks interesting. Yes, this xdpdump sound like a very useful tool. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-26 9:11 ` Jesper Dangaard Brouer @ 2017-04-26 16:35 ` John Fastabend 2017-04-26 17:58 ` Alexei Starovoitov 0 siblings, 1 reply; 34+ messages in thread From: John Fastabend @ 2017-04-26 16:35 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, netdev, xdp-newbies [...] >> Jesper, I was working up the code for the redirect piece for ixgbe and >> virtio, please use this as a base for your virtual port number table. I'll >> push an update onto github tomorrow. I think the table should drop in fairly >> nicely. > > Cool, I will do that. Then, I'll also have a redirect method to shape > this around, and I would have to benchmark/test your ixgbe redirect. > > (John please let me know, what github tree we are talking about, and > what branch) > > >> One piece that isn't clear to me is how do you plan to instantiate and >> program this table. Is it a new static bpf map that is created any >> time we see the redirect command? I think this would be preferred. > > (This is difficult to explain without us misunderstanding each-other) > Yep and I'm not sure I follow :) > As Alexei also mentioned before, ifindex vs port makes no real > difference seen from the bpf program side. It is userspace's > responsibility to add ifindex/port's to the bpf-maps, according to how > the bpf program "policy" want to "connect" these ports. The > port-table system add one extra step, of also adding this port to the > port-table (which lives inside the kernel). > I'm not sure I understand the "lives inside the kernel" bit. I assumed the 'map' should be a bpf map and behave like any other bpf map. I wanted a new map to be defined, something like this from the bpf programmer side. struct bpf_map_def SEC("maps") port_table = .type = BPF_MAP_TYPE_PORT_CONNECTION, .key_size = sizeof(u32), .value_size = BPF_PORT_CONNECTION_SIZE, .max_entries = 256, }; > When loading the XDP program, we also need to pass along a port table > "id" this XDP program is associated with (and if it doesn't exists you > create it). And your userspace "control-plane" application also need > to know this port table "id", when adding a new port. So the user space application that is loading the program also needs to handle this map. This seems correct to me. But I don't see the value in making some new port table when we already have well understood framework for maps. > > The concept of having multiple port tables is key. As this implies we > can have several simultaneous "data-planes" that is *isolated* from > each-other. Think about how network-namespaces/containers want > isolation. A subtle thing I'm afraid to mention, is that oppose to the > ifindex model, a port table with mapping to a net_device pointer, would > allow (faster) delivery into the container's inner net_device, which > sort of violates the isolation, but I would argue it is not a problem > as this net_device pointer could only be added from a process within the > namespace. I like this feature, but it could easily be disallowed via > port insertion-time validation. > I think the above optimization should be allowed. And agree multiple port tables (maps?) is needed. Again all this points to using standard maps logic in my mind. For permissions and different domains, which I think you were starting to touch on, it looks like we could extend the pinning API. At the moment it does an inode_permission(inode, MAY_WRITE) check but I presume this could be extended. None of this would be needed in v1 and could be added subsequently. read-only maps seems doable. > >>>> I'm not worried about the DROP case, I agree that is fine (as you >>>> also say). The problem is unintentionally sending a packet to a >>>> wrong ifindex. This is clearly an eBPF program error, BUT with >>>> XDP this becomes a very hard to debug program error. With >>>> TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is >>>> no visibility into this happening (the NSA is going to love this >>>> "feature"). Maybe we could add yet-another tracepoint to allow >>>> debugging this. My proposal that we simply remove the possibility >>>> for such program errors, by as you say move the validation from >>>> run-time into static insertion-time, via a port table. >>> >>> I think lack of tcpdump-like debugging in xdp is a separate issue. >>> As I was saying in the other thread we have trivial 'xdpdump' >>> kern+user app that emits pcap file, but it's too specific to how we >>> use tail_calls+prog_array in our xdp setup. I'm working on the >>> program chaining that will be generic and allow us transparently >>> add multiple xdp or tc progs to the same attachment point and will >>> allow us to do 'xdpdump' at any point of this pipeline, so >>> debugging of what happened to the packet will be easier and done in >>> the same way for both tc and xdp. >>> btw in our experience working with both tc and xdp the tc+bpf was >>> actually harder to use and more bug prone. >>> >> >> Nice, the tcpdump-like debugging looks interesting. > > Yes, this xdpdump sound like a very useful tool. > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-26 16:35 ` John Fastabend @ 2017-04-26 17:58 ` Alexei Starovoitov 2017-04-26 20:55 ` Andy Gospodarek 0 siblings, 1 reply; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-26 17:58 UTC (permalink / raw) To: John Fastabend, Jesper Dangaard Brouer Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek, Daniel Borkmann, netdev, xdp-newbies On 4/26/17 9:35 AM, John Fastabend wrote: > >> As Alexei also mentioned before, ifindex vs port makes no real >> difference seen from the bpf program side. It is userspace's >> responsibility to add ifindex/port's to the bpf-maps, according to how >> the bpf program "policy" want to "connect" these ports. The >> port-table system add one extra step, of also adding this port to the >> port-table (which lives inside the kernel). >> > > I'm not sure I understand the "lives inside the kernel" bit. I assumed > the 'map' should be a bpf map and behave like any other bpf map. > > I wanted a new map to be defined, something like this from the bpf programmer > side. > > struct bpf_map_def SEC("maps") port_table = > .type = BPF_MAP_TYPE_PORT_CONNECTION, > .key_size = sizeof(u32), > .value_size = BPF_PORT_CONNECTION_SIZE, > .max_entries = 256, > }; I like the idea. We have prog_array, perf_event_array, cgroup_array map specializations. This one can be new netdev_array with some new bpf_redirect-like helper accessing it. >> When loading the XDP program, we also need to pass along a port table >> "id" this XDP program is associated with (and if it doesn't exists you >> create it). And your userspace "control-plane" application also need >> to know this port table "id", when adding a new port. > > So the user space application that is loading the program also needs > to handle this map. This seems correct to me. But I don't see the > value in making some new port table when we already have well understood > framework for maps. +1 >> >> The concept of having multiple port tables is key. As this implies we >> can have several simultaneous "data-planes" that is *isolated* from >> each-other. Think about how network-namespaces/containers want >> isolation. A subtle thing I'm afraid to mention, is that oppose to the >> ifindex model, a port table with mapping to a net_device pointer, would >> allow (faster) delivery into the container's inner net_device, which >> sort of violates the isolation, but I would argue it is not a problem >> as this net_device pointer could only be added from a process within the >> namespace. I like this feature, but it could easily be disallowed via >> port insertion-time validation. >> > > I think the above optimization should be allowed. And agree multiple port > tables (maps?) is needed. Again all this points to using standard maps > logic in my mind. For permissions and different domains, which I think > you were starting to touch on, it looks like we could extend the pinning API. > At the moment it does an inode_permission(inode, MAY_WRITE) check but I > presume this could be extended. None of this would be needed in v1 and > could be added subsequently. read-only maps seems doable. this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated the user space can make it readonly to prevent further changes. From user space it can be done similar to perf_events/cgroups as well. bpf_map_update_elem(&netdev_array, &port_num, &ifindex) should work. For bpf_map_lookup_elem() from such netdev_array we can return ifindex back. The bpf_map_show_fdinfo() can be customized as well to pretty print ifindexes of netdevs stored in there. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-26 17:58 ` Alexei Starovoitov @ 2017-04-26 20:55 ` Andy Gospodarek 2017-04-27 8:41 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 34+ messages in thread From: Andy Gospodarek @ 2017-04-26 20:55 UTC (permalink / raw) To: Alexei Starovoitov Cc: John Fastabend, Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote: > On 4/26/17 9:35 AM, John Fastabend wrote: > > > > > As Alexei also mentioned before, ifindex vs port makes no real > > > difference seen from the bpf program side. It is userspace's > > > responsibility to add ifindex/port's to the bpf-maps, according to how > > > the bpf program "policy" want to "connect" these ports. The > > > port-table system add one extra step, of also adding this port to the > > > port-table (which lives inside the kernel). > > > > > > > I'm not sure I understand the "lives inside the kernel" bit. I assumed > > the 'map' should be a bpf map and behave like any other bpf map. > > > > I wanted a new map to be defined, something like this from the bpf programmer > > side. > > > > struct bpf_map_def SEC("maps") port_table = > > .type = BPF_MAP_TYPE_PORT_CONNECTION, > > .key_size = sizeof(u32), > > .value_size = BPF_PORT_CONNECTION_SIZE, > > .max_entries = 256, > > }; > > I like the idea. > We have prog_array, perf_event_array, cgroup_array map specializations. > This one can be new netdev_array with some new bpf_redirect-like helper > accessing it. > > > > When loading the XDP program, we also need to pass along a port table > > > "id" this XDP program is associated with (and if it doesn't exists you > > > create it). And your userspace "control-plane" application also need > > > to know this port table "id", when adding a new port. > > > > So the user space application that is loading the program also needs > > to handle this map. This seems correct to me. But I don't see the > > value in making some new port table when we already have well understood > > framework for maps. > > +1 > > > > > > > The concept of having multiple port tables is key. As this implies we > > > can have several simultaneous "data-planes" that is *isolated* from > > > each-other. Think about how network-namespaces/containers want > > > isolation. A subtle thing I'm afraid to mention, is that oppose to the > > > ifindex model, a port table with mapping to a net_device pointer, would > > > allow (faster) delivery into the container's inner net_device, which > > > sort of violates the isolation, but I would argue it is not a problem > > > as this net_device pointer could only be added from a process within the > > > namespace. I like this feature, but it could easily be disallowed via > > > port insertion-time validation. > > > > > > > I think the above optimization should be allowed. And agree multiple port > > tables (maps?) is needed. Again all this points to using standard maps > > logic in my mind. For permissions and different domains, which I think > > you were starting to touch on, it looks like we could extend the pinning API. > > At the moment it does an inode_permission(inode, MAY_WRITE) check but I > > presume this could be extended. None of this would be needed in v1 and > > could be added subsequently. read-only maps seems doable. > > this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated > the user space can make it readonly to prevent further changes. > > From user space it can be done similar to perf_events/cgroups as well. > bpf_map_update_elem(&netdev_array, &port_num, &ifindex) > should work. > For bpf_map_lookup_elem() from such netdev_array we can return > ifindex back. > The bpf_map_show_fdinfo() can be customized as well to pretty print > ifindexes of netdevs stored in there. > I agree with both of you on all of these points. Having the port redirection in a new type of map and/or array seems like the way to go. I understood Jesper's perspecitive when thinking about a way to pass a port-table id down, but I think the idea that the userspace loader code defining the maps is going to be the one making this link is the right idea and handling things like ifindex changes (rather than identifiers that perform lookups in other tables) is going to have to be yet another exercise left up to the...user. :-) ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-26 20:55 ` Andy Gospodarek @ 2017-04-27 8:41 ` Jesper Dangaard Brouer 2017-04-27 23:31 ` Alexei Starovoitov 0 siblings, 1 reply; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-27 8:41 UTC (permalink / raw) To: Andy Gospodarek Cc: Alexei Starovoitov, John Fastabend, Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies, brouer On Wed, 26 Apr 2017 16:55:44 -0400 Andy Gospodarek <andy@greyhouse.net> wrote: > On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote: > > On 4/26/17 9:35 AM, John Fastabend wrote: > > > > > > > As Alexei also mentioned before, ifindex vs port makes no real > > > > difference seen from the bpf program side. It is userspace's > > > > responsibility to add ifindex/port's to the bpf-maps, according to how > > > > the bpf program "policy" want to "connect" these ports. The > > > > port-table system add one extra step, of also adding this port to the > > > > port-table (which lives inside the kernel). > > > > > > > > > > I'm not sure I understand the "lives inside the kernel" bit. I assumed > > > the 'map' should be a bpf map and behave like any other bpf map. > > > > > > I wanted a new map to be defined, something like this from the bpf programmer > > > side. > > > > > > struct bpf_map_def SEC("maps") port_table = > > > .type = BPF_MAP_TYPE_PORT_CONNECTION, > > > .key_size = sizeof(u32), > > > .value_size = BPF_PORT_CONNECTION_SIZE, > > > .max_entries = 256, > > > }; > > > > I like the idea. > > We have prog_array, perf_event_array, cgroup_array map specializations. > > This one can be new netdev_array with some new bpf_redirect-like helper > > accessing it. > > > > > > When loading the XDP program, we also need to pass along a port table > > > > "id" this XDP program is associated with (and if it doesn't exists you > > > > create it). And your userspace "control-plane" application also need > > > > to know this port table "id", when adding a new port. > > > > > > So the user space application that is loading the program also needs > > > to handle this map. This seems correct to me. But I don't see the > > > value in making some new port table when we already have well understood > > > framework for maps. > > > > +1 > > > > > > > > > > The concept of having multiple port tables is key. As this implies we > > > > can have several simultaneous "data-planes" that is *isolated* from > > > > each-other. Think about how network-namespaces/containers want > > > > isolation. A subtle thing I'm afraid to mention, is that oppose to the > > > > ifindex model, a port table with mapping to a net_device pointer, would > > > > allow (faster) delivery into the container's inner net_device, which > > > > sort of violates the isolation, but I would argue it is not a problem > > > > as this net_device pointer could only be added from a process within the > > > > namespace. I like this feature, but it could easily be disallowed via > > > > port insertion-time validation. > > > > > > > > > > I think the above optimization should be allowed. And agree multiple port > > > tables (maps?) is needed. Again all this points to using standard maps > > > logic in my mind. For permissions and different domains, which I think > > > you were starting to touch on, it looks like we could extend the pinning API. > > > At the moment it does an inode_permission(inode, MAY_WRITE) check but I > > > presume this could be extended. None of this would be needed in v1 and > > > could be added subsequently. read-only maps seems doable. > > > > this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated > > the user space can make it readonly to prevent further changes. > > > > From user space it can be done similar to perf_events/cgroups as well. > > bpf_map_update_elem(&netdev_array, &port_num, &ifindex) > > should work. > > For bpf_map_lookup_elem() from such netdev_array we can return > > ifindex back. > > The bpf_map_show_fdinfo() can be customized as well to pretty print > > ifindexes of netdevs stored in there. > > > > I agree with both of you on all of these points. Having the port > redirection in a new type of map and/or array seems like the way to go. > > I understood Jesper's perspecitive when thinking about a way to pass a > port-table id down, but I think the idea that the userspace loader code > defining the maps is going to be the one making this link is the right > idea and handling things like ifindex changes (rather than identifiers > that perform lookups in other tables) is going to have to be yet another > exercise left up to the...user. :-) > I love this idea. Integrating the port table closer with the bpf-maps infrastructure makes sense. This gives me a place to hook the code into, instead of (re)inventing a new infrastructure for this port table, and the interface will be more natural from a bpf-API point of view. When registering/attaching a XDP/bpf program, we would just send the file-descriptor for this port-map along (like we do with the bpf_prog FD). Plus, it own ingress-port number this program is in the port-map. It is not clear to me, in-which-data-structure on the kernel-side we store this reference to the port-map and ingress-port. As today we only have the "raw" struct bpf_prog pointer. I see several options: 1. Create a new xdp_prog struct that contains existing bpf_prog, a port-map pointer and ingress-port. (IMHO easiest solution) 2. Just create a new pointer to port-map and store it in driver rx-ring struct (like existing bpf_prog), but this create a race-challenge replacing (cmpxchg) the program (or perhaps it's not a problem as it runs under rcu and RTNL-lock). 3. Extend bpf_prog to store this port-map and ingress-port, and have a fast-way to access it. I assume it will be accessible via bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-27 8:41 ` Jesper Dangaard Brouer @ 2017-04-27 23:31 ` Alexei Starovoitov 2017-04-28 5:06 ` John Fastabend 2017-04-28 10:58 ` Jesper Dangaard Brouer 0 siblings, 2 replies; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-27 23:31 UTC (permalink / raw) To: Jesper Dangaard Brouer, Andy Gospodarek Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: > When registering/attaching a XDP/bpf program, we would just send the > file-descriptor for this port-map along (like we do with the bpf_prog > FD). Plus, it own ingress-port number this program is in the port-map. > > It is not clear to me, in-which-data-structure on the kernel-side we > store this reference to the port-map and ingress-port. As today we only > have the "raw" struct bpf_prog pointer. I see several options: > > 1. Create a new xdp_prog struct that contains existing bpf_prog, > a port-map pointer and ingress-port. (IMHO easiest solution) > > 2. Just create a new pointer to port-map and store it in driver rx-ring > struct (like existing bpf_prog), but this create a race-challenge > replacing (cmpxchg) the program (or perhaps it's not a problem as it > runs under rcu and RTNL-lock). > > 3. Extend bpf_prog to store this port-map and ingress-port, and have a > fast-way to access it. I assume it will be accessible via > bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. I'm not sure I completely follow the 3 proposals. Are you suggesting to have only one netdev_array per program? Why not to allow any number like we do for tailcall+prog_array, etc? We can teach verifier to allow new helper bpf_tx_port(netdev_array, port_num); to only be used with netdev_array map type. It will fetch netdevice pointer from netdev_array[port_num] and will tx the packet into it. We can make it similar to bpf_tail_call(), so that program will finish on successful bpf_tx_port() or make it into 'delayed' tx which will be executed when program finishes. Not sure which approach is better. We can also extend this netdev_array into broadcast/multicast. Like bpf_tx_allports(&netdev_array); call from the program will xmit the packet to all netdevices in that 'netdev_array' map type. The map-in-map support can be trivially extended to allow netdev_array, then the program can create N multicast groups of netdevices. Each multicast group == one netdev_array map. The user space will populate a hashmap with these netdev_arrays and bpf kernel side can select dynamically which multicast group to use to send the packets to. bpf kernel side may look like: struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key); if (!netdev_array) ... if (my_condition) bpf_tx_allports(netdev_array); /* broadcast to all netdevices */ else bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */ that's an artificial example. Just trying to point out that we shouldn't restrict the feature too soon. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-27 23:31 ` Alexei Starovoitov @ 2017-04-28 5:06 ` John Fastabend 2017-04-28 5:30 ` Alexei Starovoitov 2017-04-28 10:58 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 34+ messages in thread From: John Fastabend @ 2017-04-28 5:06 UTC (permalink / raw) To: Alexei Starovoitov, Jesper Dangaard Brouer, Andy Gospodarek Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On 17-04-27 04:31 PM, Alexei Starovoitov wrote: > On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: >> When registering/attaching a XDP/bpf program, we would just send the >> file-descriptor for this port-map along (like we do with the bpf_prog >> FD). Plus, it own ingress-port number this program is in the port-map. >> >> It is not clear to me, in-which-data-structure on the kernel-side we >> store this reference to the port-map and ingress-port. As today we only >> have the "raw" struct bpf_prog pointer. I see several options: >> >> 1. Create a new xdp_prog struct that contains existing bpf_prog, >> a port-map pointer and ingress-port. (IMHO easiest solution) >> >> 2. Just create a new pointer to port-map and store it in driver rx-ring >> struct (like existing bpf_prog), but this create a race-challenge >> replacing (cmpxchg) the program (or perhaps it's not a problem as it >> runs under rcu and RTNL-lock). >> >> 3. Extend bpf_prog to store this port-map and ingress-port, and have a >> fast-way to access it. I assume it will be accessible via >> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. > > I'm not sure I completely follow the 3 proposals. > Are you suggesting to have only one netdev_array per program? > Why not to allow any number like we do for tailcall+prog_array, etc? > We can teach verifier to allow new helper > bpf_tx_port(netdev_array, port_num); > to only be used with netdev_array map type. > It will fetch netdevice pointer from netdev_array[port_num] > and will tx the packet into it. > We can make it similar to bpf_tail_call(), so that program will > finish on successful bpf_tx_port() or > make it into 'delayed' tx which will be executed when program finishes. > Not sure which approach is better. My reaction would be to make it finish on success but would like to write a few programs first and see. I can't think of any use _not_ to terminate but maybe there is something I'm missing. > > We can also extend this netdev_array into broadcast/multicast. Like > bpf_tx_allports(&netdev_array); > call from the program will xmit the packet to all netdevices > in that 'netdev_array' map type. Yep nice solution to the multicast problem. > > The map-in-map support can be trivially extended to allow netdev_array, > then the program can create N multicast groups of netdevices. > Each multicast group == one netdev_array map. > The user space will populate a hashmap with these netdev_arrays and > bpf kernel side can select dynamically which multicast group to use > to send the packets to. > bpf kernel side may look like: > struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key); > if (!netdev_array) > ... > if (my_condition) > bpf_tx_allports(netdev_array); /* broadcast to all netdevices */ > else > bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */ > > that's an artificial example. Just trying to point out > that we shouldn't restrict the feature too soon. > That is more or less what I was thinking as well. The other question I have though is should we have a bpf_redirect() call for the simple case where I use the ifindex directly. This will be helpful for taking existing programs from tc_cls into xdp. I think it makes sense to have both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). .John ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-28 5:06 ` John Fastabend @ 2017-04-28 5:30 ` Alexei Starovoitov 2017-04-28 19:43 ` Hannes Frederic Sowa 0 siblings, 1 reply; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-28 5:30 UTC (permalink / raw) To: John Fastabend, Jesper Dangaard Brouer, Andy Gospodarek Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On 4/27/17 10:06 PM, John Fastabend wrote: > That is more or less what I was thinking as well. The other question > I have though is should we have a bpf_redirect() call for the simple > case where I use the ifindex directly. This will be helpful for taking > existing programs from tc_cls into xdp. I think it makes sense to have > both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). I think so too. Once netdevice is stored into netdev_array map the netdevice is pinned and we need to figure out what to do if somebody tries to delete it. Should we add a new netlink notifier that this netdev's refcnt is almost zero and it's only in netdev_array(s) ? or should it be deleted from the array(s) automatically and then user space will be notified post-deletion? Both approaches have their pros and cons. Whereas raw ifindex approach (via bpf_redirect) doesn't have these caveats. It's clear to both bpf prog and user space that ifindex can be stale and user space needs to monitor netdevs and update programs/maps. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-28 5:30 ` Alexei Starovoitov @ 2017-04-28 19:43 ` Hannes Frederic Sowa 2017-04-30 1:35 ` Alexei Starovoitov 0 siblings, 1 reply; 34+ messages in thread From: Hannes Frederic Sowa @ 2017-04-28 19:43 UTC (permalink / raw) To: Alexei Starovoitov, John Fastabend, Jesper Dangaard Brouer, Andy Gospodarek Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On 28.04.2017 07:30, Alexei Starovoitov wrote: > On 4/27/17 10:06 PM, John Fastabend wrote: >> That is more or less what I was thinking as well. The other question >> I have though is should we have a bpf_redirect() call for the simple >> case where I use the ifindex directly. This will be helpful for taking >> existing programs from tc_cls into xdp. I think it makes sense to have >> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). > > I think so too. > Once netdevice is stored into netdev_array map the netdevice is pinned > and we need to figure out what to do if somebody tries to delete it. > Should we add a new netlink notifier that this netdev's refcnt is > almost zero and it's only in netdev_array(s) ? We basically do that automatically in netdev_wait_allrefs: pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d\n", dev->name, refcnt); It is a very unpleasant warning and users probably think about a bug in the kernel at first. I don't think we should wait for user space to clean that up but have to do it automatically from the kernel. Maybe we can introduce a special value that basically NOPs the transmission. The hash table itself would install a netdevice notifier and would clean all tables. Could definitely cause some storm in the kernel, if a lot of keys are mapped to the same interface. > or should it be deleted from the array(s) automatically and > then user space will be notified post-deletion? > Both approaches have their pros and cons. I am leaning more towards deleting it automatically. But walking all tables and in there all keys might cause some unwanted load spikes. > Whereas raw ifindex approach (via bpf_redirect) doesn't have these > caveats. It's clear to both bpf prog and user space that ifindex > can be stale and user space needs to monitor netdevs and update > programs/maps. A separate type for ifindex as key or value might be nice to expose this information directly via the kernel (fdinfo etc.) but at the same time, debugging infrastructure from user space can also easily deal with that. Another approach would be: ifindexes are allocated cyclic and also are signed int and not u32 during allocation. Maybe we can negate the ifindex during transmission in the table and thus mark it as stale (or set it to -1)? This update would be done by bpf_tx_*ports() that take a reference to a table and a key and submit the packets on the appropriate ports and can flag the relevant ifindexes as stale. Just wanted to draft this idea, I am not particular happy with that idea. Maybe someone comes up with a better one. Thanks, Hannes ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-28 19:43 ` Hannes Frederic Sowa @ 2017-04-30 1:35 ` Alexei Starovoitov 0 siblings, 0 replies; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-30 1:35 UTC (permalink / raw) To: Hannes Frederic Sowa, John Fastabend, Jesper Dangaard Brouer, Andy Gospodarek Cc: Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On 4/28/17 12:43 PM, Hannes Frederic Sowa wrote: > On 28.04.2017 07:30, Alexei Starovoitov wrote: >> On 4/27/17 10:06 PM, John Fastabend wrote: >>> That is more or less what I was thinking as well. The other question >>> I have though is should we have a bpf_redirect() call for the simple >>> case where I use the ifindex directly. This will be helpful for taking >>> existing programs from tc_cls into xdp. I think it makes sense to have >>> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). >> >> I think so too. >> Once netdevice is stored into netdev_array map the netdevice is pinned >> and we need to figure out what to do if somebody tries to delete it. >> Should we add a new netlink notifier that this netdev's refcnt is >> almost zero and it's only in netdev_array(s) ? > > We basically do that automatically in netdev_wait_allrefs: > > pr_emerg("unregister_netdevice: waiting for %s to become free. Usage > count = %d\n", > dev->name, refcnt); > > It is a very unpleasant warning and users probably think about a bug in > the kernel at first. > > I don't think we should wait for user space to clean that up but have to > do it automatically from the kernel. Maybe we can introduce a special > value that basically NOPs the transmission. The hash table itself would > install a netdevice notifier and would clean all tables. Could > definitely cause some storm in the kernel, if a lot of keys are mapped > to the same interface. > >> or should it be deleted from the array(s) automatically and >> then user space will be notified post-deletion? >> Both approaches have their pros and cons. > > I am leaning more towards deleting it automatically. But walking all > tables and in there all keys might cause some unwanted load spikes. yeah, when netdev is being unregistered we have to drop the ref to avoid the warning. Speaking of delete notifiers for maps. Until recently all deletions were explicit and the program could do extra perf_event_submit() if it needed to notify user space of deletion. But right now we have LRU map that deletes automatically and this netdev_array will be deleting automatically too, so we probably need some sort of generic map notifier for all map types. It can be as simple as user space sleeping on read(map_fd, buf); or waiting for epoll on map_fd and kernel wakes it up when map element is deleted and pushes 'key/value' pair into the buf. That should be generic for all map types, but it needs some mechanism to avoid blocking, since number of deletions can be huge. Something similar to perf's event record 'lost N records' should do the trick. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-27 23:31 ` Alexei Starovoitov 2017-04-28 5:06 ` John Fastabend @ 2017-04-28 10:58 ` Jesper Dangaard Brouer 2017-04-30 1:04 ` Alexei Starovoitov 1 sibling, 1 reply; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-28 10:58 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andy Gospodarek, John Fastabend, Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies, brouer On Thu, 27 Apr 2017 16:31:14 -0700 Alexei Starovoitov <ast@fb.com> wrote: > On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: > > When registering/attaching a XDP/bpf program, we would just send the > > file-descriptor for this port-map along (like we do with the bpf_prog > > FD). Plus, it own ingress-port number this program is in the port-map. > > > > It is not clear to me, in-which-data-structure on the kernel-side we > > store this reference to the port-map and ingress-port. As today we only > > have the "raw" struct bpf_prog pointer. I see several options: > > > > 1. Create a new xdp_prog struct that contains existing bpf_prog, > > a port-map pointer and ingress-port. (IMHO easiest solution) > > > > 2. Just create a new pointer to port-map and store it in driver rx-ring > > struct (like existing bpf_prog), but this create a race-challenge > > replacing (cmpxchg) the program (or perhaps it's not a problem as it > > runs under rcu and RTNL-lock). > > > > 3. Extend bpf_prog to store this port-map and ingress-port, and have a > > fast-way to access it. I assume it will be accessible via > > bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. > > I'm not sure I completely follow the 3 proposals. > Are you suggesting to have only one netdev_array per program? Yes, but I can see you have a more clever idea below. > Why not to allow any number like we do for tailcall+prog_array, etc? > We can teach verifier to allow new helper > bpf_tx_port(netdev_array, port_num); > to only be used with netdev_array map type. > It will fetch netdevice pointer from netdev_array[port_num] > and will tx the packet into it. I love it. I just don't like the "netdev" part of the name "netdev_array" as one basic ideas of a port tabel, is that a port can be anything that can consume a XDP_buff packet. This generalization allow us to move code out of the drivers. We might be on the same page, as I do imagine that netdev_array or port_array is just a struct bpf_map pointer, and the bpf_map->map_type will tell us that this bpf_map contains net_device pointers. Thus, when later introducing a new type of redirect (like to a socket or remote-CPU) then we just add a new bpf_map_type for this, without needing to change anything in the drivers, right? Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT? Or does it return if the call was successful (e.g validate port_num existed in map)? On the kernel side, we need to receive this info "port_array" and "port_num", given you don't provide the call a xdp_buff/ctx, then I assume you want the per-CPU temp-store solution. Then during the XDP_REDIRECT action we call a core redirect function that based on the bpf_map_type does a lookup, and find the net_device ptr. > We can make it similar to bpf_tail_call(), so that program will > finish on successful bpf_tx_port() or > make it into 'delayed' tx which will be executed when program finishes. > Not sure which approach is better. I know you are talking about something slightly different, about delaying TX. But I want to mention (as I've done before) that it is important (for me) that we get bulking working/integrated. I imagine the driver will call a function that will delay the TX/redirect action and at the end of the NAPI cycle have a function that flush packets, bulk per destination port. I was wondering where to store these delayed TX packets, but now that we have an associated bpf_map data-structure (netdev_array), I'm thinking about storing packets (ordered by port) inside that. And then have a bpf_tx_flush(netdev_array) call in the driver (for every port-table-map seen, which will likely be small). > We can also extend this netdev_array into broadcast/multicast. Like > bpf_tx_allports(&netdev_array); > call from the program will xmit the packet to all netdevices > in that 'netdev_array' map type. When broadcasting you often don't want to broadcast the packet out of the incoming interface. How can you support this? Normally you would know your ingress port, and then excluded that port in the broadcast. But with many netdev_array's how do the program know it's own ingress port. > The map-in-map support can be trivially extended to allow netdev_array, > then the program can create N multicast groups of netdevices. > Each multicast group == one netdev_array map. > The user space will populate a hashmap with these netdev_arrays and > bpf kernel side can select dynamically which multicast group to use > to send the packets to. > bpf kernel side may look like: > struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key); > if (!netdev_array) > ... > if (my_condition) > bpf_tx_allports(netdev_array); /* broadcast to all netdevices */ > else > bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */ > > that's an artificial example. Just trying to point out > that we shouldn't restrict the feature too soon. I like how you solve the multicast problem. (But I do need to learn some more of the inner-workings of bpf map-in-map to follow this completely). Thanks a lot for all this input, I got a much more clear picture of how I can/should implement this :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-28 10:58 ` Jesper Dangaard Brouer @ 2017-04-30 1:04 ` Alexei Starovoitov 2017-04-30 22:55 ` John Fastabend 0 siblings, 1 reply; 34+ messages in thread From: Alexei Starovoitov @ 2017-04-30 1:04 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Andy Gospodarek, John Fastabend, Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote: > On Thu, 27 Apr 2017 16:31:14 -0700 > Alexei Starovoitov <ast@fb.com> wrote: > >> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: >>> When registering/attaching a XDP/bpf program, we would just send the >>> file-descriptor for this port-map along (like we do with the bpf_prog >>> FD). Plus, it own ingress-port number this program is in the port-map. >>> >>> It is not clear to me, in-which-data-structure on the kernel-side we >>> store this reference to the port-map and ingress-port. As today we only >>> have the "raw" struct bpf_prog pointer. I see several options: >>> >>> 1. Create a new xdp_prog struct that contains existing bpf_prog, >>> a port-map pointer and ingress-port. (IMHO easiest solution) >>> >>> 2. Just create a new pointer to port-map and store it in driver rx-ring >>> struct (like existing bpf_prog), but this create a race-challenge >>> replacing (cmpxchg) the program (or perhaps it's not a problem as it >>> runs under rcu and RTNL-lock). >>> >>> 3. Extend bpf_prog to store this port-map and ingress-port, and have a >>> fast-way to access it. I assume it will be accessible via >>> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. >> >> I'm not sure I completely follow the 3 proposals. >> Are you suggesting to have only one netdev_array per program? > > Yes, but I can see you have a more clever idea below. > >> Why not to allow any number like we do for tailcall+prog_array, etc? > >> We can teach verifier to allow new helper >> bpf_tx_port(netdev_array, port_num); >> to only be used with netdev_array map type. >> It will fetch netdevice pointer from netdev_array[port_num] >> and will tx the packet into it. > > I love it. > > I just don't like the "netdev" part of the name "netdev_array" as one > basic ideas of a port tabel, is that a port can be anything that can > consume a XDP_buff packet. This generalization allow us to move code > out of the drivers. We might be on the same page, as I do imagine that > netdev_array or port_array is just a struct bpf_map pointer, and the > bpf_map->map_type will tell us that this bpf_map contains net_device > pointers. Thus, when later introducing a new type of redirect (like to > a socket or remote-CPU) then we just add a new bpf_map_type for this, > without needing to change anything in the drivers, right? In theory, yes, but in practice I doubt it will be so easy. We probably shouldn't allow very different types of netdev into the same netdev_array or port_array (whatever the name). They need to be similar enough, otherwise we'd have to do run-time checks. If they're all the same, these checks can be done at insertion time instead. > Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT? > Or does it return if the call was successful (e.g validate port_num > existed in map)? don't know :) we need to brainstorm pros and cons. > On the kernel side, we need to receive this info "port_array" and > "port_num", given you don't provide the call a xdp_buff/ctx, then I > assume you want the per-CPU temp-store solution. Then during the > XDP_REDIRECT action we call a core redirect function that based on the > bpf_map_type does a lookup, and find the net_device ptr. hmm. didn't think that far either :) indeed makes sense to pass 'ctx' into such helper as well, so it's easier to deal with original netdev. >> We can make it similar to bpf_tail_call(), so that program will >> finish on successful bpf_tx_port() or >> make it into 'delayed' tx which will be executed when program finishes. >> Not sure which approach is better. > > I know you are talking about something slightly different, about > delaying TX. > > But I want to mention (as I've done before) that it is important (for > me) that we get bulking working/integrated. I imagine the driver will > call a function that will delay the TX/redirect action and at the end > of the NAPI cycle have a function that flush packets, bulk per > destination port. > > I was wondering where to store these delayed TX packets, but now that > we have an associated bpf_map data-structure (netdev_array), I'm thinking > about storing packets (ordered by port) inside that. And then have a > bpf_tx_flush(netdev_array) call in the driver (for every port-table-map > seen, which will likely be small). makes sense to me as well. Ideally we should try to make an api such, that batching or no-batching can be kernel choice. The program will just do xdp_tx_port(...something here...) and the kernel does the best for performance. If it needs to delay the result to do batching the api should allow that transparently. Like right now xdp program does 'return XDP_TX;' and on the kernel side we can either do immediate xmit (like we do today) or can change it to do batching and programs don't need to change. >> We can also extend this netdev_array into broadcast/multicast. Like >> bpf_tx_allports(&netdev_array); >> call from the program will xmit the packet to all netdevices >> in that 'netdev_array' map type. > > When broadcasting you often don't want to broadcast the packet out of > the incoming interface. How can you support this? > > Normally you would know your ingress port, and then excluded that port > in the broadcast. But with many netdev_array's how do the program know > it's own ingress port. absolutely! bpf_tx_allports() should somehow exclude the port packet arrived on. What you're proposing about passing 'ctx' into this helper, should solve it, I guess. > Thanks a lot for all this input, I got a much more clear picture of how > I can/should implement this :-) awesome :) Let's brainstorm more and get John's opinion on it as well, since sounds like he'll be heavy user of such api. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port? 2017-04-30 1:04 ` Alexei Starovoitov @ 2017-04-30 22:55 ` John Fastabend 0 siblings, 0 replies; 34+ messages in thread From: John Fastabend @ 2017-04-30 22:55 UTC (permalink / raw) To: Alexei Starovoitov, Jesper Dangaard Brouer Cc: Andy Gospodarek, Alexei Starovoitov, Daniel Borkmann, Daniel Borkmann, netdev, xdp-newbies On 17-04-29 06:04 PM, Alexei Starovoitov wrote: > On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote: >> On Thu, 27 Apr 2017 16:31:14 -0700 >> Alexei Starovoitov <ast@fb.com> wrote: >> >>> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: >>>> When registering/attaching a XDP/bpf program, we would just send the >>>> file-descriptor for this port-map along (like we do with the bpf_prog >>>> FD). Plus, it own ingress-port number this program is in the port-map. >>>> >>>> It is not clear to me, in-which-data-structure on the kernel-side we >>>> store this reference to the port-map and ingress-port. As today we only >>>> have the "raw" struct bpf_prog pointer. I see several options: >>>> >>>> 1. Create a new xdp_prog struct that contains existing bpf_prog, >>>> a port-map pointer and ingress-port. (IMHO easiest solution) >>>> >>>> 2. Just create a new pointer to port-map and store it in driver rx-ring >>>> struct (like existing bpf_prog), but this create a race-challenge >>>> replacing (cmpxchg) the program (or perhaps it's not a problem as it >>>> runs under rcu and RTNL-lock). >>>> >>>> 3. Extend bpf_prog to store this port-map and ingress-port, and have a >>>> fast-way to access it. I assume it will be accessible via >>>> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. >>> >>> I'm not sure I completely follow the 3 proposals. >>> Are you suggesting to have only one netdev_array per program? >> >> Yes, but I can see you have a more clever idea below. >> >>> Why not to allow any number like we do for tailcall+prog_array, etc? >> >>> We can teach verifier to allow new helper >>> bpf_tx_port(netdev_array, port_num); >>> to only be used with netdev_array map type. >>> It will fetch netdevice pointer from netdev_array[port_num] >>> and will tx the packet into it. >> >> I love it. >> >> I just don't like the "netdev" part of the name "netdev_array" as one >> basic ideas of a port tabel, is that a port can be anything that can >> consume a XDP_buff packet. This generalization allow us to move code >> out of the drivers. We might be on the same page, as I do imagine that >> netdev_array or port_array is just a struct bpf_map pointer, and the >> bpf_map->map_type will tell us that this bpf_map contains net_device >> pointers. Thus, when later introducing a new type of redirect (like to >> a socket or remote-CPU) then we just add a new bpf_map_type for this, >> without needing to change anything in the drivers, right? > > In theory, yes, but in practice I doubt it will be so easy. > We probably shouldn't allow very different types of netdev > into the same netdev_array or port_array (whatever the name). > They need to be similar enough, otherwise we'd have to do run-time > checks. If they're all the same, these checks can be done at > insertion time instead. > I think we can just have different map types for each redirect type. So a netdev_map_array, socket_map_array, etc. >> Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT? >> Or does it return if the call was successful (e.g validate port_num >> existed in map)? > > don't know :) > we need to brainstorm pros and cons. > >> On the kernel side, we need to receive this info "port_array" and >> "port_num", given you don't provide the call a xdp_buff/ctx, then I >> assume you want the per-CPU temp-store solution. Then during the >> XDP_REDIRECT action we call a core redirect function that based on the >> bpf_map_type does a lookup, and find the net_device ptr. > > hmm. didn't think that far either :) > indeed makes sense to pass 'ctx' into such helper as well, > so it's easier to deal with original netdev. > >>> We can make it similar to bpf_tail_call(), so that program will >>> finish on successful bpf_tx_port() or >>> make it into 'delayed' tx which will be executed when program finishes. >>> Not sure which approach is better. >> >> I know you are talking about something slightly different, about >> delaying TX. >> >> But I want to mention (as I've done before) that it is important (for >> me) that we get bulking working/integrated. I imagine the driver will >> call a function that will delay the TX/redirect action and at the end >> of the NAPI cycle have a function that flush packets, bulk per >> destination port. >> >> I was wondering where to store these delayed TX packets, but now that >> we have an associated bpf_map data-structure (netdev_array), I'm thinking >> about storing packets (ordered by port) inside that. And then have a >> bpf_tx_flush(netdev_array) call in the driver (for every port-table-map >> seen, which will likely be small). > > makes sense to me as well. > Ideally we should try to make an api such, that batching or no-batching > can be kernel choice. The program will just do > xdp_tx_port(...something here...) > and the kernel does the best for performance. If it needs to delay > the result to do batching the api should allow that transparently. > Like right now xdp program does 'return XDP_TX;' and > on the kernel side we can either do immediate xmit (like we do today) > or can change it to do batching and programs don't need to change. I think on the kernel side we can just add a flag to tell the xmit routine to hold off hitting the tail. Then when we have the last packet we can just set the flag. For the case where you don't know when the last packet is coming you can send a NULL xdp buff and have the tail updated. This seems to give the most flexibility to driver writers to optimize for their hardware. I also tried a xmit routine to send many xdp_bufs in one call. This became a bit ugly IMO and I decided the above was actually better. The basic problem is, in general, we get packets for all sorts of ports and do not get say 10 packets all for one port in one call. > >>> We can also extend this netdev_array into broadcast/multicast. Like >>> bpf_tx_allports(&netdev_array); >>> call from the program will xmit the packet to all netdevices >>> in that 'netdev_array' map type. >> >> When broadcasting you often don't want to broadcast the packet out of >> the incoming interface. How can you support this? >> >> Normally you would know your ingress port, and then excluded that port >> in the broadcast. But with many netdev_array's how do the program know >> it's own ingress port. > > absolutely! > bpf_tx_allports() should somehow exclude the port packet arrived on. > What you're proposing about passing 'ctx' into this helper, > should solve it, I guess. +1 seems reasonable. > >> Thanks a lot for all this input, I got a much more clear picture of how >> I can/should implement this :-) > > awesome :) Let's brainstorm more and get John's opinion on it as well, > since sounds like he'll be heavy user of such api. > Yep I have a couple cls_tc programs that I'm eager to port as soon as the infrastructure is there. I think this is generally headed the "right" direction. I'm trying to wrap up the qdisc codes now and then will start working on getting the basic xdp_redirect working. The qdisc stuff took a bit longer than I hoped :/ Thanks, John ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-19 22:51 ` Daniel Borkmann 2017-04-20 2:56 ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov @ 2017-04-20 6:39 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 34+ messages in thread From: Jesper Dangaard Brouer @ 2017-04-20 6:39 UTC (permalink / raw) To: Daniel Borkmann Cc: Andy Gospodarek, Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies, John Fastabend, brouer On Thu, 20 Apr 2017 00:51:31 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote: > On 04/19/2017 10:02 PM, Andy Gospodarek wrote: > [...] > > and then lookup this dest in a table we have the option to make that > > dest an ifindex/socket/other. > > > > I did also look at JohnF's patch and I do like the simplicity of the redirect > > action and new ndo_xdp_xmit and how it moves towards a way to transmit the > > frame. The downside is that it presumes an ifindex, so it might not be ideal > > we want the lookup to return something other than an ifindex. > > > [...] > > would be handled. If we are ultimately going to need a new netdev op to > > handle the redirect then what may be the issue with not providing the > > destination port the return code and the option proposed by JohnF looks > > good to me with maybe a small tweak to not presume ifindex in some manner. > > Is there a concrete reason that all the proposed future cases like sockets > have to be handled within the very same XDP_REDIRECT return code? F.e. why > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future > ones would get a different return code f.e. XDP_TX_SK only handling sockets > when we get there implementation-wise? The concrete reason for an "unified" return code is to simplify changes needed in the drivers. The purpose is moving code out of the drivers into the core. IMHO we should keep drivers as simple as possible and stop adding more return action codes. XDP_REDIRECT combined with a port-table should be the last action-code needed in the drivers. As the port table will know the "type" of the port. Keeping action XDP_DROP and XDP_TX inside the drivers make sense, because it allow for driver level optimizations. Actions that want to transfer the packet/xdp_buff "outside" the driver should call a core function (given it will be called with the xdp_buff struct, you can hide all your extra extensions there without changing the driver code). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: XDP question: best API for returning/setting egress port? 2017-04-19 20:02 ` Andy Gospodarek 2017-04-19 21:42 ` Daniel Borkmann 2017-04-19 22:51 ` Daniel Borkmann @ 2017-04-20 4:43 ` John Fastabend 2 siblings, 0 replies; 34+ messages in thread From: John Fastabend @ 2017-04-20 4:43 UTC (permalink / raw) To: Andy Gospodarek, Jesper Dangaard Brouer Cc: Daniel Borkmann, Alexei Starovoitov, Alexei Starovoitov, netdev, xdp-newbies [...] > JohnF, any test results with this you can share? Presumably you tested with > virtio-net, right? > For my patches using the xdp_redirect with virtio-net showed no measurable difference from running just the straight XDP_TX in the PPS column. However virtio_net XDP_TX was quite low to start with (and I was mostly just looking at functionality so I don't recall if there was a CPU usage hit) so I expect on ixgbe to see a difference. However I have plenty of cpu overhead to consume on the 10Gbps card so it will likely just be a CPU % cost not a PPS cost. I'll bother Alex about the 40Gbps cards ;) Anyways, I'll put it on my list to rerun. But like the xdp generic pieces it wont be for a couple days until I get around to it. Sorry need to take care of a few things first. Thanks, John ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2017-04-30 22:55 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-04-18 19:58 XDP question: best API for returning/setting egress port? Jesper Dangaard Brouer 2017-04-18 20:54 ` John Fastabend 2017-04-19 12:00 ` Jesper Dangaard Brouer 2017-04-19 12:33 ` Daniel Borkmann 2017-04-19 15:24 ` Jesper Dangaard Brouer 2017-04-19 12:25 ` Hannes Frederic Sowa 2017-04-19 20:02 ` Andy Gospodarek 2017-04-19 21:42 ` Daniel Borkmann 2017-04-20 17:12 ` Andy Gospodarek 2017-04-19 22:51 ` Daniel Borkmann 2017-04-20 2:56 ` xdp_redirect ifindex vs port. Was: " Alexei Starovoitov 2017-04-20 4:38 ` John Fastabend 2017-04-20 4:58 ` Alexei Starovoitov 2017-04-20 5:14 ` John Fastabend 2017-04-20 6:10 ` Jesper Dangaard Brouer 2017-04-20 17:10 ` Alexei Starovoitov 2017-04-25 9:34 ` Jesper Dangaard Brouer 2017-04-26 0:26 ` Alexei Starovoitov 2017-04-26 3:07 ` John Fastabend 2017-04-26 9:11 ` Jesper Dangaard Brouer 2017-04-26 16:35 ` John Fastabend 2017-04-26 17:58 ` Alexei Starovoitov 2017-04-26 20:55 ` Andy Gospodarek 2017-04-27 8:41 ` Jesper Dangaard Brouer 2017-04-27 23:31 ` Alexei Starovoitov 2017-04-28 5:06 ` John Fastabend 2017-04-28 5:30 ` Alexei Starovoitov 2017-04-28 19:43 ` Hannes Frederic Sowa 2017-04-30 1:35 ` Alexei Starovoitov 2017-04-28 10:58 ` Jesper Dangaard Brouer 2017-04-30 1:04 ` Alexei Starovoitov 2017-04-30 22:55 ` John Fastabend 2017-04-20 6:39 ` XDP question: " Jesper Dangaard Brouer 2017-04-20 4:43 ` John Fastabend
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.