netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BPF redirect API design issue for BPF-prog MTU feedback?
@ 2020-09-17 12:38 Jesper Dangaard Brouer
  2020-09-17 12:54 ` Maciej Żenczykowski
  0 siblings, 1 reply; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-17 12:38 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, BPF-dev-list
  Cc: brouer, netdev, Maciej Zenczykowski, Lorenzo Bianconi,
	Lorenz Bauer, John Fastabend, Jakub Kicinski, Shaun Crampton,
	David Miller, Marek Majkowski


As you likely know[1] I'm looking into moving the MTU check (for TC-BPF)
in __bpf_skb_max_len() when e.g. called by bpf_skb_adjust_room(),
because when redirecting packets to another netdev it is not correct to
limit the MTU based on the incoming netdev.

I was looking at doing the MTU check in bpf_redirect() helper, because
at this point we know the redirect to netdev, and returning an
indication/error that MTU was exceed, would allow the BPF-prog logic to
react, e.g. sending ICMP (instead of packet getting silently dropped).
BUT this is not possible because bpf_redirect(index, flags) helper
don't provide the packet context-object (so I cannot lookup the packet
length).

Seeking input:

Should/can we change the bpf_redirect API or create a new helper with
packet-context?

 Note: We have the same need for the packet context for XDP when
 redirecting the new multi-buffer packets, as not all destination netdev
 will support these new multi-buffer packets.

I can of-cause do the MTU checks on kernel-side in skb_do_redirect, but
then how do people debug this? as packet will basically be silently dropped.



(Looking at how does BPF-prog logic handle MTU today)

How do bpf_skb_adjust_room() report that the MTU was exceeded?
Unfortunately it uses a common return code -ENOTSUPP which used for
multiple cases (include MTU exceeded). Thus, the BPF-prog logic cannot
use this reliably to know if this is a MTU exceeded event. (Looked
BPF-prog code and they all simply exit with TC_ACT_SHOT for all error
codes, cloudflare have the most advanced handling with
metrics->errors_total_encap_adjust_failed++).


[1] https://lore.kernel.org/bpf/159921182827.1260200.9699352760916903781.stgit@firesoul/
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-17 12:38 BPF redirect API design issue for BPF-prog MTU feedback? Jesper Dangaard Brouer
@ 2020-09-17 12:54 ` Maciej Żenczykowski
  2020-09-17 19:11   ` Saeed Mahameed
  2020-09-21 10:42   ` Lorenz Bauer
  0 siblings, 2 replies; 17+ messages in thread
From: Maciej Żenczykowski @ 2020-09-17 12:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, BPF-dev-list, netdev,
	Lorenzo Bianconi, Lorenz Bauer, John Fastabend, Jakub Kicinski,
	Shaun Crampton, David Miller, Marek Majkowski

On Thu, Sep 17, 2020 at 5:39 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
>
> As you likely know[1] I'm looking into moving the MTU check (for TC-BPF)
> in __bpf_skb_max_len() when e.g. called by bpf_skb_adjust_room(),
> because when redirecting packets to another netdev it is not correct to
> limit the MTU based on the incoming netdev.
>
> I was looking at doing the MTU check in bpf_redirect() helper, because
> at this point we know the redirect to netdev, and returning an
> indication/error that MTU was exceed, would allow the BPF-prog logic to
> react, e.g. sending ICMP (instead of packet getting silently dropped).
> BUT this is not possible because bpf_redirect(index, flags) helper
> don't provide the packet context-object (so I cannot lookup the packet
> length).
>
> Seeking input:
>
> Should/can we change the bpf_redirect API or create a new helper with
> packet-context?
>
>  Note: We have the same need for the packet context for XDP when
>  redirecting the new multi-buffer packets, as not all destination netdev
>  will support these new multi-buffer packets.
>
> I can of-cause do the MTU checks on kernel-side in skb_do_redirect, but
> then how do people debug this? as packet will basically be silently dropped.
>
>
>
> (Looking at how does BPF-prog logic handle MTU today)
>
> How do bpf_skb_adjust_room() report that the MTU was exceeded?
> Unfortunately it uses a common return code -ENOTSUPP which used for
> multiple cases (include MTU exceeded). Thus, the BPF-prog logic cannot
> use this reliably to know if this is a MTU exceeded event. (Looked
> BPF-prog code and they all simply exit with TC_ACT_SHOT for all error
> codes, cloudflare have the most advanced handling with
> metrics->errors_total_encap_adjust_failed++).
>
>
> [1] https://lore.kernel.org/bpf/159921182827.1260200.9699352760916903781.stgit@firesoul/
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>

(a) the current state of the world seems very hard to use correctly,
so adding new apis,
or even changing existing ones seems ok to me.
especially if this just means changing what error code they return

(b) another complexity with bpf_redirect() is you can call it, it can succeed,
but then you can not return TC_ACT_REDIRECT from the bpf program,
which effectively makes the earlier *successful* bpf_redirect() call
an utter no-op.

(bpf_redirect() just determines what a future return TC_ACT_REDIRECT will do)

so if you bpf_redirect to interface with larger mtu, then increase packet size,
then return TC_ACT_OK, then you potentially end up with excessively large
packet egressing through original interface (with small mtu).

My vote would be to return a new distinct error from bpf_redirect()
based on then current
packet size and interface being redirected to, save this interface mtu
somewhere,
then in operations that increase packet size check against this saved mtu,
for correctness you still have to check mtu after the bpf program is done,
but this is then just to deal with braindead bpf code (that calls
bpf_redirect and returns TC_ACT_OK, or calls bpf_redirect() multiple
times, or something...).

Maciej Żenczykowski, Kernel Networking Developer @ Google

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-17 12:54 ` Maciej Żenczykowski
@ 2020-09-17 19:11   ` Saeed Mahameed
  2020-09-18 10:00     ` Jesper Dangaard Brouer
  2020-09-21 10:42   ` Lorenz Bauer
  1 sibling, 1 reply; 17+ messages in thread
From: Saeed Mahameed @ 2020-09-17 19:11 UTC (permalink / raw)
  To: Maciej Żenczykowski, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, BPF-dev-list, netdev,
	Lorenzo Bianconi, Lorenz Bauer, John Fastabend, Jakub Kicinski,
	Shaun Crampton, David Miller, Marek Majkowski

On Thu, 2020-09-17 at 05:54 -0700, Maciej Żenczykowski wrote:
> On Thu, Sep 17, 2020 at 5:39 AM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > 
> > As you likely know[1] I'm looking into moving the MTU check (for
> > TC-BPF)
> > in __bpf_skb_max_len() when e.g. called by bpf_skb_adjust_room(),
> > because when redirecting packets to another netdev it is not
> > correct to
> > limit the MTU based on the incoming netdev.
> > 
> > I was looking at doing the MTU check in bpf_redirect() helper,
> > because
> > at this point we know the redirect to netdev, and returning an
> > indication/error that MTU was exceed, would allow the BPF-prog
> > logic to
> > react, e.g. sending ICMP (instead of packet getting silently
> > dropped).
> > BUT this is not possible because bpf_redirect(index, flags) helper
> > don't provide the packet context-object (so I cannot lookup the
> > packet
> > length).
> > 
> > Seeking input:
> > 
> > Should/can we change the bpf_redirect API or create a new helper
> > with
> > packet-context?
> > 
> >  Note: We have the same need for the packet context for XDP when
> >  redirecting the new multi-buffer packets, as not all destination
> > netdev
> >  will support these new multi-buffer packets.
> > 
> > I can of-cause do the MTU checks on kernel-side in skb_do_redirect,
> > but
> > then how do people debug this? as packet will basically be silently
> > dropped.
> > 
> > 
> > 
> > (Looking at how does BPF-prog logic handle MTU today)
> > 
> > How do bpf_skb_adjust_room() report that the MTU was exceeded?
> > Unfortunately it uses a common return code -ENOTSUPP which used for
> > multiple cases (include MTU exceeded). Thus, the BPF-prog logic
> > cannot
> > use this reliably to know if this is a MTU exceeded event. (Looked
> > BPF-prog code and they all simply exit with TC_ACT_SHOT for all
> > error
> > codes, cloudflare have the most advanced handling with
> > metrics->errors_total_encap_adjust_failed++).
> > 
> > 
> > [1] 
> > https://lore.kernel.org/bpf/159921182827.1260200.9699352760916903781.stgit@firesoul/
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> > 
> 
> (a) the current state of the world seems very hard to use correctly,
> so adding new apis,
> or even changing existing ones seems ok to me.
> especially if this just means changing what error code they return
> 
> (b) another complexity with bpf_redirect() is you can call it, it can
> succeed,
> but then you can not return TC_ACT_REDIRECT from the bpf program,
> which effectively makes the earlier *successful* bpf_redirect() call
> an utter no-op.
> 
> (bpf_redirect() just determines what a future return TC_ACT_REDIRECT
> will do)
> 
> so if you bpf_redirect to interface with larger mtu, then increase
> packet size,

why would you redirect then touch the packet afterwards ? 
if you have a bad program, then it is a user issue.

> then return TC_ACT_OK, then you potentially end up with excessively
> large
> packet egressing through original interface (with small mtu).
> 
> My vote would be to return a new distinct error from bpf_redirect()
> based on then current
> packet size and interface being redirected to, save this interface
> mtu
> somewhere,
> then in operations that increase packet size check against this saved
> mtu,
> for correctness you still have to check mtu after the bpf program is
> done,
> but this is then just to deal with braindead bpf code (that calls
> bpf_redirect and returns TC_ACT_OK, or calls bpf_redirect() multiple
> times, or something...).
> 


Another solution is to have an exception function defined in the
BPF_prog, this function by itself is another program that can be
executed to notify the prog about any exception/err that happened after
the main BPF_program exited and let the XDP program react by its own
logic.

example:

BPF_prog:
    int XDP_main_prog(xdp_buff) {
        xdp_adjust_head/tail(xdp_buff);
        return xdp_redirect(ifindex, flags);
    }

    int XDP_exception(xdp_buff, excption_code) {
        if (excetption_code == XDP_REDIRECRT_MTU_EXCEEDED) {
                ICMP_response(xdp_buff);
                return XDP_TX;
        }
        return XDP_DROP;
    }


netdev_driver_xdp_handle():
   act = bpf_prog_run_xdp(prog, xdp); // Run XDP_main_prog
   if (act == XDP_REDIRECT)
       err = xdp_do_redirect(netdev, xdp, prog);
       if (err) { 
          // Run XDP_exception() function in the user prog
          // finds the exception handler of active program
          act = bpf_prog_run_xdp_exciption(prog, xdp, err);
          // then handle exception action in the driver
(XDP_TX/DROP/FORWARD).. 
       }

of-course a user program will be notified only on the first err .. 
if it fails on the 2nd time .. just drop..

-Saeed.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-17 19:11   ` Saeed Mahameed
@ 2020-09-18 10:00     ` Jesper Dangaard Brouer
  2020-09-18 10:34       ` Toke Høiland-Jørgensen
  2020-09-18 23:06       ` Maciej Żenczykowski
  0 siblings, 2 replies; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-18 10:00 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Maciej Żenczykowski, Daniel Borkmann, Alexei Starovoitov,
	BPF-dev-list, netdev, Lorenzo Bianconi, Lorenz Bauer,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski, brouer

On Thu, 17 Sep 2020 12:11:33 -0700
Saeed Mahameed <saeed@kernel.org> wrote:

> On Thu, 2020-09-17 at 05:54 -0700, Maciej Żenczykowski wrote:
> > On Thu, Sep 17, 2020 at 5:39 AM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> > > 
> > > As you likely know[1] I'm looking into moving the MTU check (for
> > > TC-BPF) in __bpf_skb_max_len() when e.g. called by
> > > bpf_skb_adjust_room(), because when redirecting packets to
> > > another netdev it is not correct to limit the MTU based on the
> > > incoming netdev.
> > > 
> > > I was looking at doing the MTU check in bpf_redirect() helper,
> > > because at this point we know the redirect to netdev, and
> > > returning an indication/error that MTU was exceed, would allow
> > > the BPF-prog logic to react, e.g. sending ICMP (instead of packet
> > > getting silently dropped). 
> > > BUT this is not possible because bpf_redirect(index, flags) helper
> > > don't provide the packet context-object (so I cannot lookup the
> > > packet length).
> > > 
> > > Seeking input:
> > > 
> > > Should/can we change the bpf_redirect API or create a new helper
> > > with packet-context?
> > > 
> > >  Note: We have the same need for the packet context for XDP when
> > >  redirecting the new multi-buffer packets, as not all destination
> > >  netdev will support these new multi-buffer packets.
> > > 
> > > I can of-cause do the MTU checks on kernel-side in
> > > skb_do_redirect, but then how do people debug this? as packet
> > > will basically be silently dropped.
> > > 
> > > 
> > > 
> > > (Looking at how does BPF-prog logic handle MTU today)
> > > 
> > > How do bpf_skb_adjust_room() report that the MTU was exceeded?
> > > Unfortunately it uses a common return code -ENOTSUPP which used
> > > for multiple cases (include MTU exceeded). Thus, the BPF-prog
> > > logic cannot use this reliably to know if this is a MTU exceeded
> > > event. (Looked BPF-prog code and they all simply exit with
> > > TC_ACT_SHOT for all error codes, cloudflare have the most
> > > advanced handling with metrics->errors_total_encap_adjust_failed++).
> > > 
> > > 
> > > [1] 
> > > https://lore.kernel.org/bpf/159921182827.1260200.9699352760916903781.stgit@firesoul/
> > > --
> > > Best regards,
> > >   Jesper Dangaard Brouer
> > >   MSc.CS, Principal Kernel Engineer at Red Hat
> > >   LinkedIn: http://www.linkedin.com/in/brouer
> > >   
> > 
> > (a) the current state of the world seems very hard to use correctly,
> > so adding new apis, or even changing existing ones seems ok to me.
> > especially if this just means changing what error code they return
> > 
> > (b) another complexity with bpf_redirect() is you can call it, it
> > can succeed, but then you can not return TC_ACT_REDIRECT from the
> > bpf program, which effectively makes the earlier *successful*
> > bpf_redirect() call an utter no-op.
> > 
> > (bpf_redirect() just determines what a future return TC_ACT_REDIRECT
> > will do)
> > 
> > so if you bpf_redirect to interface with larger mtu, then increase
> > packet size,  
> 
> why would you redirect then touch the packet afterwards ? 
> if you have a bad program, then it is a user issue.
> 
> > then return TC_ACT_OK, then you potentially end up with excessively
> > large packet egressing through original interface (with small mtu).
> > 

This is a good point.  As bpf_skb_adjust_room() can just be run after
bpf_redirect() call, then a MTU check in bpf_redirect() actually
doesn't make much sense.  As clever/bad BPF program can then avoid the
MTU check anyhow.  This basically means that we have to do the MTU
check (again) on kernel side anyhow to catch such clever/bad BPF
programs.  (And I don't like wasting cycles on doing the same check two
times).

If we do the MTU check on the kernel side, then there are no feedback
to the program, and how are end-users going to debug this?


> > My vote would be to return a new distinct error from bpf_redirect()
> > based on then current packet size and interface being redirected
> > to, save this interface mtu  somewhere, then in operations that
> > increase packet size check against this saved mtu, for correctness
> > you still have to check mtu after the bpf program is done,
> > but this is then just to deal with braindead bpf code (that calls
> > bpf_redirect and returns TC_ACT_OK, or calls bpf_redirect() multiple
> > times, or something...).
> >   

Hmm, I don't like wasting cycles on doing the same check multiple times.

In bpf_redirect() we store the netdev we want to redirect to, thus, if
bpf_skb_adjust_room() is run after bpf_redirect(), we could also do a
MTU check in bpf_skb_adjust_room() based on this netdev. (maybe there
are still ways for a BPF-prog to cheat the MTU check?).

> 
> Another solution is to have an exception function defined in the
> BPF_prog, this function by itself is another program that can be
> executed to notify the prog about any exception/err that happened
> after the main BPF_program exited and let the XDP program react by
> its own logic.

We are talking about TC-BPF programs here, but the concept and
usability issue with bpf_redirect() is the same for XDP.

If doing the MTU check (and drop) on kernel side, as was thinking about
adding a simple tracepoint, for end-users to debug this. It would also
allow for attaching a BPF-prog to the tracepoint to get more direct
feedback, but it would not allow sending back an ICMP response.

Your approach of calling a 2nd BPF-prog to deal with exceptions, and
allow it to alter the packet and flow "action" is definitely
interesting.  What do others think?


> example:
> 
> BPF_prog:
>     int XDP_main_prog(xdp_buff) {
>         xdp_adjust_head/tail(xdp_buff);
>         return xdp_redirect(ifindex, flags);
>     }
> 
>     int XDP_exception(xdp_buff, excption_code) {
>         if (excetption_code == XDP_REDIRECRT_MTU_EXCEEDED) {
>                 ICMP_response(xdp_buff);
>                 return XDP_TX;
>         }
>         return XDP_DROP;
>     }
> 
> 
> netdev_driver_xdp_handle():
>    act = bpf_prog_run_xdp(prog, xdp); // Run XDP_main_prog
>    if (act == XDP_REDIRECT)
>        err = xdp_do_redirect(netdev, xdp, prog);
>        if (err) { 
>           // Run XDP_exception() function in the user prog
>           // finds the exception handler of active program
>           act = bpf_prog_run_xdp_exciption(prog, xdp, err);
>           // then handle exception action in the driver
> (XDP_TX/DROP/FORWARD).. 
>        }
> 
> of-course a user program will be notified only on the first err .. 
> if it fails on the 2nd time .. just drop..


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-18 10:00     ` Jesper Dangaard Brouer
@ 2020-09-18 10:34       ` Toke Høiland-Jørgensen
  2020-09-18 23:06       ` Maciej Żenczykowski
  1 sibling, 0 replies; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-09-18 10:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Saeed Mahameed
  Cc: Maciej Żenczykowski, Daniel Borkmann, Alexei Starovoitov,
	BPF-dev-list, netdev, Lorenzo Bianconi, Lorenz Bauer,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski, brouer

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Thu, 17 Sep 2020 12:11:33 -0700
> Saeed Mahameed <saeed@kernel.org> wrote:
>
>> On Thu, 2020-09-17 at 05:54 -0700, Maciej Żenczykowski wrote:
>> > On Thu, Sep 17, 2020 at 5:39 AM Jesper Dangaard Brouer
>> > <brouer@redhat.com> wrote:  
>> > > 
>> > > As you likely know[1] I'm looking into moving the MTU check (for
>> > > TC-BPF) in __bpf_skb_max_len() when e.g. called by
>> > > bpf_skb_adjust_room(), because when redirecting packets to
>> > > another netdev it is not correct to limit the MTU based on the
>> > > incoming netdev.
>> > > 
>> > > I was looking at doing the MTU check in bpf_redirect() helper,
>> > > because at this point we know the redirect to netdev, and
>> > > returning an indication/error that MTU was exceed, would allow
>> > > the BPF-prog logic to react, e.g. sending ICMP (instead of packet
>> > > getting silently dropped). 
>> > > BUT this is not possible because bpf_redirect(index, flags) helper
>> > > don't provide the packet context-object (so I cannot lookup the
>> > > packet length).
>> > > 
>> > > Seeking input:
>> > > 
>> > > Should/can we change the bpf_redirect API or create a new helper
>> > > with packet-context?
>> > > 
>> > >  Note: We have the same need for the packet context for XDP when
>> > >  redirecting the new multi-buffer packets, as not all destination
>> > >  netdev will support these new multi-buffer packets.
>> > > 
>> > > I can of-cause do the MTU checks on kernel-side in
>> > > skb_do_redirect, but then how do people debug this? as packet
>> > > will basically be silently dropped.
>> > > 
>> > > 
>> > > 
>> > > (Looking at how does BPF-prog logic handle MTU today)
>> > > 
>> > > How do bpf_skb_adjust_room() report that the MTU was exceeded?
>> > > Unfortunately it uses a common return code -ENOTSUPP which used
>> > > for multiple cases (include MTU exceeded). Thus, the BPF-prog
>> > > logic cannot use this reliably to know if this is a MTU exceeded
>> > > event. (Looked BPF-prog code and they all simply exit with
>> > > TC_ACT_SHOT for all error codes, cloudflare have the most
>> > > advanced handling with metrics->errors_total_encap_adjust_failed++).
>> > > 
>> > > 
>> > > [1] 
>> > > https://lore.kernel.org/bpf/159921182827.1260200.9699352760916903781.stgit@firesoul/
>> > > --
>> > > Best regards,
>> > >   Jesper Dangaard Brouer
>> > >   MSc.CS, Principal Kernel Engineer at Red Hat
>> > >   LinkedIn: http://www.linkedin.com/in/brouer
>> > >   
>> > 
>> > (a) the current state of the world seems very hard to use correctly,
>> > so adding new apis, or even changing existing ones seems ok to me.
>> > especially if this just means changing what error code they return
>> > 
>> > (b) another complexity with bpf_redirect() is you can call it, it
>> > can succeed, but then you can not return TC_ACT_REDIRECT from the
>> > bpf program, which effectively makes the earlier *successful*
>> > bpf_redirect() call an utter no-op.
>> > 
>> > (bpf_redirect() just determines what a future return TC_ACT_REDIRECT
>> > will do)
>> > 
>> > so if you bpf_redirect to interface with larger mtu, then increase
>> > packet size,  
>> 
>> why would you redirect then touch the packet afterwards ? 
>> if you have a bad program, then it is a user issue.
>> 
>> > then return TC_ACT_OK, then you potentially end up with excessively
>> > large packet egressing through original interface (with small mtu).
>> > 
>
> This is a good point.  As bpf_skb_adjust_room() can just be run after
> bpf_redirect() call, then a MTU check in bpf_redirect() actually
> doesn't make much sense.  As clever/bad BPF program can then avoid the
> MTU check anyhow.  This basically means that we have to do the MTU
> check (again) on kernel side anyhow to catch such clever/bad BPF
> programs.  (And I don't like wasting cycles on doing the same check two
> times).
>
> If we do the MTU check on the kernel side, then there are no feedback
> to the program, and how are end-users going to debug this?

The same way any other MTU-related error is seen? Isn't there a counter
or something? Presumably (since this is in the skb path) it would also
be caught by drop_monitor?

-Toke


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-18 10:00     ` Jesper Dangaard Brouer
  2020-09-18 10:34       ` Toke Høiland-Jørgensen
@ 2020-09-18 23:06       ` Maciej Żenczykowski
  2020-09-21 10:37         ` Lorenz Bauer
  1 sibling, 1 reply; 17+ messages in thread
From: Maciej Żenczykowski @ 2020-09-18 23:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Daniel Borkmann, Alexei Starovoitov,
	BPF-dev-list, netdev, Lorenzo Bianconi, Lorenz Bauer,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski

> This is a good point.  As bpf_skb_adjust_room() can just be run after
> bpf_redirect() call, then a MTU check in bpf_redirect() actually
> doesn't make much sense.  As clever/bad BPF program can then avoid the
> MTU check anyhow.  This basically means that we have to do the MTU
> check (again) on kernel side anyhow to catch such clever/bad BPF
> programs.  (And I don't like wasting cycles on doing the same check two
> times).

If you get rid of the check in bpf_redirect() you might as well get
rid of *all* the checks for excessive mtu in all the helpers that
adjust packet size one way or another way.  They *all* then become
useless overhead.

I don't like that.  There may be something the bpf program could do to
react to the error condition (for example in my case, not modify
things and just let the core stack deal with things - which will
probably just generate packet too big icmp error).

btw. right now our forwarding programs first adjust the packet size
then call bpf_redirect() and almost immediately return what it
returned.

but this could I think easily be changed to reverse the ordering, so
we wouldn't increase packet size before the core stack was informed we
would be forwarding via a different interface.

> If we do the MTU check on the kernel side, then there are no feedback
> to the program, and how are end-users going to debug this?

What about just adding a net_ratelimited printk in the >mtu but too
late to return error case?
This is only useful though if this should never happen unless you do
something 'evil' in your bpf code.
(ie. to be useful this requires all the bpf helpers to have correct
mtu checks in them)

Alternatively a sysctl that would generate icmp error on such a drop?
Or just always do that anyway?

> Hmm, I don't like wasting cycles on doing the same check multiple times.

I'm not sure this is avoidable without making bpf code error prone...

> In bpf_redirect() we store the netdev we want to redirect to, thus, if
> bpf_skb_adjust_room() is run after bpf_redirect(), we could also do a
> MTU check in bpf_skb_adjust_room() based on this netdev. (maybe there
> are still ways for a BPF-prog to cheat the MTU check?).

Sure but, bpf_redirect() can be from large to small mtu device without
packet ever being modified.
bpf_redirect() needs to check mtu in such a case.
bpf code may decide that if can't redirect then it just wants core
stack to do something with it (like frag it).

> > Another solution is to have an exception function defined in the
> > BPF_prog, this function by itself is another program that can be
> > executed to notify the prog about any exception/err that happened
> > after the main BPF_program exited and let the XDP program react by
> > its own logic.
>
> We are talking about TC-BPF programs here, but the concept and
> usability issue with bpf_redirect() is the same for XDP.
>
> If doing the MTU check (and drop) on kernel side, as was thinking about
> adding a simple tracepoint, for end-users to debug this. It would also
> allow for attaching a BPF-prog to the tracepoint to get more direct
> feedback, but it would not allow sending back an ICMP response.
>
> Your approach of calling a 2nd BPF-prog to deal with exceptions, and
> allow it to alter the packet and flow "action" is definitely
> interesting.  What do others think?
>
>
> > example:
> >
> > BPF_prog:
> >     int XDP_main_prog(xdp_buff) {
> >         xdp_adjust_head/tail(xdp_buff);
> >         return xdp_redirect(ifindex, flags);
> >     }
> >
> >     int XDP_exception(xdp_buff, excption_code) {
> >         if (excetption_code == XDP_REDIRECRT_MTU_EXCEEDED) {
> >                 ICMP_response(xdp_buff);
> >                 return XDP_TX;
> >         }
> >         return XDP_DROP;
> >     }
> >
> >
> > netdev_driver_xdp_handle():
> >    act = bpf_prog_run_xdp(prog, xdp); // Run XDP_main_prog
> >    if (act == XDP_REDIRECT)
> >        err = xdp_do_redirect(netdev, xdp, prog);
> >        if (err) {
> >           // Run XDP_exception() function in the user prog
> >           // finds the exception handler of active program
> >           act = bpf_prog_run_xdp_exciption(prog, xdp, err);
> >           // then handle exception action in the driver
> > (XDP_TX/DROP/FORWARD)..
> >        }
> >
> > of-course a user program will be notified only on the first err ..
> > if it fails on the 2nd time .. just drop..
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
Maciej Żenczykowski, Kernel Networking Developer @ Google

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-18 23:06       ` Maciej Żenczykowski
@ 2020-09-21 10:37         ` Lorenz Bauer
  2020-09-21 12:49           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenz Bauer @ 2020-09-21 10:37 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Jesper Dangaard Brouer, Saeed Mahameed, Daniel Borkmann,
	Alexei Starovoitov, BPF-dev-list, netdev, Lorenzo Bianconi,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski

On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:
>
> > This is a good point.  As bpf_skb_adjust_room() can just be run after
> > bpf_redirect() call, then a MTU check in bpf_redirect() actually
> > doesn't make much sense.  As clever/bad BPF program can then avoid the
> > MTU check anyhow.  This basically means that we have to do the MTU
> > check (again) on kernel side anyhow to catch such clever/bad BPF
> > programs.  (And I don't like wasting cycles on doing the same check two
> > times).
>
> If you get rid of the check in bpf_redirect() you might as well get
> rid of *all* the checks for excessive mtu in all the helpers that
> adjust packet size one way or another way.  They *all* then become
> useless overhead.
>
> I don't like that.  There may be something the bpf program could do to
> react to the error condition (for example in my case, not modify
> things and just let the core stack deal with things - which will
> probably just generate packet too big icmp error).
>
> btw. right now our forwarding programs first adjust the packet size
> then call bpf_redirect() and almost immediately return what it
> returned.
>
> but this could I think easily be changed to reverse the ordering, so
> we wouldn't increase packet size before the core stack was informed we
> would be forwarding via a different interface.

We do the same, except that we also use XDP_TX when appropriate. This
complicates the matter, because there is no helper call we could
return an error from.

My preference would be to have three helpers: get MTU for a device,
redirect ctx to a device (with MTU check), resize ctx (without MTU
check) but that doesn't work with XDP_TX. Your idea of doing checks in
redirect and adjust_room is pragmatic and seems easier to implement.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-17 12:54 ` Maciej Żenczykowski
  2020-09-17 19:11   ` Saeed Mahameed
@ 2020-09-21 10:42   ` Lorenz Bauer
  1 sibling, 0 replies; 17+ messages in thread
From: Lorenz Bauer @ 2020-09-21 10:42 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov,
	BPF-dev-list, netdev, Lorenzo Bianconi, John Fastabend,
	Jakub Kicinski, Shaun Crampton, David Miller, Marek Majkowski

On Thu, 17 Sep 2020 at 13:55, Maciej Żenczykowski <maze@google.com> wrote:
>
>
> (b) another complexity with bpf_redirect() is you can call it, it can succeed,
> but then you can not return TC_ACT_REDIRECT from the bpf program,
> which effectively makes the earlier *successful* bpf_redirect() call
> an utter no-op.
>
> (bpf_redirect() just determines what a future return TC_ACT_REDIRECT will do)
>
> so if you bpf_redirect to interface with larger mtu, then increase packet size,
> then return TC_ACT_OK, then you potentially end up with excessively large
> packet egressing through original interface (with small mtu).

Yeah, this isn't nice. What is the use case for allowing this in the
first place?

For sk_lookup programs, we have a similar situation, except that we
"redirect" to a socket. Here the redirect happens if the helper call
is successful and the program returns SK_PASS. Maybe that is a
feasible approach if we introduce new helpers.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 10:37         ` Lorenz Bauer
@ 2020-09-21 12:49           ` Jesper Dangaard Brouer
  2020-09-21 15:08             ` Daniel Borkmann
  0 siblings, 1 reply; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-21 12:49 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Maciej Żenczykowski, Saeed Mahameed, Daniel Borkmann,
	Alexei Starovoitov, BPF-dev-list, netdev, Lorenzo Bianconi,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski, brouer

On Mon, 21 Sep 2020 11:37:18 +0100
Lorenz Bauer <lmb@cloudflare.com> wrote:

> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:
> >  
> > > This is a good point.  As bpf_skb_adjust_room() can just be run after
> > > bpf_redirect() call, then a MTU check in bpf_redirect() actually
> > > doesn't make much sense.  As clever/bad BPF program can then avoid the
> > > MTU check anyhow.  This basically means that we have to do the MTU
> > > check (again) on kernel side anyhow to catch such clever/bad BPF
> > > programs.  (And I don't like wasting cycles on doing the same check two
> > > times).  
> >
> > If you get rid of the check in bpf_redirect() you might as well get
> > rid of *all* the checks for excessive mtu in all the helpers that
> > adjust packet size one way or another way.  They *all* then become
> > useless overhead.
> >
> > I don't like that.  There may be something the bpf program could do to
> > react to the error condition (for example in my case, not modify
> > things and just let the core stack deal with things - which will
> > probably just generate packet too big icmp error).
> >
> > btw. right now our forwarding programs first adjust the packet size
> > then call bpf_redirect() and almost immediately return what it
> > returned.
> >
> > but this could I think easily be changed to reverse the ordering, so
> > we wouldn't increase packet size before the core stack was informed we
> > would be forwarding via a different interface.  
> 
> We do the same, except that we also use XDP_TX when appropriate. This
> complicates the matter, because there is no helper call we could
> return an error from.

Do notice that my MTU work is focused on TC-BPF.  For XDP-redirect the
MTU check is done in xdp_ok_fwd_dev() via __xdp_enqueue(), which also
happens too late to give BPF-prog knowledge/feedback.  For XDP_TX I
audited the drivers when I implemented xdp_buff.frame_sz, and they
handled (or I added) handling against max HW MTU. E.g. mlx5 [1].

[1] https://elixir.bootlin.com/linux/v5.9-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L267



> My preference would be to have three helpers: get MTU for a device,
> redirect ctx to a device (with MTU check), resize ctx (without MTU
> check) but that doesn't work with XDP_TX. Your idea of doing checks
> in redirect and adjust_room is pragmatic and seems easier to
> implement.
 
I do like this plan/proposal (with 3 helpers), but it is not possible
with current API.  The main problem is the current bpf_redirect API
doesn't provide the ctx, so we cannot do the check in the BPF-helper.

Are you saying we should create a new bpf_redirect API (that incl packet ctx)?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 12:49           ` Jesper Dangaard Brouer
@ 2020-09-21 15:08             ` Daniel Borkmann
  2020-09-21 16:21               ` Marek Zavodsky
                                 ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Daniel Borkmann @ 2020-09-21 15:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Lorenz Bauer
  Cc: Maciej Żenczykowski, Saeed Mahameed, Daniel Borkmann,
	Alexei Starovoitov, BPF-dev-list, netdev, Lorenzo Bianconi,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski

On 9/21/20 2:49 PM, Jesper Dangaard Brouer wrote:
> On Mon, 21 Sep 2020 11:37:18 +0100
> Lorenz Bauer <lmb@cloudflare.com> wrote:
>> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:
>>>   
>>>> This is a good point.  As bpf_skb_adjust_room() can just be run after
>>>> bpf_redirect() call, then a MTU check in bpf_redirect() actually
>>>> doesn't make much sense.  As clever/bad BPF program can then avoid the
>>>> MTU check anyhow.  This basically means that we have to do the MTU
>>>> check (again) on kernel side anyhow to catch such clever/bad BPF
>>>> programs.  (And I don't like wasting cycles on doing the same check two
>>>> times).
>>>
>>> If you get rid of the check in bpf_redirect() you might as well get
>>> rid of *all* the checks for excessive mtu in all the helpers that
>>> adjust packet size one way or another way.  They *all* then become
>>> useless overhead.
>>>
>>> I don't like that.  There may be something the bpf program could do to
>>> react to the error condition (for example in my case, not modify
>>> things and just let the core stack deal with things - which will
>>> probably just generate packet too big icmp error).
>>>
>>> btw. right now our forwarding programs first adjust the packet size
>>> then call bpf_redirect() and almost immediately return what it
>>> returned.
>>>
>>> but this could I think easily be changed to reverse the ordering, so
>>> we wouldn't increase packet size before the core stack was informed we
>>> would be forwarding via a different interface.
>>
>> We do the same, except that we also use XDP_TX when appropriate. This
>> complicates the matter, because there is no helper call we could
>> return an error from.
> 
> Do notice that my MTU work is focused on TC-BPF.  For XDP-redirect the
> MTU check is done in xdp_ok_fwd_dev() via __xdp_enqueue(), which also
> happens too late to give BPF-prog knowledge/feedback.  For XDP_TX I
> audited the drivers when I implemented xdp_buff.frame_sz, and they
> handled (or I added) handling against max HW MTU. E.g. mlx5 [1].
> 
> [1] https://elixir.bootlin.com/linux/v5.9-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L267
> 
>> My preference would be to have three helpers: get MTU for a device,
>> redirect ctx to a device (with MTU check), resize ctx (without MTU
>> check) but that doesn't work with XDP_TX. Your idea of doing checks
>> in redirect and adjust_room is pragmatic and seems easier to
>> implement.
>   
> I do like this plan/proposal (with 3 helpers), but it is not possible
> with current API.  The main problem is the current bpf_redirect API
> doesn't provide the ctx, so we cannot do the check in the BPF-helper.
> 
> Are you saying we should create a new bpf_redirect API (that incl packet ctx)?

Sorry for jumping in late here... one thing that is not clear to me is that if
we are fully sure that skb is dropped by stack anyway due to invalid MTU (redirect
to ingress does this via dev_forward_skb(), it's not fully clear to me whether it's
also the case for the dev_queue_xmiy()), then why not dropping all the MTU checks
aside from SKB_MAX_ALLOC sanity check for BPF helpers and have something like a
device object (similar to e.g. TCP sockets) exposed to BPF prog where we can retrieve
the object and read dev->mtu from the prog, so the BPF program could then do the
"exception" handling internally w/o extra prog needed (we also already expose whether
skb is GSO or not).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 15:08             ` Daniel Borkmann
@ 2020-09-21 16:21               ` Marek Zavodsky
  2020-09-21 21:17                 ` Willem de Bruijn
  2020-09-21 16:26               ` Jesper Dangaard Brouer
  2020-09-21 18:04               ` John Fastabend
  2 siblings, 1 reply; 17+ messages in thread
From: Marek Zavodsky @ 2020-09-21 16:21 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jesper Dangaard Brouer, Lorenz Bauer, Maciej Żenczykowski,
	Saeed Mahameed, Daniel Borkmann, Alexei Starovoitov,
	BPF-dev-list, netdev, Lorenzo Bianconi, John Fastabend,
	Jakub Kicinski, Shaun Crampton, David Miller, Marek Majkowski

Hi guys,

My kernel knowledge is small, but I experienced this (similar) issue
with packet encapsulation (not a redirect), therefore modifying the
redirect branch would not help in my case.

I'm working on a TC program to do GUE encap/decap (IP + UDP + GUE,
outer header has extra 52B).
There are no issues with small packets. But when I use curl to
download big file HTTP server chunks data randomly. Some packets have
MTU size, others are even bigger. Big packets are not an issue,
however MTU sized packets fail on bpf_skb_adjust_room with -524
(ENOTSUPP).

Below are some (annotated) logs for small, MTU-sized and beefy packets
I collected from /sys/kernel/debug/tracing/trace.
Log is produced by egress TC on server side (node1 TX)

SMALL PACKET:
            curl-14470 [000] .Ns1   402.561940: 0: PFC ifindex 52, len
66                                            <- skb->ifindex,
skb->len
            curl-14470 [000] .Ns1   402.561940: 0:   gso_segs 1
                                                   <- skb->gso_segs
(no GSO)
            curl-14470 [000] .Ns1   402.561941: 0:   gso_size 0
                                                    <- skb->gso_size
(5.7+) ... skb_is_gso(skb) -> "false"
            curl-14470 [000] .Ns1   402.561941: 0: ID node1 TX
                                                  <- serverside egress
(GUE encap)
            curl-14470 [000] .Ns1   402.561942: 0: DUMP:
=====================                      <- original packet
            curl-14470 [000] .Ns1   402.561942: 0:   Size : 66 B
                                                    <- skb->len
            curl-14470 [000] .Ns1   402.561942: 0:   ETH  : ac010005
-> ac010002, proto 800
            curl-14470 [000] .Ns1   402.561943: 0:   IPv4 : 1010101 ->
ac010002, id 62439
            curl-14470 [000] .Ns1   402.561943: 0:     csum 0x98d7
            curl-14470 [000] .Ns1   402.561944: 0:   TCP  : 4000 ->
60296, Flags [A]
            curl-14470 [000] .Ns1   402.561944: 0:     csum 0xae2b
            curl-14470 [000] .Ns1   402.561944: 0: DUMP: =====================
            curl-14470 [000] .Ns1   402.561946: 0: GUE Encap Tunnel:
id 65636                               <- subject to encap...
            curl-14470 [000] .Ns1   402.561947: 0:     FROM ac010005:6000
            curl-14470 [000] .Ns1   402.561947: 0:     TO
ac010003:5000       (ac010003)
            curl-14470 [000] .Ns1   402.561948: 0: DUMP:
=====================                    <- encapsulated packet
            curl-14470 [000] .Ns1   402.561948: 0:   Size : 118 B
                                                  <- new skb->len
            curl-14470 [000] .Ns1   402.561948: 0:   ETH  : ac010005
-> ac010003, proto 800
            curl-14470 [000] .Ns1   402.561949: 0:   IPv4 : ac010005
-> ac010003, id 62439
            curl-14470 [000] .Ns1   402.561949: 0:     csum 0xee92
            curl-14470 [000] .Ns1   402.561949: 0:   UDP  : 6000 -> 5000
            curl-14470 [000] .Ns1   402.561950: 0:     csum 0x0
            curl-14470 [000] .Ns1   402.561950: 0: DUMP: =====================
            curl-14470 [000] .Ns1   402.561950: 0: Action: TC_ACT_OK

MTU-SIZED PACKET (orig-len <= MTU && orig-len + GUE-overhead > MTU):
         systemd-1693  [000] .Ns.   408.640488: 0: PFC ifindex 52, len 1502
         systemd-1693  [000] .Ns.   408.640540: 0:   gso_segs 1
                                                 <- skb->gso_segs (no
GSO)
         systemd-1693  [000] .Ns.   408.640561: 0:   gso_size 0
                                                  <- skb->gso_size
(5.7+) ... skb_is_gso(skb) -> "false"
         systemd-1693  [000] .Ns.   408.640582: 0: ID node1 TX
         systemd-1693  [000] .Ns.   408.640601: 0: DUMP: =====================
         systemd-1693  [000] .Ns.   408.640617: 0:   Size : 1502 B
         systemd-1693  [000] .Ns.   408.640632: 0:   ETH  : ac010005
-> ac010002, proto 800
         systemd-1693  [000] .Ns.   408.640646: 0:   IPv4 : 1010101 ->
ac010002, id 62737
         systemd-1693  [000] .Ns.   408.640659: 0:     csum 0x9211
         systemd-1693  [000] .Ns.   408.640670: 0:   TCP  : 4000 ->
60296, Flags [A]
         systemd-1693  [000] .Ns.   408.640681: 0:     csum 0xb3c7
         systemd-1693  [000] .Ns.   408.640690: 0: DUMP: =====================
         systemd-1693  [000] .Ns.   408.640729: 0: GUE Encap Tunnel: id 65636
         systemd-1693  [000] .Ns.   408.640735: 0:     FROM ac010005:6000
         systemd-1693  [000] .Ns.   408.640740: 0:     TO
ac010003:5000       (ac010003)
         systemd-1693  [000] .Ns.   408.640746: 0:
bpf_skb_adjust_room: -524                                    <- FAILED
to enlarge skbuff
         systemd-1693  [000] .Ns.   408.640750: 0: GUE Encap Failed!
         systemd-1693  [000] .Ns.   408.640755: 0: Action: TC_ACT_SHOT

HUGE PACKET:
            curl-14470 [000] ..s1   402.566490: 0: PFC ifindex 52, len 63778
            curl-14470 [000] ..s1   402.566491: 0:   gso_segs 44
                                                  <- skb->gso_segs
(GSO in progress)
            curl-14470 [000] ..s1   402.566491: 0:   gso_size 1448
                                                 <- skb->gso_size
(5.7+) ... skb_is_gso(skb) -> "true"
            curl-14470 [000] ..s1   402.566492: 0: ID node1 TX
            curl-14470 [000] ..s1   402.566492: 0: DUMP: =====================
            curl-14470 [000] ..s1   402.566493: 0:   Size : 63778 B
            curl-14470 [000] ..s1   402.566493: 0:   ETH  : ac010005
-> ac010002, proto 800
            curl-14470 [000] ..s1   402.566493: 0:   IPv4 : 1010101 ->
ac010002, id 62674
            curl-14470 [000] ..s1   402.566494: 0:     csum 0x9f0b
            curl-14470 [000] ..s1   402.566494: 0:   TCP  : 4000 ->
60296, Flags [A]
            curl-14470 [000] ..s1   402.566494: 0:     csum 0xa70c
            curl-14470 [000] ..s1   402.566495: 0: DUMP: =====================
            curl-14470 [000] ..s1   402.566497: 0: GUE Encap Tunnel: id 65636
            curl-14470 [000] ..s1   402.566497: 0:     FROM ac010005:6000
            curl-14470 [000] ..s1   402.566497: 0:     TO
ac010003:5000       (ac010003)
            curl-14470 [000] ..s1   402.566498: 0: DUMP: =====================
            curl-14470 [000] ..s1   402.566498: 0:   Size : 63830 B
                                                      <- and it still
works
            curl-14470 [000] ..s1   402.566499: 0:   ETH  : ac010005
-> ac010003, proto 800
            curl-14470 [000] ..s1   402.566499: 0:   IPv4 : ac010005
-> ac010003, id 62674
            curl-14470 [000] ..s1   402.566499: 0:     csum 0xf4c6
            curl-14470 [000] ..s1   402.566500: 0:   UDP  : 6000 -> 5000
            curl-14470 [000] ..s1   402.566500: 0:     csum 0x0
            curl-14470 [000] ..s1   402.566500: 0: DUMP: =====================
            curl-14470 [000] ..s1   402.566500: 0: Action: TC_ACT_OK

I tried kernels from 5.2 to 5.8 and they behave the same, only
difference I noticed is skb->gso_size is supported by 5.7+.
bpf_skb_net_grow should recompute GSO, but base on visual code check
it not getting there because of:
https://github.com/torvalds/linux/blob/9907ab371426da8b3cffa6cc3e4ae54829559207/net/core/filter.c
  line: 3256
    if ((shrink && (len_diff_abs >= len_cur ||
            len_cur - len_diff_abs < len_min)) ||
        (!shrink && (skb->len + len_diff_abs > len_max &&          <-
enlarge && new-size-exceeds-MTU && gso-size is not set yet
             !skb_is_gso(skb))))
        return -ENOTSUPP;

IMHO len_max could help, but I do not understand the
"!skb_is_gso(skb)" check under current conditions. I'm probably
missing something here, but if it was a "system wide" setting it would
make sense (we can handle > MTU packets because GSO kicks in OR we
can't because of no GSO). But it doesn't make sense to me why some
packets in the same stream support GSO and others do not.

https://github.com/torvalds/linux/blob/3e8d3bdc2a757cc6be5470297947799a7df445cc/include/linux/skbuff.h
static inline bool skb_is_gso(const struct sk_buff *skb)
{
    return skb_shinfo(skb)->gso_size;
}

Regards,
Marek

On Mon, Sep 21, 2020 at 5:09 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/21/20 2:49 PM, Jesper Dangaard Brouer wrote:
> > On Mon, 21 Sep 2020 11:37:18 +0100
> > Lorenz Bauer <lmb@cloudflare.com> wrote:
> >> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:
> >>>
> >>>> This is a good point.  As bpf_skb_adjust_room() can just be run after
> >>>> bpf_redirect() call, then a MTU check in bpf_redirect() actually
> >>>> doesn't make much sense.  As clever/bad BPF program can then avoid the
> >>>> MTU check anyhow.  This basically means that we have to do the MTU
> >>>> check (again) on kernel side anyhow to catch such clever/bad BPF
> >>>> programs.  (And I don't like wasting cycles on doing the same check two
> >>>> times).
> >>>
> >>> If you get rid of the check in bpf_redirect() you might as well get
> >>> rid of *all* the checks for excessive mtu in all the helpers that
> >>> adjust packet size one way or another way.  They *all* then become
> >>> useless overhead.
> >>>
> >>> I don't like that.  There may be something the bpf program could do to
> >>> react to the error condition (for example in my case, not modify
> >>> things and just let the core stack deal with things - which will
> >>> probably just generate packet too big icmp error).
> >>>
> >>> btw. right now our forwarding programs first adjust the packet size
> >>> then call bpf_redirect() and almost immediately return what it
> >>> returned.
> >>>
> >>> but this could I think easily be changed to reverse the ordering, so
> >>> we wouldn't increase packet size before the core stack was informed we
> >>> would be forwarding via a different interface.
> >>
> >> We do the same, except that we also use XDP_TX when appropriate. This
> >> complicates the matter, because there is no helper call we could
> >> return an error from.
> >
> > Do notice that my MTU work is focused on TC-BPF.  For XDP-redirect the
> > MTU check is done in xdp_ok_fwd_dev() via __xdp_enqueue(), which also
> > happens too late to give BPF-prog knowledge/feedback.  For XDP_TX I
> > audited the drivers when I implemented xdp_buff.frame_sz, and they
> > handled (or I added) handling against max HW MTU. E.g. mlx5 [1].
> >
> > [1] https://elixir.bootlin.com/linux/v5.9-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L267
> >
> >> My preference would be to have three helpers: get MTU for a device,
> >> redirect ctx to a device (with MTU check), resize ctx (without MTU
> >> check) but that doesn't work with XDP_TX. Your idea of doing checks
> >> in redirect and adjust_room is pragmatic and seems easier to
> >> implement.
> >
> > I do like this plan/proposal (with 3 helpers), but it is not possible
> > with current API.  The main problem is the current bpf_redirect API
> > doesn't provide the ctx, so we cannot do the check in the BPF-helper.
> >
> > Are you saying we should create a new bpf_redirect API (that incl packet ctx)?
>
> Sorry for jumping in late here... one thing that is not clear to me is that if
> we are fully sure that skb is dropped by stack anyway due to invalid MTU (redirect
> to ingress does this via dev_forward_skb(), it's not fully clear to me whether it's
> also the case for the dev_queue_xmiy()), then why not dropping all the MTU checks
> aside from SKB_MAX_ALLOC sanity check for BPF helpers and have something like a
> device object (similar to e.g. TCP sockets) exposed to BPF prog where we can retrieve
> the object and read dev->mtu from the prog, so the BPF program could then do the
> "exception" handling internally w/o extra prog needed (we also already expose whether
> skb is GSO or not).
>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 15:08             ` Daniel Borkmann
  2020-09-21 16:21               ` Marek Zavodsky
@ 2020-09-21 16:26               ` Jesper Dangaard Brouer
  2020-09-22  6:56                 ` Eyal Birger
  2020-09-21 18:04               ` John Fastabend
  2 siblings, 1 reply; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-21 16:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lorenz Bauer, Maciej Żenczykowski, Saeed Mahameed,
	Daniel Borkmann, Alexei Starovoitov, BPF-dev-list, netdev,
	Lorenzo Bianconi, John Fastabend, Jakub Kicinski, Shaun Crampton,
	David Miller, Marek Majkowski, brouer

On Mon, 21 Sep 2020 17:08:17 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 9/21/20 2:49 PM, Jesper Dangaard Brouer wrote:
> > On Mon, 21 Sep 2020 11:37:18 +0100
> > Lorenz Bauer <lmb@cloudflare.com> wrote:  
> >> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:  
> >>>     
> >>>> This is a good point.  As bpf_skb_adjust_room() can just be run after
> >>>> bpf_redirect() call, then a MTU check in bpf_redirect() actually
> >>>> doesn't make much sense.  As clever/bad BPF program can then avoid the
> >>>> MTU check anyhow.  This basically means that we have to do the MTU
> >>>> check (again) on kernel side anyhow to catch such clever/bad BPF
> >>>> programs.  (And I don't like wasting cycles on doing the same check two
> >>>> times).  
> >>>
> >>> If you get rid of the check in bpf_redirect() you might as well get
> >>> rid of *all* the checks for excessive mtu in all the helpers that
> >>> adjust packet size one way or another way.  They *all* then become
> >>> useless overhead.
> >>>
> >>> I don't like that.  There may be something the bpf program could do to
> >>> react to the error condition (for example in my case, not modify
> >>> things and just let the core stack deal with things - which will
> >>> probably just generate packet too big icmp error).
> >>>
> >>> btw. right now our forwarding programs first adjust the packet size
> >>> then call bpf_redirect() and almost immediately return what it
> >>> returned.
> >>>
> >>> but this could I think easily be changed to reverse the ordering, so
> >>> we wouldn't increase packet size before the core stack was informed we
> >>> would be forwarding via a different interface.  
> >>
> >> We do the same, except that we also use XDP_TX when appropriate. This
> >> complicates the matter, because there is no helper call we could
> >> return an error from.  
> > 
> > Do notice that my MTU work is focused on TC-BPF.  For XDP-redirect the
> > MTU check is done in xdp_ok_fwd_dev() via __xdp_enqueue(), which also
> > happens too late to give BPF-prog knowledge/feedback.  For XDP_TX I
> > audited the drivers when I implemented xdp_buff.frame_sz, and they
> > handled (or I added) handling against max HW MTU. E.g. mlx5 [1].
> > 
> > [1] https://elixir.bootlin.com/linux/v5.9-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L267
> >   
> >> My preference would be to have three helpers: get MTU for a device,
> >> redirect ctx to a device (with MTU check), resize ctx (without MTU
> >> check) but that doesn't work with XDP_TX. Your idea of doing checks
> >> in redirect and adjust_room is pragmatic and seems easier to
> >> implement.  
> >   
> > I do like this plan/proposal (with 3 helpers), but it is not possible
> > with current API.  The main problem is the current bpf_redirect API
> > doesn't provide the ctx, so we cannot do the check in the BPF-helper.
> > 
> > Are you saying we should create a new bpf_redirect API (that incl packet ctx)?  
> 
> Sorry for jumping in late here... one thing that is not clear to me
> is that if we are fully sure that skb is dropped by stack anyway due
> to invalid MTU (redirect to ingress does this via dev_forward_skb(),

Yes, TC-redirecting to *INGRESS* have a slightly relaxed MTU check via
is_skb_forwardable() called via ____dev_forward_skb().  This MTU check
seems redundant as netstack will do MTU checks anyhow.

> it's not fully clear to me whether it's also the case for the
> dev_queue_xmit()),

This seems the problematic case; TC-ingress redirect to netdev-egress,
that basically calls dev_queue_xmit().  I tried to follow the code all
the way into ixgbe driver, and I didn't see any MTU checks.

We might have to add a MTU check here, as it could be considered a
bug/problematic that we allow this. (e.g. netdev with large MTU can
redirect frames larger than MTU of egress netdev).


> then why not dropping all the MTU checks aside
> from SKB_MAX_ALLOC sanity check for BPF helpers 

I agree, and think that the MTU checks in the BPF-helpers, make little
sense, as we have found ways to circumvent these checks (as discussed
in thread).

> and have something like a device object (similar to e.g. TCP sockets)
> exposed to BPF prog where we can retrieve the object and read
> dev->mtu from the prog, so the BPF program could then do the
> "exception" handling internally w/o extra prog needed (we also
> already expose whether skb is GSO or not).

I do think we need some BPF-helper that allows BPF-prog to lookup MTU
of a netdev, so it can do proper ICMP exception handling.

I looked at doing ICMP exception handling on kernel-side, but realized
that this is not possible at the TC-redirect layer, as we have not
decoded the L3 protocol at this point (e.g. cannot know if I need to
call icmp_send or icmp6_send).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 15:08             ` Daniel Borkmann
  2020-09-21 16:21               ` Marek Zavodsky
  2020-09-21 16:26               ` Jesper Dangaard Brouer
@ 2020-09-21 18:04               ` John Fastabend
  2020-10-06 11:45                 ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 17+ messages in thread
From: John Fastabend @ 2020-09-21 18:04 UTC (permalink / raw)
  To: Daniel Borkmann, Jesper Dangaard Brouer, Lorenz Bauer
  Cc: Maciej Żenczykowski, Saeed Mahameed, Daniel Borkmann,
	Alexei Starovoitov, BPF-dev-list, netdev, Lorenzo Bianconi,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski

Daniel Borkmann wrote:
> On 9/21/20 2:49 PM, Jesper Dangaard Brouer wrote:
> > On Mon, 21 Sep 2020 11:37:18 +0100
> > Lorenz Bauer <lmb@cloudflare.com> wrote:
> >> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:
> >>>   
> >>>> This is a good point.  As bpf_skb_adjust_room() can just be run after
> >>>> bpf_redirect() call, then a MTU check in bpf_redirect() actually
> >>>> doesn't make much sense.  As clever/bad BPF program can then avoid the
> >>>> MTU check anyhow.  This basically means that we have to do the MTU
> >>>> check (again) on kernel side anyhow to catch such clever/bad BPF
> >>>> programs.  (And I don't like wasting cycles on doing the same check two
> >>>> times).
> >>>
> >>> If you get rid of the check in bpf_redirect() you might as well get
> >>> rid of *all* the checks for excessive mtu in all the helpers that
> >>> adjust packet size one way or another way.  They *all* then become
> >>> useless overhead.
> >>>
> >>> I don't like that.  There may be something the bpf program could do to
> >>> react to the error condition (for example in my case, not modify
> >>> things and just let the core stack deal with things - which will
> >>> probably just generate packet too big icmp error).
> >>>
> >>> btw. right now our forwarding programs first adjust the packet size
> >>> then call bpf_redirect() and almost immediately return what it
> >>> returned.
> >>>
> >>> but this could I think easily be changed to reverse the ordering, so
> >>> we wouldn't increase packet size before the core stack was informed we
> >>> would be forwarding via a different interface.
> >>
> >> We do the same, except that we also use XDP_TX when appropriate. This
> >> complicates the matter, because there is no helper call we could
> >> return an error from.
> > 
> > Do notice that my MTU work is focused on TC-BPF.  For XDP-redirect the
> > MTU check is done in xdp_ok_fwd_dev() via __xdp_enqueue(), which also
> > happens too late to give BPF-prog knowledge/feedback.  For XDP_TX I
> > audited the drivers when I implemented xdp_buff.frame_sz, and they
> > handled (or I added) handling against max HW MTU. E.g. mlx5 [1].
> > 
> > [1] https://elixir.bootlin.com/linux/v5.9-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L267
> > 
> >> My preference would be to have three helpers: get MTU for a device,
> >> redirect ctx to a device (with MTU check), resize ctx (without MTU
> >> check) but that doesn't work with XDP_TX. Your idea of doing checks
> >> in redirect and adjust_room is pragmatic and seems easier to
> >> implement.
> >   
> > I do like this plan/proposal (with 3 helpers), but it is not possible
> > with current API.  The main problem is the current bpf_redirect API
> > doesn't provide the ctx, so we cannot do the check in the BPF-helper.
> > 
> > Are you saying we should create a new bpf_redirect API (that incl packet ctx)?
> 
> Sorry for jumping in late here... one thing that is not clear to me is that if
> we are fully sure that skb is dropped by stack anyway due to invalid MTU (redirect
> to ingress does this via dev_forward_skb(), it's not fully clear to me whether it's
> also the case for the dev_queue_xmiy()), then why not dropping all the MTU checks
> aside from SKB_MAX_ALLOC sanity check for BPF helpers and have something like a
> device object (similar to e.g. TCP sockets) exposed to BPF prog where we can retrieve
> the object and read dev->mtu from the prog, so the BPF program could then do the
> "exception" handling internally w/o extra prog needed (we also already expose whether
> skb is GSO or not).
> 
> Thanks,
> Daniel

My $.02 is MTU should only apply to transmitted packets so redirect to
ingress should be OK. Then on transmit shouldn't the user know the MTU
on their devices?

I'm for dropping all the MTU checks and if a driver tosses a packet then
the user should be more careful. Having a bpf helper to check MTU of a
dev seems useful although the workaround would be a map the user could
put the max MTU in. Of course that would be a bit fragile if the BPF program
and person managing MTU are not in-sync.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 16:21               ` Marek Zavodsky
@ 2020-09-21 21:17                 ` Willem de Bruijn
  2020-09-22  9:15                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 17+ messages in thread
From: Willem de Bruijn @ 2020-09-21 21:17 UTC (permalink / raw)
  To: Marek Zavodsky
  Cc: Daniel Borkmann, Jesper Dangaard Brouer, Lorenz Bauer,
	Maciej Żenczykowski, Saeed Mahameed, Daniel Borkmann,
	Alexei Starovoitov, BPF-dev-list, netdev, Lorenzo Bianconi,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski

On Mon, Sep 21, 2020 at 6:22 PM Marek Zavodsky <marek.zavodsky@gmail.com> wrote:
>
> Hi guys,
>
> My kernel knowledge is small, but I experienced this (similar) issue
> with packet encapsulation (not a redirect), therefore modifying the
> redirect branch would not help in my case.
>
> I'm working on a TC program to do GUE encap/decap (IP + UDP + GUE,
> outer header has extra 52B).
> There are no issues with small packets. But when I use curl to
> download big file HTTP server chunks data randomly. Some packets have
> MTU size, others are even bigger. Big packets are not an issue,
> however MTU sized packets fail on bpf_skb_adjust_room with -524
> (ENOTSUPP).

This is a related, but different, unresolved issue at the boundary of
GSO packets. Packets that are not GSO, but would exceed MTU once
encapsulated, will cause adjust room to fail:

            (!shrink && (skb->len + len_diff_abs > len_max &&
                         !skb_is_gso(skb))))
                return -ENOTSUPP;

As admin, this can be addressed by setting a lower route MTU on routes
that may be encapsulated. But that is not very obvious or transparent.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 16:26               ` Jesper Dangaard Brouer
@ 2020-09-22  6:56                 ` Eyal Birger
  0 siblings, 0 replies; 17+ messages in thread
From: Eyal Birger @ 2020-09-22  6:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Lorenz Bauer, Maciej Żenczykowski,
	Saeed Mahameed, Daniel Borkmann, Alexei Starovoitov,
	BPF-dev-list, netdev, Lorenzo Bianconi, John Fastabend,
	Jakub Kicinski, Shaun Crampton, David Miller, Marek Majkowski

On Mon, Sep 21, 2020 at 7:30 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 21 Sep 2020 17:08:17 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> > On 9/21/20 2:49 PM, Jesper Dangaard Brouer wrote:
> > > On Mon, 21 Sep 2020 11:37:18 +0100
> > > Lorenz Bauer <lmb@cloudflare.com> wrote:
> > >> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:
> > >>>
> > >>>> This is a good point.  As bpf_skb_adjust_room() can just be run after
> > >>>> bpf_redirect() call, then a MTU check in bpf_redirect() actually
> > >>>> doesn't make much sense.  As clever/bad BPF program can then avoid the
> > >>>> MTU check anyhow.  This basically means that we have to do the MTU
> > >>>> check (again) on kernel side anyhow to catch such clever/bad BPF
> > >>>> programs.  (And I don't like wasting cycles on doing the same check two
> > >>>> times).
> > >>>
> > >>> If you get rid of the check in bpf_redirect() you might as well get
> > >>> rid of *all* the checks for excessive mtu in all the helpers that
> > >>> adjust packet size one way or another way.  They *all* then become
> > >>> useless overhead.
> > >>>
> > >>> I don't like that.  There may be something the bpf program could do to
> > >>> react to the error condition (for example in my case, not modify
> > >>> things and just let the core stack deal with things - which will
> > >>> probably just generate packet too big icmp error).
> > >>>
> > >>> btw. right now our forwarding programs first adjust the packet size
> > >>> then call bpf_redirect() and almost immediately return what it
> > >>> returned.
> > >>>
> > >>> but this could I think easily be changed to reverse the ordering, so
> > >>> we wouldn't increase packet size before the core stack was informed we
> > >>> would be forwarding via a different interface.
> > >>
> > >> We do the same, except that we also use XDP_TX when appropriate. This
> > >> complicates the matter, because there is no helper call we could
> > >> return an error from.
> > >
> > > Do notice that my MTU work is focused on TC-BPF.  For XDP-redirect the
> > > MTU check is done in xdp_ok_fwd_dev() via __xdp_enqueue(), which also
> > > happens too late to give BPF-prog knowledge/feedback.  For XDP_TX I
> > > audited the drivers when I implemented xdp_buff.frame_sz, and they
> > > handled (or I added) handling against max HW MTU. E.g. mlx5 [1].
> > >
> > > [1] https://elixir.bootlin.com/linux/v5.9-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L267
> > >
> > >> My preference would be to have three helpers: get MTU for a device,
> > >> redirect ctx to a device (with MTU check), resize ctx (without MTU
> > >> check) but that doesn't work with XDP_TX. Your idea of doing checks
> > >> in redirect and adjust_room is pragmatic and seems easier to
> > >> implement.
> > >
> > > I do like this plan/proposal (with 3 helpers), but it is not possible
> > > with current API.  The main problem is the current bpf_redirect API
> > > doesn't provide the ctx, so we cannot do the check in the BPF-helper.
> > >
> > > Are you saying we should create a new bpf_redirect API (that incl packet ctx)?
> >
> > Sorry for jumping in late here... one thing that is not clear to me
> > is that if we are fully sure that skb is dropped by stack anyway due
> > to invalid MTU (redirect to ingress does this via dev_forward_skb(),
>
> Yes, TC-redirecting to *INGRESS* have a slightly relaxed MTU check via
> is_skb_forwardable() called via ____dev_forward_skb().  This MTU check
> seems redundant as netstack will do MTU checks anyhow.
>

I found the MTU check on redirect-to-ingress to be very unexpected.

We hit this when implementing NAT64 as a tc egress program which translates
the packet and redirects it to ingress from the same device.

It is beneficial to have the MTU of the device set to a limit fitting the
IPv4 MTU, so that the IP stack would fragment as needed on the IPv4->IPv6
path. But when translating the packet to IPv6, it can no longer be ingressed
from the same device because of the MTU check. Packets are silently dropped
without any hint.

So would definitely be nice if this check is removed, or a flag to avoid
it is supported in bpf_redirect().

Eyal.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 21:17                 ` Willem de Bruijn
@ 2020-09-22  9:15                   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-22  9:15 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Marek Zavodsky, Daniel Borkmann, Lorenz Bauer,
	Maciej Żenczykowski, Saeed Mahameed, Daniel Borkmann,
	Alexei Starovoitov, BPF-dev-list, netdev, Lorenzo Bianconi,
	John Fastabend, Jakub Kicinski, Shaun Crampton, David Miller,
	Marek Majkowski, brouer

On Mon, 21 Sep 2020 23:17:16 +0200
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> On Mon, Sep 21, 2020 at 6:22 PM Marek Zavodsky <marek.zavodsky@gmail.com> wrote:
> >
> > Hi guys,
> >
> > My kernel knowledge is small, but I experienced this (similar) issue
> > with packet encapsulation (not a redirect), therefore modifying the
> > redirect branch would not help in my case.
> >
> > I'm working on a TC program to do GUE encap/decap (IP + UDP + GUE,
> > outer header has extra 52B).
> > There are no issues with small packets. But when I use curl to
> > download big file HTTP server chunks data randomly. Some packets have
> > MTU size, others are even bigger. Big packets are not an issue,
> > however MTU sized packets fail on bpf_skb_adjust_room with -524
> > (ENOTSUPP).  
> 
> This is a related, but different, unresolved issue at the boundary of
> GSO packets. Packets that are not GSO, but would exceed MTU once
> encapsulated, will cause adjust room to fail:
> 
>             (!shrink && (skb->len + len_diff_abs > len_max &&
>                          !skb_is_gso(skb))))
>                 return -ENOTSUPP;
> 
> As admin, this can be addressed by setting a lower route MTU on routes
> that may be encapsulated. But that is not very obvious or transparent.

Your issue is very much related, and even-though it is not related to
redirect, I also want to address and allow your use-case (in the
patchset that I'm collecting input for now).

I do think this patch[1][2] will actually solve your problem.  Could
you please try to apply and test this to make sure?  (as we have
discussed on this list, that patch is not a 100% solution, and I will
work on a better solution).

[1] https://lore.kernel.org/bpf/159921182827.1260200.9699352760916903781.stgit@firesoul
[2] http://patchwork.ozlabs.org/project/netdev/patch/159921182827.1260200.9699352760916903781.stgit@firesoul/
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BPF redirect API design issue for BPF-prog MTU feedback?
  2020-09-21 18:04               ` John Fastabend
@ 2020-10-06 11:45                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-06 11:45 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Lorenz Bauer, Maciej Żenczykowski,
	Saeed Mahameed, Daniel Borkmann, Alexei Starovoitov,
	BPF-dev-list, netdev, Lorenzo Bianconi, Jakub Kicinski,
	Shaun Crampton, David Miller, Marek Majkowski, brouer

On Mon, 21 Sep 2020 11:04:09 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Daniel Borkmann wrote:
> > On 9/21/20 2:49 PM, Jesper Dangaard Brouer wrote:  
> > > On Mon, 21 Sep 2020 11:37:18 +0100
> > > Lorenz Bauer <lmb@cloudflare.com> wrote:  
> > >> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski <maze@google.com> wrote:  
> > >>>     
> > >>>> This is a good point.  As bpf_skb_adjust_room() can just be run after
> > >>>> bpf_redirect() call, then a MTU check in bpf_redirect() actually
> > >>>> doesn't make much sense.  As clever/bad BPF program can then avoid the
> > >>>> MTU check anyhow.  This basically means that we have to do the MTU
> > >>>> check (again) on kernel side anyhow to catch such clever/bad BPF
> > >>>> programs.  (And I don't like wasting cycles on doing the same check two
> > >>>> times).  
> > >>>
> > >>> If you get rid of the check in bpf_redirect() you might as well get
> > >>> rid of *all* the checks for excessive mtu in all the helpers that
> > >>> adjust packet size one way or another way.  They *all* then become
> > >>> useless overhead.
> > >>>
[...]
> > 
> > Sorry for jumping in late here... one thing that is not clear to me is that if
> > we are fully sure that skb is dropped by stack anyway due to invalid MTU (redirect
> > to ingress does this via dev_forward_skb(), it's not fully clear to me whether it's
> > also the case for the dev_queue_xmiy()), then why not dropping all the MTU checks
> > aside from SKB_MAX_ALLOC sanity check for BPF helpers and have something like a
> > device object (similar to e.g. TCP sockets) exposed to BPF prog where we can retrieve
> > the object and read dev->mtu from the prog, so the BPF program could then do the
> > "exception" handling internally w/o extra prog needed (we also already expose whether
> > skb is GSO or not).
> > 
> > Thanks,
> > Daniel  
> 
> My $.02 is MTU should only apply to transmitted packets so redirect to
> ingress should be OK. Then on transmit shouldn't the user know the MTU
> on their devices?

I like the point that "MTU should only apply to transmitted packets". 
 
> I'm for dropping all the MTU checks and if a driver tosses a packet then
> the user should be more careful. Having a bpf helper to check MTU of a
> dev seems useful although the workaround would be a map the user could
> put the max MTU in. Of course that would be a bit fragile if the BPF program
> and person managing MTU are not in-sync.

I'm coding this up. Dropping all the MTU checks in helpers, but adding
helper to lookup/check the MTU.  I've also extended the bpf_fib_lookup
to return MTU value (it already does MTU check), as it can be more
specific.

The problematic code path seems to be when TC-ingress redirect packet
to egress on another netdev, then the normal netstack MTU checks are
skipped and driver level will not catch any MTU violation (only checked
ixgbe code path).

First I looked at adding MTU check in the egress code path of
skb_do_redirect() prior to calling dev_queue_xmit(), but I found this
to be the wrong approach.  This is because it is still possible to run
another BPF egress program that will shrink/consume headers, which will
make packet comply with netdev MTU. This use-case might already be in
production use (allowed if ingress MTU is larger than egress MTU).

Instead I'm currently coding up doing the MTU check after
sch_handle_egress() step, for the cases that require this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-10-06 11:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-17 12:38 BPF redirect API design issue for BPF-prog MTU feedback? Jesper Dangaard Brouer
2020-09-17 12:54 ` Maciej Żenczykowski
2020-09-17 19:11   ` Saeed Mahameed
2020-09-18 10:00     ` Jesper Dangaard Brouer
2020-09-18 10:34       ` Toke Høiland-Jørgensen
2020-09-18 23:06       ` Maciej Żenczykowski
2020-09-21 10:37         ` Lorenz Bauer
2020-09-21 12:49           ` Jesper Dangaard Brouer
2020-09-21 15:08             ` Daniel Borkmann
2020-09-21 16:21               ` Marek Zavodsky
2020-09-21 21:17                 ` Willem de Bruijn
2020-09-22  9:15                   ` Jesper Dangaard Brouer
2020-09-21 16:26               ` Jesper Dangaard Brouer
2020-09-22  6:56                 ` Eyal Birger
2020-09-21 18:04               ` John Fastabend
2020-10-06 11:45                 ` Jesper Dangaard Brouer
2020-09-21 10:42   ` Lorenz Bauer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).