All of lore.kernel.org
 help / color / mirror / Atom feed
* How to limit TCP packet lengths given to TC egress EBPF programs?
@ 2021-07-09 18:38 Fingerhut, John Andy
  2021-07-13 23:51 ` Alexei Starovoitov
  0 siblings, 1 reply; 4+ messages in thread
From: Fingerhut, John Andy @ 2021-07-09 18:38 UTC (permalink / raw)
  To: netdev

Greetings:

I am working on a project that runs an EBPF program on the Linux
Traffic Control egress hook, which modifies selected packets to add
headers to them that we use for some network telemetry.

I know that this is _not_ what one wants to do to get maximum TCP
performance, but at least for development purposes I was hoping to
find a way to limit the length of all TCP packets that are processed
by this EBPF program to be at most one MTU.

Towards that goal, we have tried several things, but regardless of
which subset of the following things we have tried, there are some
packets processed by our EBPF program that have IPv4 Total Length
field that is some multiple of the MSS size, sometimes nearly 64
KBytes.  If it makes a difference in configuration options available,
we have primarily been testing with Ubuntu 20.04 Linux running the
Linux kernel versions near 5.8.0-50-generic distributed by Canonical.

Disable TSO and GSO on the network interface:

    ethtool -K enp0s8 tso off gso off

Configuring TCP MSS using 'ip route' command:

    ip route change 10.0.3.0/24 dev enp0s8 advmss 1424

The last command _does_ have some effect, in that many packets
processed by our EBPF program have a length affected by that advmss
value, but we still see many packets that are about twice as large,
about three times as large, etc., which fit into that MSS after being
segmented, I believe in the kernel GSO code.

Is there some other configuration option we can change that can
guarantee that when a TCP packet is given to a TC egress EBPF program,
it will always be at most a specified length?


Background:

Intel is developing and releasing some open source EBPF programs and
associated user space programs that modify packets to add INT (Inband
Network Telemetry) headers, which can be used for some kinds of
performance debugging reasons, e.g. triggering events when packet
losses are detected, or significant changes in one-way packet latency
between two hosts configured to run this Host INT code.  See the
project home page for more details if you are interested:

https://github.com/intel/host-int

Note: The code published now is an alpha release.  We know there are
bugs.  We know our development team is not what you would call EBPF
experts (at least not yet), so feel free to point out bugs and/or
anything that code is doing that might be a bad idea.

Thanks,
Andy Fingerhut
Principal Engineer
Intel Corporation

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to limit TCP packet lengths given to TC egress EBPF programs?
  2021-07-09 18:38 How to limit TCP packet lengths given to TC egress EBPF programs? Fingerhut, John Andy
@ 2021-07-13 23:51 ` Alexei Starovoitov
       [not found]   ` <CABX6iNqj-ojymaPhPtgeOGxtUS6evyrvN69MrLD7s_+Z3xAK+w@mail.gmail.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Alexei Starovoitov @ 2021-07-13 23:51 UTC (permalink / raw)
  To: Fingerhut, John Andy, bpf
  Cc: netdev, Petr Lapukhov, Sandesh Dhawaskar Sathyanarayana,
	Daniel Borkmann, Toke Høiland-Jørgensen

On Fri, Jul 9, 2021 at 11:40 AM Fingerhut, John Andy
<john.andy.fingerhut@intel.com> wrote:
>
> Greetings:
>
> I am working on a project that runs an EBPF program on the Linux
> Traffic Control egress hook, which modifies selected packets to add
> headers to them that we use for some network telemetry.
>
> I know that this is _not_ what one wants to do to get maximum TCP
> performance, but at least for development purposes I was hoping to
> find a way to limit the length of all TCP packets that are processed
> by this EBPF program to be at most one MTU.
>
> Towards that goal, we have tried several things, but regardless of
> which subset of the following things we have tried, there are some
> packets processed by our EBPF program that have IPv4 Total Length
> field that is some multiple of the MSS size, sometimes nearly 64
> KBytes.  If it makes a difference in configuration options available,
> we have primarily been testing with Ubuntu 20.04 Linux running the
> Linux kernel versions near 5.8.0-50-generic distributed by Canonical.
>
> Disable TSO and GSO on the network interface:
>
>     ethtool -K enp0s8 tso off gso off
>
> Configuring TCP MSS using 'ip route' command:
>
>     ip route change 10.0.3.0/24 dev enp0s8 advmss 1424
>
> The last command _does_ have some effect, in that many packets
> processed by our EBPF program have a length affected by that advmss
> value, but we still see many packets that are about twice as large,
> about three times as large, etc., which fit into that MSS after being
> segmented, I believe in the kernel GSO code.
>
> Is there some other configuration option we can change that can
> guarantee that when a TCP packet is given to a TC egress EBPF program,
> it will always be at most a specified length?
>
>
> Background:
>
> Intel is developing and releasing some open source EBPF programs and
> associated user space programs that modify packets to add INT (Inband
> Network Telemetry) headers, which can be used for some kinds of
> performance debugging reasons, e.g. triggering events when packet
> losses are detected, or significant changes in one-way packet latency
> between two hosts configured to run this Host INT code.  See the
> project home page for more details if you are interested:
>
> https://github.com/intel/host-int

I suspect MTU/MSS issue is only the tip of the iceberg.

https://github.com/intel/host-int/blob/main/docs/Host_INT_fmt.md
That's an interesting design !
Few things should be probably be addressed sooner than later:
"Host INT currently only supports adding INT headers to IPv4 packets."
To consider such a feature of Tofino switches IPv6 has to be supported.
That shouldn't be hard to do, right?

https://github.com/intel/host-int/blob/main/docs/host-int-project.pptx
That's a lot of bpf programs :)
Looks like in the bridge case (last slide) every incoming packet will
be processed
by two XDP programs.
XDP is certainly fast, but it still adds overhead.
Not every packet will have such INT header so most of the packets will be
passing through XDP prog into the stack or from stack through TC egress program.
Such XDP ingress and TC egress progs will add overhead that might be
unacceptable in production deployment.
Have you considered using the new TCP header option instead?
https://lore.kernel.org/bpf/CAADnVQJ21Tt2HaJ5P4wbxBLVo1YT-PwN3bOHBQK+17reK5HxOg@mail.gmail.com/
BPF prog can conditionally add it for few packets/flows and another BPF prog
on receive side will process such header option.
While Tofino switch will find packets with a special TCP header and fill them in
with telemetry data.
"INT report packets are sent as UDP datagrams" part of the design can stay.
Looks like you're reserving a UDP port for such a purpose, so no need
for the receive side to have an XDP program to process every packet.

With TCP header option approach the MTU issue will go away as well.

> Note: The code published now is an alpha release.  We know there are
> bugs.  We know our development team is not what you would call EBPF
> experts (at least not yet), so feel free to point out bugs and/or
> anything that code is doing that might be a bad idea.

Thank you for reaching out. We're here to help with your BPF/XDP needs :)

> Thanks,
> Andy Fingerhut
> Principal Engineer
> Intel Corporation

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to limit TCP packet lengths given to TC egress EBPF programs?
       [not found]   ` <CABX6iNqj-ojymaPhPtgeOGxtUS6evyrvN69MrLD7s_+Z3xAK+w@mail.gmail.com>
@ 2021-07-16  0:14     ` Alexei Starovoitov
  2021-07-16 20:48       ` Martin KaFai Lau
  0 siblings, 1 reply; 4+ messages in thread
From: Alexei Starovoitov @ 2021-07-16  0:14 UTC (permalink / raw)
  To: Sandesh Dhawaskar Sathyanarayana, Martin KaFai Lau
  Cc: Fingerhut, John Andy, bpf, netdev, Petr Lapukhov,
	Daniel Borkmann, Toke Høiland-Jørgensen

On Thu, Jul 15, 2021 at 4:26 PM Sandesh Dhawaskar Sathyanarayana
<Sandesh.DhawaskarSathyanarayana@colorado.edu> wrote:
>
> Hi ,

Please do not top post and do not use html in your replies.

> I tested the new TCP experimental headers as INT headers in TCP options. But this does not work.
> Programmable switch looks for the INT header after 20 bytes of TCP header. If it finds INT then it just appends its own INT data by parsing INT field in TCP,else it appends its own INT header with data after 20 bytes and if any TCP option is present it will append that after INT.
> Now if we use the TCP options field in the end host as INT fields, the switch looks at TCP header options as INT and appends just the data. Now that the switch has consumed TCP option as INT data, it will not find TCP options to append after it puts its INT data as result the packets will be dropped in the switch.
>
> Hence we need a new way to create INT header space in the TCP kernel stack itself.
>
>
> Here is what I did:
>
> 1. Reserved the space in the TCP header option using BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> 2. Used the TC-eBPF at egress to write INT header in this field.

Hard to guess without looking at the actual code,
but sounds like you did bpf_reserve_hdr_opt() as sockops program,
but didn't do bpf_store_hdr_opt() in BPF_SOCK_OPS_WRITE_HDR_OPT_CB ?
and instead tried to write it in TC layer?
That won't work of course.
Please see progs/test_tcp_hdr_options.c example.

cc-ing Martin for further questions.

> But these packets get dropped at switch as the TCP length doesn;t match.
>
> Also another challenge in appending the INT in the end host at TC-eBPF (currently no support for TCP) is it breaks the TCP SYN and ACK mechanism resulting in spurious retransmissions.  As kernel is not aware of appended data in TC-eBPF at egress.
>
> If anyone has suggestions please do let me know. Currently I am just thinking of creating the space in the kernel TCP stack itself when sk_buff is allocated.
>
> Thanks
> Sandesh
>
> On Tue, Jul 13, 2021 at 5:52 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>> On Fri, Jul 9, 2021 at 11:40 AM Fingerhut, John Andy
>> <john.andy.fingerhut@intel.com> wrote:
>> >
>> > Greetings:
>> >
>> > I am working on a project that runs an EBPF program on the Linux
>> > Traffic Control egress hook, which modifies selected packets to add
>> > headers to them that we use for some network telemetry.
>> >
>> > I know that this is _not_ what one wants to do to get maximum TCP
>> > performance, but at least for development purposes I was hoping to
>> > find a way to limit the length of all TCP packets that are processed
>> > by this EBPF program to be at most one MTU.
>> >
>> > Towards that goal, we have tried several things, but regardless of
>> > which subset of the following things we have tried, there are some
>> > packets processed by our EBPF program that have IPv4 Total Length
>> > field that is some multiple of the MSS size, sometimes nearly 64
>> > KBytes.  If it makes a difference in configuration options available,
>> > we have primarily been testing with Ubuntu 20.04 Linux running the
>> > Linux kernel versions near 5.8.0-50-generic distributed by Canonical.
>> >
>> > Disable TSO and GSO on the network interface:
>> >
>> >     ethtool -K enp0s8 tso off gso off
>> >
>> > Configuring TCP MSS using 'ip route' command:
>> >
>> >     ip route change 10.0.3.0/24 dev enp0s8 advmss 1424
>> >
>> > The last command _does_ have some effect, in that many packets
>> > processed by our EBPF program have a length affected by that advmss
>> > value, but we still see many packets that are about twice as large,
>> > about three times as large, etc., which fit into that MSS after being
>> > segmented, I believe in the kernel GSO code.
>> >
>> > Is there some other configuration option we can change that can
>> > guarantee that when a TCP packet is given to a TC egress EBPF program,
>> > it will always be at most a specified length?
>> >
>> >
>> > Background:
>> >
>> > Intel is developing and releasing some open source EBPF programs and
>> > associated user space programs that modify packets to add INT (Inband
>> > Network Telemetry) headers, which can be used for some kinds of
>> > performance debugging reasons, e.g. triggering events when packet
>> > losses are detected, or significant changes in one-way packet latency
>> > between two hosts configured to run this Host INT code.  See the
>> > project home page for more details if you are interested:
>> >
>> > https://github.com/intel/host-int
>>
>> I suspect MTU/MSS issue is only the tip of the iceberg.
>>
>> https://github.com/intel/host-int/blob/main/docs/Host_INT_fmt.md
>> That's an interesting design !
>> Few things should be probably be addressed sooner than later:
>> "Host INT currently only supports adding INT headers to IPv4 packets."
>> To consider such a feature of Tofino switches IPv6 has to be supported.
>> That shouldn't be hard to do, right?
>>
>> https://github.com/intel/host-int/blob/main/docs/host-int-project.pptx
>> That's a lot of bpf programs :)
>> Looks like in the bridge case (last slide) every incoming packet will
>> be processed
>> by two XDP programs.
>> XDP is certainly fast, but it still adds overhead.
>> Not every packet will have such INT header so most of the packets will be
>> passing through XDP prog into the stack or from stack through TC egress program.
>> Such XDP ingress and TC egress progs will add overhead that might be
>> unacceptable in production deployment.
>> Have you considered using the new TCP header option instead?
>> https://lore.kernel.org/bpf/CAADnVQJ21Tt2HaJ5P4wbxBLVo1YT-PwN3bOHBQK+17reK5HxOg@mail.gmail.com/
>> BPF prog can conditionally add it for few packets/flows and another BPF prog
>> on receive side will process such header option.
>> While Tofino switch will find packets with a special TCP header and fill them in
>> with telemetry data.
>> "INT report packets are sent as UDP datagrams" part of the design can stay.
>> Looks like you're reserving a UDP port for such a purpose, so no need
>> for the receive side to have an XDP program to process every packet.
>>
>> With TCP header option approach the MTU issue will go away as well.
>>
>> > Note: The code published now is an alpha release.  We know there are
>> > bugs.  We know our development team is not what you would call EBPF
>> > experts (at least not yet), so feel free to point out bugs and/or
>> > anything that code is doing that might be a bad idea.
>>
>> Thank you for reaching out. We're here to help with your BPF/XDP needs :)
>>
>> > Thanks,
>> > Andy Fingerhut
>> > Principal Engineer
>> > Intel Corporation

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to limit TCP packet lengths given to TC egress EBPF programs?
  2021-07-16  0:14     ` Alexei Starovoitov
@ 2021-07-16 20:48       ` Martin KaFai Lau
  0 siblings, 0 replies; 4+ messages in thread
From: Martin KaFai Lau @ 2021-07-16 20:48 UTC (permalink / raw)
  To: Sandesh Dhawaskar Sathyanarayana
  Cc: Alexei Starovoitov, Fingerhut, John Andy, bpf, netdev,
	Petr Lapukhov, Daniel Borkmann, Toke Høiland-Jørgensen,
	Prashanth Kannan

On Thu, Jul 15, 2021 at 05:14:45PM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 15, 2021 at 4:26 PM Sandesh Dhawaskar Sathyanarayana
> <Sandesh.DhawaskarSathyanarayana@colorado.edu> wrote:
> >
> > Hi ,
> 
> Please do not top post and do not use html in your replies.
> 
> > I tested the new TCP experimental headers as INT headers in TCP options. But this does not work.
> > Programmable switch looks for the INT header after 20 bytes of TCP header. If it finds INT then it just appends its own INT data by parsing INT field in TCP,else it appends its own INT header with data after 20 bytes and if any TCP option is present it will append that after INT.
Does it mean the switch can only look for INT at the fixed 20 bytes offset?

> > Now if we use the TCP options field in the end host as INT fields, the switch looks at TCP header options as INT and appends just the data. Now that the switch has consumed TCP option as INT data, it will not find TCP options to append after it puts its INT data as result the packets will be dropped in the switch.
This part I still don't understand after parsing a few times.

Does it mean the switch has an issue when "INT" is put under a proper
TCP header option like (kind, length, INT)?

> >
> > Hence we need a new way to create INT header space in the TCP kernel stack itself.
> >
> >
> > Here is what I did:
> >
> > 1. Reserved the space in the TCP header option using BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> > 2. Used the TC-eBPF at egress to write INT header in this field.
> 
> Hard to guess without looking at the actual code,
> but sounds like you did bpf_reserve_hdr_opt() as sockops program,
> but didn't do bpf_store_hdr_opt() in BPF_SOCK_OPS_WRITE_HDR_OPT_CB ?
> and instead tried to write it in TC layer?
> That won't work of course.
> Please see progs/test_tcp_hdr_options.c example.
> 
> cc-ing Martin for further questions.
> 
> > But these packets get dropped at switch as the TCP length doesn;t match.
Why the length does not match?  Which box changed the header option
without changing the offset in the tcp header?

> >
> > Also another challenge in appending the INT in the end host at TC-eBPF (currently no support for TCP) is it breaks the TCP SYN and ACK mechanism resulting in spurious retransmissions.  As kernel is not aware of appended data in TC-eBPF at egress.
hmm... so this TC-eBPF approach is a separate attempt unrelated to
the tcp@BPF_SOCK_OPS_HDR_OPT_LEN_CB above, right?

and what caused the tcp syn get dropped?  the tcp data offset (th->doff)?

I think we are not on the same page.  May be start with these questions:
1. Please share the bpf code handling BPF_SOCK_OPS_HDR_OPT_LEN_CB.
2. Exactly how you wanted the TCP header, INT, header option to look like
   for an outgoing packet from the sender end-host.  To be practical,
   lets assume there is always a timestamp option.
   Please spell out the value in th->doff (data offset) and
   also what header option "kind" you have used to store the "INT".
   It seems you have used the experimental kind (254)?
3. When the switch sees the "INT" header option, how does it
   append its data and how the tcp header will look like at the end.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-07-16 20:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-09 18:38 How to limit TCP packet lengths given to TC egress EBPF programs? Fingerhut, John Andy
2021-07-13 23:51 ` Alexei Starovoitov
     [not found]   ` <CABX6iNqj-ojymaPhPtgeOGxtUS6evyrvN69MrLD7s_+Z3xAK+w@mail.gmail.com>
2021-07-16  0:14     ` Alexei Starovoitov
2021-07-16 20:48       ` Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.