Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

From: Dave Taht <dave.taht@gmail.com>
To: Ivan Babrou <ivan@cloudflare.com>
Cc: bpf <bpf@vger.kernel.org>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>,
	Eric Dumazet <edumazet@google.com>
Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt
Date: Sat, 15 Jan 2022 08:46:52 -0800	[thread overview]
Message-ID: <CAA93jw435mThYcBA_7Sf1Z6W_bZrLuK8FLHw8AgAwg0+3y6PBw@mail.gmail.com> (raw)
In-Reply-To: <CABWYdi1p=rRQM3oySw2+N+mcrUq3bXA5MXm8cHmC3=qfCU5SDA@mail.gmail.com>

On Fri, Jan 14, 2022 at 2:21 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Thu, Jan 13, 2022 at 9:44 PM Dave Taht <dave.taht@gmail.com> wrote:
> > Yes, but with the caveats below. I'm fine with you just saying round trips,
> > and making this api possible.
> >
> > It would comfort me further if you could provide an actual scenario.
>
> The actual scenario is getting a response as quickly as possible on a
> fresh connection across long distances (200ms+ RTT). If an RPC
> response doesn't fit into the initial 64k of rcv_ssthresh, we end up
> requiring more roundrips to receive the response. Some customers are
> very picky about the latency they measure and cutting the extra
> roundtrips made a very visible difference in the tests.
>
> > See also:
> >
> > https://datatracker.ietf.org/doc/html/rfc6928
> >
> > which predates packet pacing (are you using sch_fq?)
>
> We are using fq and bbr.
>
> > > Congestion window is a learned property, not a static number. You
> > > won't get a large initcwnd towards a poor connection.
> >
> > initcwnd is set globally or on a per route basis.

Like I said, retaining state from an existing connection as to the
window is ok. i think arbitrarily declaring a window like this
for a new connection is not.

> With TCP_BPF_IW the world is your oyster.

The oyster has to co-habit in this ocean with all the other life
there, and I would be comforted if your customer also tracked various
other TCP_INFO statistics, like RTT growth, loss, marks, and
retransmits, and was aware of not just the self harm inflicted but of
collateral damage. In fact I really wish more were instrumenting
everything with that, of late we've seen a lot of need for
TCP_NOTSENT_LOWAT in things like apache traffic server in containers.
A simple one line patch for an widely used app I can't talk about, did
wonders for actual perceived throughput and responsiveness by the end
user. Measuring from the reciever is far, far more important than
measuring from the sender. Collecting long term statistics over many
connections, also, from
the real world. I hope y'all have been instrumenting your work as well
as google has, on these fronts.

I know that I'm getting old and crunchy and scarred by seeing so many
(corporate wifi mostly) networks over the last decade essentially in
congestion collapse!

https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over-yet/

I'm very happy with how well sch_fq + packet pacing works to mitigate
impuses like this, as well as with so many other things like BBR and
BQL, but pacing out != pacing in,
and despite my fervent wish for more FQ+AQM techniques on more
bottleneck links also, we're not there yet.

I like very much that BPF is allowing rapid innovation, but with great
power comes great responsibility.
-- 
I tried to build a better future, a few times:
https://wayforward.archive.org/?site=https%3A%2F%2Fwww.icei.org

Dave Täht CEO, TekLibre, LLC