Re: [Patch bpf v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb

From: John Fastabend <john.fastabend@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>,
	Cong Wang <xiyou.wangcong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, Cong Wang <cong.wang@bytedance.com>,
	Jiang Wang <jiang.wang@bytedance.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Lorenz Bauer <lmb@cloudflare.com>
Subject: Re: [Patch bpf v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb
Date: Mon, 05 Jul 2021 09:24:43 -0700	[thread overview]
Message-ID: <60e3324bbc766_20ea208ce@john-XPS-13-9370.notmuch> (raw)
In-Reply-To: <8735strwwg.fsf@cloudflare.com>

Jakub Sitnicki wrote:
> On Sun, Jul 04, 2021 at 09:53 PM CEST, Cong Wang wrote:
> > On Sat, Jul 3, 2021 at 10:52 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> When running with just the verdict prog attached, the -EIO error from
> >> sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it
> >> maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case.
> >>
> >> tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays
> >> on sk_receive_queue.
> >
> > Are you sure?
> >
> > When recv_actor() returns 0, the while loop breaks:
> >
> > 1661                         used = recv_actor(desc, skb, offset, len);
> > 1662                         if (used <= 0) {
> > 1663                                 if (!copied)
> > 1664                                         copied = used;
> > 1665                                 break;
> >
> > then it calls sk_eat_skb() a few lines after the loop:
> > ...
> > 1690                 sk_eat_skb(sk, skb);
> 
> This sk_eat_skb is still within the loop:
> 
> 1636:int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> 1637-		  sk_read_actor_t recv_actor)
> 1638-{
> 	…
> 1643-	int copied = 0;
>         …
> 1647-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
> 1648-		if (offset < skb->len) {
> 			…
> 1661-			used = recv_actor(desc, skb, offset, len);
> 1662-			if (used <= 0) {
> 1663-				if (!copied)
> 1664-					copied = used;
> 1665-				break;
> 1666-			} else if (used <= len) {
> 1667-				seq += used;
> 1668-				copied += used;
> 1669-				offset += used;
> 1670-			}
> 			…
> 1684-		}
> 		…
> 1690-		sk_eat_skb(sk, skb);
> 		…
> 1694-	}
> 	…
> 1699-	/* Clean up data we have read: This will do ACK frames. */
> 1700-	if (copied > 0) {
> 1701-		tcp_recv_skb(sk, seq, &offset);
> 1702-		tcp_cleanup_rbuf(sk, copied);
> 1703-	}
> 1704-	return copied;
> 1705-}
> 
> sk_eat_skb could get called by tcp_recv_skb → sk_eat_skb if recv_actor
> returned > 0 (the case when we have parser attached).
> 
> >
> >>
> >>   sk->sk_data_ready
> >>     sk_psock_verdict_data_ready
> >>       ->read_sock(..., sk_psock_verdict_recv)
> >>         tcp_read_sock (used = copied = 0)
> >>           sk_psock_verdict_recv -> ret = 0
> >>             sk_psock_verdict_apply -> -EIO
> >>               sk_psock_skb_redirect -> -EIO
> >>
> >> However, I think this gets us stuck. What if no more data gets queued,
> >> and sk_data_ready doesn't get called again?

Agree looks like it will be stuck. I found a similar stuck queue
in the skmsg backlog case for this I'm testing changing our workqueue
over to a delayed workqueue and then calling it again after some
delay.

We could do the same here I think. Return 0 to stop the tcp_read_sock
as you show. Then the data is still on the sk_receive_queue and
memory should still be accounted for. Solving the memory overrun
on the dest socket. Then we kick a workqueue item that will call
data_ready at some point in the future. And we do a backoff so
we don't keep hitting it repeatedly. I think this should work and
I have the patches testing now for doing it on the backlog paths.

> >
> > I think it is dropped by sk_eat_skb() in TCP case and of course the
> > sender will retransmit it after detecting this loss. So from this point of
> > view, there is no difference between drops due to overlimit and drops
> > due to eBPF program policy.
> 
> I'm not sure the retransmit will happen.
> 
> We update tp->rcv_nxt (tcp_rcv_nxt_update) when skb gets pushed onto
> sk_receive_queue in either:
> 
>  - tcp_rcv_established -> tcp_queue_rcv, or
>  - tcp_rcv_established -> tcp_data_queue -> tcp_queue_rcv
> 
> ... and schedule ACK (tcp_event_data_recv) to be sent.

Right, this is what we've observed before when we did have a few drops
on error cases. And then some applications will break and user will
send bugs and we have to fix them. We can't unintentionally drop TCP
packets. A policy drop though is different in this case we want to
break the session. FWIW I'm considering adding a reset socket helper
so we can do this cleanly and tear the socket down.

> 
> Say we are in quickack mode, then
> tcp_ack_snd_check()/__tcp_ack_snd_check() would cause ACK to be sent
> out.

Yep.

For the stream parser case. I propose we stop the stream parser so it
wont pull in more data. Then requeue the skb we don't have memory for.
Finally use same delayed workqueue to start up the stream parser again.

It requires some patches to get working, but this should get us the
correct handling for TCP. And avoids drops in UDP stack which may
or may not be useful.

I'll try to flush my queue of patches tomorrow its a holiday today
here. With those I think above becomes not so difficult.

.John