RE: [Patch bpf v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb

From: John Fastabend <john.fastabend@gmail.com>
To: Cong Wang <xiyou.wangcong@gmail.com>, netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>,
	Jiang Wang <jiang.wang@bytedance.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Lorenz Bauer <lmb@cloudflare.com>,
	Jakub Sitnicki <jakub@cloudflare.com>
Subject: RE: [Patch bpf v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb
Date: Thu, 01 Jul 2021 09:23:29 -0700	[thread overview]
Message-ID: <60ddec01c651b_3fe24208dc@john-XPS-13-9370.notmuch> (raw)
In-Reply-To: <20210701061656.34150-1-xiyou.wangcong@gmail.com>

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Jiang observed OOM frequently when testing our AF_UNIX/UDP
> proxy. This is due to the fact that we do not actually limit
> the socket memory before queueing skb to ingress_skb. We
> charge the skb memory later when handling the psock backlog,
> but it is not limited either.

Right, its not limiting but charging it should push back on
the stack so it stops feeding skbs to us. Maybe this doesn't
happen in UDP side?

> 
> This patch adds checks for sk->sk_rcvbuf right before queuing
> to ingress_skb and drops packets if this limit exceeds. This
> is very similar to UDP receive path. Ideally we should set the
> skb owner before this check too, but it is hard to make TCP
> happy about sk_forward_alloc.

But it breaks TCP side see below.

> 
> Reported-by: Jiang Wang <jiang.wang@bytedance.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  net/core/skmsg.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 9b6160a191f8..a5185c781332 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
>  		return -EIO;
>  	}
>  	spin_lock_bh(&psock_other->ingress_lock);
> -	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
> +	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) ||
> +	    atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) {
>  		spin_unlock_bh(&psock_other->ingress_lock);
>  		skb_bpf_redirect_clear(skb);
>  		sock_drop(from->sk, skb);
> @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
>  		}
>  		if (err < 0) {
>  			spin_lock_bh(&psock->ingress_lock);
> -			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> +			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) &&
> +			    atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) {
>  				skb_queue_tail(&psock->ingress_skb, skb);

We can't just drop the packet in the memory overrun case here. This will
break TCP because the data will be gone and no one will retransmit.

Thats why in the current scheme on redirect we can push back when we
move it to the other queues ingress message queue or redirect into
the other socket via send.

At one point I considered charging the data sitting in the ingress_skb?
Would that solve the problem here? I think it would cause the enqueue
at the UDP to start dropping packets from __udp_enqueue_schedule_skb()?

>  				schedule_work(&psock->work);
>  				err = 0;
> -- 
> 2.27.0
>