Re: [Patch bpf v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb

From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	Cong Wang <cong.wang@bytedance.com>,
	Jiang Wang <jiang.wang@bytedance.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Lorenz Bauer <lmb@cloudflare.com>
Subject: Re: [Patch bpf v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb
Date: Sat, 03 Jul 2021 19:52:35 +0200	[thread overview]
Message-ID: <875yxrs2sc.fsf@cloudflare.com> (raw)
In-Reply-To: <60ddec01c651b_3fe24208dc@john-XPS-13-9370.notmuch>

On Thu, Jul 01, 2021 at 06:23 PM CEST, John Fastabend wrote:

[...]

>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index 9b6160a191f8..a5185c781332 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
>>  		return -EIO;
>>  	}
>>  	spin_lock_bh(&psock_other->ingress_lock);
>> -	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
>> +	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) ||
>> +	    atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) {
>>  		spin_unlock_bh(&psock_other->ingress_lock);
>>  		skb_bpf_redirect_clear(skb);
>>  		sock_drop(from->sk, skb);
>> @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
>>  		}
>>  		if (err < 0) {
>>  			spin_lock_bh(&psock->ingress_lock);
>> -			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>> +			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) &&
>> +			    atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) {
>>  				skb_queue_tail(&psock->ingress_skb, skb);
>
> We can't just drop the packet in the memory overrun case here. This will
> break TCP because the data will be gone and no one will retransmit.

I don't think it's always the case that data will be gone. But you're
right that it breaks TCP. I was too quick to Ack this patch.

When running with just the verdict prog attached, the -EIO error from
sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it
maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case.

tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays
on sk_receive_queue.

  sk->sk_data_ready
    sk_psock_verdict_data_ready
      ->read_sock(..., sk_psock_verdict_recv)
        tcp_read_sock (used = copied = 0)
          sk_psock_verdict_recv -> ret = 0
            sk_psock_verdict_apply -> -EIO
              sk_psock_skb_redirect -> -EIO

However, I think this gets us stuck. What if no more data gets queued,
and sk_data_ready doesn't get called again?

Then there is the case when a parser prog is attached. In this case the
skb is really gone if we drop it on redirect.

In sk_psock_strp_read, we ignore the -EIO error from
sk_psock_verdict_apply, and return to tcp_read_sock how many bytes have
been parsed.

  sk->sk_data_ready
    sk_psock_verdict_data_ready
      ->read_sock(..., sk_psock_verdict_recv)
        tcp_read_sock (used = copied = eaten)
          strp_recv -> ret = eaten
            __strp_recv -> ret = eaten
              strp->cb.rcv_msg -> -EIO
                sk_psock_verdict_apply -> -EIO
                  sk_psock_redirect -> -EIO

Maybe we could put the skb back on strp->skb_head list on error, though?

But again some notification would need to trigger a re-read, or we are
stuck.