bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Aditi Ghag <aditi.ghag@isovalent.com>
Cc: sdf@google.com, edumazet@google.com,
	Martin KaFai Lau <martin.lau@kernel.org>,
	bpf@vger.kernel.org
Subject: Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
Date: Tue, 28 Mar 2023 14:33:55 -0700	[thread overview]
Message-ID: <d03ee937-5c78-1c7a-c487-4fae508627aa@linux.dev> (raw)
In-Reply-To: <FF565E79-7C76-4525-8835-931146319F21@isovalent.com>

On 3/28/23 10:06 AM, Aditi Ghag wrote:
>>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>> +{
>>> +	struct bpf_udp_iter_state *iter = seq->private;
>>> +	struct udp_iter_state *state = &iter->state;
>>> +	struct net *net = seq_file_net(seq);
>>> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
>>> +	struct udp_table *udptable;
>>> +	struct sock *first_sk = NULL;
>>> +	struct sock *sk;
>>> +	unsigned int bucket_sks = 0;
>>> +	bool resized = false;
>>> +	int offset = 0;
>>> +	int new_offset;
>>> +
>>> +	/* The current batch is done, so advance the bucket. */
>>> +	if (iter->st_bucket_done) {
>>> +		state->bucket++;
>>> +		state->offset = 0;
>>> +	}
>>> +
>>> +	udptable = udp_get_table_afinfo(afinfo, net);
>>> +
>>> +	if (state->bucket > udptable->mask) {
>>> +		state->bucket = 0;
>>> +		state->offset = 0;
>>> +		return NULL;
>>> +	}
>>> +
>>> +again:
>>> +	/* New batch for the next bucket.
>>> +	 * Iterate over the hash table to find a bucket with sockets matching
>>> +	 * the iterator attributes, and return the first matching socket from
>>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>>> +	 * before releasing the bucket lock. This allows BPF programs that are
>>> +	 * called in seq_show to acquire the bucket lock if needed.
>>> +	 */
>>> +	iter->cur_sk = 0;
>>> +	iter->end_sk = 0;
>>> +	iter->st_bucket_done = false;
>>> +	first_sk = NULL;
>>> +	bucket_sks = 0;
>>> +	offset = state->offset;
>>> +	new_offset = offset;
>>> +
>>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>>> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
>>
>> Use udptable->hash"2" which is hashed by addr and port. It will help to get a smaller batch. It was the comment given in v2.
> 
> I thought I replied to your review comment, but looks like I didn't. My bad!
> I already gave it a shot, and I'll need to understand better how udptable->hash2 is populated. When I swapped hash with hash2, there were no sockets to iterate. Am I missing something obvious?

Take a look at udp_lib_lport_inuse2() on how it iterates.

> 
>>
>>> +
>>> +		if (hlist_empty(&hslot->head)) {
>>> +			offset = 0;
>>> +			continue;
>>> +		}
>>> +
>>> +		spin_lock_bh(&hslot->lock);
>>> +		/* Resume from the last saved position in a bucket before
>>> +		 * iterator was stopped.
>>> +		 */
>>> +		while (offset-- > 0) {
>>> +			sk_for_each(sk, &hslot->head)
>>> +				continue;
>>> +		}
>>
>> hmm... how does the above while loop and sk_for_each loop actually work?
>>
>>> +		sk_for_each(sk, &hslot->head) {
>>
>> Here starts from the beginning of the hslot->head again. doesn't look right also.
>>
>> Am I missing something here?
>>
>>> +			if (seq_sk_match(seq, sk)) {
>>> +				if (!first_sk)
>>> +					first_sk = sk;
>>> +				if (iter->end_sk < iter->max_sk) {
>>> +					sock_hold(sk);
>>> +					iter->batch[iter->end_sk++] = sk;
>>> +				}
>>> +				bucket_sks++;
>>> +			}
>>> +			new_offset++;
>>
>> And this new_offset is outside of seq_sk_match, so it is not counting for the seq_file_net(seq) netns alone.
> 
> This logic to resume iterator is buggy, indeed! So I was trying to account for the cases where the current bucket could've been updated since we release the bucket lock.
> This is what I intended to do -
> 
> +loop:
>                  sk_for_each(sk, &hslot->head) {
>                          if (seq_sk_match(seq, sk)) {
> +                               /* Resume from the last saved position in the
> +                                * bucket before iterator was stopped.
> +                                */
> +                               while (offset && offset-- > 0)
> +                                       goto loop;

still does not look right. merely a loop decrementing offset one at a time and 
then go back to the beginning of hslot->head?

A quick (untested and uncompiled) thought :

				/* Skip the first 'offset' number of sk
				 * and not putting them in the iter->batch[].
				 */
				if (offset) {
					offset--;
					continue;
				}

>                                  if (!first_sk)
>                                          first_sk = sk;
>                                  if (iter->end_sk < iter->max_sk) {
> @@ -3245,8 +3244,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>                                          iter->batch[iter->end_sk++] = sk;
>                                  }
>                                  bucket_sks++ > +                              new_offset++;
>                          }
> 
> This handles the case when sockets that weren't iterated in the previous round got deleted by the time iterator was resumed. But it's possible that previously iterated sockets got deleted before the iterator was later resumed, and the offset is now outdated. Ideally, iterator should be invalidated in this case, but there is no way to track this, is there? Any thoughts?

I would not worry about this update in-between case. race will happen anyway 
when the bucket lock is released. This should be very unlikely when hash"2" is used.



  reply	other threads:[~2023-03-28 21:34 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-23 20:06 [PATCH v4 bpf-next 0/5] bpf-nex: Add socket destroy capability Aditi Ghag
2023-03-23 20:06 ` [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator Aditi Ghag
2023-03-24 21:56   ` Stanislav Fomichev
2023-03-27 15:52     ` Aditi Ghag
2023-03-27 16:52       ` Stanislav Fomichev
2023-03-27 22:28   ` Martin KaFai Lau
2023-03-28 17:06     ` Aditi Ghag
2023-03-28 21:33       ` Martin KaFai Lau [this message]
2023-03-29 16:20         ` Aditi Ghag
2023-03-23 20:06 ` [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-03-23 23:58   ` Martin KaFai Lau
2023-03-24 21:37   ` Stanislav Fomichev
2023-03-30 14:42     ` Aditi Ghag
2023-03-30 16:32       ` Stanislav Fomichev
2023-03-30 17:30         ` Martin KaFai Lau
2023-04-03 15:58           ` Aditi Ghag
2023-03-23 20:06 ` [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator Aditi Ghag
2023-03-24 21:45   ` Stanislav Fomichev
2023-03-28 15:20     ` Aditi Ghag
2023-03-27 22:34   ` Martin KaFai Lau
2023-03-23 20:06 ` [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
2023-03-24 21:52   ` Stanislav Fomichev
2023-03-27 15:57     ` Aditi Ghag
2023-03-27 16:54       ` Stanislav Fomichev
2023-03-28 17:50         ` Aditi Ghag
2023-03-28 18:35           ` Stanislav Fomichev
2023-03-29 23:13             ` Aditi Ghag
2023-03-29 23:25               ` Aditi Ghag
2023-03-29 23:25               ` Stanislav Fomichev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d03ee937-5c78-1c7a-c487-4fae508627aa@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=aditi.ghag@isovalent.com \
    --cc=bpf@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=martin.lau@kernel.org \
    --cc=sdf@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).