Re: [PATCH v2 bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos

From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
To: <kafai@fb.com>
Cc: <ast@kernel.org>, <bpf@vger.kernel.org>, <daniel@iogearbox.net>,
	<edumazet@google.com>, <kernel-team@fb.com>,
	<ncardwell@google.com>, <netdev@vger.kernel.org>,
	<ycheng@google.com>, <yhs@fb.com>
Subject: Re: [PATCH v2 bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos
Date: Thu, 22 Jul 2021 23:16:37 +0900	[thread overview]
Message-ID: <20210722141637.68161-1-kuniyu@amazon.co.jp> (raw)
In-Reply-To: <20210701200541.1033917-1-kafai@fb.com>

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 1 Jul 2021 13:05:41 -0700
> st->bucket stores the current bucket number.
> st->offset stores the offset within this bucket that is the sk to be
> seq_show().  Thus, st->offset only makes sense within the same
> st->bucket.
> 
> These two variables are an optimization for the common no-lseek case.
> When resuming the seq_file iteration (i.e. seq_start()),
> tcp_seek_last_pos() tries to continue from the st->offset
> at bucket st->bucket.
> 
> However, it is possible that the bucket pointed by st->bucket
> has changed and st->offset may end up skipping the whole st->bucket
> without finding a sk.  In this case, tcp_seek_last_pos() currently
> continues to satisfy the offset condition in the next (and incorrect)
> bucket.  Instead, regardless of the offset value, the first sk of the
> next bucket should be returned.  Thus, "bucket == st->bucket" check is
> added to tcp_seek_last_pos().
> 
> The chance of hitting this is small and the issue is a decade old,
> so targeting for the next tree.

Multiple read()s or lseek()+read() can call tcp_seek_last_pos().

IIUC, the problem happens when the sockets placed before the last shown
socket in the list are closed between some read()s or lseek() and read().

I think there is still a case where bucket is valid but offset is invalid:

  listening_hash[1] -> sk1 -> sk2 -> sk3 -> nulls
  listening_hash[2] -> sk4 -> sk5 -> nulls

  read(/proc/net/tcp)
    end up with sk2

  close(sk1)

  listening_hash[1] -> sk2 -> sk3 -> nulls
  listening_hash[2] -> sk4 -> sk5 -> nulls

  read(/proc/net/tcp) (resume)
    offset = 2

    listening_get_next() returns sk2

    while (offset--)
      1st loop listening_get_next() returns sk3 (bucket == st->bucket)
      2nd loop listening_get_next() returns sk4 (bucket != st->bucket)

    show() starts from sk4

    only is sk3 skipped, but should be shown.

In listening_get_next(), we can check if we passed through sk2, but this
does not work well if sk2 itself is closed... then there are no way to
check the offset is valid or not.

Handling this may be too much though, what do you think ?