[PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt()
@ 2022-07-27  6:08 Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr Martin KaFai Lau
                   ` (14 more replies)
  0 siblings, 15 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:08 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

The codes in bpf_setsockopt() is mostly a copy-and-paste from
the sock_setsockopt(), do_tcp_setsockopt(), do_ipv6_setsockopt(),
and do_ip_setsockopt().  As the allowed optnames in bpf_setsockopt()
grows, so are the duplicated codes.  The codes between the copies
also slowly drifted.

This set is an effort to clean this up and reuse the existing
{sock,do_tcp,do_ipv6,do_ip}_setsockopt() as much as possible.

After the clean up, this set also adds a few allowed optnames
that we need to the bpf_setsockopt().

The initial attempt was to clean up both bpf_setsockopt() and
bpf_getsockopt() together.  However, the patch set was getting
too long.  It is beneficial to leave the bpf_getsockopt()
out for another patch set.  Thus, this set is focusing
on the bpf_setsockopt().

Martin KaFai Lau (14):
  net: Change sock_setsockopt from taking sock ptr to sk ptr
  bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  bpf: net: Consider optval.is_bpf before capable check in
    sock_setsockopt()
  bpf: net: Avoid do_tcp_setsockopt() taking sk lock when called from
    bpf
  bpf: net: Avoid do_ip_setsockopt() taking sk lock when called from bpf
  bpf: net: Avoid do_ipv6_setsockopt() taking sk lock when called from
    bpf
  bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt
  bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sock_setsockopt()
  bpf: Refactor bpf specific tcp optnames to a new function
  bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
  bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()
  bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt()
  bpf: Add a few optnames to bpf_setsockopt
  selftests/bpf: bpf_setsockopt tests

 drivers/nvme/host/tcp.c                       |   2 +-
 fs/ksmbd/transport_tcp.c                      |   2 +-
 include/linux/sockptr.h                       |   8 +-
 include/net/ip.h                              |   2 +
 include/net/ipv6.h                            |   2 +
 include/net/ipv6_stubs.h                      |   2 +
 include/net/sock.h                            |  14 +-
 include/net/tcp.h                             |   2 +
 net/core/filter.c                             | 378 +++++-------
 net/core/sock.c                               |  25 +-
 net/ipv4/ip_sockglue.c                        |  10 +-
 net/ipv4/tcp.c                                |  21 +-
 net/ipv6/af_inet6.c                           |   1 +
 net/ipv6/ipv6_sockglue.c                      |  10 +-
 net/mptcp/sockopt.c                           |  12 +-
 net/socket.c                                  |   2 +-
 .../selftests/bpf/prog_tests/setget_sockopt.c | 125 ++++
 .../selftests/bpf/progs/setget_sockopt.c      | 538 ++++++++++++++++++
 18 files changed, 890 insertions(+), 266 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
 create mode 100644 tools/testing/selftests/bpf/progs/setget_sockopt.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  8:11   ` David Laight
  2022-07-27  8:16   ` Eric Dumazet
  2022-07-27  6:09 ` [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
sock_setsockopt() to avoid code duplication and code
drift between the two duplicates.

The current sock_setsockopt() takes sock ptr as the argument.
The very first thing of this function is to get back the sk ptr
by 'sk = sock->sk'.

bpf_setsockopt() could be called when the sk does not have
a userspace owner.  Meaning sk->sk_socket is NULL.  For example,
when a passive tcp connection has just been established.  Thus,
it cannot use the sock_setsockopt(sk->sk_socket) or else it will
pass a NULL sock ptr.

All existing callers have both sock->sk and sk->sk_socket pointer.
Thus, this patch changes the sock_setsockopt() to take a sk ptr
instead of the sock ptr.  The bpf_setsockopt() only allows
optnames that do not require a sock ptr.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/nvme/host/tcp.c  |  2 +-
 fs/ksmbd/transport_tcp.c |  2 +-
 include/net/sock.h       |  2 +-
 net/core/sock.c          |  4 ++--
 net/mptcp/sockopt.c      | 12 ++++++------
 net/socket.c             |  2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 7a9e6ffa2342..60e14cc39e49 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1555,7 +1555,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 		char *iface = nctrl->opts->host_iface;
 		sockptr_t optval = KERNEL_SOCKPTR(iface);
 
-		ret = sock_setsockopt(queue->sock, SOL_SOCKET, SO_BINDTODEVICE,
+		ret = sock_setsockopt(queue->sock->sk, SOL_SOCKET, SO_BINDTODEVICE,
 				      optval, strlen(iface));
 		if (ret) {
 			dev_err(nctrl->device,
diff --git a/fs/ksmbd/transport_tcp.c b/fs/ksmbd/transport_tcp.c
index 143bba4e4db8..982eed2dd575 100644
--- a/fs/ksmbd/transport_tcp.c
+++ b/fs/ksmbd/transport_tcp.c
@@ -420,7 +420,7 @@ static int create_socket(struct interface *iface)
 	ksmbd_tcp_nodelay(ksmbd_socket);
 	ksmbd_tcp_reuseaddr(ksmbd_socket);
 
-	ret = sock_setsockopt(ksmbd_socket,
+	ret = sock_setsockopt(ksmbd_socket->sk,
 			      SOL_SOCKET,
 			      SO_BINDTODEVICE,
 			      KERNEL_SOCKPTR(iface->name),
diff --git a/include/net/sock.h b/include/net/sock.h
index f7ad1a7705e9..9e2539dcc293 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1795,7 +1795,7 @@ void sock_pfree(struct sk_buff *skb);
 #define sock_edemux sock_efree
 #endif
 
-int sock_setsockopt(struct socket *sock, int level, int op,
+int sock_setsockopt(struct sock *sk, int level, int op,
 		    sockptr_t optval, unsigned int optlen);
 
 int sock_getsockopt(struct socket *sock, int level, int op,
diff --git a/net/core/sock.c b/net/core/sock.c
index 4cb957d934a2..18bb4f269cf1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1041,12 +1041,12 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
  *	at the socket level. Everything here is generic.
  */
 
-int sock_setsockopt(struct socket *sock, int level, int optname,
+int sock_setsockopt(struct sock *sk, int level, int optname,
 		    sockptr_t optval, unsigned int optlen)
 {
 	struct so_timestamping timestamping;
+	struct socket *sock = sk->sk_socket;
 	struct sock_txtime sk_txtime;
-	struct sock *sk = sock->sk;
 	int val;
 	int valbool;
 	struct linger ling;
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 423d3826ca1e..5684499b4d39 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -124,7 +124,7 @@ static int mptcp_sol_socket_intval(struct mptcp_sock *msk, int optname, int val)
 	struct sock *sk = (struct sock *)msk;
 	int ret;
 
-	ret = sock_setsockopt(sk->sk_socket, SOL_SOCKET, optname,
+	ret = sock_setsockopt(sk, SOL_SOCKET, optname,
 			      optval, sizeof(val));
 	if (ret)
 		return ret;
@@ -149,7 +149,7 @@ static int mptcp_setsockopt_sol_socket_tstamp(struct mptcp_sock *msk, int optnam
 	struct sock *sk = (struct sock *)msk;
 	int ret;
 
-	ret = sock_setsockopt(sk->sk_socket, SOL_SOCKET, optname,
+	ret = sock_setsockopt(sk, SOL_SOCKET, optname,
 			      optval, sizeof(val));
 	if (ret)
 		return ret;
@@ -225,7 +225,7 @@ static int mptcp_setsockopt_sol_socket_timestamping(struct mptcp_sock *msk,
 		return -EINVAL;
 	}
 
-	ret = sock_setsockopt(sk->sk_socket, SOL_SOCKET, optname,
+	ret = sock_setsockopt(sk, SOL_SOCKET, optname,
 			      KERNEL_SOCKPTR(&timestamping),
 			      sizeof(timestamping));
 	if (ret)
@@ -262,7 +262,7 @@ static int mptcp_setsockopt_sol_socket_linger(struct mptcp_sock *msk, sockptr_t
 		return -EFAULT;
 
 	kopt = KERNEL_SOCKPTR(&ling);
-	ret = sock_setsockopt(sk->sk_socket, SOL_SOCKET, SO_LINGER, kopt, sizeof(ling));
+	ret = sock_setsockopt(sk, SOL_SOCKET, SO_LINGER, kopt, sizeof(ling));
 	if (ret)
 		return ret;
 
@@ -306,7 +306,7 @@ static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname,
 			return -EINVAL;
 		}
 
-		ret = sock_setsockopt(ssock, SOL_SOCKET, optname, optval, optlen);
+		ret = sock_setsockopt(ssock->sk, SOL_SOCKET, optname, optval, optlen);
 		if (ret == 0) {
 			if (optname == SO_REUSEPORT)
 				sk->sk_reuseport = ssock->sk->sk_reuseport;
@@ -349,7 +349,7 @@ static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname,
 	case SO_PREFER_BUSY_POLL:
 	case SO_BUSY_POLL_BUDGET:
 		/* No need to copy: only relevant for msk */
-		return sock_setsockopt(sk->sk_socket, SOL_SOCKET, optname, optval, optlen);
+		return sock_setsockopt(sk, SOL_SOCKET, optname, optval, optlen);
 	case SO_NO_CHECK:
 	case SO_DONTROUTE:
 	case SO_BROADCAST:
diff --git a/net/socket.c b/net/socket.c
index b6bd4cf44d3f..c6911d613ae2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2245,7 +2245,7 @@ int __sys_setsockopt(int fd, int level, int optname, char __user *user_optval,
 	if (kernel_optval)
 		optval = KERNEL_SOCKPTR(kernel_optval);
 	if (level == SOL_SOCKET && !sock_use_custom_sol_socket(sock))
-		err = sock_setsockopt(sock, level, optname, optval, optlen);
+		err = sock_setsockopt(sock->sk, level, optname, optval, optlen);
 	else if (unlikely(!sock->ops->setsockopt))
 		err = -EOPNOTSUPP;
 	else
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  8:36   ` David Laight
  2022-07-27 16:47   ` sdf
  2022-07-27  6:09 ` [PATCH bpf-next 03/14] bpf: net: Consider optval.is_bpf before capable check in sock_setsockopt() Martin KaFai Lau
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sock_setsockopt().  The number of supported options are
increasing ever and so as the duplicated codes.

One issue in reusing sock_setsockopt() is that the bpf prog
has already acquired the sk lock.  sockptr_t is useful to handle this.
sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
has already been ensured by the bpf prog.

{lock,release}_sock_sockopt() helpers are added for the
sock_setsockopt() to use.  These helpers will handle the new
'is_bpf' bit.

Note on the change in sock_setbindtodevice().  lock_sock_sockopt()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/sockptr.h |  8 +++++++-
 include/net/sock.h      | 12 ++++++++++++
 net/core/sock.c         |  8 +++++---
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/linux/sockptr.h b/include/linux/sockptr.h
index d45902fb4cad..787d0be08fb7 100644
--- a/include/linux/sockptr.h
+++ b/include/linux/sockptr.h
@@ -16,7 +16,8 @@ typedef struct {
 		void		*kernel;
 		void __user	*user;
 	};
-	bool		is_kernel : 1;
+	bool		is_kernel : 1,
+			is_bpf : 1;
 } sockptr_t;
 
 static inline bool sockptr_is_kernel(sockptr_t sockptr)
@@ -24,6 +25,11 @@ static inline bool sockptr_is_kernel(sockptr_t sockptr)
 	return sockptr.is_kernel;
 }
 
+static inline sockptr_t KERNEL_SOCKPTR_BPF(void *p)
+{
+	return (sockptr_t) { .kernel = p, .is_kernel = true, .is_bpf = true };
+}
+
 static inline sockptr_t KERNEL_SOCKPTR(void *p)
 {
 	return (sockptr_t) { .kernel = p, .is_kernel = true };
diff --git a/include/net/sock.h b/include/net/sock.h
index 9e2539dcc293..2f913bdf0035 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1721,6 +1721,18 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 	}
 }
 
+static inline void lock_sock_sockopt(struct sock *sk, sockptr_t optval)
+{
+	if (!optval.is_bpf)
+		lock_sock(sk);
+}
+
+static inline void release_sock_sockopt(struct sock *sk, sockptr_t optval)
+{
+	if (!optval.is_bpf)
+		release_sock(sk);
+}
+
 /* Used by processes to "lock" a socket state, so that
  * interrupts and bottom half handlers won't change it
  * from under us. It essentially blocks any incoming
diff --git a/net/core/sock.c b/net/core/sock.c
index 18bb4f269cf1..61d927a5f6cb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -703,7 +703,9 @@ static int sock_setbindtodevice(struct sock *sk, sockptr_t optval, int optlen)
 			goto out;
 	}
 
-	return sock_bindtoindex(sk, index, true);
+	lock_sock_sockopt(sk, optval);
+	ret = sock_bindtoindex_locked(sk, index);
+	release_sock_sockopt(sk, optval);
 out:
 #endif
 
@@ -1067,7 +1069,7 @@ int sock_setsockopt(struct sock *sk, int level, int optname,
 
 	valbool = val ? 1 : 0;
 
-	lock_sock(sk);
+	lock_sock_sockopt(sk, optval);
 
 	switch (optname) {
 	case SO_DEBUG:
@@ -1496,7 +1498,7 @@ int sock_setsockopt(struct sock *sk, int level, int optname,
 		ret = -ENOPROTOOPT;
 		break;
 	}
-	release_sock(sk);
+	release_sock_sockopt(sk, optval);
 	return ret;
 }
 EXPORT_SYMBOL(sock_setsockopt);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 03/14] bpf: net: Consider optval.is_bpf before capable check in sock_setsockopt()
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27 16:54   ` sdf
  2022-07-27  6:09 ` [PATCH bpf-next 04/14] bpf: net: Avoid do_tcp_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check.  There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit 8d650cdedaab ("tcp: fix tcp_set_congestion_control() use from bpf hook")
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.

A similar change is done in this patch for SO_MARK, SO_PRIORITY,
and SO_BINDTO{DEVICE,IFINDEX} which are the optnames allowed by
bpf_setsockopt(SOL_SOCKET).  This will allow the sock_setsockopt()
to be reused by bpf_setsockopt(SOL_SOCKET) in a latter patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/sock.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 61d927a5f6cb..f2c582491d5f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -620,7 +620,7 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
 }
 EXPORT_SYMBOL(sk_dst_check);
 
-static int sock_bindtoindex_locked(struct sock *sk, int ifindex)
+static int sock_bindtoindex_locked(struct sock *sk, int ifindex, bool cap_check)
 {
 	int ret = -ENOPROTOOPT;
 #ifdef CONFIG_NETDEVICES
@@ -628,7 +628,8 @@ static int sock_bindtoindex_locked(struct sock *sk, int ifindex)
 
 	/* Sorry... */
 	ret = -EPERM;
-	if (sk->sk_bound_dev_if && !ns_capable(net->user_ns, CAP_NET_RAW))
+	if (sk->sk_bound_dev_if && cap_check &&
+	    !ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out;
 
 	ret = -EINVAL;
@@ -656,7 +657,7 @@ int sock_bindtoindex(struct sock *sk, int ifindex, bool lock_sk)
 
 	if (lock_sk)
 		lock_sock(sk);
-	ret = sock_bindtoindex_locked(sk, ifindex);
+	ret = sock_bindtoindex_locked(sk, ifindex, true);
 	if (lock_sk)
 		release_sock(sk);
 
@@ -704,7 +705,7 @@ static int sock_setbindtodevice(struct sock *sk, sockptr_t optval, int optlen)
 	}
 
 	lock_sock_sockopt(sk, optval);
-	ret = sock_bindtoindex_locked(sk, index);
+	ret = sock_bindtoindex_locked(sk, index, !optval.is_bpf);
 	release_sock_sockopt(sk, optval);
 out:
 #endif
@@ -1166,6 +1167,7 @@ int sock_setsockopt(struct sock *sk, int level, int optname,
 
 	case SO_PRIORITY:
 		if ((val >= 0 && val <= 6) ||
+		    optval.is_bpf ||
 		    ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
 		    ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			sk->sk_priority = val;
@@ -1312,7 +1314,8 @@ int sock_setsockopt(struct sock *sk, int level, int optname,
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
 	case SO_MARK:
-		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
+		if (!optval.is_bpf &&
+		    !ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
 		    !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
@@ -1456,7 +1459,7 @@ int sock_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case SO_BINDTOIFINDEX:
-		ret = sock_bindtoindex_locked(sk, val);
+		ret = sock_bindtoindex_locked(sk, val, !optval.is_bpf);
 		break;
 
 	case SO_BUF_LOCK:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 04/14] bpf: net: Avoid do_tcp_setsockopt() taking sk lock when called from bpf
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 03/14] bpf: net: Consider optval.is_bpf before capable check in sock_setsockopt() Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 05/14] bpf: net: Avoid do_ip_setsockopt() " Martin KaFai Lau
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

Similar to the earlier patch that avoids sock_setsockopt() from
taking sk lock when called from bpf.  This patch changes
do_tcp_setsockopt() to use the {lock,release}_sock_sockopt().

This patch also changes do_tcp_setsockopt() to check optval.is_bpf
when passing the cap_net_admin arg to
tcp_set_congestion_control(..., cap_net_admin).  It is
the same as how bpf_setsockopt(TCP_CONGESTION) is calling
tcp_set_congestion_control() now.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ba2bdc811374..7f8d81befa8e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3462,11 +3462,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 			return -EFAULT;
 		name[val] = 0;
 
-		lock_sock(sk);
-		err = tcp_set_congestion_control(sk, name, true,
-						 ns_capable(sock_net(sk)->user_ns,
-							    CAP_NET_ADMIN));
-		release_sock(sk);
+		lock_sock_sockopt(sk, optval);
+		err = tcp_set_congestion_control(sk, name, !optval.is_bpf,
+						 optval.is_bpf ?
+						 true : ns_capable(sock_net(sk)->user_ns,
+								   CAP_NET_ADMIN));
+		release_sock_sockopt(sk, optval);
 		return err;
 	}
 	case TCP_ULP: {
@@ -3482,9 +3483,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 			return -EFAULT;
 		name[val] = 0;
 
-		lock_sock(sk);
+		lock_sock_sockopt(sk, optval);
 		err = tcp_set_ulp(sk, name);
-		release_sock(sk);
+		release_sock_sockopt(sk, optval);
 		return err;
 	}
 	case TCP_FASTOPEN_KEY: {
@@ -3517,7 +3518,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 	if (copy_from_sockptr(&val, optval, sizeof(val)))
 		return -EFAULT;
 
-	lock_sock(sk);
+	lock_sock_sockopt(sk, optval);
 
 	switch (optname) {
 	case TCP_MAXSEG:
@@ -3739,7 +3740,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 		break;
 	}
 
-	release_sock(sk);
+	release_sock_sockopt(sk, optval);
 	return err;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 05/14] bpf: net: Avoid do_ip_setsockopt() taking sk lock when called from bpf
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 04/14] bpf: net: Avoid do_tcp_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 06/14] bpf: net: Avoid do_ipv6_setsockopt() " Martin KaFai Lau
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

Similar to the earlier patch that avoids sock_setsockopt() from
taking sk lock when called from bpf.  This patch changes
do_ip_setsockopt() to use the {lock,release}_sock_sockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/ip_sockglue.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index a8a323ecbb54..8271ad565a3a 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -944,7 +944,7 @@ static int do_ip_setsockopt(struct sock *sk, int level, int optname,
 	err = 0;
 	if (needs_rtnl)
 		rtnl_lock();
-	lock_sock(sk);
+	lock_sock_sockopt(sk, optval);
 
 	switch (optname) {
 	case IP_OPTIONS:
@@ -1368,13 +1368,13 @@ static int do_ip_setsockopt(struct sock *sk, int level, int optname,
 		err = -ENOPROTOOPT;
 		break;
 	}
-	release_sock(sk);
+	release_sock_sockopt(sk, optval);
 	if (needs_rtnl)
 		rtnl_unlock();
 	return err;
 
 e_inval:
-	release_sock(sk);
+	release_sock_sockopt(sk, optval);
 	if (needs_rtnl)
 		rtnl_unlock();
 	return -EINVAL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 06/14] bpf: net: Avoid do_ipv6_setsockopt() taking sk lock when called from bpf
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 05/14] bpf: net: Avoid do_ip_setsockopt() " Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 07/14] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt Martin KaFai Lau
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

Similar to the earlier patch that avoids sock_setsockopt() from
taking sk lock when called from bpf.  This patch changes
do_ipv6_setsockopt() to use the {lock,release}_sock_sockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv6/ipv6_sockglue.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 222f6bf220ba..4559f02ab4a8 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -417,7 +417,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 
 	if (needs_rtnl)
 		rtnl_lock();
-	lock_sock(sk);
+	lock_sock_sockopt(sk, optval);
 
 	switch (optname) {
 
@@ -994,14 +994,14 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		break;
 	}
 
-	release_sock(sk);
+	release_sock_sockopt(sk, optval);
 	if (needs_rtnl)
 		rtnl_unlock();
 
 	return retv;
 
 e_inval:
-	release_sock(sk);
+	release_sock_sockopt(sk, optval);
 	if (needs_rtnl)
 		rtnl_unlock();
 	return -EINVAL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 07/14] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 06/14] bpf: net: Avoid do_ipv6_setsockopt() " Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 08/14] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sock_setsockopt() Martin KaFai Lau
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

This patch moves the "#ifdef CONFIG_XXX" check into the "if/else"
statement itself.  The change is done for the bpf_setsockopt()
function only.  It will make the latter patches easier to follow
without the surrounding ifdef macro.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 5669248aff25..01cb4a01b1c1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5113,8 +5113,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-#ifdef CONFIG_INET
-	} else if (level == SOL_IP) {
+	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET)
 			return -EINVAL;
 
@@ -5135,8 +5134,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-#if IS_ENABLED(CONFIG_IPV6)
-	} else if (level == SOL_IPV6) {
+	} else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
 			return -EINVAL;
 
@@ -5157,8 +5155,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-#endif
-	} else if (level == SOL_TCP &&
+	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
 		if (optname == TCP_CONGESTION) {
 			char name[TCP_CA_NAME_MAX];
@@ -5250,7 +5247,6 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 				ret = -EINVAL;
 			}
 		}
-#endif
 	} else {
 		ret = -EINVAL;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 08/14] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sock_setsockopt()
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 07/14] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 09/14] bpf: Refactor bpf specific tcp optnames to a new function Martin KaFai Lau
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sock_setsockopt().

The only exception is SO_RCVLOWAT.  The bpf_setsockopt() does
not always have the sock ptr (sk->sk_socket) and sock->ops->set_rcvlowat
is needed in sock_setsockopt.  tcp_set_rcvlowat is the only
implementation for set_rcvlowat.  bpf_setsockopt() needs
one special handling for SO_RCVLOWAT for tcp_sock, so leave
SO_RCVLOWAT in bpf_setsockopt() for now.

The existing optname white-list is refactored into a new
function sol_socket_setsockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 130 ++++++++++++++--------------------------------
 1 file changed, 38 insertions(+), 92 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 01cb4a01b1c1..77a9aa17a1c0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5013,106 +5013,52 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+static int sol_socket_setsockopt(struct sock *sk, int optname,
+				 char *optval, int optlen)
+{
+	switch (optname) {
+	case SO_SNDBUF:
+	case SO_RCVBUF:
+	case SO_KEEPALIVE:
+	case SO_PRIORITY:
+	case SO_REUSEPORT:
+	case SO_RCVLOWAT:
+	case SO_MARK:
+	case SO_MAX_PACING_RATE:
+	case SO_BINDTOIFINDEX:
+	case SO_TXREHASH:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	case SO_BINDTODEVICE:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (optname == SO_RCVLOWAT) {
+		int val = *(int *)optval;
+
+		if (val < 0)
+			val = INT_MAX;
+		WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
+		return 0;
+	}
+
+	return sock_setsockopt(sk, SOL_SOCKET, optname,
+			       KERNEL_SOCKPTR_BPF(optval), optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
-	char devname[IFNAMSIZ];
-	int val, valbool;
-	struct net *net;
-	int ifindex;
-	int ret = 0;
+	int val, ret = 0;
 
 	if (!sk_fullsock(sk))
 		return -EINVAL;
 
 	if (level == SOL_SOCKET) {
-		if (optlen != sizeof(int) && optname != SO_BINDTODEVICE)
-			return -EINVAL;
-		val = *((int *)optval);
-		valbool = val ? 1 : 0;
-
-		/* Only some socketops are supported */
-		switch (optname) {
-		case SO_RCVBUF:
-			val = min_t(u32, val, sysctl_rmem_max);
-			val = min_t(int, val, INT_MAX / 2);
-			sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-			WRITE_ONCE(sk->sk_rcvbuf,
-				   max_t(int, val * 2, SOCK_MIN_RCVBUF));
-			break;
-		case SO_SNDBUF:
-			val = min_t(u32, val, sysctl_wmem_max);
-			val = min_t(int, val, INT_MAX / 2);
-			sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
-			WRITE_ONCE(sk->sk_sndbuf,
-				   max_t(int, val * 2, SOCK_MIN_SNDBUF));
-			break;
-		case SO_MAX_PACING_RATE: /* 32bit version */
-			if (val != ~0U)
-				cmpxchg(&sk->sk_pacing_status,
-					SK_PACING_NONE,
-					SK_PACING_NEEDED);
-			sk->sk_max_pacing_rate = (val == ~0U) ?
-						 ~0UL : (unsigned int)val;
-			sk->sk_pacing_rate = min(sk->sk_pacing_rate,
-						 sk->sk_max_pacing_rate);
-			break;
-		case SO_PRIORITY:
-			sk->sk_priority = val;
-			break;
-		case SO_RCVLOWAT:
-			if (val < 0)
-				val = INT_MAX;
-			WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
-			break;
-		case SO_MARK:
-			if (sk->sk_mark != val) {
-				sk->sk_mark = val;
-				sk_dst_reset(sk);
-			}
-			break;
-		case SO_BINDTODEVICE:
-			optlen = min_t(long, optlen, IFNAMSIZ - 1);
-			strncpy(devname, optval, optlen);
-			devname[optlen] = 0;
-
-			ifindex = 0;
-			if (devname[0] != '\0') {
-				struct net_device *dev;
-
-				ret = -ENODEV;
-
-				net = sock_net(sk);
-				dev = dev_get_by_name(net, devname);
-				if (!dev)
-					break;
-				ifindex = dev->ifindex;
-				dev_put(dev);
-			}
-			fallthrough;
-		case SO_BINDTOIFINDEX:
-			if (optname == SO_BINDTOIFINDEX)
-				ifindex = val;
-			ret = sock_bindtoindex(sk, ifindex, false);
-			break;
-		case SO_KEEPALIVE:
-			if (sk->sk_prot->keepalive)
-				sk->sk_prot->keepalive(sk, valbool);
-			sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool);
-			break;
-		case SO_REUSEPORT:
-			sk->sk_reuseport = valbool;
-			break;
-		case SO_TXREHASH:
-			if (val < -1 || val > 1) {
-				ret = -EINVAL;
-				break;
-			}
-			sk->sk_txrehash = (u8)val;
-			break;
-		default:
-			ret = -EINVAL;
-		}
+		return sol_socket_setsockopt(sk, optname, optval, optlen);
 	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET)
 			return -EINVAL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 09/14] bpf: Refactor bpf specific tcp optnames to a new function
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 08/14] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sock_setsockopt() Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  6:09 ` [PATCH bpf-next 10/14] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt() Martin KaFai Lau
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

The patch moves all bpf specific tcp optnames (TCP_BPF_XXX)
to a new function bpf_sol_tcp_setsockopt().  This will make
the next patch easier to follow.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 79 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 29 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 77a9aa17a1c0..8dd195b9b860 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5049,6 +5049,52 @@ static int sol_socket_setsockopt(struct sock *sk, int optname,
 			       KERNEL_SOCKPTR_BPF(optval), optlen);
 }
 
+static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
+				  char *optval, int optlen)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned long timeout;
+	int val;
+
+	if (optlen != sizeof(int))
+		return -EINVAL;
+
+	val = *(int *)optval;
+
+	/* Only some options are supported */
+	switch (optname) {
+	case TCP_BPF_IW:
+		if (val <= 0 || tp->data_segs_out > tp->syn_data)
+			return -EINVAL;
+		tcp_snd_cwnd_set(tp, val);
+		break;
+	case TCP_BPF_SNDCWND_CLAMP:
+		if (val <= 0)
+			return -EINVAL;
+		tp->snd_cwnd_clamp = val;
+		tp->snd_ssthresh = val;
+		break;
+	case TCP_BPF_DELACK_MAX:
+		timeout = usecs_to_jiffies(val);
+		if (timeout > TCP_DELACK_MAX ||
+		    timeout < TCP_TIMEOUT_MIN)
+			return -EINVAL;
+		inet_csk(sk)->icsk_delack_max = timeout;
+		break;
+	case TCP_BPF_RTO_MIN:
+		timeout = usecs_to_jiffies(val);
+		if (timeout > TCP_RTO_MIN ||
+		    timeout < TCP_TIMEOUT_MIN)
+			return -EINVAL;
+		inet_csk(sk)->icsk_rto_min = timeout;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
@@ -5103,6 +5149,10 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		}
 	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
+		if (optname >= TCP_BPF_IW)
+			return bpf_sol_tcp_setsockopt(sk, optname,
+						      optval, optlen);
+
 		if (optname == TCP_CONGESTION) {
 			char name[TCP_CA_NAME_MAX];
 
@@ -5113,7 +5163,6 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 			struct tcp_sock *tp = tcp_sk(sk);
-			unsigned long timeout;
 
 			if (optlen != sizeof(int))
 				return -EINVAL;
@@ -5121,34 +5170,6 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			val = *((int *)optval);
 			/* Only some options are supported */
 			switch (optname) {
-			case TCP_BPF_IW:
-				if (val <= 0 || tp->data_segs_out > tp->syn_data)
-					ret = -EINVAL;
-				else
-					tcp_snd_cwnd_set(tp, val);
-				break;
-			case TCP_BPF_SNDCWND_CLAMP:
-				if (val <= 0) {
-					ret = -EINVAL;
-				} else {
-					tp->snd_cwnd_clamp = val;
-					tp->snd_ssthresh = val;
-				}
-				break;
-			case TCP_BPF_DELACK_MAX:
-				timeout = usecs_to_jiffies(val);
-				if (timeout > TCP_DELACK_MAX ||
-				    timeout < TCP_TIMEOUT_MIN)
-					return -EINVAL;
-				inet_csk(sk)->icsk_delack_max = timeout;
-				break;
-			case TCP_BPF_RTO_MIN:
-				timeout = usecs_to_jiffies(val);
-				if (timeout > TCP_RTO_MIN ||
-				    timeout < TCP_TIMEOUT_MIN)
-					return -EINVAL;
-				inet_csk(sk)->icsk_rto_min = timeout;
-				break;
 			case TCP_SAVE_SYN:
 				if (val < 0 || val > 1)
 					ret = -EINVAL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 10/14] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 09/14] bpf: Refactor bpf specific tcp optnames to a new function Martin KaFai Lau
@ 2022-07-27  6:09 ` Martin KaFai Lau
  2022-07-27  6:10 ` [PATCH bpf-next 11/14] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt() Martin KaFai Lau
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:09 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

After the prep work in the previous patches,
this patch removes all the dup code from bpf_setsockopt(SOL_TCP)
and reuses the do_tcp_setsockopt().

The existing optname white-list is refactored into a new
function sol_tcp_setsockopt().  The sol_tcp_setsockopt()
also calls the bpf_sol_tcp_setsockopt() to handle
the TCP_BPF_XXX specific optnames.

bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to
save the eth header also and it comes for free from
do_tcp_setsockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/tcp.h |  2 +
 net/core/filter.c | 97 +++++++++++++++--------------------------------
 net/ipv4/tcp.c    |  2 +-
 3 files changed, 33 insertions(+), 68 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f9e7c85ea829..06b63a807c33 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -405,6 +405,8 @@ __poll_t tcp_poll(struct file *file, struct socket *sock,
 int tcp_getsockopt(struct sock *sk, int level, int optname,
 		   char __user *optval, int __user *optlen);
 bool tcp_bpf_bypass_getsockopt(int level, int optname);
+int do_tcp_setsockopt(struct sock *sk, int level, int optname,
+		      sockptr_t optval, unsigned int optlen);
 int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		   unsigned int optlen);
 void tcp_set_keepalive(struct sock *sk, int val);
diff --git a/net/core/filter.c b/net/core/filter.c
index 8dd195b9b860..97aed6575810 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5095,6 +5095,34 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
 	return 0;
 }
 
+static int sol_tcp_setsockopt(struct sock *sk, int optname,
+			      char *optval, int optlen)
+{
+	if (sk->sk_prot->setsockopt != tcp_setsockopt)
+		return -EINVAL;
+
+	switch (optname) {
+	case TCP_KEEPIDLE:
+	case TCP_KEEPINTVL:
+	case TCP_KEEPCNT:
+	case TCP_SYNCNT:
+	case TCP_WINDOW_CLAMP:
+	case TCP_USER_TIMEOUT:
+	case TCP_NOTSENT_LOWAT:
+	case TCP_SAVE_SYN:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	case TCP_CONGESTION:
+		break;
+	default:
+		return bpf_sol_tcp_setsockopt(sk, optname, optval, optlen);
+	}
+
+	return do_tcp_setsockopt(sk, SOL_TCP, optname,
+				 KERNEL_SOCKPTR_BPF(optval), optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
@@ -5147,73 +5175,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
-		   sk->sk_prot->setsockopt == tcp_setsockopt) {
-		if (optname >= TCP_BPF_IW)
-			return bpf_sol_tcp_setsockopt(sk, optname,
-						      optval, optlen);
-
-		if (optname == TCP_CONGESTION) {
-			char name[TCP_CA_NAME_MAX];
-
-			strncpy(name, optval, min_t(long, optlen,
-						    TCP_CA_NAME_MAX-1));
-			name[TCP_CA_NAME_MAX-1] = 0;
-			ret = tcp_set_congestion_control(sk, name, false, true);
-		} else {
-			struct inet_connection_sock *icsk = inet_csk(sk);
-			struct tcp_sock *tp = tcp_sk(sk);
-
-			if (optlen != sizeof(int))
-				return -EINVAL;
-
-			val = *((int *)optval);
-			/* Only some options are supported */
-			switch (optname) {
-			case TCP_SAVE_SYN:
-				if (val < 0 || val > 1)
-					ret = -EINVAL;
-				else
-					tp->save_syn = val;
-				break;
-			case TCP_KEEPIDLE:
-				ret = tcp_sock_set_keepidle_locked(sk, val);
-				break;
-			case TCP_KEEPINTVL:
-				if (val < 1 || val > MAX_TCP_KEEPINTVL)
-					ret = -EINVAL;
-				else
-					tp->keepalive_intvl = val * HZ;
-				break;
-			case TCP_KEEPCNT:
-				if (val < 1 || val > MAX_TCP_KEEPCNT)
-					ret = -EINVAL;
-				else
-					tp->keepalive_probes = val;
-				break;
-			case TCP_SYNCNT:
-				if (val < 1 || val > MAX_TCP_SYNCNT)
-					ret = -EINVAL;
-				else
-					icsk->icsk_syn_retries = val;
-				break;
-			case TCP_USER_TIMEOUT:
-				if (val < 0)
-					ret = -EINVAL;
-				else
-					icsk->icsk_user_timeout = val;
-				break;
-			case TCP_NOTSENT_LOWAT:
-				tp->notsent_lowat = val;
-				sk->sk_write_space(sk);
-				break;
-			case TCP_WINDOW_CLAMP:
-				ret = tcp_set_window_clamp(sk, val);
-				break;
-			default:
-				ret = -EINVAL;
-			}
-		}
+	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP) {
+		return sol_tcp_setsockopt(sk, optname, optval, optlen);
 	} else {
 		ret = -EINVAL;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7f8d81befa8e..5a327a0e1af9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3439,7 +3439,7 @@ int tcp_set_window_clamp(struct sock *sk, int val)
 /*
  *	Socket option code for TCP.
  */
-static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
+int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 		sockptr_t optval, unsigned int optlen)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 11/14] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (9 preceding siblings ...)
  2022-07-27  6:09 ` [PATCH bpf-next 10/14] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt() Martin KaFai Lau
@ 2022-07-27  6:10 ` Martin KaFai Lau
  2022-07-27  6:10 ` [PATCH bpf-next 12/14] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt() Martin KaFai Lau
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:10 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IP)
and reuses the implementation in do_ip_setsockopt().

The existing optname white-list is refactored into a new
function sol_ip_setsockopt().

The current bpf_setsockopt(IP_TOS) does not take the
INET_ECN_MASK into the account for tcp.  The do_ip_setsockopt(IP_TOS)
will handle it correctly.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ip.h       |  2 ++
 net/core/filter.c      | 40 ++++++++++++++++++++--------------------
 net/ipv4/ip_sockglue.c |  4 ++--
 3 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 1c979fd1904c..34fa5b0f0a0e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -743,6 +743,8 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sock *sk,
 int ip_cmsg_send(struct sock *sk, struct msghdr *msg,
 		 struct ipcm_cookie *ipc, bool allow_ipv6);
 DECLARE_STATIC_KEY_FALSE(ip4_min_ttl);
+int do_ip_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
+		     unsigned int optlen);
 int ip_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		  unsigned int optlen);
 int ip_getsockopt(struct sock *sk, int level, int optname, char __user *optval,
diff --git a/net/core/filter.c b/net/core/filter.c
index 97aed6575810..67c87d7acb23 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5123,6 +5123,25 @@ static int sol_tcp_setsockopt(struct sock *sk, int optname,
 				 KERNEL_SOCKPTR_BPF(optval), optlen);
 }
 
+static int sol_ip_setsockopt(struct sock *sk, int optname,
+			     char *optval, int optlen)
+{
+	if (sk->sk_family != AF_INET)
+		return -EINVAL;
+
+	switch (optname) {
+	case IP_TOS:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return do_ip_setsockopt(sk, SOL_IP, optname,
+				KERNEL_SOCKPTR_BPF(optval), optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
@@ -5134,26 +5153,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 	if (level == SOL_SOCKET) {
 		return sol_socket_setsockopt(sk, optname, optval, optlen);
 	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
-		if (optlen != sizeof(int) || sk->sk_family != AF_INET)
-			return -EINVAL;
-
-		val = *((int *)optval);
-		/* Only some options are supported */
-		switch (optname) {
-		case IP_TOS:
-			if (val < -1 || val > 0xff) {
-				ret = -EINVAL;
-			} else {
-				struct inet_sock *inet = inet_sk(sk);
-
-				if (val == -1)
-					val = 0;
-				inet->tos = val;
-			}
-			break;
-		default:
-			ret = -EINVAL;
-		}
+		return sol_ip_setsockopt(sk, optname, optval, optlen);
 	} else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
 			return -EINVAL;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 8271ad565a3a..e70572142d99 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -888,8 +888,8 @@ static int compat_ip_mcast_join_leave(struct sock *sk, int optname,
 
 DEFINE_STATIC_KEY_FALSE(ip4_min_ttl);
 
-static int do_ip_setsockopt(struct sock *sk, int level, int optname,
-		sockptr_t optval, unsigned int optlen)
+int do_ip_setsockopt(struct sock *sk, int level, int optname,
+		     sockptr_t optval, unsigned int optlen)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct net *net = sock_net(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 12/14] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt()
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (10 preceding siblings ...)
  2022-07-27  6:10 ` [PATCH bpf-next 11/14] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt() Martin KaFai Lau
@ 2022-07-27  6:10 ` Martin KaFai Lau
  2022-07-27  6:10 ` [PATCH bpf-next 13/14] bpf: Add a few optnames to bpf_setsockopt Martin KaFai Lau
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:10 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IPV6)
and reuses the implementation in do_ipv6_setsockopt().

ipv6 could be compiled as a module.  Like how other codes solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt
to the ipv6_bpf_stub.

The current bpf_setsockopt(IPV6_TCLASS) does not take the
INET_ECN_MASK into the account for tcp.  The
do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly.

The existing optname white-list is refactored into a new
function sol_ipv6_setsockopt().

After this last SOL_IPV6 dup code removal, the __bpf_setsockopt()
is simplified enough that the extra "{ }" around the if statement
can be removed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ipv6.h       |  2 ++
 include/net/ipv6_stubs.h |  2 ++
 net/core/filter.c        | 57 +++++++++++++++++++---------------------
 net/ipv6/af_inet6.c      |  1 +
 net/ipv6/ipv6_sockglue.c |  4 +--
 5 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index de9dcc5652c4..c110d9032083 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1156,6 +1156,8 @@ struct in6_addr *fl6_update_dst(struct flowi6 *fl6,
  */
 DECLARE_STATIC_KEY_FALSE(ip6_min_hopcount);
 
+int do_ipv6_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
+		       unsigned int optlen);
 int ipv6_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		    unsigned int optlen);
 int ipv6_getsockopt(struct sock *sk, int level, int optname,
diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h
index 45e0339be6fa..8692698b01cf 100644
--- a/include/net/ipv6_stubs.h
+++ b/include/net/ipv6_stubs.h
@@ -81,6 +81,8 @@ struct ipv6_bpf_stub {
 				     const struct in6_addr *daddr, __be16 dport,
 				     int dif, int sdif, struct udp_table *tbl,
 				     struct sk_buff *skb);
+	int (*ipv6_setsockopt)(struct sock *sk, int level, int optname,
+			       sockptr_t optval, unsigned int optlen);
 };
 extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 67c87d7acb23..7b510e009bb3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5142,45 +5142,42 @@ static int sol_ip_setsockopt(struct sock *sk, int optname,
 				KERNEL_SOCKPTR_BPF(optval), optlen);
 }
 
+static int sol_ipv6_setsockopt(struct sock *sk, int optname,
+			       char *optval, int optlen)
+{
+	if (sk->sk_family != AF_INET6)
+		return -EINVAL;
+
+	switch (optname) {
+	case IPV6_TCLASS:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ipv6_bpf_stub->ipv6_setsockopt(sk, SOL_IPV6, optname,
+					      KERNEL_SOCKPTR_BPF(optval),
+					      optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
-	int val, ret = 0;
-
 	if (!sk_fullsock(sk))
 		return -EINVAL;
 
-	if (level == SOL_SOCKET) {
+	if (level == SOL_SOCKET)
 		return sol_socket_setsockopt(sk, optname, optval, optlen);
-	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
+	else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP)
 		return sol_ip_setsockopt(sk, optname, optval, optlen);
-	} else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6) {
-		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
-			return -EINVAL;
-
-		val = *((int *)optval);
-		/* Only some options are supported */
-		switch (optname) {
-		case IPV6_TCLASS:
-			if (val < -1 || val > 0xff) {
-				ret = -EINVAL;
-			} else {
-				struct ipv6_pinfo *np = inet6_sk(sk);
-
-				if (val == -1)
-					val = 0;
-				np->tclass = val;
-			}
-			break;
-		default:
-			ret = -EINVAL;
-		}
-	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP) {
+	else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6)
+		return sol_ipv6_setsockopt(sk, optname, optval, optlen);
+	else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP)
 		return sol_tcp_setsockopt(sk, optname, optval, optlen);
-	} else {
-		ret = -EINVAL;
-	}
-	return ret;
+
+	return -EINVAL;
 }
 
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 2ce0c44d0081..cadc97852787 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1057,6 +1057,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
 static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
 	.inet6_bind = __inet6_bind,
 	.udp6_lib_lookup = __udp6_lib_lookup,
+	.ipv6_setsockopt = do_ipv6_setsockopt,
 };
 
 static int __init inet6_init(void)
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 4559f02ab4a8..0eef5a11dc3c 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -391,8 +391,8 @@ static int ipv6_set_opt_hdr(struct sock *sk, int optname, sockptr_t optval,
 	return err;
 }
 
-static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
-		   sockptr_t optval, unsigned int optlen)
+int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+		       sockptr_t optval, unsigned int optlen)
 {
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct net *net = sock_net(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 13/14] bpf: Add a few optnames to bpf_setsockopt
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (11 preceding siblings ...)
  2022-07-27  6:10 ` [PATCH bpf-next 12/14] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt() Martin KaFai Lau
@ 2022-07-27  6:10 ` Martin KaFai Lau
  2022-07-27  6:10 ` [PATCH bpf-next 14/14] selftests/bpf: bpf_setsockopt tests Martin KaFai Lau
  2022-07-27 17:14 ` [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Jakub Kicinski
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:10 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

This patch adds a few optnames for bpf_setsockopt:
SO_REUSEADDR, IPV6_AUTOFLOWLABEL, TCP_MAXSEG, TCP_NODELAY,
and TCP_THIN_LINEAR_TIMEOUTS.

Thanks to the previous patches of this set, all additions can reuse
the sock_setsockopt(), do_ipv6_setsockopt(), and do_tcp_setsockopt().
The only change here is to allow them in bpf_setsockopt.

The bpf prog has been able to read all members of a sk by
using PTR_TO_BTF_ID of a sk.  The optname additions here can also be
read by the same approach.  Meaning there is a way to read
the values back.

These optnames can also be added to bpf_getsockopt() later with
another patch set that makes the bpf_getsockopt() to reuse
the sock_getsockopt(), tcp_getsockopt(), and ip[v6]_getsockopt().
Thus, this patch does not add more duplicated codes to
bpf_getsockopt() now.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 7b510e009bb3..899ee7b4a04a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5017,6 +5017,7 @@ static int sol_socket_setsockopt(struct sock *sk, int optname,
 				 char *optval, int optlen)
 {
 	switch (optname) {
+	case SO_REUSEADDR:
 	case SO_SNDBUF:
 	case SO_RCVBUF:
 	case SO_KEEPALIVE:
@@ -5102,11 +5103,14 @@ static int sol_tcp_setsockopt(struct sock *sk, int optname,
 		return -EINVAL;
 
 	switch (optname) {
+	case TCP_NODELAY:
+	case TCP_MAXSEG:
 	case TCP_KEEPIDLE:
 	case TCP_KEEPINTVL:
 	case TCP_KEEPCNT:
 	case TCP_SYNCNT:
 	case TCP_WINDOW_CLAMP:
+	case TCP_THIN_LINEAR_TIMEOUTS:
 	case TCP_USER_TIMEOUT:
 	case TCP_NOTSENT_LOWAT:
 	case TCP_SAVE_SYN:
@@ -5150,6 +5154,7 @@ static int sol_ipv6_setsockopt(struct sock *sk, int optname,
 
 	switch (optname) {
 	case IPV6_TCLASS:
+	case IPV6_AUTOFLOWLABEL:
 		if (optlen != sizeof(int))
 			return -EINVAL;
 		break;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 14/14] selftests/bpf: bpf_setsockopt tests
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (12 preceding siblings ...)
  2022-07-27  6:10 ` [PATCH bpf-next 13/14] bpf: Add a few optnames to bpf_setsockopt Martin KaFai Lau
@ 2022-07-27  6:10 ` Martin KaFai Lau
  2022-07-27 17:14 ` [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Jakub Kicinski
  14 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27  6:10 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

This patch adds tests to exercise optnames that are allowed
in bpf_setsockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../selftests/bpf/prog_tests/setget_sockopt.c | 125 ++++
 .../selftests/bpf/progs/setget_sockopt.c      | 538 ++++++++++++++++++
 2 files changed, 663 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
 create mode 100644 tools/testing/selftests/bpf/progs/setget_sockopt.c

diff --git a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
new file mode 100644
index 000000000000..018611e6b248
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) Meta Platforms, Inc. and affiliates. */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <linux/socket.h>
+#include <net/if.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+#include "setget_sockopt.skel.h"
+
+#define CG_NAME "/setget-sockopt-test"
+
+static const char addr4_str[] = "127.0.0.1";
+static const char addr6_str[] = "::1";
+static struct setget_sockopt *skel;
+static int cg_fd;
+
+static int create_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link add dev binddevtest1 type veth peer name binddevtest2"),
+		       "add veth"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev binddevtest1 up"),
+		       "bring veth up"))
+		return -1;
+
+	return 0;
+}
+
+static void test_tcp(int family)
+{
+	struct setget_sockopt__bss *bss = skel->bss;
+	int sfd, cfd;
+
+	memset(bss, 0, sizeof(*bss));
+
+	sfd = start_server(family, SOCK_STREAM,
+			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
+	if (!ASSERT_GE(sfd, 0, "start_server"))
+		return;
+
+	cfd = connect_to_fd(sfd, 0);
+	if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
+		close(sfd);
+		return;
+	}
+	close(sfd);
+	close(cfd);
+
+	ASSERT_EQ(bss->nr_listen, 1, "nr_listen");
+	ASSERT_EQ(bss->nr_connect, 1, "nr_connect");
+	ASSERT_EQ(bss->nr_active, 1, "nr_active");
+	ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
+	ASSERT_EQ(bss->nr_socket_post_create, 2, "nr_socket_post_create");
+	ASSERT_EQ(bss->nr_binddev, 2, "nr_bind");
+}
+
+static void test_udp(int family)
+{
+	struct setget_sockopt__bss *bss = skel->bss;
+	int sfd;
+
+	memset(bss, 0, sizeof(*bss));
+
+	sfd = start_server(family, SOCK_DGRAM,
+			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
+	if (!ASSERT_GE(sfd, 0, "start_server"))
+		return;
+	close(sfd);
+
+	ASSERT_GE(bss->nr_socket_post_create, 1, "nr_socket_post_create");
+	ASSERT_EQ(bss->nr_binddev, 1, "nr_bind");
+}
+
+void test_setget_sockopt(void)
+{
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (cg_fd < 0)
+		return;
+
+	if (create_netns())
+		goto done;
+
+	skel = setget_sockopt__open();
+	if (!ASSERT_OK_PTR(skel, "open skel"))
+		goto done;
+
+	strcpy(skel->rodata->veth, "binddevtest1");
+	skel->rodata->veth_ifindex = if_nametoindex("binddevtest1");
+	if (!ASSERT_GT(skel->rodata->veth_ifindex, 0, "if_nametoindex"))
+		goto done;
+
+	if (!ASSERT_OK(setget_sockopt__load(skel), "load skel"))
+		goto done;
+
+	skel->links.skops_sockopt =
+		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
+		goto done;
+
+	skel->links.socket_post_create =
+		bpf_program__attach_cgroup(skel->progs.socket_post_create, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.socket_post_create, "attach_cgroup"))
+		goto done;
+
+	test_tcp(AF_INET6);
+	test_tcp(AF_INET);
+	test_udp(AF_INET6);
+	test_udp(AF_INET);
+
+done:
+	setget_sockopt__destroy(skel);
+	close(cg_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/setget_sockopt.c b/tools/testing/selftests/bpf/progs/setget_sockopt.c
new file mode 100644
index 000000000000..e52b96cf85fb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/setget_sockopt.c
@@ -0,0 +1,538 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) Meta Platforms, Inc. and affiliates. */
+
+#include <stddef.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/in.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
+#include <linux/bpf.h>
+#include <linux/if.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+
+#ifndef SO_TXREHASH
+#define SO_TXREHASH 74
+#endif
+
+#ifndef TCP_NAGLE_OFF
+#define TCP_NAGLE_OFF 1
+#endif
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+extern unsigned long CONFIG_HZ __kconfig;
+
+const volatile char veth[IFNAMSIZ];
+const volatile int veth_ifindex;
+const char cubic_cc[] = "cubic";
+const char reno_cc[] = "reno";
+
+int nr_listen;
+int nr_passive;
+int nr_active;
+int nr_connect;
+int nr_binddev;
+int nr_socket_post_create;
+
+struct sockopt_test {
+	int opt;
+	int new;
+	int restore;
+	int expected;
+	int tcp_expected;
+	int toggle:1;
+};
+
+static const struct sockopt_test sol_socket_tests[] = {
+	{ .opt = SO_SNDBUF, .new = 8123, .expected = 8123 * 2, },
+	{ .opt = SO_RCVBUF, .new = 8123, .expected = 8123 * 2, },
+	{ .opt = SO_KEEPALIVE, .toggle = 1, },
+	{ .opt = SO_PRIORITY, .new = 0xeb9f, .expected = 0xeb9f, },
+	{ .opt = SO_REUSEPORT, .toggle = 1, },
+	{ .opt = SO_RCVLOWAT, .new = 8123, .expected = 8123, },
+	{ .opt = SO_MARK, .new = 0xeb9f, .expected = 0xeb9f, },
+	{ .opt = SO_MAX_PACING_RATE, .new = 0xeb9f, .expected = 0xeb9f, },
+	{ .opt = SO_TXREHASH, .toggle = 1, },
+	{ .opt = 0, },
+};
+
+static const struct sockopt_test sol_tcp_tests[] = {
+	{ .opt = TCP_NODELAY, .toggle = 1, },
+	{ .opt = TCP_MAXSEG, .new = 1314, .expected = 1314, },
+	{ .opt = TCP_KEEPIDLE, .new = 123, .expected = 123, .restore = 321, },
+	{ .opt = TCP_KEEPINTVL, .new = 123, .expected = 123, .restore = 321, },
+	{ .opt = TCP_KEEPCNT, .new = 123, .expected = 123, .restore = 124, },
+	{ .opt = TCP_SYNCNT, .new = 123, .expected = 123, .restore = 124, },
+	{ .opt = TCP_WINDOW_CLAMP, .new = 8123, .expected = 8123, .restore = 8124, },
+	{ .opt = TCP_CONGESTION, },
+	{ .opt = TCP_THIN_LINEAR_TIMEOUTS, .toggle = 1, },
+	{ .opt = TCP_USER_TIMEOUT, .new = 123400, .expected = 123400, },
+	{ .opt = TCP_NOTSENT_LOWAT, .new = 1314, .expected = 1314, },
+	{ .opt = TCP_SAVE_SYN, .new = 1, .expected = 1, },
+	{ .opt = 0, },
+};
+
+static const struct sockopt_test sol_ip_tests[] = {
+	{ .opt = IP_TOS, .new = 0xe1, .expected = 0xe1, .tcp_expected = 0xe0, },
+	{ .opt = 0, },
+};
+
+static const struct sockopt_test sol_ipv6_tests[] = {
+	{ .opt = IPV6_TCLASS, .new = 0xe1, .expected = 0xe1, .tcp_expected = 0xe0, },
+	{ .opt = IPV6_AUTOFLOWLABEL, .toggle = 1, },
+	{ .opt = 0, },
+};
+
+struct sock_common {
+	unsigned short	skc_family;
+	unsigned long	skc_flags;
+} __attribute__((preserve_access_index));
+
+struct sock {
+	struct sock_common	__sk_common;
+	__u16			sk_type;
+	__u16			sk_protocol;
+	int			sk_rcvlowat;
+	__u32			sk_mark;
+	unsigned long		sk_max_pacing_rate;
+	unsigned int		keepalive_time;
+	unsigned int		keepalive_intvl;
+} __attribute__((preserve_access_index));
+
+struct tcp_options_received {
+	__u16 user_mss;
+} __attribute__((preserve_access_index));
+
+struct ipv6_pinfo {
+	__u16			recverr:1,
+				sndflow:1,
+				repflow:1,
+				pmtudisc:3,
+				padding:1,
+				srcprefs:3,
+				dontfrag:1,
+				autoflowlabel:1,
+				autoflowlabel_set:1,
+				mc_all:1,
+				recverr_rfc4884:1,
+				rtalert_isolate:1;
+}  __attribute__((preserve_access_index));
+
+struct inet_sock {
+	/* sk and pinet6 has to be the first two members of inet_sock */
+	struct sock		sk;
+	struct ipv6_pinfo	*pinet6;
+} __attribute__((preserve_access_index));
+
+struct inet_connection_sock {
+	__u32			  icsk_user_timeout;
+	__u8			  icsk_syn_retries;
+} __attribute__((preserve_access_index));
+
+struct tcp_sock {
+	struct inet_connection_sock	inet_conn;
+	struct tcp_options_received rx_opt;
+	__u8	save_syn:2,
+		syn_data:1,
+		syn_fastopen:1,
+		syn_fastopen_exp:1,
+		syn_fastopen_ch:1,
+		syn_data_acked:1,
+		is_cwnd_limited:1;
+	__u32	window_clamp;
+	__u8	nonagle     : 4,
+		thin_lto    : 1,
+		recvmsg_inq : 1,
+		repair      : 1,
+		frto        : 1;
+	__u32	notsent_lowat;
+	__u8	keepalive_probes;
+	unsigned int		keepalive_time;
+	unsigned int		keepalive_intvl;
+} __attribute__((preserve_access_index));
+
+struct socket {
+	struct sock *sk;
+} __attribute__((preserve_access_index));
+
+struct loop_ctx {
+	void *ctx;
+	struct sock *sk;
+};
+
+static int __bpf_getsockopt(void *ctx, struct sock *sk,
+			    int level, int opt, int *optval,
+			    int optlen)
+{
+	if (level == SOL_SOCKET) {
+		switch (opt) {
+		case SO_KEEPALIVE:
+			*optval = !!(sk->__sk_common.skc_flags & (1UL << 3));
+			break;
+		case SO_RCVLOWAT:
+			*optval = sk->sk_rcvlowat;
+			break;
+		case SO_MARK:
+			*optval = sk->sk_mark;
+			break;
+		case SO_MAX_PACING_RATE:
+			*optval = sk->sk_max_pacing_rate;
+			break;
+		default:
+			return bpf_getsockopt(ctx, level, opt, optval, optlen);
+		}
+		return 0;
+	}
+
+	if (level == IPPROTO_TCP) {
+		struct tcp_sock *tp = bpf_skc_to_tcp_sock(sk);
+
+		if (!tp)
+			return -1;
+
+		switch (opt) {
+		case TCP_NODELAY:
+			*optval = !!(tp->nonagle & TCP_NAGLE_OFF);
+			break;
+		case TCP_MAXSEG:
+			*optval = tp->rx_opt.user_mss;
+			break;
+		case TCP_KEEPIDLE:
+			*optval = tp->keepalive_time / CONFIG_HZ;
+			break;
+		case TCP_SYNCNT:
+			*optval = tp->inet_conn.icsk_syn_retries;
+			break;
+		case TCP_KEEPINTVL:
+			*optval = tp->keepalive_intvl / CONFIG_HZ;
+			break;
+		case TCP_KEEPCNT:
+			*optval = tp->keepalive_probes;
+			break;
+		case TCP_WINDOW_CLAMP:
+			*optval = tp->window_clamp;
+			break;
+		case TCP_THIN_LINEAR_TIMEOUTS:
+			*optval = tp->thin_lto;
+			break;
+		case TCP_USER_TIMEOUT:
+			*optval = tp->inet_conn.icsk_user_timeout;
+			break;
+		case TCP_NOTSENT_LOWAT:
+			*optval = tp->notsent_lowat;
+			break;
+		case TCP_SAVE_SYN:
+			*optval = tp->save_syn;
+			break;
+		default:
+			return bpf_getsockopt(ctx, level, opt, optval, optlen);
+		}
+		return 0;
+	}
+
+	if (level == IPPROTO_IPV6) {
+		switch (opt) {
+		case IPV6_AUTOFLOWLABEL: {
+			__u16 proto = sk->sk_protocol;
+			struct inet_sock *inet_sk;
+
+			if (proto == IPPROTO_TCP)
+				inet_sk = (struct inet_sock *)bpf_skc_to_tcp_sock(sk);
+			else
+				inet_sk = (struct inet_sock *)bpf_skc_to_udp6_sock(sk);
+
+			if (!inet_sk)
+				return -1;
+
+			*optval = !!inet_sk->pinet6->autoflowlabel;
+			break;
+		}
+		default:
+			return bpf_getsockopt(ctx, level, opt, optval, optlen);
+		}
+		return 0;
+	}
+
+	return bpf_getsockopt(ctx, level, opt, optval, optlen);
+}
+
+static int bpf_test_sockopt_flip(void *ctx, struct sock *sk,
+				 const struct sockopt_test *t,
+				 int level)
+{
+	int old, tmp, new, opt = t->opt;
+
+	opt = t->opt;
+
+	if (__bpf_getsockopt(ctx, sk, level, opt, &old, sizeof(old)))
+		return 1;
+	/* kernel initialized txrehash to 255 */
+	if (level == SOL_SOCKET && opt == SO_TXREHASH && old != 0 && old != 1)
+		old = 1;
+
+	new = !old;
+	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
+		return 1;
+	if (__bpf_getsockopt(ctx, sk, level, opt, &tmp, sizeof(tmp)) ||
+	    tmp != new)
+		return 1;
+
+	if (bpf_setsockopt(ctx, level, opt, &old, sizeof(old)))
+		return 1;
+
+	return 0;
+}
+
+static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
+				const struct sockopt_test *t,
+				int level)
+{
+	int old, tmp, new, expected, opt;
+
+	opt = t->opt;
+	new = t->new;
+	if (sk->sk_type == SOCK_STREAM && t->tcp_expected)
+		expected = t->tcp_expected;
+	else
+		expected = t->expected;
+
+	if (__bpf_getsockopt(ctx, sk, level, opt, &old, sizeof(old)) ||
+	    old == new)
+		return 1;
+
+	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
+		return 1;
+	if (__bpf_getsockopt(ctx, sk, level, opt, &tmp, sizeof(tmp)) ||
+	    tmp != expected)
+		return 1;
+
+	if (t->restore)
+		old = t->restore;
+	if (bpf_setsockopt(ctx, level, opt, &old, sizeof(old)))
+		return 1;
+
+	return 0;
+}
+
+static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_socket_tests))
+		return 1;
+
+	t = &sol_socket_tests[i];
+	if (!t->opt)
+		return 1;
+
+	if (t->toggle)
+		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, SOL_SOCKET);
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
+}
+
+static int bpf_test_ip_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_ip_tests))
+		return 1;
+
+	t = &sol_ip_tests[i];
+	if (!t->opt)
+		return 1;
+
+	if (t->toggle)
+		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, IPPROTO_IP);
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, IPPROTO_IP);
+}
+
+static int bpf_test_ipv6_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_ipv6_tests))
+		return 1;
+
+	t = &sol_ipv6_tests[i];
+	if (!t->opt)
+		return 1;
+
+	if (t->toggle)
+		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, IPPROTO_IPV6);
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, IPPROTO_IPV6);
+}
+
+static int bpf_test_tcp_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+	struct sock *sk;
+	void *ctx;
+
+	if (i >= ARRAY_SIZE(sol_tcp_tests))
+		return 1;
+
+	t = &sol_tcp_tests[i];
+	if (!t->opt)
+		return 1;
+
+	ctx = lc->ctx;
+	sk = lc->sk;
+
+	if (t->opt == TCP_CONGESTION) {
+		char old_cc[16], tmp_cc[16];
+		const char *new_cc;
+
+		if (bpf_getsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, old_cc, sizeof(old_cc)))
+			return 1;
+		if (!bpf_strncmp(old_cc, sizeof(old_cc), cubic_cc))
+			new_cc = reno_cc;
+		else
+			new_cc = cubic_cc;
+		if (bpf_setsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, (void *)new_cc,
+				   sizeof(new_cc)))
+			return 1;
+		if (bpf_getsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, tmp_cc, sizeof(tmp_cc)))
+			return 1;
+		if (bpf_strncmp(tmp_cc, sizeof(tmp_cc), new_cc))
+			return 1;
+		if (bpf_setsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, old_cc, sizeof(old_cc)))
+			return 1;
+		return 0;
+	}
+
+	if (t->toggle)
+		return bpf_test_sockopt_flip(ctx, sk, t, IPPROTO_TCP);
+
+	return bpf_test_sockopt_int(ctx, sk, t, IPPROTO_TCP);
+}
+
+static int bpf_test_sockopt(void *ctx, struct sock *sk)
+{
+	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
+	__u16 family, proto;
+	int n;
+
+	family = sk->__sk_common.skc_family;
+	proto = sk->sk_protocol;
+
+	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
+	if (n != ARRAY_SIZE(sol_socket_tests))
+		return -1;
+
+	if (proto == IPPROTO_TCP) {
+		n = bpf_loop(ARRAY_SIZE(sol_tcp_tests), bpf_test_tcp_sockopt, &lc, 0);
+		if (n != ARRAY_SIZE(sol_tcp_tests))
+			return -1;
+	}
+
+	if (family == AF_INET) {
+		n = bpf_loop(ARRAY_SIZE(sol_ip_tests), bpf_test_ip_sockopt, &lc, 0);
+		if (n != ARRAY_SIZE(sol_ip_tests))
+			return -1;
+	} else {
+		n = bpf_loop(ARRAY_SIZE(sol_ipv6_tests), bpf_test_ipv6_sockopt, &lc, 0);
+		if (n != ARRAY_SIZE(sol_ipv6_tests))
+			return -1;
+	}
+
+	return 0;
+}
+
+static int binddev_test(void *ctx)
+{
+	const char empty_ifname[] = "";
+	int ifindex, zero = 0;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE,
+			   (void *)veth, sizeof(veth)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != veth_ifindex)
+		return -1;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE,
+			   (void *)empty_ifname, sizeof(empty_ifname)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != 0)
+		return -1;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   (void *)&veth_ifindex, sizeof(int)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != veth_ifindex)
+		return -1;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &zero, sizeof(int)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != 0)
+		return -1;
+
+	return 0;
+}
+
+SEC("lsm_cgroup/socket_post_create")
+int BPF_PROG(socket_post_create, struct socket *sock, int family,
+	     int type, int protocol, int kern)
+{
+	struct sock *sk = sock->sk;
+
+	if (!sk)
+		return 1;
+
+	nr_socket_post_create += !bpf_test_sockopt(sk, sk);
+	nr_binddev += !binddev_test(sk);
+
+	return 1;
+}
+
+SEC("sockops")
+int skops_sockopt(struct bpf_sock_ops *skops)
+{
+	struct bpf_sock *bpf_sk = skops->sk;
+	struct sock *sk;
+
+	if (!bpf_sk)
+		return 1;
+
+	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
+	if (!sk)
+		return 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+		nr_listen += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		nr_connect += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		nr_active += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		nr_passive += !bpf_test_sockopt(skops, sk);
+		break;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr
  2022-07-27  6:09 ` [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr Martin KaFai Lau
@ 2022-07-27  8:11   ` David Laight
  2022-07-27 20:42     ` Martin KaFai Lau
  2022-07-27  8:16   ` Eric Dumazet
  1 sibling, 1 reply; 38+ messages in thread
From: David Laight @ 2022-07-27  8:11 UTC (permalink / raw)
  To: 'Martin KaFai Lau', bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

From: Martin KaFai Lau
> Sent: 27 July 2022 07:09
> 
> A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
> sock_setsockopt() to avoid code duplication and code
> drift between the two duplicates.
> 
> The current sock_setsockopt() takes sock ptr as the argument.
> The very first thing of this function is to get back the sk ptr
> by 'sk = sock->sk'.
> 
> bpf_setsockopt() could be called when the sk does not have
> a userspace owner.  Meaning sk->sk_socket is NULL.  For example,
> when a passive tcp connection has just been established.  Thus,
> it cannot use the sock_setsockopt(sk->sk_socket) or else it will
> pass a NULL sock ptr.

I'm intrigued, I've some code that uses sock_create_kern() to create
sockets without a userspace owner - I'd have though bpf is doing
much the same.

I end up doing:
        if (level == SOL_SOCKET)
                err = sock_setsockopt(sock, level, optname, koptval, optlen);
        else
                err = sock->ops->setsockopt(sock, level, optname, koptval,
                                            optlen);
to set options.
(This code used to use kern_setsockopt() - but that got removed.)

I'd have though bpf would need similar code??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr
  2022-07-27  6:09 ` [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr Martin KaFai Lau
  2022-07-27  8:11   ` David Laight
@ 2022-07-27  8:16   ` Eric Dumazet
  2022-07-27 18:50     ` Martin KaFai Lau
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2022-07-27  8:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Paolo Abeni

On Wed, Jul 27, 2022 at 8:09 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
> sock_setsockopt() to avoid code duplication and code
> drift between the two duplicates.
>
> The current sock_setsockopt() takes sock ptr as the argument.
> The very first thing of this function is to get back the sk ptr
> by 'sk = sock->sk'.
>
> bpf_setsockopt() could be called when the sk does not have
> a userspace owner.  Meaning sk->sk_socket is NULL.  For example,
> when a passive tcp connection has just been established.  Thus,
> it cannot use the sock_setsockopt(sk->sk_socket) or else it will
> pass a NULL sock ptr.
>
> All existing callers have both sock->sk and sk->sk_socket pointer.
> Thus, this patch changes the sock_setsockopt() to take a sk ptr
> instead of the sock ptr.  The bpf_setsockopt() only allows
> optnames that do not require a sock ptr.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

...

> diff --git a/include/net/sock.h b/include/net/sock.h
> index f7ad1a7705e9..9e2539dcc293 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1795,7 +1795,7 @@ void sock_pfree(struct sk_buff *skb);
>  #define sock_edemux sock_efree
>  #endif
>
> -int sock_setsockopt(struct socket *sock, int level, int op,
> +int sock_setsockopt(struct sock *sk, int level, int op,
>                     sockptr_t optval, unsigned int optlen);
>

SGTM, but I feel we should rename this to sk_setsockopt() ?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27  6:09 ` [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
@ 2022-07-27  8:36   ` David Laight
  2022-07-27 20:05     ` Martin KaFai Lau
  2022-07-27 16:47   ` sdf
  1 sibling, 1 reply; 38+ messages in thread
From: David Laight @ 2022-07-27  8:36 UTC (permalink / raw)
  To: 'Martin KaFai Lau', bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

From: Martin KaFai Lau
> Sent: 27 July 2022 07:09
> 
> Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> the sock_setsockopt().  The number of supported options are
> increasing ever and so as the duplicated codes.
> 
> One issue in reusing sock_setsockopt() is that the bpf prog
> has already acquired the sk lock.  sockptr_t is useful to handle this.
> sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> has already been ensured by the bpf prog.

That is a really horrid place to hide an 'is locked' bit.

You'd be better off splitting sock_setsockopt() to add a function
that is called with sk_lock held and the value read.
That would also save the churn of all the callers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27  6:09 ` [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
  2022-07-27  8:36   ` David Laight
@ 2022-07-27 16:47   ` sdf
  2022-07-27 18:37     ` Martin KaFai Lau
  1 sibling, 1 reply; 38+ messages in thread
From: sdf @ 2022-07-27 16:47 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On 07/26, Martin KaFai Lau wrote:
> Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> the sock_setsockopt().  The number of supported options are
> increasing ever and so as the duplicated codes.

> One issue in reusing sock_setsockopt() is that the bpf prog
> has already acquired the sk lock.  sockptr_t is useful to handle this.
> sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> has already been ensured by the bpf prog.

Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some  
point,
we can have code paths in bpf where the socket has been already locked by
the stack?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 03/14] bpf: net: Consider optval.is_bpf before capable check in sock_setsockopt()
  2022-07-27  6:09 ` [PATCH bpf-next 03/14] bpf: net: Consider optval.is_bpf before capable check in sock_setsockopt() Martin KaFai Lau
@ 2022-07-27 16:54   ` sdf
  2022-07-27 18:47     ` Martin KaFai Lau
  0 siblings, 1 reply; 38+ messages in thread
From: sdf @ 2022-07-27 16:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On 07/26, Martin KaFai Lau wrote:
> When bpf program calling bpf_setsockopt(SOL_SOCKET),
> it could be run in softirq and doesn't make sense to do the capable
> check.  There was a similar situation in bpf_setsockopt(TCP_CONGESTION).

Should we instead skip these capability checks based on something like
in_serving_softirq? I wonder if we might be mixing too much into that
is_bpf flag (locking assumptions, context assumptions, etc)?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt()
  2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
                   ` (13 preceding siblings ...)
  2022-07-27  6:10 ` [PATCH bpf-next 14/14] selftests/bpf: bpf_setsockopt tests Martin KaFai Lau
@ 2022-07-27 17:14 ` Jakub Kicinski
  2022-07-27 20:42   ` Martin KaFai Lau
  14 siblings, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2022-07-27 17:14 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, kernel-team,
	Paolo Abeni

On Tue, 26 Jul 2022 23:08:56 -0700 Martin KaFai Lau wrote:
> bpf: net: Remove duplicated codes from bpf_setsockopt()

nit: "code" is a mass noun, uncountable and always singular

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27 16:47   ` sdf
@ 2022-07-27 18:37     ` Martin KaFai Lau
  2022-07-27 20:39       ` Stanislav Fomichev
  0 siblings, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27 18:37 UTC (permalink / raw)
  To: sdf
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> On 07/26, Martin KaFai Lau wrote:
> > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > the sock_setsockopt().  The number of supported options are
> > increasing ever and so as the duplicated codes.
> 
> > One issue in reusing sock_setsockopt() is that the bpf prog
> > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > has already been ensured by the bpf prog.
> 
> Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> point,
is_locked was my initial attempt.  The bpf_setsockopt() also skips
the ns_capable() check, like in patch 3.  I ended up using
one is_bpf bit here to do both.

> we can have code paths in bpf where the socket has been already locked by
> the stack?
hmm... You meant the opposite, like the bpf hook does not have the
lock pre-acquired before the bpf prog gets run and sock_setsockopt()
should do lock_sock() as usual?

I was thinking a likely situation is a bpf 'sleepable' hook does not
have the lock pre-acquired.  In that case, the bpf_setsockopt() could
always acquire the lock first but it may turn out to be too
pessmissitic for the future bpf_[G]etsockopt() refactoring.

or we could do this 'bit' break up (into one is_locked bit
for locked and one is_bpf to skip-capable-check).  I was waiting until a real
need comes up instead of having both bits always true now.  I don't mind to
add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
I can do this in the next spin.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 03/14] bpf: net: Consider optval.is_bpf before capable check in sock_setsockopt()
  2022-07-27 16:54   ` sdf
@ 2022-07-27 18:47     ` Martin KaFai Lau
  0 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27 18:47 UTC (permalink / raw)
  To: sdf
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 09:54:08AM -0700, sdf@google.com wrote:
> On 07/26, Martin KaFai Lau wrote:
> > When bpf program calling bpf_setsockopt(SOL_SOCKET),
> > it could be run in softirq and doesn't make sense to do the capable
> > check.  There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
> 
> Should we instead skip these capability checks based on something like
> in_serving_softirq? I wonder if we might be mixing too much into that
> is_bpf flag (locking assumptions, context assumptions, etc)?
Yes, the bit can be splitted as another reply in patch 2.
I don't think in_serving_softirq is a good fit name.  Some of the
hooks is not in_serving_softirq.  is_bpf should be a better name
for this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr
  2022-07-27  8:16   ` Eric Dumazet
@ 2022-07-27 18:50     ` Martin KaFai Lau
  0 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27 18:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Paolo Abeni

On Wed, Jul 27, 2022 at 10:16:28AM +0200, Eric Dumazet wrote:
> On Wed, Jul 27, 2022 at 8:09 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
> > sock_setsockopt() to avoid code duplication and code
> > drift between the two duplicates.
> >
> > The current sock_setsockopt() takes sock ptr as the argument.
> > The very first thing of this function is to get back the sk ptr
> > by 'sk = sock->sk'.
> >
> > bpf_setsockopt() could be called when the sk does not have
> > a userspace owner.  Meaning sk->sk_socket is NULL.  For example,
> > when a passive tcp connection has just been established.  Thus,
> > it cannot use the sock_setsockopt(sk->sk_socket) or else it will
> > pass a NULL sock ptr.
> >
> > All existing callers have both sock->sk and sk->sk_socket pointer.
> > Thus, this patch changes the sock_setsockopt() to take a sk ptr
> > instead of the sock ptr.  The bpf_setsockopt() only allows
> > optnames that do not require a sock ptr.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> 
> ...
> 
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index f7ad1a7705e9..9e2539dcc293 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1795,7 +1795,7 @@ void sock_pfree(struct sk_buff *skb);
> >  #define sock_edemux sock_efree
> >  #endif
> >
> > -int sock_setsockopt(struct socket *sock, int level, int op,
> > +int sock_setsockopt(struct sock *sk, int level, int op,
> >                     sockptr_t optval, unsigned int optlen);
> >
> 
> SGTM, but I feel we should rename this to sk_setsockopt() ?
Ah, right.  will rename it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27  8:36   ` David Laight
@ 2022-07-27 20:05     ` Martin KaFai Lau
  0 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27 20:05 UTC (permalink / raw)
  To: David Laight
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 08:36:18AM +0000, David Laight wrote:
> From: Martin KaFai Lau
> > Sent: 27 July 2022 07:09
> > 
> > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > the sock_setsockopt().  The number of supported options are
> > increasing ever and so as the duplicated codes.
> > 
> > One issue in reusing sock_setsockopt() is that the bpf prog
> > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > has already been ensured by the bpf prog.
> 
> That is a really horrid place to hide an 'is locked' bit.
> 
> You'd be better off splitting sock_setsockopt() to add a function
> that is called with sk_lock held and the value read.
> That would also save the churn of all the callers.
There is no churn to the callers after this patch, so quite
the opposite.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27 18:37     ` Martin KaFai Lau
@ 2022-07-27 20:39       ` Stanislav Fomichev
  2022-07-27 21:21         ` Martin KaFai Lau
  0 siblings, 1 reply; 38+ messages in thread
From: Stanislav Fomichev @ 2022-07-27 20:39 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > On 07/26, Martin KaFai Lau wrote:
> > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > the sock_setsockopt().  The number of supported options are
> > > increasing ever and so as the duplicated codes.
> >
> > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > has already been ensured by the bpf prog.
> >
> > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > point,
> is_locked was my initial attempt.  The bpf_setsockopt() also skips
> the ns_capable() check, like in patch 3.  I ended up using
> one is_bpf bit here to do both.

Yeah, sorry, I haven't read the whole series before I sent my first
reply. Let's discuss it here.

This reminds me of ns_capable in __inet_bind where we also had to add
special handling.

In general, not specific to the series, I wonder if we want some new
in_bpf() context indication and bypass ns_capable() from those
contexts?
Then we can do things like:

  if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
CAP_NET_RAW))
    return ...;

Or would it make things more confusing?



> > we can have code paths in bpf where the socket has been already locked by
> > the stack?
> hmm... You meant the opposite, like the bpf hook does not have the
> lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> should do lock_sock() as usual?
>
> I was thinking a likely situation is a bpf 'sleepable' hook does not
> have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> always acquire the lock first but it may turn out to be too
> pessmissitic for the future bpf_[G]etsockopt() refactoring.
>
> or we could do this 'bit' break up (into one is_locked bit
> for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> need comes up instead of having both bits always true now.  I don't mind to
> add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> I can do this in the next spin.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr
  2022-07-27  8:11   ` David Laight
@ 2022-07-27 20:42     ` Martin KaFai Lau
  0 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27 20:42 UTC (permalink / raw)
  To: David Laight
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 08:11:26AM +0000, David Laight wrote:
> From: Martin KaFai Lau
> > Sent: 27 July 2022 07:09
> > 
> > A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
> > sock_setsockopt() to avoid code duplication and code
> > drift between the two duplicates.
> > 
> > The current sock_setsockopt() takes sock ptr as the argument.
> > The very first thing of this function is to get back the sk ptr
> > by 'sk = sock->sk'.
> > 
> > bpf_setsockopt() could be called when the sk does not have
> > a userspace owner.  Meaning sk->sk_socket is NULL.  For example,
> > when a passive tcp connection has just been established.  Thus,
> > it cannot use the sock_setsockopt(sk->sk_socket) or else it will
> > pass a NULL sock ptr.
> 
> I'm intrigued, I've some code that uses sock_create_kern() to create
> sockets without a userspace owner - I'd have though bpf is doing
> much the same.
> 
> I end up doing:
>         if (level == SOL_SOCKET)
>                 err = sock_setsockopt(sock, level, optname, koptval, optlen);
>         else
>                 err = sock->ops->setsockopt(sock, level, optname, koptval,
>                                             optlen);
> to set options.
> (This code used to use kern_setsockopt() - but that got removed.)
> 
> I'd have though bpf would need similar code??
By no userspace owner, I was referring a sk has not been accept()-ed by
the userspace yet instead of a 'struct socket *sock' created for
the kernel internal use.  After another thought, that tcp_sock
is sort of owned by the listen sk,  I will rephrase the commit
message to avoid the confusion.

bpf prog does not have a 'sock' ptr because the sock
has not been created yet.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt()
  2022-07-27 17:14 ` [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Jakub Kicinski
@ 2022-07-27 20:42   ` Martin KaFai Lau
  0 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27 20:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, kernel-team,
	Paolo Abeni

On Wed, Jul 27, 2022 at 10:14:05AM -0700, Jakub Kicinski wrote:
> On Tue, 26 Jul 2022 23:08:56 -0700 Martin KaFai Lau wrote:
> > bpf: net: Remove duplicated codes from bpf_setsockopt()
> 
> nit: "code" is a mass noun, uncountable and always singular
will fix.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27 20:39       ` Stanislav Fomichev
@ 2022-07-27 21:21         ` Martin KaFai Lau
  2022-07-27 21:38           ` Stanislav Fomichev
  0 siblings, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-27 21:21 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 01:39:08PM -0700, Stanislav Fomichev wrote:
> On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > > On 07/26, Martin KaFai Lau wrote:
> > > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > the sock_setsockopt().  The number of supported options are
> > > > increasing ever and so as the duplicated codes.
> > >
> > > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > > has already been ensured by the bpf prog.
> > >
> > > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > > point,
> > is_locked was my initial attempt.  The bpf_setsockopt() also skips
> > the ns_capable() check, like in patch 3.  I ended up using
> > one is_bpf bit here to do both.
> 
> Yeah, sorry, I haven't read the whole series before I sent my first
> reply. Let's discuss it here.
> 
> This reminds me of ns_capable in __inet_bind where we also had to add
> special handling.
> 
> In general, not specific to the series, I wonder if we want some new
> in_bpf() context indication and bypass ns_capable() from those
> contexts?
> Then we can do things like:
> 
>   if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
> CAP_NET_RAW))
>     return ...;
Don't see a way to implement in_bpf() after some thoughts.
Do you have idea ?

> 
> Or would it make things more confusing?
> 
> 
> 
> > > we can have code paths in bpf where the socket has been already locked by
> > > the stack?
> > hmm... You meant the opposite, like the bpf hook does not have the
> > lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> > should do lock_sock() as usual?
> >
> > I was thinking a likely situation is a bpf 'sleepable' hook does not
> > have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> > always acquire the lock first but it may turn out to be too
> > pessmissitic for the future bpf_[G]etsockopt() refactoring.
> >
> > or we could do this 'bit' break up (into one is_locked bit
> > for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> > need comes up instead of having both bits always true now.  I don't mind to
> > add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> > I can do this in the next spin.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27 21:21         ` Martin KaFai Lau
@ 2022-07-27 21:38           ` Stanislav Fomichev
  2022-07-28  0:45             ` Martin KaFai Lau
  0 siblings, 1 reply; 38+ messages in thread
From: Stanislav Fomichev @ 2022-07-27 21:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 2:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Jul 27, 2022 at 01:39:08PM -0700, Stanislav Fomichev wrote:
> > On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > > > On 07/26, Martin KaFai Lau wrote:
> > > > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > > the sock_setsockopt().  The number of supported options are
> > > > > increasing ever and so as the duplicated codes.
> > > >
> > > > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > > > has already been ensured by the bpf prog.
> > > >
> > > > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > > > point,
> > > is_locked was my initial attempt.  The bpf_setsockopt() also skips
> > > the ns_capable() check, like in patch 3.  I ended up using
> > > one is_bpf bit here to do both.
> >
> > Yeah, sorry, I haven't read the whole series before I sent my first
> > reply. Let's discuss it here.
> >
> > This reminds me of ns_capable in __inet_bind where we also had to add
> > special handling.
> >
> > In general, not specific to the series, I wonder if we want some new
> > in_bpf() context indication and bypass ns_capable() from those
> > contexts?
> > Then we can do things like:
> >
> >   if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
> > CAP_NET_RAW))
> >     return ...;
> Don't see a way to implement in_bpf() after some thoughts.
> Do you have idea ?

I wonder if we can cheat a bit with the following:

bool setsockopt_capable(struct user_namespace *ns, int cap)
{
       if (!in_task()) {
             /* Running in irq/softirq -> setsockopt invoked by bpf program.
              * [not sure, is it safe to assume no regular path leads
to setsockopt from sirq?]
              */
             return true;
       }

       /* Running in process context, task has bpf_ctx set -> invoked
by bpf program. */
       if (current->bpf_ctx != NULL)
             return true;

       return ns_capable(ns, cap);
}

And then do /ns_capable/setsockopt_capable/ in net/core/sock.c

But that might be more fragile than passing the flag, idk.

> > Or would it make things more confusing?
> >
> >
> >
> > > > we can have code paths in bpf where the socket has been already locked by
> > > > the stack?
> > > hmm... You meant the opposite, like the bpf hook does not have the
> > > lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> > > should do lock_sock() as usual?
> > >
> > > I was thinking a likely situation is a bpf 'sleepable' hook does not
> > > have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> > > always acquire the lock first but it may turn out to be too
> > > pessmissitic for the future bpf_[G]etsockopt() refactoring.
> > >
> > > or we could do this 'bit' break up (into one is_locked bit
> > > for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> > > need comes up instead of having both bits always true now.  I don't mind to
> > > add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> > > I can do this in the next spin.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-27 21:38           ` Stanislav Fomichev
@ 2022-07-28  0:45             ` Martin KaFai Lau
  2022-07-28  1:49               ` Jakub Kicinski
  0 siblings, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-28  0:45 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 02:38:51PM -0700, Stanislav Fomichev wrote:
> On Wed, Jul 27, 2022 at 2:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Jul 27, 2022 at 01:39:08PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > > > > On 07/26, Martin KaFai Lau wrote:
> > > > > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > > > the sock_setsockopt().  The number of supported options are
> > > > > > increasing ever and so as the duplicated codes.
> > > > >
> > > > > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > > > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > > > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > > > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > > > > has already been ensured by the bpf prog.
> > > > >
> > > > > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > > > > point,
> > > > is_locked was my initial attempt.  The bpf_setsockopt() also skips
> > > > the ns_capable() check, like in patch 3.  I ended up using
> > > > one is_bpf bit here to do both.
> > >
> > > Yeah, sorry, I haven't read the whole series before I sent my first
> > > reply. Let's discuss it here.
> > >
> > > This reminds me of ns_capable in __inet_bind where we also had to add
> > > special handling.
> > >
> > > In general, not specific to the series, I wonder if we want some new
> > > in_bpf() context indication and bypass ns_capable() from those
> > > contexts?
> > > Then we can do things like:
> > >
> > >   if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
> > > CAP_NET_RAW))
> > >     return ...;
> > Don't see a way to implement in_bpf() after some thoughts.
> > Do you have idea ?
> 
> I wonder if we can cheat a bit with the following:
> 
> bool setsockopt_capable(struct user_namespace *ns, int cap)
> {
>        if (!in_task()) {
>              /* Running in irq/softirq -> setsockopt invoked by bpf program.
>               * [not sure, is it safe to assume no regular path leads
> to setsockopt from sirq?]
>               */
>              return true;
>        }
> 
>        /* Running in process context, task has bpf_ctx set -> invoked
> by bpf program. */
>        if (current->bpf_ctx != NULL)
>              return true;
> 
>        return ns_capable(ns, cap);
> }
> 
> And then do /ns_capable/setsockopt_capable/ in net/core/sock.c
> 
> But that might be more fragile than passing the flag, idk.
I think it should work.  From a quick look, all bpf_setsockopt usage has
bpf_ctx.  The one from bpf_tcp_ca (struct_ops) and bpf_iter is trampoline
which also has bpf_ctx.  Not sure about the future use cases.

To be honest, I am not sure if I have missed cases and also have similar questions
your have in the above sample code.  This may deserve a separate patch
set for discussion.  Using a bit in sockptr is mostly free now.
WDYT ?

> 
> > > Or would it make things more confusing?
> > >
> > >
> > >
> > > > > we can have code paths in bpf where the socket has been already locked by
> > > > > the stack?
> > > > hmm... You meant the opposite, like the bpf hook does not have the
> > > > lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> > > > should do lock_sock() as usual?
> > > >
> > > > I was thinking a likely situation is a bpf 'sleepable' hook does not
> > > > have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> > > > always acquire the lock first but it may turn out to be too
> > > > pessmissitic for the future bpf_[G]etsockopt() refactoring.
> > > >
> > > > or we could do this 'bit' break up (into one is_locked bit
> > > > for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> > > > need comes up instead of having both bits always true now.  I don't mind to
> > > > add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> > > > I can do this in the next spin.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-28  0:45             ` Martin KaFai Lau
@ 2022-07-28  1:49               ` Jakub Kicinski
  2022-07-28 16:31                 ` Martin KaFai Lau
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2022-07-28  1:49 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	kernel-team, Paolo Abeni

On Wed, 27 Jul 2022 17:45:46 -0700 Martin KaFai Lau wrote:
> > bool setsockopt_capable(struct user_namespace *ns, int cap)
> > {
> >        if (!in_task()) {
> >              /* Running in irq/softirq -> setsockopt invoked by bpf program.
> >               * [not sure, is it safe to assume no regular path leads
> > to setsockopt from sirq?]
> >               */
> >              return true;
> >        }
> > 
> >        /* Running in process context, task has bpf_ctx set -> invoked
> > by bpf program. */
> >        if (current->bpf_ctx != NULL)
> >              return true;
> > 
> >        return ns_capable(ns, cap);
> > }
> > 
> > And then do /ns_capable/setsockopt_capable/ in net/core/sock.c
> > 
> > But that might be more fragile than passing the flag, idk.  
> I think it should work.  From a quick look, all bpf_setsockopt usage has
> bpf_ctx.  The one from bpf_tcp_ca (struct_ops) and bpf_iter is trampoline
> which also has bpf_ctx.  Not sure about the future use cases.
> 
> To be honest, I am not sure if I have missed cases and also have similar questions
> your have in the above sample code.  This may deserve a separate patch
> set for discussion.  Using a bit in sockptr is mostly free now.
> WDYT ?

Sorry to chime in but I vote against @in_bpf. I had to search the git
history recently to figure out what SK_USER_DATA_BPF means. It's not
going to be obvious to a networking person what semantics to attribute
to "in bpf".

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-28  1:49               ` Jakub Kicinski
@ 2022-07-28 16:31                 ` Martin KaFai Lau
  2022-07-28 16:56                   ` Jakub Kicinski
  0 siblings, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-28 16:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	kernel-team, Paolo Abeni

On Wed, Jul 27, 2022 at 06:49:03PM -0700, Jakub Kicinski wrote:
> On Wed, 27 Jul 2022 17:45:46 -0700 Martin KaFai Lau wrote:
> > > bool setsockopt_capable(struct user_namespace *ns, int cap)
> > > {
> > >        if (!in_task()) {
> > >              /* Running in irq/softirq -> setsockopt invoked by bpf program.
> > >               * [not sure, is it safe to assume no regular path leads
> > > to setsockopt from sirq?]
> > >               */
> > >              return true;
> > >        }
> > > 
> > >        /* Running in process context, task has bpf_ctx set -> invoked
> > > by bpf program. */
> > >        if (current->bpf_ctx != NULL)
> > >              return true;
> > > 
> > >        return ns_capable(ns, cap);
> > > }
> > > 
> > > And then do /ns_capable/setsockopt_capable/ in net/core/sock.c
> > > 
> > > But that might be more fragile than passing the flag, idk.  
> > I think it should work.  From a quick look, all bpf_setsockopt usage has
> > bpf_ctx.  The one from bpf_tcp_ca (struct_ops) and bpf_iter is trampoline
> > which also has bpf_ctx.  Not sure about the future use cases.
> > 
> > To be honest, I am not sure if I have missed cases and also have similar questions
> > your have in the above sample code.  This may deserve a separate patch
> > set for discussion.  Using a bit in sockptr is mostly free now.
> > WDYT ?
> 
> Sorry to chime in but I vote against @in_bpf. I had to search the git
> history recently to figure out what SK_USER_DATA_BPF means. It's not
> going to be obvious to a networking person what semantics to attribute
> to "in bpf".
If I understand the concern correctly, it may not be straight forward to
grip the reason behind the testings at in_bpf() [ the in_task() and
the current->bpf_ctx test ] ?  Yes, it is a valid point.

The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
helper and should be easier to reason about.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-28 16:31                 ` Martin KaFai Lau
@ 2022-07-28 16:56                   ` Jakub Kicinski
  2022-07-28 17:20                     ` Martin KaFai Lau
  2022-07-29 10:04                     ` David Laight
  0 siblings, 2 replies; 38+ messages in thread
From: Jakub Kicinski @ 2022-07-28 16:56 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	kernel-team, Paolo Abeni

On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> If I understand the concern correctly, it may not be straight forward to
> grip the reason behind the testings at in_bpf() [ the in_task() and
> the current->bpf_ctx test ] ?  Yes, it is a valid point.
> 
> The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> helper and should be easier to reason about.

I think we're saying the opposite thing. in_bpf() the context checking
function is fine. There is a clear parallel to in_task() and combined
with the capability check it should be pretty obvious what the code
is intending to achieve.

sockptr_t::in_bpf which randomly implies that the lock is already held
will be hard to understand for anyone not intimately familiar with the
BPF code. Naming that bit is_locked seems much clearer.

Which is what I believe Stan was proposing.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-28 16:56                   ` Jakub Kicinski
@ 2022-07-28 17:20                     ` Martin KaFai Lau
  2022-07-28 17:40                       ` Jakub Kicinski
  2022-07-29 10:04                     ` David Laight
  1 sibling, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-28 17:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	kernel-team, Paolo Abeni

On Thu, Jul 28, 2022 at 09:56:29AM -0700, Jakub Kicinski wrote:
> On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> > If I understand the concern correctly, it may not be straight forward to
> > grip the reason behind the testings at in_bpf() [ the in_task() and
> > the current->bpf_ctx test ] ?  Yes, it is a valid point.
> > 
> > The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> > helper and should be easier to reason about.
> 
> I think we're saying the opposite thing. in_bpf() the context checking
> function is fine. There is a clear parallel to in_task() and combined
> with the capability check it should be pretty obvious what the code
> is intending to achieve.
> 
> sockptr_t::in_bpf which randomly implies that the lock is already held
> will be hard to understand for anyone not intimately familiar with the
> BPF code. Naming that bit is_locked seems much clearer.
> 
> Which is what I believe Stan was proposing.
Yeah, I think I read the 'vote against @in_bpf' in the other way. :)

Sure. I will do s/is_bpf/is_locked/ and do the in_bpf() context
checking before ns_capable().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-28 17:20                     ` Martin KaFai Lau
@ 2022-07-28 17:40                       ` Jakub Kicinski
  0 siblings, 0 replies; 38+ messages in thread
From: Jakub Kicinski @ 2022-07-28 17:40 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	kernel-team, Paolo Abeni

On Thu, 28 Jul 2022 10:20:04 -0700 Martin KaFai Lau wrote:
> > Which is what I believe Stan was proposing.  
> Yeah, I think I read the 'vote against @in_bpf' in the other way. :)

My bad, I didn't read the proposal in sufficient detail to realize 
the helper is called the same thing as the bit :D

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-28 16:56                   ` Jakub Kicinski
  2022-07-28 17:20                     ` Martin KaFai Lau
@ 2022-07-29 10:04                     ` David Laight
  2022-07-29 19:06                       ` Martin KaFai Lau
  1 sibling, 1 reply; 38+ messages in thread
From: David Laight @ 2022-07-29 10:04 UTC (permalink / raw)
  To: 'Jakub Kicinski', Martin KaFai Lau
  Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	kernel-team, Paolo Abeni

From: Jakub Kicinski
> Sent: 28 July 2022 17:56
> 
> On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> > If I understand the concern correctly, it may not be straight forward to
> > grip the reason behind the testings at in_bpf() [ the in_task() and
> > the current->bpf_ctx test ] ?  Yes, it is a valid point.
> >
> > The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> > helper and should be easier to reason about.
> 
> I think we're saying the opposite thing. in_bpf() the context checking
> function is fine. There is a clear parallel to in_task() and combined
> with the capability check it should be pretty obvious what the code
> is intending to achieve.
> 
> sockptr_t::in_bpf which randomly implies that the lock is already held
> will be hard to understand for anyone not intimately familiar with the
> BPF code. Naming that bit is_locked seems much clearer.
> 
> Which is what I believe Stan was proposing.

Or make sk_setsockopt() be called after the integer value
has been read and with the lock held.

That saves any (horrid) conditional locking.

Also sockptr_t should probably have been a structure with separate
user and kernel address fields.
Putting the length in there would (probably) save code.

There then might be scope for pre-copying short user buffers
into a kernel buffer while still allowing the requests that
ignore the length copy directly from a user buffer.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf
  2022-07-29 10:04                     ` David Laight
@ 2022-07-29 19:06                       ` Martin KaFai Lau
  0 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-07-29 19:06 UTC (permalink / raw)
  To: David Laight
  Cc: 'Jakub Kicinski',
	Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	kernel-team, Paolo Abeni

On Fri, Jul 29, 2022 at 10:04:29AM +0000, David Laight wrote:
> From: Jakub Kicinski
> > Sent: 28 July 2022 17:56
> > 
> > On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> > > If I understand the concern correctly, it may not be straight forward to
> > > grip the reason behind the testings at in_bpf() [ the in_task() and
> > > the current->bpf_ctx test ] ?  Yes, it is a valid point.
> > >
> > > The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> > > helper and should be easier to reason about.
> > 
> > I think we're saying the opposite thing. in_bpf() the context checking
> > function is fine. There is a clear parallel to in_task() and combined
> > with the capability check it should be pretty obvious what the code
> > is intending to achieve.
> > 
> > sockptr_t::in_bpf which randomly implies that the lock is already held
> > will be hard to understand for anyone not intimately familiar with the
> > BPF code. Naming that bit is_locked seems much clearer.
> > 
> > Which is what I believe Stan was proposing.
> 
> Or make sk_setsockopt() be called after the integer value
> has been read and with the lock held.
> 
> That saves any (horrid) conditional locking.
> 
> Also sockptr_t should probably have been a structure with separate
> user and kernel address fields.
> Putting the length in there would (probably) save code.
> 
> There then might be scope for pre-copying short user buffers
> into a kernel buffer while still allowing the requests that
> ignore the length copy directly from a user buffer.
Some optnames take its own lock.  e.g. some in do_tcp_setsockopt.
Those will need to be broken down to its own locked and unlocked functions.
Not only setsockopt, this applies to the future [g]etsockopt() refactoring also
where most optnames are not under one lock_sock() and each optname could take
the lock or release it in its own optname.  

imo, this is unnecessary code churn for long switching cases like
setsockopt without a clear benefit.  While the patch is not the first
conditional locking in the kernel, I would like to hear how others think
about doing this in a helper like lock_sock_sockopt() for set/getsockopt().

With in_bpf() helper suggested by Stan, the is_locked can be passed
as one additional argument instead.  Then there is no need to change
the sockptr_t and leave sockptr_t to contain the optval itself.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-07-29 19:08 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-27  6:08 [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 01/14] net: Change sock_setsockopt from taking sock ptr to sk ptr Martin KaFai Lau
2022-07-27  8:11   ` David Laight
2022-07-27 20:42     ` Martin KaFai Lau
2022-07-27  8:16   ` Eric Dumazet
2022-07-27 18:50     ` Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
2022-07-27  8:36   ` David Laight
2022-07-27 20:05     ` Martin KaFai Lau
2022-07-27 16:47   ` sdf
2022-07-27 18:37     ` Martin KaFai Lau
2022-07-27 20:39       ` Stanislav Fomichev
2022-07-27 21:21         ` Martin KaFai Lau
2022-07-27 21:38           ` Stanislav Fomichev
2022-07-28  0:45             ` Martin KaFai Lau
2022-07-28  1:49               ` Jakub Kicinski
2022-07-28 16:31                 ` Martin KaFai Lau
2022-07-28 16:56                   ` Jakub Kicinski
2022-07-28 17:20                     ` Martin KaFai Lau
2022-07-28 17:40                       ` Jakub Kicinski
2022-07-29 10:04                     ` David Laight
2022-07-29 19:06                       ` Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 03/14] bpf: net: Consider optval.is_bpf before capable check in sock_setsockopt() Martin KaFai Lau
2022-07-27 16:54   ` sdf
2022-07-27 18:47     ` Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 04/14] bpf: net: Avoid do_tcp_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 05/14] bpf: net: Avoid do_ip_setsockopt() " Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 06/14] bpf: net: Avoid do_ipv6_setsockopt() " Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 07/14] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 08/14] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sock_setsockopt() Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 09/14] bpf: Refactor bpf specific tcp optnames to a new function Martin KaFai Lau
2022-07-27  6:09 ` [PATCH bpf-next 10/14] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt() Martin KaFai Lau
2022-07-27  6:10 ` [PATCH bpf-next 11/14] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt() Martin KaFai Lau
2022-07-27  6:10 ` [PATCH bpf-next 12/14] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt() Martin KaFai Lau
2022-07-27  6:10 ` [PATCH bpf-next 13/14] bpf: Add a few optnames to bpf_setsockopt Martin KaFai Lau
2022-07-27  6:10 ` [PATCH bpf-next 14/14] selftests/bpf: bpf_setsockopt tests Martin KaFai Lau
2022-07-27 17:14 ` [PATCH bpf-next 00/14] bpf: net: Remove duplicated codes from bpf_setsockopt() Jakub Kicinski
2022-07-27 20:42   ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).