bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt()
@ 2022-08-03 20:46 Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 01/15] net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr Martin KaFai Lau
                   ` (14 more replies)
  0 siblings, 15 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

The code in bpf_setsockopt() is mostly a copy-and-paste from
the sock_setsockopt(), do_tcp_setsockopt(), do_ipv6_setsockopt(),
and do_ip_setsockopt().  As the allowed optnames in bpf_setsockopt()
grows, so are the duplicated code.  The code between the copies
also slowly drifted.

This set is an effort to clean this up and reuse the existing
{sock,do_tcp,do_ipv6,do_ip}_setsockopt() as much as possible.

After the clean up, this set also adds a few allowed optnames
that we need to the bpf_setsockopt().

The initial attempt was to clean up both bpf_setsockopt() and
bpf_getsockopt() together.  However, the patch set was getting
too long.  It is beneficial to leave the bpf_getsockopt()
out for another patch set.  Thus, this set is focusing
on the bpf_setsockopt().

v2:
- A major change is to use in_bpf() to test if a setsockopt()
  is called by a bpf prog and use in_bpf() to skip capable
  check.  Suggested by Stanislav.
- Instead of passing is_locked through sockptr_t or through an extra
  argument to sk_setsockopt, v2 uses in_bpf() to skip the lock_sock()
  also because bpf prog has the lock acquired.
- No change to the current sockptr_t in this revision
- s/codes/code/

Martin KaFai Lau (15):
  net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr
  bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  bpf: net: Consider in_bpf() when testing capable() in sk_setsockopt()
  bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock()
    and capable()
  bpf: net: Change do_ip_setsockopt() to use the sockopt's lock_sock()
    and capable()
  bpf: net: Change do_ipv6_setsockopt() to use the sockopt's lock_sock()
    and capable()
  bpf: Initialize the bpf_run_ctx in bpf_iter_run_prog()
  bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt
  bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()
  bpf: Refactor bpf specific tcp optnames to a new function
  bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
  bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()
  bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt()
  bpf: Add a few optnames to bpf_setsockopt
  selftests/bpf: bpf_setsockopt tests

 include/linux/bpf.h                           |   8 +
 include/net/ip.h                              |   2 +
 include/net/ipv6.h                            |   2 +
 include/net/ipv6_stubs.h                      |   2 +
 include/net/sock.h                            |   7 +
 include/net/tcp.h                             |   2 +
 kernel/bpf/bpf_iter.c                         |   5 +
 net/core/filter.c                             | 377 +++++-------
 net/core/sock.c                               |  77 ++-
 net/ipv4/ip_sockglue.c                        |  16 +-
 net/ipv4/tcp.c                                |  22 +-
 net/ipv6/af_inet6.c                           |   1 +
 net/ipv6/ipv6_sockglue.c                      |  18 +-
 .../selftests/bpf/prog_tests/setget_sockopt.c | 125 ++++
 .../selftests/bpf/progs/setget_sockopt.c      | 547 ++++++++++++++++++
 15 files changed, 939 insertions(+), 272 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
 create mode 100644 tools/testing/selftests/bpf/progs/setget_sockopt.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 01/15] net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
sock_setsockopt() to avoid code duplication and code
drift between the two duplicates.

The current sock_setsockopt() takes sock ptr as the argument.
The very first thing of this function is to get back the sk ptr
by 'sk = sock->sk'.

bpf_setsockopt() could be called when the sk does not have
the sock ptr created.  Meaning sk->sk_socket is NULL.  For example,
when a passive tcp connection has just been established but has yet
been accept()-ed.  Thus, it cannot use the sock_setsockopt(sk->sk_socket)
or else it will pass a NULL ptr.

This patch moves all sock_setsockopt implementation to the newly
added sk_setsockopt().  The new sk_setsockopt() takes a sk ptr
and immediately gets the sock ptr by 'sock = sk->sk_socket'

The existing sock_setsockopt(sock) is changed to call
sk_setsockopt(sock->sk).  All existing callers have both sock->sk
and sk->sk_socket pointer.

The latter patch will make bpf_setsockopt(SOL_SOCKET) call
sk_setsockopt(sk) directly.  The bpf_setsockopt(SOL_SOCKET) does
not use the optnames that require sk->sk_socket, so it will
be safe.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/sock.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 4cb957d934a2..20269c37ab3b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1041,12 +1041,12 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
  *	at the socket level. Everything here is generic.
  */
 
-int sock_setsockopt(struct socket *sock, int level, int optname,
-		    sockptr_t optval, unsigned int optlen)
+static int sk_setsockopt(struct sock *sk, int level, int optname,
+			 sockptr_t optval, unsigned int optlen)
 {
 	struct so_timestamping timestamping;
+	struct socket *sock = sk->sk_socket;
 	struct sock_txtime sk_txtime;
-	struct sock *sk = sock->sk;
 	int val;
 	int valbool;
 	struct linger ling;
@@ -1499,6 +1499,13 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	release_sock(sk);
 	return ret;
 }
+
+int sock_setsockopt(struct socket *sock, int level, int optname,
+		    sockptr_t optval, unsigned int optlen)
+{
+	return sk_setsockopt(sock->sk, level, optname,
+			     optval, optlen);
+}
 EXPORT_SYMBOL(sock_setsockopt);
 
 static const struct cred *sk_get_peer_cred(struct sock *sk)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 01/15] net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 22:59   ` sdf
  2022-08-04 19:03   ` Andrii Nakryiko
  2022-08-03 20:46 ` [PATCH v2 bpf-next 03/15] bpf: net: Consider in_bpf() when testing capable() in sk_setsockopt() Martin KaFai Lau
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt().  The number of supported optnames are
increasing ever and so as the duplicated code.

One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock.  This patch adds a in_bpf()
to tell if the sk_setsockopt() is called from a bpf prog.
The bpf prog calling bpf_setsockopt() is either running in_task()
or in_serving_softirq().  Both cases have the current->bpf_ctx
initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.

This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use.  These helpers will test in_bpf()
before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
for the ipv6 module to use in a latter patch.

Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h |  8 ++++++++
 include/net/sock.h  |  3 +++
 net/core/sock.c     | 26 +++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 20c26aed7896..b905b1b34fe4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
 	return !sysctl_unprivileged_bpf_disabled;
 }
 
+static inline bool in_bpf(void)
+{
+	return !!current->bpf_ctx;
+}
 #else /* !CONFIG_BPF_SYSCALL */
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 {
@@ -2175,6 +2179,10 @@ static inline bool unprivileged_ebpf_enabled(void)
 	return false;
 }
 
+static inline bool in_bpf(void)
+{
+	return false;
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
diff --git a/include/net/sock.h b/include/net/sock.h
index a7273b289188..b2ff230860c6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1721,6 +1721,9 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 	}
 }
 
+void sockopt_lock_sock(struct sock *sk);
+void sockopt_release_sock(struct sock *sk);
+
 /* Used by processes to "lock" a socket state, so that
  * interrupts and bottom half handlers won't change it
  * from under us. It essentially blocks any incoming
diff --git a/net/core/sock.c b/net/core/sock.c
index 20269c37ab3b..82759540ae2c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -703,7 +703,9 @@ static int sock_setbindtodevice(struct sock *sk, sockptr_t optval, int optlen)
 			goto out;
 	}
 
-	return sock_bindtoindex(sk, index, true);
+	sockopt_lock_sock(sk);
+	ret = sock_bindtoindex_locked(sk, index);
+	sockopt_release_sock(sk);
 out:
 #endif
 
@@ -1036,6 +1038,24 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	return 0;
 }
 
+void sockopt_lock_sock(struct sock *sk)
+{
+	if (in_bpf())
+		return;
+
+	lock_sock(sk);
+}
+EXPORT_SYMBOL(sockopt_lock_sock);
+
+void sockopt_release_sock(struct sock *sk)
+{
+	if (in_bpf())
+		return;
+
+	release_sock(sk);
+}
+EXPORT_SYMBOL(sockopt_release_sock);
+
 /*
  *	This is meant for all protocols to use and covers goings on
  *	at the socket level. Everything here is generic.
@@ -1067,7 +1087,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 
 	valbool = val ? 1 : 0;
 
-	lock_sock(sk);
+	sockopt_lock_sock(sk);
 
 	switch (optname) {
 	case SO_DEBUG:
@@ -1496,7 +1516,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 		ret = -ENOPROTOOPT;
 		break;
 	}
-	release_sock(sk);
+	sockopt_release_sock(sk);
 	return ret;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 03/15] bpf: net: Consider in_bpf() when testing capable() in sk_setsockopt()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 01/15] net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 04/15] bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable() Martin KaFai Lau
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check.  There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit 8d650cdedaab ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.

This patch adds sockopt_ns_capable() and sockopt_capable() for
the sk_setsockopt() to use.  They will consider the in_bpf()
before doing the ns_capable() and capable() test.  They are in
EXPORT_SYMBOL for the ipv6 module to use in a latter patch.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/sock.h |  2 ++
 net/core/sock.c    | 38 +++++++++++++++++++++++++-------------
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b2ff230860c6..72b78c2b6f83 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1723,6 +1723,8 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 
 void sockopt_lock_sock(struct sock *sk);
 void sockopt_release_sock(struct sock *sk);
+bool sockopt_ns_capable(struct user_namespace *ns, int cap);
+bool sockopt_capable(int cap);
 
 /* Used by processes to "lock" a socket state, so that
  * interrupts and bottom half handlers won't change it
diff --git a/net/core/sock.c b/net/core/sock.c
index 82759540ae2c..2d88c06c27b7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1056,6 +1056,18 @@ void sockopt_release_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(sockopt_release_sock);
 
+bool sockopt_ns_capable(struct user_namespace *ns, int cap)
+{
+	return in_bpf() || ns_capable(ns, cap);
+}
+EXPORT_SYMBOL(sockopt_ns_capable);
+
+bool sockopt_capable(int cap)
+{
+	return in_bpf() || capable(cap);
+}
+EXPORT_SYMBOL(sockopt_capable);
+
 /*
  *	This is meant for all protocols to use and covers goings on
  *	at the socket level. Everything here is generic.
@@ -1091,7 +1103,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 
 	switch (optname) {
 	case SO_DEBUG:
-		if (val && !capable(CAP_NET_ADMIN))
+		if (val && !sockopt_capable(CAP_NET_ADMIN))
 			ret = -EACCES;
 		else
 			sock_valbool_flag(sk, SOCK_DBG, valbool);
@@ -1135,7 +1147,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case SO_SNDBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!sockopt_capable(CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -1157,7 +1169,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case SO_RCVBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!sockopt_capable(CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -1184,8 +1196,8 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 
 	case SO_PRIORITY:
 		if ((val >= 0 && val <= 6) ||
-		    ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
-		    ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+		    sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
+		    sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			sk->sk_priority = val;
 		else
 			ret = -EPERM;
@@ -1330,8 +1342,8 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
 	case SO_MARK:
-		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
-		    !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+		if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
+		    !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -1339,8 +1351,8 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 		__sock_set_mark(sk, val);
 		break;
 	case SO_RCVMARK:
-		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
-		    !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+		if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
+		    !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -1374,7 +1386,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	case SO_BUSY_POLL:
 		/* allow unprivileged users to decrease the value */
-		if ((val > sk->sk_ll_usec) && !capable(CAP_NET_ADMIN))
+		if ((val > sk->sk_ll_usec) && !sockopt_capable(CAP_NET_ADMIN))
 			ret = -EPERM;
 		else {
 			if (val < 0)
@@ -1384,13 +1396,13 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 		}
 		break;
 	case SO_PREFER_BUSY_POLL:
-		if (valbool && !capable(CAP_NET_ADMIN))
+		if (valbool && !sockopt_capable(CAP_NET_ADMIN))
 			ret = -EPERM;
 		else
 			WRITE_ONCE(sk->sk_prefer_busy_poll, valbool);
 		break;
 	case SO_BUSY_POLL_BUDGET:
-		if (val > READ_ONCE(sk->sk_busy_poll_budget) && !capable(CAP_NET_ADMIN)) {
+		if (val > READ_ONCE(sk->sk_busy_poll_budget) && !sockopt_capable(CAP_NET_ADMIN)) {
 			ret = -EPERM;
 		} else {
 			if (val < 0 || val > U16_MAX)
@@ -1461,7 +1473,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
 		 * scheduler has enough safe guards.
 		 */
 		if (sk_txtime.clockid != CLOCK_MONOTONIC &&
-		    !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+		    !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 04/15] bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2022-08-03 20:46 ` [PATCH v2 bpf-next 03/15] bpf: net: Consider in_bpf() when testing capable() in sk_setsockopt() Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 05/15] bpf: net: Change do_ip_setsockopt() " Martin KaFai Lau
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf.  This patch
changes do_tcp_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 970e9a2cca4a..cfed84b1883f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3202,7 +3202,7 @@ EXPORT_SYMBOL(tcp_disconnect);
 
 static inline bool tcp_can_repair_sock(const struct sock *sk)
 {
-	return ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) &&
+	return sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) &&
 		(sk->sk_state != TCP_LISTEN);
 }
 
@@ -3502,11 +3502,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 			return -EFAULT;
 		name[val] = 0;
 
-		lock_sock(sk);
+		sockopt_lock_sock(sk);
 		err = tcp_set_congestion_control(sk, name, true,
-						 ns_capable(sock_net(sk)->user_ns,
-							    CAP_NET_ADMIN));
-		release_sock(sk);
+						 sockopt_ns_capable(sock_net(sk)->user_ns,
+								    CAP_NET_ADMIN));
+		sockopt_release_sock(sk);
 		return err;
 	}
 	case TCP_ULP: {
@@ -3522,9 +3522,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 			return -EFAULT;
 		name[val] = 0;
 
-		lock_sock(sk);
+		sockopt_lock_sock(sk);
 		err = tcp_set_ulp(sk, name);
-		release_sock(sk);
+		sockopt_release_sock(sk);
 		return err;
 	}
 	case TCP_FASTOPEN_KEY: {
@@ -3557,7 +3557,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 	if (copy_from_sockptr(&val, optval, sizeof(val)))
 		return -EFAULT;
 
-	lock_sock(sk);
+	sockopt_lock_sock(sk);
 
 	switch (optname) {
 	case TCP_MAXSEG:
@@ -3779,7 +3779,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 		break;
 	}
 
-	release_sock(sk);
+	sockopt_release_sock(sk);
 	return err;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 05/15] bpf: net: Change do_ip_setsockopt() to use the sockopt's lock_sock() and capable()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2022-08-03 20:46 ` [PATCH v2 bpf-next 04/15] bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable() Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 06/15] bpf: net: Change do_ipv6_setsockopt() " Martin KaFai Lau
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf.  This patch
changes do_ip_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/ip_sockglue.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index a8a323ecbb54..a3c496580e6b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -944,7 +944,7 @@ static int do_ip_setsockopt(struct sock *sk, int level, int optname,
 	err = 0;
 	if (needs_rtnl)
 		rtnl_lock();
-	lock_sock(sk);
+	sockopt_lock_sock(sk);
 
 	switch (optname) {
 	case IP_OPTIONS:
@@ -1333,14 +1333,14 @@ static int do_ip_setsockopt(struct sock *sk, int level, int optname,
 	case IP_IPSEC_POLICY:
 	case IP_XFRM_POLICY:
 		err = -EPERM;
-		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+		if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			break;
 		err = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
 
 	case IP_TRANSPARENT:
-		if (!!val && !ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
-		    !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+		if (!!val && !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
+		    !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			err = -EPERM;
 			break;
 		}
@@ -1368,13 +1368,13 @@ static int do_ip_setsockopt(struct sock *sk, int level, int optname,
 		err = -ENOPROTOOPT;
 		break;
 	}
-	release_sock(sk);
+	sockopt_release_sock(sk);
 	if (needs_rtnl)
 		rtnl_unlock();
 	return err;
 
 e_inval:
-	release_sock(sk);
+	sockopt_release_sock(sk);
 	if (needs_rtnl)
 		rtnl_unlock();
 	return -EINVAL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 06/15] bpf: net: Change do_ipv6_setsockopt() to use the sockopt's lock_sock() and capable()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2022-08-03 20:46 ` [PATCH v2 bpf-next 05/15] bpf: net: Change do_ip_setsockopt() " Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 07/15] bpf: Initialize the bpf_run_ctx in bpf_iter_run_prog() Martin KaFai Lau
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf.  This patch
changes do_ipv6_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv6/ipv6_sockglue.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 222f6bf220ba..d23de48ff612 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -327,7 +327,7 @@ static int ipv6_set_opt_hdr(struct sock *sk, int optname, sockptr_t optval,
 	int err;
 
 	/* hop-by-hop / destination options are privileged option */
-	if (optname != IPV6_RTHDR && !ns_capable(net->user_ns, CAP_NET_RAW))
+	if (optname != IPV6_RTHDR && !sockopt_ns_capable(net->user_ns, CAP_NET_RAW))
 		return -EPERM;
 
 	/* remove any sticky options header with a zero option
@@ -417,7 +417,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 
 	if (needs_rtnl)
 		rtnl_lock();
-	lock_sock(sk);
+	sockopt_lock_sock(sk);
 
 	switch (optname) {
 
@@ -634,8 +634,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case IPV6_TRANSPARENT:
-		if (valbool && !ns_capable(net->user_ns, CAP_NET_RAW) &&
-		    !ns_capable(net->user_ns, CAP_NET_ADMIN)) {
+		if (valbool && !sockopt_ns_capable(net->user_ns, CAP_NET_RAW) &&
+		    !sockopt_ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			retv = -EPERM;
 			break;
 		}
@@ -946,7 +946,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 	case IPV6_IPSEC_POLICY:
 	case IPV6_XFRM_POLICY:
 		retv = -EPERM;
-		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+		if (!sockopt_ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		retv = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
@@ -994,14 +994,14 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		break;
 	}
 
-	release_sock(sk);
+	sockopt_release_sock(sk);
 	if (needs_rtnl)
 		rtnl_unlock();
 
 	return retv;
 
 e_inval:
-	release_sock(sk);
+	sockopt_release_sock(sk);
 	if (needs_rtnl)
 		rtnl_unlock();
 	return -EINVAL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 07/15] bpf: Initialize the bpf_run_ctx in bpf_iter_run_prog()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2022-08-03 20:46 ` [PATCH v2 bpf-next 06/15] bpf: net: Change do_ipv6_setsockopt() " Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 08/15] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt Martin KaFai Lau
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

The bpf-iter-prog for tcp and unix sk can do bpf_setsockopt()
which needs in_bpf() to decide if it is called by a bpf prog.
This patch initializes the bpf_run_ctx in bpf_iter_run_prog()
for the in_bpf() to use.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/bpf_iter.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 7e8fd49406f6..5d3a982eb553 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -681,19 +681,24 @@ struct bpf_prog *bpf_iter_get_info(struct bpf_iter_meta *meta, bool in_stop)
 
 int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
 {
+	struct bpf_run_ctx run_ctx, *old_run_ctx;
 	int ret;
 
 	if (prog->aux->sleepable) {
 		rcu_read_lock_trace();
 		migrate_disable();
 		might_fault();
+		old_run_ctx = bpf_set_run_ctx(&run_ctx);
 		ret = bpf_prog_run(prog, ctx);
+		bpf_reset_run_ctx(old_run_ctx);
 		migrate_enable();
 		rcu_read_unlock_trace();
 	} else {
 		rcu_read_lock();
 		migrate_disable();
+		old_run_ctx = bpf_set_run_ctx(&run_ctx);
 		ret = bpf_prog_run(prog, ctx);
+		bpf_reset_run_ctx(old_run_ctx);
 		migrate_enable();
 		rcu_read_unlock();
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 08/15] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2022-08-03 20:46 ` [PATCH v2 bpf-next 07/15] bpf: Initialize the bpf_run_ctx in bpf_iter_run_prog() Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:46 ` [PATCH v2 bpf-next 09/15] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt() Martin KaFai Lau
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

This patch moves the "#ifdef CONFIG_XXX" check into the "if/else"
statement itself.  The change is done for the bpf_setsockopt()
function only.  It will make the latter patches easier to follow
without the surrounding ifdef macro.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 5669248aff25..01cb4a01b1c1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5113,8 +5113,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-#ifdef CONFIG_INET
-	} else if (level == SOL_IP) {
+	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET)
 			return -EINVAL;
 
@@ -5135,8 +5134,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-#if IS_ENABLED(CONFIG_IPV6)
-	} else if (level == SOL_IPV6) {
+	} else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
 			return -EINVAL;
 
@@ -5157,8 +5155,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-#endif
-	} else if (level == SOL_TCP &&
+	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
 		if (optname == TCP_CONGESTION) {
 			char name[TCP_CA_NAME_MAX];
@@ -5250,7 +5247,6 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 				ret = -EINVAL;
 			}
 		}
-#endif
 	} else {
 		ret = -EINVAL;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 09/15] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2022-08-03 20:46 ` [PATCH v2 bpf-next 08/15] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt Martin KaFai Lau
@ 2022-08-03 20:46 ` Martin KaFai Lau
  2022-08-03 20:47 ` [PATCH v2 bpf-next 10/15] bpf: Refactor bpf specific tcp optnames to a new function Martin KaFai Lau
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:46 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sk_setsockopt().

The only exception is SO_RCVLOWAT.  The bpf_setsockopt() does
not always have the sock ptr (sk->sk_socket) and sock->ops->set_rcvlowat
is needed in sk_setsockopt.  tcp_set_rcvlowat is the only
implementation for set_rcvlowat.  bpf_setsockopt() needs
one special handling for SO_RCVLOWAT for tcp_sock, so leave
SO_RCVLOWAT in bpf_setsockopt() for now.

The existing optname white-list is refactored into a new
function sol_socket_setsockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/sock.h |   2 +
 net/core/filter.c  | 130 +++++++++++++--------------------------------
 net/core/sock.c    |   4 +-
 3 files changed, 42 insertions(+), 94 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 72b78c2b6f83..b7e159f9d7bf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1800,6 +1800,8 @@ void sock_pfree(struct sk_buff *skb);
 #define sock_edemux sock_efree
 #endif
 
+int sk_setsockopt(struct sock *sk, int level, int optname,
+		  sockptr_t optval, unsigned int optlen);
 int sock_setsockopt(struct socket *sock, int level, int op,
 		    sockptr_t optval, unsigned int optlen);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 01cb4a01b1c1..d1f9f8360f60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5013,106 +5013,52 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+static int sol_socket_setsockopt(struct sock *sk, int optname,
+				 char *optval, int optlen)
+{
+	switch (optname) {
+	case SO_SNDBUF:
+	case SO_RCVBUF:
+	case SO_KEEPALIVE:
+	case SO_PRIORITY:
+	case SO_REUSEPORT:
+	case SO_RCVLOWAT:
+	case SO_MARK:
+	case SO_MAX_PACING_RATE:
+	case SO_BINDTOIFINDEX:
+	case SO_TXREHASH:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	case SO_BINDTODEVICE:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (optname == SO_RCVLOWAT) {
+		int val = *(int *)optval;
+
+		if (val < 0)
+			val = INT_MAX;
+		WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
+		return 0;
+	}
+
+	return sk_setsockopt(sk, SOL_SOCKET, optname,
+			     KERNEL_SOCKPTR(optval), optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
-	char devname[IFNAMSIZ];
-	int val, valbool;
-	struct net *net;
-	int ifindex;
-	int ret = 0;
+	int val, ret = 0;
 
 	if (!sk_fullsock(sk))
 		return -EINVAL;
 
 	if (level == SOL_SOCKET) {
-		if (optlen != sizeof(int) && optname != SO_BINDTODEVICE)
-			return -EINVAL;
-		val = *((int *)optval);
-		valbool = val ? 1 : 0;
-
-		/* Only some socketops are supported */
-		switch (optname) {
-		case SO_RCVBUF:
-			val = min_t(u32, val, sysctl_rmem_max);
-			val = min_t(int, val, INT_MAX / 2);
-			sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-			WRITE_ONCE(sk->sk_rcvbuf,
-				   max_t(int, val * 2, SOCK_MIN_RCVBUF));
-			break;
-		case SO_SNDBUF:
-			val = min_t(u32, val, sysctl_wmem_max);
-			val = min_t(int, val, INT_MAX / 2);
-			sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
-			WRITE_ONCE(sk->sk_sndbuf,
-				   max_t(int, val * 2, SOCK_MIN_SNDBUF));
-			break;
-		case SO_MAX_PACING_RATE: /* 32bit version */
-			if (val != ~0U)
-				cmpxchg(&sk->sk_pacing_status,
-					SK_PACING_NONE,
-					SK_PACING_NEEDED);
-			sk->sk_max_pacing_rate = (val == ~0U) ?
-						 ~0UL : (unsigned int)val;
-			sk->sk_pacing_rate = min(sk->sk_pacing_rate,
-						 sk->sk_max_pacing_rate);
-			break;
-		case SO_PRIORITY:
-			sk->sk_priority = val;
-			break;
-		case SO_RCVLOWAT:
-			if (val < 0)
-				val = INT_MAX;
-			WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
-			break;
-		case SO_MARK:
-			if (sk->sk_mark != val) {
-				sk->sk_mark = val;
-				sk_dst_reset(sk);
-			}
-			break;
-		case SO_BINDTODEVICE:
-			optlen = min_t(long, optlen, IFNAMSIZ - 1);
-			strncpy(devname, optval, optlen);
-			devname[optlen] = 0;
-
-			ifindex = 0;
-			if (devname[0] != '\0') {
-				struct net_device *dev;
-
-				ret = -ENODEV;
-
-				net = sock_net(sk);
-				dev = dev_get_by_name(net, devname);
-				if (!dev)
-					break;
-				ifindex = dev->ifindex;
-				dev_put(dev);
-			}
-			fallthrough;
-		case SO_BINDTOIFINDEX:
-			if (optname == SO_BINDTOIFINDEX)
-				ifindex = val;
-			ret = sock_bindtoindex(sk, ifindex, false);
-			break;
-		case SO_KEEPALIVE:
-			if (sk->sk_prot->keepalive)
-				sk->sk_prot->keepalive(sk, valbool);
-			sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool);
-			break;
-		case SO_REUSEPORT:
-			sk->sk_reuseport = valbool;
-			break;
-		case SO_TXREHASH:
-			if (val < -1 || val > 1) {
-				ret = -EINVAL;
-				break;
-			}
-			sk->sk_txrehash = (u8)val;
-			break;
-		default:
-			ret = -EINVAL;
-		}
+		return sol_socket_setsockopt(sk, optname, optval, optlen);
 	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET)
 			return -EINVAL;
diff --git a/net/core/sock.c b/net/core/sock.c
index 2d88c06c27b7..1ea320e09794 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1073,8 +1073,8 @@ EXPORT_SYMBOL(sockopt_capable);
  *	at the socket level. Everything here is generic.
  */
 
-static int sk_setsockopt(struct sock *sk, int level, int optname,
-			 sockptr_t optval, unsigned int optlen)
+int sk_setsockopt(struct sock *sk, int level, int optname,
+		  sockptr_t optval, unsigned int optlen)
 {
 	struct so_timestamping timestamping;
 	struct socket *sock = sk->sk_socket;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 10/15] bpf: Refactor bpf specific tcp optnames to a new function
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2022-08-03 20:46 ` [PATCH v2 bpf-next 09/15] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt() Martin KaFai Lau
@ 2022-08-03 20:47 ` Martin KaFai Lau
  2022-08-03 20:47 ` [PATCH v2 bpf-next 11/15] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt() Martin KaFai Lau
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:47 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

The patch moves all bpf specific tcp optnames (TCP_BPF_XXX)
to a new function bpf_sol_tcp_setsockopt().  This will make
the next patch easier to follow.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 79 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 29 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d1f9f8360f60..200e79a1fbfd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5049,6 +5049,52 @@ static int sol_socket_setsockopt(struct sock *sk, int optname,
 			     KERNEL_SOCKPTR(optval), optlen);
 }
 
+static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
+				  char *optval, int optlen)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned long timeout;
+	int val;
+
+	if (optlen != sizeof(int))
+		return -EINVAL;
+
+	val = *(int *)optval;
+
+	/* Only some options are supported */
+	switch (optname) {
+	case TCP_BPF_IW:
+		if (val <= 0 || tp->data_segs_out > tp->syn_data)
+			return -EINVAL;
+		tcp_snd_cwnd_set(tp, val);
+		break;
+	case TCP_BPF_SNDCWND_CLAMP:
+		if (val <= 0)
+			return -EINVAL;
+		tp->snd_cwnd_clamp = val;
+		tp->snd_ssthresh = val;
+		break;
+	case TCP_BPF_DELACK_MAX:
+		timeout = usecs_to_jiffies(val);
+		if (timeout > TCP_DELACK_MAX ||
+		    timeout < TCP_TIMEOUT_MIN)
+			return -EINVAL;
+		inet_csk(sk)->icsk_delack_max = timeout;
+		break;
+	case TCP_BPF_RTO_MIN:
+		timeout = usecs_to_jiffies(val);
+		if (timeout > TCP_RTO_MIN ||
+		    timeout < TCP_TIMEOUT_MIN)
+			return -EINVAL;
+		inet_csk(sk)->icsk_rto_min = timeout;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
@@ -5103,6 +5149,10 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		}
 	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
+		if (optname >= TCP_BPF_IW)
+			return bpf_sol_tcp_setsockopt(sk, optname,
+						      optval, optlen);
+
 		if (optname == TCP_CONGESTION) {
 			char name[TCP_CA_NAME_MAX];
 
@@ -5113,7 +5163,6 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 			struct tcp_sock *tp = tcp_sk(sk);
-			unsigned long timeout;
 
 			if (optlen != sizeof(int))
 				return -EINVAL;
@@ -5121,34 +5170,6 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			val = *((int *)optval);
 			/* Only some options are supported */
 			switch (optname) {
-			case TCP_BPF_IW:
-				if (val <= 0 || tp->data_segs_out > tp->syn_data)
-					ret = -EINVAL;
-				else
-					tcp_snd_cwnd_set(tp, val);
-				break;
-			case TCP_BPF_SNDCWND_CLAMP:
-				if (val <= 0) {
-					ret = -EINVAL;
-				} else {
-					tp->snd_cwnd_clamp = val;
-					tp->snd_ssthresh = val;
-				}
-				break;
-			case TCP_BPF_DELACK_MAX:
-				timeout = usecs_to_jiffies(val);
-				if (timeout > TCP_DELACK_MAX ||
-				    timeout < TCP_TIMEOUT_MIN)
-					return -EINVAL;
-				inet_csk(sk)->icsk_delack_max = timeout;
-				break;
-			case TCP_BPF_RTO_MIN:
-				timeout = usecs_to_jiffies(val);
-				if (timeout > TCP_RTO_MIN ||
-				    timeout < TCP_TIMEOUT_MIN)
-					return -EINVAL;
-				inet_csk(sk)->icsk_rto_min = timeout;
-				break;
 			case TCP_SAVE_SYN:
 				if (val < 0 || val > 1)
 					ret = -EINVAL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 11/15] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (9 preceding siblings ...)
  2022-08-03 20:47 ` [PATCH v2 bpf-next 10/15] bpf: Refactor bpf specific tcp optnames to a new function Martin KaFai Lau
@ 2022-08-03 20:47 ` Martin KaFai Lau
  2022-08-03 20:47 ` [PATCH v2 bpf-next 12/15] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt() Martin KaFai Lau
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:47 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

After the prep work in the previous patches,
this patch removes all the dup code from bpf_setsockopt(SOL_TCP)
and reuses the do_tcp_setsockopt().

The existing optname white-list is refactored into a new
function sol_tcp_setsockopt().  The sol_tcp_setsockopt()
also calls the bpf_sol_tcp_setsockopt() to handle
the TCP_BPF_XXX specific optnames.

bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to
save the eth header also and it comes for free from
do_tcp_setsockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/tcp.h |  2 +
 net/core/filter.c | 97 +++++++++++++++--------------------------------
 net/ipv4/tcp.c    |  4 +-
 3 files changed, 34 insertions(+), 69 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d10962b9f0d0..c03a50c72f40 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -405,6 +405,8 @@ __poll_t tcp_poll(struct file *file, struct socket *sock,
 int tcp_getsockopt(struct sock *sk, int level, int optname,
 		   char __user *optval, int __user *optlen);
 bool tcp_bpf_bypass_getsockopt(int level, int optname);
+int do_tcp_setsockopt(struct sock *sk, int level, int optname,
+		      sockptr_t optval, unsigned int optlen);
 int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		   unsigned int optlen);
 void tcp_set_keepalive(struct sock *sk, int val);
diff --git a/net/core/filter.c b/net/core/filter.c
index 200e79a1fbfd..0c5361b8906d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5095,6 +5095,34 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
 	return 0;
 }
 
+static int sol_tcp_setsockopt(struct sock *sk, int optname,
+			      char *optval, int optlen)
+{
+	if (sk->sk_prot->setsockopt != tcp_setsockopt)
+		return -EINVAL;
+
+	switch (optname) {
+	case TCP_KEEPIDLE:
+	case TCP_KEEPINTVL:
+	case TCP_KEEPCNT:
+	case TCP_SYNCNT:
+	case TCP_WINDOW_CLAMP:
+	case TCP_USER_TIMEOUT:
+	case TCP_NOTSENT_LOWAT:
+	case TCP_SAVE_SYN:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	case TCP_CONGESTION:
+		break;
+	default:
+		return bpf_sol_tcp_setsockopt(sk, optname, optval, optlen);
+	}
+
+	return do_tcp_setsockopt(sk, SOL_TCP, optname,
+				 KERNEL_SOCKPTR(optval), optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
@@ -5147,73 +5175,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 		default:
 			ret = -EINVAL;
 		}
-	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
-		   sk->sk_prot->setsockopt == tcp_setsockopt) {
-		if (optname >= TCP_BPF_IW)
-			return bpf_sol_tcp_setsockopt(sk, optname,
-						      optval, optlen);
-
-		if (optname == TCP_CONGESTION) {
-			char name[TCP_CA_NAME_MAX];
-
-			strncpy(name, optval, min_t(long, optlen,
-						    TCP_CA_NAME_MAX-1));
-			name[TCP_CA_NAME_MAX-1] = 0;
-			ret = tcp_set_congestion_control(sk, name, false, true);
-		} else {
-			struct inet_connection_sock *icsk = inet_csk(sk);
-			struct tcp_sock *tp = tcp_sk(sk);
-
-			if (optlen != sizeof(int))
-				return -EINVAL;
-
-			val = *((int *)optval);
-			/* Only some options are supported */
-			switch (optname) {
-			case TCP_SAVE_SYN:
-				if (val < 0 || val > 1)
-					ret = -EINVAL;
-				else
-					tp->save_syn = val;
-				break;
-			case TCP_KEEPIDLE:
-				ret = tcp_sock_set_keepidle_locked(sk, val);
-				break;
-			case TCP_KEEPINTVL:
-				if (val < 1 || val > MAX_TCP_KEEPINTVL)
-					ret = -EINVAL;
-				else
-					tp->keepalive_intvl = val * HZ;
-				break;
-			case TCP_KEEPCNT:
-				if (val < 1 || val > MAX_TCP_KEEPCNT)
-					ret = -EINVAL;
-				else
-					tp->keepalive_probes = val;
-				break;
-			case TCP_SYNCNT:
-				if (val < 1 || val > MAX_TCP_SYNCNT)
-					ret = -EINVAL;
-				else
-					icsk->icsk_syn_retries = val;
-				break;
-			case TCP_USER_TIMEOUT:
-				if (val < 0)
-					ret = -EINVAL;
-				else
-					icsk->icsk_user_timeout = val;
-				break;
-			case TCP_NOTSENT_LOWAT:
-				tp->notsent_lowat = val;
-				sk->sk_write_space(sk);
-				break;
-			case TCP_WINDOW_CLAMP:
-				ret = tcp_set_window_clamp(sk, val);
-				break;
-			default:
-				ret = -EINVAL;
-			}
-		}
+	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP) {
+		return sol_tcp_setsockopt(sk, optname, optval, optlen);
 	} else {
 		ret = -EINVAL;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cfed84b1883f..a6986f201f92 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3479,8 +3479,8 @@ int tcp_set_window_clamp(struct sock *sk, int val)
 /*
  *	Socket option code for TCP.
  */
-static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
-		sockptr_t optval, unsigned int optlen)
+int do_tcp_setsockopt(struct sock *sk, int level, int optname,
+		      sockptr_t optval, unsigned int optlen)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 12/15] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (10 preceding siblings ...)
  2022-08-03 20:47 ` [PATCH v2 bpf-next 11/15] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt() Martin KaFai Lau
@ 2022-08-03 20:47 ` Martin KaFai Lau
  2022-08-03 20:47 ` [PATCH v2 bpf-next 13/15] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt() Martin KaFai Lau
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:47 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IP)
and reuses the implementation in do_ip_setsockopt().

The existing optname white-list is refactored into a new
function sol_ip_setsockopt().

NOTE,
the current bpf_setsockopt(IP_TOS) is quite different from the
the do_ip_setsockopt(IP_TOS).  For example, it does not take
the INET_ECN_MASK into the account for tcp and also does not adjust
sk->sk_priority.  It looks like the current bpf_setsockopt(IP_TOS)
was referencing the IPV6_TCLASS implementation instead of IP_TOS.
This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS).
While this is a behavior change,  the do_ip_setsockopt(IP_TOS) behavior
is arguably what the user is expecting.  At least, the INET_ECN_MASK bits
should be masked out for tcp.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ip.h       |  2 ++
 net/core/filter.c      | 40 ++++++++++++++++++++--------------------
 net/ipv4/ip_sockglue.c |  4 ++--
 3 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 1c979fd1904c..34fa5b0f0a0e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -743,6 +743,8 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sock *sk,
 int ip_cmsg_send(struct sock *sk, struct msghdr *msg,
 		 struct ipcm_cookie *ipc, bool allow_ipv6);
 DECLARE_STATIC_KEY_FALSE(ip4_min_ttl);
+int do_ip_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
+		     unsigned int optlen);
 int ip_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		  unsigned int optlen);
 int ip_getsockopt(struct sock *sk, int level, int optname, char __user *optval,
diff --git a/net/core/filter.c b/net/core/filter.c
index 0c5361b8906d..d236f71ff501 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5123,6 +5123,25 @@ static int sol_tcp_setsockopt(struct sock *sk, int optname,
 				 KERNEL_SOCKPTR(optval), optlen);
 }
 
+static int sol_ip_setsockopt(struct sock *sk, int optname,
+			     char *optval, int optlen)
+{
+	if (sk->sk_family != AF_INET)
+		return -EINVAL;
+
+	switch (optname) {
+	case IP_TOS:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return do_ip_setsockopt(sk, SOL_IP, optname,
+				KERNEL_SOCKPTR(optval), optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
@@ -5134,26 +5153,7 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 	if (level == SOL_SOCKET) {
 		return sol_socket_setsockopt(sk, optname, optval, optlen);
 	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
-		if (optlen != sizeof(int) || sk->sk_family != AF_INET)
-			return -EINVAL;
-
-		val = *((int *)optval);
-		/* Only some options are supported */
-		switch (optname) {
-		case IP_TOS:
-			if (val < -1 || val > 0xff) {
-				ret = -EINVAL;
-			} else {
-				struct inet_sock *inet = inet_sk(sk);
-
-				if (val == -1)
-					val = 0;
-				inet->tos = val;
-			}
-			break;
-		default:
-			ret = -EINVAL;
-		}
+		return sol_ip_setsockopt(sk, optname, optval, optlen);
 	} else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6) {
 		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
 			return -EINVAL;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index a3c496580e6b..751fa69cb557 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -888,8 +888,8 @@ static int compat_ip_mcast_join_leave(struct sock *sk, int optname,
 
 DEFINE_STATIC_KEY_FALSE(ip4_min_ttl);
 
-static int do_ip_setsockopt(struct sock *sk, int level, int optname,
-		sockptr_t optval, unsigned int optlen)
+int do_ip_setsockopt(struct sock *sk, int level, int optname,
+		     sockptr_t optval, unsigned int optlen)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct net *net = sock_net(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 13/15] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt()
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (11 preceding siblings ...)
  2022-08-03 20:47 ` [PATCH v2 bpf-next 12/15] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt() Martin KaFai Lau
@ 2022-08-03 20:47 ` Martin KaFai Lau
  2022-08-03 20:47 ` [PATCH v2 bpf-next 14/15] bpf: Add a few optnames to bpf_setsockopt Martin KaFai Lau
  2022-08-03 20:47 ` [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests Martin KaFai Lau
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:47 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IPV6)
and reuses the implementation in do_ipv6_setsockopt().

ipv6 could be compiled as a module.  Like how other code solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt
to the ipv6_bpf_stub.

The current bpf_setsockopt(IPV6_TCLASS) does not take the
INET_ECN_MASK into the account for tcp.  The
do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly.

The existing optname white-list is refactored into a new
function sol_ipv6_setsockopt().

After this last SOL_IPV6 dup code removal, the __bpf_setsockopt()
is simplified enough that the extra "{ }" around the if statement
can be removed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ipv6.h       |  2 ++
 include/net/ipv6_stubs.h |  2 ++
 net/core/filter.c        | 56 +++++++++++++++++++---------------------
 net/ipv6/af_inet6.c      |  1 +
 net/ipv6/ipv6_sockglue.c |  4 +--
 5 files changed, 33 insertions(+), 32 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index de9dcc5652c4..c110d9032083 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1156,6 +1156,8 @@ struct in6_addr *fl6_update_dst(struct flowi6 *fl6,
  */
 DECLARE_STATIC_KEY_FALSE(ip6_min_hopcount);
 
+int do_ipv6_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
+		       unsigned int optlen);
 int ipv6_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		    unsigned int optlen);
 int ipv6_getsockopt(struct sock *sk, int level, int optname,
diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h
index 45e0339be6fa..8692698b01cf 100644
--- a/include/net/ipv6_stubs.h
+++ b/include/net/ipv6_stubs.h
@@ -81,6 +81,8 @@ struct ipv6_bpf_stub {
 				     const struct in6_addr *daddr, __be16 dport,
 				     int dif, int sdif, struct udp_table *tbl,
 				     struct sk_buff *skb);
+	int (*ipv6_setsockopt)(struct sock *sk, int level, int optname,
+			       sockptr_t optval, unsigned int optlen);
 };
 extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d236f71ff501..de6f90a2ec0d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5142,45 +5142,41 @@ static int sol_ip_setsockopt(struct sock *sk, int optname,
 				KERNEL_SOCKPTR(optval), optlen);
 }
 
+static int sol_ipv6_setsockopt(struct sock *sk, int optname,
+			       char *optval, int optlen)
+{
+	if (sk->sk_family != AF_INET6)
+		return -EINVAL;
+
+	switch (optname) {
+	case IPV6_TCLASS:
+		if (optlen != sizeof(int))
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ipv6_bpf_stub->ipv6_setsockopt(sk, SOL_IPV6, optname,
+					      KERNEL_SOCKPTR(optval), optlen);
+}
+
 static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 			    char *optval, int optlen)
 {
-	int val, ret = 0;
-
 	if (!sk_fullsock(sk))
 		return -EINVAL;
 
-	if (level == SOL_SOCKET) {
+	if (level == SOL_SOCKET)
 		return sol_socket_setsockopt(sk, optname, optval, optlen);
-	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP) {
+	else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP)
 		return sol_ip_setsockopt(sk, optname, optval, optlen);
-	} else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6) {
-		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
-			return -EINVAL;
-
-		val = *((int *)optval);
-		/* Only some options are supported */
-		switch (optname) {
-		case IPV6_TCLASS:
-			if (val < -1 || val > 0xff) {
-				ret = -EINVAL;
-			} else {
-				struct ipv6_pinfo *np = inet6_sk(sk);
-
-				if (val == -1)
-					val = 0;
-				np->tclass = val;
-			}
-			break;
-		default:
-			ret = -EINVAL;
-		}
-	} else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP) {
+	else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6)
+		return sol_ipv6_setsockopt(sk, optname, optval, optlen);
+	else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP)
 		return sol_tcp_setsockopt(sk, optname, optval, optlen);
-	} else {
-		ret = -EINVAL;
-	}
-	return ret;
+
+	return -EINVAL;
 }
 
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 2ce0c44d0081..cadc97852787 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1057,6 +1057,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
 static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
 	.inet6_bind = __inet6_bind,
 	.udp6_lib_lookup = __udp6_lib_lookup,
+	.ipv6_setsockopt = do_ipv6_setsockopt,
 };
 
 static int __init inet6_init(void)
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index d23de48ff612..a4535bdbd310 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -391,8 +391,8 @@ static int ipv6_set_opt_hdr(struct sock *sk, int optname, sockptr_t optval,
 	return err;
 }
 
-static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
-		   sockptr_t optval, unsigned int optlen)
+int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+		       sockptr_t optval, unsigned int optlen)
 {
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct net *net = sock_net(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 14/15] bpf: Add a few optnames to bpf_setsockopt
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (12 preceding siblings ...)
  2022-08-03 20:47 ` [PATCH v2 bpf-next 13/15] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt() Martin KaFai Lau
@ 2022-08-03 20:47 ` Martin KaFai Lau
  2022-08-03 20:47 ` [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests Martin KaFai Lau
  14 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:47 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

This patch adds a few optnames for bpf_setsockopt:
SO_REUSEADDR, IPV6_AUTOFLOWLABEL, TCP_MAXSEG, TCP_NODELAY,
and TCP_THIN_LINEAR_TIMEOUTS.

Thanks to the previous patches of this set, all additions can reuse
the sk_setsockopt(), do_ipv6_setsockopt(), and do_tcp_setsockopt().
The only change here is to allow them in bpf_setsockopt.

The bpf prog has been able to read all members of a sk by
using PTR_TO_BTF_ID of a sk.  The optname additions here can also be
read by the same approach.  Meaning there is a way to read
the values back.

These optnames can also be added to bpf_getsockopt() later with
another patch set that makes the bpf_getsockopt() to reuse
the sock_getsockopt(), tcp_getsockopt(), and ip[v6]_getsockopt().
Thus, this patch does not add more duplicated code to
bpf_getsockopt() now.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index de6f90a2ec0d..52736c833a5f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5017,6 +5017,7 @@ static int sol_socket_setsockopt(struct sock *sk, int optname,
 				 char *optval, int optlen)
 {
 	switch (optname) {
+	case SO_REUSEADDR:
 	case SO_SNDBUF:
 	case SO_RCVBUF:
 	case SO_KEEPALIVE:
@@ -5102,11 +5103,14 @@ static int sol_tcp_setsockopt(struct sock *sk, int optname,
 		return -EINVAL;
 
 	switch (optname) {
+	case TCP_NODELAY:
+	case TCP_MAXSEG:
 	case TCP_KEEPIDLE:
 	case TCP_KEEPINTVL:
 	case TCP_KEEPCNT:
 	case TCP_SYNCNT:
 	case TCP_WINDOW_CLAMP:
+	case TCP_THIN_LINEAR_TIMEOUTS:
 	case TCP_USER_TIMEOUT:
 	case TCP_NOTSENT_LOWAT:
 	case TCP_SAVE_SYN:
@@ -5150,6 +5154,7 @@ static int sol_ipv6_setsockopt(struct sock *sk, int optname,
 
 	switch (optname) {
 	case IPV6_TCLASS:
+	case IPV6_AUTOFLOWLABEL:
 		if (optlen != sizeof(int))
 			return -EINVAL;
 		break;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests
  2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
                   ` (13 preceding siblings ...)
  2022-08-03 20:47 ` [PATCH v2 bpf-next 14/15] bpf: Add a few optnames to bpf_setsockopt Martin KaFai Lau
@ 2022-08-03 20:47 ` Martin KaFai Lau
  2022-08-03 23:30   ` sdf
  14 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:47 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Stanislav Fomichev

This patch adds tests to exercise optnames that are allowed
in bpf_setsockopt().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../selftests/bpf/prog_tests/setget_sockopt.c | 125 ++++
 .../selftests/bpf/progs/setget_sockopt.c      | 547 ++++++++++++++++++
 2 files changed, 672 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
 create mode 100644 tools/testing/selftests/bpf/progs/setget_sockopt.c

diff --git a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
new file mode 100644
index 000000000000..018611e6b248
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) Meta Platforms, Inc. and affiliates. */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <linux/socket.h>
+#include <net/if.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+#include "setget_sockopt.skel.h"
+
+#define CG_NAME "/setget-sockopt-test"
+
+static const char addr4_str[] = "127.0.0.1";
+static const char addr6_str[] = "::1";
+static struct setget_sockopt *skel;
+static int cg_fd;
+
+static int create_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link add dev binddevtest1 type veth peer name binddevtest2"),
+		       "add veth"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev binddevtest1 up"),
+		       "bring veth up"))
+		return -1;
+
+	return 0;
+}
+
+static void test_tcp(int family)
+{
+	struct setget_sockopt__bss *bss = skel->bss;
+	int sfd, cfd;
+
+	memset(bss, 0, sizeof(*bss));
+
+	sfd = start_server(family, SOCK_STREAM,
+			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
+	if (!ASSERT_GE(sfd, 0, "start_server"))
+		return;
+
+	cfd = connect_to_fd(sfd, 0);
+	if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
+		close(sfd);
+		return;
+	}
+	close(sfd);
+	close(cfd);
+
+	ASSERT_EQ(bss->nr_listen, 1, "nr_listen");
+	ASSERT_EQ(bss->nr_connect, 1, "nr_connect");
+	ASSERT_EQ(bss->nr_active, 1, "nr_active");
+	ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
+	ASSERT_EQ(bss->nr_socket_post_create, 2, "nr_socket_post_create");
+	ASSERT_EQ(bss->nr_binddev, 2, "nr_bind");
+}
+
+static void test_udp(int family)
+{
+	struct setget_sockopt__bss *bss = skel->bss;
+	int sfd;
+
+	memset(bss, 0, sizeof(*bss));
+
+	sfd = start_server(family, SOCK_DGRAM,
+			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
+	if (!ASSERT_GE(sfd, 0, "start_server"))
+		return;
+	close(sfd);
+
+	ASSERT_GE(bss->nr_socket_post_create, 1, "nr_socket_post_create");
+	ASSERT_EQ(bss->nr_binddev, 1, "nr_bind");
+}
+
+void test_setget_sockopt(void)
+{
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (cg_fd < 0)
+		return;
+
+	if (create_netns())
+		goto done;
+
+	skel = setget_sockopt__open();
+	if (!ASSERT_OK_PTR(skel, "open skel"))
+		goto done;
+
+	strcpy(skel->rodata->veth, "binddevtest1");
+	skel->rodata->veth_ifindex = if_nametoindex("binddevtest1");
+	if (!ASSERT_GT(skel->rodata->veth_ifindex, 0, "if_nametoindex"))
+		goto done;
+
+	if (!ASSERT_OK(setget_sockopt__load(skel), "load skel"))
+		goto done;
+
+	skel->links.skops_sockopt =
+		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
+		goto done;
+
+	skel->links.socket_post_create =
+		bpf_program__attach_cgroup(skel->progs.socket_post_create, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.socket_post_create, "attach_cgroup"))
+		goto done;
+
+	test_tcp(AF_INET6);
+	test_tcp(AF_INET);
+	test_udp(AF_INET6);
+	test_udp(AF_INET);
+
+done:
+	setget_sockopt__destroy(skel);
+	close(cg_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/setget_sockopt.c b/tools/testing/selftests/bpf/progs/setget_sockopt.c
new file mode 100644
index 000000000000..560cf4b92d65
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/setget_sockopt.c
@@ -0,0 +1,547 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) Meta Platforms, Inc. and affiliates. */
+
+#include <stddef.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/in.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
+#include <linux/bpf.h>
+#include <linux/if.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+
+#ifndef SO_TXREHASH
+#define SO_TXREHASH 74
+#endif
+
+#ifndef TCP_NAGLE_OFF
+#define TCP_NAGLE_OFF 1
+#endif
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+extern unsigned long CONFIG_HZ __kconfig;
+
+const volatile char veth[IFNAMSIZ];
+const volatile int veth_ifindex;
+
+int nr_listen;
+int nr_passive;
+int nr_active;
+int nr_connect;
+int nr_binddev;
+int nr_socket_post_create;
+
+struct sockopt_test {
+	int opt;
+	int new;
+	int restore;
+	int expected;
+	int tcp_expected;
+	int flip:1;
+};
+
+static const char cubic_cc[] = "cubic";
+static const char reno_cc[] = "reno";
+
+static const struct sockopt_test sol_socket_tests[] = {
+	{ .opt = SO_REUSEADDR, .flip = 1, },
+	{ .opt = SO_SNDBUF, .new = 8123, .expected = 8123 * 2, },
+	{ .opt = SO_RCVBUF, .new = 8123, .expected = 8123 * 2, },
+	{ .opt = SO_KEEPALIVE, .flip = 1, },
+	{ .opt = SO_PRIORITY, .new = 0xeb9f, .expected = 0xeb9f, },
+	{ .opt = SO_REUSEPORT, .flip = 1, },
+	{ .opt = SO_RCVLOWAT, .new = 8123, .expected = 8123, },
+	{ .opt = SO_MARK, .new = 0xeb9f, .expected = 0xeb9f, },
+	{ .opt = SO_MAX_PACING_RATE, .new = 0xeb9f, .expected = 0xeb9f, },
+	{ .opt = SO_TXREHASH, .flip = 1, },
+	{ .opt = 0, },
+};
+
+static const struct sockopt_test sol_tcp_tests[] = {
+	{ .opt = TCP_NODELAY, .flip = 1, },
+	{ .opt = TCP_MAXSEG, .new = 1314, .expected = 1314, },
+	{ .opt = TCP_KEEPIDLE, .new = 123, .expected = 123, .restore = 321, },
+	{ .opt = TCP_KEEPINTVL, .new = 123, .expected = 123, .restore = 321, },
+	{ .opt = TCP_KEEPCNT, .new = 123, .expected = 123, .restore = 124, },
+	{ .opt = TCP_SYNCNT, .new = 123, .expected = 123, .restore = 124, },
+	{ .opt = TCP_WINDOW_CLAMP, .new = 8123, .expected = 8123, .restore = 8124, },
+	{ .opt = TCP_CONGESTION, },
+	{ .opt = TCP_THIN_LINEAR_TIMEOUTS, .flip = 1, },
+	{ .opt = TCP_USER_TIMEOUT, .new = 123400, .expected = 123400, },
+	{ .opt = TCP_NOTSENT_LOWAT, .new = 1314, .expected = 1314, },
+	{ .opt = TCP_SAVE_SYN, .new = 1, .expected = 1, },
+	{ .opt = 0, },
+};
+
+static const struct sockopt_test sol_ip_tests[] = {
+	{ .opt = IP_TOS, .new = 0xe1, .expected = 0xe1, .tcp_expected = 0xe0, },
+	{ .opt = 0, },
+};
+
+static const struct sockopt_test sol_ipv6_tests[] = {
+	{ .opt = IPV6_TCLASS, .new = 0xe1, .expected = 0xe1, .tcp_expected = 0xe0, },
+	{ .opt = IPV6_AUTOFLOWLABEL, .flip = 1, },
+	{ .opt = 0, },
+};
+
+struct sock_common {
+	unsigned short	skc_family;
+	unsigned long	skc_flags;
+	unsigned char	skc_reuse:4;
+	unsigned char	skc_reuseport:1;
+	unsigned char	skc_ipv6only:1;
+	unsigned char	skc_net_refcnt:1;
+} __attribute__((preserve_access_index));
+
+struct sock {
+	struct sock_common	__sk_common;
+	__u16			sk_type;
+	__u16			sk_protocol;
+	int			sk_rcvlowat;
+	__u32			sk_mark;
+	unsigned long		sk_max_pacing_rate;
+	unsigned int		keepalive_time;
+	unsigned int		keepalive_intvl;
+} __attribute__((preserve_access_index));
+
+struct tcp_options_received {
+	__u16 user_mss;
+} __attribute__((preserve_access_index));
+
+struct ipv6_pinfo {
+	__u16			recverr:1,
+				sndflow:1,
+				repflow:1,
+				pmtudisc:3,
+				padding:1,
+				srcprefs:3,
+				dontfrag:1,
+				autoflowlabel:1,
+				autoflowlabel_set:1,
+				mc_all:1,
+				recverr_rfc4884:1,
+				rtalert_isolate:1;
+}  __attribute__((preserve_access_index));
+
+struct inet_sock {
+	/* sk and pinet6 has to be the first two members of inet_sock */
+	struct sock		sk;
+	struct ipv6_pinfo	*pinet6;
+} __attribute__((preserve_access_index));
+
+struct inet_connection_sock {
+	__u32			  icsk_user_timeout;
+	__u8			  icsk_syn_retries;
+} __attribute__((preserve_access_index));
+
+struct tcp_sock {
+	struct inet_connection_sock	inet_conn;
+	struct tcp_options_received rx_opt;
+	__u8	save_syn:2,
+		syn_data:1,
+		syn_fastopen:1,
+		syn_fastopen_exp:1,
+		syn_fastopen_ch:1,
+		syn_data_acked:1,
+		is_cwnd_limited:1;
+	__u32	window_clamp;
+	__u8	nonagle     : 4,
+		thin_lto    : 1,
+		recvmsg_inq : 1,
+		repair      : 1,
+		frto        : 1;
+	__u32	notsent_lowat;
+	__u8	keepalive_probes;
+	unsigned int		keepalive_time;
+	unsigned int		keepalive_intvl;
+} __attribute__((preserve_access_index));
+
+struct socket {
+	struct sock *sk;
+} __attribute__((preserve_access_index));
+
+struct loop_ctx {
+	void *ctx;
+	struct sock *sk;
+};
+
+static int __bpf_getsockopt(void *ctx, struct sock *sk,
+			    int level, int opt, int *optval,
+			    int optlen)
+{
+	if (level == SOL_SOCKET) {
+		switch (opt) {
+		case SO_REUSEADDR:
+			*optval = !!(sk->__sk_common.skc_reuse);
+			break;
+		case SO_KEEPALIVE:
+			*optval = !!(sk->__sk_common.skc_flags & (1UL << 3));
+			break;
+		case SO_RCVLOWAT:
+			*optval = sk->sk_rcvlowat;
+			break;
+		case SO_MARK:
+			*optval = sk->sk_mark;
+			break;
+		case SO_MAX_PACING_RATE:
+			*optval = sk->sk_max_pacing_rate;
+			break;
+		default:
+			return bpf_getsockopt(ctx, level, opt, optval, optlen);
+		}
+		return 0;
+	}
+
+	if (level == IPPROTO_TCP) {
+		struct tcp_sock *tp = bpf_skc_to_tcp_sock(sk);
+
+		if (!tp)
+			return -1;
+
+		switch (opt) {
+		case TCP_NODELAY:
+			*optval = !!(tp->nonagle & TCP_NAGLE_OFF);
+			break;
+		case TCP_MAXSEG:
+			*optval = tp->rx_opt.user_mss;
+			break;
+		case TCP_KEEPIDLE:
+			*optval = tp->keepalive_time / CONFIG_HZ;
+			break;
+		case TCP_SYNCNT:
+			*optval = tp->inet_conn.icsk_syn_retries;
+			break;
+		case TCP_KEEPINTVL:
+			*optval = tp->keepalive_intvl / CONFIG_HZ;
+			break;
+		case TCP_KEEPCNT:
+			*optval = tp->keepalive_probes;
+			break;
+		case TCP_WINDOW_CLAMP:
+			*optval = tp->window_clamp;
+			break;
+		case TCP_THIN_LINEAR_TIMEOUTS:
+			*optval = tp->thin_lto;
+			break;
+		case TCP_USER_TIMEOUT:
+			*optval = tp->inet_conn.icsk_user_timeout;
+			break;
+		case TCP_NOTSENT_LOWAT:
+			*optval = tp->notsent_lowat;
+			break;
+		case TCP_SAVE_SYN:
+			*optval = tp->save_syn;
+			break;
+		default:
+			return bpf_getsockopt(ctx, level, opt, optval, optlen);
+		}
+		return 0;
+	}
+
+	if (level == IPPROTO_IPV6) {
+		switch (opt) {
+		case IPV6_AUTOFLOWLABEL: {
+			__u16 proto = sk->sk_protocol;
+			struct inet_sock *inet_sk;
+
+			if (proto == IPPROTO_TCP)
+				inet_sk = (struct inet_sock *)bpf_skc_to_tcp_sock(sk);
+			else
+				inet_sk = (struct inet_sock *)bpf_skc_to_udp6_sock(sk);
+
+			if (!inet_sk)
+				return -1;
+
+			*optval = !!inet_sk->pinet6->autoflowlabel;
+			break;
+		}
+		default:
+			return bpf_getsockopt(ctx, level, opt, optval, optlen);
+		}
+		return 0;
+	}
+
+	return bpf_getsockopt(ctx, level, opt, optval, optlen);
+}
+
+static int bpf_test_sockopt_flip(void *ctx, struct sock *sk,
+				 const struct sockopt_test *t,
+				 int level)
+{
+	int old, tmp, new, opt = t->opt;
+
+	opt = t->opt;
+
+	if (__bpf_getsockopt(ctx, sk, level, opt, &old, sizeof(old)))
+		return 1;
+	/* kernel initialized txrehash to 255 */
+	if (level == SOL_SOCKET && opt == SO_TXREHASH && old != 0 && old != 1)
+		old = 1;
+
+	new = !old;
+	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
+		return 1;
+	if (__bpf_getsockopt(ctx, sk, level, opt, &tmp, sizeof(tmp)) ||
+	    tmp != new)
+		return 1;
+
+	if (bpf_setsockopt(ctx, level, opt, &old, sizeof(old)))
+		return 1;
+
+	return 0;
+}
+
+static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
+				const struct sockopt_test *t,
+				int level)
+{
+	int old, tmp, new, expected, opt;
+
+	opt = t->opt;
+	new = t->new;
+	if (sk->sk_type == SOCK_STREAM && t->tcp_expected)
+		expected = t->tcp_expected;
+	else
+		expected = t->expected;
+
+	if (__bpf_getsockopt(ctx, sk, level, opt, &old, sizeof(old)) ||
+	    old == new)
+		return 1;
+
+	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
+		return 1;
+	if (__bpf_getsockopt(ctx, sk, level, opt, &tmp, sizeof(tmp)) ||
+	    tmp != expected)
+		return 1;
+
+	if (t->restore)
+		old = t->restore;
+	if (bpf_setsockopt(ctx, level, opt, &old, sizeof(old)))
+		return 1;
+
+	return 0;
+}
+
+static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_socket_tests))
+		return 1;
+
+	t = &sol_socket_tests[i];
+	if (!t->opt)
+		return 1;
+
+	if (t->flip)
+		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, SOL_SOCKET);
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
+}
+
+static int bpf_test_ip_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_ip_tests))
+		return 1;
+
+	t = &sol_ip_tests[i];
+	if (!t->opt)
+		return 1;
+
+	if (t->flip)
+		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, IPPROTO_IP);
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, IPPROTO_IP);
+}
+
+static int bpf_test_ipv6_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_ipv6_tests))
+		return 1;
+
+	t = &sol_ipv6_tests[i];
+	if (!t->opt)
+		return 1;
+
+	if (t->flip)
+		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, IPPROTO_IPV6);
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, IPPROTO_IPV6);
+}
+
+static int bpf_test_tcp_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+	struct sock *sk;
+	void *ctx;
+
+	if (i >= ARRAY_SIZE(sol_tcp_tests))
+		return 1;
+
+	t = &sol_tcp_tests[i];
+	if (!t->opt)
+		return 1;
+
+	ctx = lc->ctx;
+	sk = lc->sk;
+
+	if (t->opt == TCP_CONGESTION) {
+		char old_cc[16], tmp_cc[16];
+		const char *new_cc;
+
+		if (bpf_getsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, old_cc, sizeof(old_cc)))
+			return 1;
+		if (!bpf_strncmp(old_cc, sizeof(old_cc), cubic_cc))
+			new_cc = reno_cc;
+		else
+			new_cc = cubic_cc;
+		if (bpf_setsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, (void *)new_cc,
+				   sizeof(new_cc)))
+			return 1;
+		if (bpf_getsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, tmp_cc, sizeof(tmp_cc)))
+			return 1;
+		if (bpf_strncmp(tmp_cc, sizeof(tmp_cc), new_cc))
+			return 1;
+		if (bpf_setsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, old_cc, sizeof(old_cc)))
+			return 1;
+		return 0;
+	}
+
+	if (t->flip)
+		return bpf_test_sockopt_flip(ctx, sk, t, IPPROTO_TCP);
+
+	return bpf_test_sockopt_int(ctx, sk, t, IPPROTO_TCP);
+}
+
+static int bpf_test_sockopt(void *ctx, struct sock *sk)
+{
+	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
+	__u16 family, proto;
+	int n;
+
+	family = sk->__sk_common.skc_family;
+	proto = sk->sk_protocol;
+
+	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
+	if (n != ARRAY_SIZE(sol_socket_tests))
+		return -1;
+
+	if (proto == IPPROTO_TCP) {
+		n = bpf_loop(ARRAY_SIZE(sol_tcp_tests), bpf_test_tcp_sockopt, &lc, 0);
+		if (n != ARRAY_SIZE(sol_tcp_tests))
+			return -1;
+	}
+
+	if (family == AF_INET) {
+		n = bpf_loop(ARRAY_SIZE(sol_ip_tests), bpf_test_ip_sockopt, &lc, 0);
+		if (n != ARRAY_SIZE(sol_ip_tests))
+			return -1;
+	} else {
+		n = bpf_loop(ARRAY_SIZE(sol_ipv6_tests), bpf_test_ipv6_sockopt, &lc, 0);
+		if (n != ARRAY_SIZE(sol_ipv6_tests))
+			return -1;
+	}
+
+	return 0;
+}
+
+static int binddev_test(void *ctx)
+{
+	const char empty_ifname[] = "";
+	int ifindex, zero = 0;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE,
+			   (void *)veth, sizeof(veth)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != veth_ifindex)
+		return -1;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE,
+			   (void *)empty_ifname, sizeof(empty_ifname)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != 0)
+		return -1;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   (void *)&veth_ifindex, sizeof(int)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != veth_ifindex)
+		return -1;
+
+	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &zero, sizeof(int)))
+		return -1;
+	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
+			   &ifindex, sizeof(int)) ||
+	    ifindex != 0)
+		return -1;
+
+	return 0;
+}
+
+SEC("lsm_cgroup/socket_post_create")
+int BPF_PROG(socket_post_create, struct socket *sock, int family,
+	     int type, int protocol, int kern)
+{
+	struct sock *sk = sock->sk;
+
+	if (!sk)
+		return 1;
+
+	nr_socket_post_create += !bpf_test_sockopt(sk, sk);
+	nr_binddev += !binddev_test(sk);
+
+	return 1;
+}
+
+SEC("sockops")
+int skops_sockopt(struct bpf_sock_ops *skops)
+{
+	struct bpf_sock *bpf_sk = skops->sk;
+	struct sock *sk;
+
+	if (!bpf_sk)
+		return 1;
+
+	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
+	if (!sk)
+		return 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+		nr_listen += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		nr_connect += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		nr_active += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		nr_passive += !bpf_test_sockopt(skops, sk);
+		break;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-03 20:46 ` [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
@ 2022-08-03 22:59   ` sdf
  2022-08-03 23:19     ` Martin KaFai Lau
  2022-08-04 19:03   ` Andrii Nakryiko
  1 sibling, 1 reply; 29+ messages in thread
From: sdf @ 2022-08-03 22:59 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On 08/03, Martin KaFai Lau wrote:
> Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> the sk_setsockopt().  The number of supported optnames are
> increasing ever and so as the duplicated code.

> One issue in reusing sk_setsockopt() is that the bpf prog
> has already acquired the sk lock.  This patch adds a in_bpf()
> to tell if the sk_setsockopt() is called from a bpf prog.
> The bpf prog calling bpf_setsockopt() is either running in_task()
> or in_serving_softirq().  Both cases have the current->bpf_ctx
> initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.

> This patch also adds sockopt_{lock,release}_sock() helpers
> for sk_setsockopt() to use.  These helpers will test in_bpf()
> before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> for the ipv6 module to use in a latter patch.

> Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> is done in sock_setbindtodevice() instead of doing the lock_sock
> in sock_bindtoindex(..., lock_sk = true).

> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   include/linux/bpf.h |  8 ++++++++
>   include/net/sock.h  |  3 +++
>   net/core/sock.c     | 26 +++++++++++++++++++++++---
>   3 files changed, 34 insertions(+), 3 deletions(-)

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 20c26aed7896..b905b1b34fe4 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
>   	return !sysctl_unprivileged_bpf_disabled;
>   }

> +static inline bool in_bpf(void)
> +{
> +	return !!current->bpf_ctx;
> +}

Good point on not needing to care about softirq!
That actually turned even nicer :-)

QQ: do we need to add a comment here about potential false-negatives?
I see you're adding ctx to the iter, but there is still a bunch of places
that don't use it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-03 22:59   ` sdf
@ 2022-08-03 23:19     ` Martin KaFai Lau
  2022-08-03 23:24       ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 23:19 UTC (permalink / raw)
  To: sdf
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Aug 03, 2022 at 03:59:26PM -0700, sdf@google.com wrote:
> On 08/03, Martin KaFai Lau wrote:
> > Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > the sk_setsockopt().  The number of supported optnames are
> > increasing ever and so as the duplicated code.
> 
> > One issue in reusing sk_setsockopt() is that the bpf prog
> > has already acquired the sk lock.  This patch adds a in_bpf()
> > to tell if the sk_setsockopt() is called from a bpf prog.
> > The bpf prog calling bpf_setsockopt() is either running in_task()
> > or in_serving_softirq().  Both cases have the current->bpf_ctx
> > initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
> 
> > This patch also adds sockopt_{lock,release}_sock() helpers
> > for sk_setsockopt() to use.  These helpers will test in_bpf()
> > before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> > for the ipv6 module to use in a latter patch.
> 
> > Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> > is done in sock_setbindtodevice() instead of doing the lock_sock
> > in sock_bindtoindex(..., lock_sk = true).
> 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >   include/linux/bpf.h |  8 ++++++++
> >   include/net/sock.h  |  3 +++
> >   net/core/sock.c     | 26 +++++++++++++++++++++++---
> >   3 files changed, 34 insertions(+), 3 deletions(-)
> 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 20c26aed7896..b905b1b34fe4 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
> >   	return !sysctl_unprivileged_bpf_disabled;
> >   }
> 
> > +static inline bool in_bpf(void)
> > +{
> > +	return !!current->bpf_ctx;
> > +}
> 
> Good point on not needing to care about softirq!
> That actually turned even nicer :-)
> 
> QQ: do we need to add a comment here about potential false-negatives?
> I see you're adding ctx to the iter, but there is still a bunch of places
> that don't use it.
Make sense.  I will add a comment on the requirement that the bpf prog type
needs to setup the bpf_run_ctx.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-03 23:19     ` Martin KaFai Lau
@ 2022-08-03 23:24       ` Stanislav Fomichev
  2022-08-03 23:35         ` Martin KaFai Lau
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2022-08-03 23:24 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Aug 3, 2022 at 4:19 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Aug 03, 2022 at 03:59:26PM -0700, sdf@google.com wrote:
> > On 08/03, Martin KaFai Lau wrote:
> > > Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > the sk_setsockopt().  The number of supported optnames are
> > > increasing ever and so as the duplicated code.
> >
> > > One issue in reusing sk_setsockopt() is that the bpf prog
> > > has already acquired the sk lock.  This patch adds a in_bpf()
> > > to tell if the sk_setsockopt() is called from a bpf prog.
> > > The bpf prog calling bpf_setsockopt() is either running in_task()
> > > or in_serving_softirq().  Both cases have the current->bpf_ctx
> > > initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
> >
> > > This patch also adds sockopt_{lock,release}_sock() helpers
> > > for sk_setsockopt() to use.  These helpers will test in_bpf()
> > > before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> > > for the ipv6 module to use in a latter patch.
> >
> > > Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> > > is done in sock_setbindtodevice() instead of doing the lock_sock
> > > in sock_bindtoindex(..., lock_sk = true).
> >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >   include/linux/bpf.h |  8 ++++++++
> > >   include/net/sock.h  |  3 +++
> > >   net/core/sock.c     | 26 +++++++++++++++++++++++---
> > >   3 files changed, 34 insertions(+), 3 deletions(-)
> >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 20c26aed7896..b905b1b34fe4 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
> > >     return !sysctl_unprivileged_bpf_disabled;
> > >   }
> >
> > > +static inline bool in_bpf(void)
> > > +{
> > > +   return !!current->bpf_ctx;
> > > +}
> >
> > Good point on not needing to care about softirq!
> > That actually turned even nicer :-)
> >
> > QQ: do we need to add a comment here about potential false-negatives?
> > I see you're adding ctx to the iter, but there is still a bunch of places
> > that don't use it.
> Make sense.  I will add a comment on the requirement that the bpf prog type
> needs to setup the bpf_run_ctx.

Thanks! White at it, is it worth adding a short sentence to
sockopt_lock_sock on why it's safe to skip locking in the bpf case as
well?
Feels like the current state where bpf always runs with the locked
socket might change in the future.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests
  2022-08-03 20:47 ` [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests Martin KaFai Lau
@ 2022-08-03 23:30   ` sdf
  2022-08-04  0:04     ` Martin KaFai Lau
  0 siblings, 1 reply; 29+ messages in thread
From: sdf @ 2022-08-03 23:30 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On 08/03, Martin KaFai Lau wrote:
> This patch adds tests to exercise optnames that are allowed
> in bpf_setsockopt().

> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   .../selftests/bpf/prog_tests/setget_sockopt.c | 125 ++++
>   .../selftests/bpf/progs/setget_sockopt.c      | 547 ++++++++++++++++++
>   2 files changed, 672 insertions(+)
>   create mode 100644  
> tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
>   create mode 100644 tools/testing/selftests/bpf/progs/setget_sockopt.c

> diff --git a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c  
> b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
> new file mode 100644
> index 000000000000..018611e6b248
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
> @@ -0,0 +1,125 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) Meta Platforms, Inc. and affiliates. */
> +
> +#define _GNU_SOURCE
> +#include <sched.h>
> +#include <linux/socket.h>
> +#include <net/if.h>
> +
> +#include "test_progs.h"
> +#include "cgroup_helpers.h"
> +#include "network_helpers.h"
> +
> +#include "setget_sockopt.skel.h"
> +
> +#define CG_NAME "/setget-sockopt-test"
> +
> +static const char addr4_str[] = "127.0.0.1";
> +static const char addr6_str[] = "::1";
> +static struct setget_sockopt *skel;
> +static int cg_fd;
> +
> +static int create_netns(void)
> +{
> +	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
> +		return -1;
> +
> +	if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
> +		return -1;
> +
> +	if (!ASSERT_OK(system("ip link add dev binddevtest1 type veth peer name  
> binddevtest2"),
> +		       "add veth"))
> +		return -1;
> +
> +	if (!ASSERT_OK(system("ip link set dev binddevtest1 up"),
> +		       "bring veth up"))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static void test_tcp(int family)
> +{
> +	struct setget_sockopt__bss *bss = skel->bss;
> +	int sfd, cfd;
> +
> +	memset(bss, 0, sizeof(*bss));
> +
> +	sfd = start_server(family, SOCK_STREAM,
> +			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> +	if (!ASSERT_GE(sfd, 0, "start_server"))
> +		return;
> +
> +	cfd = connect_to_fd(sfd, 0);
> +	if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
> +		close(sfd);
> +		return;
> +	}
> +	close(sfd);
> +	close(cfd);
> +
> +	ASSERT_EQ(bss->nr_listen, 1, "nr_listen");
> +	ASSERT_EQ(bss->nr_connect, 1, "nr_connect");
> +	ASSERT_EQ(bss->nr_active, 1, "nr_active");
> +	ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
> +	ASSERT_EQ(bss->nr_socket_post_create, 2, "nr_socket_post_create");
> +	ASSERT_EQ(bss->nr_binddev, 2, "nr_bind");
> +}
> +
> +static void test_udp(int family)
> +{
> +	struct setget_sockopt__bss *bss = skel->bss;
> +	int sfd;
> +
> +	memset(bss, 0, sizeof(*bss));
> +
> +	sfd = start_server(family, SOCK_DGRAM,
> +			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> +	if (!ASSERT_GE(sfd, 0, "start_server"))
> +		return;
> +	close(sfd);
> +
> +	ASSERT_GE(bss->nr_socket_post_create, 1, "nr_socket_post_create");
> +	ASSERT_EQ(bss->nr_binddev, 1, "nr_bind");
> +}
> +
> +void test_setget_sockopt(void)
> +{
> +	cg_fd = test__join_cgroup(CG_NAME);
> +	if (cg_fd < 0)
> +		return;
> +
> +	if (create_netns())
> +		goto done;
> +
> +	skel = setget_sockopt__open();
> +	if (!ASSERT_OK_PTR(skel, "open skel"))
> +		goto done;
> +
> +	strcpy(skel->rodata->veth, "binddevtest1");
> +	skel->rodata->veth_ifindex = if_nametoindex("binddevtest1");
> +	if (!ASSERT_GT(skel->rodata->veth_ifindex, 0, "if_nametoindex"))
> +		goto done;
> +
> +	if (!ASSERT_OK(setget_sockopt__load(skel), "load skel"))
> +		goto done;
> +
> +	skel->links.skops_sockopt =
> +		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
> +	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
> +		goto done;
> +
> +	skel->links.socket_post_create =
> +		bpf_program__attach_cgroup(skel->progs.socket_post_create, cg_fd);
> +	if (!ASSERT_OK_PTR(skel->links.socket_post_create, "attach_cgroup"))
> +		goto done;
> +
> +	test_tcp(AF_INET6);
> +	test_tcp(AF_INET);
> +	test_udp(AF_INET6);
> +	test_udp(AF_INET);
> +
> +done:
> +	setget_sockopt__destroy(skel);
> +	close(cg_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/setget_sockopt.c  
> b/tools/testing/selftests/bpf/progs/setget_sockopt.c
> new file mode 100644
> index 000000000000..560cf4b92d65
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/setget_sockopt.c
> @@ -0,0 +1,547 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) Meta Platforms, Inc. and affiliates. */
> +
> +#include <stddef.h>
> +#include <stdbool.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <linux/in.h>
> +#include <linux/ipv6.h>
> +#include <linux/tcp.h>
> +#include <linux/socket.h>
> +#include <linux/bpf.h>
> +#include <linux/if.h>
> +#include <linux/types.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <errno.h>
> +
> +#ifndef SO_TXREHASH
> +#define SO_TXREHASH 74
> +#endif
> +
> +#ifndef TCP_NAGLE_OFF
> +#define TCP_NAGLE_OFF 1
> +#endif
> +
> +#ifndef ARRAY_SIZE
> +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> +#endif
> +
> +extern unsigned long CONFIG_HZ __kconfig;
> +
> +const volatile char veth[IFNAMSIZ];
> +const volatile int veth_ifindex;
> +
> +int nr_listen;
> +int nr_passive;
> +int nr_active;
> +int nr_connect;
> +int nr_binddev;
> +int nr_socket_post_create;
> +
> +struct sockopt_test {
> +	int opt;
> +	int new;
> +	int restore;
> +	int expected;
> +	int tcp_expected;
> +	int flip:1;
> +};
> +
> +static const char cubic_cc[] = "cubic";
> +static const char reno_cc[] = "reno";
> +
> +static const struct sockopt_test sol_socket_tests[] = {
> +	{ .opt = SO_REUSEADDR, .flip = 1, },
> +	{ .opt = SO_SNDBUF, .new = 8123, .expected = 8123 * 2, },
> +	{ .opt = SO_RCVBUF, .new = 8123, .expected = 8123 * 2, },
> +	{ .opt = SO_KEEPALIVE, .flip = 1, },
> +	{ .opt = SO_PRIORITY, .new = 0xeb9f, .expected = 0xeb9f, },
> +	{ .opt = SO_REUSEPORT, .flip = 1, },
> +	{ .opt = SO_RCVLOWAT, .new = 8123, .expected = 8123, },
> +	{ .opt = SO_MARK, .new = 0xeb9f, .expected = 0xeb9f, },
> +	{ .opt = SO_MAX_PACING_RATE, .new = 0xeb9f, .expected = 0xeb9f, },
> +	{ .opt = SO_TXREHASH, .flip = 1, },
> +	{ .opt = 0, },
> +};
> +
> +static const struct sockopt_test sol_tcp_tests[] = {
> +	{ .opt = TCP_NODELAY, .flip = 1, },
> +	{ .opt = TCP_MAXSEG, .new = 1314, .expected = 1314, },
> +	{ .opt = TCP_KEEPIDLE, .new = 123, .expected = 123, .restore = 321, },
> +	{ .opt = TCP_KEEPINTVL, .new = 123, .expected = 123, .restore = 321, },
> +	{ .opt = TCP_KEEPCNT, .new = 123, .expected = 123, .restore = 124, },
> +	{ .opt = TCP_SYNCNT, .new = 123, .expected = 123, .restore = 124, },
> +	{ .opt = TCP_WINDOW_CLAMP, .new = 8123, .expected = 8123, .restore =  
> 8124, },
> +	{ .opt = TCP_CONGESTION, },
> +	{ .opt = TCP_THIN_LINEAR_TIMEOUTS, .flip = 1, },
> +	{ .opt = TCP_USER_TIMEOUT, .new = 123400, .expected = 123400, },
> +	{ .opt = TCP_NOTSENT_LOWAT, .new = 1314, .expected = 1314, },
> +	{ .opt = TCP_SAVE_SYN, .new = 1, .expected = 1, },
> +	{ .opt = 0, },
> +};
> +
> +static const struct sockopt_test sol_ip_tests[] = {
> +	{ .opt = IP_TOS, .new = 0xe1, .expected = 0xe1, .tcp_expected = 0xe0, },
> +	{ .opt = 0, },
> +};
> +
> +static const struct sockopt_test sol_ipv6_tests[] = {
> +	{ .opt = IPV6_TCLASS, .new = 0xe1, .expected = 0xe1, .tcp_expected =  
> 0xe0, },
> +	{ .opt = IPV6_AUTOFLOWLABEL, .flip = 1, },
> +	{ .opt = 0, },
> +};
> +
> +struct sock_common {
> +	unsigned short	skc_family;
> +	unsigned long	skc_flags;
> +	unsigned char	skc_reuse:4;
> +	unsigned char	skc_reuseport:1;
> +	unsigned char	skc_ipv6only:1;
> +	unsigned char	skc_net_refcnt:1;
> +} __attribute__((preserve_access_index));
> +
> +struct sock {
> +	struct sock_common	__sk_common;
> +	__u16			sk_type;
> +	__u16			sk_protocol;
> +	int			sk_rcvlowat;
> +	__u32			sk_mark;
> +	unsigned long		sk_max_pacing_rate;
> +	unsigned int		keepalive_time;
> +	unsigned int		keepalive_intvl;
> +} __attribute__((preserve_access_index));
> +
> +struct tcp_options_received {
> +	__u16 user_mss;
> +} __attribute__((preserve_access_index));

I'm assuming you're not using vmlinux here because it doesn't bring
it most of the defines? Should we add missing stuff to bpf_tracing_net.h
instead?

> +struct ipv6_pinfo {
> +	__u16			recverr:1,
> +				sndflow:1,
> +				repflow:1,
> +				pmtudisc:3,
> +				padding:1,
> +				srcprefs:3,
> +				dontfrag:1,
> +				autoflowlabel:1,
> +				autoflowlabel_set:1,
> +				mc_all:1,
> +				recverr_rfc4884:1,
> +				rtalert_isolate:1;
> +}  __attribute__((preserve_access_index));
> +
> +struct inet_sock {
> +	/* sk and pinet6 has to be the first two members of inet_sock */
> +	struct sock		sk;
> +	struct ipv6_pinfo	*pinet6;
> +} __attribute__((preserve_access_index));
> +
> +struct inet_connection_sock {
> +	__u32			  icsk_user_timeout;
> +	__u8			  icsk_syn_retries;
> +} __attribute__((preserve_access_index));
> +
> +struct tcp_sock {
> +	struct inet_connection_sock	inet_conn;
> +	struct tcp_options_received rx_opt;
> +	__u8	save_syn:2,
> +		syn_data:1,
> +		syn_fastopen:1,
> +		syn_fastopen_exp:1,
> +		syn_fastopen_ch:1,
> +		syn_data_acked:1,
> +		is_cwnd_limited:1;
> +	__u32	window_clamp;
> +	__u8	nonagle     : 4,
> +		thin_lto    : 1,
> +		recvmsg_inq : 1,
> +		repair      : 1,
> +		frto        : 1;
> +	__u32	notsent_lowat;
> +	__u8	keepalive_probes;
> +	unsigned int		keepalive_time;
> +	unsigned int		keepalive_intvl;
> +} __attribute__((preserve_access_index));
> +
> +struct socket {
> +	struct sock *sk;
> +} __attribute__((preserve_access_index));
> +
> +struct loop_ctx {
> +	void *ctx;
> +	struct sock *sk;
> +};
> +
> +static int __bpf_getsockopt(void *ctx, struct sock *sk,
> +			    int level, int opt, int *optval,
> +			    int optlen)
> +{
> +	if (level == SOL_SOCKET) {
> +		switch (opt) {
> +		case SO_REUSEADDR:
> +			*optval = !!(sk->__sk_common.skc_reuse);
> +			break;
> +		case SO_KEEPALIVE:
> +			*optval = !!(sk->__sk_common.skc_flags & (1UL << 3));
> +			break;
> +		case SO_RCVLOWAT:
> +			*optval = sk->sk_rcvlowat;
> +			break;

What's the idea with the options above? Why not allow them in
bpf_getsockopt instead?

> +		case SO_MARK:
> +			*optval = sk->sk_mark;
> +			break;

SO_MARK should be handled by bpf_getsockopt ?

> +		case SO_MAX_PACING_RATE:
> +			*optval = sk->sk_max_pacing_rate;
> +			break;
> +		default:
> +			return bpf_getsockopt(ctx, level, opt, optval, optlen);
> +		}
> +		return 0;
> +	}
> +
> +	if (level == IPPROTO_TCP) {
> +		struct tcp_sock *tp = bpf_skc_to_tcp_sock(sk);
> +
> +		if (!tp)
> +			return -1;
> +
> +		switch (opt) {
> +		case TCP_NODELAY:
> +			*optval = !!(tp->nonagle & TCP_NAGLE_OFF);
> +			break;
> +		case TCP_MAXSEG:
> +			*optval = tp->rx_opt.user_mss;
> +			break;
> +		case TCP_KEEPIDLE:
> +			*optval = tp->keepalive_time / CONFIG_HZ;
> +			break;
> +		case TCP_SYNCNT:
> +			*optval = tp->inet_conn.icsk_syn_retries;
> +			break;
> +		case TCP_KEEPINTVL:
> +			*optval = tp->keepalive_intvl / CONFIG_HZ;
> +			break;
> +		case TCP_KEEPCNT:
> +			*optval = tp->keepalive_probes;
> +			break;
> +		case TCP_WINDOW_CLAMP:
> +			*optval = tp->window_clamp;
> +			break;
> +		case TCP_THIN_LINEAR_TIMEOUTS:
> +			*optval = tp->thin_lto;
> +			break;
> +		case TCP_USER_TIMEOUT:
> +			*optval = tp->inet_conn.icsk_user_timeout;
> +			break;
> +		case TCP_NOTSENT_LOWAT:
> +			*optval = tp->notsent_lowat;
> +			break;
> +		case TCP_SAVE_SYN:
> +			*optval = tp->save_syn;
> +			break;
> +		default:
> +			return bpf_getsockopt(ctx, level, opt, optval, optlen);
> +		}
> +		return 0;
> +	}
> +
> +	if (level == IPPROTO_IPV6) {
> +		switch (opt) {
> +		case IPV6_AUTOFLOWLABEL: {
> +			__u16 proto = sk->sk_protocol;
> +			struct inet_sock *inet_sk;
> +
> +			if (proto == IPPROTO_TCP)
> +				inet_sk = (struct inet_sock *)bpf_skc_to_tcp_sock(sk);
> +			else
> +				inet_sk = (struct inet_sock *)bpf_skc_to_udp6_sock(sk);
> +
> +			if (!inet_sk)
> +				return -1;
> +
> +			*optval = !!inet_sk->pinet6->autoflowlabel;
> +			break;
> +		}
> +		default:
> +			return bpf_getsockopt(ctx, level, opt, optval, optlen);
> +		}
> +		return 0;
> +	}
> +
> +	return bpf_getsockopt(ctx, level, opt, optval, optlen);
> +}
> +
> +static int bpf_test_sockopt_flip(void *ctx, struct sock *sk,
> +				 const struct sockopt_test *t,
> +				 int level)
> +{
> +	int old, tmp, new, opt = t->opt;
> +
> +	opt = t->opt;
> +
> +	if (__bpf_getsockopt(ctx, sk, level, opt, &old, sizeof(old)))
> +		return 1;
> +	/* kernel initialized txrehash to 255 */
> +	if (level == SOL_SOCKET && opt == SO_TXREHASH && old != 0 && old != 1)
> +		old = 1;
> +
> +	new = !old;
> +	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> +		return 1;
> +	if (__bpf_getsockopt(ctx, sk, level, opt, &tmp, sizeof(tmp)) ||
> +	    tmp != new)
> +		return 1;
> +
> +	if (bpf_setsockopt(ctx, level, opt, &old, sizeof(old)))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
> +				const struct sockopt_test *t,
> +				int level)
> +{
> +	int old, tmp, new, expected, opt;
> +
> +	opt = t->opt;
> +	new = t->new;
> +	if (sk->sk_type == SOCK_STREAM && t->tcp_expected)
> +		expected = t->tcp_expected;
> +	else
> +		expected = t->expected;
> +
> +	if (__bpf_getsockopt(ctx, sk, level, opt, &old, sizeof(old)) ||
> +	    old == new)
> +		return 1;
> +
> +	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> +		return 1;
> +	if (__bpf_getsockopt(ctx, sk, level, opt, &tmp, sizeof(tmp)) ||
> +	    tmp != expected)
> +		return 1;
> +
> +	if (t->restore)
> +		old = t->restore;
> +	if (bpf_setsockopt(ctx, level, opt, &old, sizeof(old)))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
> +{
> +	const struct sockopt_test *t;
> +
> +	if (i >= ARRAY_SIZE(sol_socket_tests))
> +		return 1;
> +
> +	t = &sol_socket_tests[i];
> +	if (!t->opt)
> +		return 1;
> +
> +	if (t->flip)
> +		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, SOL_SOCKET);
> +
> +	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
> +}
> +
> +static int bpf_test_ip_sockopt(__u32 i, struct loop_ctx *lc)
> +{
> +	const struct sockopt_test *t;
> +
> +	if (i >= ARRAY_SIZE(sol_ip_tests))
> +		return 1;
> +
> +	t = &sol_ip_tests[i];
> +	if (!t->opt)
> +		return 1;
> +
> +	if (t->flip)
> +		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, IPPROTO_IP);
> +
> +	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, IPPROTO_IP);
> +}
> +
> +static int bpf_test_ipv6_sockopt(__u32 i, struct loop_ctx *lc)
> +{
> +	const struct sockopt_test *t;
> +
> +	if (i >= ARRAY_SIZE(sol_ipv6_tests))
> +		return 1;
> +
> +	t = &sol_ipv6_tests[i];
> +	if (!t->opt)
> +		return 1;
> +
> +	if (t->flip)
> +		return bpf_test_sockopt_flip(lc->ctx, lc->sk, t, IPPROTO_IPV6);
> +
> +	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, IPPROTO_IPV6);
> +}
> +
> +static int bpf_test_tcp_sockopt(__u32 i, struct loop_ctx *lc)
> +{
> +	const struct sockopt_test *t;
> +	struct sock *sk;
> +	void *ctx;
> +
> +	if (i >= ARRAY_SIZE(sol_tcp_tests))
> +		return 1;
> +
> +	t = &sol_tcp_tests[i];
> +	if (!t->opt)
> +		return 1;
> +
> +	ctx = lc->ctx;
> +	sk = lc->sk;
> +
> +	if (t->opt == TCP_CONGESTION) {
> +		char old_cc[16], tmp_cc[16];
> +		const char *new_cc;
> +
> +		if (bpf_getsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, old_cc,  
> sizeof(old_cc)))
> +			return 1;
> +		if (!bpf_strncmp(old_cc, sizeof(old_cc), cubic_cc))
> +			new_cc = reno_cc;
> +		else
> +			new_cc = cubic_cc;
> +		if (bpf_setsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, (void *)new_cc,
> +				   sizeof(new_cc)))
> +			return 1;
> +		if (bpf_getsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, tmp_cc,  
> sizeof(tmp_cc)))
> +			return 1;
> +		if (bpf_strncmp(tmp_cc, sizeof(tmp_cc), new_cc))
> +			return 1;
> +		if (bpf_setsockopt(ctx, IPPROTO_TCP, TCP_CONGESTION, old_cc,  
> sizeof(old_cc)))
> +			return 1;
> +		return 0;
> +	}
> +
> +	if (t->flip)
> +		return bpf_test_sockopt_flip(ctx, sk, t, IPPROTO_TCP);
> +
> +	return bpf_test_sockopt_int(ctx, sk, t, IPPROTO_TCP);
> +}
> +
> +static int bpf_test_sockopt(void *ctx, struct sock *sk)
> +{
> +	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
> +	__u16 family, proto;
> +	int n;
> +
> +	family = sk->__sk_common.skc_family;
> +	proto = sk->sk_protocol;
> +
> +	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt,  
> &lc, 0);
> +	if (n != ARRAY_SIZE(sol_socket_tests))
> +		return -1;
> +
> +	if (proto == IPPROTO_TCP) {
> +		n = bpf_loop(ARRAY_SIZE(sol_tcp_tests), bpf_test_tcp_sockopt, &lc, 0);
> +		if (n != ARRAY_SIZE(sol_tcp_tests))
> +			return -1;
> +	}
> +
> +	if (family == AF_INET) {
> +		n = bpf_loop(ARRAY_SIZE(sol_ip_tests), bpf_test_ip_sockopt, &lc, 0);
> +		if (n != ARRAY_SIZE(sol_ip_tests))
> +			return -1;
> +	} else {
> +		n = bpf_loop(ARRAY_SIZE(sol_ipv6_tests), bpf_test_ipv6_sockopt, &lc,  
> 0);
> +		if (n != ARRAY_SIZE(sol_ipv6_tests))
> +			return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int binddev_test(void *ctx)
> +{
> +	const char empty_ifname[] = "";
> +	int ifindex, zero = 0;
> +
> +	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE,
> +			   (void *)veth, sizeof(veth)))
> +		return -1;
> +	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
> +			   &ifindex, sizeof(int)) ||
> +	    ifindex != veth_ifindex)
> +		return -1;
> +
> +	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE,
> +			   (void *)empty_ifname, sizeof(empty_ifname)))
> +		return -1;
> +	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
> +			   &ifindex, sizeof(int)) ||
> +	    ifindex != 0)
> +		return -1;
> +
> +	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
> +			   (void *)&veth_ifindex, sizeof(int)))
> +		return -1;
> +	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
> +			   &ifindex, sizeof(int)) ||
> +	    ifindex != veth_ifindex)
> +		return -1;
> +
> +	if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
> +			   &zero, sizeof(int)))
> +		return -1;
> +	if (bpf_getsockopt(ctx, SOL_SOCKET, SO_BINDTOIFINDEX,
> +			   &ifindex, sizeof(int)) ||
> +	    ifindex != 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +SEC("lsm_cgroup/socket_post_create")
> +int BPF_PROG(socket_post_create, struct socket *sock, int family,
> +	     int type, int protocol, int kern)
> +{
> +	struct sock *sk = sock->sk;
> +
> +	if (!sk)
> +		return 1;
> +
> +	nr_socket_post_create += !bpf_test_sockopt(sk, sk);
> +	nr_binddev += !binddev_test(sk);
> +
> +	return 1;
> +}
> +
> +SEC("sockops")
> +int skops_sockopt(struct bpf_sock_ops *skops)
> +{
> +	struct bpf_sock *bpf_sk = skops->sk;
> +	struct sock *sk;
> +
> +	if (!bpf_sk)
> +		return 1;
> +
> +	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> +	if (!sk)
> +		return 1;
> +
> +	switch (skops->op) {
> +	case BPF_SOCK_OPS_TCP_LISTEN_CB:
> +		nr_listen += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_TCP_CONNECT_CB:
> +		nr_connect += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> +		nr_active += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> +		nr_passive += !bpf_test_sockopt(skops, sk);
> +		break;
> +	}
> +
> +	return 1;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-03 23:24       ` Stanislav Fomichev
@ 2022-08-03 23:35         ` Martin KaFai Lau
  0 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 23:35 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Aug 03, 2022 at 04:24:49PM -0700, Stanislav Fomichev wrote:
> On Wed, Aug 3, 2022 at 4:19 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Aug 03, 2022 at 03:59:26PM -0700, sdf@google.com wrote:
> > > On 08/03, Martin KaFai Lau wrote:
> > > > Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > the sk_setsockopt().  The number of supported optnames are
> > > > increasing ever and so as the duplicated code.
> > >
> > > > One issue in reusing sk_setsockopt() is that the bpf prog
> > > > has already acquired the sk lock.  This patch adds a in_bpf()
> > > > to tell if the sk_setsockopt() is called from a bpf prog.
> > > > The bpf prog calling bpf_setsockopt() is either running in_task()
> > > > or in_serving_softirq().  Both cases have the current->bpf_ctx
> > > > initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
> > >
> > > > This patch also adds sockopt_{lock,release}_sock() helpers
> > > > for sk_setsockopt() to use.  These helpers will test in_bpf()
> > > > before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> > > > for the ipv6 module to use in a latter patch.
> > >
> > > > Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> > > > is done in sock_setbindtodevice() instead of doing the lock_sock
> > > > in sock_bindtoindex(..., lock_sk = true).
> > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >   include/linux/bpf.h |  8 ++++++++
> > > >   include/net/sock.h  |  3 +++
> > > >   net/core/sock.c     | 26 +++++++++++++++++++++++---
> > > >   3 files changed, 34 insertions(+), 3 deletions(-)
> > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 20c26aed7896..b905b1b34fe4 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
> > > >     return !sysctl_unprivileged_bpf_disabled;
> > > >   }
> > >
> > > > +static inline bool in_bpf(void)
> > > > +{
> > > > +   return !!current->bpf_ctx;
> > > > +}
> > >
> > > Good point on not needing to care about softirq!
> > > That actually turned even nicer :-)
> > >
> > > QQ: do we need to add a comment here about potential false-negatives?
> > > I see you're adding ctx to the iter, but there is still a bunch of places
> > > that don't use it.
> > Make sense.  I will add a comment on the requirement that the bpf prog type
> > needs to setup the bpf_run_ctx.
> 
> Thanks! White at it, is it worth adding a short sentence to
> sockopt_lock_sock on why it's safe to skip locking in the bpf case as
> well?
Yep. will do.

> Feels like the current state where bpf always runs with the locked
> socket might change in the future.
That likely will be from the sleepable bpf prog.
It can probably either acquire the lock in __bpf_setsockopt() before
calling sk_setsockopt() or flag the bpf_run_ctx to say the lock is not acquired.
The former should be more straight forward.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests
  2022-08-03 23:30   ` sdf
@ 2022-08-04  0:04     ` Martin KaFai Lau
  2022-08-04 17:03       ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-04  0:04 UTC (permalink / raw)
  To: sdf
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Aug 03, 2022 at 04:30:54PM -0700, sdf@google.com wrote:
> > +struct sock_common {
> > +	unsigned short	skc_family;
> > +	unsigned long	skc_flags;
> > +	unsigned char	skc_reuse:4;
> > +	unsigned char	skc_reuseport:1;
> > +	unsigned char	skc_ipv6only:1;
> > +	unsigned char	skc_net_refcnt:1;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct sock {
> > +	struct sock_common	__sk_common;
> > +	__u16			sk_type;
> > +	__u16			sk_protocol;
> > +	int			sk_rcvlowat;
> > +	__u32			sk_mark;
> > +	unsigned long		sk_max_pacing_rate;
> > +	unsigned int		keepalive_time;
> > +	unsigned int		keepalive_intvl;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct tcp_options_received {
> > +	__u16 user_mss;
> > +} __attribute__((preserve_access_index));
> 
> I'm assuming you're not using vmlinux here because it doesn't bring
> it most of the defines? Should we add missing stuff to bpf_tracing_net.h
> instead?
Ah, actually my first attempt was to use vmlinux.h and had
all defines ready for addition to bpf_tracing_net.h. 

However, I hit an issue in reading bitfield.  It is why the
bitfield in the tcp_sock below is sandwiched between __u32.
I think it is likely LLVM and/or CO-RE related. Yonghong is
helping to investigate it.

In the mean time, I define those mini struct here.
Once the bitfield issue is resolved, we can go back to
use vmlinux.h.

> 
> > +struct ipv6_pinfo {
> > +	__u16			recverr:1,
> > +				sndflow:1,
> > +				repflow:1,
> > +				pmtudisc:3,
> > +				padding:1,
> > +				srcprefs:3,
> > +				dontfrag:1,
> > +				autoflowlabel:1,
> > +				autoflowlabel_set:1,
> > +				mc_all:1,
> > +				recverr_rfc4884:1,
> > +				rtalert_isolate:1;
> > +}  __attribute__((preserve_access_index));
> > +
> > +struct inet_sock {
> > +	/* sk and pinet6 has to be the first two members of inet_sock */
> > +	struct sock		sk;
> > +	struct ipv6_pinfo	*pinet6;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct inet_connection_sock {
> > +	__u32			  icsk_user_timeout;
> > +	__u8			  icsk_syn_retries;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct tcp_sock {
> > +	struct inet_connection_sock	inet_conn;
> > +	struct tcp_options_received rx_opt;
> > +	__u8	save_syn:2,
> > +		syn_data:1,
> > +		syn_fastopen:1,
> > +		syn_fastopen_exp:1,
> > +		syn_fastopen_ch:1,
> > +		syn_data_acked:1,
> > +		is_cwnd_limited:1;
> > +	__u32	window_clamp;
> > +	__u8	nonagle     : 4,
> > +		thin_lto    : 1,
> > +		recvmsg_inq : 1,
> > +		repair      : 1,
> > +		frto        : 1;
> > +	__u32	notsent_lowat;
> > +	__u8	keepalive_probes;
> > +	unsigned int		keepalive_time;
> > +	unsigned int		keepalive_intvl;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct socket {
> > +	struct sock *sk;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct loop_ctx {
> > +	void *ctx;
> > +	struct sock *sk;
> > +};
> > +
> > +static int __bpf_getsockopt(void *ctx, struct sock *sk,
> > +			    int level, int opt, int *optval,
> > +			    int optlen)
> > +{
> > +	if (level == SOL_SOCKET) {
> > +		switch (opt) {
> > +		case SO_REUSEADDR:
> > +			*optval = !!(sk->__sk_common.skc_reuse);
> > +			break;
> > +		case SO_KEEPALIVE:
> > +			*optval = !!(sk->__sk_common.skc_flags & (1UL << 3));
> > +			break;
> > +		case SO_RCVLOWAT:
> > +			*optval = sk->sk_rcvlowat;
> > +			break;
> 
> What's the idea with the options above? Why not allow them in
> bpf_getsockopt instead?
I am planning to refactor the bpf_getsockopt also,
so trying to avoid adding more dup code at this point
while they can directly be read from sk through PTR_TO_BTF_ID.

btw, since we are on bpf_getsockopt(), do you still see a usage on
bpf_getsockopt() for those 'integer-value' optnames that can be
easily read from the sk pointer ?

> 
> > +		case SO_MARK:
> > +			*optval = sk->sk_mark;
> > +			break;
> 
> SO_MARK should be handled by bpf_getsockopt ?
Good point, will remove SO_MARK case.

Thanks for the review!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests
  2022-08-04  0:04     ` Martin KaFai Lau
@ 2022-08-04 17:03       ` Stanislav Fomichev
  2022-08-04 19:17         ` Martin KaFai Lau
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2022-08-04 17:03 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Wed, Aug 3, 2022 at 5:04 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Aug 03, 2022 at 04:30:54PM -0700, sdf@google.com wrote:
> > > +struct sock_common {
> > > +   unsigned short  skc_family;
> > > +   unsigned long   skc_flags;
> > > +   unsigned char   skc_reuse:4;
> > > +   unsigned char   skc_reuseport:1;
> > > +   unsigned char   skc_ipv6only:1;
> > > +   unsigned char   skc_net_refcnt:1;
> > > +} __attribute__((preserve_access_index));
> > > +
> > > +struct sock {
> > > +   struct sock_common      __sk_common;
> > > +   __u16                   sk_type;
> > > +   __u16                   sk_protocol;
> > > +   int                     sk_rcvlowat;
> > > +   __u32                   sk_mark;
> > > +   unsigned long           sk_max_pacing_rate;
> > > +   unsigned int            keepalive_time;
> > > +   unsigned int            keepalive_intvl;
> > > +} __attribute__((preserve_access_index));
> > > +
> > > +struct tcp_options_received {
> > > +   __u16 user_mss;
> > > +} __attribute__((preserve_access_index));
> >
> > I'm assuming you're not using vmlinux here because it doesn't bring
> > it most of the defines? Should we add missing stuff to bpf_tracing_net.h
> > instead?
> Ah, actually my first attempt was to use vmlinux.h and had
> all defines ready for addition to bpf_tracing_net.h.
>
> However, I hit an issue in reading bitfield.  It is why the
> bitfield in the tcp_sock below is sandwiched between __u32.
> I think it is likely LLVM and/or CO-RE related. Yonghong is
> helping to investigate it.
>
> In the mean time, I define those mini struct here.
> Once the bitfield issue is resolved, we can go back to
> use vmlinux.h.

Oh, interesting :-)

> > > +struct ipv6_pinfo {
> > > +   __u16                   recverr:1,
> > > +                           sndflow:1,
> > > +                           repflow:1,
> > > +                           pmtudisc:3,
> > > +                           padding:1,
> > > +                           srcprefs:3,
> > > +                           dontfrag:1,
> > > +                           autoflowlabel:1,
> > > +                           autoflowlabel_set:1,
> > > +                           mc_all:1,
> > > +                           recverr_rfc4884:1,
> > > +                           rtalert_isolate:1;
> > > +}  __attribute__((preserve_access_index));
> > > +
> > > +struct inet_sock {
> > > +   /* sk and pinet6 has to be the first two members of inet_sock */
> > > +   struct sock             sk;
> > > +   struct ipv6_pinfo       *pinet6;
> > > +} __attribute__((preserve_access_index));
> > > +
> > > +struct inet_connection_sock {
> > > +   __u32                     icsk_user_timeout;
> > > +   __u8                      icsk_syn_retries;
> > > +} __attribute__((preserve_access_index));
> > > +
> > > +struct tcp_sock {
> > > +   struct inet_connection_sock     inet_conn;
> > > +   struct tcp_options_received rx_opt;
> > > +   __u8    save_syn:2,
> > > +           syn_data:1,
> > > +           syn_fastopen:1,
> > > +           syn_fastopen_exp:1,
> > > +           syn_fastopen_ch:1,
> > > +           syn_data_acked:1,
> > > +           is_cwnd_limited:1;
> > > +   __u32   window_clamp;
> > > +   __u8    nonagle     : 4,
> > > +           thin_lto    : 1,
> > > +           recvmsg_inq : 1,
> > > +           repair      : 1,
> > > +           frto        : 1;
> > > +   __u32   notsent_lowat;
> > > +   __u8    keepalive_probes;
> > > +   unsigned int            keepalive_time;
> > > +   unsigned int            keepalive_intvl;
> > > +} __attribute__((preserve_access_index));
> > > +
> > > +struct socket {
> > > +   struct sock *sk;
> > > +} __attribute__((preserve_access_index));
> > > +
> > > +struct loop_ctx {
> > > +   void *ctx;
> > > +   struct sock *sk;
> > > +};
> > > +
> > > +static int __bpf_getsockopt(void *ctx, struct sock *sk,
> > > +                       int level, int opt, int *optval,
> > > +                       int optlen)
> > > +{
> > > +   if (level == SOL_SOCKET) {
> > > +           switch (opt) {
> > > +           case SO_REUSEADDR:
> > > +                   *optval = !!(sk->__sk_common.skc_reuse);
> > > +                   break;
> > > +           case SO_KEEPALIVE:
> > > +                   *optval = !!(sk->__sk_common.skc_flags & (1UL << 3));
> > > +                   break;
> > > +           case SO_RCVLOWAT:
> > > +                   *optval = sk->sk_rcvlowat;
> > > +                   break;
> >
> > What's the idea with the options above? Why not allow them in
> > bpf_getsockopt instead?
> I am planning to refactor the bpf_getsockopt also,
> so trying to avoid adding more dup code at this point
> while they can directly be read from sk through PTR_TO_BTF_ID.
>
> btw, since we are on bpf_getsockopt(), do you still see a usage on
> bpf_getsockopt() for those 'integer-value' optnames that can be
> easily read from the sk pointer ?

Writing is still done via bpf_setsockopt, so having the same interface
to read the settings seems useful?




> > > +           case SO_MARK:
> > > +                   *optval = sk->sk_mark;
> > > +                   break;
> >
> > SO_MARK should be handled by bpf_getsockopt ?
> Good point, will remove SO_MARK case.
>
> Thanks for the review!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-03 20:46 ` [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
  2022-08-03 22:59   ` sdf
@ 2022-08-04 19:03   ` Andrii Nakryiko
  2022-08-04 19:29     ` Martin KaFai Lau
  1 sibling, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2022-08-04 19:03 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni, Stanislav Fomichev

On Wed, Aug 3, 2022 at 1:49 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> the sk_setsockopt().  The number of supported optnames are
> increasing ever and so as the duplicated code.
>
> One issue in reusing sk_setsockopt() is that the bpf prog
> has already acquired the sk lock.  This patch adds a in_bpf()
> to tell if the sk_setsockopt() is called from a bpf prog.
> The bpf prog calling bpf_setsockopt() is either running in_task()
> or in_serving_softirq().  Both cases have the current->bpf_ctx
> initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
>
> This patch also adds sockopt_{lock,release}_sock() helpers
> for sk_setsockopt() to use.  These helpers will test in_bpf()
> before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> for the ipv6 module to use in a latter patch.
>
> Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> is done in sock_setbindtodevice() instead of doing the lock_sock
> in sock_bindtoindex(..., lock_sk = true).
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/bpf.h |  8 ++++++++
>  include/net/sock.h  |  3 +++
>  net/core/sock.c     | 26 +++++++++++++++++++++++---
>  3 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 20c26aed7896..b905b1b34fe4 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
>         return !sysctl_unprivileged_bpf_disabled;
>  }
>
> +static inline bool in_bpf(void)

I think this function deserves a big comment explaining that it's not
100% accurate, as not every BPF program type sets bpf_ctx. As it is
named in_bpf() promises a lot more generality than it actually
provides.

Should this be named either more specific has_current_bpf_ctx() maybe?

Also, separately, should be make an effort to set bpf_ctx for all
program types (instead or in addition to the above)?

> +{
> +       return !!current->bpf_ctx;
> +}
>  #else /* !CONFIG_BPF_SYSCALL */
>  static inline struct bpf_prog *bpf_prog_get(u32 ufd)
>  {
> @@ -2175,6 +2179,10 @@ static inline bool unprivileged_ebpf_enabled(void)
>         return false;
>  }
>
> +static inline bool in_bpf(void)
> +{
> +       return false;
> +}
>  #endif /* CONFIG_BPF_SYSCALL */
>
>  void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index a7273b289188..b2ff230860c6 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1721,6 +1721,9 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
>         }
>  }
>
> +void sockopt_lock_sock(struct sock *sk);
> +void sockopt_release_sock(struct sock *sk);
> +
>  /* Used by processes to "lock" a socket state, so that
>   * interrupts and bottom half handlers won't change it
>   * from under us. It essentially blocks any incoming
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 20269c37ab3b..82759540ae2c 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -703,7 +703,9 @@ static int sock_setbindtodevice(struct sock *sk, sockptr_t optval, int optlen)
>                         goto out;
>         }
>
> -       return sock_bindtoindex(sk, index, true);
> +       sockopt_lock_sock(sk);
> +       ret = sock_bindtoindex_locked(sk, index);
> +       sockopt_release_sock(sk);
>  out:
>  #endif
>
> @@ -1036,6 +1038,24 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
>         return 0;
>  }
>
> +void sockopt_lock_sock(struct sock *sk)
> +{
> +       if (in_bpf())
> +               return;
> +
> +       lock_sock(sk);
> +}
> +EXPORT_SYMBOL(sockopt_lock_sock);
> +
> +void sockopt_release_sock(struct sock *sk)
> +{
> +       if (in_bpf())
> +               return;
> +
> +       release_sock(sk);
> +}
> +EXPORT_SYMBOL(sockopt_release_sock);
> +
>  /*
>   *     This is meant for all protocols to use and covers goings on
>   *     at the socket level. Everything here is generic.
> @@ -1067,7 +1087,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
>
>         valbool = val ? 1 : 0;
>
> -       lock_sock(sk);
> +       sockopt_lock_sock(sk);
>
>         switch (optname) {
>         case SO_DEBUG:
> @@ -1496,7 +1516,7 @@ static int sk_setsockopt(struct sock *sk, int level, int optname,
>                 ret = -ENOPROTOOPT;
>                 break;
>         }
> -       release_sock(sk);
> +       sockopt_release_sock(sk);
>         return ret;
>  }
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests
  2022-08-04 17:03       ` Stanislav Fomichev
@ 2022-08-04 19:17         ` Martin KaFai Lau
  0 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-04 19:17 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni

On Thu, Aug 04, 2022 at 10:03:58AM -0700, Stanislav Fomichev wrote:
> > I am planning to refactor the bpf_getsockopt also,
> > so trying to avoid adding more dup code at this point
> > while they can directly be read from sk through PTR_TO_BTF_ID.
> >
> > btw, since we are on bpf_getsockopt(), do you still see a usage on
> > bpf_getsockopt() for those 'integer-value' optnames that can be
> > easily read from the sk pointer ?
> 
> Writing is still done via bpf_setsockopt, so having the same interface
> to read the settings seems useful?
Make sense.  It probably will have less surprise to have a
symmetrical optname expectation on set/getsockopt.  It will be
cheaper to add to bpf_getsockopt() anyway once it is cleaned up.
Asking because I just don't have new use case (adding optnames)
to bpf_getsockopt() after the bpf_skc_to_*() helpers were
introduced.

> > > > +           case SO_MARK:
> > > > +                   *optval = sk->sk_mark;
> > > > +                   break;
> > >
> > > SO_MARK should be handled by bpf_getsockopt ?
> > Good point, will remove SO_MARK case.
> >
> > Thanks for the review!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-04 19:03   ` Andrii Nakryiko
@ 2022-08-04 19:29     ` Martin KaFai Lau
  2022-08-04 20:51       ` Universally available bpf_ctx WAS: " Andrii Nakryiko
  0 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-04 19:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Paolo Abeni, Stanislav Fomichev

On Thu, Aug 04, 2022 at 12:03:04PM -0700, Andrii Nakryiko wrote:
> On Wed, Aug 3, 2022 at 1:49 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > the sk_setsockopt().  The number of supported optnames are
> > increasing ever and so as the duplicated code.
> >
> > One issue in reusing sk_setsockopt() is that the bpf prog
> > has already acquired the sk lock.  This patch adds a in_bpf()
> > to tell if the sk_setsockopt() is called from a bpf prog.
> > The bpf prog calling bpf_setsockopt() is either running in_task()
> > or in_serving_softirq().  Both cases have the current->bpf_ctx
> > initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
> >
> > This patch also adds sockopt_{lock,release}_sock() helpers
> > for sk_setsockopt() to use.  These helpers will test in_bpf()
> > before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> > for the ipv6 module to use in a latter patch.
> >
> > Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> > is done in sock_setbindtodevice() instead of doing the lock_sock
> > in sock_bindtoindex(..., lock_sk = true).
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/linux/bpf.h |  8 ++++++++
> >  include/net/sock.h  |  3 +++
> >  net/core/sock.c     | 26 +++++++++++++++++++++++---
> >  3 files changed, 34 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 20c26aed7896..b905b1b34fe4 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
> >         return !sysctl_unprivileged_bpf_disabled;
> >  }
> >
> > +static inline bool in_bpf(void)
> 
> I think this function deserves a big comment explaining that it's not
> 100% accurate, as not every BPF program type sets bpf_ctx. As it is
> named in_bpf() promises a lot more generality than it actually
> provides.
> 
> Should this be named either more specific has_current_bpf_ctx() maybe?
Stans also made a similar point on this to add comment.
Rename makes sense until all bpf prog has bpf_ctx.  in_bpf() was
just the name it was used in the v1 discussion for the setsockopt
context.

> Also, separately, should be make an effort to set bpf_ctx for all
> program types (instead or in addition to the above)?
I would prefer to separate this as a separate effort.  This set is
getting pretty long and the bpf_getsockopt() is still not posted.

If you prefer this must be done first, I can do that also.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Universally available bpf_ctx WAS: Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-04 19:29     ` Martin KaFai Lau
@ 2022-08-04 20:51       ` Andrii Nakryiko
  2022-08-04 21:43         ` Stanislav Fomichev
  2022-08-05  0:29         ` Martin KaFai Lau
  0 siblings, 2 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2022-08-04 20:51 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Networking, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	Kernel Team, Paolo Abeni, Stanislav Fomichev

On Thu, Aug 4, 2022 at 12:29 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Aug 04, 2022 at 12:03:04PM -0700, Andrii Nakryiko wrote:
> > On Wed, Aug 3, 2022 at 1:49 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > the sk_setsockopt().  The number of supported optnames are
> > > increasing ever and so as the duplicated code.
> > >
> > > One issue in reusing sk_setsockopt() is that the bpf prog
> > > has already acquired the sk lock.  This patch adds a in_bpf()
> > > to tell if the sk_setsockopt() is called from a bpf prog.
> > > The bpf prog calling bpf_setsockopt() is either running in_task()
> > > or in_serving_softirq().  Both cases have the current->bpf_ctx
> > > initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
> > >
> > > This patch also adds sockopt_{lock,release}_sock() helpers
> > > for sk_setsockopt() to use.  These helpers will test in_bpf()
> > > before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> > > for the ipv6 module to use in a latter patch.
> > >
> > > Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> > > is done in sock_setbindtodevice() instead of doing the lock_sock
> > > in sock_bindtoindex(..., lock_sk = true).
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >  include/linux/bpf.h |  8 ++++++++
> > >  include/net/sock.h  |  3 +++
> > >  net/core/sock.c     | 26 +++++++++++++++++++++++---
> > >  3 files changed, 34 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 20c26aed7896..b905b1b34fe4 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
> > >         return !sysctl_unprivileged_bpf_disabled;
> > >  }
> > >
> > > +static inline bool in_bpf(void)
> >
> > I think this function deserves a big comment explaining that it's not
> > 100% accurate, as not every BPF program type sets bpf_ctx. As it is
> > named in_bpf() promises a lot more generality than it actually
> > provides.
> >
> > Should this be named either more specific has_current_bpf_ctx() maybe?
> Stans also made a similar point on this to add comment.
> Rename makes sense until all bpf prog has bpf_ctx.  in_bpf() was
> just the name it was used in the v1 discussion for the setsockopt
> context.
>
> > Also, separately, should be make an effort to set bpf_ctx for all
> > program types (instead or in addition to the above)?
> I would prefer to separate this as a separate effort.  This set is
> getting pretty long and the bpf_getsockopt() is still not posted.

Yeah, sure, I don't think you should be blocked on that.

>
> If you prefer this must be done first, I can do that also.

I wanted to bring this up for discussion. I find bpf_ctx a very useful
construct, if we had it available universally we could use it
(reliably) for this in_bpf() check, we could also have a sleepable vs
non-sleepable flag stored in such context and thus avoid all the
special handling we have for providing different gfp flags, etc.

But it's not just up for me to decide if we want to add it for all
program types (e.g., I wouldn't be surprised if I got push back adding
this to XDP). Most program types I normally use already have bpf_ctx
(and bpf_cookie built on top), but I was wondering what others feel
regarding making this (bpf_ctx in general, bpf_cookie in particular)
universally available.

So please proceed with your changes, I just used your patch as an
anchor for this discussion :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Universally available bpf_ctx WAS: Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-04 20:51       ` Universally available bpf_ctx WAS: " Andrii Nakryiko
@ 2022-08-04 21:43         ` Stanislav Fomichev
  2022-08-05  0:29         ` Martin KaFai Lau
  1 sibling, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2022-08-04 21:43 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Martin KaFai Lau, bpf, Networking, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, David Miller, Eric Dumazet,
	Jakub Kicinski, Kernel Team, Paolo Abeni

On Thu, Aug 4, 2022 at 1:51 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Aug 4, 2022 at 12:29 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Aug 04, 2022 at 12:03:04PM -0700, Andrii Nakryiko wrote:
> > > On Wed, Aug 3, 2022 at 1:49 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > the sk_setsockopt().  The number of supported optnames are
> > > > increasing ever and so as the duplicated code.
> > > >
> > > > One issue in reusing sk_setsockopt() is that the bpf prog
> > > > has already acquired the sk lock.  This patch adds a in_bpf()
> > > > to tell if the sk_setsockopt() is called from a bpf prog.
> > > > The bpf prog calling bpf_setsockopt() is either running in_task()
> > > > or in_serving_softirq().  Both cases have the current->bpf_ctx
> > > > initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
> > > >
> > > > This patch also adds sockopt_{lock,release}_sock() helpers
> > > > for sk_setsockopt() to use.  These helpers will test in_bpf()
> > > > before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> > > > for the ipv6 module to use in a latter patch.
> > > >
> > > > Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> > > > is done in sock_setbindtodevice() instead of doing the lock_sock
> > > > in sock_bindtoindex(..., lock_sk = true).
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >  include/linux/bpf.h |  8 ++++++++
> > > >  include/net/sock.h  |  3 +++
> > > >  net/core/sock.c     | 26 +++++++++++++++++++++++---
> > > >  3 files changed, 34 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 20c26aed7896..b905b1b34fe4 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
> > > >         return !sysctl_unprivileged_bpf_disabled;
> > > >  }
> > > >
> > > > +static inline bool in_bpf(void)
> > >
> > > I think this function deserves a big comment explaining that it's not
> > > 100% accurate, as not every BPF program type sets bpf_ctx. As it is
> > > named in_bpf() promises a lot more generality than it actually
> > > provides.
> > >
> > > Should this be named either more specific has_current_bpf_ctx() maybe?
> > Stans also made a similar point on this to add comment.
> > Rename makes sense until all bpf prog has bpf_ctx.  in_bpf() was
> > just the name it was used in the v1 discussion for the setsockopt
> > context.
> >
> > > Also, separately, should be make an effort to set bpf_ctx for all
> > > program types (instead or in addition to the above)?
> > I would prefer to separate this as a separate effort.  This set is
> > getting pretty long and the bpf_getsockopt() is still not posted.
>
> Yeah, sure, I don't think you should be blocked on that.
>
> >
> > If you prefer this must be done first, I can do that also.
>
> I wanted to bring this up for discussion. I find bpf_ctx a very useful
> construct, if we had it available universally we could use it
> (reliably) for this in_bpf() check, we could also have a sleepable vs
> non-sleepable flag stored in such context and thus avoid all the
> special handling we have for providing different gfp flags, etc.

+1

> But it's not just up for me to decide if we want to add it for all
> program types (e.g., I wouldn't be surprised if I got push back adding
> this to XDP). Most program types I normally use already have bpf_ctx
> (and bpf_cookie built on top), but I was wondering what others feel
> regarding making this (bpf_ctx in general, bpf_cookie in particular)
> universally available.

If we can get universal bpf_ctx, do we still need bpf_prog_active?
Regarding xdp: assigning a bunch of pointers shouldn't hopefully be
that big of a deal?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Universally available bpf_ctx WAS: Re: [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
  2022-08-04 20:51       ` Universally available bpf_ctx WAS: " Andrii Nakryiko
  2022-08-04 21:43         ` Stanislav Fomichev
@ 2022-08-05  0:29         ` Martin KaFai Lau
  1 sibling, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2022-08-05  0:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	Kernel Team, Paolo Abeni, Stanislav Fomichev

On Thu, Aug 04, 2022 at 01:51:12PM -0700, Andrii Nakryiko wrote:
> On Thu, Aug 4, 2022 at 12:29 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Aug 04, 2022 at 12:03:04PM -0700, Andrii Nakryiko wrote:
> > > On Wed, Aug 3, 2022 at 1:49 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > the sk_setsockopt().  The number of supported optnames are
> > > > increasing ever and so as the duplicated code.
> > > >
> > > > One issue in reusing sk_setsockopt() is that the bpf prog
> > > > has already acquired the sk lock.  This patch adds a in_bpf()
> > > > to tell if the sk_setsockopt() is called from a bpf prog.
> > > > The bpf prog calling bpf_setsockopt() is either running in_task()
> > > > or in_serving_softirq().  Both cases have the current->bpf_ctx
> > > > initialized.  Thus, the in_bpf() only needs to test !!current->bpf_ctx.
> > > >
> > > > This patch also adds sockopt_{lock,release}_sock() helpers
> > > > for sk_setsockopt() to use.  These helpers will test in_bpf()
> > > > before acquiring/releasing the lock.  They are in EXPORT_SYMBOL
> > > > for the ipv6 module to use in a latter patch.
> > > >
> > > > Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
> > > > is done in sock_setbindtodevice() instead of doing the lock_sock
> > > > in sock_bindtoindex(..., lock_sk = true).
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >  include/linux/bpf.h |  8 ++++++++
> > > >  include/net/sock.h  |  3 +++
> > > >  net/core/sock.c     | 26 +++++++++++++++++++++++---
> > > >  3 files changed, 34 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 20c26aed7896..b905b1b34fe4 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1966,6 +1966,10 @@ static inline bool unprivileged_ebpf_enabled(void)
> > > >         return !sysctl_unprivileged_bpf_disabled;
> > > >  }
> > > >
> > > > +static inline bool in_bpf(void)
> > >
> > > I think this function deserves a big comment explaining that it's not
> > > 100% accurate, as not every BPF program type sets bpf_ctx. As it is
> > > named in_bpf() promises a lot more generality than it actually
> > > provides.
> > >
> > > Should this be named either more specific has_current_bpf_ctx() maybe?
> > Stans also made a similar point on this to add comment.
> > Rename makes sense until all bpf prog has bpf_ctx.  in_bpf() was
> > just the name it was used in the v1 discussion for the setsockopt
> > context.
> >
> > > Also, separately, should be make an effort to set bpf_ctx for all
> > > program types (instead or in addition to the above)?
> > I would prefer to separate this as a separate effort.  This set is
> > getting pretty long and the bpf_getsockopt() is still not posted.
> 
> Yeah, sure, I don't think you should be blocked on that.
> 
> >
> > If you prefer this must be done first, I can do that also.
> 
> I wanted to bring this up for discussion. I find bpf_ctx a very useful
> construct, if we had it available universally we could use it
> (reliably) for this in_bpf() check, we could also have a sleepable vs
> non-sleepable flag stored in such context and thus avoid all the
> special handling we have for providing different gfp flags, etc.
> 
> But it's not just up for me to decide if we want to add it for all
> program types (e.g., I wouldn't be surprised if I got push back adding
> this to XDP). Most program types I normally use already have bpf_ctx
> (and bpf_cookie built on top), but I was wondering what others feel
> regarding making this (bpf_ctx in general, bpf_cookie in particular)
> universally available.
It may be easier to reason to add bpf_ctx with a use case.
Like networking prog, the cgroup-bpf (for storage and retval) and the
struct_ops (came automatically from the trampoline but finally become
useful in setsockopt here).

I don't think other network prog types have it now.  For sleepable or not,
I am not sure if those other network programs will ever be sleepable.
tc-bpf cannot call setsockopt also.

> 
> So please proceed with your changes, I just used your patch as an
> anchor for this discussion :)
+1

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-08-05  0:30 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-03 20:46 [PATCH v2 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt() Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 01/15] net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 02/15] bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf Martin KaFai Lau
2022-08-03 22:59   ` sdf
2022-08-03 23:19     ` Martin KaFai Lau
2022-08-03 23:24       ` Stanislav Fomichev
2022-08-03 23:35         ` Martin KaFai Lau
2022-08-04 19:03   ` Andrii Nakryiko
2022-08-04 19:29     ` Martin KaFai Lau
2022-08-04 20:51       ` Universally available bpf_ctx WAS: " Andrii Nakryiko
2022-08-04 21:43         ` Stanislav Fomichev
2022-08-05  0:29         ` Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 03/15] bpf: net: Consider in_bpf() when testing capable() in sk_setsockopt() Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 04/15] bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable() Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 05/15] bpf: net: Change do_ip_setsockopt() " Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 06/15] bpf: net: Change do_ipv6_setsockopt() " Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 07/15] bpf: Initialize the bpf_run_ctx in bpf_iter_run_prog() Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 08/15] bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt Martin KaFai Lau
2022-08-03 20:46 ` [PATCH v2 bpf-next 09/15] bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt() Martin KaFai Lau
2022-08-03 20:47 ` [PATCH v2 bpf-next 10/15] bpf: Refactor bpf specific tcp optnames to a new function Martin KaFai Lau
2022-08-03 20:47 ` [PATCH v2 bpf-next 11/15] bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt() Martin KaFai Lau
2022-08-03 20:47 ` [PATCH v2 bpf-next 12/15] bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt() Martin KaFai Lau
2022-08-03 20:47 ` [PATCH v2 bpf-next 13/15] bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt() Martin KaFai Lau
2022-08-03 20:47 ` [PATCH v2 bpf-next 14/15] bpf: Add a few optnames to bpf_setsockopt Martin KaFai Lau
2022-08-03 20:47 ` [PATCH v2 bpf-next 15/15] selftests/bpf: bpf_setsockopt tests Martin KaFai Lau
2022-08-03 23:30   ` sdf
2022-08-04  0:04     ` Martin KaFai Lau
2022-08-04 17:03       ` Stanislav Fomichev
2022-08-04 19:17         ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).