bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: bpf@vger.kernel.org
Cc: netdev@vger.kernel.org, kernel-team@cloudflare.com,
	John Fastabend <john.fastabend@gmail.com>,
	Martin KaFai Lau <kafai@fb.com>
Subject: [PATCH bpf-next v5 03/12] net, sk_msg: Clear sk_user_data pointer on clone if tagged
Date: Mon, 27 Jan 2020 13:55:25 +0100	[thread overview]
Message-ID: <20200127125534.137492-4-jakub@cloudflare.com> (raw)
In-Reply-To: <20200127125534.137492-1-jakub@cloudflare.com>

sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.

Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/sock.h | 37 +++++++++++++++++++++++++++++++++++--
 net/core/skmsg.c   |  2 +-
 net/core/sock.c    |  6 ++++++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 02162b0378f7..9f37fdfd15d4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -502,10 +502,43 @@ enum sk_pacing {
 	SK_PACING_FQ		= 2,
 };
 
+/* Pointer stored in sk_user_data might not be suitable for copying
+ * when cloning the socket. For instance, it can point to a reference
+ * counted object. sk_user_data bottom bit is set if pointer must not
+ * be copied.
+ */
+#define SK_USER_DATA_NOCOPY	1UL
+#define SK_USER_DATA_PTRMASK	~(SK_USER_DATA_NOCOPY)
+
+/**
+ * sk_user_data_is_nocopy - Test if sk_user_data pointer must not be copied
+ * @sk: socket
+ */
+static inline bool sk_user_data_is_nocopy(const struct sock *sk)
+{
+	return ((uintptr_t)sk->sk_user_data & SK_USER_DATA_NOCOPY);
+}
+
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
 
-#define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
-#define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
+#define rcu_dereference_sk_user_data(sk)				\
+({									\
+	void *__tmp = rcu_dereference(__sk_user_data((sk)));		\
+	(void *)((uintptr_t)__tmp & SK_USER_DATA_PTRMASK);		\
+})
+#define rcu_assign_sk_user_data(sk, ptr)				\
+({									\
+	uintptr_t __tmp = (uintptr_t)(ptr);				\
+	WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK);			\
+	rcu_assign_pointer(__sk_user_data((sk)), __tmp);		\
+})
+#define rcu_assign_sk_user_data_nocopy(sk, ptr)				\
+({									\
+	uintptr_t __tmp = (uintptr_t)(ptr);				\
+	WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK);			\
+	rcu_assign_pointer(__sk_user_data((sk)),			\
+			   __tmp | SK_USER_DATA_NOCOPY);		\
+})
 
 /*
  * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 3866d7e20c07..cd133688fd78 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -512,7 +512,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
 	refcount_set(&psock->refcnt, 1);
 
-	rcu_assign_sk_user_data(sk, psock);
+	rcu_assign_sk_user_data_nocopy(sk, psock);
 	sock_hold(sk);
 
 	return psock;
diff --git a/net/core/sock.c b/net/core/sock.c
index 3953bb23f4d0..74662943af5c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1864,6 +1864,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 			goto out;
 		}
 
+		/* Clear sk_user_data if parent had the pointer tagged
+		 * as not suitable for copying when cloning.
+		 */
+		if (sk_user_data_is_nocopy(newsk))
+			RCU_INIT_POINTER(newsk->sk_user_data, NULL);
+
 		newsk->sk_err	   = 0;
 		newsk->sk_err_soft = 0;
 		newsk->sk_priority = 0;
-- 
2.24.1


  parent reply	other threads:[~2020-01-27 12:55 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-27 12:55 [PATCH bpf-next v5 00/12] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 01/12] bpf, sk_msg: Don't clear saved sock proto on restore Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 02/12] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
2020-01-27 12:55 ` Jakub Sitnicki [this message]
2020-01-27 12:55 ` [PATCH bpf-next v5 04/12] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 05/12] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 06/12] bpf, sockmap: Don't set up sockmap progs for listening sockets Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 07/12] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 08/12] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 09/12] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 10/12] net: Generate reuseport group ID on group creation Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 11/12] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2020-01-27 12:55 ` [PATCH bpf-next v5 12/12] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2020-03-11 18:48   ` Andrii Nakryiko
2020-03-11 19:49     ` Jakub Sitnicki
2020-03-11 21:02       ` Andrii Nakryiko
2020-03-12 17:25         ` Jakub Sitnicki
2020-01-27 13:01 ` [PATCH bpf-next v5 00/12] Extend SOCKMAP to store " Jakub Sitnicki
2020-01-27 13:05 ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200127125534.137492-4-jakub@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=bpf@vger.kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --subject='Re: [PATCH bpf-next v5 03/12] net, sk_msg: Clear sk_user_data pointer on clone if tagged' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).