All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v3 0/3] Stop corrupting socket's task_frag
@ 2022-12-13 11:10 Benjamin Coddington
  2022-12-13 11:10 ` [PATCH net v3 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-13 11:10 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

The networking code uses flags in sk_allocation to determine if it can use
current->task_frag, however in-kernel users of sockets may stop setting
sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
on all rpciod/xprtiod jobs").

This will cause corruption in current->task_frag when recursing into the
network layer for those subsystems during page fault or reclaim.  The
corruption is difficult to diagnose because stack traces may not contain the
offending subsystem at all.  The corruption is unlikely to show up in
testing because it requires memory pressure, and so subsystems that
convert to memalloc_nofs_save/restore are likely to continue to run into
this issue.

Previous reports and proposed fixes:
https://lore.kernel.org/netdev/96a18bd00cbc6cb554603cc0d6ef1c551965b078.1663762494.git.gnault@redhat.com/
https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
https://lore.kernel.org/linux-nfs/de6d99321d1dcaa2ad456b92b3680aa77c07a747.1665401788.git.gnault@redhat.com/

Guilluame Nault has done all of the hard work tracking this problem down and
finding the best fix for this issue.  I'm just taking a turn posting another
fix.

Changes on v2:
	- rebased on -net
	- set sk_use_task_frag = false for xfrm/espintcp.c

Changes on v3:
	- fixup comments in sock.h for kernel-doc

Benjamin Coddington (2):
  Treewide: Stop corrupting socket's task_frag
  net: simplify sk_page_frag

Guillaume Nault (1):
  net: Introduce sk_use_task_frag in struct sock.

 drivers/block/drbd/drbd_receiver.c |  3 +++
 drivers/block/nbd.c                |  1 +
 drivers/nvme/host/tcp.c            |  1 +
 drivers/scsi/iscsi_tcp.c           |  1 +
 drivers/usb/usbip/usbip_common.c   |  1 +
 fs/afs/rxrpc.c                     |  1 +
 fs/cifs/connect.c                  |  1 +
 fs/dlm/lowcomms.c                  |  2 ++
 fs/ocfs2/cluster/tcp.c             |  1 +
 include/net/sock.h                 | 10 ++++++----
 net/9p/trans_fd.c                  |  1 +
 net/ceph/messenger.c               |  1 +
 net/core/sock.c                    |  1 +
 net/sunrpc/xprtsock.c              |  3 +++
 net/xfrm/espintcp.c                |  1 +
 15 files changed, 25 insertions(+), 4 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net v3 1/3] net: Introduce sk_use_task_frag in struct sock.
  2022-12-13 11:10 [PATCH net v3 0/3] Stop corrupting socket's task_frag Benjamin Coddington
@ 2022-12-13 11:10 ` Benjamin Coddington
  2022-12-15 12:08   ` Paolo Abeni
  2022-12-13 11:10 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-13 11:10 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

From: Guillaume Nault <gnault@redhat.com>

Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.

To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.

This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.

This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.

Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).

The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).

Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
---
 include/net/sock.h | 11 +++++++++--
 net/core/sock.c    |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index e0517ecc6531..44380c6dc6c4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -318,6 +318,9 @@ struct sk_filter;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
   *	@sk_tsflags: SO_TIMESTAMPING flags
+  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
+  *			   Sockets that can be used under memory reclaim should
+  *			   set this to false.
   *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
   *	              for timestamping
   *	@sk_tskey: counter to disambiguate concurrent tstamp requests
@@ -505,6 +508,7 @@ struct sock {
 #endif
 	u16			sk_tsflags;
 	u8			sk_shutdown;
+	bool			sk_use_task_frag;
 	atomic_t		sk_tskey;
 	atomic_t		sk_zckey;
 
@@ -2561,14 +2565,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
  * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags.
+ * This assumes that page fault handlers use the GFP_NOFS flags or
+ * explicitly disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
+	if (sk->sk_use_task_frag &&
+	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
+				  __GFP_FS)) ==
 	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
 		return &current->task_frag;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index a3ba0358c77c..cc113500d442 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3368,6 +3368,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_rcvbuf		=	READ_ONCE(sysctl_rmem_default);
 	sk->sk_sndbuf		=	READ_ONCE(sysctl_wmem_default);
 	sk->sk_state		=	TCP_CLOSE;
+	sk->sk_use_task_frag	=	true;
 	sk_set_socket(sk, sock);
 
 	sock_set_flag(sk, SOCK_ZAPPED);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-13 11:10 [PATCH net v3 0/3] Stop corrupting socket's task_frag Benjamin Coddington
  2022-12-13 11:10 ` [PATCH net v3 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
@ 2022-12-13 11:10 ` Benjamin Coddington
  2022-12-13 11:10 ` [PATCH net v3 3/3] net: simplify sk_page_frag Benjamin Coddington
  2022-12-15 12:12 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag David Howells
  3 siblings, 0 replies; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-13 11:10 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
---
 drivers/block/drbd/drbd_receiver.c | 3 +++
 drivers/block/nbd.c                | 1 +
 drivers/nvme/host/tcp.c            | 1 +
 drivers/scsi/iscsi_tcp.c           | 1 +
 drivers/usb/usbip/usbip_common.c   | 1 +
 fs/afs/rxrpc.c                     | 1 +
 fs/cifs/connect.c                  | 1 +
 fs/dlm/lowcomms.c                  | 2 ++
 fs/ocfs2/cluster/tcp.c             | 1 +
 net/9p/trans_fd.c                  | 1 +
 net/ceph/messenger.c               | 1 +
 net/sunrpc/xprtsock.c              | 3 +++
 net/xfrm/espintcp.c                | 1 +
 13 files changed, 18 insertions(+)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index ee69d50ba4fd..0d3f910ae347 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1030,6 +1030,9 @@ static int conn_connect(struct drbd_connection *connection)
 	sock.socket->sk->sk_allocation = GFP_NOIO;
 	msock.socket->sk->sk_allocation = GFP_NOIO;
 
+	sock.socket->sk->sk_use_task_frag = false;
+	msock.socket->sk->sk_use_task_frag = false;
+
 	sock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK;
 	msock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE;
 
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 5cffd96ef2d7..3a46b776354d 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -512,6 +512,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 	noreclaim_flag = memalloc_noreclaim_save();
 	do {
 		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
+		sock->sk->sk_use_task_frag = false;
 		msg.msg_name = NULL;
 		msg.msg_namelen = 0;
 		msg.msg_control = NULL;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9b47dcb2a7d9..fe772d6c4c96 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1537,6 +1537,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
 	queue->sock->sk->sk_rcvtimeo = 10 * HZ;
 
 	queue->sock->sk->sk_allocation = GFP_ATOMIC;
+	queue->sock->sk->sk_use_task_frag = false;
 	nvme_tcp_set_queue_io_cpu(queue);
 	queue->request = NULL;
 	queue->data_remaining = 0;
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 5fb1f364e815..1d1cf641937c 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -738,6 +738,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
 	sk->sk_reuse = SK_CAN_REUSE;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
 	sk->sk_allocation = GFP_ATOMIC;
+	sk->sk_use_task_frag = false;
 	sk_set_memalloc(sk);
 	sock_no_linger(sk);
 
diff --git a/drivers/usb/usbip/usbip_common.c b/drivers/usb/usbip/usbip_common.c
index 053a2bca4c47..e15ae6ca95ea 100644
--- a/drivers/usb/usbip/usbip_common.c
+++ b/drivers/usb/usbip/usbip_common.c
@@ -315,6 +315,7 @@ int usbip_recv(struct socket *sock, void *buf, int size)
 
 	do {
 		sock->sk->sk_allocation = GFP_NOIO;
+		sock->sk->sk_use_task_frag = false;
 
 		result = sock_recvmsg(sock, &msg, MSG_WAITALL);
 		if (result <= 0)
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index eccc3cd0cb70..ac75ad18db83 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -46,6 +46,7 @@ int afs_open_socket(struct afs_net *net)
 		goto error_1;
 
 	socket->sk->sk_allocation = GFP_NOFS;
+	socket->sk->sk_use_task_frag = false;
 
 	/* bind the callback manager's address to make this a server socket */
 	memset(&srx, 0, sizeof(srx));
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 9db9527c61cf..d84f1660cacb 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -2944,6 +2944,7 @@ generic_ip_connect(struct TCP_Server_Info *server)
 		cifs_dbg(FYI, "Socket created\n");
 		server->ssocket = socket;
 		socket->sk->sk_allocation = GFP_NOFS;
+		socket->sk->sk_use_task_frag = false;
 		if (sfamily == AF_INET6)
 			cifs_reclassify_socket6(socket);
 		else
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 59f64c596233..120be782edbc 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -699,6 +699,7 @@ static void add_listen_sock(struct socket *sock, struct listen_connection *con)
 
 	sk->sk_user_data = con;
 	sk->sk_allocation = GFP_NOFS;
+	sk->sk_use_task_frag = false;
 	/* Install a data_ready callback */
 	sk->sk_data_ready = lowcomms_listen_data_ready;
 	release_sock(sk);
@@ -718,6 +719,7 @@ static void add_sock(struct socket *sock, struct connection *con)
 	sk->sk_write_space = lowcomms_write_space;
 	sk->sk_state_change = lowcomms_state_change;
 	sk->sk_allocation = GFP_NOFS;
+	sk->sk_use_task_frag = false;
 	sk->sk_error_report = lowcomms_error_report;
 	release_sock(sk);
 }
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index f660c0dbdb63..3eaafa5e5ec4 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1604,6 +1604,7 @@ static void o2net_start_connect(struct work_struct *work)
 	sc->sc_sock = sock; /* freed by sc_kref_release */
 
 	sock->sk->sk_allocation = GFP_ATOMIC;
+	sock->sk->sk_use_task_frag = false;
 
 	myaddr.sin_family = AF_INET;
 	myaddr.sin_addr.s_addr = mynode->nd_ipv4_address;
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 07db2f436d44..d9120f14684b 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -868,6 +868,7 @@ static int p9_socket_open(struct p9_client *client, struct socket *csocket)
 	}
 
 	csocket->sk->sk_allocation = GFP_NOIO;
+	csocket->sk->sk_use_task_frag = false;
 	file = sock_alloc_file(csocket, 0, NULL);
 	if (IS_ERR(file)) {
 		pr_err("%s (%d): failed to map fd\n",
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index dfa237fbd5a3..1d06e114ba3f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -446,6 +446,7 @@ int ceph_tcp_connect(struct ceph_connection *con)
 	if (ret)
 		return ret;
 	sock->sk->sk_allocation = GFP_NOFS;
+	sock->sk->sk_use_task_frag = false;
 
 #ifdef CONFIG_LOCKDEP
 	lockdep_set_class(&sock->sk->sk_lock, &socket_class);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 915b9902f673..41ffc2169743 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1882,6 +1882,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,
 		sk->sk_write_space = xs_udp_write_space;
 		sk->sk_state_change = xs_local_state_change;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		xprt_clear_connected(xprt);
 
@@ -2082,6 +2083,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_user_data = xprt;
 		sk->sk_data_ready = xs_data_ready;
 		sk->sk_write_space = xs_udp_write_space;
+		sk->sk_use_task_frag = false;
 
 		xprt_set_connected(xprt);
 
@@ -2249,6 +2251,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_state_change = xs_tcp_state_change;
 		sk->sk_write_space = xs_tcp_write_space;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		/* socket options */
 		sock_reset_flag(sk, SOCK_LINGER);
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index 29a540dcb5a7..4ca2c5927ace 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -489,6 +489,7 @@ static int espintcp_init_sk(struct sock *sk)
 
 	/* avoid using task_frag */
 	sk->sk_allocation = GFP_ATOMIC;
+	sk->sk_use_task_frag = false;
 
 	return 0;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net v3 3/3] net: simplify sk_page_frag
  2022-12-13 11:10 [PATCH net v3 0/3] Stop corrupting socket's task_frag Benjamin Coddington
  2022-12-13 11:10 ` [PATCH net v3 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
  2022-12-13 11:10 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
@ 2022-12-13 11:10 ` Benjamin Coddington
  2022-12-15 12:12 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag David Howells
  3 siblings, 0 replies; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-13 11:10 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

Now that in-kernel socket users that may recurse during reclaim have benn
converted to sk_use_task_frag = false, we can have sk_page_frag() simply
check that value.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/sock.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 44380c6dc6c4..8e23c6d5e899 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2564,19 +2564,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * Both direct reclaim and page faults can nest inside other
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
- * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags or
- * explicitly disable sk_use_task_frag.
+ * when users disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (sk->sk_use_task_frag &&
-	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
-				  __GFP_FS)) ==
-	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
+	if (sk->sk_use_task_frag)
 		return &current->task_frag;
 
 	return &sk->sk_frag;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net v3 1/3] net: Introduce sk_use_task_frag in struct sock.
  2022-12-13 11:10 ` [PATCH net v3 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
@ 2022-12-15 12:08   ` Paolo Abeni
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2022-12-15 12:08 UTC (permalink / raw)
  To: Benjamin Coddington, Jakub Kicinski, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

On Tue, 2022-12-13 at 06:10 -0500, Benjamin Coddington wrote:
> From: Guillaume Nault <gnault@redhat.com>
> 
> Sockets that can be used while recursing into memory reclaim, like
> those used by network block devices and file systems, mustn't use
> current->task_frag: if the current process is already using it, then
> the inner memory reclaim call would corrupt the task_frag structure.
> 
> To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
> that mustn't use current->task_frag, assuming that those used during
> memory reclaim had their allocation constraints reflected in
> ->sk_allocation.
> 
> This unfortunately doesn't cover all cases: in an attempt to remove all
> usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
> ->sk_allocation, and used memalloc_nofs critical sections instead.
> This breaks the sk_page_frag() heuristic since the allocation
> constraints are now stored in current->flags, which sk_page_frag()
> can't read without risking triggering a cache miss and slowing down
> TCP's fast path.
> 
> This patch creates a new field in struct sock, named sk_use_task_frag,
> which sockets with memory reclaim constraints can set to false if they
> can't safely use current->task_frag. In such cases, sk_page_frag() now
> always returns the socket's page_frag (->sk_frag). The first user is
> sunrpc, which needs to avoid using current->task_frag but can keep
> ->sk_allocation set to GFP_KERNEL otherwise.
> 
> Eventually, it might be possible to simplify sk_page_frag() by only
> testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
> heuristic entirely (assuming other sockets will set ->sk_use_task_frag
> according to their constraints in the future).
> 
> The new ->sk_use_task_frag field is placed in a hole in struct sock and
> belongs to a cache line shared with ->sk_shutdown. Therefore it should
> be hot and shouldn't have negative performance impacts on TCP's fast
> path (sk_shutdown is tested just before the while() loop in
> tcp_sendmsg_locked()).
> 
> Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
> Signed-off-by: Guillaume Nault <gnault@redhat.com>
> Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  include/net/sock.h | 11 +++++++++--
>  net/core/sock.c    |  1 +
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e0517ecc6531..44380c6dc6c4 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -318,6 +318,9 @@ struct sk_filter;
>    *	@sk_stamp: time stamp of last packet received
>    *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
>    *	@sk_tsflags: SO_TIMESTAMPING flags
> +  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
> +  *			   Sockets that can be used under memory reclaim should
> +  *			   set this to false.
>    *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
>    *	              for timestamping
>    *	@sk_tskey: counter to disambiguate concurrent tstamp requests
> @@ -505,6 +508,7 @@ struct sock {
>  #endif
>  	u16			sk_tsflags;
>  	u8			sk_shutdown;
> +	bool			sk_use_task_frag;
>  	atomic_t		sk_tskey;
>  	atomic_t		sk_zckey;

I'm sorry, but after the post PR -net -> net-next merge, this does not
apply cleanly any-more, you need to rebase it once more.

Note that commit b534dc46c8ae ("net_tstamp: add
SOF_TIMESTAMPING_OPT_ID_TCP") moved the surronding fields a bit but
there is still one byte hole after sk_txtime_unused, before sk_socket.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-13 11:10 [PATCH net v3 0/3] Stop corrupting socket's task_frag Benjamin Coddington
                   ` (2 preceding siblings ...)
  2022-12-13 11:10 ` [PATCH net v3 3/3] net: simplify sk_page_frag Benjamin Coddington
@ 2022-12-15 12:12 ` David Howells
  2022-12-15 13:59   ` Guillaume Nault
  2022-12-15 14:36   ` David Howells
  3 siblings, 2 replies; 9+ messages in thread
From: David Howells @ 2022-12-15 12:12 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev


Benjamin Coddington <bcodding@redhat.com> wrote:

> diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
> index eccc3cd0cb70..ac75ad18db83 100644
> --- a/fs/afs/rxrpc.c
> +++ b/fs/afs/rxrpc.c
> @@ -46,6 +46,7 @@ int afs_open_socket(struct afs_net *net)
>  		goto error_1;
>  
>  	socket->sk->sk_allocation = GFP_NOFS;
> +	socket->sk->sk_use_task_frag = false;
>  
>  	/* bind the callback manager's address to make this a server socket */
>  	memset(&srx, 0, sizeof(srx));

Possibly this should be done in net/rxrpc/local_object.c too?  Or maybe in
udp_sock_create() or sock_create_kern()?

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-15 12:12 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag David Howells
@ 2022-12-15 13:59   ` Guillaume Nault
  2022-12-15 14:36   ` David Howells
  1 sibling, 0 replies; 9+ messages in thread
From: Guillaume Nault @ 2022-12-15 13:59 UTC (permalink / raw)
  To: David Howells
  Cc: Benjamin Coddington, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, Ilya Dryomov, Xiubo Li, Chuck Lever,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Steffen Klassert,
	Herbert Xu, netdev

On Thu, Dec 15, 2022 at 12:12:42PM +0000, David Howells wrote:
> 
> Benjamin Coddington <bcodding@redhat.com> wrote:
> 
> > diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
> > index eccc3cd0cb70..ac75ad18db83 100644
> > --- a/fs/afs/rxrpc.c
> > +++ b/fs/afs/rxrpc.c
> > @@ -46,6 +46,7 @@ int afs_open_socket(struct afs_net *net)
> >  		goto error_1;
> >  
> >  	socket->sk->sk_allocation = GFP_NOFS;
> > +	socket->sk->sk_use_task_frag = false;
> >  
> >  	/* bind the callback manager's address to make this a server socket */
> >  	memset(&srx, 0, sizeof(srx));
> 
> Possibly this should be done in net/rxrpc/local_object.c too?  Or maybe in
> udp_sock_create() or sock_create_kern()?

UDP tunnels typically don't need to set sk_use_task_frag, as they don't
call sk_page_frag(). One exception would be if they called
ip_append_data() (or ip6_append_data()), but none of them seem to do
that (and I can't see any reason why they would).

And net/rxrpc/local_object.c doesn't seems very different in this regard.

Maybe setting sk_use_task_frag in fs/afs/rxrpc.c was overzealous but
I'm not familiar enough with the AF_RXRPC family to tell. If AF_RXRPC
sockets can't call sk_page_frag() and have no reason to do so in the
future, then it should be safe to drop this chunk.

> David
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-15 12:12 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag David Howells
  2022-12-15 13:59   ` Guillaume Nault
@ 2022-12-15 14:36   ` David Howells
  2022-12-15 15:37     ` Guillaume Nault
  1 sibling, 1 reply; 9+ messages in thread
From: David Howells @ 2022-12-15 14:36 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: dhowells, Benjamin Coddington, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, Ilya Dryomov, Xiubo Li, Chuck Lever,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Steffen Klassert,
	Herbert Xu, netdev


Guillaume Nault <gnault@redhat.com> wrote:

> Maybe setting sk_use_task_frag in fs/afs/rxrpc.c was overzealous but
> I'm not familiar enough with the AF_RXRPC family to tell. If AF_RXRPC
> sockets can't call sk_page_frag() and have no reason to do so in the
> future, then it should be safe to drop this chunk.

As of this merge window, AF_RXRPC doesn't actually allocate sk_buffs apart
from when it calls skb_unshare().  It does steal the incoming sk_buffs from
the UDP socket it uses as a transport, but they're allocated in the IP/IP6
stack somewhere.

The UDP transport socket, on the other hand, will allocate sk_buffs for
transmission, but rxrpc sends an entire UDP packet at a time, each with a
single sendmsg call.

Further, this mostly now moved such that the UDP sendmsg calls are performed
inside an I/O thread.  The application thread does not interact directly with
the UDP transport socket.

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-15 14:36   ` David Howells
@ 2022-12-15 15:37     ` Guillaume Nault
  0 siblings, 0 replies; 9+ messages in thread
From: Guillaume Nault @ 2022-12-15 15:37 UTC (permalink / raw)
  To: David Howells
  Cc: Benjamin Coddington, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, Ilya Dryomov, Xiubo Li, Chuck Lever,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Steffen Klassert,
	Herbert Xu, netdev

On Thu, Dec 15, 2022 at 02:36:52PM +0000, David Howells wrote:
> 
> Guillaume Nault <gnault@redhat.com> wrote:
> 
> > Maybe setting sk_use_task_frag in fs/afs/rxrpc.c was overzealous but
> > I'm not familiar enough with the AF_RXRPC family to tell. If AF_RXRPC
> > sockets can't call sk_page_frag() and have no reason to do so in the
> > future, then it should be safe to drop this chunk.
> 
> As of this merge window, AF_RXRPC doesn't actually allocate sk_buffs apart
> from when it calls skb_unshare().  It does steal the incoming sk_buffs from
> the UDP socket it uses as a transport, but they're allocated in the IP/IP6
> stack somewhere.
> 
> The UDP transport socket, on the other hand, will allocate sk_buffs for
> transmission, but rxrpc sends an entire UDP packet at a time, each with a
> single sendmsg call.
> 
> Further, this mostly now moved such that the UDP sendmsg calls are performed
> inside an I/O thread.  The application thread does not interact directly with
> the UDP transport socket.
> 
> David

Thanks for the explanations. Looks like we could drop the fs/afs/rxrpc.c
chunk then.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-12-15 15:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-13 11:10 [PATCH net v3 0/3] Stop corrupting socket's task_frag Benjamin Coddington
2022-12-13 11:10 ` [PATCH net v3 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
2022-12-15 12:08   ` Paolo Abeni
2022-12-13 11:10 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
2022-12-13 11:10 ` [PATCH net v3 3/3] net: simplify sk_page_frag Benjamin Coddington
2022-12-15 12:12 ` [PATCH net v3 2/3] Treewide: Stop corrupting socket's task_frag David Howells
2022-12-15 13:59   ` Guillaume Nault
2022-12-15 14:36   ` David Howells
2022-12-15 15:37     ` Guillaume Nault

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.