All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v4 0/3] Stop corrupting socket's task_frag
@ 2022-12-16 12:45 Benjamin Coddington
  2022-12-16 12:45 ` [PATCH net v4 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

The networking code uses flags in sk_allocation to determine if it can use
current->task_frag, however in-kernel users of sockets may stop setting
sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
on all rpciod/xprtiod jobs").

This will cause corruption in current->task_frag when recursing into the
network layer for those subsystems during page fault or reclaim.  The
corruption is difficult to diagnose because stack traces may not contain the
offending subsystem at all.  The corruption is unlikely to show up in
testing because it requires memory pressure, and so subsystems that
convert to memalloc_nofs_save/restore are likely to continue to run into
this issue.

Previous reports and proposed fixes:
https://lore.kernel.org/netdev/96a18bd00cbc6cb554603cc0d6ef1c551965b078.1663762494.git.gnault@redhat.com/
https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
https://lore.kernel.org/linux-nfs/de6d99321d1dcaa2ad456b92b3680aa77c07a747.1665401788.git.gnault@redhat.com/

Guilluame Nault has done all of the hard work tracking this problem down and
finding the best fix for this issue.  I'm just taking a turn posting another
fix.

Changes on v2:
	- rebased on -net
	- set sk_use_task_frag = false for xfrm/espintcp.c

Changes on v3:
	- fixup comments in sock.h for kernel-doc

Changes on v4:
	- rebased on -net
	- sk_use_task_frag moved to hole after sk_txtime_unused
	- droppd afs/rxrpc.c hunk, not needed

Benjamin Coddington (2):
  Treewide: Stop corrupting socket's task_frag
  net: simplify sk_page_frag

Guillaume Nault (1):
  net: Introduce sk_use_task_frag in struct sock.

 drivers/block/drbd/drbd_receiver.c |  3 +++
 drivers/block/nbd.c                |  1 +
 drivers/nvme/host/tcp.c            |  1 +
 drivers/scsi/iscsi_tcp.c           |  1 +
 drivers/usb/usbip/usbip_common.c   |  1 +
 fs/cifs/connect.c                  |  1 +
 fs/dlm/lowcomms.c                  |  2 ++
 fs/ocfs2/cluster/tcp.c             |  1 +
 include/net/sock.h                 | 10 ++++++----
 net/9p/trans_fd.c                  |  1 +
 net/ceph/messenger.c               |  1 +
 net/core/sock.c                    |  1 +
 net/sunrpc/xprtsock.c              |  3 +++
 net/xfrm/espintcp.c                |  1 +
 14 files changed, 24 insertions(+), 4 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net v4 1/3] net: Introduce sk_use_task_frag in struct sock.
  2022-12-16 12:45 [PATCH net v4 0/3] Stop corrupting socket's task_frag Benjamin Coddington
@ 2022-12-16 12:45 ` Benjamin Coddington
  2022-12-16 12:45 ` [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

From: Guillaume Nault <gnault@redhat.com>

Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.

To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.

This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.

This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.

Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).

The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).

Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
---
 include/net/sock.h | 11 +++++++++--
 net/core/sock.c    |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ecea3dcc2217..fefe1f4abf19 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -318,6 +318,9 @@ struct sk_filter;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
   *	@sk_tsflags: SO_TIMESTAMPING flags
+  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
+  *			   Sockets that can be used under memory reclaim should
+  *			   set this to false.
   *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
   *	              for timestamping
   *	@sk_tskey: counter to disambiguate concurrent tstamp requests
@@ -512,6 +515,7 @@ struct sock {
 	u8			sk_txtime_deadline_mode : 1,
 				sk_txtime_report_errors : 1,
 				sk_txtime_unused : 6;
+	bool			sk_use_task_frag;
 
 	struct socket		*sk_socket;
 	void			*sk_user_data;
@@ -2561,14 +2565,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
  * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags.
+ * This assumes that page fault handlers use the GFP_NOFS flags or
+ * explicitly disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
+	if (sk->sk_use_task_frag &&
+	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
+				  __GFP_FS)) ==
 	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
 		return &current->task_frag;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index d2587d8712db..f954d5893e79 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3390,6 +3390,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_rcvbuf		=	READ_ONCE(sysctl_rmem_default);
 	sk->sk_sndbuf		=	READ_ONCE(sysctl_wmem_default);
 	sk->sk_state		=	TCP_CLOSE;
+	sk->sk_use_task_frag	=	true;
 	sk_set_socket(sk, sock);
 
 	sock_set_flag(sk, SOCK_ZAPPED);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-16 12:45 [PATCH net v4 0/3] Stop corrupting socket's task_frag Benjamin Coddington
  2022-12-16 12:45 ` [PATCH net v4 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
@ 2022-12-16 12:45 ` Benjamin Coddington
  2023-01-03 14:26   ` Eric Dumazet
  2022-12-16 12:45 ` [PATCH net v4 3/3] net: simplify sk_page_frag Benjamin Coddington
  2022-12-20  2:00 ` [PATCH net v4 0/3] Stop corrupting socket's task_frag patchwork-bot+netdevbpf
  3 siblings, 1 reply; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
---
 drivers/block/drbd/drbd_receiver.c | 3 +++
 drivers/block/nbd.c                | 1 +
 drivers/nvme/host/tcp.c            | 1 +
 drivers/scsi/iscsi_tcp.c           | 1 +
 drivers/usb/usbip/usbip_common.c   | 1 +
 fs/cifs/connect.c                  | 1 +
 fs/dlm/lowcomms.c                  | 2 ++
 fs/ocfs2/cluster/tcp.c             | 1 +
 net/9p/trans_fd.c                  | 1 +
 net/ceph/messenger.c               | 1 +
 net/sunrpc/xprtsock.c              | 3 +++
 net/xfrm/espintcp.c                | 1 +
 12 files changed, 17 insertions(+)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 0e58a3187345..757f4692b5bd 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1030,6 +1030,9 @@ static int conn_connect(struct drbd_connection *connection)
 	sock.socket->sk->sk_allocation = GFP_NOIO;
 	msock.socket->sk->sk_allocation = GFP_NOIO;
 
+	sock.socket->sk->sk_use_task_frag = false;
+	msock.socket->sk->sk_use_task_frag = false;
+
 	sock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK;
 	msock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE;
 
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index e379ccc63c52..592cfa8b765a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -512,6 +512,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 	noreclaim_flag = memalloc_noreclaim_save();
 	do {
 		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
+		sock->sk->sk_use_task_frag = false;
 		msg.msg_name = NULL;
 		msg.msg_namelen = 0;
 		msg.msg_control = NULL;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b69b89166b6b..8cedc1ef496c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1537,6 +1537,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
 	queue->sock->sk->sk_rcvtimeo = 10 * HZ;
 
 	queue->sock->sk->sk_allocation = GFP_ATOMIC;
+	queue->sock->sk->sk_use_task_frag = false;
 	nvme_tcp_set_queue_io_cpu(queue);
 	queue->request = NULL;
 	queue->data_remaining = 0;
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 5fb1f364e815..1d1cf641937c 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -738,6 +738,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
 	sk->sk_reuse = SK_CAN_REUSE;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
 	sk->sk_allocation = GFP_ATOMIC;
+	sk->sk_use_task_frag = false;
 	sk_set_memalloc(sk);
 	sock_no_linger(sk);
 
diff --git a/drivers/usb/usbip/usbip_common.c b/drivers/usb/usbip/usbip_common.c
index f8b326eed54d..a2b2da1255dd 100644
--- a/drivers/usb/usbip/usbip_common.c
+++ b/drivers/usb/usbip/usbip_common.c
@@ -315,6 +315,7 @@ int usbip_recv(struct socket *sock, void *buf, int size)
 
 	do {
 		sock->sk->sk_allocation = GFP_NOIO;
+		sock->sk->sk_use_task_frag = false;
 
 		result = sock_recvmsg(sock, &msg, MSG_WAITALL);
 		if (result <= 0)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index e80252a83225..7bc7b5e03c51 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -2944,6 +2944,7 @@ generic_ip_connect(struct TCP_Server_Info *server)
 		cifs_dbg(FYI, "Socket created\n");
 		server->ssocket = socket;
 		socket->sk->sk_allocation = GFP_NOFS;
+		socket->sk->sk_use_task_frag = false;
 		if (sfamily == AF_INET6)
 			cifs_reclassify_socket6(socket);
 		else
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 8b80ca0cd65f..4450721ec83c 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -645,6 +645,7 @@ static void add_sock(struct socket *sock, struct connection *con)
 	if (dlm_config.ci_protocol == DLM_PROTO_SCTP)
 		sk->sk_state_change = lowcomms_state_change;
 	sk->sk_allocation = GFP_NOFS;
+	sk->sk_use_task_frag = false;
 	sk->sk_error_report = lowcomms_error_report;
 	release_sock(sk);
 }
@@ -1769,6 +1770,7 @@ static int dlm_listen_for_all(void)
 	listen_con.sock = sock;
 
 	sock->sk->sk_allocation = GFP_NOFS;
+	sock->sk->sk_use_task_frag = false;
 	sock->sk->sk_data_ready = lowcomms_listen_data_ready;
 	release_sock(sock->sk);
 
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 37d222bdfc8c..a07b24d170f2 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1602,6 +1602,7 @@ static void o2net_start_connect(struct work_struct *work)
 	sc->sc_sock = sock; /* freed by sc_kref_release */
 
 	sock->sk->sk_allocation = GFP_ATOMIC;
+	sock->sk->sk_use_task_frag = false;
 
 	myaddr.sin_family = AF_INET;
 	myaddr.sin_addr.s_addr = mynode->nd_ipv4_address;
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 07db2f436d44..d9120f14684b 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -868,6 +868,7 @@ static int p9_socket_open(struct p9_client *client, struct socket *csocket)
 	}
 
 	csocket->sk->sk_allocation = GFP_NOIO;
+	csocket->sk->sk_use_task_frag = false;
 	file = sock_alloc_file(csocket, 0, NULL);
 	if (IS_ERR(file)) {
 		pr_err("%s (%d): failed to map fd\n",
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index dfa237fbd5a3..1d06e114ba3f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -446,6 +446,7 @@ int ceph_tcp_connect(struct ceph_connection *con)
 	if (ret)
 		return ret;
 	sock->sk->sk_allocation = GFP_NOFS;
+	sock->sk->sk_use_task_frag = false;
 
 #ifdef CONFIG_LOCKDEP
 	lockdep_set_class(&sock->sk->sk_lock, &socket_class);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index c0506d0d7478..aaa5b2741b79 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1882,6 +1882,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,
 		sk->sk_write_space = xs_udp_write_space;
 		sk->sk_state_change = xs_local_state_change;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		xprt_clear_connected(xprt);
 
@@ -2082,6 +2083,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_user_data = xprt;
 		sk->sk_data_ready = xs_data_ready;
 		sk->sk_write_space = xs_udp_write_space;
+		sk->sk_use_task_frag = false;
 
 		xprt_set_connected(xprt);
 
@@ -2249,6 +2251,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_state_change = xs_tcp_state_change;
 		sk->sk_write_space = xs_tcp_write_space;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		/* socket options */
 		sock_reset_flag(sk, SOCK_LINGER);
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index d6fece1ed982..74a54295c164 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -489,6 +489,7 @@ static int espintcp_init_sk(struct sock *sk)
 
 	/* avoid using task_frag */
 	sk->sk_allocation = GFP_ATOMIC;
+	sk->sk_use_task_frag = false;
 
 	return 0;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net v4 3/3] net: simplify sk_page_frag
  2022-12-16 12:45 [PATCH net v4 0/3] Stop corrupting socket's task_frag Benjamin Coddington
  2022-12-16 12:45 ` [PATCH net v4 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
  2022-12-16 12:45 ` [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
@ 2022-12-16 12:45 ` Benjamin Coddington
  2022-12-20  2:00 ` [PATCH net v4 0/3] Stop corrupting socket's task_frag patchwork-bot+netdevbpf
  3 siblings, 0 replies; 9+ messages in thread
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

Now that in-kernel socket users that may recurse during reclaim have benn
converted to sk_use_task_frag = false, we can have sk_page_frag() simply
check that value.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/sock.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index fefe1f4abf19..dcd72e6285b2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2564,19 +2564,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * Both direct reclaim and page faults can nest inside other
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
- * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags or
- * explicitly disable sk_use_task_frag.
+ * when users disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (sk->sk_use_task_frag &&
-	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
-				  __GFP_FS)) ==
-	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
+	if (sk->sk_use_task_frag)
 		return &current->task_frag;
 
 	return &sk->sk_frag;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net v4 0/3] Stop corrupting socket's task_frag
  2022-12-16 12:45 [PATCH net v4 0/3] Stop corrupting socket's task_frag Benjamin Coddington
                   ` (2 preceding siblings ...)
  2022-12-16 12:45 ` [PATCH net v4 3/3] net: simplify sk_page_frag Benjamin Coddington
@ 2022-12-20  2:00 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-12-20  2:00 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: kuba, pabeni, davem, edumazet, gnault, philipp.reisner,
	lars.ellenberg, christoph.boehmwalder, axboe, josef, kbusch, hch,
	sagi, lduncan, cleech, michael.christie, jejb, martin.petersen,
	valentina.manea.m, shuah, gregkh, dhowells, marc.dionne, sfrench,
	ccaulfie, teigland, mark, jlbec, joseph.qi, ericvh, lucho,
	asmadeus, idryomov, xiubli, chuck.lever, jlayton,
	trond.myklebust, anna, steffen.klassert, herbert, netdev

Hello:

This series was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 16 Dec 2022 07:45:25 -0500 you wrote:
> The networking code uses flags in sk_allocation to determine if it can use
> current->task_frag, however in-kernel users of sockets may stop setting
> sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
> as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
> on all rpciod/xprtiod jobs").
> 
> This will cause corruption in current->task_frag when recursing into the
> network layer for those subsystems during page fault or reclaim.  The
> corruption is difficult to diagnose because stack traces may not contain the
> offending subsystem at all.  The corruption is unlikely to show up in
> testing because it requires memory pressure, and so subsystems that
> convert to memalloc_nofs_save/restore are likely to continue to run into
> this issue.
> 
> [...]

Here is the summary with links:
  - [net,v4,1/3] net: Introduce sk_use_task_frag in struct sock.
    https://git.kernel.org/netdev/net/c/fb87bd47516d
  - [net,v4,2/3] Treewide: Stop corrupting socket's task_frag
    https://git.kernel.org/netdev/net/c/98123866fcf3
  - [net,v4,3/3] net: simplify sk_page_frag
    https://git.kernel.org/netdev/net/c/08f65892c5ee

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-16 12:45 ` [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
@ 2023-01-03 14:26   ` Eric Dumazet
  2023-01-03 15:14     ` Guillaume Nault
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2023-01-03 14:26 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Guillaume Nault,
	Philipp Reisner, Lars Ellenberg, Christoph Böhmwalder,
	Jens Axboe, Josef Bacik, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Lee Duncan, Chris Leech, Mike Christie,
	James E.J. Bottomley, Martin K. Petersen, Valentina Manea,
	Shuah Khan, Greg Kroah-Hartman, David Howells, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, Ilya Dryomov, Xiubo Li, Chuck Lever,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Steffen Klassert,
	Herbert Xu, netdev

On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote:
>
> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> GFP_NOIO flag on sk_allocation which the networking system uses to decide
> when it is safe to use current->task_frag.  The results of this are
> unexpected corruption in task_frag when SUNRPC is involved in memory
> reclaim.
>
> The corruption can be seen in crashes, but the root cause is often
> difficult to ascertain as a crashing machine's stack trace will have no
> evidence of being near NFS or SUNRPC code.  I believe this problem to
> be much more pervasive than reports to the community may indicate.
>
> Fix this by having kernel users of sockets that may corrupt task_frag due
> to reclaim set sk_use_task_frag = false.  Preemptively correcting this
> situation for users that still set sk_allocation allows them to convert to
> memalloc_nofs_save/restore without the same unexpected corruptions that are
> sure to follow, unlikely to show up in testing, and difficult to bisect.
>

I am back from PTO.

It seems inet_ctl_sock_create() has been forgotten.

Without following fix, ICMP messages sent from softirq would corrupt
innocent thread task_frag.

(I will submit this patch formally a bit later today)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca
100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk,
unsigned short family,
        if (rc == 0) {
                *sk = sock->sk;
                (*sk)->sk_allocation = GFP_ATOMIC;
+               (*sk)->sk_use_task_frag = false;
                /*
                 * Unhash it so that IP input processing does not even see it,
                 * we do not wish this socket to see incoming packets.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag
  2023-01-03 14:26   ` Eric Dumazet
@ 2023-01-03 15:14     ` Guillaume Nault
  2023-01-03 16:10       ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Guillaume Nault @ 2023-01-03 15:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benjamin Coddington, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

On Tue, Jan 03, 2023 at 03:26:27PM +0100, Eric Dumazet wrote:
> On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote:
> >
> > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> > GFP_NOIO flag on sk_allocation which the networking system uses to decide
> > when it is safe to use current->task_frag.  The results of this are
> > unexpected corruption in task_frag when SUNRPC is involved in memory
> > reclaim.
> >
> > The corruption can be seen in crashes, but the root cause is often
> > difficult to ascertain as a crashing machine's stack trace will have no
> > evidence of being near NFS or SUNRPC code.  I believe this problem to
> > be much more pervasive than reports to the community may indicate.
> >
> > Fix this by having kernel users of sockets that may corrupt task_frag due
> > to reclaim set sk_use_task_frag = false.  Preemptively correcting this
> > situation for users that still set sk_allocation allows them to convert to
> > memalloc_nofs_save/restore without the same unexpected corruptions that are
> > sure to follow, unlikely to show up in testing, and difficult to bisect.
> >
> 
> I am back from PTO.
> 
> It seems inet_ctl_sock_create() has been forgotten.
> 
> Without following fix, ICMP messages sent from softirq would corrupt
> innocent thread task_frag.

I didn't consider setting ->sk_use_task_frag on ICMP sockets as my
understanding was that only TCP and ip_append_data() could eventually
call sk_page_frag(). Therefore, I didn't see how ICMP sockets could be
affected. Did I miss something?

> (I will submit this patch formally a bit later today)
> 
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca
> 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk,
> unsigned short family,
>         if (rc == 0) {
>                 *sk = sock->sk;
>                 (*sk)->sk_allocation = GFP_ATOMIC;
> +               (*sk)->sk_use_task_frag = false;
>                 /*
>                  * Unhash it so that IP input processing does not even see it,
>                  * we do not wish this socket to see incoming packets.
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag
  2023-01-03 15:14     ` Guillaume Nault
@ 2023-01-03 16:10       ` Eric Dumazet
  2023-01-03 22:37         ` Guillaume Nault
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2023-01-03 16:10 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: Benjamin Coddington, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

On Tue, Jan 3, 2023 at 4:14 PM Guillaume Nault <gnault@redhat.com> wrote:
>
> On Tue, Jan 03, 2023 at 03:26:27PM +0100, Eric Dumazet wrote:
> > On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote:
> > >
> > > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> > > GFP_NOIO flag on sk_allocation which the networking system uses to decide
> > > when it is safe to use current->task_frag.  The results of this are
> > > unexpected corruption in task_frag when SUNRPC is involved in memory
> > > reclaim.
> > >
> > > The corruption can be seen in crashes, but the root cause is often
> > > difficult to ascertain as a crashing machine's stack trace will have no
> > > evidence of being near NFS or SUNRPC code.  I believe this problem to
> > > be much more pervasive than reports to the community may indicate.
> > >
> > > Fix this by having kernel users of sockets that may corrupt task_frag due
> > > to reclaim set sk_use_task_frag = false.  Preemptively correcting this
> > > situation for users that still set sk_allocation allows them to convert to
> > > memalloc_nofs_save/restore without the same unexpected corruptions that are
> > > sure to follow, unlikely to show up in testing, and difficult to bisect.
> > >
> >
> > I am back from PTO.
> >
> > It seems inet_ctl_sock_create() has been forgotten.
> >
> > Without following fix, ICMP messages sent from softirq would corrupt
> > innocent thread task_frag.
>
> I didn't consider setting ->sk_use_task_frag on ICMP sockets as my
> understanding was that only TCP and ip_append_data() could eventually
> call sk_page_frag(). Therefore, I didn't see how ICMP sockets could be
> affected. Did I miss something?

net/ipv4/ping.c

ICMP uses per-cpu sockets (look in icmp_init(), icmp_xmit_lock()...)

icmp_rcv()
  -> icmp_echo()
     -> icmp_reply()
       -> icmp_push_reply()
         -> ip_append_data()
           -> sk_page_frag_refill()


>
> > (I will submit this patch formally a bit later today)
> >
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca
> > 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk,
> > unsigned short family,
> >         if (rc == 0) {
> >                 *sk = sock->sk;
> >                 (*sk)->sk_allocation = GFP_ATOMIC;
> > +               (*sk)->sk_use_task_frag = false;
> >                 /*
> >                  * Unhash it so that IP input processing does not even see it,
> >                  * we do not wish this socket to see incoming packets.
> >
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag
  2023-01-03 16:10       ` Eric Dumazet
@ 2023-01-03 22:37         ` Guillaume Nault
  0 siblings, 0 replies; 9+ messages in thread
From: Guillaume Nault @ 2023-01-03 22:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benjamin Coddington, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

On Tue, Jan 03, 2023 at 05:10:24PM +0100, Eric Dumazet wrote:
> On Tue, Jan 3, 2023 at 4:14 PM Guillaume Nault <gnault@redhat.com> wrote:
> >
> > On Tue, Jan 03, 2023 at 03:26:27PM +0100, Eric Dumazet wrote:
> > > On Fri, Dec 16, 2022 at 1:45 PM Benjamin Coddington <bcodding@redhat.com> wrote:
> > > >
> > > > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> > > > GFP_NOIO flag on sk_allocation which the networking system uses to decide
> > > > when it is safe to use current->task_frag.  The results of this are
> > > > unexpected corruption in task_frag when SUNRPC is involved in memory
> > > > reclaim.
> > > >
> > > > The corruption can be seen in crashes, but the root cause is often
> > > > difficult to ascertain as a crashing machine's stack trace will have no
> > > > evidence of being near NFS or SUNRPC code.  I believe this problem to
> > > > be much more pervasive than reports to the community may indicate.
> > > >
> > > > Fix this by having kernel users of sockets that may corrupt task_frag due
> > > > to reclaim set sk_use_task_frag = false.  Preemptively correcting this
> > > > situation for users that still set sk_allocation allows them to convert to
> > > > memalloc_nofs_save/restore without the same unexpected corruptions that are
> > > > sure to follow, unlikely to show up in testing, and difficult to bisect.
> > > >
> > >
> > > I am back from PTO.
> > >
> > > It seems inet_ctl_sock_create() has been forgotten.
> > >
> > > Without following fix, ICMP messages sent from softirq would corrupt
> > > innocent thread task_frag.
> >
> > I didn't consider setting ->sk_use_task_frag on ICMP sockets as my
> > understanding was that only TCP and ip_append_data() could eventually
> > call sk_page_frag(). Therefore, I didn't see how ICMP sockets could be
> > affected. Did I miss something?
> 
> net/ipv4/ping.c
> 
> ICMP uses per-cpu sockets (look in icmp_init(), icmp_xmit_lock()...)
> 
> icmp_rcv()
>   -> icmp_echo()
>      -> icmp_reply()
>        -> icmp_push_reply()
>          -> ip_append_data()
>            -> sk_page_frag_refill()

Thanks, that looks so obvious now :/
Sorry for missing that case.

> >
> > > (I will submit this patch formally a bit later today)
> > >
> > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > index ab4a06be489b5d410cec603bf56248d31dbc90dd..6c0ec27899431eb56e2f9d0c3a936b77f44ccaca
> > > 100644
> > > --- a/net/ipv4/af_inet.c
> > > +++ b/net/ipv4/af_inet.c
> > > @@ -1665,6 +1665,7 @@ int inet_ctl_sock_create(struct sock **sk,
> > > unsigned short family,
> > >         if (rc == 0) {
> > >                 *sk = sock->sk;
> > >                 (*sk)->sk_allocation = GFP_ATOMIC;
> > > +               (*sk)->sk_use_task_frag = false;
> > >                 /*
> > >                  * Unhash it so that IP input processing does not even see it,
> > >                  * we do not wish this socket to see incoming packets.
> > >
> >
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-01-03 22:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-16 12:45 [PATCH net v4 0/3] Stop corrupting socket's task_frag Benjamin Coddington
2022-12-16 12:45 ` [PATCH net v4 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
2022-12-16 12:45 ` [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
2023-01-03 14:26   ` Eric Dumazet
2023-01-03 15:14     ` Guillaume Nault
2023-01-03 16:10       ` Eric Dumazet
2023-01-03 22:37         ` Guillaume Nault
2022-12-16 12:45 ` [PATCH net v4 3/3] net: simplify sk_page_frag Benjamin Coddington
2022-12-20  2:00 ` [PATCH net v4 0/3] Stop corrupting socket's task_frag patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.