netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/3] Stop corrupting socket's task_frag
@ 2022-11-21 13:35 Benjamin Coddington
  2022-11-21 13:35 ` [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-21 13:35 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

The networking code uses flags in sk_allocation to determine if it can use
current->task_frag, however in-kernel users of sockets may stop setting
sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
on all rpciod/xprtiod jobs").

This will cause corruption in current->task_frag when recursing into the
network layer for those subsystems during page fault or reclaim.  The
corruption is difficult to diagnose because stack traces may not contain the
offending subsystem at all.  The corruption is unlikely to show up in
testing because it requires memory pressure, and so subsystems that
convert to memalloc_nofs_save/restore are likely to continue to run into
this issue.

Previous reports and proposed fixes:
https://lore.kernel.org/netdev/96a18bd00cbc6cb554603cc0d6ef1c551965b078.1663762494.git.gnault@redhat.com/
https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
https://lore.kernel.org/linux-nfs/de6d99321d1dcaa2ad456b92b3680aa77c07a747.1665401788.git.gnault@redhat.com/

Guilluame Nault has done all of the hard work tracking this problem down and
finding the best fix for this issue.  I'm just taking a turn posting another
fix.

Benjamin Coddington (2):
  Treewide: Stop corrupting socket's task_frag
  net: simplify sk_page_frag

Guillaume Nault (1):
  net: Introduce sk_use_task_frag in struct sock.

 drivers/block/drbd/drbd_receiver.c |  3 +++
 drivers/block/nbd.c                |  1 +
 drivers/nvme/host/tcp.c            |  1 +
 drivers/scsi/iscsi_tcp.c           |  1 +
 drivers/usb/usbip/usbip_common.c   |  1 +
 fs/afs/rxrpc.c                     |  1 +
 fs/cifs/connect.c                  |  1 +
 fs/dlm/lowcomms.c                  |  2 ++
 fs/ocfs2/cluster/tcp.c             |  1 +
 include/net/sock.h                 | 10 ++++++----
 net/9p/trans_fd.c                  |  1 +
 net/ceph/messenger.c               |  1 +
 net/core/sock.c                    |  1 +
 net/sunrpc/xprtsock.c              |  3 +++
 14 files changed, 24 insertions(+), 4 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock.
  2022-11-21 13:35 [PATCH v1 0/3] Stop corrupting socket's task_frag Benjamin Coddington
@ 2022-11-21 13:35 ` Benjamin Coddington
  2022-12-09 12:09   ` Paolo Abeni
  2022-11-21 13:35 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-21 13:35 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Soheil Hassas Yeganeh, Shakeel Butt, Pavel Begunkov,
	Kuniyuki Iwashima, Maciej Żenczykowski, Menglong Dong,
	Akhmat Karakotov, Alexander Duyck

From: Guillaume Nault <gnault@redhat.com>

Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.

To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.

This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.

This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.

Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).

The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).

Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/sock.h | 11 +++++++++--
 net/core/sock.c    |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index d08cfe190a78..ffba9e95470d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -318,6 +318,9 @@ struct sk_filter;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
   *	@sk_tsflags: SO_TIMESTAMPING flags
+  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
+			   Sockets that can be used under memory reclaim should
+			   set this to false.
   *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
   *	              for timestamping
   *	@sk_tskey: counter to disambiguate concurrent tstamp requests
@@ -504,6 +507,7 @@ struct sock {
 #endif
 	u16			sk_tsflags;
 	u8			sk_shutdown;
+	bool			sk_use_task_frag;
 	atomic_t		sk_tskey;
 	atomic_t		sk_zckey;
 
@@ -2536,14 +2540,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
  * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags.
+ * This assumes that page fault handlers use the GFP_NOFS flags or
+ * explicitly disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
+	if (sk->sk_use_task_frag &&
+	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
+				  __GFP_FS)) ==
 	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
 		return &current->task_frag;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 788c1372663c..1ab781be9fbe 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3314,6 +3314,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_rcvbuf		=	READ_ONCE(sysctl_rmem_default);
 	sk->sk_sndbuf		=	READ_ONCE(sysctl_wmem_default);
 	sk->sk_state		=	TCP_CLOSE;
+	sk->sk_use_task_frag	=	true;
 	sk_set_socket(sk, sock);
 
 	sock_set_flag(sk, SOCK_ZAPPED);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 13:35 [PATCH v1 0/3] Stop corrupting socket's task_frag Benjamin Coddington
  2022-11-21 13:35 ` [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
@ 2022-11-21 13:35 ` Benjamin Coddington
  2022-11-29 14:02   ` Christoph Hellwig
  2022-12-09 12:37   ` Paolo Abeni
  2022-11-21 13:35 ` [PATCH v1 3/3] net: simplify sk_page_frag Benjamin Coddington
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-21 13:35 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Ilya Dryomov,
	Xiubo Li, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, drbd-dev, linux-block, nbd, linux-nvme, open-iscsi,
	linux-scsi, linux-usb, linux-afs, linux-cifs, samba-technical,
	cluster-devel, ocfs2-devel, v9fs-developer, ceph-devel,
	linux-nfs

Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: drbd-dev@lists.linbit.com
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: nbd@other.debian.org
CC: linux-nvme@lists.infradead.org
CC: open-iscsi@googlegroups.com
CC: linux-scsi@vger.kernel.org
CC: linux-usb@vger.kernel.org
CC: linux-afs@lists.infradead.org
CC: linux-cifs@vger.kernel.org
CC: samba-technical@lists.samba.org
CC: cluster-devel@redhat.com
CC: ocfs2-devel@oss.oracle.com
CC: v9fs-developer@lists.sourceforge.net
CC: netdev@vger.kernel.org
CC: ceph-devel@vger.kernel.org
CC: linux-nfs@vger.kernel.org

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 drivers/block/drbd/drbd_receiver.c | 3 +++
 drivers/block/nbd.c                | 1 +
 drivers/nvme/host/tcp.c            | 1 +
 drivers/scsi/iscsi_tcp.c           | 1 +
 drivers/usb/usbip/usbip_common.c   | 1 +
 fs/afs/rxrpc.c                     | 1 +
 fs/cifs/connect.c                  | 1 +
 fs/dlm/lowcomms.c                  | 2 ++
 fs/ocfs2/cluster/tcp.c             | 1 +
 net/9p/trans_fd.c                  | 1 +
 net/ceph/messenger.c               | 1 +
 net/sunrpc/xprtsock.c              | 3 +++
 12 files changed, 17 insertions(+)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index af4c7d65490b..09ad8d82c200 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1030,6 +1030,9 @@ static int conn_connect(struct drbd_connection *connection)
 	sock.socket->sk->sk_allocation = GFP_NOIO;
 	msock.socket->sk->sk_allocation = GFP_NOIO;
 
+	sock.socket->sk->sk_use_task_frag = false;
+	msock.socket->sk->sk_use_task_frag = false;
+
 	sock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK;
 	msock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE;
 
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 2a709daefbc4..815ee631ed30 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -514,6 +514,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 	noreclaim_flag = memalloc_noreclaim_save();
 	do {
 		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
+		sock->sk->sk_use_task_frag = false;
 		msg.msg_name = NULL;
 		msg.msg_namelen = 0;
 		msg.msg_control = NULL;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d5871fd6f769..e01d78858cb4 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1531,6 +1531,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 	queue->sock->sk->sk_rcvtimeo = 10 * HZ;
 
 	queue->sock->sk->sk_allocation = GFP_ATOMIC;
+	queue->sock->sk->sk_use_task_frag = false;
 	nvme_tcp_set_queue_io_cpu(queue);
 	queue->request = NULL;
 	queue->data_remaining = 0;
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 29b1bd755afe..733e540d0abf 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -733,6 +733,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
 	sk->sk_reuse = SK_CAN_REUSE;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
 	sk->sk_allocation = GFP_ATOMIC;
+	sk->sk_use_task_frag = false;
 	sk_set_memalloc(sk);
 	sock_no_linger(sk);
 
diff --git a/drivers/usb/usbip/usbip_common.c b/drivers/usb/usbip/usbip_common.c
index 2ab99244bc31..76bfc6e43881 100644
--- a/drivers/usb/usbip/usbip_common.c
+++ b/drivers/usb/usbip/usbip_common.c
@@ -315,6 +315,7 @@ int usbip_recv(struct socket *sock, void *buf, int size)
 
 	do {
 		sock->sk->sk_allocation = GFP_NOIO;
+		sock->sk->sk_use_task_frag = false;
 
 		result = sock_recvmsg(sock, &msg, MSG_WAITALL);
 		if (result <= 0)
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index eccc3cd0cb70..ac75ad18db83 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -46,6 +46,7 @@ int afs_open_socket(struct afs_net *net)
 		goto error_1;
 
 	socket->sk->sk_allocation = GFP_NOFS;
+	socket->sk->sk_use_task_frag = false;
 
 	/* bind the callback manager's address to make this a server socket */
 	memset(&srx, 0, sizeof(srx));
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 7ae6f2c08153..c2b0d6f59f79 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -2935,6 +2935,7 @@ generic_ip_connect(struct TCP_Server_Info *server)
 		cifs_dbg(FYI, "Socket created\n");
 		server->ssocket = socket;
 		socket->sk->sk_allocation = GFP_NOFS;
+		socket->sk->sk_use_task_frag = false;
 		if (sfamily == AF_INET6)
 			cifs_reclassify_socket6(socket);
 		else
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index a4e84e8d94c8..4cf29ac3c428 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -699,6 +699,7 @@ static void add_listen_sock(struct socket *sock, struct listen_connection *con)
 
 	sk->sk_user_data = con;
 	sk->sk_allocation = GFP_NOFS;
+	sk->sk_use_task_frag = false;
 	/* Install a data_ready callback */
 	sk->sk_data_ready = lowcomms_listen_data_ready;
 	release_sock(sk);
@@ -718,6 +719,7 @@ static void add_sock(struct socket *sock, struct connection *con)
 	sk->sk_write_space = lowcomms_write_space;
 	sk->sk_state_change = lowcomms_state_change;
 	sk->sk_allocation = GFP_NOFS;
+	sk->sk_use_task_frag = false;
 	sk->sk_error_report = lowcomms_error_report;
 	release_sock(sk);
 }
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index f660c0dbdb63..3eaafa5e5ec4 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1604,6 +1604,7 @@ static void o2net_start_connect(struct work_struct *work)
 	sc->sc_sock = sock; /* freed by sc_kref_release */
 
 	sock->sk->sk_allocation = GFP_ATOMIC;
+	sock->sk->sk_use_task_frag = false;
 
 	myaddr.sin_family = AF_INET;
 	myaddr.sin_addr.s_addr = mynode->nd_ipv4_address;
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index e758978b44be..96f803499323 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -851,6 +851,7 @@ static int p9_socket_open(struct p9_client *client, struct socket *csocket)
 		return -ENOMEM;
 
 	csocket->sk->sk_allocation = GFP_NOIO;
+	csocket->sk->sk_use_task_frag = false;
 	file = sock_alloc_file(csocket, 0, NULL);
 	if (IS_ERR(file)) {
 		pr_err("%s (%d): failed to map fd\n",
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index d3bb656308b4..cad8e0ca8432 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -446,6 +446,7 @@ int ceph_tcp_connect(struct ceph_connection *con)
 	if (ret)
 		return ret;
 	sock->sk->sk_allocation = GFP_NOFS;
+	sock->sk->sk_use_task_frag = false;
 
 #ifdef CONFIG_LOCKDEP
 	lockdep_set_class(&sock->sk->sk_lock, &socket_class);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index e976007f4fd0..d3170b753dfc 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1882,6 +1882,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,
 		sk->sk_write_space = xs_udp_write_space;
 		sk->sk_state_change = xs_local_state_change;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		xprt_clear_connected(xprt);
 
@@ -2083,6 +2084,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_user_data = xprt;
 		sk->sk_data_ready = xs_data_ready;
 		sk->sk_write_space = xs_udp_write_space;
+		sk->sk_use_task_frag = false;
 
 		xprt_set_connected(xprt);
 
@@ -2250,6 +2252,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_state_change = xs_tcp_state_change;
 		sk->sk_write_space = xs_tcp_write_space;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		/* socket options */
 		sock_reset_flag(sk, SOCK_LINGER);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v1 3/3] net: simplify sk_page_frag
  2022-11-21 13:35 [PATCH v1 0/3] Stop corrupting socket's task_frag Benjamin Coddington
  2022-11-21 13:35 ` [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
  2022-11-21 13:35 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
@ 2022-11-21 13:35 ` Benjamin Coddington
  2022-12-09 16:42   ` Paolo Abeni
                     ` (2 more replies)
  2022-11-21 13:56 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag David Howells
  2022-12-07 11:44 ` [PATCH v1 0/3] " Benjamin Coddington
  4 siblings, 3 replies; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-21 13:35 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni

Now that in-kernel socket users that may recurse during reclaim have benn
converted to sk_use_task_frag = false, we can have sk_page_frag() simply
check that value.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 include/net/sock.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ffba9e95470d..fac24c6ee30d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * Both direct reclaim and page faults can nest inside other
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
- * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags or
- * explicitly disable sk_use_task_frag.
+ * when users disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (sk->sk_use_task_frag &&
-	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
-				  __GFP_FS)) ==
-	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
+	if (sk->sk_use_task_frag)
 		return &current->task_frag;
 
 	return &sk->sk_frag;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 13:35 [PATCH v1 0/3] Stop corrupting socket's task_frag Benjamin Coddington
                   ` (2 preceding siblings ...)
  2022-11-21 13:35 ` [PATCH v1 3/3] net: simplify sk_page_frag Benjamin Coddington
@ 2022-11-21 13:56 ` David Howells
  2022-11-21 14:34   ` Benjamin Coddington
  2022-12-07 11:44 ` [PATCH v1 0/3] " Benjamin Coddington
  4 siblings, 1 reply; 25+ messages in thread
From: David Howells @ 2022-11-21 13:56 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: netdev, linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs


Benjamin Coddington <bcodding@redhat.com> wrote:

> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> GFP_NOIO flag on sk_allocation which the networking system uses to decide
> when it is safe to use current->task_frag.

Um, what's task_frag?

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 13:56 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag David Howells
@ 2022-11-21 14:34   ` Benjamin Coddington
  2022-11-21 21:40     ` Shuah Khan
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-21 14:34 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs

On 21 Nov 2022, at 8:56, David Howells wrote:

> Benjamin Coddington <bcodding@redhat.com> wrote:
>
>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
>> GFP_NOIO flag on sk_allocation which the networking system uses to decide
>> when it is safe to use current->task_frag.
>
> Um, what's task_frag?

Its a per-task page_frag used to coalesce small writes for networking -- see:

5640f7685831 net: use a per task frag allocator

Ben


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 14:34   ` Benjamin Coddington
@ 2022-11-21 21:40     ` Shuah Khan
  2022-11-21 21:43       ` Shuah Khan
  0 siblings, 1 reply; 25+ messages in thread
From: Shuah Khan @ 2022-11-21 21:40 UTC (permalink / raw)
  To: Benjamin Coddington, David Howells
  Cc: netdev, linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs, Shuah Khan

On 11/21/22 07:34, Benjamin Coddington wrote:
> On 21 Nov 2022, at 8:56, David Howells wrote:
> 
>> Benjamin Coddington <bcodding@redhat.com> wrote:
>>
>>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
>>> GFP_NOIO flag on sk_allocation which the networking system uses to decide
>>> when it is safe to use current->task_frag.
>>
>> Um, what's task_frag?
> 
> Its a per-task page_frag used to coalesce small writes for networking -- see:
> 
> 5640f7685831 net: use a per task frag allocator
> 
> Ben
> 
> 

I am not seeing this in the mainline. Where can find this commit?

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 21:40     ` Shuah Khan
@ 2022-11-21 21:43       ` Shuah Khan
  2022-11-21 22:01         ` Benjamin Coddington
  0 siblings, 1 reply; 25+ messages in thread
From: Shuah Khan @ 2022-11-21 21:43 UTC (permalink / raw)
  To: Benjamin Coddington, David Howells
  Cc: netdev, linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs, Shuah Khan

On 11/21/22 14:40, Shuah Khan wrote:
> On 11/21/22 07:34, Benjamin Coddington wrote:
>> On 21 Nov 2022, at 8:56, David Howells wrote:
>>
>>> Benjamin Coddington <bcodding@redhat.com> wrote:
>>>
>>>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
>>>> GFP_NOIO flag on sk_allocation which the networking system uses to decide
>>>> when it is safe to use current->task_frag.
>>>
>>> Um, what's task_frag?
>>
>> Its a per-task page_frag used to coalesce small writes for networking -- see:
>>
>> 5640f7685831 net: use a per task frag allocator
>>
>> Ben
>>
>>
> 
> I am not seeing this in the mainline. Where can find this commit?
> 

Okay. I see this commit in the mainline. However, I don't see the
sk_use_task_frag in mainline.

thanks,
-- Shuah


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 21:43       ` Shuah Khan
@ 2022-11-21 22:01         ` Benjamin Coddington
  2022-11-21 22:32           ` Shuah Khan
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-21 22:01 UTC (permalink / raw)
  To: Shuah Khan
  Cc: David Howells, netdev, linux-kernel, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Josef Bacik, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Lee Duncan, Chris Leech, Mike Christie, James E.J. Bottomley,
	Martin K. Petersen, Valentina Manea, Shuah Khan,
	Greg Kroah-Hartman, Marc Dionne, Steve French,
	Christine Caulfield, David Teigland, Mark Fasheh, Joel Becker,
	Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs

On 21 Nov 2022, at 16:43, Shuah Khan wrote:

> On 11/21/22 14:40, Shuah Khan wrote:
>> On 11/21/22 07:34, Benjamin Coddington wrote:
>>> On 21 Nov 2022, at 8:56, David Howells wrote:
>>>
>>>> Benjamin Coddington <bcodding@redhat.com> wrote:
>>>>
>>>>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
>>>>> GFP_NOIO flag on sk_allocation which the networking system uses to decide
>>>>> when it is safe to use current->task_frag.
>>>>
>>>> Um, what's task_frag?
>>>
>>> Its a per-task page_frag used to coalesce small writes for networking -- see:
>>>
>>> 5640f7685831 net: use a per task frag allocator
>>>
>>> Ben
>>>
>>>
>>
>> I am not seeing this in the mainline. Where can find this commit?
>>
>
> Okay. I see this commit in the mainline. However, I don't see the
> sk_use_task_frag in mainline.

sk_use_task_frag is in patch 1/3 in this posting.

https://lore.kernel.org/netdev/26d98c8f-372b-b9c8-c29f-096cddaff149@linuxfoundation.org/T/#m3271959c4cf8dcff1c0c6ba023b2b3821d9e7e99

Ben


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 22:01         ` Benjamin Coddington
@ 2022-11-21 22:32           ` Shuah Khan
  2022-11-22 14:23             ` Benjamin Coddington
  0 siblings, 1 reply; 25+ messages in thread
From: Shuah Khan @ 2022-11-21 22:32 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: David Howells, netdev, linux-kernel, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Josef Bacik, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Lee Duncan, Chris Leech, Mike Christie, James E.J. Bottomley,
	Martin K. Petersen, Valentina Manea, Shuah Khan,
	Greg Kroah-Hartman, Marc Dionne, Steve French,
	Christine Caulfield, David Teigland, Mark Fasheh, Joel Becker,
	Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs, Shuah Khan

On 11/21/22 15:01, Benjamin Coddington wrote:
> On 21 Nov 2022, at 16:43, Shuah Khan wrote:
> 
>> On 11/21/22 14:40, Shuah Khan wrote:
>>> On 11/21/22 07:34, Benjamin Coddington wrote:
>>>> On 21 Nov 2022, at 8:56, David Howells wrote:
>>>>
>>>>> Benjamin Coddington <bcodding@redhat.com> wrote:
>>>>>
>>>>>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
>>>>>> GFP_NOIO flag on sk_allocation which the networking system uses to decide
>>>>>> when it is safe to use current->task_frag.
>>>>>
>>>>> Um, what's task_frag?
>>>>
>>>> Its a per-task page_frag used to coalesce small writes for networking -- see:
>>>>
>>>> 5640f7685831 net: use a per task frag allocator
>>>>
>>>> Ben
>>>>
>>>>
>>>
>>> I am not seeing this in the mainline. Where can find this commit?
>>>
>>
>> Okay. I see this commit in the mainline. However, I don't see the
>> sk_use_task_frag in mainline.
> 
> sk_use_task_frag is in patch 1/3 in this posting.
> 
> https://lore.kernel.org/netdev/26d98c8f-372b-b9c8-c29f-096cddaff149@linuxfoundation.org/T/#m3271959c4cf8dcff1c0c6ba023b2b3821d9e7e99
> 

Aha. I don't have 1/3 in my Inbox - I think it would make
sense to cc people on the first patch so we can understand
the premise for the change.

thanks,
-- Shuah
  


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 22:32           ` Shuah Khan
@ 2022-11-22 14:23             ` Benjamin Coddington
  0 siblings, 0 replies; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-22 14:23 UTC (permalink / raw)
  To: Shuah Khan
  Cc: David Howells, netdev, linux-kernel, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Josef Bacik, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Lee Duncan, Chris Leech, Mike Christie, James E.J. Bottomley,
	Martin K. Petersen, Valentina Manea, Shuah Khan,
	Greg Kroah-Hartman, Marc Dionne, Steve French,
	Christine Caulfield, David Teigland, Mark Fasheh, Joel Becker,
	Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs

On 21 Nov 2022, at 17:32, Shuah Khan wrote:

> On 11/21/22 15:01, Benjamin Coddington wrote:
>> On 21 Nov 2022, at 16:43, Shuah Khan wrote:
>>
>>> On 11/21/22 14:40, Shuah Khan wrote:
>>>> On 11/21/22 07:34, Benjamin Coddington wrote:
>>>>> On 21 Nov 2022, at 8:56, David Howells wrote:
>>>>>
>>>>>> Benjamin Coddington <bcodding@redhat.com> wrote:
>>>>>>
>>>>>>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
>>>>>>> GFP_NOIO flag on sk_allocation which the networking system uses to decide
>>>>>>> when it is safe to use current->task_frag.
>>>>>>
>>>>>> Um, what's task_frag?
>>>>>
>>>>> Its a per-task page_frag used to coalesce small writes for networking -- see:
>>>>>
>>>>> 5640f7685831 net: use a per task frag allocator
>>>>>
>>>>> Ben
>>>>>
>>>>>
>>>>
>>>> I am not seeing this in the mainline. Where can find this commit?
>>>>
>>>
>>> Okay. I see this commit in the mainline. However, I don't see the
>>> sk_use_task_frag in mainline.
>>
>> sk_use_task_frag is in patch 1/3 in this posting.
>>
>> https://lore.kernel.org/netdev/26d98c8f-372b-b9c8-c29f-096cddaff149@linuxfoundation.org/T/#m3271959c4cf8dcff1c0c6ba023b2b3821d9e7e99
>>
>
> Aha. I don't have 1/3 in my Inbox - I think it would make
> sense to cc people on the first patch so we can understand
> the premise for the change.

Yeah, I can do that if it goes to another version, I was just trying to be
considerate of all the noise this sort of posting generates.

Ben


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 13:35 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
@ 2022-11-29 14:02   ` Christoph Hellwig
  2022-11-29 16:47     ` Benjamin Coddington
                       ` (2 more replies)
  2022-12-09 12:37   ` Paolo Abeni
  1 sibling, 3 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-11-29 14:02 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: netdev, linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Ilya Dryomov,
	Xiubo Li, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, drbd-dev, linux-block, nbd, linux-nvme, open-iscsi,
	linux-scsi, linux-usb, linux-afs, linux-cifs, samba-technical,
	cluster-devel, ocfs2-devel, v9fs-developer, ceph-devel,
	linux-nfs

Hmm.  Having to set a flag to not accidentally corrupt per-task
state seems a bit fragile.  Wouldn't it make sense to find a way to opt
into the feature only for sockets created from the syscall layer?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-29 14:02   ` Christoph Hellwig
@ 2022-11-29 16:47     ` Benjamin Coddington
  2022-11-30 12:07       ` Guillaume Nault
  2022-11-29 17:42     ` Jeff Layton
  2022-11-30 11:48     ` Guillaume Nault
  2 siblings, 1 reply; 25+ messages in thread
From: Benjamin Coddington @ 2022-11-29 16:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Sagi Grimberg, Lee Duncan, Chris Leech, Mike Christie,
	James E.J. Bottomley, Martin K. Petersen, Valentina Manea,
	Shuah Khan, Greg Kroah-Hartman, David Howells, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs

On 29 Nov 2022, at 9:02, Christoph Hellwig wrote:

> Hmm.  Having to set a flag to not accidentally corrupt per-task
> state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> into the feature only for sockets created from the syscall layer?

It's totally fragile, and that's why it's currently broken in production.
The fragile ship sailed when networking decided to depend on users setting
the socket's GFP_ flags correctly to avoid corruption.

Meantime, this problem needs fixing in a way that makes everyone happy.
This fix doesn't make it less fragile, but it may (hopefully) address the
previous criticisms enough that something gets done to fix it.

Ben


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-29 14:02   ` Christoph Hellwig
  2022-11-29 16:47     ` Benjamin Coddington
@ 2022-11-29 17:42     ` Jeff Layton
  2022-11-30 11:48     ` Guillaume Nault
  2 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2022-11-29 17:42 UTC (permalink / raw)
  To: Christoph Hellwig, Benjamin Coddington
  Cc: netdev, linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Sagi Grimberg, Lee Duncan, Chris Leech, Mike Christie,
	James E.J. Bottomley, Martin K. Petersen, Valentina Manea,
	Shuah Khan, Greg Kroah-Hartman, David Howells, Marc Dionne,
	Steve French, Christine Caulfield, David Teigland, Mark Fasheh,
	Joel Becker, Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, drbd-dev,
	linux-block, nbd, linux-nvme, open-iscsi, linux-scsi, linux-usb,
	linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs

On Tue, 2022-11-29 at 15:02 +0100, Christoph Hellwig wrote:
> Hmm.  Having to set a flag to not accidentally corrupt per-task
> state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> into the feature only for sockets created from the syscall layer?

I agree that that would be cleaner. task_frag should have been an opt-in
thing all along. That change regressed all of the in-kernel users of
sockets.

Where would be the right place to set that flag for only userland
sockets? A lot of the in-kernel socket users hook into the socket API at
a fairly high-level. 9P and CIFS, for instance, call __sock_create.

We could set it in the syscall handlers (and maybe in iouring) I
suppose, but that seems like the wrong thing to do too.

In the absence of a clean place to do this, I think we're going to be
stuck doing it the way Ben has proposed...
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-29 14:02   ` Christoph Hellwig
  2022-11-29 16:47     ` Benjamin Coddington
  2022-11-29 17:42     ` Jeff Layton
@ 2022-11-30 11:48     ` Guillaume Nault
  2 siblings, 0 replies; 25+ messages in thread
From: Guillaume Nault @ 2022-11-30 11:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Benjamin Coddington, netdev, linux-kernel, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Josef Bacik, Keith Busch, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Ilya Dryomov,
	Xiubo Li, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, drbd-dev, linux-block, nbd, linux-nvme, open-iscsi,
	linux-scsi, linux-usb, linux-afs, linux-cifs, samba-technical,
	cluster-devel, ocfs2-devel, v9fs-developer, ceph-devel,
	linux-nfs

On Tue, Nov 29, 2022 at 03:02:42PM +0100, Christoph Hellwig wrote:
> Hmm.  Having to set a flag to not accidentally corrupt per-task
> state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> into the feature only for sockets created from the syscall layer?

That's something I originally considered. But, as far as I can see, nbd
needs this flag _and_ uses sockets created in user space. So it'd still
need to opt out manually.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-29 16:47     ` Benjamin Coddington
@ 2022-11-30 12:07       ` Guillaume Nault
  0 siblings, 0 replies; 25+ messages in thread
From: Guillaume Nault @ 2022-11-30 12:07 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Christoph Hellwig, netdev, linux-kernel, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Josef Bacik, Keith Busch, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Ilya Dryomov,
	Xiubo Li, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, drbd-dev, linux-block, nbd, linux-nvme, open-iscsi,
	linux-scsi, linux-usb, linux-afs, linux-cifs, samba-technical,
	cluster-devel, ocfs2-devel, v9fs-developer, ceph-devel,
	linux-nfs

On Tue, Nov 29, 2022 at 11:47:47AM -0500, Benjamin Coddington wrote:
> On 29 Nov 2022, at 9:02, Christoph Hellwig wrote:
> 
> > Hmm.  Having to set a flag to not accidentally corrupt per-task
> > state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> > into the feature only for sockets created from the syscall layer?
> 
> It's totally fragile, and that's why it's currently broken in production.
> The fragile ship sailed when networking decided to depend on users setting
> the socket's GFP_ flags correctly to avoid corruption.
> 
> Meantime, this problem needs fixing in a way that makes everyone happy.
> This fix doesn't make it less fragile, but it may (hopefully) address the
> previous criticisms enough that something gets done to fix it.

Also, let's remember that while we're discussing how the kernel sould
work in an ideal world, the reality is that production NFS systems
crash randomly upon memory reclaim since commit a1231fda7e94 ("SUNRPC:
Set memalloc_nofs_save() on all rpciod/xprtiod jobs"). Fixing that is
just a matter of re-introducing GFP_NOFS on SUNRPC sockets (which has
been proposed several times already). Then we'll have plenty of time
to argue about how networking should use the per-task page_frag and
how to remove GFP_NOFS in the long term.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 0/3] Stop corrupting socket's task_frag
  2022-11-21 13:35 [PATCH v1 0/3] Stop corrupting socket's task_frag Benjamin Coddington
                   ` (3 preceding siblings ...)
  2022-11-21 13:56 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag David Howells
@ 2022-12-07 11:44 ` Benjamin Coddington
  4 siblings, 0 replies; 25+ messages in thread
From: Benjamin Coddington @ 2022-12-07 11:44 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, netdev

Hi Dave, Eric, Jakub, Paolo,

I think it makes sense for all three of these to go together through netdev.
If you agree, would you like me to chase down individual ACKs for each
treewide touch?

What can I do from netdev's perspective to move this forward?

Ben

On 21 Nov 2022, at 8:35, Benjamin Coddington wrote:

> The networking code uses flags in sk_allocation to determine if it can use
> current->task_frag, however in-kernel users of sockets may stop setting
> sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
> as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
> on all rpciod/xprtiod jobs").
>
> This will cause corruption in current->task_frag when recursing into the
> network layer for those subsystems during page fault or reclaim.  The
> corruption is difficult to diagnose because stack traces may not contain the
> offending subsystem at all.  The corruption is unlikely to show up in
> testing because it requires memory pressure, and so subsystems that
> convert to memalloc_nofs_save/restore are likely to continue to run into
> this issue.
>
> Previous reports and proposed fixes:
> https://lore.kernel.org/netdev/96a18bd00cbc6cb554603cc0d6ef1c551965b078.1663762494.git.gnault@redhat.com/
> https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
> https://lore.kernel.org/linux-nfs/de6d99321d1dcaa2ad456b92b3680aa77c07a747.1665401788.git.gnault@redhat.com/
>
> Guilluame Nault has done all of the hard work tracking this problem down and
> finding the best fix for this issue.  I'm just taking a turn posting another
> fix.
>
> Benjamin Coddington (2):
>   Treewide: Stop corrupting socket's task_frag
>   net: simplify sk_page_frag
>
> Guillaume Nault (1):
>   net: Introduce sk_use_task_frag in struct sock.
>
>  drivers/block/drbd/drbd_receiver.c |  3 +++
>  drivers/block/nbd.c                |  1 +
>  drivers/nvme/host/tcp.c            |  1 +
>  drivers/scsi/iscsi_tcp.c           |  1 +
>  drivers/usb/usbip/usbip_common.c   |  1 +
>  fs/afs/rxrpc.c                     |  1 +
>  fs/cifs/connect.c                  |  1 +
>  fs/dlm/lowcomms.c                  |  2 ++
>  fs/ocfs2/cluster/tcp.c             |  1 +
>  include/net/sock.h                 | 10 ++++++----
>  net/9p/trans_fd.c                  |  1 +
>  net/ceph/messenger.c               |  1 +
>  net/core/sock.c                    |  1 +
>  net/sunrpc/xprtsock.c              |  3 +++
>  14 files changed, 24 insertions(+), 4 deletions(-)
>
> -- 
> 2.31.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock.
  2022-11-21 13:35 ` [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
@ 2022-12-09 12:09   ` Paolo Abeni
  2022-12-09 14:16     ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2022-12-09 12:09 UTC (permalink / raw)
  To: Benjamin Coddington, netdev
  Cc: linux-kernel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Soheil Hassas Yeganeh, Shakeel Butt, Pavel Begunkov,
	Kuniyuki Iwashima, Maciej Żenczykowski, Menglong Dong,
	Akhmat Karakotov, Alexander Duyck

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> From: Guillaume Nault <gnault@redhat.com>
> 
> Sockets that can be used while recursing into memory reclaim, like
> those used by network block devices and file systems, mustn't use
> current->task_frag: if the current process is already using it, then
> the inner memory reclaim call would corrupt the task_frag structure.
> 
> To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
> that mustn't use current->task_frag, assuming that those used during
> memory reclaim had their allocation constraints reflected in
> ->sk_allocation.
> 
> This unfortunately doesn't cover all cases: in an attempt to remove all
> usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
> ->sk_allocation, and used memalloc_nofs critical sections instead.
> This breaks the sk_page_frag() heuristic since the allocation
> constraints are now stored in current->flags, which sk_page_frag()
> can't read without risking triggering a cache miss and slowing down
> TCP's fast path.
> 
> This patch creates a new field in struct sock, named sk_use_task_frag,
> which sockets with memory reclaim constraints can set to false if they
> can't safely use current->task_frag. In such cases, sk_page_frag() now
> always returns the socket's page_frag (->sk_frag). The first user is
> sunrpc, which needs to avoid using current->task_frag but can keep
> ->sk_allocation set to GFP_KERNEL otherwise.
> 
> Eventually, it might be possible to simplify sk_page_frag() by only
> testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
> heuristic entirely (assuming other sockets will set ->sk_use_task_frag
> according to their constraints in the future).
> 
> The new ->sk_use_task_frag field is placed in a hole in struct sock and
> belongs to a cache line shared with ->sk_shutdown. Therefore it should
> be hot and shouldn't have negative performance impacts on TCP's fast
> path (sk_shutdown is tested just before the while() loop in
> tcp_sendmsg_locked()).
> 
> Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
> Signed-off-by: Guillaume Nault <gnault@redhat.com>
> ---
>  include/net/sock.h | 11 +++++++++--
>  net/core/sock.c    |  1 +
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index d08cfe190a78..ffba9e95470d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -318,6 +318,9 @@ struct sk_filter;
>    *	@sk_stamp: time stamp of last packet received
>    *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
>    *	@sk_tsflags: SO_TIMESTAMPING flags
> +  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
> +			   Sockets that can be used under memory reclaim should
> +			   set this to false.
>    *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
>    *	              for timestamping
>    *	@sk_tskey: counter to disambiguate concurrent tstamp requests
> @@ -504,6 +507,7 @@ struct sock {
>  #endif
>  	u16			sk_tsflags;
>  	u8			sk_shutdown;
> +	bool			sk_use_task_frag;
>  	atomic_t		sk_tskey;
>  	atomic_t		sk_zckey;

I think the above should be fine from a data locality PoV, as the used
cacheline should be hot at sk_page_frag_refill() usage time, as
sk_tsflags has been accessed just before.

@Eric, does the above fit with the planned sock fields reordering?

Jakub noted we could use a bitfield here to be future proof for
additional flags addition. I think in this specific case a bool is
preferable, because we actually wont to discourage people to add more
of such flags, and the search for holes (or the bool -> bitflag
conversion) should give to such eventual future changes some additional
thoughts.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-11-21 13:35 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
  2022-11-29 14:02   ` Christoph Hellwig
@ 2022-12-09 12:37   ` Paolo Abeni
  2022-12-09 16:11     ` Jakub Kicinski
  1 sibling, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2022-12-09 12:37 UTC (permalink / raw)
  To: Benjamin Coddington, netdev
  Cc: linux-kernel, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Ilya Dryomov, Xiubo Li,
	Trond Myklebust, Anna Schumaker, Chuck Lever, Jeff Layton,
	drbd-dev, linux-block, nbd, linux-nvme, open-iscsi, linux-scsi,
	linux-usb, linux-afs, linux-cifs, samba-technical, cluster-devel,
	ocfs2-devel, v9fs-developer, ceph-devel, linux-nfs

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> GFP_NOIO flag on sk_allocation which the networking system uses to decide
> when it is safe to use current->task_frag.  The results of this are
> unexpected corruption in task_frag when SUNRPC is involved in memory
> reclaim.
> 
> The corruption can be seen in crashes, but the root cause is often
> difficult to ascertain as a crashing machine's stack trace will have no
> evidence of being near NFS or SUNRPC code.  I believe this problem to
> be much more pervasive than reports to the community may indicate.
> 
> Fix this by having kernel users of sockets that may corrupt task_frag due
> to reclaim set sk_use_task_frag = false.  Preemptively correcting this
> situation for users that still set sk_allocation allows them to convert to
> memalloc_nofs_save/restore without the same unexpected corruptions that are
> sure to follow, unlikely to show up in testing, and difficult to bisect.
> 
> CC: Philipp Reisner <philipp.reisner@linbit.com>
> CC: Lars Ellenberg <lars.ellenberg@linbit.com>
> CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: Josef Bacik <josef@toxicpanda.com>
> CC: Keith Busch <kbusch@kernel.org>
> CC: Christoph Hellwig <hch@lst.de>
> CC: Sagi Grimberg <sagi@grimberg.me>
> CC: Lee Duncan <lduncan@suse.com>
> CC: Chris Leech <cleech@redhat.com>
> CC: Mike Christie <michael.christie@oracle.com>
> CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
> CC: "Martin K. Petersen" <martin.petersen@oracle.com>
> CC: Valentina Manea <valentina.manea.m@gmail.com>
> CC: Shuah Khan <shuah@kernel.org>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: David Howells <dhowells@redhat.com>
> CC: Marc Dionne <marc.dionne@auristor.com>
> CC: Steve French <sfrench@samba.org>
> CC: Christine Caulfield <ccaulfie@redhat.com>
> CC: David Teigland <teigland@redhat.com>
> CC: Mark Fasheh <mark@fasheh.com>
> CC: Joel Becker <jlbec@evilplan.org>
> CC: Joseph Qi <joseph.qi@linux.alibaba.com>
> CC: Eric Van Hensbergen <ericvh@gmail.com>
> CC: Latchesar Ionkov <lucho@ionkov.net>
> CC: Dominique Martinet <asmadeus@codewreck.org>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Eric Dumazet <edumazet@google.com>
> CC: Jakub Kicinski <kuba@kernel.org>
> CC: Paolo Abeni <pabeni@redhat.com>
> CC: Ilya Dryomov <idryomov@gmail.com>
> CC: Xiubo Li <xiubli@redhat.com>
> CC: Chuck Lever <chuck.lever@oracle.com>
> CC: Jeff Layton <jlayton@kernel.org>
> CC: Trond Myklebust <trond.myklebust@hammerspace.com>
> CC: Anna Schumaker <anna@kernel.org>
> CC: drbd-dev@lists.linbit.com
> CC: linux-block@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> CC: nbd@other.debian.org
> CC: linux-nvme@lists.infradead.org
> CC: open-iscsi@googlegroups.com
> CC: linux-scsi@vger.kernel.org
> CC: linux-usb@vger.kernel.org
> CC: linux-afs@lists.infradead.org
> CC: linux-cifs@vger.kernel.org
> CC: samba-technical@lists.samba.org
> CC: cluster-devel@redhat.com
> CC: ocfs2-devel@oss.oracle.com
> CC: v9fs-developer@lists.sourceforge.net
> CC: netdev@vger.kernel.org
> CC: ceph-devel@vger.kernel.org
> CC: linux-nfs@vger.kernel.org
> 
> Suggested-by: Guillaume Nault <gnault@redhat.com>
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>

I think this is the most feasible way out of the existing issue, and I
think this patchset should go via the networking tree, targeting the
Linux 6.2.

If someone has disagreement with the above, please speak! 

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock.
  2022-12-09 12:09   ` Paolo Abeni
@ 2022-12-09 14:16     ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2022-12-09 14:16 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Benjamin Coddington, netdev, linux-kernel, David S. Miller,
	Jakub Kicinski, Soheil Hassas Yeganeh, Shakeel Butt,
	Pavel Begunkov, Kuniyuki Iwashima, Maciej Żenczykowski,
	Menglong Dong, Akhmat Karakotov, Alexander Duyck

On Fri, Dec 9, 2022 at 1:09 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> > From: Guillaume Nault <gnault@redhat.com>
> >
> > Sockets that can be used while recursing into memory reclaim, like
> > those used by network block devices and file systems, mustn't use
> > current->task_frag: if the current process is already using it, then
> > the inner memory reclaim call would corrupt the task_frag structure.
> >
> > To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
> > that mustn't use current->task_frag, assuming that those used during
> > memory reclaim had their allocation constraints reflected in
> > ->sk_allocation.
> >
> > This unfortunately doesn't cover all cases: in an attempt to remove all
> > usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
> > ->sk_allocation, and used memalloc_nofs critical sections instead.
> > This breaks the sk_page_frag() heuristic since the allocation
> > constraints are now stored in current->flags, which sk_page_frag()
> > can't read without risking triggering a cache miss and slowing down
> > TCP's fast path.
> >
> > This patch creates a new field in struct sock, named sk_use_task_frag,
> > which sockets with memory reclaim constraints can set to false if they
> > can't safely use current->task_frag. In such cases, sk_page_frag() now
> > always returns the socket's page_frag (->sk_frag). The first user is
> > sunrpc, which needs to avoid using current->task_frag but can keep
> > ->sk_allocation set to GFP_KERNEL otherwise.
> >
> > Eventually, it might be possible to simplify sk_page_frag() by only
> > testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
> > heuristic entirely (assuming other sockets will set ->sk_use_task_frag
> > according to their constraints in the future).
> >
> > The new ->sk_use_task_frag field is placed in a hole in struct sock and
> > belongs to a cache line shared with ->sk_shutdown. Therefore it should
> > be hot and shouldn't have negative performance impacts on TCP's fast
> > path (sk_shutdown is tested just before the while() loop in
> > tcp_sendmsg_locked()).
> >
> > Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
> > Signed-off-by: Guillaume Nault <gnault@redhat.com>
> > ---
> >  include/net/sock.h | 11 +++++++++--
> >  net/core/sock.c    |  1 +
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index d08cfe190a78..ffba9e95470d 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -318,6 +318,9 @@ struct sk_filter;
> >    *  @sk_stamp: time stamp of last packet received
> >    *  @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
> >    *  @sk_tsflags: SO_TIMESTAMPING flags
> > +  *  @sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
> > +                        Sockets that can be used under memory reclaim should
> > +                        set this to false.
> >    *  @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
> >    *                for timestamping
> >    *  @sk_tskey: counter to disambiguate concurrent tstamp requests
> > @@ -504,6 +507,7 @@ struct sock {
> >  #endif
> >       u16                     sk_tsflags;
> >       u8                      sk_shutdown;
> > +     bool                    sk_use_task_frag;
> >       atomic_t                sk_tskey;
> >       atomic_t                sk_zckey;
>
> I think the above should be fine from a data locality PoV, as the used
> cacheline should be hot at sk_page_frag_refill() usage time, as
> sk_tsflags has been accessed just before.
>
> @Eric, does the above fit with the planned sock fields reordering?

Do not worry about that, this can be handled later if needed.

>
> Jakub noted we could use a bitfield here to be future proof for
> additional flags addition. I think in this specific case a bool is
> preferable, because we actually wont to discourage people to add more
> of such flags, and the search for holes (or the bool -> bitflag
> conversion) should give to such eventual future changes some additional
> thoughts.
>
> Thanks!
>
> Paolo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-09 12:37   ` Paolo Abeni
@ 2022-12-09 16:11     ` Jakub Kicinski
  2022-12-09 16:50       ` Paolo Abeni
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2022-12-09 16:11 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Benjamin Coddington, netdev, linux-kernel, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Josef Bacik, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Lee Duncan, Chris Leech, Mike Christie, James E.J. Bottomley,
	Martin K. Petersen, Valentina Manea, Shuah Khan,
	Greg Kroah-Hartman, David Howells, Marc Dionne, Steve French,
	Christine Caulfield, David Teigland, Mark Fasheh, Joel Becker,
	Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet, Ilya Dryomov,
	Xiubo Li, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, drbd-dev, linux-block, nbd, linux-nvme, open-iscsi,
	linux-scsi, linux-usb, linux-afs, linux-cifs, samba-technical,
	cluster-devel, ocfs2-devel, v9fs-developer, ceph-devel,
	linux-nfs

On Fri, 09 Dec 2022 13:37:08 +0100 Paolo Abeni wrote:
> I think this is the most feasible way out of the existing issue, and I
> think this patchset should go via the networking tree, targeting the
> Linux 6.2.

FWIW some fields had been moved so this will not longer apply cleanly,
see b534dc46c8ae016. But I think we can apply it to net since the merge
window is upon us? Just a heads up.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/3] net: simplify sk_page_frag
  2022-11-21 13:35 ` [PATCH v1 3/3] net: simplify sk_page_frag Benjamin Coddington
@ 2022-12-09 16:42   ` Paolo Abeni
  2022-12-09 16:44   ` Paolo Abeni
  2022-12-09 16:44   ` Paolo Abeni
  2 siblings, 0 replies; 25+ messages in thread
From: Paolo Abeni @ 2022-12-09 16:42 UTC (permalink / raw)
  To: Benjamin Coddington, netdev
  Cc: linux-kernel, David S. Miller, Eric Dumazet, Jakub Kicinski

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Now that in-kernel socket users that may recurse during reclaim have benn
> converted to sk_use_task_frag = false, we can have sk_page_frag() simply
> check that value.
> 
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  include/net/sock.h | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ffba9e95470d..fac24c6ee30d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
>   * Both direct reclaim and page faults can nest inside other
>   * socket operations and end up recursing into sk_page_frag()
>   * while it's already in use: explicitly avoid task page_frag
> - * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags or
> - * explicitly disable sk_use_task_frag.
> + * when users disable sk_use_task_frag.
>   *
>   * Return: a per task page_frag if context allows that,
>   * otherwise a per socket one.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -	if (sk->sk_use_task_frag &&
> -	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
> -				  __GFP_FS)) ==
> -	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
> +	if (sk->sk_use_task_frag)
>  		return &current->task_frag;
>  
>  	return &sk->sk_frag;

To make the above as safe as possible I think we should double-check
the in-kernel users explicitly setting sk_allocation to GFP_ATOMIC, as
that has the side effect of disabling the task_frag usage, too.

Patch 2/3 already catches some of such users, and we can safely leave
alone few others, (specifically l2tp, fou and inet_ctl_sock_create()).

Even wireguard and tls looks safe IMHO.

So the only left-over should be espintcp, I suggest updating patch 2/3
clearing sk_use_task_frag even in espintcp_init_sk().

Other than that LGTM.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/3] net: simplify sk_page_frag
  2022-11-21 13:35 ` [PATCH v1 3/3] net: simplify sk_page_frag Benjamin Coddington
  2022-12-09 16:42   ` Paolo Abeni
@ 2022-12-09 16:44   ` Paolo Abeni
  2022-12-09 16:44   ` Paolo Abeni
  2 siblings, 0 replies; 25+ messages in thread
From: Paolo Abeni @ 2022-12-09 16:44 UTC (permalink / raw)
  To: Benjamin Coddington, netdev
  Cc: linux-kernel, David S. Miller, Eric Dumazet, Jakub Kicinski

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Now that in-kernel socket users that may recurse during reclaim have benn
> converted to sk_use_task_frag = false, we can have sk_page_frag() simply
> check that value.
> 
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  include/net/sock.h | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ffba9e95470d..fac24c6ee30d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
>   * Both direct reclaim and page faults can nest inside other
>   * socket operations and end up recursing into sk_page_frag()
>   * while it's already in use: explicitly avoid task page_frag
> - * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags or
> - * explicitly disable sk_use_task_frag.
> + * when users disable sk_use_task_frag.
>   *
>   * Return: a per task page_frag if context allows that,
>   * otherwise a per socket one.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -	if (sk->sk_use_task_frag &&
> -	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
> -				  __GFP_FS)) ==
> -	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
> +	if (sk->sk_use_task_frag)
>  		return &current->task_frag;
>  
>  	return &sk->sk_frag;

To make the above as safe as possible I think we should double-check
the in-kernel users explicitly setting sk_allocation to GFP_ATOMIC, as
that has the side effect of disabling the task_frag usage, too.

Patch 2/3 already catches some of such users, and we can safely leave
alone few others, (specifically l2tp, fou and inet_ctl_sock_create()).

Even wireguard and tls looks safe IMHO.

So the only left-over should be espintcp, I suggest updating patch 2/3
clearing sk_use_task_frag even in espintcp_init_sk().

Other than that LGTM.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/3] net: simplify sk_page_frag
  2022-11-21 13:35 ` [PATCH v1 3/3] net: simplify sk_page_frag Benjamin Coddington
  2022-12-09 16:42   ` Paolo Abeni
  2022-12-09 16:44   ` Paolo Abeni
@ 2022-12-09 16:44   ` Paolo Abeni
  2 siblings, 0 replies; 25+ messages in thread
From: Paolo Abeni @ 2022-12-09 16:44 UTC (permalink / raw)
  To: Benjamin Coddington, netdev
  Cc: linux-kernel, David S. Miller, Eric Dumazet, Jakub Kicinski

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Now that in-kernel socket users that may recurse during reclaim have benn
> converted to sk_use_task_frag = false, we can have sk_page_frag() simply
> check that value.
> 
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  include/net/sock.h | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ffba9e95470d..fac24c6ee30d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
>   * Both direct reclaim and page faults can nest inside other
>   * socket operations and end up recursing into sk_page_frag()
>   * while it's already in use: explicitly avoid task page_frag
> - * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags or
> - * explicitly disable sk_use_task_frag.
> + * when users disable sk_use_task_frag.
>   *
>   * Return: a per task page_frag if context allows that,
>   * otherwise a per socket one.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -	if (sk->sk_use_task_frag &&
> -	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
> -				  __GFP_FS)) ==
> -	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
> +	if (sk->sk_use_task_frag)
>  		return &current->task_frag;
>  
>  	return &sk->sk_frag;

To make the above as safe as possible I think we should double-check
the in-kernel users explicitly setting sk_allocation to GFP_ATOMIC, as
that has the side effect of disabling the task_frag usage, too.

Patch 2/3 already catches some of such users, and we can safely leave
alone few others, (specifically l2tp, fou and inet_ctl_sock_create()).

Even wireguard and tls looks safe IMHO.

So the only left-over should be espintcp, I suggest updating patch 2/3
clearing sk_use_task_frag even in espintcp_init_sk().

Other than that LGTM.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
  2022-12-09 16:11     ` Jakub Kicinski
@ 2022-12-09 16:50       ` Paolo Abeni
  0 siblings, 0 replies; 25+ messages in thread
From: Paolo Abeni @ 2022-12-09 16:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Benjamin Coddington, netdev, linux-kernel, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Josef Bacik, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Lee Duncan, Chris Leech, Mike Christie, James E.J. Bottomley,
	Martin K. Petersen, Valentina Manea, Shuah Khan,
	Greg Kroah-Hartman, David Howells, Marc Dionne, Steve French,
	Christine Caulfield, David Teigland, Mark Fasheh, Joel Becker,
	Joseph Qi, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Eric Dumazet, Ilya Dryomov,
	Xiubo Li, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, drbd-dev, linux-block, nbd, linux-nvme, open-iscsi,
	linux-scsi, linux-usb, linux-afs, linux-cifs, samba-technical,
	cluster-devel, ocfs2-devel, v9fs-developer, ceph-devel,
	linux-nfs

On Fri, 2022-12-09 at 08:11 -0800, Jakub Kicinski wrote:
> On Fri, 09 Dec 2022 13:37:08 +0100 Paolo Abeni wrote:
> > I think this is the most feasible way out of the existing issue, and I
> > think this patchset should go via the networking tree, targeting the
> > Linux 6.2.
> 
> FWIW some fields had been moved so this will not longer apply cleanly,
> see b534dc46c8ae016. But I think we can apply it to net since the merge
> window is upon us? Just a heads up.

We will need an additional revision, see my reply to patch 3/3. I think
the -net tree should be the appropriate target.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-12-09 16:55 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-21 13:35 [PATCH v1 0/3] Stop corrupting socket's task_frag Benjamin Coddington
2022-11-21 13:35 ` [PATCH v1 1/3] net: Introduce sk_use_task_frag in struct sock Benjamin Coddington
2022-12-09 12:09   ` Paolo Abeni
2022-12-09 14:16     ` Eric Dumazet
2022-11-21 13:35 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag Benjamin Coddington
2022-11-29 14:02   ` Christoph Hellwig
2022-11-29 16:47     ` Benjamin Coddington
2022-11-30 12:07       ` Guillaume Nault
2022-11-29 17:42     ` Jeff Layton
2022-11-30 11:48     ` Guillaume Nault
2022-12-09 12:37   ` Paolo Abeni
2022-12-09 16:11     ` Jakub Kicinski
2022-12-09 16:50       ` Paolo Abeni
2022-11-21 13:35 ` [PATCH v1 3/3] net: simplify sk_page_frag Benjamin Coddington
2022-12-09 16:42   ` Paolo Abeni
2022-12-09 16:44   ` Paolo Abeni
2022-12-09 16:44   ` Paolo Abeni
2022-11-21 13:56 ` [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag David Howells
2022-11-21 14:34   ` Benjamin Coddington
2022-11-21 21:40     ` Shuah Khan
2022-11-21 21:43       ` Shuah Khan
2022-11-21 22:01         ` Benjamin Coddington
2022-11-21 22:32           ` Shuah Khan
2022-11-22 14:23             ` Benjamin Coddington
2022-12-07 11:44 ` [PATCH v1 0/3] " Benjamin Coddington

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).