[RFC net] Should sk_page_frag() also look at the current GFP context?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC net] Should sk_page_frag() also look at the current GFP context?
@ 2022-07-01 18:41 Guillaume Nault
  2022-07-07 15:31 ` Benjamin Coddington
  2022-07-07 16:29 ` Eric Dumazet
  0 siblings, 2 replies; 9+ messages in thread
From: Guillaume Nault @ 2022-07-01 18:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Chuck Lever, Jeff Layton, Trond Myklebust,
	Anna Schumaker, Steve French, Josef Bacik, Scott Mayhew,
	Benjamin Coddington, Tejun Heo

I'm investigating a kernel oops that looks similar to
20eb4f29b602 ("net: fix sk_page_frag() recursion from memory reclaim")
and dacb5d8875cc ("tcp: fix page frag corruption on page fault").

This time the problem happens on an NFS client, while the previous bzs
respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS in
their socket's ->sk_allocation field (using GFP_NOIO or GFP_NOFS), NFS
leaves sk_allocation to its default value since commit a1231fda7e94
("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").

To recap the original problems, in commit 20eb4f29b602 and dacb5d8875cc,
memory reclaim happened while executing tcp_sendmsg_locked(). The code
path entered tcp_sendmsg_locked() recursively as pages to be reclaimed
were backed by files on the network. The problem was that both the
outer and the inner tcp_sendmsg_locked() calls used current->task_frag,
thus leaving it in an inconsistent state. The fix was to use the
socket's ->sk_frag instead for the file system socket, so that the
inner and outer calls wouln't step on each other's toes.

But now that NFS doesn't modify ->sk_allocation anymore, sk_page_frag()
sees sunrpc sockets as plain TCP ones and returns ->task_frag in the
inner tcp_sendmsg_locked() call.

Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO and use
memalloc_no{fs,io}_save() instead. So maybe other network file systems
will also stop setting ->sk_allocation in the future and we should
teach sk_page_frag() to look at the current GFP flags. Or should we
stick to ->sk_allocation and make NFS drop __GFP_FS again?

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/sock.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 72ca97ccb460..b934c9851058 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -46,6 +46,7 @@
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/mm.h>
+#include <linux/sched/mm.h>
 #include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
@@ -2503,14 +2504,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
  * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags.
+ * This assumes that page fault handlers use the GFP_NOFS flags
+ * or run under memalloc_nofs_save() protection.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
+	gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
+
+	if ((gfp_mask & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
 	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
 		return &current->task_frag;

-- 
2.21.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-01 18:41 [RFC net] Should sk_page_frag() also look at the current GFP context? Guillaume Nault
@ 2022-07-07 15:31 ` Benjamin Coddington
  2022-07-07 16:29 ` Eric Dumazet
  1 sibling, 0 replies; 9+ messages in thread
From: Benjamin Coddington @ 2022-07-07 15:31 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Chuck Lever, Jeff Layton, Trond Myklebust,
	Anna Schumaker, Steve French, Josef Bacik, Scott Mayhew,
	Tejun Heo

On 1 Jul 2022, at 14:41, Guillaume Nault wrote:

> I'm investigating a kernel oops that looks similar to
> 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory reclaim")
> and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
>
> This time the problem happens on an NFS client, while the previous bzs
> respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS in
> their socket's ->sk_allocation field (using GFP_NOIO or GFP_NOFS), NFS
> leaves sk_allocation to its default value since commit a1231fda7e94
> ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
>
> To recap the original problems, in commit 20eb4f29b602 and 
> dacb5d8875cc,
> memory reclaim happened while executing tcp_sendmsg_locked(). The code
> path entered tcp_sendmsg_locked() recursively as pages to be reclaimed
> were backed by files on the network. The problem was that both the
> outer and the inner tcp_sendmsg_locked() calls used 
> current->task_frag,
> thus leaving it in an inconsistent state. The fix was to use the
> socket's ->sk_frag instead for the file system socket, so that the
> inner and outer calls wouln't step on each other's toes.
>
> But now that NFS doesn't modify ->sk_allocation anymore, 
> sk_page_frag()
> sees sunrpc sockets as plain TCP ones and returns ->task_frag in the
> inner tcp_sendmsg_locked() call.
>
> Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO and use
> memalloc_no{fs,io}_save() instead. So maybe other network file systems
> will also stop setting ->sk_allocation in the future and we should
> teach sk_page_frag() to look at the current GFP flags. Or should we
> stick to ->sk_allocation and make NFS drop __GFP_FS again?

We need this fix in NFS.

I think we should try to get the other filesystems to move toward
memalloc_nofs_save() as per:
Documentation/core-api/gfp_mask-from-fs-io.rst

So this looks like the right fix to me and I think if you resend it 
without
the RFC and and question in the subject, it would get picked up.

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>

Ben

>
> Signed-off-by: Guillaume Nault <gnault@redhat.com>
> ---
>  include/net/sock.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 72ca97ccb460..b934c9851058 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -46,6 +46,7 @@
>  #include <linux/netdevice.h>
>  #include <linux/skbuff.h>	/* struct sk_buff */
>  #include <linux/mm.h>
> +#include <linux/sched/mm.h>
>  #include <linux/security.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> @@ -2503,14 +2504,17 @@ static inline void 
> sk_stream_moderate_sndbuf(struct sock *sk)
>   * socket operations and end up recursing into sk_page_frag()
>   * while it's already in use: explicitly avoid task page_frag
>   * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags.
> + * This assumes that page fault handlers use the GFP_NOFS flags
> + * or run under memalloc_nofs_save() protection.
>   *
>   * Return: a per task page_frag if context allows that,
>   * otherwise a per socket one.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -	if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | 
> __GFP_FS)) ==
> +	gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
> +
> +	if ((gfp_mask & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) 
> ==
>  	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
>  		return &current->task_frag;
>
> -- 
> 2.21.3

This seems like th


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-01 18:41 [RFC net] Should sk_page_frag() also look at the current GFP context? Guillaume Nault
  2022-07-07 15:31 ` Benjamin Coddington
@ 2022-07-07 16:29 ` Eric Dumazet
  2022-07-08 17:51   ` Guillaume Nault
  2022-07-08 18:10   ` Benjamin Coddington
  1 sibling, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2022-07-07 16:29 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steve French, Josef Bacik, Scott Mayhew, Benjamin Coddington,
	Tejun Heo

On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com> wrote:
>
> I'm investigating a kernel oops that looks similar to
> 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory reclaim")
> and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
>
> This time the problem happens on an NFS client, while the previous bzs
> respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS in
> their socket's ->sk_allocation field (using GFP_NOIO or GFP_NOFS), NFS
> leaves sk_allocation to its default value since commit a1231fda7e94
> ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
>
> To recap the original problems, in commit 20eb4f29b602 and dacb5d8875cc,
> memory reclaim happened while executing tcp_sendmsg_locked(). The code
> path entered tcp_sendmsg_locked() recursively as pages to be reclaimed
> were backed by files on the network. The problem was that both the
> outer and the inner tcp_sendmsg_locked() calls used current->task_frag,
> thus leaving it in an inconsistent state. The fix was to use the
> socket's ->sk_frag instead for the file system socket, so that the
> inner and outer calls wouln't step on each other's toes.
>
> But now that NFS doesn't modify ->sk_allocation anymore, sk_page_frag()
> sees sunrpc sockets as plain TCP ones and returns ->task_frag in the
> inner tcp_sendmsg_locked() call.
>
> Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO and use
> memalloc_no{fs,io}_save() instead. So maybe other network file systems
> will also stop setting ->sk_allocation in the future and we should
> teach sk_page_frag() to look at the current GFP flags. Or should we
> stick to ->sk_allocation and make NFS drop __GFP_FS again?
>
> Signed-off-by: Guillaume Nault <gnault@redhat.com>

Can you provide a Fixes: tag ?

> ---
>  include/net/sock.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 72ca97ccb460..b934c9851058 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -46,6 +46,7 @@
>  #include <linux/netdevice.h>
>  #include <linux/skbuff.h>      /* struct sk_buff */
>  #include <linux/mm.h>
> +#include <linux/sched/mm.h>
>  #include <linux/security.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> @@ -2503,14 +2504,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
>   * socket operations and end up recursing into sk_page_frag()
>   * while it's already in use: explicitly avoid task page_frag
>   * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags.
> + * This assumes that page fault handlers use the GFP_NOFS flags
> + * or run under memalloc_nofs_save() protection.
>   *
>   * Return: a per task page_frag if context allows that,
>   * otherwise a per socket one.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
> +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);

This is slowing down TCP sendmsg() fast path, reading current->flags,
possibly cold value.

I would suggest using one bit in sk, close to sk->sk_allocation to
make the decision,
instead of testing sk->sk_allocation for various flags.

Not sure if we have available holes.

> +
> +       if ((gfp_mask & ( | __GFP_MEMALLOC | __GFP_FS)) ==
>             (__GFP_DIRECT_RECLAIM | __GFP_FS))
>                 return &current->task_frag;
>
> --
> 2.21.3
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-07 16:29 ` Eric Dumazet
@ 2022-07-08 17:51   ` Guillaume Nault
  2022-07-08 18:10   ` Benjamin Coddington
  1 sibling, 0 replies; 9+ messages in thread
From: Guillaume Nault @ 2022-07-08 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steve French, Josef Bacik, Scott Mayhew, Benjamin Coddington,
	Tejun Heo

On Thu, Jul 07, 2022 at 06:29:03PM +0200, Eric Dumazet wrote:
> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com> wrote:
> >
> > I'm investigating a kernel oops that looks similar to
> > 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory reclaim")
> > and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
> >
> > This time the problem happens on an NFS client, while the previous bzs
> > respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS in
> > their socket's ->sk_allocation field (using GFP_NOIO or GFP_NOFS), NFS
> > leaves sk_allocation to its default value since commit a1231fda7e94
> > ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
> >
> > To recap the original problems, in commit 20eb4f29b602 and dacb5d8875cc,
> > memory reclaim happened while executing tcp_sendmsg_locked(). The code
> > path entered tcp_sendmsg_locked() recursively as pages to be reclaimed
> > were backed by files on the network. The problem was that both the
> > outer and the inner tcp_sendmsg_locked() calls used current->task_frag,
> > thus leaving it in an inconsistent state. The fix was to use the
> > socket's ->sk_frag instead for the file system socket, so that the
> > inner and outer calls wouln't step on each other's toes.
> >
> > But now that NFS doesn't modify ->sk_allocation anymore, sk_page_frag()
> > sees sunrpc sockets as plain TCP ones and returns ->task_frag in the
> > inner tcp_sendmsg_locked() call.
> >
> > Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO and use
> > memalloc_no{fs,io}_save() instead. So maybe other network file systems
> > will also stop setting ->sk_allocation in the future and we should
> > teach sk_page_frag() to look at the current GFP flags. Or should we
> > stick to ->sk_allocation and make NFS drop __GFP_FS again?
> >
> > Signed-off-by: Guillaume Nault <gnault@redhat.com>
> 
> Can you provide a Fixes: tag ?

Fixes: a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs")

> > ---
> >  include/net/sock.h | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 72ca97ccb460..b934c9851058 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -46,6 +46,7 @@
> >  #include <linux/netdevice.h>
> >  #include <linux/skbuff.h>      /* struct sk_buff */
> >  #include <linux/mm.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/security.h>
> >  #include <linux/slab.h>
> >  #include <linux/uaccess.h>
> > @@ -2503,14 +2504,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
> >   * socket operations and end up recursing into sk_page_frag()
> >   * while it's already in use: explicitly avoid task page_frag
> >   * usage if the caller is potentially doing any of them.
> > - * This assumes that page fault handlers use the GFP_NOFS flags.
> > + * This assumes that page fault handlers use the GFP_NOFS flags
> > + * or run under memalloc_nofs_save() protection.
> >   *
> >   * Return: a per task page_frag if context allows that,
> >   * otherwise a per socket one.
> >   */
> >  static inline struct page_frag *sk_page_frag(struct sock *sk)
> >  {
> > -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
> > +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
> 
> This is slowing down TCP sendmsg() fast path, reading current->flags,
> possibly cold value.
> 
> I would suggest using one bit in sk, close to sk->sk_allocation to
> make the decision,
> instead of testing sk->sk_allocation for various flags.

current_gfp_context() looked quite elegant to me as it avoided the need
to duplicate the NOFS/NOIO flag in the socket. But I understand the
performance concern.

> Not sure if we have available holes.

Nothing in the same cache line at least. There's a 1 bit hole in
struct sock_common after skc_net_refcnt. And it should be hot because
of sk->sk_state. We could add a "skc_use_task_frag" bit there, but I'm
not sure if it's worth using this last available bit for this.

Otherwise, the next available hole is right after sk_bind_phc.
According to pahole, it's two cache lines away from sk_allocation on my
x86_64 build, but that will depend of the size of spinlock_t and thus
on CONFIG_ options. It doesn't look very natural to add a no-reclaim
bit there.

Or maybe we could base the test on sk_kern_sock since the problem
happens on kernel sockets. But that looks like a hack to me, and it
might impact MPTCP, which also creates kernel TCP sockets but shouldn't
have the same constraints as NFS.

> > +
> > +       if ((gfp_mask & ( | __GFP_MEMALLOC | __GFP_FS)) ==
> >             (__GFP_DIRECT_RECLAIM | __GFP_FS))
> >                 return &current->task_frag;
> >
> > --
> > 2.21.3
> >
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-07 16:29 ` Eric Dumazet
  2022-07-08 17:51   ` Guillaume Nault
@ 2022-07-08 18:10   ` Benjamin Coddington
  2022-07-08 20:04     ` Trond Myklebust
  1 sibling, 1 reply; 9+ messages in thread
From: Benjamin Coddington @ 2022-07-08 18:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Guillaume Nault, David S. Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Chuck Lever, Jeff Layton, Trond Myklebust,
	Anna Schumaker, Steve French, Josef Bacik, Scott Mayhew,
	Tejun Heo

On 7 Jul 2022, at 12:29, Eric Dumazet wrote:

> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com> 
> wrote:
>>
>> I'm investigating a kernel oops that looks similar to
>> 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory 
>> reclaim")
>> and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
>>
>> This time the problem happens on an NFS client, while the previous 
>> bzs
>> respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS in
>> their socket's ->sk_allocation field (using GFP_NOIO or GFP_NOFS), 
>> NFS
>> leaves sk_allocation to its default value since commit a1231fda7e94
>> ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
>>
>> To recap the original problems, in commit 20eb4f29b602 and 
>> dacb5d8875cc,
>> memory reclaim happened while executing tcp_sendmsg_locked(). The 
>> code
>> path entered tcp_sendmsg_locked() recursively as pages to be 
>> reclaimed
>> were backed by files on the network. The problem was that both the
>> outer and the inner tcp_sendmsg_locked() calls used 
>> current->task_frag,
>> thus leaving it in an inconsistent state. The fix was to use the
>> socket's ->sk_frag instead for the file system socket, so that the
>> inner and outer calls wouln't step on each other's toes.
>>
>> But now that NFS doesn't modify ->sk_allocation anymore, 
>> sk_page_frag()
>> sees sunrpc sockets as plain TCP ones and returns ->task_frag in the
>> inner tcp_sendmsg_locked() call.
>>
>> Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO and 
>> use
>> memalloc_no{fs,io}_save() instead. So maybe other network file 
>> systems
>> will also stop setting ->sk_allocation in the future and we should
>> teach sk_page_frag() to look at the current GFP flags. Or should we
>> stick to ->sk_allocation and make NFS drop __GFP_FS again?
>>
>> Signed-off-by: Guillaume Nault <gnault@redhat.com>
>
> Can you provide a Fixes: tag ?
>
>> ---
>>  include/net/sock.h | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 72ca97ccb460..b934c9851058 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -46,6 +46,7 @@
>>  #include <linux/netdevice.h>
>>  #include <linux/skbuff.h>      /* struct sk_buff */
>>  #include <linux/mm.h>
>> +#include <linux/sched/mm.h>
>>  #include <linux/security.h>
>>  #include <linux/slab.h>
>>  #include <linux/uaccess.h>
>> @@ -2503,14 +2504,17 @@ static inline void 
>> sk_stream_moderate_sndbuf(struct sock *sk)
>>   * socket operations and end up recursing into sk_page_frag()
>>   * while it's already in use: explicitly avoid task page_frag
>>   * usage if the caller is potentially doing any of them.
>> - * This assumes that page fault handlers use the GFP_NOFS flags.
>> + * This assumes that page fault handlers use the GFP_NOFS flags
>> + * or run under memalloc_nofs_save() protection.
>>   *
>>   * Return: a per task page_frag if context allows that,
>>   * otherwise a per socket one.
>>   */
>>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>>  {
>> -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | 
>> __GFP_MEMALLOC | __GFP_FS)) ==
>> +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
>
> This is slowing down TCP sendmsg() fast path, reading current->flags,
> possibly cold value.

True - current->flags is pretty distant from current->task_frag.

> I would suggest using one bit in sk, close to sk->sk_allocation to
> make the decision,
> instead of testing sk->sk_allocation for various flags.
>
> Not sure if we have available holes.

Its looking pretty packed on my build.. the nearest hole is 5 cachelines
away.

It'd be nice to allow network filesystem to use task_frag when possible.

If we expect sk_page_frag() to only return task_frag once per call 
stack,
then can we simply check it's already in use, perhaps by looking at the
size field?

Or maybe can we set sk_allocation early from current_gfp_context() 
outside
the fast path?

Ben


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-08 18:10   ` Benjamin Coddington
@ 2022-07-08 20:04     ` Trond Myklebust
  2022-07-11 14:07       ` Benjamin Coddington
  0 siblings, 1 reply; 9+ messages in thread
From: Trond Myklebust @ 2022-07-08 20:04 UTC (permalink / raw)
  To: edumazet, bcodding
  Cc: smayhew, davem, chuck.lever, sfrench, tj, anna, kuba, jlayton,
	gnault, josef, netdev, pabeni

On Fri, 2022-07-08 at 14:10 -0400, Benjamin Coddington wrote:
> On 7 Jul 2022, at 12:29, Eric Dumazet wrote:
> 
> > On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com> 
> > wrote:
> > > 
> > > I'm investigating a kernel oops that looks similar to
> > > 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory 
> > > reclaim")
> > > and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
> > > 
> > > This time the problem happens on an NFS client, while the
> > > previous 
> > > bzs
> > > respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS
> > > in
> > > their socket's ->sk_allocation field (using GFP_NOIO or
> > > GFP_NOFS), 
> > > NFS
> > > leaves sk_allocation to its default value since commit
> > > a1231fda7e94
> > > ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
> > > 
> > > To recap the original problems, in commit 20eb4f29b602 and 
> > > dacb5d8875cc,
> > > memory reclaim happened while executing tcp_sendmsg_locked(). The
> > > code
> > > path entered tcp_sendmsg_locked() recursively as pages to be 
> > > reclaimed
> > > were backed by files on the network. The problem was that both
> > > the
> > > outer and the inner tcp_sendmsg_locked() calls used 
> > > current->task_frag,
> > > thus leaving it in an inconsistent state. The fix was to use the
> > > socket's ->sk_frag instead for the file system socket, so that
> > > the
> > > inner and outer calls wouln't step on each other's toes.
> > > 
> > > But now that NFS doesn't modify ->sk_allocation anymore, 
> > > sk_page_frag()
> > > sees sunrpc sockets as plain TCP ones and returns ->task_frag in
> > > the
> > > inner tcp_sendmsg_locked() call.
> > > 
> > > Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO
> > > and 
> > > use
> > > memalloc_no{fs,io}_save() instead. So maybe other network file 
> > > systems
> > > will also stop setting ->sk_allocation in the future and we
> > > should
> > > teach sk_page_frag() to look at the current GFP flags. Or should
> > > we
> > > stick to ->sk_allocation and make NFS drop __GFP_FS again?
> > > 
> > > Signed-off-by: Guillaume Nault <gnault@redhat.com>
> > 
> > Can you provide a Fixes: tag ?
> > 
> > > ---
> > >  include/net/sock.h | 8 ++++++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 72ca97ccb460..b934c9851058 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -46,6 +46,7 @@
> > >  #include <linux/netdevice.h>
> > >  #include <linux/skbuff.h>      /* struct sk_buff */
> > >  #include <linux/mm.h>
> > > +#include <linux/sched/mm.h>
> > >  #include <linux/security.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/uaccess.h>
> > > @@ -2503,14 +2504,17 @@ static inline void 
> > > sk_stream_moderate_sndbuf(struct sock *sk)
> > >   * socket operations and end up recursing into sk_page_frag()
> > >   * while it's already in use: explicitly avoid task page_frag
> > >   * usage if the caller is potentially doing any of them.
> > > - * This assumes that page fault handlers use the GFP_NOFS flags.
> > > + * This assumes that page fault handlers use the GFP_NOFS flags
> > > + * or run under memalloc_nofs_save() protection.
> > >   *
> > >   * Return: a per task page_frag if context allows that,
> > >   * otherwise a per socket one.
> > >   */
> > >  static inline struct page_frag *sk_page_frag(struct sock *sk)
> > >  {
> > > -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | 
> > > __GFP_MEMALLOC | __GFP_FS)) ==
> > > +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
> > 
> > This is slowing down TCP sendmsg() fast path, reading current-
> > >flags,
> > possibly cold value.
> 
> True - current->flags is pretty distant from current->task_frag.
> 
> > I would suggest using one bit in sk, close to sk->sk_allocation to
> > make the decision,
> > instead of testing sk->sk_allocation for various flags.
> > 
> > Not sure if we have available holes.
> 
> Its looking pretty packed on my build.. the nearest hole is 5
> cachelines
> away.
> 
> It'd be nice to allow network filesystem to use task_frag when
> possible.
> 
> If we expect sk_page_frag() to only return task_frag once per call 
> stack,
> then can we simply check it's already in use, perhaps by looking at
> the
> size field?
> 
> Or maybe can we set sk_allocation early from current_gfp_context() 
> outside
> the fast path?

Why not just add a bit to sk->sk_allocation itself, and have
__sock_create() default to setting it when the 'kern' parameter is non-
zero? NFS is not alone in following the request of the mm team to
deprecate use of GFP_NOFS and GFP_NOIO.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-08 20:04     ` Trond Myklebust
@ 2022-07-11 14:07       ` Benjamin Coddington
  2022-07-11 15:31         ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Coddington @ 2022-07-11 14:07 UTC (permalink / raw)
  To: Trond Myklebust, Eric Dumazet
  Cc: edumazet, smayhew, davem, chuck.lever, sfrench, tj, anna, kuba,
	jlayton, gnault, josef, netdev, pabeni

On 8 Jul 2022, at 16:04, Trond Myklebust wrote:

> On Fri, 2022-07-08 at 14:10 -0400, Benjamin Coddington wrote:
>> On 7 Jul 2022, at 12:29, Eric Dumazet wrote:
>>
>>> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com>
>>> wrote:
>>>>
>>>> I'm investigating a kernel oops that looks similar to
>>>> 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory
>>>> reclaim")
>>>> and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
>>>>
>>>> This time the problem happens on an NFS client, while the
>>>> previous
>>>> bzs
>>>> respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS
>>>> in
>>>> their socket's ->sk_allocation field (using GFP_NOIO or
>>>> GFP_NOFS),
>>>> NFS
>>>> leaves sk_allocation to its default value since commit
>>>> a1231fda7e94
>>>> ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
>>>>
>>>> To recap the original problems, in commit 20eb4f29b602 and
>>>> dacb5d8875cc,
>>>> memory reclaim happened while executing tcp_sendmsg_locked(). The
>>>> code
>>>> path entered tcp_sendmsg_locked() recursively as pages to be
>>>> reclaimed
>>>> were backed by files on the network. The problem was that both
>>>> the
>>>> outer and the inner tcp_sendmsg_locked() calls used
>>>> current->task_frag,
>>>> thus leaving it in an inconsistent state. The fix was to use the
>>>> socket's ->sk_frag instead for the file system socket, so that
>>>> the
>>>> inner and outer calls wouln't step on each other's toes.
>>>>
>>>> But now that NFS doesn't modify ->sk_allocation anymore,
>>>> sk_page_frag()
>>>> sees sunrpc sockets as plain TCP ones and returns ->task_frag in
>>>> the
>>>> inner tcp_sendmsg_locked() call.
>>>>
>>>> Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO
>>>> and
>>>> use
>>>> memalloc_no{fs,io}_save() instead. So maybe other network file
>>>> systems
>>>> will also stop setting ->sk_allocation in the future and we
>>>> should
>>>> teach sk_page_frag() to look at the current GFP flags. Or should
>>>> we
>>>> stick to ->sk_allocation and make NFS drop __GFP_FS again?
>>>>
>>>> Signed-off-by: Guillaume Nault <gnault@redhat.com>
>>>
>>> Can you provide a Fixes: tag ?
>>>
>>>> ---
>>>>  include/net/sock.h | 8 ++++++--
>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>>> index 72ca97ccb460..b934c9851058 100644
>>>> --- a/include/net/sock.h
>>>> +++ b/include/net/sock.h
>>>> @@ -46,6 +46,7 @@
>>>>  #include <linux/netdevice.h>
>>>>  #include <linux/skbuff.h>      /* struct sk_buff */
>>>>  #include <linux/mm.h>
>>>> +#include <linux/sched/mm.h>
>>>>  #include <linux/security.h>
>>>>  #include <linux/slab.h>
>>>>  #include <linux/uaccess.h>
>>>> @@ -2503,14 +2504,17 @@ static inline void
>>>> sk_stream_moderate_sndbuf(struct sock *sk)
>>>>   * socket operations and end up recursing into sk_page_frag()
>>>>   * while it's already in use: explicitly avoid task page_frag
>>>>   * usage if the caller is potentially doing any of them.
>>>> - * This assumes that page fault handlers use the GFP_NOFS flags.
>>>> + * This assumes that page fault handlers use the GFP_NOFS flags
>>>> + * or run under memalloc_nofs_save() protection.
>>>>   *
>>>>   * Return: a per task page_frag if context allows that,
>>>>   * otherwise a per socket one.
>>>>   */
>>>>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>>>>  {
>>>> -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM |
>>>> __GFP_MEMALLOC | __GFP_FS)) ==
>>>> +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
>>>
>>> This is slowing down TCP sendmsg() fast path, reading current-
>>>> flags,
>>> possibly cold value.
>>
>> True - current->flags is pretty distant from current->task_frag.
>>
>>> I would suggest using one bit in sk, close to sk->sk_allocation to
>>> make the decision,
>>> instead of testing sk->sk_allocation for various flags.
>>>
>>> Not sure if we have available holes.
>>
>> Its looking pretty packed on my build.. the nearest hole is 5
>> cachelines
>> away.
>>
>> It'd be nice to allow network filesystem to use task_frag when
>> possible.
>>
>> If we expect sk_page_frag() to only return task_frag once per call
>> stack,
>> then can we simply check it's already in use, perhaps by looking at
>> the
>> size field?
>>
>> Or maybe can we set sk_allocation early from current_gfp_context()
>> outside
>> the fast path?
>
> Why not just add a bit to sk->sk_allocation itself, and have
> __sock_create() default to setting it when the 'kern' parameter is non-
> zero? NFS is not alone in following the request of the mm team to
> deprecate use of GFP_NOFS and GFP_NOIO.

Can we overload sk_allocation safely?  There's 28 GFP flags already, I'm
worried about unintended consequences if sk_allocation gets passed on.

What about a flag in sk_gso_type?  Looks like there's 13 free there, and its
in the same cacheline as sk_allocation and sk_frag.

Ben


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-11 14:07       ` Benjamin Coddington
@ 2022-07-11 15:31         ` Eric Dumazet
  2022-09-20 18:50           ` Guillaume Nault
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2022-07-11 15:31 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Trond Myklebust, Scott Mayhew, David Miller, Chuck Lever,
	Steve French, Tejun Heo, Anna Schumaker, Jakub Kicinski,
	Jeff Layton, Guillaume Nault, Josef Bacik, netdev, Paolo Abeni

On Mon, Jul 11, 2022 at 4:07 PM Benjamin Coddington <bcodding@redhat.com> wrote:
>
> On 8 Jul 2022, at 16:04, Trond Myklebust wrote:
>
> > On Fri, 2022-07-08 at 14:10 -0400, Benjamin Coddington wrote:
> >> On 7 Jul 2022, at 12:29, Eric Dumazet wrote:
> >>
> >>> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com>
> >>> wrote:
> >>>>
> >>>> I'm investigating a kernel oops that looks similar to
> >>>> 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory
> >>>> reclaim")
> >>>> and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
> >>>>
> >>>> This time the problem happens on an NFS client, while the
> >>>> previous
> >>>> bzs
> >>>> respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS
> >>>> in
> >>>> their socket's ->sk_allocation field (using GFP_NOIO or
> >>>> GFP_NOFS),
> >>>> NFS
> >>>> leaves sk_allocation to its default value since commit
> >>>> a1231fda7e94
> >>>> ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
> >>>>
> >>>> To recap the original problems, in commit 20eb4f29b602 and
> >>>> dacb5d8875cc,
> >>>> memory reclaim happened while executing tcp_sendmsg_locked(). The
> >>>> code
> >>>> path entered tcp_sendmsg_locked() recursively as pages to be
> >>>> reclaimed
> >>>> were backed by files on the network. The problem was that both
> >>>> the
> >>>> outer and the inner tcp_sendmsg_locked() calls used
> >>>> current->task_frag,
> >>>> thus leaving it in an inconsistent state. The fix was to use the
> >>>> socket's ->sk_frag instead for the file system socket, so that
> >>>> the
> >>>> inner and outer calls wouln't step on each other's toes.
> >>>>
> >>>> But now that NFS doesn't modify ->sk_allocation anymore,
> >>>> sk_page_frag()
> >>>> sees sunrpc sockets as plain TCP ones and returns ->task_frag in
> >>>> the
> >>>> inner tcp_sendmsg_locked() call.
> >>>>
> >>>> Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO
> >>>> and
> >>>> use
> >>>> memalloc_no{fs,io}_save() instead. So maybe other network file
> >>>> systems
> >>>> will also stop setting ->sk_allocation in the future and we
> >>>> should
> >>>> teach sk_page_frag() to look at the current GFP flags. Or should
> >>>> we
> >>>> stick to ->sk_allocation and make NFS drop __GFP_FS again?
> >>>>
> >>>> Signed-off-by: Guillaume Nault <gnault@redhat.com>
> >>>
> >>> Can you provide a Fixes: tag ?
> >>>
> >>>> ---
> >>>>  include/net/sock.h | 8 ++++++--
> >>>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/net/sock.h b/include/net/sock.h
> >>>> index 72ca97ccb460..b934c9851058 100644
> >>>> --- a/include/net/sock.h
> >>>> +++ b/include/net/sock.h
> >>>> @@ -46,6 +46,7 @@
> >>>>  #include <linux/netdevice.h>
> >>>>  #include <linux/skbuff.h>      /* struct sk_buff */
> >>>>  #include <linux/mm.h>
> >>>> +#include <linux/sched/mm.h>
> >>>>  #include <linux/security.h>
> >>>>  #include <linux/slab.h>
> >>>>  #include <linux/uaccess.h>
> >>>> @@ -2503,14 +2504,17 @@ static inline void
> >>>> sk_stream_moderate_sndbuf(struct sock *sk)
> >>>>   * socket operations and end up recursing into sk_page_frag()
> >>>>   * while it's already in use: explicitly avoid task page_frag
> >>>>   * usage if the caller is potentially doing any of them.
> >>>> - * This assumes that page fault handlers use the GFP_NOFS flags.
> >>>> + * This assumes that page fault handlers use the GFP_NOFS flags
> >>>> + * or run under memalloc_nofs_save() protection.
> >>>>   *
> >>>>   * Return: a per task page_frag if context allows that,
> >>>>   * otherwise a per socket one.
> >>>>   */
> >>>>  static inline struct page_frag *sk_page_frag(struct sock *sk)
> >>>>  {
> >>>> -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM |
> >>>> __GFP_MEMALLOC | __GFP_FS)) ==
> >>>> +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
> >>>
> >>> This is slowing down TCP sendmsg() fast path, reading current-
> >>>> flags,
> >>> possibly cold value.
> >>
> >> True - current->flags is pretty distant from current->task_frag.
> >>
> >>> I would suggest using one bit in sk, close to sk->sk_allocation to
> >>> make the decision,
> >>> instead of testing sk->sk_allocation for various flags.
> >>>
> >>> Not sure if we have available holes.
> >>
> >> Its looking pretty packed on my build.. the nearest hole is 5
> >> cachelines
> >> away.
> >>
> >> It'd be nice to allow network filesystem to use task_frag when
> >> possible.
> >>
> >> If we expect sk_page_frag() to only return task_frag once per call
> >> stack,
> >> then can we simply check it's already in use, perhaps by looking at
> >> the
> >> size field?
> >>
> >> Or maybe can we set sk_allocation early from current_gfp_context()
> >> outside
> >> the fast path?
> >
> > Why not just add a bit to sk->sk_allocation itself, and have
> > __sock_create() default to setting it when the 'kern' parameter is non-
> > zero? NFS is not alone in following the request of the mm team to
> > deprecate use of GFP_NOFS and GFP_NOIO.
>
> Can we overload sk_allocation safely?  There's 28 GFP flags already, I'm
> worried about unintended consequences if sk_allocation gets passed on.
>
> What about a flag in sk_gso_type?  Looks like there's 13 free there, and its
> in the same cacheline as sk_allocation and sk_frag.

I think we could overload GFP_COMP with little risk.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
  2022-07-11 15:31         ` Eric Dumazet
@ 2022-09-20 18:50           ` Guillaume Nault
  0 siblings, 0 replies; 9+ messages in thread
From: Guillaume Nault @ 2022-09-20 18:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benjamin Coddington, Trond Myklebust, Scott Mayhew, David Miller,
	Chuck Lever, Steve French, Tejun Heo, Anna Schumaker,
	Jakub Kicinski, Jeff Layton, Josef Bacik, netdev, Paolo Abeni

On Mon, Jul 11, 2022 at 05:31:24PM +0200, Eric Dumazet wrote:
> On Mon, Jul 11, 2022 at 4:07 PM Benjamin Coddington <bcodding@redhat.com> wrote:
> >
> > On 8 Jul 2022, at 16:04, Trond Myklebust wrote:
> >
> > > On Fri, 2022-07-08 at 14:10 -0400, Benjamin Coddington wrote:
> > >> On 7 Jul 2022, at 12:29, Eric Dumazet wrote:
> > >>
> > >>> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com>
> > >>> wrote:
> > >>>>
> > >>>> diff --git a/include/net/sock.h b/include/net/sock.h
> > >>>> index 72ca97ccb460..b934c9851058 100644
> > >>>> --- a/include/net/sock.h
> > >>>> +++ b/include/net/sock.h
> > >>>> @@ -46,6 +46,7 @@
> > >>>>  #include <linux/netdevice.h>
> > >>>>  #include <linux/skbuff.h>      /* struct sk_buff */
> > >>>>  #include <linux/mm.h>
> > >>>> +#include <linux/sched/mm.h>
> > >>>>  #include <linux/security.h>
> > >>>>  #include <linux/slab.h>
> > >>>>  #include <linux/uaccess.h>
> > >>>> @@ -2503,14 +2504,17 @@ static inline void
> > >>>> sk_stream_moderate_sndbuf(struct sock *sk)
> > >>>>   * socket operations and end up recursing into sk_page_frag()
> > >>>>   * while it's already in use: explicitly avoid task page_frag
> > >>>>   * usage if the caller is potentially doing any of them.
> > >>>> - * This assumes that page fault handlers use the GFP_NOFS flags.
> > >>>> + * This assumes that page fault handlers use the GFP_NOFS flags
> > >>>> + * or run under memalloc_nofs_save() protection.
> > >>>>   *
> > >>>>   * Return: a per task page_frag if context allows that,
> > >>>>   * otherwise a per socket one.
> > >>>>   */
> > >>>>  static inline struct page_frag *sk_page_frag(struct sock *sk)
> > >>>>  {
> > >>>> -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM |
> > >>>> __GFP_MEMALLOC | __GFP_FS)) ==
> > >>>> +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
> > >>>
> > >>> This is slowing down TCP sendmsg() fast path, reading current-
> > >>>> flags,
> > >>> possibly cold value.
> > >>
> > >> True - current->flags is pretty distant from current->task_frag.
> > >>
> > >>> I would suggest using one bit in sk, close to sk->sk_allocation to
> > >>> make the decision,
> > >>> instead of testing sk->sk_allocation for various flags.
> > >>>
> > >>> Not sure if we have available holes.
> > >>
> > >> Its looking pretty packed on my build.. the nearest hole is 5
> > >> cachelines
> > >> away.
> > >>
> > >> It'd be nice to allow network filesystem to use task_frag when
> > >> possible.
> > >>
> > >> If we expect sk_page_frag() to only return task_frag once per call
> > >> stack,
> > >> then can we simply check it's already in use, perhaps by looking at
> > >> the
> > >> size field?
> > >>
> > >> Or maybe can we set sk_allocation early from current_gfp_context()
> > >> outside
> > >> the fast path?
> > >
> > > Why not just add a bit to sk->sk_allocation itself, and have
> > > __sock_create() default to setting it when the 'kern' parameter is non-
> > > zero? NFS is not alone in following the request of the mm team to
> > > deprecate use of GFP_NOFS and GFP_NOIO.
> >
> > Can we overload sk_allocation safely?  There's 28 GFP flags already, I'm
> > worried about unintended consequences if sk_allocation gets passed on.
> >
> > What about a flag in sk_gso_type?  Looks like there's 13 free there, and its
> > in the same cacheline as sk_allocation and sk_frag.
> 
> I think we could overload GFP_COMP with little risk.

Reviving this semi-old thread after discussions at LPC.

It seems we won't have a clear path for how to make sk_page_frag()
aware of memalloc_nofs or memalloc_noio contexts before some time.
Since we continue to get bug reports for this issue, I'm thinking of
just setting sk_allocation to GFP_NOFS for sunrpc sockets. Then we'll
can think of a better way of removing GFP_NOFS or GFP_NOIO from the
different networking file systems and block devices.

Here's a summary of all the long term options discussed so far.

1) Use a special bit in sk_allocation.

Either repurpose an existing bit like GFP_COMP or allocate a free one.
Then sk_page_frag() could test this bit and avoid returning
current->task_frag when it's set. That bit would have to be masked
every time sk_allocation is used to allocate memory.
Overall, it looks a bit like using GFP_NOFS with a different name,
apart that it allows the socket to allocate memory with GFP_KERNEL when
not in memalloc_nofs critical sections (but I'm not sure if it's a
practical gain for NFS).

Alternatively, there's a one bit hole in struct sock_common, right
after skc_state, which could be used to store a 'skc_no_task_frag'
flag (the cache line should be hot because of skc_state). Any socket
user that could be called during memory reclaim could set this bit to
prevent sk_page_frag() from using current->taskfrag.

2) Avoid using current->task_frag for kernel sockets.

Since sk_kern_sock is in the same cache line as sk_allocation, we
probably could let sk_page_frag() test if sk is a kernel socket and
avoid using current->task_frag in this case. Alternatively, we could
define a new flag as in option 1 and automatically set it when creating
kernel sockets (as proposed by Trond).

However, there are many kernel socket users and, so far, only NFS (and
maybe other networking FS in the future) need this protection. So it
looks like a pretty big hammer. Also, NBD uses sockets passed from user
space. Therefore if it were to phase out GFP_NOFS, sk_page_frag() could
return current->task_frag again (maybe NBD should transfrom its sockets
to kernel sockets, but that's a bit of a tangential discussion).

3) Adjust sk_allocation when necessary.

The idea would be to update sk_allocation before entering TCP fast
path. Something like:

  old_sk_allocation = sk->sk_allocation;
  sk->sk_allocation = current_gfp_context(sk->sk_allocation);
  ... use sk ...
  sk->sk_allocation = old_sk_allocation;

That doesn't seem feasible though, as this assumes exclusive access to
the socket, but we grab the socket lock only after entering
tcp_sendmsg().

A similar idea was to do this automatically in kernel_sendmsg(), but
it faces the same exclusive access problem. Furthermore, not all
GFP_NOFS users use kernel_sendmsg() (same problem as with option 2).

4) Detect if current->task_frag is already in use.

There may be a way for sk_page_frag() to detect if current->task_frag
is already in use, that is, if it's in a recursive call. That'd be nice
as that'd avoid the need for any heuristic based on sk_allocation.
However, I can't think of any way to do that efficiently and without
adding bits in current.

Thanks to all those involved in this thread, and to Paolo and Eric for
the fruitful discussions at LPC.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-09-20 18:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-01 18:41 [RFC net] Should sk_page_frag() also look at the current GFP context? Guillaume Nault
2022-07-07 15:31 ` Benjamin Coddington
2022-07-07 16:29 ` Eric Dumazet
2022-07-08 17:51   ` Guillaume Nault
2022-07-08 18:10   ` Benjamin Coddington
2022-07-08 20:04     ` Trond Myklebust
2022-07-11 14:07       ` Benjamin Coddington
2022-07-11 15:31         ` Eric Dumazet
2022-09-20 18:50           ` Guillaume Nault

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).