linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net: fix sk_page_frag() recursion from memory reclaim
@ 2019-10-19 17:01 Tejun Heo
  2019-10-19 18:15 ` Eric Dumazet
  2019-10-24 20:50 ` [PATCH v2] " Tejun Heo
  0 siblings, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2019-10-19 17:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, kernel-team, linux-kernel, Josef Bacik

From f0335a5d14d3596d36e3ffddb2fd4fa0dc6ca9c2 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Sat, 19 Oct 2019 09:10:57 -0700

sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user.  The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.

Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.

 [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
     ...
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     sock_xmit.isra.24+0xa1/0x170 [nbd]
     nbd_send_cmd+0x1d2/0x690 [nbd]
     nbd_queue_rq+0x1b5/0x3b0 [nbd]
     __blk_mq_try_issue_directly+0x108/0x1b0
     blk_mq_request_issue_directly+0xbd/0xe0
     blk_mq_try_issue_list_directly+0x41/0xb0
     blk_mq_sched_insert_requests+0xa2/0xe0
     blk_mq_flush_plug_list+0x205/0x2a0
     blk_flush_plug_list+0xc3/0xf0
 [1] blk_finish_plug+0x21/0x2e
     _xfs_buf_ioapply+0x313/0x460
     __xfs_buf_submit+0x67/0x220
     xfs_buf_read_map+0x113/0x1a0
     xfs_trans_read_buf_map+0xbf/0x330
     xfs_btree_read_buf_block.constprop.42+0x95/0xd0
     xfs_btree_lookup_get_block+0x95/0x170
     xfs_btree_lookup+0xcc/0x470
     xfs_bmap_del_extent_real+0x254/0x9a0
     __xfs_bunmapi+0x45c/0xab0
     xfs_bunmapi+0x15/0x30
     xfs_itruncate_extents_flags+0xca/0x250
     xfs_free_eofblocks+0x181/0x1e0
     xfs_fs_destroy_inode+0xa8/0x1b0
     destroy_inode+0x38/0x70
     dispose_list+0x35/0x50
     prune_icache_sb+0x52/0x70
     super_cache_scan+0x120/0x1a0
     do_shrink_slab+0x120/0x290
     shrink_slab+0x216/0x2b0
     shrink_node+0x1b6/0x4a0
     do_try_to_free_pages+0xc6/0x370
     try_to_free_mem_cgroup_pages+0xe3/0x1e0
     try_charge+0x29e/0x790
     mem_cgroup_charge_skmem+0x6a/0x100
     __sk_mem_raise_allocated+0x18e/0x390
     __sk_mem_schedule+0x2a/0x40
 [0] tcp_sendmsg_locked+0x8eb/0xe10
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     ___sys_sendmsg+0x26d/0x2b0
     __sys_sendmsg+0x57/0xa0
     do_syscall_64+0x42/0x100
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule().  It already calculated how many bytes can
be fit into current->page_frag.  Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path.  Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().

nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress.  However, this
confused sk_page_frag() called from [2].  Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].

After [2] used current->page_frag, the offset would be increased by
the used amount.  When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.

Fix it by updating sk_page_frag() to test __GFP_MEMALLOC and not use
current->task_frag if set.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: stable@vger.kernel.org
---
 include/net/sock.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 2c53f1a1d905..4e2ca38acc3c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2233,12 +2233,21 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
  * sk_page_frag - return an appropriate page_frag
  * @sk: socket
  *
- * If socket allocation mode allows current thread to sleep, it means its
- * safe to use the per task page_frag instead of the per socket one.
+ * Use the per task page_frag instead of the per socket one for
+ * optimization when we know there can be no other users.
+ *
+ * 1. The socket allocation mode allows current thread to sleep.  This is
+ *    the sleepable context which owns the task page_frag.
+ *
+ * 2. The socket allocation mode doesn't indicate that the socket is being
+ *    used to reclaim memory.  Memory reclaim may nest inside other socket
+ *    operations and end up recursing into sk_page_frag() while it's
+ *    already in use.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (gfpflags_allow_blocking(sk->sk_allocation))
+	if (gfpflags_allow_blocking(sk->sk_allocation) &&
+	    !(sk->sk_allocation & __GFP_MEMALLOC))
 		return &current->task_frag;
 
 	return &sk->sk_frag;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-19 17:01 [PATCH] net: fix sk_page_frag() recursion from memory reclaim Tejun Heo
@ 2019-10-19 18:15 ` Eric Dumazet
  2019-10-19 21:18   ` Tejun Heo
  2019-10-24 20:50 ` [PATCH v2] " Tejun Heo
  1 sibling, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-19 18:15 UTC (permalink / raw)
  To: Tejun Heo, David S. Miller; +Cc: netdev, kernel-team, linux-kernel, Josef Bacik



On 10/19/19 10:01 AM, Tejun Heo wrote:
> From f0335a5d14d3596d36e3ffddb2fd4fa0dc6ca9c2 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Sat, 19 Oct 2019 09:10:57 -0700
> 
> sk_page_frag() optimizes skb_frag allocations by using per-task
> skb_frag cache when it knows it's the only user.  The condition is
> determined by seeing whether the socket allocation mask allows
> blocking - if the allocation may block, it obviously owns the task's
> context and ergo exclusively owns current->task_frag.
> 
> Unfortunately, this misses recursion through memory reclaim path.
> Please take a look at the following backtrace.
> 
>  [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
>      ...
>      tcp_sendmsg+0x27/0x40
>      sock_sendmsg+0x30/0x40
>      sock_xmit.isra.24+0xa1/0x170 [nbd]
>      nbd_send_cmd+0x1d2/0x690 [nbd]
>      nbd_queue_rq+0x1b5/0x3b0 [nbd]
>      __blk_mq_try_issue_directly+0x108/0x1b0
>      blk_mq_request_issue_directly+0xbd/0xe0
>      blk_mq_try_issue_list_directly+0x41/0xb0
>      blk_mq_sched_insert_requests+0xa2/0xe0
>      blk_mq_flush_plug_list+0x205/0x2a0
>      blk_flush_plug_list+0xc3/0xf0
>  [1] blk_finish_plug+0x21/0x2e
>      _xfs_buf_ioapply+0x313/0x460
>      __xfs_buf_submit+0x67/0x220
>      xfs_buf_read_map+0x113/0x1a0
>      xfs_trans_read_buf_map+0xbf/0x330
>      xfs_btree_read_buf_block.constprop.42+0x95/0xd0
>      xfs_btree_lookup_get_block+0x95/0x170
>      xfs_btree_lookup+0xcc/0x470
>      xfs_bmap_del_extent_real+0x254/0x9a0
>      __xfs_bunmapi+0x45c/0xab0
>      xfs_bunmapi+0x15/0x30
>      xfs_itruncate_extents_flags+0xca/0x250
>      xfs_free_eofblocks+0x181/0x1e0
>      xfs_fs_destroy_inode+0xa8/0x1b0
>      destroy_inode+0x38/0x70
>      dispose_list+0x35/0x50
>      prune_icache_sb+0x52/0x70
>      super_cache_scan+0x120/0x1a0
>      do_shrink_slab+0x120/0x290
>      shrink_slab+0x216/0x2b0
>      shrink_node+0x1b6/0x4a0
>      do_try_to_free_pages+0xc6/0x370
>      try_to_free_mem_cgroup_pages+0xe3/0x1e0
>      try_charge+0x29e/0x790
>      mem_cgroup_charge_skmem+0x6a/0x100
>      __sk_mem_raise_allocated+0x18e/0x390
>      __sk_mem_schedule+0x2a/0x40
>  [0] tcp_sendmsg_locked+0x8eb/0xe10
>      tcp_sendmsg+0x27/0x40
>      sock_sendmsg+0x30/0x40
>      ___sys_sendmsg+0x26d/0x2b0
>      __sys_sendmsg+0x57/0xa0
>      do_syscall_64+0x42/0x100
>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> In [0], tcp_send_msg_locked() was using current->page_frag when it
> called sk_wmem_schedule().  It already calculated how many bytes can
> be fit into current->page_frag.  Due to memory pressure,
> sk_wmem_schedule() called into memory reclaim path which called into
> xfs and then IO issue path.  Because the filesystem in question is
> backed by nbd, the control goes back into the tcp layer - back into
> tcp_sendmsg_locked().
> 
> nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
> sense - it's in the process of freeing memory and wants to be able to,
> e.g., drop clean pages to make forward progress.  However, this
> confused sk_page_frag() called from [2].  Because it only tests
> whether the allocation allows blocking which it does, it now thinks
> current->page_frag can be used again although it already was being
> used in [0].
> 
> After [2] used current->page_frag, the offset would be increased by
> the used amount.  When the control returns to [0],
> current->page_frag's offset is increased and the previously calculated
> number of bytes now may overrun the end of allocated memory leading to
> silent memory corruptions.
> 
> Fix it by updating sk_page_frag() to test __GFP_MEMALLOC and not use
> current->task_frag if set.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: stable@vger.kernel.org
> ---
>  include/net/sock.h | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2c53f1a1d905..4e2ca38acc3c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2233,12 +2233,21 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
>   * sk_page_frag - return an appropriate page_frag
>   * @sk: socket
>   *
> - * If socket allocation mode allows current thread to sleep, it means its
> - * safe to use the per task page_frag instead of the per socket one.
> + * Use the per task page_frag instead of the per socket one for
> + * optimization when we know there can be no other users.
> + *
> + * 1. The socket allocation mode allows current thread to sleep.  This is
> + *    the sleepable context which owns the task page_frag.
> + *
> + * 2. The socket allocation mode doesn't indicate that the socket is being
> + *    used to reclaim memory.  Memory reclaim may nest inside other socket
> + *    operations and end up recursing into sk_page_frag() while it's
> + *    already in use.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -	if (gfpflags_allow_blocking(sk->sk_allocation))
> +	if (gfpflags_allow_blocking(sk->sk_allocation) &&
> +	    !(sk->sk_allocation & __GFP_MEMALLOC))
>  		return &current->task_frag;
>  
>  	return &sk->sk_frag;
> 

It seems compiler generates better code with :

diff --git a/include/net/sock.h b/include/net/sock.h
index ab905c4b1f0efd42ebdcae333b3f0a2c7c1b2248..56de6ac99f0952bd0bc003353c094ce3a5a852f4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2238,7 +2238,8 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-       if (gfpflags_allow_blocking(sk->sk_allocation))
+       if (likely((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
+                   __GFP_DIRECT_RECLAIM))
                return &current->task_frag;
 
        return &sk->sk_frag;


WDYT ?

Thanks !

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-19 18:15 ` Eric Dumazet
@ 2019-10-19 21:18   ` Tejun Heo
  2019-10-19 21:25     ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2019-10-19 21:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, kernel-team, linux-kernel, Josef Bacik

Hello,

On Sat, Oct 19, 2019 at 11:15:28AM -0700, Eric Dumazet wrote:
> It seems compiler generates better code with :
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ab905c4b1f0efd42ebdcae333b3f0a2c7c1b2248..56de6ac99f0952bd0bc003353c094ce3a5a852f4 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2238,7 +2238,8 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -       if (gfpflags_allow_blocking(sk->sk_allocation))
> +       if (likely((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
> +                   __GFP_DIRECT_RECLAIM))
>                 return &current->task_frag;
>  
>         return &sk->sk_frag;
> 
> 
> WDYT ?

Whatever works is fine by me.  gfpflags_allow_blocking() is clearer
than testing __GFP_DIRECT_RECLAIM directly tho.  Maybe a better way is
introducing a new gfpflags_ helper?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-19 21:18   ` Tejun Heo
@ 2019-10-19 21:25     ` Eric Dumazet
  2019-10-22 17:19       ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-19 21:25 UTC (permalink / raw)
  To: Tejun Heo, Eric Dumazet
  Cc: David S. Miller, netdev, kernel-team, linux-kernel, Josef Bacik



On 10/19/19 2:18 PM, Tejun Heo wrote:

> Whatever works is fine by me.  gfpflags_allow_blocking() is clearer
> than testing __GFP_DIRECT_RECLAIM directly tho.  Maybe a better way is
> introducing a new gfpflags_ helper?

Sounds good to me !


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-19 21:25     ` Eric Dumazet
@ 2019-10-22 17:19       ` Jakub Kicinski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2019-10-22 17:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tejun Heo, David S. Miller, netdev, kernel-team, linux-kernel,
	Josef Bacik

On Sat, 19 Oct 2019 14:25:57 -0700, Eric Dumazet wrote:
> On 10/19/19 2:18 PM, Tejun Heo wrote:
> 
> > Whatever works is fine by me.  gfpflags_allow_blocking() is clearer
> > than testing __GFP_DIRECT_RECLAIM directly tho.  Maybe a better way is
> > introducing a new gfpflags_ helper?  
> 
> Sounds good to me !

IIUC there will be a v2 with a new helper, dropping this from patchwork.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-19 17:01 [PATCH] net: fix sk_page_frag() recursion from memory reclaim Tejun Heo
  2019-10-19 18:15 ` Eric Dumazet
@ 2019-10-24 20:50 ` Tejun Heo
  2019-10-28 23:18   ` David Miller
  2019-10-31 17:35   ` Shakeel Butt
  1 sibling, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2019-10-24 20:50 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, kernel-team, linux-kernel, Josef Bacik, Eric Dumazet,
	Jakub Kicinski, Johannes Weiner, linux-mm, Mel Gorman,
	Andrew Morton

sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user.  The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.

Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.

 [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
     ...
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     sock_xmit.isra.24+0xa1/0x170 [nbd]
     nbd_send_cmd+0x1d2/0x690 [nbd]
     nbd_queue_rq+0x1b5/0x3b0 [nbd]
     __blk_mq_try_issue_directly+0x108/0x1b0
     blk_mq_request_issue_directly+0xbd/0xe0
     blk_mq_try_issue_list_directly+0x41/0xb0
     blk_mq_sched_insert_requests+0xa2/0xe0
     blk_mq_flush_plug_list+0x205/0x2a0
     blk_flush_plug_list+0xc3/0xf0
 [1] blk_finish_plug+0x21/0x2e
     _xfs_buf_ioapply+0x313/0x460
     __xfs_buf_submit+0x67/0x220
     xfs_buf_read_map+0x113/0x1a0
     xfs_trans_read_buf_map+0xbf/0x330
     xfs_btree_read_buf_block.constprop.42+0x95/0xd0
     xfs_btree_lookup_get_block+0x95/0x170
     xfs_btree_lookup+0xcc/0x470
     xfs_bmap_del_extent_real+0x254/0x9a0
     __xfs_bunmapi+0x45c/0xab0
     xfs_bunmapi+0x15/0x30
     xfs_itruncate_extents_flags+0xca/0x250
     xfs_free_eofblocks+0x181/0x1e0
     xfs_fs_destroy_inode+0xa8/0x1b0
     destroy_inode+0x38/0x70
     dispose_list+0x35/0x50
     prune_icache_sb+0x52/0x70
     super_cache_scan+0x120/0x1a0
     do_shrink_slab+0x120/0x290
     shrink_slab+0x216/0x2b0
     shrink_node+0x1b6/0x4a0
     do_try_to_free_pages+0xc6/0x370
     try_to_free_mem_cgroup_pages+0xe3/0x1e0
     try_charge+0x29e/0x790
     mem_cgroup_charge_skmem+0x6a/0x100
     __sk_mem_raise_allocated+0x18e/0x390
     __sk_mem_schedule+0x2a/0x40
 [0] tcp_sendmsg_locked+0x8eb/0xe10
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     ___sys_sendmsg+0x26d/0x2b0
     __sys_sendmsg+0x57/0xa0
     do_syscall_64+0x42/0x100
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule().  It already calculated how many bytes can
be fit into current->page_frag.  Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path.  Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().

nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress.  However, this
confused sk_page_frag() called from [2].  Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].

After [2] used current->page_frag, the offset would be increased by
the used amount.  When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.

Fix it by adding gfpflags_normal_context() which tests sleepable &&
!reclaim and use it to determine whether to use current->task_frag.

v2: Eric didn't like gfp flags being tested twice.  Introduce a new
    helper gfpflags_normal_context() and combine the two tests.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
---
 include/linux/gfp.h |   23 +++++++++++++++++++++++
 include/net/sock.h  |   11 ++++++++---
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..61f2f6ff9467 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -325,6 +325,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }
 
+/**
+ * gfpflags_normal_context - is gfp_flags a normal sleepable context?
+ * @gfp_flags: gfp_flags to test
+ *
+ * Test whether @gfp_flags indicates that the allocation is from the
+ * %current context and allowed to sleep.
+ *
+ * An allocation being allowed to block doesn't mean it owns the %current
+ * context.  When direct reclaim path tries to allocate memory, the
+ * allocation context is nested inside whatever %current was doing at the
+ * time of the original allocation.  The nested allocation may be allowed
+ * to block but modifying anything %current owns can corrupt the outer
+ * context's expectations.
+ *
+ * %true result from this function indicates that the allocation context
+ * can sleep and use anything that's associated with %current.
+ */
+static inline bool gfpflags_normal_context(const gfp_t gfp_flags)
+{
+	return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
+		__GFP_DIRECT_RECLAIM;
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
diff --git a/include/net/sock.h b/include/net/sock.h
index f69b58bff7e5..c31a9ed86d5a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2242,12 +2242,17 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
  * sk_page_frag - return an appropriate page_frag
  * @sk: socket
  *
- * If socket allocation mode allows current thread to sleep, it means its
- * safe to use the per task page_frag instead of the per socket one.
+ * Use the per task page_frag instead of the per socket one for
+ * optimization when we know that we're in the normal context and owns
+ * everything that's associated with %current.
+ *
+ * gfpflags_allow_blocking() isn't enough here as direct reclaim may nest
+ * inside other socket operations and end up recursing into sk_page_frag()
+ * while it's already in use.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (gfpflags_allow_blocking(sk->sk_allocation))
+	if (gfpflags_normal_context(sk->sk_allocation))
 		return &current->task_frag;
 
 	return &sk->sk_frag;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-24 20:50 ` [PATCH v2] " Tejun Heo
@ 2019-10-28 23:18   ` David Miller
  2019-10-31 17:35   ` Shakeel Butt
  1 sibling, 0 replies; 17+ messages in thread
From: David Miller @ 2019-10-28 23:18 UTC (permalink / raw)
  To: tj
  Cc: netdev, kernel-team, linux-kernel, josef, eric.dumazet,
	jakub.kicinski, hannes, linux-mm, mgorman, akpm

From: Tejun Heo <tj@kernel.org>
Date: Thu, 24 Oct 2019 13:50:27 -0700

> sk_page_frag() optimizes skb_frag allocations by using per-task
> skb_frag cache when it knows it's the only user.  The condition is
> determined by seeing whether the socket allocation mask allows
> blocking - if the allocation may block, it obviously owns the task's
> context and ergo exclusively owns current->task_frag.
> 
> Unfortunately, this misses recursion through memory reclaim path.
> Please take a look at the following backtrace.
 ...
> In [0], tcp_send_msg_locked() was using current->page_frag when it
> called sk_wmem_schedule().  It already calculated how many bytes can
> be fit into current->page_frag.  Due to memory pressure,
> sk_wmem_schedule() called into memory reclaim path which called into
> xfs and then IO issue path.  Because the filesystem in question is
> backed by nbd, the control goes back into the tcp layer - back into
> tcp_sendmsg_locked().
> 
> nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
> sense - it's in the process of freeing memory and wants to be able to,
> e.g., drop clean pages to make forward progress.  However, this
> confused sk_page_frag() called from [2].  Because it only tests
> whether the allocation allows blocking which it does, it now thinks
> current->page_frag can be used again although it already was being
> used in [0].
> 
> After [2] used current->page_frag, the offset would be increased by
> the used amount.  When the control returns to [0],
> current->page_frag's offset is increased and the previously calculated
> number of bytes now may overrun the end of allocated memory leading to
> silent memory corruptions.
> 
> Fix it by adding gfpflags_normal_context() which tests sleepable &&
> !reclaim and use it to determine whether to use current->task_frag.
> 
> v2: Eric didn't like gfp flags being tested twice.  Introduce a new
>     helper gfpflags_normal_context() and combine the two tests.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Applied and queued up for -stable, thanks Tejun.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-24 20:50 ` [PATCH v2] " Tejun Heo
  2019-10-28 23:18   ` David Miller
@ 2019-10-31 17:35   ` Shakeel Butt
  2019-10-31 17:47     ` Eric Dumazet
  2019-10-31 23:20     ` Andrew Morton
  1 sibling, 2 replies; 17+ messages in thread
From: Shakeel Butt @ 2019-10-31 17:35 UTC (permalink / raw)
  To: Tejun Heo, Michal Hocko
  Cc: David S. Miller, netdev, Kernel Team, LKML, Josef Bacik,
	Eric Dumazet, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton

+Michal Hocko

On Thu, Oct 24, 2019 at 1:50 PM Tejun Heo <tj@kernel.org> wrote:
>
> sk_page_frag() optimizes skb_frag allocations by using per-task
> skb_frag cache when it knows it's the only user.  The condition is
> determined by seeing whether the socket allocation mask allows
> blocking - if the allocation may block, it obviously owns the task's
> context and ergo exclusively owns current->task_frag.
>
> Unfortunately, this misses recursion through memory reclaim path.
> Please take a look at the following backtrace.
>
>  [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
>      ...
>      tcp_sendmsg+0x27/0x40
>      sock_sendmsg+0x30/0x40
>      sock_xmit.isra.24+0xa1/0x170 [nbd]
>      nbd_send_cmd+0x1d2/0x690 [nbd]
>      nbd_queue_rq+0x1b5/0x3b0 [nbd]
>      __blk_mq_try_issue_directly+0x108/0x1b0
>      blk_mq_request_issue_directly+0xbd/0xe0
>      blk_mq_try_issue_list_directly+0x41/0xb0
>      blk_mq_sched_insert_requests+0xa2/0xe0
>      blk_mq_flush_plug_list+0x205/0x2a0
>      blk_flush_plug_list+0xc3/0xf0
>  [1] blk_finish_plug+0x21/0x2e
>      _xfs_buf_ioapply+0x313/0x460
>      __xfs_buf_submit+0x67/0x220
>      xfs_buf_read_map+0x113/0x1a0
>      xfs_trans_read_buf_map+0xbf/0x330
>      xfs_btree_read_buf_block.constprop.42+0x95/0xd0
>      xfs_btree_lookup_get_block+0x95/0x170
>      xfs_btree_lookup+0xcc/0x470
>      xfs_bmap_del_extent_real+0x254/0x9a0
>      __xfs_bunmapi+0x45c/0xab0
>      xfs_bunmapi+0x15/0x30
>      xfs_itruncate_extents_flags+0xca/0x250
>      xfs_free_eofblocks+0x181/0x1e0
>      xfs_fs_destroy_inode+0xa8/0x1b0
>      destroy_inode+0x38/0x70
>      dispose_list+0x35/0x50
>      prune_icache_sb+0x52/0x70
>      super_cache_scan+0x120/0x1a0
>      do_shrink_slab+0x120/0x290
>      shrink_slab+0x216/0x2b0
>      shrink_node+0x1b6/0x4a0
>      do_try_to_free_pages+0xc6/0x370
>      try_to_free_mem_cgroup_pages+0xe3/0x1e0
>      try_charge+0x29e/0x790
>      mem_cgroup_charge_skmem+0x6a/0x100
>      __sk_mem_raise_allocated+0x18e/0x390
>      __sk_mem_schedule+0x2a/0x40
>  [0] tcp_sendmsg_locked+0x8eb/0xe10
>      tcp_sendmsg+0x27/0x40
>      sock_sendmsg+0x30/0x40
>      ___sys_sendmsg+0x26d/0x2b0
>      __sys_sendmsg+0x57/0xa0
>      do_syscall_64+0x42/0x100
>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> In [0], tcp_send_msg_locked() was using current->page_frag when it
> called sk_wmem_schedule().  It already calculated how many bytes can
> be fit into current->page_frag.  Due to memory pressure,
> sk_wmem_schedule() called into memory reclaim path which called into
> xfs and then IO issue path.  Because the filesystem in question is
> backed by nbd, the control goes back into the tcp layer - back into
> tcp_sendmsg_locked().
>
> nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
> sense - it's in the process of freeing memory and wants to be able to,
> e.g., drop clean pages to make forward progress.  However, this
> confused sk_page_frag() called from [2].  Because it only tests
> whether the allocation allows blocking which it does, it now thinks
> current->page_frag can be used again although it already was being
> used in [0].
>
> After [2] used current->page_frag, the offset would be increased by
> the used amount.  When the control returns to [0],
> current->page_frag's offset is increased and the previously calculated
> number of bytes now may overrun the end of allocated memory leading to
> silent memory corruptions.
>
> Fix it by adding gfpflags_normal_context() which tests sleepable &&
> !reclaim and use it to determine whether to use current->task_frag.
>
> v2: Eric didn't like gfp flags being tested twice.  Introduce a new
>     helper gfpflags_normal_context() and combine the two tests.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: stable@vger.kernel.org
> ---
>  include/linux/gfp.h |   23 +++++++++++++++++++++++
>  include/net/sock.h  |   11 ++++++++---
>  2 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index fb07b503dc45..61f2f6ff9467 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -325,6 +325,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>         return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
>  }
>
> +/**
> + * gfpflags_normal_context - is gfp_flags a normal sleepable context?
> + * @gfp_flags: gfp_flags to test
> + *
> + * Test whether @gfp_flags indicates that the allocation is from the
> + * %current context and allowed to sleep.
> + *
> + * An allocation being allowed to block doesn't mean it owns the %current
> + * context.  When direct reclaim path tries to allocate memory, the
> + * allocation context is nested inside whatever %current was doing at the
> + * time of the original allocation.  The nested allocation may be allowed
> + * to block but modifying anything %current owns can corrupt the outer
> + * context's expectations.
> + *
> + * %true result from this function indicates that the allocation context
> + * can sleep and use anything that's associated with %current.
> + */
> +static inline bool gfpflags_normal_context(const gfp_t gfp_flags)
> +{
> +       return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
> +               __GFP_DIRECT_RECLAIM;

I think we should be checking PF_MEMALLOC here instead. Something like:

return gfpflags_allow_blocking(gfp_flags) && !(current->flags & PF_MEMALLOC);

In my limited understanding, __GFP_MEMALLOC gives access to reserve
but we have overloaded PF_MEMALLOC to also define the reclaim context.
There are PF_MEMALLOC users which does not use __GFP_MEMALLOC like
iscsi_sw_tcp_pdu_xmit() which can call sock_sendmsg().


> +}
> +
>  #ifdef CONFIG_HIGHMEM
>  #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
>  #else
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f69b58bff7e5..c31a9ed86d5a 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2242,12 +2242,17 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
>   * sk_page_frag - return an appropriate page_frag
>   * @sk: socket
>   *
> - * If socket allocation mode allows current thread to sleep, it means its
> - * safe to use the per task page_frag instead of the per socket one.
> + * Use the per task page_frag instead of the per socket one for
> + * optimization when we know that we're in the normal context and owns
> + * everything that's associated with %current.
> + *
> + * gfpflags_allow_blocking() isn't enough here as direct reclaim may nest
> + * inside other socket operations and end up recursing into sk_page_frag()
> + * while it's already in use.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -       if (gfpflags_allow_blocking(sk->sk_allocation))
> +       if (gfpflags_normal_context(sk->sk_allocation))
>                 return &current->task_frag;
>
>         return &sk->sk_frag;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 17:35   ` Shakeel Butt
@ 2019-10-31 17:47     ` Eric Dumazet
  2019-10-31 18:30       ` Shakeel Butt
  2019-10-31 23:20     ` Andrew Morton
  1 sibling, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-31 17:47 UTC (permalink / raw)
  To: Shakeel Butt, Tejun Heo, Michal Hocko
  Cc: David S. Miller, netdev, Kernel Team, LKML, Josef Bacik,
	Eric Dumazet, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton



On 10/31/19 10:35 AM, Shakeel Butt wrote:
> +Michal Hocko
> 
> On Thu, Oct 24, 2019 at 1:50 PM Tejun Heo <tj@kernel.org> wrote:
>>
>> sk_page_frag() optimizes skb_frag allocations by using per-task
>> skb_frag cache when it knows it's the only user.  The condition is
>> determined by seeing whether the socket allocation mask allows
>> blocking - if the allocation may block, it obviously owns the task's
>> context and ergo exclusively owns current->task_frag.
>>
>> Unfortunately, this misses recursion through memory reclaim path.
>> Please take a look at the following backtrace.
>>
>>  [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
>>      ...
>>      tcp_sendmsg+0x27/0x40
>>      sock_sendmsg+0x30/0x40
>>      sock_xmit.isra.24+0xa1/0x170 [nbd]
>>      nbd_send_cmd+0x1d2/0x690 [nbd]
>>      nbd_queue_rq+0x1b5/0x3b0 [nbd]
>>      __blk_mq_try_issue_directly+0x108/0x1b0
>>      blk_mq_request_issue_directly+0xbd/0xe0
>>      blk_mq_try_issue_list_directly+0x41/0xb0
>>      blk_mq_sched_insert_requests+0xa2/0xe0
>>      blk_mq_flush_plug_list+0x205/0x2a0
>>      blk_flush_plug_list+0xc3/0xf0
>>  [1] blk_finish_plug+0x21/0x2e
>>      _xfs_buf_ioapply+0x313/0x460
>>      __xfs_buf_submit+0x67/0x220
>>      xfs_buf_read_map+0x113/0x1a0
>>      xfs_trans_read_buf_map+0xbf/0x330
>>      xfs_btree_read_buf_block.constprop.42+0x95/0xd0
>>      xfs_btree_lookup_get_block+0x95/0x170
>>      xfs_btree_lookup+0xcc/0x470
>>      xfs_bmap_del_extent_real+0x254/0x9a0
>>      __xfs_bunmapi+0x45c/0xab0
>>      xfs_bunmapi+0x15/0x30
>>      xfs_itruncate_extents_flags+0xca/0x250
>>      xfs_free_eofblocks+0x181/0x1e0
>>      xfs_fs_destroy_inode+0xa8/0x1b0
>>      destroy_inode+0x38/0x70
>>      dispose_list+0x35/0x50
>>      prune_icache_sb+0x52/0x70
>>      super_cache_scan+0x120/0x1a0
>>      do_shrink_slab+0x120/0x290
>>      shrink_slab+0x216/0x2b0
>>      shrink_node+0x1b6/0x4a0
>>      do_try_to_free_pages+0xc6/0x370
>>      try_to_free_mem_cgroup_pages+0xe3/0x1e0
>>      try_charge+0x29e/0x790
>>      mem_cgroup_charge_skmem+0x6a/0x100
>>      __sk_mem_raise_allocated+0x18e/0x390
>>      __sk_mem_schedule+0x2a/0x40
>>  [0] tcp_sendmsg_locked+0x8eb/0xe10
>>      tcp_sendmsg+0x27/0x40
>>      sock_sendmsg+0x30/0x40
>>      ___sys_sendmsg+0x26d/0x2b0
>>      __sys_sendmsg+0x57/0xa0
>>      do_syscall_64+0x42/0x100
>>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> In [0], tcp_send_msg_locked() was using current->page_frag when it
>> called sk_wmem_schedule().  It already calculated how many bytes can
>> be fit into current->page_frag.  Due to memory pressure,
>> sk_wmem_schedule() called into memory reclaim path which called into
>> xfs and then IO issue path.  Because the filesystem in question is
>> backed by nbd, the control goes back into the tcp layer - back into
>> tcp_sendmsg_locked().
>>
>> nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
>> sense - it's in the process of freeing memory and wants to be able to,
>> e.g., drop clean pages to make forward progress.  However, this
>> confused sk_page_frag() called from [2].  Because it only tests
>> whether the allocation allows blocking which it does, it now thinks
>> current->page_frag can be used again although it already was being
>> used in [0].
>>
>> After [2] used current->page_frag, the offset would be increased by
>> the used amount.  When the control returns to [0],
>> current->page_frag's offset is increased and the previously calculated
>> number of bytes now may overrun the end of allocated memory leading to
>> silent memory corruptions.
>>
>> Fix it by adding gfpflags_normal_context() which tests sleepable &&
>> !reclaim and use it to determine whether to use current->task_frag.
>>
>> v2: Eric didn't like gfp flags being tested twice.  Introduce a new
>>     helper gfpflags_normal_context() and combine the two tests.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Josef Bacik <josef@toxicpanda.com>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: stable@vger.kernel.org
>> ---
>>  include/linux/gfp.h |   23 +++++++++++++++++++++++
>>  include/net/sock.h  |   11 ++++++++---
>>  2 files changed, 31 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index fb07b503dc45..61f2f6ff9467 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -325,6 +325,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>>         return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
>>  }
>>
>> +/**
>> + * gfpflags_normal_context - is gfp_flags a normal sleepable context?
>> + * @gfp_flags: gfp_flags to test
>> + *
>> + * Test whether @gfp_flags indicates that the allocation is from the
>> + * %current context and allowed to sleep.
>> + *
>> + * An allocation being allowed to block doesn't mean it owns the %current
>> + * context.  When direct reclaim path tries to allocate memory, the
>> + * allocation context is nested inside whatever %current was doing at the
>> + * time of the original allocation.  The nested allocation may be allowed
>> + * to block but modifying anything %current owns can corrupt the outer
>> + * context's expectations.
>> + *
>> + * %true result from this function indicates that the allocation context
>> + * can sleep and use anything that's associated with %current.
>> + */
>> +static inline bool gfpflags_normal_context(const gfp_t gfp_flags)
>> +{
>> +       return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
>> +               __GFP_DIRECT_RECLAIM;
> 
> I think we should be checking PF_MEMALLOC here instead. Something like:
> 
> return gfpflags_allow_blocking(gfp_flags) && !(current->flags & PF_MEMALLOC);
> 
> In my limited understanding, __GFP_MEMALLOC gives access to reserve
> but we have overloaded PF_MEMALLOC to also define the reclaim context.
> There are PF_MEMALLOC users which does not use __GFP_MEMALLOC like
> iscsi_sw_tcp_pdu_xmit() which can call sock_sendmsg().

Why would this layer not set sk->sk_allocation to GFP_ATOMIC ?

And it also might call sk_set_memalloc() too.

Please double check scsi layer, I am pretty sure it did well at some point.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 17:47     ` Eric Dumazet
@ 2019-10-31 18:30       ` Shakeel Butt
  2019-10-31 18:43         ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Shakeel Butt @ 2019-10-31 18:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tejun Heo, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton

On Thu, Oct 31, 2019 at 10:47 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 10/31/19 10:35 AM, Shakeel Butt wrote:
> > +Michal Hocko
> >
> > On Thu, Oct 24, 2019 at 1:50 PM Tejun Heo <tj@kernel.org> wrote:
> >>
> >> sk_page_frag() optimizes skb_frag allocations by using per-task
> >> skb_frag cache when it knows it's the only user.  The condition is
> >> determined by seeing whether the socket allocation mask allows
> >> blocking - if the allocation may block, it obviously owns the task's
> >> context and ergo exclusively owns current->task_frag.
> >>
> >> Unfortunately, this misses recursion through memory reclaim path.
> >> Please take a look at the following backtrace.
> >>
> >>  [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
> >>      ...
> >>      tcp_sendmsg+0x27/0x40
> >>      sock_sendmsg+0x30/0x40
> >>      sock_xmit.isra.24+0xa1/0x170 [nbd]
> >>      nbd_send_cmd+0x1d2/0x690 [nbd]
> >>      nbd_queue_rq+0x1b5/0x3b0 [nbd]
> >>      __blk_mq_try_issue_directly+0x108/0x1b0
> >>      blk_mq_request_issue_directly+0xbd/0xe0
> >>      blk_mq_try_issue_list_directly+0x41/0xb0
> >>      blk_mq_sched_insert_requests+0xa2/0xe0
> >>      blk_mq_flush_plug_list+0x205/0x2a0
> >>      blk_flush_plug_list+0xc3/0xf0
> >>  [1] blk_finish_plug+0x21/0x2e
> >>      _xfs_buf_ioapply+0x313/0x460
> >>      __xfs_buf_submit+0x67/0x220
> >>      xfs_buf_read_map+0x113/0x1a0
> >>      xfs_trans_read_buf_map+0xbf/0x330
> >>      xfs_btree_read_buf_block.constprop.42+0x95/0xd0
> >>      xfs_btree_lookup_get_block+0x95/0x170
> >>      xfs_btree_lookup+0xcc/0x470
> >>      xfs_bmap_del_extent_real+0x254/0x9a0
> >>      __xfs_bunmapi+0x45c/0xab0
> >>      xfs_bunmapi+0x15/0x30
> >>      xfs_itruncate_extents_flags+0xca/0x250
> >>      xfs_free_eofblocks+0x181/0x1e0
> >>      xfs_fs_destroy_inode+0xa8/0x1b0
> >>      destroy_inode+0x38/0x70
> >>      dispose_list+0x35/0x50
> >>      prune_icache_sb+0x52/0x70
> >>      super_cache_scan+0x120/0x1a0
> >>      do_shrink_slab+0x120/0x290
> >>      shrink_slab+0x216/0x2b0
> >>      shrink_node+0x1b6/0x4a0
> >>      do_try_to_free_pages+0xc6/0x370
> >>      try_to_free_mem_cgroup_pages+0xe3/0x1e0
> >>      try_charge+0x29e/0x790
> >>      mem_cgroup_charge_skmem+0x6a/0x100
> >>      __sk_mem_raise_allocated+0x18e/0x390
> >>      __sk_mem_schedule+0x2a/0x40
> >>  [0] tcp_sendmsg_locked+0x8eb/0xe10
> >>      tcp_sendmsg+0x27/0x40
> >>      sock_sendmsg+0x30/0x40
> >>      ___sys_sendmsg+0x26d/0x2b0
> >>      __sys_sendmsg+0x57/0xa0
> >>      do_syscall_64+0x42/0x100
> >>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> >> In [0], tcp_send_msg_locked() was using current->page_frag when it
> >> called sk_wmem_schedule().  It already calculated how many bytes can
> >> be fit into current->page_frag.  Due to memory pressure,
> >> sk_wmem_schedule() called into memory reclaim path which called into
> >> xfs and then IO issue path.  Because the filesystem in question is
> >> backed by nbd, the control goes back into the tcp layer - back into
> >> tcp_sendmsg_locked().
> >>
> >> nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
> >> sense - it's in the process of freeing memory and wants to be able to,
> >> e.g., drop clean pages to make forward progress.  However, this
> >> confused sk_page_frag() called from [2].  Because it only tests
> >> whether the allocation allows blocking which it does, it now thinks
> >> current->page_frag can be used again although it already was being
> >> used in [0].
> >>
> >> After [2] used current->page_frag, the offset would be increased by
> >> the used amount.  When the control returns to [0],
> >> current->page_frag's offset is increased and the previously calculated
> >> number of bytes now may overrun the end of allocated memory leading to
> >> silent memory corruptions.
> >>
> >> Fix it by adding gfpflags_normal_context() which tests sleepable &&
> >> !reclaim and use it to determine whether to use current->task_frag.
> >>
> >> v2: Eric didn't like gfp flags being tested twice.  Introduce a new
> >>     helper gfpflags_normal_context() and combine the two tests.
> >>
> >> Signed-off-by: Tejun Heo <tj@kernel.org>
> >> Cc: Josef Bacik <josef@toxicpanda.com>
> >> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> >> Cc: stable@vger.kernel.org
> >> ---
> >>  include/linux/gfp.h |   23 +++++++++++++++++++++++
> >>  include/net/sock.h  |   11 ++++++++---
> >>  2 files changed, 31 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index fb07b503dc45..61f2f6ff9467 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -325,6 +325,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> >>         return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
> >>  }
> >>
> >> +/**
> >> + * gfpflags_normal_context - is gfp_flags a normal sleepable context?
> >> + * @gfp_flags: gfp_flags to test
> >> + *
> >> + * Test whether @gfp_flags indicates that the allocation is from the
> >> + * %current context and allowed to sleep.
> >> + *
> >> + * An allocation being allowed to block doesn't mean it owns the %current
> >> + * context.  When direct reclaim path tries to allocate memory, the
> >> + * allocation context is nested inside whatever %current was doing at the
> >> + * time of the original allocation.  The nested allocation may be allowed
> >> + * to block but modifying anything %current owns can corrupt the outer
> >> + * context's expectations.
> >> + *
> >> + * %true result from this function indicates that the allocation context
> >> + * can sleep and use anything that's associated with %current.
> >> + */
> >> +static inline bool gfpflags_normal_context(const gfp_t gfp_flags)
> >> +{
> >> +       return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
> >> +               __GFP_DIRECT_RECLAIM;
> >
> > I think we should be checking PF_MEMALLOC here instead. Something like:
> >
> > return gfpflags_allow_blocking(gfp_flags) && !(current->flags & PF_MEMALLOC);
> >
> > In my limited understanding, __GFP_MEMALLOC gives access to reserve
> > but we have overloaded PF_MEMALLOC to also define the reclaim context.
> > There are PF_MEMALLOC users which does not use __GFP_MEMALLOC like
> > iscsi_sw_tcp_pdu_xmit() which can call sock_sendmsg().
>
> Why would this layer not set sk->sk_allocation to GFP_ATOMIC ?
>
> And it also might call sk_set_memalloc() too.
>
> Please double check scsi layer, I am pretty sure it did well at some point.
>

Yes, you are right, quoted the wrong example. SCSI is indeed setting
sk->sk_allocation to GFP_ATOMIC and sk_set_memalloc() in
iscsi_sw_tcp_conn_bind().

Basically what I wanted to say that MM treats PF_MEMALLOC as the
reclaim context while __GFP_MEMALLOC just tells to give access to the
reserves. As gfpflags_allow_blocking() can be used beyond net
subsystem, my only concern is its potential usage under PF_MEMALLOC
context but without __GFP_MEMALLOC.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 18:30       ` Shakeel Butt
@ 2019-10-31 18:43         ` Tejun Heo
  2019-10-31 18:51           ` Shakeel Butt
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2019-10-31 18:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton

Hello,

On Thu, Oct 31, 2019 at 11:30:57AM -0700, Shakeel Butt wrote:
> Basically what I wanted to say that MM treats PF_MEMALLOC as the
> reclaim context while __GFP_MEMALLOC just tells to give access to the
> reserves. As gfpflags_allow_blocking() can be used beyond net
> subsystem, my only concern is its potential usage under PF_MEMALLOC
> context but without __GFP_MEMALLOC.

Yeah, PF_MEMALLOC is likely the better condition to check here as we
primarily want to know whether %current might be recursing and that
should be indicated reliably with PF_MEMALLOC.  Wanna prep a patch for
it?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 18:43         ` Tejun Heo
@ 2019-10-31 18:51           ` Shakeel Butt
  2019-10-31 19:00             ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Shakeel Butt @ 2019-10-31 18:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric Dumazet, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton

On Thu, Oct 31, 2019 at 11:43 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Oct 31, 2019 at 11:30:57AM -0700, Shakeel Butt wrote:
> > Basically what I wanted to say that MM treats PF_MEMALLOC as the
> > reclaim context while __GFP_MEMALLOC just tells to give access to the
> > reserves. As gfpflags_allow_blocking() can be used beyond net
> > subsystem, my only concern is its potential usage under PF_MEMALLOC
> > context but without __GFP_MEMALLOC.
>
> Yeah, PF_MEMALLOC is likely the better condition to check here as we
> primarily want to know whether %current might be recursing and that
> should be indicated reliably with PF_MEMALLOC.  Wanna prep a patch for
> it?

Sure, I will keep your commit message and authorship (if you are ok with it).

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 18:51           ` Shakeel Butt
@ 2019-10-31 19:00             ` Tejun Heo
  2019-10-31 19:14               ` Shakeel Butt
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2019-10-31 19:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton

On Thu, Oct 31, 2019 at 11:51:57AM -0700, Shakeel Butt wrote:
> > Yeah, PF_MEMALLOC is likely the better condition to check here as we
> > primarily want to know whether %current might be recursing and that
> > should be indicated reliably with PF_MEMALLOC.  Wanna prep a patch for
> > it?
> 
> Sure, I will keep your commit message and authorship (if you are ok with it).

I think the patch already got merged, so it may be easier to do an
incremental one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 19:00             ` Tejun Heo
@ 2019-10-31 19:14               ` Shakeel Butt
  2019-10-31 19:16                 ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Shakeel Butt @ 2019-10-31 19:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric Dumazet, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton

On Thu, Oct 31, 2019 at 12:00 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Oct 31, 2019 at 11:51:57AM -0700, Shakeel Butt wrote:
> > > Yeah, PF_MEMALLOC is likely the better condition to check here as we
> > > primarily want to know whether %current might be recursing and that
> > > should be indicated reliably with PF_MEMALLOC.  Wanna prep a patch for
> > > it?
> >
> > Sure, I will keep your commit message and authorship (if you are ok with it).
>
> I think the patch already got merged, so it may be easier to do an
> incremental one.

Oh ok, I will send the incremental one once I have this patch in the mm tree.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 19:14               ` Shakeel Butt
@ 2019-10-31 19:16                 ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2019-10-31 19:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Jakub Kicinski, Johannes Weiner, Linux MM,
	Mel Gorman, Andrew Morton

On Thu, Oct 31, 2019 at 12:14:58PM -0700, Shakeel Butt wrote:
> Oh ok, I will send the incremental one once I have this patch in the mm tree.

It's in the networking tree already.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 17:35   ` Shakeel Butt
  2019-10-31 17:47     ` Eric Dumazet
@ 2019-10-31 23:20     ` Andrew Morton
  2019-11-01 17:12       ` Tejun Heo
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2019-10-31 23:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Eric Dumazet, Jakub Kicinski, Johannes Weiner,
	Linux MM, Mel Gorman

On Thu, 31 Oct 2019 10:35:21 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> +Michal Hocko
> 
> On Thu, Oct 24, 2019 at 1:50 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > sk_page_frag() optimizes skb_frag allocations by using per-task
> > skb_frag cache when it knows it's the only user.  The condition is
> > determined by seeing whether the socket allocation mask allows
> > blocking - if the allocation may block, it obviously owns the task's
> > context and ergo exclusively owns current->task_frag.
> >
> > Unfortunately, this misses recursion through memory reclaim path.
> > Please take a look at the following backtrace.
> >
> >  [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
> >      ...
> >      tcp_sendmsg+0x27/0x40
> >      sock_sendmsg+0x30/0x40
> >      sock_xmit.isra.24+0xa1/0x170 [nbd]
> >      nbd_send_cmd+0x1d2/0x690 [nbd]
> >      nbd_queue_rq+0x1b5/0x3b0 [nbd]
> >      __blk_mq_try_issue_directly+0x108/0x1b0
> >      blk_mq_request_issue_directly+0xbd/0xe0
> >      blk_mq_try_issue_list_directly+0x41/0xb0
> >      blk_mq_sched_insert_requests+0xa2/0xe0
> >      blk_mq_flush_plug_list+0x205/0x2a0
> >      blk_flush_plug_list+0xc3/0xf0
> >  [1] blk_finish_plug+0x21/0x2e
> >      _xfs_buf_ioapply+0x313/0x460
> >      __xfs_buf_submit+0x67/0x220
> >      xfs_buf_read_map+0x113/0x1a0
> >      xfs_trans_read_buf_map+0xbf/0x330
> >      xfs_btree_read_buf_block.constprop.42+0x95/0xd0
> >      xfs_btree_lookup_get_block+0x95/0x170
> >      xfs_btree_lookup+0xcc/0x470
> >      xfs_bmap_del_extent_real+0x254/0x9a0
> >      __xfs_bunmapi+0x45c/0xab0
> >      xfs_bunmapi+0x15/0x30
> >      xfs_itruncate_extents_flags+0xca/0x250
> >      xfs_free_eofblocks+0x181/0x1e0
> >      xfs_fs_destroy_inode+0xa8/0x1b0
> >      destroy_inode+0x38/0x70
> >      dispose_list+0x35/0x50
> >      prune_icache_sb+0x52/0x70
> >      super_cache_scan+0x120/0x1a0
> >      do_shrink_slab+0x120/0x290
> >      shrink_slab+0x216/0x2b0
> >      shrink_node+0x1b6/0x4a0
> >      do_try_to_free_pages+0xc6/0x370
> >      try_to_free_mem_cgroup_pages+0xe3/0x1e0
> >      try_charge+0x29e/0x790
> >      mem_cgroup_charge_skmem+0x6a/0x100
> >      __sk_mem_raise_allocated+0x18e/0x390
> >      __sk_mem_schedule+0x2a/0x40
> >  [0] tcp_sendmsg_locked+0x8eb/0xe10
> >      tcp_sendmsg+0x27/0x40
> >      sock_sendmsg+0x30/0x40
> >      ___sys_sendmsg+0x26d/0x2b0
> >      __sys_sendmsg+0x57/0xa0
> >      do_syscall_64+0x42/0x100
> >      entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > In [0], tcp_send_msg_locked() was using current->page_frag when it

"tcp_sendmsg_locked" and "current->task_frag".  Stuff like this makes
review harder :(

> > called sk_wmem_schedule().  It already calculated how many bytes can
> > be fit into current->page_frag.  Due to memory pressure,
> > sk_wmem_schedule() called into memory reclaim path which called into
> > xfs and then IO issue path.  Because the filesystem in question is
> > backed by nbd, the control goes back into the tcp layer - back into
> > tcp_sendmsg_locked().
> >
> > nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
> > sense - it's in the process of freeing memory and wants to be able to,
> > e.g., drop clean pages to make forward progress.  However, this
> > confused sk_page_frag() called from [2].  Because it only tests
> > whether the allocation allows blocking which it does, it now thinks
> > current->page_frag can be used again although it already was being
> > used in [0].
> >
> > After [2] used current->page_frag, the offset would be increased by
> > the used amount.  When the control returns to [0],
> > current->page_frag's offset is increased and the previously calculated
> > number of bytes now may overrun the end of allocated memory leading to
> > silent memory corruptions.
> >
> > Fix it by adding gfpflags_normal_context() which tests sleepable &&
> > !reclaim and use it to determine whether to use current->task_frag.
> >

Dumb-but-obvious question.  Rather than putzing with allocation modes,
is it not feasible to change the net layer to copy the current value of
current->task_frag into a local then restore its value when it has
finished being used?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim
  2019-10-31 23:20     ` Andrew Morton
@ 2019-11-01 17:12       ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2019-11-01 17:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shakeel Butt, Michal Hocko, David S. Miller, netdev, Kernel Team,
	LKML, Josef Bacik, Eric Dumazet, Jakub Kicinski, Johannes Weiner,
	Linux MM, Mel Gorman

Hello,

On Thu, Oct 31, 2019 at 04:20:49PM -0700, Andrew Morton wrote:
> > > In [0], tcp_send_msg_locked() was using current->page_frag when it
> 
> "tcp_sendmsg_locked" and "current->task_frag".  Stuff like this makes
> review harder :(

lol, sorry about that.

> > > Fix it by adding gfpflags_normal_context() which tests sleepable &&
> > > !reclaim and use it to determine whether to use current->task_frag.
> > >
> 
> Dumb-but-obvious question.  Rather than putzing with allocation modes,
> is it not feasible to change the net layer to copy the current value of
> current->task_frag into a local then restore its value when it has
> finished being used?

It's being used as an allocation cache, so doing so would lead to the
same area getting allocated for two packets at the same time instead
overrunning the end of the cache.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-11-01 17:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-19 17:01 [PATCH] net: fix sk_page_frag() recursion from memory reclaim Tejun Heo
2019-10-19 18:15 ` Eric Dumazet
2019-10-19 21:18   ` Tejun Heo
2019-10-19 21:25     ` Eric Dumazet
2019-10-22 17:19       ` Jakub Kicinski
2019-10-24 20:50 ` [PATCH v2] " Tejun Heo
2019-10-28 23:18   ` David Miller
2019-10-31 17:35   ` Shakeel Butt
2019-10-31 17:47     ` Eric Dumazet
2019-10-31 18:30       ` Shakeel Butt
2019-10-31 18:43         ` Tejun Heo
2019-10-31 18:51           ` Shakeel Butt
2019-10-31 19:00             ` Tejun Heo
2019-10-31 19:14               ` Shakeel Butt
2019-10-31 19:16                 ` Tejun Heo
2019-10-31 23:20     ` Andrew Morton
2019-11-01 17:12       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).