Stable Archive on lore.kernel.org
 help / color / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	Josef Bacik <josef@toxicpanda.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 4.14 29/62] net: fix sk_page_frag() recursion from memory reclaim
Date: Fri,  8 Nov 2019 19:50:17 +0100
Message-ID: <20191108174742.381272669@linuxfoundation.org> (raw)
In-Reply-To: <20191108174719.228826381@linuxfoundation.org>

From: Tejun Heo <tj@kernel.org>

[ Upstream commit 20eb4f29b60286e0d6dc01d9c260b4bd383c58fb ]

sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user.  The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.

Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.

 [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
     ...
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     sock_xmit.isra.24+0xa1/0x170 [nbd]
     nbd_send_cmd+0x1d2/0x690 [nbd]
     nbd_queue_rq+0x1b5/0x3b0 [nbd]
     __blk_mq_try_issue_directly+0x108/0x1b0
     blk_mq_request_issue_directly+0xbd/0xe0
     blk_mq_try_issue_list_directly+0x41/0xb0
     blk_mq_sched_insert_requests+0xa2/0xe0
     blk_mq_flush_plug_list+0x205/0x2a0
     blk_flush_plug_list+0xc3/0xf0
 [1] blk_finish_plug+0x21/0x2e
     _xfs_buf_ioapply+0x313/0x460
     __xfs_buf_submit+0x67/0x220
     xfs_buf_read_map+0x113/0x1a0
     xfs_trans_read_buf_map+0xbf/0x330
     xfs_btree_read_buf_block.constprop.42+0x95/0xd0
     xfs_btree_lookup_get_block+0x95/0x170
     xfs_btree_lookup+0xcc/0x470
     xfs_bmap_del_extent_real+0x254/0x9a0
     __xfs_bunmapi+0x45c/0xab0
     xfs_bunmapi+0x15/0x30
     xfs_itruncate_extents_flags+0xca/0x250
     xfs_free_eofblocks+0x181/0x1e0
     xfs_fs_destroy_inode+0xa8/0x1b0
     destroy_inode+0x38/0x70
     dispose_list+0x35/0x50
     prune_icache_sb+0x52/0x70
     super_cache_scan+0x120/0x1a0
     do_shrink_slab+0x120/0x290
     shrink_slab+0x216/0x2b0
     shrink_node+0x1b6/0x4a0
     do_try_to_free_pages+0xc6/0x370
     try_to_free_mem_cgroup_pages+0xe3/0x1e0
     try_charge+0x29e/0x790
     mem_cgroup_charge_skmem+0x6a/0x100
     __sk_mem_raise_allocated+0x18e/0x390
     __sk_mem_schedule+0x2a/0x40
 [0] tcp_sendmsg_locked+0x8eb/0xe10
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     ___sys_sendmsg+0x26d/0x2b0
     __sys_sendmsg+0x57/0xa0
     do_syscall_64+0x42/0x100
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule().  It already calculated how many bytes can
be fit into current->page_frag.  Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path.  Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().

nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress.  However, this
confused sk_page_frag() called from [2].  Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].

After [2] used current->page_frag, the offset would be increased by
the used amount.  When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.

Fix it by adding gfpflags_normal_context() which tests sleepable &&
!reclaim and use it to determine whether to use current->task_frag.

v2: Eric didn't like gfp flags being tested twice.  Introduce a new
    helper gfpflags_normal_context() and combine the two tests.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/linux/gfp.h |   23 +++++++++++++++++++++++
 include/net/sock.h  |   11 ++++++++---
 2 files changed, 31 insertions(+), 3 deletions(-)

--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -313,6 +313,29 @@ static inline bool gfpflags_allow_blocki
 	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }
 
+/**
+ * gfpflags_normal_context - is gfp_flags a normal sleepable context?
+ * @gfp_flags: gfp_flags to test
+ *
+ * Test whether @gfp_flags indicates that the allocation is from the
+ * %current context and allowed to sleep.
+ *
+ * An allocation being allowed to block doesn't mean it owns the %current
+ * context.  When direct reclaim path tries to allocate memory, the
+ * allocation context is nested inside whatever %current was doing at the
+ * time of the original allocation.  The nested allocation may be allowed
+ * to block but modifying anything %current owns can corrupt the outer
+ * context's expectations.
+ *
+ * %true result from this function indicates that the allocation context
+ * can sleep and use anything that's associated with %current.
+ */
+static inline bool gfpflags_normal_context(const gfp_t gfp_flags)
+{
+	return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
+		__GFP_DIRECT_RECLAIM;
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2131,12 +2131,17 @@ struct sk_buff *sk_stream_alloc_skb(stru
  * sk_page_frag - return an appropriate page_frag
  * @sk: socket
  *
- * If socket allocation mode allows current thread to sleep, it means its
- * safe to use the per task page_frag instead of the per socket one.
+ * Use the per task page_frag instead of the per socket one for
+ * optimization when we know that we're in the normal context and owns
+ * everything that's associated with %current.
+ *
+ * gfpflags_allow_blocking() isn't enough here as direct reclaim may nest
+ * inside other socket operations and end up recursing into sk_page_frag()
+ * while it's already in use.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (gfpflags_allow_blocking(sk->sk_allocation))
+	if (gfpflags_normal_context(sk->sk_allocation))
 		return &current->task_frag;
 
 	return &sk->sk_frag;



  parent reply index

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-08 18:49 [PATCH 4.14 00/62] 4.14.153-stable review Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 01/62] arm64: dts: Fix gpio to pinmux mapping Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 02/62] regulator: ti-abb: Fix timeout in ti_abb_wait_txdone/ti_abb_clear_all_txdone Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 03/62] regulator: pfuze100-regulator: Variable "val" in pfuze100_regulator_probe() could be uninitialized Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 04/62] ASoC: wm_adsp: Dont generate kcontrols without READ flags Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 05/62] ASoc: rockchip: i2s: Fix RPM imbalance Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 06/62] ARM: dts: logicpd-torpedo-som: Remove twl_keypad Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 07/62] pinctrl: ns2: Fix off by one bugs in ns2_pinmux_enable() Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 08/62] ARM: mm: fix alignment handler faults under memory pressure Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 09/62] scsi: scsi_dh_alua: handle RTPG sense code correctly during state transitions Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 10/62] scsi: sni_53c710: fix compilation error Greg Kroah-Hartman
2019-11-08 18:49 ` [PATCH 4.14 11/62] scsi: fix kconfig dependency warning related to 53C700_LE_ON_BE Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 12/62] ARM: dts: imx7s: Correct GPTs ipg clock source Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 13/62] perf c2c: Fix memory leak in build_cl_output() Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 14/62] perf kmem: Fix memory leak in compact_gfp_flags() Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 15/62] ARM: davinci: dm365: Fix McBSP dma_slave_map entry Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 16/62] scsi: target: core: Do not overwrite CDB byte 1 Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 17/62] ARM: 8926/1: v7m: remove register save to stack before svc Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 18/62] of: unittest: fix memory leak in unittest_data_add Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 19/62] MIPS: bmips: mark exception vectors as char arrays Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 20/62] i2c: stm32f7: remove warning when compiling with W=1 Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 21/62] cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 22/62] nbd: handle racing with errored out commands Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 23/62] cxgb4: fix panic when attaching to ULD fail Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 24/62] dccp: do not leak jiffies on the wire Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 25/62] net: annotate accesses to sk->sk_incoming_cpu Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 26/62] net: annotate lockless accesses to sk->sk_napi_id Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 27/62] net: dsa: bcm_sf2: Fix IMP setup for port different than 8 Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 28/62] net: ethernet: ftgmac100: Fix DMA coherency issue with SW checksum Greg Kroah-Hartman
2019-11-08 18:50 ` Greg Kroah-Hartman [this message]
2019-11-08 18:50 ` [PATCH 4.14 30/62] net: hisilicon: Fix ping latency when deal with high throughput Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 31/62] net/mlx4_core: Dynamically set guaranteed amount of counters per VF Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 32/62] net: Zeroing the structure ethtool_wolinfo in ethtool_get_wol() Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 33/62] selftests: net: reuseport_dualstack: fix uninitalized parameter Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 34/62] udp: fix data-race in udp_set_dev_scratch() Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 35/62] net: add READ_ONCE() annotation in __skb_wait_for_more_packets() Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 36/62] net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budget Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 37/62] net: dsa: b53: Do not clear existing mirrored port mask Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 38/62] net: usb: lan78xx: Connect PHY before registering MAC Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 39/62] r8152: add device id for Lenovo ThinkPad USB-C Dock Gen 2 Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 40/62] net: dsa: fix switch tree list Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 41/62] net: bcmgenet: reset 40nm EPHY on energy detect Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 42/62] net: add skb_queue_empty_lockless() Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 43/62] udp: use skb_queue_empty_lockless() Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 44/62] net: use skb_queue_empty_lockless() in poll() handlers Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 45/62] net: use skb_queue_empty_lockless() in busy poll contexts Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 46/62] vxlan: check tun_info options_len properly Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 47/62] erspan: fix the tun_info options_len check for erspan Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 48/62] inet: stop leaking jiffies on the wire Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 49/62] net/flow_dissector: switch to siphash Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 50/62] dmaengine: qcom: bam_dma: Fix resource leak Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 51/62] sched/wake_q: Fix wakeup ordering for wake_q Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 52/62] kbuild: use -fmacro-prefix-map to make __FILE__ a relative path Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 53/62] kbuild: add -fcf-protection=none when using retpoline flags Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 54/62] platform/x86: pmc_atom: Add Siemens SIMATIC IPC227E to critclk_systems DMI table Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 55/62] iio: adc: stm32-adc: move registers definitions Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 56/62] iio: adc: stm32-adc: fix a race when using several adcs with dma and irq Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 57/62] powerpc/mm: Fixup tlbie vs store ordering issue on POWER9 Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 58/62] powerpc/book3s64/mm: Dont do tlbie fixup for some hardware revisions Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 59/62] powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 60/62] powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 61/62] selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue Greg Kroah-Hartman
2019-11-08 18:50 ` [PATCH 4.14 62/62] selftests/powerpc: Fix compile error on tlbie_test due to newer gcc Greg Kroah-Hartman
2019-11-09  1:57 ` [PATCH 4.14 00/62] 4.14.153-stable review kernelci.org bot
2019-11-09 10:19 ` Naresh Kamboju
2019-11-09 15:40 ` Guenter Roeck

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191108174742.381272669@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Stable Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/stable/0 stable/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 stable stable/ https://lore.kernel.org/stable \
		stable@vger.kernel.org
	public-inbox-index stable

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.stable


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git