linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: gregkh@linuxfoundation.org
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Qu Wenruo <wqu@suse.com>,
	Nikolay Borisov <nborisov@suse.com>,
	David Sterba <dsterba@suse.com>,
	Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Subject: [PATCH 5.11 16/36] btrfs: dont flush from btrfs_delayed_inode_reserve_metadata
Date: Wed, 10 Mar 2021 14:23:29 +0100	[thread overview]
Message-ID: <20210310132321.026545448@linuxfoundation.org> (raw)
In-Reply-To: <20210310132320.510840709@linuxfoundation.org>

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

From: Nikolay Borisov <nborisov@suse.com>

commit 4d14c5cde5c268a2bc26addecf09489cb953ef64 upstream

Calling btrfs_qgroup_reserve_meta_prealloc from
btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
while holding a transaction and delayed node locks. This is deadlock
prone. In the past multiple commits:

 * ae5e070eaca9 ("btrfs: qgroup: don't try to wait flushing if we're
already holding a transaction")

 * 6f23277a49e6 ("btrfs: qgroup: don't commit transaction when we already
 hold the handle")

Tried to solve various aspects of this but this was always a
whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
scenario involving btrfs_delayed_node::mutex. Namely, one thread
can call btrfs_dirty_inode as a result of reading a file and modifying
its atime:

  PID: 6963   TASK: ffff8c7f3f94c000  CPU: 2   COMMAND: "test"
  #0  __schedule at ffffffffa529e07d
  #1  schedule at ffffffffa529e4ff
  #2  schedule_timeout at ffffffffa52a1bdd
  #3  wait_for_completion at ffffffffa529eeea             <-- sleeps with delayed node mutex held
  #4  start_delalloc_inodes at ffffffffc0380db5
  #5  btrfs_start_delalloc_snapshot at ffffffffc0393836
  #6  try_flush_qgroup at ffffffffc03f04b2
  #7  __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6     <-- tries to reserve space and starts delalloc inodes.
  #8  btrfs_delayed_update_inode at ffffffffc03e31aa      <-- acquires delayed node mutex
  #9  btrfs_update_inode at ffffffffc0385ba8
 #10  btrfs_dirty_inode at ffffffffc038627b               <-- TRANSACTIION OPENED
 #11  touch_atime at ffffffffa4cf0000
 #12  generic_file_read_iter at ffffffffa4c1f123
 #13  new_sync_read at ffffffffa4ccdc8a
 #14  vfs_read at ffffffffa4cd0849
 #15  ksys_read at ffffffffa4cd0bd1
 #16  do_syscall_64 at ffffffffa4a052eb
 #17  entry_SYSCALL_64_after_hwframe at ffffffffa540008c

This will cause an asynchronous work to flush the delalloc inodes to
happen which can try to acquire the same delayed_node mutex:

  PID: 455    TASK: ffff8c8085fa4000  CPU: 5   COMMAND: "kworker/u16:30"
  #0  __schedule at ffffffffa529e07d
  #1  schedule at ffffffffa529e4ff
  #2  schedule_preempt_disabled at ffffffffa529e80a
  #3  __mutex_lock at ffffffffa529fdcb                    <-- goes to sleep, never wakes up.
  #4  btrfs_delayed_update_inode at ffffffffc03e3143      <-- tries to acquire the mutex
  #5  btrfs_update_inode at ffffffffc0385ba8              <-- this is the same inode that pid 6963 is holding
  #6  cow_file_range_inline.constprop.78 at ffffffffc0386be7
  #7  cow_file_range at ffffffffc03879c1
  #8  btrfs_run_delalloc_range at ffffffffc038894c
  #9  writepage_delalloc at ffffffffc03a3c8f
 #10  __extent_writepage at ffffffffc03a4c01
 #11  extent_write_cache_pages at ffffffffc03a500b
 #12  extent_writepages at ffffffffc03a6de2
 #13  do_writepages at ffffffffa4c277eb
 #14  __filemap_fdatawrite_range at ffffffffa4c1e5bb
 #15  btrfs_run_delalloc_work at ffffffffc0380987         <-- starts running delayed nodes
 #16  normal_work_helper at ffffffffc03b706c
 #17  process_one_work at ffffffffa4aba4e4
 #18  worker_thread at ffffffffa4aba6fd
 #19  kthread at ffffffffa4ac0a3d
 #20  ret_from_fork at ffffffffa54001ff

To fully address those cases the complete fix is to never issue any
flushing while holding the transaction or the delayed node lock. This
patch achieves it by calling qgroup_reserve_meta directly which will
either succeed without flushing or will fail and return -EDQUOT. In the
latter case that return value is going to be propagated to
btrfs_dirty_inode which will fallback to start a new transaction. That's
fine as the majority of time we expect the inode will have
BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
copying the in-memory state.

Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/btrfs/delayed-inode.c |    3 ++-
 fs/btrfs/inode.c         |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -627,7 +627,8 @@ static int btrfs_delayed_inode_reserve_m
 	 */
 	if (!src_rsv || (!trans->bytes_reserved &&
 			 src_rsv->type != BTRFS_BLOCK_RSV_DELALLOC)) {
-		ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true);
+		ret = btrfs_qgroup_reserve_meta(root, num_bytes,
+					  BTRFS_QGROUP_RSV_META_PREALLOC, true);
 		if (ret < 0)
 			return ret;
 		ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5916,7 +5916,7 @@ static int btrfs_dirty_inode(struct inod
 		return PTR_ERR(trans);
 
 	ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
-	if (ret && ret == -ENOSPC) {
+	if (ret && (ret == -ENOSPC || ret == -EDQUOT)) {
 		/* whoops, lets try again with the full transaction */
 		btrfs_end_transaction(trans);
 		trans = btrfs_start_transaction(root, 1);



  parent reply	other threads:[~2021-03-10 13:25 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-10 13:23 [PATCH 5.11 00/36] 5.11.6-rc1 review gregkh
2021-03-10 13:23 ` [PATCH 5.11 01/36] ACPICA: Fix race in generic_serial_bus (I2C) and GPIO op_region parameter handling gregkh
2021-03-10 13:23 ` [PATCH 5.11 02/36] io_uring: fix inconsistent lock state gregkh
2021-03-10 13:23 ` [PATCH 5.11 03/36] io_uring: deduplicate core cancellations sequence gregkh
2021-03-10 13:23 ` [PATCH 5.11 04/36] io_uring: unpark SQPOLL thread for cancelation gregkh
2021-03-10 13:23 ` [PATCH 5.11 05/36] io_uring: deduplicate failing task_work_add gregkh
2021-03-10 13:23 ` [PATCH 5.11 06/36] fs: provide locked helper variant of close_fd_get_file() gregkh
2021-03-10 13:23 ` [PATCH 5.11 07/36] io_uring: get rid of intermediate IORING_OP_CLOSE stage gregkh
2021-03-10 13:23 ` [PATCH 5.11 08/36] io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL gregkh
2021-03-10 13:23 ` [PATCH 5.11 09/36] io_uring/io-wq: return 2-step work swap scheme gregkh
2021-03-10 13:23 ` [PATCH 5.11 10/36] io_uring: dont take uring_lock during iowq cancel gregkh
2021-03-10 13:23 ` [PATCH 5.11 11/36] media: cedrus: Remove checking for required controls gregkh
2021-03-10 13:23 ` [PATCH 5.11 12/36] nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state gregkh
2021-03-10 13:23 ` [PATCH 5.11 13/36] parisc: Enable -mlong-calls gcc option with CONFIG_COMPILE_TEST gregkh
2021-03-10 13:23 ` [PATCH 5.11 14/36] arm64: Make CPU_BIG_ENDIAN depend on ld.bfd or ld.lld 13.0.0+ gregkh
2021-03-10 13:23 ` [PATCH 5.11 15/36] btrfs: export and rename qgroup_reserve_meta gregkh
2021-03-10 13:23 ` gregkh [this message]
2021-03-10 13:23 ` [PATCH 5.11 17/36] iommu/amd: Fix sleeping in atomic in increase_address_space() gregkh
2021-03-10 13:23 ` [PATCH 5.11 18/36] scsi: ufs-mediatek: Enable UFSHCI_QUIRK_SKIP_MANUAL_WB_FLUSH_CTRL gregkh
2021-03-10 13:23 ` [PATCH 5.11 19/36] scsi: ufs: Add a quirk to permit overriding UniPro defaults gregkh
2021-03-10 13:23 ` [PATCH 5.11 20/36] misc: eeprom_93xx46: Add quirk to support Microchip 93LC46B eeprom gregkh
2021-03-10 13:23 ` [PATCH 5.11 21/36] scsi: ufs: Introduce a quirk to allow only page-aligned sg entries gregkh
2021-03-10 13:23 ` [PATCH 5.11 22/36] scsi: ufs: ufs-exynos: Apply vendor-specific values for three timeouts gregkh
2021-03-10 13:23 ` [PATCH 5.11 23/36] scsi: ufs: ufs-exynos: Use UFSHCD_QUIRK_ALIGN_SG_WITH_PAGE_SIZE gregkh
2021-03-10 13:23 ` [PATCH 5.11 24/36] drm/msm/a5xx: Remove overwriting A5XX_PC_DBG_ECO_CNTL register gregkh
2021-03-10 13:23 ` [PATCH 5.11 25/36] mmc: sdhci-of-dwcmshc: set SDHCI_QUIRK2_PRESET_VALUE_BROKEN gregkh
2021-03-10 13:23 ` [PATCH 5.11 26/36] HID: i2c-hid: Add I2C_HID_QUIRK_NO_IRQ_AFTER_RESET for ITE8568 EC on Voyo Winpad A15 gregkh
2021-03-10 13:23 ` [PATCH 5.11 27/36] ALSA: usb-audio: Add DJM750 to Pioneer mixer quirk gregkh
2021-03-10 13:23 ` [PATCH 5.11 28/36] ALSA: usb-audio: add mixer quirks for Pioneer DJM-900NXS2 gregkh
2021-03-10 13:23 ` [PATCH 5.11 29/36] HID: ite: Enable QUIRK_TOUCHPAD_ON_OFF_REPORT on Acer Aspire Switch 10E gregkh
2021-03-10 13:23 ` [PATCH 5.11 30/36] PCI: cadence: Retrain Link to work around Gen2 training defect gregkh
2021-03-10 13:23 ` [PATCH 5.11 31/36] ASoC: Intel: sof_sdw: reorganize quirks by generation gregkh
2021-03-10 13:23 ` [PATCH 5.11 32/36] ASoC: Intel: sof_sdw: add quirk for HP Spectre x360 convertible gregkh
2021-03-10 13:23 ` [PATCH 5.11 33/36] scsi: ufs: Fix a duplicate dev quirk number gregkh
2021-03-10 13:23 ` [PATCH 5.11 34/36] KVM: SVM: Clear the CR4 register on reset gregkh
2021-03-10 13:23 ` [PATCH 5.11 35/36] nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST gregkh
2021-03-10 13:23 ` [PATCH 5.11 36/36] nvme-pci: add quirks for Lexar 256GB SSD gregkh
2021-03-10 21:59 ` [PATCH 5.11 00/36] 5.11.6-rc1 review Shuah Khan
2021-03-11 17:37   ` Greg KH
2021-03-10 23:53 ` Guenter Roeck
2021-03-11 17:37   ` Greg KH
2021-03-11  3:21 ` Naresh Kamboju
2021-03-11 17:38   ` Greg Kroah-Hartman
2021-03-11  4:09 ` Ross Schmidt
2021-03-11 17:39   ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210310132321.026545448@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=dsterba@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nborisov@suse.com \
    --cc=stable@vger.kernel.org \
    --cc=sudipm.mukherjee@gmail.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).