linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Omar Sandoval <osandov@fb.com>, Tejun Heo <tj@kernel.org>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>,
	linux-btrfs@vger.kernel.org
Subject: [PATCH AUTOSEL 4.9 80/91] btrfs: don't prematurely free work in run_ordered_work()
Date: Tue, 10 Dec 2019 17:30:24 -0500	[thread overview]
Message-ID: <20191210223035.14270-80-sashal@kernel.org> (raw)
In-Reply-To: <20191210223035.14270-1-sashal@kernel.org>

From: Omar Sandoval <osandov@fb.com>

[ Upstream commit c495dcd6fbe1dce51811a76bb85b4675f6494938 ]

We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:

1. The top (loop device) filesystem queues an async_cow work item from
   cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
   (run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
   (work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
   to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
   work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
   thread A, so it schedules X to be executed _after_ worker thread A
   finishes (see the find_worker_executing_work() call in
   process_one_work()).

Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.

This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.

P.S.:

One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.

Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.

Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index ff0b0be92d612..a3de11d52ad00 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -265,16 +265,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
 	}
 }
 
-static void run_ordered_work(struct __btrfs_workqueue *wq)
+static void run_ordered_work(struct __btrfs_workqueue *wq,
+			     struct btrfs_work *self)
 {
 	struct list_head *list = &wq->ordered_list;
 	struct btrfs_work *work;
 	spinlock_t *lock = &wq->list_lock;
 	unsigned long flags;
+	void *wtag;
+	bool free_self = false;
 
 	while (1) {
-		void *wtag;
-
 		spin_lock_irqsave(lock, flags);
 		if (list_empty(list))
 			break;
@@ -300,16 +301,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
 		list_del(&work->ordered_list);
 		spin_unlock_irqrestore(lock, flags);
 
-		/*
-		 * We don't want to call the ordered free functions with the
-		 * lock held though. Save the work as tag for the trace event,
-		 * because the callback could free the structure.
-		 */
-		wtag = work;
-		work->ordered_free(work);
-		trace_btrfs_all_work_done(wq->fs_info, wtag);
+		if (work == self) {
+			/*
+			 * This is the work item that the worker is currently
+			 * executing.
+			 *
+			 * The kernel workqueue code guarantees non-reentrancy
+			 * of work items. I.e., if a work item with the same
+			 * address and work function is queued twice, the second
+			 * execution is blocked until the first one finishes. A
+			 * work item may be freed and recycled with the same
+			 * work function; the workqueue code assumes that the
+			 * original work item cannot depend on the recycled work
+			 * item in that case (see find_worker_executing_work()).
+			 *
+			 * Note that the work of one Btrfs filesystem may depend
+			 * on the work of another Btrfs filesystem via, e.g., a
+			 * loop device. Therefore, we must not allow the current
+			 * work item to be recycled until we are really done,
+			 * otherwise we break the above assumption and can
+			 * deadlock.
+			 */
+			free_self = true;
+		} else {
+			/*
+			 * We don't want to call the ordered free functions with
+			 * the lock held though. Save the work as tag for the
+			 * trace event, because the callback could free the
+			 * structure.
+			 */
+			wtag = work;
+			work->ordered_free(work);
+			trace_btrfs_all_work_done(wq->fs_info, wtag);
+		}
 	}
 	spin_unlock_irqrestore(lock, flags);
+
+	if (free_self) {
+		wtag = self;
+		self->ordered_free(self);
+		trace_btrfs_all_work_done(wq->fs_info, wtag);
+	}
 }
 
 static void normal_work_helper(struct btrfs_work *work)
@@ -337,7 +369,7 @@ static void normal_work_helper(struct btrfs_work *work)
 	work->func(work);
 	if (need_order) {
 		set_bit(WORK_DONE_BIT, &work->flags);
-		run_ordered_work(wq);
+		run_ordered_work(wq, work);
 	}
 	if (!need_order)
 		trace_btrfs_all_work_done(wq->fs_info, wtag);
-- 
2.20.1


  parent reply	other threads:[~2019-12-10 22:40 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-10 22:29 [PATCH AUTOSEL 4.9 01/91] drm: mst: Fix query_payload ack reply struct Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 02/91] drm/bridge: analogix-anx78xx: silence -EPROBE_DEFER warnings Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 03/91] iio: light: bh1750: Resolve compiler warning and make code more readable Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 04/91] iio: tcs3414: fix iio_triggered_buffer_{pre,post}enable positions Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 05/91] spi: Add call to spi_slave_abort() function when spidev driver is released Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 06/91] staging: rtl8192u: fix multiple memory leaks on error path Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 07/91] staging: rtl8188eu: fix possible null dereference Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 08/91] objtool: add kunit_try_catch_throw to the noreturn list Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 09/91] rtlwifi: prevent memory leak in rtl_usb_probe Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 10/91] libertas: fix a potential NULL pointer dereference Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 11/91] IB/iser: bound protection_sg size by data_sg size Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 12/91] media: am437x-vpfe: Setting STD to current value is not an error Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 13/91] media: i2c: ov2659: fix s_stream return value Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 14/91] media: i2c: ov2659: Fix missing 720p register config Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 15/91] media: ov6650: Fix stored frame format not in sync with hardware Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 16/91] tools/power/cpupower: Fix initializer override in hsw_ext_cstates Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 17/91] usb: renesas_usbhs: add suspend event support in gadget mode Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 18/91] hwrng: omap3-rom - Call clk_disable_unprepare() on exit only if not idled Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 19/91] regulator: max8907: Fix the usage of uninitialized variable in max8907_regulator_probe() Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 20/91] media: flexcop-usb: fix NULL-ptr deref in flexcop_usb_transfer_init() Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 21/91] media: cec-funcs.h: add status_req checks Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 22/91] samples: pktgen: fix proc_cmd command result check logic Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 23/91] iio: proximity: sx9500: fix iio_triggered_buffer_{predisable,postenable} positions Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 24/91] mwifiex: pcie: Fix memory leak in mwifiex_pcie_init_evt_ring Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 25/91] media: ti-vpe: vpe: fix a v4l2-compliance warning about invalid pixel format Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 26/91] media: ti-vpe: vpe: fix a v4l2-compliance failure about frame sequence number Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 27/91] media: ti-vpe: vpe: Make sure YUYV is set as default format Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 28/91] extcon: sm5502: Reset registers during initialization Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 29/91] x86/mm: Use the correct function type for native_set_fixmap() Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 30/91] iio: chemical: atlas-ph-sensor: fix iio_triggered_buffer_predisable() position Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 31/91] perf test: Report failure for mmap events Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 32/91] perf report: Add warning when libunwind not compiled in Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 33/91] usb: usbfs: Suppress problematic bind and unbind uevents Sasha Levin
2019-12-11  7:46   ` Greg Kroah-Hartman
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 34/91] iio: adc: max1027: Reset the device at probe time Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 35/91] iio: dac: ad7303: replace mlock with own lock Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 36/91] Bluetooth: hci_core: fix init for HCI_USER_CHANNEL Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 37/91] x86/mce: Lower throttling MCE messages' priority to warning Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 38/91] drm/gma500: fix memory disclosures due to uninitialized bytes Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 39/91] iio: pressure: zpa2326: fix iio_triggered_buffer_postenable position Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 40/91] rtl8xxxu: fix RTL8723BU connection failure issue after warm reboot Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 41/91] x86/ioapic: Prevent inconsistent state when moving an interrupt Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 42/91] arm64: psci: Reduce the waiting time for cpu_psci_cpu_kill() Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 43/91] ALSA: hda - Fix pending unsol events at shutdown Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 44/91] libata: Ensure ata_port probe has completed before detach Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 45/91] ata: sata_mv, avoid trigerrable BUG_ON Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 46/91] pinctrl: sh-pfc: sh7734: Fix duplicate TCLK1_B Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 47/91] Bluetooth: Fix advertising duplicated flags Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 48/91] bnx2x: Fix PF-VF communication over multi-cos queues Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 49/91] spi: img-spfi: fix potential double release Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 50/91] ALSA: timer: Limit max amount of slave instances Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 51/91] rtlwifi: fix memory leak in rtl92c_set_fw_rsvdpagepkt() Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 52/91] perf probe: Fix to find range-only function instance Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 53/91] perf probe: Fix to list probe event with correct line number Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 54/91] perf probe: Walk function lines in lexical blocks Sasha Levin
2019-12-10 22:29 ` [PATCH AUTOSEL 4.9 55/91] perf probe: Fix to probe an inline function which has no entry pc Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 56/91] perf probe: Fix to show ranges of variables in functions without entry_pc Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 57/91] perf probe: Fix to show inlined function callsite " Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 58/91] perf probe: Fix to probe a function which has no entry pc Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 59/91] perf probe: Skip overlapped location on searching variables Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 60/91] perf probe: Return a better scope DIE if there is no best scope Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 61/91] perf probe: Fix to show calling lines of inlined functions Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 62/91] perf probe: Skip end-of-sequence and non statement lines Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 63/91] perf probe: Filter out instances except for inlined subroutine and subprogram Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 64/91] ath10k: fix get invalid tx rate for Mesh metric Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 65/91] media: pvrusb2: Fix oops on tear-down when radio support is not present Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 66/91] media: si470x-i2c: add missed operations in remove Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 67/91] EDAC/ghes: Fix grain calculation Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 68/91] spi: pxa2xx: Add missed security checks Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 69/91] ASoC: rt5677: Mark reg RT5677_PWR_ANLG2 as volatile Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 70/91] s390/disassembler: don't hide instruction addresses Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 71/91] net: bcmgenet: Add RGMII_RXID support Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 72/91] parport: load lowlevel driver if ports not found Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 73/91] cpufreq: Register drivers only after CPU devices have been registered Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 74/91] x86/crash: Add a forward declaration of struct kimage Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 75/91] RDMA/qib: Validate ->show()/store() callbacks before calling them Sasha Levin
2019-12-11  7:45   ` Greg Kroah-Hartman
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 76/91] iwlwifi: mvm: fix unaligned read of rx_pkt_status Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 77/91] spi: tegra20-slink: add missed clk_unprepare Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 78/91] mmc: tmio: Add MMC_CAP_ERASE to allow erase/discard/trim requests Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 79/91] btrfs: don't prematurely free work in end_workqueue_fn() Sasha Levin
2019-12-10 22:30 ` Sasha Levin [this message]
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 81/91] spi: st-ssc4: add missed pm_runtime_disable Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 82/91] x86/insn: Add some Intel instructions to the opcode map Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 83/91] iwlwifi: check kasprintf() return value Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 84/91] fbtft: Make sure string is NULL terminated Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 85/91] debugfs: Fix !DEBUG_FS debugfs_create_automount Sasha Levin
2019-12-11  7:44   ` Greg Kroah-Hartman
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 86/91] crypto: sun4i-ss - Fix 64-bit size_t warnings on sun4i-ss-hash.c Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 87/91] crypto: vmx - Avoid weird build failures Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 88/91] libtraceevent: Fix memory leakage in copy_filter_type Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 89/91] perf parse: Fix potential memory leak when handling tracepoint errors Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 90/91] perf intel-bts: Does not support AUX area sampling Sasha Levin
2019-12-10 22:30 ` [PATCH AUTOSEL 4.9 91/91] net: phy: initialise phydev speed and duplex sanely Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191210223035.14270-80-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=dsterba@suse.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=osandov@fb.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).