Stable Archive on lore.kernel.org
 help / color / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Coly Li <colyli@suse.de>,
	Jens Axboe <axboe@kernel.dk>, kbuild test robot <lkp@intel.com>
Subject: [PATCH 5.2 20/37] bcache: fix race in btree_flush_write()
Date: Fri, 13 Sep 2019 14:07:25 +0100
Message-ID: <20190913130518.849251524@linuxfoundation.org> (raw)
In-Reply-To: <20190913130510.727515099@linuxfoundation.org>

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/md/bcache/btree.c   | 28 +++++++++++++++++++++++++++-
 drivers/md/bcache/btree.h   |  2 ++
 drivers/md/bcache/journal.c |  7 +++++++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 9788b2ee6638f..5cf3247e8afb2 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -35,7 +35,7 @@
 #include <linux/rcupdate.h>
 #include <linux/sched/clock.h>
 #include <linux/rculist.h>
-
+#include <linux/delay.h>
 #include <trace/events/bcache.h>
 
 /*
@@ -655,12 +655,25 @@ static int mca_reap(struct btree *b, unsigned int min_order, bool flush)
 		up(&b->io_mutex);
 	}
 
+retry:
 	/*
 	 * BTREE_NODE_dirty might be cleared in btree_flush_btree() by
 	 * __bch_btree_node_write(). To avoid an extra flush, acquire
 	 * b->write_lock before checking BTREE_NODE_dirty bit.
 	 */
 	mutex_lock(&b->write_lock);
+	/*
+	 * If this btree node is selected in btree_flush_write() by journal
+	 * code, delay and retry until the node is flushed by journal code
+	 * and BTREE_NODE_journal_flush bit cleared by btree_flush_write().
+	 */
+	if (btree_node_journal_flush(b)) {
+		pr_debug("bnode %p is flushing by journal, retry", b);
+		mutex_unlock(&b->write_lock);
+		udelay(1);
+		goto retry;
+	}
+
 	if (btree_node_dirty(b))
 		__bch_btree_node_write(b, &cl);
 	mutex_unlock(&b->write_lock);
@@ -1077,7 +1090,20 @@ static void btree_node_free(struct btree *b)
 
 	BUG_ON(b == b->c->root);
 
+retry:
 	mutex_lock(&b->write_lock);
+	/*
+	 * If the btree node is selected and flushing in btree_flush_write(),
+	 * delay and retry until the BTREE_NODE_journal_flush bit cleared,
+	 * then it is safe to free the btree node here. Otherwise this btree
+	 * node will be in race condition.
+	 */
+	if (btree_node_journal_flush(b)) {
+		mutex_unlock(&b->write_lock);
+		pr_debug("bnode %p journal_flush set, retry", b);
+		udelay(1);
+		goto retry;
+	}
 
 	if (btree_node_dirty(b)) {
 		btree_complete_write(b, btree_current_write(b));
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index d1c72ef64edf5..76cfd121a4861 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -158,11 +158,13 @@ enum btree_flags {
 	BTREE_NODE_io_error,
 	BTREE_NODE_dirty,
 	BTREE_NODE_write_idx,
+	BTREE_NODE_journal_flush,
 };
 
 BTREE_FLAG(io_error);
 BTREE_FLAG(dirty);
 BTREE_FLAG(write_idx);
+BTREE_FLAG(journal_flush);
 
 static inline struct btree_write *btree_current_write(struct btree *b)
 {
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index cae2aff5e27ae..33556acdcf9cd 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -405,6 +405,7 @@ static void btree_flush_write(struct cache_set *c)
 retry:
 	best = NULL;
 
+	mutex_lock(&c->bucket_lock);
 	for_each_cached_btree(b, c, i)
 		if (btree_current_write(b)->journal) {
 			if (!best)
@@ -417,9 +418,14 @@ retry:
 		}
 
 	b = best;
+	if (b)
+		set_btree_node_journal_flush(b);
+	mutex_unlock(&c->bucket_lock);
+
 	if (b) {
 		mutex_lock(&b->write_lock);
 		if (!btree_current_write(b)->journal) {
+			clear_bit(BTREE_NODE_journal_flush, &b->flags);
 			mutex_unlock(&b->write_lock);
 			/* We raced */
 			atomic_long_inc(&c->retry_flush_write);
@@ -427,6 +433,7 @@ retry:
 		}
 
 		__bch_btree_node_write(b, NULL);
+		clear_bit(BTREE_NODE_journal_flush, &b->flags);
 		mutex_unlock(&b->write_lock);
 	}
 }
-- 
2.20.1




  parent reply index

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-13 13:07 [PATCH 5.2 00/37] 5.2.15-stable review Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 01/37] gpio: pca953x: correct type of reg_direction Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 02/37] gpio: pca953x: use pca953x_read_regs instead of regmap_bulk_read Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 03/37] ALSA: hda - Fix potential endless loop at applying quirks Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 04/37] ALSA: hda/realtek - Fix overridden device-specific initialization Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 05/37] ALSA: hda/realtek - Add quirk for HP Pavilion 15 Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 06/37] ALSA: hda/realtek - Enable internal speaker & headset mic of ASUS UX431FL Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 07/37] ALSA: hda/realtek - Fix the problem of two front mics on a ThinkCentre Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 08/37] sched/fair: Dont assign runtime for throttled cfs_rq Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 09/37] drm/vmwgfx: Fix double free in vmw_recv_msg() Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 10/37] drm/nouveau/sec2/gp102: add missing MODULE_FIRMWAREs Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 11/37] vhost/test: fix build for vhost test Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 12/37] vhost/test: fix build for vhost test - again Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 13/37] powerpc/64e: Drop stale call to smp_processor_id() which hangs SMP startup Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 14/37] powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 15/37] powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 16/37] batman-adv: fix uninit-value in batadv_netlink_get_ifindex() Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 17/37] batman-adv: Only read OGM tvlv_len after buffer len check Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 18/37] bcache: only clear BTREE_NODE_dirty bit when it is set Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 19/37] bcache: add comments for mutex_lock(&b->write_lock) Greg Kroah-Hartman
2019-09-13 13:07 ` Greg Kroah-Hartman [this message]
2019-09-13 13:07 ` [PATCH 5.2 21/37] IB/rdmavt: Add new completion inline Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 22/37] IB/{rdmavt, qib, hfi1}: Convert to new completion API Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 23/37] IB/hfi1: Unreserve a flushed OPFN request Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 24/37] drm/i915: Disable SAMPLER_STATE prefetching on all Gen11 steppings Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 25/37] drm/i915: Make sure cdclk is high enough for DP audio on VLV/CHV Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 26/37] mmc: sdhci-sprd: Fix the incorrect soft reset operation when runtime resuming Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 27/37] usb: chipidea: imx: add imx7ulp support Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 28/37] usb: chipidea: imx: fix EPROBE_DEFER support during driver probe Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 29/37] virtio/s390: fix race on airq_areas[] Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 30/37] drm/i915: Support flags in whitlist WAs Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 31/37] drm/i915: Support whitelist workarounds on all engines Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 32/37] drm/i915: whitelist PS_(DEPTH|INVOCATION)_COUNT Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 33/37] drm/i915: Add whitelist workarounds for ICL Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 34/37] drm/i915/icl: whitelist PS_(DEPTH|INVOCATION)_COUNT Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 35/37] Btrfs: fix unwritten extent buffers and hangs on future writeback attempts Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 36/37] vhost: block speculation of translated descriptors Greg Kroah-Hartman
2019-09-14  0:54   ` Stefan Lippers-Hollmann
2019-09-14  5:50     ` Greg Kroah-Hartman
2019-09-14  7:15       ` Stefan Lippers-Hollmann
2019-09-14  8:08         ` Greg Kroah-Hartman
2019-09-15  9:34           ` Thomas Backlund
2019-09-15 13:37             ` Greg Kroah-Hartman
2019-09-13 13:07 ` [PATCH 5.2 37/37] vhost: make sure log_num < in_num Greg Kroah-Hartman
2019-09-13 19:39 ` [PATCH 5.2 00/37] 5.2.15-stable review kernelci.org bot
2019-09-14  4:26 ` Naresh Kamboju
2019-09-14  7:43   ` Greg Kroah-Hartman
2019-09-14 14:08 ` Guenter Roeck
2019-09-15 13:34 ` Greg Kroah-Hartman
2019-09-16  9:25 ` Jon Hunter
2019-09-16 10:41   ` Greg Kroah-Hartman

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190913130518.849251524@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@kernel.dk \
    --cc=colyli@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Stable Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/stable/0 stable/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 stable stable/ https://lore.kernel.org/stable \
		stable@vger.kernel.org stable@archiver.kernel.org
	public-inbox-index stable

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.stable


AGPL code for this site: git clone https://public-inbox.org/ public-inbox