All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Guruswamy Basavaiah <guru2018@gmail.com>,
	Nikos Tsironis <ntsironis@arrikto.com>,
	Mikulas Patocka <mpatocka@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.9 03/62] dm snapshot: rework COW throttling to fix deadlock
Date: Mon,  4 Nov 2019 22:44:25 +0100	[thread overview]
Message-ID: <20191104211903.714156948@linuxfoundation.org> (raw)
In-Reply-To: <20191104211901.387893698@linuxfoundation.org>

From: Mikulas Patocka <mpatocka@redhat.com>

[ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ]

Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.

The implementation of this throttling mechanism is prone to a deadlock:

1. One or more threads write to the origin device causing COW, which is
   performed by kcopyd.

2. At some point some of these threads might reach the s->cow_count
   semaphore limit and block in down(&s->cow_count), holding a read lock
   on _origins_lock.

3. Someone tries to acquire a write lock on _origins_lock, e.g.,
   snapshot_ctr(), which blocks because the threads at step (2) already
   hold a read lock on it.

4. A COW operation completes and kcopyd runs dm-snapshot's completion
   callback, which ends up calling pending_complete().
   pending_complete() tries to resubmit any deferred origin bios. This
   requires acquiring a read lock on _origins_lock, which blocks.

   This happens because the read-write semaphore implementation gives
   priority to writers, meaning that as soon as a writer tries to enter
   the critical section, no readers will be allowed in, until all
   writers have completed their work.

   So, pending_complete() waits for the writer at step (3) to acquire
   and release the lock. This writer waits for the readers at step (2)
   to release the read lock and those readers wait for
   pending_complete() (the kcopyd thread) to signal the s->cow_count
   semaphore: DEADLOCK.

The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html

Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.

Reported-by: Guruswamy Basavaiah <guru2018@gmail.com>
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Tested-by: Nikos Tsironis <ntsironis@arrikto.com>
Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls")
Cc: stable@vger.kernel.org # v5.0+
Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/md/dm-snap.c | 80 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 64 insertions(+), 16 deletions(-)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index ef51ab8a5dcb2..cf2f44e500e24 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -19,7 +19,6 @@
 #include <linux/vmalloc.h>
 #include <linux/log2.h>
 #include <linux/dm-kcopyd.h>
-#include <linux/semaphore.h>
 
 #include "dm.h"
 
@@ -106,8 +105,8 @@ struct dm_snapshot {
 	/* The on disk metadata handler */
 	struct dm_exception_store *store;
 
-	/* Maximum number of in-flight COW jobs. */
-	struct semaphore cow_count;
+	unsigned in_progress;
+	wait_queue_head_t in_progress_wait;
 
 	struct dm_kcopyd_client *kcopyd_client;
 
@@ -158,8 +157,8 @@ struct dm_snapshot {
  */
 #define DEFAULT_COW_THRESHOLD 2048
 
-static int cow_threshold = DEFAULT_COW_THRESHOLD;
-module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644);
+static unsigned cow_threshold = DEFAULT_COW_THRESHOLD;
+module_param_named(snapshot_cow_threshold, cow_threshold, uint, 0644);
 MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being copied on write");
 
 DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
@@ -1206,7 +1205,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad_hash_tables;
 	}
 
-	sema_init(&s->cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX);
+	init_waitqueue_head(&s->in_progress_wait);
 
 	s->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle);
 	if (IS_ERR(s->kcopyd_client)) {
@@ -1396,17 +1395,54 @@ static void snapshot_dtr(struct dm_target *ti)
 
 	dm_put_device(ti, s->origin);
 
+	WARN_ON(s->in_progress);
+
 	kfree(s);
 }
 
 static void account_start_copy(struct dm_snapshot *s)
 {
-	down(&s->cow_count);
+	spin_lock(&s->in_progress_wait.lock);
+	s->in_progress++;
+	spin_unlock(&s->in_progress_wait.lock);
 }
 
 static void account_end_copy(struct dm_snapshot *s)
 {
-	up(&s->cow_count);
+	spin_lock(&s->in_progress_wait.lock);
+	BUG_ON(!s->in_progress);
+	s->in_progress--;
+	if (likely(s->in_progress <= cow_threshold) &&
+	    unlikely(waitqueue_active(&s->in_progress_wait)))
+		wake_up_locked(&s->in_progress_wait);
+	spin_unlock(&s->in_progress_wait.lock);
+}
+
+static bool wait_for_in_progress(struct dm_snapshot *s, bool unlock_origins)
+{
+	if (unlikely(s->in_progress > cow_threshold)) {
+		spin_lock(&s->in_progress_wait.lock);
+		if (likely(s->in_progress > cow_threshold)) {
+			/*
+			 * NOTE: this throttle doesn't account for whether
+			 * the caller is servicing an IO that will trigger a COW
+			 * so excess throttling may result for chunks not required
+			 * to be COW'd.  But if cow_threshold was reached, extra
+			 * throttling is unlikely to negatively impact performance.
+			 */
+			DECLARE_WAITQUEUE(wait, current);
+			__add_wait_queue(&s->in_progress_wait, &wait);
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			spin_unlock(&s->in_progress_wait.lock);
+			if (unlock_origins)
+				up_read(&_origins_lock);
+			io_schedule();
+			remove_wait_queue(&s->in_progress_wait, &wait);
+			return false;
+		}
+		spin_unlock(&s->in_progress_wait.lock);
+	}
+	return true;
 }
 
 /*
@@ -1424,7 +1460,7 @@ static void flush_bios(struct bio *bio)
 	}
 }
 
-static int do_origin(struct dm_dev *origin, struct bio *bio);
+static int do_origin(struct dm_dev *origin, struct bio *bio, bool limit);
 
 /*
  * Flush a list of buffers.
@@ -1437,7 +1473,7 @@ static void retry_origin_bios(struct dm_snapshot *s, struct bio *bio)
 	while (bio) {
 		n = bio->bi_next;
 		bio->bi_next = NULL;
-		r = do_origin(s->origin, bio);
+		r = do_origin(s->origin, bio, false);
 		if (r == DM_MAPIO_REMAPPED)
 			generic_make_request(bio);
 		bio = n;
@@ -1726,8 +1762,11 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 	if (!s->valid)
 		return -EIO;
 
-	/* FIXME: should only take write lock if we need
-	 * to copy an exception */
+	if (bio_data_dir(bio) == WRITE) {
+		while (unlikely(!wait_for_in_progress(s, false)))
+			; /* wait_for_in_progress() has slept */
+	}
+
 	mutex_lock(&s->lock);
 
 	if (!s->valid || (unlikely(s->snapshot_overflowed) &&
@@ -1876,7 +1915,7 @@ redirect_to_origin:
 
 	if (bio_data_dir(bio) == WRITE) {
 		mutex_unlock(&s->lock);
-		return do_origin(s->origin, bio);
+		return do_origin(s->origin, bio, false);
 	}
 
 out_unlock:
@@ -2212,15 +2251,24 @@ next_snapshot:
 /*
  * Called on a write from the origin driver.
  */
-static int do_origin(struct dm_dev *origin, struct bio *bio)
+static int do_origin(struct dm_dev *origin, struct bio *bio, bool limit)
 {
 	struct origin *o;
 	int r = DM_MAPIO_REMAPPED;
 
+again:
 	down_read(&_origins_lock);
 	o = __lookup_origin(origin->bdev);
-	if (o)
+	if (o) {
+		if (limit) {
+			struct dm_snapshot *s;
+			list_for_each_entry(s, &o->snapshots, list)
+				if (unlikely(!wait_for_in_progress(s, true)))
+					goto again;
+		}
+
 		r = __origin_write(&o->snapshots, bio->bi_iter.bi_sector, bio);
+	}
 	up_read(&_origins_lock);
 
 	return r;
@@ -2333,7 +2381,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio)
 		dm_accept_partial_bio(bio, available_sectors);
 
 	/* Only tell snapshots if this is a write */
-	return do_origin(o->dev, bio);
+	return do_origin(o->dev, bio, true);
 }
 
 static long origin_direct_access(struct dm_target *ti, sector_t sector,
-- 
2.20.1




  parent reply	other threads:[~2019-11-04 21:50 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-04 21:44 [PATCH 4.9 00/62] 4.9.199-stable review Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 01/62] dm snapshot: use mutex instead of rw_semaphore Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 02/62] dm snapshot: introduce account_start_copy() and account_end_copy() Greg Kroah-Hartman
2019-11-04 21:44 ` Greg Kroah-Hartman [this message]
2019-11-04 21:44 ` [PATCH 4.9 04/62] dm: Use kzalloc for all structs with embedded biosets/mempools Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 05/62] sc16is7xx: Fix for "Unexpected interrupt: 8" Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 06/62] HID: i2c-hid: add Direkt-Tek DTLAPY133-1 to descriptor override Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 07/62] x86/cpu: Add Atom Tremont (Jacobsville) Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 08/62] HID: i2c-hid: Add Odys Winbook 13 to descriptor override Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 09/62] scripts/setlocalversion: Improve -dirty check with git-status --no-optional-locks Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 10/62] usb: handle warm-reset port requests on hub resume Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 11/62] rtc: pcf8523: set xtal load capacitance from DT Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 12/62] exec: load_script: Do not exec truncated interpreter path Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 13/62] iio: fix center temperature of bmc150-accel-core Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 14/62] perf map: Fix overlapped map handling Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 15/62] perf jevents: Fix period for Intel fixed counters Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 16/62] staging: rtl8188eu: fix null dereference when kzalloc fails Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 17/62] RDMA/iwcm: Fix a lock inversion issue Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 18/62] gpio: max77620: Use correct unit for debounce times Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 19/62] fs: cifs: mute -Wunused-const-variable message Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 20/62] serial: mctrl_gpio: Check for NULL pointer Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 21/62] efi/cper: Fix endianness of PCIe class code Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 22/62] efi/x86: Do not clean dummy variable in kexec path Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 23/62] ocfs2: clear zero in unaligned direct IO Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 24/62] fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry() Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 25/62] fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock() Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 26/62] fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc() Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 27/62] MIPS: fw: sni: Fix out of bounds init of o32 stack Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 28/62] NFSv4: Fix leak of clp->cl_acceptor string Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 29/62] s390/uaccess: avoid (false positive) compiler warnings Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 30/62] tracing: Initialize iter->seq after zeroing in tracing_read_pipe() Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 31/62] USB: legousbtower: fix a signedness bug in tower_probe() Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 32/62] thunderbolt: Use 32-bit writes when writing ring producer/consumer Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 33/62] ath6kl: fix a NULL-ptr-deref bug in ath6kl_usb_alloc_urb_from_pipe() Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 34/62] fuse: flush dirty data/metadata before non-truncate setattr Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 35/62] fuse: truncate pending writes on O_TRUNC Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 36/62] ALSA: bebob: Fix prototype of helper function to return negative value Greg Kroah-Hartman
2019-11-04 21:44 ` [PATCH 4.9 37/62] UAS: Revert commit 3ae62a42090f ("UAS: fix alignment of scatter/gather segments") Greg Kroah-Hartman
2019-11-05 14:31   ` Oliver Neukum
2019-11-06  0:11     ` Sasha Levin
2019-11-06  7:45       ` Oliver Neukum
2019-11-06 11:17         ` Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 38/62] USB: gadget: Reject endpoints with 0 maxpacket value Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 39/62] usb-storage: Revert commit 747668dbc061 ("usb-storage: Set virt_boundary_mask to avoid SG overflows") Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 40/62] USB: ldusb: fix ring-buffer locking Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 41/62] USB: ldusb: fix control-message timeout Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 42/62] USB: serial: whiteheat: fix potential slab corruption Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 43/62] USB: serial: whiteheat: fix line-speed endianness Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 44/62] HID: i2c-hid: add Trekstor Primebook C11B to descriptor override Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 45/62] HID: Fix assumption that devices have inputs Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 46/62] HID: fix error message in hid_open_report() Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 47/62] nl80211: fix validation of mesh path nexthop Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 48/62] s390/cmm: fix information leak in cmm_timeout_handler() Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 49/62] rtlwifi: Fix potential overflow on P2P code Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 50/62] dmaengine: cppi41: Fix cppi41_dma_prep_slave_sg() when idle Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 51/62] llc: fix sk_buff leak in llc_sap_state_process() Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 52/62] llc: fix sk_buff leak in llc_conn_service() Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 53/62] bonding: fix potential NULL deref in bond_update_slave_arr Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 54/62] net: usb: sr9800: fix uninitialized local variable Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 55/62] sch_netem: fix rcu splat in netem_enqueue() Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 56/62] sctp: fix the issue that flags are ignored when using kernel_connect Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 57/62] sctp: not bind the socket in sctp_connect Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 58/62] xfs: Correctly invert xfs_buftarg LRU isolation logic Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 59/62] ALSA: timer: Follow standard EXPORT_SYMBOL() declarations Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 60/62] ALSA: timer: Limit max instances per timer Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 61/62] ALSA: timer: Simplify error path in snd_timer_open() Greg Kroah-Hartman
2019-11-04 21:45 ` [PATCH 4.9 62/62] ALSA: timer: Fix mutex deadlock at releasing card Greg Kroah-Hartman
2019-11-05  5:56 ` [PATCH 4.9 00/62] 4.9.199-stable review kernelci.org bot
2019-11-05  7:05 ` Naresh Kamboju
2019-11-05 14:24 ` Guenter Roeck
2019-11-05 17:01 ` shuah
2019-11-05 23:37 ` Jon Hunter
2019-11-05 23:37   ` Jon Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191104211903.714156948@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=guru2018@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=ntsironis@arrikto.com \
    --cc=sashal@kernel.org \
    --cc=snitzer@redhat.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.