* [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target @ 2023-01-18 12:29 Nikos Tsironis 2023-01-18 12:29 ` [dm-devel] [PATCH 1/2] dm era: allocate in-core writesets when loading metadata Nikos Tsironis ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Nikos Tsironis @ 2023-01-18 12:29 UTC (permalink / raw) To: snitzer, agk, dm-devel; +Cc: ejt, ntsironis Under certain conditions, swapping a table, that includes a dm-era target, with a new table, causes a deadlock. This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked in the suspended dm-era target. dm-era executes all metadata operations in a worker thread, which stops processing requests when the target is suspended, and resumes again when the target is resumed. So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era device blocks, until the device is resumed. If we then load a new table to the device, while the aforementioned dmsetup command is blocked in dm-era, and resume the device, we deadlock. The problem is that the 'dmsetup status' and 'dmsetup message' commands hold a reference to the live table, i.e., they hold an SRCU read lock on md->io_barrier, while they are blocked. When the device is resumed, the old table is replaced with the new one by dm_swap_table(), which ends up calling synchronize_srcu() on md->io_barrier. Since the blocked dmsetup command is holding the SRCU read lock, and the old table is never resumed, 'dmsetup resume' blocks too, and we have a deadlock. The only way to recover is by rebooting. Steps to reproduce: 1. Create device with dm-era target # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" 2. Suspend the device # dmsetup suspend eradev 3. Load new table to device, e.g., to resize the device. Note, that, we must load the new table _after_ suspending the device to ensure the constructor of the new target instance reads up-to-date metadata, as committed by the post-suspend hook of dm-era. # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" 4. Device now has LIVE and INACTIVE tables # dmsetup info eradev Name: eradev State: SUSPENDED Read Ahead: 16384 Tables present: LIVE & INACTIVE Open count: 0 Event number: 0 Major, minor: 252, 2 Number of targets: 1 5. Retrieve the status of the device. This blocks because the device is suspended. Equivalently, any 'dmsetup message' operation would block too. This command holds the SRCU read lock on md->io_barrier. # dmsetup status eradev 6. Resume the device. The resume operation tries to swap the old table with the new one and deadlocks, because it synchronizes SRCU for the old table, while the blocked 'dmsetup status' holds the SRCU read lock. And the old table is never resumed again at this point. # dmsetup resume eradev 7. The relevant dmesg logs are: [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 [ 7250.875809] Call Trace: [ 7250.875812] <TASK> [ 7250.875816] __schedule+0x330/0x8b0 [ 7250.875827] schedule+0x4e/0xc0 [ 7250.875831] schedule_timeout+0x20f/0x2e0 [ 7250.875836] ? do_set_pte+0xb8/0x120 [ 7250.875843] ? prep_new_page+0x91/0xa0 [ 7250.875847] wait_for_completion+0x8c/0xf0 [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] [ 7250.875877] era_status+0x64/0x1e0 [dm_era] [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] [ 7250.875932] table_status+0x61/0xa0 [dm_mod] [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 [ 7250.875970] do_syscall_64+0x3a/0xd0 [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 7250.875980] RIP: 0033:0x7f20b7cd4017 [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 [ 7250.876002] </TASK> [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 [ 7250.876134] Call Trace: [ 7250.876136] <TASK> [ 7250.876138] __schedule+0x330/0x8b0 [ 7250.876142] schedule+0x4e/0xc0 [ 7250.876145] schedule_timeout+0x20f/0x2e0 [ 7250.876149] ? __queue_work+0x226/0x420 [ 7250.876156] wait_for_completion+0x8c/0xf0 [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 [ 7250.876244] do_syscall_64+0x3a/0xd0 [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 7250.876252] RIP: 0033:0x7f15e9254017 [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 [ 7250.876269] </TASK> Fix this by allowing metadata operations triggered by user space to run in the corresponding user space thread, instead of queueing them for execution by the dm-era worker thread. This way, even if the device is suspended, the metadata operations will run and release the SRCU read lock, preventing a subsequent table reload (dm_swap_table()) from deadlocking. To make this possible use a mutex to serialize access to the metadata between the worker thread and the user space threads. This is a follow up on https://listman.redhat.com/archives/dm-devel/2021-December/msg00044.html. Nikos Tsironis (2): dm era: allocate in-core writesets when loading metadata dm era: avoid deadlock when swapping table drivers/md/dm-era-target.c | 78 ++++++++++++++++++++++++++++++-------- 1 file changed, 63 insertions(+), 15 deletions(-) -- 2.30.2 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* [dm-devel] [PATCH 1/2] dm era: allocate in-core writesets when loading metadata 2023-01-18 12:29 [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Nikos Tsironis @ 2023-01-18 12:29 ` Nikos Tsironis 2023-01-18 12:29 ` [dm-devel] [PATCH 2/2] dm era: avoid deadlock when swapping table Nikos Tsironis 2023-01-18 16:28 ` [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Mike Snitzer 2 siblings, 0 replies; 14+ messages in thread From: Nikos Tsironis @ 2023-01-18 12:29 UTC (permalink / raw) To: snitzer, agk, dm-devel; +Cc: ejt, ntsironis Until now, the allocation of the in-core bitmaps was done in pre-resume by metadata_resize(). In preparation for the next commit, which makes it possible for a metadata operation, e.g. an era rollover, to run before pre-resume runs, allocate the in-core bitmaps as part of opening the metadata. Also, set the number of blocks of the era device in era_ctr() to the number of blocks in the metadata. This avoids attempting to resize the metadata every time we create a new target instance, even if the metadata size hasn't changed. Fixes: eec40579d8487 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> --- drivers/md/dm-era-target.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c index e92c1afc3677..3332bed2f412 100644 --- a/drivers/md/dm-era-target.c +++ b/drivers/md/dm-era-target.c @@ -788,6 +788,7 @@ static int metadata_digest_start(struct era_metadata *md, struct digest *d) * High level metadata interface. Target methods should use these, and not * the lower level ones. *--------------------------------------------------------------*/ +static void metadata_close(struct era_metadata *md); static struct era_metadata *metadata_open(struct block_device *bdev, sector_t block_size, bool may_format) @@ -811,6 +812,24 @@ static struct era_metadata *metadata_open(struct block_device *bdev, return ERR_PTR(r); } + if (md->nr_blocks == 0) + return md; + + /* Allocate in-core writesets */ + r = writeset_alloc(&md->writesets[0], md->nr_blocks); + if (r) { + DMERR("%s: writeset_alloc failed for writeset 0", __func__); + metadata_close(md); + return ERR_PTR(r); + } + + r = writeset_alloc(&md->writesets[1], md->nr_blocks); + if (r) { + DMERR("%s: writeset_alloc failed for writeset 1", __func__); + metadata_close(md); + return ERR_PTR(r); + } + return md; } @@ -1504,6 +1523,7 @@ static int era_ctr(struct dm_target *ti, unsigned argc, char **argv) return PTR_ERR(md); } era->md = md; + era->nr_blocks = md->nr_blocks; era->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM); if (!era->wq) { -- 2.30.2 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [dm-devel] [PATCH 2/2] dm era: avoid deadlock when swapping table 2023-01-18 12:29 [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Nikos Tsironis 2023-01-18 12:29 ` [dm-devel] [PATCH 1/2] dm era: allocate in-core writesets when loading metadata Nikos Tsironis @ 2023-01-18 12:29 ` Nikos Tsironis 2023-01-18 16:28 ` [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Mike Snitzer 2 siblings, 0 replies; 14+ messages in thread From: Nikos Tsironis @ 2023-01-18 12:29 UTC (permalink / raw) To: snitzer, agk, dm-devel; +Cc: ejt, ntsironis Under certain conditions, swapping a table, that includes a dm-era target, with a new table, causes a deadlock. This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked in the suspended dm-era target. dm-era executes all metadata operations in a worker thread, which stops processing requests when the target is suspended, and resumes again when the target is resumed. So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era device blocks, until the device is resumed. If we then load a new table to the device, while the aforementioned dmsetup command is blocked in dm-era, and resume the device, we deadlock. The problem is that the 'dmsetup status' and 'dmsetup message' commands hold a reference to the live table, i.e., they hold an SRCU read lock on md->io_barrier, while they are blocked. When the device is resumed, the old table is replaced with the new one by dm_swap_table(), which ends up calling synchronize_srcu() on md->io_barrier. Since the blocked dmsetup command is holding the SRCU read lock, and the old table is never resumed, 'dmsetup resume' blocks too, and we have a deadlock. The only way to recover is by rebooting. Fix this by allowing metadata operations triggered by user space to run in the corresponding user space thread, instead of queueing them for execution by the dm-era worker thread. This way, even if the device is suspended, the metadata operations will run and release the SRCU read lock, preventing a subsequent table reload (dm_swap_table()) from deadlocking. To make this possible use a mutex to serialize access to the metadata between the worker thread and the user space threads. Fixes: eec40579d8487 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> --- drivers/md/dm-era-target.c | 58 ++++++++++++++++++++++++++++---------- 1 file changed, 43 insertions(+), 15 deletions(-) diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c index 3332bed2f412..c57a19320dbf 100644 --- a/drivers/md/dm-era-target.c +++ b/drivers/md/dm-era-target.c @@ -11,6 +11,7 @@ #include <linux/module.h> #include <linux/slab.h> #include <linux/vmalloc.h> +#include <linux/mutex.h> #define DM_MSG_PREFIX "era" @@ -1182,6 +1183,12 @@ struct era { dm_block_t nr_blocks; uint32_t sectors_per_block; int sectors_per_block_shift; + + /* + * Serialize access to metadata between worker thread and user space + * threads. + */ + struct mutex metadata_lock; struct era_metadata *md; struct workqueue_struct *wq; @@ -1358,10 +1365,12 @@ static void do_work(struct work_struct *ws) { struct era *era = container_of(ws, struct era, worker); + mutex_lock(&era->metadata_lock); kick_off_digest(era); process_old_eras(era); process_deferred_bios(era); process_rpc_calls(era); + mutex_unlock(&era->metadata_lock); } static void defer_bio(struct era *era, struct bio *bio) @@ -1400,17 +1409,6 @@ static int in_worker0(struct era *era, int (*fn)(struct era_metadata *)) return perform_rpc(era, &rpc); } -static int in_worker1(struct era *era, - int (*fn)(struct era_metadata *, void *), void *arg) -{ - struct rpc rpc; - rpc.fn0 = NULL; - rpc.fn1 = fn; - rpc.arg = arg; - - return perform_rpc(era, &rpc); -} - static void start_worker(struct era *era) { atomic_set(&era->suspended, 0); @@ -1439,6 +1437,7 @@ static void era_destroy(struct era *era) if (era->metadata_dev) dm_put_device(era->ti, era->metadata_dev); + mutex_destroy(&era->metadata_lock); kfree(era); } @@ -1539,6 +1538,8 @@ static int era_ctr(struct dm_target *ti, unsigned argc, char **argv) spin_lock_init(&era->rpc_lock); INIT_LIST_HEAD(&era->rpc_calls); + mutex_init(&era->metadata_lock); + ti->private = era; ti->num_flush_bios = 1; ti->flush_supported = true; @@ -1591,7 +1592,9 @@ static void era_postsuspend(struct dm_target *ti) stop_worker(era); + mutex_lock(&era->metadata_lock); r = metadata_commit(era->md); + mutex_unlock(&era->metadata_lock); if (r) { DMERR("%s: metadata_commit failed", __func__); /* FIXME: fail mode */ @@ -1605,19 +1608,23 @@ static int era_preresume(struct dm_target *ti) dm_block_t new_size = calc_nr_blocks(era); if (era->nr_blocks != new_size) { + mutex_lock(&era->metadata_lock); r = metadata_resize(era->md, &new_size); if (r) { DMERR("%s: metadata_resize failed", __func__); + mutex_unlock(&era->metadata_lock); return r; } r = metadata_commit(era->md); if (r) { DMERR("%s: metadata_commit failed", __func__); + mutex_unlock(&era->metadata_lock); return r; } era->nr_blocks = new_size; + mutex_unlock(&era->metadata_lock); } start_worker(era); @@ -1648,7 +1655,9 @@ static void era_status(struct dm_target *ti, status_type_t type, switch (type) { case STATUSTYPE_INFO: - r = in_worker1(era, metadata_get_stats, &stats); + mutex_lock(&era->metadata_lock); + r = metadata_get_stats(era->md, &stats); + mutex_unlock(&era->metadata_lock); if (r) goto err; @@ -1682,6 +1691,25 @@ static void era_status(struct dm_target *ti, status_type_t type, DMEMIT("Error"); } +static int _process_message(struct era *era, int (*fn)(struct era_metadata *)) +{ + int ret, r; + + mutex_lock(&era->metadata_lock); + ret = fn(era->md); + r = metadata_commit(era->md); + mutex_unlock(&era->metadata_lock); + + /* Handle return value the same way as process_rpc_calls() does */ + if (r) + ret = r; + + /* Wake worker to handle archived writesets */ + wake_worker(era); + + return ret; +} + static int era_message(struct dm_target *ti, unsigned argc, char **argv, char *result, unsigned maxlen) { @@ -1693,13 +1721,13 @@ static int era_message(struct dm_target *ti, unsigned argc, char **argv, } if (!strcasecmp(argv[0], "checkpoint")) - return in_worker0(era, metadata_checkpoint); + return _process_message(era, metadata_checkpoint); if (!strcasecmp(argv[0], "take_metadata_snap")) - return in_worker0(era, metadata_take_snap); + return _process_message(era, metadata_take_snap); if (!strcasecmp(argv[0], "drop_metadata_snap")) - return in_worker0(era, metadata_drop_snap); + return _process_message(era, metadata_drop_snap); DMERR("unsupported message '%s'", argv[0]); return -EINVAL; -- 2.30.2 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-18 12:29 [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Nikos Tsironis 2023-01-18 12:29 ` [dm-devel] [PATCH 1/2] dm era: allocate in-core writesets when loading metadata Nikos Tsironis 2023-01-18 12:29 ` [dm-devel] [PATCH 2/2] dm era: avoid deadlock when swapping table Nikos Tsironis @ 2023-01-18 16:28 ` Mike Snitzer 2023-01-19 9:36 ` Nikos Tsironis 2 siblings, 1 reply; 14+ messages in thread From: Mike Snitzer @ 2023-01-18 16:28 UTC (permalink / raw) To: Nikos Tsironis; +Cc: dm-devel, snitzer, ejt, agk On Wed, Jan 18 2023 at 7:29P -0500, Nikos Tsironis <ntsironis@arrikto.com> wrote: > Under certain conditions, swapping a table, that includes a dm-era > target, with a new table, causes a deadlock. > > This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked > in the suspended dm-era target. > > dm-era executes all metadata operations in a worker thread, which stops > processing requests when the target is suspended, and resumes again when > the target is resumed. > > So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era > device blocks, until the device is resumed. > > If we then load a new table to the device, while the aforementioned > dmsetup command is blocked in dm-era, and resume the device, we > deadlock. > > The problem is that the 'dmsetup status' and 'dmsetup message' commands > hold a reference to the live table, i.e., they hold an SRCU read lock on > md->io_barrier, while they are blocked. > > When the device is resumed, the old table is replaced with the new one > by dm_swap_table(), which ends up calling synchronize_srcu() on > md->io_barrier. > > Since the blocked dmsetup command is holding the SRCU read lock, and the > old table is never resumed, 'dmsetup resume' blocks too, and we have a > deadlock. > > The only way to recover is by rebooting. > > Steps to reproduce: > > 1. Create device with dm-era target > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > 2. Suspend the device > > # dmsetup suspend eradev > > 3. Load new table to device, e.g., to resize the device. Note, that, we > must load the new table _after_ suspending the device to ensure the > constructor of the new target instance reads up-to-date metadata, as > committed by the post-suspend hook of dm-era. > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > 4. Device now has LIVE and INACTIVE tables > > # dmsetup info eradev > Name: eradev > State: SUSPENDED > Read Ahead: 16384 > Tables present: LIVE & INACTIVE > Open count: 0 > Event number: 0 > Major, minor: 252, 2 > Number of targets: 1 > > 5. Retrieve the status of the device. This blocks because the device is > suspended. Equivalently, any 'dmsetup message' operation would block > too. This command holds the SRCU read lock on md->io_barrier. > > # dmsetup status eradev I'll have a look at this flow, it seems to me we shouldn't stack up 'dmsetup status' and 'dmsetup message' commands if the table is suspended. I think it is unique to dm-era that you don't allow to _read_ metadata operations while a device is suspended. But messages really shouldn't be sent when the device is suspended. As-is DM is pretty silently cutthroat about that constraint. Resulting in deadlock is obviously cutthroat... > 6. Resume the device. The resume operation tries to swap the old table > with the new one and deadlocks, because it synchronizes SRCU for the > old table, while the blocked 'dmsetup status' holds the SRCU read > lock. And the old table is never resumed again at this point. > > # dmsetup resume eradev > > 7. The relevant dmesg logs are: > > [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 > [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. > [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 > [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 > [ 7250.875809] Call Trace: > [ 7250.875812] <TASK> > [ 7250.875816] __schedule+0x330/0x8b0 > [ 7250.875827] schedule+0x4e/0xc0 > [ 7250.875831] schedule_timeout+0x20f/0x2e0 > [ 7250.875836] ? do_set_pte+0xb8/0x120 > [ 7250.875843] ? prep_new_page+0x91/0xa0 > [ 7250.875847] wait_for_completion+0x8c/0xf0 > [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] > [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] > [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] > [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] > [ 7250.875877] era_status+0x64/0x1e0 [dm_era] > [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 > [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] > [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > [ 7250.875932] table_status+0x61/0xa0 [dm_mod] > [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] > [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 > [ 7250.875970] do_syscall_64+0x3a/0xd0 > [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 7250.875980] RIP: 0033:0x7f20b7cd4017 > [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 > [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 > [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 > [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 > [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 > [ 7250.876002] </TASK> > [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. > [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 > [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 > [ 7250.876134] Call Trace: > [ 7250.876136] <TASK> > [ 7250.876138] __schedule+0x330/0x8b0 > [ 7250.876142] schedule+0x4e/0xc0 > [ 7250.876145] schedule_timeout+0x20f/0x2e0 > [ 7250.876149] ? __queue_work+0x226/0x420 > [ 7250.876156] wait_for_completion+0x8c/0xf0 > [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 > [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 > [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] > [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] > [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] > [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] > [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] > [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 > [ 7250.876244] do_syscall_64+0x3a/0xd0 > [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 7250.876252] RIP: 0033:0x7f15e9254017 > [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 > [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 > [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 > [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 > [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 > [ 7250.876269] </TASK> > > Fix this by allowing metadata operations triggered by user space to run > in the corresponding user space thread, instead of queueing them for > execution by the dm-era worker thread. Allowing them to run while the device is suspended is _not_ the correct way to work-around this deadlock situation. I think it'd be useful to understand why your userspace is tools are allowing these messages and status to a device that is suspended. FYI, status shouldn't trigger write IOs if the device is suspended. Both dm-cache and dm-thin have this in status as a guard around its work to ensure updated accounting (as a side-effect of committing metadata), e.g.: /* Commit to ensure statistics aren't out-of-date */ if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) (void) commit(cache, false); > This way, even if the device is suspended, the metadata operations will > run and release the SRCU read lock, preventing a subsequent table reload > (dm_swap_table()) from deadlocking. > > To make this possible use a mutex to serialize access to the metadata > between the worker thread and the user space threads. > > This is a follow up on https://listman.redhat.com/archives/dm-devel/2021-December/msg00044.html. > > Nikos Tsironis (2): > dm era: allocate in-core writesets when loading metadata > dm era: avoid deadlock when swapping table > > drivers/md/dm-era-target.c | 78 ++++++++++++++++++++++++++++++-------- > 1 file changed, 63 insertions(+), 15 deletions(-) > > -- > 2.30.2 > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-18 16:28 ` [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Mike Snitzer @ 2023-01-19 9:36 ` Nikos Tsironis 2023-01-19 12:58 ` Zdenek Kabelac 2023-01-23 17:34 ` Mike Snitzer 0 siblings, 2 replies; 14+ messages in thread From: Nikos Tsironis @ 2023-01-19 9:36 UTC (permalink / raw) To: Mike Snitzer; +Cc: dm-devel, snitzer, ejt, agk On 1/18/23 18:28, Mike Snitzer wrote: > On Wed, Jan 18 2023 at 7:29P -0500, > Nikos Tsironis <ntsironis@arrikto.com> wrote: > >> Under certain conditions, swapping a table, that includes a dm-era >> target, with a new table, causes a deadlock. >> >> This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked >> in the suspended dm-era target. >> >> dm-era executes all metadata operations in a worker thread, which stops >> processing requests when the target is suspended, and resumes again when >> the target is resumed. >> >> So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era >> device blocks, until the device is resumed. >> >> If we then load a new table to the device, while the aforementioned >> dmsetup command is blocked in dm-era, and resume the device, we >> deadlock. >> >> The problem is that the 'dmsetup status' and 'dmsetup message' commands >> hold a reference to the live table, i.e., they hold an SRCU read lock on >> md->io_barrier, while they are blocked. >> >> When the device is resumed, the old table is replaced with the new one >> by dm_swap_table(), which ends up calling synchronize_srcu() on >> md->io_barrier. >> >> Since the blocked dmsetup command is holding the SRCU read lock, and the >> old table is never resumed, 'dmsetup resume' blocks too, and we have a >> deadlock. >> >> The only way to recover is by rebooting. >> >> Steps to reproduce: >> >> 1. Create device with dm-era target >> >> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >> >> 2. Suspend the device >> >> # dmsetup suspend eradev >> >> 3. Load new table to device, e.g., to resize the device. Note, that, we >> must load the new table _after_ suspending the device to ensure the >> constructor of the new target instance reads up-to-date metadata, as >> committed by the post-suspend hook of dm-era. >> >> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >> >> 4. Device now has LIVE and INACTIVE tables >> >> # dmsetup info eradev >> Name: eradev >> State: SUSPENDED >> Read Ahead: 16384 >> Tables present: LIVE & INACTIVE >> Open count: 0 >> Event number: 0 >> Major, minor: 252, 2 >> Number of targets: 1 >> >> 5. Retrieve the status of the device. This blocks because the device is >> suspended. Equivalently, any 'dmsetup message' operation would block >> too. This command holds the SRCU read lock on md->io_barrier. >> >> # dmsetup status eradev > > I'll have a look at this flow, it seems to me we shouldn't stack up > 'dmsetup status' and 'dmsetup message' commands if the table is > suspended. > > I think it is unique to dm-era that you don't allow to _read_ metadata > operations while a device is suspended. But messages really shouldn't > be sent when the device is suspended. As-is DM is pretty silently > cutthroat about that constraint. > > Resulting in deadlock is obviously cutthroat... > Hi Mike, Thanks for the quick reply. I couldn't find this constraint documented anywhere and since the various DM targets seem to allow message operations while the device is suspended I drew the wrong conclusion. Thanks for clarifying this. >> 6. Resume the device. The resume operation tries to swap the old table >> with the new one and deadlocks, because it synchronizes SRCU for the >> old table, while the blocked 'dmsetup status' holds the SRCU read >> lock. And the old table is never resumed again at this point. >> >> # dmsetup resume eradev >> >> 7. The relevant dmesg logs are: >> >> [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 >> [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. >> [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 >> [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 >> [ 7250.875809] Call Trace: >> [ 7250.875812] <TASK> >> [ 7250.875816] __schedule+0x330/0x8b0 >> [ 7250.875827] schedule+0x4e/0xc0 >> [ 7250.875831] schedule_timeout+0x20f/0x2e0 >> [ 7250.875836] ? do_set_pte+0xb8/0x120 >> [ 7250.875843] ? prep_new_page+0x91/0xa0 >> [ 7250.875847] wait_for_completion+0x8c/0xf0 >> [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] >> [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] >> [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] >> [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] >> [ 7250.875877] era_status+0x64/0x1e0 [dm_era] >> [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >> [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 >> [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] >> [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >> [ 7250.875932] table_status+0x61/0xa0 [dm_mod] >> [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >> [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] >> [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 >> [ 7250.875970] do_syscall_64+0x3a/0xd0 >> [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae >> [ 7250.875980] RIP: 0033:0x7f20b7cd4017 >> [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >> [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 >> [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 >> [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 >> [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 >> [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 >> [ 7250.876002] </TASK> >> [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. >> [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 >> [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 >> [ 7250.876134] Call Trace: >> [ 7250.876136] <TASK> >> [ 7250.876138] __schedule+0x330/0x8b0 >> [ 7250.876142] schedule+0x4e/0xc0 >> [ 7250.876145] schedule_timeout+0x20f/0x2e0 >> [ 7250.876149] ? __queue_work+0x226/0x420 >> [ 7250.876156] wait_for_completion+0x8c/0xf0 >> [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 >> [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 >> [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] >> [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] >> [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] >> [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] >> [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >> [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] >> [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 >> [ 7250.876244] do_syscall_64+0x3a/0xd0 >> [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae >> [ 7250.876252] RIP: 0033:0x7f15e9254017 >> [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >> [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 >> [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 >> [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 >> [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 >> [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 >> [ 7250.876269] </TASK> >> >> Fix this by allowing metadata operations triggered by user space to run >> in the corresponding user space thread, instead of queueing them for >> execution by the dm-era worker thread. > > Allowing them to run while the device is suspended is _not_ the > correct way to work-around this deadlock situation. I think it'd be > useful to understand why your userspace is tools are allowing these > messages and status to a device that is suspended. > Ack. The sequence of operations I provided is just a way to easily reproduce the deadlock. The exact issue I am facing is the following: 1. Create device with dm-era target # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" 2. Load new table to device, e.g., to resize the device or snapshot it. # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" 3. Suspend the device # dmsetup suspend eradev 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM command at this point, e.g. 'vgs'. 5. 'vgs' tries to retrieve the status of the dm-era device using the DM_TABLE_STATUS ioctl, and blocks, because the command is queued for execution by the worker thread, which is suspended while the device is suspended. Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a difference for dm-era, since the "problem" is not that it writes to the metadata, but that it queues the metadata read operation to the worker thread. 6. Resume the device: This deadlocks. # dmsetup resume eradev So, I am not the one retrieving the status of the suspended device. LVM is. LVM, when running commands like 'lvs' and 'vgs', retrieves the status of the devices on the system using the DM_TABLE_STATUS ioctl. The deadlock is a race that happens when someone runs an LVM command at the "wrong" time. Using an appropriate LVM 'global_filter' is a way to prevent this, but: 1. This is just a workaround, not a proper solution. 2. This is not always an option. Imagine someone running an LVM command in a container, for example. Or, we may not be allowed to change the LVM configuration of the host at all. > FYI, status shouldn't trigger write IOs if the device is suspended. > Both dm-cache and dm-thin have this in status as a guard around its > work to ensure updated accounting (as a side-effect of committing > metadata), e.g.: > > /* Commit to ensure statistics aren't out-of-date */ > if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) > (void) commit(cache, false); > Ack. The current behavior of dm-era wrt 'status' command is: 1. Queue the 'metadata_get_stats()' function for execution by the worker thread. 2. The worker runs the function and retrieves the statistics 3. The worker commits the metadata _after_ retrieving the statistics Shall I just change 'era_status()' to read the metadata directly and skip the metadata commit, instead of queuing the operation to the worker thread, when the device is suspended? I think we would still need a lock to synchronize access to the metadata, since, if I am not mistaken, the pre-resume hook that starts the worker runs before clearing the DMF_SUSPENDED flag, and can run concurrently with the 'status' IOCTL. So, the worker might run concurrently with a 'status' operation that sees the device as suspended. Nikos. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-19 9:36 ` Nikos Tsironis @ 2023-01-19 12:58 ` Zdenek Kabelac 2023-01-19 14:08 ` Nikos Tsironis 2023-01-23 17:34 ` Mike Snitzer 1 sibling, 1 reply; 14+ messages in thread From: Zdenek Kabelac @ 2023-01-19 12:58 UTC (permalink / raw) To: Nikos Tsironis, Mike Snitzer; +Cc: dm-devel, snitzer, ejt, agk Dne 19. 01. 23 v 10:36 Nikos Tsironis napsal(a): > On 1/18/23 18:28, Mike Snitzer wrote: >> On Wed, Jan 18 2023 at 7:29P -0500, >> Nikos Tsironis <ntsironis@arrikto.com> wrote: >> >> > > Hi Mike, > > Thanks for the quick reply. > > I couldn't find this constraint documented anywhere and since the > various DM targets seem to allow message operations while the device is > suspended I drew the wrong conclusion. Hi Nikos Some notes from lvm2 developer - we work with these constrains: DM target should not need to allocate bunch of memory while suspended (target should preallocated pool of some memory if it really needs to do it in this case). DM target should check and allocate everything it can in the 'load' phase and error as early as possibly (so loaded inactive table can be cleared/dropped and 'resume' target can continue its work). Error in suspend phase is typically the last moment -we can handle failure 'somehow'. Handling failure in 'resume' is a can of worm with no good solution - so resume really should do as minimal as possible and should work with everything already prepared from load & suspend. 'DM status/info' should work while device is suspend - but no allocation should be needed in this case to produce result. Sending messages to a suspended target should not be needed - if it is - it should be most likely solved via 'table reload' change (target design change). Loading to the inactive table slot should not be break anything for the active table slot (table clear shall resume at suspend point). Surely the list could go longer - but these basics are crucial. Regards Zdenek -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-19 12:58 ` Zdenek Kabelac @ 2023-01-19 14:08 ` Nikos Tsironis 0 siblings, 0 replies; 14+ messages in thread From: Nikos Tsironis @ 2023-01-19 14:08 UTC (permalink / raw) To: Zdenek Kabelac, Mike Snitzer; +Cc: dm-devel, snitzer, ejt, agk On 1/19/23 14:58, Zdenek Kabelac wrote: > Dne 19. 01. 23 v 10:36 Nikos Tsironis napsal(a): >> On 1/18/23 18:28, Mike Snitzer wrote: >>> On Wed, Jan 18 2023 at 7:29P -0500, >>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>> >>> >> >> Hi Mike, >> >> Thanks for the quick reply. >> >> I couldn't find this constraint documented anywhere and since the >> various DM targets seem to allow message operations while the device is >> suspended I drew the wrong conclusion. > > Hi Nikos > > > Some notes from lvm2 developer - we work with these constrains: > > DM target should not need to allocate bunch of memory while suspended (target should preallocated pool of some memory if it really needs to do it in this case). > > DM target should check and allocate everything it can in the 'load' phase and error as early as possibly (so loaded inactive table can be cleared/dropped and 'resume' target can continue its work). > > Error in suspend phase is typically the last moment -we can handle failure 'somehow'. > > Handling failure in 'resume' is a can of worm with no good solution - so resume really should do as minimal as possible and should work with everything already prepared from load & suspend. > > 'DM status/info' should work while device is suspend - but no allocation should be needed in this case to produce result. > > Sending messages to a suspended target should not be needed - if it is - it should be most likely solved via 'table reload' change (target design change). > > Loading to the inactive table slot should not be break anything for the active table slot (table clear shall resume at suspend point). > > Surely the list could go longer - but these basics are crucial. > Hi Zdenek, That's great information! Thanks a lot for sharing it. Regards, Nikos -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-19 9:36 ` Nikos Tsironis 2023-01-19 12:58 ` Zdenek Kabelac @ 2023-01-23 17:34 ` Mike Snitzer 2023-01-25 12:37 ` Nikos Tsironis 2023-01-25 12:45 ` Nikos Tsironis 1 sibling, 2 replies; 14+ messages in thread From: Mike Snitzer @ 2023-01-23 17:34 UTC (permalink / raw) To: Nikos Tsironis; +Cc: dm-devel, snitzer, ejt, agk On Thu, Jan 19 2023 at 4:36P -0500, Nikos Tsironis <ntsironis@arrikto.com> wrote: > On 1/18/23 18:28, Mike Snitzer wrote: > > On Wed, Jan 18 2023 at 7:29P -0500, > > Nikos Tsironis <ntsironis@arrikto.com> wrote: > > > > > Under certain conditions, swapping a table, that includes a dm-era > > > target, with a new table, causes a deadlock. > > > > > > This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked > > > in the suspended dm-era target. > > > > > > dm-era executes all metadata operations in a worker thread, which stops > > > processing requests when the target is suspended, and resumes again when > > > the target is resumed. > > > > > > So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era > > > device blocks, until the device is resumed. > > > > > > If we then load a new table to the device, while the aforementioned > > > dmsetup command is blocked in dm-era, and resume the device, we > > > deadlock. > > > > > > The problem is that the 'dmsetup status' and 'dmsetup message' commands > > > hold a reference to the live table, i.e., they hold an SRCU read lock on > > > md->io_barrier, while they are blocked. > > > > > > When the device is resumed, the old table is replaced with the new one > > > by dm_swap_table(), which ends up calling synchronize_srcu() on > > > md->io_barrier. > > > > > > Since the blocked dmsetup command is holding the SRCU read lock, and the > > > old table is never resumed, 'dmsetup resume' blocks too, and we have a > > > deadlock. > > > > > > The only way to recover is by rebooting. > > > > > > Steps to reproduce: > > > > > > 1. Create device with dm-era target > > > > > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > 2. Suspend the device > > > > > > # dmsetup suspend eradev > > > > > > 3. Load new table to device, e.g., to resize the device. Note, that, we > > > must load the new table _after_ suspending the device to ensure the > > > constructor of the new target instance reads up-to-date metadata, as > > > committed by the post-suspend hook of dm-era. > > > > > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > 4. Device now has LIVE and INACTIVE tables > > > > > > # dmsetup info eradev > > > Name: eradev > > > State: SUSPENDED > > > Read Ahead: 16384 > > > Tables present: LIVE & INACTIVE > > > Open count: 0 > > > Event number: 0 > > > Major, minor: 252, 2 > > > Number of targets: 1 > > > > > > 5. Retrieve the status of the device. This blocks because the device is > > > suspended. Equivalently, any 'dmsetup message' operation would block > > > too. This command holds the SRCU read lock on md->io_barrier. > > > > > > # dmsetup status eradev > > > > I'll have a look at this flow, it seems to me we shouldn't stack up > > 'dmsetup status' and 'dmsetup message' commands if the table is > > suspended. > > > > I think it is unique to dm-era that you don't allow to _read_ metadata > > operations while a device is suspended. But messages really shouldn't > > be sent when the device is suspended. As-is DM is pretty silently > > cutthroat about that constraint. > > > > Resulting in deadlock is obviously cutthroat... > > > > Hi Mike, > > Thanks for the quick reply. > > I couldn't find this constraint documented anywhere and since the > various DM targets seem to allow message operations while the device is > suspended I drew the wrong conclusion. > > Thanks for clarifying this. > > > > 6. Resume the device. The resume operation tries to swap the old table > > > with the new one and deadlocks, because it synchronizes SRCU for the > > > old table, while the blocked 'dmsetup status' holds the SRCU read > > > lock. And the old table is never resumed again at this point. > > > > > > # dmsetup resume eradev > > > > > > 7. The relevant dmesg logs are: > > > > > > [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 > > > [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. > > > [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 > > > [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 > > > [ 7250.875809] Call Trace: > > > [ 7250.875812] <TASK> > > > [ 7250.875816] __schedule+0x330/0x8b0 > > > [ 7250.875827] schedule+0x4e/0xc0 > > > [ 7250.875831] schedule_timeout+0x20f/0x2e0 > > > [ 7250.875836] ? do_set_pte+0xb8/0x120 > > > [ 7250.875843] ? prep_new_page+0x91/0xa0 > > > [ 7250.875847] wait_for_completion+0x8c/0xf0 > > > [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] > > > [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] > > > [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] > > > [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] > > > [ 7250.875877] era_status+0x64/0x1e0 [dm_era] > > > [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > > > [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 > > > [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] > > > [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > > > [ 7250.875932] table_status+0x61/0xa0 [dm_mod] > > > [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > > > [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] > > > [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 > > > [ 7250.875970] do_syscall_64+0x3a/0xd0 > > > [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 7250.875980] RIP: 0033:0x7f20b7cd4017 > > > [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > > [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 > > > [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 > > > [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 > > > [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 > > > [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 > > > [ 7250.876002] </TASK> > > > [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. > > > [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 > > > [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 > > > [ 7250.876134] Call Trace: > > > [ 7250.876136] <TASK> > > > [ 7250.876138] __schedule+0x330/0x8b0 > > > [ 7250.876142] schedule+0x4e/0xc0 > > > [ 7250.876145] schedule_timeout+0x20f/0x2e0 > > > [ 7250.876149] ? __queue_work+0x226/0x420 > > > [ 7250.876156] wait_for_completion+0x8c/0xf0 > > > [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 > > > [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 > > > [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] > > > [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] > > > [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] > > > [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] > > > [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > > > [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] > > > [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 > > > [ 7250.876244] do_syscall_64+0x3a/0xd0 > > > [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 7250.876252] RIP: 0033:0x7f15e9254017 > > > [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > > [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 > > > [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 > > > [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 > > > [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 > > > [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 > > > [ 7250.876269] </TASK> > > > > > > Fix this by allowing metadata operations triggered by user space to run > > > in the corresponding user space thread, instead of queueing them for > > > execution by the dm-era worker thread. > > > > Allowing them to run while the device is suspended is _not_ the > > correct way to work-around this deadlock situation. I think it'd be > > useful to understand why your userspace is tools are allowing these > > messages and status to a device that is suspended. > > > > Ack. > > The sequence of operations I provided is just a way to easily reproduce > the deadlock. The exact issue I am facing is the following: > > 1. Create device with dm-era target > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > 2. Load new table to device, e.g., to resize the device or snapshot it. > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > 3. Suspend the device > > # dmsetup suspend eradev > > 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM > command at this point, e.g. 'vgs'. > > 5. 'vgs' tries to retrieve the status of the dm-era device using the > DM_TABLE_STATUS ioctl, and blocks, because the command is queued for > execution by the worker thread, which is suspended while the device > is suspended. > > Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a > difference for dm-era, since the "problem" is not that it writes to > the metadata, but that it queues the metadata read operation to the > worker thread. > > 6. Resume the device: This deadlocks. > > # dmsetup resume eradev > > So, I am not the one retrieving the status of the suspended device. LVM > is. LVM, when running commands like 'lvs' and 'vgs', retrieves the > status of the devices on the system using the DM_TABLE_STATUS ioctl. > > The deadlock is a race that happens when someone runs an LVM command at > the "wrong" time. I think dm-era shouldn't be disallowing work items while suspended, e.g.: diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c index e92c1afc3677..33ea2c2374c7 100644 --- a/drivers/md/dm-era-target.c +++ b/drivers/md/dm-era-target.c @@ -1175,7 +1175,6 @@ struct era { struct list_head rpc_calls; struct digest digest; - atomic_t suspended; }; struct rpc { @@ -1219,8 +1218,7 @@ static void remap_to_origin(struct era *era, struct bio *bio) *--------------------------------------------------------------*/ static void wake_worker(struct era *era) { - if (!atomic_read(&era->suspended)) - queue_work(era->wq, &era->worker); + queue_work(era->wq, &era->worker); } static void process_old_eras(struct era *era) @@ -1392,17 +1390,6 @@ static int in_worker1(struct era *era, return perform_rpc(era, &rpc); } -static void start_worker(struct era *era) -{ - atomic_set(&era->suspended, 0); -} - -static void stop_worker(struct era *era) -{ - atomic_set(&era->suspended, 1); - drain_workqueue(era->wq); -} - /*---------------------------------------------------------------- * Target methods *--------------------------------------------------------------*/ @@ -1569,7 +1556,7 @@ static void era_postsuspend(struct dm_target *ti) /* FIXME: fail mode */ } - stop_worker(era); + drain_workqueue(era->wq); r = metadata_commit(era->md); if (r) { @@ -1600,8 +1587,6 @@ static int era_preresume(struct dm_target *ti) era->nr_blocks = new_size; } - start_worker(era); - r = in_worker0(era, metadata_era_rollover); if (r) { DMERR("%s: metadata_era_rollover failed", __func__); During suspend it should certainly flush all works; but I fail to see the value in disallowing the targets main way to do work (even while suspended). Maybe Joe has insight on why dm-era was written this way. But as an example dm-cache and dm-thin don't have such a strong restriction. > Using an appropriate LVM 'global_filter' is a way to prevent this, but: > > 1. This is just a workaround, not a proper solution. > 2. This is not always an option. Imagine someone running an LVM command > in a container, for example. Or, we may not be allowed to change the > LVM configuration of the host at all. > > > FYI, status shouldn't trigger write IOs if the device is suspended. > > Both dm-cache and dm-thin have this in status as a guard around its > > work to ensure updated accounting (as a side-effect of committing > > metadata), e.g.: > > > > /* Commit to ensure statistics aren't out-of-date */ > > if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) > > (void) commit(cache, false); > > Ack. The current behavior of dm-era wrt 'status' command is: > > 1. Queue the 'metadata_get_stats()' function for execution by the worker > thread. > 2. The worker runs the function and retrieves the statistics > 3. The worker commits the metadata _after_ retrieving the statistics > > Shall I just change 'era_status()' to read the metadata directly and > skip the metadata commit, instead of queuing the operation to the worker > thread, when the device is suspended? dm-era should still respect that the device is suspended (disallow metadata commits, etc). Probably should just do like the dm-cache code I quoted above (dm-thin does the same): do not commit if suspended. Seems to me that if simply retrieving stats in era_status() results in metadata commit as a side-effect of its work that is pretty bogus -- process_rpc_calls() must be what does it just due to generic code? If so, yes absolutely avoid that metadata commit (might we just update process_rpc_calls() to check dm_suspended()?) Sorry for such fundamental concerns, I really should've caught this stuff back when dm-era was introduced! Really does feel like a read-write lock for metadata might be warranted. > I think we would still need a lock to synchronize access to the > metadata, since, if I am not mistaken, the pre-resume hook that starts > the worker runs before clearing the DMF_SUSPENDED flag, and can run > concurrently with the 'status' IOCTL. So, the worker might run > concurrently with a 'status' operation that sees the device as > suspended. I was initially thinking removing dm-era's suspended state (like above patch) should take care of the pitfalls of disallowing works due to that state. But the dm-era could should be audited to make sure the big hammer of disallowing works while suspended wasn't important for its implementation (and the need to disallow device changes while suspended). Mike -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-23 17:34 ` Mike Snitzer @ 2023-01-25 12:37 ` Nikos Tsironis 2023-01-26 0:06 ` Mike Snitzer 2023-01-25 12:45 ` Nikos Tsironis 1 sibling, 1 reply; 14+ messages in thread From: Nikos Tsironis @ 2023-01-25 12:37 UTC (permalink / raw) To: Mike Snitzer; +Cc: dm-devel, snitzer, ejt, agk On 1/23/23 19:34, Mike Snitzer wrote: > On Thu, Jan 19 2023 at 4:36P -0500, > Nikos Tsironis <ntsironis@arrikto.com> wrote: > >> On 1/18/23 18:28, Mike Snitzer wrote: >>> On Wed, Jan 18 2023 at 7:29P -0500, >>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>> >>>> Under certain conditions, swapping a table, that includes a dm-era >>>> target, with a new table, causes a deadlock. >>>> >>>> This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked >>>> in the suspended dm-era target. >>>> >>>> dm-era executes all metadata operations in a worker thread, which stops >>>> processing requests when the target is suspended, and resumes again when >>>> the target is resumed. >>>> >>>> So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era >>>> device blocks, until the device is resumed. >>>> >>>> If we then load a new table to the device, while the aforementioned >>>> dmsetup command is blocked in dm-era, and resume the device, we >>>> deadlock. >>>> >>>> The problem is that the 'dmsetup status' and 'dmsetup message' commands >>>> hold a reference to the live table, i.e., they hold an SRCU read lock on >>>> md->io_barrier, while they are blocked. >>>> >>>> When the device is resumed, the old table is replaced with the new one >>>> by dm_swap_table(), which ends up calling synchronize_srcu() on >>>> md->io_barrier. >>>> >>>> Since the blocked dmsetup command is holding the SRCU read lock, and the >>>> old table is never resumed, 'dmsetup resume' blocks too, and we have a >>>> deadlock. >>>> >>>> The only way to recover is by rebooting. >>>> >>>> Steps to reproduce: >>>> >>>> 1. Create device with dm-era target >>>> >>>> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>> >>>> 2. Suspend the device >>>> >>>> # dmsetup suspend eradev >>>> >>>> 3. Load new table to device, e.g., to resize the device. Note, that, we >>>> must load the new table _after_ suspending the device to ensure the >>>> constructor of the new target instance reads up-to-date metadata, as >>>> committed by the post-suspend hook of dm-era. >>>> >>>> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>> >>>> 4. Device now has LIVE and INACTIVE tables >>>> >>>> # dmsetup info eradev >>>> Name: eradev >>>> State: SUSPENDED >>>> Read Ahead: 16384 >>>> Tables present: LIVE & INACTIVE >>>> Open count: 0 >>>> Event number: 0 >>>> Major, minor: 252, 2 >>>> Number of targets: 1 >>>> >>>> 5. Retrieve the status of the device. This blocks because the device is >>>> suspended. Equivalently, any 'dmsetup message' operation would block >>>> too. This command holds the SRCU read lock on md->io_barrier. >>>> >>>> # dmsetup status eradev >>> >>> I'll have a look at this flow, it seems to me we shouldn't stack up >>> 'dmsetup status' and 'dmsetup message' commands if the table is >>> suspended. >>> >>> I think it is unique to dm-era that you don't allow to _read_ metadata >>> operations while a device is suspended. But messages really shouldn't >>> be sent when the device is suspended. As-is DM is pretty silently >>> cutthroat about that constraint. >>> >>> Resulting in deadlock is obviously cutthroat... >>> >> >> Hi Mike, >> >> Thanks for the quick reply. >> >> I couldn't find this constraint documented anywhere and since the >> various DM targets seem to allow message operations while the device is >> suspended I drew the wrong conclusion. >> >> Thanks for clarifying this. >> >>>> 6. Resume the device. The resume operation tries to swap the old table >>>> with the new one and deadlocks, because it synchronizes SRCU for the >>>> old table, while the blocked 'dmsetup status' holds the SRCU read >>>> lock. And the old table is never resumed again at this point. >>>> >>>> # dmsetup resume eradev >>>> >>>> 7. The relevant dmesg logs are: >>>> >>>> [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 >>>> [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. >>>> [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 >>>> [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 >>>> [ 7250.875809] Call Trace: >>>> [ 7250.875812] <TASK> >>>> [ 7250.875816] __schedule+0x330/0x8b0 >>>> [ 7250.875827] schedule+0x4e/0xc0 >>>> [ 7250.875831] schedule_timeout+0x20f/0x2e0 >>>> [ 7250.875836] ? do_set_pte+0xb8/0x120 >>>> [ 7250.875843] ? prep_new_page+0x91/0xa0 >>>> [ 7250.875847] wait_for_completion+0x8c/0xf0 >>>> [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] >>>> [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] >>>> [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] >>>> [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] >>>> [ 7250.875877] era_status+0x64/0x1e0 [dm_era] >>>> [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>> [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 >>>> [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] >>>> [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>> [ 7250.875932] table_status+0x61/0xa0 [dm_mod] >>>> [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>> [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>> [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 >>>> [ 7250.875970] do_syscall_64+0x3a/0xd0 >>>> [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> [ 7250.875980] RIP: 0033:0x7f20b7cd4017 >>>> [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>> [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 >>>> [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 >>>> [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 >>>> [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 >>>> [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 >>>> [ 7250.876002] </TASK> >>>> [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. >>>> [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 >>>> [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 >>>> [ 7250.876134] Call Trace: >>>> [ 7250.876136] <TASK> >>>> [ 7250.876138] __schedule+0x330/0x8b0 >>>> [ 7250.876142] schedule+0x4e/0xc0 >>>> [ 7250.876145] schedule_timeout+0x20f/0x2e0 >>>> [ 7250.876149] ? __queue_work+0x226/0x420 >>>> [ 7250.876156] wait_for_completion+0x8c/0xf0 >>>> [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 >>>> [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 >>>> [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] >>>> [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] >>>> [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] >>>> [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] >>>> [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>> [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>> [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 >>>> [ 7250.876244] do_syscall_64+0x3a/0xd0 >>>> [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> [ 7250.876252] RIP: 0033:0x7f15e9254017 >>>> [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>> [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 >>>> [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 >>>> [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 >>>> [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 >>>> [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 >>>> [ 7250.876269] </TASK> >>>> >>>> Fix this by allowing metadata operations triggered by user space to run >>>> in the corresponding user space thread, instead of queueing them for >>>> execution by the dm-era worker thread. >>> >>> Allowing them to run while the device is suspended is _not_ the >>> correct way to work-around this deadlock situation. I think it'd be >>> useful to understand why your userspace is tools are allowing these >>> messages and status to a device that is suspended. >>> >> >> Ack. >> >> The sequence of operations I provided is just a way to easily reproduce >> the deadlock. The exact issue I am facing is the following: >> >> 1. Create device with dm-era target >> >> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >> >> 2. Load new table to device, e.g., to resize the device or snapshot it. >> >> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >> >> 3. Suspend the device >> >> # dmsetup suspend eradev >> >> 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM >> command at this point, e.g. 'vgs'. >> >> 5. 'vgs' tries to retrieve the status of the dm-era device using the >> DM_TABLE_STATUS ioctl, and blocks, because the command is queued for >> execution by the worker thread, which is suspended while the device >> is suspended. >> >> Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a >> difference for dm-era, since the "problem" is not that it writes to >> the metadata, but that it queues the metadata read operation to the >> worker thread. >> >> 6. Resume the device: This deadlocks. >> >> # dmsetup resume eradev >> >> So, I am not the one retrieving the status of the suspended device. LVM >> is. LVM, when running commands like 'lvs' and 'vgs', retrieves the >> status of the devices on the system using the DM_TABLE_STATUS ioctl. >> >> The deadlock is a race that happens when someone runs an LVM command at >> the "wrong" time. > > I think dm-era shouldn't be disallowing work items while suspended, > e.g.: > > diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c > index e92c1afc3677..33ea2c2374c7 100644 > --- a/drivers/md/dm-era-target.c > +++ b/drivers/md/dm-era-target.c > @@ -1175,7 +1175,6 @@ struct era { > struct list_head rpc_calls; > > struct digest digest; > - atomic_t suspended; > }; > > struct rpc { > @@ -1219,8 +1218,7 @@ static void remap_to_origin(struct era *era, struct bio *bio) > *--------------------------------------------------------------*/ > static void wake_worker(struct era *era) > { > - if (!atomic_read(&era->suspended)) > - queue_work(era->wq, &era->worker); > + queue_work(era->wq, &era->worker); > } > > static void process_old_eras(struct era *era) > @@ -1392,17 +1390,6 @@ static int in_worker1(struct era *era, > return perform_rpc(era, &rpc); > } > > -static void start_worker(struct era *era) > -{ > - atomic_set(&era->suspended, 0); > -} > - > -static void stop_worker(struct era *era) > -{ > - atomic_set(&era->suspended, 1); > - drain_workqueue(era->wq); > -} > - > /*---------------------------------------------------------------- > * Target methods > *--------------------------------------------------------------*/ > @@ -1569,7 +1556,7 @@ static void era_postsuspend(struct dm_target *ti) > /* FIXME: fail mode */ > } > > - stop_worker(era); > + drain_workqueue(era->wq); > > r = metadata_commit(era->md); > if (r) { > @@ -1600,8 +1587,6 @@ static int era_preresume(struct dm_target *ti) > era->nr_blocks = new_size; > } > > - start_worker(era); > - > r = in_worker0(era, metadata_era_rollover); > if (r) { > DMERR("%s: metadata_era_rollover failed", __func__); > > During suspend it should certainly flush all works; but I fail to see > the value in disallowing the targets main way to do work (even while > suspended). Maybe Joe has insight on why dm-era was written this way. > > But as an example dm-cache and dm-thin don't have such a strong > restriction. > The worker thread does the following: 1. Process old eras, i.e., digest the archived writesets in to the main era array. This doesn't involve a metadata commit, but writes to the metadata 2. Process deferred bios. This might trigger a metadata commit in general, but when the device is suspended no bios should be reaching the target, so it should be a no op. 3. Process RPC calls. This involves 'status' and 'message' operations. process_rpc_calls() does commit the metadata, after running all RPC calls. Checking if the device is suspended here is a way to prevent the metadata commit. But, some of the 'message' operations also commit the metadata, e.g., the metadata_take_snap message. I understand that no one should be sending messages to a suspended device, but it's not enforced so it could happen. I believe the value of suspending the worker when the device is suspended is to prevent anyone from writing to the metadata. Writing to the metadata, while the device is suspended, could cause problems. More on that in the following comments. >> Using an appropriate LVM 'global_filter' is a way to prevent this, but: >> >> 1. This is just a workaround, not a proper solution. >> 2. This is not always an option. Imagine someone running an LVM command >> in a container, for example. Or, we may not be allowed to change the >> LVM configuration of the host at all. >> >>> FYI, status shouldn't trigger write IOs if the device is suspended. >>> Both dm-cache and dm-thin have this in status as a guard around its >>> work to ensure updated accounting (as a side-effect of committing >>> metadata), e.g.: >>> >>> /* Commit to ensure statistics aren't out-of-date */ >>> if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) >>> (void) commit(cache, false); >> >> Ack. The current behavior of dm-era wrt 'status' command is: >> >> 1. Queue the 'metadata_get_stats()' function for execution by the worker >> thread. >> 2. The worker runs the function and retrieves the statistics >> 3. The worker commits the metadata _after_ retrieving the statistics >> >> Shall I just change 'era_status()' to read the metadata directly and >> skip the metadata commit, instead of queuing the operation to the worker >> thread, when the device is suspended? > > dm-era should still respect that the device is suspended (disallow > metadata commits, etc). > > Probably should just do like the dm-cache code I quoted above (dm-thin > does the same): do not commit if suspended. > > Seems to me that if simply retrieving stats in era_status() results in > metadata commit as a side-effect of its work that is pretty bogus -- > process_rpc_calls() must be what does it just due to generic code? > > If so, yes absolutely avoid that metadata commit (might we just update > process_rpc_calls() to check dm_suspended()?) > Yes, the metadata commit is due to the generic code in process_rpc_calls(), and could be avoided if we check the device suspended state here. But, avoiding the metadata commit when the device is suspended is not enough. No one should be writing to the metadata at all. Else, we risk hitting the bug fixed by commit 9ae6e8b1c9bbf6 ("dm era: commit metadata in postsuspend after worker stops"). In short, writing to the metadata of a suspended dm-era device results in the committed metadata getting out-of-sync with the in-core metadata. This could result in the corruption of the on-disk metadata if we load a new dm-era table to the device, and destroy the old one (the destructor flushes any cached writes to the disk). For an exact sequence of events that can lead to this issue, please see the commit mentioned above (9ae6e8b1c9bbf6). > Sorry for such fundamental concerns, I really should've caught this > stuff back when dm-era was introduced! > > Really does feel like a read-write lock for metadata might be > warranted. > >> I think we would still need a lock to synchronize access to the >> metadata, since, if I am not mistaken, the pre-resume hook that starts >> the worker runs before clearing the DMF_SUSPENDED flag, and can run >> concurrently with the 'status' IOCTL. So, the worker might run >> concurrently with a 'status' operation that sees the device as >> suspended. > > I was initially thinking removing dm-era's suspended state (like above > patch) should take care of the pitfalls of disallowing works due to > that state. > > But the dm-era could should be audited to make sure the big hammer of > disallowing works while suspended wasn't important for its > implementation (and the need to disallow device changes while > suspended). > I think a simple solution, without modifying dm-era too much, is: 1. Have status run in the user space thread, instead of queueing it to the worker. 2. Add a mutex to avoid status running concurrently with the worker thread. 3. Have status check if the device is suspended. If it is not, commit the metadata before retrieving the status. If it is, skip the metadata commit. This is in line with what the thin and cache targets do. Doing this avoids the deadlock in case someone, e.g., an LVM command, runs a status operation against a suspended dm-era device. User space could trigger the deadlock if it sends a message to a suspended dm-era device, but, since this should never happen, it shouldn't affect existing workflows. Still, it might be a good idea to enforce this restriction, i.e., disallow running message operations on suspended devices, but, I guess this requires more discussion. Nikos. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-25 12:37 ` Nikos Tsironis @ 2023-01-26 0:06 ` Mike Snitzer 2023-01-31 11:01 ` Nikos Tsironis 0 siblings, 1 reply; 14+ messages in thread From: Mike Snitzer @ 2023-01-26 0:06 UTC (permalink / raw) To: Nikos Tsironis; +Cc: dm-devel, snitzer, ejt, agk On Wed, Jan 25 2023 at 7:37P -0500, Nikos Tsironis <ntsironis@arrikto.com> wrote: > On 1/23/23 19:34, Mike Snitzer wrote: > > On Thu, Jan 19 2023 at 4:36P -0500, > > Nikos Tsironis <ntsironis@arrikto.com> wrote: > > > > > On 1/18/23 18:28, Mike Snitzer wrote: > > > > On Wed, Jan 18 2023 at 7:29P -0500, > > > > Nikos Tsironis <ntsironis@arrikto.com> wrote: > > > > > > > > > Under certain conditions, swapping a table, that includes a dm-era > > > > > target, with a new table, causes a deadlock. > > > > > > > > > > This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked > > > > > in the suspended dm-era target. > > > > > > > > > > dm-era executes all metadata operations in a worker thread, which stops > > > > > processing requests when the target is suspended, and resumes again when > > > > > the target is resumed. > > > > > > > > > > So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era > > > > > device blocks, until the device is resumed. > > > > > > > > > > If we then load a new table to the device, while the aforementioned > > > > > dmsetup command is blocked in dm-era, and resume the device, we > > > > > deadlock. > > > > > > > > > > The problem is that the 'dmsetup status' and 'dmsetup message' commands > > > > > hold a reference to the live table, i.e., they hold an SRCU read lock on > > > > > md->io_barrier, while they are blocked. > > > > > > > > > > When the device is resumed, the old table is replaced with the new one > > > > > by dm_swap_table(), which ends up calling synchronize_srcu() on > > > > > md->io_barrier. > > > > > > > > > > Since the blocked dmsetup command is holding the SRCU read lock, and the > > > > > old table is never resumed, 'dmsetup resume' blocks too, and we have a > > > > > deadlock. > > > > > > > > > > The only way to recover is by rebooting. > > > > > > > > > > Steps to reproduce: > > > > > > > > > > 1. Create device with dm-era target > > > > > > > > > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > > > > > 2. Suspend the device > > > > > > > > > > # dmsetup suspend eradev > > > > > > > > > > 3. Load new table to device, e.g., to resize the device. Note, that, we > > > > > must load the new table _after_ suspending the device to ensure the > > > > > constructor of the new target instance reads up-to-date metadata, as > > > > > committed by the post-suspend hook of dm-era. > > > > > > > > > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > > > > > 4. Device now has LIVE and INACTIVE tables > > > > > > > > > > # dmsetup info eradev > > > > > Name: eradev > > > > > State: SUSPENDED > > > > > Read Ahead: 16384 > > > > > Tables present: LIVE & INACTIVE > > > > > Open count: 0 > > > > > Event number: 0 > > > > > Major, minor: 252, 2 > > > > > Number of targets: 1 > > > > > > > > > > 5. Retrieve the status of the device. This blocks because the device is > > > > > suspended. Equivalently, any 'dmsetup message' operation would block > > > > > too. This command holds the SRCU read lock on md->io_barrier. > > > > > > > > > > # dmsetup status eradev > > > > > > > > I'll have a look at this flow, it seems to me we shouldn't stack up > > > > 'dmsetup status' and 'dmsetup message' commands if the table is > > > > suspended. > > > > > > > > I think it is unique to dm-era that you don't allow to _read_ metadata > > > > operations while a device is suspended. But messages really shouldn't > > > > be sent when the device is suspended. As-is DM is pretty silently > > > > cutthroat about that constraint. > > > > > > > > Resulting in deadlock is obviously cutthroat... > > > > > > > > > > Hi Mike, > > > > > > Thanks for the quick reply. > > > > > > I couldn't find this constraint documented anywhere and since the > > > various DM targets seem to allow message operations while the device is > > > suspended I drew the wrong conclusion. > > > > > > Thanks for clarifying this. > > > > > > > > 6. Resume the device. The resume operation tries to swap the old table > > > > > with the new one and deadlocks, because it synchronizes SRCU for the > > > > > old table, while the blocked 'dmsetup status' holds the SRCU read > > > > > lock. And the old table is never resumed again at this point. > > > > > > > > > > # dmsetup resume eradev > > > > > > > > > > 7. The relevant dmesg logs are: > > > > > > > > > > [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 > > > > > [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. > > > > > [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 > > > > > [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > > [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 > > > > > [ 7250.875809] Call Trace: > > > > > [ 7250.875812] <TASK> > > > > > [ 7250.875816] __schedule+0x330/0x8b0 > > > > > [ 7250.875827] schedule+0x4e/0xc0 > > > > > [ 7250.875831] schedule_timeout+0x20f/0x2e0 > > > > > [ 7250.875836] ? do_set_pte+0xb8/0x120 > > > > > [ 7250.875843] ? prep_new_page+0x91/0xa0 > > > > > [ 7250.875847] wait_for_completion+0x8c/0xf0 > > > > > [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] > > > > > [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] > > > > > [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] > > > > > [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] > > > > > [ 7250.875877] era_status+0x64/0x1e0 [dm_era] > > > > > [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > > > > > [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 > > > > > [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] > > > > > [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > > > > > [ 7250.875932] table_status+0x61/0xa0 [dm_mod] > > > > > [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > > > > > [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] > > > > > [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 > > > > > [ 7250.875970] do_syscall_64+0x3a/0xd0 > > > > > [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > [ 7250.875980] RIP: 0033:0x7f20b7cd4017 > > > > > [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > > > > [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 > > > > > [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 > > > > > [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 > > > > > [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 > > > > > [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 > > > > > [ 7250.876002] </TASK> > > > > > [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. > > > > > [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 > > > > > [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > > [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 > > > > > [ 7250.876134] Call Trace: > > > > > [ 7250.876136] <TASK> > > > > > [ 7250.876138] __schedule+0x330/0x8b0 > > > > > [ 7250.876142] schedule+0x4e/0xc0 > > > > > [ 7250.876145] schedule_timeout+0x20f/0x2e0 > > > > > [ 7250.876149] ? __queue_work+0x226/0x420 > > > > > [ 7250.876156] wait_for_completion+0x8c/0xf0 > > > > > [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 > > > > > [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 > > > > > [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] > > > > > [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] > > > > > [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] > > > > > [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] > > > > > [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > > > > > [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] > > > > > [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 > > > > > [ 7250.876244] do_syscall_64+0x3a/0xd0 > > > > > [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > [ 7250.876252] RIP: 0033:0x7f15e9254017 > > > > > [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > > > > [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 > > > > > [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 > > > > > [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 > > > > > [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 > > > > > [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 > > > > > [ 7250.876269] </TASK> > > > > > > > > > > Fix this by allowing metadata operations triggered by user space to run > > > > > in the corresponding user space thread, instead of queueing them for > > > > > execution by the dm-era worker thread. > > > > > > > > Allowing them to run while the device is suspended is _not_ the > > > > correct way to work-around this deadlock situation. I think it'd be > > > > useful to understand why your userspace is tools are allowing these > > > > messages and status to a device that is suspended. > > > > > > > > > > Ack. > > > > > > The sequence of operations I provided is just a way to easily reproduce > > > the deadlock. The exact issue I am facing is the following: > > > > > > 1. Create device with dm-era target > > > > > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > 2. Load new table to device, e.g., to resize the device or snapshot it. > > > > > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > 3. Suspend the device > > > > > > # dmsetup suspend eradev > > > > > > 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM > > > command at this point, e.g. 'vgs'. > > > > > > 5. 'vgs' tries to retrieve the status of the dm-era device using the > > > DM_TABLE_STATUS ioctl, and blocks, because the command is queued for > > > execution by the worker thread, which is suspended while the device > > > is suspended. > > > > > > Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a > > > difference for dm-era, since the "problem" is not that it writes to > > > the metadata, but that it queues the metadata read operation to the > > > worker thread. > > > > > > 6. Resume the device: This deadlocks. > > > > > > # dmsetup resume eradev > > > > > > So, I am not the one retrieving the status of the suspended device. LVM > > > is. LVM, when running commands like 'lvs' and 'vgs', retrieves the > > > status of the devices on the system using the DM_TABLE_STATUS ioctl. > > > > > > The deadlock is a race that happens when someone runs an LVM command at > > > the "wrong" time. > > > > I think dm-era shouldn't be disallowing work items while suspended, > > e.g.: > > > > diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c > > index e92c1afc3677..33ea2c2374c7 100644 > > --- a/drivers/md/dm-era-target.c > > +++ b/drivers/md/dm-era-target.c > > @@ -1175,7 +1175,6 @@ struct era { > > struct list_head rpc_calls; > > struct digest digest; > > - atomic_t suspended; > > }; > > struct rpc { > > @@ -1219,8 +1218,7 @@ static void remap_to_origin(struct era *era, struct bio *bio) > > *--------------------------------------------------------------*/ > > static void wake_worker(struct era *era) > > { > > - if (!atomic_read(&era->suspended)) > > - queue_work(era->wq, &era->worker); > > + queue_work(era->wq, &era->worker); > > } > > static void process_old_eras(struct era *era) > > @@ -1392,17 +1390,6 @@ static int in_worker1(struct era *era, > > return perform_rpc(era, &rpc); > > } > > -static void start_worker(struct era *era) > > -{ > > - atomic_set(&era->suspended, 0); > > -} > > - > > -static void stop_worker(struct era *era) > > -{ > > - atomic_set(&era->suspended, 1); > > - drain_workqueue(era->wq); > > -} > > - > > /*---------------------------------------------------------------- > > * Target methods > > *--------------------------------------------------------------*/ > > @@ -1569,7 +1556,7 @@ static void era_postsuspend(struct dm_target *ti) > > /* FIXME: fail mode */ > > } > > - stop_worker(era); > > + drain_workqueue(era->wq); > > r = metadata_commit(era->md); > > if (r) { > > @@ -1600,8 +1587,6 @@ static int era_preresume(struct dm_target *ti) > > era->nr_blocks = new_size; > > } > > - start_worker(era); > > - > > r = in_worker0(era, metadata_era_rollover); > > if (r) { > > DMERR("%s: metadata_era_rollover failed", __func__); > > > > During suspend it should certainly flush all works; but I fail to see > > the value in disallowing the targets main way to do work (even while > > suspended). Maybe Joe has insight on why dm-era was written this way. > > > > But as an example dm-cache and dm-thin don't have such a strong > > restriction. > > > > The worker thread does the following: > 1. Process old eras, i.e., digest the archived writesets in to the main > era array. > > This doesn't involve a metadata commit, but writes to the metadata > > 2. Process deferred bios. This might trigger a metadata commit in > general, but when the device is suspended no bios should be reaching > the target, so it should be a no op. > > 3. Process RPC calls. This involves 'status' and 'message' operations. > process_rpc_calls() does commit the metadata, after running all RPC > calls. Checking if the device is suspended here is a way to prevent > the metadata commit. > > But, some of the 'message' operations also commit the metadata, e.g., > the metadata_take_snap message. I understand that no one should be > sending messages to a suspended device, but it's not enforced so it > could happen. > > I believe the value of suspending the worker when the device is > suspended is to prevent anyone from writing to the metadata. > > Writing to the metadata, while the device is suspended, could cause > problems. More on that in the following comments. > > > > Using an appropriate LVM 'global_filter' is a way to prevent this, but: > > > > > > 1. This is just a workaround, not a proper solution. > > > 2. This is not always an option. Imagine someone running an LVM command > > > in a container, for example. Or, we may not be allowed to change the > > > LVM configuration of the host at all. > > > > > > > FYI, status shouldn't trigger write IOs if the device is suspended. > > > > Both dm-cache and dm-thin have this in status as a guard around its > > > > work to ensure updated accounting (as a side-effect of committing > > > > metadata), e.g.: > > > > > > > > /* Commit to ensure statistics aren't out-of-date */ > > > > if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) > > > > (void) commit(cache, false); > > > > > > Ack. The current behavior of dm-era wrt 'status' command is: > > > > > > 1. Queue the 'metadata_get_stats()' function for execution by the worker > > > thread. > > > 2. The worker runs the function and retrieves the statistics > > > 3. The worker commits the metadata _after_ retrieving the statistics > > > > > > Shall I just change 'era_status()' to read the metadata directly and > > > skip the metadata commit, instead of queuing the operation to the worker > > > thread, when the device is suspended? > > > > dm-era should still respect that the device is suspended (disallow > > metadata commits, etc). > > > > Probably should just do like the dm-cache code I quoted above (dm-thin > > does the same): do not commit if suspended. > > > > Seems to me that if simply retrieving stats in era_status() results in > > metadata commit as a side-effect of its work that is pretty bogus -- > > process_rpc_calls() must be what does it just due to generic code? > > > > If so, yes absolutely avoid that metadata commit (might we just update > > process_rpc_calls() to check dm_suspended()?) > > > > Yes, the metadata commit is due to the generic code in > process_rpc_calls(), and could be avoided if we check the device > suspended state here. > > But, avoiding the metadata commit when the device is suspended is not > enough. No one should be writing to the metadata at all. > > Else, we risk hitting the bug fixed by commit 9ae6e8b1c9bbf6 ("dm era: > commit metadata in postsuspend after worker stops"). > > In short, writing to the metadata of a suspended dm-era device results > in the committed metadata getting out-of-sync with the in-core metadata. > > This could result in the corruption of the on-disk metadata if we load a > new dm-era table to the device, and destroy the old one (the destructor > flushes any cached writes to the disk). > > For an exact sequence of events that can lead to this issue, please see > the commit mentioned above (9ae6e8b1c9bbf6). > > > Sorry for such fundamental concerns, I really should've caught this > > stuff back when dm-era was introduced! > > > > Really does feel like a read-write lock for metadata might be > > warranted. > > > > > I think we would still need a lock to synchronize access to the > > > metadata, since, if I am not mistaken, the pre-resume hook that starts > > > the worker runs before clearing the DMF_SUSPENDED flag, and can run > > > concurrently with the 'status' IOCTL. So, the worker might run > > > concurrently with a 'status' operation that sees the device as > > > suspended. > > > > I was initially thinking removing dm-era's suspended state (like above > > patch) should take care of the pitfalls of disallowing works due to > > that state. > > > > But the dm-era could should be audited to make sure the big hammer of > > disallowing works while suspended wasn't important for its > > implementation (and the need to disallow device changes while > > suspended). > > > > I think a simple solution, without modifying dm-era too much, is: > 1. Have status run in the user space thread, instead of queueing it to > the worker. OK. > 2. Add a mutex to avoid status running concurrently with the worker > thread. Ideally you'd already have adequate protection. You want to be sure you're protecting data and not just user vs worker. > 3. Have status check if the device is suspended. If it is not, commit > the metadata before retrieving the status. If it is, skip the > metadata commit. This is in line with what the thin and cache targets > do. > > Doing this avoids the deadlock in case someone, e.g., an LVM command, > runs a status operation against a suspended dm-era device. > > User space could trigger the deadlock if it sends a message to a > suspended dm-era device, but, since this should never happen, it > shouldn't affect existing workflows. Why not have message run from user thread too? > Still, it might be a good idea to enforce this restriction, i.e., > disallow running message operations on suspended devices, but, I guess > this requires more discussion. It's all target specific. DM core shouldn't be the place to impose this constraint. Mike -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-26 0:06 ` Mike Snitzer @ 2023-01-31 11:01 ` Nikos Tsironis 2023-01-31 20:20 ` Mike Snitzer 0 siblings, 1 reply; 14+ messages in thread From: Nikos Tsironis @ 2023-01-31 11:01 UTC (permalink / raw) To: Mike Snitzer; +Cc: dm-devel, snitzer, ejt, agk On 1/26/23 02:06, Mike Snitzer wrote: > On Wed, Jan 25 2023 at 7:37P -0500, > Nikos Tsironis <ntsironis@arrikto.com> wrote: > >> On 1/23/23 19:34, Mike Snitzer wrote: >>> On Thu, Jan 19 2023 at 4:36P -0500, >>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>> >>>> On 1/18/23 18:28, Mike Snitzer wrote: >>>>> On Wed, Jan 18 2023 at 7:29P -0500, >>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>>>> >>>>>> Under certain conditions, swapping a table, that includes a dm-era >>>>>> target, with a new table, causes a deadlock. >>>>>> >>>>>> This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked >>>>>> in the suspended dm-era target. >>>>>> >>>>>> dm-era executes all metadata operations in a worker thread, which stops >>>>>> processing requests when the target is suspended, and resumes again when >>>>>> the target is resumed. >>>>>> >>>>>> So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era >>>>>> device blocks, until the device is resumed. >>>>>> >>>>>> If we then load a new table to the device, while the aforementioned >>>>>> dmsetup command is blocked in dm-era, and resume the device, we >>>>>> deadlock. >>>>>> >>>>>> The problem is that the 'dmsetup status' and 'dmsetup message' commands >>>>>> hold a reference to the live table, i.e., they hold an SRCU read lock on >>>>>> md->io_barrier, while they are blocked. >>>>>> >>>>>> When the device is resumed, the old table is replaced with the new one >>>>>> by dm_swap_table(), which ends up calling synchronize_srcu() on >>>>>> md->io_barrier. >>>>>> >>>>>> Since the blocked dmsetup command is holding the SRCU read lock, and the >>>>>> old table is never resumed, 'dmsetup resume' blocks too, and we have a >>>>>> deadlock. >>>>>> >>>>>> The only way to recover is by rebooting. >>>>>> >>>>>> Steps to reproduce: >>>>>> >>>>>> 1. Create device with dm-era target >>>>>> >>>>>> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>>>> >>>>>> 2. Suspend the device >>>>>> >>>>>> # dmsetup suspend eradev >>>>>> >>>>>> 3. Load new table to device, e.g., to resize the device. Note, that, we >>>>>> must load the new table _after_ suspending the device to ensure the >>>>>> constructor of the new target instance reads up-to-date metadata, as >>>>>> committed by the post-suspend hook of dm-era. >>>>>> >>>>>> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>>>> >>>>>> 4. Device now has LIVE and INACTIVE tables >>>>>> >>>>>> # dmsetup info eradev >>>>>> Name: eradev >>>>>> State: SUSPENDED >>>>>> Read Ahead: 16384 >>>>>> Tables present: LIVE & INACTIVE >>>>>> Open count: 0 >>>>>> Event number: 0 >>>>>> Major, minor: 252, 2 >>>>>> Number of targets: 1 >>>>>> >>>>>> 5. Retrieve the status of the device. This blocks because the device is >>>>>> suspended. Equivalently, any 'dmsetup message' operation would block >>>>>> too. This command holds the SRCU read lock on md->io_barrier. >>>>>> >>>>>> # dmsetup status eradev >>>>> >>>>> I'll have a look at this flow, it seems to me we shouldn't stack up >>>>> 'dmsetup status' and 'dmsetup message' commands if the table is >>>>> suspended. >>>>> >>>>> I think it is unique to dm-era that you don't allow to _read_ metadata >>>>> operations while a device is suspended. But messages really shouldn't >>>>> be sent when the device is suspended. As-is DM is pretty silently >>>>> cutthroat about that constraint. >>>>> >>>>> Resulting in deadlock is obviously cutthroat... >>>>> >>>> >>>> Hi Mike, >>>> >>>> Thanks for the quick reply. >>>> >>>> I couldn't find this constraint documented anywhere and since the >>>> various DM targets seem to allow message operations while the device is >>>> suspended I drew the wrong conclusion. >>>> >>>> Thanks for clarifying this. >>>> >>>>>> 6. Resume the device. The resume operation tries to swap the old table >>>>>> with the new one and deadlocks, because it synchronizes SRCU for the >>>>>> old table, while the blocked 'dmsetup status' holds the SRCU read >>>>>> lock. And the old table is never resumed again at this point. >>>>>> >>>>>> # dmsetup resume eradev >>>>>> >>>>>> 7. The relevant dmesg logs are: >>>>>> >>>>>> [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 >>>>>> [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. >>>>>> [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 >>>>>> [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>>> [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 >>>>>> [ 7250.875809] Call Trace: >>>>>> [ 7250.875812] <TASK> >>>>>> [ 7250.875816] __schedule+0x330/0x8b0 >>>>>> [ 7250.875827] schedule+0x4e/0xc0 >>>>>> [ 7250.875831] schedule_timeout+0x20f/0x2e0 >>>>>> [ 7250.875836] ? do_set_pte+0xb8/0x120 >>>>>> [ 7250.875843] ? prep_new_page+0x91/0xa0 >>>>>> [ 7250.875847] wait_for_completion+0x8c/0xf0 >>>>>> [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] >>>>>> [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] >>>>>> [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] >>>>>> [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] >>>>>> [ 7250.875877] era_status+0x64/0x1e0 [dm_era] >>>>>> [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>>>> [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 >>>>>> [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] >>>>>> [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>>>> [ 7250.875932] table_status+0x61/0xa0 [dm_mod] >>>>>> [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>>>> [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>>>> [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 >>>>>> [ 7250.875970] do_syscall_64+0x3a/0xd0 >>>>>> [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>>> [ 7250.875980] RIP: 0033:0x7f20b7cd4017 >>>>>> [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>>>> [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 >>>>>> [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 >>>>>> [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 >>>>>> [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 >>>>>> [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 >>>>>> [ 7250.876002] </TASK> >>>>>> [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. >>>>>> [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 >>>>>> [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>>> [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 >>>>>> [ 7250.876134] Call Trace: >>>>>> [ 7250.876136] <TASK> >>>>>> [ 7250.876138] __schedule+0x330/0x8b0 >>>>>> [ 7250.876142] schedule+0x4e/0xc0 >>>>>> [ 7250.876145] schedule_timeout+0x20f/0x2e0 >>>>>> [ 7250.876149] ? __queue_work+0x226/0x420 >>>>>> [ 7250.876156] wait_for_completion+0x8c/0xf0 >>>>>> [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 >>>>>> [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 >>>>>> [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] >>>>>> [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] >>>>>> [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] >>>>>> [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] >>>>>> [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>>>> [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>>>> [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 >>>>>> [ 7250.876244] do_syscall_64+0x3a/0xd0 >>>>>> [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>>> [ 7250.876252] RIP: 0033:0x7f15e9254017 >>>>>> [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>>>> [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 >>>>>> [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 >>>>>> [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 >>>>>> [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 >>>>>> [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 >>>>>> [ 7250.876269] </TASK> >>>>>> >>>>>> Fix this by allowing metadata operations triggered by user space to run >>>>>> in the corresponding user space thread, instead of queueing them for >>>>>> execution by the dm-era worker thread. >>>>> >>>>> Allowing them to run while the device is suspended is _not_ the >>>>> correct way to work-around this deadlock situation. I think it'd be >>>>> useful to understand why your userspace is tools are allowing these >>>>> messages and status to a device that is suspended. >>>>> >>>> >>>> Ack. >>>> >>>> The sequence of operations I provided is just a way to easily reproduce >>>> the deadlock. The exact issue I am facing is the following: >>>> >>>> 1. Create device with dm-era target >>>> >>>> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>> >>>> 2. Load new table to device, e.g., to resize the device or snapshot it. >>>> >>>> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>> >>>> 3. Suspend the device >>>> >>>> # dmsetup suspend eradev >>>> >>>> 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM >>>> command at this point, e.g. 'vgs'. >>>> >>>> 5. 'vgs' tries to retrieve the status of the dm-era device using the >>>> DM_TABLE_STATUS ioctl, and blocks, because the command is queued for >>>> execution by the worker thread, which is suspended while the device >>>> is suspended. >>>> >>>> Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a >>>> difference for dm-era, since the "problem" is not that it writes to >>>> the metadata, but that it queues the metadata read operation to the >>>> worker thread. >>>> >>>> 6. Resume the device: This deadlocks. >>>> >>>> # dmsetup resume eradev >>>> >>>> So, I am not the one retrieving the status of the suspended device. LVM >>>> is. LVM, when running commands like 'lvs' and 'vgs', retrieves the >>>> status of the devices on the system using the DM_TABLE_STATUS ioctl. >>>> >>>> The deadlock is a race that happens when someone runs an LVM command at >>>> the "wrong" time. >>> >>> I think dm-era shouldn't be disallowing work items while suspended, >>> e.g.: >>> >>> diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c >>> index e92c1afc3677..33ea2c2374c7 100644 >>> --- a/drivers/md/dm-era-target.c >>> +++ b/drivers/md/dm-era-target.c >>> @@ -1175,7 +1175,6 @@ struct era { >>> struct list_head rpc_calls; >>> struct digest digest; >>> - atomic_t suspended; >>> }; >>> struct rpc { >>> @@ -1219,8 +1218,7 @@ static void remap_to_origin(struct era *era, struct bio *bio) >>> *--------------------------------------------------------------*/ >>> static void wake_worker(struct era *era) >>> { >>> - if (!atomic_read(&era->suspended)) >>> - queue_work(era->wq, &era->worker); >>> + queue_work(era->wq, &era->worker); >>> } >>> static void process_old_eras(struct era *era) >>> @@ -1392,17 +1390,6 @@ static int in_worker1(struct era *era, >>> return perform_rpc(era, &rpc); >>> } >>> -static void start_worker(struct era *era) >>> -{ >>> - atomic_set(&era->suspended, 0); >>> -} >>> - >>> -static void stop_worker(struct era *era) >>> -{ >>> - atomic_set(&era->suspended, 1); >>> - drain_workqueue(era->wq); >>> -} >>> - >>> /*---------------------------------------------------------------- >>> * Target methods >>> *--------------------------------------------------------------*/ >>> @@ -1569,7 +1556,7 @@ static void era_postsuspend(struct dm_target *ti) >>> /* FIXME: fail mode */ >>> } >>> - stop_worker(era); >>> + drain_workqueue(era->wq); >>> r = metadata_commit(era->md); >>> if (r) { >>> @@ -1600,8 +1587,6 @@ static int era_preresume(struct dm_target *ti) >>> era->nr_blocks = new_size; >>> } >>> - start_worker(era); >>> - >>> r = in_worker0(era, metadata_era_rollover); >>> if (r) { >>> DMERR("%s: metadata_era_rollover failed", __func__); >>> >>> During suspend it should certainly flush all works; but I fail to see >>> the value in disallowing the targets main way to do work (even while >>> suspended). Maybe Joe has insight on why dm-era was written this way. >>> >>> But as an example dm-cache and dm-thin don't have such a strong >>> restriction. >>> >> >> The worker thread does the following: >> 1. Process old eras, i.e., digest the archived writesets in to the main >> era array. >> >> This doesn't involve a metadata commit, but writes to the metadata >> >> 2. Process deferred bios. This might trigger a metadata commit in >> general, but when the device is suspended no bios should be reaching >> the target, so it should be a no op. >> >> 3. Process RPC calls. This involves 'status' and 'message' operations. >> process_rpc_calls() does commit the metadata, after running all RPC >> calls. Checking if the device is suspended here is a way to prevent >> the metadata commit. >> >> But, some of the 'message' operations also commit the metadata, e.g., >> the metadata_take_snap message. I understand that no one should be >> sending messages to a suspended device, but it's not enforced so it >> could happen. >> >> I believe the value of suspending the worker when the device is >> suspended is to prevent anyone from writing to the metadata. >> >> Writing to the metadata, while the device is suspended, could cause >> problems. More on that in the following comments. >> >>>> Using an appropriate LVM 'global_filter' is a way to prevent this, but: >>>> >>>> 1. This is just a workaround, not a proper solution. >>>> 2. This is not always an option. Imagine someone running an LVM command >>>> in a container, for example. Or, we may not be allowed to change the >>>> LVM configuration of the host at all. >>>> >>>>> FYI, status shouldn't trigger write IOs if the device is suspended. >>>>> Both dm-cache and dm-thin have this in status as a guard around its >>>>> work to ensure updated accounting (as a side-effect of committing >>>>> metadata), e.g.: >>>>> >>>>> /* Commit to ensure statistics aren't out-of-date */ >>>>> if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) >>>>> (void) commit(cache, false); >>>> >>>> Ack. The current behavior of dm-era wrt 'status' command is: >>>> >>>> 1. Queue the 'metadata_get_stats()' function for execution by the worker >>>> thread. >>>> 2. The worker runs the function and retrieves the statistics >>>> 3. The worker commits the metadata _after_ retrieving the statistics >>>> >>>> Shall I just change 'era_status()' to read the metadata directly and >>>> skip the metadata commit, instead of queuing the operation to the worker >>>> thread, when the device is suspended? >>> >>> dm-era should still respect that the device is suspended (disallow >>> metadata commits, etc). >>> >>> Probably should just do like the dm-cache code I quoted above (dm-thin >>> does the same): do not commit if suspended. >>> >>> Seems to me that if simply retrieving stats in era_status() results in >>> metadata commit as a side-effect of its work that is pretty bogus -- >>> process_rpc_calls() must be what does it just due to generic code? >>> >>> If so, yes absolutely avoid that metadata commit (might we just update >>> process_rpc_calls() to check dm_suspended()?) >>> >> >> Yes, the metadata commit is due to the generic code in >> process_rpc_calls(), and could be avoided if we check the device >> suspended state here. >> >> But, avoiding the metadata commit when the device is suspended is not >> enough. No one should be writing to the metadata at all. >> >> Else, we risk hitting the bug fixed by commit 9ae6e8b1c9bbf6 ("dm era: >> commit metadata in postsuspend after worker stops"). >> >> In short, writing to the metadata of a suspended dm-era device results >> in the committed metadata getting out-of-sync with the in-core metadata. >> >> This could result in the corruption of the on-disk metadata if we load a >> new dm-era table to the device, and destroy the old one (the destructor >> flushes any cached writes to the disk). >> >> For an exact sequence of events that can lead to this issue, please see >> the commit mentioned above (9ae6e8b1c9bbf6). >> >>> Sorry for such fundamental concerns, I really should've caught this >>> stuff back when dm-era was introduced! >>> >>> Really does feel like a read-write lock for metadata might be >>> warranted. >>> >>>> I think we would still need a lock to synchronize access to the >>>> metadata, since, if I am not mistaken, the pre-resume hook that starts >>>> the worker runs before clearing the DMF_SUSPENDED flag, and can run >>>> concurrently with the 'status' IOCTL. So, the worker might run >>>> concurrently with a 'status' operation that sees the device as >>>> suspended. >>> >>> I was initially thinking removing dm-era's suspended state (like above >>> patch) should take care of the pitfalls of disallowing works due to >>> that state. >>> >>> But the dm-era could should be audited to make sure the big hammer of >>> disallowing works while suspended wasn't important for its >>> implementation (and the need to disallow device changes while >>> suspended). >>> >> >> I think a simple solution, without modifying dm-era too much, is: >> 1. Have status run in the user space thread, instead of queueing it to >> the worker. > > OK. > >> 2. Add a mutex to avoid status running concurrently with the worker >> thread. > > Ideally you'd already have adequate protection. > You want to be sure you're protecting data and not just user vs worker. > I agree. Ideally we would use a read-write lock to protect access to the metadata, like the rest of the DM targets do. I proposed this locking scheme because it's really simple, and it doesn't require many changes. I agree it's not ideal, but it could be a first step, before introducing a more fine grained read-write lock. >> 3. Have status check if the device is suspended. If it is not, commit >> the metadata before retrieving the status. If it is, skip the >> metadata commit. This is in line with what the thin and cache targets >> do. >> >> Doing this avoids the deadlock in case someone, e.g., an LVM command, >> runs a status operation against a suspended dm-era device. >> >> User space could trigger the deadlock if it sends a message to a >> suspended dm-era device, but, since this should never happen, it >> shouldn't affect existing workflows. > > Why not have message run from user thread too? > That's what this patch does, but I thought it was not the correct way to solve the issue, that's why I didn't propose it again: > Allowing them to run while the device is suspended is _not_ the > correct way to work-around this deadlock situation. But, probably, I misunderstood something. Allowing message operations to run in the user thread, while the device is suspended, would result in writing to the metadata of the suspended device. I understand that this should not happen. We could fail message operations if the device is suspended (since all of them write to the metadata), but isn't this considered a change to the user facing API? What if someone depends on the existing behavior of message operations blocking if the dm-era device is suspended? Is it ok to change this behavior? >> Still, it might be a good idea to enforce this restriction, i.e., >> disallow running message operations on suspended devices, but, I guess >> this requires more discussion. > > It's all target specific. DM core shouldn't be the place to impose > this constraint. > Ack. Nikos -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-31 11:01 ` Nikos Tsironis @ 2023-01-31 20:20 ` Mike Snitzer 2023-02-07 8:10 ` Nikos Tsironis 0 siblings, 1 reply; 14+ messages in thread From: Mike Snitzer @ 2023-01-31 20:20 UTC (permalink / raw) To: Nikos Tsironis; +Cc: dm-devel, snitzer, ejt, agk On Tue, Jan 31 2023 at 6:01P -0500, Nikos Tsironis <ntsironis@arrikto.com> wrote: > On 1/26/23 02:06, Mike Snitzer wrote: > > On Wed, Jan 25 2023 at 7:37P -0500, > > Nikos Tsironis <ntsironis@arrikto.com> wrote: > > > > > On 1/23/23 19:34, Mike Snitzer wrote: > > > > On Thu, Jan 19 2023 at 4:36P -0500, > > > > Nikos Tsironis <ntsironis@arrikto.com> wrote: > > > > > > > > > On 1/18/23 18:28, Mike Snitzer wrote: > > > > > > On Wed, Jan 18 2023 at 7:29P -0500, > > > > > > Nikos Tsironis <ntsironis@arrikto.com> wrote: > > > > > > > > > > > > > Under certain conditions, swapping a table, that includes a dm-era > > > > > > > target, with a new table, causes a deadlock. > > > > > > > > > > > > > > This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked > > > > > > > in the suspended dm-era target. > > > > > > > > > > > > > > dm-era executes all metadata operations in a worker thread, which stops > > > > > > > processing requests when the target is suspended, and resumes again when > > > > > > > the target is resumed. > > > > > > > > > > > > > > So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era > > > > > > > device blocks, until the device is resumed. > > > > > > > > > > > > > > If we then load a new table to the device, while the aforementioned > > > > > > > dmsetup command is blocked in dm-era, and resume the device, we > > > > > > > deadlock. > > > > > > > > > > > > > > The problem is that the 'dmsetup status' and 'dmsetup message' commands > > > > > > > hold a reference to the live table, i.e., they hold an SRCU read lock on > > > > > > > md->io_barrier, while they are blocked. > > > > > > > > > > > > > > When the device is resumed, the old table is replaced with the new one > > > > > > > by dm_swap_table(), which ends up calling synchronize_srcu() on > > > > > > > md->io_barrier. > > > > > > > > > > > > > > Since the blocked dmsetup command is holding the SRCU read lock, and the > > > > > > > old table is never resumed, 'dmsetup resume' blocks too, and we have a > > > > > > > deadlock. > > > > > > > > > > > > > > The only way to recover is by rebooting. > > > > > > > > > > > > > > Steps to reproduce: > > > > > > > > > > > > > > 1. Create device with dm-era target > > > > > > > > > > > > > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > > > > > > > > > 2. Suspend the device > > > > > > > > > > > > > > # dmsetup suspend eradev > > > > > > > > > > > > > > 3. Load new table to device, e.g., to resize the device. Note, that, we > > > > > > > must load the new table _after_ suspending the device to ensure the > > > > > > > constructor of the new target instance reads up-to-date metadata, as > > > > > > > committed by the post-suspend hook of dm-era. > > > > > > > > > > > > > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > > > > > > > > > 4. Device now has LIVE and INACTIVE tables > > > > > > > > > > > > > > # dmsetup info eradev > > > > > > > Name: eradev > > > > > > > State: SUSPENDED > > > > > > > Read Ahead: 16384 > > > > > > > Tables present: LIVE & INACTIVE > > > > > > > Open count: 0 > > > > > > > Event number: 0 > > > > > > > Major, minor: 252, 2 > > > > > > > Number of targets: 1 > > > > > > > > > > > > > > 5. Retrieve the status of the device. This blocks because the device is > > > > > > > suspended. Equivalently, any 'dmsetup message' operation would block > > > > > > > too. This command holds the SRCU read lock on md->io_barrier. > > > > > > > > > > > > > > # dmsetup status eradev > > > > > > > > > > > > I'll have a look at this flow, it seems to me we shouldn't stack up > > > > > > 'dmsetup status' and 'dmsetup message' commands if the table is > > > > > > suspended. > > > > > > > > > > > > I think it is unique to dm-era that you don't allow to _read_ metadata > > > > > > operations while a device is suspended. But messages really shouldn't > > > > > > be sent when the device is suspended. As-is DM is pretty silently > > > > > > cutthroat about that constraint. > > > > > > > > > > > > Resulting in deadlock is obviously cutthroat... > > > > > > > > > > > > > > > > Hi Mike, > > > > > > > > > > Thanks for the quick reply. > > > > > > > > > > I couldn't find this constraint documented anywhere and since the > > > > > various DM targets seem to allow message operations while the device is > > > > > suspended I drew the wrong conclusion. > > > > > > > > > > Thanks for clarifying this. > > > > > > > > > > > > 6. Resume the device. The resume operation tries to swap the old table > > > > > > > with the new one and deadlocks, because it synchronizes SRCU for the > > > > > > > old table, while the blocked 'dmsetup status' holds the SRCU read > > > > > > > lock. And the old table is never resumed again at this point. > > > > > > > > > > > > > > # dmsetup resume eradev > > > > > > > > > > > > > > 7. The relevant dmesg logs are: > > > > > > > > > > > > > > [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 > > > > > > > [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. > > > > > > > [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 > > > > > > > [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > > > > [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 > > > > > > > [ 7250.875809] Call Trace: > > > > > > > [ 7250.875812] <TASK> > > > > > > > [ 7250.875816] __schedule+0x330/0x8b0 > > > > > > > [ 7250.875827] schedule+0x4e/0xc0 > > > > > > > [ 7250.875831] schedule_timeout+0x20f/0x2e0 > > > > > > > [ 7250.875836] ? do_set_pte+0xb8/0x120 > > > > > > > [ 7250.875843] ? prep_new_page+0x91/0xa0 > > > > > > > [ 7250.875847] wait_for_completion+0x8c/0xf0 > > > > > > > [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] > > > > > > > [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] > > > > > > > [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] > > > > > > > [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] > > > > > > > [ 7250.875877] era_status+0x64/0x1e0 [dm_era] > > > > > > > [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > > > > > > > [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 > > > > > > > [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] > > > > > > > [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > > > > > > > [ 7250.875932] table_status+0x61/0xa0 [dm_mod] > > > > > > > [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > > > > > > > [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] > > > > > > > [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 > > > > > > > [ 7250.875970] do_syscall_64+0x3a/0xd0 > > > > > > > [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > > > [ 7250.875980] RIP: 0033:0x7f20b7cd4017 > > > > > > > [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > > > > > > [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 > > > > > > > [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 > > > > > > > [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 > > > > > > > [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 > > > > > > > [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 > > > > > > > [ 7250.876002] </TASK> > > > > > > > [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. > > > > > > > [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 > > > > > > > [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > > > > [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 > > > > > > > [ 7250.876134] Call Trace: > > > > > > > [ 7250.876136] <TASK> > > > > > > > [ 7250.876138] __schedule+0x330/0x8b0 > > > > > > > [ 7250.876142] schedule+0x4e/0xc0 > > > > > > > [ 7250.876145] schedule_timeout+0x20f/0x2e0 > > > > > > > [ 7250.876149] ? __queue_work+0x226/0x420 > > > > > > > [ 7250.876156] wait_for_completion+0x8c/0xf0 > > > > > > > [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 > > > > > > > [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 > > > > > > > [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] > > > > > > > [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] > > > > > > > [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] > > > > > > > [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] > > > > > > > [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > > > > > > > [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] > > > > > > > [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 > > > > > > > [ 7250.876244] do_syscall_64+0x3a/0xd0 > > > > > > > [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > > > [ 7250.876252] RIP: 0033:0x7f15e9254017 > > > > > > > [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > > > > > > [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 > > > > > > > [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 > > > > > > > [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 > > > > > > > [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 > > > > > > > [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 > > > > > > > [ 7250.876269] </TASK> > > > > > > > > > > > > > > Fix this by allowing metadata operations triggered by user space to run > > > > > > > in the corresponding user space thread, instead of queueing them for > > > > > > > execution by the dm-era worker thread. > > > > > > > > > > > > Allowing them to run while the device is suspended is _not_ the > > > > > > correct way to work-around this deadlock situation. I think it'd be > > > > > > useful to understand why your userspace is tools are allowing these > > > > > > messages and status to a device that is suspended. > > > > > > > > > > > > > > > > Ack. > > > > > > > > > > The sequence of operations I provided is just a way to easily reproduce > > > > > the deadlock. The exact issue I am facing is the following: > > > > > > > > > > 1. Create device with dm-era target > > > > > > > > > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > > > > > 2. Load new table to device, e.g., to resize the device or snapshot it. > > > > > > > > > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > > > > > > > > > 3. Suspend the device > > > > > > > > > > # dmsetup suspend eradev > > > > > > > > > > 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM > > > > > command at this point, e.g. 'vgs'. > > > > > > > > > > 5. 'vgs' tries to retrieve the status of the dm-era device using the > > > > > DM_TABLE_STATUS ioctl, and blocks, because the command is queued for > > > > > execution by the worker thread, which is suspended while the device > > > > > is suspended. > > > > > > > > > > Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a > > > > > difference for dm-era, since the "problem" is not that it writes to > > > > > the metadata, but that it queues the metadata read operation to the > > > > > worker thread. > > > > > > > > > > 6. Resume the device: This deadlocks. > > > > > > > > > > # dmsetup resume eradev > > > > > > > > > > So, I am not the one retrieving the status of the suspended device. LVM > > > > > is. LVM, when running commands like 'lvs' and 'vgs', retrieves the > > > > > status of the devices on the system using the DM_TABLE_STATUS ioctl. > > > > > > > > > > The deadlock is a race that happens when someone runs an LVM command at > > > > > the "wrong" time. > > > > > > > > I think dm-era shouldn't be disallowing work items while suspended, > > > > e.g.: > > > > > > > > diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c > > > > index e92c1afc3677..33ea2c2374c7 100644 > > > > --- a/drivers/md/dm-era-target.c > > > > +++ b/drivers/md/dm-era-target.c > > > > @@ -1175,7 +1175,6 @@ struct era { > > > > struct list_head rpc_calls; > > > > struct digest digest; > > > > - atomic_t suspended; > > > > }; > > > > struct rpc { > > > > @@ -1219,8 +1218,7 @@ static void remap_to_origin(struct era *era, struct bio *bio) > > > > *--------------------------------------------------------------*/ > > > > static void wake_worker(struct era *era) > > > > { > > > > - if (!atomic_read(&era->suspended)) > > > > - queue_work(era->wq, &era->worker); > > > > + queue_work(era->wq, &era->worker); > > > > } > > > > static void process_old_eras(struct era *era) > > > > @@ -1392,17 +1390,6 @@ static int in_worker1(struct era *era, > > > > return perform_rpc(era, &rpc); > > > > } > > > > -static void start_worker(struct era *era) > > > > -{ > > > > - atomic_set(&era->suspended, 0); > > > > -} > > > > - > > > > -static void stop_worker(struct era *era) > > > > -{ > > > > - atomic_set(&era->suspended, 1); > > > > - drain_workqueue(era->wq); > > > > -} > > > > - > > > > /*---------------------------------------------------------------- > > > > * Target methods > > > > *--------------------------------------------------------------*/ > > > > @@ -1569,7 +1556,7 @@ static void era_postsuspend(struct dm_target *ti) > > > > /* FIXME: fail mode */ > > > > } > > > > - stop_worker(era); > > > > + drain_workqueue(era->wq); > > > > r = metadata_commit(era->md); > > > > if (r) { > > > > @@ -1600,8 +1587,6 @@ static int era_preresume(struct dm_target *ti) > > > > era->nr_blocks = new_size; > > > > } > > > > - start_worker(era); > > > > - > > > > r = in_worker0(era, metadata_era_rollover); > > > > if (r) { > > > > DMERR("%s: metadata_era_rollover failed", __func__); > > > > > > > > During suspend it should certainly flush all works; but I fail to see > > > > the value in disallowing the targets main way to do work (even while > > > > suspended). Maybe Joe has insight on why dm-era was written this way. > > > > > > > > But as an example dm-cache and dm-thin don't have such a strong > > > > restriction. > > > > > > > > > > The worker thread does the following: > > > 1. Process old eras, i.e., digest the archived writesets in to the main > > > era array. > > > > > > This doesn't involve a metadata commit, but writes to the metadata > > > > > > 2. Process deferred bios. This might trigger a metadata commit in > > > general, but when the device is suspended no bios should be reaching > > > the target, so it should be a no op. > > > > > > 3. Process RPC calls. This involves 'status' and 'message' operations. > > > process_rpc_calls() does commit the metadata, after running all RPC > > > calls. Checking if the device is suspended here is a way to prevent > > > the metadata commit. > > > > > > But, some of the 'message' operations also commit the metadata, e.g., > > > the metadata_take_snap message. I understand that no one should be > > > sending messages to a suspended device, but it's not enforced so it > > > could happen. > > > > > > I believe the value of suspending the worker when the device is > > > suspended is to prevent anyone from writing to the metadata. > > > > > > Writing to the metadata, while the device is suspended, could cause > > > problems. More on that in the following comments. > > > > > > > > Using an appropriate LVM 'global_filter' is a way to prevent this, but: > > > > > > > > > > 1. This is just a workaround, not a proper solution. > > > > > 2. This is not always an option. Imagine someone running an LVM command > > > > > in a container, for example. Or, we may not be allowed to change the > > > > > LVM configuration of the host at all. > > > > > > > > > > > FYI, status shouldn't trigger write IOs if the device is suspended. > > > > > > Both dm-cache and dm-thin have this in status as a guard around its > > > > > > work to ensure updated accounting (as a side-effect of committing > > > > > > metadata), e.g.: > > > > > > > > > > > > /* Commit to ensure statistics aren't out-of-date */ > > > > > > if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) > > > > > > (void) commit(cache, false); > > > > > > > > > > Ack. The current behavior of dm-era wrt 'status' command is: > > > > > > > > > > 1. Queue the 'metadata_get_stats()' function for execution by the worker > > > > > thread. > > > > > 2. The worker runs the function and retrieves the statistics > > > > > 3. The worker commits the metadata _after_ retrieving the statistics > > > > > > > > > > Shall I just change 'era_status()' to read the metadata directly and > > > > > skip the metadata commit, instead of queuing the operation to the worker > > > > > thread, when the device is suspended? > > > > > > > > dm-era should still respect that the device is suspended (disallow > > > > metadata commits, etc). > > > > > > > > Probably should just do like the dm-cache code I quoted above (dm-thin > > > > does the same): do not commit if suspended. > > > > > > > > Seems to me that if simply retrieving stats in era_status() results in > > > > metadata commit as a side-effect of its work that is pretty bogus -- > > > > process_rpc_calls() must be what does it just due to generic code? > > > > > > > > If so, yes absolutely avoid that metadata commit (might we just update > > > > process_rpc_calls() to check dm_suspended()?) > > > > > > > > > > Yes, the metadata commit is due to the generic code in > > > process_rpc_calls(), and could be avoided if we check the device > > > suspended state here. > > > > > > But, avoiding the metadata commit when the device is suspended is not > > > enough. No one should be writing to the metadata at all. > > > > > > Else, we risk hitting the bug fixed by commit 9ae6e8b1c9bbf6 ("dm era: > > > commit metadata in postsuspend after worker stops"). > > > > > > In short, writing to the metadata of a suspended dm-era device results > > > in the committed metadata getting out-of-sync with the in-core metadata. > > > > > > This could result in the corruption of the on-disk metadata if we load a > > > new dm-era table to the device, and destroy the old one (the destructor > > > flushes any cached writes to the disk). > > > > > > For an exact sequence of events that can lead to this issue, please see > > > the commit mentioned above (9ae6e8b1c9bbf6). > > > > > > > Sorry for such fundamental concerns, I really should've caught this > > > > stuff back when dm-era was introduced! > > > > > > > > Really does feel like a read-write lock for metadata might be > > > > warranted. > > > > > > > > > I think we would still need a lock to synchronize access to the > > > > > metadata, since, if I am not mistaken, the pre-resume hook that starts > > > > > the worker runs before clearing the DMF_SUSPENDED flag, and can run > > > > > concurrently with the 'status' IOCTL. So, the worker might run > > > > > concurrently with a 'status' operation that sees the device as > > > > > suspended. > > > > > > > > I was initially thinking removing dm-era's suspended state (like above > > > > patch) should take care of the pitfalls of disallowing works due to > > > > that state. > > > > > > > > But the dm-era could should be audited to make sure the big hammer of > > > > disallowing works while suspended wasn't important for its > > > > implementation (and the need to disallow device changes while > > > > suspended). > > > > > > > > > > I think a simple solution, without modifying dm-era too much, is: > > > 1. Have status run in the user space thread, instead of queueing it to > > > the worker. > > > > OK. > > > > > 2. Add a mutex to avoid status running concurrently with the worker > > > thread. > > > > Ideally you'd already have adequate protection. > > You want to be sure you're protecting data and not just user vs worker. > > > > I agree. Ideally we would use a read-write lock to protect access to the > metadata, like the rest of the DM targets do. > > I proposed this locking scheme because it's really simple, and it > doesn't require many changes. I agree it's not ideal, but it could be a > first step, before introducing a more fine grained read-write lock. I'll defer to your best judgement (and Joe if he cares). But there is some value in not introducing hacks that impose excessive locking. This is still hackish, but it avoids per-io locking overhead: You could potentially use jump labels to have both the message() and status() hooks simply return EWOULDBLOCK while the table is suspended (a side-effect of suspend would activate the jump_label that'd cause the EWOULDBLOCK return)? > > > 3. Have status check if the device is suspended. If it is not, commit > > > the metadata before retrieving the status. If it is, skip the > > > metadata commit. This is in line with what the thin and cache targets > > > do. > > > > > > Doing this avoids the deadlock in case someone, e.g., an LVM command, > > > runs a status operation against a suspended dm-era device. > > > > > > User space could trigger the deadlock if it sends a message to a > > > suspended dm-era device, but, since this should never happen, it > > > shouldn't affect existing workflows. > > > > Why not have message run from user thread too? > > > > That's what this patch does, but I thought it was not the correct way to > solve the issue, that's why I didn't propose it again: > > > Allowing them to run while the device is suspended is _not_ the > > correct way to work-around this deadlock situation. > > But, probably, I misunderstood something. > > Allowing message operations to run in the user thread, while the device > is suspended, would result in writing to the metadata of the suspended > device. > > I understand that this should not happen. > > We could fail message operations if the device is suspended (since all > of them write to the metadata), but isn't this considered a change to > the user facing API? > > What if someone depends on the existing behavior of message operations > blocking if the dm-era device is suspended? Is it ok to change this > behavior? I honestly think you're the only consumer of dm-era for anything. I'm fine with failing the message (or status) if suspended. Mike -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-31 20:20 ` Mike Snitzer @ 2023-02-07 8:10 ` Nikos Tsironis 0 siblings, 0 replies; 14+ messages in thread From: Nikos Tsironis @ 2023-02-07 8:10 UTC (permalink / raw) To: Mike Snitzer; +Cc: dm-devel, snitzer, ejt, agk On 1/31/23 22:20, Mike Snitzer wrote: > On Tue, Jan 31 2023 at 6:01P -0500, > Nikos Tsironis <ntsironis@arrikto.com> wrote: > >> On 1/26/23 02:06, Mike Snitzer wrote: >>> On Wed, Jan 25 2023 at 7:37P -0500, >>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>> >>>> On 1/23/23 19:34, Mike Snitzer wrote: >>>>> On Thu, Jan 19 2023 at 4:36P -0500, >>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>>>> >>>>>> On 1/18/23 18:28, Mike Snitzer wrote: >>>>>>> On Wed, Jan 18 2023 at 7:29P -0500, >>>>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>>>>>> >>>>>>>> Under certain conditions, swapping a table, that includes a dm-era >>>>>>>> target, with a new table, causes a deadlock. >>>>>>>> >>>>>>>> This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked >>>>>>>> in the suspended dm-era target. >>>>>>>> >>>>>>>> dm-era executes all metadata operations in a worker thread, which stops >>>>>>>> processing requests when the target is suspended, and resumes again when >>>>>>>> the target is resumed. >>>>>>>> >>>>>>>> So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era >>>>>>>> device blocks, until the device is resumed. >>>>>>>> >>>>>>>> If we then load a new table to the device, while the aforementioned >>>>>>>> dmsetup command is blocked in dm-era, and resume the device, we >>>>>>>> deadlock. >>>>>>>> >>>>>>>> The problem is that the 'dmsetup status' and 'dmsetup message' commands >>>>>>>> hold a reference to the live table, i.e., they hold an SRCU read lock on >>>>>>>> md->io_barrier, while they are blocked. >>>>>>>> >>>>>>>> When the device is resumed, the old table is replaced with the new one >>>>>>>> by dm_swap_table(), which ends up calling synchronize_srcu() on >>>>>>>> md->io_barrier. >>>>>>>> >>>>>>>> Since the blocked dmsetup command is holding the SRCU read lock, and the >>>>>>>> old table is never resumed, 'dmsetup resume' blocks too, and we have a >>>>>>>> deadlock. >>>>>>>> >>>>>>>> The only way to recover is by rebooting. >>>>>>>> >>>>>>>> Steps to reproduce: >>>>>>>> >>>>>>>> 1. Create device with dm-era target >>>>>>>> >>>>>>>> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>>>>>> >>>>>>>> 2. Suspend the device >>>>>>>> >>>>>>>> # dmsetup suspend eradev >>>>>>>> >>>>>>>> 3. Load new table to device, e.g., to resize the device. Note, that, we >>>>>>>> must load the new table _after_ suspending the device to ensure the >>>>>>>> constructor of the new target instance reads up-to-date metadata, as >>>>>>>> committed by the post-suspend hook of dm-era. >>>>>>>> >>>>>>>> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>>>>>> >>>>>>>> 4. Device now has LIVE and INACTIVE tables >>>>>>>> >>>>>>>> # dmsetup info eradev >>>>>>>> Name: eradev >>>>>>>> State: SUSPENDED >>>>>>>> Read Ahead: 16384 >>>>>>>> Tables present: LIVE & INACTIVE >>>>>>>> Open count: 0 >>>>>>>> Event number: 0 >>>>>>>> Major, minor: 252, 2 >>>>>>>> Number of targets: 1 >>>>>>>> >>>>>>>> 5. Retrieve the status of the device. This blocks because the device is >>>>>>>> suspended. Equivalently, any 'dmsetup message' operation would block >>>>>>>> too. This command holds the SRCU read lock on md->io_barrier. >>>>>>>> >>>>>>>> # dmsetup status eradev >>>>>>> >>>>>>> I'll have a look at this flow, it seems to me we shouldn't stack up >>>>>>> 'dmsetup status' and 'dmsetup message' commands if the table is >>>>>>> suspended. >>>>>>> >>>>>>> I think it is unique to dm-era that you don't allow to _read_ metadata >>>>>>> operations while a device is suspended. But messages really shouldn't >>>>>>> be sent when the device is suspended. As-is DM is pretty silently >>>>>>> cutthroat about that constraint. >>>>>>> >>>>>>> Resulting in deadlock is obviously cutthroat... >>>>>>> >>>>>> >>>>>> Hi Mike, >>>>>> >>>>>> Thanks for the quick reply. >>>>>> >>>>>> I couldn't find this constraint documented anywhere and since the >>>>>> various DM targets seem to allow message operations while the device is >>>>>> suspended I drew the wrong conclusion. >>>>>> >>>>>> Thanks for clarifying this. >>>>>> >>>>>>>> 6. Resume the device. The resume operation tries to swap the old table >>>>>>>> with the new one and deadlocks, because it synchronizes SRCU for the >>>>>>>> old table, while the blocked 'dmsetup status' holds the SRCU read >>>>>>>> lock. And the old table is never resumed again at this point. >>>>>>>> >>>>>>>> # dmsetup resume eradev >>>>>>>> >>>>>>>> 7. The relevant dmesg logs are: >>>>>>>> >>>>>>>> [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 >>>>>>>> [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. >>>>>>>> [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 >>>>>>>> [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>>>>> [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 >>>>>>>> [ 7250.875809] Call Trace: >>>>>>>> [ 7250.875812] <TASK> >>>>>>>> [ 7250.875816] __schedule+0x330/0x8b0 >>>>>>>> [ 7250.875827] schedule+0x4e/0xc0 >>>>>>>> [ 7250.875831] schedule_timeout+0x20f/0x2e0 >>>>>>>> [ 7250.875836] ? do_set_pte+0xb8/0x120 >>>>>>>> [ 7250.875843] ? prep_new_page+0x91/0xa0 >>>>>>>> [ 7250.875847] wait_for_completion+0x8c/0xf0 >>>>>>>> [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] >>>>>>>> [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] >>>>>>>> [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] >>>>>>>> [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] >>>>>>>> [ 7250.875877] era_status+0x64/0x1e0 [dm_era] >>>>>>>> [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>>>>>> [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 >>>>>>>> [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] >>>>>>>> [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>>>>>> [ 7250.875932] table_status+0x61/0xa0 [dm_mod] >>>>>>>> [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>>>>>> [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>>>>>> [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 >>>>>>>> [ 7250.875970] do_syscall_64+0x3a/0xd0 >>>>>>>> [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>>>>> [ 7250.875980] RIP: 0033:0x7f20b7cd4017 >>>>>>>> [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>>>>>> [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 >>>>>>>> [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 >>>>>>>> [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 >>>>>>>> [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 >>>>>>>> [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 >>>>>>>> [ 7250.876002] </TASK> >>>>>>>> [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. >>>>>>>> [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 >>>>>>>> [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>>>>> [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 >>>>>>>> [ 7250.876134] Call Trace: >>>>>>>> [ 7250.876136] <TASK> >>>>>>>> [ 7250.876138] __schedule+0x330/0x8b0 >>>>>>>> [ 7250.876142] schedule+0x4e/0xc0 >>>>>>>> [ 7250.876145] schedule_timeout+0x20f/0x2e0 >>>>>>>> [ 7250.876149] ? __queue_work+0x226/0x420 >>>>>>>> [ 7250.876156] wait_for_completion+0x8c/0xf0 >>>>>>>> [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 >>>>>>>> [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 >>>>>>>> [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] >>>>>>>> [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] >>>>>>>> [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] >>>>>>>> [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] >>>>>>>> [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>>>>>> [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>>>>>> [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 >>>>>>>> [ 7250.876244] do_syscall_64+0x3a/0xd0 >>>>>>>> [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>>>>> [ 7250.876252] RIP: 0033:0x7f15e9254017 >>>>>>>> [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>>>>>> [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 >>>>>>>> [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 >>>>>>>> [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 >>>>>>>> [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 >>>>>>>> [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 >>>>>>>> [ 7250.876269] </TASK> >>>>>>>> >>>>>>>> Fix this by allowing metadata operations triggered by user space to run >>>>>>>> in the corresponding user space thread, instead of queueing them for >>>>>>>> execution by the dm-era worker thread. >>>>>>> >>>>>>> Allowing them to run while the device is suspended is _not_ the >>>>>>> correct way to work-around this deadlock situation. I think it'd be >>>>>>> useful to understand why your userspace is tools are allowing these >>>>>>> messages and status to a device that is suspended. >>>>>>> >>>>>> >>>>>> Ack. >>>>>> >>>>>> The sequence of operations I provided is just a way to easily reproduce >>>>>> the deadlock. The exact issue I am facing is the following: >>>>>> >>>>>> 1. Create device with dm-era target >>>>>> >>>>>> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>>>> >>>>>> 2. Load new table to device, e.g., to resize the device or snapshot it. >>>>>> >>>>>> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>>>> >>>>>> 3. Suspend the device >>>>>> >>>>>> # dmsetup suspend eradev >>>>>> >>>>>> 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM >>>>>> command at this point, e.g. 'vgs'. >>>>>> >>>>>> 5. 'vgs' tries to retrieve the status of the dm-era device using the >>>>>> DM_TABLE_STATUS ioctl, and blocks, because the command is queued for >>>>>> execution by the worker thread, which is suspended while the device >>>>>> is suspended. >>>>>> >>>>>> Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a >>>>>> difference for dm-era, since the "problem" is not that it writes to >>>>>> the metadata, but that it queues the metadata read operation to the >>>>>> worker thread. >>>>>> >>>>>> 6. Resume the device: This deadlocks. >>>>>> >>>>>> # dmsetup resume eradev >>>>>> >>>>>> So, I am not the one retrieving the status of the suspended device. LVM >>>>>> is. LVM, when running commands like 'lvs' and 'vgs', retrieves the >>>>>> status of the devices on the system using the DM_TABLE_STATUS ioctl. >>>>>> >>>>>> The deadlock is a race that happens when someone runs an LVM command at >>>>>> the "wrong" time. >>>>> >>>>> I think dm-era shouldn't be disallowing work items while suspended, >>>>> e.g.: >>>>> >>>>> diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c >>>>> index e92c1afc3677..33ea2c2374c7 100644 >>>>> --- a/drivers/md/dm-era-target.c >>>>> +++ b/drivers/md/dm-era-target.c >>>>> @@ -1175,7 +1175,6 @@ struct era { >>>>> struct list_head rpc_calls; >>>>> struct digest digest; >>>>> - atomic_t suspended; >>>>> }; >>>>> struct rpc { >>>>> @@ -1219,8 +1218,7 @@ static void remap_to_origin(struct era *era, struct bio *bio) >>>>> *--------------------------------------------------------------*/ >>>>> static void wake_worker(struct era *era) >>>>> { >>>>> - if (!atomic_read(&era->suspended)) >>>>> - queue_work(era->wq, &era->worker); >>>>> + queue_work(era->wq, &era->worker); >>>>> } >>>>> static void process_old_eras(struct era *era) >>>>> @@ -1392,17 +1390,6 @@ static int in_worker1(struct era *era, >>>>> return perform_rpc(era, &rpc); >>>>> } >>>>> -static void start_worker(struct era *era) >>>>> -{ >>>>> - atomic_set(&era->suspended, 0); >>>>> -} >>>>> - >>>>> -static void stop_worker(struct era *era) >>>>> -{ >>>>> - atomic_set(&era->suspended, 1); >>>>> - drain_workqueue(era->wq); >>>>> -} >>>>> - >>>>> /*---------------------------------------------------------------- >>>>> * Target methods >>>>> *--------------------------------------------------------------*/ >>>>> @@ -1569,7 +1556,7 @@ static void era_postsuspend(struct dm_target *ti) >>>>> /* FIXME: fail mode */ >>>>> } >>>>> - stop_worker(era); >>>>> + drain_workqueue(era->wq); >>>>> r = metadata_commit(era->md); >>>>> if (r) { >>>>> @@ -1600,8 +1587,6 @@ static int era_preresume(struct dm_target *ti) >>>>> era->nr_blocks = new_size; >>>>> } >>>>> - start_worker(era); >>>>> - >>>>> r = in_worker0(era, metadata_era_rollover); >>>>> if (r) { >>>>> DMERR("%s: metadata_era_rollover failed", __func__); >>>>> >>>>> During suspend it should certainly flush all works; but I fail to see >>>>> the value in disallowing the targets main way to do work (even while >>>>> suspended). Maybe Joe has insight on why dm-era was written this way. >>>>> >>>>> But as an example dm-cache and dm-thin don't have such a strong >>>>> restriction. >>>>> >>>> >>>> The worker thread does the following: >>>> 1. Process old eras, i.e., digest the archived writesets in to the main >>>> era array. >>>> >>>> This doesn't involve a metadata commit, but writes to the metadata >>>> >>>> 2. Process deferred bios. This might trigger a metadata commit in >>>> general, but when the device is suspended no bios should be reaching >>>> the target, so it should be a no op. >>>> >>>> 3. Process RPC calls. This involves 'status' and 'message' operations. >>>> process_rpc_calls() does commit the metadata, after running all RPC >>>> calls. Checking if the device is suspended here is a way to prevent >>>> the metadata commit. >>>> >>>> But, some of the 'message' operations also commit the metadata, e.g., >>>> the metadata_take_snap message. I understand that no one should be >>>> sending messages to a suspended device, but it's not enforced so it >>>> could happen. >>>> >>>> I believe the value of suspending the worker when the device is >>>> suspended is to prevent anyone from writing to the metadata. >>>> >>>> Writing to the metadata, while the device is suspended, could cause >>>> problems. More on that in the following comments. >>>> >>>>>> Using an appropriate LVM 'global_filter' is a way to prevent this, but: >>>>>> >>>>>> 1. This is just a workaround, not a proper solution. >>>>>> 2. This is not always an option. Imagine someone running an LVM command >>>>>> in a container, for example. Or, we may not be allowed to change the >>>>>> LVM configuration of the host at all. >>>>>> >>>>>>> FYI, status shouldn't trigger write IOs if the device is suspended. >>>>>>> Both dm-cache and dm-thin have this in status as a guard around its >>>>>>> work to ensure updated accounting (as a side-effect of committing >>>>>>> metadata), e.g.: >>>>>>> >>>>>>> /* Commit to ensure statistics aren't out-of-date */ >>>>>>> if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) >>>>>>> (void) commit(cache, false); >>>>>> >>>>>> Ack. The current behavior of dm-era wrt 'status' command is: >>>>>> >>>>>> 1. Queue the 'metadata_get_stats()' function for execution by the worker >>>>>> thread. >>>>>> 2. The worker runs the function and retrieves the statistics >>>>>> 3. The worker commits the metadata _after_ retrieving the statistics >>>>>> >>>>>> Shall I just change 'era_status()' to read the metadata directly and >>>>>> skip the metadata commit, instead of queuing the operation to the worker >>>>>> thread, when the device is suspended? >>>>> >>>>> dm-era should still respect that the device is suspended (disallow >>>>> metadata commits, etc). >>>>> >>>>> Probably should just do like the dm-cache code I quoted above (dm-thin >>>>> does the same): do not commit if suspended. >>>>> >>>>> Seems to me that if simply retrieving stats in era_status() results in >>>>> metadata commit as a side-effect of its work that is pretty bogus -- >>>>> process_rpc_calls() must be what does it just due to generic code? >>>>> >>>>> If so, yes absolutely avoid that metadata commit (might we just update >>>>> process_rpc_calls() to check dm_suspended()?) >>>>> >>>> >>>> Yes, the metadata commit is due to the generic code in >>>> process_rpc_calls(), and could be avoided if we check the device >>>> suspended state here. >>>> >>>> But, avoiding the metadata commit when the device is suspended is not >>>> enough. No one should be writing to the metadata at all. >>>> >>>> Else, we risk hitting the bug fixed by commit 9ae6e8b1c9bbf6 ("dm era: >>>> commit metadata in postsuspend after worker stops"). >>>> >>>> In short, writing to the metadata of a suspended dm-era device results >>>> in the committed metadata getting out-of-sync with the in-core metadata. >>>> >>>> This could result in the corruption of the on-disk metadata if we load a >>>> new dm-era table to the device, and destroy the old one (the destructor >>>> flushes any cached writes to the disk). >>>> >>>> For an exact sequence of events that can lead to this issue, please see >>>> the commit mentioned above (9ae6e8b1c9bbf6). >>>> >>>>> Sorry for such fundamental concerns, I really should've caught this >>>>> stuff back when dm-era was introduced! >>>>> >>>>> Really does feel like a read-write lock for metadata might be >>>>> warranted. >>>>> >>>>>> I think we would still need a lock to synchronize access to the >>>>>> metadata, since, if I am not mistaken, the pre-resume hook that starts >>>>>> the worker runs before clearing the DMF_SUSPENDED flag, and can run >>>>>> concurrently with the 'status' IOCTL. So, the worker might run >>>>>> concurrently with a 'status' operation that sees the device as >>>>>> suspended. >>>>> >>>>> I was initially thinking removing dm-era's suspended state (like above >>>>> patch) should take care of the pitfalls of disallowing works due to >>>>> that state. >>>>> >>>>> But the dm-era could should be audited to make sure the big hammer of >>>>> disallowing works while suspended wasn't important for its >>>>> implementation (and the need to disallow device changes while >>>>> suspended). >>>>> >>>> >>>> I think a simple solution, without modifying dm-era too much, is: >>>> 1. Have status run in the user space thread, instead of queueing it to >>>> the worker. >>> >>> OK. >>> >>>> 2. Add a mutex to avoid status running concurrently with the worker >>>> thread. >>> >>> Ideally you'd already have adequate protection. >>> You want to be sure you're protecting data and not just user vs worker. >>> >> >> I agree. Ideally we would use a read-write lock to protect access to the >> metadata, like the rest of the DM targets do. >> >> I proposed this locking scheme because it's really simple, and it >> doesn't require many changes. I agree it's not ideal, but it could be a >> first step, before introducing a more fine grained read-write lock. > > I'll defer to your best judgement (and Joe if he cares). > > But there is some value in not introducing hacks that impose excessive > locking. This is still hackish, but it avoids per-io locking > overhead: You could potentially use jump labels to have both the > message() and status() hooks simply return EWOULDBLOCK while the table > is suspended (a side-effect of suspend would activate the jump_label > that'd cause the EWOULDBLOCK return)? > Yes, you are right! Using jump labels indeed seems better, and it would avoid the mutex. Thinking more about it, I believe it's worth the extra effort to introduce a fine grained read-write lock, just like the rest of the targets. So, I will take some extra time to work on this and introduce the read-write lock. >>>> 3. Have status check if the device is suspended. If it is not, commit >>>> the metadata before retrieving the status. If it is, skip the >>>> metadata commit. This is in line with what the thin and cache targets >>>> do. >>>> >>>> Doing this avoids the deadlock in case someone, e.g., an LVM command, >>>> runs a status operation against a suspended dm-era device. >>>> >>>> User space could trigger the deadlock if it sends a message to a >>>> suspended dm-era device, but, since this should never happen, it >>>> shouldn't affect existing workflows. >>> >>> Why not have message run from user thread too? >>> >> >> That's what this patch does, but I thought it was not the correct way to >> solve the issue, that's why I didn't propose it again: >> >>> Allowing them to run while the device is suspended is _not_ the >>> correct way to work-around this deadlock situation. >> >> But, probably, I misunderstood something. >> >> Allowing message operations to run in the user thread, while the device >> is suspended, would result in writing to the metadata of the suspended >> device. >> >> I understand that this should not happen. >> >> We could fail message operations if the device is suspended (since all >> of them write to the metadata), but isn't this considered a change to >> the user facing API? >> >> What if someone depends on the existing behavior of message operations >> blocking if the dm-era device is suspended? Is it ok to change this >> behavior? > > I honestly think you're the only consumer of dm-era for anything. I'm > fine with failing the message (or status) if suspended. > Ack. My plan is to introduce a read-write lock to protect metadata accesses, as I mentioned previously, and fail message operations if the target is suspended. For status, I will do as the other targets do. Check whether the device is suspended, to determine if we can commit the metadata before retrieving status. Then take the read lock to retrieve the status. Thanks for all the help, Nikos -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target 2023-01-23 17:34 ` Mike Snitzer 2023-01-25 12:37 ` Nikos Tsironis @ 2023-01-25 12:45 ` Nikos Tsironis 1 sibling, 0 replies; 14+ messages in thread From: Nikos Tsironis @ 2023-01-25 12:45 UTC (permalink / raw) To: Mike Snitzer; +Cc: dm-devel, snitzer, ejt, agk On 1/23/23 19:34, Mike Snitzer wrote: > On Thu, Jan 19 2023 at 4:36P -0500, > Nikos Tsironis <ntsironis@arrikto.com> wrote: > >> On 1/18/23 18:28, Mike Snitzer wrote: >>> On Wed, Jan 18 2023 at 7:29P -0500, >>> Nikos Tsironis <ntsironis@arrikto.com> wrote: >>> >>>> Under certain conditions, swapping a table, that includes a dm-era >>>> target, with a new table, causes a deadlock. >>>> >>>> This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked >>>> in the suspended dm-era target. >>>> >>>> dm-era executes all metadata operations in a worker thread, which stops >>>> processing requests when the target is suspended, and resumes again when >>>> the target is resumed. >>>> >>>> So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era >>>> device blocks, until the device is resumed. >>>> >>>> If we then load a new table to the device, while the aforementioned >>>> dmsetup command is blocked in dm-era, and resume the device, we >>>> deadlock. >>>> >>>> The problem is that the 'dmsetup status' and 'dmsetup message' commands >>>> hold a reference to the live table, i.e., they hold an SRCU read lock on >>>> md->io_barrier, while they are blocked. >>>> >>>> When the device is resumed, the old table is replaced with the new one >>>> by dm_swap_table(), which ends up calling synchronize_srcu() on >>>> md->io_barrier. >>>> >>>> Since the blocked dmsetup command is holding the SRCU read lock, and the >>>> old table is never resumed, 'dmsetup resume' blocks too, and we have a >>>> deadlock. >>>> >>>> The only way to recover is by rebooting. >>>> >>>> Steps to reproduce: >>>> >>>> 1. Create device with dm-era target >>>> >>>> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>> >>>> 2. Suspend the device >>>> >>>> # dmsetup suspend eradev >>>> >>>> 3. Load new table to device, e.g., to resize the device. Note, that, we >>>> must load the new table _after_ suspending the device to ensure the >>>> constructor of the new target instance reads up-to-date metadata, as >>>> committed by the post-suspend hook of dm-era. >>>> >>>> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >>>> >>>> 4. Device now has LIVE and INACTIVE tables >>>> >>>> # dmsetup info eradev >>>> Name: eradev >>>> State: SUSPENDED >>>> Read Ahead: 16384 >>>> Tables present: LIVE & INACTIVE >>>> Open count: 0 >>>> Event number: 0 >>>> Major, minor: 252, 2 >>>> Number of targets: 1 >>>> >>>> 5. Retrieve the status of the device. This blocks because the device is >>>> suspended. Equivalently, any 'dmsetup message' operation would block >>>> too. This command holds the SRCU read lock on md->io_barrier. >>>> >>>> # dmsetup status eradev >>> >>> I'll have a look at this flow, it seems to me we shouldn't stack up >>> 'dmsetup status' and 'dmsetup message' commands if the table is >>> suspended. >>> >>> I think it is unique to dm-era that you don't allow to _read_ metadata >>> operations while a device is suspended. But messages really shouldn't >>> be sent when the device is suspended. As-is DM is pretty silently >>> cutthroat about that constraint. >>> >>> Resulting in deadlock is obviously cutthroat... >>> >> >> Hi Mike, >> >> Thanks for the quick reply. >> >> I couldn't find this constraint documented anywhere and since the >> various DM targets seem to allow message operations while the device is >> suspended I drew the wrong conclusion. >> >> Thanks for clarifying this. >> >>>> 6. Resume the device. The resume operation tries to swap the old table >>>> with the new one and deadlocks, because it synchronizes SRCU for the >>>> old table, while the blocked 'dmsetup status' holds the SRCU read >>>> lock. And the old table is never resumed again at this point. >>>> >>>> # dmsetup resume eradev >>>> >>>> 7. The relevant dmesg logs are: >>>> >>>> [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 >>>> [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. >>>> [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 >>>> [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 >>>> [ 7250.875809] Call Trace: >>>> [ 7250.875812] <TASK> >>>> [ 7250.875816] __schedule+0x330/0x8b0 >>>> [ 7250.875827] schedule+0x4e/0xc0 >>>> [ 7250.875831] schedule_timeout+0x20f/0x2e0 >>>> [ 7250.875836] ? do_set_pte+0xb8/0x120 >>>> [ 7250.875843] ? prep_new_page+0x91/0xa0 >>>> [ 7250.875847] wait_for_completion+0x8c/0xf0 >>>> [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] >>>> [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] >>>> [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] >>>> [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] >>>> [ 7250.875877] era_status+0x64/0x1e0 [dm_era] >>>> [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>> [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 >>>> [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] >>>> [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] >>>> [ 7250.875932] table_status+0x61/0xa0 [dm_mod] >>>> [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>> [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>> [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 >>>> [ 7250.875970] do_syscall_64+0x3a/0xd0 >>>> [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> [ 7250.875980] RIP: 0033:0x7f20b7cd4017 >>>> [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>> [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 >>>> [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 >>>> [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 >>>> [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 >>>> [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 >>>> [ 7250.876002] </TASK> >>>> [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. >>>> [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 >>>> [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 >>>> [ 7250.876134] Call Trace: >>>> [ 7250.876136] <TASK> >>>> [ 7250.876138] __schedule+0x330/0x8b0 >>>> [ 7250.876142] schedule+0x4e/0xc0 >>>> [ 7250.876145] schedule_timeout+0x20f/0x2e0 >>>> [ 7250.876149] ? __queue_work+0x226/0x420 >>>> [ 7250.876156] wait_for_completion+0x8c/0xf0 >>>> [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 >>>> [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 >>>> [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] >>>> [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] >>>> [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] >>>> [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] >>>> [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] >>>> [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] >>>> [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 >>>> [ 7250.876244] do_syscall_64+0x3a/0xd0 >>>> [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> [ 7250.876252] RIP: 0033:0x7f15e9254017 >>>> [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>>> [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 >>>> [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 >>>> [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 >>>> [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 >>>> [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 >>>> [ 7250.876269] </TASK> >>>> >>>> Fix this by allowing metadata operations triggered by user space to run >>>> in the corresponding user space thread, instead of queueing them for >>>> execution by the dm-era worker thread. >>> >>> Allowing them to run while the device is suspended is _not_ the >>> correct way to work-around this deadlock situation. I think it'd be >>> useful to understand why your userspace is tools are allowing these >>> messages and status to a device that is suspended. >>> >> >> Ack. >> >> The sequence of operations I provided is just a way to easily reproduce >> the deadlock. The exact issue I am facing is the following: >> >> 1. Create device with dm-era target >> >> # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" >> >> 2. Load new table to device, e.g., to resize the device or snapshot it. >> >> # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" >> >> 3. Suspend the device >> >> # dmsetup suspend eradev >> >> 4. Someone else, e.g., a user or a monitoring daemon, runs an LVM >> command at this point, e.g. 'vgs'. >> >> 5. 'vgs' tries to retrieve the status of the dm-era device using the >> DM_TABLE_STATUS ioctl, and blocks, because the command is queued for >> execution by the worker thread, which is suspended while the device >> is suspended. >> >> Note, that, LVM uses the DM_NOFLUSH_FLAG, but this doesn't make a >> difference for dm-era, since the "problem" is not that it writes to >> the metadata, but that it queues the metadata read operation to the >> worker thread. >> >> 6. Resume the device: This deadlocks. >> >> # dmsetup resume eradev >> >> So, I am not the one retrieving the status of the suspended device. LVM >> is. LVM, when running commands like 'lvs' and 'vgs', retrieves the >> status of the devices on the system using the DM_TABLE_STATUS ioctl. >> >> The deadlock is a race that happens when someone runs an LVM command at >> the "wrong" time. > > I think dm-era shouldn't be disallowing work items while suspended, > e.g.: > > diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c > index e92c1afc3677..33ea2c2374c7 100644 > --- a/drivers/md/dm-era-target.c > +++ b/drivers/md/dm-era-target.c > @@ -1175,7 +1175,6 @@ struct era { > struct list_head rpc_calls; > > struct digest digest; > - atomic_t suspended; > }; > > struct rpc { > @@ -1219,8 +1218,7 @@ static void remap_to_origin(struct era *era, struct bio *bio) > *--------------------------------------------------------------*/ > static void wake_worker(struct era *era) > { > - if (!atomic_read(&era->suspended)) > - queue_work(era->wq, &era->worker); > + queue_work(era->wq, &era->worker); > } > > static void process_old_eras(struct era *era) > @@ -1392,17 +1390,6 @@ static int in_worker1(struct era *era, > return perform_rpc(era, &rpc); > } > > -static void start_worker(struct era *era) > -{ > - atomic_set(&era->suspended, 0); > -} > - > -static void stop_worker(struct era *era) > -{ > - atomic_set(&era->suspended, 1); > - drain_workqueue(era->wq); > -} > - > /*---------------------------------------------------------------- > * Target methods > *--------------------------------------------------------------*/ > @@ -1569,7 +1556,7 @@ static void era_postsuspend(struct dm_target *ti) > /* FIXME: fail mode */ > } > > - stop_worker(era); > + drain_workqueue(era->wq); > > r = metadata_commit(era->md); > if (r) { > @@ -1600,8 +1587,6 @@ static int era_preresume(struct dm_target *ti) > era->nr_blocks = new_size; > } > > - start_worker(era); > - > r = in_worker0(era, metadata_era_rollover); > if (r) { > DMERR("%s: metadata_era_rollover failed", __func__); > > During suspend it should certainly flush all works; but I fail to see > the value in disallowing the targets main way to do work (even while > suspended). Maybe Joe has insight on why dm-era was written this way. > > But as an example dm-cache and dm-thin don't have such a strong > restriction. > The worker thread does the following: 1. Process old eras, i.e., digest the archived writesets in to the main era array. This doesn't involve a metadata commit, but writes to the metadata 2. Process deferred bios. This might trigger a metadata commit in general, but when the device is suspended no bios should be reaching the target, so it should be a no op. 3. Process RPC calls. This involves 'status' and 'message' operations. process_rpc_calls() does commit the metadata, after running all RPC calls. Checking if the device is suspended here is a way to prevent the metadata commit. But, some of the 'message' operations also commit the metadata, e.g., the metadata_take_snap message. I understand that no one should be sending messages to a suspended device, but it's not enforced so it could happen. I believe the value of suspending the worker when the device is suspended is to prevent anyone from writing to the metadata. Writing to the metadata, while the device is suspended, could cause problems. More on that in the following comments. >> Using an appropriate LVM 'global_filter' is a way to prevent this, but: >> >> 1. This is just a workaround, not a proper solution. >> 2. This is not always an option. Imagine someone running an LVM command >> in a container, for example. Or, we may not be allowed to change the >> LVM configuration of the host at all. >> >>> FYI, status shouldn't trigger write IOs if the device is suspended. >>> Both dm-cache and dm-thin have this in status as a guard around its >>> work to ensure updated accounting (as a side-effect of committing >>> metadata), e.g.: >>> >>> /* Commit to ensure statistics aren't out-of-date */ >>> if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) >>> (void) commit(cache, false); >> >> Ack. The current behavior of dm-era wrt 'status' command is: >> >> 1. Queue the 'metadata_get_stats()' function for execution by the worker >> thread. >> 2. The worker runs the function and retrieves the statistics >> 3. The worker commits the metadata _after_ retrieving the statistics >> >> Shall I just change 'era_status()' to read the metadata directly and >> skip the metadata commit, instead of queuing the operation to the worker >> thread, when the device is suspended? > > dm-era should still respect that the device is suspended (disallow > metadata commits, etc). > > Probably should just do like the dm-cache code I quoted above (dm-thin > does the same): do not commit if suspended. > > Seems to me that if simply retrieving stats in era_status() results in > metadata commit as a side-effect of its work that is pretty bogus -- > process_rpc_calls() must be what does it just due to generic code? > > If so, yes absolutely avoid that metadata commit (might we just update > process_rpc_calls() to check dm_suspended()?) > Yes, the metadata commit is due to the generic code in process_rpc_calls(), and could be avoided if we check the device suspended state here. But, avoiding the metadata commit when the device is suspended is not enough. No one should be writing to the metadata at all. Else, we risk hitting the bug fixed by commit 9ae6e8b1c9bbf6 ("dm era: commit metadata in postsuspend after worker stops"). In short, writing to the metadata of a suspended dm-era device results in the committed metadata getting out-of-sync with the in-core metadata. This could result in the corruption of the on-disk metadata if we load a new dm-era table to the device, and destroy the old one (the destructor flushes any cached writes to the disk). For an exact sequence of events that can lead to this issue, please see the commit mentioned above (9ae6e8b1c9bbf6). > Sorry for such fundamental concerns, I really should've caught this > stuff back when dm-era was introduced! > > Really does feel like a read-write lock for metadata might be > warranted. > >> I think we would still need a lock to synchronize access to the >> metadata, since, if I am not mistaken, the pre-resume hook that starts >> the worker runs before clearing the DMF_SUSPENDED flag, and can run >> concurrently with the 'status' IOCTL. So, the worker might run >> concurrently with a 'status' operation that sees the device as >> suspended. > > I was initially thinking removing dm-era's suspended state (like above > patch) should take care of the pitfalls of disallowing works due to > that state. > > But the dm-era could should be audited to make sure the big hammer of > disallowing works while suspended wasn't important for its > implementation (and the need to disallow device changes while > suspended). > I think a simple solution, without modifying dm-era too much, is: 1. Have status run in the user space thread, instead of queueing it to the worker. 2. Add a mutex to avoid status running concurrently with the worker thread. 3. Have status check if the device is suspended. If it is not, commit the metadata before retrieving the status. If it is, skip the metadata commit. This is in line with what the thin and cache targets do. Doing this avoids the deadlock in case someone, e.g., an LVM command, runs a status operation against a suspended dm-era device. User space could trigger the deadlock if it sends a message to a suspended dm-era device, but, since this should never happen, it shouldn't affect existing workflows. Still, it might be a good idea to enforce this restriction, i.e., disallow running message operations on suspended devices, but, I guess this requires more discussion. Nikos. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-02-07 8:10 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-18 12:29 [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Nikos Tsironis 2023-01-18 12:29 ` [dm-devel] [PATCH 1/2] dm era: allocate in-core writesets when loading metadata Nikos Tsironis 2023-01-18 12:29 ` [dm-devel] [PATCH 2/2] dm era: avoid deadlock when swapping table Nikos Tsironis 2023-01-18 16:28 ` [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Mike Snitzer 2023-01-19 9:36 ` Nikos Tsironis 2023-01-19 12:58 ` Zdenek Kabelac 2023-01-19 14:08 ` Nikos Tsironis 2023-01-23 17:34 ` Mike Snitzer 2023-01-25 12:37 ` Nikos Tsironis 2023-01-26 0:06 ` Mike Snitzer 2023-01-31 11:01 ` Nikos Tsironis 2023-01-31 20:20 ` Mike Snitzer 2023-02-07 8:10 ` Nikos Tsironis 2023-01-25 12:45 ` Nikos Tsironis
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.