From: Nikos Tsironis <ntsironis@arrikto.com>
To: snitzer@kernel.org, agk@redhat.com, dm-devel@redhat.com
Cc: ejt@redhat.com, ntsironis@arrikto.com
Subject: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target
Date: Wed, 18 Jan 2023 14:29:44 +0200 [thread overview]
Message-ID: <20230118122946.20435-1-ntsironis@arrikto.com> (raw)
Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.
This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.
dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.
So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.
If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.
The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.
When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.
Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.
The only way to recover is by rebooting.
Steps to reproduce:
1. Create device with dm-era target
# dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192"
2. Suspend the device
# dmsetup suspend eradev
3. Load new table to device, e.g., to resize the device. Note, that, we
must load the new table _after_ suspending the device to ensure the
constructor of the new target instance reads up-to-date metadata, as
committed by the post-suspend hook of dm-era.
# dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192"
4. Device now has LIVE and INACTIVE tables
# dmsetup info eradev
Name: eradev
State: SUSPENDED
Read Ahead: 16384
Tables present: LIVE & INACTIVE
Open count: 0
Event number: 0
Major, minor: 252, 2
Number of targets: 1
5. Retrieve the status of the device. This blocks because the device is
suspended. Equivalently, any 'dmsetup message' operation would block
too. This command holds the SRCU read lock on md->io_barrier.
# dmsetup status eradev
6. Resume the device. The resume operation tries to swap the old table
with the new one and deadlocks, because it synchronizes SRCU for the
old table, while the blocked 'dmsetup status' holds the SRCU read
lock. And the old table is never resumed again at this point.
# dmsetup resume eradev
7. The relevant dmesg logs are:
[ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
[ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
[ 7250.875722] Not tainted 5.16.0-rc2-release+ #16
[ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000
[ 7250.875809] Call Trace:
[ 7250.875812] <TASK>
[ 7250.875816] __schedule+0x330/0x8b0
[ 7250.875827] schedule+0x4e/0xc0
[ 7250.875831] schedule_timeout+0x20f/0x2e0
[ 7250.875836] ? do_set_pte+0xb8/0x120
[ 7250.875843] ? prep_new_page+0x91/0xa0
[ 7250.875847] wait_for_completion+0x8c/0xf0
[ 7250.875854] perform_rpc+0x95/0xb0 [dm_era]
[ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era]
[ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era]
[ 7250.875872] ? era_status+0x64/0x1e0 [dm_era]
[ 7250.875877] era_status+0x64/0x1e0 [dm_era]
[ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875900] ? __mod_node_page_state+0x82/0xc0
[ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod]
[ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875932] table_status+0x61/0xa0 [dm_mod]
[ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod]
[ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 7250.875966] __x64_sys_ioctl+0x8e/0xd0
[ 7250.875970] do_syscall_64+0x3a/0xd0
[ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7250.875980] RIP: 0033:0x7f20b7cd4017
[ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017
[ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003
[ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320
[ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110
[ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000
[ 7250.876002] </TASK>
[ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds.
[ 7250.876046] Not tainted 5.16.0-rc2-release+ #16
[ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000
[ 7250.876134] Call Trace:
[ 7250.876136] <TASK>
[ 7250.876138] __schedule+0x330/0x8b0
[ 7250.876142] schedule+0x4e/0xc0
[ 7250.876145] schedule_timeout+0x20f/0x2e0
[ 7250.876149] ? __queue_work+0x226/0x420
[ 7250.876156] wait_for_completion+0x8c/0xf0
[ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0
[ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10
[ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod]
[ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod]
[ 7250.876198] ? table_load+0x360/0x360 [dm_mod]
[ 7250.876207] dev_suspend+0x95/0x250 [dm_mod]
[ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod]
[ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 7250.876240] __x64_sys_ioctl+0x8e/0xd0
[ 7250.876244] do_syscall_64+0x3a/0xd0
[ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7250.876252] RIP: 0033:0x7f15e9254017
[ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017
[ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003
[ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0
[ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110
[ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001
[ 7250.876269] </TASK>
Fix this by allowing metadata operations triggered by user space to run
in the corresponding user space thread, instead of queueing them for
execution by the dm-era worker thread.
This way, even if the device is suspended, the metadata operations will
run and release the SRCU read lock, preventing a subsequent table reload
(dm_swap_table()) from deadlocking.
To make this possible use a mutex to serialize access to the metadata
between the worker thread and the user space threads.
This is a follow up on https://listman.redhat.com/archives/dm-devel/2021-December/msg00044.html.
Nikos Tsironis (2):
dm era: allocate in-core writesets when loading metadata
dm era: avoid deadlock when swapping table
drivers/md/dm-era-target.c | 78 ++++++++++++++++++++++++++++++--------
1 file changed, 63 insertions(+), 15 deletions(-)
--
2.30.2
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
next reply other threads:[~2023-01-18 12:51 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-18 12:29 Nikos Tsironis [this message]
2023-01-18 12:29 ` [dm-devel] [PATCH 1/2] dm era: allocate in-core writesets when loading metadata Nikos Tsironis
2023-01-18 12:29 ` [dm-devel] [PATCH 2/2] dm era: avoid deadlock when swapping table Nikos Tsironis
2023-01-18 16:28 ` [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target Mike Snitzer
2023-01-19 9:36 ` Nikos Tsironis
2023-01-19 12:58 ` Zdenek Kabelac
2023-01-19 14:08 ` Nikos Tsironis
2023-01-23 17:34 ` Mike Snitzer
2023-01-25 12:37 ` Nikos Tsironis
2023-01-26 0:06 ` Mike Snitzer
2023-01-31 11:01 ` Nikos Tsironis
2023-01-31 20:20 ` Mike Snitzer
2023-02-07 8:10 ` Nikos Tsironis
2023-01-25 12:45 ` Nikos Tsironis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230118122946.20435-1-ntsironis@arrikto.com \
--to=ntsironis@arrikto.com \
--cc=agk@redhat.com \
--cc=dm-devel@redhat.com \
--cc=ejt@redhat.com \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.