DM-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [dm-devel] use after free with blk_mq_update_nr_requests racing with do_resume/dm_swap_table
@ 2020-10-13 14:56 Aaron Knister
  0 siblings, 0 replies; only message in thread
From: Aaron Knister @ 2020-10-13 14:56 UTC (permalink / raw)
  To: dm-devel

Hi All,

I've been chasing down a kernel panic [2] on a RHEL7 kernel that stems 
from blk_done_softirq calling dm_done attempting to reference tio->ti, 
however tio->ti has been freed which leads to the panic.

It took me a while to figure out what was going on, but it appears as 
though blk_mq_update_nr_requests is racing with do_resume/dm_swap_table. 
blk_mq_update_nr_requests will un-quiesce the queue when it's complete, 
however this un-quiesce can sometimes occur while 
do_resume->dm_swap_table has quiesced the queues but before 
md->immutable_table has been updated with the new table reference. This 
leads to an I/O sneaking in, picking up tio->ti from md->immutable_table 
which is subsequently freed after the call to dm_swap_table and then 
when the softirq handler fires tio->ti has been freed and we end up with 
a panic.

I added some timers and debug printk's into the code hopefully you can 
see the race more clearly below [1].

I'm sure there are a myriad of approaches one could take to fix this, 
but to me it seems the patch here:

http://lists.infradead.org/pipermail/linux-nvme/2017-December/014447.html

Which allows concurrent calls to blk_mq_quiesce_queue would address this 
behavior.

While I haven't tested this on a mainline kernel, it seems to me that 
this race still exists there. I'll also raise this through RedHat, but 
since it seems to me to be an upstream issue as well I thought a post 
here would be appropriate.

I'm curious, what's the state of that patch? I do share Christoph's 
concern about the nvme driver being able to forcibly un-quiesce the 
queue and don't have any suggestions to offer, but it does seem to me 
that a quiesce operation isn't particularly useful if it can be 
unquiesced by another caller without knowledge of the original quiescer. 
It almost seems like there are two scenarios for the quiesce/unquiesce 
operations today--

1. a momentary pause where the quiesce/resume occur within the same 
function to change queue parameters (this seems to be most situations)

2. a longer pause to keep requests from being dispatched while device 
recovery occurs (this seems to me to be the nvme case)

To me it seems the 2nd case is incorrect, and one ought to distinguish 
between quiesced and stopped (e.g. you quiesce the queue to then stop it 
rather than using the quiesce as the stop mechanism).

-Aaron

[1]-

nr_requests [15679.617750] blk_mq_update_nr_requests: q=ffff99491720e988 
enter

do_resume   [15679.639846] do_resume[1015]: md=ffff99491e7ed000 
new_map=ffff99491dbc8800
do_resume   [15679.639848]      dm_suspend[1015]: md=ffff99491e7ed000 
suspend_flags=2 enter
do_resume   [15679.639850]        __dm_suspend[1015]: enter 
md=ffff99491e7ed000 md->flags=64 map=ffff99491dbc9600 suspend_flags=2 
task_state=1 dmf_suspended_flags=1

nr_requests [15679.639860]   blk_mq_quiesce_queue: 
q=ffff99491720e988       \__ one is from dm_suspend, the other from 
update_nr_requests
do_resume   [15679.641717]         blk_mq_quiesce_queue: 
q=ffff99491720e988 /   not sure which is which, but doesn't matter much

do_resume   [15679.666672]       dm_mq_stop_queue: q=ffff99491720e988 
stopped
do_resume   [15679.666684]     __dm_suspend[1015]: return end 
md=ffff99491e7ed000 map=ffff99491dbc9600 rc=0
do_resume   [15679.666688]   dm_suspend[1015]: md=ffff99491e7ed000 
suspend_flags=2 exit r=0
do_resume   [15679.666690]   dm_swap_table: enter md=ffff99491e7ed000 
md->flags=67 md->immutable_table=ffffb15b06d8d040 table=ffff99491dbc8800
do_resume                       dm_mq_stop_queue enter

nr_requests [15679.666695]       blk_mq_quiesce_queue: q=ffff99491720e988
nr_requests [15679.688675] blk_mq_unquiesce_queue: q=ffff99491720e988

do_resume    15679.694996073 = dm_start_time
do_resume    15679.695013154 = dm_mq_queue_rq_ns
do_resume    15679.695013194 = init_tio_ns (set tio->ti = ti; ti is 
md->immutable_table (ffffb15b06d8d040))

do_resume   [15679.695678]     dm_mq_stop_queue: q=ffff99491720e988 stopped
do_resume   [15679.695756]   dm_swap_table: md=ffff99491e7ed000 
md->flags=67 (orig 67) md->immutable_table=ffffb15b06d93040 (orig 
ffffb15b06d8d040) old=ffff99491dbc8800 new=ffff99491dbc9600 
table->type=3 (orig 3) suspended=1
do_resume   [15679.695758]   dm_resume[1015]: md=ffff99491e7ed000 enter
do_resume   [15679.695761] dm_mq_start_queue: q=ffff99491720e988 resuming
do_resume   [15679.695763] blk_mq_unquiesce_queue: q=ffff99491720e988
do_resume   [15679.708229]   dm_resume[1015]: md=ffff99491e7ed000 normal 
rc=0
do_resume                      tio->ti destroyed (ffffb15b06d8d040)
do_resume   [15679.708247] do_resume[1015]: md=ffff99491e7ed000 
new_map=ffff99491dbc8800 rc=0

softirq handler fires==>
PANIC [15679.708629] BUG: unable to handle kernel paging request at 
ffffb15b06d8d040


[2]-

[15679.708629] BUG: unable to handle kernel paging request at 
ffffb15b06d8d040
[15679.723104] IP: [<ffffffffc0195d9d>] dm_softirq_done+0x5d/0x310 [dm_mod]
[15679.731099] PGD 17fd65067 PUD 8bfc02067 PMD 81d650067 PTE 0
[15679.737841] Oops: 0000 [#1] SMP
[15679.741954] Modules linked in: 
stap_dcf9a1b1b114d97436231b91be6919e_24275(OE) dm_round_robin 
ib_iser(OE) libiscsi scsi_transport_iscsi rpcsec_gss_krb5 auth_rpcgss 
nfsv4 dns_resolver nfs lockd grace fscache lnet(OE) dell_rbu libcfs(OE) 
rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) 
ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) 
sunrpc iTCO_wdt iTCO_vendor_support sb_edac coretemp intel_rapl iosf_mbi 
kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw 
gf128mul glue_helper ablk_helper cryptd pcspkr joydev lpc_ich sg mei_me 
mei wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter 
pcc_cpufreq dm_multipath binfmt_misc knem(OE) ip_tables ext4 mbcache 
jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) 
ib_core(OE) sr_mod
[15679.839498]  cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci mlx5_core(OE) 
libata crct10dif_pclmul tg3 crct10dif_common mlxfw(OE) vfio_mdev(OE) 
vfio_iommu_type1 crc32c_intel vfio megaraid_sas mdev(OE) devlink 
mlx_compat(OE) drm_panel_orientation_quirks ptp pps_core dm_mirror 
dm_region_hash dm_log dm_mod(OE) [last unloaded: 
stap_18d5e581b9a87f90fa46e5292bea9ac_14254]
[15679.883540] CPU: 3 PID: 26 Comm: ksoftirqd/3 Kdump: loaded Tainted: 
G           OE  ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
[15679.898105] Hardware name:    /0H47HH, BIOS 2.7.0 05/23/2018
[15679.905263] task: ffff9941b4e9c1c0 ti: ffff9941b4484000 task.ti: 
ffff9941b4484000
[15679.914473] RIP: 0010:[<ffffffffc0195d9d>] [<ffffffffc0195d9d>] 
dm_softirq_done+0x5d/0x310 [dm_mod]
[15679.925561] RSP: 0018:ffff9941b4487da0  EFLAGS: 00010202
[15679.932364] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 
dead000000000200
[15679.941212] RDX: ffff9941b4487dd8 RSI: ffff9941b4487dc8 RDI: 
ffffb15b06d8d040
[15679.950064] RBP: ffff9941b4487dc8 R08: ffff9948f1280980 R09: 
ffff99491f053900
[15679.958914] R10: 000000000000b731 R11: 000000000000000d R12: 
ffff9948df314d00
[15679.967767] R13: ffff9948f1280a80 R14: 0000000000000000 R15: 
0000000000000009
[15679.976631] FS:  0000000000000000(0000) GS:ffff99491f040000(0000) 
knlGS:0000000000000000
[15679.986581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15679.993913] CR2: ffffb15b06d8d040 CR3: 000000101d272000 CR4: 
00000000001607e0
[15680.002810] Call Trace:
[15680.006473]  [<ffffffffaed58d86>] blk_done_softirq+0x96/0xc0
[15680.013741]  [<ffffffffaeaa5305>] __do_softirq+0xf5/0x280
[15680.020714]  [<ffffffffaeaa54c8>] run_ksoftirqd+0x38/0x50
[15680.027693]  [<ffffffffaeacec24>] smpboot_thread_fn+0x144/0x1a0
[15680.035251]  [<ffffffffaeaceae0>] ? lg_double_unlock+0x40/0x40
[15680.042710]  [<ffffffffaeac61f1>] kthread+0xd1/0xe0
[15680.049100]  [<ffffffffaeac6120>] ? insert_kthread_work+0x40/0x40
[15680.056847]  [<ffffffffaf18dd37>] ret_from_fork_nospec_begin+0x21/0x21
[15680.065091]  [<ffffffffaeac6120>] ? insert_kthread_work+0x40/0x40
[15680.072848] Code: 01 00 00 4d 85 e4 0f 84 92 00 00 00 48 8b 43 48 41 
8b 5d 58 4d 8b ac 24 70 01 00 00 48 c1 e8 16 49 8b 7d 08 48 83 f0 01 83 
e0 01 <48> 83 3f 00 74 0c 48 8b 57 08 84 c0 4c 8b 42 60 75 41 83 fb 87
[15680.096566] RIP  [<ffffffffc0195d9d>] dm_softirq_done+0x5d/0x310 [dm_mod]
[15680.105181]  RSP <ffff9941b4487da0>
[15680.110091] CR2: ffffb15b06d8d040


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, back to index

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-13 14:56 [dm-devel] use after free with blk_mq_update_nr_requests racing with do_resume/dm_swap_table Aaron Knister

DM-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/dm-devel/0 dm-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dm-devel dm-devel/ https://lore.kernel.org/dm-devel \
		dm-devel@redhat.com
	public-inbox-index dm-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/com.redhat.dm-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git