* ncme-tcp: io_work NULL pointer when racing with queue stop @ 2022-01-18 18:36 Chris Leech 2022-01-18 23:32 ` Sagi Grimberg 0 siblings, 1 reply; 8+ messages in thread From: Chris Leech @ 2022-01-18 18:36 UTC (permalink / raw) To: linux-nvme Hi, I'm looking at a reported NULL pointer fault with controller reset testing under load, and it appears that the nvme_tcp_stop_queue sequence is racing against scheduled io_work. The fault occurs when kernel_sendpage is called from the workqueue context and hits a NULL sock->ops pointer, which is cleared by kernel_sock_shutdown before the call to cancel_work_sync. [ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] [ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0 [ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9 49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48 8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff 83 e7 [ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292 [ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048 [ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400 [ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec [ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38 [ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000 [ 4353.234393] FS: 0000000000000000(0000) GS:ffff981b2fa00000(0000) knlGS:0000000000000000 [ 4353.243442] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0 [ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4353.273798] Call Trace: [ 4353.276554] nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp] [ 4353.282901] ? tcp_rcv_space_adjust+0x1d/0x1c0 [ 4353.287888] nvme_tcp_try_send+0x15a/0x330 [nvme_tcp] [ 4353.293553] ? release_sock+0x43/0x90 [ 4353.297666] ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp] [ 4353.303334] nvme_tcp_io_work+0x37/0x90 [nvme_tcp] [ 4353.308723] process_one_work+0x1e3/0x3b0 [ 4353.313237] worker_thread+0x50/0x3b0 [ 4353.317335] ? rescuer_thread+0x370/0x370 [ 4353.321813] kthread+0x149/0x170 [ 4353.325440] ? set_kthread_struct+0x40/0x40 [ 4353.330128] ret_from_fork+0x22/0x30 It would seem easy enough to cancel the io_work before kernel_sock_shutdown, but I think that still leaves an opportunity for it to be rescheduled. The TCP callbacks do all appear to be protected by the sk_callback_lock and checks against sk_user_data, so it may be possible to restore those first? I'm not sure on the best way to safely shut down the socket in this case. I'll try and get some testing on this change, but wanted to get this on the list and see what other suggestions you might have. Thanks! Chris Leech diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 4ceb28675fdf..d9e1f875f80d 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct nvme_tcp_queue *queue) static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue) { - kernel_sock_shutdown(queue->sock, SHUT_RDWR); nvme_tcp_restore_sock_calls(queue); cancel_work_sync(&queue->io_work); + kernel_sock_shutdown(queue->sock, SHUT_RDWR); } ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: ncme-tcp: io_work NULL pointer when racing with queue stop 2022-01-18 18:36 ncme-tcp: io_work NULL pointer when racing with queue stop Chris Leech @ 2022-01-18 23:32 ` Sagi Grimberg 2022-01-19 1:31 ` Chris Leech 0 siblings, 1 reply; 8+ messages in thread From: Sagi Grimberg @ 2022-01-18 23:32 UTC (permalink / raw) To: Chris Leech, linux-nvme > Hi, Hey Chris, > I'm looking at a reported NULL pointer fault with controller reset > testing under load Just to understand, is this upstream kernel? > and it appears that the nvme_tcp_stop_queue > sequence is racing against scheduled io_work. The fault occurs when > kernel_sendpage is called from the workqueue context and hits a NULL > sock->ops pointer, which is cleared by kernel_sock_shutdown before the > call to cancel_work_sync. Hmm, can you point me to where kernel_sock_shutdown is clearing sock->ops? The assumption that nvme-tcp takes is that accessing the socket is not allowed only after sock_release() but still valid after kernel_sock_shutdown. The sequence follows this logic: 1. kernel_sock_shutdown(SHUT_RDWR) - guarantee we do not send or receive anything to/from the wire when any of the driver contexts access the socket from this point on. 2. nvme_tcp_restore_sock_calls() - guarantee that we are not wake up from any socket event, incoming data, or buffer space becoming available to send from this point on. 3. cancel_work_sync(io_work) - cancel and fence io_work, and make sure that it is not self-requeued. So in essence guarantee that io_work is not triggered again (not by self-requeue, nor by socket callbacks). Here we should be able to safely continue to release the socket in nvme_tcp_free_queue as there should be no way anything can access the socket concurrently. The only places where io_work is triggered from the code are: 1. .queue_work 2. socket_event (data_ready/write_space) 3. from io_work itself (quota exhausted, or r2t was received) So (2)+(3) should be prevented by nvme_tcp_stop_queue as outlined above, and (1) is prevented by quiescing the request_queue which is happening earlier in the reset sequence (nvme_tcp_teardown_io_queues calls nvme_stop_queues before nvme_tcp_stop_io_queues). So this is the logic that nvme-tcp follows to safely reset the controller during traffic/load. I have to say that if kernel_sock_shutdown() is clearing sock->ops then we clearly have a wrong assumption, but I'm surprised that it even works at all... However I did not find evidence to kernel_sock_shutdown() clearing sock->ops and in which conditions. > > [ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] > [ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0 > [ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9 > 49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48 > 8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff > 83 e7 > [ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292 > [ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048 > [ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400 > [ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec > [ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38 > [ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000 > [ 4353.234393] FS: 0000000000000000(0000) GS:ffff981b2fa00000(0000) > knlGS:0000000000000000 > [ 4353.243442] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0 > [ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 4353.273798] Call Trace: > [ 4353.276554] nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp] > [ 4353.282901] ? tcp_rcv_space_adjust+0x1d/0x1c0 > [ 4353.287888] nvme_tcp_try_send+0x15a/0x330 [nvme_tcp] > [ 4353.293553] ? release_sock+0x43/0x90 > [ 4353.297666] ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp] > [ 4353.303334] nvme_tcp_io_work+0x37/0x90 [nvme_tcp] > [ 4353.308723] process_one_work+0x1e3/0x3b0 > [ 4353.313237] worker_thread+0x50/0x3b0 > [ 4353.317335] ? rescuer_thread+0x370/0x370 > [ 4353.321813] kthread+0x149/0x170 > [ 4353.325440] ? set_kthread_struct+0x40/0x40 > [ 4353.330128] ret_from_fork+0x22/0x30 > > It would seem easy enough to cancel the io_work before > kernel_sock_shutdown, but I think that still leaves an opportunity for > it to be rescheduled. How would it be rescheduled? By who? > The TCP callbacks do all appear to be protected > by the sk_callback_lock and checks against sk_user_data, so it may be > possible to restore those first? I'm not sure on the best way to > safely shut down the socket in this case. > > I'll try and get some testing on this change, but wanted to get this > on the list and see what other suggestions you might have. > > Thanks! > Chris Leech > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 4ceb28675fdf..d9e1f875f80d 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct > nvme_tcp_queue *queue) > > static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue) > { > - kernel_sock_shutdown(queue->sock, SHUT_RDWR); > nvme_tcp_restore_sock_calls(queue); > cancel_work_sync(&queue->io_work); > + kernel_sock_shutdown(queue->sock, SHUT_RDWR); > } On the surface, this seems harmless (other than redundantly sending/receiving data to/from the wire while we are terminating the queue). But it is unclear to me why would this fix anything... Obviously from the bug report, it seems that there is something that is not taken into account, but I don't yet see what it is and how that is fixed by this patch. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ncme-tcp: io_work NULL pointer when racing with queue stop 2022-01-18 23:32 ` Sagi Grimberg @ 2022-01-19 1:31 ` Chris Leech 2022-01-19 6:33 ` Sagi Grimberg 0 siblings, 1 reply; 8+ messages in thread From: Chris Leech @ 2022-01-19 1:31 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme On Tue, Jan 18, 2022 at 3:32 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > Hi, > > Hey Chris, > > > I'm looking at a reported NULL pointer fault with controller reset > > testing under load > > Just to understand, is this upstream kernel? This was a CentOS Stream 9 kernel, which should be largely in sync with upstream currently. > > and it appears that the nvme_tcp_stop_queue > > sequence is racing against scheduled io_work. The fault occurs when > > kernel_sendpage is called from the workqueue context and hits a NULL > > sock->ops pointer, which is cleared by kernel_sock_shutdown before the > > call to cancel_work_sync. > > Hmm, can you point me to where kernel_sock_shutdown is clearing > sock->ops? The assumption that nvme-tcp takes is that accessing > the socket is not allowed only after sock_release() but > still valid after kernel_sock_shutdown. Looks like I need to apologize Sagi. I'm not sure where I got turned around looking at the socket code, but you're right and kernel_sock_shutdown doesn't look to be the culprit here. Thank you for the following detailed description. I'm going to go back to my crash report and take another look at this one. - Chris > The sequence follows this logic: > 1. kernel_sock_shutdown(SHUT_RDWR) - guarantee we do not send or > receive anything to/from the wire when any of the driver contexts > access the socket from this point on. > 2. nvme_tcp_restore_sock_calls() - guarantee that we are not wake up > from any socket event, incoming data, or buffer space becoming > available to send from this point on. > 3. cancel_work_sync(io_work) - cancel and fence io_work, and make sure > that it is not self-requeued. So in essence guarantee that io_work is > not triggered again (not by self-requeue, nor by socket callbacks). > > Here we should be able to safely continue to release the socket > in nvme_tcp_free_queue as there should be no way anything can access > the socket concurrently. > > The only places where io_work is triggered from the code are: > 1. .queue_work > 2. socket_event (data_ready/write_space) > 3. from io_work itself (quota exhausted, or r2t was received) > > So (2)+(3) should be prevented by nvme_tcp_stop_queue as outlined > above, and (1) is prevented by quiescing the request_queue which is > happening earlier in the reset sequence (nvme_tcp_teardown_io_queues > calls nvme_stop_queues before nvme_tcp_stop_io_queues). > > So this is the logic that nvme-tcp follows to safely reset the > controller during traffic/load. > > I have to say that if kernel_sock_shutdown() is clearing sock->ops > then we clearly have a wrong assumption, but I'm surprised that it > even works at all... > > However I did not find evidence to kernel_sock_shutdown() clearing > sock->ops and in which conditions. > > > > > [ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] > > [ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0 > > [ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9 > > 49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48 > > 8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff > > 83 e7 > > [ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292 > > [ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048 > > [ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400 > > [ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec > > [ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38 > > [ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000 > > [ 4353.234393] FS: 0000000000000000(0000) GS:ffff981b2fa00000(0000) > > knlGS:0000000000000000 > > [ 4353.243442] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0 > > [ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [ 4353.273798] Call Trace: > > [ 4353.276554] nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp] > > [ 4353.282901] ? tcp_rcv_space_adjust+0x1d/0x1c0 > > [ 4353.287888] nvme_tcp_try_send+0x15a/0x330 [nvme_tcp] > > [ 4353.293553] ? release_sock+0x43/0x90 > > [ 4353.297666] ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp] > > [ 4353.303334] nvme_tcp_io_work+0x37/0x90 [nvme_tcp] > > [ 4353.308723] process_one_work+0x1e3/0x3b0 > > [ 4353.313237] worker_thread+0x50/0x3b0 > > [ 4353.317335] ? rescuer_thread+0x370/0x370 > > [ 4353.321813] kthread+0x149/0x170 > > [ 4353.325440] ? set_kthread_struct+0x40/0x40 > > [ 4353.330128] ret_from_fork+0x22/0x30 > > > > It would seem easy enough to cancel the io_work before > > kernel_sock_shutdown, but I think that still leaves an opportunity for > > it to be rescheduled. > > How would it be rescheduled? By who? > > > The TCP callbacks do all appear to be protected > > by the sk_callback_lock and checks against sk_user_data, so it may be > > possible to restore those first? I'm not sure on the best way to > > safely shut down the socket in this case. > > > > I'll try and get some testing on this change, but wanted to get this > > on the list and see what other suggestions you might have. > > > > Thanks! > > Chris Leech > > > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > > index 4ceb28675fdf..d9e1f875f80d 100644 > > --- a/drivers/nvme/host/tcp.c > > +++ b/drivers/nvme/host/tcp.c > > @@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct > > nvme_tcp_queue *queue) > > > > static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue) > > { > > - kernel_sock_shutdown(queue->sock, SHUT_RDWR); > > nvme_tcp_restore_sock_calls(queue); > > cancel_work_sync(&queue->io_work); > > + kernel_sock_shutdown(queue->sock, SHUT_RDWR); > > } > > On the surface, this seems harmless (other than redundantly > sending/receiving data to/from the wire while we are terminating > the queue). But it is unclear to me why would this fix anything... > > Obviously from the bug report, it seems that there is something > that is not taken into account, but I don't yet see what it is > and how that is fixed by this patch. > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ncme-tcp: io_work NULL pointer when racing with queue stop 2022-01-19 1:31 ` Chris Leech @ 2022-01-19 6:33 ` Sagi Grimberg 2022-01-27 23:05 ` Sagi Grimberg 0 siblings, 1 reply; 8+ messages in thread From: Sagi Grimberg @ 2022-01-19 6:33 UTC (permalink / raw) To: Chris Leech; +Cc: linux-nvme > On Tue, Jan 18, 2022 at 3:32 PM Sagi Grimberg <sagi@grimberg.me> wrote: >> >>> Hi, >> >> Hey Chris, >> >>> I'm looking at a reported NULL pointer fault with controller reset >>> testing under load >> >> Just to understand, is this upstream kernel? > > This was a CentOS Stream 9 kernel, which should be largely in sync > with upstream currently. Understood, can you perhaps share a reference to the bug report? >>> and it appears that the nvme_tcp_stop_queue >>> sequence is racing against scheduled io_work. The fault occurs when >>> kernel_sendpage is called from the workqueue context and hits a NULL >>> sock->ops pointer, which is cleared by kernel_sock_shutdown before the >>> call to cancel_work_sync. >> >> Hmm, can you point me to where kernel_sock_shutdown is clearing >> sock->ops? The assumption that nvme-tcp takes is that accessing >> the socket is not allowed only after sock_release() but >> still valid after kernel_sock_shutdown. > > Looks like I need to apologize Sagi. I'm not sure where I got turned > around looking at the socket code, but you're right and > kernel_sock_shutdown doesn't look to be the culprit here. > > Thank you for the following detailed description. I'm going to go back > to my crash report and take another look at this one. No worries Chris, perhaps I can assist. Is the dmesg log prior to the BUG available? Does it tell us anything to what was going on leading to this? Any more information about the test case? (load + controller reset) Is the reset in a loop? Any more info about the load? Any other 'interference' during the test? How reproducible is this? Is this Linux nvmet as the controller? How many queues does the controller have? (it will help me understand how easy it is to reproduce on a vm setup) Cheers, Sagi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ncme-tcp: io_work NULL pointer when racing with queue stop 2022-01-19 6:33 ` Sagi Grimberg @ 2022-01-27 23:05 ` Sagi Grimberg 2022-01-28 3:25 ` Chris Leech 0 siblings, 1 reply; 8+ messages in thread From: Sagi Grimberg @ 2022-01-27 23:05 UTC (permalink / raw) To: Chris Leech; +Cc: linux-nvme >> Thank you for the following detailed description. I'm going to go back >> to my crash report and take another look at this one. > > No worries Chris, perhaps I can assist. > > Is the dmesg log prior to the BUG available? Does it tell us anything > to what was going on leading to this? > > Any more information about the test case? (load + controller reset) > Is the reset in a loop? Any more info about the load? > Any other 'interference' during the test? > How reproducible is this? > Is this Linux nvmet as the controller? > How many queues does the controller have? (it will help me understand > how easy it is to reproduce on a vm setup) I took another look at the code and I think I see how io_work maybe triggered after a socket was released. The issue might be .submit_async_event callback from the core. When we start a reset, the first thing we do is stop the pending work elements that may trigger io by calling nvme_stop_ctrl, and then we continue to teardown the I/O queues and then the admin queue (in nvme_tcp_teardown_ctrl). So the sequence is: nvme_stop_ctrl(ctrl); nvme_tcp_teardown_ctrl(ctrl, false); However, there is a possibility, after nvme_stop_ctrl but before we teardown the admin queue, that the controller sends a AEN and is processed by the host, which includes automatically submitting another AER which in turn is calling the driver with .submit_async_event (instead of the normal .queue_rq as AERs don't have timeouts). In nvme_tcp_submit_async_event we do not check the controller or queue state and see that it is ready to accept a new submission like we do in .queue_rq, so we blindly prepare the AER cmd queue it and schedules io_work, but at this point I don't see what guarantees that the queue (e.g. the socket) is not released. Unless I'm missing something, this flow will trigger a use-after-free when io_work will attempt to access the socket. I see we also don't flush the async_event_work in the error recovery flow which we probably should so we can avoid such a race. I think that the below patch should address the issue: -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 96725c3f1e77..bf380ca0e0d1 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work) nvme_auth_stop(ctrl); nvme_stop_keep_alive(ctrl); + flush_work(&ctrl->async_event_work); nvme_tcp_teardown_io_queues(ctrl, false); /* unquiesce to fail fast pending requests */ nvme_start_queues(ctrl); @@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg) struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu; struct nvme_command *cmd = &pdu->cmd; u8 hdgst = nvme_tcp_hdgst_len(queue); + bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags); + + if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready) + return; memset(pdu, 0, sizeof(*pdu)); pdu->hdr.type = nvme_tcp_cmd; -- Chris, can you take this for some testing? ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: ncme-tcp: io_work NULL pointer when racing with queue stop 2022-01-27 23:05 ` Sagi Grimberg @ 2022-01-28 3:25 ` Chris Leech 2022-01-28 23:55 ` Chris Leech 0 siblings, 1 reply; 8+ messages in thread From: Chris Leech @ 2022-01-28 3:25 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme Thanks Sagi, this looks promising. It also might fit with a new backtrace I was just looking at from the same testing, where nvme_tcp_submit_async_event hit a null ctrl->async_req.pdu which I can only see happening if it was racing with nvme_tcp_error_recovery_work. I'll get this into some testing here at Red Hat and let you know the results. - Chris On Thu, Jan 27, 2022 at 3:05 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > >> Thank you for the following detailed description. I'm going to go back > >> to my crash report and take another look at this one. > > > > No worries Chris, perhaps I can assist. > > > > Is the dmesg log prior to the BUG available? Does it tell us anything > > to what was going on leading to this? > > > > Any more information about the test case? (load + controller reset) > > Is the reset in a loop? Any more info about the load? > > Any other 'interference' during the test? > > How reproducible is this? > > Is this Linux nvmet as the controller? > > How many queues does the controller have? (it will help me understand > > how easy it is to reproduce on a vm setup) > > I took another look at the code and I think I see how io_work maybe > triggered after a socket was released. The issue might be > .submit_async_event callback from the core. > > When we start a reset, the first thing we do is stop the pending > work elements that may trigger io by calling nvme_stop_ctrl, and > then we continue to teardown the I/O queues and then the admin > queue (in nvme_tcp_teardown_ctrl). > > So the sequence is: > nvme_stop_ctrl(ctrl); > nvme_tcp_teardown_ctrl(ctrl, false); > > However, there is a possibility, after nvme_stop_ctrl but before > we teardown the admin queue, that the controller sends a AEN > and is processed by the host, which includes automatically > submitting another AER which in turn is calling the driver with > .submit_async_event (instead of the normal .queue_rq as AERs don't have > timeouts). > > In nvme_tcp_submit_async_event we do not check the controller or > queue state and see that it is ready to accept a new submission like > we do in .queue_rq, so we blindly prepare the AER cmd queue it and > schedules io_work, but at this point I don't see what guarantees that > the queue (e.g. the socket) is not released. > > Unless I'm missing something, this flow will trigger a use-after-free > when io_work will attempt to access the socket. > > I see we also don't flush the async_event_work in the error recovery > flow which we probably should so we can avoid such a race. > > I think that the below patch should address the issue: > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 96725c3f1e77..bf380ca0e0d1 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct > work_struct *work) > > nvme_auth_stop(ctrl); > nvme_stop_keep_alive(ctrl); > + flush_work(&ctrl->async_event_work); > nvme_tcp_teardown_io_queues(ctrl, false); > /* unquiesce to fail fast pending requests */ > nvme_start_queues(ctrl); > @@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct > nvme_ctrl *arg) > struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu; > struct nvme_command *cmd = &pdu->cmd; > u8 hdgst = nvme_tcp_hdgst_len(queue); > + bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags); > + > + if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready) > + return; > > memset(pdu, 0, sizeof(*pdu)); > pdu->hdr.type = nvme_tcp_cmd; > -- > > Chris, can you take this for some testing? > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ncme-tcp: io_work NULL pointer when racing with queue stop 2022-01-28 3:25 ` Chris Leech @ 2022-01-28 23:55 ` Chris Leech 2022-01-30 8:47 ` Sagi Grimberg 0 siblings, 1 reply; 8+ messages in thread From: Chris Leech @ 2022-01-28 23:55 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme This completed a day of automated testing without any failures. Tested-by: Chris Leech <cleech@redhat.com> On Thu, Jan 27, 2022 at 7:25 PM Chris Leech <cleech@redhat.com> wrote: > > Thanks Sagi, this looks promising. > > It also might fit with a new backtrace I was just looking at from the > same testing, where nvme_tcp_submit_async_event hit a null > ctrl->async_req.pdu which I can only see happening if it was racing > with nvme_tcp_error_recovery_work. > > I'll get this into some testing here at Red Hat and let you know the results. > > - Chris > > On Thu, Jan 27, 2022 at 3:05 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > > > >> Thank you for the following detailed description. I'm going to go back > > >> to my crash report and take another look at this one. > > > > > > No worries Chris, perhaps I can assist. > > > > > > Is the dmesg log prior to the BUG available? Does it tell us anything > > > to what was going on leading to this? > > > > > > Any more information about the test case? (load + controller reset) > > > Is the reset in a loop? Any more info about the load? > > > Any other 'interference' during the test? > > > How reproducible is this? > > > Is this Linux nvmet as the controller? > > > How many queues does the controller have? (it will help me understand > > > how easy it is to reproduce on a vm setup) > > > > I took another look at the code and I think I see how io_work maybe > > triggered after a socket was released. The issue might be > > .submit_async_event callback from the core. > > > > When we start a reset, the first thing we do is stop the pending > > work elements that may trigger io by calling nvme_stop_ctrl, and > > then we continue to teardown the I/O queues and then the admin > > queue (in nvme_tcp_teardown_ctrl). > > > > So the sequence is: > > nvme_stop_ctrl(ctrl); > > nvme_tcp_teardown_ctrl(ctrl, false); > > > > However, there is a possibility, after nvme_stop_ctrl but before > > we teardown the admin queue, that the controller sends a AEN > > and is processed by the host, which includes automatically > > submitting another AER which in turn is calling the driver with > > .submit_async_event (instead of the normal .queue_rq as AERs don't have > > timeouts). > > > > In nvme_tcp_submit_async_event we do not check the controller or > > queue state and see that it is ready to accept a new submission like > > we do in .queue_rq, so we blindly prepare the AER cmd queue it and > > schedules io_work, but at this point I don't see what guarantees that > > the queue (e.g. the socket) is not released. > > > > Unless I'm missing something, this flow will trigger a use-after-free > > when io_work will attempt to access the socket. > > > > I see we also don't flush the async_event_work in the error recovery > > flow which we probably should so we can avoid such a race. > > > > I think that the below patch should address the issue: > > -- > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > > index 96725c3f1e77..bf380ca0e0d1 100644 > > --- a/drivers/nvme/host/tcp.c > > +++ b/drivers/nvme/host/tcp.c > > @@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct > > work_struct *work) > > > > nvme_auth_stop(ctrl); > > nvme_stop_keep_alive(ctrl); > > + flush_work(&ctrl->async_event_work); > > nvme_tcp_teardown_io_queues(ctrl, false); > > /* unquiesce to fail fast pending requests */ > > nvme_start_queues(ctrl); > > @@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct > > nvme_ctrl *arg) > > struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu; > > struct nvme_command *cmd = &pdu->cmd; > > u8 hdgst = nvme_tcp_hdgst_len(queue); > > + bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags); > > + > > + if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready) > > + return; > > > > memset(pdu, 0, sizeof(*pdu)); > > pdu->hdr.type = nvme_tcp_cmd; > > -- > > > > Chris, can you take this for some testing? > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ncme-tcp: io_work NULL pointer when racing with queue stop 2022-01-28 23:55 ` Chris Leech @ 2022-01-30 8:47 ` Sagi Grimberg 0 siblings, 0 replies; 8+ messages in thread From: Sagi Grimberg @ 2022-01-30 8:47 UTC (permalink / raw) To: Chris Leech; +Cc: linux-nvme > This completed a day of automated testing without any failures. > > Tested-by: Chris Leech <cleech@redhat.com> Nice, I take it that the driver did not survive this test before? btw, nvme-rdma needs the exact same fix. Will send a small patchset for it. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-01-30 8:47 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-18 18:36 ncme-tcp: io_work NULL pointer when racing with queue stop Chris Leech 2022-01-18 23:32 ` Sagi Grimberg 2022-01-19 1:31 ` Chris Leech 2022-01-19 6:33 ` Sagi Grimberg 2022-01-27 23:05 ` Sagi Grimberg 2022-01-28 3:25 ` Chris Leech 2022-01-28 23:55 ` Chris Leech 2022-01-30 8:47 ` Sagi Grimberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.