All of lore.kernel.org
 help / color / mirror / Atom feed
* ncme-tcp: io_work NULL pointer when racing with queue stop
@ 2022-01-18 18:36 Chris Leech
  2022-01-18 23:32 ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Leech @ 2022-01-18 18:36 UTC (permalink / raw)
  To: linux-nvme

Hi,

I'm looking at a reported NULL pointer fault with controller reset
testing under load, and it appears that the nvme_tcp_stop_queue
sequence is racing against scheduled io_work.  The fault occurs when
kernel_sendpage is called from the workqueue context and hits a NULL
sock->ops pointer, which is cleared by kernel_sock_shutdown before the
call to cancel_work_sync.

[ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0
[ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9
49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48
8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff
83 e7
[ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292
[ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048
[ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400
[ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec
[ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38
[ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000
[ 4353.234393] FS:  0000000000000000(0000) GS:ffff981b2fa00000(0000)
knlGS:0000000000000000
[ 4353.243442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0
[ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4353.273798] Call Trace:
[ 4353.276554]  nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp]
[ 4353.282901]  ? tcp_rcv_space_adjust+0x1d/0x1c0
[ 4353.287888]  nvme_tcp_try_send+0x15a/0x330 [nvme_tcp]
[ 4353.293553]  ? release_sock+0x43/0x90
[ 4353.297666]  ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp]
[ 4353.303334]  nvme_tcp_io_work+0x37/0x90 [nvme_tcp]
[ 4353.308723]  process_one_work+0x1e3/0x3b0
[ 4353.313237]  worker_thread+0x50/0x3b0
[ 4353.317335]  ? rescuer_thread+0x370/0x370
[ 4353.321813]  kthread+0x149/0x170
[ 4353.325440]  ? set_kthread_struct+0x40/0x40
[ 4353.330128]  ret_from_fork+0x22/0x30

It would seem easy enough to cancel the io_work before
kernel_sock_shutdown, but I think that still leaves an opportunity for
it to be rescheduled.  The TCP callbacks do all appear to be protected
by the sk_callback_lock and checks against sk_user_data, so it may be
possible to restore those first?  I'm not sure on the best way to
safely shut down the socket in this case.

I'll try and get some testing on this change, but wanted to get this
on the list and see what other suggestions you might have.

Thanks!
Chris Leech

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4ceb28675fdf..d9e1f875f80d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct
nvme_tcp_queue *queue)

 static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 {
-       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
        nvme_tcp_restore_sock_calls(queue);
        cancel_work_sync(&queue->io_work);
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
 }



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: ncme-tcp: io_work NULL pointer when racing with queue stop
  2022-01-18 18:36 ncme-tcp: io_work NULL pointer when racing with queue stop Chris Leech
@ 2022-01-18 23:32 ` Sagi Grimberg
  2022-01-19  1:31   ` Chris Leech
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2022-01-18 23:32 UTC (permalink / raw)
  To: Chris Leech, linux-nvme


> Hi,

Hey Chris,

> I'm looking at a reported NULL pointer fault with controller reset
> testing under load

Just to understand, is this upstream kernel?

> and it appears that the nvme_tcp_stop_queue
> sequence is racing against scheduled io_work.  The fault occurs when
> kernel_sendpage is called from the workqueue context and hits a NULL
> sock->ops pointer, which is cleared by kernel_sock_shutdown before the
> call to cancel_work_sync.

Hmm, can you point me to where kernel_sock_shutdown is clearing
sock->ops? The assumption that nvme-tcp takes is that accessing
the socket is not allowed only after sock_release() but
still valid after kernel_sock_shutdown.

The sequence follows this logic:
1. kernel_sock_shutdown(SHUT_RDWR) - guarantee we do not send or
receive anything to/from the wire when any of the driver contexts
access the socket from this point on.
2. nvme_tcp_restore_sock_calls() - guarantee that we are not wake up
from any socket event, incoming data, or buffer space becoming
available to send from this point on.
3. cancel_work_sync(io_work) - cancel and fence io_work, and make sure
that it is not self-requeued. So in essence guarantee that io_work is
not triggered again (not by self-requeue, nor by socket callbacks).

Here we should be able to safely continue to release the socket
in nvme_tcp_free_queue as there should be no way anything can access
the socket concurrently.

The only places where io_work is triggered from the code are:
1. .queue_work
2. socket_event (data_ready/write_space)
3. from io_work itself (quota exhausted, or r2t was received)

So (2)+(3) should be prevented by nvme_tcp_stop_queue as outlined
above, and (1) is prevented by quiescing the request_queue which is
happening earlier in the reset sequence (nvme_tcp_teardown_io_queues
calls nvme_stop_queues before nvme_tcp_stop_io_queues).

So this is the logic that nvme-tcp follows to safely reset the
controller during traffic/load.

I have to say that if kernel_sock_shutdown() is clearing sock->ops
then we clearly have a wrong assumption, but I'm surprised that it
even works at all...

However I did not find evidence to kernel_sock_shutdown() clearing
sock->ops and in which conditions.

> 
> [ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
> [ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0
> [ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9
> 49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48
> 8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff
> 83 e7
> [ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292
> [ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048
> [ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400
> [ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec
> [ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38
> [ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000
> [ 4353.234393] FS:  0000000000000000(0000) GS:ffff981b2fa00000(0000)
> knlGS:0000000000000000
> [ 4353.243442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0
> [ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 4353.273798] Call Trace:
> [ 4353.276554]  nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp]
> [ 4353.282901]  ? tcp_rcv_space_adjust+0x1d/0x1c0
> [ 4353.287888]  nvme_tcp_try_send+0x15a/0x330 [nvme_tcp]
> [ 4353.293553]  ? release_sock+0x43/0x90
> [ 4353.297666]  ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp]
> [ 4353.303334]  nvme_tcp_io_work+0x37/0x90 [nvme_tcp]
> [ 4353.308723]  process_one_work+0x1e3/0x3b0
> [ 4353.313237]  worker_thread+0x50/0x3b0
> [ 4353.317335]  ? rescuer_thread+0x370/0x370
> [ 4353.321813]  kthread+0x149/0x170
> [ 4353.325440]  ? set_kthread_struct+0x40/0x40
> [ 4353.330128]  ret_from_fork+0x22/0x30
> 
> It would seem easy enough to cancel the io_work before
> kernel_sock_shutdown, but I think that still leaves an opportunity for
> it to be rescheduled.

How would it be rescheduled? By who?

>  The TCP callbacks do all appear to be protected
> by the sk_callback_lock and checks against sk_user_data, so it may be
> possible to restore those first?  I'm not sure on the best way to
> safely shut down the socket in this case.
> 
> I'll try and get some testing on this change, but wanted to get this
> on the list and see what other suggestions you might have.
> 
> Thanks!
> Chris Leech
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 4ceb28675fdf..d9e1f875f80d 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct
> nvme_tcp_queue *queue)
> 
>   static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
>   {
> -       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>          nvme_tcp_restore_sock_calls(queue);
>          cancel_work_sync(&queue->io_work);
> +       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>   }

On the surface, this seems harmless (other than redundantly
sending/receiving data to/from the wire while we are terminating
the queue). But it is unclear to me why would this fix anything...

Obviously from the bug report, it seems that there is something
that is not taken into account, but I don't yet see what it is
and how that is fixed by this patch.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ncme-tcp: io_work NULL pointer when racing with queue stop
  2022-01-18 23:32 ` Sagi Grimberg
@ 2022-01-19  1:31   ` Chris Leech
  2022-01-19  6:33     ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Leech @ 2022-01-19  1:31 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Tue, Jan 18, 2022 at 3:32 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> > Hi,
>
> Hey Chris,
>
> > I'm looking at a reported NULL pointer fault with controller reset
> > testing under load
>
> Just to understand, is this upstream kernel?

This was a CentOS Stream 9 kernel, which should be largely in sync
with upstream currently.

> > and it appears that the nvme_tcp_stop_queue
> > sequence is racing against scheduled io_work.  The fault occurs when
> > kernel_sendpage is called from the workqueue context and hits a NULL
> > sock->ops pointer, which is cleared by kernel_sock_shutdown before the
> > call to cancel_work_sync.
>
> Hmm, can you point me to where kernel_sock_shutdown is clearing
> sock->ops? The assumption that nvme-tcp takes is that accessing
> the socket is not allowed only after sock_release() but
> still valid after kernel_sock_shutdown.

Looks like I need to apologize Sagi.  I'm not sure where I got turned
around looking at the socket code, but you're right and
kernel_sock_shutdown doesn't look to be the culprit here.

Thank you for the following detailed description. I'm going to go back
to my crash report and take another look at this one.

- Chris

> The sequence follows this logic:
> 1. kernel_sock_shutdown(SHUT_RDWR) - guarantee we do not send or
> receive anything to/from the wire when any of the driver contexts
> access the socket from this point on.
> 2. nvme_tcp_restore_sock_calls() - guarantee that we are not wake up
> from any socket event, incoming data, or buffer space becoming
> available to send from this point on.
> 3. cancel_work_sync(io_work) - cancel and fence io_work, and make sure
> that it is not self-requeued. So in essence guarantee that io_work is
> not triggered again (not by self-requeue, nor by socket callbacks).
>
> Here we should be able to safely continue to release the socket
> in nvme_tcp_free_queue as there should be no way anything can access
> the socket concurrently.
>
> The only places where io_work is triggered from the code are:
> 1. .queue_work
> 2. socket_event (data_ready/write_space)
> 3. from io_work itself (quota exhausted, or r2t was received)
>
> So (2)+(3) should be prevented by nvme_tcp_stop_queue as outlined
> above, and (1) is prevented by quiescing the request_queue which is
> happening earlier in the reset sequence (nvme_tcp_teardown_io_queues
> calls nvme_stop_queues before nvme_tcp_stop_io_queues).
>
> So this is the logic that nvme-tcp follows to safely reset the
> controller during traffic/load.
>
> I have to say that if kernel_sock_shutdown() is clearing sock->ops
> then we clearly have a wrong assumption, but I'm surprised that it
> even works at all...
>
> However I did not find evidence to kernel_sock_shutdown() clearing
> sock->ops and in which conditions.
>
> >
> > [ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
> > [ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0
> > [ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9
> > 49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48
> > 8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff
> > 83 e7
> > [ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292
> > [ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048
> > [ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400
> > [ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec
> > [ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38
> > [ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000
> > [ 4353.234393] FS:  0000000000000000(0000) GS:ffff981b2fa00000(0000)
> > knlGS:0000000000000000
> > [ 4353.243442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0
> > [ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 4353.273798] Call Trace:
> > [ 4353.276554]  nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp]
> > [ 4353.282901]  ? tcp_rcv_space_adjust+0x1d/0x1c0
> > [ 4353.287888]  nvme_tcp_try_send+0x15a/0x330 [nvme_tcp]
> > [ 4353.293553]  ? release_sock+0x43/0x90
> > [ 4353.297666]  ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp]
> > [ 4353.303334]  nvme_tcp_io_work+0x37/0x90 [nvme_tcp]
> > [ 4353.308723]  process_one_work+0x1e3/0x3b0
> > [ 4353.313237]  worker_thread+0x50/0x3b0
> > [ 4353.317335]  ? rescuer_thread+0x370/0x370
> > [ 4353.321813]  kthread+0x149/0x170
> > [ 4353.325440]  ? set_kthread_struct+0x40/0x40
> > [ 4353.330128]  ret_from_fork+0x22/0x30
> >
> > It would seem easy enough to cancel the io_work before
> > kernel_sock_shutdown, but I think that still leaves an opportunity for
> > it to be rescheduled.
>
> How would it be rescheduled? By who?
>
> >  The TCP callbacks do all appear to be protected
> > by the sk_callback_lock and checks against sk_user_data, so it may be
> > possible to restore those first?  I'm not sure on the best way to
> > safely shut down the socket in this case.
> >
> > I'll try and get some testing on this change, but wanted to get this
> > on the list and see what other suggestions you might have.
> >
> > Thanks!
> > Chris Leech
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 4ceb28675fdf..d9e1f875f80d 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct
> > nvme_tcp_queue *queue)
> >
> >   static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
> >   {
> > -       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
> >          nvme_tcp_restore_sock_calls(queue);
> >          cancel_work_sync(&queue->io_work);
> > +       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
> >   }
>
> On the surface, this seems harmless (other than redundantly
> sending/receiving data to/from the wire while we are terminating
> the queue). But it is unclear to me why would this fix anything...
>
> Obviously from the bug report, it seems that there is something
> that is not taken into account, but I don't yet see what it is
> and how that is fixed by this patch.
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ncme-tcp: io_work NULL pointer when racing with queue stop
  2022-01-19  1:31   ` Chris Leech
@ 2022-01-19  6:33     ` Sagi Grimberg
  2022-01-27 23:05       ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2022-01-19  6:33 UTC (permalink / raw)
  To: Chris Leech; +Cc: linux-nvme


> On Tue, Jan 18, 2022 at 3:32 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>> Hi,
>>
>> Hey Chris,
>>
>>> I'm looking at a reported NULL pointer fault with controller reset
>>> testing under load
>>
>> Just to understand, is this upstream kernel?
> 
> This was a CentOS Stream 9 kernel, which should be largely in sync
> with upstream currently.

Understood, can you perhaps share a reference to the bug report?

>>> and it appears that the nvme_tcp_stop_queue
>>> sequence is racing against scheduled io_work.  The fault occurs when
>>> kernel_sendpage is called from the workqueue context and hits a NULL
>>> sock->ops pointer, which is cleared by kernel_sock_shutdown before the
>>> call to cancel_work_sync.
>>
>> Hmm, can you point me to where kernel_sock_shutdown is clearing
>> sock->ops? The assumption that nvme-tcp takes is that accessing
>> the socket is not allowed only after sock_release() but
>> still valid after kernel_sock_shutdown.
> 
> Looks like I need to apologize Sagi.  I'm not sure where I got turned
> around looking at the socket code, but you're right and
> kernel_sock_shutdown doesn't look to be the culprit here.
> 
> Thank you for the following detailed description. I'm going to go back
> to my crash report and take another look at this one.

No worries Chris, perhaps I can assist.

Is the dmesg log prior to the BUG available? Does it tell us anything
to what was going on leading to this?

Any more information about the test case? (load + controller reset)
Is the reset in a loop? Any more info about the load?
Any other 'interference' during the test?
How reproducible is this?
Is this Linux nvmet as the controller?
How many queues does the controller have? (it will help me understand
how easy it is to reproduce on a vm setup)

Cheers,
Sagi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ncme-tcp: io_work NULL pointer when racing with queue stop
  2022-01-19  6:33     ` Sagi Grimberg
@ 2022-01-27 23:05       ` Sagi Grimberg
  2022-01-28  3:25         ` Chris Leech
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2022-01-27 23:05 UTC (permalink / raw)
  To: Chris Leech; +Cc: linux-nvme


>> Thank you for the following detailed description. I'm going to go back
>> to my crash report and take another look at this one.
> 
> No worries Chris, perhaps I can assist.
> 
> Is the dmesg log prior to the BUG available? Does it tell us anything
> to what was going on leading to this?
> 
> Any more information about the test case? (load + controller reset)
> Is the reset in a loop? Any more info about the load?
> Any other 'interference' during the test?
> How reproducible is this?
> Is this Linux nvmet as the controller?
> How many queues does the controller have? (it will help me understand
> how easy it is to reproduce on a vm setup)

I took another look at the code and I think I see how io_work maybe
triggered after a socket was released. The issue might be
.submit_async_event callback from the core.

When we start a reset, the first thing we do is stop the pending
work elements that may trigger io by calling nvme_stop_ctrl, and
then we continue to teardown the I/O queues and then the admin
queue (in nvme_tcp_teardown_ctrl).

So the sequence is:
         nvme_stop_ctrl(ctrl);
         nvme_tcp_teardown_ctrl(ctrl, false);

However, there is a possibility, after nvme_stop_ctrl but before
we teardown the admin queue, that the controller sends a AEN
and is processed by the host, which includes automatically
submitting another AER which in turn is calling the driver with
.submit_async_event (instead of the normal .queue_rq as AERs don't have
timeouts).

In nvme_tcp_submit_async_event we do not check the controller or
queue state and see that it is ready to accept a new submission like
we do in .queue_rq, so we blindly prepare the AER cmd queue it and
schedules io_work, but at this point I don't see what guarantees that
the queue (e.g. the socket) is not released.

Unless I'm missing something, this flow will trigger a use-after-free
when io_work will attempt to access the socket.

I see we also don't flush the async_event_work in the error recovery
flow which we probably should so we can avoid such a race.

I think that the below patch should address the issue:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 96725c3f1e77..bf380ca0e0d1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct 
work_struct *work)

         nvme_auth_stop(ctrl);
         nvme_stop_keep_alive(ctrl);
+       flush_work(&ctrl->async_event_work);
         nvme_tcp_teardown_io_queues(ctrl, false);
         /* unquiesce to fail fast pending requests */
         nvme_start_queues(ctrl);
@@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct 
nvme_ctrl *arg)
         struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
         struct nvme_command *cmd = &pdu->cmd;
         u8 hdgst = nvme_tcp_hdgst_len(queue);
+       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+
+       if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready)
+               return;

         memset(pdu, 0, sizeof(*pdu));
         pdu->hdr.type = nvme_tcp_cmd;
--

Chris, can you take this for some testing?


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: ncme-tcp: io_work NULL pointer when racing with queue stop
  2022-01-27 23:05       ` Sagi Grimberg
@ 2022-01-28  3:25         ` Chris Leech
  2022-01-28 23:55           ` Chris Leech
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Leech @ 2022-01-28  3:25 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

Thanks Sagi, this looks promising.

It also might fit with a new backtrace I was just looking at from the
same testing, where nvme_tcp_submit_async_event hit a null
ctrl->async_req.pdu which I can only see happening if it was racing
with nvme_tcp_error_recovery_work.

I'll get this into some testing here at Red Hat and let you know the results.

- Chris

On Thu, Jan 27, 2022 at 3:05 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> Thank you for the following detailed description. I'm going to go back
> >> to my crash report and take another look at this one.
> >
> > No worries Chris, perhaps I can assist.
> >
> > Is the dmesg log prior to the BUG available? Does it tell us anything
> > to what was going on leading to this?
> >
> > Any more information about the test case? (load + controller reset)
> > Is the reset in a loop? Any more info about the load?
> > Any other 'interference' during the test?
> > How reproducible is this?
> > Is this Linux nvmet as the controller?
> > How many queues does the controller have? (it will help me understand
> > how easy it is to reproduce on a vm setup)
>
> I took another look at the code and I think I see how io_work maybe
> triggered after a socket was released. The issue might be
> .submit_async_event callback from the core.
>
> When we start a reset, the first thing we do is stop the pending
> work elements that may trigger io by calling nvme_stop_ctrl, and
> then we continue to teardown the I/O queues and then the admin
> queue (in nvme_tcp_teardown_ctrl).
>
> So the sequence is:
>          nvme_stop_ctrl(ctrl);
>          nvme_tcp_teardown_ctrl(ctrl, false);
>
> However, there is a possibility, after nvme_stop_ctrl but before
> we teardown the admin queue, that the controller sends a AEN
> and is processed by the host, which includes automatically
> submitting another AER which in turn is calling the driver with
> .submit_async_event (instead of the normal .queue_rq as AERs don't have
> timeouts).
>
> In nvme_tcp_submit_async_event we do not check the controller or
> queue state and see that it is ready to accept a new submission like
> we do in .queue_rq, so we blindly prepare the AER cmd queue it and
> schedules io_work, but at this point I don't see what guarantees that
> the queue (e.g. the socket) is not released.
>
> Unless I'm missing something, this flow will trigger a use-after-free
> when io_work will attempt to access the socket.
>
> I see we also don't flush the async_event_work in the error recovery
> flow which we probably should so we can avoid such a race.
>
> I think that the below patch should address the issue:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 96725c3f1e77..bf380ca0e0d1 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct
> work_struct *work)
>
>          nvme_auth_stop(ctrl);
>          nvme_stop_keep_alive(ctrl);
> +       flush_work(&ctrl->async_event_work);
>          nvme_tcp_teardown_io_queues(ctrl, false);
>          /* unquiesce to fail fast pending requests */
>          nvme_start_queues(ctrl);
> @@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct
> nvme_ctrl *arg)
>          struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
>          struct nvme_command *cmd = &pdu->cmd;
>          u8 hdgst = nvme_tcp_hdgst_len(queue);
> +       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
> +
> +       if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready)
> +               return;
>
>          memset(pdu, 0, sizeof(*pdu));
>          pdu->hdr.type = nvme_tcp_cmd;
> --
>
> Chris, can you take this for some testing?
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ncme-tcp: io_work NULL pointer when racing with queue stop
  2022-01-28  3:25         ` Chris Leech
@ 2022-01-28 23:55           ` Chris Leech
  2022-01-30  8:47             ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Leech @ 2022-01-28 23:55 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

This completed a day of automated testing without any failures.

Tested-by: Chris Leech <cleech@redhat.com>

On Thu, Jan 27, 2022 at 7:25 PM Chris Leech <cleech@redhat.com> wrote:
>
> Thanks Sagi, this looks promising.
>
> It also might fit with a new backtrace I was just looking at from the
> same testing, where nvme_tcp_submit_async_event hit a null
> ctrl->async_req.pdu which I can only see happening if it was racing
> with nvme_tcp_error_recovery_work.
>
> I'll get this into some testing here at Red Hat and let you know the results.
>
> - Chris
>
> On Thu, Jan 27, 2022 at 3:05 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> > >> Thank you for the following detailed description. I'm going to go back
> > >> to my crash report and take another look at this one.
> > >
> > > No worries Chris, perhaps I can assist.
> > >
> > > Is the dmesg log prior to the BUG available? Does it tell us anything
> > > to what was going on leading to this?
> > >
> > > Any more information about the test case? (load + controller reset)
> > > Is the reset in a loop? Any more info about the load?
> > > Any other 'interference' during the test?
> > > How reproducible is this?
> > > Is this Linux nvmet as the controller?
> > > How many queues does the controller have? (it will help me understand
> > > how easy it is to reproduce on a vm setup)
> >
> > I took another look at the code and I think I see how io_work maybe
> > triggered after a socket was released. The issue might be
> > .submit_async_event callback from the core.
> >
> > When we start a reset, the first thing we do is stop the pending
> > work elements that may trigger io by calling nvme_stop_ctrl, and
> > then we continue to teardown the I/O queues and then the admin
> > queue (in nvme_tcp_teardown_ctrl).
> >
> > So the sequence is:
> >          nvme_stop_ctrl(ctrl);
> >          nvme_tcp_teardown_ctrl(ctrl, false);
> >
> > However, there is a possibility, after nvme_stop_ctrl but before
> > we teardown the admin queue, that the controller sends a AEN
> > and is processed by the host, which includes automatically
> > submitting another AER which in turn is calling the driver with
> > .submit_async_event (instead of the normal .queue_rq as AERs don't have
> > timeouts).
> >
> > In nvme_tcp_submit_async_event we do not check the controller or
> > queue state and see that it is ready to accept a new submission like
> > we do in .queue_rq, so we blindly prepare the AER cmd queue it and
> > schedules io_work, but at this point I don't see what guarantees that
> > the queue (e.g. the socket) is not released.
> >
> > Unless I'm missing something, this flow will trigger a use-after-free
> > when io_work will attempt to access the socket.
> >
> > I see we also don't flush the async_event_work in the error recovery
> > flow which we probably should so we can avoid such a race.
> >
> > I think that the below patch should address the issue:
> > --
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 96725c3f1e77..bf380ca0e0d1 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct
> > work_struct *work)
> >
> >          nvme_auth_stop(ctrl);
> >          nvme_stop_keep_alive(ctrl);
> > +       flush_work(&ctrl->async_event_work);
> >          nvme_tcp_teardown_io_queues(ctrl, false);
> >          /* unquiesce to fail fast pending requests */
> >          nvme_start_queues(ctrl);
> > @@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct
> > nvme_ctrl *arg)
> >          struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
> >          struct nvme_command *cmd = &pdu->cmd;
> >          u8 hdgst = nvme_tcp_hdgst_len(queue);
> > +       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
> > +
> > +       if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready)
> > +               return;
> >
> >          memset(pdu, 0, sizeof(*pdu));
> >          pdu->hdr.type = nvme_tcp_cmd;
> > --
> >
> > Chris, can you take this for some testing?
> >



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ncme-tcp: io_work NULL pointer when racing with queue stop
  2022-01-28 23:55           ` Chris Leech
@ 2022-01-30  8:47             ` Sagi Grimberg
  0 siblings, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2022-01-30  8:47 UTC (permalink / raw)
  To: Chris Leech; +Cc: linux-nvme


> This completed a day of automated testing without any failures.
> 
> Tested-by: Chris Leech <cleech@redhat.com>

Nice, I take it that the driver did not survive this test
before?

btw, nvme-rdma needs the exact same fix. Will send a
small patchset for it.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-30  8:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 18:36 ncme-tcp: io_work NULL pointer when racing with queue stop Chris Leech
2022-01-18 23:32 ` Sagi Grimberg
2022-01-19  1:31   ` Chris Leech
2022-01-19  6:33     ` Sagi Grimberg
2022-01-27 23:05       ` Sagi Grimberg
2022-01-28  3:25         ` Chris Leech
2022-01-28 23:55           ` Chris Leech
2022-01-30  8:47             ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.