Re: ncme-tcp: io_work NULL pointer when racing with queue stop

From: Chris Leech <cleech@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Subject: Re: ncme-tcp: io_work NULL pointer when racing with queue stop
Date: Thu, 27 Jan 2022 19:25:30 -0800	[thread overview]
Message-ID: <CAPnfmXJceXFDAUXND--8UiUJeyzL-8NUAdgYcws=CodEEOF6Zg@mail.gmail.com> (raw)
In-Reply-To: <d33f5615-0e40-2e6e-76a1-b9450fa9ec6c@grimberg.me>

Thanks Sagi, this looks promising.

It also might fit with a new backtrace I was just looking at from the
same testing, where nvme_tcp_submit_async_event hit a null
ctrl->async_req.pdu which I can only see happening if it was racing
with nvme_tcp_error_recovery_work.

I'll get this into some testing here at Red Hat and let you know the results.

- Chris

On Thu, Jan 27, 2022 at 3:05 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> Thank you for the following detailed description. I'm going to go back
> >> to my crash report and take another look at this one.
> >
> > No worries Chris, perhaps I can assist.
> >
> > Is the dmesg log prior to the BUG available? Does it tell us anything
> > to what was going on leading to this?
> >
> > Any more information about the test case? (load + controller reset)
> > Is the reset in a loop? Any more info about the load?
> > Any other 'interference' during the test?
> > How reproducible is this?
> > Is this Linux nvmet as the controller?
> > How many queues does the controller have? (it will help me understand
> > how easy it is to reproduce on a vm setup)
>
> I took another look at the code and I think I see how io_work maybe
> triggered after a socket was released. The issue might be
> .submit_async_event callback from the core.
>
> When we start a reset, the first thing we do is stop the pending
> work elements that may trigger io by calling nvme_stop_ctrl, and
> then we continue to teardown the I/O queues and then the admin
> queue (in nvme_tcp_teardown_ctrl).
>
> So the sequence is:
>          nvme_stop_ctrl(ctrl);
>          nvme_tcp_teardown_ctrl(ctrl, false);
>
> However, there is a possibility, after nvme_stop_ctrl but before
> we teardown the admin queue, that the controller sends a AEN
> and is processed by the host, which includes automatically
> submitting another AER which in turn is calling the driver with
> .submit_async_event (instead of the normal .queue_rq as AERs don't have
> timeouts).
>
> In nvme_tcp_submit_async_event we do not check the controller or
> queue state and see that it is ready to accept a new submission like
> we do in .queue_rq, so we blindly prepare the AER cmd queue it and
> schedules io_work, but at this point I don't see what guarantees that
> the queue (e.g. the socket) is not released.
>
> Unless I'm missing something, this flow will trigger a use-after-free
> when io_work will attempt to access the socket.
>
> I see we also don't flush the async_event_work in the error recovery
> flow which we probably should so we can avoid such a race.
>
> I think that the below patch should address the issue:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 96725c3f1e77..bf380ca0e0d1 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct
> work_struct *work)
>
>          nvme_auth_stop(ctrl);
>          nvme_stop_keep_alive(ctrl);
> +       flush_work(&ctrl->async_event_work);
>          nvme_tcp_teardown_io_queues(ctrl, false);
>          /* unquiesce to fail fast pending requests */
>          nvme_start_queues(ctrl);
> @@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct
> nvme_ctrl *arg)
>          struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
>          struct nvme_command *cmd = &pdu->cmd;
>          u8 hdgst = nvme_tcp_hdgst_len(queue);
> +       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
> +
> +       if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready)
> +               return;
>
>          memset(pdu, 0, sizeof(*pdu));
>          pdu->hdr.type = nvme_tcp_cmd;
> --
>
> Chris, can you take this for some testing?
>