Re: [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset

From: John Meneghini <jmeneghi@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	Maurizio Lombardi <mlombard@redhat.com>,
	linux-nvme@lists.infradead.org
Cc: kbusch@kernel.org, cleech@redhat.com
Subject: Re: [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset
Date: Mon, 9 May 2022 17:22:12 -0400	[thread overview]
Message-ID: <14f3b14b-b228-7758-a556-213cd07f6593@redhat.com> (raw)
In-Reply-To: <31923ed3-3d6e-b25c-6971-9d5cdfd4e057@grimberg.me>

On 5/3/22 07:48, Sagi Grimberg wrote:
> 
>> nvme_tcp_io_work() may be scheduled while the controller is resetting,
>> triggering kernel crashes because of invalid memory accesses.
>>
>> general protection fault, probably for non-canonical
>> address 0x82f4228d6e5c6
> 
> Can you please provide more details on what kernel this happened? Does
> this reproduce in upstream kernel?
> 

This is a RHEL 9.0 Beta kernel that has been patched up to about v5.16.

> The current controller teardown flow is supposed to guarantee that
> io_work will not run after a queue is being stopped. See
> nvme_tcp_stop_queue() it shuts down the socket to deny any new
> datagrams to go to the network (or come from the network), restore
> socket calls so data_ready/write_space are not called again, and
> cancels the io_work. New submissions are prevented as the queue
> state clears LIVE.
> 
> So in theory, io_work should not be running in conjunction with
> queue shutdown. Can you explain how it happens to get there?

This bug has been reproduced on a partner's testbed.  We don't
have direct access to the tests they are running but it appears
they are running error insertion tests which aggressively
bring the TCP link up and down, reset the NVMe/TCP controller,
and randomly add/remove NVMe namespaces from the controller
while IO is in progress. With these tests they have been able
to reproduce this bug with our v5.16 based kernel.

I do see a number of patches from v5.17 and v5.18 which
look like they might affect this problem.  Specifically:

63573807b27e0faf8065a28b1bbe1cbfb23c0130 nvme-tcp: fix bogus request completion when failing to send AER
d6d6742772d712ed2238f5071b96baf4924f5fad nvme: fix RCU hole that allowed for endless looping in multipath round robin
a4a6f3c8f61c3cfbda4998ad94596059ad7e4332 nvme-multipath: fix hang when disk goes live over reconnect

This is the testbed that reproduced the RCU hole fixed in d6d6742772d712ed2238f5071b96baf4924f5fad.
We should definitely make sure they are running with that fix before we go any further.

If you agree the above patches are needed we can give them a new v5.18-rc6 based kernel
and ask them to run their test again.

/John

>>
>> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
>> ---
>>   drivers/nvme/host/tcp.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index ad3a2bf2f1e9..e3ef014bbc0b 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -103,6 +103,7 @@ enum nvme_tcp_queue_flags {
>>       NVME_TCP_Q_ALLOCATED    = 0,
>>       NVME_TCP_Q_LIVE        = 1,
>>       NVME_TCP_Q_POLLING    = 2,
>> +    NVME_TCP_Q_CONNECTING    = 3,
>>   };
>>   enum nvme_tcp_recv_state {
>> @@ -1213,6 +1214,10 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>           bool pending = false;
>>           int result;
>> +        if (unlikely(!test_bit(NVME_TCP_Q_LIVE, &queue->flags) &&
>> +                !test_bit(NVME_TCP_Q_CONNECTING, &queue->flags)))
>> +            return;
> 
> How is this protecting anything? If the queue is torn down we may access
> other freed memory, not just from accessing the sockets.
> 
>> +
>>           if (mutex_trylock(&queue->send_mutex)) {
>>               result = nvme_tcp_try_send(queue);
>>               mutex_unlock(&queue->send_mutex);
>> @@ -1670,6 +1675,8 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>       int ret;
>> +    set_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>> +
>>       if (idx)
>>           ret = nvmf_connect_io_queue(nctrl, idx);
>>       else
>> @@ -1683,6 +1690,7 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>>           dev_err(nctrl->device,
>>               "failed to connect queue: %d ret=%d\n", idx, ret);
>>       }
>> +    clear_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>>       return ret;
>>   }
> 
> Question, can you tell from your stack dump if this queue is the admin
> queue or I/O queue?
> 
> If this is indeed the admin queue, please check if you have a related
> fix applied:
> 0fa0f99fc84e ("nvme: fix a possible use-after-free in controller reset during load")
>