[PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset
@ 2022-04-15 10:14 Maurizio Lombardi
  2022-05-03 11:48 ` Sagi Grimberg
  0 siblings, 1 reply; 5+ messages in thread
From: Maurizio Lombardi @ 2022-04-15 10:14 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi, kbusch, cleech

nvme_tcp_io_work() may be scheduled while the controller is resetting,
triggering kernel crashes because of invalid memory accesses.

general protection fault, probably for non-canonical
address 0x82f4228d6e5c6

 [exception RIP: _copy_to_iter+1344]
 [ffffa8a9e5c07c90] __skb_datagram_iter at ffffffffb0c339d8
 [ffffa8a9e5c07cf8] skb_copy_datagram_iter at ffffffffb0c33c83
 [ffffa8a9e5c07d28] nvme_tcp_recv_data at ffffffffc04dbff7 [nvme_tcp]
 [ffffa8a9e5c07d78] nvme_tcp_recv_skb at ffffffffc04dc90e [nvme_tcp]
 [ffffa8a9e5c07dc0] tcp_read_sock at ffffffffb0cfedbe
 [ffffa8a9e5c07e10] nvme_tcp_try_recv at ffffffffc04da518 [nvme_tcp]
 [ffffa8a9e5c07e58] nvme_tcp_io_work at ffffffffc04dbc54 [nvme_tcp]
 [ffffa8a9e5c07e88] process_one_work at ffffffffb0505408
 [ffffa8a9e5c07ed0] worker_thread at ffffffffb05059a0

Fix this bug by preventing nvme_tcp_io_work() from running if the queue
is not flagged as "LIVE" and is not in the process of
connecting to the target.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
 drivers/nvme/host/tcp.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index ad3a2bf2f1e9..e3ef014bbc0b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -103,6 +103,7 @@ enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_ALLOCATED	= 0,
 	NVME_TCP_Q_LIVE		= 1,
 	NVME_TCP_Q_POLLING	= 2,
+	NVME_TCP_Q_CONNECTING	= 3,
 };
 
 enum nvme_tcp_recv_state {
@@ -1213,6 +1214,10 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		bool pending = false;
 		int result;
 
+		if (unlikely(!test_bit(NVME_TCP_Q_LIVE, &queue->flags) &&
+				!test_bit(NVME_TCP_Q_CONNECTING, &queue->flags)))
+			return;
+
 		if (mutex_trylock(&queue->send_mutex)) {
 			result = nvme_tcp_try_send(queue);
 			mutex_unlock(&queue->send_mutex);
@@ -1670,6 +1675,8 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	int ret;
 
+	set_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
+
 	if (idx)
 		ret = nvmf_connect_io_queue(nctrl, idx);
 	else
@@ -1683,6 +1690,7 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
 		dev_err(nctrl->device,
 			"failed to connect queue: %d ret=%d\n", idx, ret);
 	}
+	clear_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
 	return ret;
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset
  2022-04-15 10:14 [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset Maurizio Lombardi
@ 2022-05-03 11:48 ` Sagi Grimberg
  2022-05-09 21:22   ` John Meneghini
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2022-05-03 11:48 UTC (permalink / raw)
  To: Maurizio Lombardi, linux-nvme; +Cc: kbusch, cleech


> nvme_tcp_io_work() may be scheduled while the controller is resetting,
> triggering kernel crashes because of invalid memory accesses.
> 
> general protection fault, probably for non-canonical
> address 0x82f4228d6e5c6

Can you please provide more details on what kernel this happened? Does
this reproduce in upstream kernel?

>   [exception RIP: _copy_to_iter+1344]
>   [ffffa8a9e5c07c90] __skb_datagram_iter at ffffffffb0c339d8
>   [ffffa8a9e5c07cf8] skb_copy_datagram_iter at ffffffffb0c33c83
>   [ffffa8a9e5c07d28] nvme_tcp_recv_data at ffffffffc04dbff7 [nvme_tcp]
>   [ffffa8a9e5c07d78] nvme_tcp_recv_skb at ffffffffc04dc90e [nvme_tcp]
>   [ffffa8a9e5c07dc0] tcp_read_sock at ffffffffb0cfedbe
>   [ffffa8a9e5c07e10] nvme_tcp_try_recv at ffffffffc04da518 [nvme_tcp]
>   [ffffa8a9e5c07e58] nvme_tcp_io_work at ffffffffc04dbc54 [nvme_tcp]
>   [ffffa8a9e5c07e88] process_one_work at ffffffffb0505408
>   [ffffa8a9e5c07ed0] worker_thread at ffffffffb05059a0
> 
> Fix this bug by preventing nvme_tcp_io_work() from running if the queue
> is not flagged as "LIVE" and is not in the process of
> connecting to the target.

The current controller teardown flow is supposed to guarantee that
io_work will not run after a queue is being stopped. See
nvme_tcp_stop_queue() it shuts down the socket to deny any new
datagrams to go to the network (or come from the network), restore
socket calls so data_ready/write_space are not called again, and
cancels the io_work. New submissions are prevented as the queue
state clears LIVE.

So in theory, io_work should not be running in conjunction with
queue shutdown. Can you explain how it happens to get there?

> 
> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
> ---
>   drivers/nvme/host/tcp.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index ad3a2bf2f1e9..e3ef014bbc0b 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -103,6 +103,7 @@ enum nvme_tcp_queue_flags {
>   	NVME_TCP_Q_ALLOCATED	= 0,
>   	NVME_TCP_Q_LIVE		= 1,
>   	NVME_TCP_Q_POLLING	= 2,
> +	NVME_TCP_Q_CONNECTING	= 3,
>   };
>   
>   enum nvme_tcp_recv_state {
> @@ -1213,6 +1214,10 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   		bool pending = false;
>   		int result;
>   
> +		if (unlikely(!test_bit(NVME_TCP_Q_LIVE, &queue->flags) &&
> +				!test_bit(NVME_TCP_Q_CONNECTING, &queue->flags)))
> +			return;

How is this protecting anything? If the queue is torn down we may access
other freed memory, not just from accessing the sockets.

> +
>   		if (mutex_trylock(&queue->send_mutex)) {
>   			result = nvme_tcp_try_send(queue);
>   			mutex_unlock(&queue->send_mutex);
> @@ -1670,6 +1675,8 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>   	int ret;
>   
> +	set_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
> +
>   	if (idx)
>   		ret = nvmf_connect_io_queue(nctrl, idx);
>   	else
> @@ -1683,6 +1690,7 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>   		dev_err(nctrl->device,
>   			"failed to connect queue: %d ret=%d\n", idx, ret);
>   	}
> +	clear_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>   	return ret;
>   }
>   

Question, can you tell from your stack dump if this queue is the admin
queue or I/O queue?

If this is indeed the admin queue, please check if you have a related
fix applied:
0fa0f99fc84e ("nvme: fix a possible use-after-free in controller reset 
during load")


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset
  2022-05-03 11:48 ` Sagi Grimberg
@ 2022-05-09 21:22   ` John Meneghini
  2022-05-09 21:35     ` Keith Busch
  2022-05-10 18:28     ` Sagi Grimberg
  0 siblings, 2 replies; 5+ messages in thread
From: John Meneghini @ 2022-05-09 21:22 UTC (permalink / raw)
  To: Sagi Grimberg, Maurizio Lombardi, linux-nvme; +Cc: kbusch, cleech

On 5/3/22 07:48, Sagi Grimberg wrote:
> 
>> nvme_tcp_io_work() may be scheduled while the controller is resetting,
>> triggering kernel crashes because of invalid memory accesses.
>>
>> general protection fault, probably for non-canonical
>> address 0x82f4228d6e5c6
> 
> Can you please provide more details on what kernel this happened? Does
> this reproduce in upstream kernel?
> 

This is a RHEL 9.0 Beta kernel that has been patched up to about v5.16.

> The current controller teardown flow is supposed to guarantee that
> io_work will not run after a queue is being stopped. See
> nvme_tcp_stop_queue() it shuts down the socket to deny any new
> datagrams to go to the network (or come from the network), restore
> socket calls so data_ready/write_space are not called again, and
> cancels the io_work. New submissions are prevented as the queue
> state clears LIVE.
> 
> So in theory, io_work should not be running in conjunction with
> queue shutdown. Can you explain how it happens to get there?

This bug has been reproduced on a partner's testbed.  We don't
have direct access to the tests they are running but it appears
they are running error insertion tests which aggressively
bring the TCP link up and down, reset the NVMe/TCP controller,
and randomly add/remove NVMe namespaces from the controller
while IO is in progress. With these tests they have been able
to reproduce this bug with our v5.16 based kernel.

I do see a number of patches from v5.17 and v5.18 which
look like they might affect this problem.  Specifically:

63573807b27e0faf8065a28b1bbe1cbfb23c0130 nvme-tcp: fix bogus request completion when failing to send AER
d6d6742772d712ed2238f5071b96baf4924f5fad nvme: fix RCU hole that allowed for endless looping in multipath round robin
a4a6f3c8f61c3cfbda4998ad94596059ad7e4332 nvme-multipath: fix hang when disk goes live over reconnect

This is the testbed that reproduced the RCU hole fixed in d6d6742772d712ed2238f5071b96baf4924f5fad.
We should definitely make sure they are running with that fix before we go any further.

If you agree the above patches are needed we can give them a new v5.18-rc6 based kernel
and ask them to run their test again.

/John

>>
>> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
>> ---
>>   drivers/nvme/host/tcp.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index ad3a2bf2f1e9..e3ef014bbc0b 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -103,6 +103,7 @@ enum nvme_tcp_queue_flags {
>>       NVME_TCP_Q_ALLOCATED    = 0,
>>       NVME_TCP_Q_LIVE        = 1,
>>       NVME_TCP_Q_POLLING    = 2,
>> +    NVME_TCP_Q_CONNECTING    = 3,
>>   };
>>   enum nvme_tcp_recv_state {
>> @@ -1213,6 +1214,10 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>           bool pending = false;
>>           int result;
>> +        if (unlikely(!test_bit(NVME_TCP_Q_LIVE, &queue->flags) &&
>> +                !test_bit(NVME_TCP_Q_CONNECTING, &queue->flags)))
>> +            return;
> 
> How is this protecting anything? If the queue is torn down we may access
> other freed memory, not just from accessing the sockets.
> 
>> +
>>           if (mutex_trylock(&queue->send_mutex)) {
>>               result = nvme_tcp_try_send(queue);
>>               mutex_unlock(&queue->send_mutex);
>> @@ -1670,6 +1675,8 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>       int ret;
>> +    set_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>> +
>>       if (idx)
>>           ret = nvmf_connect_io_queue(nctrl, idx);
>>       else
>> @@ -1683,6 +1690,7 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>>           dev_err(nctrl->device,
>>               "failed to connect queue: %d ret=%d\n", idx, ret);
>>       }
>> +    clear_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>>       return ret;
>>   }
> 
> Question, can you tell from your stack dump if this queue is the admin
> queue or I/O queue?
> 
> If this is indeed the admin queue, please check if you have a related
> fix applied:
> 0fa0f99fc84e ("nvme: fix a possible use-after-free in controller reset during load")
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset
  2022-05-09 21:22   ` John Meneghini
@ 2022-05-09 21:35     ` Keith Busch
  2022-05-10 18:28     ` Sagi Grimberg
  1 sibling, 0 replies; 5+ messages in thread
From: Keith Busch @ 2022-05-09 21:35 UTC (permalink / raw)
  To: John Meneghini; +Cc: Sagi Grimberg, Maurizio Lombardi, linux-nvme, cleech

On Mon, May 09, 2022 at 05:22:12PM -0400, John Meneghini wrote:
> If you agree the above patches are needed we can give them a new v5.18-rc6 based kernel
> and ask them to run their test again.

I think it is almost always worth verifying if upstream has the same problem as
something downstream. Since you're so close to the current upstream, hopefully
it's not a problem to compile it against the same RHEL kernel config that was
being tested.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset
  2022-05-09 21:22   ` John Meneghini
  2022-05-09 21:35     ` Keith Busch
@ 2022-05-10 18:28     ` Sagi Grimberg
  1 sibling, 0 replies; 5+ messages in thread
From: Sagi Grimberg @ 2022-05-10 18:28 UTC (permalink / raw)
  To: John Meneghini, Maurizio Lombardi, linux-nvme; +Cc: kbusch, cleech

Hey John, Sorry for the late response.

> 
> This bug has been reproduced on a partner's testbed.  We don't
> have direct access to the tests they are running but it appears
> they are running error insertion tests which aggressively
> bring the TCP link up and down, reset the NVMe/TCP controller,
> and randomly add/remove NVMe namespaces from the controller
> while IO is in progress. With these tests they have been able
> to reproduce this bug with our v5.16 based kernel.
> 
> I do see a number of patches from v5.17 and v5.18 which
> look like they might affect this problem.  Specifically:
> 
> 63573807b27e0faf8065a28b1bbe1cbfb23c0130 nvme-tcp: fix bogus request 
> completion when failing to send AER
> d6d6742772d712ed2238f5071b96baf4924f5fad nvme: fix RCU hole that allowed 
> for endless looping in multipath round robin
> a4a6f3c8f61c3cfbda4998ad94596059ad7e4332 nvme-multipath: fix hang when 
> disk goes live over reconnect
> 
> This is the testbed that reproduced the RCU hole fixed in 
> d6d6742772d712ed2238f5071b96baf4924f5fad.
> We should definitely make sure they are running with that fix before we 
> go any further.
> 
> If you agree the above patches are needed we can give them a new 
> v5.18-rc6 based kernel
> and ask them to run their test again.

These are all useful and needed fixes, but please see the fix I've
referenced below as this issue looks very similar to what this patch
fixes.

> 
> /John
> 
>>>
>>> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
>>> ---
>>>   drivers/nvme/host/tcp.c | 8 ++++++++
>>>   1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index ad3a2bf2f1e9..e3ef014bbc0b 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -103,6 +103,7 @@ enum nvme_tcp_queue_flags {
>>>       NVME_TCP_Q_ALLOCATED    = 0,
>>>       NVME_TCP_Q_LIVE        = 1,
>>>       NVME_TCP_Q_POLLING    = 2,
>>> +    NVME_TCP_Q_CONNECTING    = 3,
>>>   };
>>>   enum nvme_tcp_recv_state {
>>> @@ -1213,6 +1214,10 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>           bool pending = false;
>>>           int result;
>>> +        if (unlikely(!test_bit(NVME_TCP_Q_LIVE, &queue->flags) &&
>>> +                !test_bit(NVME_TCP_Q_CONNECTING, &queue->flags)))
>>> +            return;
>>
>> How is this protecting anything? If the queue is torn down we may access
>> other freed memory, not just from accessing the sockets.
>>
>>> +
>>>           if (mutex_trylock(&queue->send_mutex)) {
>>>               result = nvme_tcp_try_send(queue);
>>>               mutex_unlock(&queue->send_mutex);
>>> @@ -1670,6 +1675,8 @@ static int nvme_tcp_start_queue(struct 
>>> nvme_ctrl *nctrl, int idx)
>>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>>       int ret;
>>> +    set_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>>> +
>>>       if (idx)
>>>           ret = nvmf_connect_io_queue(nctrl, idx);
>>>       else
>>> @@ -1683,6 +1690,7 @@ static int nvme_tcp_start_queue(struct 
>>> nvme_ctrl *nctrl, int idx)
>>>           dev_err(nctrl->device,
>>>               "failed to connect queue: %d ret=%d\n", idx, ret);
>>>       }
>>> +    clear_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>>>       return ret;
>>>   }
>>
>> Question, can you tell from your stack dump if this queue is the admin
>> queue or I/O queue?
>>
>> If this is indeed the admin queue, please check if you have a related
>> fix applied:

This is the possible fix patch:

>> 0fa0f99fc84e ("nvme: fix a possible use-after-free in controller reset 
>> during load")


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-05-10 18:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-15 10:14 [PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset Maurizio Lombardi
2022-05-03 11:48 ` Sagi Grimberg
2022-05-09 21:22   ` John Meneghini
2022-05-09 21:35     ` Keith Busch
2022-05-10 18:28     ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.