All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] fix a panic cause by use qp after free
@ 2021-09-06  3:51 Ruozhu Li
  2021-09-06  3:51 ` [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use " Ruozhu Li
  0 siblings, 1 reply; 9+ messages in thread
From: Ruozhu Li @ 2021-09-06  3:51 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi

We got a panic when host received a rej cm event soon after a connect
error cm event.
When host get connect error cm event, it will destroy qp immediately.
But cm_id is still valid then.Another cm event rise here, try to access
the qp which was destroyed.Then we got a kernel panic blow:

[87816.777089] [20473] ib_cm:cm_rep_handler:2343: cm_rep_handler: Stale connection. local_comm_id -154357094, remote_comm_id -1133609861
[87816.777223] [20473] ib_cm:cm_init_qp_rtr_attr:4162: cm_init_qp_rtr_attr: local_id -1150387077, cm_id_priv->id.state: 13
[87816.777225] [20473] rdma_cm:cma_rep_recv:1871: RDMA CM: CONNECT_ERROR: failed to handle reply. status -22
[87816.777395] [20473] ib_cm:ib_send_cm_rej:2781: ib_send_cm_rej: local_id -1150387077, cm_id->state: 13
[87816.777398] [20473] nvme_rdma:nvme_rdma_cm_handler:1718: nvme nvme278: connect error (6): status -22 id 00000000c3809aff
[87816.801155] [20473] nvme_rdma:nvme_rdma_cm_handler:1742: nvme nvme278: CM error event 6
[87816.801160] [20473] rdma_cm:cma_ib_handler:1947: RDMA CM: REJECTED: consumer defined
[87816.801163] nvme nvme278: rdma connection establishment failed (-104)
[87816.801168] BUG: unable to handle kernel NULL pointer dereference at 0000000000000370
[87816.801171] PGD 0 P4D 0 
[87816.801174] Oops: 0000 [#1] SMP PTI
[87816.801177] CPU: 21 PID: 20473 Comm: kworker/21:0 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.el8.x86_64 #1
[87816.801179] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.57 02/26/2017
[87816.801185] Workqueue: ib_cm cm_work_handler [ib_cm]
[87816.801201] RIP: 0010:_ib_modify_qp+0x6e/0x3a0 [ib_core]
[87816.801203] Code: 8b 7d 00 25 80 00 00 00 89 44 24 04 0f 85 fe 01 00 00 89 d8 45 31 e4 25 00 40 00 00 89 44 24 08 0f 85 00 01 00 00 48 c1 e5 04 <48> 03 af 70 03 00 00 f7 45 08 00 00 b0 00 74 7d f6 c7 10 74 36 41
[87816.801204] RSP: 0018:ffff9b6805f23bd8 EFLAGS: 00010202
[87816.801206] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[87816.801207] RDX: 0000000000000001 RSI: ffff9b6805f23c48 RDI: 0000000000000000
[87816.801208] RBP: 0000000000000010 R08: 000000000048ef34 R09: 0000000000000004
[87816.801209] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[87816.801210] R13: ffff8d351b5ef800 R14: ffff9b6805f23c48 R15: ffff8d353bdbd0e0
[87816.801212] FS:  0000000000000000(0000) GS:ffff8d36afa40000(0000) knlGS:0000000000000000
[87816.801213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87816.801214] CR2: 0000000000000370 CR3: 0000000367e0a004 CR4: 00000000001606e0
[87816.801215] Call Trace:
[87816.801223]  cma_modify_qp_err+0x52/0x80 [rdma_cm]
[87816.801228]  ? __dynamic_pr_debug+0x8a/0xb0
[87816.801232]  cma_ib_handler+0x25a/0x2f0 [rdma_cm]
[87816.801235]  cm_process_work+0x60/0xe0 [ib_cm]
[87816.801238]  cm_work_handler+0x13b/0x1b97 [ib_cm]
[87816.801243]  ? __switch_to_asm+0x35/0x70
[87816.801244]  ? __switch_to_asm+0x41/0x70
[87816.801246]  ? __switch_to_asm+0x35/0x70
[87816.801248]  ? __switch_to_asm+0x41/0x70
[87816.801252]  ? __switch_to+0x8c/0x480
[87816.801254]  ? __switch_to_asm+0x41/0x70
[87816.801256]  ? __switch_to_asm+0x35/0x70
[87816.801259]  process_one_work+0x1a7/0x3b0
[87816.801263]  worker_thread+0x30/0x390
[87816.801266]  ? create_worker+0x1a0/0x1a0
[87816.801268]  kthread+0x112/0x130
[87816.801270]  ? kthread_flush_work_fn+0x10/0x10
[87816.801272]  ret_from_fork+0x35/0x40
[87816.801274] Modules linked in: nft_chain_route_ipv4 xt_CHECKSUM nft_chain_nat_ipv4 ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 nft_counter nft_compat nf_tables nfnetlink tun bridge 8021q garp mrp stp llc vfat fat ib_isert iscsi_target_mod ext4 mbcache jbd2 ib_srpt target_core_mod ib_srp scsi_transport_srp nls_utf8 isofs loop intel_rapl sb_edac x86_pkg_temp_thermal coretemp kvm_intel rpcrdma kvm sunrpc irqbypass rdma_ucm ib_iser crct10dif_pclmul joydev ib_umad libiscsi crc32_pclmul ib_ipoib ghash_clmulni_intel scsi_transport_iscsi intel_cstate intel_uncore intel_rapl_perf ipmi_si ipmi_devintf pcspkr ipmi_msghandler wdat_wdt mei_me i2c_i801 sg mei lpc_ich acpi_power_meter xfs libcrc32c mlx5_ib ib_uverbs sr_mod cdrom sd_mod ahci
[87816.801314]  libahci crc32c_intel libata megaraid_sas tg3 dm_mirror dm_region_hash dm_log dm_mod mlx5_core mlxfw nvme_rdma(OE) nvme_fabrics(OE) nvme_core(OE) rdma_cm iw_cm ib_cm ib_core
[87816.801325] CR2: 0000000000000370

Try to fix it with following patch.

Ruozhu Li (1):
  nvme-rdma: destroy cm id before destroy qp to avoid use after free

 drivers/nvme/host/rdma.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

-- 
2.16.4


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-06  3:51 [PATCH 0/1] fix a panic cause by use qp after free Ruozhu Li
@ 2021-09-06  3:51 ` Ruozhu Li
  2021-09-09 14:05   ` Christoph Hellwig
  2021-09-13 15:05   ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Ruozhu Li @ 2021-09-06  3:51 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi

We should always destroy cm_id before destroy qp to avoid to get cma
event after qp was destroyed, which may lead to use after free.
In RDMA connection establishment error flow, don't destroy qp in cm
event handler.Just report cm_error to upper level, qp will be destroy
in nvme_rdma_alloc_queue() after destroy cm id.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
---
 drivers/nvme/host/rdma.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index a68704e39084..042c594bc57e 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
 	if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
 		return;
 
-	nvme_rdma_destroy_queue_ib(queue);
 	rdma_destroy_id(queue->cm_id);
+	nvme_rdma_destroy_queue_ib(queue);
 	mutex_destroy(&queue->queue_lock);
 }
 
@@ -1815,14 +1815,10 @@ static int nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
 	for (i = 0; i < queue->queue_size; i++) {
 		ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
 		if (ret)
-			goto out_destroy_queue_ib;
+			return ret;
 	}
 
 	return 0;
-
-out_destroy_queue_ib:
-	nvme_rdma_destroy_queue_ib(queue);
-	return ret;
 }
 
 static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
@@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct nvme_rdma_queue *queue)
 	if (ret) {
 		dev_err(ctrl->ctrl.device,
 			"rdma_connect_locked failed (%d).\n", ret);
-		goto out_destroy_queue_ib;
+		return ret;
 	}
 
 	return 0;
-
-out_destroy_queue_ib:
-	nvme_rdma_destroy_queue_ib(queue);
-	return ret;
 }
 
 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
@@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 	case RDMA_CM_EVENT_ROUTE_ERROR:
 	case RDMA_CM_EVENT_CONNECT_ERROR:
 	case RDMA_CM_EVENT_UNREACHABLE:
-		nvme_rdma_destroy_queue_ib(queue);
-		fallthrough;
 	case RDMA_CM_EVENT_ADDR_ERROR:
 		dev_dbg(queue->ctrl->ctrl.device,
 			"CM error event %d\n", ev->event);
-- 
2.16.4


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-06  3:51 ` [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use " Ruozhu Li
@ 2021-09-09 14:05   ` Christoph Hellwig
  2021-09-09 17:32     ` Max Gurtovoy
  2021-09-13 15:05   ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2021-09-09 14:05 UTC (permalink / raw)
  To: Ruozhu Li; +Cc: linux-nvme, sagi, Max Gurtovoy

This looks reasonable to me.  Sagi, Max: any comments?

On Mon, Sep 06, 2021 at 11:51:34AM +0800, Ruozhu Li wrote:
> We should always destroy cm_id before destroy qp to avoid to get cma
> event after qp was destroyed, which may lead to use after free.
> In RDMA connection establishment error flow, don't destroy qp in cm
> event handler.Just report cm_error to upper level, qp will be destroy
> in nvme_rdma_alloc_queue() after destroy cm id.
> 
> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
> ---
>  drivers/nvme/host/rdma.c | 16 +++-------------
>  1 file changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index a68704e39084..042c594bc57e 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
>  	if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>  		return;
>  
> -	nvme_rdma_destroy_queue_ib(queue);
>  	rdma_destroy_id(queue->cm_id);
> +	nvme_rdma_destroy_queue_ib(queue);
>  	mutex_destroy(&queue->queue_lock);
>  }
>  
> @@ -1815,14 +1815,10 @@ static int nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
>  	for (i = 0; i < queue->queue_size; i++) {
>  		ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
>  		if (ret)
> -			goto out_destroy_queue_ib;
> +			return ret;
>  	}
>  
>  	return 0;
> -
> -out_destroy_queue_ib:
> -	nvme_rdma_destroy_queue_ib(queue);
> -	return ret;
>  }
>  
>  static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
> @@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct nvme_rdma_queue *queue)
>  	if (ret) {
>  		dev_err(ctrl->ctrl.device,
>  			"rdma_connect_locked failed (%d).\n", ret);
> -		goto out_destroy_queue_ib;
> +		return ret;
>  	}
>  
>  	return 0;
> -
> -out_destroy_queue_ib:
> -	nvme_rdma_destroy_queue_ib(queue);
> -	return ret;
>  }
>  
>  static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
> @@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>  	case RDMA_CM_EVENT_ROUTE_ERROR:
>  	case RDMA_CM_EVENT_CONNECT_ERROR:
>  	case RDMA_CM_EVENT_UNREACHABLE:
> -		nvme_rdma_destroy_queue_ib(queue);
> -		fallthrough;
>  	case RDMA_CM_EVENT_ADDR_ERROR:
>  		dev_dbg(queue->ctrl->ctrl.device,
>  			"CM error event %d\n", ev->event);
> -- 
> 2.16.4
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
---end quoted text---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-09 14:05   ` Christoph Hellwig
@ 2021-09-09 17:32     ` Max Gurtovoy
  2021-09-10  2:50       ` liruozhu
  2021-09-13  1:49       ` liruozhu
  0 siblings, 2 replies; 9+ messages in thread
From: Max Gurtovoy @ 2021-09-09 17:32 UTC (permalink / raw)
  To: Christoph Hellwig, Ruozhu Li; +Cc: linux-nvme, sagi


On 9/9/2021 5:05 PM, Christoph Hellwig wrote:
> This looks reasonable to me.  Sagi, Max: any comments?
>
> On Mon, Sep 06, 2021 at 11:51:34AM +0800, Ruozhu Li wrote:
>> We should always destroy cm_id before destroy qp to avoid to get cma
>> event after qp was destroyed, which may lead to use after free.

Do you really encounter use-after-free ?

I think the code looks correct but from experience with RDMA-CM I would 
like to know which tests did you run with this code ?

interesting tests are: port toggling with/without IO, unload 
mlx5/other_rdma_hw drivers during connection establishment or during IO, 
reboot target, reboot target during connection establishment, reboot 
target during IO.

Thanks.


>> In RDMA connection establishment error flow, don't destroy qp in cm
>> event handler.Just report cm_error to upper level, qp will be destroy
>> in nvme_rdma_alloc_queue() after destroy cm id.
>>
>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>> ---
>>   drivers/nvme/host/rdma.c | 16 +++-------------
>>   1 file changed, 3 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index a68704e39084..042c594bc57e 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
>>   	if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>>   		return;
>>   
>> -	nvme_rdma_destroy_queue_ib(queue);
>>   	rdma_destroy_id(queue->cm_id);
>> +	nvme_rdma_destroy_queue_ib(queue);
>>   	mutex_destroy(&queue->queue_lock);
>>   }
>>   
>> @@ -1815,14 +1815,10 @@ static int nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
>>   	for (i = 0; i < queue->queue_size; i++) {
>>   		ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
>>   		if (ret)
>> -			goto out_destroy_queue_ib;
>> +			return ret;
>>   	}
>>   
>>   	return 0;
>> -
>> -out_destroy_queue_ib:
>> -	nvme_rdma_destroy_queue_ib(queue);
>> -	return ret;
>>   }
>>   
>>   static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
>> @@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct nvme_rdma_queue *queue)
>>   	if (ret) {
>>   		dev_err(ctrl->ctrl.device,
>>   			"rdma_connect_locked failed (%d).\n", ret);
>> -		goto out_destroy_queue_ib;
>> +		return ret;
>>   	}
>>   
>>   	return 0;
>> -
>> -out_destroy_queue_ib:
>> -	nvme_rdma_destroy_queue_ib(queue);
>> -	return ret;
>>   }
>>   
>>   static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>> @@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>>   	case RDMA_CM_EVENT_ROUTE_ERROR:
>>   	case RDMA_CM_EVENT_CONNECT_ERROR:
>>   	case RDMA_CM_EVENT_UNREACHABLE:
>> -		nvme_rdma_destroy_queue_ib(queue);
>> -		fallthrough;
>>   	case RDMA_CM_EVENT_ADDR_ERROR:
>>   		dev_dbg(queue->ctrl->ctrl.device,
>>   			"CM error event %d\n", ev->event);
>> -- 
>> 2.16.4
>>
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
> ---end quoted text---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-09 17:32     ` Max Gurtovoy
@ 2021-09-10  2:50       ` liruozhu
  2021-09-10  7:42         ` liruozhu
  2021-09-13  1:49       ` liruozhu
  1 sibling, 1 reply; 9+ messages in thread
From: liruozhu @ 2021-09-10  2:50 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig; +Cc: linux-nvme, sagi

Hi Max,

Yes, I did encounter this problem. I posted the detailed log in the cover letter of this patch.

I tested it with Huawei storage array, kill nvme target processes(we use self-developed user mode nvme target)
while inject PCIE AER errors on array HBA card repeatedly.

I’m not sure if there is a way to easily reproduce it using standard server with kernel target.
But I still think that the host side should try to avoid this possibility of crashing.

Thanks.

On 2021/9/10 1:32, Max Gurtovoy wrote:
>
> On 9/9/2021 5:05 PM, Christoph Hellwig wrote:
>> This looks reasonable to me.  Sagi, Max: any comments?
>>
>> On Mon, Sep 06, 2021 at 11:51:34AM +0800, Ruozhu Li wrote:
>>> We should always destroy cm_id before destroy qp to avoid to get cma
>>> event after qp was destroyed, which may lead to use after free.
>
> Do you really encounter use-after-free ?
>
> I think the code looks correct but from experience with RDMA-CM I 
> would like to know which tests did you run with this code ?
>
> interesting tests are: port toggling with/without IO, unload 
> mlx5/other_rdma_hw drivers during connection establishment or during 
> IO, reboot target, reboot target during connection establishment, 
> reboot target during IO.
>
> Thanks.
>
>
>>> In RDMA connection establishment error flow, don't destroy qp in cm
>>> event handler.Just report cm_error to upper level, qp will be destroy
>>> in nvme_rdma_alloc_queue() after destroy cm id.
>>>
>>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>>> ---
>>>   drivers/nvme/host/rdma.c | 16 +++-------------
>>>   1 file changed, 3 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index a68704e39084..042c594bc57e 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct 
>>> nvme_rdma_queue *queue)
>>>       if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>>>           return;
>>>   -    nvme_rdma_destroy_queue_ib(queue);
>>>       rdma_destroy_id(queue->cm_id);
>>> +    nvme_rdma_destroy_queue_ib(queue);
>>>       mutex_destroy(&queue->queue_lock);
>>>   }
>>>   @@ -1815,14 +1815,10 @@ static int 
>>> nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
>>>       for (i = 0; i < queue->queue_size; i++) {
>>>           ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
>>>           if (ret)
>>> -            goto out_destroy_queue_ib;
>>> +            return ret;
>>>       }
>>>         return 0;
>>> -
>>> -out_destroy_queue_ib:
>>> -    nvme_rdma_destroy_queue_ib(queue);
>>> -    return ret;
>>>   }
>>>     static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
>>> @@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct 
>>> nvme_rdma_queue *queue)
>>>       if (ret) {
>>>           dev_err(ctrl->ctrl.device,
>>>               "rdma_connect_locked failed (%d).\n", ret);
>>> -        goto out_destroy_queue_ib;
>>> +        return ret;
>>>       }
>>>         return 0;
>>> -
>>> -out_destroy_queue_ib:
>>> -    nvme_rdma_destroy_queue_ib(queue);
>>> -    return ret;
>>>   }
>>>     static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>>> @@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct 
>>> rdma_cm_id *cm_id,
>>>       case RDMA_CM_EVENT_ROUTE_ERROR:
>>>       case RDMA_CM_EVENT_CONNECT_ERROR:
>>>       case RDMA_CM_EVENT_UNREACHABLE:
>>> -        nvme_rdma_destroy_queue_ib(queue);
>>> -        fallthrough;
>>>       case RDMA_CM_EVENT_ADDR_ERROR:
>>>           dev_dbg(queue->ctrl->ctrl.device,
>>>               "CM error event %d\n", ev->event);
>>> -- 
>>> 2.16.4
>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>> ---end quoted text---
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-10  2:50       ` liruozhu
@ 2021-09-10  7:42         ` liruozhu
  0 siblings, 0 replies; 9+ messages in thread
From: liruozhu @ 2021-09-10  7:42 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig; +Cc: linux-nvme, sagi

Hi Max,

I feel I may have misunderstood what you mean.I guess you want to know 
what tests I have done for patched code, not original one.

Currently I have tested the regression scenes, as mentioned in the 
previous email.

I will test all the scenes you are interested in and then feedback to you.

Thanks.
On 2021/9/10 10:50, liruozhu wrote:
> Hi Max,
>
> Yes, I did encounter this problem. I posted the detailed log in the 
> cover letter of this patch.
>
> I tested it with Huawei storage array, kill nvme target processes(we 
> use self-developed user mode nvme target)
> while inject PCIE AER errors on array HBA card repeatedly.
>
> I’m not sure if there is a way to easily reproduce it using standard 
> server with kernel target.
> But I still think that the host side should try to avoid this 
> possibility of crashing.
>
> Thanks.
>
> On 2021/9/10 1:32, Max Gurtovoy wrote:
>>
>> On 9/9/2021 5:05 PM, Christoph Hellwig wrote:
>>> This looks reasonable to me.  Sagi, Max: any comments?
>>>
>>> On Mon, Sep 06, 2021 at 11:51:34AM +0800, Ruozhu Li wrote:
>>>> We should always destroy cm_id before destroy qp to avoid to get cma
>>>> event after qp was destroyed, which may lead to use after free.
>>
>> Do you really encounter use-after-free ?
>>
>> I think the code looks correct but from experience with RDMA-CM I 
>> would like to know which tests did you run with this code ?
>>
>> interesting tests are: port toggling with/without IO, unload 
>> mlx5/other_rdma_hw drivers during connection establishment or during 
>> IO, reboot target, reboot target during connection establishment, 
>> reboot target during IO.
>>
>> Thanks.
>>
>>
>>>> In RDMA connection establishment error flow, don't destroy qp in cm
>>>> event handler.Just report cm_error to upper level, qp will be destroy
>>>> in nvme_rdma_alloc_queue() after destroy cm id.
>>>>
>>>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>>>> ---
>>>>   drivers/nvme/host/rdma.c | 16 +++-------------
>>>>   1 file changed, 3 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index a68704e39084..042c594bc57e 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct 
>>>> nvme_rdma_queue *queue)
>>>>       if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>>>>           return;
>>>>   -    nvme_rdma_destroy_queue_ib(queue);
>>>>       rdma_destroy_id(queue->cm_id);
>>>> +    nvme_rdma_destroy_queue_ib(queue);
>>>>       mutex_destroy(&queue->queue_lock);
>>>>   }
>>>>   @@ -1815,14 +1815,10 @@ static int 
>>>> nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
>>>>       for (i = 0; i < queue->queue_size; i++) {
>>>>           ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
>>>>           if (ret)
>>>> -            goto out_destroy_queue_ib;
>>>> +            return ret;
>>>>       }
>>>>         return 0;
>>>> -
>>>> -out_destroy_queue_ib:
>>>> -    nvme_rdma_destroy_queue_ib(queue);
>>>> -    return ret;
>>>>   }
>>>>     static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
>>>> @@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct 
>>>> nvme_rdma_queue *queue)
>>>>       if (ret) {
>>>>           dev_err(ctrl->ctrl.device,
>>>>               "rdma_connect_locked failed (%d).\n", ret);
>>>> -        goto out_destroy_queue_ib;
>>>> +        return ret;
>>>>       }
>>>>         return 0;
>>>> -
>>>> -out_destroy_queue_ib:
>>>> -    nvme_rdma_destroy_queue_ib(queue);
>>>> -    return ret;
>>>>   }
>>>>     static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>>>> @@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct 
>>>> rdma_cm_id *cm_id,
>>>>       case RDMA_CM_EVENT_ROUTE_ERROR:
>>>>       case RDMA_CM_EVENT_CONNECT_ERROR:
>>>>       case RDMA_CM_EVENT_UNREACHABLE:
>>>> -        nvme_rdma_destroy_queue_ib(queue);
>>>> -        fallthrough;
>>>>       case RDMA_CM_EVENT_ADDR_ERROR:
>>>>           dev_dbg(queue->ctrl->ctrl.device,
>>>>               "CM error event %d\n", ev->event);
>>>> -- 
>>>> 2.16.4
>>>>
>>>>
>>>> _______________________________________________
>>>> Linux-nvme mailing list
>>>> Linux-nvme@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>> ---end quoted text---
>> .
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-09 17:32     ` Max Gurtovoy
  2021-09-10  2:50       ` liruozhu
@ 2021-09-13  1:49       ` liruozhu
  2021-09-13 14:16         ` Max Gurtovoy
  1 sibling, 1 reply; 9+ messages in thread
From: liruozhu @ 2021-09-13  1:49 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig; +Cc: linux-nvme, sagi

Hi all,

I tested the test scenario mentioned by max, and I haven't found any 
abnormality in the host.

If there are other interesting scenes or suggestions for patch, please 
let me know.

Thanks.

On 2021/9/10 1:32, Max Gurtovoy wrote:
>
> On 9/9/2021 5:05 PM, Christoph Hellwig wrote:
>> This looks reasonable to me.  Sagi, Max: any comments?
>>
>> On Mon, Sep 06, 2021 at 11:51:34AM +0800, Ruozhu Li wrote:
>>> We should always destroy cm_id before destroy qp to avoid to get cma
>>> event after qp was destroyed, which may lead to use after free.
>
> Do you really encounter use-after-free ?
>
> I think the code looks correct but from experience with RDMA-CM I 
> would like to know which tests did you run with this code ?
>
> interesting tests are: port toggling with/without IO, unload 
> mlx5/other_rdma_hw drivers during connection establishment or during 
> IO, reboot target, reboot target during connection establishment, 
> reboot target during IO.
>
> Thanks.
>
>
>>> In RDMA connection establishment error flow, don't destroy qp in cm
>>> event handler.Just report cm_error to upper level, qp will be destroy
>>> in nvme_rdma_alloc_queue() after destroy cm id.
>>>
>>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>>> ---
>>>   drivers/nvme/host/rdma.c | 16 +++-------------
>>>   1 file changed, 3 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index a68704e39084..042c594bc57e 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct 
>>> nvme_rdma_queue *queue)
>>>       if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>>>           return;
>>>   -    nvme_rdma_destroy_queue_ib(queue);
>>>       rdma_destroy_id(queue->cm_id);
>>> +    nvme_rdma_destroy_queue_ib(queue);
>>>       mutex_destroy(&queue->queue_lock);
>>>   }
>>>   @@ -1815,14 +1815,10 @@ static int 
>>> nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
>>>       for (i = 0; i < queue->queue_size; i++) {
>>>           ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
>>>           if (ret)
>>> -            goto out_destroy_queue_ib;
>>> +            return ret;
>>>       }
>>>         return 0;
>>> -
>>> -out_destroy_queue_ib:
>>> -    nvme_rdma_destroy_queue_ib(queue);
>>> -    return ret;
>>>   }
>>>     static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
>>> @@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct 
>>> nvme_rdma_queue *queue)
>>>       if (ret) {
>>>           dev_err(ctrl->ctrl.device,
>>>               "rdma_connect_locked failed (%d).\n", ret);
>>> -        goto out_destroy_queue_ib;
>>> +        return ret;
>>>       }
>>>         return 0;
>>> -
>>> -out_destroy_queue_ib:
>>> -    nvme_rdma_destroy_queue_ib(queue);
>>> -    return ret;
>>>   }
>>>     static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>>> @@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct 
>>> rdma_cm_id *cm_id,
>>>       case RDMA_CM_EVENT_ROUTE_ERROR:
>>>       case RDMA_CM_EVENT_CONNECT_ERROR:
>>>       case RDMA_CM_EVENT_UNREACHABLE:
>>> -        nvme_rdma_destroy_queue_ib(queue);
>>> -        fallthrough;
>>>       case RDMA_CM_EVENT_ADDR_ERROR:
>>>           dev_dbg(queue->ctrl->ctrl.device,
>>>               "CM error event %d\n", ev->event);
>>> -- 
>>> 2.16.4
>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>> ---end quoted text---
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-13  1:49       ` liruozhu
@ 2021-09-13 14:16         ` Max Gurtovoy
  0 siblings, 0 replies; 9+ messages in thread
From: Max Gurtovoy @ 2021-09-13 14:16 UTC (permalink / raw)
  To: liruozhu, Christoph Hellwig; +Cc: linux-nvme, sagi


On 9/13/2021 4:49 AM, liruozhu wrote:
> Hi all,
>
> I tested the test scenario mentioned by max, and I haven't found any 
> abnormality in the host.
>
> If there are other interesting scenes or suggestions for patch, please 
> let me know.
>
> Thanks.

Thanks for the update.

The code looks good,

Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>

>
> On 2021/9/10 1:32, Max Gurtovoy wrote:
>>
>> On 9/9/2021 5:05 PM, Christoph Hellwig wrote:
>>> This looks reasonable to me.  Sagi, Max: any comments?
>>>
>>> On Mon, Sep 06, 2021 at 11:51:34AM +0800, Ruozhu Li wrote:
>>>> We should always destroy cm_id before destroy qp to avoid to get cma
>>>> event after qp was destroyed, which may lead to use after free.
>>
>> Do you really encounter use-after-free ?
>>
>> I think the code looks correct but from experience with RDMA-CM I 
>> would like to know which tests did you run with this code ?
>>
>> interesting tests are: port toggling with/without IO, unload 
>> mlx5/other_rdma_hw drivers during connection establishment or during 
>> IO, reboot target, reboot target during connection establishment, 
>> reboot target during IO.
>>
>> Thanks.
>>
>>
>>>> In RDMA connection establishment error flow, don't destroy qp in cm
>>>> event handler.Just report cm_error to upper level, qp will be destroy
>>>> in nvme_rdma_alloc_queue() after destroy cm id.
>>>>
>>>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>>>> ---
>>>>   drivers/nvme/host/rdma.c | 16 +++-------------
>>>>   1 file changed, 3 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index a68704e39084..042c594bc57e 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct 
>>>> nvme_rdma_queue *queue)
>>>>       if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>>>>           return;
>>>>   -    nvme_rdma_destroy_queue_ib(queue);
>>>>       rdma_destroy_id(queue->cm_id);
>>>> +    nvme_rdma_destroy_queue_ib(queue);
>>>>       mutex_destroy(&queue->queue_lock);
>>>>   }
>>>>   @@ -1815,14 +1815,10 @@ static int 
>>>> nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
>>>>       for (i = 0; i < queue->queue_size; i++) {
>>>>           ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
>>>>           if (ret)
>>>> -            goto out_destroy_queue_ib;
>>>> +            return ret;
>>>>       }
>>>>         return 0;
>>>> -
>>>> -out_destroy_queue_ib:
>>>> -    nvme_rdma_destroy_queue_ib(queue);
>>>> -    return ret;
>>>>   }
>>>>     static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
>>>> @@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct 
>>>> nvme_rdma_queue *queue)
>>>>       if (ret) {
>>>>           dev_err(ctrl->ctrl.device,
>>>>               "rdma_connect_locked failed (%d).\n", ret);
>>>> -        goto out_destroy_queue_ib;
>>>> +        return ret;
>>>>       }
>>>>         return 0;
>>>> -
>>>> -out_destroy_queue_ib:
>>>> -    nvme_rdma_destroy_queue_ib(queue);
>>>> -    return ret;
>>>>   }
>>>>     static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>>>> @@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct 
>>>> rdma_cm_id *cm_id,
>>>>       case RDMA_CM_EVENT_ROUTE_ERROR:
>>>>       case RDMA_CM_EVENT_CONNECT_ERROR:
>>>>       case RDMA_CM_EVENT_UNREACHABLE:
>>>> -        nvme_rdma_destroy_queue_ib(queue);
>>>> -        fallthrough;
>>>>       case RDMA_CM_EVENT_ADDR_ERROR:
>>>>           dev_dbg(queue->ctrl->ctrl.device,
>>>>               "CM error event %d\n", ev->event);
>>>> -- 
>>>> 2.16.4
>>>>
>>>>
>>>> _______________________________________________
>>>> Linux-nvme mailing list
>>>> Linux-nvme@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>> ---end quoted text---
>> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
  2021-09-06  3:51 ` [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use " Ruozhu Li
  2021-09-09 14:05   ` Christoph Hellwig
@ 2021-09-13 15:05   ` Christoph Hellwig
  1 sibling, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2021-09-13 15:05 UTC (permalink / raw)
  To: Ruozhu Li; +Cc: linux-nvme, sagi

Thanks,

applied to nvme-5.15.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-09-13 15:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-06  3:51 [PATCH 0/1] fix a panic cause by use qp after free Ruozhu Li
2021-09-06  3:51 ` [PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use " Ruozhu Li
2021-09-09 14:05   ` Christoph Hellwig
2021-09-09 17:32     ` Max Gurtovoy
2021-09-10  2:50       ` liruozhu
2021-09-10  7:42         ` liruozhu
2021-09-13  1:49       ` liruozhu
2021-09-13 14:16         ` Max Gurtovoy
2021-09-13 15:05   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.