* [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode
@ 2017-06-20 12:07 Håkon Bugge
[not found] ` <20170620120750.32268-1-Haakon.Bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Håkon Bugge @ 2017-06-20 12:07 UTC (permalink / raw)
To: Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Sean Hefty,
Hal Rosenstock, Håkon Bugge, Håkon Bugge
CM REQs cannot be successfully retried, because a new pv_cm_id is
created for each request, without checking if one already exists.
By checking if an id exists before creating one, the bug is fixed.
This bug can be provoked by running an RDMA CM user-land application,
but inserting a five seconds delay before the rdma_accept() call on
the passive side. This delay is larger than the default CMA timeout,
and triggers a retry from the active side. The retried REQ will use
another pv_cm_id (the cm_id on the wire). This confuses the CM
protocol and two REJs are sent from the passive side.
Here is an excerpt from ibdump running without the patch:
3.285092 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
7.382711 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
7.382861 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
7.387644 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
and here is the same with bug fix applied:
3.251010 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
7.349387 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
8.258443 LID: 4 -> LID: 4 SDP 290 CM: ConnectReply(SDP Hello)
8.259890 LID: 4 -> LID: 4 InfiniBand 290 CM: ReadyToUse
Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Håkon Bugge <haakon.bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Reported-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Tested-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
drivers/infiniband/hw/mlx4/cm.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
index 1e6c526..fedaf82 100644
--- a/drivers/infiniband/hw/mlx4/cm.c
+++ b/drivers/infiniband/hw/mlx4/cm.c
@@ -323,6 +323,9 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
mad->mad_hdr.attr_id == CM_REP_ATTR_ID ||
mad->mad_hdr.attr_id == CM_SIDR_REQ_ATTR_ID) {
sl_cm_id = get_local_comm_id(mad);
+ id = id_map_get(ibdev, &pv_cm_id, slave_id, sl_cm_id);
+ if (id)
+ goto cont;
id = id_map_alloc(ibdev, slave_id, sl_cm_id);
if (IS_ERR(id)) {
mlx4_ib_warn(ibdev, "%s: id{slave: %d, sl_cm_id: 0x%x} Failed to id_map_alloc\n",
@@ -343,6 +346,7 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
return -EINVAL;
}
+cont:
set_local_comm_id(mad, id->pv_cm_id);
if (mad->mad_hdr.attr_id == CM_DREQ_ATTR_ID)
--
2.9.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode
[not found] ` <20170620120750.32268-1-Haakon.Bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-06-26 9:40 ` jackm
[not found] ` <20170626124048.00002ef5-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-07-10 6:30 ` oulijun
2017-07-22 17:45 ` Doug Ledford
2 siblings, 1 reply; 7+ messages in thread
From: jackm @ 2017-06-26 9:40 UTC (permalink / raw)
To: Håkon Bugge
Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas,
Sean Hefty, Hal Rosenstock, Leon Romanovsky, Moni Shoua
Nice catch, Haakon!
-Jack
On Tue, 20 Jun 2017 14:07:50 +0200
Håkon Bugge <Haakon.Bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> CM REQs cannot be successfully retried, because a new pv_cm_id is
> created for each request, without checking if one already exists.
>
> By checking if an id exists before creating one, the bug is fixed.
>
> This bug can be provoked by running an RDMA CM user-land application,
> but inserting a five seconds delay before the rdma_accept() call on
> the passive side. This delay is larger than the default CMA timeout,
> and triggers a retry from the active side. The retried REQ will use
> another pv_cm_id (the cm_id on the wire). This confuses the CM
> protocol and two REJs are sent from the passive side.
>
> Here is an excerpt from ibdump running without the patch:
>
> 3.285092 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP
> Hello) 7.382711 LID: 4 -> LID: 4 SDP 290 CM:
> ConnectRequest(SDP Hello) 7.382861 LID: 4 -> LID: 4
> InfiniBand 290 CM: ConnectReject 7.387644 LID: 4 -> LID:
> 4 InfiniBand 290 CM: ConnectReject
>
> and here is the same with bug fix applied:
>
> 3.251010 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP
> Hello) 7.349387 LID: 4 -> LID: 4 SDP 290 CM:
> ConnectRequest(SDP Hello) 8.258443 LID: 4 -> LID: 4 SDP
> 290 CM: ConnectReply(SDP Hello) 8.259890 LID: 4 -> LID: 4
> InfiniBand 290 CM: ReadyToUse
>
> Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Håkon Bugge <haakon.bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Reported-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Tested-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Acked-by: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode
[not found] ` <20170626124048.00002ef5-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-07-07 9:03 ` Håkon Bugge
[not found] ` <6CBA4C82-C68D-4A7E-B27F-87E3522707FA-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Håkon Bugge @ 2017-07-07 9:03 UTC (permalink / raw)
To: jackm
Cc: Doug Ledford, OFED mailing list, Yishai Hadas, Sean Hefty,
Hal Rosenstock, Leon Romanovsky, Moni Shoua
Do I need any additional r-b for this to get in?
Thxs, Håkon
> On 26 Jun 2017, at 11:40, jackm <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
> Nice catch, Haakon!
>
> -Jack
>
> On Tue, 20 Jun 2017 14:07:50 +0200
> Håkon Bugge <Haakon.Bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>
>> CM REQs cannot be successfully retried, because a new pv_cm_id is
>> created for each request, without checking if one already exists.
>>
>> By checking if an id exists before creating one, the bug is fixed.
>>
>> This bug can be provoked by running an RDMA CM user-land application,
>> but inserting a five seconds delay before the rdma_accept() call on
>> the passive side. This delay is larger than the default CMA timeout,
>> and triggers a retry from the active side. The retried REQ will use
>> another pv_cm_id (the cm_id on the wire). This confuses the CM
>> protocol and two REJs are sent from the passive side.
>>
>> Here is an excerpt from ibdump running without the patch:
>>
>> 3.285092 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP
>> Hello) 7.382711 LID: 4 -> LID: 4 SDP 290 CM:
>> ConnectRequest(SDP Hello) 7.382861 LID: 4 -> LID: 4
>> InfiniBand 290 CM: ConnectReject 7.387644 LID: 4 -> LID:
>> 4 InfiniBand 290 CM: ConnectReject
>>
>> and here is the same with bug fix applied:
>>
>> 3.251010 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP
>> Hello) 7.349387 LID: 4 -> LID: 4 SDP 290 CM:
>> ConnectRequest(SDP Hello) 8.258443 LID: 4 -> LID: 4 SDP
>> 290 CM: ConnectReply(SDP Hello) 8.259890 LID: 4 -> LID: 4
>> InfiniBand 290 CM: ReadyToUse
>>
>> Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Håkon Bugge <haakon.bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Reported-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Tested-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>
> Acked-by: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode
[not found] ` <6CBA4C82-C68D-4A7E-B27F-87E3522707FA-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-07-07 9:45 ` Leon Romanovsky
0 siblings, 0 replies; 7+ messages in thread
From: Leon Romanovsky @ 2017-07-07 9:45 UTC (permalink / raw)
To: Håkon Bugge
Cc: jackm, Doug Ledford, OFED mailing list, Yishai Hadas, Sean Hefty,
Hal Rosenstock, Moni Shoua
[-- Attachment #1: Type: text/plain, Size: 153 bytes --]
On Fri, Jul 07, 2017 at 11:03:50AM +0200, Håkon Bugge wrote:
> Do I need any additional r-b for this to get in?
>
Jack's r-b is enough.
Thanks,
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode
[not found] ` <20170620120750.32268-1-Haakon.Bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-26 9:40 ` jackm
@ 2017-07-10 6:30 ` oulijun
[not found] ` <84f89de6-0efc-ada9-2745-7af65682f02e-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-07-22 17:45 ` Doug Ledford
2 siblings, 1 reply; 7+ messages in thread
From: oulijun @ 2017-07-10 6:30 UTC (permalink / raw)
To: Håkon Bugge, Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Sean Hefty,
Hal Rosenstock
Hi, Haakon.Bugge
I am interested in your question. Will it be happened when use
rdma cm to establish connection on other hardware environment?
for example, arm64 board.
Moreover, Would you provide the detail test method for the bug?
I don't understand slightly what is the RDMA CM user-land application
Thanks
Lijun Ou
在 2017/6/20 20:07, Håkon Bugge 写道:
> CM REQs cannot be successfully retried, because a new pv_cm_id is
> created for each request, without checking if one already exists.
>
> By checking if an id exists before creating one, the bug is fixed.
>
> This bug can be provoked by running an RDMA CM user-land application,
> but inserting a five seconds delay before the rdma_accept() call on
> the passive side. This delay is larger than the default CMA timeout,
> and triggers a retry from the active side. The retried REQ will use
> another pv_cm_id (the cm_id on the wire). This confuses the CM
> protocol and two REJs are sent from the passive side.
>
> Here is an excerpt from ibdump running without the patch:
>
> 3.285092 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 7.382711 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 7.382861 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
> 7.387644 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
>
> and here is the same with bug fix applied:
>
> 3.251010 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 7.349387 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 8.258443 LID: 4 -> LID: 4 SDP 290 CM: ConnectReply(SDP Hello)
> 8.259890 LID: 4 -> LID: 4 InfiniBand 290 CM: ReadyToUse
>
> Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Håkon Bugge <haakon.bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Reported-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Tested-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
> drivers/infiniband/hw/mlx4/cm.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
> index 1e6c526..fedaf82 100644
> --- a/drivers/infiniband/hw/mlx4/cm.c
> +++ b/drivers/infiniband/hw/mlx4/cm.c
> @@ -323,6 +323,9 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
> mad->mad_hdr.attr_id == CM_REP_ATTR_ID ||
> mad->mad_hdr.attr_id == CM_SIDR_REQ_ATTR_ID) {
> sl_cm_id = get_local_comm_id(mad);
> + id = id_map_get(ibdev, &pv_cm_id, slave_id, sl_cm_id);
> + if (id)
> + goto cont;
> id = id_map_alloc(ibdev, slave_id, sl_cm_id);
> if (IS_ERR(id)) {
> mlx4_ib_warn(ibdev, "%s: id{slave: %d, sl_cm_id: 0x%x} Failed to id_map_alloc\n",
> @@ -343,6 +346,7 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
> return -EINVAL;
> }
>
> +cont:
> set_local_comm_id(mad, id->pv_cm_id);
>
> if (mad->mad_hdr.attr_id == CM_DREQ_ATTR_ID)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode
[not found] ` <84f89de6-0efc-ada9-2745-7af65682f02e-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2017-07-13 12:16 ` Håkon Bugge
0 siblings, 0 replies; 7+ messages in thread
From: Håkon Bugge @ 2017-07-13 12:16 UTC (permalink / raw)
To: oulijun
Cc: Doug Ledford, OFED mailing list, Yishai Hadas, Sean Hefty,
Hal Rosenstock
Hi Lijun,
> On 10 Jul 2017, at 08:30, oulijun <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>
> Hi, Haakon.Bugge
> I am interested in your question. Will it be happened when use
> rdma cm to establish connection on other hardware environment?
> for example, arm64 board.
Yes, this bug is CPU architecture agnostic. What is required to hit the bug is a CX-3 in a virtualized environment.
> Moreover, Would you provide the detail test method for the bug?
> I don’t understand slightly what is the RDMA CM user-land application
You can provoke this bug by running for example qperf or any of the perftest applications. You must use command line switches to enable RDMA CM connection establishment. You must also insert a sleep() with a five second delay just before rdma_accept() in the source.
Now, if you run this between two VMs on the same physical machine or on two VMs on two different machines, you will hit the error.
Hope this helps :-)
Thxs, Håkon
>
> Thanks
> Lijun Ou
> 在 2017/6/20 20:07, Håkon Bugge 写道:
>> CM REQs cannot be successfully retried, because a new pv_cm_id is
>> created for each request, without checking if one already exists.
>>
>> By checking if an id exists before creating one, the bug is fixed.
>>
>> This bug can be provoked by running an RDMA CM user-land application,
>> but inserting a five seconds delay before the rdma_accept() call on
>> the passive side. This delay is larger than the default CMA timeout,
>> and triggers a retry from the active side. The retried REQ will use
>> another pv_cm_id (the cm_id on the wire). This confuses the CM
>> protocol and two REJs are sent from the passive side.
>>
>> Here is an excerpt from ibdump running without the patch:
>>
>> 3.285092 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
>> 7.382711 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
>> 7.382861 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
>> 7.387644 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
>>
>> and here is the same with bug fix applied:
>>
>> 3.251010 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
>> 7.349387 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
>> 8.258443 LID: 4 -> LID: 4 SDP 290 CM: ConnectReply(SDP Hello)
>> 8.259890 LID: 4 -> LID: 4 InfiniBand 290 CM: ReadyToUse
>>
>> Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Håkon Bugge <haakon.bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Reported-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Tested-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> ---
>> drivers/infiniband/hw/mlx4/cm.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
>> index 1e6c526..fedaf82 100644
>> --- a/drivers/infiniband/hw/mlx4/cm.c
>> +++ b/drivers/infiniband/hw/mlx4/cm.c
>> @@ -323,6 +323,9 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
>> mad->mad_hdr.attr_id == CM_REP_ATTR_ID ||
>> mad->mad_hdr.attr_id == CM_SIDR_REQ_ATTR_ID) {
>> sl_cm_id = get_local_comm_id(mad);
>> + id = id_map_get(ibdev, &pv_cm_id, slave_id, sl_cm_id);
>> + if (id)
>> + goto cont;
>> id = id_map_alloc(ibdev, slave_id, sl_cm_id);
>> if (IS_ERR(id)) {
>> mlx4_ib_warn(ibdev, "%s: id{slave: %d, sl_cm_id: 0x%x} Failed to id_map_alloc\n",
>> @@ -343,6 +346,7 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
>> return -EINVAL;
>> }
>>
>> +cont:
>> set_local_comm_id(mad, id->pv_cm_id);
>>
>> if (mad->mad_hdr.attr_id == CM_DREQ_ATTR_ID)
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode
[not found] ` <20170620120750.32268-1-Haakon.Bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-26 9:40 ` jackm
2017-07-10 6:30 ` oulijun
@ 2017-07-22 17:45 ` Doug Ledford
2 siblings, 0 replies; 7+ messages in thread
From: Doug Ledford @ 2017-07-22 17:45 UTC (permalink / raw)
To: Håkon Bugge
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Sean Hefty,
Hal Rosenstock
[-- Attachment #1.1: Type: text/plain, Size: 2068 bytes --]
On 6/20/2017 8:07 AM, Håkon Bugge wrote:
> CM REQs cannot be successfully retried, because a new pv_cm_id is
> created for each request, without checking if one already exists.
>
> By checking if an id exists before creating one, the bug is fixed.
>
> This bug can be provoked by running an RDMA CM user-land application,
> but inserting a five seconds delay before the rdma_accept() call on
> the passive side. This delay is larger than the default CMA timeout,
> and triggers a retry from the active side. The retried REQ will use
> another pv_cm_id (the cm_id on the wire). This confuses the CM
> protocol and two REJs are sent from the passive side.
>
> Here is an excerpt from ibdump running without the patch:
>
> 3.285092 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 7.382711 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 7.382861 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
> 7.387644 LID: 4 -> LID: 4 InfiniBand 290 CM: ConnectReject
>
> and here is the same with bug fix applied:
>
> 3.251010 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 7.349387 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP Hello)
> 8.258443 LID: 4 -> LID: 4 SDP 290 CM: ConnectReply(SDP Hello)
> 8.259890 LID: 4 -> LID: 4 InfiniBand 290 CM: ReadyToUse
>
> Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Håkon Bugge <haakon.bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Reported-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Tested-by: Wei Lin Guay <wei.lin.guay-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
This was accepted into 4.13-rc, thanks.
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG Key ID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-07-22 17:45 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-20 12:07 [PATCH] IB/mlx4: Fix CM REQ retries in paravirt mode Håkon Bugge
[not found] ` <20170620120750.32268-1-Haakon.Bugge-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-26 9:40 ` jackm
[not found] ` <20170626124048.00002ef5-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-07-07 9:03 ` Håkon Bugge
[not found] ` <6CBA4C82-C68D-4A7E-B27F-87E3522707FA-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-07 9:45 ` Leon Romanovsky
2017-07-10 6:30 ` oulijun
[not found] ` <84f89de6-0efc-ada9-2745-7af65682f02e-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-07-13 12:16 ` Håkon Bugge
2017-07-22 17:45 ` Doug Ledford
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.