All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [PATCH v2 00/14] rdma/siw: implement non-blocking connect
@ 2022-06-15 16:51 Bernard Metzler
  2022-06-16  9:07 ` Stefan Metzmacher
  0 siblings, 1 reply; 4+ messages in thread
From: Bernard Metzler @ 2022-06-15 16:51 UTC (permalink / raw)
  To: Stefan Metzmacher; +Cc: linux-rdma



> -----Original Message-----
> From: Stefan Metzmacher <metze@samba.org>
> Sent: Wednesday, 15 June 2022 17:27
> To: Bernard Metzler <BMT@zurich.ibm.com>
> Cc: Stefan Metzmacher <metze@samba.org>; linux-rdma@vger.kernel.org
> Subject: [EXTERNAL] [PATCH v2 00/14] rdma/siw: implement non-blocking
> connect
> 
> Hi Bernard,
> 
> as written a few month ago, I have a patchset with a lot
> of fixes for siw.ko.
> 

Hi Stefan, much appreciated! I think the erdma driver
did a good job implementing something similar, but w/o
the need to look into MPA v2 specifics, especially the
extra handshake in RDMA mode.
Did you take care of the MPA v2 extended connection
establishment stuff?

I'll have a look asap, I am just down with a nice
COVID infect. This to let you know I am not ignoring,
but have another interesting experience which takes
most of my time 😉. Will come back to it asap!

Bernard.


> As requested I'm only send isolated chunks for easier review.
> 
> This is the first chunk adressing deadlocks in siw_connect()
> 
> The RDMA application layer expects rdma_connect() to be non-blocking
> as the completion is handled via RDMA_CM_EVENT_ESTABLISHED and
> other async events. It's not unlikely to hold a lock during
> the rdma_connect() call.
> 
> Without out this a connection attempt to a non-existing/reachable
> server block until the very long tcp timeout hits.
> The application layer had no chance to have its own timeout handler
> as that would just deadlock with the already blocking rdma_connect().
> 
> First rdma_connect() holds id_priv->handler_mutex and deadlocks
> rdma_destroy_id().
> 
> And iw_cm_connect() called from within rdma_connect() sets
> IWCM_F_CONNECT_WAIT during the call to cm_id->device->ops.iw_connect(),
> siw_connect() in this case. It means that iw_cm_disconnect()
> and iw_destroy_cm_id() will both deadlock waiting for
> IWCM_F_CONNECT_WAIT being cleared.
> 
> Patch 1: Fixes a refcounting problem
> 
> Patches 2-7: Intruduces __siw_cep_terminate_upcall()
> making he upcall handling much more consistent handling
> more state combinations.
> 
> Patches 8-13 are preparation patches to siw_connect()
> in order to do the real non-blocking split in Patch 14.
> 
> Please have a look.
> 
> Thanks!
> metze
> 
> Fixed issues in v2:
> - Include more preparation patches related to __siw_cep_terminate_upcall()
>   bases on review from Cheng Xu <chengyou@linux.alibaba.com>
> 
> Stefan Metzmacher (14):
>   rdma/siw: remove superfluous siw_cep_put() from siw_connect() error
>     path
>   rdma/siw: make siw_cm_upcall() a noop without valid 'id'
>   rdma/siw: split out a __siw_cep_terminate_upcall() function
>   rdma/siw: use __siw_cep_terminate_upcall() for indirect
>     SIW_CM_WORK_CLOSE_LLP
>   rdma/siw: use __siw_cep_terminate_upcall() for SIW_CM_WORK_PEER_CLOSE
>   rdma/siw: use __siw_cep_terminate_upcall() for SIW_CM_WORK_MPATIMEOUT
>   rdma/siw: handle SIW_EPSTATE_CONNECTING in
>     __siw_cep_terminate_upcall()
>   rdma/siw: make use of kernel_{bind,connect,listen}()
>   rdma/siw: let siw_connect() set AWAIT_MPAREP before
>     siw_send_mpareqrep()
>   rdma/siw: create a temporary copy of private data
>   rdma/siw: use error and out logic at the end of siw_connect()
>   rdma/siw: start mpa timer before calling siw_send_mpareqrep()
>   rdma/siw: call the blocking kernel_bindconnect() just before
>     siw_send_mpareqrep()
>   rdma/siw: implement non-blocking connect.
> 
>  drivers/infiniband/sw/siw/siw_cm.c | 347 ++++++++++++++++++-----------
>  drivers/infiniband/sw/siw/siw_cm.h |   1 +
>  2 files changed, 224 insertions(+), 124 deletions(-)
> 
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 00/14] rdma/siw: implement non-blocking connect
  2022-06-15 16:51 [PATCH v2 00/14] rdma/siw: implement non-blocking connect Bernard Metzler
@ 2022-06-16  9:07 ` Stefan Metzmacher
  2022-06-16 10:35   ` Cheng Xu
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan Metzmacher @ 2022-06-16  9:07 UTC (permalink / raw)
  To: Bernard Metzler; +Cc: linux-rdma, Cheng Xu

Hi Bernard,

> Hi Stefan, much appreciated! I think the erdma driver
> did a good job implementing something similar, but w/o
> the need to look into MPA v2 specifics, especially the
> extra handshake in RDMA mode.

As far as I see they introduce a dedicated
ERDMA_CM_WORK_CONNECTTIMEOUT and the timeouts
are:

 > +#define MPAREQ_TIMEOUT (HZ * 20)
 > +#define MPAREP_TIMEOUT (HZ * 10)
 > +#define CONNECT_TIMEOUT (HZ * 10)

If I read the code correct that would be 10 + 20 seconds
before a RDMA_CM_EVENT_ESTABLISHED must be posted.

I'm using just one timer (#define MPAREQ_TIMEOUT (HZ * 10)) that spans the tcp connect
as well as the MPA handshare.

I don't think we should care in what portion the time was spend
between tcp and mpa, what matters for the caller is the overall timeout
to get a valid connection. So I think a single timeout is better.

> Did you take care of the MPA v2 extended connection
> establishment stuff?

I'm not aware that I have touched (or had to care about) that at all here.

> I'll have a look asap, I am just down with a nice
> COVID infect. This to let you know I am not ignoring,
> but have another interesting experience which takes
> most of my time 😉. Will come back to it asap!

Get well soon!

Thanks!
metze

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 00/14] rdma/siw: implement non-blocking connect
  2022-06-16  9:07 ` Stefan Metzmacher
@ 2022-06-16 10:35   ` Cheng Xu
  0 siblings, 0 replies; 4+ messages in thread
From: Cheng Xu @ 2022-06-16 10:35 UTC (permalink / raw)
  To: Stefan Metzmacher, Bernard Metzler; +Cc: linux-rdma



On 6/16/22 5:07 PM, Stefan Metzmacher wrote:
> Hi Bernard,
> 
>> Hi Stefan, much appreciated! I think the erdma driver
>> did a good job implementing something similar, but w/o
>> the need to look into MPA v2 specifics, especially the
>> extra handshake in RDMA mode.
> 
> As far as I see they introduce a dedicated
> ERDMA_CM_WORK_CONNECTTIMEOUT and the timeouts
> are:
> 
>  > +#define MPAREQ_TIMEOUT (HZ * 20)
>  > +#define MPAREP_TIMEOUT (HZ * 10)
>  > +#define CONNECT_TIMEOUT (HZ * 10)
> 
> If I read the code correct that would be 10 + 20 seconds
> before a RDMA_CM_EVENT_ESTABLISHED must be posted.
> 

You are right.

> I'm using just one timer (#define MPAREQ_TIMEOUT (HZ * 10)) that spans 
> the tcp connect
> as well as the MPA handshare.
> 
> I don't think we should care in what portion the time was spend
> between tcp and mpa, what matters for the caller is the overall timeout
> to get a valid connection. So I think a single timeout is better.
> 

Indeed one timer is enough and works fine. But in erdma, our consider
is:

TCP connection establishment and iWarp MPA negotiation are two stages.
tcp connect timeout often means the server is offline. And how long the 
MPA negotiation takes is influenced by the server's load (client will 
wait longer if many clients connect to it due to more RDMA resources 
need being allocated).

We want to distinguish these two situations. A shorter TCP expired time
and a longer MPA expired time allow clients can try to connect another
server fast if the current server is unreachable, and also have
enough time to wait the MPA negotiation finished. This solves some
issues in our internal scenarios.

Thanks,
Cheng Xu


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 00/14] rdma/siw: implement non-blocking connect
  2022-06-15  8:40 [PATCH 0/7] " Stefan Metzmacher
@ 2022-06-15 15:26 ` Stefan Metzmacher
  0 siblings, 0 replies; 4+ messages in thread
From: Stefan Metzmacher @ 2022-06-15 15:26 UTC (permalink / raw)
  To: Bernard Metzler; +Cc: Stefan Metzmacher, linux-rdma

Hi Bernard,

as written a few month ago, I have a patchset with a lot
of fixes for siw.ko.

As requested I'm only send isolated chunks for easier review.

This is the first chunk adressing deadlocks in siw_connect()

The RDMA application layer expects rdma_connect() to be non-blocking
as the completion is handled via RDMA_CM_EVENT_ESTABLISHED and
other async events. It's not unlikely to hold a lock during
the rdma_connect() call.

Without out this a connection attempt to a non-existing/reachable
server block until the very long tcp timeout hits.
The application layer had no chance to have its own timeout handler
as that would just deadlock with the already blocking rdma_connect().

First rdma_connect() holds id_priv->handler_mutex and deadlocks
rdma_destroy_id().

And iw_cm_connect() called from within rdma_connect() sets
IWCM_F_CONNECT_WAIT during the call to cm_id->device->ops.iw_connect(),
siw_connect() in this case. It means that iw_cm_disconnect()
and iw_destroy_cm_id() will both deadlock waiting for
IWCM_F_CONNECT_WAIT being cleared.

Patch 1: Fixes a refcounting problem

Patches 2-7: Intruduces __siw_cep_terminate_upcall()
making he upcall handling much more consistent handling
more state combinations.

Patches 8-13 are preparation patches to siw_connect()
in order to do the real non-blocking split in Patch 14.

Please have a look.

Thanks!
metze

Fixed issues in v2:
- Include more preparation patches related to __siw_cep_terminate_upcall()
  bases on review from Cheng Xu <chengyou@linux.alibaba.com>

Stefan Metzmacher (14):
  rdma/siw: remove superfluous siw_cep_put() from siw_connect() error
    path
  rdma/siw: make siw_cm_upcall() a noop without valid 'id'
  rdma/siw: split out a __siw_cep_terminate_upcall() function
  rdma/siw: use __siw_cep_terminate_upcall() for indirect
    SIW_CM_WORK_CLOSE_LLP
  rdma/siw: use __siw_cep_terminate_upcall() for SIW_CM_WORK_PEER_CLOSE
  rdma/siw: use __siw_cep_terminate_upcall() for SIW_CM_WORK_MPATIMEOUT
  rdma/siw: handle SIW_EPSTATE_CONNECTING in
    __siw_cep_terminate_upcall()
  rdma/siw: make use of kernel_{bind,connect,listen}()
  rdma/siw: let siw_connect() set AWAIT_MPAREP before
    siw_send_mpareqrep()
  rdma/siw: create a temporary copy of private data
  rdma/siw: use error and out logic at the end of siw_connect()
  rdma/siw: start mpa timer before calling siw_send_mpareqrep()
  rdma/siw: call the blocking kernel_bindconnect() just before
    siw_send_mpareqrep()
  rdma/siw: implement non-blocking connect.

 drivers/infiniband/sw/siw/siw_cm.c | 347 ++++++++++++++++++-----------
 drivers/infiniband/sw/siw/siw_cm.h |   1 +
 2 files changed, 224 insertions(+), 124 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-06-16 10:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15 16:51 [PATCH v2 00/14] rdma/siw: implement non-blocking connect Bernard Metzler
2022-06-16  9:07 ` Stefan Metzmacher
2022-06-16 10:35   ` Cheng Xu
  -- strict thread matches above, loose matches on Subject: below --
2022-06-15  8:40 [PATCH 0/7] " Stefan Metzmacher
2022-06-15 15:26 ` [PATCH v2 00/14] " Stefan Metzmacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.