RE: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues - matsuda-daisuke

From: "matsuda-daisuke@fujitsu.com" <matsuda-daisuke@fujitsu.com>
To: 'Bob Pearson' <rpearsonhpe@gmail.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"leonro@nvidia.com" <leonro@nvidia.com>,
	"jgg@nvidia.com" <jgg@nvidia.com>,
	"zyjzyj2000@gmail.com" <zyjzyj2000@gmail.com>,
	"Ziemba, Ian" <ian.ziemba@hpe.com>,
	Frank Zago <frank.zago@hpe.com>
Cc: "nvdimm@lists.linux.dev" <nvdimm@lists.linux.dev>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"yangx.jy@fujitsu.com" <yangx.jy@fujitsu.com>,
	"lizhijian@fujitsu.com" <lizhijian@fujitsu.com>,
	"y-goto@fujitsu.com" <y-goto@fujitsu.com>
Subject: RE: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
Date: Mon, 12 Sep 2022 08:27:02 +0000	[thread overview]
Message-ID: <TYCPR01MB845576AC0D07947D560F77DCE5449@TYCPR01MB8455.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <faf616cf-ea40-153e-bc42-a50b418e9ed9@gmail.com>

On Sat, Sep 10, 2022 4:39 AM Bob Pearson wrote:
> On 9/6/22 21:43, Daisuke Matsuda wrote:
> > In order to implement On-Demand Paging on the rxe driver, triple tasklets
> > (requester, responder, and completer) must be allowed to sleep so that they
> > can trigger page fault when pages being accessed are not present.
> >
> > This patch replaces the tasklets with workqueues, but still allows direct-
> > call of works from softirq context if it is obvious that MRs are not going
> > to be accessed and there is no work being processed in the workqueue.
> >
> > As counterparts to tasklet_disable() and tasklet_enable() are missing for
> > workqueues, an atomic value is introduced to get works suspended while qp
> > reset is in progress.
> >
> > As a reference, performance change was measured using ib_send_bw and
> > ib_send_lat commands over RC connection. Both the client and the server
> > were placed on the same KVM host. An option "-n 100000" was given to the
> > respective commands to iterate over 100000 times.
> >
> > Before applying this patch:
> > [ib_send_bw]
> > ---------------------------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> >  65536      100000           0.00               203.13             0.003250
> > ---------------------------------------------------------------------------------------
> > [ib_send_lat]
> > ---------------------------------------------------------------------------------------
> >  #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99%
> percentile[usec]   99.9% percentile[usec]
> >  2       100000          30.91          1519.05      34.23             36.06            5.15            48.49
> 82.37
> > ---------------------------------------------------------------------------------------
> >
> > After applying this patch:
> > [ib_send_bw]
> > ---------------------------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> >  65536      100000           0.00               240.88             0.003854
> > ---------------------------------------------------------------------------------------
> > [ib_send_lat]
> > ---------------------------------------------------------------------------------------
> >  #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99%
> percentile[usec]   99.9% percentile[usec]
> >  2       100000          40.88          2994.82      47.80             50.25            13.70           76.42
> 185.04
> > ---------------------------------------------------------------------------------------
> >
> > The test was conducted 3 times for each kernel, and the results with median
> > "BW average" and "t_typical" are shown above. It shows the conversion
> > improves the bandwidth while causing higher latency.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> >  drivers/infiniband/sw/rxe/Makefile    |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_comp.c  |  42 ++++---
> >  drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
> >  drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++++------
> >  drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_req.c   |  14 +--
> >  drivers/infiniband/sw/rxe/rxe_resp.c  |  16 +--
> >  drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------------------
> >  drivers/infiniband/sw/rxe/rxe_task.h  |  69 -----------
> >  drivers/infiniband/sw/rxe/rxe_verbs.c |   8 +-
> >  drivers/infiniband/sw/rxe/rxe_verbs.h |   8 +-
> >  drivers/infiniband/sw/rxe/rxe_wq.c    | 161 ++++++++++++++++++++++++++
> >  drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++++++++
> >  15 files changed, 322 insertions(+), 299 deletions(-)
> >  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
> >  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
> >  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
> >  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h

<...>

> 
> Daiskuke,
> 
> We (hpe) have an internal patch set that also allows using work queues instead of tasklets.
> There is a certain amount of work to allow control over cpu assignment from ulp or application
> level. This is critical for high performance in multithreaded IO applications. It would be in
> both of our interests if we could find a common implementation that works for everyone.

I agree with you, and I am interested in your work.

Would you post the patch set? It may be possible to rebase my work on your implementation.
I just simply replaced tasklets mechanically and modified the conditions when dispatching works
to responder and completer workqueues. The changes are not so complicated. I'll be away between
9/15-9/20. After that, I can take the time to look into it.

By the way, I would also like you to join in the discussion about whether to preserve tasklets or not.
Cf. https://lore.kernel.org/all/TYCPR01MB8455D739FC9FB034E3485C87E5449@TYCPR01MB8455.jpnprd01.prod.outlook.com/

> 
> Perhaps we could arrange for a call to discuss?

This is an important change that may affect all of rxe users,
so I think the discussion should be open to anybody at least at this stage.

Thanks,
Daisuke