Re: Creating new RDMA driver for habanalabs

From: Oded Gabbay <ogabbay@kernel.org>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma <linux-rdma@vger.kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: Creating new RDMA driver for habanalabs
Date: Wed, 6 Jul 2022 11:59:14 +0300	[thread overview]
Message-ID: <CAFCwf13LRmez63hGmXMDO2FoC3Qo_2BwtAtnzyJ70=_OcTc23w@mail.gmail.com> (raw)
In-Reply-To: <CAFCwf11NeJYDMBXaNTpQ+dLecxoAnFYE2Z9T9D4-A5gLtf8q+A@mail.gmail.com>

On Mon, Aug 23, 2021 at 5:19 PM Oded Gabbay <ogabbay@kernel.org> wrote:
>
> On Mon, Aug 23, 2021 at 4:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Mon, Aug 23, 2021 at 11:53:48AM +0300, Oded Gabbay wrote:
> >
> > > Do you see any issue with that ?
> >
> > It should work out, without a netdev you have to be more careful about
> > addressing and can't really use the IP addressing modes. But you'd
> > have a singular hardwired roce gid in this case and act more like an
> > IB device than a roce device.
> >
> > Where you might start to run into trouble is you probably want to put
> > all these ports under a single struct ib_device and we've been moving
> > away from having significant per-port differences. But I suspect it
> > can still work out.
> >
> > Jason
>
> ok, thanks for all the info.
> I will go look at the efa driver.
>
> Thanks,
> Oded

Hi Jason.

So it took a *bit* longer than expected due to higher-priority tasks,
but in the last month we did a thorough investigation of how our h/w maps
to the IBverbs API and it appears we have a few constraints that are
not quite common.

Tackling these constraints can affect the basic design of the driver or
even be a non-starter for this entire endeavor.

Therefore, I would like to list the major constraints and get your opinion
whether they are significant, and if so, how to tackle them.

To understand the context of these constraints, I would like to first say
that the Gaudi NICs were designed primarily as a form of a scale-out fabric
for doing Deep-Learning training across thousands of Gaudi devices.

This means that the designated deployment is one where the entire network
is composed of Gaudi NICs, and L2/L3 switches. Doing interoperability with
other NICs was not the main goal, although we did manage to
work vs. a MLNX RDMA NIC in the lab.

In addition, I would like to remind you that each Gaudi has multiple NIC
ports, but from our perspective they are all used for the same purpose.
i.e. We are using ALL the Gaudi NIC ports for a single user process
to distribute its Deep-Learning training workload.

Due to that, we would want to put all the ports under a single struct ib_device,
as you said it yourself in your original email a year ago.
I haven't written this as a h/w constraint, but this is very important
for us from a system/deployment perspective. I would go on to say it
is pretty much
mandatory.

The major constraints are:

1. Support only RDMA WRITE operation. We do not support READ, SEND or RECV.
    This means that many existing open source tests in rdma-core are not
    compatible. e.g. rc_pingpong.c will not work. I guess we will need to
    implement different tests and submit them ? Do you have a
different idea/suggestion ?

2. As you mentioned in the original email, we support only a single PD.
   I don't see any major implication regarding this constraint but please
   correct me if you think otherwise.

3. MR limitation on the rkey that is received from the remote connection
   during connection creation. The limitation is that our h/w extracts
   the rkey from the QP h/w context and not from the WQE when sending packets.
   This means that we may associate only a single remote MR per QP.

   Moreover, we also have an MR limitation on the rkey that we can give to the
   remote side. Our h/w extracts the rkey from QP h/w context and not
from the received
   packets. This means we give the same rkey for all MRs that we create per QP.

   Do you see any issue here with these two limitations ? One thing we noted is
   that we need to somehow configure the rkey in our h/w QP context, while today
   the API doesn't allow it.

   These limitations are not relevant to a deployment where all the NICs are
   Gaudi NICs, because we can use a single rkey for all MRs.

4. We do not support all the flags in the reg_mr API. e.g. we don't
   support IBV_ACCESS_LOCAL_WRITE. I'm not sure what the
   implication is here.

5. Our h/w contains several accelerations we would like to utilize.
   e.g. we have a h/w mechanism for accelerating collective operations
   on multiple RDMA NICs. These accelerations will require either extensions
   to current APIs, or some dedicated APIs. For example, one of the
   accelerations requires that the user will create a QP with the same
   index on all the Gaudi NICs.

Those are the major constraints. We have a few others but imo they are less
severe and can be discussed when we upstream the code.

btw, due to the large effort, we will do this conversion only for
Gaudi2 (and beyond).
Gaudi1 will continue to use our proprietary, not-upstreamed, kernel driver uAPI.

Appreciate your help on this.

Thanks,
Oded