linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Oded Gabbay <ogabbay@kernel.org>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma <linux-rdma@vger.kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: Creating new RDMA driver for habanalabs
Date: Wed, 6 Jul 2022 11:59:14 +0300	[thread overview]
Message-ID: <CAFCwf13LRmez63hGmXMDO2FoC3Qo_2BwtAtnzyJ70=_OcTc23w@mail.gmail.com> (raw)
In-Reply-To: <CAFCwf11NeJYDMBXaNTpQ+dLecxoAnFYE2Z9T9D4-A5gLtf8q+A@mail.gmail.com>

On Mon, Aug 23, 2021 at 5:19 PM Oded Gabbay <ogabbay@kernel.org> wrote:
>
> On Mon, Aug 23, 2021 at 4:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Mon, Aug 23, 2021 at 11:53:48AM +0300, Oded Gabbay wrote:
> >
> > > Do you see any issue with that ?
> >
> > It should work out, without a netdev you have to be more careful about
> > addressing and can't really use the IP addressing modes. But you'd
> > have a singular hardwired roce gid in this case and act more like an
> > IB device than a roce device.
> >
> > Where you might start to run into trouble is you probably want to put
> > all these ports under a single struct ib_device and we've been moving
> > away from having significant per-port differences. But I suspect it
> > can still work out.
> >
> > Jason
>
> ok, thanks for all the info.
> I will go look at the efa driver.
>
> Thanks,
> Oded

Hi Jason.

So it took a *bit* longer than expected due to higher-priority tasks,
but in the last month we did a thorough investigation of how our h/w maps
to the IBverbs API and it appears we have a few constraints that are
not quite common.

Tackling these constraints can affect the basic design of the driver or
even be a non-starter for this entire endeavor.

Therefore, I would like to list the major constraints and get your opinion
whether they are significant, and if so, how to tackle them.

To understand the context of these constraints, I would like to first say
that the Gaudi NICs were designed primarily as a form of a scale-out fabric
for doing Deep-Learning training across thousands of Gaudi devices.

This means that the designated deployment is one where the entire network
is composed of Gaudi NICs, and L2/L3 switches. Doing interoperability with
other NICs was not the main goal, although we did manage to
work vs. a MLNX RDMA NIC in the lab.

In addition, I would like to remind you that each Gaudi has multiple NIC
ports, but from our perspective they are all used for the same purpose.
i.e. We are using ALL the Gaudi NIC ports for a single user process
to distribute its Deep-Learning training workload.

Due to that, we would want to put all the ports under a single struct ib_device,
as you said it yourself in your original email a year ago.
I haven't written this as a h/w constraint, but this is very important
for us from a system/deployment perspective. I would go on to say it
is pretty much
mandatory.

The major constraints are:

1. Support only RDMA WRITE operation. We do not support READ, SEND or RECV.
    This means that many existing open source tests in rdma-core are not
    compatible. e.g. rc_pingpong.c will not work. I guess we will need to
    implement different tests and submit them ? Do you have a
different idea/suggestion ?

2. As you mentioned in the original email, we support only a single PD.
   I don't see any major implication regarding this constraint but please
   correct me if you think otherwise.

3. MR limitation on the rkey that is received from the remote connection
   during connection creation. The limitation is that our h/w extracts
   the rkey from the QP h/w context and not from the WQE when sending packets.
   This means that we may associate only a single remote MR per QP.

   Moreover, we also have an MR limitation on the rkey that we can give to the
   remote side. Our h/w extracts the rkey from QP h/w context and not
from the received
   packets. This means we give the same rkey for all MRs that we create per QP.

   Do you see any issue here with these two limitations ? One thing we noted is
   that we need to somehow configure the rkey in our h/w QP context, while today
   the API doesn't allow it.

   These limitations are not relevant to a deployment where all the NICs are
   Gaudi NICs, because we can use a single rkey for all MRs.

4. We do not support all the flags in the reg_mr API. e.g. we don't
   support IBV_ACCESS_LOCAL_WRITE. I'm not sure what the
   implication is here.

5. Our h/w contains several accelerations we would like to utilize.
   e.g. we have a h/w mechanism for accelerating collective operations
   on multiple RDMA NICs. These accelerations will require either extensions
   to current APIs, or some dedicated APIs. For example, one of the
   accelerations requires that the user will create a QP with the same
   index on all the Gaudi NICs.

Those are the major constraints. We have a few others but imo they are less
severe and can be discussed when we upstream the code.

btw, due to the large effort, we will do this conversion only for
Gaudi2 (and beyond).
Gaudi1 will continue to use our proprietary, not-upstreamed, kernel driver uAPI.

Appreciate your help on this.

Thanks,
Oded

  reply	other threads:[~2022-07-06  8:59 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-22  9:40 Creating new RDMA driver for habanalabs Oded Gabbay
2021-08-22 11:32 ` Leon Romanovsky
2021-08-22 22:31 ` Jason Gunthorpe
2021-08-23  8:53   ` Oded Gabbay
2021-08-23 13:04     ` Jason Gunthorpe
2021-08-23 14:19       ` Oded Gabbay
2022-07-06  8:59         ` Oded Gabbay [this message]
2022-07-06 16:24           ` Jason Gunthorpe
2022-07-07  9:30             ` Oded Gabbay
2022-07-08 13:29               ` Jason Gunthorpe
2022-07-10  7:30                 ` Oded Gabbay
2022-07-21 18:42                   ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFCwf13LRmez63hGmXMDO2FoC3Qo_2BwtAtnzyJ70=_OcTc23w@mail.gmail.com' \
    --to=ogabbay@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jgg@ziepe.ca \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).