linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
@ 2022-05-20 15:13 Bernard Metzler
  2022-05-23  1:39 ` Cheng Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Bernard Metzler @ 2022-05-20 15:13 UTC (permalink / raw)
  To: Cheng Xu, Jason Gunthorpe, Tom Talpey
  Cc: dledford, leon, linux-rdma, KaiShen, tonylu


> -----Original Message-----
> From: Cheng Xu <chengyou@linux.alibaba.com>
> Sent: Friday, 20 May 2022 09:04
> To: Bernard Metzler <BMT@zurich.ibm.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Tom Talpey <tom@talpey.com>
> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the
> erdma module
> 
> 
> 
> On 5/20/22 12:20 AM, Bernard Metzler wrote:
> >
> >
> 
> <...>
> 
> >>> As far as I know, iWarp device only has one GID entry which
> generated
> >>> from MAC address.
> >>>
> >>> For iWarp, The CM part in core code resolves address, finds
> >>> route with the help of kernel's net subsystem, and then obtains the
> >> correct
> >>> ibdev by GID matching. The GID matching in iWarp is indeed MAC
> address
> >>> matching.
> >>>
> >>> In another words, for iWarp devices, the core code doesn't handle IP
> >>> addressing related stuff directly, it is finished by calling net
> APIs.
> >>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
> >>> process.
> >>>
> >>> The binded netdev in iWarp devices, mainly have two purposes:
> >>>    1). generated GID0, using the netdev's mac address.
> >>>    2). get the port state and attributes.
> >>>
> >>> For 1), erdma device binded to net device also by mac address, which
> can
> >>> be obtained from our PCIe bar registers.
> >>> For 2), erdma can also get the information, and may be more
> accurately.
> >>> For example, erdma can have different MTU with virtio-net in our
> cloud.
> >>>
> >>> For RoCEv2, I know that it has many GIDs, some of them are generated
> >>> from IP addresses, and handing IP addressing in core code.
> >>
> >> Bernard, Tom what do you think?
> >>
> >> Jason
> >
> > I think iWarp (and now RoCEv2 with its UDP dependency) drivers
> > produce GIDs mostly to satisfy the current RDMA CM infrastructure,
> > which depends on this type of unique identifier, inherited from IB.
> > Imo, more natural would be to implement IP based RDMA protocols
> > connection management by relying on IP addresses.
> >
> > Sorry for asking again - why erdma does not need to link with netdev?
> > Can erdma exist without using a netdev?
> 
> Actually erdma also need a net device binded to, and so does it.
> 
> These days I’m trying to find out acceptable ways to get the reference
> of the binded netdev, e,g, the 'struct net_device' pointer. Unlike other
> RDMA drivers can get the reference of their binded netdevs' reference
> easily (most RDMA devices are based on the extended aux devices), it is
> a little more complex for erdma, because erdma and its binded net device
> are two separated PCIe devices.
> 
> Then I find that the netdev reference hold in ibdev is rarely used
> in core code for iWarp deivces, GID0 is the key attribute (As you and
> Tom mentioned, it appears with the historical need for compatibility,
> but I think this is another story).
> 

Yes, I think this is right.

If you are saying you can go away with a NULL netdev at CM core, then
I think that's fine?
Of course the erdma driver must somehow keep track of the state of
its associated network device - like catching up with link status -
and must provide related information/events to the RDMA core.

> So, there are two choices for erdma: enum net devices and find the
> matched one, or never calling ib_device_set_netdev. The second one has
> less code.
> 
> The second way can't work in ROCE. But it works for iWarp (I've tested),
> since the netdev reference is rarely used for iWarp in core code, as I
> said in last reply.
> 
> In short, the question discussed here is that: is it acceptable that
> doesn't hold the netdev reference in core code for a iWarp driver
> (indeed it has a netdev binded to) ? Or is it necessary that calling
> ib_device_set_netdev to set the binded netdev for iWarp driver?
> 
> You and Tom both are specialists in iWarp, your opinions are important.
> 
> Thanks very much
> Cheng Xu
> 
> 
> >
> > Thanks,
> > Bernard.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-20 15:13 Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module Bernard Metzler
@ 2022-05-23  1:39 ` Cheng Xu
  2022-05-23 13:25   ` Tom Talpey
  0 siblings, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-05-23  1:39 UTC (permalink / raw)
  To: Bernard Metzler, Jason Gunthorpe, Tom Talpey
  Cc: dledford, leon, linux-rdma, KaiShen, tonylu



On 5/20/22 11:13 PM, Bernard Metzler wrote:
> 
>> -----Original Message-----
>> From: Cheng Xu <chengyou@linux.alibaba.com>
>> Sent: Friday, 20 May 2022 09:04
>> To: Bernard Metzler <BMT@zurich.ibm.com>; Jason Gunthorpe
>> <jgg@nvidia.com>; Tom Talpey <tom@talpey.com>
>> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the
>> erdma module
>>
>>
>>
>> On 5/20/22 12:20 AM, Bernard Metzler wrote:
>>>
>>>
>>
>> <...>
>>
>>>>> As far as I know, iWarp device only has one GID entry which
>> generated
>>>>> from MAC address.
>>>>>
>>>>> For iWarp, The CM part in core code resolves address, finds
>>>>> route with the help of kernel's net subsystem, and then obtains the
>>>> correct
>>>>> ibdev by GID matching. The GID matching in iWarp is indeed MAC
>> address
>>>>> matching.
>>>>>
>>>>> In another words, for iWarp devices, the core code doesn't handle IP
>>>>> addressing related stuff directly, it is finished by calling net
>> APIs.
>>>>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
>>>>> process.
>>>>>
>>>>> The binded netdev in iWarp devices, mainly have two purposes:
>>>>>     1). generated GID0, using the netdev's mac address.
>>>>>     2). get the port state and attributes.
>>>>>
>>>>> For 1), erdma device binded to net device also by mac address, which
>> can
>>>>> be obtained from our PCIe bar registers.
>>>>> For 2), erdma can also get the information, and may be more
>> accurately.
>>>>> For example, erdma can have different MTU with virtio-net in our
>> cloud.
>>>>>
>>>>> For RoCEv2, I know that it has many GIDs, some of them are generated
>>>>> from IP addresses, and handing IP addressing in core code.
>>>>
>>>> Bernard, Tom what do you think?
>>>>
>>>> Jason
>>>
>>> I think iWarp (and now RoCEv2 with its UDP dependency) drivers
>>> produce GIDs mostly to satisfy the current RDMA CM infrastructure,
>>> which depends on this type of unique identifier, inherited from IB.
>>> Imo, more natural would be to implement IP based RDMA protocols
>>> connection management by relying on IP addresses.
>>>
>>> Sorry for asking again - why erdma does not need to link with netdev?
>>> Can erdma exist without using a netdev?
>>
>> Actually erdma also need a net device binded to, and so does it.
>>
>> These days I’m trying to find out acceptable ways to get the reference
>> of the binded netdev, e,g, the 'struct net_device' pointer. Unlike other
>> RDMA drivers can get the reference of their binded netdevs' reference
>> easily (most RDMA devices are based on the extended aux devices), it is
>> a little more complex for erdma, because erdma and its binded net device
>> are two separated PCIe devices.
>>
>> Then I find that the netdev reference hold in ibdev is rarely used
>> in core code for iWarp deivces, GID0 is the key attribute (As you and
>> Tom mentioned, it appears with the historical need for compatibility,
>> but I think this is another story).
>>
> 
> Yes, I think this is right.
> 
> If you are saying you can go away with a NULL netdev at CM core, then
> I think that's fine?
> Of course the erdma driver must somehow keep track of the state of
> its associated network device - like catching up with link status -
> and must provide related information/events to the RDMA core.
> 

All right, and get it. I'd like to hold the binded netdev reference in
our probe routine, and send v9 patches.

Thanks your time for looking at this, Jason, Bernard and Tom.

Cheng Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-23  1:39 ` Cheng Xu
@ 2022-05-23 13:25   ` Tom Talpey
  2022-05-24  3:09     ` Cheng Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Tom Talpey @ 2022-05-23 13:25 UTC (permalink / raw)
  To: Cheng Xu, Bernard Metzler, Jason Gunthorpe
  Cc: dledford, leon, linux-rdma, KaiShen, tonylu

On 5/22/2022 9:39 PM, Cheng Xu wrote:
> 
> 
> On 5/20/22 11:13 PM, Bernard Metzler wrote:
>>
>>> -----Original Message-----
>>> From: Cheng Xu <chengyou@linux.alibaba.com>
>>> Sent: Friday, 20 May 2022 09:04
>>> To: Bernard Metzler <BMT@zurich.ibm.com>; Jason Gunthorpe
>>> <jgg@nvidia.com>; Tom Talpey <tom@talpey.com>
>>> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
>>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>>> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the
>>> erdma module
>>>
>>>
>>>
>>> On 5/20/22 12:20 AM, Bernard Metzler wrote:
>>>>
>>>>
>>>
>>> <...>
>>>
>>>>>> As far as I know, iWarp device only has one GID entry which
>>> generated
>>>>>> from MAC address.
>>>>>>
>>>>>> For iWarp, The CM part in core code resolves address, finds
>>>>>> route with the help of kernel's net subsystem, and then obtains the
>>>>> correct
>>>>>> ibdev by GID matching. The GID matching in iWarp is indeed MAC
>>> address
>>>>>> matching.
>>>>>>
>>>>>> In another words, for iWarp devices, the core code doesn't handle IP
>>>>>> addressing related stuff directly, it is finished by calling net
>>> APIs.
>>>>>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
>>>>>> process.
>>>>>>
>>>>>> The binded netdev in iWarp devices, mainly have two purposes:
>>>>>>     1). generated GID0, using the netdev's mac address.
>>>>>>     2). get the port state and attributes.
>>>>>>
>>>>>> For 1), erdma device binded to net device also by mac address, which
>>> can
>>>>>> be obtained from our PCIe bar registers.
>>>>>> For 2), erdma can also get the information, and may be more
>>> accurately.
>>>>>> For example, erdma can have different MTU with virtio-net in our
>>> cloud.
>>>>>>
>>>>>> For RoCEv2, I know that it has many GIDs, some of them are generated
>>>>>> from IP addresses, and handing IP addressing in core code.
>>>>>
>>>>> Bernard, Tom what do you think?
>>>>>
>>>>> Jason
>>>>
>>>> I think iWarp (and now RoCEv2 with its UDP dependency) drivers
>>>> produce GIDs mostly to satisfy the current RDMA CM infrastructure,
>>>> which depends on this type of unique identifier, inherited from IB.
>>>> Imo, more natural would be to implement IP based RDMA protocols
>>>> connection management by relying on IP addresses.
>>>>
>>>> Sorry for asking again - why erdma does not need to link with netdev?
>>>> Can erdma exist without using a netdev?
>>>
>>> Actually erdma also need a net device binded to, and so does it.
>>>
>>> These days I’m trying to find out acceptable ways to get the reference
>>> of the binded netdev, e,g, the 'struct net_device' pointer. Unlike other
>>> RDMA drivers can get the reference of their binded netdevs' reference
>>> easily (most RDMA devices are based on the extended aux devices), it is
>>> a little more complex for erdma, because erdma and its binded net device
>>> are two separated PCIe devices.
>>>
>>> Then I find that the netdev reference hold in ibdev is rarely used
>>> in core code for iWarp deivces, GID0 is the key attribute (As you and
>>> Tom mentioned, it appears with the historical need for compatibility,
>>> but I think this is another story).
>>>
>>
>> Yes, I think this is right.
>>
>> If you are saying you can go away with a NULL netdev at CM core, then
>> I think that's fine?
>> Of course the erdma driver must somehow keep track of the state of
>> its associated network device - like catching up with link status -
>> and must provide related information/events to the RDMA core.
>>
> 
> All right, and get it. I'd like to hold the binded netdev reference in
> our probe routine, and send v9 patches.

Cheng Xu, by this do you mean you'll drop the reference after the probe?
In addition to watching for link status changes, I'd expect you also
will want to monitor ARP/ND changes, maybe even perform those in the
netdev stack rather than your firmware. Does your adapter not also
function as a normal NIC?

Tom.

> Thanks your time for looking at this, Jason, Bernard and Tom.
> 
> Cheng Xu
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-23 13:25   ` Tom Talpey
@ 2022-05-24  3:09     ` Cheng Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Cheng Xu @ 2022-05-24  3:09 UTC (permalink / raw)
  To: Tom Talpey, Bernard Metzler, Jason Gunthorpe
  Cc: dledford, leon, linux-rdma, KaiShen, tonylu



On 5/23/22 9:25 PM, Tom Talpey wrote:
> On 5/22/2022 9:39 PM, Cheng Xu wrote:
>>
>>
>> On 5/20/22 11:13 PM, Bernard Metzler wrote:
>>>
>>>> -----Original Message-----
>>>> From: Cheng Xu <chengyou@linux.alibaba.com>
>>>> Sent: Friday, 20 May 2022 09:04
>>>> To: Bernard Metzler <BMT@zurich.ibm.com>; Jason Gunthorpe
>>>> <jgg@nvidia.com>; Tom Talpey <tom@talpey.com>
>>>> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
>>>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>>>> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the
>>>> erdma module
>>>>
>>>>
>>>>
>>>> On 5/20/22 12:20 AM, Bernard Metzler wrote:
>>>>>
>>>>>
>>>>
>>>> <...>
>>>>
>>>>>>> As far as I know, iWarp device only has one GID entry which
>>>> generated
>>>>>>> from MAC address.
>>>>>>>
>>>>>>> For iWarp, The CM part in core code resolves address, finds
>>>>>>> route with the help of kernel's net subsystem, and then obtains the
>>>>>> correct
>>>>>>> ibdev by GID matching. The GID matching in iWarp is indeed MAC
>>>> address
>>>>>>> matching.
>>>>>>>
>>>>>>> In another words, for iWarp devices, the core code doesn't handle IP
>>>>>>> addressing related stuff directly, it is finished by calling net
>>>> APIs.
>>>>>>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
>>>>>>> process.
>>>>>>>
>>>>>>> The binded netdev in iWarp devices, mainly have two purposes:
>>>>>>>     1). generated GID0, using the netdev's mac address.
>>>>>>>     2). get the port state and attributes.
>>>>>>>
>>>>>>> For 1), erdma device binded to net device also by mac address, which
>>>> can
>>>>>>> be obtained from our PCIe bar registers.
>>>>>>> For 2), erdma can also get the information, and may be more
>>>> accurately.
>>>>>>> For example, erdma can have different MTU with virtio-net in our
>>>> cloud.
>>>>>>>
>>>>>>> For RoCEv2, I know that it has many GIDs, some of them are generated
>>>>>>> from IP addresses, and handing IP addressing in core code.
>>>>>>
>>>>>> Bernard, Tom what do you think?
>>>>>>
>>>>>> Jason
>>>>>
>>>>> I think iWarp (and now RoCEv2 with its UDP dependency) drivers
>>>>> produce GIDs mostly to satisfy the current RDMA CM infrastructure,
>>>>> which depends on this type of unique identifier, inherited from IB.
>>>>> Imo, more natural would be to implement IP based RDMA protocols
>>>>> connection management by relying on IP addresses.
>>>>>
>>>>> Sorry for asking again - why erdma does not need to link with netdev?
>>>>> Can erdma exist without using a netdev?
>>>>
>>>> Actually erdma also need a net device binded to, and so does it.
>>>>
>>>> These days I’m trying to find out acceptable ways to get the reference
>>>> of the binded netdev, e,g, the 'struct net_device' pointer. Unlike 
>>>> other
>>>> RDMA drivers can get the reference of their binded netdevs' reference
>>>> easily (most RDMA devices are based on the extended aux devices), it is
>>>> a little more complex for erdma, because erdma and its binded net 
>>>> device
>>>> are two separated PCIe devices.
>>>>
>>>> Then I find that the netdev reference hold in ibdev is rarely used
>>>> in core code for iWarp deivces, GID0 is the key attribute (As you and
>>>> Tom mentioned, it appears with the historical need for compatibility,
>>>> but I think this is another story).
>>>>
>>>
>>> Yes, I think this is right.
>>>
>>> If you are saying you can go away with a NULL netdev at CM core, then
>>> I think that's fine?
>>> Of course the erdma driver must somehow keep track of the state of
>>> its associated network device - like catching up with link status -
>>> and must provide related information/events to the RDMA core.
>>>
>>
>> All right, and get it. I'd like to hold the binded netdev reference in
>> our probe routine, and send v9 patches.
> 
> Cheng Xu, by this do you mean you'll drop the reference after the probe?

No, we will keep the reference in the lifecycle of erdma, like what
other physical RNICs do.

> In addition to watching for link status changes, I'd expect you also
> will want to monitor ARP/ND changes, maybe even perform those in the
> netdev stack rather than your firmware.

Since we will keep the netdev's reference, we needn't do this. Anyway,
thank your suggestion very much.

> ... Does your adapter not also
> function as a normal NIC?

 From the OS view, erdma does not have NIC function itself. but indeed
our HW has, and it is performed in another PCI device, e,g, virtio-net
device in our cloud now. In other words, our NIC and RDMA devices are
separated PCIe devices in OS, but they are all emulated in the same
hardware.

More details, our scenario is cloud virtualization. We have a
hardware (called MOC, a chipset for cloud virtualization in Alibaba
Cloud, a kind of DPU/IPU), which provides IO acceleration. MOC emulates
kinds of IO devices (virtio-net, virio-blk, nvme and erdma) to provide
IO services to VMs or bare metals. In our cloud, each VM/bare metal has
a virtio-net device by nature to get the VPC network service. erdma aims
to provide RDMA service in the same VPC network. So, erdma need not have
the NIC function, and it is a redundant feature for erdma, which
is provided by virio-net already.

Thanks,
Cheng Xu

> Tom.
> 
>> Thanks your time for looking at this, Jason, Bernard and Tom.
>>
>> Cheng Xu
>>
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-19 16:20             ` Bernard Metzler
  2022-05-19 18:51               ` Tom Talpey
@ 2022-05-20  7:03               ` Cheng Xu
  1 sibling, 0 replies; 22+ messages in thread
From: Cheng Xu @ 2022-05-20  7:03 UTC (permalink / raw)
  To: Bernard Metzler, Jason Gunthorpe, Tom Talpey
  Cc: dledford, leon, linux-rdma, KaiShen, tonylu



On 5/20/22 12:20 AM, Bernard Metzler wrote:
> 
> 

<...>

>>> As far as I know, iWarp device only has one GID entry which generated
>>> from MAC address.
>>>
>>> For iWarp, The CM part in core code resolves address, finds
>>> route with the help of kernel's net subsystem, and then obtains the
>> correct
>>> ibdev by GID matching. The GID matching in iWarp is indeed MAC address
>>> matching.
>>>
>>> In another words, for iWarp devices, the core code doesn't handle IP
>>> addressing related stuff directly, it is finished by calling net APIs.
>>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
>>> process.
>>>
>>> The binded netdev in iWarp devices, mainly have two purposes:
>>>    1). generated GID0, using the netdev's mac address.
>>>    2). get the port state and attributes.
>>>
>>> For 1), erdma device binded to net device also by mac address, which can
>>> be obtained from our PCIe bar registers.
>>> For 2), erdma can also get the information, and may be more accurately.
>>> For example, erdma can have different MTU with virtio-net in our cloud.
>>>
>>> For RoCEv2, I know that it has many GIDs, some of them are generated
>>> from IP addresses, and handing IP addressing in core code.
>>
>> Bernard, Tom what do you think?
>>
>> Jason
> 
> I think iWarp (and now RoCEv2 with its UDP dependency) drivers
> produce GIDs mostly to satisfy the current RDMA CM infrastructure,
> which depends on this type of unique identifier, inherited from IB.
> Imo, more natural would be to implement IP based RDMA protocols
> connection management by relying on IP addresses.
> 
> Sorry for asking again - why erdma does not need to link with netdev?
> Can erdma exist without using a netdev?

Actually erdma also need a net device binded to, and so does it.

These days I’m trying to find out acceptable ways to get the reference
of the binded netdev, e,g, the 'struct net_device' pointer. Unlike other
RDMA drivers can get the reference of their binded netdevs' reference 
easily (most RDMA devices are based on the extended aux devices), it is
a little more complex for erdma, because erdma and its binded net device
are two separated PCIe devices.

Then I find that the netdev reference hold in ibdev is rarely used
in core code for iWarp deivces, GID0 is the key attribute (As you and 
Tom mentioned, it appears with the historical need for compatibility,
but I think this is another story).

So, there are two choices for erdma: enum net devices and find the
matched one, or never calling ib_device_set_netdev. The second one has 
less code.

The second way can't work in ROCE. But it works for iWarp (I've tested),
since the netdev reference is rarely used for iWarp in core code, as I
said in last reply.

In short, the question discussed here is that: is it acceptable that
doesn't hold the netdev reference in core code for a iWarp driver
(indeed it has a netdev binded to) ? Or is it necessary that calling
ib_device_set_netdev to set the binded netdev for iWarp driver?

You and Tom both are specialists in iWarp, your opinions are important.

Thanks very much
Cheng Xu


> 
> Thanks,
> Bernard.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-19 16:20             ` Bernard Metzler
@ 2022-05-19 18:51               ` Tom Talpey
  2022-05-20  7:03               ` Cheng Xu
  1 sibling, 0 replies; 22+ messages in thread
From: Tom Talpey @ 2022-05-19 18:51 UTC (permalink / raw)
  To: Bernard Metzler, Jason Gunthorpe, Cheng Xu
  Cc: dledford, leon, linux-rdma, KaiShen, tonylu


On 5/19/2022 12:20 PM, Bernard Metzler wrote:
> 
> 
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, 18 May 2022 18:32
>> To: Cheng Xu <chengyou@linux.alibaba.com>; Bernard Metzler
>> <BMT@zurich.ibm.com>; Tom Talpey <tom@talpey.com>
>> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma
>> module
>>
>> On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote:
>>>
>>>
>>> On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
>>>> On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
>>>>>
>>>>>
>>>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>>>>
>>>>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>>>>> +	.type = "erdma",
>>>>>>> +	.newlink = erdma_newlink,
>>>>>>> +};
>>>>>>
>>>>>> Why is there still a newlink?
>>>>>>
>>>>>
>>>>> Hello, Jason,
>>>>>
>>>>> About this issue, I have another idea, more simple and reasonable.
>>>>>
>>>>> Maybe erdma driver doesn't need to link to a net device in kernel. In
>>>>> the core code, the ib_device_get_netdev has several use cases:
>>>>>
>>>>>     1). query port info in netlink
>>>>>     2). get eth speed for IB (ib_get_eth_speed)
>>>>>     3). enumerate all RoCE ports (ib_enum_roce_netdev)
>>>>>     4). iw_query_port
>>>>>
>>>>> The cases related to erdma is 4). But we change it in our patch
>> 02/12.
>>>>> So, it seems all right that we do not link erdma to a net device.
>>>>>
>>>>> * I also test this solution, it works for both perftest and NoF. *
>>>>>
>>>>> Another issue is how to get the port state and attributes without
>>>>> net device. For this, erdma can get it from HW directly.
>>>>>
>>>>> So, I think this may be the final solution. (BTW, I have gone over
>>>>> the rdma drivers, EFA does in this way, it also has two separated
>>>>> devices for net and rdma. It inspired me).
>>>>
>>>> I'm not sure this works for an iWarp device - various things expect to
>>>> know the netdevice to know how to relate IP addresses to the iWarp
>>>> stuff - but then I don't really know iWarp.
>>>
>>> As far as I know, iWarp device only has one GID entry which generated
>>> from MAC address.
>>>
>>> For iWarp, The CM part in core code resolves address, finds
>>> route with the help of kernel's net subsystem, and then obtains the
>> correct
>>> ibdev by GID matching. The GID matching in iWarp is indeed MAC address
>>> matching.
>>>
>>> In another words, for iWarp devices, the core code doesn't handle IP
>>> addressing related stuff directly, it is finished by calling net APIs.
>>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
>>> process.
>>>
>>> The binded netdev in iWarp devices, mainly have two purposes:
>>>    1). generated GID0, using the netdev's mac address.
>>>    2). get the port state and attributes.
>>>
>>> For 1), erdma device binded to net device also by mac address, which can
>>> be obtained from our PCIe bar registers.
>>> For 2), erdma can also get the information, and may be more accurately.
>>> For example, erdma can have different MTU with virtio-net in our cloud.
>>>
>>> For RoCEv2, I know that it has many GIDs, some of them are generated
>>> from IP addresses, and handing IP addressing in core code.
>>
>> Bernard, Tom what do you think?
>>
>> Jason
> 
> I think iWarp (and now RoCEv2 with its UDP dependency) drivers
> produce GIDs mostly to satisfy the current RDMA CM infrastructure,
> which depends on this type of unique identifier, inherited from IB.
> Imo, more natural would be to implement IP based RDMA protocols
> connection management by relying on IP addresses.

Agreed. Exposing MAC addresses for this seems so... 20th century.

Tom.

> Sorry for asking again - why erdma does not need to link with netdev?
> Can erdma exist without using a netdev?
> 
> 
> Thanks,
> Bernard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-18 16:31           ` Jason Gunthorpe
@ 2022-05-19 16:20             ` Bernard Metzler
  2022-05-19 18:51               ` Tom Talpey
  2022-05-20  7:03               ` Cheng Xu
  0 siblings, 2 replies; 22+ messages in thread
From: Bernard Metzler @ 2022-05-19 16:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Cheng Xu, Tom Talpey
  Cc: dledford, leon, linux-rdma, KaiShen, tonylu



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, 18 May 2022 18:32
> To: Cheng Xu <chengyou@linux.alibaba.com>; Bernard Metzler
> <BMT@zurich.ibm.com>; Tom Talpey <tom@talpey.com>
> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma
> module
> 
> On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote:
> >
> >
> > On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
> > > On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
> > > >
> > > >
> > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > > > >
> > > > > > +static struct rdma_link_ops erdma_link_ops = {
> > > > > > +	.type = "erdma",
> > > > > > +	.newlink = erdma_newlink,
> > > > > > +};
> > > > >
> > > > > Why is there still a newlink?
> > > > >
> > > >
> > > > Hello, Jason,
> > > >
> > > > About this issue, I have another idea, more simple and reasonable.
> > > >
> > > > Maybe erdma driver doesn't need to link to a net device in kernel. In
> > > > the core code, the ib_device_get_netdev has several use cases:
> > > >
> > > >    1). query port info in netlink
> > > >    2). get eth speed for IB (ib_get_eth_speed)
> > > >    3). enumerate all RoCE ports (ib_enum_roce_netdev)
> > > >    4). iw_query_port
> > > >
> > > > The cases related to erdma is 4). But we change it in our patch
> 02/12.
> > > > So, it seems all right that we do not link erdma to a net device.
> > > >
> > > > * I also test this solution, it works for both perftest and NoF. *
> > > >
> > > > Another issue is how to get the port state and attributes without
> > > > net device. For this, erdma can get it from HW directly.
> > > >
> > > > So, I think this may be the final solution. (BTW, I have gone over
> > > > the rdma drivers, EFA does in this way, it also has two separated
> > > > devices for net and rdma. It inspired me).
> > >
> > > I'm not sure this works for an iWarp device - various things expect to
> > > know the netdevice to know how to relate IP addresses to the iWarp
> > > stuff - but then I don't really know iWarp.
> >
> > As far as I know, iWarp device only has one GID entry which generated
> > from MAC address.
> >
> > For iWarp, The CM part in core code resolves address, finds
> > route with the help of kernel's net subsystem, and then obtains the
> correct
> > ibdev by GID matching. The GID matching in iWarp is indeed MAC address
> > matching.
> >
> > In another words, for iWarp devices, the core code doesn't handle IP
> > addressing related stuff directly, it is finished by calling net APIs.
> > The netdev set by ib_device_set_netdev does not used in iWarp's CM
> > process.
> >
> > The binded netdev in iWarp devices, mainly have two purposes:
> >   1). generated GID0, using the netdev's mac address.
> >   2). get the port state and attributes.
> >
> > For 1), erdma device binded to net device also by mac address, which can
> > be obtained from our PCIe bar registers.
> > For 2), erdma can also get the information, and may be more accurately.
> > For example, erdma can have different MTU with virtio-net in our cloud.
> >
> > For RoCEv2, I know that it has many GIDs, some of them are generated
> > from IP addresses, and handing IP addressing in core code.
> 
> Bernard, Tom what do you think?
> 
> Jason

I think iWarp (and now RoCEv2 with its UDP dependency) drivers
produce GIDs mostly to satisfy the current RDMA CM infrastructure,
which depends on this type of unique identifier, inherited from IB.
Imo, more natural would be to implement IP based RDMA protocols
connection management by relying on IP addresses.

Sorry for asking again - why erdma does not need to link with netdev?
Can erdma exist without using a netdev?


Thanks,
Bernard.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-18 16:24         ` Cheng Xu
@ 2022-05-18 16:31           ` Jason Gunthorpe
  2022-05-19 16:20             ` Bernard Metzler
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2022-05-18 16:31 UTC (permalink / raw)
  To: Cheng Xu, BMT, Tom Talpey; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu

On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote:
> 
> 
> On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
> > On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
> > > 
> > > 
> > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > > > 
> > > > > +static struct rdma_link_ops erdma_link_ops = {
> > > > > +	.type = "erdma",
> > > > > +	.newlink = erdma_newlink,
> > > > > +};
> > > > 
> > > > Why is there still a newlink?
> > > > 
> > > 
> > > Hello, Jason,
> > > 
> > > About this issue, I have another idea, more simple and reasonable.
> > > 
> > > Maybe erdma driver doesn't need to link to a net device in kernel. In
> > > the core code, the ib_device_get_netdev has several use cases:
> > > 
> > >    1). query port info in netlink
> > >    2). get eth speed for IB (ib_get_eth_speed)
> > >    3). enumerate all RoCE ports (ib_enum_roce_netdev)
> > >    4). iw_query_port
> > > 
> > > The cases related to erdma is 4). But we change it in our patch 02/12.
> > > So, it seems all right that we do not link erdma to a net device.
> > > 
> > > * I also test this solution, it works for both perftest and NoF. *
> > > 
> > > Another issue is how to get the port state and attributes without
> > > net device. For this, erdma can get it from HW directly.
> > > 
> > > So, I think this may be the final solution. (BTW, I have gone over
> > > the rdma drivers, EFA does in this way, it also has two separated
> > > devices for net and rdma. It inspired me).
> > 
> > I'm not sure this works for an iWarp device - various things expect to
> > know the netdevice to know how to relate IP addresses to the iWarp
> > stuff - but then I don't really know iWarp.
> 
> As far as I know, iWarp device only has one GID entry which generated
> from MAC address.
> 
> For iWarp, The CM part in core code resolves address, finds
> route with the help of kernel's net subsystem, and then obtains the correct
> ibdev by GID matching. The GID matching in iWarp is indeed MAC address
> matching.
> 
> In another words, for iWarp devices, the core code doesn't handle IP
> addressing related stuff directly, it is finished by calling net APIs.
> The netdev set by ib_device_set_netdev does not used in iWarp's CM
> process.
> 
> The binded netdev in iWarp devices, mainly have two purposes:
>   1). generated GID0, using the netdev's mac address.
>   2). get the port state and attributes.
> 
> For 1), erdma device binded to net device also by mac address, which can
> be obtained from our PCIe bar registers.
> For 2), erdma can also get the information, and may be more accurately.
> For example, erdma can have different MTU with virtio-net in our cloud.
> 
> For RoCEv2, I know that it has many GIDs, some of them are generated
> from IP addresses, and handing IP addressing in core code.

Bernard, Tom what do you think? 

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-18 14:46       ` Jason Gunthorpe
@ 2022-05-18 16:24         ` Cheng Xu
  2022-05-18 16:31           ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-05-18 16:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
> On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
>>
>>
>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>
>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>> +	.type = "erdma",
>>>> +	.newlink = erdma_newlink,
>>>> +};
>>>
>>> Why is there still a newlink?
>>>
>>
>> Hello, Jason,
>>
>> About this issue, I have another idea, more simple and reasonable.
>>
>> Maybe erdma driver doesn't need to link to a net device in kernel. In
>> the core code, the ib_device_get_netdev has several use cases:
>>
>>    1). query port info in netlink
>>    2). get eth speed for IB (ib_get_eth_speed)
>>    3). enumerate all RoCE ports (ib_enum_roce_netdev)
>>    4). iw_query_port
>>
>> The cases related to erdma is 4). But we change it in our patch 02/12.
>> So, it seems all right that we do not link erdma to a net device.
>>
>> * I also test this solution, it works for both perftest and NoF. *
>>
>> Another issue is how to get the port state and attributes without
>> net device. For this, erdma can get it from HW directly.
>>
>> So, I think this may be the final solution. (BTW, I have gone over
>> the rdma drivers, EFA does in this way, it also has two separated
>> devices for net and rdma. It inspired me).
> 
> I'm not sure this works for an iWarp device - various things expect to
> know the netdevice to know how to relate IP addresses to the iWarp
> stuff - but then I don't really know iWarp.

As far as I know, iWarp device only has one GID entry which generated
from MAC address.

For iWarp, The CM part in core code resolves address, finds
route with the help of kernel's net subsystem, and then obtains the 
correct ibdev by GID matching. The GID matching in iWarp is indeed MAC 
address matching.

In another words, for iWarp devices, the core code doesn't handle IP 
addressing related stuff directly, it is finished by calling net APIs.
The netdev set by ib_device_set_netdev does not used in iWarp's CM
process.

The binded netdev in iWarp devices, mainly have two purposes:
   1). generated GID0, using the netdev's mac address.
   2). get the port state and attributes.

For 1), erdma device binded to net device also by mac address, which can
be obtained from our PCIe bar registers.
For 2), erdma can also get the information, and may be more accurately.
For example, erdma can have different MTU with virtio-net in our cloud.

For RoCEv2, I know that it has many GIDs, some of them are generated
from IP addresses, and handing IP addressing in core code.

> If it works at all it is not a great idea
As I explained above (I hope I explained clearly, but I'm a little worry 
about my English :) ), netdev binded by ib_device_set_netdev in iWarp
device has limit usage. Compared with the v8 (traverse the netdevs in
kernel and find the matched net device), I think don't set netdev is
better.

After I explained, how do you think about this? If it still not ok,
I will work on the v8 in the future. Otherwise, I will send a v9 patch.

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-18  8:30     ` Cheng Xu
@ 2022-05-18 14:46       ` Jason Gunthorpe
  2022-05-18 16:24         ` Cheng Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2022-05-18 14:46 UTC (permalink / raw)
  To: Cheng Xu; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT

On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
> 
> 
> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > 
> > > +static struct rdma_link_ops erdma_link_ops = {
> > > +	.type = "erdma",
> > > +	.newlink = erdma_newlink,
> > > +};
> > 
> > Why is there still a newlink?
> > 
> 
> Hello, Jason,
> 
> About this issue, I have another idea, more simple and reasonable.
> 
> Maybe erdma driver doesn't need to link to a net device in kernel. In
> the core code, the ib_device_get_netdev has several use cases:
> 
>   1). query port info in netlink
>   2). get eth speed for IB (ib_get_eth_speed)
>   3). enumerate all RoCE ports (ib_enum_roce_netdev)
>   4). iw_query_port
> 
> The cases related to erdma is 4). But we change it in our patch 02/12.
> So, it seems all right that we do not link erdma to a net device.
> 
> * I also test this solution, it works for both perftest and NoF. *
> 
> Another issue is how to get the port state and attributes without
> net device. For this, erdma can get it from HW directly.
> 
> So, I think this may be the final solution. (BTW, I have gone over
> the rdma drivers, EFA does in this way, it also has two separated
> devices for net and rdma. It inspired me).

I'm not sure this works for an iWarp device - various things expect to
know the netdevice to know how to relate IP addresses to the iWarp
stuff - but then I don't really know iWarp.

If it works at all it is not a great idea.

EFA gets by because it doesn't use IP addressing at all.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-10 13:17   ` Jason Gunthorpe
  2022-05-16  3:15     ` Cheng Xu
@ 2022-05-18  8:30     ` Cheng Xu
  2022-05-18 14:46       ` Jason Gunthorpe
  1 sibling, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-05-18  8:30 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> 
>> +static struct rdma_link_ops erdma_link_ops = {
>> +	.type = "erdma",
>> +	.newlink = erdma_newlink,
>> +};
> 
> Why is there still a newlink?
> 

Hello, Jason,

About this issue, I have another idea, more simple and reasonable.

Maybe erdma driver doesn't need to link to a net device in kernel. In
the core code, the ib_device_get_netdev has several use cases:

   1). query port info in netlink
   2). get eth speed for IB (ib_get_eth_speed)
   3). enumerate all RoCE ports (ib_enum_roce_netdev)
   4). iw_query_port

The cases related to erdma is 4). But we change it in our patch 02/12.
So, it seems all right that we do not link erdma to a net device.

* I also test this solution, it works for both perftest and NoF. *

Another issue is how to get the port state and attributes without
net device. For this, erdma can get it from HW directly.

So, I think this may be the final solution. (BTW, I have gone over
the rdma drivers, EFA does in this way, it also has two separated
devices for net and rdma. It inspired me).

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16 17:31                 ` Jason Gunthorpe
@ 2022-05-17  1:53                   ` Cheng Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Cheng Xu @ 2022-05-17  1:53 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/17/22 1:31 AM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 11:14:24PM +0800, Cheng Xu wrote:
>>>> I think this twice, use a net notifier of module can not solve this, due
>>>> to the leak of all ib devices information in erdma module.
>>>
>>> I don't understand what this means
>> At first I want to register a global net notifier in our module, and link
>> the ib device and net device if matches in the callback. But in the
>> callback, we can not get the ib devices registered by erdma, because our
>> driver module does not maintain the ib device list. This is not the
>> right way.
> 
> The core has the device list, if you need to search it then you need
> to add a new searching function
> 

I had a mistake, instead of what I said, erdma should search net devices
in a wq to find the matched net device. Not search ib devices in net
notifier callback.

I think is should be fine.

>>
>>>> The solution may be simple: register net notifier per device in probe,
>>>> and call ib_device_set_netdev if matches in the notifier callback.
>>>
>>> That sounds wrong
>>>
>>
>> OK. So using rdma link command to handle the link relationship manually
>> is still a better choice?
> 
> No
>   
>> Actually, erdma looks like a hardware implementation of siw/rxe (more
>> likes siw, because erdma uses iWarp, and have similar CM
>> implementation): the ethernet functionality is provided by existed net
>> device, no matter what kind of the net device (due to virtio-net is
>> de-facto in cloud virtualization, erdma is binded to virtio-net in
>> practice). Due to this similarity, siw/rxe use 'rdma link' to link the
>> ib device and net device (including create ib device), Does it sound
>> reasonable that erdma use it to link the net deivce?
> 
> It is not like siw because it cannot run on any netdevice, it can only
> run on the signle netdevice the HW is linked to in the background.
> 

Yeah, you are exactly right, I omitted this detail in the last
reply.

Thanks,
Cheng Xu

> Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16 15:14               ` Cheng Xu
@ 2022-05-16 17:31                 ` Jason Gunthorpe
  2022-05-17  1:53                   ` Cheng Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2022-05-16 17:31 UTC (permalink / raw)
  To: Cheng Xu; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT

On Mon, May 16, 2022 at 11:14:24PM +0800, Cheng Xu wrote:
> > > I think this twice, use a net notifier of module can not solve this, due
> > > to the leak of all ib devices information in erdma module.
> > 
> > I don't understand what this means
> At first I want to register a global net notifier in our module, and link
> the ib device and net device if matches in the callback. But in the
> callback, we can not get the ib devices registered by erdma, because our
> driver module does not maintain the ib device list. This is not the
> right way.

The core has the device list, if you need to search it then you need
to add a new searching function

> 
> > > The solution may be simple: register net notifier per device in probe,
> > > and call ib_device_set_netdev if matches in the notifier callback.
> > 
> > That sounds wrong
> > 
> 
> OK. So using rdma link command to handle the link relationship manually
> is still a better choice?

No
 
> Actually, erdma looks like a hardware implementation of siw/rxe (more
> likes siw, because erdma uses iWarp, and have similar CM
> implementation): the ethernet functionality is provided by existed net
> device, no matter what kind of the net device (due to virtio-net is
> de-facto in cloud virtualization, erdma is binded to virtio-net in
> practice). Due to this similarity, siw/rxe use 'rdma link' to link the
> ib device and net device (including create ib device), Does it sound
> reasonable that erdma use it to link the net deivce?

It is not like siw because it cannot run on any netdevice, it can only
run on the signle netdevice the HW is linked to in the background.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16 14:07             ` Jason Gunthorpe
@ 2022-05-16 15:14               ` Cheng Xu
  2022-05-16 17:31                 ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-05-16 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/16/22 10:07 PM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 09:59:28PM +0800, Cheng Xu wrote:
>>
>>
>> On 5/16/22 8:40 PM, Cheng Xu wrote:
>>>
>>>
>>> On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
>>>> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>>>>
>>>>>
>>>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>>>>
>>>>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>>>>> +    .type = "erdma",
>>>>>>> +    .newlink = erdma_newlink,
>>>>>>> +};
>>>>>>
>>>>>> Why is there still a newlink?
>>>>>>
>>>>>> Jason
>>>>>
>>>>>
>>>>> Yeah, I remember your suggestion that the ibdev should keep the same
>>>>> lifecycle with its PCI device, and so does it now.
>>>>>
>>>>> The initialization flow for erdma now:
>>>>>        probe:
>>>>>          - Hardware specified initialization
>>>>>          - IB device initialization
>>>>>          - Calling ib_register_device with ibdev->netdev == NULL
>>>>>
>>>>> After probe, The ib device has been registered, but we left it in
>>>>> invalid state.
>>>>> To fully complete the initialization, we should set the ibdev->netdev.
>>>>>
>>>>> And the newlink command in erdma only do one thing now: set the
>>>>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>>>>> ib device lifecycle.
>>>>
>>>> This is not what newlink is for, it can't be used like this
>>>>
>>>> Jason
>>>
>>>
>>> [A typo in previous reply, so I re-edit it]
>>>
>> <...>
>>> Or, Is it ok that erdma registers a net notifier in our module to handle
>>> this?
>>>
>>
>> I think this twice, use a net notifier of module can not solve this, due
>> to the leak of all ib devices information in erdma module.
> 
> I don't understand what this means
At first I want to register a global net notifier in our module, and 
link the ib device and net device if matches in the callback. But in the
callback, we can not get the ib devices registered by erdma, because our
driver module does not maintain the ib device list. This is not the
right way.

>> The solution may be simple: register net notifier per device in probe,
>> and call ib_device_set_netdev if matches in the notifier callback.
> 
> That sounds wrong
> 

OK. So using rdma link command to handle the link relationship manually
is still a better choice?

Actually, erdma looks like a hardware implementation of siw/rxe (more
likes siw, because erdma uses iWarp, and have similar CM
implementation): the ethernet functionality is provided by existed net
device, no matter what kind of the net device (due to virtio-net is
de-facto in cloud virtualization, erdma is binded to virtio-net in
practice). Due to this similarity, siw/rxe use 'rdma link' to link the
ib device and net device (including create ib device), Does it sound
reasonable that erdma use it to link the net deivce?

Could you re-consider this way? Or could you give us some suggestion? I
think this is the main and important issue for our upstreaming. I would
be very grateful.

Thanks,
Cheng Xu





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16 13:59           ` Cheng Xu
@ 2022-05-16 14:07             ` Jason Gunthorpe
  2022-05-16 15:14               ` Cheng Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2022-05-16 14:07 UTC (permalink / raw)
  To: Cheng Xu; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT

On Mon, May 16, 2022 at 09:59:28PM +0800, Cheng Xu wrote:
> 
> 
> On 5/16/22 8:40 PM, Cheng Xu wrote:
> > 
> > 
> > On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
> > > On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
> > > > 
> > > > 
> > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > > > > 
> > > > > > +static struct rdma_link_ops erdma_link_ops = {
> > > > > > +    .type = "erdma",
> > > > > > +    .newlink = erdma_newlink,
> > > > > > +};
> > > > > 
> > > > > Why is there still a newlink?
> > > > > 
> > > > > Jason
> > > > 
> > > > 
> > > > Yeah, I remember your suggestion that the ibdev should keep the same
> > > > lifecycle with its PCI device, and so does it now.
> > > > 
> > > > The initialization flow for erdma now:
> > > >       probe:
> > > >         - Hardware specified initialization
> > > >         - IB device initialization
> > > >         - Calling ib_register_device with ibdev->netdev == NULL
> > > > 
> > > > After probe, The ib device has been registered, but we left it in
> > > > invalid state.
> > > > To fully complete the initialization, we should set the ibdev->netdev.
> > > > 
> > > > And the newlink command in erdma only do one thing now: set the
> > > > ibdev->netdev if the rule matches, and it is uncorrelated with the
> > > > ib device lifecycle.
> > > 
> > > This is not what newlink is for, it can't be used like this
> > > 
> > > Jason
> > 
> > 
> > [A typo in previous reply, so I re-edit it]
> > 
> <...>
> > Or, Is it ok that erdma registers a net notifier in our module to handle
> > this?
> > 
> 
> I think this twice, use a net notifier of module can not solve this, due
> to the leak of all ib devices information in erdma module.

I don't understand what this means

> The solution may be simple: register net notifier per device in probe,
> and call ib_device_set_netdev if matches in the notifier callback.

That sounds wrong

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16 12:40         ` Cheng Xu
@ 2022-05-16 13:59           ` Cheng Xu
  2022-05-16 14:07             ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-05-16 13:59 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/16/22 8:40 PM, Cheng Xu wrote:
> 
> 
> On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
>> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>>
>>>
>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>>
>>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>>> +    .type = "erdma",
>>>>> +    .newlink = erdma_newlink,
>>>>> +};
>>>>
>>>> Why is there still a newlink?
>>>>
>>>> Jason
>>>
>>>
>>> Yeah, I remember your suggestion that the ibdev should keep the same
>>> lifecycle with its PCI device, and so does it now.
>>>
>>> The initialization flow for erdma now:
>>>       probe:
>>>         - Hardware specified initialization
>>>         - IB device initialization
>>>         - Calling ib_register_device with ibdev->netdev == NULL
>>>
>>> After probe, The ib device has been registered, but we left it in
>>> invalid state.
>>> To fully complete the initialization, we should set the ibdev->netdev.
>>>
>>> And the newlink command in erdma only do one thing now: set the
>>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>>> ib device lifecycle.
>>
>> This is not what newlink is for, it can't be used like this
>>
>> Jason
> 
> 
> [A typo in previous reply, so I re-edit it]
> 
<...>
> Or, Is it ok that erdma registers a net notifier in our module to handle
> this?
> 

I think this twice, use a net notifier of module can not solve this, due
to the leak of all ib devices information in erdma module.

The solution may be simple: register net notifier per device in probe,
and call ib_device_set_netdev if matches in the notifier callback.

Should this be the right way?

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16 11:49       ` Jason Gunthorpe
  2022-05-16 12:37         ` Cheng Xu
@ 2022-05-16 12:40         ` Cheng Xu
  2022-05-16 13:59           ` Cheng Xu
  1 sibling, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-05-16 12:40 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>
>>
>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>
>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>> +	.type = "erdma",
>>>> +	.newlink = erdma_newlink,
>>>> +};
>>>
>>> Why is there still a newlink?
>>>
>>> Jason
>>
>>
>> Yeah, I remember your suggestion that the ibdev should keep the same
>> lifecycle with its PCI device, and so does it now.
>>
>> The initialization flow for erdma now:
>>       probe:
>>         - Hardware specified initialization
>>         - IB device initialization
>>         - Calling ib_register_device with ibdev->netdev == NULL
>>
>> After probe, The ib device has been registered, but we left it in
>> invalid state.
>> To fully complete the initialization, we should set the ibdev->netdev.
>>
>> And the newlink command in erdma only do one thing now: set the
>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>> ib device lifecycle.
> 
> This is not what newlink is for, it can't be used like this
> 
> Jason


[A typo in previous reply, so I re-edit it]

Yeah, the newlink is only used by siw and rxe to control their ib
device lifecycle now.

But for the usage of newlink, it is triggered by 'rdma link add'
command. In my opinion, the 'rdma link' command set is all about
display (rdma link show) or management of the link relationship between
ib device and net device (rdma link add/del). And what erdma needs is
setting up the link between ib device and net device.

So, though the current usage is controling the lifecycle of siw/rxe, but
for the command's naming and connotation, should it be OK to expand the
usage of newlink to meet our requirement?

Or, Is it ok that erdma registers a net notifier in our module to handle
this?

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16 11:49       ` Jason Gunthorpe
@ 2022-05-16 12:37         ` Cheng Xu
  2022-05-16 12:40         ` Cheng Xu
  1 sibling, 0 replies; 22+ messages in thread
From: Cheng Xu @ 2022-05-16 12:37 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>
>>
>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>
>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>> +	.type = "erdma",
>>>> +	.newlink = erdma_newlink,
>>>> +};
>>>
>>> Why is there still a newlink?
>>>
>>> Jason
>>
>>
>> Yeah, I remember your suggestion that the ibdev should keep the same
>> lifecycle with its PCI device, and so does it now.
>>
>> The initialization flow for erdma now:
>>       probe:
>>         - Hardware specified initialization
>>         - IB device initialization
>>         - Calling ib_register_device with ibdev->netdev == NULL
>>
>> After probe, The ib device has been registered, but we left it in
>> invalid state.
>> To fully complete the initialization, we should set the ibdev->netdev.
>>
>> And the newlink command in erdma only do one thing now: set the
>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>> ib device lifecycle.
> 
> This is not what newlink is for, it can't be used like this
> 
> Jason

Yeah, the newlink is only used by siw and rxe to control their ib
device lifecycle new

But for the usage of newlink, it is triggered by 'rdma link add'
command. In my opinion, the 'rdma link' command set is all about
display (rdma link show) or management of the link relationship between
ib device and net device (rdma link add/del). And what erdma needs is
setting up the link between ib device and net device.

So, though the current usage is controling the lifecycle of siw/rxe, but
for the command's naming and connotation, should it be OK to expand the 
usage of newlink to meet our requirement?

Or, Is it ok that erdma registers a net notifier in our module to handle
this?

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-16  3:15     ` Cheng Xu
@ 2022-05-16 11:49       ` Jason Gunthorpe
  2022-05-16 12:37         ` Cheng Xu
  2022-05-16 12:40         ` Cheng Xu
  0 siblings, 2 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2022-05-16 11:49 UTC (permalink / raw)
  To: Cheng Xu; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT

On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
> 
> 
> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > 
> > > +static struct rdma_link_ops erdma_link_ops = {
> > > +	.type = "erdma",
> > > +	.newlink = erdma_newlink,
> > > +};
> > 
> > Why is there still a newlink?
> > 
> > Jason
> 
> 
> Yeah, I remember your suggestion that the ibdev should keep the same
> lifecycle with its PCI device, and so does it now.
> 
> The initialization flow for erdma now:
>      probe:
>        - Hardware specified initialization
>        - IB device initialization
>        - Calling ib_register_device with ibdev->netdev == NULL
> 
> After probe, The ib device has been registered, but we left it in
> invalid state.
> To fully complete the initialization, we should set the ibdev->netdev.
> 
> And the newlink command in erdma only do one thing now: set the
> ibdev->netdev if the rule matches, and it is uncorrelated with the
> ib device lifecycle.

This is not what newlink is for, it can't be used like this

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-05-10 13:17   ` Jason Gunthorpe
@ 2022-05-16  3:15     ` Cheng Xu
  2022-05-16 11:49       ` Jason Gunthorpe
  2022-05-18  8:30     ` Cheng Xu
  1 sibling, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-05-16  3:15 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT



On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> 
>> +static struct rdma_link_ops erdma_link_ops = {
>> +	.type = "erdma",
>> +	.newlink = erdma_newlink,
>> +};
> 
> Why is there still a newlink?
> 
> Jason


Yeah, I remember your suggestion that the ibdev should keep the same
lifecycle with its PCI device, and so does it now.

The initialization flow for erdma now:
      probe:
        - Hardware specified initialization
        - IB device initialization
        - Calling ib_register_device with ibdev->netdev == NULL

After probe, The ib device has been registered, but we left it in
invalid state.
To fully complete the initialization, we should set the ibdev->netdev.

And the newlink command in erdma only do one thing now: set the 
ibdev->netdev if the rule matches, and it is uncorrelated with the
ib device lifecycle.

I think this may be the best way to link the netdev to ibdev for erdma.
Using netdev notifier is not good because we should register a netdev
notifier for the module.

If this way is not proper or there are some better ways, please let me
know.

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-04-21  7:17 ` [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module Cheng Xu
@ 2022-05-10 13:17   ` Jason Gunthorpe
  2022-05-16  3:15     ` Cheng Xu
  2022-05-18  8:30     ` Cheng Xu
  0 siblings, 2 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2022-05-10 13:17 UTC (permalink / raw)
  To: Cheng Xu; +Cc: dledford, leon, linux-rdma, KaiShen, tonylu, BMT

On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:

> +static struct rdma_link_ops erdma_link_ops = {
> +	.type = "erdma",
> +	.newlink = erdma_newlink,
> +};

Why is there still a newlink?

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module
  2022-04-21  7:17 [PATCH for-next v7 00/12] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
@ 2022-04-21  7:17 ` Cheng Xu
  2022-05-10 13:17   ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Cheng Xu @ 2022-04-21  7:17 UTC (permalink / raw)
  To: jgg, dledford, leon; +Cc: linux-rdma, KaiShen, chengyou, tonylu, BMT

Add the main erdma module, which provides interface to infiniband
subsystem. Also fix a bug reported by kernel test robot.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_main.c | 625 +++++++++++++++++++++++
 1 file changed, 625 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c

diff --git a/drivers/infiniband/hw/erdma/erdma_main.c b/drivers/infiniband/hw/erdma/erdma_main.c
new file mode 100644
index 000000000000..86c6f574fc68
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_main.c
@@ -0,0 +1,625 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+
+/* Authors: Cheng Xu <chengyou@linux.alibaba.com> */
+/*          Kai Shen <kaishen@linux.alibaba.com> */
+/* Copyright (c) 2020-2022, Alibaba Group. */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <net/addrconf.h>
+#include <rdma/erdma-abi.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_hw.h"
+#include "erdma_verbs.h"
+
+MODULE_AUTHOR("Cheng Xu <chengyou@linux.alibaba.com>");
+MODULE_DESCRIPTION("Alibaba elasticRDMA adapter driver");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static int erdma_device_register(struct erdma_dev *dev)
+{
+	struct ib_device *ibdev = &dev->ibdev;
+	int ret;
+
+	memset(ibdev->name, 0, IB_DEVICE_NAME_MAX);
+	/*
+	 * In Ali ECS environment, ENI's mac address is unique in VPC.
+	 * So, generating the ibdev's name from mac address of the binded
+	 * netdev.
+	 */
+	ret = snprintf(ibdev->name, IB_DEVICE_NAME_MAX, "%s_%.2x%.2x%.2x",
+		       DRV_MODULE_NAME, dev->attrs.peer_addr[3],
+		       dev->attrs.peer_addr[4], dev->attrs.peer_addr[5]);
+	if (ret < 0)
+		return ret;
+
+	ibdev->phys_port_cnt = 1;
+	ret = ib_device_set_netdev(ibdev, NULL, 1);
+	if (ret)
+		return ret;
+
+	ret = ib_register_device(ibdev, ibdev->name, &dev->pdev->dev);
+	if (ret)
+		dev_err(&dev->pdev->dev,
+			"ib_register_device(%s) failed: ret = %d\n",
+			ibdev->name, ret);
+
+	return ret;
+}
+
+static int erdma_netdev_event(struct notifier_block *nb, unsigned long event,
+			      void *arg)
+{
+	struct net_device *netdev = netdev_notifier_info_to_dev(arg);
+	struct erdma_dev *dev = container_of(nb, struct erdma_dev, netdev_nb);
+
+	if (dev->netdev == NULL || dev->netdev != netdev)
+		goto done;
+
+	switch (event) {
+	case NETDEV_UP:
+		dev->state = IB_PORT_ACTIVE;
+		erdma_port_event(dev, IB_EVENT_PORT_ACTIVE);
+		break;
+	case NETDEV_DOWN:
+		dev->state = IB_PORT_DOWN;
+		erdma_port_event(dev, IB_EVENT_PORT_ERR);
+		break;
+	case NETDEV_REGISTER:
+	case NETDEV_UNREGISTER:
+	case NETDEV_CHANGEADDR:
+	case NETDEV_CHANGEMTU:
+	case NETDEV_GOING_DOWN:
+	case NETDEV_CHANGE:
+	default:
+		break;
+	}
+
+done:
+	return NOTIFY_OK;
+}
+
+static int erdma_newlink(const char *dev_name, struct net_device *netdev)
+{
+	struct ib_device *ibdev;
+	struct erdma_dev *dev;
+	int ret;
+
+	ibdev = ib_device_get_by_netdev(netdev, RDMA_DRIVER_ERDMA);
+	if (ibdev) {
+		ib_device_put(ibdev);
+		return -EEXIST;
+	}
+
+	ibdev = ib_device_get_by_name(dev_name, RDMA_DRIVER_ERDMA);
+	if (!ibdev)
+		return -ENODEV;
+
+	dev = to_edev(ibdev);
+
+	if (!ether_addr_equal_unaligned(netdev->perm_addr,
+					dev->attrs.peer_addr)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	addrconf_addr_eui48((u8 *)&dev->ibdev.node_guid, netdev->dev_addr);
+	ret = ib_device_set_netdev(&dev->ibdev, netdev, 1);
+	if (ret)
+		goto out;
+
+	if (netif_running(netdev) && netif_carrier_ok(netdev))
+		dev->state = IB_PORT_ACTIVE;
+	else
+		dev->state = IB_PORT_DOWN;
+
+	dev->netdev_nb.notifier_call = erdma_netdev_event;
+	dev->netdev = netdev;
+
+	ret = register_netdevice_notifier(&dev->netdev_nb);
+out:
+	ib_device_put(ibdev);
+	return ret;
+}
+
+static struct rdma_link_ops erdma_link_ops = {
+	.type = "erdma",
+	.newlink = erdma_newlink,
+};
+
+static irqreturn_t erdma_comm_irq_handler(int irq, void *data)
+{
+	struct erdma_dev *dev = data;
+
+	erdma_cmdq_completion_handler(&dev->cmdq);
+	erdma_aeq_event_handler(dev);
+
+	return IRQ_HANDLED;
+}
+
+static void erdma_dwqe_resource_init(struct erdma_dev *dev)
+{
+	int total_pages, type0, type1;
+
+	dev->attrs.grp_num = erdma_reg_read32(dev, ERDMA_REGS_GRP_NUM_REG);
+
+	if (dev->attrs.grp_num < 4)
+		dev->attrs.disable_dwqe = true;
+	else
+		dev->attrs.disable_dwqe = false;
+
+	/* One page contains 4 goups. */
+	total_pages = dev->attrs.grp_num * 4;
+
+	if (dev->attrs.grp_num >= ERDMA_DWQE_MAX_GRP_CNT) {
+		dev->attrs.grp_num = ERDMA_DWQE_MAX_GRP_CNT;
+		type0 = ERDMA_DWQE_TYPE0_CNT;
+		type1 = ERDMA_DWQE_TYPE1_CNT / ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+	} else {
+		type1 = total_pages / 3;
+		type0 = total_pages - type1 - 1;
+	}
+
+	dev->attrs.dwqe_pages = type0;
+	dev->attrs.dwqe_entries = type1 * ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+}
+
+static int erdma_request_vectors(struct erdma_dev *dev)
+{
+	int expect_irq_num = min(num_possible_cpus() + 1, ERDMA_NUM_MSIX_VEC);
+
+	dev->attrs.irq_num = pci_alloc_irq_vectors(dev->pdev, 1, expect_irq_num,
+						   PCI_IRQ_MSIX);
+	if (dev->attrs.irq_num <= 0) {
+		dev_err(&dev->pdev->dev, "request irq vectors failed(%d)\n",
+			dev->attrs.irq_num);
+		return -ENOSPC;
+	}
+
+	return 0;
+}
+
+static int erdma_comm_irq_init(struct erdma_dev *dev)
+{
+	snprintf(dev->comm_irq.name, ERDMA_IRQNAME_SIZE, "erdma-common@pci:%s",
+		 pci_name(dev->pdev));
+	dev->comm_irq.msix_vector =
+		pci_irq_vector(dev->pdev, ERDMA_MSIX_VECTOR_CMDQ);
+
+	cpumask_set_cpu(cpumask_first(cpumask_of_pcibus(dev->pdev->bus)),
+			&dev->comm_irq.affinity_hint_mask);
+	irq_set_affinity_hint(dev->comm_irq.msix_vector,
+			      &dev->comm_irq.affinity_hint_mask);
+
+	return request_irq(dev->comm_irq.msix_vector, erdma_comm_irq_handler, 0,
+			   dev->comm_irq.name, dev);
+}
+
+static void erdma_comm_irq_uninit(struct erdma_dev *dev)
+{
+	irq_set_affinity_hint(dev->comm_irq.msix_vector, NULL);
+	free_irq(dev->comm_irq.msix_vector, dev);
+}
+
+static int erdma_device_init(struct erdma_dev *dev, struct pci_dev *pdev)
+{
+	erdma_dwqe_resource_init(dev);
+
+	return dma_set_mask_and_coherent(&pdev->dev,
+					 DMA_BIT_MASK(ERDMA_PCI_WIDTH));
+}
+
+static void erdma_device_uninit(struct erdma_dev *dev)
+{
+	u32 ctrl = FIELD_PREP(ERDMA_REG_DEV_CTRL_RESET_MASK, 1);
+
+	erdma_reg_write32(dev, ERDMA_REGS_DEV_CTRL_REG, ctrl);
+}
+
+static const struct pci_device_id erdma_pci_tbl[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_ALIBABA, 0x107f) },
+	{}
+};
+
+static int erdma_probe_dev(struct pci_dev *pdev)
+{
+	int err;
+	struct erdma_dev *dev;
+	u32 version;
+	int bars;
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "pci_enable_device failed(%d)\n", err);
+		return err;
+	}
+
+	pci_set_master(pdev);
+
+	dev = ib_alloc_device(erdma_dev, ibdev);
+	if (!dev) {
+		dev_err(&pdev->dev, "ib_alloc_device failed\n");
+		err = -ENOMEM;
+		goto err_disable_device;
+	}
+
+	pci_set_drvdata(pdev, dev);
+	dev->pdev = pdev;
+	dev->attrs.numa_node = dev_to_node(&pdev->dev);
+
+	bars = pci_select_bars(pdev, IORESOURCE_MEM);
+	err = pci_request_selected_regions(pdev, bars, DRV_MODULE_NAME);
+	if (bars != ERDMA_BAR_MASK || err) {
+		err = err ? err : -EINVAL;
+		goto err_ib_device_release;
+	}
+
+	dev->func_bar_addr = pci_resource_start(pdev, ERDMA_FUNC_BAR);
+	dev->func_bar_len = pci_resource_len(pdev, ERDMA_FUNC_BAR);
+
+	dev->func_bar =
+		devm_ioremap(&pdev->dev, dev->func_bar_addr, dev->func_bar_len);
+	if (!dev->func_bar) {
+		dev_err(&pdev->dev, "devm_ioremap failed.\n");
+		err = -EFAULT;
+		goto err_release_bars;
+	}
+
+	version = erdma_reg_read32(dev, ERDMA_REGS_VERSION_REG);
+	if (version == 0) {
+		/* we knows that it is a non-functional function. */
+		err = -ENODEV;
+		goto err_iounmap_func_bar;
+	}
+
+	err = erdma_device_init(dev, pdev);
+	if (err)
+		goto err_iounmap_func_bar;
+
+	err = erdma_request_vectors(dev);
+	if (err)
+		goto err_iounmap_func_bar;
+
+	err = erdma_comm_irq_init(dev);
+	if (err)
+		goto err_free_vectors;
+
+	err = erdma_aeq_init(dev);
+	if (err)
+		goto err_uninit_comm_irq;
+
+	err = erdma_cmdq_init(dev);
+	if (err)
+		goto err_uninit_aeq;
+
+	err = erdma_ceqs_init(dev);
+	if (err)
+		goto err_uninit_cmdq;
+
+	erdma_finish_cmdq_init(dev);
+
+	return 0;
+
+err_uninit_cmdq:
+	erdma_device_uninit(dev);
+	erdma_cmdq_destroy(dev);
+
+err_uninit_aeq:
+	erdma_aeq_destroy(dev);
+
+err_uninit_comm_irq:
+	erdma_comm_irq_uninit(dev);
+
+err_free_vectors:
+	pci_free_irq_vectors(dev->pdev);
+
+err_iounmap_func_bar:
+	devm_iounmap(&pdev->dev, dev->func_bar);
+
+err_release_bars:
+	pci_release_selected_regions(pdev, bars);
+
+err_ib_device_release:
+	ib_dealloc_device(&dev->ibdev);
+
+err_disable_device:
+	pci_disable_device(pdev);
+
+	return err;
+}
+
+static void erdma_remove_dev(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+
+	erdma_ceqs_uninit(dev);
+
+	erdma_device_uninit(dev);
+
+	erdma_cmdq_destroy(dev);
+	erdma_aeq_destroy(dev);
+	erdma_comm_irq_uninit(dev);
+	pci_free_irq_vectors(dev->pdev);
+
+	devm_iounmap(&pdev->dev, dev->func_bar);
+	pci_release_selected_regions(pdev, ERDMA_BAR_MASK);
+
+	ib_dealloc_device(&dev->ibdev);
+
+	pci_disable_device(pdev);
+}
+
+#define ERDMA_GET_CAP(name, cap) FIELD_GET(ERDMA_CMD_DEV_CAP_##name##_MASK, cap)
+
+static int erdma_dev_attrs_init(struct erdma_dev *dev)
+{
+	int err;
+	u64 req_hdr, cap0, cap1;
+
+	erdma_cmdq_build_reqhdr(&req_hdr, CMDQ_SUBMOD_RDMA,
+				CMDQ_OPCODE_QUERY_DEVICE);
+
+	err = erdma_post_cmd_wait(&dev->cmdq, &req_hdr, sizeof(req_hdr), &cap0,
+				  &cap1);
+	if (err)
+		return err;
+
+	dev->attrs.max_cqe = 1 << ERDMA_GET_CAP(MAX_CQE, cap0);
+	dev->attrs.max_mr_size = 1ULL << ERDMA_GET_CAP(MAX_MR_SIZE, cap0);
+	dev->attrs.max_mw = 1 << ERDMA_GET_CAP(MAX_MW, cap1);
+	dev->attrs.max_recv_wr = 1 << ERDMA_GET_CAP(MAX_RECV_WR, cap0);
+	dev->attrs.local_dma_key = ERDMA_GET_CAP(DMA_LOCAL_KEY, cap1);
+	dev->attrs.cc = ERDMA_GET_CAP(DEFAULT_CC, cap1);
+	dev->attrs.max_qp = ERDMA_NQP_PER_QBLOCK * ERDMA_GET_CAP(QBLOCK, cap1);
+	dev->attrs.max_mr = dev->attrs.max_qp << 1;
+	dev->attrs.max_cq = dev->attrs.max_qp << 1;
+
+	dev->attrs.max_send_wr = ERDMA_MAX_SEND_WR;
+	dev->attrs.max_ord = ERDMA_MAX_ORD;
+	dev->attrs.max_ird = ERDMA_MAX_IRD;
+	dev->attrs.max_send_sge = ERDMA_MAX_SEND_SGE;
+	dev->attrs.max_recv_sge = ERDMA_MAX_RECV_SGE;
+	dev->attrs.max_sge_rd = ERDMA_MAX_SGE_RD;
+	dev->attrs.max_pd = ERDMA_MAX_PD;
+
+	dev->res_cb[ERDMA_RES_TYPE_PD].max_cap = ERDMA_MAX_PD;
+	dev->res_cb[ERDMA_RES_TYPE_STAG_IDX].max_cap = dev->attrs.max_mr;
+
+	erdma_cmdq_build_reqhdr(&req_hdr, CMDQ_SUBMOD_COMMON,
+				CMDQ_OPCODE_QUERY_FW_INFO);
+
+	err = erdma_post_cmd_wait(&dev->cmdq, &req_hdr, sizeof(req_hdr), &cap0,
+				  &cap1);
+
+	dev->attrs.fw_version = FIELD_GET(ERDMA_CMD_INFO0_FW_VER_MASK, cap0);
+
+	return err;
+}
+
+static int erdma_res_cb_init(struct erdma_dev *dev)
+{
+	int i, j;
+
+	for (i = 0; i < ERDMA_RES_CNT; i++) {
+		dev->res_cb[i].next_alloc_idx = 1;
+		spin_lock_init(&dev->res_cb[i].lock);
+		dev->res_cb[i].bitmap =
+			kcalloc(BITS_TO_LONGS(dev->res_cb[i].max_cap),
+				sizeof(unsigned long), GFP_KERNEL);
+		/* We will free the memory in erdma_res_cb_free */
+		if (!dev->res_cb[i].bitmap)
+			goto err;
+	}
+
+	return 0;
+
+err:
+	for (j = 0; j < i; j++)
+		kfree(dev->res_cb[j].bitmap);
+
+	return -ENOMEM;
+}
+
+static void erdma_res_cb_free(struct erdma_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < ERDMA_RES_CNT; i++)
+		kfree(dev->res_cb[i].bitmap);
+}
+
+static const struct ib_device_ops erdma_device_ops = {
+	.owner = THIS_MODULE,
+	.driver_id = RDMA_DRIVER_ERDMA,
+	.uverbs_abi_ver = ERDMA_ABI_VERSION,
+
+	.alloc_mr = erdma_ib_alloc_mr,
+	.alloc_pd = erdma_alloc_pd,
+	.alloc_ucontext = erdma_alloc_ucontext,
+	.create_cq = erdma_create_cq,
+	.create_qp = erdma_create_qp,
+	.dealloc_pd = erdma_dealloc_pd,
+	.dealloc_ucontext = erdma_dealloc_ucontext,
+	.dereg_mr = erdma_dereg_mr,
+	.destroy_cq = erdma_destroy_cq,
+	.destroy_qp = erdma_destroy_qp,
+	.get_dma_mr = erdma_get_dma_mr,
+	.get_port_immutable = erdma_get_port_immutable,
+	.iw_accept = erdma_accept,
+	.iw_add_ref = erdma_qp_get_ref,
+	.iw_connect = erdma_connect,
+	.iw_create_listen = erdma_create_listen,
+	.iw_destroy_listen = erdma_destroy_listen,
+	.iw_get_qp = erdma_get_ibqp,
+	.iw_reject = erdma_reject,
+	.iw_rem_ref = erdma_qp_put_ref,
+	.map_mr_sg = erdma_map_mr_sg,
+	.mmap = erdma_mmap,
+	.mmap_free = erdma_mmap_free,
+	.modify_qp = erdma_modify_qp,
+	.post_recv = erdma_post_recv,
+	.post_send = erdma_post_send,
+	.poll_cq = erdma_poll_cq,
+	.query_device = erdma_query_device,
+	.query_gid = erdma_query_gid,
+	.query_port = erdma_query_port,
+	.query_qp = erdma_query_qp,
+	.req_notify_cq = erdma_req_notify_cq,
+	.reg_user_mr = erdma_reg_user_mr,
+
+	INIT_RDMA_OBJ_SIZE(ib_cq, erdma_cq, ibcq),
+	INIT_RDMA_OBJ_SIZE(ib_pd, erdma_pd, ibpd),
+	INIT_RDMA_OBJ_SIZE(ib_ucontext, erdma_ucontext, ibucontext),
+	INIT_RDMA_OBJ_SIZE(ib_qp, erdma_qp, ibqp),
+};
+
+static int erdma_ib_device_add(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+	struct ib_device *ibdev = &dev->ibdev;
+	u64 mac;
+	int ret = 0;
+
+	ret = erdma_dev_attrs_init(dev);
+	if (ret)
+		return ret;
+
+	ibdev->node_type = RDMA_NODE_RNIC;
+	memcpy(ibdev->node_desc, ERDMA_NODE_DESC, sizeof(ERDMA_NODE_DESC));
+
+	/*
+	 * Current model (one-to-one device association):
+	 * One ERDMA device per net_device or, equivalently,
+	 * per physical port.
+	 */
+	ibdev->phys_port_cnt = 1;
+	ibdev->num_comp_vectors = dev->attrs.irq_num - 1;
+
+	ib_set_device_ops(ibdev, &erdma_device_ops);
+
+	INIT_LIST_HEAD(&dev->cep_list);
+
+	spin_lock_init(&dev->lock);
+	xa_init_flags(&dev->qp_xa, XA_FLAGS_ALLOC1);
+	xa_init_flags(&dev->cq_xa, XA_FLAGS_ALLOC1);
+	dev->next_alloc_cqn = 1;
+	dev->next_alloc_qpn = 1;
+
+	ret = erdma_res_cb_init(dev);
+	if (ret)
+		return ret;
+
+	spin_lock_init(&dev->db_bitmap_lock);
+	bitmap_zero(dev->sdb_page, ERDMA_DWQE_TYPE0_CNT);
+	bitmap_zero(dev->sdb_entry, ERDMA_DWQE_TYPE1_CNT);
+
+	atomic_set(&dev->num_ctx, 0);
+
+	mac = erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_L_REG);
+	mac |= (u64)erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_H_REG) << 32;
+
+	u64_to_ether_addr(mac, dev->attrs.peer_addr);
+
+	ret = erdma_device_register(dev);
+	if (ret)
+		goto err_out;
+
+	return 0;
+
+err_out:
+	xa_destroy(&dev->qp_xa);
+	xa_destroy(&dev->cq_xa);
+
+	erdma_res_cb_free(dev);
+
+	return ret;
+}
+
+static void erdma_ib_device_remove(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+
+	if (dev->netdev) {
+		ib_device_set_netdev(&dev->ibdev, NULL, 1);
+		dev->netdev = NULL;
+		unregister_netdevice_notifier(&dev->netdev_nb);
+	}
+
+	ib_unregister_device(&dev->ibdev);
+
+	erdma_res_cb_free(dev);
+	xa_destroy(&dev->qp_xa);
+	xa_destroy(&dev->cq_xa);
+}
+
+static int erdma_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+	int ret;
+
+	ret = erdma_probe_dev(pdev);
+	if (ret)
+		return ret;
+
+	ret = erdma_ib_device_add(pdev);
+	if (ret) {
+		erdma_remove_dev(pdev);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void erdma_remove(struct pci_dev *pdev)
+{
+	erdma_ib_device_remove(pdev);
+	erdma_remove_dev(pdev);
+}
+
+static struct pci_driver erdma_pci_driver = {
+	.name = DRV_MODULE_NAME,
+	.id_table = erdma_pci_tbl,
+	.probe = erdma_probe,
+	.remove = erdma_remove
+};
+
+MODULE_DEVICE_TABLE(pci, erdma_pci_tbl);
+
+static __init int erdma_init_module(void)
+{
+	int ret;
+
+	ret = erdma_cm_init();
+	if (ret)
+		return ret;
+
+	ret = pci_register_driver(&erdma_pci_driver);
+	if (ret) {
+		erdma_cm_exit();
+		return ret;
+	}
+
+	rdma_link_register(&erdma_link_ops);
+
+	return 0;
+}
+
+static void __exit erdma_exit_module(void)
+{
+	rdma_link_unregister(&erdma_link_ops);
+
+	pci_unregister_driver(&erdma_pci_driver);
+
+	erdma_cm_exit();
+}
+
+module_init(erdma_init_module);
+module_exit(erdma_exit_module);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-05-24  3:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-20 15:13 Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module Bernard Metzler
2022-05-23  1:39 ` Cheng Xu
2022-05-23 13:25   ` Tom Talpey
2022-05-24  3:09     ` Cheng Xu
  -- strict thread matches above, loose matches on Subject: below --
2022-04-21  7:17 [PATCH for-next v7 00/12] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
2022-04-21  7:17 ` [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma module Cheng Xu
2022-05-10 13:17   ` Jason Gunthorpe
2022-05-16  3:15     ` Cheng Xu
2022-05-16 11:49       ` Jason Gunthorpe
2022-05-16 12:37         ` Cheng Xu
2022-05-16 12:40         ` Cheng Xu
2022-05-16 13:59           ` Cheng Xu
2022-05-16 14:07             ` Jason Gunthorpe
2022-05-16 15:14               ` Cheng Xu
2022-05-16 17:31                 ` Jason Gunthorpe
2022-05-17  1:53                   ` Cheng Xu
2022-05-18  8:30     ` Cheng Xu
2022-05-18 14:46       ` Jason Gunthorpe
2022-05-18 16:24         ` Cheng Xu
2022-05-18 16:31           ` Jason Gunthorpe
2022-05-19 16:20             ` Bernard Metzler
2022-05-19 18:51               ` Tom Talpey
2022-05-20  7:03               ` Cheng Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).