linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Is anyone working on implementing LAG in ib core?
@ 2020-02-22  3:48 Weihang Li
  2020-02-22 23:40 ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Weihang Li @ 2020-02-22  3:48 UTC (permalink / raw)
  To: dledford, jgg, leon; +Cc: linux-rdma, linuxarm

Hi all,

We plan to implement LAG in hns drivers recently, and as we know, there is
already a mature and stable solution in the mlx5 driver. Considering that
the net subsystem in kernel adopt the strategy that the framework implement
bonding, we think it's reasonable to add LAG feature to the ib core based
on mlx5's implementation. So that all drivers including hns and mlx5 can
benefit from it.

In previous discussions with Leon about achieving reporting of ib port link
event in ib core, Leon mentioned that there might be someone trying to do
this.

So may I ask if there is anyone working on LAG in ib core or planning to
implement it in near future? I will appreciate it if you can share your
progress with me and maybe we can finish it together.

If nobody is working on this, our team may take a try to implement LAG in
the core. Any comments and suggestion are welcome.

Thanks,
Weihang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-22  3:48 Is anyone working on implementing LAG in ib core? Weihang Li
@ 2020-02-22 23:40 ` Jason Gunthorpe
  2020-02-23  0:44   ` Parav Pandit
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2020-02-22 23:40 UTC (permalink / raw)
  To: Weihang Li; +Cc: dledford, leon, linux-rdma, linuxarm, Shiraz Saleem

On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
> Hi all,
> 
> We plan to implement LAG in hns drivers recently, and as we know, there is
> already a mature and stable solution in the mlx5 driver. Considering that
> the net subsystem in kernel adopt the strategy that the framework implement
> bonding, we think it's reasonable to add LAG feature to the ib core based
> on mlx5's implementation. So that all drivers including hns and mlx5 can
> benefit from it.
> 
> In previous discussions with Leon about achieving reporting of ib port link
> event in ib core, Leon mentioned that there might be someone trying to do
> this.
> 
> So may I ask if there is anyone working on LAG in ib core or planning to
> implement it in near future? I will appreciate it if you can share your
> progress with me and maybe we can finish it together.
> 
> If nobody is working on this, our team may take a try to implement LAG in
> the core. Any comments and suggestion are welcome.

This is something that needs to be done, I understand several of the
other drivers are going to want to use LAG and we certainly can't have
everything copied into each driver.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-22 23:40 ` Jason Gunthorpe
@ 2020-02-23  0:44   ` Parav Pandit
  2020-02-23  9:49     ` Leon Romanovsky
  2020-03-02  7:22     ` liweihang
  0 siblings, 2 replies; 13+ messages in thread
From: Parav Pandit @ 2020-02-23  0:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Weihang Li
  Cc: dledford, leon, linux-rdma, linuxarm, Shiraz Saleem

Hi Jason, Weihang,

On 2/22/2020 5:40 PM, Jason Gunthorpe wrote:
> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
>> Hi all,
>>
>> We plan to implement LAG in hns drivers recently, and as we know, there is
>> already a mature and stable solution in the mlx5 driver. Considering that
>> the net subsystem in kernel adopt the strategy that the framework implement
>> bonding, we think it's reasonable to add LAG feature to the ib core based
>> on mlx5's implementation. So that all drivers including hns and mlx5 can
>> benefit from it.
>>
>> In previous discussions with Leon about achieving reporting of ib port link
>> event in ib core, Leon mentioned that there might be someone trying to do
>> this.
>>
>> So may I ask if there is anyone working on LAG in ib core or planning to
>> implement it in near future? I will appreciate it if you can share your
>> progress with me and maybe we can finish it together.
>>
>> If nobody is working on this, our team may take a try to implement LAG in
>> the core. Any comments and suggestion are welcome.
> 
> This is something that needs to be done, I understand several of the
> other drivers are going to want to use LAG and we certainly can't have
> everything copied into each driver.
> 
> Jason
> 
I am not sure mlx5 is right model for new rdma bond device support which
I tried to highlight in Q&A-1 below.

I have below not-so-refined proposal for rdma bond device.

- Create a rdma bond device named rbond0 using two slave rdma devices
mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma
device of mlx5_0 rdma device.

$ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave
mlx5_1 dmadevice mlx5_0

$ rdma dev show
0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660
sys_image_guid 248a:0703:0055:4660
1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661
sys_image_guid 248a:0703:0055:4660
2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid
248a:0703:0055:4660

- This should be done via rdma bond driver in
drivers/infiniband/ulp/rdma_bond

Few obvious questions arise from above proposal:
1. But why can't I do the trick to remove two or more rdma device(s) and
create one device when bond0 netdevice is created?
Ans:
(a) Because it leads to complex driver code in vendor driver to handle
netdev events under rtnl lock.
Given GID table needs to hold rtnl lock for short duration in
ib_register_device(), things need to differ to work queue and perform
synchronization.
(b) User cannot predict when this new rdma bond device will be created
automatically, after how long?
(c) What if some failure occurred, should I parse /var/log/messages to
figure out the error? What steps to roll back and retry?
(d) What if driver internally attempt retry?..
and some more..

2. Why do we need to give netdevice in above proposal?
Ans:
Because for RoCE device you want to build right GID table for its
matching netdevice. No guess work.

3. But with that there will be multiple devices rbond0, mlx5_0 with same
GID table entries.
And that will confuse the user.
What do we do about it?
Ans:
No. That won't happen, because this bond driver accept slave rdma devices.
bond driver will request IB core to disable GID table of slave rdma devices.
Or we can have commands to disable/enable specific GID types of slave
rdma devices, which user can invoke before creating rdma bond device.

Such as
$ rdma link mlx5_0 disable rocev1
$ rdma link mlx5_0 enable rocev2

This way its easy to compose and addressed wider use case where RoCEv1
GID table entries can be disabled and make efficient use of GID table.

4. But if we are going to disable the GID table, why do we even need
those RDMA devices?
Ans:
Because when you delete the bond rdma device, you can revert back to
those mlx5_0/1 devices.
Follwo mirror of add in delete.

5. Why do we need to give DMA device?
Ans:
Because we want to avoid doing guess work in rdma_bond driver on which
DMA device to use.
User knows the configuration of how he wants to use it based on the
system. (irqs etc).

6. What happens if slave pci devices are hot removed?
Ans:
If slave dma device is removed, it disassociates the ucontext and rbond0
becomes unusable for applications.

7. How is the failover done?
Ans: Since failover netdevice is provided, its failover settings are
inherited by rbond0 rdma device and passed on to its slave device.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-23  0:44   ` Parav Pandit
@ 2020-02-23  9:49     ` Leon Romanovsky
  2020-02-24  7:10       ` Parav Pandit
  2020-03-02  7:22     ` liweihang
  1 sibling, 1 reply; 13+ messages in thread
From: Leon Romanovsky @ 2020-02-23  9:49 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Weihang Li, dledford, linux-rdma, linuxarm,
	Shiraz Saleem

On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote:
> Hi Jason, Weihang,
>
> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote:
> > On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
> >> Hi all,
> >>
> >> We plan to implement LAG in hns drivers recently, and as we know, there is
> >> already a mature and stable solution in the mlx5 driver. Considering that
> >> the net subsystem in kernel adopt the strategy that the framework implement
> >> bonding, we think it's reasonable to add LAG feature to the ib core based
> >> on mlx5's implementation. So that all drivers including hns and mlx5 can
> >> benefit from it.
> >>
> >> In previous discussions with Leon about achieving reporting of ib port link
> >> event in ib core, Leon mentioned that there might be someone trying to do
> >> this.
> >>
> >> So may I ask if there is anyone working on LAG in ib core or planning to
> >> implement it in near future? I will appreciate it if you can share your
> >> progress with me and maybe we can finish it together.
> >>
> >> If nobody is working on this, our team may take a try to implement LAG in
> >> the core. Any comments and suggestion are welcome.
> >
> > This is something that needs to be done, I understand several of the
> > other drivers are going to want to use LAG and we certainly can't have
> > everything copied into each driver.
> >
> > Jason
> >
> I am not sure mlx5 is right model for new rdma bond device support which
> I tried to highlight in Q&A-1 below.
>
> I have below not-so-refined proposal for rdma bond device.
>
> - Create a rdma bond device named rbond0 using two slave rdma devices
> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma
> device of mlx5_0 rdma device.
>
> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave
> mlx5_1 dmadevice mlx5_0
>
> $ rdma dev show
> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660
> sys_image_guid 248a:0703:0055:4660
> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661
> sys_image_guid 248a:0703:0055:4660
> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid
> 248a:0703:0055:4660
>
> - This should be done via rdma bond driver in
> drivers/infiniband/ulp/rdma_bond

Extra question, why do we need RDMA bond ULP which combines ib devices
and not create netdev bond device and create one ib device on top of that?

Thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-23  9:49     ` Leon Romanovsky
@ 2020-02-24  7:10       ` Parav Pandit
  2020-02-24 10:52         ` Leon Romanovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Parav Pandit @ 2020-02-24  7:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Weihang Li, dledford, linux-rdma, linuxarm,
	Shiraz Saleem

Hi Leon,

On 2/23/2020 3:49 AM, Leon Romanovsky wrote:
> On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote:
>> Hi Jason, Weihang,
>>
>> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote:
>>> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
>>>> Hi all,
>>>>
>>>> We plan to implement LAG in hns drivers recently, and as we know, there is
>>>> already a mature and stable solution in the mlx5 driver. Considering that
>>>> the net subsystem in kernel adopt the strategy that the framework implement
>>>> bonding, we think it's reasonable to add LAG feature to the ib core based
>>>> on mlx5's implementation. So that all drivers including hns and mlx5 can
>>>> benefit from it.
>>>>
>>>> In previous discussions with Leon about achieving reporting of ib port link
>>>> event in ib core, Leon mentioned that there might be someone trying to do
>>>> this.
>>>>
>>>> So may I ask if there is anyone working on LAG in ib core or planning to
>>>> implement it in near future? I will appreciate it if you can share your
>>>> progress with me and maybe we can finish it together.
>>>>
>>>> If nobody is working on this, our team may take a try to implement LAG in
>>>> the core. Any comments and suggestion are welcome.
>>>
>>> This is something that needs to be done, I understand several of the
>>> other drivers are going to want to use LAG and we certainly can't have
>>> everything copied into each driver.
>>>
>>> Jason
>>>
>> I am not sure mlx5 is right model for new rdma bond device support which
>> I tried to highlight in Q&A-1 below.
>>
>> I have below not-so-refined proposal for rdma bond device.
>>
>> - Create a rdma bond device named rbond0 using two slave rdma devices
>> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma
>> device of mlx5_0 rdma device.
>>
>> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave
>> mlx5_1 dmadevice mlx5_0
>>
>> $ rdma dev show
>> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660
>> sys_image_guid 248a:0703:0055:4660
>> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661
>> sys_image_guid 248a:0703:0055:4660
>> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid
>> 248a:0703:0055:4660
>>
>> - This should be done via rdma bond driver in
>> drivers/infiniband/ulp/rdma_bond
> 
> Extra question, why do we need RDMA bond ULP which combines ib devices
> and not create netdev bond device and create one ib device on top of that?
> 

I read your question few times, but I likely don't understand your question.

Are you asking why bonding should be implemented as dedicated
ulp/driver, and not as an extension by the vendor driver?

In an alternative approach a given hw driver implements a newlink()
instead of dedicated bond driver.
However this will duplicate the code in few drivers. Hence, to do in
common driver and implement only necessary hooks in hw driver.

> Thanks
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-24  7:10       ` Parav Pandit
@ 2020-02-24 10:52         ` Leon Romanovsky
  2020-02-24 18:12           ` Parav Pandit
  2020-02-24 18:29           ` Jason Gunthorpe
  0 siblings, 2 replies; 13+ messages in thread
From: Leon Romanovsky @ 2020-02-24 10:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Weihang Li, dledford, linux-rdma, linuxarm,
	Shiraz Saleem

On Mon, Feb 24, 2020 at 07:10:13AM +0000, Parav Pandit wrote:
> Hi Leon,
>
> On 2/23/2020 3:49 AM, Leon Romanovsky wrote:
> > On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote:
> >> Hi Jason, Weihang,
> >>
> >> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote:
> >>> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
> >>>> Hi all,
> >>>>
> >>>> We plan to implement LAG in hns drivers recently, and as we know, there is
> >>>> already a mature and stable solution in the mlx5 driver. Considering that
> >>>> the net subsystem in kernel adopt the strategy that the framework implement
> >>>> bonding, we think it's reasonable to add LAG feature to the ib core based
> >>>> on mlx5's implementation. So that all drivers including hns and mlx5 can
> >>>> benefit from it.
> >>>>
> >>>> In previous discussions with Leon about achieving reporting of ib port link
> >>>> event in ib core, Leon mentioned that there might be someone trying to do
> >>>> this.
> >>>>
> >>>> So may I ask if there is anyone working on LAG in ib core or planning to
> >>>> implement it in near future? I will appreciate it if you can share your
> >>>> progress with me and maybe we can finish it together.
> >>>>
> >>>> If nobody is working on this, our team may take a try to implement LAG in
> >>>> the core. Any comments and suggestion are welcome.
> >>>
> >>> This is something that needs to be done, I understand several of the
> >>> other drivers are going to want to use LAG and we certainly can't have
> >>> everything copied into each driver.
> >>>
> >>> Jason
> >>>
> >> I am not sure mlx5 is right model for new rdma bond device support which
> >> I tried to highlight in Q&A-1 below.
> >>
> >> I have below not-so-refined proposal for rdma bond device.
> >>
> >> - Create a rdma bond device named rbond0 using two slave rdma devices
> >> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma
> >> device of mlx5_0 rdma device.
> >>
> >> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave
> >> mlx5_1 dmadevice mlx5_0
> >>
> >> $ rdma dev show
> >> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660
> >> sys_image_guid 248a:0703:0055:4660
> >> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661
> >> sys_image_guid 248a:0703:0055:4660
> >> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid
> >> 248a:0703:0055:4660
> >>
> >> - This should be done via rdma bond driver in
> >> drivers/infiniband/ulp/rdma_bond
> >
> > Extra question, why do we need RDMA bond ULP which combines ib devices
> > and not create netdev bond device and create one ib device on top of that?
> >
>
> I read your question few times, but I likely don't understand your question.
>
> Are you asking why bonding should be implemented as dedicated
> ulp/driver, and not as an extension by the vendor driver?

No, I meant something different. You are proposing to combine IB
devices, while keeping netdev devices separated. I'm asking if it is
possible to combine netdev devices with already existing bond driver
and simply create new ib device with bond netdev as an underlying
provider.

I'm not suggesting to implement anything in vendor drivers.

>
> In an alternative approach a given hw driver implements a newlink()
> instead of dedicated bond driver.
> However this will duplicate the code in few drivers. Hence, to do in
> common driver and implement only necessary hooks in hw driver.

I'm not sure about it.

>
> > Thanks
> >
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-24 10:52         ` Leon Romanovsky
@ 2020-02-24 18:12           ` Parav Pandit
  2020-02-24 18:29           ` Jason Gunthorpe
  1 sibling, 0 replies; 13+ messages in thread
From: Parav Pandit @ 2020-02-24 18:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Weihang Li, dledford, linux-rdma, linuxarm,
	Shiraz Saleem

On 2/24/2020 4:52 AM, Leon Romanovsky wrote:
> On Mon, Feb 24, 2020 at 07:10:13AM +0000, Parav Pandit wrote:
>> Hi Leon,
>>
>> On 2/23/2020 3:49 AM, Leon Romanovsky wrote:
>>> On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote:
>>>> Hi Jason, Weihang,
>>>>
>>>> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote:
>>>>> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> We plan to implement LAG in hns drivers recently, and as we know, there is
>>>>>> already a mature and stable solution in the mlx5 driver. Considering that
>>>>>> the net subsystem in kernel adopt the strategy that the framework implement
>>>>>> bonding, we think it's reasonable to add LAG feature to the ib core based
>>>>>> on mlx5's implementation. So that all drivers including hns and mlx5 can
>>>>>> benefit from it.
>>>>>>
>>>>>> In previous discussions with Leon about achieving reporting of ib port link
>>>>>> event in ib core, Leon mentioned that there might be someone trying to do
>>>>>> this.
>>>>>>
>>>>>> So may I ask if there is anyone working on LAG in ib core or planning to
>>>>>> implement it in near future? I will appreciate it if you can share your
>>>>>> progress with me and maybe we can finish it together.
>>>>>>
>>>>>> If nobody is working on this, our team may take a try to implement LAG in
>>>>>> the core. Any comments and suggestion are welcome.
>>>>>
>>>>> This is something that needs to be done, I understand several of the
>>>>> other drivers are going to want to use LAG and we certainly can't have
>>>>> everything copied into each driver.
>>>>>
>>>>> Jason
>>>>>
>>>> I am not sure mlx5 is right model for new rdma bond device support which
>>>> I tried to highlight in Q&A-1 below.
>>>>
>>>> I have below not-so-refined proposal for rdma bond device.
>>>>
>>>> - Create a rdma bond device named rbond0 using two slave rdma devices
>>>> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma
>>>> device of mlx5_0 rdma device.
>>>>
>>>> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave
>>>> mlx5_1 dmadevice mlx5_0
>>>>
>>>> $ rdma dev show
>>>> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660
>>>> sys_image_guid 248a:0703:0055:4660
>>>> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661
>>>> sys_image_guid 248a:0703:0055:4660
>>>> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid
>>>> 248a:0703:0055:4660
>>>>
>>>> - This should be done via rdma bond driver in
>>>> drivers/infiniband/ulp/rdma_bond
>>>
>>> Extra question, why do we need RDMA bond ULP which combines ib devices
>>> and not create netdev bond device and create one ib device on top of that?
>>>
>>
>> I read your question few times, but I likely don't understand your question.
>>
>> Are you asking why bonding should be implemented as dedicated
>> ulp/driver, and not as an extension by the vendor driver?
> 
> No, I meant something different. You are proposing to combine IB
> devices, while keeping netdev devices separated. I'm asking if it is
> possible to combine netdev devices with already existing bond driver
> and simply create new ib device with bond netdev as an underlying
> provider.
> 
Ah I understand now.
Yes, if bond_newlink() can be extended to accept rdma device specific
parameters, it should be possible.

For example we already have two class of users where one wants to have
rdma bond device created and one doesn't want to create for a given bond
netdev.

You might be already aware that newlink() is called under rtnl lock and
ib_register_device() also need to acquire rtnl lock. But this is
implementation that can be discussed later once bond driver extension
for rdma make sense. :-)

If one wants to create a vlan netdevice on top of bond device, user
usually creates it after bond netdevice is created.
Similarly I see that bond rdma device is composed of underlying bond
netdev and underlying rdma device or at least its parent 'struct device'
for purpose of irq routing, pci access etc.

Compare to overloading bond_newlink() with rdma parameters, independent
rdma bond device creation linked to bond netdev, appears more elegant
and modular approach.

> I'm not suggesting to implement anything in vendor drivers.
> 
>>
>> In an alternative approach a given hw driver implements a newlink()
>> instead of dedicated bond driver.
>> However this will duplicate the code in few drivers. Hence, to do in
>> common driver and implement only necessary hooks in hw driver.
> 
> I'm not sure about it.
> 
>>
>>> Thanks
>>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-24 10:52         ` Leon Romanovsky
  2020-02-24 18:12           ` Parav Pandit
@ 2020-02-24 18:29           ` Jason Gunthorpe
  2020-02-24 19:01             ` Parav Pandit
  2020-02-25  7:58             ` Leon Romanovsky
  1 sibling, 2 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2020-02-24 18:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Parav Pandit, Weihang Li, dledford, linux-rdma, linuxarm, Shiraz Saleem

On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote:
> > Are you asking why bonding should be implemented as dedicated
> > ulp/driver, and not as an extension by the vendor driver?
> 
> No, I meant something different. You are proposing to combine IB
> devices, while keeping netdev devices separated. I'm asking if it is
> possible to combine netdev devices with already existing bond driver
> and simply create new ib device with bond netdev as an underlying
> provider.

Isn't that basically what we do now in mlx5?

Logically the ib_device is attached to the bond, it uses the bond for
IP addressing, etc.

I'm not sure trying to have 3 ib_devices like netdev does is sane,
that is very, very complicated to get everything to work. The ib stuff
just isn't designed to be stacked like that.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-24 18:29           ` Jason Gunthorpe
@ 2020-02-24 19:01             ` Parav Pandit
  2020-02-24 19:23               ` Jason Gunthorpe
  2020-02-25  7:58             ` Leon Romanovsky
  1 sibling, 1 reply; 13+ messages in thread
From: Parav Pandit @ 2020-02-24 19:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: Weihang Li, dledford, linux-rdma, linuxarm, Shiraz Saleem

On 2/24/2020 12:29 PM, Jason Gunthorpe wrote:
> On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote:
>>> Are you asking why bonding should be implemented as dedicated
>>> ulp/driver, and not as an extension by the vendor driver?
>>
>> No, I meant something different. You are proposing to combine IB
>> devices, while keeping netdev devices separated. I'm asking if it is
>> possible to combine netdev devices with already existing bond driver
>> and simply create new ib device with bond netdev as an underlying
>> provider.
> 
> Isn't that basically what we do now in mlx5?
> 
And its broken for few aspects that I described in Q&A question-1 in
this thread previously.

On top of that user has no ability to disable rdma bonding.
User exactly asked us that they want to disable in some cases.
(not on mailing list). So there are non-upstream hacks exists that is
not applicable for this discussion.

> Logically the ib_device is attached to the bond, it uses the bond for
> IP addressing, etc.
> 
> I'm not sure trying to have 3 ib_devices like netdev does is sane,
> that is very, very complicated to get everything to work. The ib stuff
> just isn't designed to be stacked like that.
> 
I am not sure I understand all the complications you have thought through.
I thought of few and put forward in the Q&A in the thread and we can
improve the design as we go forward.

Stacking rdma device on top of existing rdma device as an ib_client so
that rdma bond device exactly aware of what is going on with slaves and
bond netdev.

On top of that I am enjoying below list delete corruption call trace for
a while now coming from this lag code in a silly devel test of load
unload driver 30+ times (not even running traffic). :-)

RIP: 0010:__list_del_entry_valid+0x7c/0xa0
unregister_netdevice_notifier_dev_net+0x1f/0x70
mlx5_lag_remove+0x61/0xf0 [mlx5_core]
mlx5e_detach_netdev+0x24/0x50 [mlx5_core]
mlx5e_detach+0x36/0x40 [mlx5_core]
mlx5e_remove+0x48/0x60 [mlx5_core]
mlx5_remove_device+0xb0/0xc0 [mlx5_core]
mlx5_unregister_interface+0x39/0x90 [mlx5_core]
cleanup+0x5/0xdd1 [mlx5_core]

And there is some LAG code in a hw driver that reschedules the work if
it cannot acquire some mutex lock...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-24 19:01             ` Parav Pandit
@ 2020-02-24 19:23               ` Jason Gunthorpe
  2020-02-24 19:41                 ` Parav Pandit
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2020-02-24 19:23 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Weihang Li, dledford, linux-rdma, linuxarm,
	Shiraz Saleem

On Mon, Feb 24, 2020 at 07:01:43PM +0000, Parav Pandit wrote:
> On 2/24/2020 12:29 PM, Jason Gunthorpe wrote:
> > On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote:
> >>> Are you asking why bonding should be implemented as dedicated
> >>> ulp/driver, and not as an extension by the vendor driver?
> >>
> >> No, I meant something different. You are proposing to combine IB
> >> devices, while keeping netdev devices separated. I'm asking if it is
> >> possible to combine netdev devices with already existing bond driver
> >> and simply create new ib device with bond netdev as an underlying
> >> provider.
> > 
> > Isn't that basically what we do now in mlx5?
> > 
> And its broken for few aspects that I described in Q&A question-1 in
> this thread previously.
> 
> On top of that user has no ability to disable rdma bonding.

And what does that mean? The real netdevs have no IP addreses so what
exactly does a non-bonded RoCEv2 RDMA device do?

> User exactly asked us that they want to disable in some cases.
> (not on mailing list). So there are non-upstream hacks exists that is
> not applicable for this discussion.

Bah, I'm aware of that - that request is hack solution to something
else as well.

> > Logically the ib_device is attached to the bond, it uses the bond for
> > IP addressing, etc.
> > 
> > I'm not sure trying to have 3 ib_devices like netdev does is sane,
> > that is very, very complicated to get everything to work. The ib stuff
> > just isn't designed to be stacked like that.
> > 
> I am not sure I understand all the complications you have thought through.
> I thought of few and put forward in the Q&A in the thread and we can
> improve the design as we go forward.
> 
> Stacking rdma device on top of existing rdma device as an ib_client so
> that rdma bond device exactly aware of what is going on with slaves and
> bond netdev.

How do you safely proxy every single op from the bond to slaves?

How do you force the slaves to allow PDs to be shared across them?

What provider lives in user space for the bond driver? What happens to
the udata/etc?

And it doesn't solve the main problem you raised, creating a IB device
while holding RTNL simply should not ever be done. Moving this code
into the core layer fixed it up significantly for the similar rxe/siw
cases, I expect the same is possible for the LAG situation too.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-24 19:23               ` Jason Gunthorpe
@ 2020-02-24 19:41                 ` Parav Pandit
  0 siblings, 0 replies; 13+ messages in thread
From: Parav Pandit @ 2020-02-24 19:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Weihang Li, dledford, linux-rdma, linuxarm,
	Shiraz Saleem

On 2/24/2020 1:23 PM, Jason Gunthorpe wrote:
> On Mon, Feb 24, 2020 at 07:01:43PM +0000, Parav Pandit wrote:
>> On 2/24/2020 12:29 PM, Jason Gunthorpe wrote:
>>> On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote:
>>>>> Are you asking why bonding should be implemented as dedicated
>>>>> ulp/driver, and not as an extension by the vendor driver?
>>>>
>>>> No, I meant something different. You are proposing to combine IB
>>>> devices, while keeping netdev devices separated. I'm asking if it is
>>>> possible to combine netdev devices with already existing bond driver
>>>> and simply create new ib device with bond netdev as an underlying
>>>> provider.
>>>
>>> Isn't that basically what we do now in mlx5?
>>>
>> And its broken for few aspects that I described in Q&A question-1 in
>> this thread previously.
>>
>> On top of that user has no ability to disable rdma bonding.
> 
> And what does that mean? The real netdevs have no IP addreses so what
> exactly does a non-bonded RoCEv2 RDMA device do?
> 
Not sure. There is some default GID on it.

>> User exactly asked us that they want to disable in some cases.
>> (not on mailing list). So there are non-upstream hacks exists that is
>> not applicable for this discussion.
> 
> Bah, I'm aware of that - that request is hack solution to something
> else as well.
> 
>>> Logically the ib_device is attached to the bond, it uses the bond for
>>> IP addressing, etc.
>>>
>>> I'm not sure trying to have 3 ib_devices like netdev does is sane,
>>> that is very, very complicated to get everything to work. The ib stuff
>>> just isn't designed to be stacked like that.
>>>
>> I am not sure I understand all the complications you have thought through.
>> I thought of few and put forward in the Q&A in the thread and we can
>> improve the design as we go forward.
>>
>> Stacking rdma device on top of existing rdma device as an ib_client so
>> that rdma bond device exactly aware of what is going on with slaves and
>> bond netdev.
> 
> How do you safely proxy every single op from the bond to slaves?
Bond config should tell which slave to use, instead of current implicit one.

> 
> How do you force the slaves to allow PDs to be shared across them?
> 
For slave it doesn't matter if caller is bond or direct.

> What provider lives in user space for the bond driver? What happens to
> the udata/etc?
Same as that of primary slave used for pci, irq access whose info is
provided through new netlink discovery path.

> 
> And it doesn't solve the main problem you raised, creating a IB device
> while holding RTNL simply should not ever be done. Moving this code
> into the core layer fixed it up significantly for the similar rxe/siw
> cases, I expect the same is possible for the LAG situation too.
> 
$ rdma add dev command won't hold rtnl lock, when new rdma bond device
is added through rdma netlink socket.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-24 18:29           ` Jason Gunthorpe
  2020-02-24 19:01             ` Parav Pandit
@ 2020-02-25  7:58             ` Leon Romanovsky
  1 sibling, 0 replies; 13+ messages in thread
From: Leon Romanovsky @ 2020-02-25  7:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Weihang Li, dledford, linux-rdma, linuxarm, Shiraz Saleem

On Mon, Feb 24, 2020 at 02:29:02PM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote:
> > > Are you asking why bonding should be implemented as dedicated
> > > ulp/driver, and not as an extension by the vendor driver?
> >
> > No, I meant something different. You are proposing to combine IB
> > devices, while keeping netdev devices separated. I'm asking if it is
> > possible to combine netdev devices with already existing bond driver
> > and simply create new ib device with bond netdev as an underlying
> > provider.
>
> Isn't that basically what we do now in mlx5?

Yes, logically the same, the implementation of course should be
different by involving IB/core more than it is now.

>
> Logically the ib_device is attached to the bond, it uses the bond for
> IP addressing, etc.

Yes.

Thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is anyone working on implementing LAG in ib core?
  2020-02-23  0:44   ` Parav Pandit
  2020-02-23  9:49     ` Leon Romanovsky
@ 2020-03-02  7:22     ` liweihang
  1 sibling, 0 replies; 13+ messages in thread
From: liweihang @ 2020-03-02  7:22 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: dledford, leon, linux-rdma, Linuxarm, Shiraz Saleem

On 2020/2/23 8:44, Parav Pandit wrote:
> Hi Jason, Weihang,
> 
> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote:
>> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
>>> Hi all,
>>>
>>> We plan to implement LAG in hns drivers recently, and as we know, there is
>>> already a mature and stable solution in the mlx5 driver. Considering that
>>> the net subsystem in kernel adopt the strategy that the framework implement
>>> bonding, we think it's reasonable to add LAG feature to the ib core based
>>> on mlx5's implementation. So that all drivers including hns and mlx5 can
>>> benefit from it.
>>>
>>> In previous discussions with Leon about achieving reporting of ib port link
>>> event in ib core, Leon mentioned that there might be someone trying to do
>>> this.
>>>
>>> So may I ask if there is anyone working on LAG in ib core or planning to
>>> implement it in near future? I will appreciate it if you can share your
>>> progress with me and maybe we can finish it together.
>>>
>>> If nobody is working on this, our team may take a try to implement LAG in
>>> the core. Any comments and suggestion are welcome.
>>
>> This is something that needs to be done, I understand several of the
>> other drivers are going to want to use LAG and we certainly can't have
>> everything copied into each driver.
>>
>> Jason
>>
> I am not sure mlx5 is right model for new rdma bond device support which
> I tried to highlight in Q&A-1 below.
> 
> I have below not-so-refined proposal for rdma bond device.
> 
> - Create a rdma bond device named rbond0 using two slave rdma devices
> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma
> device of mlx5_0 rdma device.
> 
> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave
> mlx5_1 dmadevice mlx5_0
> 
> $ rdma dev show
> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660
> sys_image_guid 248a:0703:0055:4660
> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661
> sys_image_guid 248a:0703:0055:4660
> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid
> 248a:0703:0055:4660
> 
> - This should be done via rdma bond driver in
> drivers/infiniband/ulp/rdma_bond
> 
> Few obvious questions arise from above proposal:
> 1. But why can't I do the trick to remove two or more rdma device(s) and
> create one device when bond0 netdevice is created?
> Ans:
> (a) Because it leads to complex driver code in vendor driver to handle
> netdev events under rtnl lock.
> Given GID table needs to hold rtnl lock for short duration in
> ib_register_device(), things need to differ to work queue and perform
> synchronization.
> (b) User cannot predict when this new rdma bond device will be created
> automatically, after how long?
> (c) What if some failure occurred, should I parse /var/log/messages to
> figure out the error? What steps to roll back and retry?
> (d) What if driver internally attempt retry?..
> and some more..
> 
> 2. Why do we need to give netdevice in above proposal?
> Ans:
> Because for RoCE device you want to build right GID table for its
> matching netdevice. No guess work.
> 
> 3. But with that there will be multiple devices rbond0, mlx5_0 with same
> GID table entries.
> And that will confuse the user.
> What do we do about it?
> Ans:
> No. That won't happen, because this bond driver accept slave rdma devices.
> bond driver will request IB core to disable GID table of slave rdma devices.
> Or we can have commands to disable/enable specific GID types of slave
> rdma devices, which user can invoke before creating rdma bond device.
> 
> Such as
> $ rdma link mlx5_0 disable rocev1
> $ rdma link mlx5_0 enable rocev2
> 
> This way its easy to compose and addressed wider use case where RoCEv1
> GID table entries can be disabled and make efficient use of GID table.
> 
> 4. But if we are going to disable the GID table, why do we even need
> those RDMA devices?
> Ans:
> Because when you delete the bond rdma device, you can revert back to
> those mlx5_0/1 devices.
> Follwo mirror of add in delete.
> 
> 5. Why do we need to give DMA device?
> Ans:
> Because we want to avoid doing guess work in rdma_bond driver on which
> DMA device to use.
> User knows the configuration of how he wants to use it based on the
> system. (irqs etc).
> 
> 6. What happens if slave pci devices are hot removed?
> Ans:
> If slave dma device is removed, it disassociates the ucontext and rbond0
> becomes unusable for applications.
> 
> 7. How is the failover done?
> Ans: Since failover netdevice is provided, its failover settings are
> inherited by rbond0 rdma device and passed on to its slave device.
> 

Hi Parav,

Thanks for your discussion with Jason and Leon, there are a lot of
things that I have never considered. And sorry for the late response, it
took me some time to understand current LAG driver of mlx5. Seems a
mlx5_lag structure combines all slave net devices and ib devices, and it
has no directly combination with the bonding net device.

It seems you have been thinking a lot about how LAG can be achieved
in core. Do you have plans to implement it? Or have you already finished
some part of it? I'm curious about that because we want to use LAG in
HIPXX as I said, and we can help to test and improve the existing design
if some work is already done.

Thank you
Weihang

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-03-02  7:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-22  3:48 Is anyone working on implementing LAG in ib core? Weihang Li
2020-02-22 23:40 ` Jason Gunthorpe
2020-02-23  0:44   ` Parav Pandit
2020-02-23  9:49     ` Leon Romanovsky
2020-02-24  7:10       ` Parav Pandit
2020-02-24 10:52         ` Leon Romanovsky
2020-02-24 18:12           ` Parav Pandit
2020-02-24 18:29           ` Jason Gunthorpe
2020-02-24 19:01             ` Parav Pandit
2020-02-24 19:23               ` Jason Gunthorpe
2020-02-24 19:41                 ` Parav Pandit
2020-02-25  7:58             ` Leon Romanovsky
2020-03-02  7:22     ` liweihang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).