netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net/bonding: send arp in interval if no active slave
@ 2015-08-17 16:23 Jarod Wilson
  2015-08-17 16:55 ` Veaceslav Falico
  0 siblings, 1 reply; 22+ messages in thread
From: Jarod Wilson @ 2015-08-17 16:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Uwe Koziolek, Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	netdev, Jarod Wilson

From: Uwe Koziolek <uwe.koziolek@redknee.com>

With some very finicky switch hardware, active backup bonding can get into
a situation where we play ping-pong between interfaces, trying to get one
to come up as the active slave. There seems to be an issue with the
switch's arp replies either taking too long, or simply getting lost, so we
wind up unable to get any interface up and active. Sometimes, the issue
sorts itself out after a while, sometimes it doesn't.

Testing with num_grat_arp has proven fruitless, but sending an additional
arp on curr_arp_slave if we're still in the arp_interval timeslice in
bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
this hardware combination.

[jarod: manufacturing of changelog]
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
---
 drivers/net/bonding/bond_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 0c627b4..60b9483 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding *bond)
 			return should_notify_rtnl;
 	}
 
+	if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+		bond_arp_send_all(bond, curr_arp_slave);
+		return should_notify_rtnl;
+	}
+
 	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
 
 	bond_for_each_slave_rcu(bond, slave, iter) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-17 16:23 [PATCH] net/bonding: send arp in interval if no active slave Jarod Wilson
@ 2015-08-17 16:55 ` Veaceslav Falico
  2015-08-17 17:12   ` Jarod Wilson
  0 siblings, 1 reply; 22+ messages in thread
From: Veaceslav Falico @ 2015-08-17 16:55 UTC (permalink / raw)
  To: Jarod Wilson
  Cc: linux-kernel, Uwe Koziolek, Jay Vosburgh, Andy Gospodarek, netdev

On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>From: Uwe Koziolek <uwe.koziolek@redknee.com>
>
>With some very finicky switch hardware, active backup bonding can get into
>a situation where we play ping-pong between interfaces, trying to get one
>to come up as the active slave. There seems to be an issue with the
>switch's arp replies either taking too long, or simply getting lost, so we
>wind up unable to get any interface up and active. Sometimes, the issue
>sorts itself out after a while, sometimes it doesn't.
>
>Testing with num_grat_arp has proven fruitless, but sending an additional
>arp on curr_arp_slave if we're still in the arp_interval timeslice in
>bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
>this hardware combination.

Sorry, I don't understand the logic of why it works, and what exactly are
we fixiing here.

It also breaks completely the logic for link state management in case of no
current active slave for 2*arp_interval.

Could you please elaborate what exactly is fixed here, and how it works? :)

p.s. num_grat_arp maybe could help?

>
>[jarod: manufacturing of changelog]
>CC: Jay Vosburgh <j.vosburgh@gmail.com>
>CC: Veaceslav Falico <vfalico@gmail.com>
>CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>CC: netdev@vger.kernel.org
>Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>Signed-off-by: Jarod Wilson <jarod@redhat.com>
>---
> drivers/net/bonding/bond_main.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 0c627b4..60b9483 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding *bond)
> 			return should_notify_rtnl;
> 	}
>
>+	if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
>+		bond_arp_send_all(bond, curr_arp_slave);
>+		return should_notify_rtnl;
>+	}
>+
> 	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>
> 	bond_for_each_slave_rcu(bond, slave, iter) {
>-- 
>1.8.3.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-17 16:55 ` Veaceslav Falico
@ 2015-08-17 17:12   ` Jarod Wilson
  2015-08-17 18:56     ` Uwe Koziolek
  0 siblings, 1 reply; 22+ messages in thread
From: Jarod Wilson @ 2015-08-17 17:12 UTC (permalink / raw)
  To: Veaceslav Falico
  Cc: linux-kernel, Uwe Koziolek, Jay Vosburgh, Andy Gospodarek, netdev

On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>
>> With some very finicky switch hardware, active backup bonding can get
>> into
>> a situation where we play ping-pong between interfaces, trying to get one
>> to come up as the active slave. There seems to be an issue with the
>> switch's arp replies either taking too long, or simply getting lost,
>> so we
>> wind up unable to get any interface up and active. Sometimes, the issue
>> sorts itself out after a while, sometimes it doesn't.
>>
>> Testing with num_grat_arp has proven fruitless, but sending an additional
>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>> with
>> this hardware combination.
>
> Sorry, I don't understand the logic of why it works, and what exactly are
> we fixiing here.
>
> It also breaks completely the logic for link state management in case of no
> current active slave for 2*arp_interval.
>
> Could you please elaborate what exactly is fixed here, and how it works? :)

I can either duplicate some information from the bug, or Uwe can, to 
illustrate the exact nature of the problem.

> p.s. num_grat_arp maybe could help?

That was my thought as well, but as I understand it, that route was 
explored, and it didn't help any. I don't actually have a reproducer 
setup of my own, unfortunately, so I'm kind of caught in the middle here...

Uwe, can you perhaps further enlighten us as to what num_grat_arp 
settings were tried that didn't help? I'm still of the mind that if 
num_grat_arp *didn't* help, we probably need to do something keyed off 
num_grat_arp.


>> [jarod: manufacturing of changelog]
>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>> CC: Veaceslav Falico <vfalico@gmail.com>
>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>> CC: netdev@vger.kernel.org
>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>> ---
>> drivers/net/bonding/bond_main.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/net/bonding/bond_main.c
>> b/drivers/net/bonding/bond_main.c
>> index 0c627b4..60b9483 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>> *bond)
>>             return should_notify_rtnl;
>>     }
>>
>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
>> +        bond_arp_send_all(bond, curr_arp_slave);
>> +        return should_notify_rtnl;
>> +    }
>> +
>>     bond_set_slave_inactive_flags(curr_arp_slave,
>> BOND_SLAVE_NOTIFY_LATER);
>>
>>     bond_for_each_slave_rcu(bond, slave, iter) {
>> --
>> 1.8.3.1
>>


-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-17 17:12   ` Jarod Wilson
@ 2015-08-17 18:56     ` Uwe Koziolek
  2015-08-17 19:14       ` Jay Vosburgh
  0 siblings, 1 reply; 22+ messages in thread
From: Uwe Koziolek @ 2015-08-17 18:56 UTC (permalink / raw)
  To: Jarod Wilson, Veaceslav Falico
  Cc: linux-kernel, Jay Vosburgh, Andy Gospodarek, netdev

On2015-08-17 07:12 PM,Jarod Wilson wrote:
> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>
>>> With some very finicky switch hardware, active backup bonding can get
>>> into
>>> a situation where we play ping-pong between interfaces, trying to 
>>> get one
>>> to come up as the active slave. There seems to be an issue with the
>>> switch's arp replies either taking too long, or simply getting lost,
>>> so we
>>> wind up unable to get any interface up and active. Sometimes, the issue
>>> sorts itself out after a while, sometimes it doesn't.
>>>
>>> Testing with num_grat_arp has proven fruitless, but sending an 
>>> additional
>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>> with
>>> this hardware combination.
>>
>> Sorry, I don't understand the logic of why it works, and what exactly 
>> are
>> we fixiing here.
>>
>> It also breaks completely the logic for link state management in case 
>> of no
>> current active slave for 2*arp_interval.
>>
>> Could you please elaborate what exactly is fixed here, and how it 
>> works? :)
>
> I can either duplicate some information from the bug, or Uwe can, to 
> illustrate the exact nature of the problem.
>
>> p.s. num_grat_arp maybe could help?
>
> That was my thought as well, but as I understand it, that route was 
> explored, and it didn't help any. I don't actually have a reproducer 
> setup of my own, unfortunately, so I'm kind of caught in the middle 
> here...
>
> Uwe, can you perhaps further enlighten us as to what num_grat_arp 
> settings were tried that didn't help? I'm still of the mind that if 
> num_grat_arp *didn't* help, we probably need to do something keyed off 
> num_grat_arp.
The bonding slaves are connected to high available switches, each of the 
slaves is connected to a different switch. If the bond is starting, only 
the selected slave sends one arp-request. If a matching arp_response was 
received, this slave and the bond is going into state up, sending the 
gratitious arps...
But if you got no arp reply the next slave was selected.
With most of the newer switches, not overloaded, or with other software 
bugs, or with a single switch configuration, you would get a arp 
response on the first arp request.
But in case of high availability configuration with non perfect switches 
like HP ProCurve 54xx, also with some Cisco models, you may not get a 
response on the first arp request.

I have seen network snoops, there the switches are not responding to the 
first arp request on slave 1, the second arp request was sent on slave 2 
but the response was received on slave one,  and all following arp 
requests are anwsered on the wrong slave for a longer time.

The proposed change sents up to 3 arp requests on a down bond using the 
same slave, delayed by arp_interval.
Using problematic switches i have seen the the arp response on the right 
slave at latest on the second arp request. So the bond is going into 
state up.

How does it works:
The bonds in up state are handled on the beginning of bond_ab_arp_probe 
procedure, the other part of this procedure is handling the slave change.
The proposed change is bypassing the slave change for 2 additional calls 
of bond_ab_arp_probe.
Now the retries are not only for an up bond available, they are also 
implemented for a down bond.

The num_grat_arp has no chance to solve the problem. The num_grat_arp is 
only used, if a different slave is going active.
But in our case, the bonding slaves are not going into the state active 
for a longer time.
>
>>> [jarod: manufacturing of changelog]
>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>> CC: netdev@vger.kernel.org
>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>> ---
>>> drivers/net/bonding/bond_main.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c
>>> b/drivers/net/bonding/bond_main.c
>>> index 0c627b4..60b9483 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>> *bond)
>>>             return should_notify_rtnl;
>>>     }
>>>
>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 
>>> 2)) {
>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>> +        return should_notify_rtnl;
>>> +    }
>>> +
>>>     bond_set_slave_inactive_flags(curr_arp_slave,
>>> BOND_SLAVE_NOTIFY_LATER);
>>>
>>>     bond_for_each_slave_rcu(bond, slave, iter) {
>>> -- 
>>> 1.8.3.1
>>>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-17 18:56     ` Uwe Koziolek
@ 2015-08-17 19:14       ` Jay Vosburgh
  2015-08-17 20:51         ` Uwe Koziolek
  0 siblings, 1 reply; 22+ messages in thread
From: Jay Vosburgh @ 2015-08-17 19:14 UTC (permalink / raw)
  To: Uwe Koziolek
  Cc: Jarod Wilson, Veaceslav Falico, linux-kernel, Andy Gospodarek, netdev

Uwe Koziolek <uwe.koziolek@redknee.com> wrote:

>On2015-08-17 07:12 PM,Jarod Wilson wrote:
>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>
>>>> With some very finicky switch hardware, active backup bonding can get
>>>> into
>>>> a situation where we play ping-pong between interfaces, trying to get
>>>> one
>>>> to come up as the active slave. There seems to be an issue with the
>>>> switch's arp replies either taking too long, or simply getting lost,
>>>> so we
>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>> sorts itself out after a while, sometimes it doesn't.
>>>>
>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>> additional
>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>> with
>>>> this hardware combination.
>>>
>>> Sorry, I don't understand the logic of why it works, and what exactly
>>> are
>>> we fixiing here.
>>>
>>> It also breaks completely the logic for link state management in case
>>> of no
>>> current active slave for 2*arp_interval.
>>>
>>> Could you please elaborate what exactly is fixed here, and how it
>>> works? :)
>>
>> I can either duplicate some information from the bug, or Uwe can, to
>> illustrate the exact nature of the problem.
>>
>>> p.s. num_grat_arp maybe could help?
>>
>> That was my thought as well, but as I understand it, that route was
>> explored, and it didn't help any. I don't actually have a reproducer
>> setup of my own, unfortunately, so I'm kind of caught in the middle
>> here...
>>
>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>> settings were tried that didn't help? I'm still of the mind that if
>> num_grat_arp *didn't* help, we probably need to do something keyed off
>> num_grat_arp.
>The bonding slaves are connected to high available switches, each of the
>slaves is connected to a different switch. If the bond is starting, only
>the selected slave sends one arp-request. If a matching arp_response was
>received, this slave and the bond is going into state up, sending the
>gratitious arps...
>But if you got no arp reply the next slave was selected.
>With most of the newer switches, not overloaded, or with other software
>bugs, or with a single switch configuration, you would get a arp response
>on the first arp request.
>But in case of high availability configuration with non perfect switches
>like HP ProCurve 54xx, also with some Cisco models, you may not get a
>response on the first arp request.
>
>I have seen network snoops, there the switches are not responding to the
>first arp request on slave 1, the second arp request was sent on slave 2
>but the response was received on slave one,  and all following arp
>requests are anwsered on the wrong slave for a longer time.

	Could you elaborate on the exact "high availability
configuration" here, including the model(s) of switch(es) involved?

	Is this some kind of race between the switch or switches
updating the forwarding tables and the bond flip flopping between the
slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
the forwarding table, but (for whatever reason) there is no reply.  ARP
on slave 2 is sent (using the same source MAC, unless you set
fail_over_mac), but forwarding tables still send that MAC to slave 1, so
reply is sent there.

>The proposed change sents up to 3 arp requests on a down bond using the
>same slave, delayed by arp_interval.
>Using problematic switches i have seen the the arp response on the right
>slave at latest on the second arp request. So the bond is going into state
>up.
>
>How does it works:
>The bonds in up state are handled on the beginning of bond_ab_arp_probe
>procedure, the other part of this procedure is handling the slave change.
>The proposed change is bypassing the slave change for 2 additional calls
>of bond_ab_arp_probe.
>Now the retries are not only for an up bond available, they are also
>implemented for a down bond.

	Does this delay failover or bringup on switches that are not
"problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
this impact failover / recovery times?

	-J

>The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>only used, if a different slave is going active.
>But in our case, the bonding slaves are not going into the state active
>for a longer time.
>>
>>>> [jarod: manufacturing of changelog]
>>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>> CC: netdev@vger.kernel.org
>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>>> ---
>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>> 1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>> b/drivers/net/bonding/bond_main.c
>>>> index 0c627b4..60b9483 100644
>>>> --- a/drivers/net/bonding/bond_main.c
>>>> +++ b/drivers/net/bonding/bond_main.c
>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>> *bond)
>>>>             return should_notify_rtnl;
>>>>     }
>>>>
>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>> {
>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>> +        return should_notify_rtnl;
>>>> +    }
>>>> +
>>>>     bond_set_slave_inactive_flags(curr_arp_slave,
>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>
>>>>     bond_for_each_slave_rcu(bond, slave, iter) {
>>>> -- 
>>>> 1.8.3.1

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-17 19:14       ` Jay Vosburgh
@ 2015-08-17 20:51         ` Uwe Koziolek
  2015-08-31 22:21           ` Jarod Wilson
  2015-09-01 15:41           ` Andy Gospodarek
  0 siblings, 2 replies; 22+ messages in thread
From: Uwe Koziolek @ 2015-08-17 20:51 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Jarod Wilson, Veaceslav Falico, linux-kernel, Andy Gospodarek, netdev

On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>
>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>
>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>> into
>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>> one
>>>>> to come up as the active slave. There seems to be an issue with the
>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>> so we
>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>
>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>> additional
>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>> with
>>>>> this hardware combination.
>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>> are
>>>> we fixiing here.
>>>>
>>>> It also breaks completely the logic for link state management in case
>>>> of no
>>>> current active slave for 2*arp_interval.
>>>>
>>>> Could you please elaborate what exactly is fixed here, and how it
>>>> works? :)
>>> I can either duplicate some information from the bug, or Uwe can, to
>>> illustrate the exact nature of the problem.
>>>
>>>> p.s. num_grat_arp maybe could help?
>>> That was my thought as well, but as I understand it, that route was
>>> explored, and it didn't help any. I don't actually have a reproducer
>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>> here...
>>>
>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>> settings were tried that didn't help? I'm still of the mind that if
>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>> num_grat_arp.
>> The bonding slaves are connected to high available switches, each of the
>> slaves is connected to a different switch. If the bond is starting, only
>> the selected slave sends one arp-request. If a matching arp_response was
>> received, this slave and the bond is going into state up, sending the
>> gratitious arps...
>> But if you got no arp reply the next slave was selected.
>> With most of the newer switches, not overloaded, or with other software
>> bugs, or with a single switch configuration, you would get a arp response
>> on the first arp request.
>> But in case of high availability configuration with non perfect switches
>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>> response on the first arp request.
>>
>> I have seen network snoops, there the switches are not responding to the
>> first arp request on slave 1, the second arp request was sent on slave 2
>> but the response was received on slave one,  and all following arp
>> requests are anwsered on the wrong slave for a longer time.
> 	Could you elaborate on the exact "high availability
> configuration" here, including the model(s) of switch(es) involved?
>
> 	Is this some kind of race between the switch or switches
> updating the forwarding tables and the bond flip flopping between the
> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
> the forwarding table, but (for whatever reason) there is no reply.  ARP
> on slave 2 is sent (using the same source MAC, unless you set
> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
> reply is sent there.
High availability:
2 managed switches with routing capabilities have an interconnect.
One slave of a bonding interface is connected to the first switch, the 
second slave is connected to the other switch.
The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i 
remember also HP E 3500 and  E 3800 are also
affected, for the affected Cisco models I can't answer today.
Affected single switch configurations was not seen.

Yes, race conditions with delayed upgrades of the forwarding tables is a 
well matching explanation for the problem.

>> The proposed change sents up to 3 arp requests on a down bond using the
>> same slave, delayed by arp_interval.
>> Using problematic switches i have seen the the arp response on the right
>> slave at latest on the second arp request. So the bond is going into state
>> up.
>>
>> How does it works:
>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>> procedure, the other part of this procedure is handling the slave change.
>> The proposed change is bypassing the slave change for 2 additional calls
>> of bond_ab_arp_probe.
>> Now the retries are not only for an up bond available, they are also
>> implemented for a down bond.
> 	Does this delay failover or bringup on switches that are not
> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
> this impact failover / recovery times?
>
> 	-J
It depends.
failover times are not impacted, this is handled different.
Only the transition from a down bonding interface (bond and all slaves 
are down) to the state up can be increased by up to 2 times arp_interval,
If the selected interface did not came up .If well working switches are 
used, and everything other is also ok, there are no impacts.

>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>> only used, if a different slave is going active.
>> But in our case, the bonding slaves are not going into the state active
>> for a longer time.
>>>>> [jarod: manufacturing of changelog]
>>>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>> CC: netdev@vger.kernel.org
>>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>>>> ---
>>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>>> 1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>>> b/drivers/net/bonding/bond_main.c
>>>>> index 0c627b4..60b9483 100644
>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>>> *bond)
>>>>>              return should_notify_rtnl;
>>>>>      }
>>>>>
>>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>>> {
>>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>>> +        return should_notify_rtnl;
>>>>> +    }
>>>>> +
>>>>>      bond_set_slave_inactive_flags(curr_arp_slave,
>>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>>
>>>>>      bond_for_each_slave_rcu(bond, slave, iter) {
>>>>> -- 
>>>>> 1.8.3.1
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-17 20:51         ` Uwe Koziolek
@ 2015-08-31 22:21           ` Jarod Wilson
  2015-09-01 23:15             ` Uwe Koziolek
  2015-09-01 15:41           ` Andy Gospodarek
  1 sibling, 1 reply; 22+ messages in thread
From: Jarod Wilson @ 2015-08-31 22:21 UTC (permalink / raw)
  To: Uwe Koziolek, Jay Vosburgh
  Cc: Veaceslav Falico, linux-kernel, Andy Gospodarek, netdev

On 2015-08-17 4:51 PM, Uwe Koziolek wrote:
> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>
>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
...
>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>> settings were tried that didn't help? I'm still of the mind that if
>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>> num_grat_arp.
>>> The bonding slaves are connected to high available switches, each of the
>>> slaves is connected to a different switch. If the bond is starting, only
>>> the selected slave sends one arp-request. If a matching arp_response was
>>> received, this slave and the bond is going into state up, sending the
>>> gratitious arps...
>>> But if you got no arp reply the next slave was selected.
>>> With most of the newer switches, not overloaded, or with other software
>>> bugs, or with a single switch configuration, you would get a arp
>>> response
>>> on the first arp request.
>>> But in case of high availability configuration with non perfect switches
>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>> response on the first arp request.
>>>
>>> I have seen network snoops, there the switches are not responding to the
>>> first arp request on slave 1, the second arp request was sent on slave 2
>>> but the response was received on slave one,  and all following arp
>>> requests are anwsered on the wrong slave for a longer time.
>>     Could you elaborate on the exact "high availability
>> configuration" here, including the model(s) of switch(es) involved?
>>
>>     Is this some kind of race between the switch or switches
>> updating the forwarding tables and the bond flip flopping between the
>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>> on slave 2 is sent (using the same source MAC, unless you set
>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
>> reply is sent there.
> High availability:
> 2 managed switches with routing capabilities have an interconnect.
> One slave of a bonding interface is connected to the first switch, the
> second slave is connected to the other switch.
> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
> remember also HP E 3500 and  E 3800 are also
> affected, for the affected Cisco models I can't answer today.
> Affected single switch configurations was not seen.
>
> Yes, race conditions with delayed upgrades of the forwarding tables is a
> well matching explanation for the problem.
>
>>> The proposed change sents up to 3 arp requests on a down bond using the
>>> same slave, delayed by arp_interval.
>>> Using problematic switches i have seen the the arp response on the right
>>> slave at latest on the second arp request. So the bond is going into
>>> state
>>> up.
>>>
>>> How does it works:
>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>>> procedure, the other part of this procedure is handling the slave
>>> change.
>>> The proposed change is bypassing the slave change for 2 additional calls
>>> of bond_ab_arp_probe.
>>> Now the retries are not only for an up bond available, they are also
>>> implemented for a down bond.
>>     Does this delay failover or bringup on switches that are not
>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>> this impact failover / recovery times?
>>
>>     -J
> It depends.
> failover times are not impacted, this is handled different.
> Only the transition from a down bonding interface (bond and all slaves
> are down) to the state up can be increased by up to 2 times arp_interval,
> If the selected interface did not came up .If well working switches are
> used, and everything other is also ok, there are no impacts.

Jay, any further thoughts on this given Uwe's reply? Uwe, did you have a 
chance to get affected Cisco model numbers too?

-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-17 20:51         ` Uwe Koziolek
  2015-08-31 22:21           ` Jarod Wilson
@ 2015-09-01 15:41           ` Andy Gospodarek
  2015-09-01 23:10             ` Uwe Koziolek
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Gospodarek @ 2015-09-01 15:41 UTC (permalink / raw)
  To: Uwe Koziolek
  Cc: Jay Vosburgh, Jarod Wilson, Veaceslav Falico, linux-kernel, netdev

On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
> >Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
> >
> >>On2015-08-17 07:12 PM,Jarod Wilson wrote:
> >>>On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
> >>>>On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
> >>>>>From: Uwe Koziolek <uwe.koziolek@redknee.com>
> >>>>>
> >>>>>With some very finicky switch hardware, active backup bonding can get
> >>>>>into
> >>>>>a situation where we play ping-pong between interfaces, trying to get
> >>>>>one
> >>>>>to come up as the active slave. There seems to be an issue with the
> >>>>>switch's arp replies either taking too long, or simply getting lost,
> >>>>>so we
> >>>>>wind up unable to get any interface up and active. Sometimes, the issue
> >>>>>sorts itself out after a while, sometimes it doesn't.
> >>>>>
> >>>>>Testing with num_grat_arp has proven fruitless, but sending an
> >>>>>additional
> >>>>>arp on curr_arp_slave if we're still in the arp_interval timeslice in
> >>>>>bond_ab_arp_probe(), has shown to produce 100% reliability in testing
> >>>>>with
> >>>>>this hardware combination.
> >>>>Sorry, I don't understand the logic of why it works, and what exactly
> >>>>are
> >>>>we fixiing here.
> >>>>
> >>>>It also breaks completely the logic for link state management in case
> >>>>of no
> >>>>current active slave for 2*arp_interval.
> >>>>
> >>>>Could you please elaborate what exactly is fixed here, and how it
> >>>>works? :)
> >>>I can either duplicate some information from the bug, or Uwe can, to
> >>>illustrate the exact nature of the problem.
> >>>
> >>>>p.s. num_grat_arp maybe could help?
> >>>That was my thought as well, but as I understand it, that route was
> >>>explored, and it didn't help any. I don't actually have a reproducer
> >>>setup of my own, unfortunately, so I'm kind of caught in the middle
> >>>here...
> >>>
> >>>Uwe, can you perhaps further enlighten us as to what num_grat_arp
> >>>settings were tried that didn't help? I'm still of the mind that if
> >>>num_grat_arp *didn't* help, we probably need to do something keyed off
> >>>num_grat_arp.
> >>The bonding slaves are connected to high available switches, each of the
> >>slaves is connected to a different switch. If the bond is starting, only
> >>the selected slave sends one arp-request. If a matching arp_response was
> >>received, this slave and the bond is going into state up, sending the
> >>gratitious arps...
> >>But if you got no arp reply the next slave was selected.
> >>With most of the newer switches, not overloaded, or with other software
> >>bugs, or with a single switch configuration, you would get a arp response
> >>on the first arp request.
> >>But in case of high availability configuration with non perfect switches
> >>like HP ProCurve 54xx, also with some Cisco models, you may not get a
> >>response on the first arp request.
> >>
> >>I have seen network snoops, there the switches are not responding to the
> >>first arp request on slave 1, the second arp request was sent on slave 2
> >>but the response was received on slave one,  and all following arp
> >>requests are anwsered on the wrong slave for a longer time.
> >	Could you elaborate on the exact "high availability
> >configuration" here, including the model(s) of switch(es) involved?
> >
> >	Is this some kind of race between the switch or switches
> >updating the forwarding tables and the bond flip flopping between the
> >slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
> >the forwarding table, but (for whatever reason) there is no reply.  ARP
> >on slave 2 is sent (using the same source MAC, unless you set
> >fail_over_mac), but forwarding tables still send that MAC to slave 1, so
> >reply is sent there.
> High availability:
> 2 managed switches with routing capabilities have an interconnect.
> One slave of a bonding interface is connected to the first switch, the
> second slave is connected to the other switch.
> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
> remember also HP E 3500 and  E 3800 are also
> affected, for the affected Cisco models I can't answer today.
> Affected single switch configurations was not seen.
> 
> Yes, race conditions with delayed upgrades of the forwarding tables is a
> well matching explanation for the problem.
> 
> >>The proposed change sents up to 3 arp requests on a down bond using the
> >>same slave, delayed by arp_interval.
> >>Using problematic switches i have seen the the arp response on the right
> >>slave at latest on the second arp request. So the bond is going into state
> >>up.
> >>
> >>How does it works:
> >>The bonds in up state are handled on the beginning of bond_ab_arp_probe
> >>procedure, the other part of this procedure is handling the slave change.
> >>The proposed change is bypassing the slave change for 2 additional calls
> >>of bond_ab_arp_probe.
> >>Now the retries are not only for an up bond available, they are also
> >>implemented for a down bond.
> >	Does this delay failover or bringup on switches that are not
> >"problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
> >this impact failover / recovery times?
> >
> >	-J
> It depends.
> failover times are not impacted, this is handled different.
> Only the transition from a down bonding interface (bond and all slaves are
> down) to the state up can be increased by up to 2 times arp_interval,
> If the selected interface did not came up .If well working switches are
> used, and everything other is also ok, there are no impacts.

So I'm not a huge fan of workarounds like these, but I also understand
from a practical standpoint that this is useful.  My only issue with the
patch would be to please include a small comment (1-2 lines) in the code
that describes the behavior.  I know we have the changelog entries for
this, but I would feel better about having an exception like this in the
code for those reading it and wondering:

"Why would we wait 2 intervals before failing over to the next interface
when there are no active interfaces?"


> 
> >>The num_grat_arp has no chance to solve the problem. The num_grat_arp is
> >>only used, if a different slave is going active.
> >>But in our case, the bonding slaves are not going into the state active
> >>for a longer time.
> >>>>>[jarod: manufacturing of changelog]
> >>>>>CC: Jay Vosburgh <j.vosburgh@gmail.com>
> >>>>>CC: Veaceslav Falico <vfalico@gmail.com>
> >>>>>CC: Andy Gospodarek <gospo@cumulusnetworks.com>
> >>>>>CC: netdev@vger.kernel.org
> >>>>>Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
> >>>>>Signed-off-by: Jarod Wilson <jarod@redhat.com>
> >>>>>---
> >>>>>drivers/net/bonding/bond_main.c | 5 +++++
> >>>>>1 file changed, 5 insertions(+)
> >>>>>
> >>>>>diff --git a/drivers/net/bonding/bond_main.c
> >>>>>b/drivers/net/bonding/bond_main.c
> >>>>>index 0c627b4..60b9483 100644
> >>>>>--- a/drivers/net/bonding/bond_main.c
> >>>>>+++ b/drivers/net/bonding/bond_main.c
> >>>>>@@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
> >>>>>*bond)
> >>>>>             return should_notify_rtnl;
> >>>>>     }
> >>>>>
> >>>>>+    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
> >>>>>{
> >>>>>+        bond_arp_send_all(bond, curr_arp_slave);
> >>>>>+        return should_notify_rtnl;
> >>>>>+    }
> >>>>>+
> >>>>>     bond_set_slave_inactive_flags(curr_arp_slave,
> >>>>>BOND_SLAVE_NOTIFY_LATER);
> >>>>>
> >>>>>     bond_for_each_slave_rcu(bond, slave, iter) {
> >>>>>-- 
> >>>>>1.8.3.1
> >---
> >	-Jay Vosburgh, jay.vosburgh@canonical.com
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-09-01 15:41           ` Andy Gospodarek
@ 2015-09-01 23:10             ` Uwe Koziolek
  2015-09-03 15:05               ` Jay Vosburgh
  0 siblings, 1 reply; 22+ messages in thread
From: Uwe Koziolek @ 2015-09-01 23:10 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Jay Vosburgh, Jarod Wilson, Veaceslav Falico, linux-kernel, netdev

On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>>
>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>
>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>> into
>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>> one
>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>> so we
>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>
>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>> additional
>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>> with
>>>>>>> this hardware combination.
>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>> are
>>>>>> we fixiing here.
>>>>>>
>>>>>> It also breaks completely the logic for link state management in case
>>>>>> of no
>>>>>> current active slave for 2*arp_interval.
>>>>>>
>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>> works? :)
>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>> illustrate the exact nature of the problem.
>>>>>
>>>>>> p.s. num_grat_arp maybe could help?
>>>>> That was my thought as well, but as I understand it, that route was
>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>> here...
>>>>>
>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>> num_grat_arp.
>>>> The bonding slaves are connected to high available switches, each of the
>>>> slaves is connected to a different switch. If the bond is starting, only
>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>> received, this slave and the bond is going into state up, sending the
>>>> gratitious arps...
>>>> But if you got no arp reply the next slave was selected.
>>>> With most of the newer switches, not overloaded, or with other software
>>>> bugs, or with a single switch configuration, you would get a arp response
>>>> on the first arp request.
>>>> But in case of high availability configuration with non perfect switches
>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>> response on the first arp request.
>>>>
>>>> I have seen network snoops, there the switches are not responding to the
>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>> but the response was received on slave one,  and all following arp
>>>> requests are anwsered on the wrong slave for a longer time.
>>> 	Could you elaborate on the exact "high availability
>>> configuration" here, including the model(s) of switch(es) involved?
>>>
>>> 	Is this some kind of race between the switch or switches
>>> updating the forwarding tables and the bond flip flopping between the
>>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>>> on slave 2 is sent (using the same source MAC, unless you set
>>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
>>> reply is sent there.
>> High availability:
>> 2 managed switches with routing capabilities have an interconnect.
>> One slave of a bonding interface is connected to the first switch, the
>> second slave is connected to the other switch.
>> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
>> remember also HP E 3500 and  E 3800 are also
>> affected, for the affected Cisco models I can't answer today.
>> Affected single switch configurations was not seen.
>>
>> Yes, race conditions with delayed upgrades of the forwarding tables is a
>> well matching explanation for the problem.
>>
>>>> The proposed change sents up to 3 arp requests on a down bond using the
>>>> same slave, delayed by arp_interval.
>>>> Using problematic switches i have seen the the arp response on the right
>>>> slave at latest on the second arp request. So the bond is going into state
>>>> up.
>>>>
>>>> How does it works:
>>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>>>> procedure, the other part of this procedure is handling the slave change.
>>>> The proposed change is bypassing the slave change for 2 additional calls
>>>> of bond_ab_arp_probe.
>>>> Now the retries are not only for an up bond available, they are also
>>>> implemented for a down bond.
>>> 	Does this delay failover or bringup on switches that are not
>>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>>> this impact failover / recovery times?
>>>
>>> 	-J
>> It depends.
>> failover times are not impacted, this is handled different.
>> Only the transition from a down bonding interface (bond and all slaves are
>> down) to the state up can be increased by up to 2 times arp_interval,
>> If the selected interface did not came up .If well working switches are
>> used, and everything other is also ok, there are no impacts.
> So I'm not a huge fan of workarounds like these, but I also understand
> from a practical standpoint that this is useful.  My only issue with the
> patch would be to please include a small comment (1-2 lines) in the code
> that describes the behavior.  I know we have the changelog entries for
> this, but I would feel better about having an exception like this in the
> code for those reading it and wondering:
>
> "Why would we wait 2 intervals before failing over to the next interface
> when there are no active interfaces?"
>

diff -up a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
--- a/drivers/net/bonding/bond_main.c   2015-08-30 20:34:09.000000000 +0200
+++ b/drivers/net/bonding/bond_main.c   2015-09-02 00:39:10.000298202 +0200
@@ -2795,6 +2795,16 @@ static bool bond_ab_arp_probe(struct bon
                         return should_notify_rtnl;
         }

+       /* sometimes the forwarding tables of the switches are not updated fast enough
+        * the first arp response after a slave change is received on the wrong slave.
+        * the arp requests will be retried 2 times on the same slave
+        */
+
+       if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+               bond_arp_send_all(bond, curr_arp_slave);
+               return should_notify_rtnl;
+       }
+
         bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);

         bond_for_each_slave_rcu(bond, slave, iter) {

>>>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>>>> only used, if a different slave is going active.
>>>> But in our case, the bonding slaves are not going into the state active
>>>> for a longer time.
>>>>>>> [jarod: manufacturing of changelog]
>>>>>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>>>>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>>>>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>>>> CC: netdev@vger.kernel.org
>>>>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>>>>>> ---
>>>>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>>>>> b/drivers/net/bonding/bond_main.c
>>>>>>> index 0c627b4..60b9483 100644
>>>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>>>>> *bond)
>>>>>>>              return should_notify_rtnl;
>>>>>>>      }
>>>>>>>
>>>>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>>>>> {
>>>>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>>>>> +        return should_notify_rtnl;
>>>>>>> +    }
>>>>>>> +
>>>>>>>      bond_set_slave_inactive_flags(curr_arp_slave,
>>>>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>>>>
>>>>>>>      bond_for_each_slave_rcu(bond, slave, iter) {
>>>>>>> -- 
>>>>>>> 1.8.3.1
>>> ---
>>> 	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-08-31 22:21           ` Jarod Wilson
@ 2015-09-01 23:15             ` Uwe Koziolek
  0 siblings, 0 replies; 22+ messages in thread
From: Uwe Koziolek @ 2015-09-01 23:15 UTC (permalink / raw)
  To: Jarod Wilson, Jay Vosburgh
  Cc: Veaceslav Falico, linux-kernel, Andy Gospodarek, netdev

On Tue, 01.09.2015 at 00:21 +0200 Jarod Wilson wrote:
> On 2015-08-17 4:51 PM, Uwe Koziolek wrote:
>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>>
>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
> ...
>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>> num_grat_arp *didn't* help, we probably need to do something keyed 
>>>>> off
>>>>> num_grat_arp.
>>>> The bonding slaves are connected to high available switches, each 
>>>> of the
>>>> slaves is connected to a different switch. If the bond is starting, 
>>>> only
>>>> the selected slave sends one arp-request. If a matching 
>>>> arp_response was
>>>> received, this slave and the bond is going into state up, sending the
>>>> gratitious arps...
>>>> But if you got no arp reply the next slave was selected.
>>>> With most of the newer switches, not overloaded, or with other 
>>>> software
>>>> bugs, or with a single switch configuration, you would get a arp
>>>> response
>>>> on the first arp request.
>>>> But in case of high availability configuration with non perfect 
>>>> switches
>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>> response on the first arp request.
>>>>
>>>> I have seen network snoops, there the switches are not responding 
>>>> to the
>>>> first arp request on slave 1, the second arp request was sent on 
>>>> slave 2
>>>> but the response was received on slave one,  and all following arp
>>>> requests are anwsered on the wrong slave for a longer time.
>>>     Could you elaborate on the exact "high availability
>>> configuration" here, including the model(s) of switch(es) involved?
>>>
>>>     Is this some kind of race between the switch or switches
>>> updating the forwarding tables and the bond flip flopping between the
>>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>>> on slave 2 is sent (using the same source MAC, unless you set
>>> fail_over_mac), but forwarding tables still send that MAC to slave 
>>> 1, so
>>> reply is sent there.
>> High availability:
>> 2 managed switches with routing capabilities have an interconnect.
>> One slave of a bonding interface is connected to the first switch, the
>> second slave is connected to the other switch.
>> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
>> remember also HP E 3500 and  E 3800 are also
>> affected, for the affected Cisco models I can't answer today.
>> Affected single switch configurations was not seen.
>>
>> Yes, race conditions with delayed upgrades of the forwarding tables is a
>> well matching explanation for the problem.
>>
>>>> The proposed change sents up to 3 arp requests on a down bond using 
>>>> the
>>>> same slave, delayed by arp_interval.
>>>> Using problematic switches i have seen the the arp response on the 
>>>> right
>>>> slave at latest on the second arp request. So the bond is going into
>>>> state
>>>> up.
>>>>
>>>> How does it works:
>>>> The bonds in up state are handled on the beginning of 
>>>> bond_ab_arp_probe
>>>> procedure, the other part of this procedure is handling the slave
>>>> change.
>>>> The proposed change is bypassing the slave change for 2 additional 
>>>> calls
>>>> of bond_ab_arp_probe.
>>>> Now the retries are not only for an up bond available, they are also
>>>> implemented for a down bond.
>>>     Does this delay failover or bringup on switches that are not
>>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>>> this impact failover / recovery times?
>>>
>>>     -J
>> It depends.
>> failover times are not impacted, this is handled different.
>> Only the transition from a down bonding interface (bond and all slaves
>> are down) to the state up can be increased by up to 2 times 
>> arp_interval,
>> If the selected interface did not came up .If well working switches are
>> used, and everything other is also ok, there are no impacts.
>
> Jay, any further thoughts on this given Uwe's reply? Uwe, did you have 
> a chance to get affected Cisco model numbers too?
>
The affected Cisco model was a C3750.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-09-01 23:10             ` Uwe Koziolek
@ 2015-09-03 15:05               ` Jay Vosburgh
  2015-09-04 11:04                 ` Uwe Koziolek
  0 siblings, 1 reply; 22+ messages in thread
From: Jay Vosburgh @ 2015-09-03 15:05 UTC (permalink / raw)
  To: Uwe Koziolek
  Cc: Andy Gospodarek, Jarod Wilson, Veaceslav Falico, linux-kernel, netdev

Uwe Koziolek <uwe.koziolek@redknee.com> wrote:

>On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>>>
>>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>>
>>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>>> into
>>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>>> one
>>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>>> so we
>>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>>
>>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>>> additional
>>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>>> with
>>>>>>>> this hardware combination.
>>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>>> are
>>>>>>> we fixiing here.
>>>>>>>
>>>>>>> It also breaks completely the logic for link state management in case
>>>>>>> of no
>>>>>>> current active slave for 2*arp_interval.
>>>>>>>
>>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>>> works? :)
>>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>>> illustrate the exact nature of the problem.
>>>>>>
>>>>>>> p.s. num_grat_arp maybe could help?
>>>>>> That was my thought as well, but as I understand it, that route was
>>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>>> here...
>>>>>>
>>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>>> num_grat_arp.
>>>>> The bonding slaves are connected to high available switches, each of the
>>>>> slaves is connected to a different switch. If the bond is starting, only
>>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>>> received, this slave and the bond is going into state up, sending the
>>>>> gratitious arps...
>>>>> But if you got no arp reply the next slave was selected.
>>>>> With most of the newer switches, not overloaded, or with other software
>>>>> bugs, or with a single switch configuration, you would get a arp response
>>>>> on the first arp request.
>>>>> But in case of high availability configuration with non perfect switches
>>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>>> response on the first arp request.
>>>>>
>>>>> I have seen network snoops, there the switches are not responding to the
>>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>>> but the response was received on slave one,  and all following arp
>>>>> requests are anwsered on the wrong slave for a longer time.
>>>> 	Could you elaborate on the exact "high availability
>>>> configuration" here, including the model(s) of switch(es) involved?
>>>>
>>>> 	Is this some kind of race between the switch or switches
>>>> updating the forwarding tables and the bond flip flopping between the
>>>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>>>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>>>> on slave 2 is sent (using the same source MAC, unless you set
>>>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
>>>> reply is sent there.
>>> High availability:
>>> 2 managed switches with routing capabilities have an interconnect.
>>> One slave of a bonding interface is connected to the first switch, the
>>> second slave is connected to the other switch.
>>> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
>>> remember also HP E 3500 and  E 3800 are also
>>> affected, for the affected Cisco models I can't answer today.
>>> Affected single switch configurations was not seen.
>>>
>>> Yes, race conditions with delayed upgrades of the forwarding tables is a
>>> well matching explanation for the problem.
>>>
>>>>> The proposed change sents up to 3 arp requests on a down bond using the
>>>>> same slave, delayed by arp_interval.
>>>>> Using problematic switches i have seen the the arp response on the right
>>>>> slave at latest on the second arp request. So the bond is going into state
>>>>> up.
>>>>>
>>>>> How does it works:
>>>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>>>>> procedure, the other part of this procedure is handling the slave change.
>>>>> The proposed change is bypassing the slave change for 2 additional calls
>>>>> of bond_ab_arp_probe.
>>>>> Now the retries are not only for an up bond available, they are also
>>>>> implemented for a down bond.
>>>> 	Does this delay failover or bringup on switches that are not
>>>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>>>> this impact failover / recovery times?
>>>>
>>>> 	-J
>>> It depends.
>>> failover times are not impacted, this is handled different.
>>> Only the transition from a down bonding interface (bond and all slaves are
>>> down) to the state up can be increased by up to 2 times arp_interval,
>>> If the selected interface did not came up .If well working switches are
>>> used, and everything other is also ok, there are no impacts.
>> So I'm not a huge fan of workarounds like these, but I also understand
>> from a practical standpoint that this is useful.  My only issue with the
>> patch would be to please include a small comment (1-2 lines) in the code
>> that describes the behavior.  I know we have the changelog entries for
>> this, but I would feel better about having an exception like this in the
>> code for those reading it and wondering:
>>
>> "Why would we wait 2 intervals before failing over to the next interface
>> when there are no active interfaces?"
>>
>
>diff -up a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>--- a/drivers/net/bonding/bond_main.c   2015-08-30 20:34:09.000000000 +0200
>+++ b/drivers/net/bonding/bond_main.c   2015-09-02 00:39:10.000298202 +0200
>@@ -2795,6 +2795,16 @@ static bool bond_ab_arp_probe(struct bon
>                        return should_notify_rtnl;
>        }
>
>+       /* sometimes the forwarding tables of the switches are not updated fast enough
>+        * the first arp response after a slave change is received on the wrong slave.
>+        * the arp requests will be retried 2 times on the same slave
>+        */
>+
>+       if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
>+               bond_arp_send_all(bond, curr_arp_slave);
>+               return should_notify_rtnl;
>+       }
>+

	I probably should have asked this in the beginning, but at what
range of arp_interval values does the problem manifest?  If it's a race
condition with the switch update, I'd expect that only very small
arp_interval values would be affected.

	Also, your proposed comment wraps past 80 columns.

	-J


>        bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>
>        bond_for_each_slave_rcu(bond, slave, iter) {
>
>>>>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>>>>> only used, if a different slave is going active.
>>>>> But in our case, the bonding slaves are not going into the state active
>>>>> for a longer time.
>>>>>>>> [jarod: manufacturing of changelog]
>>>>>>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>>>>>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>>>>>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>>>>> CC: netdev@vger.kernel.org
>>>>>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>>>>>>> ---
>>>>>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>>>>>> b/drivers/net/bonding/bond_main.c
>>>>>>>> index 0c627b4..60b9483 100644
>>>>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>>>>>> *bond)
>>>>>>>>              return should_notify_rtnl;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>>>>>> {
>>>>>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>>>>>> +        return should_notify_rtnl;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>      bond_set_slave_inactive_flags(curr_arp_slave,
>>>>>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>>>>>
>>>>>>>>      bond_for_each_slave_rcu(bond, slave, iter) {
>>>>>>>> -- 
>>>>>>>> 1.8.3.1

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-09-03 15:05               ` Jay Vosburgh
@ 2015-09-04 11:04                 ` Uwe Koziolek
  2015-09-28 13:31                   ` Jarod Wilson
  0 siblings, 1 reply; 22+ messages in thread
From: Uwe Koziolek @ 2015-09-04 11:04 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Andy Gospodarek, Jarod Wilson, Veaceslav Falico, linux-kernel, netdev

Am 03.09.2015 um 17:05 schrieb Jay Vosburgh:
> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>
>> On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>>>>
>>>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>>>
>>>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>>>> into
>>>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>>>> one
>>>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>>>> so we
>>>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>>>
>>>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>>>> additional
>>>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>>>> with
>>>>>>>>> this hardware combination.
>>>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>>>> are
>>>>>>>> we fixiing here.
>>>>>>>>
>>>>>>>> It also breaks completely the logic for link state management in case
>>>>>>>> of no
>>>>>>>> current active slave for 2*arp_interval.
>>>>>>>>
>>>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>>>> works? :)
>>>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>>>> illustrate the exact nature of the problem.
>>>>>>>
>>>>>>>> p.s. num_grat_arp maybe could help?
>>>>>>> That was my thought as well, but as I understand it, that route was
>>>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>>>> here...
>>>>>>>
>>>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>>>> num_grat_arp.
>>>>>> The bonding slaves are connected to high available switches, each of the
>>>>>> slaves is connected to a different switch. If the bond is starting, only
>>>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>>>> received, this slave and the bond is going into state up, sending the
>>>>>> gratitious arps...
>>>>>> But if you got no arp reply the next slave was selected.
>>>>>> With most of the newer switches, not overloaded, or with other software
>>>>>> bugs, or with a single switch configuration, you would get a arp response
>>>>>> on the first arp request.
>>>>>> But in case of high availability configuration with non perfect switches
>>>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>>>> response on the first arp request.
>>>>>>
>>>>>> I have seen network snoops, there the switches are not responding to the
>>>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>>>> but the response was received on slave one,  and all following arp
>>>>>> requests are anwsered on the wrong slave for a longer time.
>>>>> 	Could you elaborate on the exact "high availability
>>>>> configuration" here, including the model(s) of switch(es) involved?
>>>>>
>>>>> 	Is this some kind of race between the switch or switches
>>>>> updating the forwarding tables and the bond flip flopping between the
>>>>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>>>>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>>>>> on slave 2 is sent (using the same source MAC, unless you set
>>>>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
>>>>> reply is sent there.
>>>> High availability:
>>>> 2 managed switches with routing capabilities have an interconnect.
>>>> One slave of a bonding interface is connected to the first switch, the
>>>> second slave is connected to the other switch.
>>>> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
>>>> remember also HP E 3500 and  E 3800 are also
>>>> affected, for the affected Cisco models I can't answer today.
>>>> Affected single switch configurations was not seen.
>>>>
>>>> Yes, race conditions with delayed upgrades of the forwarding tables is a
>>>> well matching explanation for the problem.
>>>>
>>>>>> The proposed change sents up to 3 arp requests on a down bond using the
>>>>>> same slave, delayed by arp_interval.
>>>>>> Using problematic switches i have seen the the arp response on the right
>>>>>> slave at latest on the second arp request. So the bond is going into state
>>>>>> up.
>>>>>>
>>>>>> How does it works:
>>>>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>>>>>> procedure, the other part of this procedure is handling the slave change.
>>>>>> The proposed change is bypassing the slave change for 2 additional calls
>>>>>> of bond_ab_arp_probe.
>>>>>> Now the retries are not only for an up bond available, they are also
>>>>>> implemented for a down bond.
>>>>> 	Does this delay failover or bringup on switches that are not
>>>>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>>>>> this impact failover / recovery times?
>>>>>
>>>>> 	-J
>>>> It depends.
>>>> failover times are not impacted, this is handled different.
>>>> Only the transition from a down bonding interface (bond and all slaves are
>>>> down) to the state up can be increased by up to 2 times arp_interval,
>>>> If the selected interface did not came up .If well working switches are
>>>> used, and everything other is also ok, there are no impacts.
>>> So I'm not a huge fan of workarounds like these, but I also understand
>>> from a practical standpoint that this is useful.  My only issue with the
>>> patch would be to please include a small comment (1-2 lines) in the code
>>> that describes the behavior.  I know we have the changelog entries for
>>> this, but I would feel better about having an exception like this in the
>>> code for those reading it and wondering:
>>>
>>> "Why would we wait 2 intervals before failing over to the next interface
>>> when there are no active interfaces?"
>>>
>>
>> diff -up a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> --- a/drivers/net/bonding/bond_main.c   2015-08-30 20:34:09.000000000 +0200
>> +++ b/drivers/net/bonding/bond_main.c   2015-09-02 00:39:10.000298202 +0200
>> @@ -2795,6 +2795,16 @@ static bool bond_ab_arp_probe(struct bon
>>                         return should_notify_rtnl;
>>         }
>>
>> +       /* sometimes the forwarding tables of the switches are not updated fast enough
>> +        * the first arp response after a slave change is received on the wrong slave.
>> +        * the arp requests will be retried 2 times on the same slave
>> +        */
>> +
>> +       if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
>> +               bond_arp_send_all(bond, curr_arp_slave);
>> +               return should_notify_rtnl;
>> +       }
>> +
>
> 	I probably should have asked this in the beginning, but at what
> range of arp_interval values does the problem manifest?  If it's a race
> condition with the switch update, I'd expect that only very small
> arp_interval values would be affected.
>
> 	Also, your proposed comment wraps past 80 columns.
>
> 	-J
>
Only 500 msecs arp interval is used, no other values are checked.
Wraps in patch are now removed.

diff -up ./drivers/net/bonding/bond_main.c.orig ./drivers/net/bonding/bond_main.c
--- ./drivers/net/bonding/bond_main.c.orig	2015-08-30 20:34:09.000000000 +0200
+++ ./drivers/net/bonding/bond_main.c	2015-09-04 11:59:05.755897182 +0200
@@ -2795,6 +2795,17 @@ static bool bond_ab_arp_probe(struct bon
  			return should_notify_rtnl;
  	}

+	/* sometimes the forwarding tables of the switches are not updated
+	 * fast enough. the first arp response after a slave change is received
+	 * on the wrong slave.
+	 * the arp requests will be retried 2 times on the same slave
+	 */
+
+	if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+		bond_arp_send_all(bond, curr_arp_slave);
+		return should_notify_rtnl;
+	}
+
  	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);

  	bond_for_each_slave_rcu(bond, slave, iter) {

>
>>         bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>>
>>         bond_for_each_slave_rcu(bond, slave, iter) {
>>
>>>>>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>>>>>> only used, if a different slave is going active.
>>>>>> But in our case, the bonding slaves are not going into the state active
>>>>>> for a longer time.
>>>>>>>>> [jarod: manufacturing of changelog]
>>>>>>>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>>>>>>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>>>>>>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>>>>>> CC: netdev@vger.kernel.org
>>>>>>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>>>>>>>> ---
>>>>>>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>>>>>>> b/drivers/net/bonding/bond_main.c
>>>>>>>>> index 0c627b4..60b9483 100644
>>>>>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>>>>>>> *bond)
>>>>>>>>>               return should_notify_rtnl;
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>>>>>>> {
>>>>>>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>>>>>>> +        return should_notify_rtnl;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>>       bond_set_slave_inactive_flags(curr_arp_slave,
>>>>>>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>>>>>>
>>>>>>>>>       bond_for_each_slave_rcu(bond, slave, iter) {
>>>>>>>>> --
>>>>>>>>> 1.8.3.1
>
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/bonding: send arp in interval if no active slave
  2015-09-04 11:04                 ` Uwe Koziolek
@ 2015-09-28 13:31                   ` Jarod Wilson
  2015-10-06 19:53                     ` [PATCH v4] " Jarod Wilson
  0 siblings, 1 reply; 22+ messages in thread
From: Jarod Wilson @ 2015-09-28 13:31 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Andy Gospodarek, Veaceslav Falico, linux-kernel, netdev, Uwe Koziolek

Uwe Koziolek wrote:
> Am 03.09.2015 um 17:05 schrieb Jay Vosburgh:
>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>
>>> On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>>>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>>>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
...
>> I probably should have asked this in the beginning, but at what
>> range of arp_interval values does the problem manifest? If it's a race
>> condition with the switch update, I'd expect that only very small
>> arp_interval values would be affected.
>>
>> Also, your proposed comment wraps past 80 columns.
>>
>> -J
>>
> Only 500 msecs arp interval is used, no other values are checked.
> Wraps in patch are now removed.
>
> diff -up ./drivers/net/bonding/bond_main.c.orig
> ./drivers/net/bonding/bond_main.c
> --- ./drivers/net/bonding/bond_main.c.orig 2015-08-30 20:34:09.000000000
> +0200
> +++ ./drivers/net/bonding/bond_main.c 2015-09-04 11:59:05.755897182 +0200
> @@ -2795,6 +2795,17 @@ static bool bond_ab_arp_probe(struct bon
> return should_notify_rtnl;
> }
>
> + /* sometimes the forwarding tables of the switches are not updated
> + * fast enough. the first arp response after a slave change is received
> + * on the wrong slave.
> + * the arp requests will be retried 2 times on the same slave
> + */
> +
> + if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
> + bond_arp_send_all(bond, curr_arp_slave);
> + return should_notify_rtnl;
> + }
> +
> bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>
> bond_for_each_slave_rcu(bond, slave, iter) {

Jay, any further issues with this patch? I know Veaceslav was concerned 
about it breaking the logic for link state management if there's no 
current active slave for 2 * arp_interval, while Andy seemed okay with 
it, provided there was a comment explaining. Just looking at what might 
have to be done next here to keep heading towards a resolution.

Thanks much,

-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-09-28 13:31                   ` Jarod Wilson
@ 2015-10-06 19:53                     ` Jarod Wilson
  2015-10-06 19:58                       ` Jarod Wilson
  2015-10-07 12:03                       ` Nikolay Aleksandrov
  0 siblings, 2 replies; 22+ messages in thread
From: Jarod Wilson @ 2015-10-06 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Uwe Koziolek, Jay Vosburgh, Andy Gospodarek, Veaceslav Falico,
	netdev, Jarod Wilson

From: Uwe Koziolek <uwe.koziolek@redknee.com>

With some very finicky switch hardware, active backup bonding can get into
a situation where we play ping-pong between interfaces, trying to get one
to come up as the active slave. There seems to be an issue with the
switch's arp replies either taking too long, or simply getting lost, so we
wind up unable to get any interface up and active. Sometimes, the issue
sorts itself out after a while, sometimes it doesn't.

Testing with num_grat_arp has proven fruitless, but sending an additional
arp on curr_arp_slave if we're still in the arp_interval timeslice in
bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
this hardware combination.

[jarod: manufacturing of changelog, addition of modparam gating]
CC: Jay Vosburgh <jay.vosburgh@canonical.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
---
v2: add code comment as to why change is needed
v3: fix wrapping of comments
v4: [jarod] add module parameter gating of code addition

 drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
 include/net/bonding.h           |  1 +
 2 files changed, 25 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 90f2615..72ab512 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -95,6 +95,7 @@ static int miimon;
 static int updelay;
 static int downdelay;
 static int use_carrier	= 1;
+static int arp_slow_switch;
 static char *mode;
 static char *primary;
 static char *primary_reselect;
@@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
 module_param(use_carrier, int, 0);
 MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
 			      "0 for off, 1 for on (default)");
+module_param(arp_slow_switch, int, 0);
+MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
+				  "caches that are slow to update; "
+				  "0 for off (default), 1 for on");
 module_param(mode, charp, 0);
 MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
 		       "1 for active-backup, 2 for balance-xor, "
@@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
 			return should_notify_rtnl;
 	}
 
+	/* Sometimes the forwarding tables of the switches are not update
+	 * fast enough, so the first arp response after a slave change is
+	 * received on the wrong slave.
+	 *
+	 * The arp requests will be retried 2 times on the same slave.
+	 */
+	if (arp_slow_switch &&
+	    bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+		bond_arp_send_all(bond, curr_arp_slave);
+		return should_notify_rtnl;
+	}
+
 	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
 
 	bond_for_each_slave_rcu(bond, slave, iter) {
@@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
 		use_carrier = 1;
 	}
 
+	if ((arp_slow_switch != 0) && (arp_slow_switch != 1)) {
+		pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
+			arp_slow_switch);
+		arp_slow_switch = 1;
+	}
+
 	if (num_peer_notif < 0 || num_peer_notif > 255) {
 		pr_warn("Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
 			num_peer_notif);
@@ -4516,6 +4539,7 @@ static int bond_check_params(struct bond_params *params)
 	params->updelay = updelay;
 	params->downdelay = downdelay;
 	params->use_carrier = use_carrier;
+	params->arp_slow_switch = arp_slow_switch;
 	params->lacp_fast = lacp_fast;
 	params->primary[0] = 0;
 	params->primary_reselect = primary_reselect_value;
diff --git a/include/net/bonding.h b/include/net/bonding.h
index c1740a2..208d31c 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -120,6 +120,7 @@ struct bond_params {
 	int arp_validate;
 	int arp_all_targets;
 	int use_carrier;
+	int arp_slow_switch;
 	int fail_over_mac;
 	int updelay;
 	int downdelay;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-06 19:53                     ` [PATCH v4] " Jarod Wilson
@ 2015-10-06 19:58                       ` Jarod Wilson
  2015-10-07 12:03                       ` Nikolay Aleksandrov
  1 sibling, 0 replies; 22+ messages in thread
From: Jarod Wilson @ 2015-10-06 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Uwe Koziolek, Jay Vosburgh, Andy Gospodarek, Veaceslav Falico, netdev

Jarod Wilson wrote:
> From: Uwe Koziolek<uwe.koziolek@redknee.com>
>
> With some very finicky switch hardware, active backup bonding can get into
> a situation where we play ping-pong between interfaces, trying to get one
> to come up as the active slave. There seems to be an issue with the
> switch's arp replies either taking too long, or simply getting lost, so we
> wind up unable to get any interface up and active. Sometimes, the issue
> sorts itself out after a while, sometimes it doesn't.
>
> Testing with num_grat_arp has proven fruitless, but sending an additional
> arp on curr_arp_slave if we're still in the arp_interval timeslice in
> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
> this hardware combination.
>
> [jarod: manufacturing of changelog, addition of modparam gating]
> CC: Jay Vosburgh<jay.vosburgh@canonical.com>
> CC: Andy Gospodarek<gospo@cumulusnetworks.com>
> CC: Veaceslav Falico<vfalico@gmail.com>
> CC: netdev@vger.kernel.org
> Signed-off-by: Uwe Koziolek<uwe.koziolek@redknee.com>
> Signed-off-by: Jarod Wilson<jarod@redhat.com>
> ---
> v2: add code comment as to why change is needed
> v3: fix wrapping of comments
> v4: [jarod] add module parameter gating of code addition
>
>   drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>   include/net/bonding.h           |  1 +
>   2 files changed, 25 insertions(+)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 90f2615..72ab512 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -95,6 +95,7 @@ static int miimon;
>   static int updelay;
>   static int downdelay;
>   static int use_carrier	= 1;
> +static int arp_slow_switch;
>   static char *mode;
>   static char *primary;
>   static char *primary_reselect;
> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>   module_param(use_carrier, int, 0);
>   MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>   			      "0 for off, 1 for on (default)");
> +module_param(arp_slow_switch, int, 0);
> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
> +				  "caches that are slow to update; "
> +				  "0 for off (default), 1 for on");
>   module_param(mode, charp, 0);
>   MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>   		       "1 for active-backup, 2 for balance-xor, "
> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>   			return should_notify_rtnl;
>   	}
>
> +	/* Sometimes the forwarding tables of the switches are not update
> +	 * fast enough, so the first arp response after a slave change is
> +	 * received on the wrong slave.
> +	 *
> +	 * The arp requests will be retried 2 times on the same slave.
> +	 */
> +	if (arp_slow_switch &&

This here should actually be bond->params.arp_slow_switch, but I'd like 
to hear first if a module parameter gating this change is even a 
remotely acceptable idea. It'd keep the logic identical in the default 
case though, and still allow for people like Uwe that need it to deploy 
the work-around.

Though I'm slightly curious if this problem does NOT manifest by simply 
setting a larger arp_interval. Early on, I thought I'd heard that other 
intervals had been tried with the same results, but a comment in this 
thread suggested maybe only 500 had been tried.

-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-06 19:53                     ` [PATCH v4] " Jarod Wilson
  2015-10-06 19:58                       ` Jarod Wilson
@ 2015-10-07 12:03                       ` Nikolay Aleksandrov
  2015-10-07 13:29                         ` Jarod Wilson
  1 sibling, 1 reply; 22+ messages in thread
From: Nikolay Aleksandrov @ 2015-10-07 12:03 UTC (permalink / raw)
  To: Jarod Wilson, linux-kernel
  Cc: Uwe Koziolek, Jay Vosburgh, Andy Gospodarek, Veaceslav Falico, netdev

On 10/06/2015 09:53 PM, Jarod Wilson wrote:
> From: Uwe Koziolek <uwe.koziolek@redknee.com>
> 
> With some very finicky switch hardware, active backup bonding can get into
> a situation where we play ping-pong between interfaces, trying to get one
> to come up as the active slave. There seems to be an issue with the
> switch's arp replies either taking too long, or simply getting lost, so we
> wind up unable to get any interface up and active. Sometimes, the issue
> sorts itself out after a while, sometimes it doesn't.
> 
> Testing with num_grat_arp has proven fruitless, but sending an additional
> arp on curr_arp_slave if we're still in the arp_interval timeslice in
> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
> this hardware combination.
> 
> [jarod: manufacturing of changelog, addition of modparam gating]
> CC: Jay Vosburgh <jay.vosburgh@canonical.com>
> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
> CC: Veaceslav Falico <vfalico@gmail.com>
> CC: netdev@vger.kernel.org
> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
> Signed-off-by: Jarod Wilson <jarod@redhat.com>
> ---
> v2: add code comment as to why change is needed
> v3: fix wrapping of comments
> v4: [jarod] add module parameter gating of code addition
> 
Hi all,
As Andy already stated I'm not a fan of such workarounds either but it's
necessary sometimes so if this is going to be actually considered then a
few things need to be fixed. Please make this a proper bonding option
which can be changed at runtime and not only via a module parameter.
Now, I saw that you've only tested with 500 ms, can't this be fixed by using
a different interval ? This seems like a very specific problem to have a
whole new option for.
I really want to say fix the switch but I know that's not an option. :-)
A few minor nits below,

>  drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>  include/net/bonding.h           |  1 +
>  2 files changed, 25 insertions(+)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 90f2615..72ab512 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -95,6 +95,7 @@ static int miimon;
>  static int updelay;
>  static int downdelay;
>  static int use_carrier	= 1;
> +static int arp_slow_switch;
>  static char *mode;
>  static char *primary;
>  static char *primary_reselect;
> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>  module_param(use_carrier, int, 0);
>  MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>  			      "0 for off, 1 for on (default)");
> +module_param(arp_slow_switch, int, 0);
> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
> +				  "caches that are slow to update; "
> +				  "0 for off (default), 1 for on");
>  module_param(mode, charp, 0);
>  MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>  		       "1 for active-backup, 2 for balance-xor, "
> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>  			return should_notify_rtnl;
>  	}
>  
> +	/* Sometimes the forwarding tables of the switches are not update
^ s/update/updated/

> +	 * fast enough, so the first arp response after a slave change is
> +	 * received on the wrong slave.
> +	 *
> +	 * The arp requests will be retried 2 times on the same slave.
> +	 */
> +	if (arp_slow_switch &&
> +	    bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
> +		bond_arp_send_all(bond, curr_arp_slave);
> +		return should_notify_rtnl;
> +	}
> +
>  	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>  
>  	bond_for_each_slave_rcu(bond, slave, iter) {
> @@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
>  		use_carrier = 1;
>  	}
>  
> +	if ((arp_slow_switch != 0) && (arp_slow_switch != 1)) {
^^ no need for the extra ()

> +		pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
> +			arp_slow_switch);
> +		arp_slow_switch = 1;
^^ please default to old behaviour in this case (0)

> +	}
> +
>  	if (num_peer_notif < 0 || num_peer_notif > 255) {
>  		pr_warn("Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
>  			num_peer_notif);
> @@ -4516,6 +4539,7 @@ static int bond_check_params(struct bond_params *params)
>  	params->updelay = updelay;
>  	params->downdelay = downdelay;
>  	params->use_carrier = use_carrier;
> +	params->arp_slow_switch = arp_slow_switch;
>  	params->lacp_fast = lacp_fast;
>  	params->primary[0] = 0;
>  	params->primary_reselect = primary_reselect_value;
> diff --git a/include/net/bonding.h b/include/net/bonding.h
> index c1740a2..208d31c 100644
> --- a/include/net/bonding.h
> +++ b/include/net/bonding.h
> @@ -120,6 +120,7 @@ struct bond_params {
>  	int arp_validate;
>  	int arp_all_targets;
>  	int use_carrier;
> +	int arp_slow_switch;
>  	int fail_over_mac;
>  	int updelay;
>  	int downdelay;
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-07 12:03                       ` Nikolay Aleksandrov
@ 2015-10-07 13:29                         ` Jarod Wilson
  2015-10-09 14:36                           ` Jarod Wilson
  2015-10-30 18:59                           ` Uwe Koziolek
  0 siblings, 2 replies; 22+ messages in thread
From: Jarod Wilson @ 2015-10-07 13:29 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: linux-kernel, Uwe Koziolek, Jay Vosburgh, Andy Gospodarek,
	Veaceslav Falico, netdev

Nikolay Aleksandrov wrote:
> On 10/06/2015 09:53 PM, Jarod Wilson wrote:
>> From: Uwe Koziolek<uwe.koziolek@redknee.com>
>>
>> With some very finicky switch hardware, active backup bonding can get into
>> a situation where we play ping-pong between interfaces, trying to get one
>> to come up as the active slave. There seems to be an issue with the
>> switch's arp replies either taking too long, or simply getting lost, so we
>> wind up unable to get any interface up and active. Sometimes, the issue
>> sorts itself out after a while, sometimes it doesn't.
>>
>> Testing with num_grat_arp has proven fruitless, but sending an additional
>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
>> this hardware combination.
>>
>> [jarod: manufacturing of changelog, addition of modparam gating]
>> CC: Jay Vosburgh<jay.vosburgh@canonical.com>
>> CC: Andy Gospodarek<gospo@cumulusnetworks.com>
>> CC: Veaceslav Falico<vfalico@gmail.com>
>> CC: netdev@vger.kernel.org
>> Signed-off-by: Uwe Koziolek<uwe.koziolek@redknee.com>
>> Signed-off-by: Jarod Wilson<jarod@redhat.com>
>> ---
>> v2: add code comment as to why change is needed
>> v3: fix wrapping of comments
>> v4: [jarod] add module parameter gating of code addition
>>
> Hi all,
> As Andy already stated I'm not a fan of such workarounds either but it's
> necessary sometimes so if this is going to be actually considered then a
> few things need to be fixed. Please make this a proper bonding option
> which can be changed at runtime and not only via a module parameter.

Okay, I can give that a shot, however...

> Now, I saw that you've only tested with 500 ms, can't this be fixed by using
> a different interval ? This seems like a very specific problem to have a
> whole new option for.

...I'll wait until we've heard confirmation from Uwe that intervals 
other than 500ms don't fix things.

> I really want to say fix the switch but I know that's not an option. :-)

Yeah, unfortunately not!

> A few minor nits below,
>
>>   drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>>   include/net/bonding.h           |  1 +
>>   2 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 90f2615..72ab512 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -95,6 +95,7 @@ static int miimon;
>>   static int updelay;
>>   static int downdelay;
>>   static int use_carrier	= 1;
>> +static int arp_slow_switch;
>>   static char *mode;
>>   static char *primary;
>>   static char *primary_reselect;
>> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>>   module_param(use_carrier, int, 0);
>>   MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>>   			      "0 for off, 1 for on (default)");
>> +module_param(arp_slow_switch, int, 0);
>> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
>> +				  "caches that are slow to update; "
>> +				  "0 for off (default), 1 for on");
>>   module_param(mode, charp, 0);
>>   MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>>   		       "1 for active-backup, 2 for balance-xor, "
>> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>>   			return should_notify_rtnl;
>>   	}
>>
>> +	/* Sometimes the forwarding tables of the switches are not update
> ^ s/update/updated/

D'oh. Fixed locally.

>> @@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
>>   		use_carrier = 1;
>>   	}
>>
>> +	if ((arp_slow_switch != 0) &&  (arp_slow_switch != 1)) {
> ^^ no need for the extra ()

Copy-pasta from use_carrier checks right above it. Never quite sure if I 
should stick with the same possibly sub-optimal formatting conventions 
already in the file, try to fix them while also fixing bugs, or just mix 
styles...


>> +		pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
>> +			arp_slow_switch);
>> +		arp_slow_switch = 1;
> ^^ please default to old behaviour in this case (0)

Will do.



-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-07 13:29                         ` Jarod Wilson
@ 2015-10-09 14:36                           ` Jarod Wilson
  2015-10-09 15:25                             ` Nikolay Aleksandrov
  2015-10-09 15:31                             ` Jay Vosburgh
  2015-10-30 18:59                           ` Uwe Koziolek
  1 sibling, 2 replies; 22+ messages in thread
From: Jarod Wilson @ 2015-10-09 14:36 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: linux-kernel, Uwe Koziolek, Jay Vosburgh, Andy Gospodarek,
	Veaceslav Falico, netdev

Jarod Wilson wrote:
...
> As Andy already stated I'm not a fan of such workarounds either but it's
> necessary sometimes so if this is going to be actually considered then a
> few things need to be fixed. Please make this a proper bonding option
> which can be changed at runtime and not only via a module parameter.

Is there any particular userspace tool that would need some updating, or 
is adding the sysfs knobs sufficient here? I think I've got all the 
sysfs stuff thrown together now, but still need to test.


>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>> using
>> a different interval ? This seems like a very specific problem to have a
>> whole new option for.
>
> ...I'll wait until we've heard confirmation from Uwe that intervals
> other than 500ms don't fix things.

Okay, so I believe the "only tested with 500ms" was in reference to 
testing with Uwe's initial patch. I do have supporting evidence in a 
bugzilla report that shows upwards of 5000ms still experience the 
problem here.



-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-09 14:36                           ` Jarod Wilson
@ 2015-10-09 15:25                             ` Nikolay Aleksandrov
  2015-10-09 15:31                             ` Jay Vosburgh
  1 sibling, 0 replies; 22+ messages in thread
From: Nikolay Aleksandrov @ 2015-10-09 15:25 UTC (permalink / raw)
  To: Jarod Wilson
  Cc: linux-kernel, Uwe Koziolek, Jay Vosburgh, Andy Gospodarek,
	Veaceslav Falico, netdev

On 10/09/2015 04:36 PM, Jarod Wilson wrote:
> Jarod Wilson wrote:
> ...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
> 
> Is there any particular userspace tool that would need some updating, or is adding the sysfs knobs sufficient here? I think I've got all the sysfs stuff thrown together now, but still need to test.
> 
I'd say adding netlink support at this point is more important, and it'd be nice
if you can add support to iproute2 for the new attribute. Currently all bonding
options have both netlink and sysfs support, so you can follow that, the others
can correct me if I'm wrong here.

One more thing please don't forget to update Documentation/networking/bonding.txt

> 
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
> 
> Okay, so I believe the "only tested with 500ms" was in reference to testing with Uwe's initial patch. I do have supporting evidence in a bugzilla report that shows upwards of 5000ms still experience the problem here.
_5 seconds_ are not enough to receive a reply, but sending it twice
in a second fixes the issue ?!
This sounds like the ARP request is not properly handled/received
and there's no reply.

Cheers,
 Nik

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-09 14:36                           ` Jarod Wilson
  2015-10-09 15:25                             ` Nikolay Aleksandrov
@ 2015-10-09 15:31                             ` Jay Vosburgh
  2015-10-12 15:33                               ` Jarod Wilson
  1 sibling, 1 reply; 22+ messages in thread
From: Jay Vosburgh @ 2015-10-09 15:31 UTC (permalink / raw)
  To: Jarod Wilson
  Cc: Nikolay Aleksandrov, linux-kernel, Uwe Koziolek, Andy Gospodarek,
	Veaceslav Falico, netdev

Jarod Wilson <jarod@redhat.com> wrote:

>Jarod Wilson wrote:
>...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
>Is there any particular userspace tool that would need some updating, or
>is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>stuff thrown together now, but still need to test.

	Most (all?) bonding options should be configurable via iproute
(netlink) now.

>
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
>
>Okay, so I believe the "only tested with 500ms" was in reference to
>testing with Uwe's initial patch. I do have supporting evidence in a
>bugzilla report that shows upwards of 5000ms still experience the problem
>here.

	I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

	As I understand it, the working theory is something like this:

	- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

	- switch has MAC for B in its forwarding table

	- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

	As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

	- switch sees source mac coming from port A, starts to update
its forwarding table

	- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

	- switch finishes updating forwarding table, MAC is now assigned
to port A.

	- bonding now tries sending on port B, and the cycle repeats.

	If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

	Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.

	For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.

	I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.

	Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-09 15:31                             ` Jay Vosburgh
@ 2015-10-12 15:33                               ` Jarod Wilson
  0 siblings, 0 replies; 22+ messages in thread
From: Jarod Wilson @ 2015-10-12 15:33 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Nikolay Aleksandrov, linux-kernel, Uwe Koziolek, Andy Gospodarek,
	Veaceslav Falico, netdev

Jay Vosburgh wrote:
> Jarod Wilson<jarod@redhat.com>  wrote:
>
>> Jarod Wilson wrote:
>> ...
>>> As Andy already stated I'm not a fan of such workarounds either but it's
>>> necessary sometimes so if this is going to be actually considered then a
>>> few things need to be fixed. Please make this a proper bonding option
>>> which can be changed at runtime and not only via a module parameter.
>> Is there any particular userspace tool that would need some updating, or
>> is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>> stuff thrown together now, but still need to test.
>
> 	Most (all?) bonding options should be configurable via iproute
> (netlink) now.

D'oh, of course. I've done the kernel-side netlink bits now too, and 
started looking at the iproute source. However...


>>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>>> using
>>>> a different interval ? This seems like a very specific problem to have a
>>>> whole new option for.
>>> ...I'll wait until we've heard confirmation from Uwe that intervals
>>> other than 500ms don't fix things.
>> Okay, so I believe the "only tested with 500ms" was in reference to
>> testing with Uwe's initial patch. I do have supporting evidence in a
>> bugzilla report that shows upwards of 5000ms still experience the problem
>> here.
>
> 	I did set up some switches and attempt to reproduce this
> yesterday; I daisy-chained three switches (two Cisco and an HP) together
> and connected the bonded interfaces to the "end" switches.  I tried
> various ARP targets (the switch, hosts on various points of the switch)
> and varying arp_intervals and was unable to reproduce the problem.
>
> 	As I understand it, the working theory is something like this:
>
> 	- host with two bonded interfaces, A and B.  For active-backup
> mode, the interfaces have been assigned the same MAC address.
>
> 	- switch has MAC for B in its forwarding table
>
> 	- bonding goes from down to up, and thinks all its slaves are
> down, and starts the "curr_arp_slave" search for an active
> arp_ip_target.  In this case, it starts with A, and sends an ARP from A.
>
> 	As an aside, I'm not 100% clear on what exactly is going on in
> the "bonding goes from down to up" transition; this seems to be key in
> reproducing the issue.
>
> 	- switch sees source mac coming from port A, starts to update
> its forwarding table
>
> 	- meanwhile, switch forwards ARP request, and receives ARP
> reply, which it forwards to port B.  Bonding drops this, as the slave is
> inactive.
>
> 	- switch finishes updating forwarding table, MAC is now assigned
> to port A.
>
> 	- bonding now tries sending on port B, and the cycle repeats.
>
> 	If this is what's taking place, then the arp_interval itself is
> irrelevant, the race is between the switch table update and the
> generation of the ARP reply.
>
> 	Also, presuming the above is what's going on, we could modify
> the ARP "curr_arp_slave" logic a bit to resolve this without requiring
> any magic knobs.

I really like this idea. Still trying to grasp exactly how we get into 
this situation and what everything looks like as we hop through the 
various bond_ab_arp_* functions though.

> 	For example, we could change the "drop on inactive" logic to
> recognise the "curr_arp_slave" search and accept the unicast ARP reply,
> and perhaps make that receiving slave the next curr_arp_slave
> automatically.

Nothing ever actually getting picked as curr_arp_slave does appear to be 
the problem, so that does sound like it could do the trick.

> 	I also wonder if the fail_over_mac option would affect this
> behavior, as it would cause the slaves to keep their MAC address for the
> duration, so the switch would not see the MAC move from port to port.

Not sure if that's an option for the particular environment, but we 
could certainly ask Uwe to give it a try.

> 	Another thought would be to have the curr_arp_slave cycle
> through the slaves in random order, but that could create
> non-deterministic results even when things are working correctly.

I'd say avoid this route if at all possible, would rather not make 
things less predictable.

-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/bonding: send arp in interval if no active slave
  2015-10-07 13:29                         ` Jarod Wilson
  2015-10-09 14:36                           ` Jarod Wilson
@ 2015-10-30 18:59                           ` Uwe Koziolek
  1 sibling, 0 replies; 22+ messages in thread
From: Uwe Koziolek @ 2015-10-30 18:59 UTC (permalink / raw)
  To: Jarod Wilson, Nikolay Aleksandrov
  Cc: linux-kernel, Jay Vosburgh, Andy Gospodarek, Veaceslav Falico, netdev

> Nikolay Aleksandrov wrote:
>> On 10/06/2015 09:53 PM, Jarod Wilson wrote:
>>> From: Uwe Koziolek<uwe.koziolek@redknee.com>
>>>
>>> With some very finicky switch hardware, active backup bonding can get into
>>> a situation where we play ping-pong between interfaces, trying to get one
>>> to come up as the active slave. There seems to be an issue with the
>>> switch's arp replies either taking too long, or simply getting lost, so we
>>> wind up unable to get any interface up and active. Sometimes, the issue
>>> sorts itself out after a while, sometimes it doesn't.
>>>
>>> Testing with num_grat_arp has proven fruitless, but sending an additional
>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
>>> this hardware combination.
>>>
>>> [jarod: manufacturing of changelog, addition of modparam gating]
>>> CC: Jay Vosburgh<jay.vosburgh@canonical.com>
>>> CC: Andy Gospodarek<gospo@cumulusnetworks.com>
>>> CC: Veaceslav Falico<vfalico@gmail.com>
>>> CC: netdev@vger.kernel.org
>>> Signed-off-by: Uwe Koziolek<uwe.koziolek@redknee.com>
>>> Signed-off-by: Jarod Wilson<jarod@redhat.com>
>>> ---
>>> v2: add code comment as to why change is needed
>>> v3: fix wrapping of comments
>>> v4: [jarod] add module parameter gating of code addition
>>>
>> Hi all,
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
> Okay, I can give that a shot, however...
>
>> Now, I saw that you've only tested with 500 ms, can't this be fixed by using
>> a different interval ? This seems like a very specific problem to have a
>> whole new option for.
>
> ...I'll wait until we've heard confirmation from Uwe that intervals other than 500ms don't fix things.
>
A test with 5000 ms don't fix the problem. Tested with Cisco C3750, 4 bonds.

>> I really want to say fix the switch but I know that's not an option. :-)
>
> Yeah, unfortunately not!
>
>> A few minor nits below,
>>
>>>   drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>>>   include/net/bonding.h           |  1 +
>>>   2 files changed, 25 insertions(+)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>> index 90f2615..72ab512 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -95,6 +95,7 @@ static int miimon;
>>>   static int updelay;
>>>   static int downdelay;
>>>   static int use_carrier    = 1;
>>> +static int arp_slow_switch;
>>>   static char *mode;
>>>   static char *primary;
>>>   static char *primary_reselect;
>>> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>>>   module_param(use_carrier, int, 0);
>>>   MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>>>                     "0 for off, 1 for on (default)");
>>> +module_param(arp_slow_switch, int, 0);
>>> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
>>> +                  "caches that are slow to update; "
>>> +                  "0 for off (default), 1 for on");
>>>   module_param(mode, charp, 0);
>>>   MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>>>                  "1 for active-backup, 2 for balance-xor, "
>>> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>>>               return should_notify_rtnl;
>>>       }
>>>
>>> +    /* Sometimes the forwarding tables of the switches are not update
>> ^ s/update/updated/
>
> D'oh. Fixed locally.
>
>>> @@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
>>>           use_carrier = 1;
>>>       }
>>>
>>> +    if ((arp_slow_switch != 0) &&  (arp_slow_switch != 1)) {
>> ^^ no need for the extra ()
>
> Copy-pasta from use_carrier checks right above it. Never quite sure if I should stick with the same possibly sub-optimal
> formatting conventions already in the file, try to fix them while also fixing bugs, or just mix styles...
>
>
>>> +        pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
>>> +            arp_slow_switch);
>>> +        arp_slow_switch = 1;
>> ^^ please default to old behaviour in this case (0)
>
> Will do.
>
Uwe Koziolek

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-10-30 18:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-17 16:23 [PATCH] net/bonding: send arp in interval if no active slave Jarod Wilson
2015-08-17 16:55 ` Veaceslav Falico
2015-08-17 17:12   ` Jarod Wilson
2015-08-17 18:56     ` Uwe Koziolek
2015-08-17 19:14       ` Jay Vosburgh
2015-08-17 20:51         ` Uwe Koziolek
2015-08-31 22:21           ` Jarod Wilson
2015-09-01 23:15             ` Uwe Koziolek
2015-09-01 15:41           ` Andy Gospodarek
2015-09-01 23:10             ` Uwe Koziolek
2015-09-03 15:05               ` Jay Vosburgh
2015-09-04 11:04                 ` Uwe Koziolek
2015-09-28 13:31                   ` Jarod Wilson
2015-10-06 19:53                     ` [PATCH v4] " Jarod Wilson
2015-10-06 19:58                       ` Jarod Wilson
2015-10-07 12:03                       ` Nikolay Aleksandrov
2015-10-07 13:29                         ` Jarod Wilson
2015-10-09 14:36                           ` Jarod Wilson
2015-10-09 15:25                             ` Nikolay Aleksandrov
2015-10-09 15:31                             ` Jay Vosburgh
2015-10-12 15:33                               ` Jarod Wilson
2015-10-30 18:59                           ` Uwe Koziolek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).