netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] can current ECMP implementation support consistent hashing for next hop?
@ 2020-06-11 14:56 Yi Yang (杨燚)-云服务集团
  2020-06-11 18:27 ` David Ahern
  0 siblings, 1 reply; 10+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-06-11 14:56 UTC (permalink / raw)
  To: netdev
  Cc: nikolay, dsahern,
	Yi Yang
	(杨燚)-云服务集团

[-- Attachment #1: Type: text/plain, Size: 924 bytes --]

Hi, folks

We need to use Linux ECMP to do active-active load balancer, but consistent hash is necessary because load balance node may be added or removed dynamically, so number of hash bucket is changeable, but we have to distribute flow to load balance node which is handling this flow and has current session state, I’m not sure if current Linux has implemented the algorithm in  https://tools.ietf.org/html/rfc2992, anybody can confirm yes or no?

I checked source code in https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/fib_semantics.c#n2176, every next hop in fib has a upper_bound, fib_select_multipath just checks if hash value is greater than upper_bound of next hop and decide if it is selected next hop, so I don't think current linux has implemented consistent hash, please correct me if I'm wrong.

Thank you all so much in advance and sincerely appreciate your help.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-06-11 14:56 [PATCH] can current ECMP implementation support consistent hashing for next hop? Yi Yang (杨燚)-云服务集团
@ 2020-06-11 18:27 ` David Ahern
  2020-06-12  0:32   ` 答复: " Yi Yang (杨燚)-云服务集团
  0 siblings, 1 reply; 10+ messages in thread
From: David Ahern @ 2020-06-11 18:27 UTC (permalink / raw)
  To: Yi Yang
	(杨燚)-云服务集团,
	netdev
  Cc: nikolay

On 6/11/20 8:56 AM, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, folks
> 
> We need to use Linux ECMP to do active-active load balancer, but consistent hash is necessary because load balance node may be added or removed dynamically, so number of hash bucket is changeable, but we have to distribute flow to load balance node which is handling this flow and has current session state, I’m not sure if current Linux has implemented the algorithm in  https://tools.ietf.org/html/rfc2992, anybody can confirm yes or no?
> 
> I checked source code in https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/fib_semantics.c#n2176, every next hop in fib has a upper_bound, fib_select_multipath just checks if hash value is greater than upper_bound of next hop and decide if it is selected next hop, so I don't think current linux has implemented consistent hash, please correct me if I'm wrong.
> 
> Thank you all so much in advance and sincerely appreciate your help.
> 

The kernel does not do resilient hashing, but I believe you can do it
from userspace by updating route entries - replacing nexthop entries as
LB's come and go.

Cumulus docs have a good description:
https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing

^ permalink raw reply	[flat|nested] 10+ messages in thread

* 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-06-11 18:27 ` David Ahern
@ 2020-06-12  0:32   ` Yi Yang (杨燚)-云服务集团
  2020-06-12  4:36     ` David Ahern
  0 siblings, 1 reply; 10+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-06-12  0:32 UTC (permalink / raw)
  To: dsahern, netdev
  Cc: nikolay,
	Yi Yang
	(杨燚)-云服务集团

[-- Attachment #1: Type: text/plain, Size: 2484 bytes --]

David, thank you so much for confirming it can't, I did read your cumulus document before, resilient hashing is ok for next hop remove, but it still has the same issue there if add new next hop. I know most of kernel code in Cumulus Linux has been in upstream kernel, I'm wondering why you didn't push resilient hashing to upstream kernel.

I think consistent hashing is must-have for a commercial load balancing solution, otherwise it is basically nonsense , do you Cumulus Linux have consistent hashing solution?

Is "- replacing nexthop entries as LB's come and go" ithe stuff https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing is showing? It can't ensure the flow is distributed to the right backend server if a new next hop is added.

-----邮件原件-----
发件人: David Ahern [mailto:dsahern@gmail.com] 
发送时间: 2020年6月12日 2:27
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>; netdev@vger.kernel.org
抄送: nikolay@cumulusnetworks.com
主题: Re: [PATCH] can current ECMP implementation support consistent hashing for next hop?

On 6/11/20 8:56 AM, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, folks
> 
> We need to use Linux ECMP to do active-active load balancer, but consistent hash is necessary because load balance node may be added or removed dynamically, so number of hash bucket is changeable, but we have to distribute flow to load balance node which is handling this flow and has current session state, I’m not sure if current Linux has implemented the algorithm in  https://tools.ietf.org/html/rfc2992, anybody can confirm yes or no?
> 
> I checked source code in https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/fib_semantics.c#n2176, every next hop in fib has a upper_bound, fib_select_multipath just checks if hash value is greater than upper_bound of next hop and decide if it is selected next hop, so I don't think current linux has implemented consistent hash, please correct me if I'm wrong.
> 
> Thank you all so much in advance and sincerely appreciate your help.
> 

The kernel does not do resilient hashing, but I believe you can do it from userspace by updating route entries - replacing nexthop entries as LB's come and go.

Cumulus docs have a good description:
https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-06-12  0:32   ` 答复: " Yi Yang (杨燚)-云服务集团
@ 2020-06-12  4:36     ` David Ahern
  2020-06-15  6:56       ` 答复: [vger.kernel.org代发]Re: " Yi Yang (杨燚)-云服务集团
  2020-08-02 14:49       ` Ido Schimmel
  0 siblings, 2 replies; 10+ messages in thread
From: David Ahern @ 2020-06-12  4:36 UTC (permalink / raw)
  To: Yi Yang
	(杨燚)-云服务集团,
	netdev
  Cc: nikolay

On 6/11/20 6:32 PM, Yi Yang (杨燚)-云服务集团 wrote:
> David, thank you so much for confirming it can't, I did read your cumulus document before, resilient hashing is ok for next hop remove, but it still has the same issue there if add new next hop. I know most of kernel code in Cumulus Linux has been in upstream kernel, I'm wondering why you didn't push resilient hashing to upstream kernel.
> 
> I think consistent hashing is must-have for a commercial load balancing solution, otherwise it is basically nonsense , do you Cumulus Linux have consistent hashing solution?
> 
> Is "- replacing nexthop entries as LB's come and go" ithe stuff https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing is showing? It can't ensure the flow is distributed to the right backend server if a new next hop is added.

I do not believe it is a problem to be solved in the kernel.

If you follow the *intent* of the Cumulus document: what is the maximum
number of load balancers you expect to have? 16? 32? 64? Define an ECMP
route with that number of nexthops and fill in the weighting that meets
your needs. When an LB is added or removed, you decide what the new set
of paths is that maintains N-total paths with the distribution that
meets your needs.

I just sent patches for active-backup nexthops that allows an automatic
fallback when one is removed to address the redistribution problem, but
it still requires userspace to decide what the active-backup pairs are
as well as the maximum number of paths.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* 答复: [vger.kernel.org代发]Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-06-12  4:36     ` David Ahern
@ 2020-06-15  6:56       ` Yi Yang (杨燚)-云服务集团
  2020-06-15 22:42         ` David Ahern
  2020-08-02 14:49       ` Ido Schimmel
  1 sibling, 1 reply; 10+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-06-15  6:56 UTC (permalink / raw)
  To: dsahern, netdev; +Cc: nikolay

[-- Attachment #1: Type: text/plain, Size: 2866 bytes --]

Hi David

My next hops are final real servers but not load balancers, say we sets maximum number of servers to 64, but next hop entry is added or removed dynamically, we are unlikely to know them beforehand. I can't understand how user space can attain consistent distribution without in-kernel consistent hashing, can you show how ip route cmds can attain this?

I find routing cache can help fix this issue, if a flow has been routed to a real server, then its route has been cached, so packets in this flow should hit routing cache by fib_lookup, so this can make sure it can be always routed to right server, as far as the result is concerned, it is equivalent to consistent hashing. 

Can you help confirm if my understanding is right? It will be a big problem if no. It will be great if you can show me user space can attain this without kernel help.

-----邮件原件-----
发件人: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] 代表 David Ahern
发送时间: 2020年6月12日 12:37
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>; netdev@vger.kernel.org
抄送: nikolay@cumulusnetworks.com
主题: [vger.kernel.org代发]Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?

On 6/11/20 6:32 PM, Yi Yang (杨燚)-云服务集团 wrote:
> David, thank you so much for confirming it can't, I did read your cumulus document before, resilient hashing is ok for next hop remove, but it still has the same issue there if add new next hop. I know most of kernel code in Cumulus Linux has been in upstream kernel, I'm wondering why you didn't push resilient hashing to upstream kernel.
> 
> I think consistent hashing is must-have for a commercial load balancing solution, otherwise it is basically nonsense , do you Cumulus Linux have consistent hashing solution?
> 
> Is "- replacing nexthop entries as LB's come and go" ithe stuff https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing is showing? It can't ensure the flow is distributed to the right backend server if a new next hop is added.

I do not believe it is a problem to be solved in the kernel.

If you follow the *intent* of the Cumulus document: what is the maximum number of load balancers you expect to have? 16? 32? 64? Define an ECMP route with that number of nexthops and fill in the weighting that meets your needs. When an LB is added or removed, you decide what the new set of paths is that maintains N-total paths with the distribution that meets your needs.

I just sent patches for active-backup nexthops that allows an automatic fallback when one is removed to address the redistribution problem, but it still requires userspace to decide what the active-backup pairs are as well as the maximum number of paths.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [vger.kernel.org代发]Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-06-15  6:56       ` 答复: [vger.kernel.org代发]Re: " Yi Yang (杨燚)-云服务集团
@ 2020-06-15 22:42         ` David Ahern
  2020-06-16  0:29           ` 答复: " Yi Yang (杨燚)-云服务集团
  0 siblings, 1 reply; 10+ messages in thread
From: David Ahern @ 2020-06-15 22:42 UTC (permalink / raw)
  To: Yi Yang
	(杨燚)-云服务集团,
	netdev
  Cc: nikolay

On 6/15/20 12:56 AM, Yi Yang (杨燚)-云服务集团 wrote:
> My next hops are final real servers but not load balancers, say we sets maximum number of servers to 64, but next hop entry is added or removed dynamically, we are unlikely to know them beforehand. I can't understand how user space can attain consistent distribution without in-kernel consistent hashing, can you show how ip route cmds can attain this?
> 

I do not see how consistent hashing can be done in the kernel without
affecting performance, and a second problem is having it do the right
thing for all use cases. That said, feel free to try to implement it.

> I find routing cache can help fix this issue, if a flow has been routed to a real server, then its route has been cached, so packets in this flow should hit routing cache by fib_lookup, so this can make sure it can be always routed to right server, as far as the result is concerned, it is equivalent to consistent hashing. 
> 

route cache is invalidated anytime there is a change to the FIB.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* 答复: 答复: [vger.kernel.org代发]Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-06-15 22:42         ` David Ahern
@ 2020-06-16  0:29           ` Yi Yang (杨燚)-云服务集团
  0 siblings, 0 replies; 10+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-06-16  0:29 UTC (permalink / raw)
  To: dsahern, netdev; +Cc: nikolay

[-- Attachment #1: Type: text/plain, Size: 1760 bytes --]

Got it, thanks David,  so it is only one way to do it by ourselves :-), yes performance has to been hurt by consistent hashing, but I also found consistent hashing can't make sure the flows are always dispatched to the server which is handling them, it can just ensure most of cases are so. For our cases, consistent hashing is not enough.

-----邮件原件-----
发件人: David Ahern [mailto:dsahern@gmail.com] 
发送时间: 2020年6月16日 6:43
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>; netdev@vger.kernel.org
抄送: nikolay@cumulusnetworks.com
主题: Re: 答复: [vger.kernel.org代发]Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?

On 6/15/20 12:56 AM, Yi Yang (杨燚)-云服务集团 wrote:
> My next hops are final real servers but not load balancers, say we sets maximum number of servers to 64, but next hop entry is added or removed dynamically, we are unlikely to know them beforehand. I can't understand how user space can attain consistent distribution without in-kernel consistent hashing, can you show how ip route cmds can attain this?
> 

I do not see how consistent hashing can be done in the kernel without affecting performance, and a second problem is having it do the right thing for all use cases. That said, feel free to try to implement it.

> I find routing cache can help fix this issue, if a flow has been routed to a real server, then its route has been cached, so packets in this flow should hit routing cache by fib_lookup, so this can make sure it can be always routed to right server, as far as the result is concerned, it is equivalent to consistent hashing. 
> 

route cache is invalidated anytime there is a change to the FIB.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-06-12  4:36     ` David Ahern
  2020-06-15  6:56       ` 答复: [vger.kernel.org代发]Re: " Yi Yang (杨燚)-云服务集团
@ 2020-08-02 14:49       ` Ido Schimmel
  2020-08-06 16:45         ` David Ahern
  1 sibling, 1 reply; 10+ messages in thread
From: Ido Schimmel @ 2020-08-02 14:49 UTC (permalink / raw)
  To: David Ahern
  Cc: Yi Yang
	(杨燚)-云服务集团,
	netdev, nikolay

On Thu, Jun 11, 2020 at 10:36:59PM -0600, David Ahern wrote:
> On 6/11/20 6:32 PM, Yi Yang (杨燚)-云服务集团 wrote:
> > David, thank you so much for confirming it can't, I did read your cumulus document before, resilient hashing is ok for next hop remove, but it still has the same issue there if add new next hop. I know most of kernel code in Cumulus Linux has been in upstream kernel, I'm wondering why you didn't push resilient hashing to upstream kernel.
> > 
> > I think consistent hashing is must-have for a commercial load balancing solution, otherwise it is basically nonsense , do you Cumulus Linux have consistent hashing solution?
> > 
> > Is "- replacing nexthop entries as LB's come and go" ithe stuff https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing is showing? It can't ensure the flow is distributed to the right backend server if a new next hop is added.
> 
> I do not believe it is a problem to be solved in the kernel.
> 
> If you follow the *intent* of the Cumulus document: what is the maximum
> number of load balancers you expect to have? 16? 32? 64? Define an ECMP
> route with that number of nexthops and fill in the weighting that meets
> your needs. When an LB is added or removed, you decide what the new set
> of paths is that maintains N-total paths with the distribution that
> meets your needs.

I recently started looking into consistent hashing and I wonder if it
can be done with the new nexthop API while keeping all the logic in user
space (e.g., FRR).

The only extension that might be required from the kernel is a new
nexthop attribute that indicates when a nexthop was last recently used.
User space can then use it to understand which nexthops to replace when
a new nexthop is added and when to perform the replacement. In case the
nexthops are offloaded, it is possible for the driver to periodically
update the nexthop code about their activity.

Below is a script that demonstrates the concept with the example in the
Cumulus documentation. I chose to replace the individual nexthops
instead of creating new ones and then replacing the group.

It is obviously possible to create larger groups to reduce the impact on
existing flows when a new nexthop is added.

WDYT?

```
#!/bin/bash

### Setup ####

IP="ip -n testns"

ip netns add testns

$IP link add name dummy_a up type dummy
$IP link add name dummy_b up type dummy
$IP link add name dummy_c up type dummy
$IP link add name dummy_d up type dummy
$IP link add name dummy_e up type dummy

$IP route add 1.1.1.0/24 dev dummy_a
$IP route add 2.2.2.0/24 dev dummy_b
$IP route add 3.3.3.0/24 dev dummy_c
$IP route add 4.4.4.0/24 dev dummy_d
$IP route add 5.5.5.0/24 dev dummy_e

### Initial nexthop configuration ####
# According to:
# https://docs.cumulusnetworks.com/cumulus-linux-42/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hash-buckets

$IP nexthop replace id 1 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 2 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 3 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 4 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 5 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 6 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 7 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 8 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 9 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 10 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 11 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 12 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 10000 group 1/2/3/4/5/6/7/8/9/10/11/12

echo
echo "Initial state:"
echo
$IP nexthop show

### Nexthop B is removed ###
# According to:
# https://docs.cumulusnetworks.com/cumulus-linux-42/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#remove-next-hops

$IP nexthop replace id 2 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 6 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 10 via 4.4.4.4 dev dummy_d

echo
echo "After nexthop B was removed:"
echo
$IP nexthop show

### Initial state restored ####

$IP nexthop replace id 2 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 6 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 10 via 2.2.2.2 dev dummy_b

echo
echo "After intial state was restored:"
echo
$IP nexthop show

### Nexthop E is added ####
# According to:
# https://docs.cumulusnetworks.com/cumulus-linux-42/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#add-next-hops

# Nexthop 2, 5, 8 are active. Replace in a way that minimizes
# interruptions.
$IP nexthop replace id 1 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 2 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 3 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 4 via 5.5.5.5 dev dummy_e
# Nexthop 5 remains the same
# Nexthop 6 remains the same
# Nexthop 7 remains the same
# Nexthop 8 remains the same
$IP nexthop replace id 9 via 5.5.5.5 dev dummy_e
$IP nexthop replace id 10 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 11 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 12 via 3.3.3.3 dev dummy_c

echo
echo "After nexthop E was added:"
echo
$IP nexthop show

ip netns del testns
```

> 
> I just sent patches for active-backup nexthops that allows an automatic
> fallback when one is removed to address the redistribution problem, but
> it still requires userspace to decide what the active-backup pairs are
> as well as the maximum number of paths.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-08-02 14:49       ` Ido Schimmel
@ 2020-08-06 16:45         ` David Ahern
  2020-08-08 18:40           ` Ido Schimmel
  0 siblings, 1 reply; 10+ messages in thread
From: David Ahern @ 2020-08-06 16:45 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Yi Yang
	(杨燚)-云服务集团,
	netdev, nikolay

On 8/2/20 8:49 AM, Ido Schimmel wrote:
> On Thu, Jun 11, 2020 at 10:36:59PM -0600, David Ahern wrote:
>> On 6/11/20 6:32 PM, Yi Yang (杨燚)-云服务集团 wrote:
>>> David, thank you so much for confirming it can't, I did read your cumulus document before, resilient hashing is ok for next hop remove, but it still has the same issue there if add new next hop. I know most of kernel code in Cumulus Linux has been in upstream kernel, I'm wondering why you didn't push resilient hashing to upstream kernel.
>>>
>>> I think consistent hashing is must-have for a commercial load balancing solution, otherwise it is basically nonsense , do you Cumulus Linux have consistent hashing solution?
>>>
>>> Is "- replacing nexthop entries as LB's come and go" ithe stuff https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing is showing? It can't ensure the flow is distributed to the right backend server if a new next hop is added.
>>
>> I do not believe it is a problem to be solved in the kernel.
>>
>> If you follow the *intent* of the Cumulus document: what is the maximum
>> number of load balancers you expect to have? 16? 32? 64? Define an ECMP
>> route with that number of nexthops and fill in the weighting that meets
>> your needs. When an LB is added or removed, you decide what the new set
>> of paths is that maintains N-total paths with the distribution that
>> meets your needs.
> 
> I recently started looking into consistent hashing and I wonder if it
> can be done with the new nexthop API while keeping all the logic in user
> space (e.g., FRR).
> 
> The only extension that might be required from the kernel is a new
> nexthop attribute that indicates when a nexthop was last recently used.

The only potential problem that comes to mind is that a nexthop can be
used by multiple prefixes.

But, I'm not sure I follow what the last recently used indicator gives
you for maintaining flows as a group is updated.

> User space can then use it to understand which nexthops to replace when
> a new nexthop is added and when to perform the replacement. In case the
> nexthops are offloaded, it is possible for the driver to periodically
> update the nexthop code about their activity.
> 
> Below is a script that demonstrates the concept with the example in the
> Cumulus documentation. I chose to replace the individual nexthops
> instead of creating new ones and then replacing the group.

That is one of the features ... a group points to individual nexthops
and those can be atomically updated without affecting the group.

> 
> It is obviously possible to create larger groups to reduce the impact on
> existing flows when a new nexthop is added.
> 
> WDYT?

This is inline with my earlier responses, and your script shows an
example of how to manage it. Combine it with the active-backup patch set
and you handle device events too (avoid disrupting size of the group on
device events).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [PATCH] can current ECMP implementation support consistent hashing for next hop?
  2020-08-06 16:45         ` David Ahern
@ 2020-08-08 18:40           ` Ido Schimmel
  0 siblings, 0 replies; 10+ messages in thread
From: Ido Schimmel @ 2020-08-08 18:40 UTC (permalink / raw)
  To: David Ahern
  Cc: Yi Yang
	(杨燚)-云服务集团,
	netdev, nikolay

On Thu, Aug 06, 2020 at 10:45:52AM -0600, David Ahern wrote:
> On 8/2/20 8:49 AM, Ido Schimmel wrote:
> > On Thu, Jun 11, 2020 at 10:36:59PM -0600, David Ahern wrote:
> >> On 6/11/20 6:32 PM, Yi Yang (杨燚)-云服务集团 wrote:
> >>> David, thank you so much for confirming it can't, I did read your cumulus document before, resilient hashing is ok for next hop remove, but it still has the same issue there if add new next hop. I know most of kernel code in Cumulus Linux has been in upstream kernel, I'm wondering why you didn't push resilient hashing to upstream kernel.
> >>>
> >>> I think consistent hashing is must-have for a commercial load balancing solution, otherwise it is basically nonsense , do you Cumulus Linux have consistent hashing solution?
> >>>
> >>> Is "- replacing nexthop entries as LB's come and go" ithe stuff https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hashing is showing? It can't ensure the flow is distributed to the right backend server if a new next hop is added.
> >>
> >> I do not believe it is a problem to be solved in the kernel.
> >>
> >> If you follow the *intent* of the Cumulus document: what is the maximum
> >> number of load balancers you expect to have? 16? 32? 64? Define an ECMP
> >> route with that number of nexthops and fill in the weighting that meets
> >> your needs. When an LB is added or removed, you decide what the new set
> >> of paths is that maintains N-total paths with the distribution that
> >> meets your needs.
> > 
> > I recently started looking into consistent hashing and I wonder if it
> > can be done with the new nexthop API while keeping all the logic in user
> > space (e.g., FRR).
> > 
> > The only extension that might be required from the kernel is a new
> > nexthop attribute that indicates when a nexthop was last recently used.
> 
> The only potential problem that comes to mind is that a nexthop can be
> used by multiple prefixes.

Yes. The key point is that for resilient hashing a nexthop ID no longer
represents a logical nexthop (dev + gw), but rather a hash bucket. User
space determines how many buckets there are in a group (e.g., 256) and
how logical nexthops are assigned to them.

> 
> But, I'm not sure I follow what the last recently used indicator gives
> you for maintaining flows as a group is updated.

When adding a nexthop to a group the goal is to do it in a way that
minimizes the impact on existing flows. Therefore, you want to avoid
assigning the nexthop to buckets that are active. After a certain time
of bucket inactivity user space can "safely" perform the replacement.
See description of this knob in Cumulus documentation:

```
resilient_hash_active_timer: A timer that protects TCP sessions from being
disrupted while attempting to populate new next hops. You specify the number of
seconds when at least one hash bucket consistently sees no traffic before
Cumulus Linux rebalances the flows; the default is 120 seconds. If any one
bucket is idle; that is, it sees no traffic for the defined period, the next
new flow utilizes that bucket and flows to the new link. Thus, if the network
is experiencing a large number of flows or very consistent or persistent flows,
there may not be any buckets remaining idle for a consistent 120 second period,
and the imbalance remains until that timer has been met. If a new link is
brought up and added back to a group during this time, traffic does not get
allocated to utilize it until a bucket qualifies as empty, meaning it has been
idle for 120 seconds. This is when a rebalance can occur.
```

Currently, user space does not have this activity information.

I'm saying "safely" because by the time user space decides to perform
the replacement it is possible that the bucket became active again. In
this case it is possible for the kernel / hardware to reject the
replacement. How such an atomic replacement is communicated to the
kernel will determine how the activity information should be exposed.

Option 1:

A new nexthop flag (RTNH_F_ACTIVE ?). For example:

id 1 via 2.2.2.2 dev dummy_b scope link active

User space can periodically query the kernel and clear the activity. For
example:

ip nexthop list_clear

To communicate an atomic replacement:

ip nexthop replace atomic id 3 via 2.2.2.2 dev dummy_b

Option 2:

Add a new 'used' attribute that encodes time since the bucket was last
used. For example:

ip -s nexthop show id 1
id 1 via 2.2.2.2 dev dummy_b scope link used 5

User space will cache it and use it to perform an atomic replacement:

ip nexthop replace used 5 id 3 via 2.2.2.2 dev dummy_b

The kernel will compare its current used time with the value specified
by user space. If current value is smaller, reject the replacement.

> 
> > User space can then use it to understand which nexthops to replace when
> > a new nexthop is added and when to perform the replacement. In case the
> > nexthops are offloaded, it is possible for the driver to periodically
> > update the nexthop code about their activity.
> > 
> > Below is a script that demonstrates the concept with the example in the
> > Cumulus documentation. I chose to replace the individual nexthops
> > instead of creating new ones and then replacing the group.
> 
> That is one of the features ... a group points to individual nexthops
> and those can be atomically updated without affecting the group.
> 
> > 
> > It is obviously possible to create larger groups to reduce the impact on
> > existing flows when a new nexthop is added.
> > 
> > WDYT?
> 
> This is inline with my earlier responses, and your script shows an
> example of how to manage it. Combine it with the active-backup patch set
> and you handle device events too (avoid disrupting size of the group on
> device events).

Yes, correct. I rebased your active-backup patches on top of net-next,
salvaged the iproute2 patches from your github and updated the example
script:

```
#!/bin/bash

### Setup ####

IP="ip -n testns"

ip netns add testns

$IP link add name dummy_a up type dummy
$IP link add name dummy_b up type dummy
$IP link add name dummy_c up type dummy
$IP link add name dummy_d up type dummy
$IP link add name dummy_e up type dummy

$IP route add 1.1.1.0/24 dev dummy_a
$IP route add 2.2.2.0/24 dev dummy_b
$IP route add 3.3.3.0/24 dev dummy_c
$IP route add 4.4.4.0/24 dev dummy_d
$IP route add 5.5.5.0/24 dev dummy_e

### Initial nexthop configuration ####
# According to:
# https://docs.cumulusnetworks.com/cumulus-linux-42/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#resilient-hash-buckets

# First sub-group

$IP nexthop replace id 1 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 2 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 101 group 1/2 active-backup

$IP nexthop replace id 3 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 4 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 102 group 3/4 active-backup

$IP nexthop replace id 5 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 6 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 103 group 5/6 active-backup

$IP nexthop replace id 7 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 8 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 104 group 7/8 active-backup

# Second sub-group

$IP nexthop replace id 9 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 10 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 105 group 9/10 active-backup

$IP nexthop replace id 11 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 12 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 106 group 11/12 active-backup

$IP nexthop replace id 13 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 14 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 107 group 13/14 active-backup

$IP nexthop replace id 15 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 16 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 108 group 15/16 active-backup

# Third sub-group

$IP nexthop replace id 17 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 18 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 109 group 17/18 active-backup

$IP nexthop replace id 19 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 20 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 110 group 19/20 active-backup

$IP nexthop replace id 21 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 22 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 111 group 21/22 active-backup

$IP nexthop replace id 23 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 24 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 112 group 23/24 active-backup

$IP nexthop replace id 10001 \
	group 101/102/103/104/105/106/107/108/109/110/111/112

echo
echo "Initial state:"
echo
$IP nexthop show

### Nexthop B is removed ###
# According to:
# https://docs.cumulusnetworks.com/cumulus-linux-42/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#remove-next-hops

$IP link set dev dummy_b carrier off

echo
echo "After nexthop B was removed:"
echo
$IP nexthop show

### Initial state restored ####

$IP link set dev dummy_b carrier on

$IP nexthop replace id 2 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 101 group 1/2 active-backup

$IP nexthop replace id 3 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 102 group 3/4 active-backup

$IP nexthop replace id 11 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 106 group 11/12 active-backup

$IP nexthop replace id 14 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 107 group 13/14 active-backup

$IP nexthop replace id 16 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 108 group 15/16 active-backup

# Third sub-group

$IP nexthop replace id 19 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 110 group 19/20 active-backup

$IP nexthop replace id 23 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 112 group 23/24 active-backup

echo
echo "After intial state was restored:"
echo
$IP nexthop show

### Nexthop E is added ####
# According to:
# https://docs.cumulusnetworks.com/cumulus-linux-42/Layer-3/Equal-Cost-Multipath-Load-Sharing-Hardware-ECMP/#add-next-hops

# Nexthop 3, 9, 15 are active. Replace in a way that minimizes interruptions.
$IP nexthop replace id 1 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 3 via 3.3.3.3 dev dummy_c
$IP nexthop replace id 5 via 4.4.4.4 dev dummy_d
$IP nexthop replace id 7 via 5.5.5.5 dev dummy_e
# Nexthop 9 remains the same
# Nexthop 11 remains the same
# Nexthop 13 remains the same
# Nexthop 15 remains the same
$IP nexthop replace id 17 via 5.5.5.5 dev dummy_e
$IP nexthop replace id 19 via 1.1.1.1 dev dummy_a
$IP nexthop replace id 21 via 2.2.2.2 dev dummy_b
$IP nexthop replace id 23 via 3.3.3.3 dev dummy_c

echo
echo "After nexthop E was added:"
echo
$IP nexthop show

ip netns del testns
```

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-08-08 18:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-11 14:56 [PATCH] can current ECMP implementation support consistent hashing for next hop? Yi Yang (杨燚)-云服务集团
2020-06-11 18:27 ` David Ahern
2020-06-12  0:32   ` 答复: " Yi Yang (杨燚)-云服务集团
2020-06-12  4:36     ` David Ahern
2020-06-15  6:56       ` 答复: [vger.kernel.org代发]Re: " Yi Yang (杨燚)-云服务集团
2020-06-15 22:42         ` David Ahern
2020-06-16  0:29           ` 答复: " Yi Yang (杨燚)-云服务集团
2020-08-02 14:49       ` Ido Schimmel
2020-08-06 16:45         ` David Ahern
2020-08-08 18:40           ` Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).