All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
@ 2011-01-14 19:07 Oleg V. Ukhno
  2011-01-14 20:10 ` John Fastabend
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-14 19:07 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, David S. Miller

Patch introduces new hashing policy for 802.3ad bonding mode.
This hashing policy can be used(was tested) only for round-robin
balancing of ISCSI traffic(single TCP session is balanced (per-packet)
over all slave interfaces. 
General requirements for this hashing policy usage are:
1) switch must be configured with src-dst-mac or src-mac hashing policy 
2) number of bond slaves on sending and receiving machine should be equal
and preferrably even; or simply even, otherwise you may get asymmetric 
load on receiving machine
3) hashing policy must not be used when round trip time between source 
and destination machines for slaves in same bond is expected to be 
significanly different (it works fine when all slaves are plugged into
single switch)

Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru>
---

 Documentation/networking/bonding.txt |   27 +++++++++++++++++++++++++++
 drivers/net/bonding/bond_3ad.c       |    6 ++++++
 drivers/net/bonding/bond_main.c      |   18 +++++++++++++++++-
 include/linux/if_bonding.h           |    1 +
 4 files changed, 51 insertions(+), 1 deletion(-)

diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt
--- linux-2.6.37-vanilla/Documentation/networking/bonding.txt	2011-01-05 03:50:19.000000000 +0300
+++ linux-2.6.37.my/Documentation/networking/bonding.txt	2011-01-14 21:34:46.635268000 +0300
@@ -759,6 +759,33 @@ xmit_hash_policy
 		most UDP traffic is not involved in extended
 		conversations.  Other implementations of 802.3ad may
 		or may not tolerate this noncompliance.
+
+	simple-rr or 3
+		This policy simply sends every next packet via "next"
+		slave interface. When sending, it resets mac-address
+		within packet to real mac-address of the slave interface.
+
+		When switch is configured properly, and receiving machine
+		has even and equal number of interfaces, this guarantees
+		quite precise rx/tx load balancing for any single TCP
+		session. Typical use-case for this mode is ISCSI(and patch was
+		developed for), because it ises single TCP session to
+		transmit data.
+
+		It is important to remember, that all slaves should be
+		plugged into single switch to avoid out-of-order packets
+		It is recommended to have equal and even number of slave
+		interfaces in sending and receviving machines bond's,
+		otherwise you will get asymmetric load on receiving host.
+		Another caveat is that hashing policy must not be used when
+		round trip time between source and destination machines for
+		slaves in same bond is expected to be significanly different
+		(it works fine when all slaves are plugged into single switch)
+
+		For correct load baalncing on the receiving side you must
+		configure switch for using src-dst-mac or src-mac hashing
+		mode.
+
 
 	The default value is layer2.  This option was added in bonding
 	version 2.6.3.  In earlier versions of bonding, this parameter
diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c
--- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c	2011-01-14 19:39:05.575268000 +0300
+++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c	2011-01-14 19:47:03.815268000 +0300
@@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
 	int i;
 	struct ad_info ad_info;
 	int res = 1;
+	struct ethhdr *eth_data;
 
 	/* make sure that the slaves list will
 	 * not change during tx
@@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
 			slave_agg_id = agg->aggregator_identifier;
 
 		if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
+			if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) {
+				skb_reset_mac_header(skb);
+				eth_data = eth_hdr(skb);
+				memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN);
+			}
 			res = bond_dev_queue_xmit(bond, skb, slave->dev);
 			break;
 		}
diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c
--- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c	2011-01-14 19:39:05.575268000 +0300
+++ linux-2.6.37.my/drivers/net/bonding/bond_main.c	2011-01-14 19:47:55.835268001 +0300
@@ -152,7 +152,9 @@ module_param(ad_select, charp, 0);
 MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)");
 module_param(xmit_hash_policy, charp, 0);
 MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)"
-				   ", 1 for layer 3+4");
+				   ", 1 for layer 3+4"
+				   ", 2 for layer 2+3"
+				   ", 3 for round-robin");
 module_param(arp_interval, int, 0);
 MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
 module_param_array(arp_ip_target, charp, NULL, 0);
@@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype
 {	"layer2",		BOND_XMIT_POLICY_LAYER2},
 {	"layer3+4",		BOND_XMIT_POLICY_LAYER34},
 {	"layer2+3",		BOND_XMIT_POLICY_LAYER23},
+{	"simple-rr",		BOND_XMIT_POLICY_LAYERRR},
 {	NULL,			-1},
 };
 
@@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru
 	return (data->h_dest[5] ^ data->h_source[5]) % count;
 }
 
+/*
+ * simply round robin
+ */
+static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
+				   struct net_device *bond_dev, int count)
+{
+	struct bonding *bond = netdev_priv(bond_dev);
+	return bond->rr_tx_counter++ % count;
+}
+
 /*-------------------------- Device entry points ----------------------------*/
 
 static int bond_open(struct net_device *bond_dev)
@@ -4482,6 +4495,9 @@ out:
 static void bond_set_xmit_hash_policy(struct bonding *bond)
 {
 	switch (bond->params.xmit_policy) {
+	case BOND_XMIT_POLICY_LAYERRR:
+		bond->xmit_hash_policy = bond_xmit_hash_policy_rr;
+		break;
 	case BOND_XMIT_POLICY_LAYER23:
 		bond->xmit_hash_policy = bond_xmit_hash_policy_l23;
 		break;
diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/include/linux/if_bonding.h linux-2.6.37.my/include/linux/if_bonding.h
--- linux-2.6.37-vanilla/include/linux/if_bonding.h	2011-01-05 03:50:19.000000000 +0300
+++ linux-2.6.37.my/include/linux/if_bonding.h	2011-01-14 19:34:29.755268001 +0300
@@ -91,6 +91,7 @@
 #define BOND_XMIT_POLICY_LAYER2		0 /* layer 2 (MAC only), default */
 #define BOND_XMIT_POLICY_LAYER34	1 /* layer 3+4 (IP ^ (TCP || UDP)) */
 #define BOND_XMIT_POLICY_LAYER23	2 /* layer 2+3 (IP ^ MAC) */
+#define BOND_XMIT_POLICY_LAYERRR	3 /* round-robin */
 
 typedef struct ifbond {
 	__s32 bond_mode;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
@ 2011-01-14 20:10 ` John Fastabend
  2011-01-14 23:12   ` Oleg V. Ukhno
  2011-01-14 20:13 ` Jay Vosburgh
  2011-01-14 20:41 ` Nicolas de Pesloüan
  2 siblings, 1 reply; 32+ messages in thread
From: John Fastabend @ 2011-01-14 20:10 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: netdev, Jay Vosburgh, David S. Miller

On 1/14/2011 11:07 AM, Oleg V. Ukhno wrote:
> Patch introduces new hashing policy for 802.3ad bonding mode.
> This hashing policy can be used(was tested) only for round-robin
> balancing of ISCSI traffic(single TCP session is balanced (per-packet)
> over all slave interfaces. 
> General requirements for this hashing policy usage are:
> 1) switch must be configured with src-dst-mac or src-mac hashing policy 
> 2) number of bond slaves on sending and receiving machine should be equal
> and preferrably even; or simply even, otherwise you may get asymmetric 
> load on receiving machine
> 3) hashing policy must not be used when round trip time between source 
> and destination machines for slaves in same bond is expected to be 
> significanly different (it works fine when all slaves are plugged into
> single switch)
> 
> Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru>
> ---

I think you want this patch against net-next not 2.6.37.

> 
>  Documentation/networking/bonding.txt |   27 +++++++++++++++++++++++++++
>  drivers/net/bonding/bond_3ad.c       |    6 ++++++
>  drivers/net/bonding/bond_main.c      |   18 +++++++++++++++++-
>  include/linux/if_bonding.h           |    1 +
>  4 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt
> --- linux-2.6.37-vanilla/Documentation/networking/bonding.txt	2011-01-05 03:50:19.000000000 +0300
> +++ linux-2.6.37.my/Documentation/networking/bonding.txt	2011-01-14 21:34:46.635268000 +0300
> @@ -759,6 +759,33 @@ xmit_hash_policy
>  		most UDP traffic is not involved in extended
>  		conversations.  Other implementations of 802.3ad may
>  		or may not tolerate this noncompliance.
> +
> +	simple-rr or 3
> +		This policy simply sends every next packet via "next"
> +		slave interface. When sending, it resets mac-address
> +		within packet to real mac-address of the slave interface.
> +
> +		When switch is configured properly, and receiving machine
> +		has even and equal number of interfaces, this guarantees
> +		quite precise rx/tx load balancing for any single TCP
> +		session. Typical use-case for this mode is ISCSI(and patch was
> +		developed for), because it ises single TCP session to
> +		transmit data.

Oleg, sorry but I don't follow. If this is simply sending every next packet
via "next" slave interface how are packets not going to get out of order? If
the links have different RTT this would seem problematic.

Have you considered using multipath at the block layer? This is how I generally
handle load balancing over iSCSI/FCoE and it works reasonably well.

see ./drivers/md/dm-mpath.c

> +
> +		It is important to remember, that all slaves should be
> +		plugged into single switch to avoid out-of-order packets
> +		It is recommended to have equal and even number of slave
> +		interfaces in sending and receviving machines bond's,
> +		otherwise you will get asymmetric load on receiving host.
> +		Another caveat is that hashing policy must not be used when
> +		round trip time between source and destination machines for
> +		slaves in same bond is expected to be significanly different
> +		(it works fine when all slaves are plugged into single switch)
> +
> +		For correct load baalncing on the receiving side you must
> +		configure switch for using src-dst-mac or src-mac hashing
> +		mode.
> +
>  
>  	The default value is layer2.  This option was added in bonding
>  	version 2.6.3.  In earlier versions of bonding, this parameter
> diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c
> --- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c	2011-01-14 19:39:05.575268000 +0300
> +++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c	2011-01-14 19:47:03.815268000 +0300
> @@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
>  	int i;
>  	struct ad_info ad_info;
>  	int res = 1;
> +	struct ethhdr *eth_data;
>  
>  	/* make sure that the slaves list will
>  	 * not change during tx
> @@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
>  			slave_agg_id = agg->aggregator_identifier;
>  
>  		if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
> +			if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) {
> +				skb_reset_mac_header(skb);
> +				eth_data = eth_hdr(skb);
> +				memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN);
> +			}
>  			res = bond_dev_queue_xmit(bond, skb, slave->dev);
>  			break;
>  		}
> diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c
> --- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c	2011-01-14 19:39:05.575268000 +0300
> +++ linux-2.6.37.my/drivers/net/bonding/bond_main.c	2011-01-14 19:47:55.835268001 +0300
> @@ -152,7 +152,9 @@ module_param(ad_select, charp, 0);
>  MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)");
>  module_param(xmit_hash_policy, charp, 0);
>  MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)"
> -				   ", 1 for layer 3+4");
> +				   ", 1 for layer 3+4"
> +				   ", 2 for layer 2+3"
> +				   ", 3 for round-robin");
>  module_param(arp_interval, int, 0);
>  MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
>  module_param_array(arp_ip_target, charp, NULL, 0);
> @@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype
>  {	"layer2",		BOND_XMIT_POLICY_LAYER2},
>  {	"layer3+4",		BOND_XMIT_POLICY_LAYER34},
>  {	"layer2+3",		BOND_XMIT_POLICY_LAYER23},
> +{	"simple-rr",		BOND_XMIT_POLICY_LAYERRR},
>  {	NULL,			-1},
>  };
>  
> @@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru
>  	return (data->h_dest[5] ^ data->h_source[5]) % count;
>  }
>  
> +/*
> + * simply round robin
> + */
> +static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
> +				   struct net_device *bond_dev, int count)

Here's one reason why this won't work on net-next-2.6.

int      (*xmit_hash_policy)(struct sk_buff *, int);


Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
  2011-01-14 20:10 ` John Fastabend
@ 2011-01-14 20:13 ` Jay Vosburgh
  2011-01-14 22:51   ` Oleg V. Ukhno
  2011-01-14 20:41 ` Nicolas de Pesloüan
  2 siblings, 1 reply; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-14 20:13 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: netdev, David S. Miller

Oleg V. Ukhno <olegu@yandex-team.ru> wrote:

>Patch introduces new hashing policy for 802.3ad bonding mode.
>This hashing policy can be used(was tested) only for round-robin
>balancing of ISCSI traffic(single TCP session is balanced (per-packet)
>over all slave interfaces. 

	This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
(f), which requires that all frames of a given "conversation" are passed
to a single port.

	The existing layer3+4 hash has a similar problem (that it may
send packets from a conversation to multiple ports), but for that case
it's an unlikely exception (only in the case of IP fragmentation), but
here it's the norm.  At a minimum, this must be clearly documented.

	Also, what does a round robin in 802.3ad provide that the
existing round robin does not?  My presumption is that you're looking to
get the aggregator autoconfiguration that 802.3ad provides, but you
don't say.

	I don't necessarily think this is a bad cheat (round robining on
802.3ad as an explicit non-standard extension), since everybody wants to
stripe their traffic across multiple slaves.  I've given some thought to
making round robin into just another hash mode, but this also does some
magic to the MAC addresses of the outgoing frames (more on that below).

>General requirements for this hashing policy usage are:
>1) switch must be configured with src-dst-mac or src-mac hashing policy 
>2) number of bond slaves on sending and receiving machine should be equal
>and preferrably even; or simply even, otherwise you may get asymmetric 
>load on receiving machine
>3) hashing policy must not be used when round trip time between source 
>and destination machines for slaves in same bond is expected to be 
>significanly different (it works fine when all slaves are plugged into
>single switch)
>
>Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru>
>---
>
> Documentation/networking/bonding.txt |   27 +++++++++++++++++++++++++++
> drivers/net/bonding/bond_3ad.c       |    6 ++++++
> drivers/net/bonding/bond_main.c      |   18 +++++++++++++++++-
> include/linux/if_bonding.h           |    1 +
> 4 files changed, 51 insertions(+), 1 deletion(-)
>
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt
>--- linux-2.6.37-vanilla/Documentation/networking/bonding.txt	2011-01-05 03:50:19.000000000 +0300
>+++ linux-2.6.37.my/Documentation/networking/bonding.txt	2011-01-14 21:34:46.635268000 +0300
>@@ -759,6 +759,33 @@ xmit_hash_policy
> 		most UDP traffic is not involved in extended
> 		conversations.  Other implementations of 802.3ad may
> 		or may not tolerate this noncompliance.
>+
>+	simple-rr or 3
>+		This policy simply sends every next packet via "next"
>+		slave interface. When sending, it resets mac-address
>+		within packet to real mac-address of the slave interface.

	Why is the MAC address reset done?  This is also a violation of
802.3ad, 5.2.1 (j).

>+		When switch is configured properly, and receiving machine
>+		has even and equal number of interfaces, this guarantees
>+		quite precise rx/tx load balancing for any single TCP
>+		session. Typical use-case for this mode is ISCSI(and patch was
>+		developed for), because it ises single TCP session to
>+		transmit data.
>+
>+		It is important to remember, that all slaves should be
>+		plugged into single switch to avoid out-of-order packets
>+		It is recommended to have equal and even number of slave
>+		interfaces in sending and receviving machines bond's,
>+		otherwise you will get asymmetric load on receiving host.
>+		Another caveat is that hashing policy must not be used when
>+		round trip time between source and destination machines for
>+		slaves in same bond is expected to be significanly different
>+		(it works fine when all slaves are plugged into single switch)
>+
>+		For correct load baalncing on the receiving side you must
>+		configure switch for using src-dst-mac or src-mac hashing
>+		mode.
>+
>
> 	The default value is layer2.  This option was added in bonding
> 	version 2.6.3.  In earlier versions of bonding, this parameter
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c
>--- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c	2011-01-14 19:39:05.575268000 +0300
>+++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c	2011-01-14 19:47:03.815268000 +0300
>@@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
> 	int i;
> 	struct ad_info ad_info;
> 	int res = 1;
>+	struct ethhdr *eth_data;
>
> 	/* make sure that the slaves list will
> 	 * not change during tx
>@@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
> 			slave_agg_id = agg->aggregator_identifier;
>
> 		if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
>+			if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) {
>+				skb_reset_mac_header(skb);
>+				eth_data = eth_hdr(skb);
>+				memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN);
>+			}

	This is the code that resets the MAC header as described above.
It doesn't quite match the documentation, since it only resets the MAC
for ETH_P_IP packets.

> 			res = bond_dev_queue_xmit(bond, skb, slave->dev);
> 			break;
> 		}
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c
>--- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c	2011-01-14 19:39:05.575268000 +0300
>+++ linux-2.6.37.my/drivers/net/bonding/bond_main.c	2011-01-14 19:47:55.835268001 +0300
>@@ -152,7 +152,9 @@ module_param(ad_select, charp, 0);
> MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)");
> module_param(xmit_hash_policy, charp, 0);
> MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)"
>-				   ", 1 for layer 3+4");
>+				   ", 1 for layer 3+4"
>+				   ", 2 for layer 2+3"
>+				   ", 3 for round-robin");
> module_param(arp_interval, int, 0);
> MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
> module_param_array(arp_ip_target, charp, NULL, 0);
>@@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype
> {	"layer2",		BOND_XMIT_POLICY_LAYER2},
> {	"layer3+4",		BOND_XMIT_POLICY_LAYER34},
> {	"layer2+3",		BOND_XMIT_POLICY_LAYER23},
>+{	"simple-rr",		BOND_XMIT_POLICY_LAYERRR},

	I'd just call it "round-robin" instead of "simple-rr".

> {	NULL,			-1},
> };
>
>@@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru
> 	return (data->h_dest[5] ^ data->h_source[5]) % count;
> }
>
>+/*
>+ * simply round robin
>+ */
>+static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
>+				   struct net_device *bond_dev, int count)
>+{
>+	struct bonding *bond = netdev_priv(bond_dev);
>+	return bond->rr_tx_counter++ % count;
>+}
>+
> /*-------------------------- Device entry points ----------------------------*/
>
> static int bond_open(struct net_device *bond_dev)
>@@ -4482,6 +4495,9 @@ out:
> static void bond_set_xmit_hash_policy(struct bonding *bond)
> {
> 	switch (bond->params.xmit_policy) {
>+	case BOND_XMIT_POLICY_LAYERRR:
>+		bond->xmit_hash_policy = bond_xmit_hash_policy_rr;
>+		break;
> 	case BOND_XMIT_POLICY_LAYER23:
> 		bond->xmit_hash_policy = bond_xmit_hash_policy_l23;
> 		break;
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/include/linux/if_bonding.h linux-2.6.37.my/include/linux/if_bonding.h
>--- linux-2.6.37-vanilla/include/linux/if_bonding.h	2011-01-05 03:50:19.000000000 +0300
>+++ linux-2.6.37.my/include/linux/if_bonding.h	2011-01-14 19:34:29.755268001 +0300
>@@ -91,6 +91,7 @@
> #define BOND_XMIT_POLICY_LAYER2		0 /* layer 2 (MAC only), default */
> #define BOND_XMIT_POLICY_LAYER34	1 /* layer 3+4 (IP ^ (TCP || UDP)) */
> #define BOND_XMIT_POLICY_LAYER23	2 /* layer 2+3 (IP ^ MAC) */
>+#define BOND_XMIT_POLICY_LAYERRR	3 /* round-robin */
>
> typedef struct ifbond {
> 	__s32 bond_mode;

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
  2011-01-14 20:10 ` John Fastabend
  2011-01-14 20:13 ` Jay Vosburgh
@ 2011-01-14 20:41 ` Nicolas de Pesloüan
  2 siblings, 0 replies; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-14 20:41 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: netdev, Jay Vosburgh, David S. Miller

Le 14/01/2011 20:07, Oleg V. Ukhno a écrit :

> +
> +		For correct load baalncing on the receiving side you must
> +		configure switch for using src-dst-mac or src-mac hashing
> +		mode.

Typo in baalncing -> balancing.

	Nicolas.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-14 20:13 ` Jay Vosburgh
@ 2011-01-14 22:51   ` Oleg V. Ukhno
  2011-01-15  0:05     ` Jay Vosburgh
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-14 22:51 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, David S. Miller



Jay Vosburgh wrote:

> 	This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
> (f), which requires that all frames of a given "conversation" are passed
> to a single port.
> 
> 	The existing layer3+4 hash has a similar problem (that it may
> send packets from a conversation to multiple ports), but for that case
> it's an unlikely exception (only in the case of IP fragmentation), but
> here it's the norm.  At a minimum, this must be clearly documented.
> 
> 	Also, what does a round robin in 802.3ad provide that the
> existing round robin does not?  My presumption is that you're looking to
> get the aggregator autoconfiguration that 802.3ad provides, but you
> don't say.
> 
> 	I don't necessarily think this is a bad cheat (round robining on
> 802.3ad as an explicit non-standard extension), since everybody wants to
> stripe their traffic across multiple slaves.  I've given some thought to
> making round robin into just another hash mode, but this also does some
> magic to the MAC addresses of the outgoing frames (more on that below).
Yes, I am resetting MAC addresses when transmitting packets to have 
switch to put packets into different ports of the receiving etherchannel.
I am using this patch to provide full-mesh ISCSI connectivity between at 
least 4 hosts (all hosts of course are in same ethernet segment) and 
every host is connected with aggregate link with 4 slaves(usually).
Using round-robin I provide near-equal load striping when transmitting, 
using MAC address magic I force switch to stripe packets over all slave 
links in destination port-channel(when number of rx-ing slaves is equal 
to number ot tx-ing slaves and is even). So I am able to utilize all 
slaves for tx and for rx up to maximum capacity; besides I am getting L2 
link failure detection (and load rebalancing), which is (in my opinion) 
much faster and robust than L3 or than dm-multipath provides.
It's my idea with the patch
> 

> 
> 	This is the code that resets the MAC header as described above.
> It doesn't quite match the documentation, since it only resets the MAC
> for ETH_P_IP packets.
Yes, I really meant that my patch applies to ETH_P_IP packets and I've 
missed that from documentation I wrote.
> 

> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 

-- 
Best regards,

Oleg Ukhno


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-14 20:10 ` John Fastabend
@ 2011-01-14 23:12   ` Oleg V. Ukhno
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-14 23:12 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, Jay Vosburgh, David S. Miller



John Fastabend wrote:

> 
> I think you want this patch against net-next not 2.6.37.
This patch is against  2.6.37-git11 and I've tried to apply it to 
net-next - it applied ok
> 
> Oleg, sorry but I don't follow. If this is simply sending every next packet
> via "next" slave interface how are packets not going to get out of order? If
> the links have different RTT this would seem problematic.
> 
> Have you considered using multipath at the block layer? This is how I generally
> handle load balancing over iSCSI/FCoE and it works reasonably well.
> 
> see ./drivers/md/dm-mpath.c

John, the first solution I was using a long time for ISCSI load 
balancing was multipath. But there are some problems with dm-multipath:
- it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
- it handles any link failures bad, because of it's command queue 
limitation(all queued commands above 32 are discarded in case of path 
failure, as I remember)
- it performs very bad when there are many devices and maтy paths(I was 
unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths 
per each disk)

My patch won't work correct when slave links have different RTT, this is 
true - it is usable only within one ethernet segment with equal/near 
equal RTT. This is it's limitation.


>> +static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
>> +				   struct net_device *bond_dev, int count)
> 
> Here's one reason why this won't work on net-next-2.6.
> 
> int      (*xmit_hash_policy)(struct sk_buff *, int);

Thank you, I've missed that change.

> 
> 
> Thanks,
> John
> 

-- 
Best reagrds,
Oleg Ukhno

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-14 22:51   ` Oleg V. Ukhno
@ 2011-01-15  0:05     ` Jay Vosburgh
  2011-01-15 12:11       ` Oleg V. Ukhno
  2011-01-18  3:16       ` John Fastabend
  0 siblings, 2 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-15  0:05 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: netdev, David S. Miller, John Fastabend

Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>Jay Vosburgh wrote:
>
>> 	This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
>> (f), which requires that all frames of a given "conversation" are passed
>> to a single port.
>>
>> 	The existing layer3+4 hash has a similar problem (that it may
>> send packets from a conversation to multiple ports), but for that case
>> it's an unlikely exception (only in the case of IP fragmentation), but
>> here it's the norm.  At a minimum, this must be clearly documented.
>>
>> 	Also, what does a round robin in 802.3ad provide that the
>> existing round robin does not?  My presumption is that you're looking to
>> get the aggregator autoconfiguration that 802.3ad provides, but you
>> don't say.

	I'm still curious about this question.  Given the rather
intricate setup of your particular network (described below), I'm not
sure why 802.3ad is of benefit over traditional etherchannel
(balance-rr / balance-xor).

>> 	I don't necessarily think this is a bad cheat (round robining on
>> 802.3ad as an explicit non-standard extension), since everybody wants to
>> stripe their traffic across multiple slaves.  I've given some thought to
>> making round robin into just another hash mode, but this also does some
>> magic to the MAC addresses of the outgoing frames (more on that below).
>Yes, I am resetting MAC addresses when transmitting packets to have switch
>to put packets into different ports of the receiving etherchannel.

	By "etherchannel" do you really mean "Cisco switch with a
port-channel group using LACP"?

>I am using this patch to provide full-mesh ISCSI connectivity between at
>least 4 hosts (all hosts of course are in same ethernet segment) and every
>host is connected with aggregate link with 4 slaves(usually).
>Using round-robin I provide near-equal load striping when transmitting,
>using MAC address magic I force switch to stripe packets over all slave
>links in destination port-channel(when number of rx-ing slaves is equal to
>number ot tx-ing slaves and is even).

	By "MAC address magic" do you mean that you're assigning
specifically chosen MAC addresses to the slaves so that the switch's
hash is essentially "assigning" the bonding slaves to particular ports
on the outgoing port-channel group?

	Assuming that this is the case, it's an interesting idea, but
I'm unconvinced that it's better on 802.3ad vs. balance-rr.  Unless I'm
missing something, you can get everything you need from an option to
have balance-rr / balance-xor utilize the slave's permanent address as
the source address for outgoing traffic.

>[...] So I am able to utilize all slaves
>for tx and for rx up to maximum capacity; besides I am getting L2 link
>failure detection (and load rebalancing), which is (in my opinion) much
>faster and robust than L3 or than dm-multipath provides.
>It's my idea with the patch

	Can somebody (John?) more knowledgable than I about dm-multipath
comment on the above?

>> 	This is the code that resets the MAC header as described above.
>> It doesn't quite match the documentation, since it only resets the MAC
>> for ETH_P_IP packets.
>Yes, I really meant that my patch applies to ETH_P_IP packets and I've
>missed that from documentation I wrote.

	Is limiting this to just ETH_P_IP really a means to exclude ARP,
or is there some advantage to (effectively) only balancing IP traffic,
and leaving other traffic (IPv6, for one) essentially unbalanced (when
exiting the switch through the destination port-channel group, which
you've set to use a src-mac hash)?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-15  0:05     ` Jay Vosburgh
@ 2011-01-15 12:11       ` Oleg V. Ukhno
  2011-01-18  3:16       ` John Fastabend
  1 sibling, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-15 12:11 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, David S. Miller, John Fastabend



Jay Vosburgh wrote:
> Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>> Jay Vosburgh wrote:
>>
>>> 	Also, what does a round robin in 802.3ad provide that the
>>> existing round robin does not?  My presumption is that you're looking to
>>> get the aggregator autoconfiguration that 802.3ad provides, but you
>>> don't say.
> 
> 	I'm still curious about this question.  Given the rather
> intricate setup of your particular network (described below), I'm not
> sure why 802.3ad is of benefit over traditional etherchannel
> (balance-rr / balance-xor).
Yes, I wanted 802.3ad autoconfiguration. Besides, all switches I use 
support LACP so I've chosen 802.3ad link aggregation.
Of course, it would be cool it both 802.3ad and balance-rr modes 
supported such load striping feature.

> 
>> Yes, I am resetting MAC addresses when transmitting packets to have switch
>> to put packets into different ports of the receiving etherchannel.
> 
> 	By "etherchannel" do you really mean "Cisco switch with a
> port-channel group using LACP"?
Yes, exactly
> 
>> I am using this patch to provide full-mesh ISCSI connectivity between at
>> least 4 hosts (all hosts of course are in same ethernet segment) and every
>> host is connected with aggregate link with 4 slaves(usually).
>> Using round-robin I provide near-equal load striping when transmitting,
>> using MAC address magic I force switch to stripe packets over all slave
>> links in destination port-channel(when number of rx-ing slaves is equal to
>> number ot tx-ing slaves and is even).
> 
> 	By "MAC address magic" do you mean that you're assigning
> specifically chosen MAC addresses to the slaves so that the switch's
> hash is essentially "assigning" the bonding slaves to particular ports
> on the outgoing port-channel group?

Yes, so I am able to make equal load striping even for single TCP 
session between just two hosts not only for transmiting host, but also 
for receiving host(iperf, when doing TCP test, is able to utilize all 
available bandwith in given etherchannel).

> 
> 	Assuming that this is the case, it's an interesting idea, but
> I'm unconvinced that it's better on 802.3ad vs. balance-rr.  Unless I'm
> missing something, you can get everything you need from an option to
> have balance-rr / balance-xor utilize the slave's permanent address as
> the source address for outgoing traffic.

Yes, balance-rr would satisfy my requrements if patched for doing "MAC 
address magic"(replacing MAC address of packets being transmitted by 
slave's permanent address), except for 802.3ad link autoconfiguration.
"Pure" balance-rr won't allow to utilize whole etherchannel bandwidth 
when transmitting data just between 2 hosts( for example, when I have 
one iSCSI initiator and one iSCSI target). balance-xor is not what I 
wanted because data transmitted on source host will stick to any, but 
single slave.


> 
> 
>>> 	This is the code that resets the MAC header as described above.
>>> It doesn't quite match the documentation, since it only resets the MAC
>>> for ETH_P_IP packets.
>> Yes, I really meant that my patch applies to ETH_P_IP packets and I've
>> missed that from documentation I wrote.
> 
> 	Is limiting this to just ETH_P_IP really a means to exclude ARP,
> or is there some advantage to (effectively) only balancing IP traffic,
> and leaving other traffic (IPv6, for one) essentially unbalanced (when
> exiting the switch through the destination port-channel group, which
> you've set to use a src-mac hash)?
> 
Well, when making initial version of this patch(it was for 2.6.18 
kernel), I meant just excluding ARP .

> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 

-- 
Best regards,

Oleg Ukhno

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-15  0:05     ` Jay Vosburgh
  2011-01-15 12:11       ` Oleg V. Ukhno
@ 2011-01-18  3:16       ` John Fastabend
  2011-01-18 12:40         ` Oleg V. Ukhno
  1 sibling, 1 reply; 32+ messages in thread
From: John Fastabend @ 2011-01-18  3:16 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Oleg V. Ukhno, netdev, David S. Miller

On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
> Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>> Jay Vosburgh wrote:
>>
>>> 	This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
>>> (f), which requires that all frames of a given "conversation" are passed
>>> to a single port.
>>>
>>> 	The existing layer3+4 hash has a similar problem (that it may
>>> send packets from a conversation to multiple ports), but for that case
>>> it's an unlikely exception (only in the case of IP fragmentation), but
>>> here it's the norm.  At a minimum, this must be clearly documented.
>>>
>>> 	Also, what does a round robin in 802.3ad provide that the
>>> existing round robin does not?  My presumption is that you're looking to
>>> get the aggregator autoconfiguration that 802.3ad provides, but you
>>> don't say.
> 
> 	I'm still curious about this question.  Given the rather
> intricate setup of your particular network (described below), I'm not
> sure why 802.3ad is of benefit over traditional etherchannel
> (balance-rr / balance-xor).
> 
>>> 	I don't necessarily think this is a bad cheat (round robining on
>>> 802.3ad as an explicit non-standard extension), since everybody wants to
>>> stripe their traffic across multiple slaves.  I've given some thought to
>>> making round robin into just another hash mode, but this also does some
>>> magic to the MAC addresses of the outgoing frames (more on that below).
>> Yes, I am resetting MAC addresses when transmitting packets to have switch
>> to put packets into different ports of the receiving etherchannel.
> 
> 	By "etherchannel" do you really mean "Cisco switch with a
> port-channel group using LACP"?
> 
>> I am using this patch to provide full-mesh ISCSI connectivity between at
>> least 4 hosts (all hosts of course are in same ethernet segment) and every
>> host is connected with aggregate link with 4 slaves(usually).
>> Using round-robin I provide near-equal load striping when transmitting,
>> using MAC address magic I force switch to stripe packets over all slave
>> links in destination port-channel(when number of rx-ing slaves is equal to
>> number ot tx-ing slaves and is even).
> 
> 	By "MAC address magic" do you mean that you're assigning
> specifically chosen MAC addresses to the slaves so that the switch's
> hash is essentially "assigning" the bonding slaves to particular ports
> on the outgoing port-channel group?
> 
> 	Assuming that this is the case, it's an interesting idea, but
> I'm unconvinced that it's better on 802.3ad vs. balance-rr.  Unless I'm
> missing something, you can get everything you need from an option to
> have balance-rr / balance-xor utilize the slave's permanent address as
> the source address for outgoing traffic.
> 
>> [...] So I am able to utilize all slaves
>> for tx and for rx up to maximum capacity; besides I am getting L2 link
>> failure detection (and load rebalancing), which is (in my opinion) much
>> faster and robust than L3 or than dm-multipath provides.
>> It's my idea with the patch
> 
> 	Can somebody (John?) more knowledgable than I about dm-multipath
> comment on the above?

Here I'll give it a go.

I don't think detecting L2 link failure this way is very robust. If there
is a failure farther away then your immediate link your going to break
completely? Your bonding hash will continue to round robin the iscsi
packets and half them will get dropped on the floor. dm-multipath handles
this reasonably gracefully. Also in this bonding environment you seem to
be very sensitive to RTT times on the network. Maybe not bad out right but
I wouldn't consider this robust either.

You could tweak your scsi timeout values and fail_fast values, set the io
retry to 0 to cause the fail over to occur faster. I suspect you already
did this and still it is too slow? Maybe adding a checker in multipathd to
listen for link events would be fast enough. The checker could then fail
the path immediately.

I'll try to address your comments from the other thread here. In general I
wonder if it would be better to solve the problems in dm-multipath rather than
add another bonding mode?

OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)

The dm-multipath layer is adding latency? How much? If this is really true
maybe its best to the address the real issue here and not avoid it by
using the bonding layer.

OVU - it handles any link failures bad, because of it's command queue 
limitation(all queued commands above 32 are discarded in case of path 
failure, as I remember)

Maybe true but only link failures with the immediate peer are handled
with a bonding strategy. By working at the block layer we can detect
failures throughout the path. I would need to look into this again I
know when we were looking at this sometime ago there was some talk about
improving this behavior. I need to take some time to go back through the
error recovery stuff to remember how this works.

OVU - it performs very bad when there are many devices and maтy paths(I was 
unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths 
per each disk)

Hmm well that seems like something is broken. I'll try this setup when
I get some time next few days. This really shouldn't be the case dm-multipath
should not add a bunch of extra latency or effect throughput significantly.
By the way what are you seeing without mpio?

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18  3:16       ` John Fastabend
@ 2011-01-18 12:40         ` Oleg V. Ukhno
  2011-01-18 14:54           ` Nicolas de Pesloüan
  2011-01-18 16:41           ` John Fastabend
  0 siblings, 2 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 12:40 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jay Vosburgh, netdev, David S. Miller

On 01/18/2011 06:16 AM, John Fastabend wrote:
> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>> 	Can somebody (John?) more knowledgable than I about dm-multipath
>> comment on the above?
>
> Here I'll give it a go.
>
> I don't think detecting L2 link failure this way is very robust. If there
> is a failure farther away then your immediate link your going to break
> completely? Your bonding hash will continue to round robin the iscsi
> packets and half them will get dropped on the floor. dm-multipath handles
> this reasonably gracefully. Also in this bonding environment you seem to
> be very sensitive to RTT times on the network. Maybe not bad out right but
> I wouldn't consider this robust either.

John, I agree - this bonding mode should be used in quite limited number 
of situations, but as for failure farther away then immediate link - 
every bonding mode will suffer same problems in this case - bonding 
detects only L2 failures, other is done by upper-layer mechanisms. And 
almost all bonding modes depend on equal RTT on slaves. And, there is 
already similar load balancing mode - balance-alb - what I did is 
approximately the same, but for 802.3ad bonding mode and provides 
"better"(more equal and non-conditional layser2) load striping for tx 
and _rx_ .

I think I shouldn't mention the particular use case of this patch - when 
I wrote it I tried to make a more general solution - my goal was "make 
equal or near-equal load striping for TX and (most important part) RX 
within single ethernet(layer 2) domain for  TCP transmission". This 
bonding mode  just introduces ability to stripe rx and tx load for 
single TCP connection between hosts inside of one ethernet segment. 
iSCSI is just an example. It is possible to stripe load between a 
linux-based router and linux-based web/ftp/etc server as well in the 
same manner. I think this feature will be useful in some number of 
network configurations.

  Also, I looked into net-next code - it seems to me that it can be 
implemented(adapted to net-next bonding code) without any difficulties 
and hashing function change makes no problem here.

What I've written below is just my personal experience and opinion after 
5 years of using Oracle +iSCSI +mpath(later - patched bonding).

 From my personal experience I just can say that most iSCSI failures are 
caused by link failures, and also I would never send any significant 
iSCSI traffic via router - router would be a bottleneck in this case.
So, in my case iSCSI traffic flows within one ethernet domain and in 
case of link failure bonding driver simply fails one slave(in case of 
bonding) , instead of checking and failing hundreths of paths (in case 
of mpath) and first case significantly less cpu, net and time 
consuming(if using default mpath checker - readsector0).
Mpath is good for me, when I use it to "merge" drbd mirrors from 
different hosts, but for just doing simple load striping within single 
L2 network switch  between 2 .. 16 hosts is some overkill(particularly 
in maintaining human-readable device naming) :).

John, what is you opinion on such load balancing method in general, 
without referring to particular use cases?


>
> You could tweak your scsi timeout values and fail_fast values, set the io
> retry to 0 to cause the fail over to occur faster. I suspect you already
> did this and still it is too slow? Maybe adding a checker in multipathd to
> listen for link events would be fast enough. The checker could then fail
> the path immediately.
>
> I'll try to address your comments from the other thread here. In general I
> wonder if it would be better to solve the problems in dm-multipath rather than
> add another bonding mode?
Of course I did this, but mpath is fine when device quantity is below 
30-40 devices with two paths, 150-200 devices with 2+ paths can make 
life far more interesting :)
>
> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>
> The dm-multipath layer is adding latency? How much? If this is really true
> maybe its best to the address the real issue here and not avoid it by
> using the bonding layer.

I do not remember exact number now, but switching one of my databases , 
about 2 years ago to bonding increased read throughput for the entire db 
from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and 
8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in 
one switch) because of "full" bandwidth use. Also, bonding usage 
simplifies network and application setup greatly(compared to mpath)

>
> OVU - it handles any link failures bad, because of it's command queue
> limitation(all queued commands above 32 are discarded in case of path
> failure, as I remember)
>
> Maybe true but only link failures with the immediate peer are handled
> with a bonding strategy. By working at the block layer we can detect
> failures throughout the path. I would need to look into this again I
> know when we were looking at this sometime ago there was some talk about
> improving this behavior. I need to take some time to go back through the
> error recovery stuff to remember how this works.
>
> OVU - it performs very bad when there are many devices and maтy paths(I was
> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
> per each disk)

Well, I think that behavior can be explained in such a way:
when balancing by I/Os number per path(rr_min_io), and there is a huge 
number of devices, mpath is doing load-balaning per-device, and it is 
not possible to quarantee equal device use for all devices, so there 
will be imbalance over network interface(mpath is unaware of it's 
existence, etc), and it is likely it becomes more imbalanced when there 
are many devices. Also, counting I/O's for many devices and paths 
consumes some CPU resources and also can cause excessive context switches.

>
> Hmm well that seems like something is broken. I'll try this setup when
> I get some time next few days. This really shouldn't be the case dm-multipath
> should not add a bunch of extra latency or effect throughput significantly.
> By the way what are you seeing without mpio?

And one more obsevation from my 2-years old tests - reading device(using 
dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath 
device with single path was done at approximately 120-150mb/s, and same 
test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a 
kind of revelation to me that time.

>
> Thanks,
> John
>


-- 
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 12:40         ` Oleg V. Ukhno
@ 2011-01-18 14:54           ` Nicolas de Pesloüan
  2011-01-18 15:28             ` Oleg V. Ukhno
  2011-01-18 16:41           ` John Fastabend
  1 sibling, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-18 14:54 UTC (permalink / raw)
  To: Oleg V. Ukhno, John Fastabend, Jay Vosburgh, David S. Miller
  Cc: netdev, Sébastien Barré, Christophe Paasch

Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :

The fact that there exist many situations where it simply doesn't work, should not cause the idea of 
Oleg to be rejected.

In Documentation/networking/bonding.txt, tuning tcp_reordering on receiving side is already 
documented as a possible workaround for out of order delivery due to load balancing of a single TCP 
session, using mode=balance-rr.

This might work reasonably well in a pure LAN topology, without any router between both ends of the 
TCP session, even if this is limited to Linux hosts. The uses are not uncommon and not limited to iSCSI:
- between an application server and a database server,
- between members of a cluster, for replication purpose,
- between a server and a backup system,
- ...

Of course, for longer paths, with routers and variable RTT, we would need something different 
(possibly MultiPathTCP: http://datatracker.ietf.org/wg/mptcp/).

I remember a topology (described by Jay, for as far as I remember), where two hosts were connected 
through two distinct VLANs. In such topology:
- it is possible to detect path failure using arp monitoring instead of miimon.
- changing the destination MAC address of egress packets are not necessary, because egress path 
selection force ingress path selection due to the VLAN.

I think the only point is whether we need a new xmit_hash_policy for mode=802.3ad or whether 
mode=balance-rr could be enough.

Oleg, would you mind trying the above "two VLAN" topology" with mode=balance-rr and report any 
results ? For high-availability purpose, it's obviously necessary to setup those VLAN on distinct 
switches.

	Nicolas



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 14:54           ` Nicolas de Pesloüan
@ 2011-01-18 15:28             ` Oleg V. Ukhno
  2011-01-18 16:24               ` Nicolas de Pesloüan
  2011-01-18 17:56               ` Kirill Smelkov
  0 siblings, 2 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 15:28 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev,
	Sébastien Barré,
	Christophe Paasch

On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
> Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :
>
> The fact that there exist many situations where it simply doesn't work,
> should not cause the idea of Oleg to be rejected.
>
> In Documentation/networking/bonding.txt, tuning tcp_reordering on
> receiving side is already documented as a possible workaround for out of
> order delivery due to load balancing of a single TCP session, using
> mode=balance-rr.
>
> This might work reasonably well in a pure LAN topology, without any
> router between both ends of the TCP session, even if this is limited to
> Linux hosts. The uses are not uncommon and not limited to iSCSI:
> - between an application server and a database server,
> - between members of a cluster, for replication purpose,
> - between a server and a backup system,
> - ...
Nicolas, thank you for your opinion - this is exactly what I mean - 
iSCSI is just one particular use case, but there are many cases where 
this load balancing method will be useful
>
> Of course, for longer paths, with routers and variable RTT, we would
> need something different (possibly MultiPathTCP:
> http://datatracker.ietf.org/wg/mptcp/).
>
> I remember a topology (described by Jay, for as far as I remember),
> where two hosts were connected through two distinct VLANs. In such
> topology:
> - it is possible to detect path failure using arp monitoring instead of
> miimon.
> - changing the destination MAC address of egress packets are not
> necessary, because egress path selection force ingress path selection
> due to the VLAN.

In case with two VLANs - yes, this shouldn't be necessary(but needs to 
be tested, I am not sure), but within one - it is essential for correct 
rx load striping.
>
> I think the only point is whether we need a new xmit_hash_policy for
> mode=802.3ad or whether mode=balance-rr could be enough.
May by, but it seems to me fair enough not to restrict this feature only 
to non-LACP aggregate links; dynamic aggregation may be useful(it helps 
to avoid switch misconfiguration(misconfigured slaves on switch side) 
sometimes without loss of service).
>
> Oleg, would you mind trying the above "two VLAN" topology" with
> mode=balance-rr and report any results ? For high-availability purpose,
> it's obviously necessary to setup those VLAN on distinct switches.
I'll do it, but it will take some time to setup test environment, 
several days may be.
You mean following topology:
           switch 1
        /           \
host A                host B
        \  switch 2 /

(i'm sure it will work as desired if each host is connected to each 
switch with only one slave link, if there are more slaves in each switch 
- unsure)?
>
> Nicolas
>
>
>


-- 
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 15:28             ` Oleg V. Ukhno
@ 2011-01-18 16:24               ` Nicolas de Pesloüan
  2011-01-18 16:57                 ` Oleg V. Ukhno
  2011-01-18 20:24                 ` Jay Vosburgh
  2011-01-18 17:56               ` Kirill Smelkov
  1 sibling, 2 replies; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-18 16:24 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev,
	Sébastien Barré,
	Christophe Paasch

Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>> I remember a topology (described by Jay, for as far as I remember),
>> where two hosts were connected through two distinct VLANs. In such
>> topology:
>> - it is possible to detect path failure using arp monitoring instead of
>> miimon.
>> - changing the destination MAC address of egress packets are not
>> necessary, because egress path selection force ingress path selection
>> due to the VLAN.
>
> In case with two VLANs - yes, this shouldn't be necessary(but needs to
> be tested, I am not sure), but within one - it is essential for correct
> rx load striping.

Changing the destination MAC address is definitely not required if you segregate each path in a 
distinct VLAN.

             +-------------------+     +-------------------+
     +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
     |       +-------------------+     +-------------------+       |
+------+              |                         |              +------+
|host A|              |                         |              |host B|
+------+              |                         |              +------+
     |       +-------------------+     +-------------------+       |
     +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+
             +-------------------+     +-------------------+

Even in the present of ISL between some switches, packet sent through host A interface connected to 
vlan 100 will only enter host B using the interface connected to vlan 100. So every slaves of the 
bonding interface can use the same MAC address.

Of course, changing the destination address would be required in order to achieve ingress load 
balancing on a *single* LAN. But, as Jay noted at the beginning of this thread, this would violate 
802.3ad.

>> I think the only point is whether we need a new xmit_hash_policy for
>> mode=802.3ad or whether mode=balance-rr could be enough.
> May by, but it seems to me fair enough not to restrict this feature only
> to non-LACP aggregate links; dynamic aggregation may be useful(it helps
> to avoid switch misconfiguration(misconfigured slaves on switch side)
> sometimes without loss of service).

You are right, but such LAN setup need to be carefully designed and built. I'm not sure that an 
automatic channel aggregation system is the right way to do it. Hence the reason why I suggest to 
use balance-rr with VLANs.

>> Oleg, would you mind trying the above "two VLAN" topology" with
>> mode=balance-rr and report any results ? For high-availability purpose,
>> it's obviously necessary to setup those VLAN on distinct switches.
> I'll do it, but it will take some time to setup test environment,
> several days may be.

Thanks. For testing purpose, it is enough to setup those VLAN on a single switch if it is easier for 
you to do.

> You mean following topology:

See above.

> (i'm sure it will work as desired if each host is connected to each
> switch with only one slave link, if there are more slaves in each switch
> - unsure)?

If you want to use more than 2 slaves per host, then you need more than 2 VLAN. You also need to 
have the exact same number of slaves on all hosts, as egress path selection cause ingress path 
selection at the other side.

             +-------------------+     +-------------------+
     +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
     |       +-------------------+     +-------------------+       |
+------+              |                         |              +------+
|host A|              |                         |              |host B|
+------+              |                         |              +------+
   | |       +-------------------+     +-------------------+       | |
   | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ |
   |         +-------------------+     +-------------------+         |
   |                   |                         |                   |
   |                   |                         |                   |
   |         +-------------------+     +-------------------+         |
   +---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+
             +-------------------+     +-------------------+

Of course, you can add others host to vlan 100, 200 and 300, with the exact same configuration at 
host A or host B.

	Nicolas.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 12:40         ` Oleg V. Ukhno
  2011-01-18 14:54           ` Nicolas de Pesloüan
@ 2011-01-18 16:41           ` John Fastabend
  2011-01-18 17:21             ` Oleg V. Ukhno
  1 sibling, 1 reply; 32+ messages in thread
From: John Fastabend @ 2011-01-18 16:41 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: Jay Vosburgh, netdev, David S. Miller

On 1/18/2011 4:40 AM, Oleg V. Ukhno wrote:
> On 01/18/2011 06:16 AM, John Fastabend wrote:
>> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>>> 	Can somebody (John?) more knowledgable than I about dm-multipath
>>> comment on the above?
>>
>> Here I'll give it a go.
>>
>> I don't think detecting L2 link failure this way is very robust. If there
>> is a failure farther away then your immediate link your going to break
>> completely? Your bonding hash will continue to round robin the iscsi
>> packets and half them will get dropped on the floor. dm-multipath handles
>> this reasonably gracefully. Also in this bonding environment you seem to
>> be very sensitive to RTT times on the network. Maybe not bad out right but
>> I wouldn't consider this robust either.
> 
> John, I agree - this bonding mode should be used in quite limited number 
> of situations, but as for failure farther away then immediate link - 
> every bonding mode will suffer same problems in this case - bonding 
> detects only L2 failures, other is done by upper-layer mechanisms. And 
> almost all bonding modes depend on equal RTT on slaves. And, there is 
> already similar load balancing mode - balance-alb - what I did is 
> approximately the same, but for 802.3ad bonding mode and provides 
> "better"(more equal and non-conditional layser2) load striping for tx 
> and _rx_ .
> 
> I think I shouldn't mention the particular use case of this patch - when 
> I wrote it I tried to make a more general solution - my goal was "make 
> equal or near-equal load striping for TX and (most important part) RX 
> within single ethernet(layer 2) domain for  TCP transmission". This 
> bonding mode  just introduces ability to stripe rx and tx load for 
> single TCP connection between hosts inside of one ethernet segment. 
> iSCSI is just an example. It is possible to stripe load between a 
> linux-based router and linux-based web/ftp/etc server as well in the 
> same manner. I think this feature will be useful in some number of 
> network configurations.
> 
>   Also, I looked into net-next code - it seems to me that it can be 
> implemented(adapted to net-next bonding code) without any difficulties 
> and hashing function change makes no problem here.
> 
> What I've written below is just my personal experience and opinion after 
> 5 years of using Oracle +iSCSI +mpath(later - patched bonding).
> 
>  From my personal experience I just can say that most iSCSI failures are 
> caused by link failures, and also I would never send any significant 
> iSCSI traffic via router - router would be a bottleneck in this case.
> So, in my case iSCSI traffic flows within one ethernet domain and in 
> case of link failure bonding driver simply fails one slave(in case of 
> bonding) , instead of checking and failing hundreths of paths (in case 
> of mpath) and first case significantly less cpu, net and time 
> consuming(if using default mpath checker - readsector0).
> Mpath is good for me, when I use it to "merge" drbd mirrors from 
> different hosts, but for just doing simple load striping within single 
> L2 network switch  between 2 .. 16 hosts is some overkill(particularly 
> in maintaining human-readable device naming) :).
> 
> John, what is you opinion on such load balancing method in general, 
> without referring to particular use cases?
> 

This seems reasonable to me, but I'll defer to Jay on this. As long as the
limitations are documented and it looks like they are this may be fine.

Mostly I was interested to know what led you down this path and why MPIO
was not working as at least I expected it should. When I get some time I'll
see if we can address at least some of these issues. Even so it seems like
this bonding mode may still be useful for some use cases perhaps even none
storage use cases.

> 
>>
>> You could tweak your scsi timeout values and fail_fast values, set the io
>> retry to 0 to cause the fail over to occur faster. I suspect you already
>> did this and still it is too slow? Maybe adding a checker in multipathd to
>> listen for link events would be fast enough. The checker could then fail
>> the path immediately.
>>
>> I'll try to address your comments from the other thread here. In general I
>> wonder if it would be better to solve the problems in dm-multipath rather than
>> add another bonding mode?
> Of course I did this, but mpath is fine when device quantity is below 
> 30-40 devices with two paths, 150-200 devices with 2+ paths can make 
> life far more interesting :)

OK admittedly this gets ugly fast.

>>
>> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>>
>> The dm-multipath layer is adding latency? How much? If this is really true
>> maybe its best to the address the real issue here and not avoid it by
>> using the bonding layer.
> 
> I do not remember exact number now, but switching one of my databases , 
> about 2 years ago to bonding increased read throughput for the entire db 
> from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and 
> 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in 
> one switch) because of "full" bandwidth use. Also, bonding usage 
> simplifies network and application setup greatly(compared to mpath)
> 
>>
>> OVU - it handles any link failures bad, because of it's command queue
>> limitation(all queued commands above 32 are discarded in case of path
>> failure, as I remember)
>>
>> Maybe true but only link failures with the immediate peer are handled
>> with a bonding strategy. By working at the block layer we can detect
>> failures throughout the path. I would need to look into this again I
>> know when we were looking at this sometime ago there was some talk about
>> improving this behavior. I need to take some time to go back through the
>> error recovery stuff to remember how this works.
>>
>> OVU - it performs very bad when there are many devices and maтy paths(I was
>> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
>> per each disk)
> 
> Well, I think that behavior can be explained in such a way:
> when balancing by I/Os number per path(rr_min_io), and there is a huge 
> number of devices, mpath is doing load-balaning per-device, and it is 
> not possible to quarantee equal device use for all devices, so there 
> will be imbalance over network interface(mpath is unaware of it's 
> existence, etc), and it is likely it becomes more imbalanced when there 
> are many devices. Also, counting I/O's for many devices and paths 
> consumes some CPU resources and also can cause excessive context switches.
> 

hmm I'll get something setup here and see if this is the case.

>>
>> Hmm well that seems like something is broken. I'll try this setup when
>> I get some time next few days. This really shouldn't be the case dm-multipath
>> should not add a bunch of extra latency or effect throughput significantly.
>> By the way what are you seeing without mpio?
> 
> And one more obsevation from my 2-years old tests - reading device(using 
> dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath 
> device with single path was done at approximately 120-150mb/s, and same 
> test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a 
> kind of revelation to me that time.
> 

Similarly I'll have a look. Thanks for the info.

>>
>> Thanks,
>> John
>>
> 
> 
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 16:24               ` Nicolas de Pesloüan
@ 2011-01-18 16:57                 ` Oleg V. Ukhno
  2011-01-18 20:24                 ` Jay Vosburgh
  1 sibling, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 16:57 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev,
	Sébastien Barré,
	Christophe Paasch

On 01/18/2011 07:24 PM, Nicolas de Pesloüan wrote:
> Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
>> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>>> I remember a topology (described by Jay, for as far as I remember),
>>> where two hosts were connected through two distinct VLANs. In such
>>> topology:
>>> - it is possible to detect path failure using arp monitoring instead of
>>> miimon.
>>> - changing the destination MAC address of egress packets are not
>>> necessary, because egress path selection force ingress path selection
>>> due to the VLAN.
>>
>> In case with two VLANs - yes, this shouldn't be necessary(but needs to
>> be tested, I am not sure), but within one - it is essential for correct
>> rx load striping.
>
> Changing the destination MAC address is definitely not required if you
> segregate each path in a distinct VLAN.
Yes, such L2 network topology should provide necessary high-availability 
and load striping without need to change MAC addresses. But it is more 
difficult to maintain and to understand, in my opinion(when there are 
just several configurations like this - it's ok, but when you have 50 or 
more?) - this is why I've chosen 802.3ad.

> Even in the present of ISL between some switches, packet sent through
> host A interface connected to vlan 100 will only enter host B using the
> interface connected to vlan 100. So every slaves of the bonding
> interface can use the same MAC address.
>
> Of course, changing the destination address would be required in order
> to achieve ingress load balancing on a *single* LAN. But, as Jay noted
> at the beginning of this thread, this would violate 802.3ad.
>

I think receiving same MAC-addresses on different ports on same host 
will just make any troubleshooting much harder, won't it? With different 
MACs it takes little time to find out where the problem is, usually.
I think that implementing choice for choosing whether use single MAC 
address in etherchannel or just use slave's real MAC adresses, won't 
harm anything for both 802.3ad and balance-rr modes, but will simplify 
it's usage without doing any evil, when documented properly.

>
> You are right, but such LAN setup need to be carefully designed and
> built. I'm not sure that an automatic channel aggregation system is the
> right way to do it. Hence the reason why I suggest to use balance-rr
> with VLANs.
>
>>> Oleg, would you mind trying the above "two VLAN" topology" with
>>> mode=balance-rr and report any results ? For high-availability purpose,
>>> it's obviously necessary to setup those VLAN on distinct switches.
>> I'll do it, but it will take some time to setup test environment,
>> several days may be.
>
> Thanks. For testing purpose, it is enough to setup those VLAN on a
> single switch if it is easier for you to do.
Well, I'll do it with 2 switches :)
>
>> You mean following topology:
>
> See above.
>
>> (i'm sure it will work as desired if each host is connected to each
>> switch with only one slave link, if there are more slaves in each switch
>> - unsure)?
>
> If you want to use more than 2 slaves per host, then you need more than
> 2 VLAN.

That's what I don't like in this solution. Within one LAN is is simplier 
and requires less configuration efforts.

You also need to have the exact same number of slaves on all
> hosts, as egress path selection cause ingress path selection at the
> other side.
>

Well, and here's one difference from bonding with my patch. In case of 
my patch applied, it is not required to have equal number of slaves, it 
is enough to have *even* number of slaves, this almost always(so far I 
haven't seen opposite) gurarntees good rx(ingress) load striping.

>
> Nicolas.
>


-- 
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс

Олег Юхно



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 16:41           ` John Fastabend
@ 2011-01-18 17:21             ` Oleg V. Ukhno
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 17:21 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jay Vosburgh, netdev, David S. Miller

On 01/18/2011 07:41 PM, John Fastabend wrote:

>>
>> John, what is you opinion on such load balancing method in general,
>> without referring to particular use cases?
>>
>
> This seems reasonable to me, but I'll defer to Jay on this. As long as the
> limitations are documented and it looks like they are this may be fine.
>
> Mostly I was interested to know what led you down this path and why MPIO
> was not working as at least I expected it should. When I get some time I'll
> see if we can address at least some of these issues. Even so it seems like
> this bonding mode may still be useful for some use cases perhaps even none
> storage use cases.
>
>>

I was adressing several problems with my patch:
  - I was unable to consume whole bandwidth with multipath - with four 
1Gbit "paths" it was slightly above 2Gbit/s
  - Link failures caused quite often disk failures, which led to Oracle 
ASM rebalance, especially with versions below 11.
  - It is not always possible to autogenerate multipathd.conf with 
human-readable device names because of iscsi session id and scsi device 
bus/channel/etc mismatch(usually it differs by 1, but not necessarily), 
with bonding solution I can just look into /dev/disk/by-path to find out 
where physically is device, let's  say, /dev/sdab, located(it's just a 
free bonus I've got, so to say:)) .



-- 
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс

Олег Юхно



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 15:28             ` Oleg V. Ukhno
  2011-01-18 16:24               ` Nicolas de Pesloüan
@ 2011-01-18 17:56               ` Kirill Smelkov
  1 sibling, 0 replies; 32+ messages in thread
From: Kirill Smelkov @ 2011-01-18 17:56 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: Nicolas de Pesloüan, John Fastabend, Jay Vosburgh,
	David S. Miller, netdev, Sébastien Barré,
	Christophe Paasch

On Tue, Jan 18, 2011 at 06:28:48PM +0300, Oleg V. Ukhno wrote:
> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>> Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :
[...]

>> Oleg, would you mind trying the above "two VLAN" topology" with
>> mode=balance-rr and report any results ? For high-availability purpose,
>> it's obviously necessary to setup those VLAN on distinct switches.
> I'll do it, but it will take some time to setup test environment,  
> several days may be.
> You mean following topology:
>           switch 1
>        /           \
> host A                host B
>        \  switch 2 /
>

FYI: I'm in the process of developing new redundancy mode for bonding,
and while at it, the following script is maybe useful for you too, so
that bonding testing can be done entirely on one host:

http://repo.or.cz/w/linux-2.6/kirr.git/blob/refs/heads/x/etherdup:/tools/bonding/mk-tap-loops.sh


Sorry for maybe being offtopic,
Kirill

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 16:24               ` Nicolas de Pesloüan
  2011-01-18 16:57                 ` Oleg V. Ukhno
@ 2011-01-18 20:24                 ` Jay Vosburgh
  2011-01-18 21:20                   ` Nicolas de Pesloüan
                                     ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-18 20:24 UTC (permalink / raw)
  To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
  Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev,
	=?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?=,
	Christophe Paasch

Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:

>Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
>> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>>> I remember a topology (described by Jay, for as far as I remember),
>>> where two hosts were connected through two distinct VLANs. In such
>>> topology:
>>> - it is possible to detect path failure using arp monitoring instead of
>>> miimon.

	I don't think this is true, at least not for the case of
balance-rr.  Using ARP monitoring with any sort of load balance scheme
is problematic, because the replies may be balanced to a different slave
than the sender.

>>> - changing the destination MAC address of egress packets are not
>>> necessary, because egress path selection force ingress path selection
>>> due to the VLAN.

	This is true, with one comment: Oleg's proposal we're discussing
changes the source MAC address of outgoing packets, not the destination.
The purpose being to manipulate the src-mac balancing algorithm on the
switch when the packets are hashed at the egress port channel group.
The packets (for a particular destination) all bear the same destination
MAC, but (as I understand it) are manually assigned tailored source MAC
addresses that hash to sequential values.

>> In case with two VLANs - yes, this shouldn't be necessary(but needs to
>> be tested, I am not sure), but within one - it is essential for correct
>> rx load striping.
>
>Changing the destination MAC address is definitely not required if you
>segregate each path in a distinct VLAN.
>
>            +-------------------+     +-------------------+
>    +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
>    |       +-------------------+     +-------------------+       |
>+------+              |                         |              +------+
>|host A|              |                         |              |host B|
>+------+              |                         |              +------+
>    |       +-------------------+     +-------------------+       |
>    +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+
>            +-------------------+     +-------------------+
>
>Even in the present of ISL between some switches, packet sent through host
>A interface connected to vlan 100 will only enter host B using the
>interface connected to vlan 100. So every slaves of the bonding interface
>can use the same MAC address.

	That's true.  The big problem with the "VLAN tunnel" approach is
that it's not tolerant of link failures.

>Of course, changing the destination address would be required in order to
>achieve ingress load balancing on a *single* LAN. But, as Jay noted at the
>beginning of this thread, this would violate 802.3ad.
>
>>> I think the only point is whether we need a new xmit_hash_policy for
>>> mode=802.3ad or whether mode=balance-rr could be enough.
>> May by, but it seems to me fair enough not to restrict this feature only
>> to non-LACP aggregate links; dynamic aggregation may be useful(it helps
>> to avoid switch misconfiguration(misconfigured slaves on switch side)
>> sometimes without loss of service).
>
>You are right, but such LAN setup need to be carefully designed and
>built. I'm not sure that an automatic channel aggregation system is the
>right way to do it. Hence the reason why I suggest to use balance-rr with
>VLANs.

	The "VLAN tunnel" approach is a derivative of an actual switch
topology that balance-rr was originally intended for, many moons ago.
This is described in the current bonding.txt; I'll cut & paste a bit
here:

12.2 Maximum Throughput in a Multiple Switch Topology
-----------------------------------------------------

        Multiple switches may be utilized to optimize for throughput
when they are configured in parallel as part of an isolated network
between two or more systems, for example:

                       +-----------+
                       |  Host A   | 
                       +-+---+---+-+
                         |   |   |
                +--------+   |   +---------+
                |            |             |
         +------+---+  +-----+----+  +-----+----+
         | Switch A |  | Switch B |  | Switch C |
         +------+---+  +-----+----+  +-----+----+
                |            |             |
                +--------+   |   +---------+
                         |   |   |
                       +-+---+---+-+
                       |  Host B   | 
                       +-----------+

        In this configuration, the switches are isolated from one
another.  One reason to employ a topology such as this is for an
isolated network with many hosts (a cluster configured for high
performance, for example), using multiple smaller switches can be more
cost effective than a single larger switch, e.g., on a network with 24
hosts, three 24 port switches can be significantly less expensive than
a single 72 port switch.

        If access beyond the network is required, an individual host
can be equipped with an additional network device connected to an
external network; this host then additionally acts as a gateway.

	[end of cut]

	This was described to me some time ago as an early usage model
for balance-rr using multiple 10 Mb/sec switches.  It has the same link
monitoring problems as the "VLAN tunnel" approach, although modern
switches with "trunk failover" type of functionality may be able to
mitigate the problem.

>>> Oleg, would you mind trying the above "two VLAN" topology" with
>>> mode=balance-rr and report any results ? For high-availability purpose,
>>> it's obviously necessary to setup those VLAN on distinct switches.
>> I'll do it, but it will take some time to setup test environment,
>> several days may be.
>
>Thanks. For testing purpose, it is enough to setup those VLAN on a single
>switch if it is easier for you to do.
>
>> You mean following topology:
>
>See above.
>
>> (i'm sure it will work as desired if each host is connected to each
>> switch with only one slave link, if there are more slaves in each switch
>> - unsure)?
>
>If you want to use more than 2 slaves per host, then you need more than 2
>VLAN. You also need to have the exact same number of slaves on all hosts,
>as egress path selection cause ingress path selection at the other side.
>
>            +-------------------+     +-------------------+
>    +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
>    |       +-------------------+     +-------------------+       |
>+------+              |                         |              +------+
>|host A|              |                         |              |host B|
>+------+              |                         |              +------+
>  | |       +-------------------+     +-------------------+       | |
>  | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ |
>  |         +-------------------+     +-------------------+         |
>  |                   |                         |                   |
>  |                   |                         |                   |
>  |         +-------------------+     +-------------------+         |
>  +---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+
>            +-------------------+     +-------------------+
>
>Of course, you can add others host to vlan 100, 200 and 300, with the
>exact same configuration at host A or host B.

	This is essentially the same thing as the diagram I pasted in up
above, except with VLANs and an additional layer of switches between the
hosts.  The multiple VLANs take the place of multiple discrete switches.

	This could also be accomplished via bridge groups (in
Cisco-speak).  For example, instead of VLAN 100, that could be bridge
group X, VLAN 200 is bridge group Y, and so on.

	Neither the VLAN nor the bridge group methods handle link
failures very well; if, in the above diagram, the link from "switch 2
vlan 100" to "host B" fails, there's no way for host A to know to stop
sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
to "host B."

	One item I'd like to see some more data on is the level of
reordering at the receiver in Oleg's system.

	One of the reasons round robin isn't as useful as it once was is
due to the rise of NAPI and interrupt coalescing, both of which will
tend to increase the reordering of packets at the receiver when the
packets are evenly striped.  In the old days, it was one interrupt, one
packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
packets striped across interfaces, this will tend to increase
reordering.  E.g.,

	slave 1		slave 2		slave 3
	Packet 1	P2		P3
	P4		P5		P6
	P7		P8		P9

	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.

	I haven't done much testing with this lately, but I suspect this
behavior hasn't really changed.  Raising the tcp_reordering sysctl value
can mitigate this somewhat (by making TCP more tolerant of this), but
that doesn't help non-TCP protocols.

	Barring evidence to the contrary, I presume that Oleg's system
delivers out of order at the receiver.  That's not automatically a
reason to reject it, but this entire proposal is sufficiently complex to
configure that very explicit documentation will be necessary.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 20:24                 ` Jay Vosburgh
@ 2011-01-18 21:20                   ` Nicolas de Pesloüan
  2011-01-19  1:45                     ` Jay Vosburgh
  2011-01-18 22:22                   ` Oleg V. Ukhno
  2011-01-19 16:13                   ` Oleg V. Ukhno
  2 siblings, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-18 21:20 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev,
	Sébastien Barré,
	Christophe Paasch

Le 18/01/2011 21:24, Jay Vosburgh a écrit :
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:

>>>> - it is possible to detect path failure using arp monitoring instead of
>>>> miimon.
>
> 	I don't think this is true, at least not for the case of
> balance-rr.  Using ARP monitoring with any sort of load balance scheme
> is problematic, because the replies may be balanced to a different slave
> than the sender.

Cannot we achieve the expected arp monitoring by using the exact same artifice that Oleg suggested: 
using a different source MAC per slave for arp monitoring, so that return path match sending path ?

>>>> - changing the destination MAC address of egress packets are not
>>>> necessary, because egress path selection force ingress path selection
>>>> due to the VLAN.
>
> 	This is true, with one comment: Oleg's proposal we're discussing
> changes the source MAC address of outgoing packets, not the destination.
> The purpose being to manipulate the src-mac balancing algorithm on the
> switch when the packets are hashed at the egress port channel group.
> The packets (for a particular destination) all bear the same destination
> MAC, but (as I understand it) are manually assigned tailored source MAC
> addresses that hash to sequential values.

Yes, you're right.

> 	That's true.  The big problem with the "VLAN tunnel" approach is
> that it's not tolerant of link failures.

Yes, except if we find a way to make arp monitoring reliable in load balancing situation.

[snip]

> 	This is essentially the same thing as the diagram I pasted in up
> above, except with VLANs and an additional layer of switches between the
> hosts.  The multiple VLANs take the place of multiple discrete switches.
>
> 	This could also be accomplished via bridge groups (in
> Cisco-speak).  For example, instead of VLAN 100, that could be bridge
> group X, VLAN 200 is bridge group Y, and so on.
>
> 	Neither the VLAN nor the bridge group methods handle link
> failures very well; if, in the above diagram, the link from "switch 2
> vlan 100" to "host B" fails, there's no way for host A to know to stop
> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
> to "host B."

Can't we imagine to "arp monitor" the destination MAC address of host B, on both paths ? That way, 
host A would know that a given path is down, because return path would be the same. The target host 
should send the reply on the slave on which it receive the request, which is the normal way to reply 
to arp request.

> 	One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.

This is exactly the reason why I asked Oleg to do some test with balance-rr. I cannot find a good 
reason for a possibly new xmit_hash_policy to provide better throughput than current balance-rr. If 
the throughput increase by, let's say, less than 20%, whatever tcp_reordering value, then it is 
probably a dead end way.

> 	One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped.  In the old days, it was one interrupt, one
> packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
> packets striped across interfaces, this will tend to increase
> reordering.  E.g.,
>
> 	slave 1		slave 2		slave 3
> 	Packet 1	P2		P3
> 	P4		P5		P6
> 	P7		P8		P9
>
> 	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.

Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7, P8, P9 on slave3, possibly 
by sending grouped packets, changing the sending slave every N packets instead of every packet ? I 
think we already discussed this possibility a few months or years ago in bonding-devel ML. For as 
far as I remember, the idea was not developed because it was not easy to find the number of packets 
to send through the same slave. Anyway, this might help reduce out of order delivery.

> 	Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver.  That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.

Yes, and this is already true for some bonding modes and in particular for balance-rr.

	Nicolas.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 20:24                 ` Jay Vosburgh
  2011-01-18 21:20                   ` Nicolas de Pesloüan
@ 2011-01-18 22:22                   ` Oleg V. Ukhno
  2011-01-19 16:13                   ` Oleg V. Ukhno
  2 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 22:22 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Nicolas de Pesloüan, John Fastabend, David S. Miller,
	netdev, Sébastien Barré,
	Christophe Paasch



Jay Vosburgh wrote:
> 
> 	One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.
> 
> 	One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped.  In the old days, it was one interrupt, one
> packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
> packets striped across interfaces, this will tend to increase
> reordering.  E.g.,
> 
> 	slave 1		slave 2		slave 3
> 	Packet 1	P2		P3
> 	P4		P5		P6
> 	P7		P8		P9
> 
> 	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
> 
> 	I haven't done much testing with this lately, but I suspect this
> behavior hasn't really changed.  Raising the tcp_reordering sysctl value
> can mitigate this somewhat (by making TCP more tolerant of this), but
> that doesn't help non-TCP protocols.
> 
> 	Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver.  That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.

Jay, here is some network stats from one of my iSCSI targets with avg 
load of 1.5-2.5Gbit/sec(4 slaves in etherchannel).Not perfect and not 
very "clean"(there are more interfaces on host, than these 4)
[root@<somehost> ~]# netstat -st 

IcmpMsg: 

     InType0: 6 

     InType3: 1872 

     InType8: 60557 

     InType11: 23 

     OutType0: 60528 

     OutType3: 1755 

     OutType8: 6 

Tcp: 

     1298909 active connections openings 

     61090 passive connection openings 

     2374 failed connection attempts 

     62781 connection resets received 

     3 connections established 

     1268233942 segments received 

     1198020318 segments send out 

     18939618 segments retransmited 

     0 bad segments received. 

     23643 resets sent 

TcpExt:
     294935 TCP sockets finished time wait in fast timer
     472 time wait sockets recycled by time stamp
     819481 delayed acks sent
     295332 delayed acks further delayed because of locked socket
     Quick ack mode was activated 30616377 times
     3516920 packets directly queued to recvmsg prequeue.
     4353 packets directly received from backlog
     44873453 packets directly received from prequeue
     1442812750 packets header predicted
     1077442 packets header predicted and directly queued to user
     2123453975 acknowledgments not containing data received
     2375328274 predicted acknowledgments
     8462439 times recovered from packet loss due to fast retransmit
     Detected reordering 19203 times using reno fast retransmit
     Detected reordering 100 times using time stamp
     3429 congestion windows fully recovered
     11760 congestion windows partially recovered using Hoe heuristic
     398 congestion windows recovered after partial ack
     0 TCP data loss events
     3671 timeouts after reno fast retransmit
     6 timeouts in loss state
     18919118 fast retransmits
     11637 retransmits in slow start
     1756 other TCP timeouts
     TCPRenoRecoveryFail: 3187
     62779 connections reset due to early user close
IpExt:
     InBcastPkts: 512616
[root@<somehost> ~]# uptime
  00:35:49 up 42 days,  8:27,  1 user,  load average: 3.70, 3.80, 4.07
[root@<somehost> ~]# sysctl -a|grep tcp_reo
net.ipv4.tcp_reordering = 3

I will get back with "clean" results after I'll setup test system tomorrow.
TcpExt stats from other hosts are similar.

> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 

-- 
Best regards,
Oleg Ukhno

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 21:20                   ` Nicolas de Pesloüan
@ 2011-01-19  1:45                     ` Jay Vosburgh
  0 siblings, 0 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-19  1:45 UTC (permalink / raw)
  To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
  Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev,
	=?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?=,
	Christophe Paasch

Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:

>Le 18/01/2011 21:24, Jay Vosburgh a écrit :
>> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:
>
>>>>> - it is possible to detect path failure using arp monitoring instead of
>>>>> miimon.
>>
>> 	I don't think this is true, at least not for the case of
>> balance-rr.  Using ARP monitoring with any sort of load balance scheme
>> is problematic, because the replies may be balanced to a different slave
>> than the sender.
>
>Cannot we achieve the expected arp monitoring by using the exact same
>artifice that Oleg suggested: using a different source MAC per slave for
>arp monitoring, so that return path match sending path ?

	It's not as simple with ARP, because it's a control protocol
that has side effects.

	First, the MAC level broadcast ARP probes from bonding would
have to be round robined in such a manner that they regularly arrive at
every possible slave.  A single broadcast won't be sent to more than one
member of the channel group by the switch.  We can't do multiple unicast
ARPs with different destination MAC addresses, because we'd have to
track all of those MACs somewhere (keep track of the MAC of every slave
on each peer we're monitoring).  I suspect that snooping switches will
get all whiny about port flapping and the like.

	We could have a separate IP address per slave, used only for
link monitoring, but that's a huge headache.  Actually, it's a lot like
the multi-link stuff I've been working on (and posted RFC of in
December), but that doesn't use ARP (it segregates slaves by IP subnet,
and balances at the IP layer).  Basically, you need a overlaying active
protocol to handle the map of which slave goes where (which multi-link
has).

	So, maybe we have the ARP replies massaged such that the
Ethernet header source and ARP target hardware address don't match.

	So the probes from bonding currently look like this:

MAC-A > ff:ff:ff:ff:ff:ff Request who-has 10.0.4.2 tell 10.0.1.1

	Where MAC-A is the bond's MAC address.  And the replies now look
like this:

MAC-B > MAC-A, Reply 10.0.4.2 is-at MAC-B

	Where MAC-B is the MAC of the peer's bond.  The massaged replies
would be of the form:

MAC-C > MAC-A, Reply 10.0.4.2 is-at MAC-B

	where MAC-C is the slave "permanent" address (which is really a
fake address to manipulate the switch's hash), and MAC-B is whatever the
real MAC of the bond is.  I don't think we can mess with MAC-B in the
reply (the "is-at" part), because that would update ARP tables and such.
If we change MAC-A in the reply, they're liable to be filtered out.  I
really don't know if putting MAC-C in there as the source would confuse
snooping switches or not.

	One other thought I had while chewing on this is to run the LACP
protocol exchange between the bonding peers directly, instead of between
each bond and each switch.  I have no idea if this would work or not,
but the theory would look something like the "VLAN tunnel" topology for
the switches, but the bonds at the ends are configured for 802.3ad.  To
make this work, bonding would have to be able to run mutiple LACP
instances (one for each bonding peer on the network) over a single
aggregator (or permit slaves to belong to multiple active aggregators).
This would basically be the same as the multi-link business, except
using LACP for the active protocol to build the map.

	A distinguished correspondent (who may confess if he so chooses)
also suggested 802.2 LLC XID or TEST frames, which have been discussed
in the past.  Those don't have side effects, but I'm not sure if either
is technically feasible, or if we really want bonding to have a
dependency on llc.  They would also only interop with hosts that respond
to the XID or TEST.  I haven't thought about this in detail for a number
of years, but I think the LLC DSAP / SSAP space is pretty small.

>>>>> - changing the destination MAC address of egress packets are not
>>>>> necessary, because egress path selection force ingress path selection
>>>>> due to the VLAN.
>>
>> 	This is true, with one comment: Oleg's proposal we're discussing
>> changes the source MAC address of outgoing packets, not the destination.
>> The purpose being to manipulate the src-mac balancing algorithm on the
>> switch when the packets are hashed at the egress port channel group.
>> The packets (for a particular destination) all bear the same destination
>> MAC, but (as I understand it) are manually assigned tailored source MAC
>> addresses that hash to sequential values.
>
>Yes, you're right.
>
>> 	That's true.  The big problem with the "VLAN tunnel" approach is
>> that it's not tolerant of link failures.
>
>Yes, except if we find a way to make arp monitoring reliable in load balancing situation.
>
>[snip]
>
>> 	This is essentially the same thing as the diagram I pasted in up
>> above, except with VLANs and an additional layer of switches between the
>> hosts.  The multiple VLANs take the place of multiple discrete switches.
>>
>> 	This could also be accomplished via bridge groups (in
>> Cisco-speak).  For example, instead of VLAN 100, that could be bridge
>> group X, VLAN 200 is bridge group Y, and so on.
>>
>> 	Neither the VLAN nor the bridge group methods handle link
>> failures very well; if, in the above diagram, the link from "switch 2
>> vlan 100" to "host B" fails, there's no way for host A to know to stop
>> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
>> to "host B."
>
>Can't we imagine to "arp monitor" the destination MAC address of host B,
>on both paths ? That way, host A would know that a given path is down,
>because return path would be the same. The target host should send the
>reply on the slave on which it receive the request, which is the normal
>way to reply to arp request.

	I think you can only get away with this if each slave set (where
a "set" is one slave from each bond that's attending our little load
balancing party) is on a separate switch domain, and the switch domains
are not bridged together.  Otherwise the switches will flap their MAC
tables as they update from each probe that they see.

	As for the reply going out the same slave, to do that, bonding
would have to intercept the ARP traffic (because ARPs arriving on slaves
are normally assigned to the bond itself, not the slave) and track and
tweak them.

	Lastly, bonding would again have to maintain a map, showing
which destinations are reachable via which set of slaves.  All peer
systems (needing to have per-slave link monitoring) would have to be ARP
targets.

>> 	One item I'd like to see some more data on is the level of
>> reordering at the receiver in Oleg's system.
>
>This is exactly the reason why I asked Oleg to do some test with
>balance-rr. I cannot find a good reason for a possibly new
>xmit_hash_policy to provide better throughput than current balance-rr. If
>the throughput increase by, let's say, less than 20%, whatever
>tcp_reordering value, then it is probably a dead end way.

	Well, the point of making a round robin xmit_hash_policy isn't
that the throughput will be better than the existing round robin, it's
to make round-robin accessible to the 802.3ad mode.

>> 	One of the reasons round robin isn't as useful as it once was is
>> due to the rise of NAPI and interrupt coalescing, both of which will
>> tend to increase the reordering of packets at the receiver when the
>> packets are evenly striped.  In the old days, it was one interrupt, one
>> packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
>> packets striped across interfaces, this will tend to increase
>> reordering.  E.g.,
>>
>> 	slave 1		slave 2		slave 3
>> 	Packet 1	P2		P3
>> 	P4		P5		P6
>> 	P7		P8		P9
>>
>> 	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
>> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
>
>Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7,
>P8, P9 on slave3, possibly by sending grouped packets, changing the
>sending slave every N packets instead of every packet ? I think we already
>discussed this possibility a few months or years ago in bonding-devel
>ML. For as far as I remember, the idea was not developed because it was
>not easy to find the number of packets to send through the same
>slave. Anyway, this might help reduce out of order delivery.

	Yes, this came up several years ago, and, basically, there's no
way to do it perfectly.  An interesting experiment would be to see if
sending groups (perhaps close to the NAPI weight of the receiver) would
reduce reordering.

>> 	Barring evidence to the contrary, I presume that Oleg's system
>> delivers out of order at the receiver.  That's not automatically a
>> reason to reject it, but this entire proposal is sufficiently complex to
>> configure that very explicit documentation will be necessary.
>
>Yes, and this is already true for some bonding modes and in particular for balance-rr.

	I don't think any modes other than balance-rr will deliver out
of order normally.  It can happen during edge cases, e.g., alb
rebalance, or the layer3+4 hash with IP fragments, but I'd expect those
to be at a much lower rate than what round robin causes.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-18 20:24                 ` Jay Vosburgh
  2011-01-18 21:20                   ` Nicolas de Pesloüan
  2011-01-18 22:22                   ` Oleg V. Ukhno
@ 2011-01-19 16:13                   ` Oleg V. Ukhno
  2011-01-19 20:12                     ` Nicolas de Pesloüan
  2 siblings, 1 reply; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-19 16:13 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Nicolas de Pesloüan, John Fastabend, David S. Miller,
	netdev, Sébastien Barré,
	Christophe Paasch

On 01/18/2011 11:24 PM, Jay Vosburgh wrote:
> 	I haven't done much testing with this lately, but I suspect this
> behavior hasn't really changed.  Raising the tcp_reordering sysctl value
> can mitigate this somewhat (by making TCP more tolerant of this), but
> that doesn't help non-TCP protocols.
>
> 	Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver.  That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.
>
> 	-J
>
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>

Jay,
I have ran some tests with patched 802.3ad bonding for now
Test system configuration:
2 identical servers with 82576, Gigabit ET2 Quad Port Srvr Adptr 
LowProfile, PCI-E (igb), connected to one switch(Cisco 2960) with all 4 
ports, all ports on each host aggregated into single etherchannel using 
802.3ad(w/patch).
kernel version: vanilla 2.6.32(tcp_reordering - default setting)
igb version - 2.3.4, parameters - default
Ran two tests:
1) unidirectional test using iperf
2) Bidirectional test, iperf client is running with 8 threads
One remark:
Decreasing number of slaves will cause higher active slave utilization( 
for example with 2 slaves iperf test will consume almost full bandwidth 
available in both directions(test parameters are the same, test time 
reduced to 150sec):
[SUM]  0.0-150.3 sec  34640 MBytes  1933 Mbits/sec
[SUM]  0.0-150.5 sec  34875 MBytes  1944 Mbits/sec
)
For me (my use case) risk of some bandwidth loss with 4 slaves is 
acceptable, but my suggestion that building aggregate link with more 
than 4 slaves is inadequate. For 2 slaves this solution should work with 
minimum @overhead@ of any kind. TCP reordering and retransmit numbers in 
my opinion are acceptable for most use cases for such bonding mode I can 
imagine.

What is your opinion on my idea with patch?

I will come back with results for VLAN tunneling case, if this is 
necessary (Nicolas, shall I do that test - I think it will show similar 
results for performance?)

Below are test results(sorry for huge amount of text):

Iperf results:
Test 1:
Receiver:
[root@target2 ~]# iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 
9999 -t 300
------------------------------------------------------------
Client connecting to 192.168.111.128, TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
[  3] local 192.168.111.129 port 9999 connected with 192.168.111.128 
port 9999
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-300.0 sec  141643 MBytes  3961 Mbits/sec
Sender:
[root@target1 ~]# iperf -f m -s -B 192.168.111.128 -p 9999 -t 300
------------------------------------------------------------
Server listening on TCP port 9999
Binding to local address 192.168.111.128
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
[  4] local 192.168.111.128 port 9999 connected with 192.168.111.129 
port 9999
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-300.1 sec  141643 MBytes  3959 Mbits/sec
^C[root@target1 ~]#

Test 2:
former "sender" side:
[SUM]  0.0-300.2 sec  111541 MBytes  3117 Mbits/sec
[SUM]  0.0-300.4 sec  110515 MBytes  3086 Mbits/sec
former "receiver" side:
[SUM]  0.0-300.1 sec  110515 MBytes  3089 Mbits/sec
[SUM]  0.0-300.3 sec  111541 MBytes  3116 Mbits/sec




Netstat's:

netstat -st (sender, before 1st test)
[root@target1 ~]# netstat -st
IcmpMsg:
     InType3: 5
     InType8: 3
     OutType0: 3
     OutType3: 4
Tcp:
     26 active connections openings
     7 passive connection openings
     5 failed connection attempts
     1 connection resets received
     4 connections established
     349 segments received
     330 segments send out
     7 segments retransmited
     0 bad segments received.
     5 resets sent
UdpLite:
TcpExt:
     10 TCP sockets finished time wait in slow timer
     8 delayed acks sent
     56 packets directly queued to recvmsg prequeue.
     40 packets directly received from backlog
     317 packets directly received from prequeue
     78 packets header predicted
     36 packets header predicted and directly queued to user
     41 acknowledgments not containing data received
     134 predicted acknowledgments
     0 TCP data loss events
     4 other TCP timeouts
     2 connections reset due to unexpected data
     TCPSackShiftFallback: 1
IpExt:
     InMcastPkts: 74
     OutMcastPkts: 62
     InOctets: 76001
     OutOctets: 82234
     InMcastOctets: 13074
     OutMcastOctets: 10428

netstat -st (sender, after 1st test)
[root@target1 ~]netstat -st
IcmpMsg:
     InType3: 5
     InType8: 7
     OutType0: 7
     OutType3: 4
Tcp:
     71 active connections openings
     15 passive connection openings
     5 failed connection attempts
     4 connection resets received
     4 connections established
     16674161 segments received
     16674113 segments send out
     7 segments retransmited
     0 bad segments received.
     5 resets sent
UdpLite:
TcpExt:
     31 TCP sockets finished time wait in slow timer
     13 delayed acks sent
     42 delayed acks further delayed because of locked socket
     Quick ack mode was activated 297 times
     239 packets directly queued to recvmsg prequeue.
     2388220516 packets directly received from backlog
     595165 packets directly received from prequeue
     16954 packets header predicted
     445 packets header predicted and directly queued to user
     129 acknowledgments not containing data received
     322 predicted acknowledgments
     0 TCP data loss events
     4 other TCP timeouts
     297 DSACKs sent for old packets
     2 connections reset due to unexpected data
     TCPSackShiftFallback: 1
IpExt:
     InMcastPkts: 86
     OutMcastPkts: 68
     InBcastPkts: 2
     InOctets: -930738047
     OutOctets: 1321936884
     InMcastOctets: 13434
     OutMcastOctets: 10620
     InBcastOctets: 483

netstat -st (receiver, before 1st test)
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 5
     InType8: 3
     OutType0: 3
     OutType3: 4
Tcp:
     23 active connections openings
     6 passive connection openings
     3 failed connection attempts
     1 connection resets received
     3 connections established
     309 segments received
     264 segments send out
     7 segments retransmited
     0 bad segments received.
     6 resets sent
UdpLite:
TcpExt:
     10 TCP sockets finished time wait in slow timer
     5 delayed acks sent
     74 packets directly queued to recvmsg prequeue.
     16 packets directly received from backlog
     377 packets directly received from prequeue
     62 packets header predicted
     35 packets header predicted and directly queued to user
     32 acknowledgments not containing data received
     106 predicted acknowledgments
     0 TCP data loss events
     4 other TCP timeouts
     1 connections reset due to early user close
IpExt:
     InMcastPkts: 75
     OutMcastPkts: 62
     InOctets: 64952
     OutOctets: 66396
     InMcastOctets: 13428
     OutMcastOctets: 10403

netstat -st (sender, after 1st test)
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 5
     InType8: 8
     OutType0: 8
     OutType3: 4
Tcp:
     70 active connections openings
     14 passive connection openings
     3 failed connection attempts
     4 connection resets received
     4 connections established
     16674253 segments received
     16673801 segments send out
     487 segments retransmited
     0 bad segments received.
     6 resets sent
UdpLite:
TcpExt:
     32 TCP sockets finished time wait in slow timer
     15 delayed acks sent
     228 packets directly queued to recvmsg prequeue.
     24 packets directly received from backlog
     1081 packets directly received from prequeue
     146 packets header predicted
     124 packets header predicted and directly queued to user
     10913589 acknowledgments not containing data received
     573 predicted acknowledgments
     185 times recovered from packet loss due to SACK data
     Detected reordering 1 times using FACK
     Detected reordering 8 times using SACK
     Detected reordering 2 times using time stamp
     1 congestion windows fully recovered
     23 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 1
     0 TCP data loss events
     471 fast retransmits
     9 forward retransmits
     4 other TCP timeouts
     297 DSACKs received
     1 connections reset due to early user close
     TCPDSACKIgnoredOld: 258
     TCPDSACKIgnoredNoUndo: 39
     TCPSackShiftFallback: 35790574
IpExt:
     InMcastPkts: 89
     OutMcastPkts: 69
     InBcastPkts: 2
     InOctets: 1321825004
     OutOctets: -928982419
     InMcastOctets: 13848
     OutMcastOctets: 10627
     InBcastOctets: 483

Second test:

former "sender" side:
[root@target1 ~]# netstat -st
IcmpMsg:
     InType3: 5
     InType8: 13
     OutType0: 13
     OutType3: 4
Tcp:
     556 active connections openings
     65 passive connection openings
     391 failed connection attempts
     15 connection resets received
     4 connections established
     52164640 segments received
     52117884 segments send out
     62522 segments retransmited
     0 bad segments received.
     33 resets sent
UdpLite:
TcpExt:
     27 invalid SYN cookies received
     74 TCP sockets finished time wait in slow timer
     698540 packets rejects in established connections because of timestamp
     51 delayed acks sent
     487 delayed acks further delayed because of locked socket
     Quick ack mode was activated 18838 times
     7 times the listen queue of a socket overflowed
     7 SYNs to LISTEN sockets ignored
     1632 packets directly queued to recvmsg prequeue.
     4137769996 packets directly received from backlog
     5723253 packets directly received from prequeue
     1365131 packets header predicted
     136330 packets header predicted and directly queued to user
     10241415 acknowledgments not containing data received
     156502 predicted acknowledgments
     10983 times recovered from packet loss due to SACK data
     Detected reordering 4 times using FACK
     Detected reordering 10095 times using SACK
     Detected reordering 138 times using time stamp
     2107 congestion windows fully recovered
     18612 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 80
     5 congestion windows recovered after partial ack
     0 TCP data loss events
     52 timeouts after SACK recovery
     2 timeouts in loss state
     61206 fast retransmits
     7 forward retransmits
     984 retransmits in slow start
     8 other TCP timeouts
     258 sack retransmits failed
     18838 DSACKs sent for old packets
     274 DSACKs sent for out of order packets
     14169 DSACKs received
     34 DSACKs for out of order packets received
     2 connections reset due to unexpected data
     TCPDSACKIgnoredOld: 8694
     TCPDSACKIgnoredNoUndo: 5482
     TCPSackShiftFallback: 18352494
IpExt:
     InMcastPkts: 104
     OutMcastPkts: 77
     InBcastPkts: 6
     InOctets: -474718903
     OutOctets: 1280495238
     InMcastOctets: 13974
     OutMcastOctets: 10908
     InBcastOctets: 1449



former "receiver" side:
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 5
     InType8: 14
     OutType0: 14
     OutType3: 4
Tcp:
     182 active connections openings
     39 passive connection openings
     4 failed connection attempts
     12 connection resets received
     4 connections established
     52098089 segments received
     52180386 segments send out
     68994 segments retransmited
     0 bad segments received.
     1070 resets sent
UdpLite:
TcpExt:
     12 TCP sockets finished time wait in fast timer
     102 TCP sockets finished time wait in slow timer
     770084 packets rejects in established connections because of timestamp
     37 delayed acks sent
     261 delayed acks further delayed because of locked socket
     Quick ack mode was activated 14276 times
     1466 packets directly queued to recvmsg prequeue.
     1190723332 packets directly received from backlog
     4781569 packets directly received from prequeue
     776470 packets header predicted
     97281 packets header predicted and directly queued to user
     24979561 acknowledgments not containing data received
     484206 predicted acknowledgments
     11461 times recovered from packet loss due to SACK data
     Detected reordering 15 times using FACK
     Detected reordering 15520 times using SACK
     Detected reordering 208 times using time stamp
     2046 congestion windows fully recovered
     18402 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 82
     13 congestion windows recovered after partial ack
     0 TCP data loss events
     49 timeouts after SACK recovery
     1 timeouts in loss state
     62078 fast retransmits
     5340 forward retransmits
     1181 retransmits in slow start
     20 other TCP timeouts
     322 sack retransmits failed
     14276 DSACKs sent for old packets
     36 DSACKs sent for out of order packets
     17940 DSACKs received
     254 DSACKs for out of order packets received
     4 connections reset due to early user close
     TCPDSACKIgnoredOld: 12703
     TCPDSACKIgnoredNoUndo: 5251
     TCPSackShiftFallback: 57141117
IpExt:
     InMcastPkts: 104
     OutMcastPkts: 76
     InBcastPkts: 6
     InOctets: 902997645
     OutOctets: -82887048
     InMcastOctets: 14296
     OutMcastOctets: 10851
     InBcastOctets: 1449
[root@target2 ~]#








-- 
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-19 16:13                   ` Oleg V. Ukhno
@ 2011-01-19 20:12                     ` Nicolas de Pesloüan
  2011-01-21 13:55                       ` Oleg V. Ukhno
  0 siblings, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-19 20:12 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: Jay Vosburgh, John Fastabend, David S. Miller, netdev,
	Sébastien Barré,
	Christophe Paasch

Le 19/01/2011 17:13, Oleg V. Ukhno a écrit :
> On 01/18/2011 11:24 PM, Jay Vosburgh wrote:
[snip]
>> I haven't done much testing with this lately, but I suspect this
>> behavior hasn't really changed. Raising the tcp_reordering sysctl value
>> can mitigate this somewhat (by making TCP more tolerant of this), but
>> that doesn't help non-TCP protocols.
>>
>> Barring evidence to the contrary, I presume that Oleg's system
>> delivers out of order at the receiver. That's not automatically a
>> reason to reject it, but this entire proposal is sufficiently complex to
>> configure that very explicit documentation will be necessary.
>>
>> -J
>>
>> ---
>> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>>
>
> Jay,
[snip]
>
> What is your opinion on my idea with patch?
>
> I will come back with results for VLAN tunneling case, if this is
> necessary (Nicolas, shall I do that test - I think it will show similar
> results for performance?)

If you have time for that, then yes, please, do the same test using balance-rr+vlan to segregate 
path. With those results, we whould have the opportunity to enhance the documentation with some well 
tested cases of TCP load balancing on a LAN, not limited to 802.3ad automatic setup. Both setups 
make sense, and assuming the results would be similar is probably true, but not reliable enough to 
assert it into the documentation.

Thanks,

	Nicolas.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-19 20:12                     ` Nicolas de Pesloüan
@ 2011-01-21 13:55                       ` Oleg V. Ukhno
  2011-01-22 12:48                         ` Nicolas de Pesloüan
  2011-01-29  2:28                         ` Jay Vosburgh
  0 siblings, 2 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-21 13:55 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: Jay Vosburgh, John Fastabend, netdev

On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:

> If you have time for that, then yes, please, do the same test using
> balance-rr+vlan to segregate path. With those results, we whould have
> the opportunity to enhance the documentation with some well tested cases
> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup.
> Both setups make sense, and assuming the results would be similar is
> probably true, but not reliable enough to assert it into the documentation.
>
> Thanks,
>
> Nicolas.
>
Nicolas,
I've ran similar tests for VLAN tunneling scenario. Results are 
identical, as I expected. The only significat difference is link failure 
handling. 802.3ad mode allows almost painless load reditribution, 
balance-rr causes packet loss.
The only question for me now is if my patch could be applied to upstream 
version - fixing issues with adaptftion to net-next code aren't the 
problem, if nobody objects
There were 2 tests:
1) unidirectional test
2) bidirectional test
Below are results:

Iperf results:
test 1:
  iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300
------------------------------------------------------------
Client connecting to 192.168.111.128, TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
[  3] local 192.168.111.129 port 9999 connected with 192.168.111.128 
port 9999
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-300.0 sec  141637 MBytes  3960 Mbits/sec

test 2:
iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300 
--dualtest -P 4
------------------------------------------------------------
Server listening on TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
...
[SUM]  0.0-300.2 sec  111334 MBytes  3111 Mbits/sec
[SUM]  0.0-300.4 sec  109582 MBytes  3060 Mbits/sec

TCP stats:
receiver side, before test 1:
[root@target1 ~]# netstat -st
IcmpMsg:
     InType0: 4
     InType3: 6
     InType8: 2
     OutType0: 2
     OutType3: 6
     OutType8: 4
Tcp:
     4 active connections openings
     2 passive connection openings
     3 failed connection attempts
     0 connection resets received
     3 connections established
     10252 segments received
     29766 segments send out
     2 segments retransmited
     0 bad segments received.
     0 resets sent
UdpLite:
TcpExt:
     3 delayed acks sent
     613 packets directly queued to recvmsg prequeue.
     16 packets directly received from backlog
     1760 packets directly received from prequeue
     428 packets header predicted
     10 packets header predicted and directly queued to user
     9295 acknowledgments not containing data received
     265 predicted acknowledgments
     0 TCP data loss events
     1 other TCP timeouts
     TCPSackMerged: 1
     TCPSackShiftFallback: 1
IpExt:
     InMcastPkts: 92
     OutMcastPkts: 64
     InBcastPkts: 2
     InOctets: 1089217
     OutOctets: 265005791
     InMcastOctets: 16294
     OutMcastOctets: 10364
     InBcastOctets: 483


receiver side , after test 1:
[root@target1 ~]netstat -st
IcmpMsg:
     InType0: 17
     InType3: 6
     InType8: 9
     OutType0: 9
     OutType3: 6
     OutType8: 19
Tcp:
     84 active connections openings
     14 passive connection openings
     6 failed connection attempts
     4 connection resets received
     4 connections established
     16684784 segments received
     16704650 segments send out
     22 segments retransmited
     0 bad segments received.
     6 resets sent
UdpLite:
TcpExt:
     39 TCP sockets finished time wait in slow timer
     23 delayed acks sent
     83 delayed acks further delayed because of locked socket
     Quick ack mode was activated 225 times
     1019 packets directly queued to recvmsg prequeue.
     3235352384 packets directly received from backlog
     483600 packets directly received from prequeue
     86065 packets header predicted
     4855 packets header predicted and directly queued to user
     10369 acknowledgments not containing data received
     928 predicted acknowledgments
     0 TCP data loss events
     2 retransmits in slow start
     6 other TCP timeouts
     225 DSACKs sent for old packets
     1 connections reset due to unexpected data
     TCPSackMerged: 1
     TCPSackShiftFallback: 3
IpExt:
     InMcastPkts: 108
     OutMcastPkts: 72
     InBcastPkts: 4
     InOctets: -936746758
     OutOctets: 1556837236
     InMcastOctets: 16774
     OutMcastOctets: 10620
     InBcastOctets: 966

receiver side, after test 2
[root@target1 ~]netstat -st
IcmpMsg:
     InType0: 17
     InType3: 6
     InType8: 12
     OutType0: 12
     OutType3: 6
     OutType8: 19
Tcp:
     144 active connections openings
     25 passive connection openings
     29 failed connection attempts
     7 connection resets received
     4 connections established
     44349148 segments received
     44401154 segments send out
     58434 segments retransmited
     0 bad segments received.
     6 resets sent
UdpLite:
TcpExt:
     58 TCP sockets finished time wait in slow timer
     735072 packets rejects in established connections because of timestamp
     34 delayed acks sent
     359 delayed acks further delayed because of locked socket
     Quick ack mode was activated 14800 times
     2112 packets directly queued to recvmsg prequeue.
     3753925448 packets directly received from backlog
     4377976 packets directly received from prequeue
     847653 packets header predicted
     105696 packets header predicted and directly queued to user
     8804473 acknowledgments not containing data received
     154775 predicted acknowledgments
     10465 times recovered from packet loss due to SACK data
     Detected reordering 1 times using FACK
     Detected reordering 11185 times using SACK
     Detected reordering 182 times using time stamp
     2116 congestion windows fully recovered
     18951 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 58
     8 congestion windows recovered after partial ack
     0 TCP data loss events
     53 timeouts after SACK recovery
     1 timeouts in loss state
     57287 fast retransmits
     12 forward retransmits
     793 retransmits in slow start
     10 other TCP timeouts
     263 sack retransmits failed
     14800 DSACKs sent for old packets
     31 DSACKs sent for out of order packets
     14289 DSACKs received
     43 DSACKs for out of order packets received
     1 connections reset due to unexpected data
     TCPDSACKIgnoredOld: 8615
     TCPDSACKIgnoredNoUndo: 5683
     TCPSackMerged: 1
     TCPSackShiftFallback: 15015212
IpExt:
     InMcastPkts: 116
     OutMcastPkts: 76
     InBcastPkts: 4
     InOctets: 1012355682
     OutOctets: -1540562156
     InMcastOctets: 17014
     OutMcastOctets: 10748
     InBcastOctets: 966


sender side, before test 1:
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 4
     InType8: 32
     OutType0: 32
     OutType3: 4
Tcp:
     1 active connections openings
     2 passive connection openings
     0 failed connection attempts
     0 connection resets received
     3 connections established
     30268 segments received
     10217 segments send out
     0 segments retransmited
     0 bad segments received.
     3 resets sent
UdpLite:
TcpExt:
     7 delayed acks sent
     6332 packets directly queued to recvmsg prequeue.
     8 packets directly received from backlog
     46104 packets directly received from prequeue
     27935 packets header predicted
     11 packets header predicted and directly queued to user
     455 acknowledgments not containing data received
     119 predicted acknowledgments
     0 TCP data loss events
     TCPSackShiftFallback: 1
IpExt:
     InMcastPkts: 87
     OutMcastPkts: 54
     InBcastPkts: 2
     InOctets: 265039007
     OutOctets: 1083024
     InMcastOctets: 16444
     OutMcastOctets: 9893
     InBcastOctets: 483

sender side , after test 1:
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 4
     InType8: 53
     OutType0: 53
     OutType3: 4
Tcp:
     69 active connections openings
     12 passive connection openings
     2 failed connection attempts
     4 connection resets received
     4 connections established
     16704819 segments received
     16684841 segments send out
     401 segments retransmited
     0 bad segments received.
     10 resets sent
UdpLite:
TcpExt:
     31 TCP sockets finished time wait in slow timer
     25 delayed acks sent
     6515 packets directly queued to recvmsg prequeue.
     24 packets directly received from backlog
     46988 packets directly received from prequeue
     27974 packets header predicted
     115 packets header predicted and directly queued to user
     10259331 acknowledgments not containing data received
     12483 predicted acknowledgments
     166 times recovered from packet loss due to SACK data
     Detected reordering 1 times using FACK
     Detected reordering 7 times using SACK
     Detected reordering 1 times using time stamp
     1 congestion windows fully recovered
     41 congestion windows partially recovered using Hoe heuristic
     0 TCP data loss events
     386 fast retransmits
     5 forward retransmits
     3 other TCP timeouts
     1 times receiver scheduled too late for direct processing
     225 DSACKs received
     1 connections reset due to unexpected data
     TCPDSACKIgnoredOld: 167
     TCPDSACKIgnoredNoUndo: 58
     TCPSackShiftFallback: 30925668
IpExt:
     InMcastPkts: 103
     OutMcastPkts: 62
     InBcastPkts: 4
     InOctets: 1556368288
     OutOctets: -934790015
     InMcastOctets: 16924
     OutMcastOctets: 10149
     InBcastOctets: 966

sender side, after test 2:
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 4
     InType8: 56
     OutType0: 56
     OutType3: 4
Tcp:
     117 active connections openings
     25 passive connection openings
     2 failed connection attempts
     7 connection resets received
     4 connections established
     44383169 segments received
     44367187 segments send out
     59660 segments retransmited
     0 bad segments received.
     34 resets sent
UdpLite:
TcpExt:
     2 TCP sockets finished time wait in fast timer
     57 TCP sockets finished time wait in slow timer
     717082 packets rejects in established connections because of timestamp
     46 delayed acks sent
     202 delayed acks further delayed because of locked socket
     Quick ack mode was activated 14356 times
     7432 packets directly queued to recvmsg prequeue.
     135038632 packets directly received from backlog
     3633432 packets directly received from prequeue
     783534 packets header predicted
     94671 packets header predicted and directly queued to user
     20034470 acknowledgments not containing data received
     177885 predicted acknowledgments
     10851 times recovered from packet loss due to SACK data
     Detected reordering 6 times using FACK
     Detected reordering 9217 times using SACK
     Detected reordering 111 times using time stamp
     2125 congestion windows fully recovered
     19325 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 71
     7 congestion windows recovered after partial ack
     0 TCP data loss events
     52 timeouts after SACK recovery
     58562 fast retransmits
     67 forward retransmits
     736 retransmits in slow start
     8 other TCP timeouts
     226 sack retransmits failed
     1 times receiver scheduled too late for direct processing
     14356 DSACKs sent for old packets
     44 DSACKs sent for out of order packets
     14679 DSACKs received
     31 DSACKs for out of order packets received
     1 connections reset due to unexpected data
     TCPDSACKIgnoredOld: 8899
     TCPDSACKIgnoredNoUndo: 5791
     TCPSackShiftFallback: 47227517
IpExt:
     InMcastPkts: 109
     OutMcastPkts: 65
     InBcastPkts: 4
     InOctets: -1885181292
     OutOctets: 1366995261
     InMcastOctets: 17104
     OutMcastOctets: 10245
     InBcastOctets: 966

-- 
Best regards,
Oleg Ukhno,
ITO Team lead
Yandex LLC.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-21 13:55                       ` Oleg V. Ukhno
@ 2011-01-22 12:48                         ` Nicolas de Pesloüan
  2011-01-24 19:32                           ` Oleg V. Ukhno
  2011-01-29  2:28                         ` Jay Vosburgh
  1 sibling, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-22 12:48 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: Jay Vosburgh, John Fastabend, netdev

Le 21/01/2011 14:55, Oleg V. Ukhno a écrit :
> On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:
>
>> If you have time for that, then yes, please, do the same test using
>> balance-rr+vlan to segregate path. With those results, we whould have
>> the opportunity to enhance the documentation with some well tested cases
>> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup.
>> Both setups make sense, and assuming the results would be similar is
>> probably true, but not reliable enough to assert it into the
>> documentation.
>>
>> Thanks,
>>
>> Nicolas.
>>
> Nicolas,
> I've ran similar tests for VLAN tunneling scenario. Results are
> identical, as I expected. The only significat difference is link failure
> handling. 802.3ad mode allows almost painless load reditribution,
> balance-rr causes packet loss.

Oleg,

Thanks for doing the tests.

What link failure mode did you use for those tests ? miimon or arp monitoring ?

	Nicolas.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-22 12:48                         ` Nicolas de Pesloüan
@ 2011-01-24 19:32                           ` Oleg V. Ukhno
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-24 19:32 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: Jay Vosburgh, John Fastabend, netdev

On 01/22/2011 03:48 PM, Nicolas de Pesloüan wrote:
> Le 21/01/2011 14:55, Oleg V. Ukhno a écrit :
>> On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:

>>>
>> Nicolas,
>> I've ran similar tests for VLAN tunneling scenario. Results are
>> identical, as I expected. The only significat difference is link failure
>> handling. 802.3ad mode allows almost painless load reditribution,
>> balance-rr causes packet loss.
>
> Oleg,
>
> Thanks for doing the tests.
>
> What link failure mode did you use for those tests ? miimon or arp
> monitoring ?
>
> Nicolas.
>
>

Nicolas,
  as for tests:
  MII link monitoring kills the whole transfer, when in ARP mode 
monitoring - it still works, but there is asymmetric load striping on 
bond slaves(one slave is overloaded, two other - about 50-60% badwidnth 
utilized.
Just as a summary - balance-rr behaves like patched 802.3ad when using 
arp monitoring mode, but there is quite asymmetric load striping and 
quite a monstrous configuration on switch and server sides.



-- 
Best regards,
Oleg Ukhno



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-21 13:55                       ` Oleg V. Ukhno
  2011-01-22 12:48                         ` Nicolas de Pesloüan
@ 2011-01-29  2:28                         ` Jay Vosburgh
  2011-02-01 16:25                           ` Oleg V. Ukhno
  2011-02-02  9:54                           ` Nicolas de Pesloüan
  1 sibling, 2 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-29  2:28 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=, John Fastabend, netdev

Oleg V. Ukhno <olegu@yandex-team.ru> wrote:

>On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:
>
>> If you have time for that, then yes, please, do the same test using
>> balance-rr+vlan to segregate path. With those results, we whould have
>> the opportunity to enhance the documentation with some well tested cases
>> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup.
>> Both setups make sense, and assuming the results would be similar is
>> probably true, but not reliable enough to assert it into the documentation.
>>
>> Thanks,
>>
>> Nicolas.
>>
>Nicolas,
>I've ran similar tests for VLAN tunneling scenario. Results are identical,
>as I expected. The only significat difference is link failure
>handling. 802.3ad mode allows almost painless load reditribution,
>balance-rr causes packet loss.
>The only question for me now is if my patch could be applied to upstream
>version - fixing issues with adaptftion to net-next code aren't the
>problem, if nobody objects

	I've thought about this whole thing, and here's what I view as
the proper way to do this.

	In my mind, this proposal is two separate pieces:

	First, a piece to make round-robin a selectable hash for
xmit_hash_policy.  The documentation for this should follow the pattern
of the "layer3+4" hash policy, in particular noting that the new
algorithm violates the 802.3ad standard in exciting ways, will result in
out of order delivery, and that other 802.3ad implementations may or may
not tolerate this.

	Second, a piece to make certain transmitted packets use the
source MAC of the sending slave instead of the bond's MAC.  This should
be a separate option from the round-robin hash policy.  I'd call it
something like "mac_select" with two values: "default" (what we do now)
and "slave_src_mac" to use the slave's real MAC for certain types of
traffic (I'm open to better names; that's just what I came up with while
writing this).  I believe that "certain types" means "everything but
ARP," but might be "only IP and IPv6."  Structuring the option in this
manner leaves the option open for additional selections in the future,
which a simple "on/off" option wouldn't.  This option should probably
only affect a subset of modes; I'm thinking anything except balance-tlb
or -alb (because they do funky MAC things already) and active-backup (it
doesn't balance traffic, and already uses fail_over_mac to control
this).  I think this option also needs a whole new section down in the
bottom explaining how to exploit it (the "pick special MACs on slaves to
trick switch hash" business).

	Comments?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-29  2:28                         ` Jay Vosburgh
@ 2011-02-01 16:25                           ` Oleg V. Ukhno
  2011-02-02 17:30                             ` Jay Vosburgh
  2011-02-02  9:54                           ` Nicolas de Pesloüan
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-02-01 16:25 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Nicolas de Pesloüan, John Fastabend, netdev

On 01/29/2011 05:28 AM, Jay Vosburgh wrote:
> Oleg V. Ukhno<olegu@yandex-team.ru>  wrote:
>
> 	I've thought about this whole thing, and here's what I view as
> the proper way to do this.
>
> 	In my mind, this proposal is two separate pieces:
>
> 	First, a piece to make round-robin a selectable hash for
> xmit_hash_policy.  The documentation for this should follow the pattern
> of the "layer3+4" hash policy, in particular noting that the new
> algorithm violates the 802.3ad standard in exciting ways, will result in
> out of order delivery, and that other 802.3ad implementations may or may
> not tolerate this.
>
> 	Second, a piece to make certain transmitted packets use the
> source MAC of the sending slave instead of the bond's MAC.  This should
> be a separate option from the round-robin hash policy.  I'd call it
> something like "mac_select" with two values: "default" (what we do now)
> and "slave_src_mac" to use the slave's real MAC for certain types of
> traffic (I'm open to better names; that's just what I came up with while
> writing this).  I believe that "certain types" means "everything but
> ARP," but might be "only IP and IPv6."  Structuring the option in this
> manner leaves the option open for additional selections in the future,
> which a simple "on/off" option wouldn't.  This option should probably
> only affect a subset of modes; I'm thinking anything except balance-tlb
> or -alb (because they do funky MAC things already) and active-backup (it
> doesn't balance traffic, and already uses fail_over_mac to control
> this).  I think this option also needs a whole new section down in the
> bottom explaining how to exploit it (the "pick special MACs on slaves to
> trick switch hash" business).
>
> 	Comments?
>
> 	-J
>
Jay,
As for me splitting my initial proposal into two logically diffent 
pieces is ok, this will provide more flexible configuration.
Do I understand correctly, that after I rewrite  patch in splitted form, 
as you described above, and enhance documentation it will be /can be 
applied to kernel?
Then what should I do: rewrite patch and resubmit it as a new one?

Oleg.

> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>


-- 
Best regards,
Oleg Ukhno.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-01-29  2:28                         ` Jay Vosburgh
  2011-02-01 16:25                           ` Oleg V. Ukhno
@ 2011-02-02  9:54                           ` Nicolas de Pesloüan
  2011-02-02 17:57                             ` Jay Vosburgh
  1 sibling, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-02-02  9:54 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Oleg V. Ukhno, John Fastabend, netdev

Le 29/01/2011 03:28, Jay Vosburgh a écrit :
> 	I've thought about this whole thing, and here's what I view as
> the proper way to do this.
>
> 	In my mind, this proposal is two separate pieces:
>
> 	First, a piece to make round-robin a selectable hash for
> xmit_hash_policy.  The documentation for this should follow the pattern
> of the "layer3+4" hash policy, in particular noting that the new
> algorithm violates the 802.3ad standard in exciting ways, will result in
> out of order delivery, and that other 802.3ad implementations may or may
> not tolerate this.
>
> 	Second, a piece to make certain transmitted packets use the
> source MAC of the sending slave instead of the bond's MAC.  This should
> be a separate option from the round-robin hash policy.  I'd call it
> something like "mac_select" with two values: "default" (what we do now)
> and "slave_src_mac" to use the slave's real MAC for certain types of
> traffic (I'm open to better names; that's just what I came up with while
> writing this).  I believe that "certain types" means "everything but
> ARP," but might be "only IP and IPv6."  Structuring the option in this
> manner leaves the option open for additional selections in the future,
> which a simple "on/off" option wouldn't.  This option should probably
> only affect a subset of modes; I'm thinking anything except balance-tlb
> or -alb (because they do funky MAC things already) and active-backup (it
> doesn't balance traffic, and already uses fail_over_mac to control
> this).  I think this option also needs a whole new section down in the
> bottom explaining how to exploit it (the "pick special MACs on slaves to
> trick switch hash" business).
>
> 	Comments?

Looks really sensible to me.

I just propose the following option and option values : "src_mac_select" (instead of mac_select), 
with "default" and "slave_mac" (instead of slave_src_mac) as possible values. In the future, we 
might need a "dst_mac_select" option... :-)

Also, are there any risks that this kind of session load-balancing won't properly cooperate with 
multiqueue (as explained in "Overriding Configuration for Special Cases" in 
Documentation/networking/bonding.txt)? I think it is important to ensure we keep the ability to fine 
tune the egress path selection

	Nicolas.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-02-01 16:25                           ` Oleg V. Ukhno
@ 2011-02-02 17:30                             ` Jay Vosburgh
  0 siblings, 0 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-02-02 17:30 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: Nicolas de Pesloüan, John Fastabend, netdev

Oleg V. Ukhno <olegu@yandex-team.ru> wrote:

>On 01/29/2011 05:28 AM, Jay Vosburgh wrote:
>> Oleg V. Ukhno<olegu@yandex-team.ru>  wrote:
>>
>> 	I've thought about this whole thing, and here's what I view as
>> the proper way to do this.
>>
>> 	In my mind, this proposal is two separate pieces:
>>
>> 	First, a piece to make round-robin a selectable hash for
>> xmit_hash_policy.  The documentation for this should follow the pattern
>> of the "layer3+4" hash policy, in particular noting that the new
>> algorithm violates the 802.3ad standard in exciting ways, will result in
>> out of order delivery, and that other 802.3ad implementations may or may
>> not tolerate this.
>>
>> 	Second, a piece to make certain transmitted packets use the
>> source MAC of the sending slave instead of the bond's MAC.  This should
>> be a separate option from the round-robin hash policy.  I'd call it
>> something like "mac_select" with two values: "default" (what we do now)
>> and "slave_src_mac" to use the slave's real MAC for certain types of
>> traffic (I'm open to better names; that's just what I came up with while
>> writing this).  I believe that "certain types" means "everything but
>> ARP," but might be "only IP and IPv6."  Structuring the option in this
>> manner leaves the option open for additional selections in the future,
>> which a simple "on/off" option wouldn't.  This option should probably
>> only affect a subset of modes; I'm thinking anything except balance-tlb
>> or -alb (because they do funky MAC things already) and active-backup (it
>> doesn't balance traffic, and already uses fail_over_mac to control
>> this).  I think this option also needs a whole new section down in the
>> bottom explaining how to exploit it (the "pick special MACs on slaves to
>> trick switch hash" business).
>>
>> 	Comments?
>>
>> 	-J
>>
>Jay,
>As for me splitting my initial proposal into two logically diffent pieces
>is ok, this will provide more flexible configuration.
>Do I understand correctly, that after I rewrite  patch in splitted form,
>as you described above, and enhance documentation it will be /can be
>applied to kernel?

	Yes, although the patches may have to go through a few
revisions.


>Then what should I do: rewrite patch and resubmit it as a new one?

	Yes.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-02-02  9:54                           ` Nicolas de Pesloüan
@ 2011-02-02 17:57                             ` Jay Vosburgh
  2011-02-03 14:54                               ` Oleg V. Ukhno
  0 siblings, 1 reply; 32+ messages in thread
From: Jay Vosburgh @ 2011-02-02 17:57 UTC (permalink / raw)
  To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
  Cc: Oleg V. Ukhno, John Fastabend, netdev

Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:

>Le 29/01/2011 03:28, Jay Vosburgh a écrit :
>> 	I've thought about this whole thing, and here's what I view as
>> the proper way to do this.
>>
>> 	In my mind, this proposal is two separate pieces:
>>
>> 	First, a piece to make round-robin a selectable hash for
>> xmit_hash_policy.  The documentation for this should follow the pattern
>> of the "layer3+4" hash policy, in particular noting that the new
>> algorithm violates the 802.3ad standard in exciting ways, will result in
>> out of order delivery, and that other 802.3ad implementations may or may
>> not tolerate this.
>>
>> 	Second, a piece to make certain transmitted packets use the
>> source MAC of the sending slave instead of the bond's MAC.  This should
>> be a separate option from the round-robin hash policy.  I'd call it
>> something like "mac_select" with two values: "default" (what we do now)
>> and "slave_src_mac" to use the slave's real MAC for certain types of
>> traffic (I'm open to better names; that's just what I came up with while
>> writing this).  I believe that "certain types" means "everything but
>> ARP," but might be "only IP and IPv6."  Structuring the option in this
>> manner leaves the option open for additional selections in the future,
>> which a simple "on/off" option wouldn't.  This option should probably
>> only affect a subset of modes; I'm thinking anything except balance-tlb
>> or -alb (because they do funky MAC things already) and active-backup (it
>> doesn't balance traffic, and already uses fail_over_mac to control
>> this).  I think this option also needs a whole new section down in the
>> bottom explaining how to exploit it (the "pick special MACs on slaves to
>> trick switch hash" business).
>>
>> 	Comments?
>
>Looks really sensible to me.
>
>I just propose the following option and option values : "src_mac_select"
>(instead of mac_select), with "default" and "slave_mac" (instead of
>slave_src_mac) as possible values. In the future, we might need a
>"dst_mac_select" option... :-)

	I originally thought of using the nomenclature you propose; my
thinking for doing it the way I ended up with is to minimize the number
of tunable knobs that bonding has (so, the dst_mac would be a setting
for mac_select).  That works as long as there aren't a lot of settings
that would be turned on simultaneously, since each combination would
have to be a separate option, or the options parser would have to handle
multiple settings (e.g., mac_select=src+dst or something like that).

	Anyway, after thinking about it some more, in the long run it's
probably safer to separate these two, so, Oleg, use the above naming
("src_mac_select" with "default" and "slave_mac").

>Also, are there any risks that this kind of session load-balancing won't
>properly cooperate with multiqueue (as explained in "Overriding
>Configuration for Special Cases" in Documentation/networking/bonding.txt)?
>I think it is important to ensure we keep the ability to fine tune the
>egress path selection

	I think the logic for the mac_select (or src_mac_select or
whatever) just has to be done last, after the slave selection is done by
the multiqueue stuff.  That's probably a good tidbit to put in the
documentation as well.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
  2011-02-02 17:57                             ` Jay Vosburgh
@ 2011-02-03 14:54                               ` Oleg V. Ukhno
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-02-03 14:54 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Nicolas de Pesloüan, John Fastabend, netdev

On 02/02/2011 08:57 PM, Jay Vosburgh wrote:
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:

>> I just propose the following option and option values : "src_mac_select"
>> (instead of mac_select), with "default" and "slave_mac" (instead of
>> slave_src_mac) as possible values. In the future, we might need a
>> "dst_mac_select" option... :-)
>
> 	I originally thought of using the nomenclature you propose; my
> thinking for doing it the way I ended up with is to minimize the number
> of tunable knobs that bonding has (so, the dst_mac would be a setting
> for mac_select).  That works as long as there aren't a lot of settings
> that would be turned on simultaneously, since each combination would
> have to be a separate option, or the options parser would have to handle
> multiple settings (e.g., mac_select=src+dst or something like that).
>
> 	Anyway, after thinking about it some more, in the long run it's
> probably safer to separate these two, so, Oleg, use the above naming
> ("src_mac_select" with "default" and "slave_mac").
>
>> Also, are there any risks that this kind of session load-balancing won't
>> properly cooperate with multiqueue (as explained in "Overriding
>> Configuration for Special Cases" in Documentation/networking/bonding.txt)?
>> I think it is important to ensure we keep the ability to fine tune the
>> egress path selection
>
> 	I think the logic for the mac_select (or src_mac_select or
> whatever) just has to be done last, after the slave selection is done by
> the multiqueue stuff.  That's probably a good tidbit to put in the
> documentation as well.
>
> 	-J
>
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
Thank everyone for comments,
I'll resubmit modified patch after it is ready and tested, in about a 
week or two  I think.

Oleg

>


-- 
Best regards,
Oleg Ukhno


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-02-03 14:54 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
2011-01-14 20:10 ` John Fastabend
2011-01-14 23:12   ` Oleg V. Ukhno
2011-01-14 20:13 ` Jay Vosburgh
2011-01-14 22:51   ` Oleg V. Ukhno
2011-01-15  0:05     ` Jay Vosburgh
2011-01-15 12:11       ` Oleg V. Ukhno
2011-01-18  3:16       ` John Fastabend
2011-01-18 12:40         ` Oleg V. Ukhno
2011-01-18 14:54           ` Nicolas de Pesloüan
2011-01-18 15:28             ` Oleg V. Ukhno
2011-01-18 16:24               ` Nicolas de Pesloüan
2011-01-18 16:57                 ` Oleg V. Ukhno
2011-01-18 20:24                 ` Jay Vosburgh
2011-01-18 21:20                   ` Nicolas de Pesloüan
2011-01-19  1:45                     ` Jay Vosburgh
2011-01-18 22:22                   ` Oleg V. Ukhno
2011-01-19 16:13                   ` Oleg V. Ukhno
2011-01-19 20:12                     ` Nicolas de Pesloüan
2011-01-21 13:55                       ` Oleg V. Ukhno
2011-01-22 12:48                         ` Nicolas de Pesloüan
2011-01-24 19:32                           ` Oleg V. Ukhno
2011-01-29  2:28                         ` Jay Vosburgh
2011-02-01 16:25                           ` Oleg V. Ukhno
2011-02-02 17:30                             ` Jay Vosburgh
2011-02-02  9:54                           ` Nicolas de Pesloüan
2011-02-02 17:57                             ` Jay Vosburgh
2011-02-03 14:54                               ` Oleg V. Ukhno
2011-01-18 17:56               ` Kirill Smelkov
2011-01-18 16:41           ` John Fastabend
2011-01-18 17:21             ` Oleg V. Ukhno
2011-01-14 20:41 ` Nicolas de Pesloüan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.