* [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing @ 2011-01-14 19:07 Oleg V. Ukhno 2011-01-14 20:10 ` John Fastabend ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-14 19:07 UTC (permalink / raw) To: netdev; +Cc: Jay Vosburgh, David S. Miller Patch introduces new hashing policy for 802.3ad bonding mode. This hashing policy can be used(was tested) only for round-robin balancing of ISCSI traffic(single TCP session is balanced (per-packet) over all slave interfaces. General requirements for this hashing policy usage are: 1) switch must be configured with src-dst-mac or src-mac hashing policy 2) number of bond slaves on sending and receiving machine should be equal and preferrably even; or simply even, otherwise you may get asymmetric load on receiving machine 3) hashing policy must not be used when round trip time between source and destination machines for slaves in same bond is expected to be significanly different (it works fine when all slaves are plugged into single switch) Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru> --- Documentation/networking/bonding.txt | 27 +++++++++++++++++++++++++++ drivers/net/bonding/bond_3ad.c | 6 ++++++ drivers/net/bonding/bond_main.c | 18 +++++++++++++++++- include/linux/if_bonding.h | 1 + 4 files changed, 51 insertions(+), 1 deletion(-) diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt --- linux-2.6.37-vanilla/Documentation/networking/bonding.txt 2011-01-05 03:50:19.000000000 +0300 +++ linux-2.6.37.my/Documentation/networking/bonding.txt 2011-01-14 21:34:46.635268000 +0300 @@ -759,6 +759,33 @@ xmit_hash_policy most UDP traffic is not involved in extended conversations. Other implementations of 802.3ad may or may not tolerate this noncompliance. + + simple-rr or 3 + This policy simply sends every next packet via "next" + slave interface. When sending, it resets mac-address + within packet to real mac-address of the slave interface. + + When switch is configured properly, and receiving machine + has even and equal number of interfaces, this guarantees + quite precise rx/tx load balancing for any single TCP + session. Typical use-case for this mode is ISCSI(and patch was + developed for), because it ises single TCP session to + transmit data. + + It is important to remember, that all slaves should be + plugged into single switch to avoid out-of-order packets + It is recommended to have equal and even number of slave + interfaces in sending and receviving machines bond's, + otherwise you will get asymmetric load on receiving host. + Another caveat is that hashing policy must not be used when + round trip time between source and destination machines for + slaves in same bond is expected to be significanly different + (it works fine when all slaves are plugged into single switch) + + For correct load baalncing on the receiving side you must + configure switch for using src-dst-mac or src-mac hashing + mode. + The default value is layer2. This option was added in bonding version 2.6.3. In earlier versions of bonding, this parameter diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c --- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c 2011-01-14 19:39:05.575268000 +0300 +++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c 2011-01-14 19:47:03.815268000 +0300 @@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk int i; struct ad_info ad_info; int res = 1; + struct ethhdr *eth_data; /* make sure that the slaves list will * not change during tx @@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk slave_agg_id = agg->aggregator_identifier; if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) { + if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) { + skb_reset_mac_header(skb); + eth_data = eth_hdr(skb); + memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN); + } res = bond_dev_queue_xmit(bond, skb, slave->dev); break; } diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c --- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c 2011-01-14 19:39:05.575268000 +0300 +++ linux-2.6.37.my/drivers/net/bonding/bond_main.c 2011-01-14 19:47:55.835268001 +0300 @@ -152,7 +152,9 @@ module_param(ad_select, charp, 0); MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)"); module_param(xmit_hash_policy, charp, 0); MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)" - ", 1 for layer 3+4"); + ", 1 for layer 3+4" + ", 2 for layer 2+3" + ", 3 for round-robin"); module_param(arp_interval, int, 0); MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds"); module_param_array(arp_ip_target, charp, NULL, 0); @@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype { "layer2", BOND_XMIT_POLICY_LAYER2}, { "layer3+4", BOND_XMIT_POLICY_LAYER34}, { "layer2+3", BOND_XMIT_POLICY_LAYER23}, +{ "simple-rr", BOND_XMIT_POLICY_LAYERRR}, { NULL, -1}, }; @@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru return (data->h_dest[5] ^ data->h_source[5]) % count; } +/* + * simply round robin + */ +static int bond_xmit_hash_policy_rr(struct sk_buff *skb, + struct net_device *bond_dev, int count) +{ + struct bonding *bond = netdev_priv(bond_dev); + return bond->rr_tx_counter++ % count; +} + /*-------------------------- Device entry points ----------------------------*/ static int bond_open(struct net_device *bond_dev) @@ -4482,6 +4495,9 @@ out: static void bond_set_xmit_hash_policy(struct bonding *bond) { switch (bond->params.xmit_policy) { + case BOND_XMIT_POLICY_LAYERRR: + bond->xmit_hash_policy = bond_xmit_hash_policy_rr; + break; case BOND_XMIT_POLICY_LAYER23: bond->xmit_hash_policy = bond_xmit_hash_policy_l23; break; diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/include/linux/if_bonding.h linux-2.6.37.my/include/linux/if_bonding.h --- linux-2.6.37-vanilla/include/linux/if_bonding.h 2011-01-05 03:50:19.000000000 +0300 +++ linux-2.6.37.my/include/linux/if_bonding.h 2011-01-14 19:34:29.755268001 +0300 @@ -91,6 +91,7 @@ #define BOND_XMIT_POLICY_LAYER2 0 /* layer 2 (MAC only), default */ #define BOND_XMIT_POLICY_LAYER34 1 /* layer 3+4 (IP ^ (TCP || UDP)) */ #define BOND_XMIT_POLICY_LAYER23 2 /* layer 2+3 (IP ^ MAC) */ +#define BOND_XMIT_POLICY_LAYERRR 3 /* round-robin */ typedef struct ifbond { __s32 bond_mode; ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno @ 2011-01-14 20:10 ` John Fastabend 2011-01-14 23:12 ` Oleg V. Ukhno 2011-01-14 20:13 ` Jay Vosburgh 2011-01-14 20:41 ` Nicolas de Pesloüan 2 siblings, 1 reply; 32+ messages in thread From: John Fastabend @ 2011-01-14 20:10 UTC (permalink / raw) To: Oleg V. Ukhno; +Cc: netdev, Jay Vosburgh, David S. Miller On 1/14/2011 11:07 AM, Oleg V. Ukhno wrote: > Patch introduces new hashing policy for 802.3ad bonding mode. > This hashing policy can be used(was tested) only for round-robin > balancing of ISCSI traffic(single TCP session is balanced (per-packet) > over all slave interfaces. > General requirements for this hashing policy usage are: > 1) switch must be configured with src-dst-mac or src-mac hashing policy > 2) number of bond slaves on sending and receiving machine should be equal > and preferrably even; or simply even, otherwise you may get asymmetric > load on receiving machine > 3) hashing policy must not be used when round trip time between source > and destination machines for slaves in same bond is expected to be > significanly different (it works fine when all slaves are plugged into > single switch) > > Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru> > --- I think you want this patch against net-next not 2.6.37. > > Documentation/networking/bonding.txt | 27 +++++++++++++++++++++++++++ > drivers/net/bonding/bond_3ad.c | 6 ++++++ > drivers/net/bonding/bond_main.c | 18 +++++++++++++++++- > include/linux/if_bonding.h | 1 + > 4 files changed, 51 insertions(+), 1 deletion(-) > > diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt > --- linux-2.6.37-vanilla/Documentation/networking/bonding.txt 2011-01-05 03:50:19.000000000 +0300 > +++ linux-2.6.37.my/Documentation/networking/bonding.txt 2011-01-14 21:34:46.635268000 +0300 > @@ -759,6 +759,33 @@ xmit_hash_policy > most UDP traffic is not involved in extended > conversations. Other implementations of 802.3ad may > or may not tolerate this noncompliance. > + > + simple-rr or 3 > + This policy simply sends every next packet via "next" > + slave interface. When sending, it resets mac-address > + within packet to real mac-address of the slave interface. > + > + When switch is configured properly, and receiving machine > + has even and equal number of interfaces, this guarantees > + quite precise rx/tx load balancing for any single TCP > + session. Typical use-case for this mode is ISCSI(and patch was > + developed for), because it ises single TCP session to > + transmit data. Oleg, sorry but I don't follow. If this is simply sending every next packet via "next" slave interface how are packets not going to get out of order? If the links have different RTT this would seem problematic. Have you considered using multipath at the block layer? This is how I generally handle load balancing over iSCSI/FCoE and it works reasonably well. see ./drivers/md/dm-mpath.c > + > + It is important to remember, that all slaves should be > + plugged into single switch to avoid out-of-order packets > + It is recommended to have equal and even number of slave > + interfaces in sending and receviving machines bond's, > + otherwise you will get asymmetric load on receiving host. > + Another caveat is that hashing policy must not be used when > + round trip time between source and destination machines for > + slaves in same bond is expected to be significanly different > + (it works fine when all slaves are plugged into single switch) > + > + For correct load baalncing on the receiving side you must > + configure switch for using src-dst-mac or src-mac hashing > + mode. > + > > The default value is layer2. This option was added in bonding > version 2.6.3. In earlier versions of bonding, this parameter > diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c > --- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c 2011-01-14 19:39:05.575268000 +0300 > +++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c 2011-01-14 19:47:03.815268000 +0300 > @@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk > int i; > struct ad_info ad_info; > int res = 1; > + struct ethhdr *eth_data; > > /* make sure that the slaves list will > * not change during tx > @@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk > slave_agg_id = agg->aggregator_identifier; > > if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) { > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) { > + skb_reset_mac_header(skb); > + eth_data = eth_hdr(skb); > + memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN); > + } > res = bond_dev_queue_xmit(bond, skb, slave->dev); > break; > } > diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c > --- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c 2011-01-14 19:39:05.575268000 +0300 > +++ linux-2.6.37.my/drivers/net/bonding/bond_main.c 2011-01-14 19:47:55.835268001 +0300 > @@ -152,7 +152,9 @@ module_param(ad_select, charp, 0); > MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)"); > module_param(xmit_hash_policy, charp, 0); > MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)" > - ", 1 for layer 3+4"); > + ", 1 for layer 3+4" > + ", 2 for layer 2+3" > + ", 3 for round-robin"); > module_param(arp_interval, int, 0); > MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds"); > module_param_array(arp_ip_target, charp, NULL, 0); > @@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype > { "layer2", BOND_XMIT_POLICY_LAYER2}, > { "layer3+4", BOND_XMIT_POLICY_LAYER34}, > { "layer2+3", BOND_XMIT_POLICY_LAYER23}, > +{ "simple-rr", BOND_XMIT_POLICY_LAYERRR}, > { NULL, -1}, > }; > > @@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru > return (data->h_dest[5] ^ data->h_source[5]) % count; > } > > +/* > + * simply round robin > + */ > +static int bond_xmit_hash_policy_rr(struct sk_buff *skb, > + struct net_device *bond_dev, int count) Here's one reason why this won't work on net-next-2.6. int (*xmit_hash_policy)(struct sk_buff *, int); Thanks, John ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-14 20:10 ` John Fastabend @ 2011-01-14 23:12 ` Oleg V. Ukhno 0 siblings, 0 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-14 23:12 UTC (permalink / raw) To: John Fastabend; +Cc: netdev, Jay Vosburgh, David S. Miller John Fastabend wrote: > > I think you want this patch against net-next not 2.6.37. This patch is against 2.6.37-git11 and I've tried to apply it to net-next - it applied ok > > Oleg, sorry but I don't follow. If this is simply sending every next packet > via "next" slave interface how are packets not going to get out of order? If > the links have different RTT this would seem problematic. > > Have you considered using multipath at the block layer? This is how I generally > handle load balancing over iSCSI/FCoE and it works reasonably well. > > see ./drivers/md/dm-mpath.c John, the first solution I was using a long time for ISCSI load balancing was multipath. But there are some problems with dm-multipath: - it is slow(I am using ISCSI for Oracle , so I need to minimize latency) - it handles any link failures bad, because of it's command queue limitation(all queued commands above 32 are discarded in case of path failure, as I remember) - it performs very bad when there are many devices and maтy paths(I was unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths per each disk) My patch won't work correct when slave links have different RTT, this is true - it is usable only within one ethernet segment with equal/near equal RTT. This is it's limitation. >> +static int bond_xmit_hash_policy_rr(struct sk_buff *skb, >> + struct net_device *bond_dev, int count) > > Here's one reason why this won't work on net-next-2.6. > > int (*xmit_hash_policy)(struct sk_buff *, int); Thank you, I've missed that change. > > > Thanks, > John > -- Best reagrds, Oleg Ukhno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno 2011-01-14 20:10 ` John Fastabend @ 2011-01-14 20:13 ` Jay Vosburgh 2011-01-14 22:51 ` Oleg V. Ukhno 2011-01-14 20:41 ` Nicolas de Pesloüan 2 siblings, 1 reply; 32+ messages in thread From: Jay Vosburgh @ 2011-01-14 20:13 UTC (permalink / raw) To: Oleg V. Ukhno; +Cc: netdev, David S. Miller Oleg V. Ukhno <olegu@yandex-team.ru> wrote: >Patch introduces new hashing policy for 802.3ad bonding mode. >This hashing policy can be used(was tested) only for round-robin >balancing of ISCSI traffic(single TCP session is balanced (per-packet) >over all slave interfaces. This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1 (f), which requires that all frames of a given "conversation" are passed to a single port. The existing layer3+4 hash has a similar problem (that it may send packets from a conversation to multiple ports), but for that case it's an unlikely exception (only in the case of IP fragmentation), but here it's the norm. At a minimum, this must be clearly documented. Also, what does a round robin in 802.3ad provide that the existing round robin does not? My presumption is that you're looking to get the aggregator autoconfiguration that 802.3ad provides, but you don't say. I don't necessarily think this is a bad cheat (round robining on 802.3ad as an explicit non-standard extension), since everybody wants to stripe their traffic across multiple slaves. I've given some thought to making round robin into just another hash mode, but this also does some magic to the MAC addresses of the outgoing frames (more on that below). >General requirements for this hashing policy usage are: >1) switch must be configured with src-dst-mac or src-mac hashing policy >2) number of bond slaves on sending and receiving machine should be equal >and preferrably even; or simply even, otherwise you may get asymmetric >load on receiving machine >3) hashing policy must not be used when round trip time between source >and destination machines for slaves in same bond is expected to be >significanly different (it works fine when all slaves are plugged into >single switch) > >Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru> >--- > > Documentation/networking/bonding.txt | 27 +++++++++++++++++++++++++++ > drivers/net/bonding/bond_3ad.c | 6 ++++++ > drivers/net/bonding/bond_main.c | 18 +++++++++++++++++- > include/linux/if_bonding.h | 1 + > 4 files changed, 51 insertions(+), 1 deletion(-) > >diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt >--- linux-2.6.37-vanilla/Documentation/networking/bonding.txt 2011-01-05 03:50:19.000000000 +0300 >+++ linux-2.6.37.my/Documentation/networking/bonding.txt 2011-01-14 21:34:46.635268000 +0300 >@@ -759,6 +759,33 @@ xmit_hash_policy > most UDP traffic is not involved in extended > conversations. Other implementations of 802.3ad may > or may not tolerate this noncompliance. >+ >+ simple-rr or 3 >+ This policy simply sends every next packet via "next" >+ slave interface. When sending, it resets mac-address >+ within packet to real mac-address of the slave interface. Why is the MAC address reset done? This is also a violation of 802.3ad, 5.2.1 (j). >+ When switch is configured properly, and receiving machine >+ has even and equal number of interfaces, this guarantees >+ quite precise rx/tx load balancing for any single TCP >+ session. Typical use-case for this mode is ISCSI(and patch was >+ developed for), because it ises single TCP session to >+ transmit data. >+ >+ It is important to remember, that all slaves should be >+ plugged into single switch to avoid out-of-order packets >+ It is recommended to have equal and even number of slave >+ interfaces in sending and receviving machines bond's, >+ otherwise you will get asymmetric load on receiving host. >+ Another caveat is that hashing policy must not be used when >+ round trip time between source and destination machines for >+ slaves in same bond is expected to be significanly different >+ (it works fine when all slaves are plugged into single switch) >+ >+ For correct load baalncing on the receiving side you must >+ configure switch for using src-dst-mac or src-mac hashing >+ mode. >+ > > The default value is layer2. This option was added in bonding > version 2.6.3. In earlier versions of bonding, this parameter >diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c >--- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c 2011-01-14 19:39:05.575268000 +0300 >+++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c 2011-01-14 19:47:03.815268000 +0300 >@@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk > int i; > struct ad_info ad_info; > int res = 1; >+ struct ethhdr *eth_data; > > /* make sure that the slaves list will > * not change during tx >@@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk > slave_agg_id = agg->aggregator_identifier; > > if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) { >+ if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) { >+ skb_reset_mac_header(skb); >+ eth_data = eth_hdr(skb); >+ memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN); >+ } This is the code that resets the MAC header as described above. It doesn't quite match the documentation, since it only resets the MAC for ETH_P_IP packets. > res = bond_dev_queue_xmit(bond, skb, slave->dev); > break; > } >diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c >--- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c 2011-01-14 19:39:05.575268000 +0300 >+++ linux-2.6.37.my/drivers/net/bonding/bond_main.c 2011-01-14 19:47:55.835268001 +0300 >@@ -152,7 +152,9 @@ module_param(ad_select, charp, 0); > MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)"); > module_param(xmit_hash_policy, charp, 0); > MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)" >- ", 1 for layer 3+4"); >+ ", 1 for layer 3+4" >+ ", 2 for layer 2+3" >+ ", 3 for round-robin"); > module_param(arp_interval, int, 0); > MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds"); > module_param_array(arp_ip_target, charp, NULL, 0); >@@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype > { "layer2", BOND_XMIT_POLICY_LAYER2}, > { "layer3+4", BOND_XMIT_POLICY_LAYER34}, > { "layer2+3", BOND_XMIT_POLICY_LAYER23}, >+{ "simple-rr", BOND_XMIT_POLICY_LAYERRR}, I'd just call it "round-robin" instead of "simple-rr". > { NULL, -1}, > }; > >@@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru > return (data->h_dest[5] ^ data->h_source[5]) % count; > } > >+/* >+ * simply round robin >+ */ >+static int bond_xmit_hash_policy_rr(struct sk_buff *skb, >+ struct net_device *bond_dev, int count) >+{ >+ struct bonding *bond = netdev_priv(bond_dev); >+ return bond->rr_tx_counter++ % count; >+} >+ > /*-------------------------- Device entry points ----------------------------*/ > > static int bond_open(struct net_device *bond_dev) >@@ -4482,6 +4495,9 @@ out: > static void bond_set_xmit_hash_policy(struct bonding *bond) > { > switch (bond->params.xmit_policy) { >+ case BOND_XMIT_POLICY_LAYERRR: >+ bond->xmit_hash_policy = bond_xmit_hash_policy_rr; >+ break; > case BOND_XMIT_POLICY_LAYER23: > bond->xmit_hash_policy = bond_xmit_hash_policy_l23; > break; >diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/include/linux/if_bonding.h linux-2.6.37.my/include/linux/if_bonding.h >--- linux-2.6.37-vanilla/include/linux/if_bonding.h 2011-01-05 03:50:19.000000000 +0300 >+++ linux-2.6.37.my/include/linux/if_bonding.h 2011-01-14 19:34:29.755268001 +0300 >@@ -91,6 +91,7 @@ > #define BOND_XMIT_POLICY_LAYER2 0 /* layer 2 (MAC only), default */ > #define BOND_XMIT_POLICY_LAYER34 1 /* layer 3+4 (IP ^ (TCP || UDP)) */ > #define BOND_XMIT_POLICY_LAYER23 2 /* layer 2+3 (IP ^ MAC) */ >+#define BOND_XMIT_POLICY_LAYERRR 3 /* round-robin */ > > typedef struct ifbond { > __s32 bond_mode; -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-14 20:13 ` Jay Vosburgh @ 2011-01-14 22:51 ` Oleg V. Ukhno 2011-01-15 0:05 ` Jay Vosburgh 0 siblings, 1 reply; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-14 22:51 UTC (permalink / raw) To: Jay Vosburgh; +Cc: netdev, David S. Miller Jay Vosburgh wrote: > This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1 > (f), which requires that all frames of a given "conversation" are passed > to a single port. > > The existing layer3+4 hash has a similar problem (that it may > send packets from a conversation to multiple ports), but for that case > it's an unlikely exception (only in the case of IP fragmentation), but > here it's the norm. At a minimum, this must be clearly documented. > > Also, what does a round robin in 802.3ad provide that the > existing round robin does not? My presumption is that you're looking to > get the aggregator autoconfiguration that 802.3ad provides, but you > don't say. > > I don't necessarily think this is a bad cheat (round robining on > 802.3ad as an explicit non-standard extension), since everybody wants to > stripe their traffic across multiple slaves. I've given some thought to > making round robin into just another hash mode, but this also does some > magic to the MAC addresses of the outgoing frames (more on that below). Yes, I am resetting MAC addresses when transmitting packets to have switch to put packets into different ports of the receiving etherchannel. I am using this patch to provide full-mesh ISCSI connectivity between at least 4 hosts (all hosts of course are in same ethernet segment) and every host is connected with aggregate link with 4 slaves(usually). Using round-robin I provide near-equal load striping when transmitting, using MAC address magic I force switch to stripe packets over all slave links in destination port-channel(when number of rx-ing slaves is equal to number ot tx-ing slaves and is even). So I am able to utilize all slaves for tx and for rx up to maximum capacity; besides I am getting L2 link failure detection (and load rebalancing), which is (in my opinion) much faster and robust than L3 or than dm-multipath provides. It's my idea with the patch > > > This is the code that resets the MAC header as described above. > It doesn't quite match the documentation, since it only resets the MAC > for ETH_P_IP packets. Yes, I really meant that my patch applies to ETH_P_IP packets and I've missed that from documentation I wrote. > > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com > -- Best regards, Oleg Ukhno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-14 22:51 ` Oleg V. Ukhno @ 2011-01-15 0:05 ` Jay Vosburgh 2011-01-15 12:11 ` Oleg V. Ukhno 2011-01-18 3:16 ` John Fastabend 0 siblings, 2 replies; 32+ messages in thread From: Jay Vosburgh @ 2011-01-15 0:05 UTC (permalink / raw) To: Oleg V. Ukhno; +Cc: netdev, David S. Miller, John Fastabend Oleg V. Ukhno <olegu@yandex-team.ru> wrote: >Jay Vosburgh wrote: > >> This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1 >> (f), which requires that all frames of a given "conversation" are passed >> to a single port. >> >> The existing layer3+4 hash has a similar problem (that it may >> send packets from a conversation to multiple ports), but for that case >> it's an unlikely exception (only in the case of IP fragmentation), but >> here it's the norm. At a minimum, this must be clearly documented. >> >> Also, what does a round robin in 802.3ad provide that the >> existing round robin does not? My presumption is that you're looking to >> get the aggregator autoconfiguration that 802.3ad provides, but you >> don't say. I'm still curious about this question. Given the rather intricate setup of your particular network (described below), I'm not sure why 802.3ad is of benefit over traditional etherchannel (balance-rr / balance-xor). >> I don't necessarily think this is a bad cheat (round robining on >> 802.3ad as an explicit non-standard extension), since everybody wants to >> stripe their traffic across multiple slaves. I've given some thought to >> making round robin into just another hash mode, but this also does some >> magic to the MAC addresses of the outgoing frames (more on that below). >Yes, I am resetting MAC addresses when transmitting packets to have switch >to put packets into different ports of the receiving etherchannel. By "etherchannel" do you really mean "Cisco switch with a port-channel group using LACP"? >I am using this patch to provide full-mesh ISCSI connectivity between at >least 4 hosts (all hosts of course are in same ethernet segment) and every >host is connected with aggregate link with 4 slaves(usually). >Using round-robin I provide near-equal load striping when transmitting, >using MAC address magic I force switch to stripe packets over all slave >links in destination port-channel(when number of rx-ing slaves is equal to >number ot tx-ing slaves and is even). By "MAC address magic" do you mean that you're assigning specifically chosen MAC addresses to the slaves so that the switch's hash is essentially "assigning" the bonding slaves to particular ports on the outgoing port-channel group? Assuming that this is the case, it's an interesting idea, but I'm unconvinced that it's better on 802.3ad vs. balance-rr. Unless I'm missing something, you can get everything you need from an option to have balance-rr / balance-xor utilize the slave's permanent address as the source address for outgoing traffic. >[...] So I am able to utilize all slaves >for tx and for rx up to maximum capacity; besides I am getting L2 link >failure detection (and load rebalancing), which is (in my opinion) much >faster and robust than L3 or than dm-multipath provides. >It's my idea with the patch Can somebody (John?) more knowledgable than I about dm-multipath comment on the above? >> This is the code that resets the MAC header as described above. >> It doesn't quite match the documentation, since it only resets the MAC >> for ETH_P_IP packets. >Yes, I really meant that my patch applies to ETH_P_IP packets and I've >missed that from documentation I wrote. Is limiting this to just ETH_P_IP really a means to exclude ARP, or is there some advantage to (effectively) only balancing IP traffic, and leaving other traffic (IPv6, for one) essentially unbalanced (when exiting the switch through the destination port-channel group, which you've set to use a src-mac hash)? -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-15 0:05 ` Jay Vosburgh @ 2011-01-15 12:11 ` Oleg V. Ukhno 2011-01-18 3:16 ` John Fastabend 1 sibling, 0 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-15 12:11 UTC (permalink / raw) To: Jay Vosburgh; +Cc: netdev, David S. Miller, John Fastabend Jay Vosburgh wrote: > Oleg V. Ukhno <olegu@yandex-team.ru> wrote: >> Jay Vosburgh wrote: >> >>> Also, what does a round robin in 802.3ad provide that the >>> existing round robin does not? My presumption is that you're looking to >>> get the aggregator autoconfiguration that 802.3ad provides, but you >>> don't say. > > I'm still curious about this question. Given the rather > intricate setup of your particular network (described below), I'm not > sure why 802.3ad is of benefit over traditional etherchannel > (balance-rr / balance-xor). Yes, I wanted 802.3ad autoconfiguration. Besides, all switches I use support LACP so I've chosen 802.3ad link aggregation. Of course, it would be cool it both 802.3ad and balance-rr modes supported such load striping feature. > >> Yes, I am resetting MAC addresses when transmitting packets to have switch >> to put packets into different ports of the receiving etherchannel. > > By "etherchannel" do you really mean "Cisco switch with a > port-channel group using LACP"? Yes, exactly > >> I am using this patch to provide full-mesh ISCSI connectivity between at >> least 4 hosts (all hosts of course are in same ethernet segment) and every >> host is connected with aggregate link with 4 slaves(usually). >> Using round-robin I provide near-equal load striping when transmitting, >> using MAC address magic I force switch to stripe packets over all slave >> links in destination port-channel(when number of rx-ing slaves is equal to >> number ot tx-ing slaves and is even). > > By "MAC address magic" do you mean that you're assigning > specifically chosen MAC addresses to the slaves so that the switch's > hash is essentially "assigning" the bonding slaves to particular ports > on the outgoing port-channel group? Yes, so I am able to make equal load striping even for single TCP session between just two hosts not only for transmiting host, but also for receiving host(iperf, when doing TCP test, is able to utilize all available bandwith in given etherchannel). > > Assuming that this is the case, it's an interesting idea, but > I'm unconvinced that it's better on 802.3ad vs. balance-rr. Unless I'm > missing something, you can get everything you need from an option to > have balance-rr / balance-xor utilize the slave's permanent address as > the source address for outgoing traffic. Yes, balance-rr would satisfy my requrements if patched for doing "MAC address magic"(replacing MAC address of packets being transmitted by slave's permanent address), except for 802.3ad link autoconfiguration. "Pure" balance-rr won't allow to utilize whole etherchannel bandwidth when transmitting data just between 2 hosts( for example, when I have one iSCSI initiator and one iSCSI target). balance-xor is not what I wanted because data transmitted on source host will stick to any, but single slave. > > >>> This is the code that resets the MAC header as described above. >>> It doesn't quite match the documentation, since it only resets the MAC >>> for ETH_P_IP packets. >> Yes, I really meant that my patch applies to ETH_P_IP packets and I've >> missed that from documentation I wrote. > > Is limiting this to just ETH_P_IP really a means to exclude ARP, > or is there some advantage to (effectively) only balancing IP traffic, > and leaving other traffic (IPv6, for one) essentially unbalanced (when > exiting the switch through the destination port-channel group, which > you've set to use a src-mac hash)? > Well, when making initial version of this patch(it was for 2.6.18 kernel), I meant just excluding ARP . > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com > -- Best regards, Oleg Ukhno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-15 0:05 ` Jay Vosburgh 2011-01-15 12:11 ` Oleg V. Ukhno @ 2011-01-18 3:16 ` John Fastabend 2011-01-18 12:40 ` Oleg V. Ukhno 1 sibling, 1 reply; 32+ messages in thread From: John Fastabend @ 2011-01-18 3:16 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Oleg V. Ukhno, netdev, David S. Miller On 1/14/2011 4:05 PM, Jay Vosburgh wrote: > Oleg V. Ukhno <olegu@yandex-team.ru> wrote: >> Jay Vosburgh wrote: >> >>> This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1 >>> (f), which requires that all frames of a given "conversation" are passed >>> to a single port. >>> >>> The existing layer3+4 hash has a similar problem (that it may >>> send packets from a conversation to multiple ports), but for that case >>> it's an unlikely exception (only in the case of IP fragmentation), but >>> here it's the norm. At a minimum, this must be clearly documented. >>> >>> Also, what does a round robin in 802.3ad provide that the >>> existing round robin does not? My presumption is that you're looking to >>> get the aggregator autoconfiguration that 802.3ad provides, but you >>> don't say. > > I'm still curious about this question. Given the rather > intricate setup of your particular network (described below), I'm not > sure why 802.3ad is of benefit over traditional etherchannel > (balance-rr / balance-xor). > >>> I don't necessarily think this is a bad cheat (round robining on >>> 802.3ad as an explicit non-standard extension), since everybody wants to >>> stripe their traffic across multiple slaves. I've given some thought to >>> making round robin into just another hash mode, but this also does some >>> magic to the MAC addresses of the outgoing frames (more on that below). >> Yes, I am resetting MAC addresses when transmitting packets to have switch >> to put packets into different ports of the receiving etherchannel. > > By "etherchannel" do you really mean "Cisco switch with a > port-channel group using LACP"? > >> I am using this patch to provide full-mesh ISCSI connectivity between at >> least 4 hosts (all hosts of course are in same ethernet segment) and every >> host is connected with aggregate link with 4 slaves(usually). >> Using round-robin I provide near-equal load striping when transmitting, >> using MAC address magic I force switch to stripe packets over all slave >> links in destination port-channel(when number of rx-ing slaves is equal to >> number ot tx-ing slaves and is even). > > By "MAC address magic" do you mean that you're assigning > specifically chosen MAC addresses to the slaves so that the switch's > hash is essentially "assigning" the bonding slaves to particular ports > on the outgoing port-channel group? > > Assuming that this is the case, it's an interesting idea, but > I'm unconvinced that it's better on 802.3ad vs. balance-rr. Unless I'm > missing something, you can get everything you need from an option to > have balance-rr / balance-xor utilize the slave's permanent address as > the source address for outgoing traffic. > >> [...] So I am able to utilize all slaves >> for tx and for rx up to maximum capacity; besides I am getting L2 link >> failure detection (and load rebalancing), which is (in my opinion) much >> faster and robust than L3 or than dm-multipath provides. >> It's my idea with the patch > > Can somebody (John?) more knowledgable than I about dm-multipath > comment on the above? Here I'll give it a go. I don't think detecting L2 link failure this way is very robust. If there is a failure farther away then your immediate link your going to break completely? Your bonding hash will continue to round robin the iscsi packets and half them will get dropped on the floor. dm-multipath handles this reasonably gracefully. Also in this bonding environment you seem to be very sensitive to RTT times on the network. Maybe not bad out right but I wouldn't consider this robust either. You could tweak your scsi timeout values and fail_fast values, set the io retry to 0 to cause the fail over to occur faster. I suspect you already did this and still it is too slow? Maybe adding a checker in multipathd to listen for link events would be fast enough. The checker could then fail the path immediately. I'll try to address your comments from the other thread here. In general I wonder if it would be better to solve the problems in dm-multipath rather than add another bonding mode? OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency) The dm-multipath layer is adding latency? How much? If this is really true maybe its best to the address the real issue here and not avoid it by using the bonding layer. OVU - it handles any link failures bad, because of it's command queue limitation(all queued commands above 32 are discarded in case of path failure, as I remember) Maybe true but only link failures with the immediate peer are handled with a bonding strategy. By working at the block layer we can detect failures throughout the path. I would need to look into this again I know when we were looking at this sometime ago there was some talk about improving this behavior. I need to take some time to go back through the error recovery stuff to remember how this works. OVU - it performs very bad when there are many devices and maтy paths(I was unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths per each disk) Hmm well that seems like something is broken. I'll try this setup when I get some time next few days. This really shouldn't be the case dm-multipath should not add a bunch of extra latency or effect throughput significantly. By the way what are you seeing without mpio? Thanks, John ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 3:16 ` John Fastabend @ 2011-01-18 12:40 ` Oleg V. Ukhno 2011-01-18 14:54 ` Nicolas de Pesloüan 2011-01-18 16:41 ` John Fastabend 0 siblings, 2 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-18 12:40 UTC (permalink / raw) To: John Fastabend; +Cc: Jay Vosburgh, netdev, David S. Miller On 01/18/2011 06:16 AM, John Fastabend wrote: > On 1/14/2011 4:05 PM, Jay Vosburgh wrote: >> Can somebody (John?) more knowledgable than I about dm-multipath >> comment on the above? > > Here I'll give it a go. > > I don't think detecting L2 link failure this way is very robust. If there > is a failure farther away then your immediate link your going to break > completely? Your bonding hash will continue to round robin the iscsi > packets and half them will get dropped on the floor. dm-multipath handles > this reasonably gracefully. Also in this bonding environment you seem to > be very sensitive to RTT times on the network. Maybe not bad out right but > I wouldn't consider this robust either. John, I agree - this bonding mode should be used in quite limited number of situations, but as for failure farther away then immediate link - every bonding mode will suffer same problems in this case - bonding detects only L2 failures, other is done by upper-layer mechanisms. And almost all bonding modes depend on equal RTT on slaves. And, there is already similar load balancing mode - balance-alb - what I did is approximately the same, but for 802.3ad bonding mode and provides "better"(more equal and non-conditional layser2) load striping for tx and _rx_ . I think I shouldn't mention the particular use case of this patch - when I wrote it I tried to make a more general solution - my goal was "make equal or near-equal load striping for TX and (most important part) RX within single ethernet(layer 2) domain for TCP transmission". This bonding mode just introduces ability to stripe rx and tx load for single TCP connection between hosts inside of one ethernet segment. iSCSI is just an example. It is possible to stripe load between a linux-based router and linux-based web/ftp/etc server as well in the same manner. I think this feature will be useful in some number of network configurations. Also, I looked into net-next code - it seems to me that it can be implemented(adapted to net-next bonding code) without any difficulties and hashing function change makes no problem here. What I've written below is just my personal experience and opinion after 5 years of using Oracle +iSCSI +mpath(later - patched bonding). From my personal experience I just can say that most iSCSI failures are caused by link failures, and also I would never send any significant iSCSI traffic via router - router would be a bottleneck in this case. So, in my case iSCSI traffic flows within one ethernet domain and in case of link failure bonding driver simply fails one slave(in case of bonding) , instead of checking and failing hundreths of paths (in case of mpath) and first case significantly less cpu, net and time consuming(if using default mpath checker - readsector0). Mpath is good for me, when I use it to "merge" drbd mirrors from different hosts, but for just doing simple load striping within single L2 network switch between 2 .. 16 hosts is some overkill(particularly in maintaining human-readable device naming) :). John, what is you opinion on such load balancing method in general, without referring to particular use cases? > > You could tweak your scsi timeout values and fail_fast values, set the io > retry to 0 to cause the fail over to occur faster. I suspect you already > did this and still it is too slow? Maybe adding a checker in multipathd to > listen for link events would be fast enough. The checker could then fail > the path immediately. > > I'll try to address your comments from the other thread here. In general I > wonder if it would be better to solve the problems in dm-multipath rather than > add another bonding mode? Of course I did this, but mpath is fine when device quantity is below 30-40 devices with two paths, 150-200 devices with 2+ paths can make life far more interesting :) > > OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency) > > The dm-multipath layer is adding latency? How much? If this is really true > maybe its best to the address the real issue here and not avoid it by > using the bonding layer. I do not remember exact number now, but switching one of my databases , about 2 years ago to bonding increased read throughput for the entire db from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in one switch) because of "full" bandwidth use. Also, bonding usage simplifies network and application setup greatly(compared to mpath) > > OVU - it handles any link failures bad, because of it's command queue > limitation(all queued commands above 32 are discarded in case of path > failure, as I remember) > > Maybe true but only link failures with the immediate peer are handled > with a bonding strategy. By working at the block layer we can detect > failures throughout the path. I would need to look into this again I > know when we were looking at this sometime ago there was some talk about > improving this behavior. I need to take some time to go back through the > error recovery stuff to remember how this works. > > OVU - it performs very bad when there are many devices and maтy paths(I was > unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths > per each disk) Well, I think that behavior can be explained in such a way: when balancing by I/Os number per path(rr_min_io), and there is a huge number of devices, mpath is doing load-balaning per-device, and it is not possible to quarantee equal device use for all devices, so there will be imbalance over network interface(mpath is unaware of it's existence, etc), and it is likely it becomes more imbalanced when there are many devices. Also, counting I/O's for many devices and paths consumes some CPU resources and also can cause excessive context switches. > > Hmm well that seems like something is broken. I'll try this setup when > I get some time next few days. This really shouldn't be the case dm-multipath > should not add a bunch of extra latency or effect throughput significantly. > By the way what are you seeing without mpio? And one more obsevation from my 2-years old tests - reading device(using dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath device with single path was done at approximately 120-150mb/s, and same test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a kind of revelation to me that time. > > Thanks, > John > -- Best regards, Oleg Ukhno. ITO Team Lead, Yandex LLC. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 12:40 ` Oleg V. Ukhno @ 2011-01-18 14:54 ` Nicolas de Pesloüan 2011-01-18 15:28 ` Oleg V. Ukhno 2011-01-18 16:41 ` John Fastabend 1 sibling, 1 reply; 32+ messages in thread From: Nicolas de Pesloüan @ 2011-01-18 14:54 UTC (permalink / raw) To: Oleg V. Ukhno, John Fastabend, Jay Vosburgh, David S. Miller Cc: netdev, Sébastien Barré, Christophe Paasch Le 18/01/2011 13:40, Oleg V. Ukhno a écrit : The fact that there exist many situations where it simply doesn't work, should not cause the idea of Oleg to be rejected. In Documentation/networking/bonding.txt, tuning tcp_reordering on receiving side is already documented as a possible workaround for out of order delivery due to load balancing of a single TCP session, using mode=balance-rr. This might work reasonably well in a pure LAN topology, without any router between both ends of the TCP session, even if this is limited to Linux hosts. The uses are not uncommon and not limited to iSCSI: - between an application server and a database server, - between members of a cluster, for replication purpose, - between a server and a backup system, - ... Of course, for longer paths, with routers and variable RTT, we would need something different (possibly MultiPathTCP: http://datatracker.ietf.org/wg/mptcp/). I remember a topology (described by Jay, for as far as I remember), where two hosts were connected through two distinct VLANs. In such topology: - it is possible to detect path failure using arp monitoring instead of miimon. - changing the destination MAC address of egress packets are not necessary, because egress path selection force ingress path selection due to the VLAN. I think the only point is whether we need a new xmit_hash_policy for mode=802.3ad or whether mode=balance-rr could be enough. Oleg, would you mind trying the above "two VLAN" topology" with mode=balance-rr and report any results ? For high-availability purpose, it's obviously necessary to setup those VLAN on distinct switches. Nicolas ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 14:54 ` Nicolas de Pesloüan @ 2011-01-18 15:28 ` Oleg V. Ukhno 2011-01-18 16:24 ` Nicolas de Pesloüan 2011-01-18 17:56 ` Kirill Smelkov 0 siblings, 2 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-18 15:28 UTC (permalink / raw) To: Nicolas de Pesloüan Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev, Sébastien Barré, Christophe Paasch On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote: > Le 18/01/2011 13:40, Oleg V. Ukhno a écrit : > > The fact that there exist many situations where it simply doesn't work, > should not cause the idea of Oleg to be rejected. > > In Documentation/networking/bonding.txt, tuning tcp_reordering on > receiving side is already documented as a possible workaround for out of > order delivery due to load balancing of a single TCP session, using > mode=balance-rr. > > This might work reasonably well in a pure LAN topology, without any > router between both ends of the TCP session, even if this is limited to > Linux hosts. The uses are not uncommon and not limited to iSCSI: > - between an application server and a database server, > - between members of a cluster, for replication purpose, > - between a server and a backup system, > - ... Nicolas, thank you for your opinion - this is exactly what I mean - iSCSI is just one particular use case, but there are many cases where this load balancing method will be useful > > Of course, for longer paths, with routers and variable RTT, we would > need something different (possibly MultiPathTCP: > http://datatracker.ietf.org/wg/mptcp/). > > I remember a topology (described by Jay, for as far as I remember), > where two hosts were connected through two distinct VLANs. In such > topology: > - it is possible to detect path failure using arp monitoring instead of > miimon. > - changing the destination MAC address of egress packets are not > necessary, because egress path selection force ingress path selection > due to the VLAN. In case with two VLANs - yes, this shouldn't be necessary(but needs to be tested, I am not sure), but within one - it is essential for correct rx load striping. > > I think the only point is whether we need a new xmit_hash_policy for > mode=802.3ad or whether mode=balance-rr could be enough. May by, but it seems to me fair enough not to restrict this feature only to non-LACP aggregate links; dynamic aggregation may be useful(it helps to avoid switch misconfiguration(misconfigured slaves on switch side) sometimes without loss of service). > > Oleg, would you mind trying the above "two VLAN" topology" with > mode=balance-rr and report any results ? For high-availability purpose, > it's obviously necessary to setup those VLAN on distinct switches. I'll do it, but it will take some time to setup test environment, several days may be. You mean following topology: switch 1 / \ host A host B \ switch 2 / (i'm sure it will work as desired if each host is connected to each switch with only one slave link, if there are more slaves in each switch - unsure)? > > Nicolas > > > -- Best regards, Oleg Ukhno. ITO Team Lead, Yandex LLC. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 15:28 ` Oleg V. Ukhno @ 2011-01-18 16:24 ` Nicolas de Pesloüan 2011-01-18 16:57 ` Oleg V. Ukhno 2011-01-18 20:24 ` Jay Vosburgh 2011-01-18 17:56 ` Kirill Smelkov 1 sibling, 2 replies; 32+ messages in thread From: Nicolas de Pesloüan @ 2011-01-18 16:24 UTC (permalink / raw) To: Oleg V. Ukhno Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev, Sébastien Barré, Christophe Paasch Le 18/01/2011 16:28, Oleg V. Ukhno a écrit : > On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote: >> I remember a topology (described by Jay, for as far as I remember), >> where two hosts were connected through two distinct VLANs. In such >> topology: >> - it is possible to detect path failure using arp monitoring instead of >> miimon. >> - changing the destination MAC address of egress packets are not >> necessary, because egress path selection force ingress path selection >> due to the VLAN. > > In case with two VLANs - yes, this shouldn't be necessary(but needs to > be tested, I am not sure), but within one - it is essential for correct > rx load striping. Changing the destination MAC address is definitely not required if you segregate each path in a distinct VLAN. +-------------------+ +-------------------+ +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+ | +-------------------+ +-------------------+ | +------+ | | +------+ |host A| | | |host B| +------+ | | +------+ | +-------------------+ +-------------------+ | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ +-------------------+ +-------------------+ Even in the present of ISL between some switches, packet sent through host A interface connected to vlan 100 will only enter host B using the interface connected to vlan 100. So every slaves of the bonding interface can use the same MAC address. Of course, changing the destination address would be required in order to achieve ingress load balancing on a *single* LAN. But, as Jay noted at the beginning of this thread, this would violate 802.3ad. >> I think the only point is whether we need a new xmit_hash_policy for >> mode=802.3ad or whether mode=balance-rr could be enough. > May by, but it seems to me fair enough not to restrict this feature only > to non-LACP aggregate links; dynamic aggregation may be useful(it helps > to avoid switch misconfiguration(misconfigured slaves on switch side) > sometimes without loss of service). You are right, but such LAN setup need to be carefully designed and built. I'm not sure that an automatic channel aggregation system is the right way to do it. Hence the reason why I suggest to use balance-rr with VLANs. >> Oleg, would you mind trying the above "two VLAN" topology" with >> mode=balance-rr and report any results ? For high-availability purpose, >> it's obviously necessary to setup those VLAN on distinct switches. > I'll do it, but it will take some time to setup test environment, > several days may be. Thanks. For testing purpose, it is enough to setup those VLAN on a single switch if it is easier for you to do. > You mean following topology: See above. > (i'm sure it will work as desired if each host is connected to each > switch with only one slave link, if there are more slaves in each switch > - unsure)? If you want to use more than 2 slaves per host, then you need more than 2 VLAN. You also need to have the exact same number of slaves on all hosts, as egress path selection cause ingress path selection at the other side. +-------------------+ +-------------------+ +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+ | +-------------------+ +-------------------+ | +------+ | | +------+ |host A| | | |host B| +------+ | | +------+ | | +-------------------+ +-------------------+ | | | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ | | +-------------------+ +-------------------+ | | | | | | | | | | +-------------------+ +-------------------+ | +---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+ +-------------------+ +-------------------+ Of course, you can add others host to vlan 100, 200 and 300, with the exact same configuration at host A or host B. Nicolas. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 16:24 ` Nicolas de Pesloüan @ 2011-01-18 16:57 ` Oleg V. Ukhno 2011-01-18 20:24 ` Jay Vosburgh 1 sibling, 0 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-18 16:57 UTC (permalink / raw) To: Nicolas de Pesloüan Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev, Sébastien Barré, Christophe Paasch On 01/18/2011 07:24 PM, Nicolas de Pesloüan wrote: > Le 18/01/2011 16:28, Oleg V. Ukhno a écrit : >> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote: >>> I remember a topology (described by Jay, for as far as I remember), >>> where two hosts were connected through two distinct VLANs. In such >>> topology: >>> - it is possible to detect path failure using arp monitoring instead of >>> miimon. >>> - changing the destination MAC address of egress packets are not >>> necessary, because egress path selection force ingress path selection >>> due to the VLAN. >> >> In case with two VLANs - yes, this shouldn't be necessary(but needs to >> be tested, I am not sure), but within one - it is essential for correct >> rx load striping. > > Changing the destination MAC address is definitely not required if you > segregate each path in a distinct VLAN. Yes, such L2 network topology should provide necessary high-availability and load striping without need to change MAC addresses. But it is more difficult to maintain and to understand, in my opinion(when there are just several configurations like this - it's ok, but when you have 50 or more?) - this is why I've chosen 802.3ad. > Even in the present of ISL between some switches, packet sent through > host A interface connected to vlan 100 will only enter host B using the > interface connected to vlan 100. So every slaves of the bonding > interface can use the same MAC address. > > Of course, changing the destination address would be required in order > to achieve ingress load balancing on a *single* LAN. But, as Jay noted > at the beginning of this thread, this would violate 802.3ad. > I think receiving same MAC-addresses on different ports on same host will just make any troubleshooting much harder, won't it? With different MACs it takes little time to find out where the problem is, usually. I think that implementing choice for choosing whether use single MAC address in etherchannel or just use slave's real MAC adresses, won't harm anything for both 802.3ad and balance-rr modes, but will simplify it's usage without doing any evil, when documented properly. > > You are right, but such LAN setup need to be carefully designed and > built. I'm not sure that an automatic channel aggregation system is the > right way to do it. Hence the reason why I suggest to use balance-rr > with VLANs. > >>> Oleg, would you mind trying the above "two VLAN" topology" with >>> mode=balance-rr and report any results ? For high-availability purpose, >>> it's obviously necessary to setup those VLAN on distinct switches. >> I'll do it, but it will take some time to setup test environment, >> several days may be. > > Thanks. For testing purpose, it is enough to setup those VLAN on a > single switch if it is easier for you to do. Well, I'll do it with 2 switches :) > >> You mean following topology: > > See above. > >> (i'm sure it will work as desired if each host is connected to each >> switch with only one slave link, if there are more slaves in each switch >> - unsure)? > > If you want to use more than 2 slaves per host, then you need more than > 2 VLAN. That's what I don't like in this solution. Within one LAN is is simplier and requires less configuration efforts. You also need to have the exact same number of slaves on all > hosts, as egress path selection cause ingress path selection at the > other side. > Well, and here's one difference from bonding with my patch. In case of my patch applied, it is not required to have equal number of slaves, it is enough to have *even* number of slaves, this almost always(so far I haven't seen opposite) gurarntees good rx(ingress) load striping. > > Nicolas. > -- С уважением, руководитель службы эксплуатации коммерческих и финансовых сервисов ООО Яндекс Олег Юхно ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 16:24 ` Nicolas de Pesloüan 2011-01-18 16:57 ` Oleg V. Ukhno @ 2011-01-18 20:24 ` Jay Vosburgh 2011-01-18 21:20 ` Nicolas de Pesloüan ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Jay Vosburgh @ 2011-01-18 20:24 UTC (permalink / raw) To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?= Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev, =?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?=, Christophe Paasch Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote: >Le 18/01/2011 16:28, Oleg V. Ukhno a écrit : >> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote: >>> I remember a topology (described by Jay, for as far as I remember), >>> where two hosts were connected through two distinct VLANs. In such >>> topology: >>> - it is possible to detect path failure using arp monitoring instead of >>> miimon. I don't think this is true, at least not for the case of balance-rr. Using ARP monitoring with any sort of load balance scheme is problematic, because the replies may be balanced to a different slave than the sender. >>> - changing the destination MAC address of egress packets are not >>> necessary, because egress path selection force ingress path selection >>> due to the VLAN. This is true, with one comment: Oleg's proposal we're discussing changes the source MAC address of outgoing packets, not the destination. The purpose being to manipulate the src-mac balancing algorithm on the switch when the packets are hashed at the egress port channel group. The packets (for a particular destination) all bear the same destination MAC, but (as I understand it) are manually assigned tailored source MAC addresses that hash to sequential values. >> In case with two VLANs - yes, this shouldn't be necessary(but needs to >> be tested, I am not sure), but within one - it is essential for correct >> rx load striping. > >Changing the destination MAC address is definitely not required if you >segregate each path in a distinct VLAN. > > +-------------------+ +-------------------+ > +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+ > | +-------------------+ +-------------------+ | >+------+ | | +------+ >|host A| | | |host B| >+------+ | | +------+ > | +-------------------+ +-------------------+ | > +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ > +-------------------+ +-------------------+ > >Even in the present of ISL between some switches, packet sent through host >A interface connected to vlan 100 will only enter host B using the >interface connected to vlan 100. So every slaves of the bonding interface >can use the same MAC address. That's true. The big problem with the "VLAN tunnel" approach is that it's not tolerant of link failures. >Of course, changing the destination address would be required in order to >achieve ingress load balancing on a *single* LAN. But, as Jay noted at the >beginning of this thread, this would violate 802.3ad. > >>> I think the only point is whether we need a new xmit_hash_policy for >>> mode=802.3ad or whether mode=balance-rr could be enough. >> May by, but it seems to me fair enough not to restrict this feature only >> to non-LACP aggregate links; dynamic aggregation may be useful(it helps >> to avoid switch misconfiguration(misconfigured slaves on switch side) >> sometimes without loss of service). > >You are right, but such LAN setup need to be carefully designed and >built. I'm not sure that an automatic channel aggregation system is the >right way to do it. Hence the reason why I suggest to use balance-rr with >VLANs. The "VLAN tunnel" approach is a derivative of an actual switch topology that balance-rr was originally intended for, many moons ago. This is described in the current bonding.txt; I'll cut & paste a bit here: 12.2 Maximum Throughput in a Multiple Switch Topology ----------------------------------------------------- Multiple switches may be utilized to optimize for throughput when they are configured in parallel as part of an isolated network between two or more systems, for example: +-----------+ | Host A | +-+---+---+-+ | | | +--------+ | +---------+ | | | +------+---+ +-----+----+ +-----+----+ | Switch A | | Switch B | | Switch C | +------+---+ +-----+----+ +-----+----+ | | | +--------+ | +---------+ | | | +-+---+---+-+ | Host B | +-----------+ In this configuration, the switches are isolated from one another. One reason to employ a topology such as this is for an isolated network with many hosts (a cluster configured for high performance, for example), using multiple smaller switches can be more cost effective than a single larger switch, e.g., on a network with 24 hosts, three 24 port switches can be significantly less expensive than a single 72 port switch. If access beyond the network is required, an individual host can be equipped with an additional network device connected to an external network; this host then additionally acts as a gateway. [end of cut] This was described to me some time ago as an early usage model for balance-rr using multiple 10 Mb/sec switches. It has the same link monitoring problems as the "VLAN tunnel" approach, although modern switches with "trunk failover" type of functionality may be able to mitigate the problem. >>> Oleg, would you mind trying the above "two VLAN" topology" with >>> mode=balance-rr and report any results ? For high-availability purpose, >>> it's obviously necessary to setup those VLAN on distinct switches. >> I'll do it, but it will take some time to setup test environment, >> several days may be. > >Thanks. For testing purpose, it is enough to setup those VLAN on a single >switch if it is easier for you to do. > >> You mean following topology: > >See above. > >> (i'm sure it will work as desired if each host is connected to each >> switch with only one slave link, if there are more slaves in each switch >> - unsure)? > >If you want to use more than 2 slaves per host, then you need more than 2 >VLAN. You also need to have the exact same number of slaves on all hosts, >as egress path selection cause ingress path selection at the other side. > > +-------------------+ +-------------------+ > +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+ > | +-------------------+ +-------------------+ | >+------+ | | +------+ >|host A| | | |host B| >+------+ | | +------+ > | | +-------------------+ +-------------------+ | | > | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ | > | +-------------------+ +-------------------+ | > | | | | > | | | | > | +-------------------+ +-------------------+ | > +---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+ > +-------------------+ +-------------------+ > >Of course, you can add others host to vlan 100, 200 and 300, with the >exact same configuration at host A or host B. This is essentially the same thing as the diagram I pasted in up above, except with VLANs and an additional layer of switches between the hosts. The multiple VLANs take the place of multiple discrete switches. This could also be accomplished via bridge groups (in Cisco-speak). For example, instead of VLAN 100, that could be bridge group X, VLAN 200 is bridge group Y, and so on. Neither the VLAN nor the bridge group methods handle link failures very well; if, in the above diagram, the link from "switch 2 vlan 100" to "host B" fails, there's no way for host A to know to stop sending to "switch 1 vlan 100," and there's no backup path for VLAN 100 to "host B." One item I'd like to see some more data on is the level of reordering at the receiver in Oleg's system. One of the reasons round robin isn't as useful as it once was is due to the rise of NAPI and interrupt coalescing, both of which will tend to increase the reordering of packets at the receiver when the packets are evenly striped. In the old days, it was one interrupt, one packet. Now, it's one interrupt or NAPI poll, many packets. With the packets striped across interfaces, this will tend to increase reordering. E.g., slave 1 slave 2 slave 3 Packet 1 P2 P3 P4 P5 P6 P7 P8 P9 and so on. A poll of slave 1 will get packets 1, 4 and 7 (and probably several more), then a poll of slave 2 will get 2, 5 and 8, etc. I haven't done much testing with this lately, but I suspect this behavior hasn't really changed. Raising the tcp_reordering sysctl value can mitigate this somewhat (by making TCP more tolerant of this), but that doesn't help non-TCP protocols. Barring evidence to the contrary, I presume that Oleg's system delivers out of order at the receiver. That's not automatically a reason to reject it, but this entire proposal is sufficiently complex to configure that very explicit documentation will be necessary. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 20:24 ` Jay Vosburgh @ 2011-01-18 21:20 ` Nicolas de Pesloüan 2011-01-19 1:45 ` Jay Vosburgh 2011-01-18 22:22 ` Oleg V. Ukhno 2011-01-19 16:13 ` Oleg V. Ukhno 2 siblings, 1 reply; 32+ messages in thread From: Nicolas de Pesloüan @ 2011-01-18 21:20 UTC (permalink / raw) To: Jay Vosburgh Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev, Sébastien Barré, Christophe Paasch Le 18/01/2011 21:24, Jay Vosburgh a écrit : > Nicolas de Pesloüan<nicolas.2p.debian@gmail.com> wrote: >>>> - it is possible to detect path failure using arp monitoring instead of >>>> miimon. > > I don't think this is true, at least not for the case of > balance-rr. Using ARP monitoring with any sort of load balance scheme > is problematic, because the replies may be balanced to a different slave > than the sender. Cannot we achieve the expected arp monitoring by using the exact same artifice that Oleg suggested: using a different source MAC per slave for arp monitoring, so that return path match sending path ? >>>> - changing the destination MAC address of egress packets are not >>>> necessary, because egress path selection force ingress path selection >>>> due to the VLAN. > > This is true, with one comment: Oleg's proposal we're discussing > changes the source MAC address of outgoing packets, not the destination. > The purpose being to manipulate the src-mac balancing algorithm on the > switch when the packets are hashed at the egress port channel group. > The packets (for a particular destination) all bear the same destination > MAC, but (as I understand it) are manually assigned tailored source MAC > addresses that hash to sequential values. Yes, you're right. > That's true. The big problem with the "VLAN tunnel" approach is > that it's not tolerant of link failures. Yes, except if we find a way to make arp monitoring reliable in load balancing situation. [snip] > This is essentially the same thing as the diagram I pasted in up > above, except with VLANs and an additional layer of switches between the > hosts. The multiple VLANs take the place of multiple discrete switches. > > This could also be accomplished via bridge groups (in > Cisco-speak). For example, instead of VLAN 100, that could be bridge > group X, VLAN 200 is bridge group Y, and so on. > > Neither the VLAN nor the bridge group methods handle link > failures very well; if, in the above diagram, the link from "switch 2 > vlan 100" to "host B" fails, there's no way for host A to know to stop > sending to "switch 1 vlan 100," and there's no backup path for VLAN 100 > to "host B." Can't we imagine to "arp monitor" the destination MAC address of host B, on both paths ? That way, host A would know that a given path is down, because return path would be the same. The target host should send the reply on the slave on which it receive the request, which is the normal way to reply to arp request. > One item I'd like to see some more data on is the level of > reordering at the receiver in Oleg's system. This is exactly the reason why I asked Oleg to do some test with balance-rr. I cannot find a good reason for a possibly new xmit_hash_policy to provide better throughput than current balance-rr. If the throughput increase by, let's say, less than 20%, whatever tcp_reordering value, then it is probably a dead end way. > One of the reasons round robin isn't as useful as it once was is > due to the rise of NAPI and interrupt coalescing, both of which will > tend to increase the reordering of packets at the receiver when the > packets are evenly striped. In the old days, it was one interrupt, one > packet. Now, it's one interrupt or NAPI poll, many packets. With the > packets striped across interfaces, this will tend to increase > reordering. E.g., > > slave 1 slave 2 slave 3 > Packet 1 P2 P3 > P4 P5 P6 > P7 P8 P9 > > and so on. A poll of slave 1 will get packets 1, 4 and 7 (and > probably several more), then a poll of slave 2 will get 2, 5 and 8, etc. Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7, P8, P9 on slave3, possibly by sending grouped packets, changing the sending slave every N packets instead of every packet ? I think we already discussed this possibility a few months or years ago in bonding-devel ML. For as far as I remember, the idea was not developed because it was not easy to find the number of packets to send through the same slave. Anyway, this might help reduce out of order delivery. > Barring evidence to the contrary, I presume that Oleg's system > delivers out of order at the receiver. That's not automatically a > reason to reject it, but this entire proposal is sufficiently complex to > configure that very explicit documentation will be necessary. Yes, and this is already true for some bonding modes and in particular for balance-rr. Nicolas. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 21:20 ` Nicolas de Pesloüan @ 2011-01-19 1:45 ` Jay Vosburgh 0 siblings, 0 replies; 32+ messages in thread From: Jay Vosburgh @ 2011-01-19 1:45 UTC (permalink / raw) To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?= Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev, =?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?=, Christophe Paasch Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote: >Le 18/01/2011 21:24, Jay Vosburgh a écrit : >> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com> wrote: > >>>>> - it is possible to detect path failure using arp monitoring instead of >>>>> miimon. >> >> I don't think this is true, at least not for the case of >> balance-rr. Using ARP monitoring with any sort of load balance scheme >> is problematic, because the replies may be balanced to a different slave >> than the sender. > >Cannot we achieve the expected arp monitoring by using the exact same >artifice that Oleg suggested: using a different source MAC per slave for >arp monitoring, so that return path match sending path ? It's not as simple with ARP, because it's a control protocol that has side effects. First, the MAC level broadcast ARP probes from bonding would have to be round robined in such a manner that they regularly arrive at every possible slave. A single broadcast won't be sent to more than one member of the channel group by the switch. We can't do multiple unicast ARPs with different destination MAC addresses, because we'd have to track all of those MACs somewhere (keep track of the MAC of every slave on each peer we're monitoring). I suspect that snooping switches will get all whiny about port flapping and the like. We could have a separate IP address per slave, used only for link monitoring, but that's a huge headache. Actually, it's a lot like the multi-link stuff I've been working on (and posted RFC of in December), but that doesn't use ARP (it segregates slaves by IP subnet, and balances at the IP layer). Basically, you need a overlaying active protocol to handle the map of which slave goes where (which multi-link has). So, maybe we have the ARP replies massaged such that the Ethernet header source and ARP target hardware address don't match. So the probes from bonding currently look like this: MAC-A > ff:ff:ff:ff:ff:ff Request who-has 10.0.4.2 tell 10.0.1.1 Where MAC-A is the bond's MAC address. And the replies now look like this: MAC-B > MAC-A, Reply 10.0.4.2 is-at MAC-B Where MAC-B is the MAC of the peer's bond. The massaged replies would be of the form: MAC-C > MAC-A, Reply 10.0.4.2 is-at MAC-B where MAC-C is the slave "permanent" address (which is really a fake address to manipulate the switch's hash), and MAC-B is whatever the real MAC of the bond is. I don't think we can mess with MAC-B in the reply (the "is-at" part), because that would update ARP tables and such. If we change MAC-A in the reply, they're liable to be filtered out. I really don't know if putting MAC-C in there as the source would confuse snooping switches or not. One other thought I had while chewing on this is to run the LACP protocol exchange between the bonding peers directly, instead of between each bond and each switch. I have no idea if this would work or not, but the theory would look something like the "VLAN tunnel" topology for the switches, but the bonds at the ends are configured for 802.3ad. To make this work, bonding would have to be able to run mutiple LACP instances (one for each bonding peer on the network) over a single aggregator (or permit slaves to belong to multiple active aggregators). This would basically be the same as the multi-link business, except using LACP for the active protocol to build the map. A distinguished correspondent (who may confess if he so chooses) also suggested 802.2 LLC XID or TEST frames, which have been discussed in the past. Those don't have side effects, but I'm not sure if either is technically feasible, or if we really want bonding to have a dependency on llc. They would also only interop with hosts that respond to the XID or TEST. I haven't thought about this in detail for a number of years, but I think the LLC DSAP / SSAP space is pretty small. >>>>> - changing the destination MAC address of egress packets are not >>>>> necessary, because egress path selection force ingress path selection >>>>> due to the VLAN. >> >> This is true, with one comment: Oleg's proposal we're discussing >> changes the source MAC address of outgoing packets, not the destination. >> The purpose being to manipulate the src-mac balancing algorithm on the >> switch when the packets are hashed at the egress port channel group. >> The packets (for a particular destination) all bear the same destination >> MAC, but (as I understand it) are manually assigned tailored source MAC >> addresses that hash to sequential values. > >Yes, you're right. > >> That's true. The big problem with the "VLAN tunnel" approach is >> that it's not tolerant of link failures. > >Yes, except if we find a way to make arp monitoring reliable in load balancing situation. > >[snip] > >> This is essentially the same thing as the diagram I pasted in up >> above, except with VLANs and an additional layer of switches between the >> hosts. The multiple VLANs take the place of multiple discrete switches. >> >> This could also be accomplished via bridge groups (in >> Cisco-speak). For example, instead of VLAN 100, that could be bridge >> group X, VLAN 200 is bridge group Y, and so on. >> >> Neither the VLAN nor the bridge group methods handle link >> failures very well; if, in the above diagram, the link from "switch 2 >> vlan 100" to "host B" fails, there's no way for host A to know to stop >> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100 >> to "host B." > >Can't we imagine to "arp monitor" the destination MAC address of host B, >on both paths ? That way, host A would know that a given path is down, >because return path would be the same. The target host should send the >reply on the slave on which it receive the request, which is the normal >way to reply to arp request. I think you can only get away with this if each slave set (where a "set" is one slave from each bond that's attending our little load balancing party) is on a separate switch domain, and the switch domains are not bridged together. Otherwise the switches will flap their MAC tables as they update from each probe that they see. As for the reply going out the same slave, to do that, bonding would have to intercept the ARP traffic (because ARPs arriving on slaves are normally assigned to the bond itself, not the slave) and track and tweak them. Lastly, bonding would again have to maintain a map, showing which destinations are reachable via which set of slaves. All peer systems (needing to have per-slave link monitoring) would have to be ARP targets. >> One item I'd like to see some more data on is the level of >> reordering at the receiver in Oleg's system. > >This is exactly the reason why I asked Oleg to do some test with >balance-rr. I cannot find a good reason for a possibly new >xmit_hash_policy to provide better throughput than current balance-rr. If >the throughput increase by, let's say, less than 20%, whatever >tcp_reordering value, then it is probably a dead end way. Well, the point of making a round robin xmit_hash_policy isn't that the throughput will be better than the existing round robin, it's to make round-robin accessible to the 802.3ad mode. >> One of the reasons round robin isn't as useful as it once was is >> due to the rise of NAPI and interrupt coalescing, both of which will >> tend to increase the reordering of packets at the receiver when the >> packets are evenly striped. In the old days, it was one interrupt, one >> packet. Now, it's one interrupt or NAPI poll, many packets. With the >> packets striped across interfaces, this will tend to increase >> reordering. E.g., >> >> slave 1 slave 2 slave 3 >> Packet 1 P2 P3 >> P4 P5 P6 >> P7 P8 P9 >> >> and so on. A poll of slave 1 will get packets 1, 4 and 7 (and >> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc. > >Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7, >P8, P9 on slave3, possibly by sending grouped packets, changing the >sending slave every N packets instead of every packet ? I think we already >discussed this possibility a few months or years ago in bonding-devel >ML. For as far as I remember, the idea was not developed because it was >not easy to find the number of packets to send through the same >slave. Anyway, this might help reduce out of order delivery. Yes, this came up several years ago, and, basically, there's no way to do it perfectly. An interesting experiment would be to see if sending groups (perhaps close to the NAPI weight of the receiver) would reduce reordering. >> Barring evidence to the contrary, I presume that Oleg's system >> delivers out of order at the receiver. That's not automatically a >> reason to reject it, but this entire proposal is sufficiently complex to >> configure that very explicit documentation will be necessary. > >Yes, and this is already true for some bonding modes and in particular for balance-rr. I don't think any modes other than balance-rr will deliver out of order normally. It can happen during edge cases, e.g., alb rebalance, or the layer3+4 hash with IP fragments, but I'd expect those to be at a much lower rate than what round robin causes. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 20:24 ` Jay Vosburgh 2011-01-18 21:20 ` Nicolas de Pesloüan @ 2011-01-18 22:22 ` Oleg V. Ukhno 2011-01-19 16:13 ` Oleg V. Ukhno 2 siblings, 0 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-18 22:22 UTC (permalink / raw) To: Jay Vosburgh Cc: Nicolas de Pesloüan, John Fastabend, David S. Miller, netdev, Sébastien Barré, Christophe Paasch Jay Vosburgh wrote: > > One item I'd like to see some more data on is the level of > reordering at the receiver in Oleg's system. > > One of the reasons round robin isn't as useful as it once was is > due to the rise of NAPI and interrupt coalescing, both of which will > tend to increase the reordering of packets at the receiver when the > packets are evenly striped. In the old days, it was one interrupt, one > packet. Now, it's one interrupt or NAPI poll, many packets. With the > packets striped across interfaces, this will tend to increase > reordering. E.g., > > slave 1 slave 2 slave 3 > Packet 1 P2 P3 > P4 P5 P6 > P7 P8 P9 > > and so on. A poll of slave 1 will get packets 1, 4 and 7 (and > probably several more), then a poll of slave 2 will get 2, 5 and 8, etc. > > I haven't done much testing with this lately, but I suspect this > behavior hasn't really changed. Raising the tcp_reordering sysctl value > can mitigate this somewhat (by making TCP more tolerant of this), but > that doesn't help non-TCP protocols. > > Barring evidence to the contrary, I presume that Oleg's system > delivers out of order at the receiver. That's not automatically a > reason to reject it, but this entire proposal is sufficiently complex to > configure that very explicit documentation will be necessary. Jay, here is some network stats from one of my iSCSI targets with avg load of 1.5-2.5Gbit/sec(4 slaves in etherchannel).Not perfect and not very "clean"(there are more interfaces on host, than these 4) [root@<somehost> ~]# netstat -st IcmpMsg: InType0: 6 InType3: 1872 InType8: 60557 InType11: 23 OutType0: 60528 OutType3: 1755 OutType8: 6 Tcp: 1298909 active connections openings 61090 passive connection openings 2374 failed connection attempts 62781 connection resets received 3 connections established 1268233942 segments received 1198020318 segments send out 18939618 segments retransmited 0 bad segments received. 23643 resets sent TcpExt: 294935 TCP sockets finished time wait in fast timer 472 time wait sockets recycled by time stamp 819481 delayed acks sent 295332 delayed acks further delayed because of locked socket Quick ack mode was activated 30616377 times 3516920 packets directly queued to recvmsg prequeue. 4353 packets directly received from backlog 44873453 packets directly received from prequeue 1442812750 packets header predicted 1077442 packets header predicted and directly queued to user 2123453975 acknowledgments not containing data received 2375328274 predicted acknowledgments 8462439 times recovered from packet loss due to fast retransmit Detected reordering 19203 times using reno fast retransmit Detected reordering 100 times using time stamp 3429 congestion windows fully recovered 11760 congestion windows partially recovered using Hoe heuristic 398 congestion windows recovered after partial ack 0 TCP data loss events 3671 timeouts after reno fast retransmit 6 timeouts in loss state 18919118 fast retransmits 11637 retransmits in slow start 1756 other TCP timeouts TCPRenoRecoveryFail: 3187 62779 connections reset due to early user close IpExt: InBcastPkts: 512616 [root@<somehost> ~]# uptime 00:35:49 up 42 days, 8:27, 1 user, load average: 3.70, 3.80, 4.07 [root@<somehost> ~]# sysctl -a|grep tcp_reo net.ipv4.tcp_reordering = 3 I will get back with "clean" results after I'll setup test system tomorrow. TcpExt stats from other hosts are similar. > > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com > -- Best regards, Oleg Ukhno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 20:24 ` Jay Vosburgh 2011-01-18 21:20 ` Nicolas de Pesloüan 2011-01-18 22:22 ` Oleg V. Ukhno @ 2011-01-19 16:13 ` Oleg V. Ukhno 2011-01-19 20:12 ` Nicolas de Pesloüan 2 siblings, 1 reply; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-19 16:13 UTC (permalink / raw) To: Jay Vosburgh Cc: Nicolas de Pesloüan, John Fastabend, David S. Miller, netdev, Sébastien Barré, Christophe Paasch On 01/18/2011 11:24 PM, Jay Vosburgh wrote: > I haven't done much testing with this lately, but I suspect this > behavior hasn't really changed. Raising the tcp_reordering sysctl value > can mitigate this somewhat (by making TCP more tolerant of this), but > that doesn't help non-TCP protocols. > > Barring evidence to the contrary, I presume that Oleg's system > delivers out of order at the receiver. That's not automatically a > reason to reject it, but this entire proposal is sufficiently complex to > configure that very explicit documentation will be necessary. > > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com > Jay, I have ran some tests with patched 802.3ad bonding for now Test system configuration: 2 identical servers with 82576, Gigabit ET2 Quad Port Srvr Adptr LowProfile, PCI-E (igb), connected to one switch(Cisco 2960) with all 4 ports, all ports on each host aggregated into single etherchannel using 802.3ad(w/patch). kernel version: vanilla 2.6.32(tcp_reordering - default setting) igb version - 2.3.4, parameters - default Ran two tests: 1) unidirectional test using iperf 2) Bidirectional test, iperf client is running with 8 threads One remark: Decreasing number of slaves will cause higher active slave utilization( for example with 2 slaves iperf test will consume almost full bandwidth available in both directions(test parameters are the same, test time reduced to 150sec): [SUM] 0.0-150.3 sec 34640 MBytes 1933 Mbits/sec [SUM] 0.0-150.5 sec 34875 MBytes 1944 Mbits/sec ) For me (my use case) risk of some bandwidth loss with 4 slaves is acceptable, but my suggestion that building aggregate link with more than 4 slaves is inadequate. For 2 slaves this solution should work with minimum @overhead@ of any kind. TCP reordering and retransmit numbers in my opinion are acceptable for most use cases for such bonding mode I can imagine. What is your opinion on my idea with patch? I will come back with results for VLAN tunneling case, if this is necessary (Nicolas, shall I do that test - I think it will show similar results for performance?) Below are test results(sorry for huge amount of text): Iperf results: Test 1: Receiver: [root@target2 ~]# iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300 ------------------------------------------------------------ Client connecting to 192.168.111.128, TCP port 9999 Binding to local address 192.168.111.129 TCP window size: 32.0 MByte (default) ------------------------------------------------------------ [ 3] local 192.168.111.129 port 9999 connected with 192.168.111.128 port 9999 [ ID] Interval Transfer Bandwidth [ 3] 0.0-300.0 sec 141643 MBytes 3961 Mbits/sec Sender: [root@target1 ~]# iperf -f m -s -B 192.168.111.128 -p 9999 -t 300 ------------------------------------------------------------ Server listening on TCP port 9999 Binding to local address 192.168.111.128 TCP window size: 32.0 MByte (default) ------------------------------------------------------------ [ 4] local 192.168.111.128 port 9999 connected with 192.168.111.129 port 9999 [ ID] Interval Transfer Bandwidth [ 4] 0.0-300.1 sec 141643 MBytes 3959 Mbits/sec ^C[root@target1 ~]# Test 2: former "sender" side: [SUM] 0.0-300.2 sec 111541 MBytes 3117 Mbits/sec [SUM] 0.0-300.4 sec 110515 MBytes 3086 Mbits/sec former "receiver" side: [SUM] 0.0-300.1 sec 110515 MBytes 3089 Mbits/sec [SUM] 0.0-300.3 sec 111541 MBytes 3116 Mbits/sec Netstat's: netstat -st (sender, before 1st test) [root@target1 ~]# netstat -st IcmpMsg: InType3: 5 InType8: 3 OutType0: 3 OutType3: 4 Tcp: 26 active connections openings 7 passive connection openings 5 failed connection attempts 1 connection resets received 4 connections established 349 segments received 330 segments send out 7 segments retransmited 0 bad segments received. 5 resets sent UdpLite: TcpExt: 10 TCP sockets finished time wait in slow timer 8 delayed acks sent 56 packets directly queued to recvmsg prequeue. 40 packets directly received from backlog 317 packets directly received from prequeue 78 packets header predicted 36 packets header predicted and directly queued to user 41 acknowledgments not containing data received 134 predicted acknowledgments 0 TCP data loss events 4 other TCP timeouts 2 connections reset due to unexpected data TCPSackShiftFallback: 1 IpExt: InMcastPkts: 74 OutMcastPkts: 62 InOctets: 76001 OutOctets: 82234 InMcastOctets: 13074 OutMcastOctets: 10428 netstat -st (sender, after 1st test) [root@target1 ~]netstat -st IcmpMsg: InType3: 5 InType8: 7 OutType0: 7 OutType3: 4 Tcp: 71 active connections openings 15 passive connection openings 5 failed connection attempts 4 connection resets received 4 connections established 16674161 segments received 16674113 segments send out 7 segments retransmited 0 bad segments received. 5 resets sent UdpLite: TcpExt: 31 TCP sockets finished time wait in slow timer 13 delayed acks sent 42 delayed acks further delayed because of locked socket Quick ack mode was activated 297 times 239 packets directly queued to recvmsg prequeue. 2388220516 packets directly received from backlog 595165 packets directly received from prequeue 16954 packets header predicted 445 packets header predicted and directly queued to user 129 acknowledgments not containing data received 322 predicted acknowledgments 0 TCP data loss events 4 other TCP timeouts 297 DSACKs sent for old packets 2 connections reset due to unexpected data TCPSackShiftFallback: 1 IpExt: InMcastPkts: 86 OutMcastPkts: 68 InBcastPkts: 2 InOctets: -930738047 OutOctets: 1321936884 InMcastOctets: 13434 OutMcastOctets: 10620 InBcastOctets: 483 netstat -st (receiver, before 1st test) [root@target2 ~]# netstat -st IcmpMsg: InType3: 5 InType8: 3 OutType0: 3 OutType3: 4 Tcp: 23 active connections openings 6 passive connection openings 3 failed connection attempts 1 connection resets received 3 connections established 309 segments received 264 segments send out 7 segments retransmited 0 bad segments received. 6 resets sent UdpLite: TcpExt: 10 TCP sockets finished time wait in slow timer 5 delayed acks sent 74 packets directly queued to recvmsg prequeue. 16 packets directly received from backlog 377 packets directly received from prequeue 62 packets header predicted 35 packets header predicted and directly queued to user 32 acknowledgments not containing data received 106 predicted acknowledgments 0 TCP data loss events 4 other TCP timeouts 1 connections reset due to early user close IpExt: InMcastPkts: 75 OutMcastPkts: 62 InOctets: 64952 OutOctets: 66396 InMcastOctets: 13428 OutMcastOctets: 10403 netstat -st (sender, after 1st test) [root@target2 ~]# netstat -st IcmpMsg: InType3: 5 InType8: 8 OutType0: 8 OutType3: 4 Tcp: 70 active connections openings 14 passive connection openings 3 failed connection attempts 4 connection resets received 4 connections established 16674253 segments received 16673801 segments send out 487 segments retransmited 0 bad segments received. 6 resets sent UdpLite: TcpExt: 32 TCP sockets finished time wait in slow timer 15 delayed acks sent 228 packets directly queued to recvmsg prequeue. 24 packets directly received from backlog 1081 packets directly received from prequeue 146 packets header predicted 124 packets header predicted and directly queued to user 10913589 acknowledgments not containing data received 573 predicted acknowledgments 185 times recovered from packet loss due to SACK data Detected reordering 1 times using FACK Detected reordering 8 times using SACK Detected reordering 2 times using time stamp 1 congestion windows fully recovered 23 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 1 0 TCP data loss events 471 fast retransmits 9 forward retransmits 4 other TCP timeouts 297 DSACKs received 1 connections reset due to early user close TCPDSACKIgnoredOld: 258 TCPDSACKIgnoredNoUndo: 39 TCPSackShiftFallback: 35790574 IpExt: InMcastPkts: 89 OutMcastPkts: 69 InBcastPkts: 2 InOctets: 1321825004 OutOctets: -928982419 InMcastOctets: 13848 OutMcastOctets: 10627 InBcastOctets: 483 Second test: former "sender" side: [root@target1 ~]# netstat -st IcmpMsg: InType3: 5 InType8: 13 OutType0: 13 OutType3: 4 Tcp: 556 active connections openings 65 passive connection openings 391 failed connection attempts 15 connection resets received 4 connections established 52164640 segments received 52117884 segments send out 62522 segments retransmited 0 bad segments received. 33 resets sent UdpLite: TcpExt: 27 invalid SYN cookies received 74 TCP sockets finished time wait in slow timer 698540 packets rejects in established connections because of timestamp 51 delayed acks sent 487 delayed acks further delayed because of locked socket Quick ack mode was activated 18838 times 7 times the listen queue of a socket overflowed 7 SYNs to LISTEN sockets ignored 1632 packets directly queued to recvmsg prequeue. 4137769996 packets directly received from backlog 5723253 packets directly received from prequeue 1365131 packets header predicted 136330 packets header predicted and directly queued to user 10241415 acknowledgments not containing data received 156502 predicted acknowledgments 10983 times recovered from packet loss due to SACK data Detected reordering 4 times using FACK Detected reordering 10095 times using SACK Detected reordering 138 times using time stamp 2107 congestion windows fully recovered 18612 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 80 5 congestion windows recovered after partial ack 0 TCP data loss events 52 timeouts after SACK recovery 2 timeouts in loss state 61206 fast retransmits 7 forward retransmits 984 retransmits in slow start 8 other TCP timeouts 258 sack retransmits failed 18838 DSACKs sent for old packets 274 DSACKs sent for out of order packets 14169 DSACKs received 34 DSACKs for out of order packets received 2 connections reset due to unexpected data TCPDSACKIgnoredOld: 8694 TCPDSACKIgnoredNoUndo: 5482 TCPSackShiftFallback: 18352494 IpExt: InMcastPkts: 104 OutMcastPkts: 77 InBcastPkts: 6 InOctets: -474718903 OutOctets: 1280495238 InMcastOctets: 13974 OutMcastOctets: 10908 InBcastOctets: 1449 former "receiver" side: [root@target2 ~]# netstat -st IcmpMsg: InType3: 5 InType8: 14 OutType0: 14 OutType3: 4 Tcp: 182 active connections openings 39 passive connection openings 4 failed connection attempts 12 connection resets received 4 connections established 52098089 segments received 52180386 segments send out 68994 segments retransmited 0 bad segments received. 1070 resets sent UdpLite: TcpExt: 12 TCP sockets finished time wait in fast timer 102 TCP sockets finished time wait in slow timer 770084 packets rejects in established connections because of timestamp 37 delayed acks sent 261 delayed acks further delayed because of locked socket Quick ack mode was activated 14276 times 1466 packets directly queued to recvmsg prequeue. 1190723332 packets directly received from backlog 4781569 packets directly received from prequeue 776470 packets header predicted 97281 packets header predicted and directly queued to user 24979561 acknowledgments not containing data received 484206 predicted acknowledgments 11461 times recovered from packet loss due to SACK data Detected reordering 15 times using FACK Detected reordering 15520 times using SACK Detected reordering 208 times using time stamp 2046 congestion windows fully recovered 18402 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 82 13 congestion windows recovered after partial ack 0 TCP data loss events 49 timeouts after SACK recovery 1 timeouts in loss state 62078 fast retransmits 5340 forward retransmits 1181 retransmits in slow start 20 other TCP timeouts 322 sack retransmits failed 14276 DSACKs sent for old packets 36 DSACKs sent for out of order packets 17940 DSACKs received 254 DSACKs for out of order packets received 4 connections reset due to early user close TCPDSACKIgnoredOld: 12703 TCPDSACKIgnoredNoUndo: 5251 TCPSackShiftFallback: 57141117 IpExt: InMcastPkts: 104 OutMcastPkts: 76 InBcastPkts: 6 InOctets: 902997645 OutOctets: -82887048 InMcastOctets: 14296 OutMcastOctets: 10851 InBcastOctets: 1449 [root@target2 ~]# -- Best regards, Oleg Ukhno. ITO Team Lead, Yandex LLC. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-19 16:13 ` Oleg V. Ukhno @ 2011-01-19 20:12 ` Nicolas de Pesloüan 2011-01-21 13:55 ` Oleg V. Ukhno 0 siblings, 1 reply; 32+ messages in thread From: Nicolas de Pesloüan @ 2011-01-19 20:12 UTC (permalink / raw) To: Oleg V. Ukhno Cc: Jay Vosburgh, John Fastabend, David S. Miller, netdev, Sébastien Barré, Christophe Paasch Le 19/01/2011 17:13, Oleg V. Ukhno a écrit : > On 01/18/2011 11:24 PM, Jay Vosburgh wrote: [snip] >> I haven't done much testing with this lately, but I suspect this >> behavior hasn't really changed. Raising the tcp_reordering sysctl value >> can mitigate this somewhat (by making TCP more tolerant of this), but >> that doesn't help non-TCP protocols. >> >> Barring evidence to the contrary, I presume that Oleg's system >> delivers out of order at the receiver. That's not automatically a >> reason to reject it, but this entire proposal is sufficiently complex to >> configure that very explicit documentation will be necessary. >> >> -J >> >> --- >> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com >> > > Jay, [snip] > > What is your opinion on my idea with patch? > > I will come back with results for VLAN tunneling case, if this is > necessary (Nicolas, shall I do that test - I think it will show similar > results for performance?) If you have time for that, then yes, please, do the same test using balance-rr+vlan to segregate path. With those results, we whould have the opportunity to enhance the documentation with some well tested cases of TCP load balancing on a LAN, not limited to 802.3ad automatic setup. Both setups make sense, and assuming the results would be similar is probably true, but not reliable enough to assert it into the documentation. Thanks, Nicolas. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-19 20:12 ` Nicolas de Pesloüan @ 2011-01-21 13:55 ` Oleg V. Ukhno 2011-01-22 12:48 ` Nicolas de Pesloüan 2011-01-29 2:28 ` Jay Vosburgh 0 siblings, 2 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-21 13:55 UTC (permalink / raw) To: Nicolas de Pesloüan; +Cc: Jay Vosburgh, John Fastabend, netdev On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote: > If you have time for that, then yes, please, do the same test using > balance-rr+vlan to segregate path. With those results, we whould have > the opportunity to enhance the documentation with some well tested cases > of TCP load balancing on a LAN, not limited to 802.3ad automatic setup. > Both setups make sense, and assuming the results would be similar is > probably true, but not reliable enough to assert it into the documentation. > > Thanks, > > Nicolas. > Nicolas, I've ran similar tests for VLAN tunneling scenario. Results are identical, as I expected. The only significat difference is link failure handling. 802.3ad mode allows almost painless load reditribution, balance-rr causes packet loss. The only question for me now is if my patch could be applied to upstream version - fixing issues with adaptftion to net-next code aren't the problem, if nobody objects There were 2 tests: 1) unidirectional test 2) bidirectional test Below are results: Iperf results: test 1: iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300 ------------------------------------------------------------ Client connecting to 192.168.111.128, TCP port 9999 Binding to local address 192.168.111.129 TCP window size: 32.0 MByte (default) ------------------------------------------------------------ [ 3] local 192.168.111.129 port 9999 connected with 192.168.111.128 port 9999 [ ID] Interval Transfer Bandwidth [ 3] 0.0-300.0 sec 141637 MBytes 3960 Mbits/sec test 2: iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300 --dualtest -P 4 ------------------------------------------------------------ Server listening on TCP port 9999 Binding to local address 192.168.111.129 TCP window size: 32.0 MByte (default) ------------------------------------------------------------ ... [SUM] 0.0-300.2 sec 111334 MBytes 3111 Mbits/sec [SUM] 0.0-300.4 sec 109582 MBytes 3060 Mbits/sec TCP stats: receiver side, before test 1: [root@target1 ~]# netstat -st IcmpMsg: InType0: 4 InType3: 6 InType8: 2 OutType0: 2 OutType3: 6 OutType8: 4 Tcp: 4 active connections openings 2 passive connection openings 3 failed connection attempts 0 connection resets received 3 connections established 10252 segments received 29766 segments send out 2 segments retransmited 0 bad segments received. 0 resets sent UdpLite: TcpExt: 3 delayed acks sent 613 packets directly queued to recvmsg prequeue. 16 packets directly received from backlog 1760 packets directly received from prequeue 428 packets header predicted 10 packets header predicted and directly queued to user 9295 acknowledgments not containing data received 265 predicted acknowledgments 0 TCP data loss events 1 other TCP timeouts TCPSackMerged: 1 TCPSackShiftFallback: 1 IpExt: InMcastPkts: 92 OutMcastPkts: 64 InBcastPkts: 2 InOctets: 1089217 OutOctets: 265005791 InMcastOctets: 16294 OutMcastOctets: 10364 InBcastOctets: 483 receiver side , after test 1: [root@target1 ~]netstat -st IcmpMsg: InType0: 17 InType3: 6 InType8: 9 OutType0: 9 OutType3: 6 OutType8: 19 Tcp: 84 active connections openings 14 passive connection openings 6 failed connection attempts 4 connection resets received 4 connections established 16684784 segments received 16704650 segments send out 22 segments retransmited 0 bad segments received. 6 resets sent UdpLite: TcpExt: 39 TCP sockets finished time wait in slow timer 23 delayed acks sent 83 delayed acks further delayed because of locked socket Quick ack mode was activated 225 times 1019 packets directly queued to recvmsg prequeue. 3235352384 packets directly received from backlog 483600 packets directly received from prequeue 86065 packets header predicted 4855 packets header predicted and directly queued to user 10369 acknowledgments not containing data received 928 predicted acknowledgments 0 TCP data loss events 2 retransmits in slow start 6 other TCP timeouts 225 DSACKs sent for old packets 1 connections reset due to unexpected data TCPSackMerged: 1 TCPSackShiftFallback: 3 IpExt: InMcastPkts: 108 OutMcastPkts: 72 InBcastPkts: 4 InOctets: -936746758 OutOctets: 1556837236 InMcastOctets: 16774 OutMcastOctets: 10620 InBcastOctets: 966 receiver side, after test 2 [root@target1 ~]netstat -st IcmpMsg: InType0: 17 InType3: 6 InType8: 12 OutType0: 12 OutType3: 6 OutType8: 19 Tcp: 144 active connections openings 25 passive connection openings 29 failed connection attempts 7 connection resets received 4 connections established 44349148 segments received 44401154 segments send out 58434 segments retransmited 0 bad segments received. 6 resets sent UdpLite: TcpExt: 58 TCP sockets finished time wait in slow timer 735072 packets rejects in established connections because of timestamp 34 delayed acks sent 359 delayed acks further delayed because of locked socket Quick ack mode was activated 14800 times 2112 packets directly queued to recvmsg prequeue. 3753925448 packets directly received from backlog 4377976 packets directly received from prequeue 847653 packets header predicted 105696 packets header predicted and directly queued to user 8804473 acknowledgments not containing data received 154775 predicted acknowledgments 10465 times recovered from packet loss due to SACK data Detected reordering 1 times using FACK Detected reordering 11185 times using SACK Detected reordering 182 times using time stamp 2116 congestion windows fully recovered 18951 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 58 8 congestion windows recovered after partial ack 0 TCP data loss events 53 timeouts after SACK recovery 1 timeouts in loss state 57287 fast retransmits 12 forward retransmits 793 retransmits in slow start 10 other TCP timeouts 263 sack retransmits failed 14800 DSACKs sent for old packets 31 DSACKs sent for out of order packets 14289 DSACKs received 43 DSACKs for out of order packets received 1 connections reset due to unexpected data TCPDSACKIgnoredOld: 8615 TCPDSACKIgnoredNoUndo: 5683 TCPSackMerged: 1 TCPSackShiftFallback: 15015212 IpExt: InMcastPkts: 116 OutMcastPkts: 76 InBcastPkts: 4 InOctets: 1012355682 OutOctets: -1540562156 InMcastOctets: 17014 OutMcastOctets: 10748 InBcastOctets: 966 sender side, before test 1: [root@target2 ~]# netstat -st IcmpMsg: InType3: 4 InType8: 32 OutType0: 32 OutType3: 4 Tcp: 1 active connections openings 2 passive connection openings 0 failed connection attempts 0 connection resets received 3 connections established 30268 segments received 10217 segments send out 0 segments retransmited 0 bad segments received. 3 resets sent UdpLite: TcpExt: 7 delayed acks sent 6332 packets directly queued to recvmsg prequeue. 8 packets directly received from backlog 46104 packets directly received from prequeue 27935 packets header predicted 11 packets header predicted and directly queued to user 455 acknowledgments not containing data received 119 predicted acknowledgments 0 TCP data loss events TCPSackShiftFallback: 1 IpExt: InMcastPkts: 87 OutMcastPkts: 54 InBcastPkts: 2 InOctets: 265039007 OutOctets: 1083024 InMcastOctets: 16444 OutMcastOctets: 9893 InBcastOctets: 483 sender side , after test 1: [root@target2 ~]# netstat -st IcmpMsg: InType3: 4 InType8: 53 OutType0: 53 OutType3: 4 Tcp: 69 active connections openings 12 passive connection openings 2 failed connection attempts 4 connection resets received 4 connections established 16704819 segments received 16684841 segments send out 401 segments retransmited 0 bad segments received. 10 resets sent UdpLite: TcpExt: 31 TCP sockets finished time wait in slow timer 25 delayed acks sent 6515 packets directly queued to recvmsg prequeue. 24 packets directly received from backlog 46988 packets directly received from prequeue 27974 packets header predicted 115 packets header predicted and directly queued to user 10259331 acknowledgments not containing data received 12483 predicted acknowledgments 166 times recovered from packet loss due to SACK data Detected reordering 1 times using FACK Detected reordering 7 times using SACK Detected reordering 1 times using time stamp 1 congestion windows fully recovered 41 congestion windows partially recovered using Hoe heuristic 0 TCP data loss events 386 fast retransmits 5 forward retransmits 3 other TCP timeouts 1 times receiver scheduled too late for direct processing 225 DSACKs received 1 connections reset due to unexpected data TCPDSACKIgnoredOld: 167 TCPDSACKIgnoredNoUndo: 58 TCPSackShiftFallback: 30925668 IpExt: InMcastPkts: 103 OutMcastPkts: 62 InBcastPkts: 4 InOctets: 1556368288 OutOctets: -934790015 InMcastOctets: 16924 OutMcastOctets: 10149 InBcastOctets: 966 sender side, after test 2: [root@target2 ~]# netstat -st IcmpMsg: InType3: 4 InType8: 56 OutType0: 56 OutType3: 4 Tcp: 117 active connections openings 25 passive connection openings 2 failed connection attempts 7 connection resets received 4 connections established 44383169 segments received 44367187 segments send out 59660 segments retransmited 0 bad segments received. 34 resets sent UdpLite: TcpExt: 2 TCP sockets finished time wait in fast timer 57 TCP sockets finished time wait in slow timer 717082 packets rejects in established connections because of timestamp 46 delayed acks sent 202 delayed acks further delayed because of locked socket Quick ack mode was activated 14356 times 7432 packets directly queued to recvmsg prequeue. 135038632 packets directly received from backlog 3633432 packets directly received from prequeue 783534 packets header predicted 94671 packets header predicted and directly queued to user 20034470 acknowledgments not containing data received 177885 predicted acknowledgments 10851 times recovered from packet loss due to SACK data Detected reordering 6 times using FACK Detected reordering 9217 times using SACK Detected reordering 111 times using time stamp 2125 congestion windows fully recovered 19325 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 71 7 congestion windows recovered after partial ack 0 TCP data loss events 52 timeouts after SACK recovery 58562 fast retransmits 67 forward retransmits 736 retransmits in slow start 8 other TCP timeouts 226 sack retransmits failed 1 times receiver scheduled too late for direct processing 14356 DSACKs sent for old packets 44 DSACKs sent for out of order packets 14679 DSACKs received 31 DSACKs for out of order packets received 1 connections reset due to unexpected data TCPDSACKIgnoredOld: 8899 TCPDSACKIgnoredNoUndo: 5791 TCPSackShiftFallback: 47227517 IpExt: InMcastPkts: 109 OutMcastPkts: 65 InBcastPkts: 4 InOctets: -1885181292 OutOctets: 1366995261 InMcastOctets: 17104 OutMcastOctets: 10245 InBcastOctets: 966 -- Best regards, Oleg Ukhno, ITO Team lead Yandex LLC. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-21 13:55 ` Oleg V. Ukhno @ 2011-01-22 12:48 ` Nicolas de Pesloüan 2011-01-24 19:32 ` Oleg V. Ukhno 2011-01-29 2:28 ` Jay Vosburgh 1 sibling, 1 reply; 32+ messages in thread From: Nicolas de Pesloüan @ 2011-01-22 12:48 UTC (permalink / raw) To: Oleg V. Ukhno; +Cc: Jay Vosburgh, John Fastabend, netdev Le 21/01/2011 14:55, Oleg V. Ukhno a écrit : > On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote: > >> If you have time for that, then yes, please, do the same test using >> balance-rr+vlan to segregate path. With those results, we whould have >> the opportunity to enhance the documentation with some well tested cases >> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup. >> Both setups make sense, and assuming the results would be similar is >> probably true, but not reliable enough to assert it into the >> documentation. >> >> Thanks, >> >> Nicolas. >> > Nicolas, > I've ran similar tests for VLAN tunneling scenario. Results are > identical, as I expected. The only significat difference is link failure > handling. 802.3ad mode allows almost painless load reditribution, > balance-rr causes packet loss. Oleg, Thanks for doing the tests. What link failure mode did you use for those tests ? miimon or arp monitoring ? Nicolas. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-22 12:48 ` Nicolas de Pesloüan @ 2011-01-24 19:32 ` Oleg V. Ukhno 0 siblings, 0 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-24 19:32 UTC (permalink / raw) To: Nicolas de Pesloüan; +Cc: Jay Vosburgh, John Fastabend, netdev On 01/22/2011 03:48 PM, Nicolas de Pesloüan wrote: > Le 21/01/2011 14:55, Oleg V. Ukhno a écrit : >> On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote: >>> >> Nicolas, >> I've ran similar tests for VLAN tunneling scenario. Results are >> identical, as I expected. The only significat difference is link failure >> handling. 802.3ad mode allows almost painless load reditribution, >> balance-rr causes packet loss. > > Oleg, > > Thanks for doing the tests. > > What link failure mode did you use for those tests ? miimon or arp > monitoring ? > > Nicolas. > > Nicolas, as for tests: MII link monitoring kills the whole transfer, when in ARP mode monitoring - it still works, but there is asymmetric load striping on bond slaves(one slave is overloaded, two other - about 50-60% badwidnth utilized. Just as a summary - balance-rr behaves like patched 802.3ad when using arp monitoring mode, but there is quite asymmetric load striping and quite a monstrous configuration on switch and server sides. -- Best regards, Oleg Ukhno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-21 13:55 ` Oleg V. Ukhno 2011-01-22 12:48 ` Nicolas de Pesloüan @ 2011-01-29 2:28 ` Jay Vosburgh 2011-02-01 16:25 ` Oleg V. Ukhno 2011-02-02 9:54 ` Nicolas de Pesloüan 1 sibling, 2 replies; 32+ messages in thread From: Jay Vosburgh @ 2011-01-29 2:28 UTC (permalink / raw) To: Oleg V. Ukhno Cc: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=, John Fastabend, netdev Oleg V. Ukhno <olegu@yandex-team.ru> wrote: >On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote: > >> If you have time for that, then yes, please, do the same test using >> balance-rr+vlan to segregate path. With those results, we whould have >> the opportunity to enhance the documentation with some well tested cases >> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup. >> Both setups make sense, and assuming the results would be similar is >> probably true, but not reliable enough to assert it into the documentation. >> >> Thanks, >> >> Nicolas. >> >Nicolas, >I've ran similar tests for VLAN tunneling scenario. Results are identical, >as I expected. The only significat difference is link failure >handling. 802.3ad mode allows almost painless load reditribution, >balance-rr causes packet loss. >The only question for me now is if my patch could be applied to upstream >version - fixing issues with adaptftion to net-next code aren't the >problem, if nobody objects I've thought about this whole thing, and here's what I view as the proper way to do this. In my mind, this proposal is two separate pieces: First, a piece to make round-robin a selectable hash for xmit_hash_policy. The documentation for this should follow the pattern of the "layer3+4" hash policy, in particular noting that the new algorithm violates the 802.3ad standard in exciting ways, will result in out of order delivery, and that other 802.3ad implementations may or may not tolerate this. Second, a piece to make certain transmitted packets use the source MAC of the sending slave instead of the bond's MAC. This should be a separate option from the round-robin hash policy. I'd call it something like "mac_select" with two values: "default" (what we do now) and "slave_src_mac" to use the slave's real MAC for certain types of traffic (I'm open to better names; that's just what I came up with while writing this). I believe that "certain types" means "everything but ARP," but might be "only IP and IPv6." Structuring the option in this manner leaves the option open for additional selections in the future, which a simple "on/off" option wouldn't. This option should probably only affect a subset of modes; I'm thinking anything except balance-tlb or -alb (because they do funky MAC things already) and active-backup (it doesn't balance traffic, and already uses fail_over_mac to control this). I think this option also needs a whole new section down in the bottom explaining how to exploit it (the "pick special MACs on slaves to trick switch hash" business). Comments? -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-29 2:28 ` Jay Vosburgh @ 2011-02-01 16:25 ` Oleg V. Ukhno 2011-02-02 17:30 ` Jay Vosburgh 2011-02-02 9:54 ` Nicolas de Pesloüan 1 sibling, 1 reply; 32+ messages in thread From: Oleg V. Ukhno @ 2011-02-01 16:25 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Nicolas de Pesloüan, John Fastabend, netdev On 01/29/2011 05:28 AM, Jay Vosburgh wrote: > Oleg V. Ukhno<olegu@yandex-team.ru> wrote: > > I've thought about this whole thing, and here's what I view as > the proper way to do this. > > In my mind, this proposal is two separate pieces: > > First, a piece to make round-robin a selectable hash for > xmit_hash_policy. The documentation for this should follow the pattern > of the "layer3+4" hash policy, in particular noting that the new > algorithm violates the 802.3ad standard in exciting ways, will result in > out of order delivery, and that other 802.3ad implementations may or may > not tolerate this. > > Second, a piece to make certain transmitted packets use the > source MAC of the sending slave instead of the bond's MAC. This should > be a separate option from the round-robin hash policy. I'd call it > something like "mac_select" with two values: "default" (what we do now) > and "slave_src_mac" to use the slave's real MAC for certain types of > traffic (I'm open to better names; that's just what I came up with while > writing this). I believe that "certain types" means "everything but > ARP," but might be "only IP and IPv6." Structuring the option in this > manner leaves the option open for additional selections in the future, > which a simple "on/off" option wouldn't. This option should probably > only affect a subset of modes; I'm thinking anything except balance-tlb > or -alb (because they do funky MAC things already) and active-backup (it > doesn't balance traffic, and already uses fail_over_mac to control > this). I think this option also needs a whole new section down in the > bottom explaining how to exploit it (the "pick special MACs on slaves to > trick switch hash" business). > > Comments? > > -J > Jay, As for me splitting my initial proposal into two logically diffent pieces is ok, this will provide more flexible configuration. Do I understand correctly, that after I rewrite patch in splitted form, as you described above, and enhance documentation it will be /can be applied to kernel? Then what should I do: rewrite patch and resubmit it as a new one? Oleg. > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com > -- Best regards, Oleg Ukhno. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-02-01 16:25 ` Oleg V. Ukhno @ 2011-02-02 17:30 ` Jay Vosburgh 0 siblings, 0 replies; 32+ messages in thread From: Jay Vosburgh @ 2011-02-02 17:30 UTC (permalink / raw) To: Oleg V. Ukhno; +Cc: Nicolas de Pesloüan, John Fastabend, netdev Oleg V. Ukhno <olegu@yandex-team.ru> wrote: >On 01/29/2011 05:28 AM, Jay Vosburgh wrote: >> Oleg V. Ukhno<olegu@yandex-team.ru> wrote: >> >> I've thought about this whole thing, and here's what I view as >> the proper way to do this. >> >> In my mind, this proposal is two separate pieces: >> >> First, a piece to make round-robin a selectable hash for >> xmit_hash_policy. The documentation for this should follow the pattern >> of the "layer3+4" hash policy, in particular noting that the new >> algorithm violates the 802.3ad standard in exciting ways, will result in >> out of order delivery, and that other 802.3ad implementations may or may >> not tolerate this. >> >> Second, a piece to make certain transmitted packets use the >> source MAC of the sending slave instead of the bond's MAC. This should >> be a separate option from the round-robin hash policy. I'd call it >> something like "mac_select" with two values: "default" (what we do now) >> and "slave_src_mac" to use the slave's real MAC for certain types of >> traffic (I'm open to better names; that's just what I came up with while >> writing this). I believe that "certain types" means "everything but >> ARP," but might be "only IP and IPv6." Structuring the option in this >> manner leaves the option open for additional selections in the future, >> which a simple "on/off" option wouldn't. This option should probably >> only affect a subset of modes; I'm thinking anything except balance-tlb >> or -alb (because they do funky MAC things already) and active-backup (it >> doesn't balance traffic, and already uses fail_over_mac to control >> this). I think this option also needs a whole new section down in the >> bottom explaining how to exploit it (the "pick special MACs on slaves to >> trick switch hash" business). >> >> Comments? >> >> -J >> >Jay, >As for me splitting my initial proposal into two logically diffent pieces >is ok, this will provide more flexible configuration. >Do I understand correctly, that after I rewrite patch in splitted form, >as you described above, and enhance documentation it will be /can be >applied to kernel? Yes, although the patches may have to go through a few revisions. >Then what should I do: rewrite patch and resubmit it as a new one? Yes. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-29 2:28 ` Jay Vosburgh 2011-02-01 16:25 ` Oleg V. Ukhno @ 2011-02-02 9:54 ` Nicolas de Pesloüan 2011-02-02 17:57 ` Jay Vosburgh 1 sibling, 1 reply; 32+ messages in thread From: Nicolas de Pesloüan @ 2011-02-02 9:54 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Oleg V. Ukhno, John Fastabend, netdev Le 29/01/2011 03:28, Jay Vosburgh a écrit : > I've thought about this whole thing, and here's what I view as > the proper way to do this. > > In my mind, this proposal is two separate pieces: > > First, a piece to make round-robin a selectable hash for > xmit_hash_policy. The documentation for this should follow the pattern > of the "layer3+4" hash policy, in particular noting that the new > algorithm violates the 802.3ad standard in exciting ways, will result in > out of order delivery, and that other 802.3ad implementations may or may > not tolerate this. > > Second, a piece to make certain transmitted packets use the > source MAC of the sending slave instead of the bond's MAC. This should > be a separate option from the round-robin hash policy. I'd call it > something like "mac_select" with two values: "default" (what we do now) > and "slave_src_mac" to use the slave's real MAC for certain types of > traffic (I'm open to better names; that's just what I came up with while > writing this). I believe that "certain types" means "everything but > ARP," but might be "only IP and IPv6." Structuring the option in this > manner leaves the option open for additional selections in the future, > which a simple "on/off" option wouldn't. This option should probably > only affect a subset of modes; I'm thinking anything except balance-tlb > or -alb (because they do funky MAC things already) and active-backup (it > doesn't balance traffic, and already uses fail_over_mac to control > this). I think this option also needs a whole new section down in the > bottom explaining how to exploit it (the "pick special MACs on slaves to > trick switch hash" business). > > Comments? Looks really sensible to me. I just propose the following option and option values : "src_mac_select" (instead of mac_select), with "default" and "slave_mac" (instead of slave_src_mac) as possible values. In the future, we might need a "dst_mac_select" option... :-) Also, are there any risks that this kind of session load-balancing won't properly cooperate with multiqueue (as explained in "Overriding Configuration for Special Cases" in Documentation/networking/bonding.txt)? I think it is important to ensure we keep the ability to fine tune the egress path selection Nicolas. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-02-02 9:54 ` Nicolas de Pesloüan @ 2011-02-02 17:57 ` Jay Vosburgh 2011-02-03 14:54 ` Oleg V. Ukhno 0 siblings, 1 reply; 32+ messages in thread From: Jay Vosburgh @ 2011-02-02 17:57 UTC (permalink / raw) To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?= Cc: Oleg V. Ukhno, John Fastabend, netdev Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote: >Le 29/01/2011 03:28, Jay Vosburgh a écrit : >> I've thought about this whole thing, and here's what I view as >> the proper way to do this. >> >> In my mind, this proposal is two separate pieces: >> >> First, a piece to make round-robin a selectable hash for >> xmit_hash_policy. The documentation for this should follow the pattern >> of the "layer3+4" hash policy, in particular noting that the new >> algorithm violates the 802.3ad standard in exciting ways, will result in >> out of order delivery, and that other 802.3ad implementations may or may >> not tolerate this. >> >> Second, a piece to make certain transmitted packets use the >> source MAC of the sending slave instead of the bond's MAC. This should >> be a separate option from the round-robin hash policy. I'd call it >> something like "mac_select" with two values: "default" (what we do now) >> and "slave_src_mac" to use the slave's real MAC for certain types of >> traffic (I'm open to better names; that's just what I came up with while >> writing this). I believe that "certain types" means "everything but >> ARP," but might be "only IP and IPv6." Structuring the option in this >> manner leaves the option open for additional selections in the future, >> which a simple "on/off" option wouldn't. This option should probably >> only affect a subset of modes; I'm thinking anything except balance-tlb >> or -alb (because they do funky MAC things already) and active-backup (it >> doesn't balance traffic, and already uses fail_over_mac to control >> this). I think this option also needs a whole new section down in the >> bottom explaining how to exploit it (the "pick special MACs on slaves to >> trick switch hash" business). >> >> Comments? > >Looks really sensible to me. > >I just propose the following option and option values : "src_mac_select" >(instead of mac_select), with "default" and "slave_mac" (instead of >slave_src_mac) as possible values. In the future, we might need a >"dst_mac_select" option... :-) I originally thought of using the nomenclature you propose; my thinking for doing it the way I ended up with is to minimize the number of tunable knobs that bonding has (so, the dst_mac would be a setting for mac_select). That works as long as there aren't a lot of settings that would be turned on simultaneously, since each combination would have to be a separate option, or the options parser would have to handle multiple settings (e.g., mac_select=src+dst or something like that). Anyway, after thinking about it some more, in the long run it's probably safer to separate these two, so, Oleg, use the above naming ("src_mac_select" with "default" and "slave_mac"). >Also, are there any risks that this kind of session load-balancing won't >properly cooperate with multiqueue (as explained in "Overriding >Configuration for Special Cases" in Documentation/networking/bonding.txt)? >I think it is important to ensure we keep the ability to fine tune the >egress path selection I think the logic for the mac_select (or src_mac_select or whatever) just has to be done last, after the slave selection is done by the multiqueue stuff. That's probably a good tidbit to put in the documentation as well. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-02-02 17:57 ` Jay Vosburgh @ 2011-02-03 14:54 ` Oleg V. Ukhno 0 siblings, 0 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-02-03 14:54 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Nicolas de Pesloüan, John Fastabend, netdev On 02/02/2011 08:57 PM, Jay Vosburgh wrote: > Nicolas de Pesloüan<nicolas.2p.debian@gmail.com> wrote: >> I just propose the following option and option values : "src_mac_select" >> (instead of mac_select), with "default" and "slave_mac" (instead of >> slave_src_mac) as possible values. In the future, we might need a >> "dst_mac_select" option... :-) > > I originally thought of using the nomenclature you propose; my > thinking for doing it the way I ended up with is to minimize the number > of tunable knobs that bonding has (so, the dst_mac would be a setting > for mac_select). That works as long as there aren't a lot of settings > that would be turned on simultaneously, since each combination would > have to be a separate option, or the options parser would have to handle > multiple settings (e.g., mac_select=src+dst or something like that). > > Anyway, after thinking about it some more, in the long run it's > probably safer to separate these two, so, Oleg, use the above naming > ("src_mac_select" with "default" and "slave_mac"). > >> Also, are there any risks that this kind of session load-balancing won't >> properly cooperate with multiqueue (as explained in "Overriding >> Configuration for Special Cases" in Documentation/networking/bonding.txt)? >> I think it is important to ensure we keep the ability to fine tune the >> egress path selection > > I think the logic for the mac_select (or src_mac_select or > whatever) just has to be done last, after the slave selection is done by > the multiqueue stuff. That's probably a good tidbit to put in the > documentation as well. > > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com Thank everyone for comments, I'll resubmit modified patch after it is ready and tested, in about a week or two I think. Oleg > -- Best regards, Oleg Ukhno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 15:28 ` Oleg V. Ukhno 2011-01-18 16:24 ` Nicolas de Pesloüan @ 2011-01-18 17:56 ` Kirill Smelkov 1 sibling, 0 replies; 32+ messages in thread From: Kirill Smelkov @ 2011-01-18 17:56 UTC (permalink / raw) To: Oleg V. Ukhno Cc: Nicolas de Pesloüan, John Fastabend, Jay Vosburgh, David S. Miller, netdev, Sébastien Barré, Christophe Paasch On Tue, Jan 18, 2011 at 06:28:48PM +0300, Oleg V. Ukhno wrote: > On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote: >> Le 18/01/2011 13:40, Oleg V. Ukhno a écrit : [...] >> Oleg, would you mind trying the above "two VLAN" topology" with >> mode=balance-rr and report any results ? For high-availability purpose, >> it's obviously necessary to setup those VLAN on distinct switches. > I'll do it, but it will take some time to setup test environment, > several days may be. > You mean following topology: > switch 1 > / \ > host A host B > \ switch 2 / > FYI: I'm in the process of developing new redundancy mode for bonding, and while at it, the following script is maybe useful for you too, so that bonding testing can be done entirely on one host: http://repo.or.cz/w/linux-2.6/kirr.git/blob/refs/heads/x/etherdup:/tools/bonding/mk-tap-loops.sh Sorry for maybe being offtopic, Kirill ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 12:40 ` Oleg V. Ukhno 2011-01-18 14:54 ` Nicolas de Pesloüan @ 2011-01-18 16:41 ` John Fastabend 2011-01-18 17:21 ` Oleg V. Ukhno 1 sibling, 1 reply; 32+ messages in thread From: John Fastabend @ 2011-01-18 16:41 UTC (permalink / raw) To: Oleg V. Ukhno; +Cc: Jay Vosburgh, netdev, David S. Miller On 1/18/2011 4:40 AM, Oleg V. Ukhno wrote: > On 01/18/2011 06:16 AM, John Fastabend wrote: >> On 1/14/2011 4:05 PM, Jay Vosburgh wrote: >>> Can somebody (John?) more knowledgable than I about dm-multipath >>> comment on the above? >> >> Here I'll give it a go. >> >> I don't think detecting L2 link failure this way is very robust. If there >> is a failure farther away then your immediate link your going to break >> completely? Your bonding hash will continue to round robin the iscsi >> packets and half them will get dropped on the floor. dm-multipath handles >> this reasonably gracefully. Also in this bonding environment you seem to >> be very sensitive to RTT times on the network. Maybe not bad out right but >> I wouldn't consider this robust either. > > John, I agree - this bonding mode should be used in quite limited number > of situations, but as for failure farther away then immediate link - > every bonding mode will suffer same problems in this case - bonding > detects only L2 failures, other is done by upper-layer mechanisms. And > almost all bonding modes depend on equal RTT on slaves. And, there is > already similar load balancing mode - balance-alb - what I did is > approximately the same, but for 802.3ad bonding mode and provides > "better"(more equal and non-conditional layser2) load striping for tx > and _rx_ . > > I think I shouldn't mention the particular use case of this patch - when > I wrote it I tried to make a more general solution - my goal was "make > equal or near-equal load striping for TX and (most important part) RX > within single ethernet(layer 2) domain for TCP transmission". This > bonding mode just introduces ability to stripe rx and tx load for > single TCP connection between hosts inside of one ethernet segment. > iSCSI is just an example. It is possible to stripe load between a > linux-based router and linux-based web/ftp/etc server as well in the > same manner. I think this feature will be useful in some number of > network configurations. > > Also, I looked into net-next code - it seems to me that it can be > implemented(adapted to net-next bonding code) without any difficulties > and hashing function change makes no problem here. > > What I've written below is just my personal experience and opinion after > 5 years of using Oracle +iSCSI +mpath(later - patched bonding). > > From my personal experience I just can say that most iSCSI failures are > caused by link failures, and also I would never send any significant > iSCSI traffic via router - router would be a bottleneck in this case. > So, in my case iSCSI traffic flows within one ethernet domain and in > case of link failure bonding driver simply fails one slave(in case of > bonding) , instead of checking and failing hundreths of paths (in case > of mpath) and first case significantly less cpu, net and time > consuming(if using default mpath checker - readsector0). > Mpath is good for me, when I use it to "merge" drbd mirrors from > different hosts, but for just doing simple load striping within single > L2 network switch between 2 .. 16 hosts is some overkill(particularly > in maintaining human-readable device naming) :). > > John, what is you opinion on such load balancing method in general, > without referring to particular use cases? > This seems reasonable to me, but I'll defer to Jay on this. As long as the limitations are documented and it looks like they are this may be fine. Mostly I was interested to know what led you down this path and why MPIO was not working as at least I expected it should. When I get some time I'll see if we can address at least some of these issues. Even so it seems like this bonding mode may still be useful for some use cases perhaps even none storage use cases. > >> >> You could tweak your scsi timeout values and fail_fast values, set the io >> retry to 0 to cause the fail over to occur faster. I suspect you already >> did this and still it is too slow? Maybe adding a checker in multipathd to >> listen for link events would be fast enough. The checker could then fail >> the path immediately. >> >> I'll try to address your comments from the other thread here. In general I >> wonder if it would be better to solve the problems in dm-multipath rather than >> add another bonding mode? > Of course I did this, but mpath is fine when device quantity is below > 30-40 devices with two paths, 150-200 devices with 2+ paths can make > life far more interesting :) OK admittedly this gets ugly fast. >> >> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency) >> >> The dm-multipath layer is adding latency? How much? If this is really true >> maybe its best to the address the real issue here and not avoid it by >> using the bonding layer. > > I do not remember exact number now, but switching one of my databases , > about 2 years ago to bonding increased read throughput for the entire db > from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and > 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in > one switch) because of "full" bandwidth use. Also, bonding usage > simplifies network and application setup greatly(compared to mpath) > >> >> OVU - it handles any link failures bad, because of it's command queue >> limitation(all queued commands above 32 are discarded in case of path >> failure, as I remember) >> >> Maybe true but only link failures with the immediate peer are handled >> with a bonding strategy. By working at the block layer we can detect >> failures throughout the path. I would need to look into this again I >> know when we were looking at this sometime ago there was some talk about >> improving this behavior. I need to take some time to go back through the >> error recovery stuff to remember how this works. >> >> OVU - it performs very bad when there are many devices and maтy paths(I was >> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths >> per each disk) > > Well, I think that behavior can be explained in such a way: > when balancing by I/Os number per path(rr_min_io), and there is a huge > number of devices, mpath is doing load-balaning per-device, and it is > not possible to quarantee equal device use for all devices, so there > will be imbalance over network interface(mpath is unaware of it's > existence, etc), and it is likely it becomes more imbalanced when there > are many devices. Also, counting I/O's for many devices and paths > consumes some CPU resources and also can cause excessive context switches. > hmm I'll get something setup here and see if this is the case. >> >> Hmm well that seems like something is broken. I'll try this setup when >> I get some time next few days. This really shouldn't be the case dm-multipath >> should not add a bunch of extra latency or effect throughput significantly. >> By the way what are you seeing without mpio? > > And one more obsevation from my 2-years old tests - reading device(using > dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath > device with single path was done at approximately 120-150mb/s, and same > test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a > kind of revelation to me that time. > Similarly I'll have a look. Thanks for the info. >> >> Thanks, >> John >> > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-18 16:41 ` John Fastabend @ 2011-01-18 17:21 ` Oleg V. Ukhno 0 siblings, 0 replies; 32+ messages in thread From: Oleg V. Ukhno @ 2011-01-18 17:21 UTC (permalink / raw) To: John Fastabend; +Cc: Jay Vosburgh, netdev, David S. Miller On 01/18/2011 07:41 PM, John Fastabend wrote: >> >> John, what is you opinion on such load balancing method in general, >> without referring to particular use cases? >> > > This seems reasonable to me, but I'll defer to Jay on this. As long as the > limitations are documented and it looks like they are this may be fine. > > Mostly I was interested to know what led you down this path and why MPIO > was not working as at least I expected it should. When I get some time I'll > see if we can address at least some of these issues. Even so it seems like > this bonding mode may still be useful for some use cases perhaps even none > storage use cases. > >> I was adressing several problems with my patch: - I was unable to consume whole bandwidth with multipath - with four 1Gbit "paths" it was slightly above 2Gbit/s - Link failures caused quite often disk failures, which led to Oracle ASM rebalance, especially with versions below 11. - It is not always possible to autogenerate multipathd.conf with human-readable device names because of iscsi session id and scsi device bus/channel/etc mismatch(usually it differs by 1, but not necessarily), with bonding solution I can just look into /dev/disk/by-path to find out where physically is device, let's say, /dev/sdab, located(it's just a free bonus I've got, so to say:)) . -- С уважением, руководитель службы эксплуатации коммерческих и финансовых сервисов ООО Яндекс Олег Юхно ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing 2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno 2011-01-14 20:10 ` John Fastabend 2011-01-14 20:13 ` Jay Vosburgh @ 2011-01-14 20:41 ` Nicolas de Pesloüan 2 siblings, 0 replies; 32+ messages in thread From: Nicolas de Pesloüan @ 2011-01-14 20:41 UTC (permalink / raw) To: Oleg V. Ukhno; +Cc: netdev, Jay Vosburgh, David S. Miller Le 14/01/2011 20:07, Oleg V. Ukhno a écrit : > + > + For correct load baalncing on the receiving side you must > + configure switch for using src-dst-mac or src-mac hashing > + mode. Typo in baalncing -> balancing. Nicolas. ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2011-02-03 14:54 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno 2011-01-14 20:10 ` John Fastabend 2011-01-14 23:12 ` Oleg V. Ukhno 2011-01-14 20:13 ` Jay Vosburgh 2011-01-14 22:51 ` Oleg V. Ukhno 2011-01-15 0:05 ` Jay Vosburgh 2011-01-15 12:11 ` Oleg V. Ukhno 2011-01-18 3:16 ` John Fastabend 2011-01-18 12:40 ` Oleg V. Ukhno 2011-01-18 14:54 ` Nicolas de Pesloüan 2011-01-18 15:28 ` Oleg V. Ukhno 2011-01-18 16:24 ` Nicolas de Pesloüan 2011-01-18 16:57 ` Oleg V. Ukhno 2011-01-18 20:24 ` Jay Vosburgh 2011-01-18 21:20 ` Nicolas de Pesloüan 2011-01-19 1:45 ` Jay Vosburgh 2011-01-18 22:22 ` Oleg V. Ukhno 2011-01-19 16:13 ` Oleg V. Ukhno 2011-01-19 20:12 ` Nicolas de Pesloüan 2011-01-21 13:55 ` Oleg V. Ukhno 2011-01-22 12:48 ` Nicolas de Pesloüan 2011-01-24 19:32 ` Oleg V. Ukhno 2011-01-29 2:28 ` Jay Vosburgh 2011-02-01 16:25 ` Oleg V. Ukhno 2011-02-02 17:30 ` Jay Vosburgh 2011-02-02 9:54 ` Nicolas de Pesloüan 2011-02-02 17:57 ` Jay Vosburgh 2011-02-03 14:54 ` Oleg V. Ukhno 2011-01-18 17:56 ` Kirill Smelkov 2011-01-18 16:41 ` John Fastabend 2011-01-18 17:21 ` Oleg V. Ukhno 2011-01-14 20:41 ` Nicolas de Pesloüan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.