* [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
@ 2011-01-14 19:07 Oleg V. Ukhno
2011-01-14 20:10 ` John Fastabend
` (2 more replies)
0 siblings, 3 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-14 19:07 UTC (permalink / raw)
To: netdev; +Cc: Jay Vosburgh, David S. Miller
Patch introduces new hashing policy for 802.3ad bonding mode.
This hashing policy can be used(was tested) only for round-robin
balancing of ISCSI traffic(single TCP session is balanced (per-packet)
over all slave interfaces.
General requirements for this hashing policy usage are:
1) switch must be configured with src-dst-mac or src-mac hashing policy
2) number of bond slaves on sending and receiving machine should be equal
and preferrably even; or simply even, otherwise you may get asymmetric
load on receiving machine
3) hashing policy must not be used when round trip time between source
and destination machines for slaves in same bond is expected to be
significanly different (it works fine when all slaves are plugged into
single switch)
Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru>
---
Documentation/networking/bonding.txt | 27 +++++++++++++++++++++++++++
drivers/net/bonding/bond_3ad.c | 6 ++++++
drivers/net/bonding/bond_main.c | 18 +++++++++++++++++-
include/linux/if_bonding.h | 1 +
4 files changed, 51 insertions(+), 1 deletion(-)
diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt
--- linux-2.6.37-vanilla/Documentation/networking/bonding.txt 2011-01-05 03:50:19.000000000 +0300
+++ linux-2.6.37.my/Documentation/networking/bonding.txt 2011-01-14 21:34:46.635268000 +0300
@@ -759,6 +759,33 @@ xmit_hash_policy
most UDP traffic is not involved in extended
conversations. Other implementations of 802.3ad may
or may not tolerate this noncompliance.
+
+ simple-rr or 3
+ This policy simply sends every next packet via "next"
+ slave interface. When sending, it resets mac-address
+ within packet to real mac-address of the slave interface.
+
+ When switch is configured properly, and receiving machine
+ has even and equal number of interfaces, this guarantees
+ quite precise rx/tx load balancing for any single TCP
+ session. Typical use-case for this mode is ISCSI(and patch was
+ developed for), because it ises single TCP session to
+ transmit data.
+
+ It is important to remember, that all slaves should be
+ plugged into single switch to avoid out-of-order packets
+ It is recommended to have equal and even number of slave
+ interfaces in sending and receviving machines bond's,
+ otherwise you will get asymmetric load on receiving host.
+ Another caveat is that hashing policy must not be used when
+ round trip time between source and destination machines for
+ slaves in same bond is expected to be significanly different
+ (it works fine when all slaves are plugged into single switch)
+
+ For correct load baalncing on the receiving side you must
+ configure switch for using src-dst-mac or src-mac hashing
+ mode.
+
The default value is layer2. This option was added in bonding
version 2.6.3. In earlier versions of bonding, this parameter
diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c
--- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c 2011-01-14 19:39:05.575268000 +0300
+++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c 2011-01-14 19:47:03.815268000 +0300
@@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
int i;
struct ad_info ad_info;
int res = 1;
+ struct ethhdr *eth_data;
/* make sure that the slaves list will
* not change during tx
@@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
slave_agg_id = agg->aggregator_identifier;
if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
+ if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) {
+ skb_reset_mac_header(skb);
+ eth_data = eth_hdr(skb);
+ memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN);
+ }
res = bond_dev_queue_xmit(bond, skb, slave->dev);
break;
}
diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c
--- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c 2011-01-14 19:39:05.575268000 +0300
+++ linux-2.6.37.my/drivers/net/bonding/bond_main.c 2011-01-14 19:47:55.835268001 +0300
@@ -152,7 +152,9 @@ module_param(ad_select, charp, 0);
MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)");
module_param(xmit_hash_policy, charp, 0);
MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)"
- ", 1 for layer 3+4");
+ ", 1 for layer 3+4"
+ ", 2 for layer 2+3"
+ ", 3 for round-robin");
module_param(arp_interval, int, 0);
MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
module_param_array(arp_ip_target, charp, NULL, 0);
@@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype
{ "layer2", BOND_XMIT_POLICY_LAYER2},
{ "layer3+4", BOND_XMIT_POLICY_LAYER34},
{ "layer2+3", BOND_XMIT_POLICY_LAYER23},
+{ "simple-rr", BOND_XMIT_POLICY_LAYERRR},
{ NULL, -1},
};
@@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru
return (data->h_dest[5] ^ data->h_source[5]) % count;
}
+/*
+ * simply round robin
+ */
+static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
+ struct net_device *bond_dev, int count)
+{
+ struct bonding *bond = netdev_priv(bond_dev);
+ return bond->rr_tx_counter++ % count;
+}
+
/*-------------------------- Device entry points ----------------------------*/
static int bond_open(struct net_device *bond_dev)
@@ -4482,6 +4495,9 @@ out:
static void bond_set_xmit_hash_policy(struct bonding *bond)
{
switch (bond->params.xmit_policy) {
+ case BOND_XMIT_POLICY_LAYERRR:
+ bond->xmit_hash_policy = bond_xmit_hash_policy_rr;
+ break;
case BOND_XMIT_POLICY_LAYER23:
bond->xmit_hash_policy = bond_xmit_hash_policy_l23;
break;
diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/include/linux/if_bonding.h linux-2.6.37.my/include/linux/if_bonding.h
--- linux-2.6.37-vanilla/include/linux/if_bonding.h 2011-01-05 03:50:19.000000000 +0300
+++ linux-2.6.37.my/include/linux/if_bonding.h 2011-01-14 19:34:29.755268001 +0300
@@ -91,6 +91,7 @@
#define BOND_XMIT_POLICY_LAYER2 0 /* layer 2 (MAC only), default */
#define BOND_XMIT_POLICY_LAYER34 1 /* layer 3+4 (IP ^ (TCP || UDP)) */
#define BOND_XMIT_POLICY_LAYER23 2 /* layer 2+3 (IP ^ MAC) */
+#define BOND_XMIT_POLICY_LAYERRR 3 /* round-robin */
typedef struct ifbond {
__s32 bond_mode;
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
@ 2011-01-14 20:10 ` John Fastabend
2011-01-14 23:12 ` Oleg V. Ukhno
2011-01-14 20:13 ` Jay Vosburgh
2011-01-14 20:41 ` Nicolas de Pesloüan
2 siblings, 1 reply; 32+ messages in thread
From: John Fastabend @ 2011-01-14 20:10 UTC (permalink / raw)
To: Oleg V. Ukhno; +Cc: netdev, Jay Vosburgh, David S. Miller
On 1/14/2011 11:07 AM, Oleg V. Ukhno wrote:
> Patch introduces new hashing policy for 802.3ad bonding mode.
> This hashing policy can be used(was tested) only for round-robin
> balancing of ISCSI traffic(single TCP session is balanced (per-packet)
> over all slave interfaces.
> General requirements for this hashing policy usage are:
> 1) switch must be configured with src-dst-mac or src-mac hashing policy
> 2) number of bond slaves on sending and receiving machine should be equal
> and preferrably even; or simply even, otherwise you may get asymmetric
> load on receiving machine
> 3) hashing policy must not be used when round trip time between source
> and destination machines for slaves in same bond is expected to be
> significanly different (it works fine when all slaves are plugged into
> single switch)
>
> Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru>
> ---
I think you want this patch against net-next not 2.6.37.
>
> Documentation/networking/bonding.txt | 27 +++++++++++++++++++++++++++
> drivers/net/bonding/bond_3ad.c | 6 ++++++
> drivers/net/bonding/bond_main.c | 18 +++++++++++++++++-
> include/linux/if_bonding.h | 1 +
> 4 files changed, 51 insertions(+), 1 deletion(-)
>
> diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt
> --- linux-2.6.37-vanilla/Documentation/networking/bonding.txt 2011-01-05 03:50:19.000000000 +0300
> +++ linux-2.6.37.my/Documentation/networking/bonding.txt 2011-01-14 21:34:46.635268000 +0300
> @@ -759,6 +759,33 @@ xmit_hash_policy
> most UDP traffic is not involved in extended
> conversations. Other implementations of 802.3ad may
> or may not tolerate this noncompliance.
> +
> + simple-rr or 3
> + This policy simply sends every next packet via "next"
> + slave interface. When sending, it resets mac-address
> + within packet to real mac-address of the slave interface.
> +
> + When switch is configured properly, and receiving machine
> + has even and equal number of interfaces, this guarantees
> + quite precise rx/tx load balancing for any single TCP
> + session. Typical use-case for this mode is ISCSI(and patch was
> + developed for), because it ises single TCP session to
> + transmit data.
Oleg, sorry but I don't follow. If this is simply sending every next packet
via "next" slave interface how are packets not going to get out of order? If
the links have different RTT this would seem problematic.
Have you considered using multipath at the block layer? This is how I generally
handle load balancing over iSCSI/FCoE and it works reasonably well.
see ./drivers/md/dm-mpath.c
> +
> + It is important to remember, that all slaves should be
> + plugged into single switch to avoid out-of-order packets
> + It is recommended to have equal and even number of slave
> + interfaces in sending and receviving machines bond's,
> + otherwise you will get asymmetric load on receiving host.
> + Another caveat is that hashing policy must not be used when
> + round trip time between source and destination machines for
> + slaves in same bond is expected to be significanly different
> + (it works fine when all slaves are plugged into single switch)
> +
> + For correct load baalncing on the receiving side you must
> + configure switch for using src-dst-mac or src-mac hashing
> + mode.
> +
>
> The default value is layer2. This option was added in bonding
> version 2.6.3. In earlier versions of bonding, this parameter
> diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c
> --- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c 2011-01-14 19:39:05.575268000 +0300
> +++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c 2011-01-14 19:47:03.815268000 +0300
> @@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
> int i;
> struct ad_info ad_info;
> int res = 1;
> + struct ethhdr *eth_data;
>
> /* make sure that the slaves list will
> * not change during tx
> @@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
> slave_agg_id = agg->aggregator_identifier;
>
> if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
> + if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) {
> + skb_reset_mac_header(skb);
> + eth_data = eth_hdr(skb);
> + memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN);
> + }
> res = bond_dev_queue_xmit(bond, skb, slave->dev);
> break;
> }
> diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c
> --- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c 2011-01-14 19:39:05.575268000 +0300
> +++ linux-2.6.37.my/drivers/net/bonding/bond_main.c 2011-01-14 19:47:55.835268001 +0300
> @@ -152,7 +152,9 @@ module_param(ad_select, charp, 0);
> MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)");
> module_param(xmit_hash_policy, charp, 0);
> MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)"
> - ", 1 for layer 3+4");
> + ", 1 for layer 3+4"
> + ", 2 for layer 2+3"
> + ", 3 for round-robin");
> module_param(arp_interval, int, 0);
> MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
> module_param_array(arp_ip_target, charp, NULL, 0);
> @@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype
> { "layer2", BOND_XMIT_POLICY_LAYER2},
> { "layer3+4", BOND_XMIT_POLICY_LAYER34},
> { "layer2+3", BOND_XMIT_POLICY_LAYER23},
> +{ "simple-rr", BOND_XMIT_POLICY_LAYERRR},
> { NULL, -1},
> };
>
> @@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru
> return (data->h_dest[5] ^ data->h_source[5]) % count;
> }
>
> +/*
> + * simply round robin
> + */
> +static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
> + struct net_device *bond_dev, int count)
Here's one reason why this won't work on net-next-2.6.
int (*xmit_hash_policy)(struct sk_buff *, int);
Thanks,
John
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
2011-01-14 20:10 ` John Fastabend
@ 2011-01-14 20:13 ` Jay Vosburgh
2011-01-14 22:51 ` Oleg V. Ukhno
2011-01-14 20:41 ` Nicolas de Pesloüan
2 siblings, 1 reply; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-14 20:13 UTC (permalink / raw)
To: Oleg V. Ukhno; +Cc: netdev, David S. Miller
Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>Patch introduces new hashing policy for 802.3ad bonding mode.
>This hashing policy can be used(was tested) only for round-robin
>balancing of ISCSI traffic(single TCP session is balanced (per-packet)
>over all slave interfaces.
This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
(f), which requires that all frames of a given "conversation" are passed
to a single port.
The existing layer3+4 hash has a similar problem (that it may
send packets from a conversation to multiple ports), but for that case
it's an unlikely exception (only in the case of IP fragmentation), but
here it's the norm. At a minimum, this must be clearly documented.
Also, what does a round robin in 802.3ad provide that the
existing round robin does not? My presumption is that you're looking to
get the aggregator autoconfiguration that 802.3ad provides, but you
don't say.
I don't necessarily think this is a bad cheat (round robining on
802.3ad as an explicit non-standard extension), since everybody wants to
stripe their traffic across multiple slaves. I've given some thought to
making round robin into just another hash mode, but this also does some
magic to the MAC addresses of the outgoing frames (more on that below).
>General requirements for this hashing policy usage are:
>1) switch must be configured with src-dst-mac or src-mac hashing policy
>2) number of bond slaves on sending and receiving machine should be equal
>and preferrably even; or simply even, otherwise you may get asymmetric
>load on receiving machine
>3) hashing policy must not be used when round trip time between source
>and destination machines for slaves in same bond is expected to be
>significanly different (it works fine when all slaves are plugged into
>single switch)
>
>Signed-off-by: Oleg V. Ukhno <olegu@yandex-team.ru>
>---
>
> Documentation/networking/bonding.txt | 27 +++++++++++++++++++++++++++
> drivers/net/bonding/bond_3ad.c | 6 ++++++
> drivers/net/bonding/bond_main.c | 18 +++++++++++++++++-
> include/linux/if_bonding.h | 1 +
> 4 files changed, 51 insertions(+), 1 deletion(-)
>
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/Documentation/networking/bonding.txt linux-2.6.37.my/Documentation/networking/bonding.txt
>--- linux-2.6.37-vanilla/Documentation/networking/bonding.txt 2011-01-05 03:50:19.000000000 +0300
>+++ linux-2.6.37.my/Documentation/networking/bonding.txt 2011-01-14 21:34:46.635268000 +0300
>@@ -759,6 +759,33 @@ xmit_hash_policy
> most UDP traffic is not involved in extended
> conversations. Other implementations of 802.3ad may
> or may not tolerate this noncompliance.
>+
>+ simple-rr or 3
>+ This policy simply sends every next packet via "next"
>+ slave interface. When sending, it resets mac-address
>+ within packet to real mac-address of the slave interface.
Why is the MAC address reset done? This is also a violation of
802.3ad, 5.2.1 (j).
>+ When switch is configured properly, and receiving machine
>+ has even and equal number of interfaces, this guarantees
>+ quite precise rx/tx load balancing for any single TCP
>+ session. Typical use-case for this mode is ISCSI(and patch was
>+ developed for), because it ises single TCP session to
>+ transmit data.
>+
>+ It is important to remember, that all slaves should be
>+ plugged into single switch to avoid out-of-order packets
>+ It is recommended to have equal and even number of slave
>+ interfaces in sending and receviving machines bond's,
>+ otherwise you will get asymmetric load on receiving host.
>+ Another caveat is that hashing policy must not be used when
>+ round trip time between source and destination machines for
>+ slaves in same bond is expected to be significanly different
>+ (it works fine when all slaves are plugged into single switch)
>+
>+ For correct load baalncing on the receiving side you must
>+ configure switch for using src-dst-mac or src-mac hashing
>+ mode.
>+
>
> The default value is layer2. This option was added in bonding
> version 2.6.3. In earlier versions of bonding, this parameter
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c linux-2.6.37.my/drivers/net/bonding/bond_3ad.c
>--- linux-2.6.37-vanilla/drivers/net/bonding/bond_3ad.c 2011-01-14 19:39:05.575268000 +0300
>+++ linux-2.6.37.my/drivers/net/bonding/bond_3ad.c 2011-01-14 19:47:03.815268000 +0300
>@@ -2395,6 +2395,7 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
> int i;
> struct ad_info ad_info;
> int res = 1;
>+ struct ethhdr *eth_data;
>
> /* make sure that the slaves list will
> * not change during tx
>@@ -2447,6 +2448,11 @@ int bond_3ad_xmit_xor(struct sk_buff *sk
> slave_agg_id = agg->aggregator_identifier;
>
> if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
>+ if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYERRR && ntohs(skb->protocol) == ETH_P_IP) {
>+ skb_reset_mac_header(skb);
>+ eth_data = eth_hdr(skb);
>+ memcpy(eth_data->h_source, slave->perm_hwaddr, ETH_ALEN);
>+ }
This is the code that resets the MAC header as described above.
It doesn't quite match the documentation, since it only resets the MAC
for ETH_P_IP packets.
> res = bond_dev_queue_xmit(bond, skb, slave->dev);
> break;
> }
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c linux-2.6.37.my/drivers/net/bonding/bond_main.c
>--- linux-2.6.37-vanilla/drivers/net/bonding/bond_main.c 2011-01-14 19:39:05.575268000 +0300
>+++ linux-2.6.37.my/drivers/net/bonding/bond_main.c 2011-01-14 19:47:55.835268001 +0300
>@@ -152,7 +152,9 @@ module_param(ad_select, charp, 0);
> MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)");
> module_param(xmit_hash_policy, charp, 0);
> MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)"
>- ", 1 for layer 3+4");
>+ ", 1 for layer 3+4"
>+ ", 2 for layer 2+3"
>+ ", 3 for round-robin");
> module_param(arp_interval, int, 0);
> MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
> module_param_array(arp_ip_target, charp, NULL, 0);
>@@ -206,6 +208,7 @@ const struct bond_parm_tbl xmit_hashtype
> { "layer2", BOND_XMIT_POLICY_LAYER2},
> { "layer3+4", BOND_XMIT_POLICY_LAYER34},
> { "layer2+3", BOND_XMIT_POLICY_LAYER23},
>+{ "simple-rr", BOND_XMIT_POLICY_LAYERRR},
I'd just call it "round-robin" instead of "simple-rr".
> { NULL, -1},
> };
>
>@@ -3762,6 +3765,16 @@ static int bond_xmit_hash_policy_l2(stru
> return (data->h_dest[5] ^ data->h_source[5]) % count;
> }
>
>+/*
>+ * simply round robin
>+ */
>+static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
>+ struct net_device *bond_dev, int count)
>+{
>+ struct bonding *bond = netdev_priv(bond_dev);
>+ return bond->rr_tx_counter++ % count;
>+}
>+
> /*-------------------------- Device entry points ----------------------------*/
>
> static int bond_open(struct net_device *bond_dev)
>@@ -4482,6 +4495,9 @@ out:
> static void bond_set_xmit_hash_policy(struct bonding *bond)
> {
> switch (bond->params.xmit_policy) {
>+ case BOND_XMIT_POLICY_LAYERRR:
>+ bond->xmit_hash_policy = bond_xmit_hash_policy_rr;
>+ break;
> case BOND_XMIT_POLICY_LAYER23:
> bond->xmit_hash_policy = bond_xmit_hash_policy_l23;
> break;
>diff -uprN -X linux-2.6.37-vanilla/Documentation/dontdiff linux-2.6.37-vanilla/include/linux/if_bonding.h linux-2.6.37.my/include/linux/if_bonding.h
>--- linux-2.6.37-vanilla/include/linux/if_bonding.h 2011-01-05 03:50:19.000000000 +0300
>+++ linux-2.6.37.my/include/linux/if_bonding.h 2011-01-14 19:34:29.755268001 +0300
>@@ -91,6 +91,7 @@
> #define BOND_XMIT_POLICY_LAYER2 0 /* layer 2 (MAC only), default */
> #define BOND_XMIT_POLICY_LAYER34 1 /* layer 3+4 (IP ^ (TCP || UDP)) */
> #define BOND_XMIT_POLICY_LAYER23 2 /* layer 2+3 (IP ^ MAC) */
>+#define BOND_XMIT_POLICY_LAYERRR 3 /* round-robin */
>
> typedef struct ifbond {
> __s32 bond_mode;
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
2011-01-14 20:10 ` John Fastabend
2011-01-14 20:13 ` Jay Vosburgh
@ 2011-01-14 20:41 ` Nicolas de Pesloüan
2 siblings, 0 replies; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-14 20:41 UTC (permalink / raw)
To: Oleg V. Ukhno; +Cc: netdev, Jay Vosburgh, David S. Miller
Le 14/01/2011 20:07, Oleg V. Ukhno a écrit :
> +
> + For correct load baalncing on the receiving side you must
> + configure switch for using src-dst-mac or src-mac hashing
> + mode.
Typo in baalncing -> balancing.
Nicolas.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-14 20:13 ` Jay Vosburgh
@ 2011-01-14 22:51 ` Oleg V. Ukhno
2011-01-15 0:05 ` Jay Vosburgh
0 siblings, 1 reply; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-14 22:51 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev, David S. Miller
Jay Vosburgh wrote:
> This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
> (f), which requires that all frames of a given "conversation" are passed
> to a single port.
>
> The existing layer3+4 hash has a similar problem (that it may
> send packets from a conversation to multiple ports), but for that case
> it's an unlikely exception (only in the case of IP fragmentation), but
> here it's the norm. At a minimum, this must be clearly documented.
>
> Also, what does a round robin in 802.3ad provide that the
> existing round robin does not? My presumption is that you're looking to
> get the aggregator autoconfiguration that 802.3ad provides, but you
> don't say.
>
> I don't necessarily think this is a bad cheat (round robining on
> 802.3ad as an explicit non-standard extension), since everybody wants to
> stripe their traffic across multiple slaves. I've given some thought to
> making round robin into just another hash mode, but this also does some
> magic to the MAC addresses of the outgoing frames (more on that below).
Yes, I am resetting MAC addresses when transmitting packets to have
switch to put packets into different ports of the receiving etherchannel.
I am using this patch to provide full-mesh ISCSI connectivity between at
least 4 hosts (all hosts of course are in same ethernet segment) and
every host is connected with aggregate link with 4 slaves(usually).
Using round-robin I provide near-equal load striping when transmitting,
using MAC address magic I force switch to stripe packets over all slave
links in destination port-channel(when number of rx-ing slaves is equal
to number ot tx-ing slaves and is even). So I am able to utilize all
slaves for tx and for rx up to maximum capacity; besides I am getting L2
link failure detection (and load rebalancing), which is (in my opinion)
much faster and robust than L3 or than dm-multipath provides.
It's my idea with the patch
>
>
> This is the code that resets the MAC header as described above.
> It doesn't quite match the documentation, since it only resets the MAC
> for ETH_P_IP packets.
Yes, I really meant that my patch applies to ETH_P_IP packets and I've
missed that from documentation I wrote.
>
>
> ---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
--
Best regards,
Oleg Ukhno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-14 20:10 ` John Fastabend
@ 2011-01-14 23:12 ` Oleg V. Ukhno
0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-14 23:12 UTC (permalink / raw)
To: John Fastabend; +Cc: netdev, Jay Vosburgh, David S. Miller
John Fastabend wrote:
>
> I think you want this patch against net-next not 2.6.37.
This patch is against 2.6.37-git11 and I've tried to apply it to
net-next - it applied ok
>
> Oleg, sorry but I don't follow. If this is simply sending every next packet
> via "next" slave interface how are packets not going to get out of order? If
> the links have different RTT this would seem problematic.
>
> Have you considered using multipath at the block layer? This is how I generally
> handle load balancing over iSCSI/FCoE and it works reasonably well.
>
> see ./drivers/md/dm-mpath.c
John, the first solution I was using a long time for ISCSI load
balancing was multipath. But there are some problems with dm-multipath:
- it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
- it handles any link failures bad, because of it's command queue
limitation(all queued commands above 32 are discarded in case of path
failure, as I remember)
- it performs very bad when there are many devices and maтy paths(I was
unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
per each disk)
My patch won't work correct when slave links have different RTT, this is
true - it is usable only within one ethernet segment with equal/near
equal RTT. This is it's limitation.
>> +static int bond_xmit_hash_policy_rr(struct sk_buff *skb,
>> + struct net_device *bond_dev, int count)
>
> Here's one reason why this won't work on net-next-2.6.
>
> int (*xmit_hash_policy)(struct sk_buff *, int);
Thank you, I've missed that change.
>
>
> Thanks,
> John
>
--
Best reagrds,
Oleg Ukhno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-14 22:51 ` Oleg V. Ukhno
@ 2011-01-15 0:05 ` Jay Vosburgh
2011-01-15 12:11 ` Oleg V. Ukhno
2011-01-18 3:16 ` John Fastabend
0 siblings, 2 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-15 0:05 UTC (permalink / raw)
To: Oleg V. Ukhno; +Cc: netdev, David S. Miller, John Fastabend
Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>Jay Vosburgh wrote:
>
>> This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
>> (f), which requires that all frames of a given "conversation" are passed
>> to a single port.
>>
>> The existing layer3+4 hash has a similar problem (that it may
>> send packets from a conversation to multiple ports), but for that case
>> it's an unlikely exception (only in the case of IP fragmentation), but
>> here it's the norm. At a minimum, this must be clearly documented.
>>
>> Also, what does a round robin in 802.3ad provide that the
>> existing round robin does not? My presumption is that you're looking to
>> get the aggregator autoconfiguration that 802.3ad provides, but you
>> don't say.
I'm still curious about this question. Given the rather
intricate setup of your particular network (described below), I'm not
sure why 802.3ad is of benefit over traditional etherchannel
(balance-rr / balance-xor).
>> I don't necessarily think this is a bad cheat (round robining on
>> 802.3ad as an explicit non-standard extension), since everybody wants to
>> stripe their traffic across multiple slaves. I've given some thought to
>> making round robin into just another hash mode, but this also does some
>> magic to the MAC addresses of the outgoing frames (more on that below).
>Yes, I am resetting MAC addresses when transmitting packets to have switch
>to put packets into different ports of the receiving etherchannel.
By "etherchannel" do you really mean "Cisco switch with a
port-channel group using LACP"?
>I am using this patch to provide full-mesh ISCSI connectivity between at
>least 4 hosts (all hosts of course are in same ethernet segment) and every
>host is connected with aggregate link with 4 slaves(usually).
>Using round-robin I provide near-equal load striping when transmitting,
>using MAC address magic I force switch to stripe packets over all slave
>links in destination port-channel(when number of rx-ing slaves is equal to
>number ot tx-ing slaves and is even).
By "MAC address magic" do you mean that you're assigning
specifically chosen MAC addresses to the slaves so that the switch's
hash is essentially "assigning" the bonding slaves to particular ports
on the outgoing port-channel group?
Assuming that this is the case, it's an interesting idea, but
I'm unconvinced that it's better on 802.3ad vs. balance-rr. Unless I'm
missing something, you can get everything you need from an option to
have balance-rr / balance-xor utilize the slave's permanent address as
the source address for outgoing traffic.
>[...] So I am able to utilize all slaves
>for tx and for rx up to maximum capacity; besides I am getting L2 link
>failure detection (and load rebalancing), which is (in my opinion) much
>faster and robust than L3 or than dm-multipath provides.
>It's my idea with the patch
Can somebody (John?) more knowledgable than I about dm-multipath
comment on the above?
>> This is the code that resets the MAC header as described above.
>> It doesn't quite match the documentation, since it only resets the MAC
>> for ETH_P_IP packets.
>Yes, I really meant that my patch applies to ETH_P_IP packets and I've
>missed that from documentation I wrote.
Is limiting this to just ETH_P_IP really a means to exclude ARP,
or is there some advantage to (effectively) only balancing IP traffic,
and leaving other traffic (IPv6, for one) essentially unbalanced (when
exiting the switch through the destination port-channel group, which
you've set to use a src-mac hash)?
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-15 0:05 ` Jay Vosburgh
@ 2011-01-15 12:11 ` Oleg V. Ukhno
2011-01-18 3:16 ` John Fastabend
1 sibling, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-15 12:11 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev, David S. Miller, John Fastabend
Jay Vosburgh wrote:
> Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>> Jay Vosburgh wrote:
>>
>>> Also, what does a round robin in 802.3ad provide that the
>>> existing round robin does not? My presumption is that you're looking to
>>> get the aggregator autoconfiguration that 802.3ad provides, but you
>>> don't say.
>
> I'm still curious about this question. Given the rather
> intricate setup of your particular network (described below), I'm not
> sure why 802.3ad is of benefit over traditional etherchannel
> (balance-rr / balance-xor).
Yes, I wanted 802.3ad autoconfiguration. Besides, all switches I use
support LACP so I've chosen 802.3ad link aggregation.
Of course, it would be cool it both 802.3ad and balance-rr modes
supported such load striping feature.
>
>> Yes, I am resetting MAC addresses when transmitting packets to have switch
>> to put packets into different ports of the receiving etherchannel.
>
> By "etherchannel" do you really mean "Cisco switch with a
> port-channel group using LACP"?
Yes, exactly
>
>> I am using this patch to provide full-mesh ISCSI connectivity between at
>> least 4 hosts (all hosts of course are in same ethernet segment) and every
>> host is connected with aggregate link with 4 slaves(usually).
>> Using round-robin I provide near-equal load striping when transmitting,
>> using MAC address magic I force switch to stripe packets over all slave
>> links in destination port-channel(when number of rx-ing slaves is equal to
>> number ot tx-ing slaves and is even).
>
> By "MAC address magic" do you mean that you're assigning
> specifically chosen MAC addresses to the slaves so that the switch's
> hash is essentially "assigning" the bonding slaves to particular ports
> on the outgoing port-channel group?
Yes, so I am able to make equal load striping even for single TCP
session between just two hosts not only for transmiting host, but also
for receiving host(iperf, when doing TCP test, is able to utilize all
available bandwith in given etherchannel).
>
> Assuming that this is the case, it's an interesting idea, but
> I'm unconvinced that it's better on 802.3ad vs. balance-rr. Unless I'm
> missing something, you can get everything you need from an option to
> have balance-rr / balance-xor utilize the slave's permanent address as
> the source address for outgoing traffic.
Yes, balance-rr would satisfy my requrements if patched for doing "MAC
address magic"(replacing MAC address of packets being transmitted by
slave's permanent address), except for 802.3ad link autoconfiguration.
"Pure" balance-rr won't allow to utilize whole etherchannel bandwidth
when transmitting data just between 2 hosts( for example, when I have
one iSCSI initiator and one iSCSI target). balance-xor is not what I
wanted because data transmitted on source host will stick to any, but
single slave.
>
>
>>> This is the code that resets the MAC header as described above.
>>> It doesn't quite match the documentation, since it only resets the MAC
>>> for ETH_P_IP packets.
>> Yes, I really meant that my patch applies to ETH_P_IP packets and I've
>> missed that from documentation I wrote.
>
> Is limiting this to just ETH_P_IP really a means to exclude ARP,
> or is there some advantage to (effectively) only balancing IP traffic,
> and leaving other traffic (IPv6, for one) essentially unbalanced (when
> exiting the switch through the destination port-channel group, which
> you've set to use a src-mac hash)?
>
Well, when making initial version of this patch(it was for 2.6.18
kernel), I meant just excluding ARP .
> -J
>
> ---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
--
Best regards,
Oleg Ukhno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-15 0:05 ` Jay Vosburgh
2011-01-15 12:11 ` Oleg V. Ukhno
@ 2011-01-18 3:16 ` John Fastabend
2011-01-18 12:40 ` Oleg V. Ukhno
1 sibling, 1 reply; 32+ messages in thread
From: John Fastabend @ 2011-01-18 3:16 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Oleg V. Ukhno, netdev, David S. Miller
On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
> Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>> Jay Vosburgh wrote:
>>
>>> This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
>>> (f), which requires that all frames of a given "conversation" are passed
>>> to a single port.
>>>
>>> The existing layer3+4 hash has a similar problem (that it may
>>> send packets from a conversation to multiple ports), but for that case
>>> it's an unlikely exception (only in the case of IP fragmentation), but
>>> here it's the norm. At a minimum, this must be clearly documented.
>>>
>>> Also, what does a round robin in 802.3ad provide that the
>>> existing round robin does not? My presumption is that you're looking to
>>> get the aggregator autoconfiguration that 802.3ad provides, but you
>>> don't say.
>
> I'm still curious about this question. Given the rather
> intricate setup of your particular network (described below), I'm not
> sure why 802.3ad is of benefit over traditional etherchannel
> (balance-rr / balance-xor).
>
>>> I don't necessarily think this is a bad cheat (round robining on
>>> 802.3ad as an explicit non-standard extension), since everybody wants to
>>> stripe their traffic across multiple slaves. I've given some thought to
>>> making round robin into just another hash mode, but this also does some
>>> magic to the MAC addresses of the outgoing frames (more on that below).
>> Yes, I am resetting MAC addresses when transmitting packets to have switch
>> to put packets into different ports of the receiving etherchannel.
>
> By "etherchannel" do you really mean "Cisco switch with a
> port-channel group using LACP"?
>
>> I am using this patch to provide full-mesh ISCSI connectivity between at
>> least 4 hosts (all hosts of course are in same ethernet segment) and every
>> host is connected with aggregate link with 4 slaves(usually).
>> Using round-robin I provide near-equal load striping when transmitting,
>> using MAC address magic I force switch to stripe packets over all slave
>> links in destination port-channel(when number of rx-ing slaves is equal to
>> number ot tx-ing slaves and is even).
>
> By "MAC address magic" do you mean that you're assigning
> specifically chosen MAC addresses to the slaves so that the switch's
> hash is essentially "assigning" the bonding slaves to particular ports
> on the outgoing port-channel group?
>
> Assuming that this is the case, it's an interesting idea, but
> I'm unconvinced that it's better on 802.3ad vs. balance-rr. Unless I'm
> missing something, you can get everything you need from an option to
> have balance-rr / balance-xor utilize the slave's permanent address as
> the source address for outgoing traffic.
>
>> [...] So I am able to utilize all slaves
>> for tx and for rx up to maximum capacity; besides I am getting L2 link
>> failure detection (and load rebalancing), which is (in my opinion) much
>> faster and robust than L3 or than dm-multipath provides.
>> It's my idea with the patch
>
> Can somebody (John?) more knowledgable than I about dm-multipath
> comment on the above?
Here I'll give it a go.
I don't think detecting L2 link failure this way is very robust. If there
is a failure farther away then your immediate link your going to break
completely? Your bonding hash will continue to round robin the iscsi
packets and half them will get dropped on the floor. dm-multipath handles
this reasonably gracefully. Also in this bonding environment you seem to
be very sensitive to RTT times on the network. Maybe not bad out right but
I wouldn't consider this robust either.
You could tweak your scsi timeout values and fail_fast values, set the io
retry to 0 to cause the fail over to occur faster. I suspect you already
did this and still it is too slow? Maybe adding a checker in multipathd to
listen for link events would be fast enough. The checker could then fail
the path immediately.
I'll try to address your comments from the other thread here. In general I
wonder if it would be better to solve the problems in dm-multipath rather than
add another bonding mode?
OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
The dm-multipath layer is adding latency? How much? If this is really true
maybe its best to the address the real issue here and not avoid it by
using the bonding layer.
OVU - it handles any link failures bad, because of it's command queue
limitation(all queued commands above 32 are discarded in case of path
failure, as I remember)
Maybe true but only link failures with the immediate peer are handled
with a bonding strategy. By working at the block layer we can detect
failures throughout the path. I would need to look into this again I
know when we were looking at this sometime ago there was some talk about
improving this behavior. I need to take some time to go back through the
error recovery stuff to remember how this works.
OVU - it performs very bad when there are many devices and maтy paths(I was
unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
per each disk)
Hmm well that seems like something is broken. I'll try this setup when
I get some time next few days. This really shouldn't be the case dm-multipath
should not add a bunch of extra latency or effect throughput significantly.
By the way what are you seeing without mpio?
Thanks,
John
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 3:16 ` John Fastabend
@ 2011-01-18 12:40 ` Oleg V. Ukhno
2011-01-18 14:54 ` Nicolas de Pesloüan
2011-01-18 16:41 ` John Fastabend
0 siblings, 2 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 12:40 UTC (permalink / raw)
To: John Fastabend; +Cc: Jay Vosburgh, netdev, David S. Miller
On 01/18/2011 06:16 AM, John Fastabend wrote:
> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>> Can somebody (John?) more knowledgable than I about dm-multipath
>> comment on the above?
>
> Here I'll give it a go.
>
> I don't think detecting L2 link failure this way is very robust. If there
> is a failure farther away then your immediate link your going to break
> completely? Your bonding hash will continue to round robin the iscsi
> packets and half them will get dropped on the floor. dm-multipath handles
> this reasonably gracefully. Also in this bonding environment you seem to
> be very sensitive to RTT times on the network. Maybe not bad out right but
> I wouldn't consider this robust either.
John, I agree - this bonding mode should be used in quite limited number
of situations, but as for failure farther away then immediate link -
every bonding mode will suffer same problems in this case - bonding
detects only L2 failures, other is done by upper-layer mechanisms. And
almost all bonding modes depend on equal RTT on slaves. And, there is
already similar load balancing mode - balance-alb - what I did is
approximately the same, but for 802.3ad bonding mode and provides
"better"(more equal and non-conditional layser2) load striping for tx
and _rx_ .
I think I shouldn't mention the particular use case of this patch - when
I wrote it I tried to make a more general solution - my goal was "make
equal or near-equal load striping for TX and (most important part) RX
within single ethernet(layer 2) domain for TCP transmission". This
bonding mode just introduces ability to stripe rx and tx load for
single TCP connection between hosts inside of one ethernet segment.
iSCSI is just an example. It is possible to stripe load between a
linux-based router and linux-based web/ftp/etc server as well in the
same manner. I think this feature will be useful in some number of
network configurations.
Also, I looked into net-next code - it seems to me that it can be
implemented(adapted to net-next bonding code) without any difficulties
and hashing function change makes no problem here.
What I've written below is just my personal experience and opinion after
5 years of using Oracle +iSCSI +mpath(later - patched bonding).
From my personal experience I just can say that most iSCSI failures are
caused by link failures, and also I would never send any significant
iSCSI traffic via router - router would be a bottleneck in this case.
So, in my case iSCSI traffic flows within one ethernet domain and in
case of link failure bonding driver simply fails one slave(in case of
bonding) , instead of checking and failing hundreths of paths (in case
of mpath) and first case significantly less cpu, net and time
consuming(if using default mpath checker - readsector0).
Mpath is good for me, when I use it to "merge" drbd mirrors from
different hosts, but for just doing simple load striping within single
L2 network switch between 2 .. 16 hosts is some overkill(particularly
in maintaining human-readable device naming) :).
John, what is you opinion on such load balancing method in general,
without referring to particular use cases?
>
> You could tweak your scsi timeout values and fail_fast values, set the io
> retry to 0 to cause the fail over to occur faster. I suspect you already
> did this and still it is too slow? Maybe adding a checker in multipathd to
> listen for link events would be fast enough. The checker could then fail
> the path immediately.
>
> I'll try to address your comments from the other thread here. In general I
> wonder if it would be better to solve the problems in dm-multipath rather than
> add another bonding mode?
Of course I did this, but mpath is fine when device quantity is below
30-40 devices with two paths, 150-200 devices with 2+ paths can make
life far more interesting :)
>
> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>
> The dm-multipath layer is adding latency? How much? If this is really true
> maybe its best to the address the real issue here and not avoid it by
> using the bonding layer.
I do not remember exact number now, but switching one of my databases ,
about 2 years ago to bonding increased read throughput for the entire db
from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and
8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in
one switch) because of "full" bandwidth use. Also, bonding usage
simplifies network and application setup greatly(compared to mpath)
>
> OVU - it handles any link failures bad, because of it's command queue
> limitation(all queued commands above 32 are discarded in case of path
> failure, as I remember)
>
> Maybe true but only link failures with the immediate peer are handled
> with a bonding strategy. By working at the block layer we can detect
> failures throughout the path. I would need to look into this again I
> know when we were looking at this sometime ago there was some talk about
> improving this behavior. I need to take some time to go back through the
> error recovery stuff to remember how this works.
>
> OVU - it performs very bad when there are many devices and maтy paths(I was
> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
> per each disk)
Well, I think that behavior can be explained in such a way:
when balancing by I/Os number per path(rr_min_io), and there is a huge
number of devices, mpath is doing load-balaning per-device, and it is
not possible to quarantee equal device use for all devices, so there
will be imbalance over network interface(mpath is unaware of it's
existence, etc), and it is likely it becomes more imbalanced when there
are many devices. Also, counting I/O's for many devices and paths
consumes some CPU resources and also can cause excessive context switches.
>
> Hmm well that seems like something is broken. I'll try this setup when
> I get some time next few days. This really shouldn't be the case dm-multipath
> should not add a bunch of extra latency or effect throughput significantly.
> By the way what are you seeing without mpio?
And one more obsevation from my 2-years old tests - reading device(using
dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath
device with single path was done at approximately 120-150mb/s, and same
test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a
kind of revelation to me that time.
>
> Thanks,
> John
>
--
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 12:40 ` Oleg V. Ukhno
@ 2011-01-18 14:54 ` Nicolas de Pesloüan
2011-01-18 15:28 ` Oleg V. Ukhno
2011-01-18 16:41 ` John Fastabend
1 sibling, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-18 14:54 UTC (permalink / raw)
To: Oleg V. Ukhno, John Fastabend, Jay Vosburgh, David S. Miller
Cc: netdev, Sébastien Barré, Christophe Paasch
Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :
The fact that there exist many situations where it simply doesn't work, should not cause the idea of
Oleg to be rejected.
In Documentation/networking/bonding.txt, tuning tcp_reordering on receiving side is already
documented as a possible workaround for out of order delivery due to load balancing of a single TCP
session, using mode=balance-rr.
This might work reasonably well in a pure LAN topology, without any router between both ends of the
TCP session, even if this is limited to Linux hosts. The uses are not uncommon and not limited to iSCSI:
- between an application server and a database server,
- between members of a cluster, for replication purpose,
- between a server and a backup system,
- ...
Of course, for longer paths, with routers and variable RTT, we would need something different
(possibly MultiPathTCP: http://datatracker.ietf.org/wg/mptcp/).
I remember a topology (described by Jay, for as far as I remember), where two hosts were connected
through two distinct VLANs. In such topology:
- it is possible to detect path failure using arp monitoring instead of miimon.
- changing the destination MAC address of egress packets are not necessary, because egress path
selection force ingress path selection due to the VLAN.
I think the only point is whether we need a new xmit_hash_policy for mode=802.3ad or whether
mode=balance-rr could be enough.
Oleg, would you mind trying the above "two VLAN" topology" with mode=balance-rr and report any
results ? For high-availability purpose, it's obviously necessary to setup those VLAN on distinct
switches.
Nicolas
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 14:54 ` Nicolas de Pesloüan
@ 2011-01-18 15:28 ` Oleg V. Ukhno
2011-01-18 16:24 ` Nicolas de Pesloüan
2011-01-18 17:56 ` Kirill Smelkov
0 siblings, 2 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 15:28 UTC (permalink / raw)
To: Nicolas de Pesloüan
Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev,
Sébastien Barré,
Christophe Paasch
On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
> Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :
>
> The fact that there exist many situations where it simply doesn't work,
> should not cause the idea of Oleg to be rejected.
>
> In Documentation/networking/bonding.txt, tuning tcp_reordering on
> receiving side is already documented as a possible workaround for out of
> order delivery due to load balancing of a single TCP session, using
> mode=balance-rr.
>
> This might work reasonably well in a pure LAN topology, without any
> router between both ends of the TCP session, even if this is limited to
> Linux hosts. The uses are not uncommon and not limited to iSCSI:
> - between an application server and a database server,
> - between members of a cluster, for replication purpose,
> - between a server and a backup system,
> - ...
Nicolas, thank you for your opinion - this is exactly what I mean -
iSCSI is just one particular use case, but there are many cases where
this load balancing method will be useful
>
> Of course, for longer paths, with routers and variable RTT, we would
> need something different (possibly MultiPathTCP:
> http://datatracker.ietf.org/wg/mptcp/).
>
> I remember a topology (described by Jay, for as far as I remember),
> where two hosts were connected through two distinct VLANs. In such
> topology:
> - it is possible to detect path failure using arp monitoring instead of
> miimon.
> - changing the destination MAC address of egress packets are not
> necessary, because egress path selection force ingress path selection
> due to the VLAN.
In case with two VLANs - yes, this shouldn't be necessary(but needs to
be tested, I am not sure), but within one - it is essential for correct
rx load striping.
>
> I think the only point is whether we need a new xmit_hash_policy for
> mode=802.3ad or whether mode=balance-rr could be enough.
May by, but it seems to me fair enough not to restrict this feature only
to non-LACP aggregate links; dynamic aggregation may be useful(it helps
to avoid switch misconfiguration(misconfigured slaves on switch side)
sometimes without loss of service).
>
> Oleg, would you mind trying the above "two VLAN" topology" with
> mode=balance-rr and report any results ? For high-availability purpose,
> it's obviously necessary to setup those VLAN on distinct switches.
I'll do it, but it will take some time to setup test environment,
several days may be.
You mean following topology:
switch 1
/ \
host A host B
\ switch 2 /
(i'm sure it will work as desired if each host is connected to each
switch with only one slave link, if there are more slaves in each switch
- unsure)?
>
> Nicolas
>
>
>
--
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 15:28 ` Oleg V. Ukhno
@ 2011-01-18 16:24 ` Nicolas de Pesloüan
2011-01-18 16:57 ` Oleg V. Ukhno
2011-01-18 20:24 ` Jay Vosburgh
2011-01-18 17:56 ` Kirill Smelkov
1 sibling, 2 replies; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-18 16:24 UTC (permalink / raw)
To: Oleg V. Ukhno
Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev,
Sébastien Barré,
Christophe Paasch
Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>> I remember a topology (described by Jay, for as far as I remember),
>> where two hosts were connected through two distinct VLANs. In such
>> topology:
>> - it is possible to detect path failure using arp monitoring instead of
>> miimon.
>> - changing the destination MAC address of egress packets are not
>> necessary, because egress path selection force ingress path selection
>> due to the VLAN.
>
> In case with two VLANs - yes, this shouldn't be necessary(but needs to
> be tested, I am not sure), but within one - it is essential for correct
> rx load striping.
Changing the destination MAC address is definitely not required if you segregate each path in a
distinct VLAN.
+-------------------+ +-------------------+
+-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
| +-------------------+ +-------------------+ |
+------+ | | +------+
|host A| | | |host B|
+------+ | | +------+
| +-------------------+ +-------------------+ |
+-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+
+-------------------+ +-------------------+
Even in the present of ISL between some switches, packet sent through host A interface connected to
vlan 100 will only enter host B using the interface connected to vlan 100. So every slaves of the
bonding interface can use the same MAC address.
Of course, changing the destination address would be required in order to achieve ingress load
balancing on a *single* LAN. But, as Jay noted at the beginning of this thread, this would violate
802.3ad.
>> I think the only point is whether we need a new xmit_hash_policy for
>> mode=802.3ad or whether mode=balance-rr could be enough.
> May by, but it seems to me fair enough not to restrict this feature only
> to non-LACP aggregate links; dynamic aggregation may be useful(it helps
> to avoid switch misconfiguration(misconfigured slaves on switch side)
> sometimes without loss of service).
You are right, but such LAN setup need to be carefully designed and built. I'm not sure that an
automatic channel aggregation system is the right way to do it. Hence the reason why I suggest to
use balance-rr with VLANs.
>> Oleg, would you mind trying the above "two VLAN" topology" with
>> mode=balance-rr and report any results ? For high-availability purpose,
>> it's obviously necessary to setup those VLAN on distinct switches.
> I'll do it, but it will take some time to setup test environment,
> several days may be.
Thanks. For testing purpose, it is enough to setup those VLAN on a single switch if it is easier for
you to do.
> You mean following topology:
See above.
> (i'm sure it will work as desired if each host is connected to each
> switch with only one slave link, if there are more slaves in each switch
> - unsure)?
If you want to use more than 2 slaves per host, then you need more than 2 VLAN. You also need to
have the exact same number of slaves on all hosts, as egress path selection cause ingress path
selection at the other side.
+-------------------+ +-------------------+
+-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
| +-------------------+ +-------------------+ |
+------+ | | +------+
|host A| | | |host B|
+------+ | | +------+
| | +-------------------+ +-------------------+ | |
| +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ |
| +-------------------+ +-------------------+ |
| | | |
| | | |
| +-------------------+ +-------------------+ |
+---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+
+-------------------+ +-------------------+
Of course, you can add others host to vlan 100, 200 and 300, with the exact same configuration at
host A or host B.
Nicolas.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 12:40 ` Oleg V. Ukhno
2011-01-18 14:54 ` Nicolas de Pesloüan
@ 2011-01-18 16:41 ` John Fastabend
2011-01-18 17:21 ` Oleg V. Ukhno
1 sibling, 1 reply; 32+ messages in thread
From: John Fastabend @ 2011-01-18 16:41 UTC (permalink / raw)
To: Oleg V. Ukhno; +Cc: Jay Vosburgh, netdev, David S. Miller
On 1/18/2011 4:40 AM, Oleg V. Ukhno wrote:
> On 01/18/2011 06:16 AM, John Fastabend wrote:
>> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>>> Can somebody (John?) more knowledgable than I about dm-multipath
>>> comment on the above?
>>
>> Here I'll give it a go.
>>
>> I don't think detecting L2 link failure this way is very robust. If there
>> is a failure farther away then your immediate link your going to break
>> completely? Your bonding hash will continue to round robin the iscsi
>> packets and half them will get dropped on the floor. dm-multipath handles
>> this reasonably gracefully. Also in this bonding environment you seem to
>> be very sensitive to RTT times on the network. Maybe not bad out right but
>> I wouldn't consider this robust either.
>
> John, I agree - this bonding mode should be used in quite limited number
> of situations, but as for failure farther away then immediate link -
> every bonding mode will suffer same problems in this case - bonding
> detects only L2 failures, other is done by upper-layer mechanisms. And
> almost all bonding modes depend on equal RTT on slaves. And, there is
> already similar load balancing mode - balance-alb - what I did is
> approximately the same, but for 802.3ad bonding mode and provides
> "better"(more equal and non-conditional layser2) load striping for tx
> and _rx_ .
>
> I think I shouldn't mention the particular use case of this patch - when
> I wrote it I tried to make a more general solution - my goal was "make
> equal or near-equal load striping for TX and (most important part) RX
> within single ethernet(layer 2) domain for TCP transmission". This
> bonding mode just introduces ability to stripe rx and tx load for
> single TCP connection between hosts inside of one ethernet segment.
> iSCSI is just an example. It is possible to stripe load between a
> linux-based router and linux-based web/ftp/etc server as well in the
> same manner. I think this feature will be useful in some number of
> network configurations.
>
> Also, I looked into net-next code - it seems to me that it can be
> implemented(adapted to net-next bonding code) without any difficulties
> and hashing function change makes no problem here.
>
> What I've written below is just my personal experience and opinion after
> 5 years of using Oracle +iSCSI +mpath(later - patched bonding).
>
> From my personal experience I just can say that most iSCSI failures are
> caused by link failures, and also I would never send any significant
> iSCSI traffic via router - router would be a bottleneck in this case.
> So, in my case iSCSI traffic flows within one ethernet domain and in
> case of link failure bonding driver simply fails one slave(in case of
> bonding) , instead of checking and failing hundreths of paths (in case
> of mpath) and first case significantly less cpu, net and time
> consuming(if using default mpath checker - readsector0).
> Mpath is good for me, when I use it to "merge" drbd mirrors from
> different hosts, but for just doing simple load striping within single
> L2 network switch between 2 .. 16 hosts is some overkill(particularly
> in maintaining human-readable device naming) :).
>
> John, what is you opinion on such load balancing method in general,
> without referring to particular use cases?
>
This seems reasonable to me, but I'll defer to Jay on this. As long as the
limitations are documented and it looks like they are this may be fine.
Mostly I was interested to know what led you down this path and why MPIO
was not working as at least I expected it should. When I get some time I'll
see if we can address at least some of these issues. Even so it seems like
this bonding mode may still be useful for some use cases perhaps even none
storage use cases.
>
>>
>> You could tweak your scsi timeout values and fail_fast values, set the io
>> retry to 0 to cause the fail over to occur faster. I suspect you already
>> did this and still it is too slow? Maybe adding a checker in multipathd to
>> listen for link events would be fast enough. The checker could then fail
>> the path immediately.
>>
>> I'll try to address your comments from the other thread here. In general I
>> wonder if it would be better to solve the problems in dm-multipath rather than
>> add another bonding mode?
> Of course I did this, but mpath is fine when device quantity is below
> 30-40 devices with two paths, 150-200 devices with 2+ paths can make
> life far more interesting :)
OK admittedly this gets ugly fast.
>>
>> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>>
>> The dm-multipath layer is adding latency? How much? If this is really true
>> maybe its best to the address the real issue here and not avoid it by
>> using the bonding layer.
>
> I do not remember exact number now, but switching one of my databases ,
> about 2 years ago to bonding increased read throughput for the entire db
> from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and
> 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in
> one switch) because of "full" bandwidth use. Also, bonding usage
> simplifies network and application setup greatly(compared to mpath)
>
>>
>> OVU - it handles any link failures bad, because of it's command queue
>> limitation(all queued commands above 32 are discarded in case of path
>> failure, as I remember)
>>
>> Maybe true but only link failures with the immediate peer are handled
>> with a bonding strategy. By working at the block layer we can detect
>> failures throughout the path. I would need to look into this again I
>> know when we were looking at this sometime ago there was some talk about
>> improving this behavior. I need to take some time to go back through the
>> error recovery stuff to remember how this works.
>>
>> OVU - it performs very bad when there are many devices and maтy paths(I was
>> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
>> per each disk)
>
> Well, I think that behavior can be explained in such a way:
> when balancing by I/Os number per path(rr_min_io), and there is a huge
> number of devices, mpath is doing load-balaning per-device, and it is
> not possible to quarantee equal device use for all devices, so there
> will be imbalance over network interface(mpath is unaware of it's
> existence, etc), and it is likely it becomes more imbalanced when there
> are many devices. Also, counting I/O's for many devices and paths
> consumes some CPU resources and also can cause excessive context switches.
>
hmm I'll get something setup here and see if this is the case.
>>
>> Hmm well that seems like something is broken. I'll try this setup when
>> I get some time next few days. This really shouldn't be the case dm-multipath
>> should not add a bunch of extra latency or effect throughput significantly.
>> By the way what are you seeing without mpio?
>
> And one more obsevation from my 2-years old tests - reading device(using
> dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath
> device with single path was done at approximately 120-150mb/s, and same
> test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a
> kind of revelation to me that time.
>
Similarly I'll have a look. Thanks for the info.
>>
>> Thanks,
>> John
>>
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 16:24 ` Nicolas de Pesloüan
@ 2011-01-18 16:57 ` Oleg V. Ukhno
2011-01-18 20:24 ` Jay Vosburgh
1 sibling, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 16:57 UTC (permalink / raw)
To: Nicolas de Pesloüan
Cc: John Fastabend, Jay Vosburgh, David S. Miller, netdev,
Sébastien Barré,
Christophe Paasch
On 01/18/2011 07:24 PM, Nicolas de Pesloüan wrote:
> Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
>> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>>> I remember a topology (described by Jay, for as far as I remember),
>>> where two hosts were connected through two distinct VLANs. In such
>>> topology:
>>> - it is possible to detect path failure using arp monitoring instead of
>>> miimon.
>>> - changing the destination MAC address of egress packets are not
>>> necessary, because egress path selection force ingress path selection
>>> due to the VLAN.
>>
>> In case with two VLANs - yes, this shouldn't be necessary(but needs to
>> be tested, I am not sure), but within one - it is essential for correct
>> rx load striping.
>
> Changing the destination MAC address is definitely not required if you
> segregate each path in a distinct VLAN.
Yes, such L2 network topology should provide necessary high-availability
and load striping without need to change MAC addresses. But it is more
difficult to maintain and to understand, in my opinion(when there are
just several configurations like this - it's ok, but when you have 50 or
more?) - this is why I've chosen 802.3ad.
> Even in the present of ISL between some switches, packet sent through
> host A interface connected to vlan 100 will only enter host B using the
> interface connected to vlan 100. So every slaves of the bonding
> interface can use the same MAC address.
>
> Of course, changing the destination address would be required in order
> to achieve ingress load balancing on a *single* LAN. But, as Jay noted
> at the beginning of this thread, this would violate 802.3ad.
>
I think receiving same MAC-addresses on different ports on same host
will just make any troubleshooting much harder, won't it? With different
MACs it takes little time to find out where the problem is, usually.
I think that implementing choice for choosing whether use single MAC
address in etherchannel or just use slave's real MAC adresses, won't
harm anything for both 802.3ad and balance-rr modes, but will simplify
it's usage without doing any evil, when documented properly.
>
> You are right, but such LAN setup need to be carefully designed and
> built. I'm not sure that an automatic channel aggregation system is the
> right way to do it. Hence the reason why I suggest to use balance-rr
> with VLANs.
>
>>> Oleg, would you mind trying the above "two VLAN" topology" with
>>> mode=balance-rr and report any results ? For high-availability purpose,
>>> it's obviously necessary to setup those VLAN on distinct switches.
>> I'll do it, but it will take some time to setup test environment,
>> several days may be.
>
> Thanks. For testing purpose, it is enough to setup those VLAN on a
> single switch if it is easier for you to do.
Well, I'll do it with 2 switches :)
>
>> You mean following topology:
>
> See above.
>
>> (i'm sure it will work as desired if each host is connected to each
>> switch with only one slave link, if there are more slaves in each switch
>> - unsure)?
>
> If you want to use more than 2 slaves per host, then you need more than
> 2 VLAN.
That's what I don't like in this solution. Within one LAN is is simplier
and requires less configuration efforts.
You also need to have the exact same number of slaves on all
> hosts, as egress path selection cause ingress path selection at the
> other side.
>
Well, and here's one difference from bonding with my patch. In case of
my patch applied, it is not required to have equal number of slaves, it
is enough to have *even* number of slaves, this almost always(so far I
haven't seen opposite) gurarntees good rx(ingress) load striping.
>
> Nicolas.
>
--
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс
Олег Юхно
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 16:41 ` John Fastabend
@ 2011-01-18 17:21 ` Oleg V. Ukhno
0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 17:21 UTC (permalink / raw)
To: John Fastabend; +Cc: Jay Vosburgh, netdev, David S. Miller
On 01/18/2011 07:41 PM, John Fastabend wrote:
>>
>> John, what is you opinion on such load balancing method in general,
>> without referring to particular use cases?
>>
>
> This seems reasonable to me, but I'll defer to Jay on this. As long as the
> limitations are documented and it looks like they are this may be fine.
>
> Mostly I was interested to know what led you down this path and why MPIO
> was not working as at least I expected it should. When I get some time I'll
> see if we can address at least some of these issues. Even so it seems like
> this bonding mode may still be useful for some use cases perhaps even none
> storage use cases.
>
>>
I was adressing several problems with my patch:
- I was unable to consume whole bandwidth with multipath - with four
1Gbit "paths" it was slightly above 2Gbit/s
- Link failures caused quite often disk failures, which led to Oracle
ASM rebalance, especially with versions below 11.
- It is not always possible to autogenerate multipathd.conf with
human-readable device names because of iscsi session id and scsi device
bus/channel/etc mismatch(usually it differs by 1, but not necessarily),
with bonding solution I can just look into /dev/disk/by-path to find out
where physically is device, let's say, /dev/sdab, located(it's just a
free bonus I've got, so to say:)) .
--
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс
Олег Юхно
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 15:28 ` Oleg V. Ukhno
2011-01-18 16:24 ` Nicolas de Pesloüan
@ 2011-01-18 17:56 ` Kirill Smelkov
1 sibling, 0 replies; 32+ messages in thread
From: Kirill Smelkov @ 2011-01-18 17:56 UTC (permalink / raw)
To: Oleg V. Ukhno
Cc: Nicolas de Pesloüan, John Fastabend, Jay Vosburgh,
David S. Miller, netdev, Sébastien Barré,
Christophe Paasch
On Tue, Jan 18, 2011 at 06:28:48PM +0300, Oleg V. Ukhno wrote:
> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>> Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :
[...]
>> Oleg, would you mind trying the above "two VLAN" topology" with
>> mode=balance-rr and report any results ? For high-availability purpose,
>> it's obviously necessary to setup those VLAN on distinct switches.
> I'll do it, but it will take some time to setup test environment,
> several days may be.
> You mean following topology:
> switch 1
> / \
> host A host B
> \ switch 2 /
>
FYI: I'm in the process of developing new redundancy mode for bonding,
and while at it, the following script is maybe useful for you too, so
that bonding testing can be done entirely on one host:
http://repo.or.cz/w/linux-2.6/kirr.git/blob/refs/heads/x/etherdup:/tools/bonding/mk-tap-loops.sh
Sorry for maybe being offtopic,
Kirill
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 16:24 ` Nicolas de Pesloüan
2011-01-18 16:57 ` Oleg V. Ukhno
@ 2011-01-18 20:24 ` Jay Vosburgh
2011-01-18 21:20 ` Nicolas de Pesloüan
` (2 more replies)
1 sibling, 3 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-18 20:24 UTC (permalink / raw)
To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev,
=?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?=,
Christophe Paasch
Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:
>Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
>> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>>> I remember a topology (described by Jay, for as far as I remember),
>>> where two hosts were connected through two distinct VLANs. In such
>>> topology:
>>> - it is possible to detect path failure using arp monitoring instead of
>>> miimon.
I don't think this is true, at least not for the case of
balance-rr. Using ARP monitoring with any sort of load balance scheme
is problematic, because the replies may be balanced to a different slave
than the sender.
>>> - changing the destination MAC address of egress packets are not
>>> necessary, because egress path selection force ingress path selection
>>> due to the VLAN.
This is true, with one comment: Oleg's proposal we're discussing
changes the source MAC address of outgoing packets, not the destination.
The purpose being to manipulate the src-mac balancing algorithm on the
switch when the packets are hashed at the egress port channel group.
The packets (for a particular destination) all bear the same destination
MAC, but (as I understand it) are manually assigned tailored source MAC
addresses that hash to sequential values.
>> In case with two VLANs - yes, this shouldn't be necessary(but needs to
>> be tested, I am not sure), but within one - it is essential for correct
>> rx load striping.
>
>Changing the destination MAC address is definitely not required if you
>segregate each path in a distinct VLAN.
>
> +-------------------+ +-------------------+
> +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
> | +-------------------+ +-------------------+ |
>+------+ | | +------+
>|host A| | | |host B|
>+------+ | | +------+
> | +-------------------+ +-------------------+ |
> +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+
> +-------------------+ +-------------------+
>
>Even in the present of ISL between some switches, packet sent through host
>A interface connected to vlan 100 will only enter host B using the
>interface connected to vlan 100. So every slaves of the bonding interface
>can use the same MAC address.
That's true. The big problem with the "VLAN tunnel" approach is
that it's not tolerant of link failures.
>Of course, changing the destination address would be required in order to
>achieve ingress load balancing on a *single* LAN. But, as Jay noted at the
>beginning of this thread, this would violate 802.3ad.
>
>>> I think the only point is whether we need a new xmit_hash_policy for
>>> mode=802.3ad or whether mode=balance-rr could be enough.
>> May by, but it seems to me fair enough not to restrict this feature only
>> to non-LACP aggregate links; dynamic aggregation may be useful(it helps
>> to avoid switch misconfiguration(misconfigured slaves on switch side)
>> sometimes without loss of service).
>
>You are right, but such LAN setup need to be carefully designed and
>built. I'm not sure that an automatic channel aggregation system is the
>right way to do it. Hence the reason why I suggest to use balance-rr with
>VLANs.
The "VLAN tunnel" approach is a derivative of an actual switch
topology that balance-rr was originally intended for, many moons ago.
This is described in the current bonding.txt; I'll cut & paste a bit
here:
12.2 Maximum Throughput in a Multiple Switch Topology
-----------------------------------------------------
Multiple switches may be utilized to optimize for throughput
when they are configured in parallel as part of an isolated network
between two or more systems, for example:
+-----------+
| Host A |
+-+---+---+-+
| | |
+--------+ | +---------+
| | |
+------+---+ +-----+----+ +-----+----+
| Switch A | | Switch B | | Switch C |
+------+---+ +-----+----+ +-----+----+
| | |
+--------+ | +---------+
| | |
+-+---+---+-+
| Host B |
+-----------+
In this configuration, the switches are isolated from one
another. One reason to employ a topology such as this is for an
isolated network with many hosts (a cluster configured for high
performance, for example), using multiple smaller switches can be more
cost effective than a single larger switch, e.g., on a network with 24
hosts, three 24 port switches can be significantly less expensive than
a single 72 port switch.
If access beyond the network is required, an individual host
can be equipped with an additional network device connected to an
external network; this host then additionally acts as a gateway.
[end of cut]
This was described to me some time ago as an early usage model
for balance-rr using multiple 10 Mb/sec switches. It has the same link
monitoring problems as the "VLAN tunnel" approach, although modern
switches with "trunk failover" type of functionality may be able to
mitigate the problem.
>>> Oleg, would you mind trying the above "two VLAN" topology" with
>>> mode=balance-rr and report any results ? For high-availability purpose,
>>> it's obviously necessary to setup those VLAN on distinct switches.
>> I'll do it, but it will take some time to setup test environment,
>> several days may be.
>
>Thanks. For testing purpose, it is enough to setup those VLAN on a single
>switch if it is easier for you to do.
>
>> You mean following topology:
>
>See above.
>
>> (i'm sure it will work as desired if each host is connected to each
>> switch with only one slave link, if there are more slaves in each switch
>> - unsure)?
>
>If you want to use more than 2 slaves per host, then you need more than 2
>VLAN. You also need to have the exact same number of slaves on all hosts,
>as egress path selection cause ingress path selection at the other side.
>
> +-------------------+ +-------------------+
> +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
> | +-------------------+ +-------------------+ |
>+------+ | | +------+
>|host A| | | |host B|
>+------+ | | +------+
> | | +-------------------+ +-------------------+ | |
> | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ |
> | +-------------------+ +-------------------+ |
> | | | |
> | | | |
> | +-------------------+ +-------------------+ |
> +---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+
> +-------------------+ +-------------------+
>
>Of course, you can add others host to vlan 100, 200 and 300, with the
>exact same configuration at host A or host B.
This is essentially the same thing as the diagram I pasted in up
above, except with VLANs and an additional layer of switches between the
hosts. The multiple VLANs take the place of multiple discrete switches.
This could also be accomplished via bridge groups (in
Cisco-speak). For example, instead of VLAN 100, that could be bridge
group X, VLAN 200 is bridge group Y, and so on.
Neither the VLAN nor the bridge group methods handle link
failures very well; if, in the above diagram, the link from "switch 2
vlan 100" to "host B" fails, there's no way for host A to know to stop
sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
to "host B."
One item I'd like to see some more data on is the level of
reordering at the receiver in Oleg's system.
One of the reasons round robin isn't as useful as it once was is
due to the rise of NAPI and interrupt coalescing, both of which will
tend to increase the reordering of packets at the receiver when the
packets are evenly striped. In the old days, it was one interrupt, one
packet. Now, it's one interrupt or NAPI poll, many packets. With the
packets striped across interfaces, this will tend to increase
reordering. E.g.,
slave 1 slave 2 slave 3
Packet 1 P2 P3
P4 P5 P6
P7 P8 P9
and so on. A poll of slave 1 will get packets 1, 4 and 7 (and
probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
I haven't done much testing with this lately, but I suspect this
behavior hasn't really changed. Raising the tcp_reordering sysctl value
can mitigate this somewhat (by making TCP more tolerant of this), but
that doesn't help non-TCP protocols.
Barring evidence to the contrary, I presume that Oleg's system
delivers out of order at the receiver. That's not automatically a
reason to reject it, but this entire proposal is sufficiently complex to
configure that very explicit documentation will be necessary.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 20:24 ` Jay Vosburgh
@ 2011-01-18 21:20 ` Nicolas de Pesloüan
2011-01-19 1:45 ` Jay Vosburgh
2011-01-18 22:22 ` Oleg V. Ukhno
2011-01-19 16:13 ` Oleg V. Ukhno
2 siblings, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-18 21:20 UTC (permalink / raw)
To: Jay Vosburgh
Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev,
Sébastien Barré,
Christophe Paasch
Le 18/01/2011 21:24, Jay Vosburgh a écrit :
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com> wrote:
>>>> - it is possible to detect path failure using arp monitoring instead of
>>>> miimon.
>
> I don't think this is true, at least not for the case of
> balance-rr. Using ARP monitoring with any sort of load balance scheme
> is problematic, because the replies may be balanced to a different slave
> than the sender.
Cannot we achieve the expected arp monitoring by using the exact same artifice that Oleg suggested:
using a different source MAC per slave for arp monitoring, so that return path match sending path ?
>>>> - changing the destination MAC address of egress packets are not
>>>> necessary, because egress path selection force ingress path selection
>>>> due to the VLAN.
>
> This is true, with one comment: Oleg's proposal we're discussing
> changes the source MAC address of outgoing packets, not the destination.
> The purpose being to manipulate the src-mac balancing algorithm on the
> switch when the packets are hashed at the egress port channel group.
> The packets (for a particular destination) all bear the same destination
> MAC, but (as I understand it) are manually assigned tailored source MAC
> addresses that hash to sequential values.
Yes, you're right.
> That's true. The big problem with the "VLAN tunnel" approach is
> that it's not tolerant of link failures.
Yes, except if we find a way to make arp monitoring reliable in load balancing situation.
[snip]
> This is essentially the same thing as the diagram I pasted in up
> above, except with VLANs and an additional layer of switches between the
> hosts. The multiple VLANs take the place of multiple discrete switches.
>
> This could also be accomplished via bridge groups (in
> Cisco-speak). For example, instead of VLAN 100, that could be bridge
> group X, VLAN 200 is bridge group Y, and so on.
>
> Neither the VLAN nor the bridge group methods handle link
> failures very well; if, in the above diagram, the link from "switch 2
> vlan 100" to "host B" fails, there's no way for host A to know to stop
> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
> to "host B."
Can't we imagine to "arp monitor" the destination MAC address of host B, on both paths ? That way,
host A would know that a given path is down, because return path would be the same. The target host
should send the reply on the slave on which it receive the request, which is the normal way to reply
to arp request.
> One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.
This is exactly the reason why I asked Oleg to do some test with balance-rr. I cannot find a good
reason for a possibly new xmit_hash_policy to provide better throughput than current balance-rr. If
the throughput increase by, let's say, less than 20%, whatever tcp_reordering value, then it is
probably a dead end way.
> One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped. In the old days, it was one interrupt, one
> packet. Now, it's one interrupt or NAPI poll, many packets. With the
> packets striped across interfaces, this will tend to increase
> reordering. E.g.,
>
> slave 1 slave 2 slave 3
> Packet 1 P2 P3
> P4 P5 P6
> P7 P8 P9
>
> and so on. A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7, P8, P9 on slave3, possibly
by sending grouped packets, changing the sending slave every N packets instead of every packet ? I
think we already discussed this possibility a few months or years ago in bonding-devel ML. For as
far as I remember, the idea was not developed because it was not easy to find the number of packets
to send through the same slave. Anyway, this might help reduce out of order delivery.
> Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver. That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.
Yes, and this is already true for some bonding modes and in particular for balance-rr.
Nicolas.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 20:24 ` Jay Vosburgh
2011-01-18 21:20 ` Nicolas de Pesloüan
@ 2011-01-18 22:22 ` Oleg V. Ukhno
2011-01-19 16:13 ` Oleg V. Ukhno
2 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-18 22:22 UTC (permalink / raw)
To: Jay Vosburgh
Cc: Nicolas de Pesloüan, John Fastabend, David S. Miller,
netdev, Sébastien Barré,
Christophe Paasch
Jay Vosburgh wrote:
>
> One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.
>
> One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped. In the old days, it was one interrupt, one
> packet. Now, it's one interrupt or NAPI poll, many packets. With the
> packets striped across interfaces, this will tend to increase
> reordering. E.g.,
>
> slave 1 slave 2 slave 3
> Packet 1 P2 P3
> P4 P5 P6
> P7 P8 P9
>
> and so on. A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
>
> I haven't done much testing with this lately, but I suspect this
> behavior hasn't really changed. Raising the tcp_reordering sysctl value
> can mitigate this somewhat (by making TCP more tolerant of this), but
> that doesn't help non-TCP protocols.
>
> Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver. That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.
Jay, here is some network stats from one of my iSCSI targets with avg
load of 1.5-2.5Gbit/sec(4 slaves in etherchannel).Not perfect and not
very "clean"(there are more interfaces on host, than these 4)
[root@<somehost> ~]# netstat -st
IcmpMsg:
InType0: 6
InType3: 1872
InType8: 60557
InType11: 23
OutType0: 60528
OutType3: 1755
OutType8: 6
Tcp:
1298909 active connections openings
61090 passive connection openings
2374 failed connection attempts
62781 connection resets received
3 connections established
1268233942 segments received
1198020318 segments send out
18939618 segments retransmited
0 bad segments received.
23643 resets sent
TcpExt:
294935 TCP sockets finished time wait in fast timer
472 time wait sockets recycled by time stamp
819481 delayed acks sent
295332 delayed acks further delayed because of locked socket
Quick ack mode was activated 30616377 times
3516920 packets directly queued to recvmsg prequeue.
4353 packets directly received from backlog
44873453 packets directly received from prequeue
1442812750 packets header predicted
1077442 packets header predicted and directly queued to user
2123453975 acknowledgments not containing data received
2375328274 predicted acknowledgments
8462439 times recovered from packet loss due to fast retransmit
Detected reordering 19203 times using reno fast retransmit
Detected reordering 100 times using time stamp
3429 congestion windows fully recovered
11760 congestion windows partially recovered using Hoe heuristic
398 congestion windows recovered after partial ack
0 TCP data loss events
3671 timeouts after reno fast retransmit
6 timeouts in loss state
18919118 fast retransmits
11637 retransmits in slow start
1756 other TCP timeouts
TCPRenoRecoveryFail: 3187
62779 connections reset due to early user close
IpExt:
InBcastPkts: 512616
[root@<somehost> ~]# uptime
00:35:49 up 42 days, 8:27, 1 user, load average: 3.70, 3.80, 4.07
[root@<somehost> ~]# sysctl -a|grep tcp_reo
net.ipv4.tcp_reordering = 3
I will get back with "clean" results after I'll setup test system tomorrow.
TcpExt stats from other hosts are similar.
>
> -J
>
> ---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
--
Best regards,
Oleg Ukhno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 21:20 ` Nicolas de Pesloüan
@ 2011-01-19 1:45 ` Jay Vosburgh
0 siblings, 0 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-19 1:45 UTC (permalink / raw)
To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
Cc: Oleg V. Ukhno, John Fastabend, David S. Miller, netdev,
=?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?=,
Christophe Paasch
Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:
>Le 18/01/2011 21:24, Jay Vosburgh a écrit :
>> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com> wrote:
>
>>>>> - it is possible to detect path failure using arp monitoring instead of
>>>>> miimon.
>>
>> I don't think this is true, at least not for the case of
>> balance-rr. Using ARP monitoring with any sort of load balance scheme
>> is problematic, because the replies may be balanced to a different slave
>> than the sender.
>
>Cannot we achieve the expected arp monitoring by using the exact same
>artifice that Oleg suggested: using a different source MAC per slave for
>arp monitoring, so that return path match sending path ?
It's not as simple with ARP, because it's a control protocol
that has side effects.
First, the MAC level broadcast ARP probes from bonding would
have to be round robined in such a manner that they regularly arrive at
every possible slave. A single broadcast won't be sent to more than one
member of the channel group by the switch. We can't do multiple unicast
ARPs with different destination MAC addresses, because we'd have to
track all of those MACs somewhere (keep track of the MAC of every slave
on each peer we're monitoring). I suspect that snooping switches will
get all whiny about port flapping and the like.
We could have a separate IP address per slave, used only for
link monitoring, but that's a huge headache. Actually, it's a lot like
the multi-link stuff I've been working on (and posted RFC of in
December), but that doesn't use ARP (it segregates slaves by IP subnet,
and balances at the IP layer). Basically, you need a overlaying active
protocol to handle the map of which slave goes where (which multi-link
has).
So, maybe we have the ARP replies massaged such that the
Ethernet header source and ARP target hardware address don't match.
So the probes from bonding currently look like this:
MAC-A > ff:ff:ff:ff:ff:ff Request who-has 10.0.4.2 tell 10.0.1.1
Where MAC-A is the bond's MAC address. And the replies now look
like this:
MAC-B > MAC-A, Reply 10.0.4.2 is-at MAC-B
Where MAC-B is the MAC of the peer's bond. The massaged replies
would be of the form:
MAC-C > MAC-A, Reply 10.0.4.2 is-at MAC-B
where MAC-C is the slave "permanent" address (which is really a
fake address to manipulate the switch's hash), and MAC-B is whatever the
real MAC of the bond is. I don't think we can mess with MAC-B in the
reply (the "is-at" part), because that would update ARP tables and such.
If we change MAC-A in the reply, they're liable to be filtered out. I
really don't know if putting MAC-C in there as the source would confuse
snooping switches or not.
One other thought I had while chewing on this is to run the LACP
protocol exchange between the bonding peers directly, instead of between
each bond and each switch. I have no idea if this would work or not,
but the theory would look something like the "VLAN tunnel" topology for
the switches, but the bonds at the ends are configured for 802.3ad. To
make this work, bonding would have to be able to run mutiple LACP
instances (one for each bonding peer on the network) over a single
aggregator (or permit slaves to belong to multiple active aggregators).
This would basically be the same as the multi-link business, except
using LACP for the active protocol to build the map.
A distinguished correspondent (who may confess if he so chooses)
also suggested 802.2 LLC XID or TEST frames, which have been discussed
in the past. Those don't have side effects, but I'm not sure if either
is technically feasible, or if we really want bonding to have a
dependency on llc. They would also only interop with hosts that respond
to the XID or TEST. I haven't thought about this in detail for a number
of years, but I think the LLC DSAP / SSAP space is pretty small.
>>>>> - changing the destination MAC address of egress packets are not
>>>>> necessary, because egress path selection force ingress path selection
>>>>> due to the VLAN.
>>
>> This is true, with one comment: Oleg's proposal we're discussing
>> changes the source MAC address of outgoing packets, not the destination.
>> The purpose being to manipulate the src-mac balancing algorithm on the
>> switch when the packets are hashed at the egress port channel group.
>> The packets (for a particular destination) all bear the same destination
>> MAC, but (as I understand it) are manually assigned tailored source MAC
>> addresses that hash to sequential values.
>
>Yes, you're right.
>
>> That's true. The big problem with the "VLAN tunnel" approach is
>> that it's not tolerant of link failures.
>
>Yes, except if we find a way to make arp monitoring reliable in load balancing situation.
>
>[snip]
>
>> This is essentially the same thing as the diagram I pasted in up
>> above, except with VLANs and an additional layer of switches between the
>> hosts. The multiple VLANs take the place of multiple discrete switches.
>>
>> This could also be accomplished via bridge groups (in
>> Cisco-speak). For example, instead of VLAN 100, that could be bridge
>> group X, VLAN 200 is bridge group Y, and so on.
>>
>> Neither the VLAN nor the bridge group methods handle link
>> failures very well; if, in the above diagram, the link from "switch 2
>> vlan 100" to "host B" fails, there's no way for host A to know to stop
>> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
>> to "host B."
>
>Can't we imagine to "arp monitor" the destination MAC address of host B,
>on both paths ? That way, host A would know that a given path is down,
>because return path would be the same. The target host should send the
>reply on the slave on which it receive the request, which is the normal
>way to reply to arp request.
I think you can only get away with this if each slave set (where
a "set" is one slave from each bond that's attending our little load
balancing party) is on a separate switch domain, and the switch domains
are not bridged together. Otherwise the switches will flap their MAC
tables as they update from each probe that they see.
As for the reply going out the same slave, to do that, bonding
would have to intercept the ARP traffic (because ARPs arriving on slaves
are normally assigned to the bond itself, not the slave) and track and
tweak them.
Lastly, bonding would again have to maintain a map, showing
which destinations are reachable via which set of slaves. All peer
systems (needing to have per-slave link monitoring) would have to be ARP
targets.
>> One item I'd like to see some more data on is the level of
>> reordering at the receiver in Oleg's system.
>
>This is exactly the reason why I asked Oleg to do some test with
>balance-rr. I cannot find a good reason for a possibly new
>xmit_hash_policy to provide better throughput than current balance-rr. If
>the throughput increase by, let's say, less than 20%, whatever
>tcp_reordering value, then it is probably a dead end way.
Well, the point of making a round robin xmit_hash_policy isn't
that the throughput will be better than the existing round robin, it's
to make round-robin accessible to the 802.3ad mode.
>> One of the reasons round robin isn't as useful as it once was is
>> due to the rise of NAPI and interrupt coalescing, both of which will
>> tend to increase the reordering of packets at the receiver when the
>> packets are evenly striped. In the old days, it was one interrupt, one
>> packet. Now, it's one interrupt or NAPI poll, many packets. With the
>> packets striped across interfaces, this will tend to increase
>> reordering. E.g.,
>>
>> slave 1 slave 2 slave 3
>> Packet 1 P2 P3
>> P4 P5 P6
>> P7 P8 P9
>>
>> and so on. A poll of slave 1 will get packets 1, 4 and 7 (and
>> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
>
>Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7,
>P8, P9 on slave3, possibly by sending grouped packets, changing the
>sending slave every N packets instead of every packet ? I think we already
>discussed this possibility a few months or years ago in bonding-devel
>ML. For as far as I remember, the idea was not developed because it was
>not easy to find the number of packets to send through the same
>slave. Anyway, this might help reduce out of order delivery.
Yes, this came up several years ago, and, basically, there's no
way to do it perfectly. An interesting experiment would be to see if
sending groups (perhaps close to the NAPI weight of the receiver) would
reduce reordering.
>> Barring evidence to the contrary, I presume that Oleg's system
>> delivers out of order at the receiver. That's not automatically a
>> reason to reject it, but this entire proposal is sufficiently complex to
>> configure that very explicit documentation will be necessary.
>
>Yes, and this is already true for some bonding modes and in particular for balance-rr.
I don't think any modes other than balance-rr will deliver out
of order normally. It can happen during edge cases, e.g., alb
rebalance, or the layer3+4 hash with IP fragments, but I'd expect those
to be at a much lower rate than what round robin causes.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-18 20:24 ` Jay Vosburgh
2011-01-18 21:20 ` Nicolas de Pesloüan
2011-01-18 22:22 ` Oleg V. Ukhno
@ 2011-01-19 16:13 ` Oleg V. Ukhno
2011-01-19 20:12 ` Nicolas de Pesloüan
2 siblings, 1 reply; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-19 16:13 UTC (permalink / raw)
To: Jay Vosburgh
Cc: Nicolas de Pesloüan, John Fastabend, David S. Miller,
netdev, Sébastien Barré,
Christophe Paasch
On 01/18/2011 11:24 PM, Jay Vosburgh wrote:
> I haven't done much testing with this lately, but I suspect this
> behavior hasn't really changed. Raising the tcp_reordering sysctl value
> can mitigate this somewhat (by making TCP more tolerant of this), but
> that doesn't help non-TCP protocols.
>
> Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver. That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.
>
> -J
>
> ---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
Jay,
I have ran some tests with patched 802.3ad bonding for now
Test system configuration:
2 identical servers with 82576, Gigabit ET2 Quad Port Srvr Adptr
LowProfile, PCI-E (igb), connected to one switch(Cisco 2960) with all 4
ports, all ports on each host aggregated into single etherchannel using
802.3ad(w/patch).
kernel version: vanilla 2.6.32(tcp_reordering - default setting)
igb version - 2.3.4, parameters - default
Ran two tests:
1) unidirectional test using iperf
2) Bidirectional test, iperf client is running with 8 threads
One remark:
Decreasing number of slaves will cause higher active slave utilization(
for example with 2 slaves iperf test will consume almost full bandwidth
available in both directions(test parameters are the same, test time
reduced to 150sec):
[SUM] 0.0-150.3 sec 34640 MBytes 1933 Mbits/sec
[SUM] 0.0-150.5 sec 34875 MBytes 1944 Mbits/sec
)
For me (my use case) risk of some bandwidth loss with 4 slaves is
acceptable, but my suggestion that building aggregate link with more
than 4 slaves is inadequate. For 2 slaves this solution should work with
minimum @overhead@ of any kind. TCP reordering and retransmit numbers in
my opinion are acceptable for most use cases for such bonding mode I can
imagine.
What is your opinion on my idea with patch?
I will come back with results for VLAN tunneling case, if this is
necessary (Nicolas, shall I do that test - I think it will show similar
results for performance?)
Below are test results(sorry for huge amount of text):
Iperf results:
Test 1:
Receiver:
[root@target2 ~]# iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p
9999 -t 300
------------------------------------------------------------
Client connecting to 192.168.111.128, TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
[ 3] local 192.168.111.129 port 9999 connected with 192.168.111.128
port 9999
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-300.0 sec 141643 MBytes 3961 Mbits/sec
Sender:
[root@target1 ~]# iperf -f m -s -B 192.168.111.128 -p 9999 -t 300
------------------------------------------------------------
Server listening on TCP port 9999
Binding to local address 192.168.111.128
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
[ 4] local 192.168.111.128 port 9999 connected with 192.168.111.129
port 9999
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-300.1 sec 141643 MBytes 3959 Mbits/sec
^C[root@target1 ~]#
Test 2:
former "sender" side:
[SUM] 0.0-300.2 sec 111541 MBytes 3117 Mbits/sec
[SUM] 0.0-300.4 sec 110515 MBytes 3086 Mbits/sec
former "receiver" side:
[SUM] 0.0-300.1 sec 110515 MBytes 3089 Mbits/sec
[SUM] 0.0-300.3 sec 111541 MBytes 3116 Mbits/sec
Netstat's:
netstat -st (sender, before 1st test)
[root@target1 ~]# netstat -st
IcmpMsg:
InType3: 5
InType8: 3
OutType0: 3
OutType3: 4
Tcp:
26 active connections openings
7 passive connection openings
5 failed connection attempts
1 connection resets received
4 connections established
349 segments received
330 segments send out
7 segments retransmited
0 bad segments received.
5 resets sent
UdpLite:
TcpExt:
10 TCP sockets finished time wait in slow timer
8 delayed acks sent
56 packets directly queued to recvmsg prequeue.
40 packets directly received from backlog
317 packets directly received from prequeue
78 packets header predicted
36 packets header predicted and directly queued to user
41 acknowledgments not containing data received
134 predicted acknowledgments
0 TCP data loss events
4 other TCP timeouts
2 connections reset due to unexpected data
TCPSackShiftFallback: 1
IpExt:
InMcastPkts: 74
OutMcastPkts: 62
InOctets: 76001
OutOctets: 82234
InMcastOctets: 13074
OutMcastOctets: 10428
netstat -st (sender, after 1st test)
[root@target1 ~]netstat -st
IcmpMsg:
InType3: 5
InType8: 7
OutType0: 7
OutType3: 4
Tcp:
71 active connections openings
15 passive connection openings
5 failed connection attempts
4 connection resets received
4 connections established
16674161 segments received
16674113 segments send out
7 segments retransmited
0 bad segments received.
5 resets sent
UdpLite:
TcpExt:
31 TCP sockets finished time wait in slow timer
13 delayed acks sent
42 delayed acks further delayed because of locked socket
Quick ack mode was activated 297 times
239 packets directly queued to recvmsg prequeue.
2388220516 packets directly received from backlog
595165 packets directly received from prequeue
16954 packets header predicted
445 packets header predicted and directly queued to user
129 acknowledgments not containing data received
322 predicted acknowledgments
0 TCP data loss events
4 other TCP timeouts
297 DSACKs sent for old packets
2 connections reset due to unexpected data
TCPSackShiftFallback: 1
IpExt:
InMcastPkts: 86
OutMcastPkts: 68
InBcastPkts: 2
InOctets: -930738047
OutOctets: 1321936884
InMcastOctets: 13434
OutMcastOctets: 10620
InBcastOctets: 483
netstat -st (receiver, before 1st test)
[root@target2 ~]# netstat -st
IcmpMsg:
InType3: 5
InType8: 3
OutType0: 3
OutType3: 4
Tcp:
23 active connections openings
6 passive connection openings
3 failed connection attempts
1 connection resets received
3 connections established
309 segments received
264 segments send out
7 segments retransmited
0 bad segments received.
6 resets sent
UdpLite:
TcpExt:
10 TCP sockets finished time wait in slow timer
5 delayed acks sent
74 packets directly queued to recvmsg prequeue.
16 packets directly received from backlog
377 packets directly received from prequeue
62 packets header predicted
35 packets header predicted and directly queued to user
32 acknowledgments not containing data received
106 predicted acknowledgments
0 TCP data loss events
4 other TCP timeouts
1 connections reset due to early user close
IpExt:
InMcastPkts: 75
OutMcastPkts: 62
InOctets: 64952
OutOctets: 66396
InMcastOctets: 13428
OutMcastOctets: 10403
netstat -st (sender, after 1st test)
[root@target2 ~]# netstat -st
IcmpMsg:
InType3: 5
InType8: 8
OutType0: 8
OutType3: 4
Tcp:
70 active connections openings
14 passive connection openings
3 failed connection attempts
4 connection resets received
4 connections established
16674253 segments received
16673801 segments send out
487 segments retransmited
0 bad segments received.
6 resets sent
UdpLite:
TcpExt:
32 TCP sockets finished time wait in slow timer
15 delayed acks sent
228 packets directly queued to recvmsg prequeue.
24 packets directly received from backlog
1081 packets directly received from prequeue
146 packets header predicted
124 packets header predicted and directly queued to user
10913589 acknowledgments not containing data received
573 predicted acknowledgments
185 times recovered from packet loss due to SACK data
Detected reordering 1 times using FACK
Detected reordering 8 times using SACK
Detected reordering 2 times using time stamp
1 congestion windows fully recovered
23 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 1
0 TCP data loss events
471 fast retransmits
9 forward retransmits
4 other TCP timeouts
297 DSACKs received
1 connections reset due to early user close
TCPDSACKIgnoredOld: 258
TCPDSACKIgnoredNoUndo: 39
TCPSackShiftFallback: 35790574
IpExt:
InMcastPkts: 89
OutMcastPkts: 69
InBcastPkts: 2
InOctets: 1321825004
OutOctets: -928982419
InMcastOctets: 13848
OutMcastOctets: 10627
InBcastOctets: 483
Second test:
former "sender" side:
[root@target1 ~]# netstat -st
IcmpMsg:
InType3: 5
InType8: 13
OutType0: 13
OutType3: 4
Tcp:
556 active connections openings
65 passive connection openings
391 failed connection attempts
15 connection resets received
4 connections established
52164640 segments received
52117884 segments send out
62522 segments retransmited
0 bad segments received.
33 resets sent
UdpLite:
TcpExt:
27 invalid SYN cookies received
74 TCP sockets finished time wait in slow timer
698540 packets rejects in established connections because of timestamp
51 delayed acks sent
487 delayed acks further delayed because of locked socket
Quick ack mode was activated 18838 times
7 times the listen queue of a socket overflowed
7 SYNs to LISTEN sockets ignored
1632 packets directly queued to recvmsg prequeue.
4137769996 packets directly received from backlog
5723253 packets directly received from prequeue
1365131 packets header predicted
136330 packets header predicted and directly queued to user
10241415 acknowledgments not containing data received
156502 predicted acknowledgments
10983 times recovered from packet loss due to SACK data
Detected reordering 4 times using FACK
Detected reordering 10095 times using SACK
Detected reordering 138 times using time stamp
2107 congestion windows fully recovered
18612 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 80
5 congestion windows recovered after partial ack
0 TCP data loss events
52 timeouts after SACK recovery
2 timeouts in loss state
61206 fast retransmits
7 forward retransmits
984 retransmits in slow start
8 other TCP timeouts
258 sack retransmits failed
18838 DSACKs sent for old packets
274 DSACKs sent for out of order packets
14169 DSACKs received
34 DSACKs for out of order packets received
2 connections reset due to unexpected data
TCPDSACKIgnoredOld: 8694
TCPDSACKIgnoredNoUndo: 5482
TCPSackShiftFallback: 18352494
IpExt:
InMcastPkts: 104
OutMcastPkts: 77
InBcastPkts: 6
InOctets: -474718903
OutOctets: 1280495238
InMcastOctets: 13974
OutMcastOctets: 10908
InBcastOctets: 1449
former "receiver" side:
[root@target2 ~]# netstat -st
IcmpMsg:
InType3: 5
InType8: 14
OutType0: 14
OutType3: 4
Tcp:
182 active connections openings
39 passive connection openings
4 failed connection attempts
12 connection resets received
4 connections established
52098089 segments received
52180386 segments send out
68994 segments retransmited
0 bad segments received.
1070 resets sent
UdpLite:
TcpExt:
12 TCP sockets finished time wait in fast timer
102 TCP sockets finished time wait in slow timer
770084 packets rejects in established connections because of timestamp
37 delayed acks sent
261 delayed acks further delayed because of locked socket
Quick ack mode was activated 14276 times
1466 packets directly queued to recvmsg prequeue.
1190723332 packets directly received from backlog
4781569 packets directly received from prequeue
776470 packets header predicted
97281 packets header predicted and directly queued to user
24979561 acknowledgments not containing data received
484206 predicted acknowledgments
11461 times recovered from packet loss due to SACK data
Detected reordering 15 times using FACK
Detected reordering 15520 times using SACK
Detected reordering 208 times using time stamp
2046 congestion windows fully recovered
18402 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 82
13 congestion windows recovered after partial ack
0 TCP data loss events
49 timeouts after SACK recovery
1 timeouts in loss state
62078 fast retransmits
5340 forward retransmits
1181 retransmits in slow start
20 other TCP timeouts
322 sack retransmits failed
14276 DSACKs sent for old packets
36 DSACKs sent for out of order packets
17940 DSACKs received
254 DSACKs for out of order packets received
4 connections reset due to early user close
TCPDSACKIgnoredOld: 12703
TCPDSACKIgnoredNoUndo: 5251
TCPSackShiftFallback: 57141117
IpExt:
InMcastPkts: 104
OutMcastPkts: 76
InBcastPkts: 6
InOctets: 902997645
OutOctets: -82887048
InMcastOctets: 14296
OutMcastOctets: 10851
InBcastOctets: 1449
[root@target2 ~]#
--
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-19 16:13 ` Oleg V. Ukhno
@ 2011-01-19 20:12 ` Nicolas de Pesloüan
2011-01-21 13:55 ` Oleg V. Ukhno
0 siblings, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-19 20:12 UTC (permalink / raw)
To: Oleg V. Ukhno
Cc: Jay Vosburgh, John Fastabend, David S. Miller, netdev,
Sébastien Barré,
Christophe Paasch
Le 19/01/2011 17:13, Oleg V. Ukhno a écrit :
> On 01/18/2011 11:24 PM, Jay Vosburgh wrote:
[snip]
>> I haven't done much testing with this lately, but I suspect this
>> behavior hasn't really changed. Raising the tcp_reordering sysctl value
>> can mitigate this somewhat (by making TCP more tolerant of this), but
>> that doesn't help non-TCP protocols.
>>
>> Barring evidence to the contrary, I presume that Oleg's system
>> delivers out of order at the receiver. That's not automatically a
>> reason to reject it, but this entire proposal is sufficiently complex to
>> configure that very explicit documentation will be necessary.
>>
>> -J
>>
>> ---
>> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>>
>
> Jay,
[snip]
>
> What is your opinion on my idea with patch?
>
> I will come back with results for VLAN tunneling case, if this is
> necessary (Nicolas, shall I do that test - I think it will show similar
> results for performance?)
If you have time for that, then yes, please, do the same test using balance-rr+vlan to segregate
path. With those results, we whould have the opportunity to enhance the documentation with some well
tested cases of TCP load balancing on a LAN, not limited to 802.3ad automatic setup. Both setups
make sense, and assuming the results would be similar is probably true, but not reliable enough to
assert it into the documentation.
Thanks,
Nicolas.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-19 20:12 ` Nicolas de Pesloüan
@ 2011-01-21 13:55 ` Oleg V. Ukhno
2011-01-22 12:48 ` Nicolas de Pesloüan
2011-01-29 2:28 ` Jay Vosburgh
0 siblings, 2 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-21 13:55 UTC (permalink / raw)
To: Nicolas de Pesloüan; +Cc: Jay Vosburgh, John Fastabend, netdev
On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:
> If you have time for that, then yes, please, do the same test using
> balance-rr+vlan to segregate path. With those results, we whould have
> the opportunity to enhance the documentation with some well tested cases
> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup.
> Both setups make sense, and assuming the results would be similar is
> probably true, but not reliable enough to assert it into the documentation.
>
> Thanks,
>
> Nicolas.
>
Nicolas,
I've ran similar tests for VLAN tunneling scenario. Results are
identical, as I expected. The only significat difference is link failure
handling. 802.3ad mode allows almost painless load reditribution,
balance-rr causes packet loss.
The only question for me now is if my patch could be applied to upstream
version - fixing issues with adaptftion to net-next code aren't the
problem, if nobody objects
There were 2 tests:
1) unidirectional test
2) bidirectional test
Below are results:
Iperf results:
test 1:
iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300
------------------------------------------------------------
Client connecting to 192.168.111.128, TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
[ 3] local 192.168.111.129 port 9999 connected with 192.168.111.128
port 9999
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-300.0 sec 141637 MBytes 3960 Mbits/sec
test 2:
iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300
--dualtest -P 4
------------------------------------------------------------
Server listening on TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
...
[SUM] 0.0-300.2 sec 111334 MBytes 3111 Mbits/sec
[SUM] 0.0-300.4 sec 109582 MBytes 3060 Mbits/sec
TCP stats:
receiver side, before test 1:
[root@target1 ~]# netstat -st
IcmpMsg:
InType0: 4
InType3: 6
InType8: 2
OutType0: 2
OutType3: 6
OutType8: 4
Tcp:
4 active connections openings
2 passive connection openings
3 failed connection attempts
0 connection resets received
3 connections established
10252 segments received
29766 segments send out
2 segments retransmited
0 bad segments received.
0 resets sent
UdpLite:
TcpExt:
3 delayed acks sent
613 packets directly queued to recvmsg prequeue.
16 packets directly received from backlog
1760 packets directly received from prequeue
428 packets header predicted
10 packets header predicted and directly queued to user
9295 acknowledgments not containing data received
265 predicted acknowledgments
0 TCP data loss events
1 other TCP timeouts
TCPSackMerged: 1
TCPSackShiftFallback: 1
IpExt:
InMcastPkts: 92
OutMcastPkts: 64
InBcastPkts: 2
InOctets: 1089217
OutOctets: 265005791
InMcastOctets: 16294
OutMcastOctets: 10364
InBcastOctets: 483
receiver side , after test 1:
[root@target1 ~]netstat -st
IcmpMsg:
InType0: 17
InType3: 6
InType8: 9
OutType0: 9
OutType3: 6
OutType8: 19
Tcp:
84 active connections openings
14 passive connection openings
6 failed connection attempts
4 connection resets received
4 connections established
16684784 segments received
16704650 segments send out
22 segments retransmited
0 bad segments received.
6 resets sent
UdpLite:
TcpExt:
39 TCP sockets finished time wait in slow timer
23 delayed acks sent
83 delayed acks further delayed because of locked socket
Quick ack mode was activated 225 times
1019 packets directly queued to recvmsg prequeue.
3235352384 packets directly received from backlog
483600 packets directly received from prequeue
86065 packets header predicted
4855 packets header predicted and directly queued to user
10369 acknowledgments not containing data received
928 predicted acknowledgments
0 TCP data loss events
2 retransmits in slow start
6 other TCP timeouts
225 DSACKs sent for old packets
1 connections reset due to unexpected data
TCPSackMerged: 1
TCPSackShiftFallback: 3
IpExt:
InMcastPkts: 108
OutMcastPkts: 72
InBcastPkts: 4
InOctets: -936746758
OutOctets: 1556837236
InMcastOctets: 16774
OutMcastOctets: 10620
InBcastOctets: 966
receiver side, after test 2
[root@target1 ~]netstat -st
IcmpMsg:
InType0: 17
InType3: 6
InType8: 12
OutType0: 12
OutType3: 6
OutType8: 19
Tcp:
144 active connections openings
25 passive connection openings
29 failed connection attempts
7 connection resets received
4 connections established
44349148 segments received
44401154 segments send out
58434 segments retransmited
0 bad segments received.
6 resets sent
UdpLite:
TcpExt:
58 TCP sockets finished time wait in slow timer
735072 packets rejects in established connections because of timestamp
34 delayed acks sent
359 delayed acks further delayed because of locked socket
Quick ack mode was activated 14800 times
2112 packets directly queued to recvmsg prequeue.
3753925448 packets directly received from backlog
4377976 packets directly received from prequeue
847653 packets header predicted
105696 packets header predicted and directly queued to user
8804473 acknowledgments not containing data received
154775 predicted acknowledgments
10465 times recovered from packet loss due to SACK data
Detected reordering 1 times using FACK
Detected reordering 11185 times using SACK
Detected reordering 182 times using time stamp
2116 congestion windows fully recovered
18951 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 58
8 congestion windows recovered after partial ack
0 TCP data loss events
53 timeouts after SACK recovery
1 timeouts in loss state
57287 fast retransmits
12 forward retransmits
793 retransmits in slow start
10 other TCP timeouts
263 sack retransmits failed
14800 DSACKs sent for old packets
31 DSACKs sent for out of order packets
14289 DSACKs received
43 DSACKs for out of order packets received
1 connections reset due to unexpected data
TCPDSACKIgnoredOld: 8615
TCPDSACKIgnoredNoUndo: 5683
TCPSackMerged: 1
TCPSackShiftFallback: 15015212
IpExt:
InMcastPkts: 116
OutMcastPkts: 76
InBcastPkts: 4
InOctets: 1012355682
OutOctets: -1540562156
InMcastOctets: 17014
OutMcastOctets: 10748
InBcastOctets: 966
sender side, before test 1:
[root@target2 ~]# netstat -st
IcmpMsg:
InType3: 4
InType8: 32
OutType0: 32
OutType3: 4
Tcp:
1 active connections openings
2 passive connection openings
0 failed connection attempts
0 connection resets received
3 connections established
30268 segments received
10217 segments send out
0 segments retransmited
0 bad segments received.
3 resets sent
UdpLite:
TcpExt:
7 delayed acks sent
6332 packets directly queued to recvmsg prequeue.
8 packets directly received from backlog
46104 packets directly received from prequeue
27935 packets header predicted
11 packets header predicted and directly queued to user
455 acknowledgments not containing data received
119 predicted acknowledgments
0 TCP data loss events
TCPSackShiftFallback: 1
IpExt:
InMcastPkts: 87
OutMcastPkts: 54
InBcastPkts: 2
InOctets: 265039007
OutOctets: 1083024
InMcastOctets: 16444
OutMcastOctets: 9893
InBcastOctets: 483
sender side , after test 1:
[root@target2 ~]# netstat -st
IcmpMsg:
InType3: 4
InType8: 53
OutType0: 53
OutType3: 4
Tcp:
69 active connections openings
12 passive connection openings
2 failed connection attempts
4 connection resets received
4 connections established
16704819 segments received
16684841 segments send out
401 segments retransmited
0 bad segments received.
10 resets sent
UdpLite:
TcpExt:
31 TCP sockets finished time wait in slow timer
25 delayed acks sent
6515 packets directly queued to recvmsg prequeue.
24 packets directly received from backlog
46988 packets directly received from prequeue
27974 packets header predicted
115 packets header predicted and directly queued to user
10259331 acknowledgments not containing data received
12483 predicted acknowledgments
166 times recovered from packet loss due to SACK data
Detected reordering 1 times using FACK
Detected reordering 7 times using SACK
Detected reordering 1 times using time stamp
1 congestion windows fully recovered
41 congestion windows partially recovered using Hoe heuristic
0 TCP data loss events
386 fast retransmits
5 forward retransmits
3 other TCP timeouts
1 times receiver scheduled too late for direct processing
225 DSACKs received
1 connections reset due to unexpected data
TCPDSACKIgnoredOld: 167
TCPDSACKIgnoredNoUndo: 58
TCPSackShiftFallback: 30925668
IpExt:
InMcastPkts: 103
OutMcastPkts: 62
InBcastPkts: 4
InOctets: 1556368288
OutOctets: -934790015
InMcastOctets: 16924
OutMcastOctets: 10149
InBcastOctets: 966
sender side, after test 2:
[root@target2 ~]# netstat -st
IcmpMsg:
InType3: 4
InType8: 56
OutType0: 56
OutType3: 4
Tcp:
117 active connections openings
25 passive connection openings
2 failed connection attempts
7 connection resets received
4 connections established
44383169 segments received
44367187 segments send out
59660 segments retransmited
0 bad segments received.
34 resets sent
UdpLite:
TcpExt:
2 TCP sockets finished time wait in fast timer
57 TCP sockets finished time wait in slow timer
717082 packets rejects in established connections because of timestamp
46 delayed acks sent
202 delayed acks further delayed because of locked socket
Quick ack mode was activated 14356 times
7432 packets directly queued to recvmsg prequeue.
135038632 packets directly received from backlog
3633432 packets directly received from prequeue
783534 packets header predicted
94671 packets header predicted and directly queued to user
20034470 acknowledgments not containing data received
177885 predicted acknowledgments
10851 times recovered from packet loss due to SACK data
Detected reordering 6 times using FACK
Detected reordering 9217 times using SACK
Detected reordering 111 times using time stamp
2125 congestion windows fully recovered
19325 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 71
7 congestion windows recovered after partial ack
0 TCP data loss events
52 timeouts after SACK recovery
58562 fast retransmits
67 forward retransmits
736 retransmits in slow start
8 other TCP timeouts
226 sack retransmits failed
1 times receiver scheduled too late for direct processing
14356 DSACKs sent for old packets
44 DSACKs sent for out of order packets
14679 DSACKs received
31 DSACKs for out of order packets received
1 connections reset due to unexpected data
TCPDSACKIgnoredOld: 8899
TCPDSACKIgnoredNoUndo: 5791
TCPSackShiftFallback: 47227517
IpExt:
InMcastPkts: 109
OutMcastPkts: 65
InBcastPkts: 4
InOctets: -1885181292
OutOctets: 1366995261
InMcastOctets: 17104
OutMcastOctets: 10245
InBcastOctets: 966
--
Best regards,
Oleg Ukhno,
ITO Team lead
Yandex LLC.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-21 13:55 ` Oleg V. Ukhno
@ 2011-01-22 12:48 ` Nicolas de Pesloüan
2011-01-24 19:32 ` Oleg V. Ukhno
2011-01-29 2:28 ` Jay Vosburgh
1 sibling, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-01-22 12:48 UTC (permalink / raw)
To: Oleg V. Ukhno; +Cc: Jay Vosburgh, John Fastabend, netdev
Le 21/01/2011 14:55, Oleg V. Ukhno a écrit :
> On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:
>
>> If you have time for that, then yes, please, do the same test using
>> balance-rr+vlan to segregate path. With those results, we whould have
>> the opportunity to enhance the documentation with some well tested cases
>> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup.
>> Both setups make sense, and assuming the results would be similar is
>> probably true, but not reliable enough to assert it into the
>> documentation.
>>
>> Thanks,
>>
>> Nicolas.
>>
> Nicolas,
> I've ran similar tests for VLAN tunneling scenario. Results are
> identical, as I expected. The only significat difference is link failure
> handling. 802.3ad mode allows almost painless load reditribution,
> balance-rr causes packet loss.
Oleg,
Thanks for doing the tests.
What link failure mode did you use for those tests ? miimon or arp monitoring ?
Nicolas.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-22 12:48 ` Nicolas de Pesloüan
@ 2011-01-24 19:32 ` Oleg V. Ukhno
0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-01-24 19:32 UTC (permalink / raw)
To: Nicolas de Pesloüan; +Cc: Jay Vosburgh, John Fastabend, netdev
On 01/22/2011 03:48 PM, Nicolas de Pesloüan wrote:
> Le 21/01/2011 14:55, Oleg V. Ukhno a écrit :
>> On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:
>>>
>> Nicolas,
>> I've ran similar tests for VLAN tunneling scenario. Results are
>> identical, as I expected. The only significat difference is link failure
>> handling. 802.3ad mode allows almost painless load reditribution,
>> balance-rr causes packet loss.
>
> Oleg,
>
> Thanks for doing the tests.
>
> What link failure mode did you use for those tests ? miimon or arp
> monitoring ?
>
> Nicolas.
>
>
Nicolas,
as for tests:
MII link monitoring kills the whole transfer, when in ARP mode
monitoring - it still works, but there is asymmetric load striping on
bond slaves(one slave is overloaded, two other - about 50-60% badwidnth
utilized.
Just as a summary - balance-rr behaves like patched 802.3ad when using
arp monitoring mode, but there is quite asymmetric load striping and
quite a monstrous configuration on switch and server sides.
--
Best regards,
Oleg Ukhno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-21 13:55 ` Oleg V. Ukhno
2011-01-22 12:48 ` Nicolas de Pesloüan
@ 2011-01-29 2:28 ` Jay Vosburgh
2011-02-01 16:25 ` Oleg V. Ukhno
2011-02-02 9:54 ` Nicolas de Pesloüan
1 sibling, 2 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-01-29 2:28 UTC (permalink / raw)
To: Oleg V. Ukhno
Cc: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=, John Fastabend, netdev
Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:
>
>> If you have time for that, then yes, please, do the same test using
>> balance-rr+vlan to segregate path. With those results, we whould have
>> the opportunity to enhance the documentation with some well tested cases
>> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup.
>> Both setups make sense, and assuming the results would be similar is
>> probably true, but not reliable enough to assert it into the documentation.
>>
>> Thanks,
>>
>> Nicolas.
>>
>Nicolas,
>I've ran similar tests for VLAN tunneling scenario. Results are identical,
>as I expected. The only significat difference is link failure
>handling. 802.3ad mode allows almost painless load reditribution,
>balance-rr causes packet loss.
>The only question for me now is if my patch could be applied to upstream
>version - fixing issues with adaptftion to net-next code aren't the
>problem, if nobody objects
I've thought about this whole thing, and here's what I view as
the proper way to do this.
In my mind, this proposal is two separate pieces:
First, a piece to make round-robin a selectable hash for
xmit_hash_policy. The documentation for this should follow the pattern
of the "layer3+4" hash policy, in particular noting that the new
algorithm violates the 802.3ad standard in exciting ways, will result in
out of order delivery, and that other 802.3ad implementations may or may
not tolerate this.
Second, a piece to make certain transmitted packets use the
source MAC of the sending slave instead of the bond's MAC. This should
be a separate option from the round-robin hash policy. I'd call it
something like "mac_select" with two values: "default" (what we do now)
and "slave_src_mac" to use the slave's real MAC for certain types of
traffic (I'm open to better names; that's just what I came up with while
writing this). I believe that "certain types" means "everything but
ARP," but might be "only IP and IPv6." Structuring the option in this
manner leaves the option open for additional selections in the future,
which a simple "on/off" option wouldn't. This option should probably
only affect a subset of modes; I'm thinking anything except balance-tlb
or -alb (because they do funky MAC things already) and active-backup (it
doesn't balance traffic, and already uses fail_over_mac to control
this). I think this option also needs a whole new section down in the
bottom explaining how to exploit it (the "pick special MACs on slaves to
trick switch hash" business).
Comments?
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-29 2:28 ` Jay Vosburgh
@ 2011-02-01 16:25 ` Oleg V. Ukhno
2011-02-02 17:30 ` Jay Vosburgh
2011-02-02 9:54 ` Nicolas de Pesloüan
1 sibling, 1 reply; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-02-01 16:25 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Nicolas de Pesloüan, John Fastabend, netdev
On 01/29/2011 05:28 AM, Jay Vosburgh wrote:
> Oleg V. Ukhno<olegu@yandex-team.ru> wrote:
>
> I've thought about this whole thing, and here's what I view as
> the proper way to do this.
>
> In my mind, this proposal is two separate pieces:
>
> First, a piece to make round-robin a selectable hash for
> xmit_hash_policy. The documentation for this should follow the pattern
> of the "layer3+4" hash policy, in particular noting that the new
> algorithm violates the 802.3ad standard in exciting ways, will result in
> out of order delivery, and that other 802.3ad implementations may or may
> not tolerate this.
>
> Second, a piece to make certain transmitted packets use the
> source MAC of the sending slave instead of the bond's MAC. This should
> be a separate option from the round-robin hash policy. I'd call it
> something like "mac_select" with two values: "default" (what we do now)
> and "slave_src_mac" to use the slave's real MAC for certain types of
> traffic (I'm open to better names; that's just what I came up with while
> writing this). I believe that "certain types" means "everything but
> ARP," but might be "only IP and IPv6." Structuring the option in this
> manner leaves the option open for additional selections in the future,
> which a simple "on/off" option wouldn't. This option should probably
> only affect a subset of modes; I'm thinking anything except balance-tlb
> or -alb (because they do funky MAC things already) and active-backup (it
> doesn't balance traffic, and already uses fail_over_mac to control
> this). I think this option also needs a whole new section down in the
> bottom explaining how to exploit it (the "pick special MACs on slaves to
> trick switch hash" business).
>
> Comments?
>
> -J
>
Jay,
As for me splitting my initial proposal into two logically diffent
pieces is ok, this will provide more flexible configuration.
Do I understand correctly, that after I rewrite patch in splitted form,
as you described above, and enhance documentation it will be /can be
applied to kernel?
Then what should I do: rewrite patch and resubmit it as a new one?
Oleg.
> ---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
--
Best regards,
Oleg Ukhno.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-01-29 2:28 ` Jay Vosburgh
2011-02-01 16:25 ` Oleg V. Ukhno
@ 2011-02-02 9:54 ` Nicolas de Pesloüan
2011-02-02 17:57 ` Jay Vosburgh
1 sibling, 1 reply; 32+ messages in thread
From: Nicolas de Pesloüan @ 2011-02-02 9:54 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Oleg V. Ukhno, John Fastabend, netdev
Le 29/01/2011 03:28, Jay Vosburgh a écrit :
> I've thought about this whole thing, and here's what I view as
> the proper way to do this.
>
> In my mind, this proposal is two separate pieces:
>
> First, a piece to make round-robin a selectable hash for
> xmit_hash_policy. The documentation for this should follow the pattern
> of the "layer3+4" hash policy, in particular noting that the new
> algorithm violates the 802.3ad standard in exciting ways, will result in
> out of order delivery, and that other 802.3ad implementations may or may
> not tolerate this.
>
> Second, a piece to make certain transmitted packets use the
> source MAC of the sending slave instead of the bond's MAC. This should
> be a separate option from the round-robin hash policy. I'd call it
> something like "mac_select" with two values: "default" (what we do now)
> and "slave_src_mac" to use the slave's real MAC for certain types of
> traffic (I'm open to better names; that's just what I came up with while
> writing this). I believe that "certain types" means "everything but
> ARP," but might be "only IP and IPv6." Structuring the option in this
> manner leaves the option open for additional selections in the future,
> which a simple "on/off" option wouldn't. This option should probably
> only affect a subset of modes; I'm thinking anything except balance-tlb
> or -alb (because they do funky MAC things already) and active-backup (it
> doesn't balance traffic, and already uses fail_over_mac to control
> this). I think this option also needs a whole new section down in the
> bottom explaining how to exploit it (the "pick special MACs on slaves to
> trick switch hash" business).
>
> Comments?
Looks really sensible to me.
I just propose the following option and option values : "src_mac_select" (instead of mac_select),
with "default" and "slave_mac" (instead of slave_src_mac) as possible values. In the future, we
might need a "dst_mac_select" option... :-)
Also, are there any risks that this kind of session load-balancing won't properly cooperate with
multiqueue (as explained in "Overriding Configuration for Special Cases" in
Documentation/networking/bonding.txt)? I think it is important to ensure we keep the ability to fine
tune the egress path selection
Nicolas.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-02-01 16:25 ` Oleg V. Ukhno
@ 2011-02-02 17:30 ` Jay Vosburgh
0 siblings, 0 replies; 32+ messages in thread
From: Jay Vosburgh @ 2011-02-02 17:30 UTC (permalink / raw)
To: Oleg V. Ukhno; +Cc: Nicolas de Pesloüan, John Fastabend, netdev
Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>On 01/29/2011 05:28 AM, Jay Vosburgh wrote:
>> Oleg V. Ukhno<olegu@yandex-team.ru> wrote:
>>
>> I've thought about this whole thing, and here's what I view as
>> the proper way to do this.
>>
>> In my mind, this proposal is two separate pieces:
>>
>> First, a piece to make round-robin a selectable hash for
>> xmit_hash_policy. The documentation for this should follow the pattern
>> of the "layer3+4" hash policy, in particular noting that the new
>> algorithm violates the 802.3ad standard in exciting ways, will result in
>> out of order delivery, and that other 802.3ad implementations may or may
>> not tolerate this.
>>
>> Second, a piece to make certain transmitted packets use the
>> source MAC of the sending slave instead of the bond's MAC. This should
>> be a separate option from the round-robin hash policy. I'd call it
>> something like "mac_select" with two values: "default" (what we do now)
>> and "slave_src_mac" to use the slave's real MAC for certain types of
>> traffic (I'm open to better names; that's just what I came up with while
>> writing this). I believe that "certain types" means "everything but
>> ARP," but might be "only IP and IPv6." Structuring the option in this
>> manner leaves the option open for additional selections in the future,
>> which a simple "on/off" option wouldn't. This option should probably
>> only affect a subset of modes; I'm thinking anything except balance-tlb
>> or -alb (because they do funky MAC things already) and active-backup (it
>> doesn't balance traffic, and already uses fail_over_mac to control
>> this). I think this option also needs a whole new section down in the
>> bottom explaining how to exploit it (the "pick special MACs on slaves to
>> trick switch hash" business).
>>
>> Comments?
>>
>> -J
>>
>Jay,
>As for me splitting my initial proposal into two logically diffent pieces
>is ok, this will provide more flexible configuration.
>Do I understand correctly, that after I rewrite patch in splitted form,
>as you described above, and enhance documentation it will be /can be
>applied to kernel?
Yes, although the patches may have to go through a few
revisions.
>Then what should I do: rewrite patch and resubmit it as a new one?
Yes.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-02-02 9:54 ` Nicolas de Pesloüan
@ 2011-02-02 17:57 ` Jay Vosburgh
2011-02-03 14:54 ` Oleg V. Ukhno
0 siblings, 1 reply; 32+ messages in thread
From: Jay Vosburgh @ 2011-02-02 17:57 UTC (permalink / raw)
To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
Cc: Oleg V. Ukhno, John Fastabend, netdev
Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:
>Le 29/01/2011 03:28, Jay Vosburgh a écrit :
>> I've thought about this whole thing, and here's what I view as
>> the proper way to do this.
>>
>> In my mind, this proposal is two separate pieces:
>>
>> First, a piece to make round-robin a selectable hash for
>> xmit_hash_policy. The documentation for this should follow the pattern
>> of the "layer3+4" hash policy, in particular noting that the new
>> algorithm violates the 802.3ad standard in exciting ways, will result in
>> out of order delivery, and that other 802.3ad implementations may or may
>> not tolerate this.
>>
>> Second, a piece to make certain transmitted packets use the
>> source MAC of the sending slave instead of the bond's MAC. This should
>> be a separate option from the round-robin hash policy. I'd call it
>> something like "mac_select" with two values: "default" (what we do now)
>> and "slave_src_mac" to use the slave's real MAC for certain types of
>> traffic (I'm open to better names; that's just what I came up with while
>> writing this). I believe that "certain types" means "everything but
>> ARP," but might be "only IP and IPv6." Structuring the option in this
>> manner leaves the option open for additional selections in the future,
>> which a simple "on/off" option wouldn't. This option should probably
>> only affect a subset of modes; I'm thinking anything except balance-tlb
>> or -alb (because they do funky MAC things already) and active-backup (it
>> doesn't balance traffic, and already uses fail_over_mac to control
>> this). I think this option also needs a whole new section down in the
>> bottom explaining how to exploit it (the "pick special MACs on slaves to
>> trick switch hash" business).
>>
>> Comments?
>
>Looks really sensible to me.
>
>I just propose the following option and option values : "src_mac_select"
>(instead of mac_select), with "default" and "slave_mac" (instead of
>slave_src_mac) as possible values. In the future, we might need a
>"dst_mac_select" option... :-)
I originally thought of using the nomenclature you propose; my
thinking for doing it the way I ended up with is to minimize the number
of tunable knobs that bonding has (so, the dst_mac would be a setting
for mac_select). That works as long as there aren't a lot of settings
that would be turned on simultaneously, since each combination would
have to be a separate option, or the options parser would have to handle
multiple settings (e.g., mac_select=src+dst or something like that).
Anyway, after thinking about it some more, in the long run it's
probably safer to separate these two, so, Oleg, use the above naming
("src_mac_select" with "default" and "slave_mac").
>Also, are there any risks that this kind of session load-balancing won't
>properly cooperate with multiqueue (as explained in "Overriding
>Configuration for Special Cases" in Documentation/networking/bonding.txt)?
>I think it is important to ensure we keep the ability to fine tune the
>egress path selection
I think the logic for the mac_select (or src_mac_select or
whatever) just has to be done last, after the slave selection is done by
the multiqueue stuff. That's probably a good tidbit to put in the
documentation as well.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
2011-02-02 17:57 ` Jay Vosburgh
@ 2011-02-03 14:54 ` Oleg V. Ukhno
0 siblings, 0 replies; 32+ messages in thread
From: Oleg V. Ukhno @ 2011-02-03 14:54 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Nicolas de Pesloüan, John Fastabend, netdev
On 02/02/2011 08:57 PM, Jay Vosburgh wrote:
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com> wrote:
>> I just propose the following option and option values : "src_mac_select"
>> (instead of mac_select), with "default" and "slave_mac" (instead of
>> slave_src_mac) as possible values. In the future, we might need a
>> "dst_mac_select" option... :-)
>
> I originally thought of using the nomenclature you propose; my
> thinking for doing it the way I ended up with is to minimize the number
> of tunable knobs that bonding has (so, the dst_mac would be a setting
> for mac_select). That works as long as there aren't a lot of settings
> that would be turned on simultaneously, since each combination would
> have to be a separate option, or the options parser would have to handle
> multiple settings (e.g., mac_select=src+dst or something like that).
>
> Anyway, after thinking about it some more, in the long run it's
> probably safer to separate these two, so, Oleg, use the above naming
> ("src_mac_select" with "default" and "slave_mac").
>
>> Also, are there any risks that this kind of session load-balancing won't
>> properly cooperate with multiqueue (as explained in "Overriding
>> Configuration for Special Cases" in Documentation/networking/bonding.txt)?
>> I think it is important to ensure we keep the ability to fine tune the
>> egress path selection
>
> I think the logic for the mac_select (or src_mac_select or
> whatever) just has to be done last, after the slave selection is done by
> the multiqueue stuff. That's probably a good tidbit to put in the
> documentation as well.
>
> -J
>
> ---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
Thank everyone for comments,
I'll resubmit modified patch after it is ready and tested, in about a
week or two I think.
Oleg
>
--
Best regards,
Oleg Ukhno
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2011-02-03 14:54 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
2011-01-14 20:10 ` John Fastabend
2011-01-14 23:12 ` Oleg V. Ukhno
2011-01-14 20:13 ` Jay Vosburgh
2011-01-14 22:51 ` Oleg V. Ukhno
2011-01-15 0:05 ` Jay Vosburgh
2011-01-15 12:11 ` Oleg V. Ukhno
2011-01-18 3:16 ` John Fastabend
2011-01-18 12:40 ` Oleg V. Ukhno
2011-01-18 14:54 ` Nicolas de Pesloüan
2011-01-18 15:28 ` Oleg V. Ukhno
2011-01-18 16:24 ` Nicolas de Pesloüan
2011-01-18 16:57 ` Oleg V. Ukhno
2011-01-18 20:24 ` Jay Vosburgh
2011-01-18 21:20 ` Nicolas de Pesloüan
2011-01-19 1:45 ` Jay Vosburgh
2011-01-18 22:22 ` Oleg V. Ukhno
2011-01-19 16:13 ` Oleg V. Ukhno
2011-01-19 20:12 ` Nicolas de Pesloüan
2011-01-21 13:55 ` Oleg V. Ukhno
2011-01-22 12:48 ` Nicolas de Pesloüan
2011-01-24 19:32 ` Oleg V. Ukhno
2011-01-29 2:28 ` Jay Vosburgh
2011-02-01 16:25 ` Oleg V. Ukhno
2011-02-02 17:30 ` Jay Vosburgh
2011-02-02 9:54 ` Nicolas de Pesloüan
2011-02-02 17:57 ` Jay Vosburgh
2011-02-03 14:54 ` Oleg V. Ukhno
2011-01-18 17:56 ` Kirill Smelkov
2011-01-18 16:41 ` John Fastabend
2011-01-18 17:21 ` Oleg V. Ukhno
2011-01-14 20:41 ` Nicolas de Pesloüan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.