All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] IPVS: add wlib & wlip schedulers
       [not found] ` <Pine.LNX.4.61.0502010007060.1148@penguin.linux-vs.org>
@ 2015-01-17 23:15   ` Chris Caputo
  2015-01-19 23:17     ` Julian Anastasov
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Caputo @ 2015-01-17 23:15 UTC (permalink / raw)
  To: Wensong Zhang, Julian Anastasov, Simon Horman; +Cc: lvs-devel, linux-kernel

Wensong, this is something we discussed 10 years ago and you liked it, but 
it didn't actually get into the kernel.  I've updated it, tested it, and 
would like to work toward inclusion.

Thanks,
Chris

---
From: Chris Caputo <ccaputo@alt.net> 

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) schedulers, updated for 3.19-rc4.

Signed-off-by: Chris Caputo <ccaputo@alt.net>
---
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig linux-3.19-rc4/net/netfilter/ipvs/Kconfig
--- linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig	2015-01-11 20:44:53.000000000 +0000
+++ linux-3.19-rc4/net/netfilter/ipvs/Kconfig	2015-01-17 22:47:52.250301042 +0000
@@ -240,6 +240,26 @@ config	IP_VS_NQ
 	  If you want to compile it in kernel, say Y. To compile it as a
 	  module, choose M here. If unsure, say N.
 
+config	IP_VS_WLIB
+	tristate "weighted least incoming byterate scheduling"
+	---help---
+	  The weighted least incoming byterate scheduling algorithm directs
+	  network connections to the server with the least incoming byterate
+	  normalized by the server weight.
+
+	  If you want to compile it in kernel, say Y. To compile it as a
+	  module, choose M here. If unsure, say N.
+
+config	IP_VS_WLIP
+	tristate "weighted least incoming packetrate scheduling"
+	---help---
+	  The weighted least incoming packetrate scheduling algorithm directs
+	  network connections to the server with the least incoming packetrate
+	  normalized by the server weight.
+
+	  If you want to compile it in kernel, say Y. To compile it as a
+	  module, choose M here. If unsure, say N.
+
 comment 'IPVS SH scheduler'
 
 config IP_VS_SH_TAB_BITS
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile linux-3.19-rc4/net/netfilter/ipvs/Makefile
--- linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile	2015-01-11 20:44:53.000000000 +0000
+++ linux-3.19-rc4/net/netfilter/ipvs/Makefile	2015-01-17 22:47:35.421861075 +0000
@@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
+obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o
+obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c
--- linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c	2015-01-17 22:47:35.421861075 +0000
@@ -0,0 +1,156 @@
+/* IPVS:        Weighted Least Incoming Byterate Scheduling module
+ *
+ * Authors:     Chris Caputo <ccaputo@alt.net> based on code by:
+ *
+ *                  Wensong Zhang <wensong@linuxvirtualserver.org>
+ *                  Peter Kese <peter.kese@ijs.si>
+ *                  Julian Anastasov <ja@ssi.bg>
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ *     Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIB algorithm uses the results of the estimator's inbps
+ * calculations to determine which real server has the lowest incoming
+ * byterate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 100 Mbps of input and
+ * another that can handle 1 Gbps you could set the weights to be 100 and 1000
+ * respectively.
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+#include <net/ip_vs.h>
+
+static int
+ip_vs_wlib_init_svc(struct ip_vs_service *svc)
+{
+	svc->sched_data = &svc->destinations;
+	return 0;
+}
+
+static int
+ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+	struct list_head *p;
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	/* dest is already unlinked, so p->prev is not valid but
+	 * p->next is valid, use it to reach previous entry.
+	 */
+	if (p == &dest->n_list)
+		svc->sched_data = p->next->prev;
+	spin_unlock_bh(&svc->sched_lock);
+	return 0;
+}
+
+/* Weighted Least Incoming Byterate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+		    struct ip_vs_iphdr *iph)
+{
+	struct list_head *p, *q;
+	struct ip_vs_dest *dest, *least = NULL;
+	u32 dr, lr = -1;
+	int dwgt, lwgt = 0;
+
+	IP_VS_DBG(6, "%s(): Scheduling...\n", __func__);
+
+	/* We calculate the load of each dest server as follows:
+	 *        (dest inbps rate) / dest->weight
+	 *
+	 * The comparison of dr*lwght < lr*dwght is equivalent to that of
+	 * dr/dwght < lr/lwght if every weight is larger than zero.
+	 *
+	 * A server with weight=0 is quiesced and will not receive any
+	 * new connections.
+	 *
+	 * In case of ties, highest weight is winner.  And if that still makes
+	 * for a tie, round robin is used (which is why we remember our last
+	 * starting location in the linked list).
+	 */
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	p = list_next_rcu(p);
+	q = p;
+	do {
+		/* skip list head */
+		if (q == &svc->destinations) {
+			q = list_next_rcu(q);
+			continue;
+		}
+
+		dest = list_entry_rcu(q, struct ip_vs_dest, n_list);
+		dwgt = atomic_read(&dest->weight);
+		if (!(dest->flags & IP_VS_DEST_F_OVERLOAD) && dwgt > 0) {
+			spin_lock(&dest->stats.lock);
+			dr = dest->stats.ustats.inbps;
+			spin_unlock(&dest->stats.lock);
+
+			if (!least ||
+			    (u64)dr * (u64)lwgt < (u64)lr * (u64)dwgt ||
+			    (dr == lr && dwgt > lwgt)) {
+				least = dest;
+				lr = dr;
+				lwgt = dwgt;
+				svc->sched_data = q;
+			}
+		}
+		q = list_next_rcu(q);
+	} while (q != p);
+	spin_unlock_bh(&svc->sched_lock);
+
+	if (least) {
+		IP_VS_DBG_BUF(6,
+			      "WLIB: server %s:%u activeconns %d refcnt %d weight %d\n",
+			      IP_VS_DBG_ADDR(least->af, &least->addr),
+			      ntohs(least->port),
+			      atomic_read(&least->activeconns),
+			      atomic_read(&least->refcnt),
+			      atomic_read(&least->weight));
+	} else {
+		ip_vs_scheduler_err(svc, "no destination available");
+	}
+
+	return least;
+}
+
+static struct ip_vs_scheduler ip_vs_wlib_scheduler = {
+	.name =			"wlib",
+	.refcnt =		ATOMIC_INIT(0),
+	.module =		THIS_MODULE,
+	.n_list =		LIST_HEAD_INIT(ip_vs_wlib_scheduler.n_list),
+	.init_service =		ip_vs_wlib_init_svc,
+	.add_dest =		NULL,
+	.del_dest =		ip_vs_wlib_del_dest,
+	.schedule =		ip_vs_wlib_schedule,
+};
+
+static int __init ip_vs_wlib_init(void)
+{
+	return register_ip_vs_scheduler(&ip_vs_wlib_scheduler);
+}
+
+static void __exit ip_vs_wlib_cleanup(void)
+{
+	unregister_ip_vs_scheduler(&ip_vs_wlib_scheduler);
+	synchronize_rcu();
+}
+
+module_init(ip_vs_wlib_init);
+module_exit(ip_vs_wlib_cleanup);
+MODULE_LICENSE("GPL");
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlip.c linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlip.c
--- linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlip.c	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlip.c	2015-01-17 22:47:35.421861075 +0000
@@ -0,0 +1,156 @@
+/* IPVS:        Weighted Least Incoming Packetrate Scheduling module
+ *
+ * Authors:     Chris Caputo <ccaputo@alt.net> based on code by:
+ *
+ *                  Wensong Zhang <wensong@linuxvirtualserver.org>
+ *                  Peter Kese <peter.kese@ijs.si>
+ *                  Julian Anastasov <ja@ssi.bg>
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ *     Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIP algorithm uses the results of the estimator's inpps
+ * calculations to determine which real server has the lowest incoming
+ * packetrate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 10 Kpps of input and
+ * another that can handle 100 Kpps you could set the weights to be 10 and 100
+ * respectively.
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+#include <net/ip_vs.h>
+
+static int
+ip_vs_wlip_init_svc(struct ip_vs_service *svc)
+{
+	svc->sched_data = &svc->destinations;
+	return 0;
+}
+
+static int
+ip_vs_wlip_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+	struct list_head *p;
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	/* dest is already unlinked, so p->prev is not valid but
+	 * p->next is valid, use it to reach previous entry.
+	 */
+	if (p == &dest->n_list)
+		svc->sched_data = p->next->prev;
+	spin_unlock_bh(&svc->sched_lock);
+	return 0;
+}
+
+/* Weighted Least Incoming Packetrate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlip_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+		    struct ip_vs_iphdr *iph)
+{
+	struct list_head *p, *q;
+	struct ip_vs_dest *dest, *least = NULL;
+	u32 dr, lr = -1;
+	int dwgt, lwgt = 0;
+
+	IP_VS_DBG(6, "%s(): Scheduling...\n", __func__);
+
+	/* We calculate the load of each dest server as follows:
+	 *        (dest inpps rate) / dest->weight
+	 *
+	 * The comparison of dr*lwght < lr*dwght is equivalent to that of
+	 * dr/dwght < lr/lwght if every weight is larger than zero.
+	 *
+	 * A server with weight=0 is quiesced and will not receive any
+	 * new connections.
+	 *
+	 * In case of ties, highest weight is winner.  And if that still makes
+	 * for a tie, round robin is used (which is why we remember our last
+	 * starting location in the linked list).
+	 */
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	p = list_next_rcu(p);
+	q = p;
+	do {
+		/* skip list head */
+		if (q == &svc->destinations) {
+			q = list_next_rcu(q);
+			continue;
+		}
+
+		dest = list_entry_rcu(q, struct ip_vs_dest, n_list);
+		dwgt = atomic_read(&dest->weight);
+		if (!(dest->flags & IP_VS_DEST_F_OVERLOAD) && dwgt > 0) {
+			spin_lock(&dest->stats.lock);
+			dr = dest->stats.ustats.inpps;
+			spin_unlock(&dest->stats.lock);
+
+			if (!least ||
+			    (u64)dr * (u64)lwgt < (u64)lr * (u64)dwgt ||
+			    (dr == lr && dwgt > lwgt)) {
+				least = dest;
+				lr = dr;
+				lwgt = dwgt;
+				svc->sched_data = q;
+			}
+		}
+		q = list_next_rcu(q);
+	} while (q != p);
+	spin_unlock_bh(&svc->sched_lock);
+
+	if (least) {
+		IP_VS_DBG_BUF(6,
+			      "WLIP: server %s:%u activeconns %d refcnt %d weight %d\n",
+			      IP_VS_DBG_ADDR(least->af, &least->addr),
+			      ntohs(least->port),
+			      atomic_read(&least->activeconns),
+			      atomic_read(&least->refcnt),
+			      atomic_read(&least->weight));
+	} else {
+		ip_vs_scheduler_err(svc, "no destination available");
+	}
+
+	return least;
+}
+
+static struct ip_vs_scheduler ip_vs_wlip_scheduler = {
+	.name =			"wlip",
+	.refcnt =		ATOMIC_INIT(0),
+	.module =		THIS_MODULE,
+	.n_list =		LIST_HEAD_INIT(ip_vs_wlip_scheduler.n_list),
+	.init_service =		ip_vs_wlip_init_svc,
+	.add_dest =		NULL,
+	.del_dest =		ip_vs_wlip_del_dest,
+	.schedule =		ip_vs_wlip_schedule,
+};
+
+static int __init ip_vs_wlip_init(void)
+{
+	return register_ip_vs_scheduler(&ip_vs_wlip_scheduler);
+}
+
+static void __exit ip_vs_wlip_cleanup(void)
+{
+	unregister_ip_vs_scheduler(&ip_vs_wlip_scheduler);
+	synchronize_rcu();
+}
+
+module_init(ip_vs_wlip_init);
+module_exit(ip_vs_wlip_cleanup);
+MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] IPVS: add wlib & wlip schedulers
  2015-01-17 23:15   ` [PATCH 1/2] IPVS: add wlib & wlip schedulers Chris Caputo
@ 2015-01-19 23:17     ` Julian Anastasov
  2015-01-20 23:21       ` [PATCH 1/3] " Chris Caputo
                         ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Julian Anastasov @ 2015-01-19 23:17 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel


	Hello,

On Sat, 17 Jan 2015, Chris Caputo wrote:

> From: Chris Caputo <ccaputo@alt.net> 
> 
> IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
> Packetrate) schedulers, updated for 3.19-rc4.

	The IPVS estimator uses 2-second timer to update
the stats, isn't that a problem for such schedulers?
Also, you schedule by incoming traffic rate which is
ok when clients mostly upload. But in the common case
clients mostly download and IPVS processes download
traffic only for NAT method.

	May be not so useful idea: use sum of both directions
or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx
flags, see how "sh" scheduler supports flags. I.e.
inbps + outbps.

	Another problem: pps and bps are shifted values,
see how ip_vs_read_estimator() reads them. ip_vs_est.c
contains comments that this code handles couple of
gigabits. May be inbps and outbps in struct ip_vs_estimator
should be changed to u64 to support more gigabits, with
separate patch.

> Signed-off-by: Chris Caputo <ccaputo@alt.net>
> ---
> +++ linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c	2015-01-17 22:47:35.421861075 +0000

> +/* Weighted Least Incoming Byterate scheduling */
> +static struct ip_vs_dest *
> +ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
> +		    struct ip_vs_iphdr *iph)
> +{
> +	struct list_head *p, *q;
> +	struct ip_vs_dest *dest, *least = NULL;
> +	u32 dr, lr = -1;
> +	int dwgt, lwgt = 0;

	To support u64 result from 32-bit multiply we can
change the vars as follows:

u32 dwgt, lwgt = 0;

> +	spin_lock_bh(&svc->sched_lock);
> +	p = (struct list_head *)svc->sched_data;
> +	p = list_next_rcu(p);

	Note that dests are deleted from svc->destinations
out of any lock (from __ip_vs_unlink_dest), above lock
svc->sched_lock protects only svc->sched_data.

	So, RCU dereference is needed here, list_next_rcu is
not enough. Better to stick to the list walking from the
rr algorithm in ip_vs_rr.c.

> +	q = p;
> +	do {
> +		/* skip list head */
> +		if (q == &svc->destinations) {
> +			q = list_next_rcu(q);
> +			continue;
> +		}
> +
> +		dest = list_entry_rcu(q, struct ip_vs_dest, n_list);
> +		dwgt = atomic_read(&dest->weight);

	This will be dwgt = (u32) atomic_read(&dest->weight);

> +		if (!(dest->flags & IP_VS_DEST_F_OVERLOAD) && dwgt > 0) {
> +			spin_lock(&dest->stats.lock);
> +			dr = dest->stats.ustats.inbps;
> +			spin_unlock(&dest->stats.lock);
> +
> +			if (!least ||
> +			    (u64)dr * (u64)lwgt < (u64)lr * (u64)dwgt ||

	This will be (u64)dr * lwgt < (u64)lr * dwgt ||

	See commit c16526a7b99c1c for 32x32 multiply.

> +			    (dr == lr && dwgt > lwgt)) {

	Above check is redundant.

> +				least = dest;
> +				lr = dr;
> +				lwgt = dwgt;
> +				svc->sched_data = q;

	Better to update sched_data at final, see below...

> +			}
> +		}
> +		q = list_next_rcu(q);
> +	} while (q != p);

	if (least)
		svc->sched_data = &least->n_list;

> +	spin_unlock_bh(&svc->sched_lock);

	Same comments for wlip.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] IPVS: add wlib & wlip schedulers
  2015-01-19 23:17     ` Julian Anastasov
@ 2015-01-20 23:21       ` Chris Caputo
  2015-01-22 22:06         ` Julian Anastasov
  2015-01-20 23:21       ` [PATCH 2/3] " Chris Caputo
  2015-01-20 23:21       ` [PATCH 3/3] " Chris Caputo
  2 siblings, 1 reply; 9+ messages in thread
From: Chris Caputo @ 2015-01-20 23:21 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel

On Tue, 20 Jan 2015, Julian Anastasov wrote:
> On Sat, 17 Jan 2015, Chris Caputo wrote:
> > From: Chris Caputo <ccaputo@alt.net> 
> > 
> > IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
> > Packetrate) schedulers, updated for 3.19-rc4.

Hi Julian,

Thanks for the review.

> 	The IPVS estimator uses 2-second timer to update
> the stats, isn't that a problem for such schedulers?
> Also, you schedule by incoming traffic rate which is
> ok when clients mostly upload. But in the common case
> clients mostly download and IPVS processes download
> traffic only for NAT method.

My application consists of incoming TCP streams being load balanced to 
servers which receive the feeds. These are long lived multi-gigabyte 
streams, and so I believe the estimator's 2-second timer is fine. As an 
example:

# cat /proc/net/ip_vs_stats
   Total Incoming Outgoing         Incoming         Outgoing
   Conns  Packets  Packets            Bytes            Bytes
     9AB  58B7C17        0      1237CA2C325                0

 Conns/s   Pkts/s   Pkts/s          Bytes/s          Bytes/s
       1     387C        0          B16C4AE                0

> 	May be not so useful idea: use sum of both directions
> or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx
> flags, see how "sh" scheduler supports flags. I.e.
> inbps + outbps.

I see a user-mode option as increasing complexity. For example, 
keepalived users would need to have keepalived patched to support the new 
algorithm, due to flags, rather than just configuring "wlib" or "wlip" and 
it just working.

I think I'd rather see a wlob/wlop version for users that want to 
load-balance based on outgoing bytes/packets, and a wlb/wlp version for 
users that want them summed.

> 	Another problem: pps and bps are shifted values,
> see how ip_vs_read_estimator() reads them. ip_vs_est.c
> contains comments that this code handles couple of
> gigabits. May be inbps and outbps in struct ip_vs_estimator
> should be changed to u64 to support more gigabits, with
> separate patch.

See patch below to convert bps in ip_vs_estimator to 64-bits.

Other patches, based on your feedback, to follow.

Thanks,
Chris

From: Chris Caputo <ccaputo@alt.net> 

IPVS: Change inbps and outbps to 64-bits so that estimator handles faster
flows. Also increases maximum viewable at user level from ~2.15Gbits/s to
~34.35Gbits/s.

Signed-off-by: Chris Caputo <ccaputo@alt.net>
---
diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h linux-3.19-rc5/include/net/ip_vs.h
--- linux-3.19-rc5-stock/include/net/ip_vs.h	2015-01-18 06:02:20.000000000 +0000
+++ linux-3.19-rc5/include/net/ip_vs.h	2015-01-20 08:01:15.548177969 +0000
@@ -390,8 +390,8 @@ struct ip_vs_estimator {
 	u32			cps;
 	u32			inpps;
 	u32			outpps;
-	u32			inbps;
-	u32			outbps;
+	u64			inbps;
+	u64			outbps;
 };
 
 struct ip_vs_stats {
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c
--- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c	2015-01-18 06:02:20.000000000 +0000
+++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c	2015-01-20 08:01:34.369840704 +0000
@@ -45,10 +45,12 @@
 
   NOTES.
 
-  * The stored value for average bps is scaled by 2^5, so that maximal
-    rate is ~2.15Gbits/s, average pps and cps are scaled by 2^10.
+  * Average bps is scaled by 2^5, while average pps and cps are scaled by 2^10.
 
-  * A lot code is taken from net/sched/estimator.c
+  * All are reported to user level as 32 bit unsigned values. Bps can
+    overflow for fast links : max speed being ~34.35Gbits/s.
+
+  * A lot of code is taken from net/core/gen_estimator.c
  */
 
 
@@ -98,7 +100,7 @@ static void estimation_timer(unsigned lo
 	u32 n_conns;
 	u32 n_inpkts, n_outpkts;
 	u64 n_inbytes, n_outbytes;
-	u32 rate;
+	u64 rate;
 	struct net *net = (struct net *)arg;
 	struct netns_ipvs *ipvs;
 
@@ -118,23 +120,24 @@ static void estimation_timer(unsigned lo
 		/* scaled by 2^10, but divided 2 seconds */
 		rate = (n_conns - e->last_conns) << 9;
 		e->last_conns = n_conns;
-		e->cps += ((long)rate - (long)e->cps) >> 2;
+		e->cps += ((s64)rate - (s64)e->cps) >> 2;
 
 		rate = (n_inpkts - e->last_inpkts) << 9;
 		e->last_inpkts = n_inpkts;
-		e->inpps += ((long)rate - (long)e->inpps) >> 2;
+		e->inpps += ((s64)rate - (s64)e->inpps) >> 2;
 
 		rate = (n_outpkts - e->last_outpkts) << 9;
 		e->last_outpkts = n_outpkts;
-		e->outpps += ((long)rate - (long)e->outpps) >> 2;
+		e->outpps += ((s64)rate - (s64)e->outpps) >> 2;
 
+		/* scaled by 2^5, but divided 2 seconds */
 		rate = (n_inbytes - e->last_inbytes) << 4;
 		e->last_inbytes = n_inbytes;
-		e->inbps += ((long)rate - (long)e->inbps) >> 2;
+		e->inbps += ((s64)rate - (s64)e->inbps) >> 2;
 
 		rate = (n_outbytes - e->last_outbytes) << 4;
 		e->last_outbytes = n_outbytes;
-		e->outbps += ((long)rate - (long)e->outbps) >> 2;
+		e->outbps += ((s64)rate - (s64)e->outbps) >> 2;
 		spin_unlock(&s->lock);
 	}
 	spin_unlock(&ipvs->est_lock);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/3] IPVS: add wlib & wlip schedulers
  2015-01-19 23:17     ` Julian Anastasov
  2015-01-20 23:21       ` [PATCH 1/3] " Chris Caputo
@ 2015-01-20 23:21       ` Chris Caputo
  2015-01-22 21:07         ` Julian Anastasov
  2015-01-20 23:21       ` [PATCH 3/3] " Chris Caputo
  2 siblings, 1 reply; 9+ messages in thread
From: Chris Caputo @ 2015-01-20 23:21 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel

On Tue, 20 Jan 2015, Julian Anastasov wrote:
> > +                      (u64)dr * (u64)lwgt < (u64)lr * (u64)dwgt ||
[...]
> > +  	                   (dr == lr && dwgt > lwgt)) {
> 
> 	Above check is redundant.

I accepted your feedback and applied it to the below, except for this 
item.  I believe if dr and lr are zero (no traffic), we still want to 
choose the higher weight, thus a separate comparison is needed.

Thanks,
Chris

From: Chris Caputo <ccaputo@alt.net> 

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) schedulers, updated for 3.19-rc5.

Signed-off-by: Chris Caputo <ccaputo@alt.net>
---
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig linux-3.19-rc5/net/netfilter/ipvs/Kconfig
--- linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig	2015-01-18 06:02:20.000000000 +0000
+++ linux-3.19-rc5/net/netfilter/ipvs/Kconfig	2015-01-20 08:08:28.883080285 +0000
@@ -240,6 +240,26 @@ config	IP_VS_NQ
 	  If you want to compile it in kernel, say Y. To compile it as a
 	  module, choose M here. If unsure, say N.
 
+config	IP_VS_WLIB
+	tristate "weighted least incoming byterate scheduling"
+	---help---
+	  The weighted least incoming byterate scheduling algorithm directs
+	  network connections to the server with the least incoming byterate
+	  normalized by the server weight.
+
+	  If you want to compile it in kernel, say Y. To compile it as a
+	  module, choose M here. If unsure, say N.
+
+config	IP_VS_WLIP
+	tristate "weighted least incoming packetrate scheduling"
+	---help---
+	  The weighted least incoming packetrate scheduling algorithm directs
+	  network connections to the server with the least incoming packetrate
+	  normalized by the server weight.
+
+	  If you want to compile it in kernel, say Y. To compile it as a
+	  module, choose M here. If unsure, say N.
+
 comment 'IPVS SH scheduler'
 
 config IP_VS_SH_TAB_BITS
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile linux-3.19-rc5/net/netfilter/ipvs/Makefile
--- linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile	2015-01-18 06:02:20.000000000 +0000
+++ linux-3.19-rc5/net/netfilter/ipvs/Makefile	2015-01-20 08:08:28.883080285 +0000
@@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
+obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o
+obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c
--- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c	2015-01-20 08:09:00.177816054 +0000
@@ -0,0 +1,166 @@
+/* IPVS:        Weighted Least Incoming Byterate Scheduling module
+ *
+ * Authors:     Chris Caputo <ccaputo@alt.net> based on code by:
+ *
+ *                  Wensong Zhang <wensong@linuxvirtualserver.org>
+ *                  Peter Kese <peter.kese@ijs.si>
+ *                  Julian Anastasov <ja@ssi.bg>
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ *     Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIB algorithm uses the results of the estimator's inbps
+ * calculations to determine which real server has the lowest incoming
+ * byterate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 100 Mbps of input and
+ * another that can handle 1 Gbps you could set the weights to be 100 and 1000
+ * respectively.
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+#include <net/ip_vs.h>
+
+static int
+ip_vs_wlib_init_svc(struct ip_vs_service *svc)
+{
+	svc->sched_data = &svc->destinations;
+	return 0;
+}
+
+static int
+ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+	struct list_head *p;
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	/* dest is already unlinked, so p->prev is not valid but
+	 * p->next is valid, use it to reach previous entry.
+	 */
+	if (p == &dest->n_list)
+		svc->sched_data = p->next->prev;
+	spin_unlock_bh(&svc->sched_lock);
+	return 0;
+}
+
+/* Weighted Least Incoming Byterate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+		    struct ip_vs_iphdr *iph)
+{
+	struct list_head *p;
+	struct ip_vs_dest *dest, *last, *least = NULL;
+	int pass = 0;
+	u64 dr, lr = -1;
+	u32 dwgt, lwgt = 0;
+
+	IP_VS_DBG(6, "%s(): Scheduling...\n", __func__);
+
+	/* We calculate the load of each dest server as follows:
+	 *        (dest inbps rate) / dest->weight
+	 *
+	 * The comparison of dr*lwght < lr*dwght is equivalent to that of
+	 * dr/dwght < lr/lwght if every weight is larger than zero.
+	 *
+	 * A server with weight=0 is quiesced and will not receive any
+	 * new connections.
+	 *
+	 * In case of inactivity, highest weight is winner.  And if that still makes
+	 * for a tie, round robin is used (which is why we remember our last
+	 * starting location in the linked list).
+	 */
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	last = dest = list_entry(p, struct ip_vs_dest, n_list);
+
+	do {
+		list_for_each_entry_continue_rcu(dest,
+						 &svc->destinations,
+						 n_list) {
+			dwgt = (u32)atomic_read(&dest->weight);
+			if (!(dest->flags & IP_VS_DEST_F_OVERLOAD) &&
+			    dwgt > 0) {
+				spin_lock(&dest->stats.lock);
+				/* estimator's scaling doesn't matter */
+				dr = dest->stats.est.inbps;
+				spin_unlock(&dest->stats.lock);
+
+				if (!least ||
+				    dr * lwgt < lr * dwgt ||
+				    (!dr && !lr && dwgt > lwgt)) {
+					least = dest;
+					lr = dr;
+					lwgt = dwgt;
+				}
+			}
+
+			if (dest == last)
+				goto stop;
+		}
+		pass++;
+		/* Previous dest could be unlinked, do not loop forever.
+		 * If we stay at head there is no need for 2nd pass.
+		 */
+	} while (pass < 2 && p != &svc->destinations);
+
+stop:
+	if (least)
+		svc->sched_data = &least->n_list;
+
+	spin_unlock_bh(&svc->sched_lock);
+
+	if (least) {
+		IP_VS_DBG_BUF(6,
+			      "WLIB: server %s:%u activeconns %d refcnt %d weight %d\n",
+			      IP_VS_DBG_ADDR(least->af, &least->addr),
+			      ntohs(least->port),
+			      atomic_read(&least->activeconns),
+			      atomic_read(&least->refcnt),
+			      atomic_read(&least->weight));
+	} else {
+		ip_vs_scheduler_err(svc, "no destination available");
+	}
+
+	return least;
+}
+
+static struct ip_vs_scheduler ip_vs_wlib_scheduler = {
+	.name =			"wlib",
+	.refcnt =		ATOMIC_INIT(0),
+	.module =		THIS_MODULE,
+	.n_list =		LIST_HEAD_INIT(ip_vs_wlib_scheduler.n_list),
+	.init_service =		ip_vs_wlib_init_svc,
+	.add_dest =		NULL,
+	.del_dest =		ip_vs_wlib_del_dest,
+	.schedule =		ip_vs_wlib_schedule,
+};
+
+static int __init ip_vs_wlib_init(void)
+{
+	return register_ip_vs_scheduler(&ip_vs_wlib_scheduler);
+}
+
+static void __exit ip_vs_wlib_cleanup(void)
+{
+	unregister_ip_vs_scheduler(&ip_vs_wlib_scheduler);
+	synchronize_rcu();
+}
+
+module_init(ip_vs_wlib_init);
+module_exit(ip_vs_wlib_cleanup);
+MODULE_LICENSE("GPL");
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlip.c linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlip.c
--- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlip.c	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlip.c	2015-01-20 08:09:07.456126624 +0000
@@ -0,0 +1,166 @@
+/* IPVS:        Weighted Least Incoming Packetrate Scheduling module
+ *
+ * Authors:     Chris Caputo <ccaputo@alt.net> based on code by:
+ *
+ *                  Wensong Zhang <wensong@linuxvirtualserver.org>
+ *                  Peter Kese <peter.kese@ijs.si>
+ *                  Julian Anastasov <ja@ssi.bg>
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ *     Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIP algorithm uses the results of the estimator's inpps
+ * calculations to determine which real server has the lowest incoming
+ * packetrate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 10 Kpps of input and
+ * another that can handle 100 Kpps you could set the weights to be 10 and 100
+ * respectively.
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+#include <net/ip_vs.h>
+
+static int
+ip_vs_wlip_init_svc(struct ip_vs_service *svc)
+{
+	svc->sched_data = &svc->destinations;
+	return 0;
+}
+
+static int
+ip_vs_wlip_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+	struct list_head *p;
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	/* dest is already unlinked, so p->prev is not valid but
+	 * p->next is valid, use it to reach previous entry.
+	 */
+	if (p == &dest->n_list)
+		svc->sched_data = p->next->prev;
+	spin_unlock_bh(&svc->sched_lock);
+	return 0;
+}
+
+/* Weighted Least Incoming Packetrate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlip_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+		    struct ip_vs_iphdr *iph)
+{
+	struct list_head *p;
+	struct ip_vs_dest *dest, *last, *least = NULL;
+	int pass = 0;
+	u32 dr, lr = -1;
+	u32 dwgt, lwgt = 0;
+
+	IP_VS_DBG(6, "%s(): Scheduling...\n", __func__);
+
+	/* We calculate the load of each dest server as follows:
+	 *        (dest inpps rate) / dest->weight
+	 *
+	 * The comparison of dr*lwght < lr*dwght is equivalent to that of
+	 * dr/dwght < lr/lwght if every weight is larger than zero.
+	 *
+	 * A server with weight=0 is quiesced and will not receive any
+	 * new connections.
+	 *
+	 * In case of inactivity, highest weight is winner.  And if that still makes
+	 * for a tie, round robin is used (which is why we remember our last
+	 * starting location in the linked list).
+	 */
+
+	spin_lock_bh(&svc->sched_lock);
+	p = (struct list_head *)svc->sched_data;
+	last = dest = list_entry(p, struct ip_vs_dest, n_list);
+
+	do {
+		list_for_each_entry_continue_rcu(dest,
+						 &svc->destinations,
+						 n_list) {
+			dwgt = (u32)atomic_read(&dest->weight);
+			if (!(dest->flags & IP_VS_DEST_F_OVERLOAD) &&
+			    dwgt > 0) {
+				spin_lock(&dest->stats.lock);
+				/* estimator's scaling doesn't matter */
+				dr = dest->stats.est.inpps;
+				spin_unlock(&dest->stats.lock);
+
+				if (!least ||
+				    (u64)dr * lwgt < (u64)lr * dwgt ||
+				    (!dr && !lr && dwgt > lwgt)) {
+					least = dest;
+					lr = dr;
+					lwgt = dwgt;
+				}
+			}
+
+			if (dest == last)
+				goto stop;
+		}
+		pass++;
+		/* Previous dest could be unlinked, do not loop forever.
+		 * If we stay at head there is no need for 2nd pass.
+		 */
+	} while (pass < 2 && p != &svc->destinations);
+
+stop:
+	if (least)
+		svc->sched_data = &least->n_list;
+
+	spin_unlock_bh(&svc->sched_lock);
+
+	if (least) {
+		IP_VS_DBG_BUF(6,
+			      "WLIP: server %s:%u activeconns %d refcnt %d weight %d\n",
+			      IP_VS_DBG_ADDR(least->af, &least->addr),
+			      ntohs(least->port),
+			      atomic_read(&least->activeconns),
+			      atomic_read(&least->refcnt),
+			      atomic_read(&least->weight));
+	} else {
+		ip_vs_scheduler_err(svc, "no destination available");
+	}
+
+	return least;
+}
+
+static struct ip_vs_scheduler ip_vs_wlip_scheduler = {
+	.name =			"wlip",
+	.refcnt =		ATOMIC_INIT(0),
+	.module =		THIS_MODULE,
+	.n_list =		LIST_HEAD_INIT(ip_vs_wlip_scheduler.n_list),
+	.init_service =		ip_vs_wlip_init_svc,
+	.add_dest =		NULL,
+	.del_dest =		ip_vs_wlip_del_dest,
+	.schedule =		ip_vs_wlip_schedule,
+};
+
+static int __init ip_vs_wlip_init(void)
+{
+	return register_ip_vs_scheduler(&ip_vs_wlip_scheduler);
+}
+
+static void __exit ip_vs_wlip_cleanup(void)
+{
+	unregister_ip_vs_scheduler(&ip_vs_wlip_scheduler);
+	synchronize_rcu();
+}
+
+module_init(ip_vs_wlip_init);
+module_exit(ip_vs_wlip_cleanup);
+MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/3] IPVS: add wlib & wlip schedulers
  2015-01-19 23:17     ` Julian Anastasov
  2015-01-20 23:21       ` [PATCH 1/3] " Chris Caputo
  2015-01-20 23:21       ` [PATCH 2/3] " Chris Caputo
@ 2015-01-20 23:21       ` Chris Caputo
  2 siblings, 0 replies; 9+ messages in thread
From: Chris Caputo @ 2015-01-20 23:21 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel

From: Chris Caputo <ccaputo@alt.net> 

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) scheduler docs for ipvsadm-1.27.

Signed-off-by: Chris Caputo <ccaputo@alt.net>
---
diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS
--- ipvsadm-1.27-stock/SCHEDULERS	2013-09-06 08:37:27.000000000 +0000
+++ ipvsadm-1.27/SCHEDULERS	2015-01-17 22:14:32.812597191 +0000
@@ -1 +1 @@
-rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq
+rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip
diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8
--- ipvsadm-1.27-stock/ipvsadm.8	2013-09-06 08:37:27.000000000 +0000
+++ ipvsadm-1.27/ipvsadm.8	2015-01-17 22:14:32.812597191 +0000
@@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s
 \fBnq\fR - Never Queue: assigns an incoming job to an idle server if
 there is, instead of waiting for a fast one; if all the servers are
 busy, it adopts the Shortest Expected Delay policy to assign the job.
+.sp
+\fBwlib\fR - Weighted Least Incoming Byterate: directs network
+connections to the real server with the least incoming byterate
+normalized by the server weight.
+.sp
+\fBwlip\fR - Weighted Least Incoming Packetrate: directs network
+connections to the real server with the least incoming packetrate
+normalized by the server weight.
 .TP
 .B -p, --persistent [\fItimeout\fP]
 Specify that a virtual service is persistent. If this option is

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] IPVS: add wlib & wlip schedulers
  2015-01-20 23:21       ` [PATCH 2/3] " Chris Caputo
@ 2015-01-22 21:07         ` Julian Anastasov
  0 siblings, 0 replies; 9+ messages in thread
From: Julian Anastasov @ 2015-01-22 21:07 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel


	Hello,

On Tue, 20 Jan 2015, Chris Caputo wrote:

> On Tue, 20 Jan 2015, Julian Anastasov wrote:
> > > +                      (u64)dr * (u64)lwgt < (u64)lr * (u64)dwgt ||
> [...]
> > > +  	                   (dr == lr && dwgt > lwgt)) {
> > 
> > 	Above check is redundant.
> 
> I accepted your feedback and applied it to the below, except for this 
> item.  I believe if dr and lr are zero (no traffic), we still want to 
> choose the higher weight, thus a separate comparison is needed.

	ok

> +	spin_lock_bh(&svc->sched_lock);
> +	p = (struct list_head *)svc->sched_data;
> +	last = dest = list_entry(p, struct ip_vs_dest, n_list);
> +
> +	do {
> +		list_for_each_entry_continue_rcu(dest,
> +						 &svc->destinations,
> +						 n_list) {
> +			dwgt = (u32)atomic_read(&dest->weight);
> +			if (!(dest->flags & IP_VS_DEST_F_OVERLOAD) &&
> +			    dwgt > 0) {
> +                               spin_lock(&dest->stats.lock);

	May be there is a way to avoid this spin_lock
by using u64_stats_fetch_begin and corresponding
u64_stats_update_begin in estimation_timer(). We can
even remove this ->lock, it will be replaced by ->syncp.
The benefit is for 64-bit platforms where we avoid
lock here in the scheduler. Otherwise, I don't see
other implementation problems in this patch and I'll
check it more carefully this weekend.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] IPVS: add wlib & wlip schedulers
  2015-01-20 23:21       ` [PATCH 1/3] " Chris Caputo
@ 2015-01-22 22:06         ` Julian Anastasov
  2015-01-23  4:16           ` Chris Caputo
  2015-01-27  8:36           ` Julian Anastasov
  0 siblings, 2 replies; 9+ messages in thread
From: Julian Anastasov @ 2015-01-22 22:06 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel


	Hello,

On Tue, 20 Jan 2015, Chris Caputo wrote:

> My application consists of incoming TCP streams being load balanced to 
> servers which receive the feeds. These are long lived multi-gigabyte 
> streams, and so I believe the estimator's 2-second timer is fine. As an 
> example:
> 
> # cat /proc/net/ip_vs_stats
>    Total Incoming Outgoing         Incoming         Outgoing
>    Conns  Packets  Packets            Bytes            Bytes
>      9AB  58B7C17        0      1237CA2C325                0
> 
>  Conns/s   Pkts/s   Pkts/s          Bytes/s          Bytes/s
>        1     387C        0          B16C4AE                0

	All other schedulers react and see different
picture after every new connection. The worst example
is WLC where slow-start mechanism is desired because
idle server can be overloaded before the load is noticed
properly. Even WRR accounts every connection in its state.

	Your setup may expect low number of connections per
second but for other kind of setups sending all connections
to same server for 2 seconds looks scary. In fact, what
changes is the position, so we rotate only among the
least loaded servers that look equally loaded but it is
one server in the common case. And as our stats are per
CPU and designed for human reading, it is difficult to
read them often for other purposes. We need a good idea
to solve this problem, so that we can have faster feedback
after every scheduling.

> > 	May be not so useful idea: use sum of both directions
> > or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx
> > flags, see how "sh" scheduler supports flags. I.e.
> > inbps + outbps.
> 
> I see a user-mode option as increasing complexity. For example, 
> keepalived users would need to have keepalived patched to support the new 
> algorithm, due to flags, rather than just configuring "wlib" or "wlip" and 
> it just working.

	That is also true.

> I think I'd rather see a wlob/wlop version for users that want to 
> load-balance based on outgoing bytes/packets, and a wlb/wlp version for 
> users that want them summed.

	ok

> From: Chris Caputo <ccaputo@alt.net> 
> 
> IPVS: Change inbps and outbps to 64-bits so that estimator handles faster
> flows. Also increases maximum viewable at user level from ~2.15Gbits/s to
> ~34.35Gbits/s.

	Yep, we are limited from u32 in user space structs.
I have to think how to solve this problem.

1gbit => ~1.5 million pps
10gbit => ~15 million pps
100gbit => ~150 million pps

> Signed-off-by: Chris Caputo <ccaputo@alt.net>
> ---
> diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h linux-3.19-rc5/include/net/ip_vs.h
> --- linux-3.19-rc5-stock/include/net/ip_vs.h	2015-01-18 06:02:20.000000000 +0000
> +++ linux-3.19-rc5/include/net/ip_vs.h	2015-01-20 08:01:15.548177969 +0000
> @@ -390,8 +390,8 @@ struct ip_vs_estimator {
>  	u32			cps;
>  	u32			inpps;
>  	u32			outpps;
> -	u32			inbps;
> -	u32			outbps;
> +	u64			inbps;
> +	u64			outbps;

	Not sure, may be everything here should be u64 because
we have shifted values. I'll need some days to investigate
this issue...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] IPVS: add wlib & wlip schedulers
  2015-01-22 22:06         ` Julian Anastasov
@ 2015-01-23  4:16           ` Chris Caputo
  2015-01-27  8:36           ` Julian Anastasov
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Caputo @ 2015-01-23  4:16 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel

On Fri, 23 Jan 2015, Julian Anastasov wrote:
> 	Hello,
> 
> On Tue, 20 Jan 2015, Chris Caputo wrote:
> > My application consists of incoming TCP streams being load balanced to 
> > servers which receive the feeds. These are long lived multi-gigabyte 
> > streams, and so I believe the estimator's 2-second timer is fine. As an 
> > example:
> > 
> > # cat /proc/net/ip_vs_stats
> >    Total Incoming Outgoing         Incoming         Outgoing
> >    Conns  Packets  Packets            Bytes            Bytes
> >      9AB  58B7C17        0      1237CA2C325                0
> > 
> >  Conns/s   Pkts/s   Pkts/s          Bytes/s          Bytes/s
> >        1     387C        0          B16C4AE                0
> 
> 	All other schedulers react and see different
> picture after every new connection. The worst example
> is WLC where slow-start mechanism is desired because
> idle server can be overloaded before the load is noticed
> properly. Even WRR accounts every connection in its state.
> 
> 	Your setup may expect low number of connections per
> second but for other kind of setups sending all connections
> to same server for 2 seconds looks scary. In fact, what
> changes is the position, so we rotate only among the
> least loaded servers that look equally loaded but it is
> one server in the common case. And as our stats are per
> CPU and designed for human reading, it is difficult to
> read them often for other purposes. We need a good idea
> to solve this problem, so that we can have faster feedback
> after every scheduling.

This is exactly why my wlib/wlip code is a hybrid of wlc and rr.  Last 
location is saved, and search is started after it.  Thus when traffic is 
zero, round-robin occurs.  When flows already exist, bursts of new 
connections do choose poorly based on repeated use of last estimation, but 
the complexity of working around that seems complex.

> > > 	May be not so useful idea: use sum of both directions
> > > or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx
> > > flags, see how "sh" scheduler supports flags. I.e.
> > > inbps + outbps.
> > 
> > I see a user-mode option as increasing complexity. For example, 
> > keepalived users would need to have keepalived patched to support the new 
> > algorithm, due to flags, rather than just configuring "wlib" or "wlip" and 
> > it just working.
> 
> 	That is also true.
> 
> > I think I'd rather see a wlob/wlop version for users that want to 
> > load-balance based on outgoing bytes/packets, and a wlb/wlp version for 
> > users that want them summed.
> 
> 	ok
> 
> > From: Chris Caputo <ccaputo@alt.net> 
> > 
> > IPVS: Change inbps and outbps to 64-bits so that estimator handles faster
> > flows. Also increases maximum viewable at user level from ~2.15Gbits/s to
> > ~34.35Gbits/s.
> 
> 	Yep, we are limited from u32 in user space structs.
> I have to think how to solve this problem.
> 
> 1gbit => ~1.5 million pps
> 10gbit => ~15 million pps
> 100gbit => ~150 million pps
> 
> > Signed-off-by: Chris Caputo <ccaputo@alt.net>
> > ---
> > diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h linux-3.19-rc5/include/net/ip_vs.h
> > --- linux-3.19-rc5-stock/include/net/ip_vs.h	2015-01-18 06:02:20.000000000 +0000
> > +++ linux-3.19-rc5/include/net/ip_vs.h	2015-01-20 08:01:15.548177969 +0000
> > @@ -390,8 +390,8 @@ struct ip_vs_estimator {
> >  	u32			cps;
> >  	u32			inpps;
> >  	u32			outpps;
> > -	u32			inbps;
> > -	u32			outbps;
> > +	u64			inbps;
> > +	u64			outbps;
> 
> 	Not sure, may be everything here should be u64 because
> we have shifted values. I'll need some days to investigate
> this issue...
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

Sounds good and thanks!

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] IPVS: add wlib & wlip schedulers
  2015-01-22 22:06         ` Julian Anastasov
  2015-01-23  4:16           ` Chris Caputo
@ 2015-01-27  8:36           ` Julian Anastasov
  1 sibling, 0 replies; 9+ messages in thread
From: Julian Anastasov @ 2015-01-27  8:36 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Wensong Zhang, Simon Horman, lvs-devel, linux-kernel


	Hello,

On Fri, 23 Jan 2015, Julian Anastasov wrote:

> On Tue, 20 Jan 2015, Chris Caputo wrote:
> 
> > My application consists of incoming TCP streams being load balanced to 
> > servers which receive the feeds. These are long lived multi-gigabyte 
> > streams, and so I believe the estimator's 2-second timer is fine. As an 
> > example:
> > 
> > # cat /proc/net/ip_vs_stats
> >    Total Incoming Outgoing         Incoming         Outgoing
> >    Conns  Packets  Packets            Bytes            Bytes
> >      9AB  58B7C17        0      1237CA2C325                0
> > 
> >  Conns/s   Pkts/s   Pkts/s          Bytes/s          Bytes/s
> >        1     387C        0          B16C4AE                0
> 
> 	Not sure, may be everything here should be u64 because
> we have shifted values. I'll need some days to investigate
> this issue...

	For now I don't see hope in using schedulers that rely
on IPVS byte/packet stats, due to the slow update (2 seconds).
If we reduce this period we can cause performance problems to
other users.

Every *-LEAST-* (eg. LC, WLC) algorithm needs actual information
to take decision on every new connection. OTOH, all *-ROUND-ROBIN-*
algorithms (RR, WRR) use information (weights) from user space,
by this way kernel performs as expected.

	Currently, LC/WLC use feedback from the 3-way TCP handshake,
see ip_vs_dest_conn_overhead() where the established connections
have large preference. Such feedback from real servers is delayed
usually with microseconds, up to milliseconds. More time if
depends on clients.

	The proposed schedulers have round-robin function but
only among least loaded servers, so it is not dominant
and we suffer from slow feedback from the estimator.

	For load information that is not present in kernel
an user space daemon is needed to determine weights to use
with WRR. It can take actual stats from real server, for
example, it can take into account non-IPVS traffic.

	As alternative, it is possible to implement some new svc
method that can be called for every packet, for example, in
ip_vs_in_stats(). It does not look fatal to add some fields in
struct ip_vs_dest that only specific schedulers will update,
for example, byte/packet counters. Of course, the spin_locks
the scheduler must use will suffer on many CPUs. Such info can
be also attached as allocated structure in RCU pointer
dest->sched_info where data and corresponding methods can be
stored. It will need careful RCU-kind of update, especially when
scheduler is updated in svc. If you think such idea can work
we can discuss the RCU and scheduler changes that are needed.
The proposed schedulers have to implement counters, their
own estimator and WRR function.

	Another variant can be to extend WRR with some
support for automatic dynamic-weight update depending on 
parameters: -s wrr --sched-flags {wlip,wlib,...}

	or using new option --sched-param that can also
provide info for wrr estimator, etc. In any case, the
extended WRR scheduler will need above support to check
every packet.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-01-27  8:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.44.0501260832210.17893-100000@nacho.alt.net>
     [not found] ` <Pine.LNX.4.61.0502010007060.1148@penguin.linux-vs.org>
2015-01-17 23:15   ` [PATCH 1/2] IPVS: add wlib & wlip schedulers Chris Caputo
2015-01-19 23:17     ` Julian Anastasov
2015-01-20 23:21       ` [PATCH 1/3] " Chris Caputo
2015-01-22 22:06         ` Julian Anastasov
2015-01-23  4:16           ` Chris Caputo
2015-01-27  8:36           ` Julian Anastasov
2015-01-20 23:21       ` [PATCH 2/3] " Chris Caputo
2015-01-22 21:07         ` Julian Anastasov
2015-01-20 23:21       ` [PATCH 3/3] " Chris Caputo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.