All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCHv5 0/6] ipvs: Use kthreads for stats
@ 2022-10-09 15:37 Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 1/6] ipvs: add rcu protection to stats Julian Anastasov
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-09 15:37 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

	Hello,

	Posting v5 (no new ideas for now). New patch
inserted at 2nd position. Patch 6 just for debugging,
do not apply if not needed. 

	This patchset implements stats estimation in
kthread context. Simple tests do not show any problem.
If testing, check that calculated chain_max does not
deviates much. 

	cache_factor = 4 needs to be tested with different
configurations. Now ip_vs_est_calc_limits() is simpler.
BH locking is used, IRQ locking looks dangerous to me,
it can hurt systems that do not have TSC clocksource,
not sure if that can affect ktime.

	Overview of the basic concepts. More in the
commit messages...

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we
apply a rlimit as max kthreads to create.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is 0.

- the allocated kthread context may grow from 1 to 50
allocated structures for ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains. This will reduce
the chain imbalance.

	There are things that I don't like but for now
I don't have a better idea for them:

- calculation of chain_max can go wrong, depending
on the current load, CPU speed, memory speeds, running in
VM, whether tested estimators are in CPU cache, even if
we are doing it in SCHED_FIFO mode or with BH disabled.
I expect such noise to be insignificant but who knows.

- ip_vs_stop_estimator is not a simple unlinking of
list node, we spend cycles to account for the removed
estimator

- __ip_vs_mutex is global mutex for all netns. But it
protects hash tables that are still global ones.


Changes in v5:
Patch 4 (was 3 in v4):
* use ip_vs_est_max_threads() helper
Patch 3 (was 2 in v4):
* kthread 0 now disables BH instead of using SCHED_FIFO mode,
  this should work because now we perform less number of repeated
  test over pre-allocated kd->calc_stats structure. Use
  cache_factor = 4 to approximate the time non-cached (due to large
  number) per-cpu stats are estimated compared to the cached data
  we estimate in calc phase. This needs to be tested with
  different number of CPUs and NUMA nodes.
* limit max threads using the formula 4 * Number of CPUs, not on rlimit
* remove _len suffix from vars like chain_max_len, tick_max_len,
  est_chain_max_len
Patch 2:
* new patch that adds functions for stats allocations

Changes in v4:
Patch 2:
* kthread 0 can start with calculation phase in SCHED_FIFO mode
  to determine chain_max_len suitable for 100us cond_resched
  rate and 12% of 40ms CPU usage in a tick. Current value of
  IPVS_EST_TICK_CHAINS=48 determines tick time of 4.8ms (i.e.
  in units of 100us) which is 12% of max tick time of 40ms.
  The question is how reliable will be such calculation test.
* est_calc_phase indicates a mode where we dequeue estimators
  from kthreads, apply new chain_max_len and enqueue again
  all estimators to kthreads, done by kthread 0
* est->ktid now can be -1 to indicate est is in est_temp_list
  ready to be distributed to kthread by kt 0, done in
  ip_vs_est_drain_temp_list(). kthread 0 data is now released
  only after the data for others kthreads
* ip_vs_start_estimator was not setting ret = 0
* READ_ONCE not needed for volatile jiffies
Patch 3:
* restrict cpulist based on the cpus_allowed of
  process that assigns cpulist, not on cpu_possible_mask
* change of cpulist will trigger calc phase
Patch 5:
* print message every minute, not 2 seconds

Changes in v3:
Patch 2:
* calculate chain_max_len (was IPVS_EST_CHAIN_DEPTH) but
  it needs further tuning based on real estimation test
* est_max_threads set from rlimit(RLIMIT_NPROC). I don't
  see analog to get_ucounts_value() to get the max value.
* the atomic bitop for td->present is not needed,
  remove it
* start filling based on est_row after 2 ticks are
  fully allocated. As 2/50 is 4% this can be increased
  more.

Changes in v2:
Patch 2:
* kd->mutex is gone, cond_resched rate determined by
  IPVS_EST_CHAIN_DEPTH
* IPVS_EST_MAX_COUNT is a hard limit now
* kthread data is now 1-50 allocated tick structures,
  each containing heads for limited chains. Bitmaps
  should allow faster access. We avoid large
  allocations for structs.
* as the td->present bitmap is shared, use atomic bitops
* ip_vs_start_estimator now returns error code
* _bh locking removed from stats->lock
* bump arg is gone from ip_vs_est_reload_start
* prepare for upcoming changes that remove _irq
  from u64_stats_fetch_begin_irq/u64_stats_fetch_retry_irq
* est_add_ktid is now always valid
Patch 3:
* use .. in est_nice docs

Julian Anastasov (6):
  ipvs: add rcu protection to stats
  ipvs: use common functions for stats allocation
  ipvs: use kthreads for stats estimation
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: run_estimation should control the kthread tasks
  ipvs: debug the tick time

 Documentation/networking/ipvs-sysctl.rst |  24 +-
 include/net/ip_vs.h                      | 144 +++-
 net/netfilter/ipvs/ip_vs_core.c          |  10 +-
 net/netfilter/ipvs/ip_vs_ctl.c           | 442 +++++++++---
 net/netfilter/ipvs/ip_vs_est.c           | 862 +++++++++++++++++++++--
 5 files changed, 1320 insertions(+), 162 deletions(-)

-- 
2.37.3



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCHv5 1/6] ipvs: add rcu protection to stats
  2022-10-09 15:37 [RFC PATCHv5 0/6] ipvs: Use kthreads for stats Julian Anastasov
@ 2022-10-09 15:37 ` Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 2/6] ipvs: use common functions for stats allocation Julian Anastasov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-09 15:37 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

In preparation to using RCU locking for the list
with estimators, make sure the struct ip_vs_stats
are released after RCU grace period by using RCU
callbacks. This affects ipvs->tot_stats where we
can not use RCU callbacks for ipvs, so we use
allocated struct ip_vs_stats_rcu. For services
and dests we force RCU callbacks for all cases.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h             |  8 ++++-
 net/netfilter/ipvs/ip_vs_core.c | 10 ++++--
 net/netfilter/ipvs/ip_vs_ctl.c  | 64 ++++++++++++++++++++++-----------
 3 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index ff1804a0c469..bd8ae137e43b 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -405,6 +405,11 @@ struct ip_vs_stats {
 	struct ip_vs_kstats	kstats0;	/* reset values */
 };
 
+struct ip_vs_stats_rcu {
+	struct ip_vs_stats	s;
+	struct rcu_head		rcu_head;
+};
+
 struct dst_entry;
 struct iphdr;
 struct ip_vs_conn;
@@ -688,6 +693,7 @@ struct ip_vs_dest {
 	union nf_inet_addr	vaddr;		/* virtual IP address */
 	__u32			vfwmark;	/* firewall mark of service */
 
+	struct rcu_head		rcu_head;
 	struct list_head	t_list;		/* in dest_trash */
 	unsigned int		in_rs_table:1;	/* we are in rs_table */
 };
@@ -869,7 +875,7 @@ struct netns_ipvs {
 	atomic_t		conn_count;      /* connection counter */
 
 	/* ip_vs_ctl */
-	struct ip_vs_stats		tot_stats;  /* Statistics & est. */
+	struct ip_vs_stats_rcu	*tot_stats;      /* Statistics & est. */
 
 	int			num_services;    /* no of virtual services */
 	int			num_services6;   /* IPv6 virtual services */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 51ad557a525b..fcdaef1fcccf 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -143,7 +143,7 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 		s->cnt.inbytes += skb->len;
 		u64_stats_update_end(&s->syncp);
 
-		s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
 		s->cnt.inpkts++;
 		s->cnt.inbytes += skb->len;
@@ -179,7 +179,7 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 		s->cnt.outbytes += skb->len;
 		u64_stats_update_end(&s->syncp);
 
-		s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
 		s->cnt.outpkts++;
 		s->cnt.outbytes += skb->len;
@@ -208,7 +208,7 @@ ip_vs_conn_stats(struct ip_vs_conn *cp, struct ip_vs_service *svc)
 	s->cnt.conns++;
 	u64_stats_update_end(&s->syncp);
 
-	s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+	s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 	u64_stats_update_begin(&s->syncp);
 	s->cnt.conns++;
 	u64_stats_update_end(&s->syncp);
@@ -2448,6 +2448,10 @@ static void __exit ip_vs_cleanup(void)
 	ip_vs_conn_cleanup();
 	ip_vs_protocol_cleanup();
 	ip_vs_control_cleanup();
+	/* common rcu_barrier() used by:
+	 * - ip_vs_control_cleanup()
+	 */
+	rcu_barrier();
 	pr_info("ipvs unloaded.\n");
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index efab2b06d373..44c79fd1779c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -483,17 +483,14 @@ static void ip_vs_service_rcu_free(struct rcu_head *head)
 	ip_vs_service_free(svc);
 }
 
-static void __ip_vs_svc_put(struct ip_vs_service *svc, bool do_delay)
+static void __ip_vs_svc_put(struct ip_vs_service *svc)
 {
 	if (atomic_dec_and_test(&svc->refcnt)) {
 		IP_VS_DBG_BUF(3, "Removing service %u/%s:%u\n",
 			      svc->fwmark,
 			      IP_VS_DBG_ADDR(svc->af, &svc->addr),
 			      ntohs(svc->port));
-		if (do_delay)
-			call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
-		else
-			ip_vs_service_free(svc);
+		call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
 	}
 }
 
@@ -780,14 +777,22 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, int dest_af,
 	return dest;
 }
 
+static void ip_vs_dest_rcu_free(struct rcu_head *head)
+{
+	struct ip_vs_dest *dest;
+
+	dest = container_of(head, struct ip_vs_dest, rcu_head);
+	free_percpu(dest->stats.cpustats);
+	ip_vs_dest_put_and_free(dest);
+}
+
 static void ip_vs_dest_free(struct ip_vs_dest *dest)
 {
 	struct ip_vs_service *svc = rcu_dereference_protected(dest->svc, 1);
 
 	__ip_vs_dst_cache_reset(dest);
-	__ip_vs_svc_put(svc, false);
-	free_percpu(dest->stats.cpustats);
-	ip_vs_dest_put_and_free(dest);
+	__ip_vs_svc_put(svc);
+	call_rcu(&dest->rcu_head, ip_vs_dest_rcu_free);
 }
 
 /*
@@ -811,6 +816,16 @@ static void ip_vs_trash_cleanup(struct netns_ipvs *ipvs)
 	}
 }
 
+static void ip_vs_stats_rcu_free(struct rcu_head *head)
+{
+	struct ip_vs_stats_rcu *rs = container_of(head,
+						  struct ip_vs_stats_rcu,
+						  rcu_head);
+
+	free_percpu(rs->s.cpustats);
+	kfree(rs);
+}
+
 static void
 ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
 {
@@ -923,7 +938,7 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
 		if (old_svc != svc) {
 			ip_vs_zero_stats(&dest->stats);
 			__ip_vs_bind_svc(dest, svc);
-			__ip_vs_svc_put(old_svc, true);
+			__ip_vs_svc_put(old_svc);
 		}
 	}
 
@@ -1571,7 +1586,7 @@ static void __ip_vs_del_service(struct ip_vs_service *svc, bool cleanup)
 	/*
 	 *    Free the service if nobody refers to it
 	 */
-	__ip_vs_svc_put(svc, true);
+	__ip_vs_svc_put(svc);
 
 	/* decrease the module use count */
 	ip_vs_use_count_dec();
@@ -1761,7 +1776,7 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 		}
 	}
 
-	ip_vs_zero_stats(&ipvs->tot_stats);
+	ip_vs_zero_stats(&ipvs->tot_stats->s);
 	return 0;
 }
 
@@ -2255,7 +2270,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
 	seq_puts(seq,
 		 "   Conns  Packets  Packets            Bytes            Bytes\n");
 
-	ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats);
+	ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats->s);
 	seq_printf(seq, "%8LX %8LX %8LX %16LX %16LX\n\n",
 		   (unsigned long long)show.conns,
 		   (unsigned long long)show.inpkts,
@@ -2279,7 +2294,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
 static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq_file_single_net(seq);
-	struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats;
+	struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats->s;
 	struct ip_vs_cpu_stats __percpu *cpustats = tot_stats->cpustats;
 	struct ip_vs_kstats kstats;
 	int i;
@@ -4106,7 +4121,6 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 			kfree(tbl);
 		return -ENOMEM;
 	}
-	ip_vs_start_estimator(ipvs, &ipvs->tot_stats);
 	ipvs->sysctl_tbl = tbl;
 	/* Schedule defense work */
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
@@ -4117,6 +4131,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
 			  expire_nodest_conn_handler);
 
+	ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
 	return 0;
 }
 
@@ -4128,7 +4143,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats);
+	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
 
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->sysctl_tbl);
@@ -4164,17 +4179,20 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	atomic_set(&ipvs->conn_out_counter, 0);
 
 	/* procfs stats */
-	ipvs->tot_stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!ipvs->tot_stats.cpustats)
+	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
+	if (!ipvs->tot_stats)
 		return -ENOMEM;
+	ipvs->tot_stats->s.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
+	if (!ipvs->tot_stats->s.cpustats)
+		goto err_tot_stats;
 
 	for_each_possible_cpu(i) {
 		struct ip_vs_cpu_stats *ipvs_tot_stats;
-		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats.cpustats, i);
+		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats->s.cpustats, i);
 		u64_stats_init(&ipvs_tot_stats->syncp);
 	}
 
-	spin_lock_init(&ipvs->tot_stats.lock);
+	spin_lock_init(&ipvs->tot_stats->s.lock);
 
 #ifdef CONFIG_PROC_FS
 	if (!proc_create_net("ip_vs", 0, ipvs->net->proc_net,
@@ -4206,7 +4224,10 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 
 err_vs:
 #endif
-	free_percpu(ipvs->tot_stats.cpustats);
+	free_percpu(ipvs->tot_stats->s.cpustats);
+
+err_tot_stats:
+	kfree(ipvs->tot_stats);
 	return -ENOMEM;
 }
 
@@ -4219,7 +4240,7 @@ void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs", ipvs->net->proc_net);
 #endif
-	free_percpu(ipvs->tot_stats.cpustats);
+	call_rcu(&ipvs->tot_stats->rcu_head, ip_vs_stats_rcu_free);
 }
 
 int __init ip_vs_register_nl_ioctl(void)
@@ -4279,5 +4300,6 @@ void ip_vs_control_cleanup(void)
 {
 	EnterFunction(2);
 	unregister_netdevice_notifier(&ip_vs_dst_notifier);
+	/* relying on common rcu_barrier() in ip_vs_cleanup() */
 	LeaveFunction(2);
 }
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCHv5 2/6] ipvs: use common functions for stats allocation
  2022-10-09 15:37 [RFC PATCHv5 0/6] ipvs: Use kthreads for stats Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 1/6] ipvs: add rcu protection to stats Julian Anastasov
@ 2022-10-09 15:37 ` Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation Julian Anastasov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-09 15:37 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Move alloc_percpu/free_percpu logic in new functions

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h            |  5 ++
 net/netfilter/ipvs/ip_vs_ctl.c | 96 +++++++++++++++++++---------------
 2 files changed, 60 insertions(+), 41 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index bd8ae137e43b..e5582c01a4a3 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -410,6 +410,11 @@ struct ip_vs_stats_rcu {
 	struct rcu_head		rcu_head;
 };
 
+int ip_vs_stats_init_alloc(struct ip_vs_stats *s);
+struct ip_vs_stats *ip_vs_stats_alloc(void);
+void ip_vs_stats_release(struct ip_vs_stats *stats);
+void ip_vs_stats_free(struct ip_vs_stats *stats);
+
 struct dst_entry;
 struct iphdr;
 struct ip_vs_conn;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 44c79fd1779c..28264911064a 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -471,7 +471,7 @@ __ip_vs_bind_svc(struct ip_vs_dest *dest, struct ip_vs_service *svc)
 
 static void ip_vs_service_free(struct ip_vs_service *svc)
 {
-	free_percpu(svc->stats.cpustats);
+	ip_vs_stats_release(&svc->stats);
 	kfree(svc);
 }
 
@@ -782,7 +782,7 @@ static void ip_vs_dest_rcu_free(struct rcu_head *head)
 	struct ip_vs_dest *dest;
 
 	dest = container_of(head, struct ip_vs_dest, rcu_head);
-	free_percpu(dest->stats.cpustats);
+	ip_vs_stats_release(&dest->stats);
 	ip_vs_dest_put_and_free(dest);
 }
 
@@ -822,7 +822,7 @@ static void ip_vs_stats_rcu_free(struct rcu_head *head)
 						  struct ip_vs_stats_rcu,
 						  rcu_head);
 
-	free_percpu(rs->s.cpustats);
+	ip_vs_stats_release(&rs->s);
 	kfree(rs);
 }
 
@@ -879,6 +879,47 @@ ip_vs_zero_stats(struct ip_vs_stats *stats)
 	spin_unlock_bh(&stats->lock);
 }
 
+/* Allocate fields after kzalloc */
+int ip_vs_stats_init_alloc(struct ip_vs_stats *s)
+{
+	int i;
+
+	spin_lock_init(&s->lock);
+	s->cpustats = alloc_percpu(struct ip_vs_cpu_stats);
+	if (!s->cpustats)
+		return -ENOMEM;
+
+	for_each_possible_cpu(i) {
+		struct ip_vs_cpu_stats *cs = per_cpu_ptr(s->cpustats, i);
+
+		u64_stats_init(&cs->syncp);
+	}
+	return 0;
+}
+
+struct ip_vs_stats *ip_vs_stats_alloc(void)
+{
+	struct ip_vs_stats *s = kzalloc(sizeof(*s), GFP_KERNEL);
+
+	if (s && ip_vs_stats_init_alloc(s) >= 0)
+		return s;
+	kfree(s);
+	return NULL;
+}
+
+void ip_vs_stats_release(struct ip_vs_stats *stats)
+{
+	free_percpu(stats->cpustats);
+}
+
+void ip_vs_stats_free(struct ip_vs_stats *stats)
+{
+	if (stats) {
+		ip_vs_stats_release(stats);
+		kfree(stats);
+	}
+}
+
 /*
  *	Update a destination in the given service
  */
@@ -978,14 +1019,13 @@ static int
 ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 {
 	struct ip_vs_dest *dest;
-	unsigned int atype, i;
+	unsigned int atype;
+	int ret;
 
 	EnterFunction(2);
 
 #ifdef CONFIG_IP_VS_IPV6
 	if (udest->af == AF_INET6) {
-		int ret;
-
 		atype = ipv6_addr_type(&udest->addr.in6);
 		if ((!(atype & IPV6_ADDR_UNICAST) ||
 			atype & IPV6_ADDR_LINKLOCAL) &&
@@ -1007,16 +1047,10 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 	if (dest == NULL)
 		return -ENOMEM;
 
-	dest->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!dest->stats.cpustats)
+	ret = ip_vs_stats_init_alloc(&dest->stats);
+	if (ret < 0)
 		goto err_alloc;
 
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *ip_vs_dest_stats;
-		ip_vs_dest_stats = per_cpu_ptr(dest->stats.cpustats, i);
-		u64_stats_init(&ip_vs_dest_stats->syncp);
-	}
-
 	dest->af = udest->af;
 	dest->protocol = svc->protocol;
 	dest->vaddr = svc->addr;
@@ -1032,7 +1066,6 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 
 	INIT_HLIST_NODE(&dest->d_list);
 	spin_lock_init(&dest->dst_lock);
-	spin_lock_init(&dest->stats.lock);
 	__ip_vs_update_dest(svc, dest, udest, 1);
 
 	LeaveFunction(2);
@@ -1040,7 +1073,7 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 
 err_alloc:
 	kfree(dest);
-	return -ENOMEM;
+	return ret;
 }
 
 
@@ -1299,7 +1332,7 @@ static int
 ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 		  struct ip_vs_service **svc_p)
 {
-	int ret = 0, i;
+	int ret = 0;
 	struct ip_vs_scheduler *sched = NULL;
 	struct ip_vs_pe *pe = NULL;
 	struct ip_vs_service *svc = NULL;
@@ -1359,18 +1392,9 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 		ret = -ENOMEM;
 		goto out_err;
 	}
-	svc->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!svc->stats.cpustats) {
-		ret = -ENOMEM;
+	ret = ip_vs_stats_init_alloc(&svc->stats);
+	if (ret < 0)
 		goto out_err;
-	}
-
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *ip_vs_stats;
-		ip_vs_stats = per_cpu_ptr(svc->stats.cpustats, i);
-		u64_stats_init(&ip_vs_stats->syncp);
-	}
-
 
 	/* I'm the first user of the service */
 	atomic_set(&svc->refcnt, 0);
@@ -1387,7 +1411,6 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 
 	INIT_LIST_HEAD(&svc->destinations);
 	spin_lock_init(&svc->sched_lock);
-	spin_lock_init(&svc->stats.lock);
 
 	/* Bind the scheduler */
 	if (sched) {
@@ -4165,7 +4188,7 @@ static struct notifier_block ip_vs_dst_notifier = {
 
 int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 {
-	int i, idx;
+	int idx;
 
 	/* Initialize rs_table */
 	for (idx = 0; idx < IP_VS_RTAB_SIZE; idx++)
@@ -4182,18 +4205,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
 	if (!ipvs->tot_stats)
 		return -ENOMEM;
-	ipvs->tot_stats->s.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!ipvs->tot_stats->s.cpustats)
+	if (ip_vs_stats_init_alloc(&ipvs->tot_stats->s) < 0)
 		goto err_tot_stats;
 
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *ipvs_tot_stats;
-		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats->s.cpustats, i);
-		u64_stats_init(&ipvs_tot_stats->syncp);
-	}
-
-	spin_lock_init(&ipvs->tot_stats->s.lock);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-09 15:37 [RFC PATCHv5 0/6] ipvs: Use kthreads for stats Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 1/6] ipvs: add rcu protection to stats Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 2/6] ipvs: use common functions for stats allocation Julian Anastasov
@ 2022-10-09 15:37 ` Julian Anastasov
  2022-10-15  9:21   ` Jiri Wiesner
  2022-10-09 15:37 ` [RFC PATCHv5 4/6] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Julian Anastasov @ 2022-10-09 15:37 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Estimating all entries in single list in timer context
causes large latency with multiple rules.

Spread the estimator structures in multiple chains and
use kthread(s) for the estimation. Every chain is
processed under RCU lock. First kthread determines
parameters to use, eg. maximum number of estimators to
process per kthread based on chain's length, allowing
sub-100us cond_resched rate and estimation taking 1/8
of the CPU.

First kthread also plays the role of distributor of
added estimators to all kthreads.

We also add delayed work est_reload_work that will
make sure the kthread tasks are properly started/stopped.

ip_vs_start_estimator() is changed to report errors
which allows to safely store the estimators in
allocated structures.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h            |  74 ++-
 net/netfilter/ipvs/ip_vs_ctl.c | 130 ++++--
 net/netfilter/ipvs/ip_vs_est.c | 831 ++++++++++++++++++++++++++++++---
 3 files changed, 933 insertions(+), 102 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index e5582c01a4a3..f18bc398472b 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -42,6 +42,8 @@ static inline struct netns_ipvs *net_ipvs(struct net* net)
 /* Connections' size value needed by ip_vs_ctl.c */
 extern int ip_vs_conn_tab_size;
 
+extern struct mutex __ip_vs_mutex;
+
 struct ip_vs_iphdr {
 	int hdr_flags;	/* ipvs flags */
 	__u32 off;	/* Where IP or IPv4 header starts */
@@ -365,7 +367,7 @@ struct ip_vs_cpu_stats {
 
 /* IPVS statistics objects */
 struct ip_vs_estimator {
-	struct list_head	list;
+	struct hlist_node	list;
 
 	u64			last_inbytes;
 	u64			last_outbytes;
@@ -378,6 +380,10 @@ struct ip_vs_estimator {
 	u64			outpps;
 	u64			inbps;
 	u64			outbps;
+
+	s32			ktid:16,	/* kthread ID, -1=temp list */
+				ktrow:8,	/* row ID for kthread */
+				ktcid:8;	/* chain ID for kthread */
 };
 
 /*
@@ -415,6 +421,52 @@ struct ip_vs_stats *ip_vs_stats_alloc(void);
 void ip_vs_stats_release(struct ip_vs_stats *stats);
 void ip_vs_stats_free(struct ip_vs_stats *stats);
 
+/* Process estimators in multiple timer ticks */
+#define IPVS_EST_NTICKS		50
+/* Estimation uses a 2-second period */
+#define IPVS_EST_TICK		((2 * HZ) / IPVS_EST_NTICKS)
+
+/* Desired number of chains per tick (chain load factor in 100us units),
+ * 48=4.8ms of 40ms tick (12% CPU usage):
+ * 2 sec * 1000 ms in sec * 10 (100us in ms) / 8 (12.5%) / IPVS_EST_NTICKS
+ */
+#define IPVS_EST_CHAIN_FACTOR	ALIGN_DOWN(2 * 1000 * 10 / 8 / IPVS_EST_NTICKS, 8)
+
+/* Compiled number of chains per tick
+ * The defines should match cond_resched_rcu
+ */
+#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
+#define IPVS_EST_TICK_CHAINS	IPVS_EST_CHAIN_FACTOR
+#else
+#define IPVS_EST_TICK_CHAINS	1
+#endif
+
+/* Multiple chains processed in same tick */
+struct ip_vs_est_tick_data {
+	struct hlist_head	chains[IPVS_EST_TICK_CHAINS];
+	DECLARE_BITMAP(present, IPVS_EST_TICK_CHAINS);
+	DECLARE_BITMAP(full, IPVS_EST_TICK_CHAINS);
+	int			chain_len[IPVS_EST_TICK_CHAINS];
+};
+
+/* Context for estimation kthread */
+struct ip_vs_est_kt_data {
+	struct netns_ipvs	*ipvs;
+	struct task_struct	*task;		/* task if running */
+	struct ip_vs_est_tick_data __rcu *ticks[IPVS_EST_NTICKS];
+	DECLARE_BITMAP(avail, IPVS_EST_NTICKS);	/* tick has space for ests */
+	unsigned long		est_timer;	/* estimation timer (jiffies) */
+	struct ip_vs_stats	*calc_stats;	/* Used for calculation */
+	int			tick_len[IPVS_EST_NTICKS];	/* est count */
+	int			id;		/* ktid per netns */
+	int			chain_max;	/* max ests per tick chain */
+	int			tick_max;	/* max ests per tick */
+	int			est_count;	/* attached ests to kthread */
+	int			est_max_count;	/* max ests per kthread */
+	int			add_row;	/* row for new ests */
+	int			est_row;	/* estimated row */
+};
+
 struct dst_entry;
 struct iphdr;
 struct ip_vs_conn;
@@ -953,9 +1005,17 @@ struct netns_ipvs {
 	struct ctl_table_header	*lblcr_ctl_header;
 	struct ctl_table	*lblcr_ctl_table;
 	/* ip_vs_est */
-	struct list_head	est_list;	/* estimator list */
-	spinlock_t		est_lock;
-	struct timer_list	est_timer;	/* Estimation timer */
+	struct delayed_work	est_reload_work;/* Reload kthread tasks */
+	struct mutex		est_mutex;	/* protect kthread tasks */
+	struct hlist_head	est_temp_list;	/* Ests during calc phase */
+	struct ip_vs_est_kt_data **est_kt_arr;	/* Array of kthread data ptrs */
+	unsigned long		est_max_threads;/* rlimit */
+	int			est_calc_phase;	/* Calculation phase */
+	int			est_chain_max;	/* Calculated chain_max */
+	int			est_kt_count;	/* Allocated ptrs */
+	int			est_add_ktid;	/* ktid where to add ests */
+	atomic_t		est_genid;	/* kthreads reload genid */
+	atomic_t		est_genid_done;	/* applied genid */
 	/* ip_vs_sync */
 	spinlock_t		sync_lock;
 	struct ipvs_master_sync_state *ms;
@@ -1486,10 +1546,14 @@ int stop_sync_thread(struct netns_ipvs *ipvs, int state);
 void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct ip_vs_conn *cp, int pkts);
 
 /* IPVS rate estimator prototypes (from ip_vs_est.c) */
-void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
+int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
 void ip_vs_zero_estimator(struct ip_vs_stats *stats);
 void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs);
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+			    struct ip_vs_est_kt_data *kd);
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
 
 /* Various IPVS packet transmitters (from ip_vs_xmit.c) */
 int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 28264911064a..f30f2488de5d 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -49,8 +49,7 @@
 
 MODULE_ALIAS_GENL_FAMILY(IPVS_GENL_NAME);
 
-/* semaphore for IPVS sockopts. And, [gs]etsockopt may sleep. */
-static DEFINE_MUTEX(__ip_vs_mutex);
+DEFINE_MUTEX(__ip_vs_mutex); /* Serialize configuration with sockopt/netlink */
 
 /* sysctl variables */
 
@@ -241,6 +240,47 @@ static void defense_work_handler(struct work_struct *work)
 }
 #endif
 
+static void est_reload_work_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs =
+		container_of(work, struct netns_ipvs, est_reload_work.work);
+	int genid_done = atomic_read(&ipvs->est_genid_done);
+	unsigned long delay = HZ / 10;	/* repeat startups after failure */
+	bool repeat = false;
+	int genid;
+	int id;
+
+	mutex_lock(&ipvs->est_mutex);
+	genid = atomic_read(&ipvs->est_genid);
+	for (id = 0; id < ipvs->est_kt_count; id++) {
+		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			goto unlock;
+		if (!kd)
+			continue;
+		/* New config ? Stop kthread tasks */
+		if (genid != genid_done)
+			ip_vs_est_kthread_stop(kd);
+		if (!kd->task) {
+			/* Do not start kthreads above 0 in calc phase */
+			if ((!id || !ipvs->est_calc_phase) &&
+			    ip_vs_est_kthread_start(ipvs, kd) < 0)
+				repeat = true;
+		}
+	}
+
+	atomic_set(&ipvs->est_genid_done, genid);
+
+	if (repeat)
+		queue_delayed_work(system_long_wq, &ipvs->est_reload_work,
+				   delay);
+
+unlock:
+	mutex_unlock(&ipvs->est_mutex);
+}
+
 int
 ip_vs_use_count_inc(void)
 {
@@ -831,7 +871,7 @@ ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
 {
 #define IP_VS_SHOW_STATS_COUNTER(c) dst->c = src->kstats.c - src->kstats0.c
 
-	spin_lock_bh(&src->lock);
+	spin_lock(&src->lock);
 
 	IP_VS_SHOW_STATS_COUNTER(conns);
 	IP_VS_SHOW_STATS_COUNTER(inpkts);
@@ -841,7 +881,7 @@ ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
 
 	ip_vs_read_estimator(dst, src);
 
-	spin_unlock_bh(&src->lock);
+	spin_unlock(&src->lock);
 }
 
 static void
@@ -862,7 +902,7 @@ ip_vs_export_stats_user(struct ip_vs_stats_user *dst, struct ip_vs_kstats *src)
 static void
 ip_vs_zero_stats(struct ip_vs_stats *stats)
 {
-	spin_lock_bh(&stats->lock);
+	spin_lock(&stats->lock);
 
 	/* get current counters as zero point, rates are zeroed */
 
@@ -876,7 +916,7 @@ ip_vs_zero_stats(struct ip_vs_stats *stats)
 
 	ip_vs_zero_estimator(stats);
 
-	spin_unlock_bh(&stats->lock);
+	spin_unlock(&stats->lock);
 }
 
 /* Allocate fields after kzalloc */
@@ -998,7 +1038,6 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
 	spin_unlock_bh(&dest->dst_lock);
 
 	if (add) {
-		ip_vs_start_estimator(svc->ipvs, &dest->stats);
 		list_add_rcu(&dest->n_list, &svc->destinations);
 		svc->num_dests++;
 		sched = rcu_dereference_protected(svc->scheduler, 1);
@@ -1051,6 +1090,10 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 	if (ret < 0)
 		goto err_alloc;
 
+	ret = ip_vs_start_estimator(svc->ipvs, &dest->stats);
+	if (ret < 0)
+		goto err_stats;
+
 	dest->af = udest->af;
 	dest->protocol = svc->protocol;
 	dest->vaddr = svc->addr;
@@ -1071,6 +1114,9 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 	LeaveFunction(2);
 	return 0;
 
+err_stats:
+	ip_vs_stats_release(&dest->stats);
+
 err_alloc:
 	kfree(dest);
 	return ret;
@@ -1135,14 +1181,18 @@ ip_vs_add_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 			      IP_VS_DBG_ADDR(svc->af, &dest->vaddr),
 			      ntohs(dest->vport));
 
+		ret = ip_vs_start_estimator(svc->ipvs, &dest->stats);
+		if (ret < 0)
+			goto err;
 		__ip_vs_update_dest(svc, dest, udest, 1);
-		ret = 0;
 	} else {
 		/*
 		 * Allocate and initialize the dest structure
 		 */
 		ret = ip_vs_new_dest(svc, udest);
 	}
+
+err:
 	LeaveFunction(2);
 
 	return ret;
@@ -1420,6 +1470,10 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 		sched = NULL;
 	}
 
+	ret = ip_vs_start_estimator(ipvs, &svc->stats);
+	if (ret < 0)
+		goto out_err;
+
 	/* Bind the ct retriever */
 	RCU_INIT_POINTER(svc->pe, pe);
 	pe = NULL;
@@ -1432,8 +1486,6 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 	if (svc->pe && svc->pe->conn_out)
 		atomic_inc(&ipvs->conn_out_counter);
 
-	ip_vs_start_estimator(ipvs, &svc->stats);
-
 	/* Count only IPv4 services for old get/setsockopt interface */
 	if (svc->af == AF_INET)
 		ipvs->num_services++;
@@ -1444,8 +1496,15 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 	ip_vs_svc_hash(svc);
 
 	*svc_p = svc;
-	/* Now there is a service - full throttle */
-	ipvs->enable = 1;
+
+	if (!ipvs->enable) {
+		/* Now there is a service - full throttle */
+		ipvs->enable = 1;
+
+		/* Start estimation for first time */
+		ip_vs_est_reload_start(ipvs);
+	}
+
 	return 0;
 
 
@@ -2334,13 +2393,13 @@ static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
 		u64 conns, inpkts, outpkts, inbytes, outbytes;
 
 		do {
-			start = u64_stats_fetch_begin_irq(&u->syncp);
+			start = u64_stats_fetch_begin(&u->syncp);
 			conns = u->cnt.conns;
 			inpkts = u->cnt.inpkts;
 			outpkts = u->cnt.outpkts;
 			inbytes = u->cnt.inbytes;
 			outbytes = u->cnt.outbytes;
-		} while (u64_stats_fetch_retry_irq(&u->syncp, start));
+		} while (u64_stats_fetch_retry(&u->syncp, start));
 
 		seq_printf(seq, "%3X %8LX %8LX %8LX %16LX %16LX\n",
 			   i, (u64)conns, (u64)inpkts,
@@ -4064,13 +4123,16 @@ static void ip_vs_genl_unregister(void)
 static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
-	int idx;
 	struct ctl_table *tbl;
+	int idx, ret;
 
 	atomic_set(&ipvs->dropentry, 0);
 	spin_lock_init(&ipvs->dropentry_lock);
 	spin_lock_init(&ipvs->droppacket_lock);
 	spin_lock_init(&ipvs->securetcp_lock);
+	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
 
 	if (!net_eq(net, &init_net)) {
 		tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
@@ -4138,24 +4200,27 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 		tbl[idx++].mode = 0444;
 #endif
 
+	ret = -ENOMEM;
 	ipvs->sysctl_hdr = register_net_sysctl(net, "net/ipv4/vs", tbl);
-	if (ipvs->sysctl_hdr == NULL) {
-		if (!net_eq(net, &init_net))
-			kfree(tbl);
-		return -ENOMEM;
-	}
+	if (!ipvs->sysctl_hdr)
+		goto err;
 	ipvs->sysctl_tbl = tbl;
+
+	ret = ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
+	if (ret < 0)
+		goto err;
+
 	/* Schedule defense work */
-	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
 			   DEFENSE_TIMER_PERIOD);
 
-	/* Init delayed work for expiring no dest conn */
-	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
-			  expire_nodest_conn_handler);
-
-	ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
 	return 0;
+
+err:
+	unregister_net_sysctl_table(ipvs->sysctl_hdr);
+	if (!net_eq(net, &init_net))
+		kfree(tbl);
+	return ret;
 }
 
 static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
@@ -4188,6 +4253,7 @@ static struct notifier_block ip_vs_dst_notifier = {
 
 int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 {
+	int ret = -ENOMEM;
 	int idx;
 
 	/* Initialize rs_table */
@@ -4201,10 +4267,12 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	atomic_set(&ipvs->nullsvc_counter, 0);
 	atomic_set(&ipvs->conn_out_counter, 0);
 
+	INIT_DELAYED_WORK(&ipvs->est_reload_work, est_reload_work_handler);
+
 	/* procfs stats */
 	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
 	if (!ipvs->tot_stats)
-		return -ENOMEM;
+		goto out;
 	if (ip_vs_stats_init_alloc(&ipvs->tot_stats->s) < 0)
 		goto err_tot_stats;
 
@@ -4221,7 +4289,8 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 		goto err_percpu;
 #endif
 
-	if (ip_vs_control_net_init_sysctl(ipvs))
+	ret = ip_vs_control_net_init_sysctl(ipvs);
+	if (ret < 0)
 		goto err;
 
 	return 0;
@@ -4242,13 +4311,16 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 
 err_tot_stats:
 	kfree(ipvs->tot_stats);
-	return -ENOMEM;
+
+out:
+	return ret;
 }
 
 void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
 {
 	ip_vs_trash_cleanup(ipvs);
 	ip_vs_control_net_cleanup_sysctl(ipvs);
+	cancel_delayed_work_sync(&ipvs->est_reload_work);
 #ifdef CONFIG_PROC_FS
 	remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 9a1a7af6a186..99a85ee9cd1e 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -30,9 +30,6 @@
   long interval, it is easy to implement a user level daemon which
   periodically reads those statistical counters and measure rate.
 
-  Currently, the measurement is activated by slow timer handler. Hope
-  this measurement will not introduce too much load.
-
   We measure rate during the last 8 seconds every 2 seconds:
 
     avgrate = avgrate*(1-W) + rate*W
@@ -47,68 +44,76 @@
     to 32-bit values for conns, packets, bps, cps and pps.
 
   * A lot of code is taken from net/core/gen_estimator.c
- */
-
 
-/*
- * Make a summary from each cpu
+  KEY POINTS:
+  - cpustats counters are updated per-cpu in SoftIRQ context with BH disabled
+  - kthreads read the cpustats to update the estimators (svcs, dests, total)
+  - the states of estimators can be read (get stats) or modified (zero stats)
+    from processes
+
+  KTHREADS:
+  - estimators are added initially to est_temp_list and later kthread 0
+    distributes them to one or many kthreads for estimation
+  - kthread contexts are created and attached to array
+  - the kthread tasks are started when first service is added, before that
+    the total stats are not estimated
+  - the kthread context holds lists with estimators (chains) which are
+    processed every 2 seconds
+  - as estimators can be added dynamically and in bursts, we try to spread
+    them to multiple chains which are estimated at different time
+  - on start, kthread 0 enters calculation phase to determine the chain limits
+    and the limit of estimators per kthread
+  - est_add_ktid: ktid where to add new ests, can point to empty slot where
+    we should add kt data
  */
-static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
-				 struct ip_vs_cpu_stats __percpu *stats)
-{
-	int i;
-	bool add = false;
 
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i);
-		unsigned int start;
-		u64 conns, inpkts, outpkts, inbytes, outbytes;
-
-		if (add) {
-			do {
-				start = u64_stats_fetch_begin(&s->syncp);
-				conns = s->cnt.conns;
-				inpkts = s->cnt.inpkts;
-				outpkts = s->cnt.outpkts;
-				inbytes = s->cnt.inbytes;
-				outbytes = s->cnt.outbytes;
-			} while (u64_stats_fetch_retry(&s->syncp, start));
-			sum->conns += conns;
-			sum->inpkts += inpkts;
-			sum->outpkts += outpkts;
-			sum->inbytes += inbytes;
-			sum->outbytes += outbytes;
-		} else {
-			add = true;
-			do {
-				start = u64_stats_fetch_begin(&s->syncp);
-				sum->conns = s->cnt.conns;
-				sum->inpkts = s->cnt.inpkts;
-				sum->outpkts = s->cnt.outpkts;
-				sum->inbytes = s->cnt.inbytes;
-				sum->outbytes = s->cnt.outbytes;
-			} while (u64_stats_fetch_retry(&s->syncp, start));
-		}
-	}
-}
+static struct lock_class_key __ipvs_est_key;
 
+static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs);
+static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs);
 
-static void estimation_timer(struct timer_list *t)
+static void ip_vs_chain_estimation(struct hlist_head *chain)
 {
 	struct ip_vs_estimator *e;
+	struct ip_vs_cpu_stats *c;
 	struct ip_vs_stats *s;
 	u64 rate;
-	struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
 
-	if (!sysctl_run_estimation(ipvs))
-		goto skip;
+	hlist_for_each_entry_rcu(e, chain, list) {
+		u64 conns, inpkts, outpkts, inbytes, outbytes;
+		u64 kconns = 0, kinpkts = 0, koutpkts = 0;
+		u64 kinbytes = 0, koutbytes = 0;
+		unsigned int start;
+		int i;
+
+		if (kthread_should_stop())
+			break;
 
-	spin_lock(&ipvs->est_lock);
-	list_for_each_entry(e, &ipvs->est_list, list) {
 		s = container_of(e, struct ip_vs_stats, est);
+		for_each_possible_cpu(i) {
+			c = per_cpu_ptr(s->cpustats, i);
+			do {
+				start = u64_stats_fetch_begin(&c->syncp);
+				conns = c->cnt.conns;
+				inpkts = c->cnt.inpkts;
+				outpkts = c->cnt.outpkts;
+				inbytes = c->cnt.inbytes;
+				outbytes = c->cnt.outbytes;
+			} while (u64_stats_fetch_retry(&c->syncp, start));
+			kconns += conns;
+			kinpkts += inpkts;
+			koutpkts += outpkts;
+			kinbytes += inbytes;
+			koutbytes += outbytes;
+		}
 
 		spin_lock(&s->lock);
-		ip_vs_read_cpu_stats(&s->kstats, s->cpustats);
+
+		s->kstats.conns = kconns;
+		s->kstats.inpkts = kinpkts;
+		s->kstats.outpkts = koutpkts;
+		s->kstats.inbytes = kinbytes;
+		s->kstats.outbytes = koutbytes;
 
 		/* scaled by 2^10, but divided 2 seconds */
 		rate = (s->kstats.conns - e->last_conns) << 9;
@@ -133,30 +138,709 @@ static void estimation_timer(struct timer_list *t)
 		e->outbps += ((s64)rate - (s64)e->outbps) >> 2;
 		spin_unlock(&s->lock);
 	}
-	spin_unlock(&ipvs->est_lock);
+}
+
+static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
+{
+	struct ip_vs_est_tick_data *td;
+	int cid;
+
+	rcu_read_lock();
+	td = rcu_dereference(kd->ticks[row]);
+	if (!td)
+		goto out;
+	for_each_set_bit(cid, td->present, IPVS_EST_TICK_CHAINS) {
+		if (kthread_should_stop())
+			break;
+		ip_vs_chain_estimation(&td->chains[cid]);
+		cond_resched_rcu();
+		td = rcu_dereference(kd->ticks[row]);
+		if (!td)
+			break;
+	}
 
-skip:
-	mod_timer(&ipvs->est_timer, jiffies + 2*HZ);
+out:
+	rcu_read_unlock();
 }
 
-void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
+static int ip_vs_estimation_kthread(void *data)
 {
-	struct ip_vs_estimator *est = &stats->est;
+	struct ip_vs_est_kt_data *kd = data;
+	struct netns_ipvs *ipvs = kd->ipvs;
+	int row = kd->est_row;
+	unsigned long now;
+	int id = kd->id;
+	long gap;
+
+	if (id > 0) {
+		if (!ipvs->est_chain_max)
+			return 0;
+	} else {
+		if (!ipvs->est_chain_max) {
+			ipvs->est_calc_phase = 1;
+			/* commit est_calc_phase before reading est_genid */
+			smp_mb();
+		}
+
+		/* kthread 0 will handle the calc phase */
+		if (ipvs->est_calc_phase)
+			ip_vs_est_calc_phase(ipvs);
+	}
 
-	INIT_LIST_HEAD(&est->list);
+	while (1) {
+		if (!id && !hlist_empty(&ipvs->est_temp_list))
+			ip_vs_est_drain_temp_list(ipvs);
+		set_current_state(TASK_IDLE);
+		if (kthread_should_stop())
+			break;
+
+		/* before estimation, check if we should sleep */
+		now = jiffies;
+		gap = kd->est_timer - now;
+		if (gap > 0) {
+			if (gap > IPVS_EST_TICK) {
+				kd->est_timer = now - IPVS_EST_TICK;
+				gap = IPVS_EST_TICK;
+			}
+			schedule_timeout(gap);
+		} else {
+			__set_current_state(TASK_RUNNING);
+			if (gap < -8 * IPVS_EST_TICK)
+				kd->est_timer = now;
+		}
 
-	spin_lock_bh(&ipvs->est_lock);
-	list_add(&est->list, &ipvs->est_list);
-	spin_unlock_bh(&ipvs->est_lock);
+		if (sysctl_run_estimation(ipvs) && kd->tick_len[row])
+			ip_vs_tick_estimation(kd, row);
+
+		row++;
+		if (row >= IPVS_EST_NTICKS)
+			row = 0;
+		WRITE_ONCE(kd->est_row, row);
+		kd->est_timer += IPVS_EST_TICK;
+	}
+	__set_current_state(TASK_RUNNING);
+
+	return 0;
+}
+
+/* Schedule stop/start for kthread tasks */
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs)
+{
+	/* Ignore reloads before first service is added */
+	if (!ipvs->enable)
+		return;
+	/* Bump the kthread configuration genid */
+	atomic_inc(&ipvs->est_genid);
+	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 0);
+}
+
+/* Start kthread task with current configuration */
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+			    struct ip_vs_est_kt_data *kd)
+{
+	unsigned long now;
+	int ret = 0;
+	long gap;
+
+	lockdep_assert_held(&ipvs->est_mutex);
+
+	if (kd->task)
+		goto out;
+	now = jiffies;
+	gap = kd->est_timer - now;
+	/* Sync est_timer if task is starting later */
+	if (abs(gap) > 4 * IPVS_EST_TICK)
+		kd->est_timer = now;
+	kd->task = kthread_create(ip_vs_estimation_kthread, kd, "ipvs-e:%d:%d",
+				  ipvs->gen, kd->id);
+	if (IS_ERR(kd->task)) {
+		ret = PTR_ERR(kd->task);
+		kd->task = NULL;
+		goto out;
+	}
+
+	pr_info("starting estimator thread %d...\n", kd->id);
+	wake_up_process(kd->task);
+
+out:
+	return ret;
+}
+
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd)
+{
+	if (kd->task) {
+		pr_info("stopping estimator thread %d...\n", kd->id);
+		kthread_stop(kd->task);
+		kd->task = NULL;
+	}
+}
+
+/* Apply parameters to kthread */
+static void ip_vs_est_set_params(struct netns_ipvs *ipvs,
+				 struct ip_vs_est_kt_data *kd)
+{
+	kd->chain_max = ipvs->est_chain_max;
+	/* We are using single chain on RCU preemption */
+	if (IPVS_EST_TICK_CHAINS == 1)
+		kd->chain_max *= IPVS_EST_CHAIN_FACTOR;
+	kd->tick_max = IPVS_EST_TICK_CHAINS * kd->chain_max;
+	kd->est_max_count = IPVS_EST_NTICKS * kd->tick_max;
 }
 
+/* Create and start estimation kthread in a free or new array slot */
+static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
+{
+	struct ip_vs_est_kt_data *kd = NULL;
+	int id = ipvs->est_kt_count;
+	int ret = -ENOMEM;
+	void *arr = NULL;
+	int i;
+
+	if ((unsigned long)ipvs->est_kt_count >= ipvs->est_max_threads &&
+	    ipvs->enable && ipvs->est_max_threads)
+		return -EINVAL;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	for (i = 0; i < id; i++) {
+		if (!ipvs->est_kt_arr[i])
+			break;
+	}
+	if (i >= id) {
+		arr = krealloc_array(ipvs->est_kt_arr, id + 1,
+				     sizeof(struct ip_vs_est_kt_data *),
+				     GFP_KERNEL);
+		if (!arr)
+			goto out;
+		ipvs->est_kt_arr = arr;
+	} else {
+		id = i;
+	}
+
+	kd = kzalloc(sizeof(*kd), GFP_KERNEL);
+	if (!kd)
+		goto out;
+	kd->ipvs = ipvs;
+	bitmap_fill(kd->avail, IPVS_EST_NTICKS);
+	kd->est_timer = jiffies;
+	kd->id = id;
+	ip_vs_est_set_params(ipvs, kd);
+
+	/* Pre-allocate stats used in calc phase */
+	if (!id && !kd->calc_stats) {
+		kd->calc_stats = ip_vs_stats_alloc();
+		if (!kd->calc_stats)
+			goto out;
+	}
+
+	/* Start kthread tasks only when services are present */
+	if (ipvs->enable) {
+		ret = ip_vs_est_kthread_start(ipvs, kd);
+		if (ret < 0)
+			goto out;
+	}
+
+	if (arr)
+		ipvs->est_kt_count++;
+	ipvs->est_kt_arr[id] = kd;
+	kd = NULL;
+	/* Use most recent kthread for new ests */
+	ipvs->est_add_ktid = id;
+	ret = 0;
+
+out:
+	mutex_unlock(&ipvs->est_mutex);
+	if (kd) {
+		ip_vs_stats_free(kd->calc_stats);
+		kfree(kd);
+	}
+
+	return ret;
+}
+
+/* Select ktid where to add new ests: available, unused or new slot */
+static void ip_vs_est_update_ktid(struct netns_ipvs *ipvs)
+{
+	int ktid, best = ipvs->est_kt_count;
+	struct ip_vs_est_kt_data *kd;
+
+	for (ktid = 0; ktid < ipvs->est_kt_count; ktid++) {
+		kd = ipvs->est_kt_arr[ktid];
+		if (kd) {
+			if (kd->est_count < kd->est_max_count) {
+				best = ktid;
+				break;
+			}
+		} else if (ktid < best) {
+			best = ktid;
+		}
+	}
+	ipvs->est_add_ktid = best;
+}
+
+/* Add estimator to current kthread (est_add_ktid) */
+static int ip_vs_enqueue_estimator(struct netns_ipvs *ipvs,
+				   struct ip_vs_estimator *est)
+{
+	struct ip_vs_est_kt_data *kd = NULL;
+	struct ip_vs_est_tick_data *td;
+	int ktid, row, crow, cid, ret;
+
+	if (ipvs->est_add_ktid < ipvs->est_kt_count) {
+		kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+		if (kd)
+			goto add_est;
+	}
+
+	ret = ip_vs_est_add_kthread(ipvs);
+	if (ret < 0)
+		goto out;
+	kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+
+add_est:
+	ktid = kd->id;
+	/* For small number of estimators prefer to use few ticks,
+	 * otherwise try to add into the last estimated row.
+	 * est_row and add_row point after the row we should use
+	 */
+	if (kd->est_count >= 2 * kd->tick_max)
+		crow = READ_ONCE(kd->est_row);
+	else
+		crow = kd->add_row;
+	crow--;
+	if (crow < 0)
+		crow = IPVS_EST_NTICKS - 1;
+	row = crow;
+	if (crow < IPVS_EST_NTICKS - 1) {
+		crow++;
+		row = find_last_bit(kd->avail, crow);
+	}
+	if (row >= crow)
+		row = find_last_bit(kd->avail, IPVS_EST_NTICKS);
+
+	td = rcu_dereference_protected(kd->ticks[row], 1);
+	if (!td) {
+		td = kzalloc(sizeof(*td), GFP_KERNEL);
+		if (!td) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		rcu_assign_pointer(kd->ticks[row], td);
+	}
+
+	cid = find_first_zero_bit(td->full, IPVS_EST_TICK_CHAINS);
+
+	kd->est_count++;
+	kd->tick_len[row]++;
+	if (!td->chain_len[cid])
+		__set_bit(cid, td->present);
+	td->chain_len[cid]++;
+	est->ktid = ktid;
+	est->ktrow = row;
+	est->ktcid = cid;
+	hlist_add_head_rcu(&est->list, &td->chains[cid]);
+
+	if (td->chain_len[cid] >= kd->chain_max) {
+		__set_bit(cid, td->full);
+		if (kd->tick_len[row] >= kd->tick_max) {
+			__clear_bit(row, kd->avail);
+			/* Next time search from previous row */
+			kd->add_row = row;
+		}
+	}
+
+	/* Update est_add_ktid to point to first available/empty kt slot */
+	if (kd->est_count == kd->est_max_count)
+		ip_vs_est_update_ktid(ipvs);
+
+	ret = 0;
+
+out:
+	return ret;
+}
+
+/* Start estimation for stats */
+int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
+{
+	struct ip_vs_estimator *est = &stats->est;
+	int ret;
+
+	if (!ipvs->est_max_threads && ipvs->enable)
+		ipvs->est_max_threads = 4 * num_possible_cpus();
+
+	est->ktid = -1;
+
+	/* We prefer this code to be short, kthread 0 will requeue the
+	 * estimator to available chain. If tasks are disabled, we
+	 * will not allocate much memory, just for kt 0.
+	 */
+	ret = 0;
+	if (!ipvs->est_kt_count || !ipvs->est_kt_arr[0])
+		ret = ip_vs_est_add_kthread(ipvs);
+	if (ret >= 0)
+		hlist_add_head(&est->list, &ipvs->est_temp_list);
+	else
+		INIT_HLIST_NODE(&est->list);
+	return ret;
+}
+
+static void ip_vs_est_kthread_destroy(struct ip_vs_est_kt_data *kd)
+{
+	if (kd) {
+		if (kd->task) {
+			pr_info("stop unused estimator thread %d...\n", kd->id);
+			kthread_stop(kd->task);
+		}
+		ip_vs_stats_free(kd->calc_stats);
+		kfree(kd);
+	}
+}
+
+/* Unlink estimator from chain */
 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
 {
 	struct ip_vs_estimator *est = &stats->est;
+	struct ip_vs_est_tick_data *td;
+	struct ip_vs_est_kt_data *kd;
+	int ktid = est->ktid;
+	int row = est->ktrow;
+	int cid = est->ktcid;
+
+	/* Failed to add to chain ? */
+	if (hlist_unhashed(&est->list))
+		return;
+
+	/* On return, estimator can be freed, dequeue it now */
+
+	/* In est_temp_list ? */
+	if (ktid < 0) {
+		hlist_del(&est->list);
+		goto end_kt0;
+	}
+
+	hlist_del_rcu(&est->list);
+	kd = ipvs->est_kt_arr[ktid];
+	td = rcu_dereference_protected(kd->ticks[row], 1);
+	__clear_bit(cid, td->full);
+	td->chain_len[cid]--;
+	if (!td->chain_len[cid])
+		__clear_bit(cid, td->present);
+	kd->tick_len[row]--;
+	__set_bit(row, kd->avail);
+	if (!kd->tick_len[row]) {
+		RCU_INIT_POINTER(kd->ticks[row], NULL);
+		kfree_rcu(td);
+	}
+	kd->est_count--;
+	if (kd->est_count) {
+		/* This kt slot can become available just now, prefer it */
+		if (ktid < ipvs->est_add_ktid)
+			ipvs->est_add_ktid = ktid;
+		return;
+	}
+
+	if (ktid > 0) {
+		mutex_lock(&ipvs->est_mutex);
+		ip_vs_est_kthread_destroy(kd);
+		ipvs->est_kt_arr[ktid] = NULL;
+		if (ktid == ipvs->est_kt_count - 1) {
+			ipvs->est_kt_count--;
+			while (ipvs->est_kt_count > 1 &&
+			       !ipvs->est_kt_arr[ipvs->est_kt_count - 1])
+				ipvs->est_kt_count--;
+		}
+		mutex_unlock(&ipvs->est_mutex);
+
+		/* This slot is now empty, prefer another available kt slot */
+		if (ktid == ipvs->est_add_ktid)
+			ip_vs_est_update_ktid(ipvs);
+	}
+
+end_kt0:
+	/* kt 0 is freed after all other kthreads and chains are empty */
+	if (ipvs->est_kt_count == 1 && hlist_empty(&ipvs->est_temp_list)) {
+		kd = ipvs->est_kt_arr[0];
+		if (!kd || !kd->est_count) {
+			mutex_lock(&ipvs->est_mutex);
+			if (kd) {
+				ip_vs_est_kthread_destroy(kd);
+				ipvs->est_kt_arr[0] = NULL;
+			}
+			ipvs->est_kt_count--;
+			mutex_unlock(&ipvs->est_mutex);
+			ipvs->est_add_ktid = 0;
+		}
+	}
+}
+
+/* Register all ests from est_temp_list to kthreads */
+static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs)
+{
+	struct ip_vs_estimator *est;
+
+	while (1) {
+		int max = 16;
+
+		mutex_lock(&__ip_vs_mutex);
+
+		while (max-- > 0) {
+			est = hlist_entry_safe(ipvs->est_temp_list.first,
+					       struct ip_vs_estimator, list);
+			if (est) {
+				if (kthread_should_stop())
+					goto unlock;
+				hlist_del_init(&est->list);
+				if (ip_vs_enqueue_estimator(ipvs, est) >= 0)
+					continue;
+				est->ktid = -1;
+				hlist_add_head(&est->list,
+					       &ipvs->est_temp_list);
+				/* Abort, some entries will not be estimated
+				 * until next attempt
+				 */
+			}
+			goto unlock;
+		}
+		mutex_unlock(&__ip_vs_mutex);
+		cond_resched();
+	}
+
+unlock:
+	mutex_unlock(&__ip_vs_mutex);
+}
 
-	spin_lock_bh(&ipvs->est_lock);
-	list_del(&est->list);
-	spin_unlock_bh(&ipvs->est_lock);
+/* Calculate limits for all kthreads */
+static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
+{
+	struct ip_vs_est_kt_data *kd;
+	struct ip_vs_stats *s;
+	struct hlist_head chain;
+	int cache_factor = 4;
+	int i, loops, ntest;
+	s32 min_est = 0;
+	ktime_t t1, t2;
+	s64 diff, val;
+	int max = 8;
+	int ret = 1;
+
+	INIT_HLIST_HEAD(&chain);
+	mutex_lock(&__ip_vs_mutex);
+	kd = ipvs->est_kt_arr[0];
+	mutex_unlock(&__ip_vs_mutex);
+	s = kd ? kd->calc_stats : NULL;
+	if (!s)
+		goto out;
+	hlist_add_head(&s->est.list, &chain);
+
+	loops = 1;
+	/* Get best result from many tests */
+	for (ntest = 0; ntest < 3; ntest++) {
+		local_bh_disable();
+		rcu_read_lock();
+
+		/* Put stats in cache */
+		ip_vs_chain_estimation(&chain);
+
+		t1 = ktime_get();
+		for (i = loops * cache_factor; i > 0; i--)
+			ip_vs_chain_estimation(&chain);
+		t2 = ktime_get();
+
+		rcu_read_unlock();
+		local_bh_enable();
+
+		if (!ipvs->enable || kthread_should_stop())
+			goto stop;
+		cond_resched();
+
+		diff = ktime_to_ns(ktime_sub(t2, t1));
+		if (diff <= 1 * NSEC_PER_USEC) {
+			/* Do more loops on low resolution */
+			loops *= 2;
+			continue;
+		}
+		if (diff >= NSEC_PER_SEC)
+			continue;
+		val = diff;
+		do_div(val, loops);
+		if (!min_est || val < min_est) {
+			min_est = val;
+			/* goal: 95usec per chain */
+			val = 95 * NSEC_PER_USEC;
+			if (val >= min_est) {
+				do_div(val, min_est);
+				max = (int)val;
+			} else {
+				max = 1;
+			}
+		}
+	}
+
+out:
+	if (s)
+		hlist_del_init(&s->est.list);
+	*chain_max = max;
+	return ret;
+
+stop:
+	ret = 0;
+	goto out;
+}
+
+/* Calculate the parameters and apply them in context of kt #0
+ * ECP: est_calc_phase
+ * ECM: est_chain_max
+ * ECP	ECM	Insert Chain	enable	Description
+ * ---------------------------------------------------------------------------
+ * 0	0	est_temp_list	0	create kt #0 context
+ * 0	0	est_temp_list	0->1	service added, start kthread #0 task
+ * 0->1	0	est_temp_list	1	kt task #0 started, enters calc phase
+ * 1	0	est_temp_list	1	kt #0: determine est_chain_max,
+ *					stop tasks, move ests to est_temp_list
+ *					and free kd for kthreads 1..last
+ * 1->0	0->N	kt chains	1	ests can go to kthreads
+ * 0	N	kt chains	1	drain est_temp_list, create new kthread
+ *					contexts, start tasks, estimate
+ */
+static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
+{
+	int genid = atomic_read(&ipvs->est_genid);
+	struct ip_vs_est_tick_data *td;
+	struct ip_vs_est_kt_data *kd;
+	struct ip_vs_estimator *est;
+	struct ip_vs_stats *stats;
+	int chain_max;
+	int id, row, cid;
+	bool last, last_td;
+	int step;
+
+	if (!ip_vs_est_calc_limits(ipvs, &chain_max))
+		return;
+
+	mutex_lock(&__ip_vs_mutex);
+
+	/* Stop all other tasks, so that we can immediately move the
+	 * estimators to est_temp_list without RCU grace period
+	 */
+	mutex_lock(&ipvs->est_mutex);
+	for (id = 1; id < ipvs->est_kt_count; id++) {
+		/* netns clean up started, abort */
+		if (!ipvs->enable)
+			goto unlock2;
+		kd = ipvs->est_kt_arr[id];
+		if (!kd)
+			continue;
+		ip_vs_est_kthread_stop(kd);
+	}
+	mutex_unlock(&ipvs->est_mutex);
+
+	/* Move all estimators to est_temp_list but carefully,
+	 * all estimators and kthread data can be released while
+	 * we reschedule. Even for kthread 0.
+	 */
+	step = 0;
+
+next_kt:
+	/* Destroy contexts backwards */
+	id = ipvs->est_kt_count - 1;
+	if (id < 0)
+		goto end_dequeue;
+	kd = ipvs->est_kt_arr[id];
+	if (!kd)
+		goto end_dequeue;
+	/* kt 0 can exist with empty chains */
+	if (!id && kd->est_count <= 1)
+		goto end_dequeue;
+
+	row = -1;
+
+next_row:
+	row++;
+	if (row >= IPVS_EST_NTICKS)
+		goto next_kt;
+	if (!ipvs->enable)
+		goto unlock;
+	td = rcu_dereference_protected(kd->ticks[row], 1);
+	if (!td)
+		goto next_row;
+
+	cid = 0;
+
+walk_chain:
+	if (kthread_should_stop())
+		goto unlock;
+	step++;
+	if (!(step & 63)) {
+		/* Give chance estimators to be added (to est_temp_list)
+		 * and deleted (releasing kthread contexts)
+		 */
+		mutex_unlock(&__ip_vs_mutex);
+		cond_resched();
+		mutex_lock(&__ip_vs_mutex);
+
+		/* Current kt released ? */
+		if (id + 1 != ipvs->est_kt_count)
+			goto next_kt;
+		if (kd != ipvs->est_kt_arr[id])
+			goto end_dequeue;
+		/* Current td released ? */
+		if (td != rcu_dereference_protected(kd->ticks[row], 1))
+			goto next_row;
+		/* No fatal changes on the current kd and td */
+	}
+	est = hlist_entry_safe(td->chains[cid].first, struct ip_vs_estimator,
+			       list);
+	if (!est) {
+		cid++;
+		if (cid >= IPVS_EST_TICK_CHAINS)
+			goto next_row;
+		goto walk_chain;
+	}
+	/* We can cheat and increase est_count to protect kt 0 context
+	 * from release but we prefer to keep the last estimator
+	 */
+	last = kd->est_count <= 1;
+	/* Do not free kt #0 data */
+	if (!id && last)
+		goto end_dequeue;
+	last_td = kd->tick_len[row] <= 1;
+	stats = container_of(est, struct ip_vs_stats, est);
+	ip_vs_stop_estimator(ipvs, stats);
+	/* Tasks are stopped, move without RCU grace period */
+	est->ktid = -1;
+	hlist_add_head(&est->list, &ipvs->est_temp_list);
+	/* kd freed ? */
+	if (last)
+		goto next_kt;
+	/* td freed ? */
+	if (last_td)
+		goto next_row;
+	goto walk_chain;
+
+end_dequeue:
+	/* All estimators removed while calculating ? */
+	if (!ipvs->est_kt_count)
+		goto unlock;
+	kd = ipvs->est_kt_arr[0];
+	if (!kd)
+		goto unlock;
+	ipvs->est_chain_max = chain_max;
+	ip_vs_est_set_params(ipvs, kd);
+
+	pr_info("using max %d ests per chain, %d per kthread\n",
+		kd->chain_max, kd->est_max_count);
+
+	mutex_lock(&ipvs->est_mutex);
+
+	/* We completed the calc phase, new calc phase not requested */
+	if (genid == atomic_read(&ipvs->est_genid))
+		ipvs->est_calc_phase = 0;
+
+unlock2:
+	mutex_unlock(&ipvs->est_mutex);
+
+unlock:
+	mutex_unlock(&__ip_vs_mutex);
 }
 
 void ip_vs_zero_estimator(struct ip_vs_stats *stats)
@@ -191,14 +875,25 @@ void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats)
 
 int __net_init ip_vs_estimator_net_init(struct netns_ipvs *ipvs)
 {
-	INIT_LIST_HEAD(&ipvs->est_list);
-	spin_lock_init(&ipvs->est_lock);
-	timer_setup(&ipvs->est_timer, estimation_timer, 0);
-	mod_timer(&ipvs->est_timer, jiffies + 2 * HZ);
+	INIT_HLIST_HEAD(&ipvs->est_temp_list);
+	ipvs->est_kt_arr = NULL;
+	ipvs->est_max_threads = 0;
+	ipvs->est_calc_phase = 0;
+	ipvs->est_chain_max = 0;
+	ipvs->est_kt_count = 0;
+	ipvs->est_add_ktid = 0;
+	atomic_set(&ipvs->est_genid, 0);
+	atomic_set(&ipvs->est_genid_done, 0);
+	__mutex_init(&ipvs->est_mutex, "ipvs->est_mutex", &__ipvs_est_key);
 	return 0;
 }
 
 void __net_exit ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs)
 {
-	del_timer_sync(&ipvs->est_timer);
+	int i;
+
+	for (i = 0; i < ipvs->est_kt_count; i++)
+		ip_vs_est_kthread_destroy(ipvs->est_kt_arr[i]);
+	kfree(ipvs->est_kt_arr);
+	mutex_destroy(&ipvs->est_mutex);
 }
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCHv5 4/6] ipvs: add est_cpulist and est_nice sysctl vars
  2022-10-09 15:37 [RFC PATCHv5 0/6] ipvs: Use kthreads for stats Julian Anastasov
                   ` (2 preceding siblings ...)
  2022-10-09 15:37 ` [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-10-09 15:37 ` Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 5/6] ipvs: run_estimation should control the kthread tasks Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 6/6] ipvs: debug the tick time Julian Anastasov
  5 siblings, 0 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-09 15:37 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Allow the kthreads for stats to be configured for
specific cpulist (isolation) and niceness (scheduling
priority).

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 Documentation/networking/ipvs-sysctl.rst |  20 ++++
 include/net/ip_vs.h                      |  55 +++++++++
 net/netfilter/ipvs/ip_vs_ctl.c           | 143 ++++++++++++++++++++++-
 net/netfilter/ipvs/ip_vs_est.c           |  11 +-
 4 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 387fda80f05f..1b778705d706 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -129,6 +129,26 @@ drop_packet - INTEGER
 	threshold. When the mode 3 is set, the always mode drop rate
 	is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
 
+est_cpulist - CPULIST
+	Allowed	CPUs for estimation kthreads
+
+	Syntax: standard cpulist format
+	empty list - stop kthread tasks and estimation
+	default - the system's housekeeping CPUs for kthreads
+
+	Example:
+	"all": all possible CPUs
+	"0-N": all possible CPUs, N denotes last CPU number
+	"0,1-N:1/2": first and all CPUs with odd number
+	"": empty list
+
+est_nice - INTEGER
+	default 0
+	Valid range: -20 (more favorable) .. 19 (less favorable)
+
+	Niceness value to use for the estimation kthreads (scheduling
+	priority)
+
 expire_nodest_conn - BOOLEAN
 	- 0 - disabled (default)
 	- not 0 - enabled
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index f18bc398472b..9da45664dba8 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -29,6 +29,7 @@
 #include <net/netfilter/nf_conntrack.h>
 #endif
 #include <net/net_namespace.h>		/* Netw namespace */
+#include <linux/sched/isolation.h>
 
 #define IP_VS_HDR_INVERSE	1
 #define IP_VS_HDR_ICMP		2
@@ -365,6 +366,9 @@ struct ip_vs_cpu_stats {
 	struct u64_stats_sync   syncp;
 };
 
+/* Default nice for estimator kthreads */
+#define IPVS_EST_NICE		0
+
 /* IPVS statistics objects */
 struct ip_vs_estimator {
 	struct hlist_node	list;
@@ -995,6 +999,12 @@ struct netns_ipvs {
 	int			sysctl_schedule_icmp;
 	int			sysctl_ignore_tunneled;
 	int			sysctl_run_estimation;
+#ifdef CONFIG_SYSCTL
+	cpumask_var_t		sysctl_est_cpulist;	/* kthread cpumask */
+	int			est_cpulist_valid;	/* cpulist set */
+	int			sysctl_est_nice;	/* kthread nice */
+	int			est_stopped;		/* stop tasks */
+#endif
 
 	/* ip_vs_lblc */
 	int			sysctl_lblc_expiration;
@@ -1148,6 +1158,19 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_run_estimation;
 }
 
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+	if (ipvs->est_cpulist_valid)
+		return ipvs->sysctl_est_cpulist;
+	else
+		return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_est_nice;
+}
+
 #else
 
 static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1245,6 +1268,16 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+	return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+	return IPVS_EST_NICE;
+}
+
 #endif
 
 /* IPVS core functions
@@ -1555,6 +1588,28 @@ int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
 			    struct ip_vs_est_kt_data *kd);
 void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
 
+static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+	ipvs->est_stopped = ipvs->est_cpulist_valid &&
+			    cpumask_empty(sysctl_est_cpulist(ipvs));
+#endif
+}
+
+static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+	return ipvs->est_stopped;
+#else
+	return false;
+#endif
+}
+
+static inline int ip_vs_est_max_threads(struct netns_ipvs *ipvs)
+{
+	return max(1U, 4 * cpumask_weight(sysctl_est_cpulist(ipvs)));
+}
+
 /* Various IPVS packet transmitters (from ip_vs_xmit.c) */
 int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 		    struct ip_vs_protocol *pp, struct ip_vs_iphdr *iph);
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index f30f2488de5d..56edd94d0877 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -263,7 +263,7 @@ static void est_reload_work_handler(struct work_struct *work)
 		/* New config ? Stop kthread tasks */
 		if (genid != genid_done)
 			ip_vs_est_kthread_stop(kd);
-		if (!kd->task) {
+		if (!kd->task && !ip_vs_est_stopped(ipvs)) {
 			/* Do not start kthreads above 0 in calc phase */
 			if ((!id || !ipvs->est_calc_phase) &&
 			    ip_vs_est_kthread_start(ipvs, kd) < 0)
@@ -1940,6 +1940,122 @@ proc_do_sync_ports(struct ctl_table *table, int write,
 	return rc;
 }
 
+static int ipvs_proc_est_cpumask_set(struct ctl_table *table, void *buffer)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	cpumask_var_t *valp = table->data;
+	cpumask_var_t newmask;
+	int ret;
+
+	if (!zalloc_cpumask_var(&newmask, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = cpulist_parse(buffer, newmask);
+	if (ret)
+		goto out;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	if (!ipvs->est_cpulist_valid) {
+		if (!zalloc_cpumask_var(valp, GFP_KERNEL)) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
+		ipvs->est_cpulist_valid = 1;
+	}
+	cpumask_and(newmask, newmask, &current->cpus_mask);
+	cpumask_copy(*valp, newmask);
+	/* est_max_threads may depend on cpulist size */
+	ipvs->est_max_threads = ip_vs_est_max_threads(ipvs);
+	ipvs->est_calc_phase = 1;
+	ip_vs_est_reload_start(ipvs);
+
+unlock:
+	mutex_unlock(&ipvs->est_mutex);
+
+out:
+	free_cpumask_var(newmask);
+	return ret;
+}
+
+static int ipvs_proc_est_cpumask_get(struct ctl_table *table, void *buffer,
+				     size_t size)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	cpumask_var_t *valp = table->data;
+	struct cpumask *mask;
+	int ret;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	if (ipvs->est_cpulist_valid)
+		mask = *valp;
+	else
+		mask = (struct cpumask *)housekeeping_cpumask(HK_TYPE_KTHREAD);
+	ret = scnprintf(buffer, size, "%*pbl\n", cpumask_pr_args(mask));
+
+	mutex_unlock(&ipvs->est_mutex);
+
+	return ret;
+}
+
+static int ipvs_proc_est_cpulist(struct ctl_table *table, int write,
+				 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+
+	/* Ignore both read and write(append) if *ppos not 0 */
+	if (*ppos || !*lenp) {
+		*lenp = 0;
+		return 0;
+	}
+	if (write) {
+		/* proc_sys_call_handler() appends terminator */
+		ret = ipvs_proc_est_cpumask_set(table, buffer);
+		if (ret >= 0)
+			*ppos += *lenp;
+	} else {
+		/* proc_sys_call_handler() allocates 1 byte for terminator */
+		ret = ipvs_proc_est_cpumask_get(table, buffer, *lenp + 1);
+		if (ret >= 0) {
+			*lenp = ret;
+			*ppos += *lenp;
+			ret = 0;
+		}
+	}
+	return ret;
+}
+
+static int ipvs_proc_est_nice(struct ctl_table *table, int write,
+			      void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		if (val < MIN_NICE || val > MAX_NICE) {
+			ret = -EINVAL;
+		} else {
+			mutex_lock(&ipvs->est_mutex);
+			if (*valp != val) {
+				*valp = val;
+				ip_vs_est_reload_start(ipvs);
+			}
+			mutex_unlock(&ipvs->est_mutex);
+		}
+	}
+	return ret;
+}
+
 /*
  *	IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
  *	Do not change order or insert new entries without
@@ -2116,6 +2232,18 @@ static struct ctl_table vs_vars[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "est_cpulist",
+		.maxlen		= NR_CPUS,	/* unused */
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_est_cpulist,
+	},
+	{
+		.procname	= "est_nice",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_est_nice,
+	},
 #ifdef CONFIG_IP_VS_DEBUG
 	{
 		.procname	= "debug_level",
@@ -4133,6 +4261,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
 			  expire_nodest_conn_handler);
+	ipvs->est_stopped = 0;
 
 	if (!net_eq(net, &init_net)) {
 		tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
@@ -4194,6 +4323,15 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
 	ipvs->sysctl_run_estimation = 1;
 	tbl[idx++].data = &ipvs->sysctl_run_estimation;
+
+	ipvs->est_cpulist_valid = 0;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_est_cpulist;
+
+	ipvs->sysctl_est_nice = IPVS_EST_NICE;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_est_nice;
+
 #ifdef CONFIG_IP_VS_DEBUG
 	/* Global sysctls must be ro in non-init netns */
 	if (!net_eq(net, &init_net))
@@ -4233,6 +4371,9 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
 	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
 
+	if (ipvs->est_cpulist_valid)
+		free_cpumask_var(ipvs->sysctl_est_cpulist);
+
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->sysctl_tbl);
 }
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 99a85ee9cd1e..d4d860f19cb8 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -57,6 +57,9 @@
   - kthread contexts are created and attached to array
   - the kthread tasks are started when first service is added, before that
     the total stats are not estimated
+  - when configuration (cpulist/nice) is changed, the tasks are restarted
+    by work (est_reload_work)
+  - kthread tasks are stopped while the cpulist is empty
   - the kthread context holds lists with estimators (chains) which are
     processed every 2 seconds
   - as estimators can be added dynamically and in bursts, we try to spread
@@ -229,6 +232,7 @@ void ip_vs_est_reload_start(struct netns_ipvs *ipvs)
 	/* Ignore reloads before first service is added */
 	if (!ipvs->enable)
 		return;
+	ip_vs_est_stopped_recalc(ipvs);
 	/* Bump the kthread configuration genid */
 	atomic_inc(&ipvs->est_genid);
 	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 0);
@@ -259,6 +263,9 @@ int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
 		goto out;
 	}
 
+	set_user_nice(kd->task, sysctl_est_nice(ipvs));
+	set_cpus_allowed_ptr(kd->task, sysctl_est_cpulist(ipvs));
+
 	pr_info("starting estimator thread %d...\n", kd->id);
 	wake_up_process(kd->task);
 
@@ -334,7 +341,7 @@ static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
 	}
 
 	/* Start kthread tasks only when services are present */
-	if (ipvs->enable) {
+	if (ipvs->enable && !ip_vs_est_stopped(ipvs)) {
 		ret = ip_vs_est_kthread_start(ipvs, kd);
 		if (ret < 0)
 			goto out;
@@ -466,7 +473,7 @@ int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
 	int ret;
 
 	if (!ipvs->est_max_threads && ipvs->enable)
-		ipvs->est_max_threads = 4 * num_possible_cpus();
+		ipvs->est_max_threads = ip_vs_est_max_threads(ipvs);
 
 	est->ktid = -1;
 
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCHv5 5/6] ipvs: run_estimation should control the kthread tasks
  2022-10-09 15:37 [RFC PATCHv5 0/6] ipvs: Use kthreads for stats Julian Anastasov
                   ` (3 preceding siblings ...)
  2022-10-09 15:37 ` [RFC PATCHv5 4/6] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
@ 2022-10-09 15:37 ` Julian Anastasov
  2022-10-09 15:37 ` [RFC PATCHv5 6/6] ipvs: debug the tick time Julian Anastasov
  5 siblings, 0 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-09 15:37 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Change the run_estimation flag to start/stop the kthread tasks.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 Documentation/networking/ipvs-sysctl.rst |  4 ++--
 include/net/ip_vs.h                      |  6 +++--
 net/netfilter/ipvs/ip_vs_ctl.c           | 29 +++++++++++++++++++++++-
 net/netfilter/ipvs/ip_vs_est.c           |  2 +-
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 1b778705d706..3fb5fa142eef 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -324,8 +324,8 @@ run_estimation - BOOLEAN
 	0 - disabled
 	not 0 - enabled (default)
 
-	If disabled, the estimation will be stop, and you can't see
-	any update on speed estimation data.
+	If disabled, the estimation will be suspended and kthread tasks
+	stopped.
 
 	You can always re-enable estimation by setting this value to 1.
 	But be careful, the first estimation after re-enable is not
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 9da45664dba8..76d4abf5a881 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1591,8 +1591,10 @@ void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
 static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
 {
 #ifdef CONFIG_SYSCTL
-	ipvs->est_stopped = ipvs->est_cpulist_valid &&
-			    cpumask_empty(sysctl_est_cpulist(ipvs));
+	/* Stop tasks while cpulist is empty or if disabled with flag */
+	ipvs->est_stopped = !sysctl_run_estimation(ipvs) ||
+			    (ipvs->est_cpulist_valid &&
+			     cpumask_empty(sysctl_est_cpulist(ipvs)));
 #endif
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 56edd94d0877..aa0ac51a162c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2056,6 +2056,32 @@ static int ipvs_proc_est_nice(struct ctl_table *table, int write,
 	return ret;
 }
 
+static int ipvs_proc_run_estimation(struct ctl_table *table, int write,
+				    void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		mutex_lock(&ipvs->est_mutex);
+		if (*valp != val) {
+			*valp = val;
+			ip_vs_est_reload_start(ipvs);
+		}
+		mutex_unlock(&ipvs->est_mutex);
+	}
+	return ret;
+}
+
 /*
  *	IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
  *	Do not change order or insert new entries without
@@ -2230,7 +2256,7 @@ static struct ctl_table vs_vars[] = {
 		.procname	= "run_estimation",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= ipvs_proc_run_estimation,
 	},
 	{
 		.procname	= "est_cpulist",
@@ -4322,6 +4348,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
 	tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
 	ipvs->sysctl_run_estimation = 1;
+	tbl[idx].extra2 = ipvs;
 	tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
 	ipvs->est_cpulist_valid = 0;
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index d4d860f19cb8..d6e35f2a4c77 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -212,7 +212,7 @@ static int ip_vs_estimation_kthread(void *data)
 				kd->est_timer = now;
 		}
 
-		if (sysctl_run_estimation(ipvs) && kd->tick_len[row])
+		if (kd->tick_len[row])
 			ip_vs_tick_estimation(kd, row);
 
 		row++;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCHv5 6/6] ipvs: debug the tick time
  2022-10-09 15:37 [RFC PATCHv5 0/6] ipvs: Use kthreads for stats Julian Anastasov
                   ` (4 preceding siblings ...)
  2022-10-09 15:37 ` [RFC PATCHv5 5/6] ipvs: run_estimation should control the kthread tasks Julian Anastasov
@ 2022-10-09 15:37 ` Julian Anastasov
  5 siblings, 0 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-09 15:37 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Just for testing print the tick time every minute

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/netfilter/ipvs/ip_vs_est.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index d6e35f2a4c77..35cb13aad2b1 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -147,7 +147,14 @@ static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
 {
 	struct ip_vs_est_tick_data *td;
 	int cid;
-
+	u64 ns = 0;
+	static int used_row = -1;
+	static int pass;
+
+	if (used_row < 0)
+		used_row = row;
+	if (row == used_row && !kd->id && !(pass & 31))
+		ns = ktime_get_ns();
 	rcu_read_lock();
 	td = rcu_dereference(kd->ticks[row]);
 	if (!td)
@@ -164,6 +171,16 @@ static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
 
 out:
 	rcu_read_unlock();
+	if (row == used_row && !kd->id && !(pass++ & 31)) {
+		static int ncpu;
+
+		ns = ktime_get_ns() - ns;
+		if (!ncpu)
+			ncpu = num_possible_cpus();
+		pr_info("tick time: %lluns for %d CPUs, %d ests, %d chains, chain_max=%d\n",
+			(unsigned long long)ns, ncpu, kd->tick_len[row],
+			IPVS_EST_TICK_CHAINS, kd->chain_max);
+	}
 }
 
 static int ip_vs_estimation_kthread(void *data)
@@ -626,7 +643,7 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
 	int i, loops, ntest;
 	s32 min_est = 0;
 	ktime_t t1, t2;
-	s64 diff, val;
+	s64 diff = 0, val;
 	int max = 8;
 	int ret = 1;
 
@@ -684,6 +701,8 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
 	}
 
 out:
+	pr_info("calc: chain_max=%d, single est=%dns, diff=%d, loops=%d, ntest=%d\n",
+		max, min_est, (int)diff, loops, ntest);
 	if (s)
 		hlist_del_init(&s->est.list);
 	*chain_max = max;
@@ -719,6 +738,7 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 	int chain_max;
 	int id, row, cid;
 	bool last, last_td;
+	u64 ns = 0;
 	int step;
 
 	if (!ip_vs_est_calc_limits(ipvs, &chain_max))
@@ -747,6 +767,8 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 	 */
 	step = 0;
 
+	ns = ktime_get_ns();
+
 next_kt:
 	/* Destroy contexts backwards */
 	id = ipvs->est_kt_count - 1;
@@ -825,6 +847,8 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 	goto walk_chain;
 
 end_dequeue:
+	ns = ktime_get_ns() - ns;
+	pr_info("dequeue: %lluns\n", (unsigned long long)ns);
 	/* All estimators removed while calculating ? */
 	if (!ipvs->est_kt_count)
 		goto unlock;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-09 15:37 ` [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-10-15  9:21   ` Jiri Wiesner
  2022-10-16 12:21     ` Julian Anastasov
  0 siblings, 1 reply; 15+ messages in thread
From: Jiri Wiesner @ 2022-10-15  9:21 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Sun, Oct 09, 2022 at 06:37:07PM +0300, Julian Anastasov wrote:
> +/* Calculate limits for all kthreads */
> +static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
> +{
> +	struct ip_vs_est_kt_data *kd;
> +	struct ip_vs_stats *s;
> +	struct hlist_head chain;
> +	int cache_factor = 4;
> +	int i, loops, ntest;
> +	s32 min_est = 0;
> +	ktime_t t1, t2;
> +	s64 diff, val;
> +	int max = 8;
> +	int ret = 1;
> +
> +	INIT_HLIST_HEAD(&chain);
> +	mutex_lock(&__ip_vs_mutex);
> +	kd = ipvs->est_kt_arr[0];
> +	mutex_unlock(&__ip_vs_mutex);
> +	s = kd ? kd->calc_stats : NULL;
> +	if (!s)
> +		goto out;
> +	hlist_add_head(&s->est.list, &chain);
> +
> +	loops = 1;
> +	/* Get best result from many tests */
> +	for (ntest = 0; ntest < 3; ntest++) {
> +		local_bh_disable();
> +		rcu_read_lock();
> +
> +		/* Put stats in cache */
> +		ip_vs_chain_estimation(&chain);
> +
> +		t1 = ktime_get();
> +		for (i = loops * cache_factor; i > 0; i--)
> +			ip_vs_chain_estimation(&chain);
> +		t2 = ktime_get();

I have tested this. There is one problem: When the calc phase is carried out for the first time after booting the kernel the diff is several times higher than what is should be - it was 7325 ns on my testing machine. The wrong chain_max value causes 15 kthreads to be created when 500,000 estimators have been added, which is not abysmal (It's better to underestimate chain_max than to overestimate it) but not optimal either. When the ip_vs module is unloaded and then a new service is added again the diff has the expected value. The commands:
> # ipvsadm -A -t 10.10.10.1:2000
> # ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> # ipvsadm -A -t 10.10.10.1:2000
The kernel log:
> [  200.020287] IPVS: ipvs loaded.
> [  200.036128] IPVS: starting estimator thread 0...
> [  200.042213] IPVS: calc: chain_max=12, single est=7319ns, diff=7325, loops=1, ntest=3
> [  200.051714] IPVS: dequeue: 49ns
> [  200.056024] IPVS: using max 576 ests per chain, 28800 per kthread
> [  201.983034] IPVS: tick time: 6057ns for 64 CPUs, 2 ests, 1 chains, chain_max=576
> [  237.555043] IPVS: stop unused estimator thread 0...
> [  237.599116] IPVS: ipvs unloaded.
> [  268.533028] IPVS: ipvs loaded.
> [  268.548401] IPVS: starting estimator thread 0...
> [  268.554472] IPVS: calc: chain_max=33, single est=2834ns, diff=2834, loops=1, ntest=3
> [  268.563972] IPVS: dequeue: 68ns
> [  268.568292] IPVS: using max 1584 ests per chain, 79200 per kthread
> [  270.495032] IPVS: tick time: 5761ns for 64 CPUs, 2 ests, 1 chains, chain_max=1584
> [  307.847045] IPVS: stop unused estimator thread 0...
> [  307.891101] IPVS: ipvs unloaded.
Loading the module and adding a service a third time gives a diff that is close enough to the expected value:
> [  312.807107] IPVS: ipvs loaded.
> [  312.823972] IPVS: starting estimator thread 0...
> [  312.829967] IPVS: calc: chain_max=38, single est=2444ns, diff=2477, loops=1, ntest=3
> [  312.839470] IPVS: dequeue: 66ns
> [  312.843800] IPVS: using max 1824 ests per chain, 91200 per kthread
> [  314.771028] IPVS: tick time: 5703ns for 64 CPUs, 2 ests, 1 chains, chain_max=1824
Here is a distribution of the time needed to process one estimator - the average value is around 2900 ns (on my testing machine):
> dmesg | awk '/tick time:/ {d = $(NF - 8); sub("ns", "", d); d /= $(NF - 4); d = int(d / 100) * 100; hist[d]++} END {PROCINFO["sorted_in"] = "@ind_num_asc"; for (d in hist) printf "%5d %5d\n", d, hist[d]}'
>  2500     2
>  2700     1
>  2800   243
>  2900   427
>  3000    20
>  3100     1
>  3500     1
>  3600     1
>  3700     1
>  4900     1
I am not sure why the first 3 tests give such a high diff value but the diff value is much closer to the read average time after the module is loaded a second time.

I ran more tests. All I did was increase ntests to 3000. The diff had a much more realistic value even when the calc phase was carried out for the first time:
> [   98.804037] IPVS: ipvs loaded.
> [   98.819451] IPVS: starting estimator thread 0...
> [   98.834960] IPVS: calc: chain_max=39, single est=2418ns, diff=2464, loops=1, ntest=3000
> [   98.844775] IPVS: dequeue: 67ns
> [   98.849091] IPVS: using max 1872 ests per chain, 93600 per kthread
> [  100.767346] IPVS: tick time: 5895ns for 64 CPUs, 2 ests, 1 chains, chain_max=1872
> [  107.419344] IPVS: stop unused estimator thread 0...
> [  107.459423] IPVS: ipvs unloaded.
> [  114.421324] IPVS: ipvs loaded.
> [  114.435151] IPVS: starting estimator thread 0...
> [  114.451304] IPVS: calc: chain_max=36, single est=2627ns, diff=8136, loops=1, ntest=3000
> [  114.461079] IPVS: dequeue: 77ns
> [  114.465389] IPVS: using max 1728 ests per chain, 86400 per kthread
> [  116.388968] IPVS: tick time: 1632749ns for 64 CPUs, 1433 ests, 1 chains, chain_max=1728
> [  180.387030] IPVS: tick time: 3686870ns for 64 CPUs, 1728 ests, 1 chains, chain_max=1728
> [  232.507642] IPVS: starting estimator thread 1...
> [  244.387184] IPVS: tick time: 3846122ns for 64 CPUs, 1728 ests, 1 chains, chain_max=1728
> [  308.387170] IPVS: tick time: 3835769ns for 64 CPUs, 1728 ests, 1 chains, chain_max=1728
> [  358.227680] IPVS: starting estimator thread 2...
> [  372.387177] IPVS: tick time: 3841369ns for 64 CPUs, 1728 ests, 1 chains, chain_max=1728
> [  436.387204] IPVS: tick time: 3869654ns for 64 CPUs, 1728 ests, 1 chains, chain_max=1728
Setting ntests to 3000 is probably overkill. The message is that increasing ntests is needed to get a realistic value of the diff. When I added 500,000 estimators 5 kthreads where created, which I think is reasonable. After adding 500,000 estimators, the time needed to process one estimator decreased from 2900 ms to circa 2200 ms when a kthread is fully loaded, which I do not think is necessarily a problem.

> +
> +		rcu_read_unlock();
> +		local_bh_enable();
> +
> +		if (!ipvs->enable || kthread_should_stop())
> +			goto stop;
> +		cond_resched();
> +
> +		diff = ktime_to_ns(ktime_sub(t2, t1));
> +		if (diff <= 1 * NSEC_PER_USEC) {
> +			/* Do more loops on low resolution */
> +			loops *= 2;
> +			continue;
> +		}
> +		if (diff >= NSEC_PER_SEC)
> +			continue;
> +		val = diff;
> +		do_div(val, loops);
> +		if (!min_est || val < min_est) {
> +			min_est = val;
> +			/* goal: 95usec per chain */
> +			val = 95 * NSEC_PER_USEC;
> +			if (val >= min_est) {
> +				do_div(val, min_est);
> +				max = (int)val;
> +			} else {
> +				max = 1;
> +			}
> +		}
> +	}
> +
> +out:
> +	if (s)
> +		hlist_del_init(&s->est.list);
> +	*chain_max = max;
> +	return ret;
> +
> +stop:
> +	ret = 0;
> +	goto out;
> +}

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-15  9:21   ` Jiri Wiesner
@ 2022-10-16 12:21     ` Julian Anastasov
  2022-10-22 18:15       ` Jiri Wiesner
  0 siblings, 1 reply; 15+ messages in thread
From: Julian Anastasov @ 2022-10-16 12:21 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Sat, 15 Oct 2022, Jiri Wiesner wrote:

> On Sun, Oct 09, 2022 at 06:37:07PM +0300, Julian Anastasov wrote:
> > +/* Calculate limits for all kthreads */
> > +static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
> > +{
> > +	struct ip_vs_est_kt_data *kd;
> > +	struct ip_vs_stats *s;
> > +	struct hlist_head chain;
> > +	int cache_factor = 4;
> > +	int i, loops, ntest;
> > +	s32 min_est = 0;
> > +	ktime_t t1, t2;
> > +	s64 diff, val;
> > +	int max = 8;
> > +	int ret = 1;
> > +
> > +	INIT_HLIST_HEAD(&chain);
> > +	mutex_lock(&__ip_vs_mutex);
> > +	kd = ipvs->est_kt_arr[0];
> > +	mutex_unlock(&__ip_vs_mutex);
> > +	s = kd ? kd->calc_stats : NULL;
> > +	if (!s)
> > +		goto out;
> > +	hlist_add_head(&s->est.list, &chain);
> > +
> > +	loops = 1;
> > +	/* Get best result from many tests */
> > +	for (ntest = 0; ntest < 3; ntest++) {
> > +		local_bh_disable();
> > +		rcu_read_lock();
> > +
> > +		/* Put stats in cache */
> > +		ip_vs_chain_estimation(&chain);
> > +
> > +		t1 = ktime_get();
> > +		for (i = loops * cache_factor; i > 0; i--)
> > +			ip_vs_chain_estimation(&chain);
> > +		t2 = ktime_get();
> 
> I have tested this. There is one problem: When the calc phase is carried out for the first time after booting the kernel the diff is several times higher than what is should be - it was 7325 ns on my testing machine. The wrong chain_max value causes 15 kthreads to be created when 500,000 estimators have been added, which is not abysmal (It's better to underestimate chain_max than to overestimate it) but not optimal either. When the ip_vs module is unloaded and then a new service is added again the diff has the expected value. The commands:

	Yes, our goal allows underestimation of the tick time,
2-3 times is expected and not a big deal.

> > # ipvsadm -A -t 10.10.10.1:2000
> > # ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> > # ipvsadm -A -t 10.10.10.1:2000
> The kernel log:
> > [  200.020287] IPVS: ipvs loaded.
> > [  200.036128] IPVS: starting estimator thread 0...
> > [  200.042213] IPVS: calc: chain_max=12, single est=7319ns, diff=7325, loops=1, ntest=3
> > [  200.051714] IPVS: dequeue: 49ns
> > [  200.056024] IPVS: using max 576 ests per chain, 28800 per kthread
> > [  201.983034] IPVS: tick time: 6057ns for 64 CPUs, 2 ests, 1 chains, chain_max=576

	It is not a problem to add some wait_event_idle_timeout
calls to sleep before/between tests if the system is so busy
on boot that it can even disturb our tests with disabled BHs.
May be there are many IRQs that interrupt us. We have 2 seconds
for the first tests, so we can add some gaps. If you want to
test it you can add some schedule_timeout(HZ/10) between the
3 tests (total 300ms delay of the real estimation).

	Another option is to implement some clever way to dynamically
adjust chain_max in the background depending on the load.
But it is difficult to react on all changes in the system
because while tests run in privileged mode with BHs disabled,
this is not true later. If real estimation deviates from
the determined chain_max more than 4 times (average for some period)
we can trigger some relinking with the goal to increase/decrease
the kthreads. kt 0 can do such measurements.

> > [  237.555043] IPVS: stop unused estimator thread 0...
> > [  237.599116] IPVS: ipvs unloaded.
> > [  268.533028] IPVS: ipvs loaded.
> > [  268.548401] IPVS: starting estimator thread 0...
> > [  268.554472] IPVS: calc: chain_max=33, single est=2834ns, diff=2834, loops=1, ntest=3
> > [  268.563972] IPVS: dequeue: 68ns
> > [  268.568292] IPVS: using max 1584 ests per chain, 79200 per kthread
> > [  270.495032] IPVS: tick time: 5761ns for 64 CPUs, 2 ests, 1 chains, chain_max=1584
> > [  307.847045] IPVS: stop unused estimator thread 0...
> > [  307.891101] IPVS: ipvs unloaded.
> Loading the module and adding a service a third time gives a diff that is close enough to the expected value:
> > [  312.807107] IPVS: ipvs loaded.
> > [  312.823972] IPVS: starting estimator thread 0...
> > [  312.829967] IPVS: calc: chain_max=38, single est=2444ns, diff=2477, loops=1, ntest=3
> > [  312.839470] IPVS: dequeue: 66ns
> > [  312.843800] IPVS: using max 1824 ests per chain, 91200 per kthread
> > [  314.771028] IPVS: tick time: 5703ns for 64 CPUs, 2 ests, 1 chains, chain_max=1824
> Here is a distribution of the time needed to process one estimator - the average value is around 2900 ns (on my testing machine):
> > dmesg | awk '/tick time:/ {d = $(NF - 8); sub("ns", "", d); d /= $(NF - 4); d = int(d / 100) * 100; hist[d]++} END {PROCINFO["sorted_in"] = "@ind_num_asc"; for (d in hist) printf "%5d %5d\n", d, hist[d]}'
> >  2500     2
> >  2700     1
> >  2800   243
> >  2900   427
> >  3000    20
> >  3100     1
> >  3500     1
> >  3600     1
> >  3700     1
> >  4900     1
> I am not sure why the first 3 tests give such a high diff value but the diff value is much closer to the read average time after the module is loaded a second time.
> 
> I ran more tests. All I did was increase ntests to 3000. The diff had a much more realistic value even when the calc phase was carried out for the first time:

	Yes, more tests should give more accurate results
for the current load. But load can change.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-16 12:21     ` Julian Anastasov
@ 2022-10-22 18:15       ` Jiri Wiesner
  2022-10-24 15:01         ` Julian Anastasov
  0 siblings, 1 reply; 15+ messages in thread
From: Jiri Wiesner @ 2022-10-22 18:15 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Sun, Oct 16, 2022 at 03:21:10PM +0300, Julian Anastasov wrote:
> > I have tested this. There is one problem: When the calc phase is carried out for the first time after booting the kernel the diff is several times higher than what is should be - it was 7325 ns on my testing machine. The wrong chain_max value causes 15 kthreads to be created when 500,000 estimators have been added, which is not abysmal (It's better to underestimate chain_max than to overestimate it) but not optimal either. When the ip_vs module is unloaded and then a new service is added again the diff has the expected value. The commands:
> 
> 	Yes, our goal allows underestimation of the tick time,
> 2-3 times is expected and not a big deal.

I agree that even if the max chain length was underestimated in many cases it can be expected that there will be occasions on which the max chain length would be just right. If parameters of the algorithm were changed to make the underestimates less severe there would be occasions when the max chain length would be overestimated, leading to high CPU utilization, which is undesirable.

> 	It is not a problem to add some wait_event_idle_timeout
> calls to sleep before/between tests if the system is so busy
> on boot that it can even disturb our tests with disabled BHs.

That is definitely not the case. When I get the underestimated max chain length:
> [  130.699910][ T2564] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
> [  130.707580][ T2564] IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
> [  130.716633][ T2564] IPVS: ipvs loaded.
> [  130.723423][ T2570] IPVS: [wlc] scheduler registered.
> [  130.731071][  T477] IPVS: starting estimator thread 0...
> [  130.737169][ T2571] IPVS: calc: chain_max=12, single est=7379ns, diff=7379, loops=1, ntest=3
> [  130.746673][ T2571] IPVS: dequeue: 81ns
> [  130.750988][ T2571] IPVS: using max 576 ests per chain, 28800 per kthread
> [  132.678012][ T2571] IPVS: tick time: 5930ns for 64 CPUs, 2 ests, 1 chains, chain_max=576
the system is idle, not running any workload and the booting sequence has finished.

This is the scheduler trace for the calculation the underestimated max chain length above:
>  kworker/37:1-mm   477 [037]   130.737046:       sched:sched_waking: comm=ipvs-e:0:0 pid=2571 prio=120 target_cpu=062
>          swapper     0 [062]   130.737132:       sched:sched_switch: prev_comm=swapper/62 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=2571 next_prio=120
ip_vs_est_calc_limits() ran here.
>       ipvs-e:0:0  2571 [062]   130.746658: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=2571 runtime=9537544 [ns] vruntime=579529909 [ns]
>       ipvs-e:0:0  2571 [062]   130.750979:       sched:sched_waking: comm=in:imklog pid=2248 prio=120 target_cpu=011
>       ipvs-e:0:0  2571 [062]   130.750981:       sched:sched_waking: comm=systemd-journal pid=1160 prio=120 target_cpu=017
>       ipvs-e:0:0  2571 [062]   130.750982: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=2571 runtime=4324225 [ns] vruntime=583854134 [ns]
>       ipvs-e:0:0  2571 [062]   130.758608:       sched:sched_waking: comm=in:imklog pid=2248 prio=120 target_cpu=011
>       ipvs-e:0:0  2571 [062]   130.758609:       sched:sched_waking: comm=systemd-journal pid=1160 prio=120 target_cpu=017
>       ipvs-e:0:0  2571 [062]   130.758609: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=2571 runtime=7627794 [ns] vruntime=591481928 [ns]
>       ipvs-e:0:0  2571 [062]   130.758617: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=2571 runtime=7480 [ns] vruntime=591489408 [ns]
>       ipvs-e:0:0  2571 [062]   130.758619:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=2571 prev_prio=120 prev_state=I ==> next_comm=swapper/62 next_pid=0 next_prio=120
At this point, ipvs-e:0:0 blocked after finishing the first iteration of the loop in ip_vs_estimation_kthread(). Since ipvs-e:0:0 had been woken up until this point, all the trace entries from CPU 62 are listed above.
>          swapper     0 [062]   130.797980:       sched:sched_waking: comm=ipvs-e:0:0 pid=2571 prio=120 target_cpu=062
>          swapper     0 [062]   130.797992:       sched:sched_switch: prev_comm=swapper/62 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=2571 next_prio=120
>       ipvs-e:0:0  2571 [062]   130.797995: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=2571 runtime=6252 [ns] vruntime=591495660 [ns]
>       ipvs-e:0:0  2571 [062]   130.797996:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=2571 prev_prio=120 prev_state=I ==> next_comm=swapper/62 next_pid=0 next_prio=120
>          swapper     0 [062]   130.837981:       sched:sched_waking: comm=ipvs-e:0:0 pid=2571 prio=120 target_cpu=062
>          swapper     0 [062]   130.837996:       sched:sched_switch: prev_comm=swapper/62 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=2571 next_prio=120
>       ipvs-e:0:0  2571 [062]   130.837998: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=2571 runtime=6362 [ns] vruntime=591502022 [ns]
>       ipvs-e:0:0  2571 [062]   130.838000:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=2571 prev_prio=120 prev_state=I ==> next_comm=swapper/62 next_pid=0 next_prio=120

> May be there are many IRQs that interrupt us.

I've tried that it made no difference.

> We have 2 seconds
> for the first tests, so we can add some gaps. If you want to
> test it you can add some schedule_timeout(HZ/10) between the
> 3 tests (total 300ms delay of the real estimation).

schedule_timeout(HZ/10) does not make the thread sleep - the function returns 25, which is the value of the timeout passed to it. msleep(100) works, though:
>  kworker/0:0-eve     7 [000]    70.223673:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
>          swapper     0 [051]    70.223770:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
>       ipvs-e:0:0  8927 [051]    70.223786: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=26009 [ns] vruntime=2654620258 [ns]
>       ipvs-e:0:0  8927 [051]    70.223787:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=D ==> next_comm=swapper/51 next_pid=0 next_prio=120
>          swapper     0 [051]    70.331221:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
>          swapper     0 [051]    70.331234:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
>       ipvs-e:0:0  8927 [051]    70.331241: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=11064 [ns] vruntime=2654631322 [ns]
>       ipvs-e:0:0  8927 [051]    70.331242:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=D ==> next_comm=swapper/51 next_pid=0 next_prio=120
>          swapper     0 [051]    70.439220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
>          swapper     0 [051]    70.439235:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
>       ipvs-e:0:0  8927 [051]    70.439242: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=10324 [ns] vruntime=2654641646 [ns]
>       ipvs-e:0:0  8927 [051]    70.439243:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=D ==> next_comm=swapper/51 next_pid=0 next_prio=120
>          swapper     0 [051]    70.547220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
>          swapper     0 [051]    70.547235:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
>       ipvs-e:0:0  8927 [051]    70.556717: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=9486028 [ns] vruntime=2664127674 [ns]
>       ipvs-e:0:0  8927 [051]    70.561134:       sched:sched_waking: comm=in:imklog pid=2210 prio=120 target_cpu=039
>       ipvs-e:0:0  8927 [051]    70.561136:       sched:sched_waking: comm=systemd-journal pid=1161 prio=120 target_cpu=001
>       ipvs-e:0:0  8927 [051]    70.561138: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=4421889 [ns] vruntime=2668549563 [ns]
>       ipvs-e:0:0  8927 [051]    70.568867:       sched:sched_waking: comm=in:imklog pid=2210 prio=120 target_cpu=039
>       ipvs-e:0:0  8927 [051]    70.568868:       sched:sched_waking: comm=systemd-journal pid=1161 prio=120 target_cpu=001
>       ipvs-e:0:0  8927 [051]    70.568870: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=7731843 [ns] vruntime=2676281406 [ns]
>       ipvs-e:0:0  8927 [051]    70.568878: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=8169 [ns] vruntime=2676289575 [ns]
>       ipvs-e:0:0  8927 [051]    70.568880:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=I ==> next_comm=swapper/51 next_pid=0 next_prio=120
>          swapper     0 [051]    70.611220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
>          swapper     0 [051]    70.611239:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
>       ipvs-e:0:0  8927 [051]    70.611243: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=10196 [ns] vruntime=2676299771 [ns]
>       ipvs-e:0:0  8927 [051]    70.611245:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=I ==> next_comm=swapper/51 next_pid=0 next_prio=120
>          swapper     0 [051]    70.651220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
>          swapper     0 [051]    70.651239:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
>       ipvs-e:0:0  8927 [051]    70.651243: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=10985 [ns] vruntime=2676310756 [ns]
>       ipvs-e:0:0  8927 [051]    70.651245:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=I ==> next_comm=swapper/51 next_pid=0 next_prio=120
After adding msleep(), I have rebooted 3 times and added a service. The max chain length was always at the optimal value - around 35. I think more tests on other architecture would be needed. I can test on ARM next week.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-22 18:15       ` Jiri Wiesner
@ 2022-10-24 15:01         ` Julian Anastasov
  2022-10-26 15:29           ` Julian Anastasov
  2022-10-27 18:07           ` Jiri Wiesner
  0 siblings, 2 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-24 15:01 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Sat, 22 Oct 2022, Jiri Wiesner wrote:

> On Sun, Oct 16, 2022 at 03:21:10PM +0300, Julian Anastasov wrote:
> 
> > 	It is not a problem to add some wait_event_idle_timeout
> > calls to sleep before/between tests if the system is so busy
> > on boot that it can even disturb our tests with disabled BHs.
> 
> That is definitely not the case. When I get the underestimated max chain length:
> > [  130.699910][ T2564] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
> > [  130.707580][ T2564] IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
> > [  130.716633][ T2564] IPVS: ipvs loaded.
> > [  130.723423][ T2570] IPVS: [wlc] scheduler registered.
> > [  130.731071][  T477] IPVS: starting estimator thread 0...
> > [  130.737169][ T2571] IPVS: calc: chain_max=12, single est=7379ns, diff=7379, loops=1, ntest=3
> > [  130.746673][ T2571] IPVS: dequeue: 81ns
> > [  130.750988][ T2571] IPVS: using max 576 ests per chain, 28800 per kthread
> > [  132.678012][ T2571] IPVS: tick time: 5930ns for 64 CPUs, 2 ests, 1 chains, chain_max=576
> the system is idle, not running any workload and the booting sequence has finished.

	Hm, can it be some cpufreq/ondemand issue causing this?
Test can be affected by CPU speed.

> > We have 2 seconds
> > for the first tests, so we can add some gaps. If you want to
> > test it you can add some schedule_timeout(HZ/10) between the
> > 3 tests (total 300ms delay of the real estimation).
> 
> schedule_timeout(HZ/10) does not make the thread sleep - the function returns 25, which is the value of the timeout passed to it. msleep(100) works, though:

	Hm, yes. Due to the RUNNING state, msleep works
for testing.

> >  kworker/0:0-eve     7 [000]    70.223673:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
> >          swapper     0 [051]    70.223770:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
> >       ipvs-e:0:0  8927 [051]    70.223786: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=26009 [ns] vruntime=2654620258 [ns]
> >       ipvs-e:0:0  8927 [051]    70.223787:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=D ==> next_comm=swapper/51 next_pid=0 next_prio=120
> >          swapper     0 [051]    70.331221:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
> >          swapper     0 [051]    70.331234:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
> >       ipvs-e:0:0  8927 [051]    70.331241: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=11064 [ns] vruntime=2654631322 [ns]
> >       ipvs-e:0:0  8927 [051]    70.331242:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=D ==> next_comm=swapper/51 next_pid=0 next_prio=120
> >          swapper     0 [051]    70.439220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
> >          swapper     0 [051]    70.439235:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
> >       ipvs-e:0:0  8927 [051]    70.439242: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=10324 [ns] vruntime=2654641646 [ns]
> >       ipvs-e:0:0  8927 [051]    70.439243:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=D ==> next_comm=swapper/51 next_pid=0 next_prio=120
> >          swapper     0 [051]    70.547220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
> >          swapper     0 [051]    70.547235:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
> >       ipvs-e:0:0  8927 [051]    70.556717: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=9486028 [ns] vruntime=2664127674 [ns]
> >       ipvs-e:0:0  8927 [051]    70.561134:       sched:sched_waking: comm=in:imklog pid=2210 prio=120 target_cpu=039
> >       ipvs-e:0:0  8927 [051]    70.561136:       sched:sched_waking: comm=systemd-journal pid=1161 prio=120 target_cpu=001
> >       ipvs-e:0:0  8927 [051]    70.561138: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=4421889 [ns] vruntime=2668549563 [ns]
> >       ipvs-e:0:0  8927 [051]    70.568867:       sched:sched_waking: comm=in:imklog pid=2210 prio=120 target_cpu=039
> >       ipvs-e:0:0  8927 [051]    70.568868:       sched:sched_waking: comm=systemd-journal pid=1161 prio=120 target_cpu=001
> >       ipvs-e:0:0  8927 [051]    70.568870: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=7731843 [ns] vruntime=2676281406 [ns]
> >       ipvs-e:0:0  8927 [051]    70.568878: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=8169 [ns] vruntime=2676289575 [ns]
> >       ipvs-e:0:0  8927 [051]    70.568880:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=I ==> next_comm=swapper/51 next_pid=0 next_prio=120
> >          swapper     0 [051]    70.611220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
> >          swapper     0 [051]    70.611239:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
> >       ipvs-e:0:0  8927 [051]    70.611243: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=10196 [ns] vruntime=2676299771 [ns]
> >       ipvs-e:0:0  8927 [051]    70.611245:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=I ==> next_comm=swapper/51 next_pid=0 next_prio=120
> >          swapper     0 [051]    70.651220:       sched:sched_waking: comm=ipvs-e:0:0 pid=8927 prio=120 target_cpu=051
> >          swapper     0 [051]    70.651239:       sched:sched_switch: prev_comm=swapper/51 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ipvs-e:0:0 next_pid=8927 next_prio=120
> >       ipvs-e:0:0  8927 [051]    70.651243: sched:sched_stat_runtime: comm=ipvs-e:0:0 pid=8927 runtime=10985 [ns] vruntime=2676310756 [ns]
> >       ipvs-e:0:0  8927 [051]    70.651245:       sched:sched_switch: prev_comm=ipvs-e:0:0 prev_pid=8927 prev_prio=120 prev_state=I ==> next_comm=swapper/51 next_pid=0 next_prio=120
> After adding msleep(), I have rebooted 3 times and added a service. The max chain length was always at the optimal value - around 35. I think more tests on other architecture would be needed. I can test on ARM next week.

	Then I'll add such pause between the tests in the next
version. Let me know if you see any problems with different NUMA
configurations due to the chosen cache_factor.

	For now, I don't have a good idea how to change the
algorithm to use feedback from real estimation without
complicating it further. The only way to safely change
the chain limit immediately is as it is implemented now: stop
tasks, reallocate, relink and start tasks. If we want to
do it without stopping tasks, it violates the RCU-list
rules: we can not relink entries without RCU grace period.

	So, we have the following options:

1. Use this algorithm if it works in different configurations
2. Use this algorithm but trigger recalculation (stop, relink,
start) if a kthread with largest number of entries detects
big difference for chain_max
3. Implement different data structure to store estimators

	Currently, the problem comes from the fact that we
store estimators in chains. We should cut these chains if
chain_max should be reduced. Second option would be to
put estimators in ptr arrays but then there is a problem
with fragmentation on add/del and as result, slower walking.
Arrays probably can allow the limit used for cond_resched,
that is now chain_max, to be applied without relinking
entries.

	To summarize, the goals are:

- allocations for linking estimators should not be large (many
pages), prefer to allocate in small steps

- due to RCU-list rules we can not relink without task stop+start

- real estimation should give more accurate values for
the parameters: cond_resched rate

- fast lock-free walking of estimators by kthreads

- fast add/del of estimators, by netlink

- if possible, a way to avoid estimations for estimators
that are not updated, eg. added service/dest but no
traffic

- fast and safe way to apply a new chain_max or similar
parameter for cond_resched rate. If possible, without
relinking. stop+start can be slow too.

	While finishing this posting, I'm investigating
the idea to use structures without chains (no relinking),
without chain_len, tick_len, etc. But let me first see
if such idea can work...

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-24 15:01         ` Julian Anastasov
@ 2022-10-26 15:29           ` Julian Anastasov
  2022-10-27 18:07           ` Jiri Wiesner
  1 sibling, 0 replies; 15+ messages in thread
From: Julian Anastasov @ 2022-10-26 15:29 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Mon, 24 Oct 2022, Julian Anastasov wrote:

> 	While finishing this posting, I'm investigating
> the idea to use structures without chains (no relinking),
> without chain_len, tick_len, etc. But let me first see
> if such idea can work...

	Hm, I tried some ideas but result is not good at all.
chain_max looks like a good implicit way to apply cond_resched
rate and to return to some safe position on return. Other methods
with arrays will (1) allocate more memory or (2) slowdown searching
for free slot while adding or (3) slowdown while walking during
estimation. Now we benefit from long chains (33/38 as in your
setup) to (1) use less memory to store ests in chains and (2) to
reduce the cost from for_each_set_bit() operation.

	So, for now I don't have any better idea to change
the data structures.

	After your feedback I can prepare next version by adding
wait_event_idle_timeout() calls as pause between tests.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-24 15:01         ` Julian Anastasov
  2022-10-26 15:29           ` Julian Anastasov
@ 2022-10-27 18:07           ` Jiri Wiesner
  2022-10-29 14:12             ` Julian Anastasov
  1 sibling, 1 reply; 15+ messages in thread
From: Jiri Wiesner @ 2022-10-27 18:07 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 24, 2022 at 06:01:32PM +0300, Julian Anastasov wrote:
> > On Sun, Oct 16, 2022 at 03:21:10PM +0300, Julian Anastasov wrote:
> > 
> > > 	It is not a problem to add some wait_event_idle_timeout
> > > calls to sleep before/between tests if the system is so busy
> > > on boot that it can even disturb our tests with disabled BHs.
> > 
> > That is definitely not the case. When I get the underestimated max chain length:
> > > [  130.699910][ T2564] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
> > > [  130.707580][ T2564] IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
> > > [  130.716633][ T2564] IPVS: ipvs loaded.
> > > [  130.723423][ T2570] IPVS: [wlc] scheduler registered.
> > > [  130.731071][  T477] IPVS: starting estimator thread 0...
> > > [  130.737169][ T2571] IPVS: calc: chain_max=12, single est=7379ns, diff=7379, loops=1, ntest=3
> > > [  130.746673][ T2571] IPVS: dequeue: 81ns
> > > [  130.750988][ T2571] IPVS: using max 576 ests per chain, 28800 per kthread
> > > [  132.678012][ T2571] IPVS: tick time: 5930ns for 64 CPUs, 2 ests, 1 chains, chain_max=576
> > the system is idle, not running any workload and the booting sequence has finished.
> 
> 	Hm, can it be some cpufreq/ondemand issue causing this?
> Test can be affected by CPU speed.

Yes, my testing confirms that it is the CPU frequency governor. On my Intel testing machine, the intel_pstate driver can use the powersave governor (which is similar to the ondemand cpufreq governor) or the performance governor (which is, again, similar to the ondemand cpufreq governor but ramps up CPU frequency rapidly when a CPU is utilized). Chain_max ends up being 12 up to 35 when the powersave governor is used. It may happen that the powersave governor does not manage to ramp up CPU frequency before the calc phase is over so chain_max can be as low as 12. Chain_max exceeds 50 when the performance governor is used.

This leads to the following scenario: What if someone set the performance governor, added many estimators and changed the governor to powersave? It seems the current algorithm leaves enough headroom for this sequence of steps not to saturate the CPUs:
> cpupower frequency-set -g performance
> [ 3796.171742] IPVS: starting estimator thread 0...
> [ 3796.177723] IPVS: calc: chain_max=53, single est=1775ns, diff=1775, loops=1, ntest=3
> [ 3796.187205] IPVS: dequeue: 35ns
> [ 3796.191513] IPVS: using max 2544 ests per chain, 127200 per kthread
> [ 3798.081076] IPVS: tick time: 64306ns for 64 CPUs, 89 ests, 1 chains, chain_max=2544
> [ 3898.081668] IPVS: tick time: 661019ns for 64 CPUs, 743 ests, 1 chains, chain_max=2544
> [ 3959.127101] IPVS: starting estimator thread 1...
This is the output of top with the performance governor and a fully loaded kthread:
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 31138 root      20   0       0      0      0 I 4.651 0.000   0:06.02 ipvs-e:0:0
> 39644 root      20   0       0      0      0 I 0.332 0.000   0:00.28 ipvs-e:0:1
> cpupower frequency-set -g powersave
> [ 3962.083052] IPVS: tick time: 2047264ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544
> [ 4026.083100] IPVS: tick time: 2074317ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544
> [ 4090.086794] IPVS: tick time: 5758102ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544
> [ 4154.086788] IPVS: tick time: 5753057ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544
This is the output of top with the powersave governor and a fully loaded kthread:
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 31138 root      20   0       0      0      0 I 13.91 0.000   0:16.34 ipvs-e:0:0
> 39644 root      20   0       0      0      0 I 1.656 0.000   0:01.32 ipvs-e:0:1
So, the CPU time more than doubles but is still reasonable.

Next, I tried the same with a 4 NUMA node ARM server, which uses the cppc_cpufreq driver. I checked the chain_max under different governors:
> cpupower frequency-set -g powersave > /dev/null
> ipvsadm -A -t 10.10.10.1:2000
> ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> [ 8833.384789] IPVS: starting estimator thread 0...
> [ 8833.743439] IPVS: calc: chain_max=1, single est=66250ns, diff=66250, loops=1, ntest=3
> [ 8833.751989] IPVS: dequeue: 460ns
> [ 8833.755955] IPVS: using max 48 ests per chain, 2400 per kthread
> [ 8835.723480] IPVS: tick time: 49150ns for 128 CPUs, 2 ests, 1 chains, chain_max=48
> cpupower frequency-set -g ondemand > /dev/null
> ipvsadm -A -t 10.10.10.1:2000
> ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> [ 8865.160082] IPVS: starting estimator thread 0...
> [ 8865.523554] IPVS: calc: chain_max=7, single est=13090ns, diff=71140, loops=1, ntest=3
> [ 8865.532119] IPVS: dequeue: 470ns
> [ 8865.536098] IPVS: using max 336 ests per chain, 16800 per kthread
> [ 8867.503530] IPVS: tick time: 90650ns for 128 CPUs, 2 ests, 1 chains, chain_max=336
> cpupower frequency-set -g performance > /dev/null
> ipvsadm -A -t 10.10.10.1:2000
> ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> [ 9064.480977] IPVS: starting estimator thread 0...
> [ 9064.843404] IPVS: calc: chain_max=11, single est=8230ns, diff=8230, loops=1, ntest=3
> [ 9064.851836] IPVS: dequeue: 50ns
> [ 9064.855668] IPVS: using max 528 ests per chain, 26400 per kthread
> [ 9066.823414] IPVS: tick time: 8020ns for 128 CPUs, 2 ests, 1 chains, chain_max=528

I created a fully loaded kthread under the performance governor and switched to more energy-saving governors after that:
> cpupower frequency-set -g performance > /dev/null
> [ 9174.806973] IPVS: starting estimator thread 0...
> [ 9175.163406] IPVS: calc: chain_max=12, single est=7890ns, diff=7890, loops=1, ntest=3
> [ 9175.171834] IPVS: dequeue: 80ns
> [ 9175.175663] IPVS: using max 576 ests per chain, 28800 per kthread
> [ 9177.143429] IPVS: tick time: 21080ns for 128 CPUs, 2 ests, 1 chains, chain_max=576
> [ 9241.145020] IPVS: tick time: 1608270ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> [ 9246.984959] IPVS: starting estimator thread 1...
>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>   7071 root      20   0       0      0      0 I   3.630 0.000   0:03.24 ipvs-e:0:0
>  35898 root      20   0       0      0      0 I   1.320 0.000   0:00.78 ipvs-e:0:1
> [ 9305.145029] IPVS: tick time: 1617990ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> cpupower frequency-set -g ondemand > /dev/null
> [ 9369.148006] IPVS: tick time: 4575030ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> [ 9433.147149] IPVS: tick time: 3725910ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>   7071 root      20   0       0      0      0 I  11.148 0.000   0:53.90 ipvs-e:0:0
>  35898 root      20   0       0      0      0 I   5.902 0.000   0:26.94 ipvs-e:0:1
> [ 9497.149206] IPVS: tick time: 5564490ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> [ 9561.147165] IPVS: tick time: 3735390ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> [ 9625.146803] IPVS: tick time: 3382870ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> [ 9689.148018] IPVS: tick time: 4580270ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> cpupower frequency-set -g powersave > /dev/null
> [ 9753.152504] IPVS: tick time: 8979300ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> [ 9817.152433] IPVS: tick time: 8985520ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>   7071 root      20   0       0      0      0 I  22.293 0.000   1:04.82 ipvs-e:0:0
>  35898 root      20   0       0      0      0 I   8.599 0.000   0:31.48 ipvs-e:0:1
To my slight suprise, the result is not disasterous even on this platform, which has much more of a difference between a CPU running in powersave and a CPU in performance mode.

> 	Then I'll add such pause between the tests in the next
> version. Let me know if you see any problems with different NUMA
> configurations due to the chosen cache_factor.

I think ip_vs_est_calc_limits() could do more to obtain a more realistic chain_max value. There should definitely be a finite number of iterations taken by the for loop - in hundreds, I guess. Instead of just collecting the minimum value of min_est, ip_vs_est_calc_limits() should check for its convergence. Once the difference from a previous iteration gets below a threshold (say, expressed as a fraction of the min_est value), a condition checking this would terminate the loop before completing all of the iterations. A sliding average and bit shifting could be used to check for convergence.

> 	For now, I don't have a good idea how to change the
> algorithm to use feedback from real estimation without
> complicating it further. The only way to safely change
> the chain limit immediately is as it is implemented now: stop
> tasks, reallocate, relink and start tasks. If we want to
> do it without stopping tasks, it violates the RCU-list
> rules: we can not relink entries without RCU grace period.
> 
> 	So, we have the following options:
> 
> 1. Use this algorithm if it works in different configurations

I think the current algorithm is a major improvement over what is currently in mainline. The current mainline algorithm just shamelessly steals hundreds of milliseconds from processes and causes havoc in terms of latency.

> 2. Use this algorithm but trigger recalculation (stop, relink,
> start) if a kthread with largest number of entries detects
> big difference for chain_max
> 3. Implement different data structure to store estimators
> 
> 	Currently, the problem comes from the fact that we
> store estimators in chains. We should cut these chains if
> chain_max should be reduced. Second option would be to
> put estimators in ptr arrays but then there is a problem
> with fragmentation on add/del and as result, slower walking.
> Arrays probably can allow the limit used for cond_resched,
> that is now chain_max, to be applied without relinking
> entries.
> 
> 	To summarize, the goals are:
> 
> - allocations for linking estimators should not be large (many
> pages), prefer to allocate in small steps
> 
> - due to RCU-list rules we can not relink without task stop+start
> 
> - real estimation should give more accurate values for
> the parameters: cond_resched rate
> 
> - fast lock-free walking of estimators by kthreads
> 
> - fast add/del of estimators, by netlink
> 
> - if possible, a way to avoid estimations for estimators
> that are not updated, eg. added service/dest but no
> traffic
> 
> - fast and safe way to apply a new chain_max or similar
> parameter for cond_resched rate. If possible, without
> relinking. stop+start can be slow too.

I am still wondering where the requirement for 100 us latency in non-preemtive kernels comes from. Typical time slices assigned by a time-sharing scheduler are measured in milliseconds. A kernel with volutary preemption does not need any cond_resched statements in ip_vs_tick_estimation() because every spin_unlock() in ip_vs_chain_estimation() is a preemption point, which actually puts the accuracy of the computed estimates at risk but nothing can be done about that, I guess.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-27 18:07           ` Jiri Wiesner
@ 2022-10-29 14:12             ` Julian Anastasov
  2022-11-16 16:41               ` Jiri Wiesner
  0 siblings, 1 reply; 15+ messages in thread
From: Julian Anastasov @ 2022-10-29 14:12 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Thu, 27 Oct 2022, Jiri Wiesner wrote:

> On Mon, Oct 24, 2022 at 06:01:32PM +0300, Julian Anastasov wrote:
> > 
> > 	Hm, can it be some cpufreq/ondemand issue causing this?
> > Test can be affected by CPU speed.
> 
> Yes, my testing confirms that it is the CPU frequency governor. On my Intel testing machine, the intel_pstate driver can use the powersave governor (which is similar to the ondemand cpufreq governor) or the performance governor (which is, again, similar to the ondemand cpufreq governor but ramps up CPU frequency rapidly when a CPU is utilized). Chain_max ends up being 12 up to 35 when the powersave governor is used. It may happen that the powersave governor does not manage to ramp up CPU frequency before the calc phase is over so chain_max can be as low as 12. Chain_max exceeds 50 when the performance governor is used.

	OK, then we can try some sequence of 3 x (pause+4) tests,
for example, pause, t1, t2, t3, t4, pause, t5... t12.
More tests and especially the delay in time will ensure the
CPU speed is increased.

> This leads to the following scenario: What if someone set the performance governor, added many estimators and changed the governor to powersave? It seems the current algorithm leaves enough headroom for this sequence of steps not to saturate the CPUs:

	We hope our test pattern will reduce the difference
in governors to acceptable levels :)

> > cpupower frequency-set -g performance
> > [ 3796.171742] IPVS: starting estimator thread 0...
> > [ 3796.177723] IPVS: calc: chain_max=53, single est=1775ns, diff=1775, loops=1, ntest=3
> > [ 3796.187205] IPVS: dequeue: 35ns
> > [ 3796.191513] IPVS: using max 2544 ests per chain, 127200 per kthread
> > [ 3798.081076] IPVS: tick time: 64306ns for 64 CPUs, 89 ests, 1 chains, chain_max=2544
> > [ 3898.081668] IPVS: tick time: 661019ns for 64 CPUs, 743 ests, 1 chains, chain_max=2544
> > [ 3959.127101] IPVS: starting estimator thread 1...
> This is the output of top with the performance governor and a fully loaded kthread:
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > 31138 root      20   0       0      0      0 I 4.651 0.000   0:06.02 ipvs-e:0:0
> > 39644 root      20   0       0      0      0 I 0.332 0.000   0:00.28 ipvs-e:0:1
> > cpupower frequency-set -g powersave
> > [ 3962.083052] IPVS: tick time: 2047264ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544
> > [ 4026.083100] IPVS: tick time: 2074317ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544
> > [ 4090.086794] IPVS: tick time: 5758102ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544
> > [ 4154.086788] IPVS: tick time: 5753057ns for 64 CPUs, 2544 ests, 1 chains, chain_max=2544

	5.7ms vs desired 4.8ms, not bad

> This is the output of top with the powersave governor and a fully loaded kthread:
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > 31138 root      20   0       0      0      0 I 13.91 0.000   0:16.34 ipvs-e:0:0
> > 39644 root      20   0       0      0      0 I 1.656 0.000   0:01.32 ipvs-e:0:1
> So, the CPU time more than doubles but is still reasonable.

	Yep, desired is 12.5% :)

> 
> Next, I tried the same with a 4 NUMA node ARM server, which uses the cppc_cpufreq driver. I checked the chain_max under different governors:
> > cpupower frequency-set -g powersave > /dev/null
> > ipvsadm -A -t 10.10.10.1:2000
> > ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> > [ 8833.384789] IPVS: starting estimator thread 0...
> > [ 8833.743439] IPVS: calc: chain_max=1, single est=66250ns, diff=66250, loops=1, ntest=3
> > [ 8833.751989] IPVS: dequeue: 460ns
> > [ 8833.755955] IPVS: using max 48 ests per chain, 2400 per kthread
> > [ 8835.723480] IPVS: tick time: 49150ns for 128 CPUs, 2 ests, 1 chains, chain_max=48
> > cpupower frequency-set -g ondemand > /dev/null
> > ipvsadm -A -t 10.10.10.1:2000
> > ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> > [ 8865.160082] IPVS: starting estimator thread 0...
> > [ 8865.523554] IPVS: calc: chain_max=7, single est=13090ns, diff=71140, loops=1, ntest=3

	Low values for chain_max such as 1 and 7 are real
issue due to the large number of CPUs. Not sure how to
optimize this, we do not know which CPUs are engaged
in network traffic, not to mention that OUTPUT traffic
can come from any CPU that serves local applications.
We spend many cycles to read stats from unused CPUs.

> > [ 8865.532119] IPVS: dequeue: 470ns
> > [ 8865.536098] IPVS: using max 336 ests per chain, 16800 per kthread
> > [ 8867.503530] IPVS: tick time: 90650ns for 128 CPUs, 2 ests, 1 chains, chain_max=336
> > cpupower frequency-set -g performance > /dev/null
> > ipvsadm -A -t 10.10.10.1:2000
> > ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> > [ 9064.480977] IPVS: starting estimator thread 0...
> > [ 9064.843404] IPVS: calc: chain_max=11, single est=8230ns, diff=8230, loops=1, ntest=3
> > [ 9064.851836] IPVS: dequeue: 50ns
> > [ 9064.855668] IPVS: using max 528 ests per chain, 26400 per kthread
> > [ 9066.823414] IPVS: tick time: 8020ns for 128 CPUs, 2 ests, 1 chains, chain_max=528
> 
> I created a fully loaded kthread under the performance governor and switched to more energy-saving governors after that:
> > cpupower frequency-set -g performance > /dev/null
> > [ 9174.806973] IPVS: starting estimator thread 0...
> > [ 9175.163406] IPVS: calc: chain_max=12, single est=7890ns, diff=7890, loops=1, ntest=3
> > [ 9175.171834] IPVS: dequeue: 80ns
> > [ 9175.175663] IPVS: using max 576 ests per chain, 28800 per kthread
> > [ 9177.143429] IPVS: tick time: 21080ns for 128 CPUs, 2 ests, 1 chains, chain_max=576
> > [ 9241.145020] IPVS: tick time: 1608270ns for 128 CPUs, 576 ests, 1 chains, chain_max=576

	1.6ms, expected was 4.8ms, does it mean we underestimate
by 3 times?

> > [ 9246.984959] IPVS: starting estimator thread 1...
> >    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
> >   7071 root      20   0       0      0      0 I   3.630 0.000   0:03.24 ipvs-e:0:0
> >  35898 root      20   0       0      0      0 I   1.320 0.000   0:00.78 ipvs-e:0:1
> > [ 9305.145029] IPVS: tick time: 1617990ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> > cpupower frequency-set -g ondemand > /dev/null
> > [ 9369.148006] IPVS: tick time: 4575030ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> > [ 9433.147149] IPVS: tick time: 3725910ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> >    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
> >   7071 root      20   0       0      0      0 I  11.148 0.000   0:53.90 ipvs-e:0:0
> >  35898 root      20   0       0      0      0 I   5.902 0.000   0:26.94 ipvs-e:0:1
> > [ 9497.149206] IPVS: tick time: 5564490ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> > [ 9561.147165] IPVS: tick time: 3735390ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> > [ 9625.146803] IPVS: tick time: 3382870ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> > [ 9689.148018] IPVS: tick time: 4580270ns for 128 CPUs, 576 ests, 1 chains, chain_max=576

	We are lucky here, 4.5ms, 11.1%

> > cpupower frequency-set -g powersave > /dev/null
> > [ 9753.152504] IPVS: tick time: 8979300ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> > [ 9817.152433] IPVS: tick time: 8985520ns for 128 CPUs, 576 ests, 1 chains, chain_max=576
> >    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
> >   7071 root      20   0       0      0      0 I  22.293 0.000   1:04.82 ipvs-e:0:0
> >  35898 root      20   0       0      0      0 I   8.599 0.000   0:31.48 ipvs-e:0:1
> To my slight suprise, the result is not disasterous even on this platform, which has much more of a difference between a CPU running in powersave and a CPU in performance mode.

	8.9ms is going hot ...

> > 	Then I'll add such pause between the tests in the next
> > version. Let me know if you see any problems with different NUMA
> > configurations due to the chosen cache_factor.
> 
> I think ip_vs_est_calc_limits() could do more to obtain a more realistic chain_max value. There should definitely be a finite number of iterations taken by the for loop - in hundreds, I guess. Instead of just collecting the minimum value of min_est, ip_vs_est_calc_limits() should check for its convergence. Once the difference from a previous iteration gets below a threshold (say, expressed as a fraction of the min_est value), a condition checking this would terminate the loop before completing all of the iterations. A sliding average and bit shifting could be used to check for convergence.

	Our tests are probably on idle system. If the
CPUs we use get network traffic, the reality can be scary :)
So, I'm not sure if any averages can make any difference
because what we will detect is a noise from load which is different
during initial tests and later during service. For now, it looks
like we can ignore the problems due to changing the cpufreq
governor.

> > 	For now, I don't have a good idea how to change the
> > algorithm to use feedback from real estimation without
> > complicating it further. The only way to safely change
> > the chain limit immediately is as it is implemented now: stop
> > tasks, reallocate, relink and start tasks. If we want to
> > do it without stopping tasks, it violates the RCU-list
> > rules: we can not relink entries without RCU grace period.
> > 
> > 	So, we have the following options:
> > 
> > 1. Use this algorithm if it works in different configurations
> 
> I think the current algorithm is a major improvement over what is currently in mainline. The current mainline algorithm just shamelessly steals hundreds of milliseconds from processes and causes havoc in terms of latency.

	Yep, only that for multi-CPU systems the whole
stats estimation looks problematic, before and now.

> > 2. Use this algorithm but trigger recalculation (stop, relink,
> > start) if a kthread with largest number of entries detects
> > big difference for chain_max
> > 3. Implement different data structure to store estimators
> > 
> > 	Currently, the problem comes from the fact that we
> > store estimators in chains. We should cut these chains if
> > chain_max should be reduced. Second option would be to
> > put estimators in ptr arrays but then there is a problem
> > with fragmentation on add/del and as result, slower walking.
> > Arrays probably can allow the limit used for cond_resched,
> > that is now chain_max, to be applied without relinking
> > entries.
> > 
> > 	To summarize, the goals are:
> > 
> > - allocations for linking estimators should not be large (many
> > pages), prefer to allocate in small steps
> > 
> > - due to RCU-list rules we can not relink without task stop+start
> > 
> > - real estimation should give more accurate values for
> > the parameters: cond_resched rate
> > 
> > - fast lock-free walking of estimators by kthreads
> > 
> > - fast add/del of estimators, by netlink
> > 
> > - if possible, a way to avoid estimations for estimators
> > that are not updated, eg. added service/dest but no
> > traffic
> > 
> > - fast and safe way to apply a new chain_max or similar
> > parameter for cond_resched rate. If possible, without
> > relinking. stop+start can be slow too.
> 
> I am still wondering where the requirement for 100 us latency in non-preemtive kernels comes from. Typical time slices assigned by a time-sharing scheduler are measured in milliseconds. A kernel with volutary preemption does not need any cond_resched statements in ip_vs_tick_estimation() because every spin_unlock() in ip_vs_chain_estimation() is a preemption point, which actually puts the accuracy of the computed estimates at risk but nothing can be done about that, I guess.

	I'm not sure about the 100us requirements for non-RT
kernels, this document covers only RT requirements, I think:

Documentation/RCU/Design/Requirements/Requirements.rst

	In fact, I don't worry for the RCU-preemptible
case where we can be rescheduled at any time. In this
case cond_resched_rcu() is NOP and chain_max has only
one purpose of limiting ests in kthread, i.e. not to
determine period between cond_resched calls which is
its 2nd purpose for the non-preemptible case.

	As for the non-preemptible case,
rcu_read_lock/rcu_read_unlock are just preempt_disable/preempt_enable 
which means the spin locking can not preempt us, the only way is
we to call rcu_read_unlock which is just preempt_count_dec()
or a simple barrier() but __preempt_schedule() is not
called as it happens on CONFIG_PREEMPTION. So, only
cond_resched() can allow rescheduling.

	Also, there are some configurations like nohz_full
that expect cond_resched() to check for any pending
rcu_urgent_qs condition via rcu_all_qs(). I'm not
expert in areas such as RCU and scheduling, so I'm
not sure about the 100us latency budget for the
non-preemptible cases we cover:

1. PREEMPT_NONE "No Forced Preemption (Server)"
2. PREEMPT_VOLUNTARY "Voluntary Kernel Preemption (Desktop)"

	Where the latency can matter is setups where the
IPVS kthreads are set to some low priority, as a
way to work in idle times and to allow app servers
to react to clients' requests faster. Once request
is served with short delay, app blocks somewhere and
our kthreads run again running in idle times.

	In short, the IPVS kthreads do not have an
urgent work, they should do their 4.8ms work in 40ms
or even more but it is preferred not to delay other
more-priority tasks such as applications or even other
kthreads. That is why I think we should stick to some low
period between cond_resched calls without causing
it to take large part of our CPU usage.

	If we want to reduce its rate, it can be
in this way, for example:

	int n = 0;

	/* 400us for forced cond_resched() but reschedule on demand */
	if (!(++n & 3) || need_resched()) {
		cond_resched_rcu();
		n = 0;
	}

	This controls both the RCU requirements and
reacts faster on scheduler's indication. There will be
an useless need_resched() call for the RCU-preemptible
case, though, where cond_resched_rcu is NOP.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation
  2022-10-29 14:12             ` Julian Anastasov
@ 2022-11-16 16:41               ` Jiri Wiesner
  0 siblings, 0 replies; 15+ messages in thread
From: Jiri Wiesner @ 2022-11-16 16:41 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Sat, Oct 29, 2022 at 05:12:28PM +0300, Julian Anastasov wrote:
> On Thu, 27 Oct 2022, Jiri Wiesner wrote:
> > On Mon, Oct 24, 2022 at 06:01:32PM +0300, Julian Anastasov wrote:
> > > - fast and safe way to apply a new chain_max or similar
> > > parameter for cond_resched rate. If possible, without
> > > relinking. stop+start can be slow too.
> > 
> > I am still wondering where the requirement for 100 us latency in non-preemtive kernels comes from. Typical time slices assigned by a time-sharing scheduler are measured in milliseconds. A kernel with volutary preemption does not need any cond_resched statements in ip_vs_tick_estimation() because every spin_unlock() in ip_vs_chain_estimation() is a preemption point, which actually puts the accuracy of the computed estimates at risk but nothing can be done about that, I guess.
> 
> 	I'm not sure about the 100us requirements for non-RT
> kernels, this document covers only RT requirements, I think:
> 
> Documentation/RCU/Design/Requirements/Requirements.rst
> 
> 	In fact, I don't worry for the RCU-preemptible
> case where we can be rescheduled at any time. In this
> case cond_resched_rcu() is NOP and chain_max has only
> one purpose of limiting ests in kthread, i.e. not to
> determine period between cond_resched calls which is
> its 2nd purpose for the non-preemptible case.
> 
> 	As for the non-preemptible case,
> rcu_read_lock/rcu_read_unlock are just preempt_disable/preempt_enable 
> which means the spin locking can not preempt us, the only way is
> we to call rcu_read_unlock which is just preempt_count_dec()
> or a simple barrier() but __preempt_schedule() is not
> called as it happens on CONFIG_PREEMPTION. So, only
> cond_resched() can allow rescheduling.
> 
> 	Also, there are some configurations like nohz_full
> that expect cond_resched() to check for any pending
> rcu_urgent_qs condition via rcu_all_qs(). I'm not
> expert in areas such as RCU and scheduling, so I'm
> not sure about the 100us latency budget for the
> non-preemptible cases we cover:
> 
> 1. PREEMPT_NONE "No Forced Preemption (Server)"
> 2. PREEMPT_VOLUNTARY "Voluntary Kernel Preemption (Desktop)"
> 
> 	Where the latency can matter is setups where the
> IPVS kthreads are set to some low priority, as a
> way to work in idle times and to allow app servers
> to react to clients' requests faster. Once request
> is served with short delay, app blocks somewhere and
> our kthreads run again running in idle times.
> 
> 	In short, the IPVS kthreads do not have an
> urgent work, they should do their 4.8ms work in 40ms
> or even more but it is preferred not to delay other
> more-priority tasks such as applications or even other
> kthreads. That is why I think we should stick to some low
> period between cond_resched calls without causing
> it to take large part of our CPU usage.

OK, I agree that volutary preemption without CONFIG_PREEMPT_RCU will need a preemption point in ip_vs_tick_estimation().

> 	If we want to reduce its rate, it can be
> in this way, for example:
> 
> 	int n = 0;
> 
> 	/* 400us for forced cond_resched() but reschedule on demand */
> 	if (!(++n & 3) || need_resched()) {
> 		cond_resched_rcu();
> 		n = 0;
> 	}
> 
> 	This controls both the RCU requirements and
> reacts faster on scheduler's indication. There will be
> an useless need_resched() call for the RCU-preemptible
> case, though, where cond_resched_rcu is NOP.

I do not see that as an improvement as well.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-11-16 16:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-09 15:37 [RFC PATCHv5 0/6] ipvs: Use kthreads for stats Julian Anastasov
2022-10-09 15:37 ` [RFC PATCHv5 1/6] ipvs: add rcu protection to stats Julian Anastasov
2022-10-09 15:37 ` [RFC PATCHv5 2/6] ipvs: use common functions for stats allocation Julian Anastasov
2022-10-09 15:37 ` [RFC PATCHv5 3/6] ipvs: use kthreads for stats estimation Julian Anastasov
2022-10-15  9:21   ` Jiri Wiesner
2022-10-16 12:21     ` Julian Anastasov
2022-10-22 18:15       ` Jiri Wiesner
2022-10-24 15:01         ` Julian Anastasov
2022-10-26 15:29           ` Julian Anastasov
2022-10-27 18:07           ` Jiri Wiesner
2022-10-29 14:12             ` Julian Anastasov
2022-11-16 16:41               ` Jiri Wiesner
2022-10-09 15:37 ` [RFC PATCHv5 4/6] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
2022-10-09 15:37 ` [RFC PATCHv5 5/6] ipvs: run_estimation should control the kthread tasks Julian Anastasov
2022-10-09 15:37 ` [RFC PATCHv5 6/6] ipvs: debug the tick time Julian Anastasov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.