All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCHv6 0/7] ipvs: Use kthreads for stats
@ 2022-10-31 14:56 Julian Anastasov
  2022-10-31 14:56 ` [RFC PATCHv6 1/7] ipvs: add rcu protection to stats Julian Anastasov
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

	Hello,

	Posting v6 (release candidate 1). New patch
inserted at 3th position. Patch 7 just for debugging,
do not apply if not needed. 

	This patchset implements stats estimation in
kthread context. Simple tests do not show any problem.

	Overview of the basic concepts. More in the
commit messages...

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is 0.

- the allocated kthread context may grow from 1 to 50
allocated structures for ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains. This will reduce
the chain imbalance.

	There are things that I don't like but for now
I don't have a better idea for them:

- calculation of chain_max can go wrong, depending
on the current load, CPU speed, memory speeds, running in
VM, whether tested estimators are in CPU cache, even if
we are doing it in SCHED_FIFO mode or with BH disabled.
I expect such noise to be insignificant but who knows.

- ip_vs_stop_estimator is not a simple unlinking of
list node, we spend cycles to account for the removed
estimator

- the relinking does not preserve the delay for all
estimators. To solve it, we would need to pre-allocate
kthread data for all threads and then to enqueue stats in
parallel.

- __ip_vs_mutex is global mutex for all netns. But it
protects hash tables that are still global ones.


Changes in v6:
Patch 4 (was 3 in v5):
* while ests are in est_temp_list use est->ktrow as delay (offset
  from est_row) to apply on enqueue, initially or later when estimators
  are relinked. By this way we try to preserve the current timer delay
  on relinking. If there is no space in the desired time slot, the
  initial delay can be reduced, while on relinking the delay can be
  increased. Still, this is far from perfect and the relinking
  continues to apply wrong delay for some of the ests while its goal
  is still to fill kthreads data one by one.
* ip_vs_est_calc_limits() now relinks the entries trying to preserve
  the delay, ests with lowest delay will get higher chance to
  preserve it
* add_row is now not updated (it is wrong on relinking) but will continue
  to be used as starting position for the bit searching, so logic is
  preserved
* while relinking try to keep tot_stats in kt0, we do not want to keep
  some kthread 1..N just for it
* ip_vs_est_calc_limits() now performs 12 tests but sleeps for 20ms
  on every 4 tests to make sure CPU speed is increased
Patch 3:
* new patch that converts per-cpu counters to u64_stats_t

Changes in v5:
Patch 4 (was 3 in v4):
* use ip_vs_est_max_threads() helper
Patch 3 (was 2 in v4):
* kthread 0 now disables BH instead of using SCHED_FIFO mode,
  this should work because now we perform less number of repeated
  test over pre-allocated kd->calc_stats structure. Use
  cache_factor = 4 to approximate the time non-cached (due to large
  number) per-cpu stats are estimated compared to the cached data
  we estimate in calc phase. This needs to be tested with
  different number of CPUs and NUMA nodes.
* limit max threads using the formula 4 * Number of CPUs, not on rlimit
* remove _len suffix from vars like chain_max_len, tick_max_len,
  est_chain_max_len
Patch 2:
* new patch that adds functions for stats allocations

Changes in v4:
Patch 2:
* kthread 0 can start with calculation phase in SCHED_FIFO mode
  to determine chain_max_len suitable for 100us cond_resched
  rate and 12% of 40ms CPU usage in a tick. Current value of
  IPVS_EST_TICK_CHAINS=48 determines tick time of 4.8ms (i.e.
  in units of 100us) which is 12% of max tick time of 40ms.
  The question is how reliable will be such calculation test.
* est_calc_phase indicates a mode where we dequeue estimators
  from kthreads, apply new chain_max_len and enqueue again
  all estimators to kthreads, done by kthread 0
* est->ktid now can be -1 to indicate est is in est_temp_list
  ready to be distributed to kthread by kt 0, done in
  ip_vs_est_drain_temp_list(). kthread 0 data is now released
  only after the data for others kthreads
* ip_vs_start_estimator was not setting ret = 0
* READ_ONCE not needed for volatile jiffies
Patch 3:
* restrict cpulist based on the cpus_allowed of
  process that assigns cpulist, not on cpu_possible_mask
* change of cpulist will trigger calc phase
Patch 5:
* print message every minute, not 2 seconds

Changes in v3:
Patch 2:
* calculate chain_max_len (was IPVS_EST_CHAIN_DEPTH) but
  it needs further tuning based on real estimation test
* est_max_threads set from rlimit(RLIMIT_NPROC). I don't
  see analog to get_ucounts_value() to get the max value.
* the atomic bitop for td->present is not needed,
  remove it
* start filling based on est_row after 2 ticks are
  fully allocated. As 2/50 is 4% this can be increased
  more.

Changes in v2:
Patch 2:
* kd->mutex is gone, cond_resched rate determined by
  IPVS_EST_CHAIN_DEPTH
* IPVS_EST_MAX_COUNT is a hard limit now
* kthread data is now 1-50 allocated tick structures,
  each containing heads for limited chains. Bitmaps
  should allow faster access. We avoid large
  allocations for structs.
* as the td->present bitmap is shared, use atomic bitops
* ip_vs_start_estimator now returns error code
* _bh locking removed from stats->lock
* bump arg is gone from ip_vs_est_reload_start
* prepare for upcoming changes that remove _irq
  from u64_stats_fetch_begin_irq/u64_stats_fetch_retry_irq
* est_add_ktid is now always valid
Patch 3:
* use .. in est_nice docs

Julian Anastasov (7):
  ipvs: add rcu protection to stats
  ipvs: use common functions for stats allocation
  ipvs: use u64_stats_t for the per-cpu counters
  ipvs: use kthreads for stats estimation
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: run_estimation should control the kthread tasks
  ipvs: debug the tick time

 Documentation/networking/ipvs-sysctl.rst |  24 +-
 include/net/ip_vs.h                      | 154 +++-
 net/netfilter/ipvs/ip_vs_core.c          |  40 +-
 net/netfilter/ipvs/ip_vs_ctl.c           | 448 ++++++++---
 net/netfilter/ipvs/ip_vs_est.c           | 902 +++++++++++++++++++++--
 5 files changed, 1383 insertions(+), 185 deletions(-)

-- 
2.38.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCHv6 1/7] ipvs: add rcu protection to stats
  2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
@ 2022-10-31 14:56 ` Julian Anastasov
  2022-11-12  7:51   ` Jiri Wiesner
  2022-10-31 14:56 ` [RFC PATCHv6 2/7] ipvs: use common functions for stats allocation Julian Anastasov
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

In preparation to using RCU locking for the list
with estimators, make sure the struct ip_vs_stats
are released after RCU grace period by using RCU
callbacks. This affects ipvs->tot_stats where we
can not use RCU callbacks for ipvs, so we use
allocated struct ip_vs_stats_rcu. For services
and dests we force RCU callbacks for all cases.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h             |  8 ++++-
 net/netfilter/ipvs/ip_vs_core.c | 10 ++++--
 net/netfilter/ipvs/ip_vs_ctl.c  | 64 ++++++++++++++++++++++-----------
 3 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index ff1804a0c469..bd8ae137e43b 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -405,6 +405,11 @@ struct ip_vs_stats {
 	struct ip_vs_kstats	kstats0;	/* reset values */
 };
 
+struct ip_vs_stats_rcu {
+	struct ip_vs_stats	s;
+	struct rcu_head		rcu_head;
+};
+
 struct dst_entry;
 struct iphdr;
 struct ip_vs_conn;
@@ -688,6 +693,7 @@ struct ip_vs_dest {
 	union nf_inet_addr	vaddr;		/* virtual IP address */
 	__u32			vfwmark;	/* firewall mark of service */
 
+	struct rcu_head		rcu_head;
 	struct list_head	t_list;		/* in dest_trash */
 	unsigned int		in_rs_table:1;	/* we are in rs_table */
 };
@@ -869,7 +875,7 @@ struct netns_ipvs {
 	atomic_t		conn_count;      /* connection counter */
 
 	/* ip_vs_ctl */
-	struct ip_vs_stats		tot_stats;  /* Statistics & est. */
+	struct ip_vs_stats_rcu	*tot_stats;      /* Statistics & est. */
 
 	int			num_services;    /* no of virtual services */
 	int			num_services6;   /* IPv6 virtual services */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 51ad557a525b..fcdaef1fcccf 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -143,7 +143,7 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 		s->cnt.inbytes += skb->len;
 		u64_stats_update_end(&s->syncp);
 
-		s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
 		s->cnt.inpkts++;
 		s->cnt.inbytes += skb->len;
@@ -179,7 +179,7 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 		s->cnt.outbytes += skb->len;
 		u64_stats_update_end(&s->syncp);
 
-		s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
 		s->cnt.outpkts++;
 		s->cnt.outbytes += skb->len;
@@ -208,7 +208,7 @@ ip_vs_conn_stats(struct ip_vs_conn *cp, struct ip_vs_service *svc)
 	s->cnt.conns++;
 	u64_stats_update_end(&s->syncp);
 
-	s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+	s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 	u64_stats_update_begin(&s->syncp);
 	s->cnt.conns++;
 	u64_stats_update_end(&s->syncp);
@@ -2448,6 +2448,10 @@ static void __exit ip_vs_cleanup(void)
 	ip_vs_conn_cleanup();
 	ip_vs_protocol_cleanup();
 	ip_vs_control_cleanup();
+	/* common rcu_barrier() used by:
+	 * - ip_vs_control_cleanup()
+	 */
+	rcu_barrier();
 	pr_info("ipvs unloaded.\n");
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 4d62059a6021..9016b641ae52 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -483,17 +483,14 @@ static void ip_vs_service_rcu_free(struct rcu_head *head)
 	ip_vs_service_free(svc);
 }
 
-static void __ip_vs_svc_put(struct ip_vs_service *svc, bool do_delay)
+static void __ip_vs_svc_put(struct ip_vs_service *svc)
 {
 	if (atomic_dec_and_test(&svc->refcnt)) {
 		IP_VS_DBG_BUF(3, "Removing service %u/%s:%u\n",
 			      svc->fwmark,
 			      IP_VS_DBG_ADDR(svc->af, &svc->addr),
 			      ntohs(svc->port));
-		if (do_delay)
-			call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
-		else
-			ip_vs_service_free(svc);
+		call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
 	}
 }
 
@@ -780,14 +777,22 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, int dest_af,
 	return dest;
 }
 
+static void ip_vs_dest_rcu_free(struct rcu_head *head)
+{
+	struct ip_vs_dest *dest;
+
+	dest = container_of(head, struct ip_vs_dest, rcu_head);
+	free_percpu(dest->stats.cpustats);
+	ip_vs_dest_put_and_free(dest);
+}
+
 static void ip_vs_dest_free(struct ip_vs_dest *dest)
 {
 	struct ip_vs_service *svc = rcu_dereference_protected(dest->svc, 1);
 
 	__ip_vs_dst_cache_reset(dest);
-	__ip_vs_svc_put(svc, false);
-	free_percpu(dest->stats.cpustats);
-	ip_vs_dest_put_and_free(dest);
+	__ip_vs_svc_put(svc);
+	call_rcu(&dest->rcu_head, ip_vs_dest_rcu_free);
 }
 
 /*
@@ -811,6 +816,16 @@ static void ip_vs_trash_cleanup(struct netns_ipvs *ipvs)
 	}
 }
 
+static void ip_vs_stats_rcu_free(struct rcu_head *head)
+{
+	struct ip_vs_stats_rcu *rs = container_of(head,
+						  struct ip_vs_stats_rcu,
+						  rcu_head);
+
+	free_percpu(rs->s.cpustats);
+	kfree(rs);
+}
+
 static void
 ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
 {
@@ -923,7 +938,7 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
 		if (old_svc != svc) {
 			ip_vs_zero_stats(&dest->stats);
 			__ip_vs_bind_svc(dest, svc);
-			__ip_vs_svc_put(old_svc, true);
+			__ip_vs_svc_put(old_svc);
 		}
 	}
 
@@ -1571,7 +1586,7 @@ static void __ip_vs_del_service(struct ip_vs_service *svc, bool cleanup)
 	/*
 	 *    Free the service if nobody refers to it
 	 */
-	__ip_vs_svc_put(svc, true);
+	__ip_vs_svc_put(svc);
 
 	/* decrease the module use count */
 	ip_vs_use_count_dec();
@@ -1761,7 +1776,7 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 		}
 	}
 
-	ip_vs_zero_stats(&ipvs->tot_stats);
+	ip_vs_zero_stats(&ipvs->tot_stats->s);
 	return 0;
 }
 
@@ -2255,7 +2270,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
 	seq_puts(seq,
 		 "   Conns  Packets  Packets            Bytes            Bytes\n");
 
-	ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats);
+	ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats->s);
 	seq_printf(seq, "%8LX %8LX %8LX %16LX %16LX\n\n",
 		   (unsigned long long)show.conns,
 		   (unsigned long long)show.inpkts,
@@ -2279,7 +2294,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
 static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq_file_single_net(seq);
-	struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats;
+	struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats->s;
 	struct ip_vs_cpu_stats __percpu *cpustats = tot_stats->cpustats;
 	struct ip_vs_kstats kstats;
 	int i;
@@ -4107,7 +4122,6 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 			kfree(tbl);
 		return -ENOMEM;
 	}
-	ip_vs_start_estimator(ipvs, &ipvs->tot_stats);
 	ipvs->sysctl_tbl = tbl;
 	/* Schedule defense work */
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
@@ -4118,6 +4132,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
 			  expire_nodest_conn_handler);
 
+	ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
 	return 0;
 }
 
@@ -4129,7 +4144,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats);
+	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
 
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->sysctl_tbl);
@@ -4165,17 +4180,20 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	atomic_set(&ipvs->conn_out_counter, 0);
 
 	/* procfs stats */
-	ipvs->tot_stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!ipvs->tot_stats.cpustats)
+	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
+	if (!ipvs->tot_stats)
 		return -ENOMEM;
+	ipvs->tot_stats->s.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
+	if (!ipvs->tot_stats->s.cpustats)
+		goto err_tot_stats;
 
 	for_each_possible_cpu(i) {
 		struct ip_vs_cpu_stats *ipvs_tot_stats;
-		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats.cpustats, i);
+		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats->s.cpustats, i);
 		u64_stats_init(&ipvs_tot_stats->syncp);
 	}
 
-	spin_lock_init(&ipvs->tot_stats.lock);
+	spin_lock_init(&ipvs->tot_stats->s.lock);
 
 #ifdef CONFIG_PROC_FS
 	if (!proc_create_net("ip_vs", 0, ipvs->net->proc_net,
@@ -4207,7 +4225,10 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 
 err_vs:
 #endif
-	free_percpu(ipvs->tot_stats.cpustats);
+	free_percpu(ipvs->tot_stats->s.cpustats);
+
+err_tot_stats:
+	kfree(ipvs->tot_stats);
 	return -ENOMEM;
 }
 
@@ -4220,7 +4241,7 @@ void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs", ipvs->net->proc_net);
 #endif
-	free_percpu(ipvs->tot_stats.cpustats);
+	call_rcu(&ipvs->tot_stats->rcu_head, ip_vs_stats_rcu_free);
 }
 
 int __init ip_vs_register_nl_ioctl(void)
@@ -4280,5 +4301,6 @@ void ip_vs_control_cleanup(void)
 {
 	EnterFunction(2);
 	unregister_netdevice_notifier(&ip_vs_dst_notifier);
+	/* relying on common rcu_barrier() in ip_vs_cleanup() */
 	LeaveFunction(2);
 }
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCHv6 2/7] ipvs: use common functions for stats allocation
  2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
  2022-10-31 14:56 ` [RFC PATCHv6 1/7] ipvs: add rcu protection to stats Julian Anastasov
@ 2022-10-31 14:56 ` Julian Anastasov
  2022-11-12  8:22   ` Jiri Wiesner
  2022-10-31 14:56 ` [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters Julian Anastasov
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Move alloc_percpu/free_percpu logic in new functions

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h            |  5 ++
 net/netfilter/ipvs/ip_vs_ctl.c | 96 +++++++++++++++++++---------------
 2 files changed, 60 insertions(+), 41 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index bd8ae137e43b..e5582c01a4a3 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -410,6 +410,11 @@ struct ip_vs_stats_rcu {
 	struct rcu_head		rcu_head;
 };
 
+int ip_vs_stats_init_alloc(struct ip_vs_stats *s);
+struct ip_vs_stats *ip_vs_stats_alloc(void);
+void ip_vs_stats_release(struct ip_vs_stats *stats);
+void ip_vs_stats_free(struct ip_vs_stats *stats);
+
 struct dst_entry;
 struct iphdr;
 struct ip_vs_conn;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 9016b641ae52..ec6db864ac36 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -471,7 +471,7 @@ __ip_vs_bind_svc(struct ip_vs_dest *dest, struct ip_vs_service *svc)
 
 static void ip_vs_service_free(struct ip_vs_service *svc)
 {
-	free_percpu(svc->stats.cpustats);
+	ip_vs_stats_release(&svc->stats);
 	kfree(svc);
 }
 
@@ -782,7 +782,7 @@ static void ip_vs_dest_rcu_free(struct rcu_head *head)
 	struct ip_vs_dest *dest;
 
 	dest = container_of(head, struct ip_vs_dest, rcu_head);
-	free_percpu(dest->stats.cpustats);
+	ip_vs_stats_release(&dest->stats);
 	ip_vs_dest_put_and_free(dest);
 }
 
@@ -822,7 +822,7 @@ static void ip_vs_stats_rcu_free(struct rcu_head *head)
 						  struct ip_vs_stats_rcu,
 						  rcu_head);
 
-	free_percpu(rs->s.cpustats);
+	ip_vs_stats_release(&rs->s);
 	kfree(rs);
 }
 
@@ -879,6 +879,47 @@ ip_vs_zero_stats(struct ip_vs_stats *stats)
 	spin_unlock_bh(&stats->lock);
 }
 
+/* Allocate fields after kzalloc */
+int ip_vs_stats_init_alloc(struct ip_vs_stats *s)
+{
+	int i;
+
+	spin_lock_init(&s->lock);
+	s->cpustats = alloc_percpu(struct ip_vs_cpu_stats);
+	if (!s->cpustats)
+		return -ENOMEM;
+
+	for_each_possible_cpu(i) {
+		struct ip_vs_cpu_stats *cs = per_cpu_ptr(s->cpustats, i);
+
+		u64_stats_init(&cs->syncp);
+	}
+	return 0;
+}
+
+struct ip_vs_stats *ip_vs_stats_alloc(void)
+{
+	struct ip_vs_stats *s = kzalloc(sizeof(*s), GFP_KERNEL);
+
+	if (s && ip_vs_stats_init_alloc(s) >= 0)
+		return s;
+	kfree(s);
+	return NULL;
+}
+
+void ip_vs_stats_release(struct ip_vs_stats *stats)
+{
+	free_percpu(stats->cpustats);
+}
+
+void ip_vs_stats_free(struct ip_vs_stats *stats)
+{
+	if (stats) {
+		ip_vs_stats_release(stats);
+		kfree(stats);
+	}
+}
+
 /*
  *	Update a destination in the given service
  */
@@ -978,14 +1019,13 @@ static int
 ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 {
 	struct ip_vs_dest *dest;
-	unsigned int atype, i;
+	unsigned int atype;
+	int ret;
 
 	EnterFunction(2);
 
 #ifdef CONFIG_IP_VS_IPV6
 	if (udest->af == AF_INET6) {
-		int ret;
-
 		atype = ipv6_addr_type(&udest->addr.in6);
 		if ((!(atype & IPV6_ADDR_UNICAST) ||
 			atype & IPV6_ADDR_LINKLOCAL) &&
@@ -1007,16 +1047,10 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 	if (dest == NULL)
 		return -ENOMEM;
 
-	dest->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!dest->stats.cpustats)
+	ret = ip_vs_stats_init_alloc(&dest->stats);
+	if (ret < 0)
 		goto err_alloc;
 
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *ip_vs_dest_stats;
-		ip_vs_dest_stats = per_cpu_ptr(dest->stats.cpustats, i);
-		u64_stats_init(&ip_vs_dest_stats->syncp);
-	}
-
 	dest->af = udest->af;
 	dest->protocol = svc->protocol;
 	dest->vaddr = svc->addr;
@@ -1032,7 +1066,6 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 
 	INIT_HLIST_NODE(&dest->d_list);
 	spin_lock_init(&dest->dst_lock);
-	spin_lock_init(&dest->stats.lock);
 	__ip_vs_update_dest(svc, dest, udest, 1);
 
 	LeaveFunction(2);
@@ -1040,7 +1073,7 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 
 err_alloc:
 	kfree(dest);
-	return -ENOMEM;
+	return ret;
 }
 
 
@@ -1299,7 +1332,7 @@ static int
 ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 		  struct ip_vs_service **svc_p)
 {
-	int ret = 0, i;
+	int ret = 0;
 	struct ip_vs_scheduler *sched = NULL;
 	struct ip_vs_pe *pe = NULL;
 	struct ip_vs_service *svc = NULL;
@@ -1359,18 +1392,9 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 		ret = -ENOMEM;
 		goto out_err;
 	}
-	svc->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!svc->stats.cpustats) {
-		ret = -ENOMEM;
+	ret = ip_vs_stats_init_alloc(&svc->stats);
+	if (ret < 0)
 		goto out_err;
-	}
-
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *ip_vs_stats;
-		ip_vs_stats = per_cpu_ptr(svc->stats.cpustats, i);
-		u64_stats_init(&ip_vs_stats->syncp);
-	}
-
 
 	/* I'm the first user of the service */
 	atomic_set(&svc->refcnt, 0);
@@ -1387,7 +1411,6 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 
 	INIT_LIST_HEAD(&svc->destinations);
 	spin_lock_init(&svc->sched_lock);
-	spin_lock_init(&svc->stats.lock);
 
 	/* Bind the scheduler */
 	if (sched) {
@@ -4166,7 +4189,7 @@ static struct notifier_block ip_vs_dst_notifier = {
 
 int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 {
-	int i, idx;
+	int idx;
 
 	/* Initialize rs_table */
 	for (idx = 0; idx < IP_VS_RTAB_SIZE; idx++)
@@ -4183,18 +4206,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
 	if (!ipvs->tot_stats)
 		return -ENOMEM;
-	ipvs->tot_stats->s.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!ipvs->tot_stats->s.cpustats)
+	if (ip_vs_stats_init_alloc(&ipvs->tot_stats->s) < 0)
 		goto err_tot_stats;
 
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *ipvs_tot_stats;
-		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats->s.cpustats, i);
-		u64_stats_init(&ipvs_tot_stats->syncp);
-	}
-
-	spin_lock_init(&ipvs->tot_stats->s.lock);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
  2022-10-31 14:56 ` [RFC PATCHv6 1/7] ipvs: add rcu protection to stats Julian Anastasov
  2022-10-31 14:56 ` [RFC PATCHv6 2/7] ipvs: use common functions for stats allocation Julian Anastasov
@ 2022-10-31 14:56 ` Julian Anastasov
  2022-11-12  9:00   ` Jiri Wiesner
  2022-11-19  7:46   ` Jiri Wiesner
  2022-10-31 14:56 ` [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation Julian Anastasov
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Use the provided u64_stats_t type to avoid
load/store tearing.

Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h             | 10 +++++-----
 net/netfilter/ipvs/ip_vs_core.c | 30 +++++++++++++++---------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 10 +++++-----
 net/netfilter/ipvs/ip_vs_est.c  | 20 ++++++++++----------
 4 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index e5582c01a4a3..a4d44138c2a8 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -351,11 +351,11 @@ struct ip_vs_seq {
 
 /* counters per cpu */
 struct ip_vs_counters {
-	__u64		conns;		/* connections scheduled */
-	__u64		inpkts;		/* incoming packets */
-	__u64		outpkts;	/* outgoing packets */
-	__u64		inbytes;	/* incoming bytes */
-	__u64		outbytes;	/* outgoing bytes */
+	u64_stats_t	conns;		/* connections scheduled */
+	u64_stats_t	inpkts;		/* incoming packets */
+	u64_stats_t	outpkts;	/* outgoing packets */
+	u64_stats_t	inbytes;	/* incoming bytes */
+	u64_stats_t	outbytes;	/* outgoing bytes */
 };
 /* Stats per cpu */
 struct ip_vs_cpu_stats {
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index fcdaef1fcccf..2fcc26507d69 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -132,21 +132,21 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 
 		s = this_cpu_ptr(dest->stats.cpustats);
 		u64_stats_update_begin(&s->syncp);
-		s->cnt.inpkts++;
-		s->cnt.inbytes += skb->len;
+		u64_stats_inc(&s->cnt.inpkts);
+		u64_stats_add(&s->cnt.inbytes, skb->len);
 		u64_stats_update_end(&s->syncp);
 
 		svc = rcu_dereference(dest->svc);
 		s = this_cpu_ptr(svc->stats.cpustats);
 		u64_stats_update_begin(&s->syncp);
-		s->cnt.inpkts++;
-		s->cnt.inbytes += skb->len;
+		u64_stats_inc(&s->cnt.inpkts);
+		u64_stats_add(&s->cnt.inbytes, skb->len);
 		u64_stats_update_end(&s->syncp);
 
 		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
-		s->cnt.inpkts++;
-		s->cnt.inbytes += skb->len;
+		u64_stats_inc(&s->cnt.inpkts);
+		u64_stats_add(&s->cnt.inbytes, skb->len);
 		u64_stats_update_end(&s->syncp);
 
 		local_bh_enable();
@@ -168,21 +168,21 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 
 		s = this_cpu_ptr(dest->stats.cpustats);
 		u64_stats_update_begin(&s->syncp);
-		s->cnt.outpkts++;
-		s->cnt.outbytes += skb->len;
+		u64_stats_inc(&s->cnt.outpkts);
+		u64_stats_add(&s->cnt.outbytes, skb->len);
 		u64_stats_update_end(&s->syncp);
 
 		svc = rcu_dereference(dest->svc);
 		s = this_cpu_ptr(svc->stats.cpustats);
 		u64_stats_update_begin(&s->syncp);
-		s->cnt.outpkts++;
-		s->cnt.outbytes += skb->len;
+		u64_stats_inc(&s->cnt.outpkts);
+		u64_stats_add(&s->cnt.outbytes, skb->len);
 		u64_stats_update_end(&s->syncp);
 
 		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
-		s->cnt.outpkts++;
-		s->cnt.outbytes += skb->len;
+		u64_stats_inc(&s->cnt.outpkts);
+		u64_stats_add(&s->cnt.outbytes, skb->len);
 		u64_stats_update_end(&s->syncp);
 
 		local_bh_enable();
@@ -200,17 +200,17 @@ ip_vs_conn_stats(struct ip_vs_conn *cp, struct ip_vs_service *svc)
 
 	s = this_cpu_ptr(cp->dest->stats.cpustats);
 	u64_stats_update_begin(&s->syncp);
-	s->cnt.conns++;
+	u64_stats_inc(&s->cnt.conns);
 	u64_stats_update_end(&s->syncp);
 
 	s = this_cpu_ptr(svc->stats.cpustats);
 	u64_stats_update_begin(&s->syncp);
-	s->cnt.conns++;
+	u64_stats_inc(&s->cnt.conns);
 	u64_stats_update_end(&s->syncp);
 
 	s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 	u64_stats_update_begin(&s->syncp);
-	s->cnt.conns++;
+	u64_stats_inc(&s->cnt.conns);
 	u64_stats_update_end(&s->syncp);
 
 	local_bh_enable();
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index ec6db864ac36..5f9cc2e7ba71 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2335,11 +2335,11 @@ static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
 
 		do {
 			start = u64_stats_fetch_begin(&u->syncp);
-			conns = u->cnt.conns;
-			inpkts = u->cnt.inpkts;
-			outpkts = u->cnt.outpkts;
-			inbytes = u->cnt.inbytes;
-			outbytes = u->cnt.outbytes;
+			conns = u64_stats_read(&u->cnt.conns);
+			inpkts = u64_stats_read(&u->cnt.inpkts);
+			outpkts = u64_stats_read(&u->cnt.outpkts);
+			inbytes = u64_stats_read(&u->cnt.inbytes);
+			outbytes = u64_stats_read(&u->cnt.outbytes);
 		} while (u64_stats_fetch_retry(&u->syncp, start));
 
 		seq_printf(seq, "%3X %8LX %8LX %8LX %16LX %16LX\n",
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 9a1a7af6a186..f53150d82a92 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -67,11 +67,11 @@ static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
 		if (add) {
 			do {
 				start = u64_stats_fetch_begin(&s->syncp);
-				conns = s->cnt.conns;
-				inpkts = s->cnt.inpkts;
-				outpkts = s->cnt.outpkts;
-				inbytes = s->cnt.inbytes;
-				outbytes = s->cnt.outbytes;
+				conns = u64_stats_read(&s->cnt.conns);
+				inpkts = u64_stats_read(&s->cnt.inpkts);
+				outpkts = u64_stats_read(&s->cnt.outpkts);
+				inbytes = u64_stats_read(&s->cnt.inbytes);
+				outbytes = u64_stats_read(&s->cnt.outbytes);
 			} while (u64_stats_fetch_retry(&s->syncp, start));
 			sum->conns += conns;
 			sum->inpkts += inpkts;
@@ -82,11 +82,11 @@ static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
 			add = true;
 			do {
 				start = u64_stats_fetch_begin(&s->syncp);
-				sum->conns = s->cnt.conns;
-				sum->inpkts = s->cnt.inpkts;
-				sum->outpkts = s->cnt.outpkts;
-				sum->inbytes = s->cnt.inbytes;
-				sum->outbytes = s->cnt.outbytes;
+				sum->conns = u64_stats_read(&s->cnt.conns);
+				sum->inpkts = u64_stats_read(&s->cnt.inpkts);
+				sum->outpkts = u64_stats_read(&s->cnt.outpkts);
+				sum->inbytes = u64_stats_read(&s->cnt.inbytes);
+				sum->outbytes = u64_stats_read(&s->cnt.outbytes);
 			} while (u64_stats_fetch_retry(&s->syncp, start));
 		}
 	}
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation
  2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
                   ` (2 preceding siblings ...)
  2022-10-31 14:56 ` [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters Julian Anastasov
@ 2022-10-31 14:56 ` Julian Anastasov
  2022-11-10 15:39   ` Jiri Wiesner
  2022-11-21 16:05   ` Jiri Wiesner
  2022-10-31 14:56 ` [RFC PATCHv6 5/7] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Estimating all entries in single list in timer context
causes large latency with multiple rules.

Spread the estimator structures in multiple chains and
use kthread(s) for the estimation. Every chain is
processed under RCU lock. First kthread determines
parameters to use, eg. maximum number of estimators to
process per kthread based on chain's length, allowing
sub-100us cond_resched rate and estimation taking 1/8
of the CPU.

First kthread also plays the role of distributor of
added estimators to all kthreads.

We also add delayed work est_reload_work that will
make sure the kthread tasks are properly started/stopped.

ip_vs_start_estimator() is changed to report errors
which allows to safely store the estimators in
allocated structures.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h            |  74 ++-
 net/netfilter/ipvs/ip_vs_ctl.c | 126 ++++-
 net/netfilter/ipvs/ip_vs_est.c | 871 ++++++++++++++++++++++++++++++---
 3 files changed, 971 insertions(+), 100 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index a4d44138c2a8..9c02840bc7c9 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -42,6 +42,8 @@ static inline struct netns_ipvs *net_ipvs(struct net* net)
 /* Connections' size value needed by ip_vs_ctl.c */
 extern int ip_vs_conn_tab_size;
 
+extern struct mutex __ip_vs_mutex;
+
 struct ip_vs_iphdr {
 	int hdr_flags;	/* ipvs flags */
 	__u32 off;	/* Where IP or IPv4 header starts */
@@ -365,7 +367,7 @@ struct ip_vs_cpu_stats {
 
 /* IPVS statistics objects */
 struct ip_vs_estimator {
-	struct list_head	list;
+	struct hlist_node	list;
 
 	u64			last_inbytes;
 	u64			last_outbytes;
@@ -378,6 +380,10 @@ struct ip_vs_estimator {
 	u64			outpps;
 	u64			inbps;
 	u64			outbps;
+
+	s32			ktid:16,	/* kthread ID, -1=temp list */
+				ktrow:8,	/* row ID for kthread */
+				ktcid:8;	/* chain ID for kthread */
 };
 
 /*
@@ -415,6 +421,52 @@ struct ip_vs_stats *ip_vs_stats_alloc(void);
 void ip_vs_stats_release(struct ip_vs_stats *stats);
 void ip_vs_stats_free(struct ip_vs_stats *stats);
 
+/* Process estimators in multiple timer ticks */
+#define IPVS_EST_NTICKS		50
+/* Estimation uses a 2-second period containing ticks (in jiffies) */
+#define IPVS_EST_TICK		((2 * HZ) / IPVS_EST_NTICKS)
+
+/* Desired number of chains per tick (chain load factor in 100us units),
+ * 48=4.8ms of 40ms tick (12% CPU usage):
+ * 2 sec * 1000 ms in sec * 10 (100us in ms) / 8 (12.5%) / IPVS_EST_NTICKS
+ */
+#define IPVS_EST_CHAIN_FACTOR	ALIGN_DOWN(2 * 1000 * 10 / 8 / IPVS_EST_NTICKS, 8)
+
+/* Compiled number of chains per tick
+ * The defines should match cond_resched_rcu
+ */
+#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
+#define IPVS_EST_TICK_CHAINS	IPVS_EST_CHAIN_FACTOR
+#else
+#define IPVS_EST_TICK_CHAINS	1
+#endif
+
+/* Multiple chains processed in same tick */
+struct ip_vs_est_tick_data {
+	struct hlist_head	chains[IPVS_EST_TICK_CHAINS];
+	DECLARE_BITMAP(present, IPVS_EST_TICK_CHAINS);
+	DECLARE_BITMAP(full, IPVS_EST_TICK_CHAINS);
+	int			chain_len[IPVS_EST_TICK_CHAINS];
+};
+
+/* Context for estimation kthread */
+struct ip_vs_est_kt_data {
+	struct netns_ipvs	*ipvs;
+	struct task_struct	*task;		/* task if running */
+	struct ip_vs_est_tick_data __rcu *ticks[IPVS_EST_NTICKS];
+	DECLARE_BITMAP(avail, IPVS_EST_NTICKS);	/* tick has space for ests */
+	unsigned long		est_timer;	/* estimation timer (jiffies) */
+	struct ip_vs_stats	*calc_stats;	/* Used for calculation */
+	int			tick_len[IPVS_EST_NTICKS];	/* est count */
+	int			id;		/* ktid per netns */
+	int			chain_max;	/* max ests per tick chain */
+	int			tick_max;	/* max ests per tick */
+	int			est_count;	/* attached ests to kthread */
+	int			est_max_count;	/* max ests per kthread */
+	int			add_row;	/* row for new ests */
+	int			est_row;	/* estimated row */
+};
+
 struct dst_entry;
 struct iphdr;
 struct ip_vs_conn;
@@ -953,9 +1005,17 @@ struct netns_ipvs {
 	struct ctl_table_header	*lblcr_ctl_header;
 	struct ctl_table	*lblcr_ctl_table;
 	/* ip_vs_est */
-	struct list_head	est_list;	/* estimator list */
-	spinlock_t		est_lock;
-	struct timer_list	est_timer;	/* Estimation timer */
+	struct delayed_work	est_reload_work;/* Reload kthread tasks */
+	struct mutex		est_mutex;	/* protect kthread tasks */
+	struct hlist_head	est_temp_list;	/* Ests during calc phase */
+	struct ip_vs_est_kt_data **est_kt_arr;	/* Array of kthread data ptrs */
+	unsigned long		est_max_threads;/* rlimit */
+	int			est_calc_phase;	/* Calculation phase */
+	int			est_chain_max;	/* Calculated chain_max */
+	int			est_kt_count;	/* Allocated ptrs */
+	int			est_add_ktid;	/* ktid where to add ests */
+	atomic_t		est_genid;	/* kthreads reload genid */
+	atomic_t		est_genid_done;	/* applied genid */
 	/* ip_vs_sync */
 	spinlock_t		sync_lock;
 	struct ipvs_master_sync_state *ms;
@@ -1486,10 +1546,14 @@ int stop_sync_thread(struct netns_ipvs *ipvs, int state);
 void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct ip_vs_conn *cp, int pkts);
 
 /* IPVS rate estimator prototypes (from ip_vs_est.c) */
-void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
+int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
 void ip_vs_zero_estimator(struct ip_vs_stats *stats);
 void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs);
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+			    struct ip_vs_est_kt_data *kd);
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
 
 /* Various IPVS packet transmitters (from ip_vs_xmit.c) */
 int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 5f9cc2e7ba71..c41a5392edc9 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -49,8 +49,7 @@
 
 MODULE_ALIAS_GENL_FAMILY(IPVS_GENL_NAME);
 
-/* semaphore for IPVS sockopts. And, [gs]etsockopt may sleep. */
-static DEFINE_MUTEX(__ip_vs_mutex);
+DEFINE_MUTEX(__ip_vs_mutex); /* Serialize configuration with sockopt/netlink */
 
 /* sysctl variables */
 
@@ -241,6 +240,47 @@ static void defense_work_handler(struct work_struct *work)
 }
 #endif
 
+static void est_reload_work_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs =
+		container_of(work, struct netns_ipvs, est_reload_work.work);
+	int genid_done = atomic_read(&ipvs->est_genid_done);
+	unsigned long delay = HZ / 10;	/* repeat startups after failure */
+	bool repeat = false;
+	int genid;
+	int id;
+
+	mutex_lock(&ipvs->est_mutex);
+	genid = atomic_read(&ipvs->est_genid);
+	for (id = 0; id < ipvs->est_kt_count; id++) {
+		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			goto unlock;
+		if (!kd)
+			continue;
+		/* New config ? Stop kthread tasks */
+		if (genid != genid_done)
+			ip_vs_est_kthread_stop(kd);
+		if (!kd->task) {
+			/* Do not start kthreads above 0 in calc phase */
+			if ((!id || !ipvs->est_calc_phase) &&
+			    ip_vs_est_kthread_start(ipvs, kd) < 0)
+				repeat = true;
+		}
+	}
+
+	atomic_set(&ipvs->est_genid_done, genid);
+
+	if (repeat)
+		queue_delayed_work(system_long_wq, &ipvs->est_reload_work,
+				   delay);
+
+unlock:
+	mutex_unlock(&ipvs->est_mutex);
+}
+
 int
 ip_vs_use_count_inc(void)
 {
@@ -831,7 +871,7 @@ ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
 {
 #define IP_VS_SHOW_STATS_COUNTER(c) dst->c = src->kstats.c - src->kstats0.c
 
-	spin_lock_bh(&src->lock);
+	spin_lock(&src->lock);
 
 	IP_VS_SHOW_STATS_COUNTER(conns);
 	IP_VS_SHOW_STATS_COUNTER(inpkts);
@@ -841,7 +881,7 @@ ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
 
 	ip_vs_read_estimator(dst, src);
 
-	spin_unlock_bh(&src->lock);
+	spin_unlock(&src->lock);
 }
 
 static void
@@ -862,7 +902,7 @@ ip_vs_export_stats_user(struct ip_vs_stats_user *dst, struct ip_vs_kstats *src)
 static void
 ip_vs_zero_stats(struct ip_vs_stats *stats)
 {
-	spin_lock_bh(&stats->lock);
+	spin_lock(&stats->lock);
 
 	/* get current counters as zero point, rates are zeroed */
 
@@ -876,7 +916,7 @@ ip_vs_zero_stats(struct ip_vs_stats *stats)
 
 	ip_vs_zero_estimator(stats);
 
-	spin_unlock_bh(&stats->lock);
+	spin_unlock(&stats->lock);
 }
 
 /* Allocate fields after kzalloc */
@@ -998,7 +1038,6 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
 	spin_unlock_bh(&dest->dst_lock);
 
 	if (add) {
-		ip_vs_start_estimator(svc->ipvs, &dest->stats);
 		list_add_rcu(&dest->n_list, &svc->destinations);
 		svc->num_dests++;
 		sched = rcu_dereference_protected(svc->scheduler, 1);
@@ -1051,6 +1090,10 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 	if (ret < 0)
 		goto err_alloc;
 
+	ret = ip_vs_start_estimator(svc->ipvs, &dest->stats);
+	if (ret < 0)
+		goto err_stats;
+
 	dest->af = udest->af;
 	dest->protocol = svc->protocol;
 	dest->vaddr = svc->addr;
@@ -1071,6 +1114,9 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 	LeaveFunction(2);
 	return 0;
 
+err_stats:
+	ip_vs_stats_release(&dest->stats);
+
 err_alloc:
 	kfree(dest);
 	return ret;
@@ -1135,14 +1181,18 @@ ip_vs_add_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 			      IP_VS_DBG_ADDR(svc->af, &dest->vaddr),
 			      ntohs(dest->vport));
 
+		ret = ip_vs_start_estimator(svc->ipvs, &dest->stats);
+		if (ret < 0)
+			goto err;
 		__ip_vs_update_dest(svc, dest, udest, 1);
-		ret = 0;
 	} else {
 		/*
 		 * Allocate and initialize the dest structure
 		 */
 		ret = ip_vs_new_dest(svc, udest);
 	}
+
+err:
 	LeaveFunction(2);
 
 	return ret;
@@ -1420,6 +1470,10 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 		sched = NULL;
 	}
 
+	ret = ip_vs_start_estimator(ipvs, &svc->stats);
+	if (ret < 0)
+		goto out_err;
+
 	/* Bind the ct retriever */
 	RCU_INIT_POINTER(svc->pe, pe);
 	pe = NULL;
@@ -1432,8 +1486,6 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 	if (svc->pe && svc->pe->conn_out)
 		atomic_inc(&ipvs->conn_out_counter);
 
-	ip_vs_start_estimator(ipvs, &svc->stats);
-
 	/* Count only IPv4 services for old get/setsockopt interface */
 	if (svc->af == AF_INET)
 		ipvs->num_services++;
@@ -1444,8 +1496,15 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 	ip_vs_svc_hash(svc);
 
 	*svc_p = svc;
-	/* Now there is a service - full throttle */
-	ipvs->enable = 1;
+
+	if (!ipvs->enable) {
+		/* Now there is a service - full throttle */
+		ipvs->enable = 1;
+
+		/* Start estimation for first time */
+		ip_vs_est_reload_start(ipvs);
+	}
+
 	return 0;
 
 
@@ -4065,13 +4124,16 @@ static void ip_vs_genl_unregister(void)
 static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
-	int idx;
 	struct ctl_table *tbl;
+	int idx, ret;
 
 	atomic_set(&ipvs->dropentry, 0);
 	spin_lock_init(&ipvs->dropentry_lock);
 	spin_lock_init(&ipvs->droppacket_lock);
 	spin_lock_init(&ipvs->securetcp_lock);
+	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
 
 	if (!net_eq(net, &init_net)) {
 		tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
@@ -4139,24 +4201,27 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 		tbl[idx++].mode = 0444;
 #endif
 
+	ret = -ENOMEM;
 	ipvs->sysctl_hdr = register_net_sysctl(net, "net/ipv4/vs", tbl);
-	if (ipvs->sysctl_hdr == NULL) {
-		if (!net_eq(net, &init_net))
-			kfree(tbl);
-		return -ENOMEM;
-	}
+	if (!ipvs->sysctl_hdr)
+		goto err;
 	ipvs->sysctl_tbl = tbl;
+
+	ret = ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
+	if (ret < 0)
+		goto err;
+
 	/* Schedule defense work */
-	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
 			   DEFENSE_TIMER_PERIOD);
 
-	/* Init delayed work for expiring no dest conn */
-	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
-			  expire_nodest_conn_handler);
-
-	ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
 	return 0;
+
+err:
+	unregister_net_sysctl_table(ipvs->sysctl_hdr);
+	if (!net_eq(net, &init_net))
+		kfree(tbl);
+	return ret;
 }
 
 static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
@@ -4189,6 +4254,7 @@ static struct notifier_block ip_vs_dst_notifier = {
 
 int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 {
+	int ret = -ENOMEM;
 	int idx;
 
 	/* Initialize rs_table */
@@ -4202,10 +4268,12 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	atomic_set(&ipvs->nullsvc_counter, 0);
 	atomic_set(&ipvs->conn_out_counter, 0);
 
+	INIT_DELAYED_WORK(&ipvs->est_reload_work, est_reload_work_handler);
+
 	/* procfs stats */
 	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
 	if (!ipvs->tot_stats)
-		return -ENOMEM;
+		goto out;
 	if (ip_vs_stats_init_alloc(&ipvs->tot_stats->s) < 0)
 		goto err_tot_stats;
 
@@ -4222,7 +4290,8 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 		goto err_percpu;
 #endif
 
-	if (ip_vs_control_net_init_sysctl(ipvs))
+	ret = ip_vs_control_net_init_sysctl(ipvs);
+	if (ret < 0)
 		goto err;
 
 	return 0;
@@ -4243,13 +4312,16 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 
 err_tot_stats:
 	kfree(ipvs->tot_stats);
-	return -ENOMEM;
+
+out:
+	return ret;
 }
 
 void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
 {
 	ip_vs_trash_cleanup(ipvs);
 	ip_vs_control_net_cleanup_sysctl(ipvs);
+	cancel_delayed_work_sync(&ipvs->est_reload_work);
 #ifdef CONFIG_PROC_FS
 	remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index f53150d82a92..c134f9466dc6 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -30,9 +30,6 @@
   long interval, it is easy to implement a user level daemon which
   periodically reads those statistical counters and measure rate.
 
-  Currently, the measurement is activated by slow timer handler. Hope
-  this measurement will not introduce too much load.
-
   We measure rate during the last 8 seconds every 2 seconds:
 
     avgrate = avgrate*(1-W) + rate*W
@@ -47,68 +44,76 @@
     to 32-bit values for conns, packets, bps, cps and pps.
 
   * A lot of code is taken from net/core/gen_estimator.c
- */
-
 
-/*
- * Make a summary from each cpu
+  KEY POINTS:
+  - cpustats counters are updated per-cpu in SoftIRQ context with BH disabled
+  - kthreads read the cpustats to update the estimators (svcs, dests, total)
+  - the states of estimators can be read (get stats) or modified (zero stats)
+    from processes
+
+  KTHREADS:
+  - estimators are added initially to est_temp_list and later kthread 0
+    distributes them to one or many kthreads for estimation
+  - kthread contexts are created and attached to array
+  - the kthread tasks are started when first service is added, before that
+    the total stats are not estimated
+  - the kthread context holds lists with estimators (chains) which are
+    processed every 2 seconds
+  - as estimators can be added dynamically and in bursts, we try to spread
+    them to multiple chains which are estimated at different time
+  - on start, kthread 0 enters calculation phase to determine the chain limits
+    and the limit of estimators per kthread
+  - est_add_ktid: ktid where to add new ests, can point to empty slot where
+    we should add kt data
  */
-static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
-				 struct ip_vs_cpu_stats __percpu *stats)
-{
-	int i;
-	bool add = false;
-
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i);
-		unsigned int start;
-		u64 conns, inpkts, outpkts, inbytes, outbytes;
 
-		if (add) {
-			do {
-				start = u64_stats_fetch_begin(&s->syncp);
-				conns = u64_stats_read(&s->cnt.conns);
-				inpkts = u64_stats_read(&s->cnt.inpkts);
-				outpkts = u64_stats_read(&s->cnt.outpkts);
-				inbytes = u64_stats_read(&s->cnt.inbytes);
-				outbytes = u64_stats_read(&s->cnt.outbytes);
-			} while (u64_stats_fetch_retry(&s->syncp, start));
-			sum->conns += conns;
-			sum->inpkts += inpkts;
-			sum->outpkts += outpkts;
-			sum->inbytes += inbytes;
-			sum->outbytes += outbytes;
-		} else {
-			add = true;
-			do {
-				start = u64_stats_fetch_begin(&s->syncp);
-				sum->conns = u64_stats_read(&s->cnt.conns);
-				sum->inpkts = u64_stats_read(&s->cnt.inpkts);
-				sum->outpkts = u64_stats_read(&s->cnt.outpkts);
-				sum->inbytes = u64_stats_read(&s->cnt.inbytes);
-				sum->outbytes = u64_stats_read(&s->cnt.outbytes);
-			} while (u64_stats_fetch_retry(&s->syncp, start));
-		}
-	}
-}
+static struct lock_class_key __ipvs_est_key;
 
+static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs);
+static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs);
 
-static void estimation_timer(struct timer_list *t)
+static void ip_vs_chain_estimation(struct hlist_head *chain)
 {
 	struct ip_vs_estimator *e;
+	struct ip_vs_cpu_stats *c;
 	struct ip_vs_stats *s;
 	u64 rate;
-	struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
 
-	if (!sysctl_run_estimation(ipvs))
-		goto skip;
+	hlist_for_each_entry_rcu(e, chain, list) {
+		u64 conns, inpkts, outpkts, inbytes, outbytes;
+		u64 kconns = 0, kinpkts = 0, koutpkts = 0;
+		u64 kinbytes = 0, koutbytes = 0;
+		unsigned int start;
+		int i;
+
+		if (kthread_should_stop())
+			break;
 
-	spin_lock(&ipvs->est_lock);
-	list_for_each_entry(e, &ipvs->est_list, list) {
 		s = container_of(e, struct ip_vs_stats, est);
+		for_each_possible_cpu(i) {
+			c = per_cpu_ptr(s->cpustats, i);
+			do {
+				start = u64_stats_fetch_begin(&c->syncp);
+				conns = u64_stats_read(&c->cnt.conns);
+				inpkts = u64_stats_read(&c->cnt.inpkts);
+				outpkts = u64_stats_read(&c->cnt.outpkts);
+				inbytes = u64_stats_read(&c->cnt.inbytes);
+				outbytes = u64_stats_read(&c->cnt.outbytes);
+			} while (u64_stats_fetch_retry(&c->syncp, start));
+			kconns += conns;
+			kinpkts += inpkts;
+			koutpkts += outpkts;
+			kinbytes += inbytes;
+			koutbytes += outbytes;
+		}
 
 		spin_lock(&s->lock);
-		ip_vs_read_cpu_stats(&s->kstats, s->cpustats);
+
+		s->kstats.conns = kconns;
+		s->kstats.inpkts = kinpkts;
+		s->kstats.outpkts = koutpkts;
+		s->kstats.inbytes = kinbytes;
+		s->kstats.outbytes = koutbytes;
 
 		/* scaled by 2^10, but divided 2 seconds */
 		rate = (s->kstats.conns - e->last_conns) << 9;
@@ -133,30 +138,749 @@ static void estimation_timer(struct timer_list *t)
 		e->outbps += ((s64)rate - (s64)e->outbps) >> 2;
 		spin_unlock(&s->lock);
 	}
-	spin_unlock(&ipvs->est_lock);
+}
 
-skip:
-	mod_timer(&ipvs->est_timer, jiffies + 2*HZ);
+static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
+{
+	struct ip_vs_est_tick_data *td;
+	int cid;
+
+	rcu_read_lock();
+	td = rcu_dereference(kd->ticks[row]);
+	if (!td)
+		goto out;
+	for_each_set_bit(cid, td->present, IPVS_EST_TICK_CHAINS) {
+		if (kthread_should_stop())
+			break;
+		ip_vs_chain_estimation(&td->chains[cid]);
+		cond_resched_rcu();
+		td = rcu_dereference(kd->ticks[row]);
+		if (!td)
+			break;
+	}
+
+out:
+	rcu_read_unlock();
 }
 
-void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
+static int ip_vs_estimation_kthread(void *data)
 {
-	struct ip_vs_estimator *est = &stats->est;
+	struct ip_vs_est_kt_data *kd = data;
+	struct netns_ipvs *ipvs = kd->ipvs;
+	int row = kd->est_row;
+	unsigned long now;
+	int id = kd->id;
+	long gap;
+
+	if (id > 0) {
+		if (!ipvs->est_chain_max)
+			return 0;
+	} else {
+		if (!ipvs->est_chain_max) {
+			ipvs->est_calc_phase = 1;
+			/* commit est_calc_phase before reading est_genid */
+			smp_mb();
+		}
+
+		/* kthread 0 will handle the calc phase */
+		if (ipvs->est_calc_phase)
+			ip_vs_est_calc_phase(ipvs);
+	}
+
+	while (1) {
+		if (!id && !hlist_empty(&ipvs->est_temp_list))
+			ip_vs_est_drain_temp_list(ipvs);
+		set_current_state(TASK_IDLE);
+		if (kthread_should_stop())
+			break;
+
+		/* before estimation, check if we should sleep */
+		now = jiffies;
+		gap = kd->est_timer - now;
+		if (gap > 0) {
+			if (gap > IPVS_EST_TICK) {
+				kd->est_timer = now - IPVS_EST_TICK;
+				gap = IPVS_EST_TICK;
+			}
+			schedule_timeout(gap);
+		} else {
+			__set_current_state(TASK_RUNNING);
+			if (gap < -8 * IPVS_EST_TICK)
+				kd->est_timer = now;
+		}
+
+		if (sysctl_run_estimation(ipvs) && kd->tick_len[row])
+			ip_vs_tick_estimation(kd, row);
+
+		row++;
+		if (row >= IPVS_EST_NTICKS)
+			row = 0;
+		WRITE_ONCE(kd->est_row, row);
+		kd->est_timer += IPVS_EST_TICK;
+	}
+	__set_current_state(TASK_RUNNING);
+
+	return 0;
+}
+
+/* Schedule stop/start for kthread tasks */
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs)
+{
+	/* Ignore reloads before first service is added */
+	if (!ipvs->enable)
+		return;
+	/* Bump the kthread configuration genid */
+	atomic_inc(&ipvs->est_genid);
+	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 0);
+}
+
+/* Start kthread task with current configuration */
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+			    struct ip_vs_est_kt_data *kd)
+{
+	unsigned long now;
+	int ret = 0;
+	long gap;
+
+	lockdep_assert_held(&ipvs->est_mutex);
+
+	if (kd->task)
+		goto out;
+	now = jiffies;
+	gap = kd->est_timer - now;
+	/* Sync est_timer if task is starting later */
+	if (abs(gap) > 4 * IPVS_EST_TICK)
+		kd->est_timer = now;
+	kd->task = kthread_create(ip_vs_estimation_kthread, kd, "ipvs-e:%d:%d",
+				  ipvs->gen, kd->id);
+	if (IS_ERR(kd->task)) {
+		ret = PTR_ERR(kd->task);
+		kd->task = NULL;
+		goto out;
+	}
+
+	pr_info("starting estimator thread %d...\n", kd->id);
+	wake_up_process(kd->task);
+
+out:
+	return ret;
+}
+
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd)
+{
+	if (kd->task) {
+		pr_info("stopping estimator thread %d...\n", kd->id);
+		kthread_stop(kd->task);
+		kd->task = NULL;
+	}
+}
+
+/* Apply parameters to kthread */
+static void ip_vs_est_set_params(struct netns_ipvs *ipvs,
+				 struct ip_vs_est_kt_data *kd)
+{
+	kd->chain_max = ipvs->est_chain_max;
+	/* We are using single chain on RCU preemption */
+	if (IPVS_EST_TICK_CHAINS == 1)
+		kd->chain_max *= IPVS_EST_CHAIN_FACTOR;
+	kd->tick_max = IPVS_EST_TICK_CHAINS * kd->chain_max;
+	kd->est_max_count = IPVS_EST_NTICKS * kd->tick_max;
+}
+
+/* Create and start estimation kthread in a free or new array slot */
+static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
+{
+	struct ip_vs_est_kt_data *kd = NULL;
+	int id = ipvs->est_kt_count;
+	int ret = -ENOMEM;
+	void *arr = NULL;
+	int i;
+
+	if ((unsigned long)ipvs->est_kt_count >= ipvs->est_max_threads &&
+	    ipvs->enable && ipvs->est_max_threads)
+		return -EINVAL;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	for (i = 0; i < id; i++) {
+		if (!ipvs->est_kt_arr[i])
+			break;
+	}
+	if (i >= id) {
+		arr = krealloc_array(ipvs->est_kt_arr, id + 1,
+				     sizeof(struct ip_vs_est_kt_data *),
+				     GFP_KERNEL);
+		if (!arr)
+			goto out;
+		ipvs->est_kt_arr = arr;
+	} else {
+		id = i;
+	}
+
+	kd = kzalloc(sizeof(*kd), GFP_KERNEL);
+	if (!kd)
+		goto out;
+	kd->ipvs = ipvs;
+	bitmap_fill(kd->avail, IPVS_EST_NTICKS);
+	kd->est_timer = jiffies;
+	kd->id = id;
+	ip_vs_est_set_params(ipvs, kd);
+
+	/* Pre-allocate stats used in calc phase */
+	if (!id && !kd->calc_stats) {
+		kd->calc_stats = ip_vs_stats_alloc();
+		if (!kd->calc_stats)
+			goto out;
+	}
+
+	/* Start kthread tasks only when services are present */
+	if (ipvs->enable) {
+		ret = ip_vs_est_kthread_start(ipvs, kd);
+		if (ret < 0)
+			goto out;
+	}
+
+	if (arr)
+		ipvs->est_kt_count++;
+	ipvs->est_kt_arr[id] = kd;
+	kd = NULL;
+	/* Use most recent kthread for new ests */
+	ipvs->est_add_ktid = id;
+	ret = 0;
+
+out:
+	mutex_unlock(&ipvs->est_mutex);
+	if (kd) {
+		ip_vs_stats_free(kd->calc_stats);
+		kfree(kd);
+	}
+
+	return ret;
+}
+
+/* Select ktid where to add new ests: available, unused or new slot */
+static void ip_vs_est_update_ktid(struct netns_ipvs *ipvs)
+{
+	int ktid, best = ipvs->est_kt_count;
+	struct ip_vs_est_kt_data *kd;
+
+	for (ktid = 0; ktid < ipvs->est_kt_count; ktid++) {
+		kd = ipvs->est_kt_arr[ktid];
+		if (kd) {
+			if (kd->est_count < kd->est_max_count) {
+				best = ktid;
+				break;
+			}
+		} else if (ktid < best) {
+			best = ktid;
+		}
+	}
+	ipvs->est_add_ktid = best;
+}
+
+/* Add estimator to current kthread (est_add_ktid) */
+static int ip_vs_enqueue_estimator(struct netns_ipvs *ipvs,
+				   struct ip_vs_estimator *est)
+{
+	struct ip_vs_est_kt_data *kd = NULL;
+	struct ip_vs_est_tick_data *td;
+	int ktid, row, crow, cid, ret;
+	int delay = est->ktrow;
+
+	if (ipvs->est_add_ktid < ipvs->est_kt_count) {
+		kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+		if (kd)
+			goto add_est;
+	}
+
+	ret = ip_vs_est_add_kthread(ipvs);
+	if (ret < 0)
+		goto out;
+	kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+
+add_est:
+	ktid = kd->id;
+	/* For small number of estimators prefer to use few ticks,
+	 * otherwise try to add into the last estimated row.
+	 * est_row and add_row point after the row we should use
+	 */
+	if (kd->est_count >= 2 * kd->tick_max || delay < IPVS_EST_NTICKS - 1)
+		crow = READ_ONCE(kd->est_row);
+	else
+		crow = kd->add_row;
+	crow += delay;
+	if (crow >= IPVS_EST_NTICKS)
+		crow -= IPVS_EST_NTICKS;
+	/* Assume initial delay ? */
+	if (delay >= IPVS_EST_NTICKS - 1) {
+		/* Preserve initial delay or decrease it if no space in tick */
+		row = crow;
+		if (crow < IPVS_EST_NTICKS - 1) {
+			crow++;
+			row = find_last_bit(kd->avail, crow);
+		}
+		if (row >= crow)
+			row = find_last_bit(kd->avail, IPVS_EST_NTICKS);
+	} else {
+		/* Preserve delay or increase it if no space in tick */
+		row = IPVS_EST_NTICKS;
+		if (crow > 0)
+			row = find_next_bit(kd->avail, IPVS_EST_NTICKS, crow);
+		if (row >= IPVS_EST_NTICKS)
+			row = find_first_bit(kd->avail, IPVS_EST_NTICKS);
+	}
+
+	td = rcu_dereference_protected(kd->ticks[row], 1);
+	if (!td) {
+		td = kzalloc(sizeof(*td), GFP_KERNEL);
+		if (!td) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		rcu_assign_pointer(kd->ticks[row], td);
+	}
+
+	cid = find_first_zero_bit(td->full, IPVS_EST_TICK_CHAINS);
+
+	kd->est_count++;
+	kd->tick_len[row]++;
+	if (!td->chain_len[cid])
+		__set_bit(cid, td->present);
+	td->chain_len[cid]++;
+	est->ktid = ktid;
+	est->ktrow = row;
+	est->ktcid = cid;
+	hlist_add_head_rcu(&est->list, &td->chains[cid]);
+
+	if (td->chain_len[cid] >= kd->chain_max) {
+		__set_bit(cid, td->full);
+		if (kd->tick_len[row] >= kd->tick_max)
+			__clear_bit(row, kd->avail);
+	}
+
+	/* Update est_add_ktid to point to first available/empty kt slot */
+	if (kd->est_count == kd->est_max_count)
+		ip_vs_est_update_ktid(ipvs);
+
+	ret = 0;
+
+out:
+	return ret;
+}
 
-	INIT_LIST_HEAD(&est->list);
+/* Start estimation for stats */
+int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
+{
+	struct ip_vs_estimator *est = &stats->est;
+	int ret;
+
+	if (!ipvs->est_max_threads && ipvs->enable)
+		ipvs->est_max_threads = 4 * num_possible_cpus();
+
+	est->ktid = -1;
+	est->ktrow = IPVS_EST_NTICKS - 1;	/* Initial delay */
+
+	/* We prefer this code to be short, kthread 0 will requeue the
+	 * estimator to available chain. If tasks are disabled, we
+	 * will not allocate much memory, just for kt 0.
+	 */
+	ret = 0;
+	if (!ipvs->est_kt_count || !ipvs->est_kt_arr[0])
+		ret = ip_vs_est_add_kthread(ipvs);
+	if (ret >= 0)
+		hlist_add_head(&est->list, &ipvs->est_temp_list);
+	else
+		INIT_HLIST_NODE(&est->list);
+	return ret;
+}
 
-	spin_lock_bh(&ipvs->est_lock);
-	list_add(&est->list, &ipvs->est_list);
-	spin_unlock_bh(&ipvs->est_lock);
+static void ip_vs_est_kthread_destroy(struct ip_vs_est_kt_data *kd)
+{
+	if (kd) {
+		if (kd->task) {
+			pr_info("stop unused estimator thread %d...\n", kd->id);
+			kthread_stop(kd->task);
+		}
+		ip_vs_stats_free(kd->calc_stats);
+		kfree(kd);
+	}
 }
 
+/* Unlink estimator from chain */
 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
 {
 	struct ip_vs_estimator *est = &stats->est;
+	struct ip_vs_est_tick_data *td;
+	struct ip_vs_est_kt_data *kd;
+	int ktid = est->ktid;
+	int row = est->ktrow;
+	int cid = est->ktcid;
+
+	/* Failed to add to chain ? */
+	if (hlist_unhashed(&est->list))
+		return;
+
+	/* On return, estimator can be freed, dequeue it now */
+
+	/* In est_temp_list ? */
+	if (ktid < 0) {
+		hlist_del(&est->list);
+		goto end_kt0;
+	}
+
+	hlist_del_rcu(&est->list);
+	kd = ipvs->est_kt_arr[ktid];
+	td = rcu_dereference_protected(kd->ticks[row], 1);
+	__clear_bit(cid, td->full);
+	td->chain_len[cid]--;
+	if (!td->chain_len[cid])
+		__clear_bit(cid, td->present);
+	kd->tick_len[row]--;
+	__set_bit(row, kd->avail);
+	if (!kd->tick_len[row]) {
+		RCU_INIT_POINTER(kd->ticks[row], NULL);
+		kfree_rcu(td);
+	}
+	kd->est_count--;
+	if (kd->est_count) {
+		/* This kt slot can become available just now, prefer it */
+		if (ktid < ipvs->est_add_ktid)
+			ipvs->est_add_ktid = ktid;
+		return;
+	}
 
-	spin_lock_bh(&ipvs->est_lock);
-	list_del(&est->list);
-	spin_unlock_bh(&ipvs->est_lock);
+	if (ktid > 0) {
+		mutex_lock(&ipvs->est_mutex);
+		ip_vs_est_kthread_destroy(kd);
+		ipvs->est_kt_arr[ktid] = NULL;
+		if (ktid == ipvs->est_kt_count - 1) {
+			ipvs->est_kt_count--;
+			while (ipvs->est_kt_count > 1 &&
+			       !ipvs->est_kt_arr[ipvs->est_kt_count - 1])
+				ipvs->est_kt_count--;
+		}
+		mutex_unlock(&ipvs->est_mutex);
+
+		/* This slot is now empty, prefer another available kt slot */
+		if (ktid == ipvs->est_add_ktid)
+			ip_vs_est_update_ktid(ipvs);
+	}
+
+end_kt0:
+	/* kt 0 is freed after all other kthreads and chains are empty */
+	if (ipvs->est_kt_count == 1 && hlist_empty(&ipvs->est_temp_list)) {
+		kd = ipvs->est_kt_arr[0];
+		if (!kd || !kd->est_count) {
+			mutex_lock(&ipvs->est_mutex);
+			if (kd) {
+				ip_vs_est_kthread_destroy(kd);
+				ipvs->est_kt_arr[0] = NULL;
+			}
+			ipvs->est_kt_count--;
+			mutex_unlock(&ipvs->est_mutex);
+			ipvs->est_add_ktid = 0;
+		}
+	}
+}
+
+/* Register all ests from est_temp_list to kthreads */
+static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs)
+{
+	struct ip_vs_estimator *est;
+
+	while (1) {
+		int max = 16;
+
+		mutex_lock(&__ip_vs_mutex);
+
+		while (max-- > 0) {
+			est = hlist_entry_safe(ipvs->est_temp_list.first,
+					       struct ip_vs_estimator, list);
+			if (est) {
+				if (kthread_should_stop())
+					goto unlock;
+				hlist_del_init(&est->list);
+				if (ip_vs_enqueue_estimator(ipvs, est) >= 0)
+					continue;
+				est->ktid = -1;
+				hlist_add_head(&est->list,
+					       &ipvs->est_temp_list);
+				/* Abort, some entries will not be estimated
+				 * until next attempt
+				 */
+			}
+			goto unlock;
+		}
+		mutex_unlock(&__ip_vs_mutex);
+		cond_resched();
+	}
+
+unlock:
+	mutex_unlock(&__ip_vs_mutex);
+}
+
+/* Calculate limits for all kthreads */
+static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
+{
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+	struct ip_vs_est_kt_data *kd;
+	struct hlist_head chain;
+	struct ip_vs_stats *s;
+	int cache_factor = 4;
+	int i, loops, ntest;
+	s32 min_est = 0;
+	ktime_t t1, t2;
+	s64 diff, val;
+	int max = 8;
+	int ret = 1;
+
+	INIT_HLIST_HEAD(&chain);
+	mutex_lock(&__ip_vs_mutex);
+	kd = ipvs->est_kt_arr[0];
+	mutex_unlock(&__ip_vs_mutex);
+	s = kd ? kd->calc_stats : NULL;
+	if (!s)
+		goto out;
+	hlist_add_head(&s->est.list, &chain);
+
+	loops = 1;
+	/* Get best result from many tests */
+	for (ntest = 0; ntest < 12; ntest++) {
+		if (!(ntest & 3)) {
+			wait_event_idle_timeout(wq, kthread_should_stop(),
+						HZ / 50);
+			if (!ipvs->enable || kthread_should_stop())
+				goto stop;
+		}
+
+		local_bh_disable();
+		rcu_read_lock();
+
+		/* Put stats in cache */
+		ip_vs_chain_estimation(&chain);
+
+		t1 = ktime_get();
+		for (i = loops * cache_factor; i > 0; i--)
+			ip_vs_chain_estimation(&chain);
+		t2 = ktime_get();
+
+		rcu_read_unlock();
+		local_bh_enable();
+
+		if (!ipvs->enable || kthread_should_stop())
+			goto stop;
+		cond_resched();
+
+		diff = ktime_to_ns(ktime_sub(t2, t1));
+		if (diff <= 1 * NSEC_PER_USEC) {
+			/* Do more loops on low resolution */
+			loops *= 2;
+			continue;
+		}
+		if (diff >= NSEC_PER_SEC)
+			continue;
+		val = diff;
+		do_div(val, loops);
+		if (!min_est || val < min_est) {
+			min_est = val;
+			/* goal: 95usec per chain */
+			val = 95 * NSEC_PER_USEC;
+			if (val >= min_est) {
+				do_div(val, min_est);
+				max = (int)val;
+			} else {
+				max = 1;
+			}
+		}
+	}
+
+out:
+	if (s)
+		hlist_del_init(&s->est.list);
+	*chain_max = max;
+	return ret;
+
+stop:
+	ret = 0;
+	goto out;
+}
+
+/* Calculate the parameters and apply them in context of kt #0
+ * ECP: est_calc_phase
+ * ECM: est_chain_max
+ * ECP	ECM	Insert Chain	enable	Description
+ * ---------------------------------------------------------------------------
+ * 0	0	est_temp_list	0	create kt #0 context
+ * 0	0	est_temp_list	0->1	service added, start kthread #0 task
+ * 0->1	0	est_temp_list	1	kt task #0 started, enters calc phase
+ * 1	0	est_temp_list	1	kt #0: determine est_chain_max,
+ *					stop tasks, move ests to est_temp_list
+ *					and free kd for kthreads 1..last
+ * 1->0	0->N	kt chains	1	ests can go to kthreads
+ * 0	N	kt chains	1	drain est_temp_list, create new kthread
+ *					contexts, start tasks, estimate
+ */
+static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
+{
+	int genid = atomic_read(&ipvs->est_genid);
+	struct ip_vs_est_tick_data *td;
+	struct ip_vs_est_kt_data *kd;
+	struct ip_vs_estimator *est;
+	struct ip_vs_stats *stats;
+	int id, row, cid, delay;
+	bool last, last_td;
+	int chain_max;
+	int step;
+
+	if (!ip_vs_est_calc_limits(ipvs, &chain_max))
+		return;
+
+	mutex_lock(&__ip_vs_mutex);
+
+	/* Stop all other tasks, so that we can immediately move the
+	 * estimators to est_temp_list without RCU grace period
+	 */
+	mutex_lock(&ipvs->est_mutex);
+	for (id = 1; id < ipvs->est_kt_count; id++) {
+		/* netns clean up started, abort */
+		if (!ipvs->enable)
+			goto unlock2;
+		kd = ipvs->est_kt_arr[id];
+		if (!kd)
+			continue;
+		ip_vs_est_kthread_stop(kd);
+	}
+	mutex_unlock(&ipvs->est_mutex);
+
+	/* Move all estimators to est_temp_list but carefully,
+	 * all estimators and kthread data can be released while
+	 * we reschedule. Even for kthread 0.
+	 */
+	step = 0;
+
+	/* Order entries in est_temp_list in ascending delay, so now
+	 * walk delay(desc), id(desc), cid(asc)
+	 */
+	delay = IPVS_EST_NTICKS;
+
+next_delay:
+	delay--;
+	if (delay < 0)
+		goto end_dequeue;
+
+last_kt:
+	/* Destroy contexts backwards */
+	id = ipvs->est_kt_count;
+
+next_kt:
+	if (!ipvs->enable || kthread_should_stop())
+		goto unlock;
+	id--;
+	if (id < 0)
+		goto next_delay;
+	kd = ipvs->est_kt_arr[id];
+	if (!kd)
+		goto next_kt;
+	/* kt 0 can exist with empty chains */
+	if (!id && kd->est_count <= 1)
+		goto next_delay;
+
+	row = kd->est_row + delay;
+	if (row >= IPVS_EST_NTICKS)
+		row -= IPVS_EST_NTICKS;
+	td = rcu_dereference_protected(kd->ticks[row], 1);
+	if (!td)
+		goto next_kt;
+
+	cid = 0;
+
+walk_chain:
+	if (kthread_should_stop())
+		goto unlock;
+	step++;
+	if (!(step & 63)) {
+		/* Give chance estimators to be added (to est_temp_list)
+		 * and deleted (releasing kthread contexts)
+		 */
+		mutex_unlock(&__ip_vs_mutex);
+		cond_resched();
+		mutex_lock(&__ip_vs_mutex);
+
+		/* Current kt released ? */
+		if (id >= ipvs->est_kt_count)
+			goto last_kt;
+		if (kd != ipvs->est_kt_arr[id])
+			goto next_kt;
+		/* Current td released ? */
+		if (td != rcu_dereference_protected(kd->ticks[row], 1))
+			goto next_kt;
+		/* No fatal changes on the current kd and td */
+	}
+	est = hlist_entry_safe(td->chains[cid].first, struct ip_vs_estimator,
+			       list);
+	if (!est) {
+		cid++;
+		if (cid >= IPVS_EST_TICK_CHAINS)
+			goto next_kt;
+		goto walk_chain;
+	}
+	/* We can cheat and increase est_count to protect kt 0 context
+	 * from release but we prefer to keep the last estimator
+	 */
+	last = kd->est_count <= 1;
+	/* Do not free kt #0 data */
+	if (!id && last)
+		goto next_delay;
+	last_td = kd->tick_len[row] <= 1;
+	stats = container_of(est, struct ip_vs_stats, est);
+	ip_vs_stop_estimator(ipvs, stats);
+	/* Tasks are stopped, move without RCU grace period */
+	est->ktid = -1;
+	est->ktrow = row - kd->est_row;
+	if (est->ktrow < 0)
+		est->ktrow += IPVS_EST_NTICKS;
+	hlist_add_head(&est->list, &ipvs->est_temp_list);
+	/* kd freed ? */
+	if (last)
+		goto next_kt;
+	/* td freed ? */
+	if (last_td)
+		goto next_kt;
+	goto walk_chain;
+
+end_dequeue:
+	/* All estimators removed while calculating ? */
+	if (!ipvs->est_kt_count)
+		goto unlock;
+	kd = ipvs->est_kt_arr[0];
+	if (!kd)
+		goto unlock;
+	kd->add_row = kd->est_row;
+	ipvs->est_chain_max = chain_max;
+	ip_vs_est_set_params(ipvs, kd);
+
+	pr_info("using max %d ests per chain, %d per kthread\n",
+		kd->chain_max, kd->est_max_count);
+
+	/* Try to keep tot_stats in kt0, enqueue it early */
+	if (ipvs->tot_stats && !hlist_unhashed(&ipvs->tot_stats->s.est.list) &&
+	    ipvs->tot_stats->s.est.ktid == -1) {
+		hlist_del(&ipvs->tot_stats->s.est.list);
+		hlist_add_head(&ipvs->tot_stats->s.est.list,
+			       &ipvs->est_temp_list);
+	}
+
+	mutex_lock(&ipvs->est_mutex);
+
+	/* We completed the calc phase, new calc phase not requested */
+	if (genid == atomic_read(&ipvs->est_genid))
+		ipvs->est_calc_phase = 0;
+
+unlock2:
+	mutex_unlock(&ipvs->est_mutex);
+
+unlock:
+	mutex_unlock(&__ip_vs_mutex);
 }
 
 void ip_vs_zero_estimator(struct ip_vs_stats *stats)
@@ -191,14 +915,25 @@ void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats)
 
 int __net_init ip_vs_estimator_net_init(struct netns_ipvs *ipvs)
 {
-	INIT_LIST_HEAD(&ipvs->est_list);
-	spin_lock_init(&ipvs->est_lock);
-	timer_setup(&ipvs->est_timer, estimation_timer, 0);
-	mod_timer(&ipvs->est_timer, jiffies + 2 * HZ);
+	INIT_HLIST_HEAD(&ipvs->est_temp_list);
+	ipvs->est_kt_arr = NULL;
+	ipvs->est_max_threads = 0;
+	ipvs->est_calc_phase = 0;
+	ipvs->est_chain_max = 0;
+	ipvs->est_kt_count = 0;
+	ipvs->est_add_ktid = 0;
+	atomic_set(&ipvs->est_genid, 0);
+	atomic_set(&ipvs->est_genid_done, 0);
+	__mutex_init(&ipvs->est_mutex, "ipvs->est_mutex", &__ipvs_est_key);
 	return 0;
 }
 
 void __net_exit ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs)
 {
-	del_timer_sync(&ipvs->est_timer);
+	int i;
+
+	for (i = 0; i < ipvs->est_kt_count; i++)
+		ip_vs_est_kthread_destroy(ipvs->est_kt_arr[i]);
+	kfree(ipvs->est_kt_arr);
+	mutex_destroy(&ipvs->est_mutex);
 }
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCHv6 5/7] ipvs: add est_cpulist and est_nice sysctl vars
  2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
                   ` (3 preceding siblings ...)
  2022-10-31 14:56 ` [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-10-31 14:56 ` Julian Anastasov
  2022-11-21 16:29   ` Jiri Wiesner
  2022-10-31 14:56 ` [RFC PATCHv6 6/7] ipvs: run_estimation should control the kthread tasks Julian Anastasov
  2022-10-31 14:56 ` [RFC PATCHv6 7/7] ipvs: debug the tick time Julian Anastasov
  6 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Allow the kthreads for stats to be configured for
specific cpulist (isolation) and niceness (scheduling
priority).

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 Documentation/networking/ipvs-sysctl.rst |  20 ++++
 include/net/ip_vs.h                      |  55 +++++++++
 net/netfilter/ipvs/ip_vs_ctl.c           | 143 ++++++++++++++++++++++-
 net/netfilter/ipvs/ip_vs_est.c           |  11 +-
 4 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 387fda80f05f..1b778705d706 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -129,6 +129,26 @@ drop_packet - INTEGER
 	threshold. When the mode 3 is set, the always mode drop rate
 	is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
 
+est_cpulist - CPULIST
+	Allowed	CPUs for estimation kthreads
+
+	Syntax: standard cpulist format
+	empty list - stop kthread tasks and estimation
+	default - the system's housekeeping CPUs for kthreads
+
+	Example:
+	"all": all possible CPUs
+	"0-N": all possible CPUs, N denotes last CPU number
+	"0,1-N:1/2": first and all CPUs with odd number
+	"": empty list
+
+est_nice - INTEGER
+	default 0
+	Valid range: -20 (more favorable) .. 19 (less favorable)
+
+	Niceness value to use for the estimation kthreads (scheduling
+	priority)
+
 expire_nodest_conn - BOOLEAN
 	- 0 - disabled (default)
 	- not 0 - enabled
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 9c02840bc7c9..0a745b626cc8 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -29,6 +29,7 @@
 #include <net/netfilter/nf_conntrack.h>
 #endif
 #include <net/net_namespace.h>		/* Netw namespace */
+#include <linux/sched/isolation.h>
 
 #define IP_VS_HDR_INVERSE	1
 #define IP_VS_HDR_ICMP		2
@@ -365,6 +366,9 @@ struct ip_vs_cpu_stats {
 	struct u64_stats_sync   syncp;
 };
 
+/* Default nice for estimator kthreads */
+#define IPVS_EST_NICE		0
+
 /* IPVS statistics objects */
 struct ip_vs_estimator {
 	struct hlist_node	list;
@@ -995,6 +999,12 @@ struct netns_ipvs {
 	int			sysctl_schedule_icmp;
 	int			sysctl_ignore_tunneled;
 	int			sysctl_run_estimation;
+#ifdef CONFIG_SYSCTL
+	cpumask_var_t		sysctl_est_cpulist;	/* kthread cpumask */
+	int			est_cpulist_valid;	/* cpulist set */
+	int			sysctl_est_nice;	/* kthread nice */
+	int			est_stopped;		/* stop tasks */
+#endif
 
 	/* ip_vs_lblc */
 	int			sysctl_lblc_expiration;
@@ -1148,6 +1158,19 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_run_estimation;
 }
 
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+	if (ipvs->est_cpulist_valid)
+		return ipvs->sysctl_est_cpulist;
+	else
+		return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_est_nice;
+}
+
 #else
 
 static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1245,6 +1268,16 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+	return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+	return IPVS_EST_NICE;
+}
+
 #endif
 
 /* IPVS core functions
@@ -1555,6 +1588,28 @@ int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
 			    struct ip_vs_est_kt_data *kd);
 void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
 
+static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+	ipvs->est_stopped = ipvs->est_cpulist_valid &&
+			    cpumask_empty(sysctl_est_cpulist(ipvs));
+#endif
+}
+
+static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+	return ipvs->est_stopped;
+#else
+	return false;
+#endif
+}
+
+static inline int ip_vs_est_max_threads(struct netns_ipvs *ipvs)
+{
+	return max(1U, 4 * cpumask_weight(sysctl_est_cpulist(ipvs)));
+}
+
 /* Various IPVS packet transmitters (from ip_vs_xmit.c) */
 int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 		    struct ip_vs_protocol *pp, struct ip_vs_iphdr *iph);
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index c41a5392edc9..38df3ee655ed 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -263,7 +263,7 @@ static void est_reload_work_handler(struct work_struct *work)
 		/* New config ? Stop kthread tasks */
 		if (genid != genid_done)
 			ip_vs_est_kthread_stop(kd);
-		if (!kd->task) {
+		if (!kd->task && !ip_vs_est_stopped(ipvs)) {
 			/* Do not start kthreads above 0 in calc phase */
 			if ((!id || !ipvs->est_calc_phase) &&
 			    ip_vs_est_kthread_start(ipvs, kd) < 0)
@@ -1940,6 +1940,122 @@ proc_do_sync_ports(struct ctl_table *table, int write,
 	return rc;
 }
 
+static int ipvs_proc_est_cpumask_set(struct ctl_table *table, void *buffer)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	cpumask_var_t *valp = table->data;
+	cpumask_var_t newmask;
+	int ret;
+
+	if (!zalloc_cpumask_var(&newmask, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = cpulist_parse(buffer, newmask);
+	if (ret)
+		goto out;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	if (!ipvs->est_cpulist_valid) {
+		if (!zalloc_cpumask_var(valp, GFP_KERNEL)) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
+		ipvs->est_cpulist_valid = 1;
+	}
+	cpumask_and(newmask, newmask, &current->cpus_mask);
+	cpumask_copy(*valp, newmask);
+	/* est_max_threads may depend on cpulist size */
+	ipvs->est_max_threads = ip_vs_est_max_threads(ipvs);
+	ipvs->est_calc_phase = 1;
+	ip_vs_est_reload_start(ipvs);
+
+unlock:
+	mutex_unlock(&ipvs->est_mutex);
+
+out:
+	free_cpumask_var(newmask);
+	return ret;
+}
+
+static int ipvs_proc_est_cpumask_get(struct ctl_table *table, void *buffer,
+				     size_t size)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	cpumask_var_t *valp = table->data;
+	struct cpumask *mask;
+	int ret;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	if (ipvs->est_cpulist_valid)
+		mask = *valp;
+	else
+		mask = (struct cpumask *)housekeeping_cpumask(HK_TYPE_KTHREAD);
+	ret = scnprintf(buffer, size, "%*pbl\n", cpumask_pr_args(mask));
+
+	mutex_unlock(&ipvs->est_mutex);
+
+	return ret;
+}
+
+static int ipvs_proc_est_cpulist(struct ctl_table *table, int write,
+				 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+
+	/* Ignore both read and write(append) if *ppos not 0 */
+	if (*ppos || !*lenp) {
+		*lenp = 0;
+		return 0;
+	}
+	if (write) {
+		/* proc_sys_call_handler() appends terminator */
+		ret = ipvs_proc_est_cpumask_set(table, buffer);
+		if (ret >= 0)
+			*ppos += *lenp;
+	} else {
+		/* proc_sys_call_handler() allocates 1 byte for terminator */
+		ret = ipvs_proc_est_cpumask_get(table, buffer, *lenp + 1);
+		if (ret >= 0) {
+			*lenp = ret;
+			*ppos += *lenp;
+			ret = 0;
+		}
+	}
+	return ret;
+}
+
+static int ipvs_proc_est_nice(struct ctl_table *table, int write,
+			      void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		if (val < MIN_NICE || val > MAX_NICE) {
+			ret = -EINVAL;
+		} else {
+			mutex_lock(&ipvs->est_mutex);
+			if (*valp != val) {
+				*valp = val;
+				ip_vs_est_reload_start(ipvs);
+			}
+			mutex_unlock(&ipvs->est_mutex);
+		}
+	}
+	return ret;
+}
+
 /*
  *	IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
  *	Do not change order or insert new entries without
@@ -2116,6 +2232,18 @@ static struct ctl_table vs_vars[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "est_cpulist",
+		.maxlen		= NR_CPUS,	/* unused */
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_est_cpulist,
+	},
+	{
+		.procname	= "est_nice",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_est_nice,
+	},
 #ifdef CONFIG_IP_VS_DEBUG
 	{
 		.procname	= "debug_level",
@@ -4134,6 +4262,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
 			  expire_nodest_conn_handler);
+	ipvs->est_stopped = 0;
 
 	if (!net_eq(net, &init_net)) {
 		tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
@@ -4195,6 +4324,15 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
 	ipvs->sysctl_run_estimation = 1;
 	tbl[idx++].data = &ipvs->sysctl_run_estimation;
+
+	ipvs->est_cpulist_valid = 0;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_est_cpulist;
+
+	ipvs->sysctl_est_nice = IPVS_EST_NICE;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_est_nice;
+
 #ifdef CONFIG_IP_VS_DEBUG
 	/* Global sysctls must be ro in non-init netns */
 	if (!net_eq(net, &init_net))
@@ -4234,6 +4372,9 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
 	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
 
+	if (ipvs->est_cpulist_valid)
+		free_cpumask_var(ipvs->sysctl_est_cpulist);
+
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->sysctl_tbl);
 }
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index c134f9466dc6..91eb0f1f8a00 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -57,6 +57,9 @@
   - kthread contexts are created and attached to array
   - the kthread tasks are started when first service is added, before that
     the total stats are not estimated
+  - when configuration (cpulist/nice) is changed, the tasks are restarted
+    by work (est_reload_work)
+  - kthread tasks are stopped while the cpulist is empty
   - the kthread context holds lists with estimators (chains) which are
     processed every 2 seconds
   - as estimators can be added dynamically and in bursts, we try to spread
@@ -229,6 +232,7 @@ void ip_vs_est_reload_start(struct netns_ipvs *ipvs)
 	/* Ignore reloads before first service is added */
 	if (!ipvs->enable)
 		return;
+	ip_vs_est_stopped_recalc(ipvs);
 	/* Bump the kthread configuration genid */
 	atomic_inc(&ipvs->est_genid);
 	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 0);
@@ -259,6 +263,9 @@ int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
 		goto out;
 	}
 
+	set_user_nice(kd->task, sysctl_est_nice(ipvs));
+	set_cpus_allowed_ptr(kd->task, sysctl_est_cpulist(ipvs));
+
 	pr_info("starting estimator thread %d...\n", kd->id);
 	wake_up_process(kd->task);
 
@@ -334,7 +341,7 @@ static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
 	}
 
 	/* Start kthread tasks only when services are present */
-	if (ipvs->enable) {
+	if (ipvs->enable && !ip_vs_est_stopped(ipvs)) {
 		ret = ip_vs_est_kthread_start(ipvs, kd);
 		if (ret < 0)
 			goto out;
@@ -475,7 +482,7 @@ int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
 	int ret;
 
 	if (!ipvs->est_max_threads && ipvs->enable)
-		ipvs->est_max_threads = 4 * num_possible_cpus();
+		ipvs->est_max_threads = ip_vs_est_max_threads(ipvs);
 
 	est->ktid = -1;
 	est->ktrow = IPVS_EST_NTICKS - 1;	/* Initial delay */
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCHv6 6/7] ipvs: run_estimation should control the kthread tasks
  2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
                   ` (4 preceding siblings ...)
  2022-10-31 14:56 ` [RFC PATCHv6 5/7] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
@ 2022-10-31 14:56 ` Julian Anastasov
  2022-11-21 16:32   ` Jiri Wiesner
  2022-10-31 14:56 ` [RFC PATCHv6 7/7] ipvs: debug the tick time Julian Anastasov
  6 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Change the run_estimation flag to start/stop the kthread tasks.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 Documentation/networking/ipvs-sysctl.rst |  4 ++--
 include/net/ip_vs.h                      |  6 +++--
 net/netfilter/ipvs/ip_vs_ctl.c           | 29 +++++++++++++++++++++++-
 net/netfilter/ipvs/ip_vs_est.c           |  2 +-
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 1b778705d706..3fb5fa142eef 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -324,8 +324,8 @@ run_estimation - BOOLEAN
 	0 - disabled
 	not 0 - enabled (default)
 
-	If disabled, the estimation will be stop, and you can't see
-	any update on speed estimation data.
+	If disabled, the estimation will be suspended and kthread tasks
+	stopped.
 
 	You can always re-enable estimation by setting this value to 1.
 	But be careful, the first estimation after re-enable is not
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 0a745b626cc8..1afb7ec8d537 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1591,8 +1591,10 @@ void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
 static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
 {
 #ifdef CONFIG_SYSCTL
-	ipvs->est_stopped = ipvs->est_cpulist_valid &&
-			    cpumask_empty(sysctl_est_cpulist(ipvs));
+	/* Stop tasks while cpulist is empty or if disabled with flag */
+	ipvs->est_stopped = !sysctl_run_estimation(ipvs) ||
+			    (ipvs->est_cpulist_valid &&
+			     cpumask_empty(sysctl_est_cpulist(ipvs)));
 #endif
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 38df3ee655ed..c9f598505642 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2056,6 +2056,32 @@ static int ipvs_proc_est_nice(struct ctl_table *table, int write,
 	return ret;
 }
 
+static int ipvs_proc_run_estimation(struct ctl_table *table, int write,
+				    void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		mutex_lock(&ipvs->est_mutex);
+		if (*valp != val) {
+			*valp = val;
+			ip_vs_est_reload_start(ipvs);
+		}
+		mutex_unlock(&ipvs->est_mutex);
+	}
+	return ret;
+}
+
 /*
  *	IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
  *	Do not change order or insert new entries without
@@ -2230,7 +2256,7 @@ static struct ctl_table vs_vars[] = {
 		.procname	= "run_estimation",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= ipvs_proc_run_estimation,
 	},
 	{
 		.procname	= "est_cpulist",
@@ -4323,6 +4349,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
 	tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
 	ipvs->sysctl_run_estimation = 1;
+	tbl[idx].extra2 = ipvs;
 	tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
 	ipvs->est_cpulist_valid = 0;
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 91eb0f1f8a00..526100976d59 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -212,7 +212,7 @@ static int ip_vs_estimation_kthread(void *data)
 				kd->est_timer = now;
 		}
 
-		if (sysctl_run_estimation(ipvs) && kd->tick_len[row])
+		if (kd->tick_len[row])
 			ip_vs_tick_estimation(kd, row);
 
 		row++;
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCHv6 7/7] ipvs: debug the tick time
  2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
                   ` (5 preceding siblings ...)
  2022-10-31 14:56 ` [RFC PATCHv6 6/7] ipvs: run_estimation should control the kthread tasks Julian Anastasov
@ 2022-10-31 14:56 ` Julian Anastasov
  6 siblings, 0 replies; 24+ messages in thread
From: Julian Anastasov @ 2022-10-31 14:56 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

Just for testing print the tick time every minute

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/netfilter/ipvs/ip_vs_est.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 526100976d59..4489a3dbad1e 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -147,7 +147,14 @@ static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
 {
 	struct ip_vs_est_tick_data *td;
 	int cid;
-
+	u64 ns = 0;
+	static int used_row = -1;
+	static int pass;
+
+	if (used_row < 0)
+		used_row = row;
+	if (row == used_row && !kd->id && !(pass & 31))
+		ns = ktime_get_ns();
 	rcu_read_lock();
 	td = rcu_dereference(kd->ticks[row]);
 	if (!td)
@@ -164,6 +171,16 @@ static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
 
 out:
 	rcu_read_unlock();
+	if (row == used_row && !kd->id && !(pass++ & 31)) {
+		static int ncpu;
+
+		ns = ktime_get_ns() - ns;
+		if (!ncpu)
+			ncpu = num_possible_cpus();
+		pr_info("tick time: %lluns for %d CPUs, %d ests, %d chains, chain_max=%d\n",
+			(unsigned long long)ns, ncpu, kd->tick_len[row],
+			IPVS_EST_TICK_CHAINS, kd->chain_max);
+	}
 }
 
 static int ip_vs_estimation_kthread(void *data)
@@ -637,7 +654,7 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
 	int i, loops, ntest;
 	s32 min_est = 0;
 	ktime_t t1, t2;
-	s64 diff, val;
+	s64 diff = 0, val;
 	int max = 8;
 	int ret = 1;
 
@@ -702,6 +719,8 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
 	}
 
 out:
+	pr_info("calc: chain_max=%d, single est=%dns, diff=%d, loops=%d, ntest=%d\n",
+		max, min_est, (int)diff, loops, ntest);
 	if (s)
 		hlist_del_init(&s->est.list);
 	*chain_max = max;
@@ -737,6 +756,7 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 	int id, row, cid, delay;
 	bool last, last_td;
 	int chain_max;
+	u64 ns = 0;
 	int step;
 
 	if (!ip_vs_est_calc_limits(ipvs, &chain_max))
@@ -770,6 +790,8 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 	 */
 	delay = IPVS_EST_NTICKS;
 
+	ns = ktime_get_ns();
+
 next_delay:
 	delay--;
 	if (delay < 0)
@@ -856,6 +878,8 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 	goto walk_chain;
 
 end_dequeue:
+	ns = ktime_get_ns() - ns;
+	pr_info("dequeue: %lluns\n", (unsigned long long)ns);
 	/* All estimators removed while calculating ? */
 	if (!ipvs->est_kt_count)
 		goto unlock;
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation
  2022-10-31 14:56 ` [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-11-10 15:39   ` Jiri Wiesner
  2022-11-10 20:16     ` Julian Anastasov
  2022-11-21 16:05   ` Jiri Wiesner
  1 sibling, 1 reply; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-10 15:39 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:44PM +0200, Julian Anastasov wrote:
> +/* Calculate limits for all kthreads */
> +static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
> +{
> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
> +	struct ip_vs_est_kt_data *kd;
> +	struct hlist_head chain;
> +	struct ip_vs_stats *s;
> +	int cache_factor = 4;
> +	int i, loops, ntest;
> +	s32 min_est = 0;
> +	ktime_t t1, t2;
> +	s64 diff, val;
> +	int max = 8;
> +	int ret = 1;
> +
> +	INIT_HLIST_HEAD(&chain);
> +	mutex_lock(&__ip_vs_mutex);
> +	kd = ipvs->est_kt_arr[0];
> +	mutex_unlock(&__ip_vs_mutex);
> +	s = kd ? kd->calc_stats : NULL;
> +	if (!s)
> +		goto out;
> +	hlist_add_head(&s->est.list, &chain);
> +
> +	loops = 1;
> +	/* Get best result from many tests */
> +	for (ntest = 0; ntest < 12; ntest++) {
> +		if (!(ntest & 3)) {
> +			wait_event_idle_timeout(wq, kthread_should_stop(),
> +						HZ / 50);
> +			if (!ipvs->enable || kthread_should_stop())
> +				goto stop;
> +		}

I was testing the stability of chain_max:
Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
64 CPUs, 2 NUMA nodes
> while :; do ipvsadm -A -t 10.10.10.1:2000; sleep 3; ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs; sleep 10m; done
> # dmesg | awk '/calc: chain_max/{m = $(NF - 5); sub("chain_max=", "", m); sub(",", "", m); m = int(m / 10) * 10; hist[m]++} END {PROCINFO["sorted_in"] = "@ind_num_asc"; for (m in hist) printf "%5d %5d\n", m, hist[m]}'
>    30    90
>    40     2
Chain_max was often 30-something. Chain_max was never below 30, which was observed earlier when only 3 tests were carried out.

AMD EPYC 7601 32-Core Processor
128 CPUs, 8 NUMA nodes
Zen 1 machines such as this one have a large number of NUMA nodes due to restrictions in the CPU architecture. First, tests with different governors:
> cpupower frequency-set -g ondemand
> [  653.441325] IPVS: starting estimator thread 0...
> [  653.514918] IPVS: calc: chain_max=8, single est=11171ns, diff=11301, loops=1, ntest=12
> [  653.523580] IPVS: dequeue: 892ns
> [  653.527528] IPVS: using max 384 ests per chain, 19200 per kthread
> [  655.349916] IPVS: tick time: 3059313ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> [  685.230016] IPVS: starting estimator thread 1...
> [  717.110852] IPVS: starting estimator thread 2...
> [  719.349755] IPVS: tick time: 2896668ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> [  750.349974] IPVS: starting estimator thread 3...
> [  783.349841] IPVS: tick time: 2942604ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> [  847.349811] IPVS: tick time: 2930872ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  74902 root      20   0       0      0      0 I 7.591 0.000   0:13.95 ipvs-e:0:1
>  94104 root      20   0       0      0      0 I 7.591 0.000   0:11.59 ipvs-e:0:2
>  55669 root      20   0       0      0      0 I 6.931 0.000   0:15.75 ipvs-e:0:0
> 113311 root      20   0       0      0      0 I 0.990 0.000   0:01.31 ipvs-e:0:3
> cpupower frequency-set -g performance
> [ 1448.118857] IPVS: starting estimator thread 0...
> [ 1448.194882] IPVS: calc: chain_max=22, single est=4138ns, diff=4298, loops=1, ntest=12
> [ 1448.203435] IPVS: dequeue: 340ns
> [ 1448.207373] IPVS: using max 1056 ests per chain, 52800 per kthread
> [ 1450.029581] IPVS: tick time: 2727370ns for 128 CPUs, 518 ests, 1 chains, chain_max=1056
> [ 1510.792734] IPVS: starting estimator thread 1...
> [ 1514.032300] IPVS: tick time: 5436826ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> [ 1578.032593] IPVS: tick time: 5691875ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  42514 root      20   0       0      0      0 I 14.24 0.000   0:14.96 ipvs-e:0:0
>  95356 root      20   0       0      0      0 I 1.987 0.000   0:01.34 ipvs-e:0:1
While having the services loaded, I switched to the ondemand governor:
> [ 1706.032577] IPVS: tick time: 5666868ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> [ 1770.032534] IPVS: tick time: 5638505ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  42514 root      20   0       0      0      0 I 18.15 0.000   0:35.75 ipvs-e:0:0
>  95356 root      20   0       0      0      0 I 2.310 0.000   0:04.05 ipvs-e:0:1
While having the services loaded, I kept the ondemand governor and saturated all logical CPUs on the machine:
> [ 1834.033988] IPVS: tick time: 7129383ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> [ 1898.038151] IPVS: tick time: 7281418ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 102581 root      20   0 1047636 262304   1580 R 12791 0.100 161:03.94 a.out
>  42514 root      20   0       0      0      0 I 17.76 0.000   1:06.88 ipvs-e:0:0
>  95356 root      20   0       0      0      0 I 2.303 0.000   0:08.33 ipvs-e:0:1
As for the stability of chain_max:
> while :; do ipvsadm -A -t 10.10.10.1:2000; sleep 3; ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs; sleep 2m; done
> dmesg | awk '/calc: chain_max/{m = $(NF - 5); sub("chain_max=", "", m); sub(",", "", m); m = int(m / 2) * 2; hist[m]++} END {PROCINFO["sorted_in"] = "@ind_num_asc"; for (m in hist) printf "%5d %5d\n", m, hist[m]}'
>     8    22
>    24     1

Basically, chain_max calculation under gonernors than ramp up CPU frequency more slowly (ondemand on AMD or powersave for intel_pstate) is stabler than before on both AMD and Intel. We know from previous results that even ARM with multiple NUMA nodes is not a complete disaster. Switching CPU frequency gonernors, including the unfavourable switches from performance to ondemand, does not saturate CPUs. When it comes to CPU frequency gonernors, people tend to use either ondemand (or powersave for intel_pstate) or performance consistently - switches between gonernors can be expected to be rare in production.
I will need to find out to read through the latest version of the patch set.

> +
> +		local_bh_disable();
> +		rcu_read_lock();
> +
> +		/* Put stats in cache */
> +		ip_vs_chain_estimation(&chain);
> +
> +		t1 = ktime_get();
> +		for (i = loops * cache_factor; i > 0; i--)
> +			ip_vs_chain_estimation(&chain);
> +		t2 = ktime_get();
> +
> +		rcu_read_unlock();
> +		local_bh_enable();
> +
> +		if (!ipvs->enable || kthread_should_stop())
> +			goto stop;
> +		cond_resched();
> +
> +		diff = ktime_to_ns(ktime_sub(t2, t1));
> +		if (diff <= 1 * NSEC_PER_USEC) {
> +			/* Do more loops on low resolution */
> +			loops *= 2;
> +			continue;
> +		}
> +		if (diff >= NSEC_PER_SEC)
> +			continue;
> +		val = diff;
> +		do_div(val, loops);
> +		if (!min_est || val < min_est) {
> +			min_est = val;
> +			/* goal: 95usec per chain */
> +			val = 95 * NSEC_PER_USEC;
> +			if (val >= min_est) {
> +				do_div(val, min_est);
> +				max = (int)val;
> +			} else {
> +				max = 1;
> +			}
> +		}
> +	}
> +
> +out:
> +	if (s)
> +		hlist_del_init(&s->est.list);
> +	*chain_max = max;
> +	return ret;
> +
> +stop:
> +	ret = 0;
> +	goto out;
> +}

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation
  2022-11-10 15:39   ` Jiri Wiesner
@ 2022-11-10 20:16     ` Julian Anastasov
  2022-11-11 17:21       ` Jiri Wiesner
  0 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2022-11-10 20:16 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Thu, 10 Nov 2022, Jiri Wiesner wrote:

> On Mon, Oct 31, 2022 at 04:56:44PM +0200, Julian Anastasov wrote:
> > +
> > +	loops = 1;
> > +	/* Get best result from many tests */
> > +	for (ntest = 0; ntest < 12; ntest++) {
> > +		if (!(ntest & 3)) {
> > +			wait_event_idle_timeout(wq, kthread_should_stop(),
> > +						HZ / 50);
> > +			if (!ipvs->enable || kthread_should_stop())
> > +				goto stop;
> > +		}
> 
> I was testing the stability of chain_max:
> Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
> 64 CPUs, 2 NUMA nodes
> > while :; do ipvsadm -A -t 10.10.10.1:2000; sleep 3; ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs; sleep 10m; done
> > # dmesg | awk '/calc: chain_max/{m = $(NF - 5); sub("chain_max=", "", m); sub(",", "", m); m = int(m / 10) * 10; hist[m]++} END {PROCINFO["sorted_in"] = "@ind_num_asc"; for (m in hist) printf "%5d %5d\n", m, hist[m]}'
> >    30    90
> >    40     2
> Chain_max was often 30-something. Chain_max was never below 30, which was observed earlier when only 3 tests were carried out.

	So, 12 tests and 3 20ms gaps eliminate any cpufreq
issues in most of the cases and we do not see small chain_max
value.

> AMD EPYC 7601 32-Core Processor
> 128 CPUs, 8 NUMA nodes
> Zen 1 machines such as this one have a large number of NUMA nodes due to restrictions in the CPU architecture. First, tests with different governors:
> > cpupower frequency-set -g ondemand
> > [  653.441325] IPVS: starting estimator thread 0...
> > [  653.514918] IPVS: calc: chain_max=8, single est=11171ns, diff=11301, loops=1, ntest=12
> > [  653.523580] IPVS: dequeue: 892ns
> > [  653.527528] IPVS: using max 384 ests per chain, 19200 per kthread
> > [  655.349916] IPVS: tick time: 3059313ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> > [  685.230016] IPVS: starting estimator thread 1...
> > [  717.110852] IPVS: starting estimator thread 2...
> > [  719.349755] IPVS: tick time: 2896668ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> > [  750.349974] IPVS: starting estimator thread 3...
> > [  783.349841] IPVS: tick time: 2942604ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> > [  847.349811] IPVS: tick time: 2930872ns for 128 CPUs, 384 ests, 1 chains, chain_max=384

	Looks like cache_factor of 4 is good both to
ondemand which prefers cache_factor 3 (2.9->4ms) and performance
which prefers cache_factor 5 (5.6->4.3ms):

gov/cache_factor	chain_max	tick time (goal 4.8ms)
ondemand/4		8		2.9ms
ondemand/3		11		4ms
performance/4		22		5.6ms
performance/5		17		4.3ms

> >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >  74902 root      20   0       0      0      0 I 7.591 0.000   0:13.95 ipvs-e:0:1
> >  94104 root      20   0       0      0      0 I 7.591 0.000   0:11.59 ipvs-e:0:2
> >  55669 root      20   0       0      0      0 I 6.931 0.000   0:15.75 ipvs-e:0:0
> > 113311 root      20   0       0      0      0 I 0.990 0.000   0:01.31 ipvs-e:0:3
> > cpupower frequency-set -g performance
> > [ 1448.118857] IPVS: starting estimator thread 0...
> > [ 1448.194882] IPVS: calc: chain_max=22, single est=4138ns, diff=4298, loops=1, ntest=12
> > [ 1448.203435] IPVS: dequeue: 340ns
> > [ 1448.207373] IPVS: using max 1056 ests per chain, 52800 per kthread
> > [ 1450.029581] IPVS: tick time: 2727370ns for 128 CPUs, 518 ests, 1 chains, chain_max=1056
> > [ 1510.792734] IPVS: starting estimator thread 1...
> > [ 1514.032300] IPVS: tick time: 5436826ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056

	performance takes 5.4-5.6ms with chain_max 22

> > [ 1578.032593] IPVS: tick time: 5691875ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >  42514 root      20   0       0      0      0 I 14.24 0.000   0:14.96 ipvs-e:0:0
> >  95356 root      20   0       0      0      0 I 1.987 0.000   0:01.34 ipvs-e:0:1
> While having the services loaded, I switched to the ondemand governor:
> > [ 1706.032577] IPVS: tick time: 5666868ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> > [ 1770.032534] IPVS: tick time: 5638505ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056

	Hm, ondemand governor takes 5.6ms just like
the above performance result? This is probabllly still
performance mode?

> >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >  42514 root      20   0       0      0      0 I 18.15 0.000   0:35.75 ipvs-e:0:0
> >  95356 root      20   0       0      0      0 I 2.310 0.000   0:04.05 ipvs-e:0:1
> While having the services loaded, I kept the ondemand governor and saturated all logical CPUs on the machine:
> > [ 1834.033988] IPVS: tick time: 7129383ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> > [ 1898.038151] IPVS: tick time: 7281418ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056

	Or ondemand takes 7.1ms with performance's chain_max=22
which is more real if we do 2.9*22/8=7.9ms.

> >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > 102581 root      20   0 1047636 262304   1580 R 12791 0.100 161:03.94 a.out
> >  42514 root      20   0       0      0      0 I 17.76 0.000   1:06.88 ipvs-e:0:0
> >  95356 root      20   0       0      0      0 I 2.303 0.000   0:08.33 ipvs-e:0:1

	If kthreads are isolated at some unused CPUs they
have space to go from desired 12.5% to 100% :) 17% is still
not so scary :) The scheduler can probably allow even 100%
usage by moving the kthreads between idle CPUs.

> As for the stability of chain_max:
> > while :; do ipvsadm -A -t 10.10.10.1:2000; sleep 3; ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs; sleep 2m; done
> > dmesg | awk '/calc: chain_max/{m = $(NF - 5); sub("chain_max=", "", m); sub(",", "", m); m = int(m / 2) * 2; hist[m]++} END {PROCINFO["sorted_in"] = "@ind_num_asc"; for (m in hist) printf "%5d %5d\n", m, hist[m]}'
> >     8    22
> >    24     1
> 
> Basically, chain_max calculation under gonernors than ramp up CPU frequency more slowly (ondemand on AMD or powersave for intel_pstate) is stabler than before on both AMD and Intel. We know from previous results that even ARM with multiple NUMA nodes is not a complete disaster. Switching CPU frequency gonernors, including the unfavourable switches from performance to ondemand, does not saturate CPUs. When it comes to CPU frequency gonernors, people tend to use either ondemand (or powersave for intel_pstate) or performance consistently - switches between gonernors can be expected to be rare in production.
> I will need to find out to read through the latest version of the patch set.

	OK. Thank you for testing the different cases!
Let me know if any changes are needed before releasing
the patchset. We can even include some testing results
in the commit messages.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation
  2022-11-10 20:16     ` Julian Anastasov
@ 2022-11-11 17:21       ` Jiri Wiesner
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-11 17:21 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Thu, Nov 10, 2022 at 10:16:24PM +0200, Julian Anastasov wrote:
> > AMD EPYC 7601 32-Core Processor
> > 128 CPUs, 8 NUMA nodes
> > Zen 1 machines such as this one have a large number of NUMA nodes due to restrictions in the CPU architecture. First, tests with different governors:
> > > cpupower frequency-set -g ondemand
> > > [  653.441325] IPVS: starting estimator thread 0...
> > > [  653.514918] IPVS: calc: chain_max=8, single est=11171ns, diff=11301, loops=1, ntest=12
> > > [  653.523580] IPVS: dequeue: 892ns
> > > [  653.527528] IPVS: using max 384 ests per chain, 19200 per kthread
> > > [  655.349916] IPVS: tick time: 3059313ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> > > [  685.230016] IPVS: starting estimator thread 1...
> > > [  717.110852] IPVS: starting estimator thread 2...
> > > [  719.349755] IPVS: tick time: 2896668ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> > > [  750.349974] IPVS: starting estimator thread 3...
> > > [  783.349841] IPVS: tick time: 2942604ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> > > [  847.349811] IPVS: tick time: 2930872ns for 128 CPUs, 384 ests, 1 chains, chain_max=384
> 
> 	Looks like cache_factor of 4 is good both to
> ondemand which prefers cache_factor 3 (2.9->4ms) and performance
> which prefers cache_factor 5 (5.6->4.3ms):
> 
> gov/cache_factor	chain_max	tick time (goal 4.8ms)
> ondemand/4		8		2.9ms
> ondemand/3		11		4ms
> performance/4		22		5.6ms
> performance/5		17		4.3ms

Yes, a cache factor of 4 happens to be a good compromise on this particular Zen 1 machine.

> > > [ 1578.032593] IPVS: tick time: 5691875ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> > >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > >  42514 root      20   0       0      0      0 I 14.24 0.000   0:14.96 ipvs-e:0:0
> > >  95356 root      20   0       0      0      0 I 1.987 0.000   0:01.34 ipvs-e:0:1
> > While having the services loaded, I switched to the ondemand governor:
> > > [ 1706.032577] IPVS: tick time: 5666868ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> > > [ 1770.032534] IPVS: tick time: 5638505ns for 128 CPUs, 1056 ests, 1 chains, chain_max=1056
> 
> 	Hm, ondemand governor takes 5.6ms just like
> the above performance result? This is probabllly still
> performance mode?

I am not sure if I copied the right messages from the log. Probably not.

> > Basically, chain_max calculation under gonernors than ramp up CPU frequency more slowly (ondemand on AMD or powersave for intel_pstate) is stabler than before on both AMD and Intel. We know from previous results that even ARM with multiple NUMA nodes is not a complete disaster. Switching CPU frequency gonernors, including the unfavourable switches from performance to ondemand, does not saturate CPUs. When it comes to CPU frequency gonernors, people tend to use either ondemand (or powersave for intel_pstate) or performance consistently - switches between gonernors can be expected to be rare in production.
> > I will need to find out to read through the latest version of the patch set.
> 
> 	OK. Thank you for testing the different cases!
> Let me know if any changes are needed before releasing
> the patchset. We can even include some testing results
> in the commit messages.

Absolutely.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 1/7] ipvs: add rcu protection to stats
  2022-10-31 14:56 ` [RFC PATCHv6 1/7] ipvs: add rcu protection to stats Julian Anastasov
@ 2022-11-12  7:51   ` Jiri Wiesner
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-12  7:51 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:41PM +0200, Julian Anastasov wrote:
> In preparation to using RCU locking for the list
> with estimators, make sure the struct ip_vs_stats
> are released after RCU grace period by using RCU
> callbacks. This affects ipvs->tot_stats where we
> can not use RCU callbacks for ipvs, so we use
> allocated struct ip_vs_stats_rcu. For services
> and dests we force RCU callbacks for all cases.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 2/7] ipvs: use common functions for stats allocation
  2022-10-31 14:56 ` [RFC PATCHv6 2/7] ipvs: use common functions for stats allocation Julian Anastasov
@ 2022-11-12  8:22   ` Jiri Wiesner
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-12  8:22 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:42PM +0200, Julian Anastasov wrote:
> Move alloc_percpu/free_percpu logic in new functions
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-10-31 14:56 ` [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters Julian Anastasov
@ 2022-11-12  9:00   ` Jiri Wiesner
  2022-11-12  9:09     ` Jiri Wiesner
  2022-11-19  7:46   ` Jiri Wiesner
  1 sibling, 1 reply; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-12  9:00 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote:
> Use the provided u64_stats_t type to avoid
> load/store tearing.
> 
> Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type")
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
>  include/net/ip_vs.h             | 10 +++++-----
>  net/netfilter/ipvs/ip_vs_core.c | 30 +++++++++++++++---------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 10 +++++-----
>  net/netfilter/ipvs/ip_vs_est.c  | 20 ++++++++++----------
>  4 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index e5582c01a4a3..a4d44138c2a8 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -351,11 +351,11 @@ struct ip_vs_seq {
>  
>  /* counters per cpu */
>  struct ip_vs_counters {
> -	__u64		conns;		/* connections scheduled */
> -	__u64		inpkts;		/* incoming packets */
> -	__u64		outpkts;	/* outgoing packets */
> -	__u64		inbytes;	/* incoming bytes */
> -	__u64		outbytes;	/* outgoing bytes */
> +	u64_stats_t	conns;		/* connections scheduled */
> +	u64_stats_t	inpkts;		/* incoming packets */
> +	u64_stats_t	outpkts;	/* outgoing packets */
> +	u64_stats_t	inbytes;	/* incoming bytes */
> +	u64_stats_t	outbytes;	/* outgoing bytes */
>  };
>  /* Stats per cpu */
>  struct ip_vs_cpu_stats {

Converting the per cpu stat to u64_stats_t means that the compiler cannot optimize the memory access and addition on x86_64. Previously, the summation of per cpu counters in ip_vs_chain_estimation() looked like:
15b65:  add    (%rsi),%r14
15b68:  add    0x8(%rsi),%r15
15b6c:  add    0x10(%rsi),%r13
15b70:  add    0x18(%rsi),%r12
15b74:  add    0x20(%rsi),%rbp
The u64_stats_read() calls in ip_vs_chain_estimation() turned it into:
159d5:  mov    (%rcx),%r11
159d8:  mov    0x8(%rcx),%r10
159dc:  mov    0x10(%rcx),%r9
159e0:  mov    0x18(%rcx),%rdi
159e4:  mov    0x20(%rcx),%rdx
159e8:  add    %r11,%r14
159eb:  add    %r10,%r13
159ee:  add    %r9,%r12
159f1:  add    %rdi,%rbp
159f4:  add    %rdx,%rbx
I guess that is not a big deal because the mov should be the instruction taking the most time on account of accessing per cpu regions of other CPUs. The add will be fast.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-11-12  9:00   ` Jiri Wiesner
@ 2022-11-12  9:09     ` Jiri Wiesner
  2022-11-12 16:01       ` Julian Anastasov
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-12  9:09 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Sat, Nov 12, 2022 at 10:00:01AM +0100, Jiri Wiesner wrote:
> On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote:
> > Use the provided u64_stats_t type to avoid
> > load/store tearing.

I think only 32-bit machines have a problem storing a 64-bit counter atomically. u64_stats_fetch_begin() and u64_stats_fetch_retry() should take care of that.

> > 
> > Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type")
> > Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > ---
> >  include/net/ip_vs.h             | 10 +++++-----
> >  net/netfilter/ipvs/ip_vs_core.c | 30 +++++++++++++++---------------
> >  net/netfilter/ipvs/ip_vs_ctl.c  | 10 +++++-----
> >  net/netfilter/ipvs/ip_vs_est.c  | 20 ++++++++++----------
> >  4 files changed, 35 insertions(+), 35 deletions(-)
> > 
> > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> > index e5582c01a4a3..a4d44138c2a8 100644
> > --- a/include/net/ip_vs.h
> > +++ b/include/net/ip_vs.h
> > @@ -351,11 +351,11 @@ struct ip_vs_seq {
> >  
> >  /* counters per cpu */
> >  struct ip_vs_counters {
> > -	__u64		conns;		/* connections scheduled */
> > -	__u64		inpkts;		/* incoming packets */
> > -	__u64		outpkts;	/* outgoing packets */
> > -	__u64		inbytes;	/* incoming bytes */
> > -	__u64		outbytes;	/* outgoing bytes */
> > +	u64_stats_t	conns;		/* connections scheduled */
> > +	u64_stats_t	inpkts;		/* incoming packets */
> > +	u64_stats_t	outpkts;	/* outgoing packets */
> > +	u64_stats_t	inbytes;	/* incoming bytes */
> > +	u64_stats_t	outbytes;	/* outgoing bytes */
> >  };
> >  /* Stats per cpu */
> >  struct ip_vs_cpu_stats {
> 
> Converting the per cpu stat to u64_stats_t means that the compiler cannot optimize the memory access and addition on x86_64. Previously, the summation of per cpu counters in ip_vs_chain_estimation() looked like:
> 15b65:  add    (%rsi),%r14
> 15b68:  add    0x8(%rsi),%r15
> 15b6c:  add    0x10(%rsi),%r13
> 15b70:  add    0x18(%rsi),%r12
> 15b74:  add    0x20(%rsi),%rbp
> The u64_stats_read() calls in ip_vs_chain_estimation() turned it into:
> 159d5:  mov    (%rcx),%r11
> 159d8:  mov    0x8(%rcx),%r10
> 159dc:  mov    0x10(%rcx),%r9
> 159e0:  mov    0x18(%rcx),%rdi
> 159e4:  mov    0x20(%rcx),%rdx
> 159e8:  add    %r11,%r14
> 159eb:  add    %r10,%r13
> 159ee:  add    %r9,%r12
> 159f1:  add    %rdi,%rbp
> 159f4:  add    %rdx,%rbx
> I guess that is not a big deal because the mov should be the instruction taking the most time on account of accessing per cpu regions of other CPUs. The add will be fast.

Another concern is the number of registers needed for the summation. Previously, 5 registers were needed. Now, 10 registers are needed. This would matter mostly if the size of the stack frame of ip_vs_chain_estimation() and the number of callee-saved registers stored to the stack changed, but introducing u64_stats_read() in ip_vs_chain_estimation() did not change that.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-11-12  9:09     ` Jiri Wiesner
@ 2022-11-12 16:01       ` Julian Anastasov
  2022-11-14 11:46         ` Julian Anastasov
  2022-11-15 12:26         ` Jiri Wiesner
  0 siblings, 2 replies; 24+ messages in thread
From: Julian Anastasov @ 2022-11-12 16:01 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Sat, 12 Nov 2022, Jiri Wiesner wrote:

> On Sat, Nov 12, 2022 at 10:00:01AM +0100, Jiri Wiesner wrote:
> > On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote:
> > > Use the provided u64_stats_t type to avoid
> > > load/store tearing.
> 
> I think only 32-bit machines have a problem storing a 64-bit counter atomically. u64_stats_fetch_begin() and u64_stats_fetch_retry() should take care of that.

	Some compilers for 64-bit can use two 32-bit
loads (load tearing) while we require atomic 64-bit read
in case reader is interrupted by updater. That is why READ_ONCE
is used now. I don't know which compilers can do this. That is
why I switched to u64_stats_t to correctly use the u64_stats
interface. And our data is 8-byte aligned. 32-bit are using
seqcount_t, so they are protected for this problem.

> > Converting the per cpu stat to u64_stats_t means that the compiler cannot optimize the memory access and addition on x86_64. Previously, the summation of per cpu counters in ip_vs_chain_estimation() looked like:
> > 15b65:  add    (%rsi),%r14
> > 15b68:  add    0x8(%rsi),%r15
> > 15b6c:  add    0x10(%rsi),%r13
> > 15b70:  add    0x18(%rsi),%r12
> > 15b74:  add    0x20(%rsi),%rbp
> > The u64_stats_read() calls in ip_vs_chain_estimation() turned it into:
> > 159d5:  mov    (%rcx),%r11
> > 159d8:  mov    0x8(%rcx),%r10
> > 159dc:  mov    0x10(%rcx),%r9
> > 159e0:  mov    0x18(%rcx),%rdi
> > 159e4:  mov    0x20(%rcx),%rdx
> > 159e8:  add    %r11,%r14
> > 159eb:  add    %r10,%r13
> > 159ee:  add    %r9,%r12
> > 159f1:  add    %rdi,%rbp
> > 159f4:  add    %rdx,%rbx
> > I guess that is not a big deal because the mov should be the instruction taking the most time on account of accessing per cpu regions of other CPUs. The add will be fast.
> 
> Another concern is the number of registers needed for the summation. Previously, 5 registers were needed. Now, 10 registers are needed. This would matter mostly if the size of the stack frame of ip_vs_chain_estimation() and the number of callee-saved registers stored to the stack changed, but introducing u64_stats_read() in ip_vs_chain_estimation() did not change that.

	It happens in this way probably because compiler should not 
reorder two or more successive instances of READ_ONCE (rwonce.h), 
something that hurts u64_stats_read(). As result, it multiplies
up to 5 times the needed temp storage (registers or stack).

	Another option would be to use something like
this below, try on top of v6. It is rude to add ifdefs here but
we should avoid unneeded code for 64-bit. On 32-bit, code is
more but operations are same. On 64-bit it should order
read+add one by one which can run in parallel. In my compile
tests it uses 3 times movq(READ_ONCE)+addq(to sum) with same temp
register which defeats fully parallel operations and for the
last 2 counters uses movq,movq,addq,addq with different
registers which can run in parallel. It seems there are no free
registers to make it for all 5 counters.

	With this patch the benefit is that compiler should
not hold all 5 values before adding them but on
x86 this patch shows some problem: it does not allocate
5 new temp registers to read and add the values in parallel.
Without this patch, as you show it, it somehow allocates 5+5
registers which allows parallel operations.

	If you think we are better with such change, we
can include it in patch 4. It will be better in case one day
we add more counters and there are more available registers.
You can notice the difference probably by comparing the
chain_max value in performance mode.

diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 4489a3dbad1e..52d77715f7ea 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -83,7 +83,9 @@ static void ip_vs_chain_estimation(struct hlist_head *chain)
 	u64 rate;
 
 	hlist_for_each_entry_rcu(e, chain, list) {
+#if BITS_PER_LONG == 32
 		u64 conns, inpkts, outpkts, inbytes, outbytes;
+#endif
 		u64 kconns = 0, kinpkts = 0, koutpkts = 0;
 		u64 kinbytes = 0, koutbytes = 0;
 		unsigned int start;
@@ -95,19 +97,30 @@ static void ip_vs_chain_estimation(struct hlist_head *chain)
 		s = container_of(e, struct ip_vs_stats, est);
 		for_each_possible_cpu(i) {
 			c = per_cpu_ptr(s->cpustats, i);
-			do {
-				start = u64_stats_fetch_begin(&c->syncp);
-				conns = u64_stats_read(&c->cnt.conns);
-				inpkts = u64_stats_read(&c->cnt.inpkts);
-				outpkts = u64_stats_read(&c->cnt.outpkts);
-				inbytes = u64_stats_read(&c->cnt.inbytes);
-				outbytes = u64_stats_read(&c->cnt.outbytes);
-			} while (u64_stats_fetch_retry(&c->syncp, start));
-			kconns += conns;
-			kinpkts += inpkts;
-			koutpkts += outpkts;
-			kinbytes += inbytes;
-			koutbytes += outbytes;
+#if BITS_PER_LONG == 32
+			conns = kconns;
+			inpkts = kinpkts;
+			outpkts = koutpkts;
+			inbytes = kinbytes;
+			outbytes = koutbytes;
+#endif
+			for (;;) {
+				start += u64_stats_fetch_begin(&c->syncp);
+				kconns += u64_stats_read(&c->cnt.conns);
+				kinpkts += u64_stats_read(&c->cnt.inpkts);
+				koutpkts += u64_stats_read(&c->cnt.outpkts);
+				kinbytes += u64_stats_read(&c->cnt.inbytes);
+				koutbytes += u64_stats_read(&c->cnt.outbytes);
+				if (!u64_stats_fetch_retry(&c->syncp, start))
+					break;
+#if BITS_PER_LONG == 32
+				kconns = conns;
+				kinpkts = inpkts;
+				koutpkts = outpkts;
+				kinbytes = inbytes;
+				koutbytes = outbytes;
+#endif
+			}
 		}
 
 		spin_lock(&s->lock);

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-11-12 16:01       ` Julian Anastasov
@ 2022-11-14 11:46         ` Julian Anastasov
  2022-11-15 12:26         ` Jiri Wiesner
  1 sibling, 0 replies; 24+ messages in thread
From: Julian Anastasov @ 2022-11-14 11:46 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Sat, 12 Nov 2022, Julian Anastasov wrote:

> +			for (;;) {
> +				start += u64_stats_fetch_begin(&c->syncp);

	Hm, this should be assignment and not addition...
Not a problem for 64-bit tests.

> +				kconns += u64_stats_read(&c->cnt.conns);
> +				kinpkts += u64_stats_read(&c->cnt.inpkts);
> +				koutpkts += u64_stats_read(&c->cnt.outpkts);
> +				kinbytes += u64_stats_read(&c->cnt.inbytes);
> +				koutbytes += u64_stats_read(&c->cnt.outbytes);
> +				if (!u64_stats_fetch_retry(&c->syncp, start))
> +					break;
> +#if BITS_PER_LONG == 32
> +				kconns = conns;
> +				kinpkts = inpkts;
> +				koutpkts = outpkts;
> +				kinbytes = inbytes;
> +				koutbytes = outbytes;
> +#endif

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-11-12 16:01       ` Julian Anastasov
  2022-11-14 11:46         ` Julian Anastasov
@ 2022-11-15 12:26         ` Jiri Wiesner
  2022-11-15 16:53           ` Julian Anastasov
  1 sibling, 1 reply; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-15 12:26 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Sat, Nov 12, 2022 at 06:01:36PM +0200, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Sat, 12 Nov 2022, Jiri Wiesner wrote:
> 
> > On Sat, Nov 12, 2022 at 10:00:01AM +0100, Jiri Wiesner wrote:
> > > On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote:
> > > > Use the provided u64_stats_t type to avoid
> > > > load/store tearing.
> > 
> > I think only 32-bit machines have a problem storing a 64-bit counter atomically. u64_stats_fetch_begin() and u64_stats_fetch_retry() should take care of that.
> 
> 	Some compilers for 64-bit can use two 32-bit
> loads (load tearing) while we require atomic 64-bit read
> in case reader is interrupted by updater. That is why READ_ONCE
> is used now. I don't know which compilers can do this. That is
> why I switched to u64_stats_t to correctly use the u64_stats
> interface. And our data is 8-byte aligned. 32-bit are using
> seqcount_t, so they are protected for this problem.

All right. It is counter-intuitive but I guess those compiles have their reasons. I think it would not hurt to expand the description of the patch with the explanation above.

> > > Converting the per cpu stat to u64_stats_t means that the compiler cannot optimize the memory access and addition on x86_64. Previously, the summation of per cpu counters in ip_vs_chain_estimation() looked like:
> > > 15b65:  add    (%rsi),%r14
> > > 15b68:  add    0x8(%rsi),%r15
> > > 15b6c:  add    0x10(%rsi),%r13
> > > 15b70:  add    0x18(%rsi),%r12
> > > 15b74:  add    0x20(%rsi),%rbp
> > > The u64_stats_read() calls in ip_vs_chain_estimation() turned it into:
> > > 159d5:  mov    (%rcx),%r11
> > > 159d8:  mov    0x8(%rcx),%r10
> > > 159dc:  mov    0x10(%rcx),%r9
> > > 159e0:  mov    0x18(%rcx),%rdi
> > > 159e4:  mov    0x20(%rcx),%rdx
> > > 159e8:  add    %r11,%r14
> > > 159eb:  add    %r10,%r13
> > > 159ee:  add    %r9,%r12
> > > 159f1:  add    %rdi,%rbp
> > > 159f4:  add    %rdx,%rbx
> > > I guess that is not a big deal because the mov should be the instruction taking the most time on account of accessing per cpu regions of other CPUs. The add will be fast.
> > 
> > Another concern is the number of registers needed for the summation. Previously, 5 registers were needed. Now, 10 registers are needed. This would matter mostly if the size of the stack frame of ip_vs_chain_estimation() and the number of callee-saved registers stored to the stack changed, but introducing u64_stats_read() in ip_vs_chain_estimation() did not change that.
> 
> 	It happens in this way probably because compiler should not 
> reorder two or more successive instances of READ_ONCE (rwonce.h), 
> something that hurts u64_stats_read(). As result, it multiplies
> up to 5 times the needed temp storage (registers or stack).
> 
> 	Another option would be to use something like
> this below, try on top of v6. It is rude to add ifdefs here but
> we should avoid unneeded code for 64-bit. On 32-bit, code is
> more but operations are same. On 64-bit it should order
> read+add one by one which can run in parallel. In my compile
> tests it uses 3 times movq(READ_ONCE)+addq(to sum) with same temp
> register which defeats fully parallel operations and for the
> last 2 counters uses movq,movq,addq,addq with different
> registers which can run in parallel. It seems there are no free
> registers to make it for all 5 counters.
> 
> 	With this patch the benefit is that compiler should
> not hold all 5 values before adding them but on
> x86 this patch shows some problem: it does not allocate
> 5 new temp registers to read and add the values in parallel.
> Without this patch, as you show it, it somehow allocates 5+5
> registers which allows parallel operations.
> 
> 	If you think we are better with such change, we
> can include it in patch 4. It will be better in case one day
> we add more counters and there are more available registers.
> You can notice the difference probably by comparing the
> chain_max value in performance mode.

I have tested the change to ip_vs_chain_estimation() and ran profiling with bus-cycles, which are proportional to time spent executing code. The number of estimator per kthread was the same in both tests - 88800. It turns out the change made the code slower:
> #       Util1          Util2                    Diff  Command          Shared Object        Symbol   CPU
> 1,124,932,796  1,143,867,951     18,935,155 (  1.7%)  ipvs-e:0:0       [ip_vs]              all      all
>    16,480,274     16,853,752        373,478 (  2.3%)  ipvs-e:0:1       [ip_vs]              all      all
>             0         21,987         21,987 (100.0%)  ipvs-e:0:0       [kvm]                all      all
>     4,622,992      4,366,150       -256,842 (  5.6%)  ipvs-e:0:1       [kernel.kallsyms]    all      all
>   180,226,857    164,665,271    -15,561,586 (  8.6%)  ipvs-e:0:0       [kernel.kallsyms]    all      all
The most significant component was the time spent in ip_vs_chain_estimation():
> #       Util1          Util2                    Diff  Command          Shared Object        Symbol                     CPU
> 1,124,414,110  1,143,352,233     18,938,123 (  1.7%)  ipvs-e:0:0       [ip_vs]              ip_vs_chain_estimation     all
>    16,291,044     16,574,498        283,454 (  1.7%)  ipvs-e:0:1       [ip_vs]              ip_vs_chain_estimation     all
>       189,230        279,254         90,024 ( 47.6%)  ipvs-e:0:1       [ip_vs]              ip_vs_estimation_kthread   all
>       518,686        515,718         -2,968 (  0.6%)  ipvs-e:0:0       [ip_vs]              ip_vs_estimation_kthread   all
>     1,562,512      1,377,609       -184,903 ( 11.8%)  ipvs-e:0:0       [kernel.kallsyms]    kthread_should_stop        all
>     2,435,803      2,138,404       -297,399 ( 12.2%)  ipvs-e:0:1       [kernel.kallsyms]    _find_next_bit             all
>    16,304,577     15,786,553       -518,024 (  3.2%)  ipvs-e:0:0       [kernel.kallsyms]    _raw_spin_lock             all
>   151,110,969    137,172,160    -13,938,809 (  9.2%)  ipvs-e:0:0       [kernel.kallsyms]    _find_next_bit             all
 
The disassembly shows it is the mov instructions (there is a skid of 1 instruction) what takes the most time:
>Percent |      Source code & Disassembly of kcore for bus-cycles (55354 samples, percent: local period)
>        : ffffffffc0bad980 <ip_vs_chain_estimation>:
>            v6 patchset                            v6 patchset + ip_vs_chain_estimation() change
>-------------------------------------------------------------------------------------------------------
>   0.00 :       nopl   0x0(%rax,%rax,1)             0.00 :       nopl   0x0(%rax,%rax,1)
>   0.00 :       push   %r15                         0.00 :       push   %r15
>   0.00 :       push   %r14                         0.00 :       push   %r14
>   0.00 :       push   %r13                         0.00 :       push   %r13
>   0.00 :       push   %r12                         0.00 :       push   %r12
>   0.00 :       push   %rbp                         0.00 :       push   %rbp
>   0.00 :       push   %rbx                         0.00 :       push   %rbx
>   0.00 :       sub    $0x8,%rsp                    0.00 :       sub    $0x8,%rsp
>   0.00 :       mov    (%rdi),%r15                  0.00 :       mov    (%rdi),%r15
>   0.00 :       test   %r15,%r15                    0.00 :       test   %r15,%r15
>   0.00 :       je     0xffffffffc0badaf1           0.00 :       je     0xffffffffc0b7faf1
>   0.00 :       call   0xffffffff8bcd33a0           0.00 :       call   0xffffffffa1ed33a0
>   0.00 :       test   %al,%al                      0.00 :       test   %al,%al
>   0.00 :       jne    0xffffffffc0badaf1           0.00 :       jne    0xffffffffc0b7faf1
>   0.00 :       mov    -0x334ccb22(%rip),%esi       0.00 :       mov    -0x1d29eb22(%rip),%esi
>   0.00 :       xor    %eax,%eax                    0.00 :       xor    %eax,%eax
>   0.00 :       xor    %ebx,%ebx                    0.00 :       xor    %ebx,%ebx
>   0.00 :       xor    %ebp,%ebp                    0.00 :       xor    %ebp,%ebp
>   0.00 :       xor    %r12d,%r12d                  0.00 :       xor    %r12d,%r12d
>   0.00 :       xor    %r13d,%r13d                  0.00 :       xor    %r13d,%r13d
>   0.05 :       xor    %r14d,%r14d                  0.08 :       xor    %r14d,%r14d
>   0.00 :       jmp    0xffffffffc0bad9f7           0.00 :       jmp    0xffffffffc0b7f9f7
>   0.14 :       movslq %eax,%rdx                    0.08 :       movslq %eax,%rdx
>   0.00 :       mov    0x68(%r15),%rcx              0.00 :       mov    0x68(%r15),%rcx
>  11.92 :       add    $0x1,%eax                   12.25 :       add    $0x1,%eax
>   0.00 :       add    -0x72ead560(,%rdx,8),%rcx    0.00 :       add    -0x5ccad560(,%rdx,8),%rcx
>   4.07 :       mov    (%rcx),%r11                  3.84 :       mov    (%rcx),%rdx
>  34.28 :       mov    0x8(%rcx),%r10              34.17 :       add    %rdx,%r14
>  10.35 :       mov    0x10(%rcx),%r9               2.62 :       mov    0x8(%rcx),%rdx
>   7.28 :       mov    0x18(%rcx),%rdi              9.33 :       add    %rdx,%r13
>   7.70 :       mov    0x20(%rcx),%rdx              1.82 :       mov    0x10(%rcx),%rdx
>   6.80 :       add    %r11,%r14                    6.11 :       add    %rdx,%r12
>   0.27 :       add    %r10,%r13                    1.26 :       mov    0x18(%rcx),%rdx
>   0.36 :       add    %r9,%r12                     6.00 :       add    %rdx,%rbp
>   2.24 :       add    %rdi,%rbp                    2.97 :       mov    0x20(%rcx),%rdx
>   1.21 :       add    %rdx,%rbx                    5.78 :       add    %rdx,%rbx
>   0.07 :       movslq %eax,%rdx                    1.08 :       movslq %eax,%rdx
>   0.35 :       mov    $0xffffffff8d6e0860,%rdi     0.00 :       mov    $0xffffffffa38e0860,%rdi
>   2.26 :       call   0xffffffff8c16e270           2.62 :       call   0xffffffffa236e270
>   0.17 :       mov    -0x334ccb7c(%rip),%esi       0.11 :       mov    -0x1d29eb7c(%rip),%esi
>   0.00 :       cmp    %eax,%esi                    0.00 :       cmp    %eax,%esi
>   0.00 :       ja     0xffffffffc0bad9c3           0.00 :       ja     0xffffffffc0b7f9c3
>   0.00 :       lea    0x70(%r15),%rdx              0.00 :       lea    0x70(%r15),%rdx
>   0.00 :       mov    %rdx,%rdi                    0.00 :       mov    %rdx,%rdi
>   0.04 :       mov    %rdx,(%rsp)                  0.05 :       mov    %rdx,(%rsp)
>   0.00 :       call   0xffffffff8c74d800           0.00 :       call   0xffffffffa294d800
>   0.06 :       mov    %r14,%rax                    0.07 :       mov    %r14,%rax
>   0.00 :       sub    0x20(%r15),%rax              0.00 :       sub    0x20(%r15),%rax
>   9.55 :       mov    0x38(%r15),%rcx              8.97 :       mov    0x38(%r15),%rcx
>   0.01 :       mov    (%rsp),%rdx                  0.03 :       mov    (%rsp),%rdx
 
The results indicate that the extra add instructions should not matter - memory accesses (the mov instructions) are the bottleneck.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-11-15 12:26         ` Jiri Wiesner
@ 2022-11-15 16:53           ` Julian Anastasov
  2022-11-16 17:37             ` Jiri Wiesner
  0 siblings, 1 reply; 24+ messages in thread
From: Julian Anastasov @ 2022-11-15 16:53 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Tue, 15 Nov 2022, Jiri Wiesner wrote:

> On Sat, Nov 12, 2022 at 06:01:36PM +0200, Julian Anastasov wrote:
> > 
> > 	Some compilers for 64-bit can use two 32-bit
> > loads (load tearing) while we require atomic 64-bit read
> > in case reader is interrupted by updater. That is why READ_ONCE
> > is used now. I don't know which compilers can do this. That is
> > why I switched to u64_stats_t to correctly use the u64_stats
> > interface. And our data is 8-byte aligned. 32-bit are using
> > seqcount_t, so they are protected for this problem.
> 
> All right. It is counter-intuitive but I guess those compiles have their reasons. I think it would not hurt to expand the description of the patch with the explanation above.

	OK. I relied on the mentioned commit to explain the problem
as other similar patches do.

> > 	If you think we are better with such change, we
> > can include it in patch 4. It will be better in case one day
> > we add more counters and there are more available registers.
> > You can notice the difference probably by comparing the
> > chain_max value in performance mode.
> 
> I have tested the change to ip_vs_chain_estimation() and ran profiling with bus-cycles, which are proportional to time spent executing code. The number of estimator per kthread was the same in both tests - 88800. It turns out the change made the code slower:
> > #       Util1          Util2                    Diff  Command          Shared Object        Symbol   CPU
> > 1,124,932,796  1,143,867,951     18,935,155 (  1.7%)  ipvs-e:0:0       [ip_vs]              all      all

	chain_max is probably 37 here and 1.7% is low
enough (below 1/37) not to be accounted in chain_max.

> >    16,480,274     16,853,752        373,478 (  2.3%)  ipvs-e:0:1       [ip_vs]              all      all
> >             0         21,987         21,987 (100.0%)  ipvs-e:0:0       [kvm]                all      all
> >     4,622,992      4,366,150       -256,842 (  5.6%)  ipvs-e:0:1       [kernel.kallsyms]    all      all
> >   180,226,857    164,665,271    -15,561,586 (  8.6%)  ipvs-e:0:0       [kernel.kallsyms]    all      all
> The most significant component was the time spent in ip_vs_chain_estimation():
> > #       Util1          Util2                    Diff  Command          Shared Object        Symbol                     CPU
> > 1,124,414,110  1,143,352,233     18,938,123 (  1.7%)  ipvs-e:0:0       [ip_vs]              ip_vs_chain_estimation     all
> >    16,291,044     16,574,498        283,454 (  1.7%)  ipvs-e:0:1       [ip_vs]              ip_vs_chain_estimation     all
> >       189,230        279,254         90,024 ( 47.6%)  ipvs-e:0:1       [ip_vs]              ip_vs_estimation_kthread   all
> >       518,686        515,718         -2,968 (  0.6%)  ipvs-e:0:0       [ip_vs]              ip_vs_estimation_kthread   all
> >     1,562,512      1,377,609       -184,903 ( 11.8%)  ipvs-e:0:0       [kernel.kallsyms]    kthread_should_stop        all
> >     2,435,803      2,138,404       -297,399 ( 12.2%)  ipvs-e:0:1       [kernel.kallsyms]    _find_next_bit             all
> >    16,304,577     15,786,553       -518,024 (  3.2%)  ipvs-e:0:0       [kernel.kallsyms]    _raw_spin_lock             all
> >   151,110,969    137,172,160    -13,938,809 (  9.2%)  ipvs-e:0:0       [kernel.kallsyms]    _find_next_bit             all
>  
> The disassembly shows it is the mov instructions (there is a skid of 1 instruction) what takes the most time:
> >Percent |      Source code & Disassembly of kcore for bus-cycles (55354 samples, percent: local period)
> >        : ffffffffc0bad980 <ip_vs_chain_estimation>:
> >            v6 patchset                            v6 patchset + ip_vs_chain_estimation() change
> >-------------------------------------------------------------------------------------------------------

> >   0.00 :       add    -0x72ead560(,%rdx,8),%rcx    0.00 :       add    -0x5ccad560(,%rdx,8),%rcx
> >   4.07 :       mov    (%rcx),%r11                  3.84 :       mov    (%rcx),%rdx
> >  34.28 :       mov    0x8(%rcx),%r10              34.17 :       add    %rdx,%r14
> >  10.35 :       mov    0x10(%rcx),%r9               2.62 :       mov    0x8(%rcx),%rdx
> >   7.28 :       mov    0x18(%rcx),%rdi              9.33 :       add    %rdx,%r13
> >   7.70 :       mov    0x20(%rcx),%rdx              1.82 :       mov    0x10(%rcx),%rdx
> >   6.80 :       add    %r11,%r14                    6.11 :       add    %rdx,%r12
> >   0.27 :       add    %r10,%r13                    1.26 :       mov    0x18(%rcx),%rdx
> >   0.36 :       add    %r9,%r12                     6.00 :       add    %rdx,%rbp
> >   2.24 :       add    %rdi,%rbp                    2.97 :       mov    0x20(%rcx),%rdx

	Bad, same register RDX used. I guess, this is
CONFIG_GENERIC_CPU=y kernel which is suitable for different
CPUs. My example was tested with CONFIG_MCORE2=y. But the
optimizations are compiler's job, there can be a cost
to free registers for our operations. Who knows, change
can be faster on other platforms with less available
registers to hold the 5 counters and their sum in registers.

	Probably, we can force it to add mem8 into reg by using
something like this:

static inline void u64_stats_read_add(u64_stats_t *p, u64 *sum)
{
	local64_add_to(&p->v, (long *) sum);
}

static inline void local_add_to(local_t *l, long *sum)
{
	asm(_ASM_ADD "%1,%0"
	    : "+r" (*sum)
	    : "m" (l->a.counter));
}

	But it deserves more testing.

> The results indicate that the extra add instructions should not matter - memory accesses (the mov instructions) are the bottleneck.

	Yep, then lets work without this change for now,
no need to deviate much from the general u64_stats usage
if difference is negligible.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-11-15 16:53           ` Julian Anastasov
@ 2022-11-16 17:37             ` Jiri Wiesner
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-16 17:37 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Tue, Nov 15, 2022 at 06:53:22PM +0200, Julian Anastasov wrote:
> 	Bad, same register RDX used. I guess, this is
> CONFIG_GENERIC_CPU=y kernel which is suitable for different
> CPUs.

Yes, it is. I use SUSE's distro config.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
  2022-10-31 14:56 ` [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters Julian Anastasov
  2022-11-12  9:00   ` Jiri Wiesner
@ 2022-11-19  7:46   ` Jiri Wiesner
  1 sibling, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-19  7:46 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote:
> Use the provided u64_stats_t type to avoid
> load/store tearing.
> 
> Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type")
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Tested-by: Jiri Wiesner <jwiesner@suse.de>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation
  2022-10-31 14:56 ` [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation Julian Anastasov
  2022-11-10 15:39   ` Jiri Wiesner
@ 2022-11-21 16:05   ` Jiri Wiesner
  1 sibling, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-21 16:05 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:44PM +0200, Julian Anastasov wrote:
> Estimating all entries in single list in timer context
> causes large latency with multiple rules.
> 
> Spread the estimator structures in multiple chains and
> use kthread(s) for the estimation. Every chain is
> processed under RCU lock. First kthread determines
> parameters to use, eg. maximum number of estimators to
> process per kthread based on chain's length, allowing
> sub-100us cond_resched rate and estimation taking 1/8
> of the CPU.
> 
> First kthread also plays the role of distributor of
> added estimators to all kthreads.
> 
> We also add delayed work est_reload_work that will
> make sure the kthread tasks are properly started/stopped.
> 
> ip_vs_start_estimator() is changed to report errors
> which allows to safely store the estimators in
> allocated structures.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Tested-by: Jiri Wiesner <jwiesner@suse.de>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

> @@ -953,9 +1005,17 @@ struct netns_ipvs {
>  	struct ctl_table_header	*lblcr_ctl_header;
>  	struct ctl_table	*lblcr_ctl_table;
>  	/* ip_vs_est */
> -	struct list_head	est_list;	/* estimator list */
> -	spinlock_t		est_lock;
> -	struct timer_list	est_timer;	/* Estimation timer */
> +	struct delayed_work	est_reload_work;/* Reload kthread tasks */
> +	struct mutex		est_mutex;	/* protect kthread tasks */
> +	struct hlist_head	est_temp_list;	/* Ests during calc phase */
> +	struct ip_vs_est_kt_data **est_kt_arr;	/* Array of kthread data ptrs */
> +	unsigned long		est_max_threads;/* rlimit */

Not an rlimit anymore.

> +	int			est_calc_phase;	/* Calculation phase */
> +	int			est_chain_max;	/* Calculated chain_max */
> +	int			est_kt_count;	/* Allocated ptrs */
> +	int			est_add_ktid;	/* ktid where to add ests */
> +	atomic_t		est_genid;	/* kthreads reload genid */
> +	atomic_t		est_genid_done;	/* applied genid */
>  	/* ip_vs_sync */
>  	spinlock_t		sync_lock;
>  	struct ipvs_master_sync_state *ms;

> -	INIT_LIST_HEAD(&est->list);
> +/* Start estimation for stats */
> +int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
> +{
> +	struct ip_vs_estimator *est = &stats->est;
> +	int ret;
> +
> +	if (!ipvs->est_max_threads && ipvs->enable)
> +		ipvs->est_max_threads = 4 * num_possible_cpus();

To avoid the magic number - 4, a symbolic constant could be used. The 4 is related to the design decision that a fully loaded kthread should take 1/8 of the CPU time of a CPU.

> +
> +	est->ktid = -1;
> +	est->ktrow = IPVS_EST_NTICKS - 1;	/* Initial delay */
> +
> +	/* We prefer this code to be short, kthread 0 will requeue the
> +	 * estimator to available chain. If tasks are disabled, we
> +	 * will not allocate much memory, just for kt 0.
> +	 */
> +	ret = 0;
> +	if (!ipvs->est_kt_count || !ipvs->est_kt_arr[0])
> +		ret = ip_vs_est_add_kthread(ipvs);
> +	if (ret >= 0)
> +		hlist_add_head(&est->list, &ipvs->est_temp_list);
> +	else
> +		INIT_HLIST_NODE(&est->list);
> +	return ret;
> +}

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 5/7] ipvs: add est_cpulist and est_nice sysctl vars
  2022-10-31 14:56 ` [RFC PATCHv6 5/7] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
@ 2022-11-21 16:29   ` Jiri Wiesner
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-21 16:29 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:45PM +0200, Julian Anastasov wrote:
> Allow the kthreads for stats to be configured for
> specific cpulist (isolation) and niceness (scheduling
> priority).
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCHv6 6/7] ipvs: run_estimation should control the kthread tasks
  2022-10-31 14:56 ` [RFC PATCHv6 6/7] ipvs: run_estimation should control the kthread tasks Julian Anastasov
@ 2022-11-21 16:32   ` Jiri Wiesner
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Wiesner @ 2022-11-21 16:32 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Mon, Oct 31, 2022 at 04:56:46PM +0200, Julian Anastasov wrote:
> Change the run_estimation flag to start/stop the kthread tasks.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-11-21 16:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
2022-10-31 14:56 ` [RFC PATCHv6 1/7] ipvs: add rcu protection to stats Julian Anastasov
2022-11-12  7:51   ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 2/7] ipvs: use common functions for stats allocation Julian Anastasov
2022-11-12  8:22   ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters Julian Anastasov
2022-11-12  9:00   ` Jiri Wiesner
2022-11-12  9:09     ` Jiri Wiesner
2022-11-12 16:01       ` Julian Anastasov
2022-11-14 11:46         ` Julian Anastasov
2022-11-15 12:26         ` Jiri Wiesner
2022-11-15 16:53           ` Julian Anastasov
2022-11-16 17:37             ` Jiri Wiesner
2022-11-19  7:46   ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation Julian Anastasov
2022-11-10 15:39   ` Jiri Wiesner
2022-11-10 20:16     ` Julian Anastasov
2022-11-11 17:21       ` Jiri Wiesner
2022-11-21 16:05   ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 5/7] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
2022-11-21 16:29   ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 6/7] ipvs: run_estimation should control the kthread tasks Julian Anastasov
2022-11-21 16:32   ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 7/7] ipvs: debug the tick time Julian Anastasov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.