All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Use kthreads for stats
@ 2022-08-27 17:41 Julian Anastasov
  2022-08-27 17:41 ` [RFC PATCH 1/4] ipvs: add rcu protection to stats Julian Anastasov
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Julian Anastasov @ 2022-08-27 17:41 UTC (permalink / raw)
  To: Jiri Wiesner
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

	Hello,

	This patchset implements stats estimation in
kthread context. Simple tests do not show any problem.
Please review, comment, test, etc.

	Overview of the basic concepts. More in the
commit messages...

RCU Locking:

- when RCU preemption is enabled the kthreads use just RCU
lock for walking the chains and we do not need to reschedule.
May be this is the common case for distribution kernels.
In this case ip_vs_stop_estimator() is completely lockless.

- when RCU preemption is not enabled, we reschedule by using
refcnt for every estimator to track if the currently removed
estimator is used at the same time by kthread for estimation.
As RCU lock is unlocked during rescheduling, the deletion
should wait kd->mutex, so that a new RCU lock is applied
before the estimator is freed with RCU callback.

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is 0.

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chain just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains. This will reduce
the chain imbalance.

- the chain imbalance is not so fatal when we use
kthreads. We design each kthread for part of the
possible CPU usage, so even if some chain exceeds its
time slot it would happen all the time or sporadic
depending on the scheduling but still keeping the
2-second interval. The cpulist isolation can make
the things more stable as a 2-second time interval
per estimator.

Julian Anastasov (4):
  ipvs: add rcu protection to stats
  ipvs: use kthreads for stats estimation
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: run_estimation should control the kthread tasks

 Documentation/networking/ipvs-sysctl.rst |  24 +-
 include/net/ip_vs.h                      | 144 +++++++-
 net/netfilter/ipvs/ip_vs_core.c          |  10 +-
 net/netfilter/ipvs/ip_vs_ctl.c           | 287 ++++++++++++++--
 net/netfilter/ipvs/ip_vs_est.c           | 408 +++++++++++++++++++----
 5 files changed, 771 insertions(+), 102 deletions(-)

-- 
2.37.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/4] ipvs: add rcu protection to stats
  2022-08-27 17:41 [RFC PATCH 0/4] Use kthreads for stats Julian Anastasov
@ 2022-08-27 17:41 ` Julian Anastasov
  2022-09-05 10:43   ` Jiri Wiesner
  2022-08-27 17:41 ` [RFC PATCH 2/4] ipvs: use kthreads for stats estimation Julian Anastasov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Julian Anastasov @ 2022-08-27 17:41 UTC (permalink / raw)
  To: Jiri Wiesner
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

In preparation to using RCU locking for the list
with estimators, make sure the struct ip_vs_stats
are released after RCU grace period by using RCU
callbacks. This affects ipvs->tot_stats where we
can not use RCU callbacks for ipvs, so we use
allocated struct ip_vs_stats_rcu. For services
and dests we force RCU callbacks for all cases.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h             |  8 ++++-
 net/netfilter/ipvs/ip_vs_core.c | 10 ++++--
 net/netfilter/ipvs/ip_vs_ctl.c  | 64 ++++++++++++++++++++++-----------
 3 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index ff1804a0c469..bd8ae137e43b 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -405,6 +405,11 @@ struct ip_vs_stats {
 	struct ip_vs_kstats	kstats0;	/* reset values */
 };
 
+struct ip_vs_stats_rcu {
+	struct ip_vs_stats	s;
+	struct rcu_head		rcu_head;
+};
+
 struct dst_entry;
 struct iphdr;
 struct ip_vs_conn;
@@ -688,6 +693,7 @@ struct ip_vs_dest {
 	union nf_inet_addr	vaddr;		/* virtual IP address */
 	__u32			vfwmark;	/* firewall mark of service */
 
+	struct rcu_head		rcu_head;
 	struct list_head	t_list;		/* in dest_trash */
 	unsigned int		in_rs_table:1;	/* we are in rs_table */
 };
@@ -869,7 +875,7 @@ struct netns_ipvs {
 	atomic_t		conn_count;      /* connection counter */
 
 	/* ip_vs_ctl */
-	struct ip_vs_stats		tot_stats;  /* Statistics & est. */
+	struct ip_vs_stats_rcu	*tot_stats;      /* Statistics & est. */
 
 	int			num_services;    /* no of virtual services */
 	int			num_services6;   /* IPv6 virtual services */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 51ad557a525b..fcdaef1fcccf 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -143,7 +143,7 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 		s->cnt.inbytes += skb->len;
 		u64_stats_update_end(&s->syncp);
 
-		s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
 		s->cnt.inpkts++;
 		s->cnt.inbytes += skb->len;
@@ -179,7 +179,7 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
 		s->cnt.outbytes += skb->len;
 		u64_stats_update_end(&s->syncp);
 
-		s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+		s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 		u64_stats_update_begin(&s->syncp);
 		s->cnt.outpkts++;
 		s->cnt.outbytes += skb->len;
@@ -208,7 +208,7 @@ ip_vs_conn_stats(struct ip_vs_conn *cp, struct ip_vs_service *svc)
 	s->cnt.conns++;
 	u64_stats_update_end(&s->syncp);
 
-	s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+	s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
 	u64_stats_update_begin(&s->syncp);
 	s->cnt.conns++;
 	u64_stats_update_end(&s->syncp);
@@ -2448,6 +2448,10 @@ static void __exit ip_vs_cleanup(void)
 	ip_vs_conn_cleanup();
 	ip_vs_protocol_cleanup();
 	ip_vs_control_cleanup();
+	/* common rcu_barrier() used by:
+	 * - ip_vs_control_cleanup()
+	 */
+	rcu_barrier();
 	pr_info("ipvs unloaded.\n");
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index efab2b06d373..44c79fd1779c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -483,17 +483,14 @@ static void ip_vs_service_rcu_free(struct rcu_head *head)
 	ip_vs_service_free(svc);
 }
 
-static void __ip_vs_svc_put(struct ip_vs_service *svc, bool do_delay)
+static void __ip_vs_svc_put(struct ip_vs_service *svc)
 {
 	if (atomic_dec_and_test(&svc->refcnt)) {
 		IP_VS_DBG_BUF(3, "Removing service %u/%s:%u\n",
 			      svc->fwmark,
 			      IP_VS_DBG_ADDR(svc->af, &svc->addr),
 			      ntohs(svc->port));
-		if (do_delay)
-			call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
-		else
-			ip_vs_service_free(svc);
+		call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
 	}
 }
 
@@ -780,14 +777,22 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, int dest_af,
 	return dest;
 }
 
+static void ip_vs_dest_rcu_free(struct rcu_head *head)
+{
+	struct ip_vs_dest *dest;
+
+	dest = container_of(head, struct ip_vs_dest, rcu_head);
+	free_percpu(dest->stats.cpustats);
+	ip_vs_dest_put_and_free(dest);
+}
+
 static void ip_vs_dest_free(struct ip_vs_dest *dest)
 {
 	struct ip_vs_service *svc = rcu_dereference_protected(dest->svc, 1);
 
 	__ip_vs_dst_cache_reset(dest);
-	__ip_vs_svc_put(svc, false);
-	free_percpu(dest->stats.cpustats);
-	ip_vs_dest_put_and_free(dest);
+	__ip_vs_svc_put(svc);
+	call_rcu(&dest->rcu_head, ip_vs_dest_rcu_free);
 }
 
 /*
@@ -811,6 +816,16 @@ static void ip_vs_trash_cleanup(struct netns_ipvs *ipvs)
 	}
 }
 
+static void ip_vs_stats_rcu_free(struct rcu_head *head)
+{
+	struct ip_vs_stats_rcu *rs = container_of(head,
+						  struct ip_vs_stats_rcu,
+						  rcu_head);
+
+	free_percpu(rs->s.cpustats);
+	kfree(rs);
+}
+
 static void
 ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
 {
@@ -923,7 +938,7 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
 		if (old_svc != svc) {
 			ip_vs_zero_stats(&dest->stats);
 			__ip_vs_bind_svc(dest, svc);
-			__ip_vs_svc_put(old_svc, true);
+			__ip_vs_svc_put(old_svc);
 		}
 	}
 
@@ -1571,7 +1586,7 @@ static void __ip_vs_del_service(struct ip_vs_service *svc, bool cleanup)
 	/*
 	 *    Free the service if nobody refers to it
 	 */
-	__ip_vs_svc_put(svc, true);
+	__ip_vs_svc_put(svc);
 
 	/* decrease the module use count */
 	ip_vs_use_count_dec();
@@ -1761,7 +1776,7 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 		}
 	}
 
-	ip_vs_zero_stats(&ipvs->tot_stats);
+	ip_vs_zero_stats(&ipvs->tot_stats->s);
 	return 0;
 }
 
@@ -2255,7 +2270,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
 	seq_puts(seq,
 		 "   Conns  Packets  Packets            Bytes            Bytes\n");
 
-	ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats);
+	ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats->s);
 	seq_printf(seq, "%8LX %8LX %8LX %16LX %16LX\n\n",
 		   (unsigned long long)show.conns,
 		   (unsigned long long)show.inpkts,
@@ -2279,7 +2294,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
 static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq_file_single_net(seq);
-	struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats;
+	struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats->s;
 	struct ip_vs_cpu_stats __percpu *cpustats = tot_stats->cpustats;
 	struct ip_vs_kstats kstats;
 	int i;
@@ -4106,7 +4121,6 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 			kfree(tbl);
 		return -ENOMEM;
 	}
-	ip_vs_start_estimator(ipvs, &ipvs->tot_stats);
 	ipvs->sysctl_tbl = tbl;
 	/* Schedule defense work */
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
@@ -4117,6 +4131,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
 			  expire_nodest_conn_handler);
 
+	ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
 	return 0;
 }
 
@@ -4128,7 +4143,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats);
+	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
 
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->sysctl_tbl);
@@ -4164,17 +4179,20 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	atomic_set(&ipvs->conn_out_counter, 0);
 
 	/* procfs stats */
-	ipvs->tot_stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!ipvs->tot_stats.cpustats)
+	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
+	if (!ipvs->tot_stats)
 		return -ENOMEM;
+	ipvs->tot_stats->s.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
+	if (!ipvs->tot_stats->s.cpustats)
+		goto err_tot_stats;
 
 	for_each_possible_cpu(i) {
 		struct ip_vs_cpu_stats *ipvs_tot_stats;
-		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats.cpustats, i);
+		ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats->s.cpustats, i);
 		u64_stats_init(&ipvs_tot_stats->syncp);
 	}
 
-	spin_lock_init(&ipvs->tot_stats.lock);
+	spin_lock_init(&ipvs->tot_stats->s.lock);
 
 #ifdef CONFIG_PROC_FS
 	if (!proc_create_net("ip_vs", 0, ipvs->net->proc_net,
@@ -4206,7 +4224,10 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 
 err_vs:
 #endif
-	free_percpu(ipvs->tot_stats.cpustats);
+	free_percpu(ipvs->tot_stats->s.cpustats);
+
+err_tot_stats:
+	kfree(ipvs->tot_stats);
 	return -ENOMEM;
 }
 
@@ -4219,7 +4240,7 @@ void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs", ipvs->net->proc_net);
 #endif
-	free_percpu(ipvs->tot_stats.cpustats);
+	call_rcu(&ipvs->tot_stats->rcu_head, ip_vs_stats_rcu_free);
 }
 
 int __init ip_vs_register_nl_ioctl(void)
@@ -4279,5 +4300,6 @@ void ip_vs_control_cleanup(void)
 {
 	EnterFunction(2);
 	unregister_netdevice_notifier(&ip_vs_dst_notifier);
+	/* relying on common rcu_barrier() in ip_vs_cleanup() */
 	LeaveFunction(2);
 }
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/4] ipvs: use kthreads for stats estimation
  2022-08-27 17:41 [RFC PATCH 0/4] Use kthreads for stats Julian Anastasov
  2022-08-27 17:41 ` [RFC PATCH 1/4] ipvs: add rcu protection to stats Julian Anastasov
@ 2022-08-27 17:41 ` Julian Anastasov
  2022-09-05  6:47   ` dust.li
  2022-09-05 13:19   ` Jiri Wiesner
  2022-08-27 17:41 ` [RFC PATCH 3/4] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Julian Anastasov @ 2022-08-27 17:41 UTC (permalink / raw)
  To: Jiri Wiesner
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

Estimating all entries in single list in timer context
causes large latency with multiple rules.

Spread the estimator structures in multiple chains and
use kthread(s) for the estimation. Every chain is
processed under RCU lock. If RCU preemption is not
enabled, we add code for rescheduling by delaying
the removal of the currently estimated entry.

We also add delayed work est_reload_work that will
make sure the kthread tasks are properly started.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_vs.h            |  84 ++++++-
 net/netfilter/ipvs/ip_vs_ctl.c |  55 ++++-
 net/netfilter/ipvs/ip_vs_est.c | 403 +++++++++++++++++++++++++++------
 3 files changed, 468 insertions(+), 74 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index bd8ae137e43b..8171d845520c 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -363,9 +363,14 @@ struct ip_vs_cpu_stats {
 	struct u64_stats_sync   syncp;
 };
 
+/* resched during estimation, the defines should match cond_resched_rcu */
+#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
+#define IPVS_EST_RESCHED_RCU	1
+#endif
+
 /* IPVS statistics objects */
 struct ip_vs_estimator {
-	struct list_head	list;
+	struct hlist_node	list;
 
 	u64			last_inbytes;
 	u64			last_outbytes;
@@ -378,6 +383,31 @@ struct ip_vs_estimator {
 	u64			outpps;
 	u64			inbps;
 	u64			outbps;
+
+#ifdef IPVS_EST_RESCHED_RCU
+	refcount_t		refcnt;
+#endif
+	u32			ktid:16,	/* kthread ID */
+				ktrow:16;	/* row ID for kthread */
+};
+
+/* Spread estimator states in multiple chains */
+#define IPVS_EST_NCHAINS	50
+#define IPVS_EST_TICK		((2 * HZ) / IPVS_EST_NCHAINS)
+
+/* Context for estimation kthread */
+struct ip_vs_est_kt_data {
+	struct netns_ipvs	*ipvs;
+	struct task_struct	*task;		/* task if running */
+	struct mutex		mutex;		/* held during resched */
+	int			id;		/* ktid per netns */
+	int			est_count;	/* attached ests to kthread */
+	int			est_max_count;	/* max ests per kthread */
+	int			add_row;	/* row for new ests */
+	int			est_row;	/* estimated row */
+	unsigned long		est_timer;	/* estimation timer (jiffies) */
+	struct hlist_head	chains[IPVS_EST_NCHAINS];
+	int			chain_len[IPVS_EST_NCHAINS];
 };
 
 /*
@@ -948,9 +978,13 @@ struct netns_ipvs {
 	struct ctl_table_header	*lblcr_ctl_header;
 	struct ctl_table	*lblcr_ctl_table;
 	/* ip_vs_est */
-	struct list_head	est_list;	/* estimator list */
-	spinlock_t		est_lock;
-	struct timer_list	est_timer;	/* Estimation timer */
+	struct delayed_work	est_reload_work;/* Reload kthread tasks */
+	struct mutex		est_mutex;	/* protect kthread tasks */
+	struct ip_vs_est_kt_data **est_kt_arr;	/* Array of kthread data ptrs */
+	int			est_kt_count;	/* Allocated ptrs */
+	int			est_add_ktid;	/* ktid where to add ests */
+	atomic_t		est_genid;	/* kthreads reload genid */
+	atomic_t		est_genid_done;	/* applied genid */
 	/* ip_vs_sync */
 	spinlock_t		sync_lock;
 	struct ipvs_master_sync_state *ms;
@@ -1485,6 +1519,48 @@ void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
 void ip_vs_zero_estimator(struct ip_vs_stats *stats);
 void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs, bool bump);
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+			    struct ip_vs_est_kt_data *kd);
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
+
+extern struct mutex ip_vs_est_mutex;
+
+static inline void ip_vs_est_init_resched_rcu(struct ip_vs_estimator *e)
+{
+#ifdef IPVS_EST_RESCHED_RCU
+	refcount_set(&e->refcnt, 1);
+#endif
+}
+
+static inline void ip_vs_est_cond_resched_rcu(struct ip_vs_est_kt_data *kd,
+					      struct ip_vs_estimator *e)
+{
+#ifdef IPVS_EST_RESCHED_RCU
+	if (mutex_trylock(&kd->mutex)) {
+		/* Block removal during reschedule */
+		if (refcount_inc_not_zero(&e->refcnt)) {
+			cond_resched_rcu();
+			refcount_dec(&e->refcnt);
+		}
+		mutex_unlock(&kd->mutex);
+	}
+#endif
+}
+
+static inline void ip_vs_est_wait_resched(struct netns_ipvs *ipvs,
+					  struct ip_vs_estimator *est)
+{
+#ifdef IPVS_EST_RESCHED_RCU
+	/* Estimator kthread is rescheduling on deleted est? Wait it! */
+	if (!refcount_dec_and_test(&est->refcnt)) {
+		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[est->ktid];
+
+		mutex_lock(&kd->mutex);
+		mutex_unlock(&kd->mutex);
+	}
+#endif
+}
 
 /* Various IPVS packet transmitters (from ip_vs_xmit.c) */
 int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 44c79fd1779c..e9f61eba3b8e 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -239,8 +239,49 @@ static void defense_work_handler(struct work_struct *work)
 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
 			   DEFENSE_TIMER_PERIOD);
 }
+
 #endif
 
+static void est_reload_work_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs =
+		container_of(work, struct netns_ipvs, est_reload_work.work);
+	int genid = atomic_read(&ipvs->est_genid);
+	int genid_done = atomic_read(&ipvs->est_genid_done);
+	unsigned long delay = HZ / 10;	/* repeat startups after failure */
+	bool repeat = false;
+	int id;
+
+	mutex_lock(&ipvs->est_mutex);
+	for (id = 0; id < ipvs->est_kt_count; id++) {
+		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			goto unlock;
+		if (!kd)
+			continue;
+		/* New config ? Stop kthread tasks */
+		if (genid != genid_done)
+			ip_vs_est_kthread_stop(kd);
+		if (!kd->task && ip_vs_est_kthread_start(ipvs, kd) < 0)
+			repeat = true;
+	}
+
+	atomic_set(&ipvs->est_genid_done, genid);
+
+unlock:
+	mutex_unlock(&ipvs->est_mutex);
+
+	if (!ipvs->enable)
+		return;
+	if (genid != atomic_read(&ipvs->est_genid))
+		delay = 1;
+	else if (!repeat)
+		return;
+	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, delay);
+}
+
 int
 ip_vs_use_count_inc(void)
 {
@@ -1421,8 +1462,15 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 	ip_vs_svc_hash(svc);
 
 	*svc_p = svc;
-	/* Now there is a service - full throttle */
-	ipvs->enable = 1;
+
+	if (!ipvs->enable) {
+		/* Now there is a service - full throttle */
+		ipvs->enable = 1;
+
+		/* Start estimation for first time */
+		ip_vs_est_reload_start(ipvs, true);
+	}
+
 	return 0;
 
 
@@ -4178,6 +4226,8 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 	atomic_set(&ipvs->nullsvc_counter, 0);
 	atomic_set(&ipvs->conn_out_counter, 0);
 
+	INIT_DELAYED_WORK(&ipvs->est_reload_work, est_reload_work_handler);
+
 	/* procfs stats */
 	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
 	if (!ipvs->tot_stats)
@@ -4235,6 +4285,7 @@ void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
 {
 	ip_vs_trash_cleanup(ipvs);
 	ip_vs_control_net_cleanup_sysctl(ipvs);
+	cancel_delayed_work_sync(&ipvs->est_reload_work);
 #ifdef CONFIG_PROC_FS
 	remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 9a1a7af6a186..b2dd6f1c284a 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -30,9 +30,6 @@
   long interval, it is easy to implement a user level daemon which
   periodically reads those statistical counters and measure rate.
 
-  Currently, the measurement is activated by slow timer handler. Hope
-  this measurement will not introduce too much load.
-
   We measure rate during the last 8 seconds every 2 seconds:
 
     avgrate = avgrate*(1-W) + rate*W
@@ -47,68 +44,75 @@
     to 32-bit values for conns, packets, bps, cps and pps.
 
   * A lot of code is taken from net/core/gen_estimator.c
- */
-
 
-/*
- * Make a summary from each cpu
+  KEY POINTS:
+  - cpustats counters are updated per-cpu in SoftIRQ context with BH disabled
+  - kthreads read the cpustats to update the estimators (svcs, dests, total)
+  - the states of estimators can be read (get stats) or modified (zero stats)
+    from processes
+
+  KTHREADS:
+  - kthread contexts are created and attached to array
+  - the kthread tasks are created when first service is added, before that
+    the total stats are not estimated
+  - the kthread context holds lists with estimators (chains) which are
+    processed every 2 seconds
+  - as estimators can be added dynamically and in bursts, we try to spread
+    them to multiple chains which are estimated at different time
  */
-static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
-				 struct ip_vs_cpu_stats __percpu *stats)
-{
-	int i;
-	bool add = false;
 
-	for_each_possible_cpu(i) {
-		struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i);
-		unsigned int start;
-		u64 conns, inpkts, outpkts, inbytes, outbytes;
-
-		if (add) {
-			do {
-				start = u64_stats_fetch_begin(&s->syncp);
-				conns = s->cnt.conns;
-				inpkts = s->cnt.inpkts;
-				outpkts = s->cnt.outpkts;
-				inbytes = s->cnt.inbytes;
-				outbytes = s->cnt.outbytes;
-			} while (u64_stats_fetch_retry(&s->syncp, start));
-			sum->conns += conns;
-			sum->inpkts += inpkts;
-			sum->outpkts += outpkts;
-			sum->inbytes += inbytes;
-			sum->outbytes += outbytes;
-		} else {
-			add = true;
-			do {
-				start = u64_stats_fetch_begin(&s->syncp);
-				sum->conns = s->cnt.conns;
-				sum->inpkts = s->cnt.inpkts;
-				sum->outpkts = s->cnt.outpkts;
-				sum->inbytes = s->cnt.inbytes;
-				sum->outbytes = s->cnt.outbytes;
-			} while (u64_stats_fetch_retry(&s->syncp, start));
-		}
-	}
-}
+/* Optimal chain length used to spread bursts of newly added ests */
+#define IPVS_EST_BURST_LEN	BIT(6)
+/* Max number of ests per kthread (recommended) */
+#define IPVS_EST_MAX_COUNT	(32 * 1024)
 
+static struct lock_class_key __ipvs_est_key;
 
-static void estimation_timer(struct timer_list *t)
+static void ip_vs_estimation_chain(struct ip_vs_est_kt_data *kd, int row)
 {
+	struct hlist_head *chain = &kd->chains[row];
 	struct ip_vs_estimator *e;
+	struct ip_vs_cpu_stats *c;
 	struct ip_vs_stats *s;
 	u64 rate;
-	struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
 
-	if (!sysctl_run_estimation(ipvs))
-		goto skip;
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(e, chain, list) {
+		u64 conns, inpkts, outpkts, inbytes, outbytes;
+		u64 kconns = 0, kinpkts = 0, koutpkts = 0;
+		u64 kinbytes = 0, koutbytes = 0;
+		unsigned int start;
+		int i;
+
+		if (kthread_should_stop())
+			break;
+		ip_vs_est_cond_resched_rcu(kd, e);
 
-	spin_lock(&ipvs->est_lock);
-	list_for_each_entry(e, &ipvs->est_list, list) {
 		s = container_of(e, struct ip_vs_stats, est);
+		for_each_possible_cpu(i) {
+			c = per_cpu_ptr(s->cpustats, i);
+			do {
+				start = u64_stats_fetch_begin(&c->syncp);
+				conns = c->cnt.conns;
+				inpkts = c->cnt.inpkts;
+				outpkts = c->cnt.outpkts;
+				inbytes = c->cnt.inbytes;
+				outbytes = c->cnt.outbytes;
+			} while (u64_stats_fetch_retry(&c->syncp, start));
+			kconns += conns;
+			kinpkts += inpkts;
+			koutpkts += outpkts;
+			kinbytes += inbytes;
+			koutbytes += outbytes;
+		}
+
+		spin_lock_bh(&s->lock);
 
-		spin_lock(&s->lock);
-		ip_vs_read_cpu_stats(&s->kstats, s->cpustats);
+		s->kstats.conns = kconns;
+		s->kstats.inpkts = kinpkts;
+		s->kstats.outpkts = koutpkts;
+		s->kstats.inbytes = kinbytes;
+		s->kstats.outbytes = koutbytes;
 
 		/* scaled by 2^10, but divided 2 seconds */
 		rate = (s->kstats.conns - e->last_conns) << 9;
@@ -131,32 +135,288 @@ static void estimation_timer(struct timer_list *t)
 		rate = (s->kstats.outbytes - e->last_outbytes) << 4;
 		e->last_outbytes = s->kstats.outbytes;
 		e->outbps += ((s64)rate - (s64)e->outbps) >> 2;
-		spin_unlock(&s->lock);
+		spin_unlock_bh(&s->lock);
+	}
+	rcu_read_unlock();
+}
+
+static int ip_vs_estimation_kthread(void *data)
+{
+	struct ip_vs_est_kt_data *kd = data;
+	struct netns_ipvs *ipvs = kd->ipvs;
+	int row = kd->est_row;
+	unsigned long now;
+	long gap;
+
+	while (1) {
+		set_current_state(TASK_IDLE);
+		if (kthread_should_stop())
+			break;
+
+		/* before estimation, check if we should sleep */
+		now = READ_ONCE(jiffies);
+		gap = kd->est_timer - now;
+		if (gap > 0) {
+			if (gap > IPVS_EST_TICK) {
+				kd->est_timer = now - IPVS_EST_TICK;
+				gap = IPVS_EST_TICK;
+			}
+			schedule_timeout(gap);
+		} else {
+			__set_current_state(TASK_RUNNING);
+			if (gap < -8 * IPVS_EST_TICK)
+				kd->est_timer = now;
+		}
+
+		if (sysctl_run_estimation(ipvs) &&
+		    !hlist_empty(&kd->chains[row]))
+			ip_vs_estimation_chain(kd, row);
+
+		row++;
+		if (row >= IPVS_EST_NCHAINS)
+			row = 0;
+		kd->est_row = row;
+		/* add_row best to point after the just estimated row */
+		WRITE_ONCE(kd->add_row, row);
+		kd->est_timer += IPVS_EST_TICK;
+	}
+	__set_current_state(TASK_RUNNING);
+
+	return 0;
+}
+
+/* Stop (bump=true)/start kthread tasks */
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs, bool bump)
+{
+	/* Ignore reloads before first service is added */
+	if (!ipvs->enable)
+		return;
+	/* Bump the kthread configuration genid */
+	if (bump)
+		atomic_inc(&ipvs->est_genid);
+	queue_delayed_work(system_long_wq, &ipvs->est_reload_work,
+			   bump ? 0 : 1);
+}
+
+/* Start kthread task with current configuration */
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+			    struct ip_vs_est_kt_data *kd)
+{
+	unsigned long now;
+	int ret = 0;
+	long gap;
+
+	lockdep_assert_held(&ipvs->est_mutex);
+
+	if (kd->task)
+		goto out;
+	now = READ_ONCE(jiffies);
+	gap = kd->est_timer - now;
+	/* Sync est_timer if task is starting later */
+	if (abs(gap) > 4 * IPVS_EST_TICK)
+		kd->est_timer = now;
+	kd->task = kthread_create(ip_vs_estimation_kthread, kd, "ipvs-e:%d:%d",
+				  ipvs->gen, kd->id);
+	if (IS_ERR(kd->task)) {
+		ret = PTR_ERR(kd->task);
+		kd->task = NULL;
+		goto out;
 	}
-	spin_unlock(&ipvs->est_lock);
 
-skip:
-	mod_timer(&ipvs->est_timer, jiffies + 2*HZ);
+	pr_info("starting estimator thread %d...\n", kd->id);
+	wake_up_process(kd->task);
+
+out:
+	return ret;
+}
+
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd)
+{
+	if (kd->task) {
+		pr_info("stopping estimator thread %d...\n", kd->id);
+		kthread_stop(kd->task);
+		kd->task = NULL;
+	}
 }
 
+/* Create and start estimation kthread in a free or new array slot */
+static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
+{
+	struct ip_vs_est_kt_data *kd = NULL;
+	int id = ipvs->est_kt_count;
+	int err = -ENOMEM;
+	void *arr = NULL;
+	int i;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	for (i = 0; i < id; i++) {
+		if (!ipvs->est_kt_arr[i])
+			break;
+	}
+	if (i >= id) {
+		arr = krealloc_array(ipvs->est_kt_arr, id + 1,
+				     sizeof(struct ip_vs_est_kt_data *),
+				     GFP_KERNEL);
+		if (!arr)
+			goto out;
+		ipvs->est_kt_arr = arr;
+	} else {
+		id = i;
+	}
+	kd = kmalloc(sizeof(*kd), GFP_KERNEL);
+	if (!kd)
+		goto out;
+	kd->ipvs = ipvs;
+	mutex_init(&kd->mutex);
+	kd->id = id;
+	kd->est_count = 0;
+	kd->est_max_count = IPVS_EST_MAX_COUNT;
+	kd->add_row = 0;
+	kd->est_row = 0;
+	kd->est_timer = jiffies;
+	for (i = 0; i < ARRAY_SIZE(kd->chains); i++)
+		INIT_HLIST_HEAD(&kd->chains[i]);
+	memset(kd->chain_len, 0, sizeof(kd->chain_len));
+	kd->task = NULL;
+	/* Start kthread tasks only when services are present */
+	if (ipvs->enable) {
+		/* On failure, try to start the task again later */
+		if (ip_vs_est_kthread_start(ipvs, kd) < 0)
+			ip_vs_est_reload_start(ipvs, false);
+	}
+
+	if (arr)
+		ipvs->est_kt_count++;
+	ipvs->est_kt_arr[id] = kd;
+	/* Use most recent kthread for new ests */
+	ipvs->est_add_ktid = id;
+
+	mutex_unlock(&ipvs->est_mutex);
+
+	return 0;
+
+out:
+	mutex_unlock(&ipvs->est_mutex);
+	if (kd) {
+		mutex_destroy(&kd->mutex);
+		kfree(kd);
+	}
+	return err;
+}
+
+/* Add estimator to current kthread (est_add_ktid) */
 void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
 {
 	struct ip_vs_estimator *est = &stats->est;
+	struct ip_vs_est_kt_data *kd = NULL;
+	int ktid, row;
+
+	INIT_HLIST_NODE(&est->list);
+	ip_vs_est_init_resched_rcu(est);
+
+	if (ipvs->est_add_ktid < ipvs->est_kt_count) {
+		kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+		if (!kd)
+			goto add_kt;
+		if (kd->est_count < kd->est_max_count)
+			goto add_est;
+	}
 
-	INIT_LIST_HEAD(&est->list);
+add_kt:
+	/* Create new kthread but we can exceed est_max_count on failure */
+	if (ip_vs_est_add_kthread(ipvs) < 0) {
+		if (!kd || kd->est_count >= INT_MAX / 2)
+			goto out;
+	}
+	kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+	if (!kd)
+		goto out;
+
+add_est:
+	ktid = kd->id;
+	/* add_row points after the row we should use */
+	row = READ_ONCE(kd->add_row) - 1;
+	if (row < 0)
+		row = IPVS_EST_NCHAINS - 1;
+
+	kd->est_count++;
+	kd->chain_len[row]++;
+	/* Multiple ests added together? Fill chains one by one. */
+	if (!(kd->chain_len[row] & (IPVS_EST_BURST_LEN - 1)))
+		kd->add_row = row;
+	est->ktid = ktid;
+	est->ktrow = row;
+	hlist_add_head_rcu(&est->list, &kd->chains[row]);
+
+out:
+	;
+}
 
-	spin_lock_bh(&ipvs->est_lock);
-	list_add(&est->list, &ipvs->est_list);
-	spin_unlock_bh(&ipvs->est_lock);
+static void ip_vs_est_kthread_destroy(struct ip_vs_est_kt_data *kd)
+{
+	if (kd) {
+		if (kd->task)
+			kthread_stop(kd->task);
+		mutex_destroy(&kd->mutex);
+		kfree(kd);
+	}
 }
 
+/* Unlink estimator from list */
 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
 {
 	struct ip_vs_estimator *est = &stats->est;
+	struct ip_vs_est_kt_data *kd;
+	int ktid = est->ktid;
+
+	/* Failed to add to chain ? */
+	if (hlist_unhashed(&est->list))
+		goto out;
+
+	hlist_del_rcu(&est->list);
+	ip_vs_est_wait_resched(ipvs, est);
+
+	kd = ipvs->est_kt_arr[ktid];
+	kd->chain_len[est->ktrow]--;
+	kd->est_count--;
+	if (kd->est_count)
+		goto out;
+	pr_info("stop unused estimator thread %d...\n", ktid);
+
+	mutex_lock(&ipvs->est_mutex);
+
+	ip_vs_est_kthread_destroy(kd);
+	ipvs->est_kt_arr[ktid] = NULL;
+	if (ktid == ipvs->est_kt_count - 1)
+		ipvs->est_kt_count--;
+
+	mutex_unlock(&ipvs->est_mutex);
+
+	if (ktid == ipvs->est_add_ktid) {
+		int count = ipvs->est_kt_count;
+		int best = -1;
+
+		while (count-- > 0) {
+			if (!ipvs->est_add_ktid)
+				ipvs->est_add_ktid = ipvs->est_kt_count;
+			ipvs->est_add_ktid--;
+			kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+			if (!kd)
+				continue;
+			if (kd->est_count < kd->est_max_count) {
+				best = ipvs->est_add_ktid;
+				break;
+			}
+			if (best < 0)
+				best = ipvs->est_add_ktid;
+		}
+		if (best >= 0)
+			ipvs->est_add_ktid = best;
+	}
 
-	spin_lock_bh(&ipvs->est_lock);
-	list_del(&est->list);
-	spin_unlock_bh(&ipvs->est_lock);
+out:
+	;
 }
 
 void ip_vs_zero_estimator(struct ip_vs_stats *stats)
@@ -191,14 +451,21 @@ void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats)
 
 int __net_init ip_vs_estimator_net_init(struct netns_ipvs *ipvs)
 {
-	INIT_LIST_HEAD(&ipvs->est_list);
-	spin_lock_init(&ipvs->est_lock);
-	timer_setup(&ipvs->est_timer, estimation_timer, 0);
-	mod_timer(&ipvs->est_timer, jiffies + 2 * HZ);
+	ipvs->est_kt_arr = NULL;
+	ipvs->est_kt_count = 0;
+	ipvs->est_add_ktid = 0;
+	atomic_set(&ipvs->est_genid, 0);
+	atomic_set(&ipvs->est_genid_done, 0);
+	__mutex_init(&ipvs->est_mutex, "ipvs->est_mutex", &__ipvs_est_key);
 	return 0;
 }
 
 void __net_exit ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs)
 {
-	del_timer_sync(&ipvs->est_timer);
+	int i;
+
+	for (i = 0; i < ipvs->est_kt_count; i++)
+		ip_vs_est_kthread_destroy(ipvs->est_kt_arr[i]);
+	kfree(ipvs->est_kt_arr);
+	mutex_destroy(&ipvs->est_mutex);
 }
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/4] ipvs: add est_cpulist and est_nice sysctl vars
  2022-08-27 17:41 [RFC PATCH 0/4] Use kthreads for stats Julian Anastasov
  2022-08-27 17:41 ` [RFC PATCH 1/4] ipvs: add rcu protection to stats Julian Anastasov
  2022-08-27 17:41 ` [RFC PATCH 2/4] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-08-27 17:41 ` Julian Anastasov
  2022-09-05 14:53   ` Jiri Wiesner
  2022-08-27 17:41 ` [RFC PATCH 4/4] ipvs: run_estimation should control the kthread tasks Julian Anastasov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Julian Anastasov @ 2022-08-27 17:41 UTC (permalink / raw)
  To: Jiri Wiesner
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

Allow the kthreads for stats to be configured for
specific cpulist (isolation) and niceness (scheduling
priority).

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 Documentation/networking/ipvs-sysctl.rst |  20 ++++
 include/net/ip_vs.h                      |  50 ++++++++
 net/netfilter/ipvs/ip_vs_ctl.c           | 141 ++++++++++++++++++++++-
 net/netfilter/ipvs/ip_vs_est.c           |   9 +-
 4 files changed, 218 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 387fda80f05f..90c7c325421a 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -129,6 +129,26 @@ drop_packet - INTEGER
 	threshold. When the mode 3 is set, the always mode drop rate
 	is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
 
+est_cpulist - CPULIST
+	Allowed	CPUs for estimation kthreads
+
+	Syntax: standard cpulist format
+	empty list - stop kthread tasks and estimation
+	default - the system's housekeeping CPUs for kthreads
+
+	Example:
+	"all": all possible CPUs
+	"0-N": all possible CPUs, N denotes last CPU number
+	"0,1-N:1/2": first and all CPUs with odd number
+	"": empty list
+
+est_nice - INTEGER
+	default 0
+	Valid range: -20 (more favorable) - 19 (less favorable)
+
+	Niceness value to use for the estimation kthreads (scheduling
+	priority)
+
 expire_nodest_conn - BOOLEAN
 	- 0 - disabled (default)
 	- not 0 - enabled
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 8171d845520c..7027eca6dab8 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -29,6 +29,7 @@
 #include <net/netfilter/nf_conntrack.h>
 #endif
 #include <net/net_namespace.h>		/* Netw namespace */
+#include <linux/sched/isolation.h>
 
 #define IP_VS_HDR_INVERSE	1
 #define IP_VS_HDR_ICMP		2
@@ -368,6 +369,9 @@ struct ip_vs_cpu_stats {
 #define IPVS_EST_RESCHED_RCU	1
 #endif
 
+/* Default nice for estimator kthreads */
+#define IPVS_EST_NICE		0
+
 /* IPVS statistics objects */
 struct ip_vs_estimator {
 	struct hlist_node	list;
@@ -968,6 +972,12 @@ struct netns_ipvs {
 	int			sysctl_schedule_icmp;
 	int			sysctl_ignore_tunneled;
 	int			sysctl_run_estimation;
+#ifdef CONFIG_SYSCTL
+	cpumask_var_t		sysctl_est_cpulist;	/* kthread cpumask */
+	int			est_cpulist_valid;	/* cpulist set */
+	int			sysctl_est_nice;	/* kthread nice */
+	int			est_stopped;		/* stop tasks */
+#endif
 
 	/* ip_vs_lblc */
 	int			sysctl_lblc_expiration;
@@ -1117,6 +1127,19 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_run_estimation;
 }
 
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+	if (ipvs->est_cpulist_valid)
+		return ipvs->sysctl_est_cpulist;
+	else
+		return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_est_nice;
+}
+
 #else
 
 static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1214,6 +1237,16 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+	return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+	return IPVS_EST_NICE;
+}
+
 #endif
 
 /* IPVS core functions
@@ -1562,6 +1595,23 @@ static inline void ip_vs_est_wait_resched(struct netns_ipvs *ipvs,
 #endif
 }
 
+static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+	ipvs->est_stopped = ipvs->est_cpulist_valid &&
+			    cpumask_empty(sysctl_est_cpulist(ipvs));
+#endif
+}
+
+static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+	return ipvs->est_stopped;
+#else
+	return false;
+#endif
+}
+
 /* Various IPVS packet transmitters (from ip_vs_xmit.c) */
 int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 		    struct ip_vs_protocol *pp, struct ip_vs_iphdr *iph);
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index e9f61eba3b8e..6279517104c6 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -264,7 +264,8 @@ static void est_reload_work_handler(struct work_struct *work)
 		/* New config ? Stop kthread tasks */
 		if (genid != genid_done)
 			ip_vs_est_kthread_stop(kd);
-		if (!kd->task && ip_vs_est_kthread_start(ipvs, kd) < 0)
+		if (!kd->task && !ip_vs_est_stopped(ipvs) &&
+		    ip_vs_est_kthread_start(ipvs, kd) < 0)
 			repeat = true;
 	}
 
@@ -1906,6 +1907,119 @@ proc_do_sync_ports(struct ctl_table *table, int write,
 	return rc;
 }
 
+static int ipvs_proc_est_cpumask_set(struct ctl_table *table, void *buffer)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	cpumask_var_t *valp = table->data;
+	cpumask_var_t newmask;
+	int ret;
+
+	if (!zalloc_cpumask_var(&newmask, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = cpulist_parse(buffer, newmask);
+	if (ret)
+		goto out;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	if (!ipvs->est_cpulist_valid) {
+		if (!zalloc_cpumask_var(valp, GFP_KERNEL)) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
+		ipvs->est_cpulist_valid = 1;
+	}
+	cpumask_and(newmask, newmask, cpu_possible_mask);
+	cpumask_copy(*valp, newmask);
+	ip_vs_est_reload_start(ipvs, true);
+
+unlock:
+	mutex_unlock(&ipvs->est_mutex);
+
+out:
+	free_cpumask_var(newmask);
+	return ret;
+}
+
+static int ipvs_proc_est_cpumask_get(struct ctl_table *table, void *buffer,
+				     size_t size)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	cpumask_var_t *valp = table->data;
+	struct cpumask *mask;
+	int ret;
+
+	mutex_lock(&ipvs->est_mutex);
+
+	if (ipvs->est_cpulist_valid)
+		mask = *valp;
+	else
+		mask = (struct cpumask *)housekeeping_cpumask(HK_TYPE_KTHREAD);
+	ret = scnprintf(buffer, size, "%*pbl\n", cpumask_pr_args(mask));
+
+	mutex_unlock(&ipvs->est_mutex);
+
+	return ret;
+}
+
+static int ipvs_proc_est_cpulist(struct ctl_table *table, int write,
+				 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+
+	/* Ignore both read and write(append) if *ppos not 0 */
+	if (*ppos || !*lenp) {
+		*lenp = 0;
+		return 0;
+	}
+	if (write) {
+		/* proc_sys_call_handler() appends terminator */
+		ret = ipvs_proc_est_cpumask_set(table, buffer);
+		if (ret >= 0)
+			*ppos += *lenp;
+	} else {
+		/* proc_sys_call_handler() allocates 1 byte for terminator */
+		ret = ipvs_proc_est_cpumask_get(table, buffer, *lenp + 1);
+		if (ret >= 0) {
+			*lenp = ret;
+			*ppos += *lenp;
+			ret = 0;
+		}
+	}
+	return ret;
+}
+
+static int ipvs_proc_est_nice(struct ctl_table *table, int write,
+			      void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		if (val < MIN_NICE || val > MAX_NICE) {
+			ret = -EINVAL;
+		} else {
+			mutex_lock(&ipvs->est_mutex);
+			if (*valp != val) {
+				*valp = val;
+				ip_vs_est_reload_start(ipvs, true);
+			}
+			mutex_unlock(&ipvs->est_mutex);
+		}
+	}
+	return ret;
+}
+
 /*
  *	IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
  *	Do not change order or insert new entries without
@@ -2082,6 +2196,18 @@ static struct ctl_table vs_vars[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "est_cpulist",
+		.maxlen		= NR_CPUS,	/* unused */
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_est_cpulist,
+	},
+	{
+		.procname	= "est_nice",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_est_nice,
+	},
 #ifdef CONFIG_IP_VS_DEBUG
 	{
 		.procname	= "debug_level",
@@ -4157,6 +4283,15 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
 	ipvs->sysctl_run_estimation = 1;
 	tbl[idx++].data = &ipvs->sysctl_run_estimation;
+
+	ipvs->est_cpulist_valid = 0;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_est_cpulist;
+
+	ipvs->sysctl_est_nice = IPVS_EST_NICE;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_est_nice;
+
 #ifdef CONFIG_IP_VS_DEBUG
 	/* Global sysctls must be ro in non-init netns */
 	if (!net_eq(net, &init_net))
@@ -4179,6 +4314,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
 			  expire_nodest_conn_handler);
 
+	ipvs->est_stopped = 0;
 	ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
 	return 0;
 }
@@ -4193,6 +4329,9 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
 	ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
 
+	if (ipvs->est_cpulist_valid)
+		free_cpumask_var(ipvs->sysctl_est_cpulist);
+
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->sysctl_tbl);
 }
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index b2dd6f1c284a..0bbc6158339e 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -55,6 +55,9 @@
   - kthread contexts are created and attached to array
   - the kthread tasks are created when first service is added, before that
     the total stats are not estimated
+  - when configuration (cpulist/nice) is changed, the tasks are restarted
+    by work (est_reload_work)
+  - kthread tasks are stopped while the cpulist is empty
   - the kthread context holds lists with estimators (chains) which are
     processed every 2 seconds
   - as estimators can be added dynamically and in bursts, we try to spread
@@ -191,6 +194,7 @@ void ip_vs_est_reload_start(struct netns_ipvs *ipvs, bool bump)
 	/* Ignore reloads before first service is added */
 	if (!ipvs->enable)
 		return;
+	ip_vs_est_stopped_recalc(ipvs);
 	/* Bump the kthread configuration genid */
 	if (bump)
 		atomic_inc(&ipvs->est_genid);
@@ -223,6 +227,9 @@ int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
 		goto out;
 	}
 
+	set_user_nice(kd->task, sysctl_est_nice(ipvs));
+	set_cpus_allowed_ptr(kd->task, sysctl_est_cpulist(ipvs));
+
 	pr_info("starting estimator thread %d...\n", kd->id);
 	wake_up_process(kd->task);
 
@@ -280,7 +287,7 @@ static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
 	memset(kd->chain_len, 0, sizeof(kd->chain_len));
 	kd->task = NULL;
 	/* Start kthread tasks only when services are present */
-	if (ipvs->enable) {
+	if (ipvs->enable && !ip_vs_est_stopped(ipvs)) {
 		/* On failure, try to start the task again later */
 		if (ip_vs_est_kthread_start(ipvs, kd) < 0)
 			ip_vs_est_reload_start(ipvs, false);
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 4/4] ipvs: run_estimation should control the kthread tasks
  2022-08-27 17:41 [RFC PATCH 0/4] Use kthreads for stats Julian Anastasov
                   ` (2 preceding siblings ...)
  2022-08-27 17:41 ` [RFC PATCH 3/4] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
@ 2022-08-27 17:41 ` Julian Anastasov
  2022-09-05 14:57   ` Jiri Wiesner
  2022-09-05  6:34 ` [RFC PATCH 0/4] Use kthreads for stats dust.li
  2022-09-05  8:26 ` Jiri Wiesner
  5 siblings, 1 reply; 18+ messages in thread
From: Julian Anastasov @ 2022-08-27 17:41 UTC (permalink / raw)
  To: Jiri Wiesner
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

Change the run_estimation flag to start/stop the kthread tasks.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 Documentation/networking/ipvs-sysctl.rst |  4 ++--
 include/net/ip_vs.h                      |  6 +++--
 net/netfilter/ipvs/ip_vs_ctl.c           | 29 +++++++++++++++++++++++-
 net/netfilter/ipvs/ip_vs_est.c           |  4 +---
 4 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 90c7c325421a..eb33355aa625 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -324,8 +324,8 @@ run_estimation - BOOLEAN
 	0 - disabled
 	not 0 - enabled (default)
 
-	If disabled, the estimation will be stop, and you can't see
-	any update on speed estimation data.
+	If disabled, the estimation will be suspended and kthread tasks
+	stopped.
 
 	You can always re-enable estimation by setting this value to 1.
 	But be careful, the first estimation after re-enable is not
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 7027eca6dab8..428b885c2063 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1598,8 +1598,10 @@ static inline void ip_vs_est_wait_resched(struct netns_ipvs *ipvs,
 static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
 {
 #ifdef CONFIG_SYSCTL
-	ipvs->est_stopped = ipvs->est_cpulist_valid &&
-			    cpumask_empty(sysctl_est_cpulist(ipvs));
+	/* Stop tasks while cpulist is empty or if disabled with flag */
+	ipvs->est_stopped = !sysctl_run_estimation(ipvs) ||
+			    (ipvs->est_cpulist_valid &&
+			     cpumask_empty(sysctl_est_cpulist(ipvs)));
 #endif
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 6279517104c6..8d03eddfb19f 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2020,6 +2020,32 @@ static int ipvs_proc_est_nice(struct ctl_table *table, int write,
 	return ret;
 }
 
+static int ipvs_proc_run_estimation(struct ctl_table *table, int write,
+				    void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		mutex_lock(&ipvs->est_mutex);
+		if (*valp != val) {
+			*valp = val;
+			ip_vs_est_reload_start(ipvs, true);
+		}
+		mutex_unlock(&ipvs->est_mutex);
+	}
+	return ret;
+}
+
 /*
  *	IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
  *	Do not change order or insert new entries without
@@ -2194,7 +2220,7 @@ static struct ctl_table vs_vars[] = {
 		.procname	= "run_estimation",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= ipvs_proc_run_estimation,
 	},
 	{
 		.procname	= "est_cpulist",
@@ -4282,6 +4308,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
 	tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
 	ipvs->sysctl_run_estimation = 1;
+	tbl[idx].extra2 = ipvs;
 	tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
 	ipvs->est_cpulist_valid = 0;
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 0bbc6158339e..0e52e64efac8 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -146,7 +146,6 @@ static void ip_vs_estimation_chain(struct ip_vs_est_kt_data *kd, int row)
 static int ip_vs_estimation_kthread(void *data)
 {
 	struct ip_vs_est_kt_data *kd = data;
-	struct netns_ipvs *ipvs = kd->ipvs;
 	int row = kd->est_row;
 	unsigned long now;
 	long gap;
@@ -171,8 +170,7 @@ static int ip_vs_estimation_kthread(void *data)
 				kd->est_timer = now;
 		}
 
-		if (sysctl_run_estimation(ipvs) &&
-		    !hlist_empty(&kd->chains[row]))
+		if (!hlist_empty(&kd->chains[row]))
 			ip_vs_estimation_chain(kd, row);
 
 		row++;
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Use kthreads for stats
  2022-08-27 17:41 [RFC PATCH 0/4] Use kthreads for stats Julian Anastasov
                   ` (3 preceding siblings ...)
  2022-08-27 17:41 ` [RFC PATCH 4/4] ipvs: run_estimation should control the kthread tasks Julian Anastasov
@ 2022-09-05  6:34 ` dust.li
  2022-09-05  8:26 ` Jiri Wiesner
  5 siblings, 0 replies; 18+ messages in thread
From: dust.li @ 2022-09-05  6:34 UTC (permalink / raw)
  To: Julian Anastasov, Jiri Wiesner
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, tangyang

On Sat, Aug 27, 2022 at 08:41:50PM +0300, Julian Anastasov wrote:
>	Hello,
>
>	This patchset implements stats estimation in
>kthread context. Simple tests do not show any problem.
>Please review, comment, test, etc.

Hi, Julian:

Thanks a lot for your work! I tested the patchset, until now, it all
works well.

On my test server with 64 CPUs and 1 million rules. The total
CPU cost of all ipvs kthreads is about 67% of 1 CPU(31 ipvs threads).
No ping slow detected.

Tested-by: Dust Li <dust.li@linux.alibaba.com>

>
>	Overview of the basic concepts. More in the
>commit messages...
>
>RCU Locking:
>
>- when RCU preemption is enabled the kthreads use just RCU
>lock for walking the chains and we do not need to reschedule.
>May be this is the common case for distribution kernels.
>In this case ip_vs_stop_estimator() is completely lockless.
>
>- when RCU preemption is not enabled, we reschedule by using
>refcnt for every estimator to track if the currently removed
>estimator is used at the same time by kthread for estimation.
>As RCU lock is unlocked during rescheduling, the deletion
>should wait kd->mutex, so that a new RCU lock is applied
>before the estimator is freed with RCU callback.
>
>- As stats are now RCU-locked, tot_stats, svc and dest which
>hold estimator structures are now always freed from RCU
>callback. This ensures RCU grace period after the
>ip_vs_stop_estimator() call.
>
>Kthread data:
>
>- every kthread works over its own data structure and all
>such structures are attached to array
>
>- even while there can be a kthread structure, its task
>may not be running, eg. before first service is added or
>while the sysctl var is set to an empty cpulist or
>when run_estimation is 0.
>
>- a task and its structure may be released if all
>estimators are unlinked from its chains, leaving the
>slot in the array empty
>
>- to add new estimators we use the last added kthread
>context (est_add_ktid). The new estimators are linked to
>the chain just before the estimated one, based on add_row.
>This ensures their estimation will start after 2 seconds.
>If estimators are added in bursts, common case if all
>services and dests are initially configured, we may
>spread the estimators to more chains. This will reduce
>the chain imbalance.
>
>- the chain imbalance is not so fatal when we use
>kthreads. We design each kthread for part of the
>possible CPU usage, so even if some chain exceeds its
>time slot it would happen all the time or sporadic
>depending on the scheduling but still keeping the
>2-second interval. The cpulist isolation can make
>the things more stable as a 2-second time interval
>per estimator.
>
>Julian Anastasov (4):
>  ipvs: add rcu protection to stats
>  ipvs: use kthreads for stats estimation
>  ipvs: add est_cpulist and est_nice sysctl vars
>  ipvs: run_estimation should control the kthread tasks
>
> Documentation/networking/ipvs-sysctl.rst |  24 +-
> include/net/ip_vs.h                      | 144 +++++++-
> net/netfilter/ipvs/ip_vs_core.c          |  10 +-
> net/netfilter/ipvs/ip_vs_ctl.c           | 287 ++++++++++++++--
> net/netfilter/ipvs/ip_vs_est.c           | 408 +++++++++++++++++++----
> 5 files changed, 771 insertions(+), 102 deletions(-)
>
>-- 
>2.37.2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] ipvs: use kthreads for stats estimation
  2022-08-27 17:41 ` [RFC PATCH 2/4] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-09-05  6:47   ` dust.li
  2022-09-07 18:07     ` Julian Anastasov
  2022-09-05 13:19   ` Jiri Wiesner
  1 sibling, 1 reply; 18+ messages in thread
From: dust.li @ 2022-09-05  6:47 UTC (permalink / raw)
  To: Julian Anastasov, Jiri Wiesner
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, tangyang

On Sat, Aug 27, 2022 at 08:41:52PM +0300, Julian Anastasov wrote:
>Estimating all entries in single list in timer context
>causes large latency with multiple rules.
>
>Spread the estimator structures in multiple chains and
>use kthread(s) for the estimation. Every chain is
>processed under RCU lock. If RCU preemption is not
>enabled, we add code for rescheduling by delaying
>the removal of the currently estimated entry.
>
>We also add delayed work est_reload_work that will
>make sure the kthread tasks are properly started.
>
>Signed-off-by: Julian Anastasov <ja@ssi.bg>
>---
> include/net/ip_vs.h            |  84 ++++++-
> net/netfilter/ipvs/ip_vs_ctl.c |  55 ++++-
> net/netfilter/ipvs/ip_vs_est.c | 403 +++++++++++++++++++++++++++------
> 3 files changed, 468 insertions(+), 74 deletions(-)
>
>diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
>index bd8ae137e43b..8171d845520c 100644
>--- a/include/net/ip_vs.h
>+++ b/include/net/ip_vs.h
>@@ -363,9 +363,14 @@ struct ip_vs_cpu_stats {
> 	struct u64_stats_sync   syncp;
> };
> 
>+/* resched during estimation, the defines should match cond_resched_rcu */
>+#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
>+#define IPVS_EST_RESCHED_RCU	1
>+#endif
>+
> /* IPVS statistics objects */
> struct ip_vs_estimator {
>-	struct list_head	list;
>+	struct hlist_node	list;
> 
> 	u64			last_inbytes;
> 	u64			last_outbytes;
>@@ -378,6 +383,31 @@ struct ip_vs_estimator {
> 	u64			outpps;
> 	u64			inbps;
> 	u64			outbps;
>+
>+#ifdef IPVS_EST_RESCHED_RCU
>+	refcount_t		refcnt;
>+#endif
>+	u32			ktid:16,	/* kthread ID */
>+				ktrow:16;	/* row ID for kthread */
>+};
>+
>+/* Spread estimator states in multiple chains */
>+#define IPVS_EST_NCHAINS	50
>+#define IPVS_EST_TICK		((2 * HZ) / IPVS_EST_NCHAINS)
>+
>+/* Context for estimation kthread */
>+struct ip_vs_est_kt_data {
>+	struct netns_ipvs	*ipvs;
>+	struct task_struct	*task;		/* task if running */
>+	struct mutex		mutex;		/* held during resched */
>+	int			id;		/* ktid per netns */
>+	int			est_count;	/* attached ests to kthread */
>+	int			est_max_count;	/* max ests per kthread */
>+	int			add_row;	/* row for new ests */
>+	int			est_row;	/* estimated row */
>+	unsigned long		est_timer;	/* estimation timer (jiffies) */
>+	struct hlist_head	chains[IPVS_EST_NCHAINS];
>+	int			chain_len[IPVS_EST_NCHAINS];
> };
> 
> /*
>@@ -948,9 +978,13 @@ struct netns_ipvs {
> 	struct ctl_table_header	*lblcr_ctl_header;
> 	struct ctl_table	*lblcr_ctl_table;
> 	/* ip_vs_est */
>-	struct list_head	est_list;	/* estimator list */
>-	spinlock_t		est_lock;
>-	struct timer_list	est_timer;	/* Estimation timer */
>+	struct delayed_work	est_reload_work;/* Reload kthread tasks */
>+	struct mutex		est_mutex;	/* protect kthread tasks */
>+	struct ip_vs_est_kt_data **est_kt_arr;	/* Array of kthread data ptrs */
>+	int			est_kt_count;	/* Allocated ptrs */
>+	int			est_add_ktid;	/* ktid where to add ests */
>+	atomic_t		est_genid;	/* kthreads reload genid */
>+	atomic_t		est_genid_done;	/* applied genid */
> 	/* ip_vs_sync */
> 	spinlock_t		sync_lock;
> 	struct ipvs_master_sync_state *ms;
>@@ -1485,6 +1519,48 @@ void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
> void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
> void ip_vs_zero_estimator(struct ip_vs_stats *stats);
> void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
>+void ip_vs_est_reload_start(struct netns_ipvs *ipvs, bool bump);
>+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
>+			    struct ip_vs_est_kt_data *kd);
>+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
>+
>+extern struct mutex ip_vs_est_mutex;
>+
>+static inline void ip_vs_est_init_resched_rcu(struct ip_vs_estimator *e)
>+{
>+#ifdef IPVS_EST_RESCHED_RCU
>+	refcount_set(&e->refcnt, 1);
>+#endif
>+}
>+
>+static inline void ip_vs_est_cond_resched_rcu(struct ip_vs_est_kt_data *kd,
>+					      struct ip_vs_estimator *e)
>+{
>+#ifdef IPVS_EST_RESCHED_RCU
>+	if (mutex_trylock(&kd->mutex)) {
>+		/* Block removal during reschedule */
>+		if (refcount_inc_not_zero(&e->refcnt)) {
>+			cond_resched_rcu();
>+			refcount_dec(&e->refcnt);
>+		}
>+		mutex_unlock(&kd->mutex);
>+	}
>+#endif
>+}
>+
>+static inline void ip_vs_est_wait_resched(struct netns_ipvs *ipvs,
>+					  struct ip_vs_estimator *est)
>+{
>+#ifdef IPVS_EST_RESCHED_RCU
>+	/* Estimator kthread is rescheduling on deleted est? Wait it! */
>+	if (!refcount_dec_and_test(&est->refcnt)) {
>+		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[est->ktid];
>+
>+		mutex_lock(&kd->mutex);
>+		mutex_unlock(&kd->mutex);

IIUC, this mutex_lock/unlock() is just used for waiting for the ipvs-e
thread schedule back if it had been scheduled out in cond_resched_rcu() ?
But not to protect data ?

If so, I am wondering if we can remove the mutex_trylock/unlock() in
ip_vs_est_cond_resched_rcu, and use some wait/wakeup mechanism to do
this ? Because when I run perf on 'ipvs-e' kthreads, I saw lots of CPU
cycles are on the mutex_trylock.

Thanks


>+	}
>+#endif
>+}
> 
> /* Various IPVS packet transmitters (from ip_vs_xmit.c) */
> int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
>diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
>index 44c79fd1779c..e9f61eba3b8e 100644
>--- a/net/netfilter/ipvs/ip_vs_ctl.c
>+++ b/net/netfilter/ipvs/ip_vs_ctl.c
>@@ -239,8 +239,49 @@ static void defense_work_handler(struct work_struct *work)
> 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
> 			   DEFENSE_TIMER_PERIOD);
> }
>+
> #endif
> 
>+static void est_reload_work_handler(struct work_struct *work)
>+{
>+	struct netns_ipvs *ipvs =
>+		container_of(work, struct netns_ipvs, est_reload_work.work);
>+	int genid = atomic_read(&ipvs->est_genid);
>+	int genid_done = atomic_read(&ipvs->est_genid_done);
>+	unsigned long delay = HZ / 10;	/* repeat startups after failure */
>+	bool repeat = false;
>+	int id;
>+
>+	mutex_lock(&ipvs->est_mutex);
>+	for (id = 0; id < ipvs->est_kt_count; id++) {
>+		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
>+
>+		/* netns clean up started, abort delayed work */
>+		if (!ipvs->enable)
>+			goto unlock;
>+		if (!kd)
>+			continue;
>+		/* New config ? Stop kthread tasks */
>+		if (genid != genid_done)
>+			ip_vs_est_kthread_stop(kd);
>+		if (!kd->task && ip_vs_est_kthread_start(ipvs, kd) < 0)
>+			repeat = true;
>+	}
>+
>+	atomic_set(&ipvs->est_genid_done, genid);
>+
>+unlock:
>+	mutex_unlock(&ipvs->est_mutex);
>+
>+	if (!ipvs->enable)
>+		return;
>+	if (genid != atomic_read(&ipvs->est_genid))
>+		delay = 1;
>+	else if (!repeat)
>+		return;
>+	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, delay);
>+}
>+
> int
> ip_vs_use_count_inc(void)
> {
>@@ -1421,8 +1462,15 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
> 	ip_vs_svc_hash(svc);
> 
> 	*svc_p = svc;
>-	/* Now there is a service - full throttle */
>-	ipvs->enable = 1;
>+
>+	if (!ipvs->enable) {
>+		/* Now there is a service - full throttle */
>+		ipvs->enable = 1;
>+
>+		/* Start estimation for first time */
>+		ip_vs_est_reload_start(ipvs, true);
>+	}
>+
> 	return 0;
> 
> 
>@@ -4178,6 +4226,8 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
> 	atomic_set(&ipvs->nullsvc_counter, 0);
> 	atomic_set(&ipvs->conn_out_counter, 0);
> 
>+	INIT_DELAYED_WORK(&ipvs->est_reload_work, est_reload_work_handler);
>+
> 	/* procfs stats */
> 	ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
> 	if (!ipvs->tot_stats)
>@@ -4235,6 +4285,7 @@ void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
> {
> 	ip_vs_trash_cleanup(ipvs);
> 	ip_vs_control_net_cleanup_sysctl(ipvs);
>+	cancel_delayed_work_sync(&ipvs->est_reload_work);
> #ifdef CONFIG_PROC_FS
> 	remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
> 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
>diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
>index 9a1a7af6a186..b2dd6f1c284a 100644
>--- a/net/netfilter/ipvs/ip_vs_est.c
>+++ b/net/netfilter/ipvs/ip_vs_est.c
>@@ -30,9 +30,6 @@
>   long interval, it is easy to implement a user level daemon which
>   periodically reads those statistical counters and measure rate.
> 
>-  Currently, the measurement is activated by slow timer handler. Hope
>-  this measurement will not introduce too much load.
>-
>   We measure rate during the last 8 seconds every 2 seconds:
> 
>     avgrate = avgrate*(1-W) + rate*W
>@@ -47,68 +44,75 @@
>     to 32-bit values for conns, packets, bps, cps and pps.
> 
>   * A lot of code is taken from net/core/gen_estimator.c
>- */
>-
> 
>-/*
>- * Make a summary from each cpu
>+  KEY POINTS:
>+  - cpustats counters are updated per-cpu in SoftIRQ context with BH disabled
>+  - kthreads read the cpustats to update the estimators (svcs, dests, total)
>+  - the states of estimators can be read (get stats) or modified (zero stats)
>+    from processes
>+
>+  KTHREADS:
>+  - kthread contexts are created and attached to array
>+  - the kthread tasks are created when first service is added, before that
>+    the total stats are not estimated
>+  - the kthread context holds lists with estimators (chains) which are
>+    processed every 2 seconds
>+  - as estimators can be added dynamically and in bursts, we try to spread
>+    them to multiple chains which are estimated at different time
>  */
>-static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
>-				 struct ip_vs_cpu_stats __percpu *stats)
>-{
>-	int i;
>-	bool add = false;
> 
>-	for_each_possible_cpu(i) {
>-		struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i);
>-		unsigned int start;
>-		u64 conns, inpkts, outpkts, inbytes, outbytes;
>-
>-		if (add) {
>-			do {
>-				start = u64_stats_fetch_begin(&s->syncp);
>-				conns = s->cnt.conns;
>-				inpkts = s->cnt.inpkts;
>-				outpkts = s->cnt.outpkts;
>-				inbytes = s->cnt.inbytes;
>-				outbytes = s->cnt.outbytes;
>-			} while (u64_stats_fetch_retry(&s->syncp, start));
>-			sum->conns += conns;
>-			sum->inpkts += inpkts;
>-			sum->outpkts += outpkts;
>-			sum->inbytes += inbytes;
>-			sum->outbytes += outbytes;
>-		} else {
>-			add = true;
>-			do {
>-				start = u64_stats_fetch_begin(&s->syncp);
>-				sum->conns = s->cnt.conns;
>-				sum->inpkts = s->cnt.inpkts;
>-				sum->outpkts = s->cnt.outpkts;
>-				sum->inbytes = s->cnt.inbytes;
>-				sum->outbytes = s->cnt.outbytes;
>-			} while (u64_stats_fetch_retry(&s->syncp, start));
>-		}
>-	}
>-}
>+/* Optimal chain length used to spread bursts of newly added ests */
>+#define IPVS_EST_BURST_LEN	BIT(6)
>+/* Max number of ests per kthread (recommended) */
>+#define IPVS_EST_MAX_COUNT	(32 * 1024)
> 
>+static struct lock_class_key __ipvs_est_key;
> 
>-static void estimation_timer(struct timer_list *t)
>+static void ip_vs_estimation_chain(struct ip_vs_est_kt_data *kd, int row)
> {
>+	struct hlist_head *chain = &kd->chains[row];
> 	struct ip_vs_estimator *e;
>+	struct ip_vs_cpu_stats *c;
> 	struct ip_vs_stats *s;
> 	u64 rate;
>-	struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
> 
>-	if (!sysctl_run_estimation(ipvs))
>-		goto skip;
>+	rcu_read_lock();
>+	hlist_for_each_entry_rcu(e, chain, list) {
>+		u64 conns, inpkts, outpkts, inbytes, outbytes;
>+		u64 kconns = 0, kinpkts = 0, koutpkts = 0;
>+		u64 kinbytes = 0, koutbytes = 0;
>+		unsigned int start;
>+		int i;
>+
>+		if (kthread_should_stop())
>+			break;
>+		ip_vs_est_cond_resched_rcu(kd, e);
> 
>-	spin_lock(&ipvs->est_lock);
>-	list_for_each_entry(e, &ipvs->est_list, list) {
> 		s = container_of(e, struct ip_vs_stats, est);
>+		for_each_possible_cpu(i) {
>+			c = per_cpu_ptr(s->cpustats, i);
>+			do {
>+				start = u64_stats_fetch_begin(&c->syncp);
>+				conns = c->cnt.conns;
>+				inpkts = c->cnt.inpkts;
>+				outpkts = c->cnt.outpkts;
>+				inbytes = c->cnt.inbytes;
>+				outbytes = c->cnt.outbytes;
>+			} while (u64_stats_fetch_retry(&c->syncp, start));
>+			kconns += conns;
>+			kinpkts += inpkts;
>+			koutpkts += outpkts;
>+			kinbytes += inbytes;
>+			koutbytes += outbytes;
>+		}
>+
>+		spin_lock_bh(&s->lock);
> 
>-		spin_lock(&s->lock);
>-		ip_vs_read_cpu_stats(&s->kstats, s->cpustats);
>+		s->kstats.conns = kconns;
>+		s->kstats.inpkts = kinpkts;
>+		s->kstats.outpkts = koutpkts;
>+		s->kstats.inbytes = kinbytes;
>+		s->kstats.outbytes = koutbytes;
> 
> 		/* scaled by 2^10, but divided 2 seconds */
> 		rate = (s->kstats.conns - e->last_conns) << 9;
>@@ -131,32 +135,288 @@ static void estimation_timer(struct timer_list *t)
> 		rate = (s->kstats.outbytes - e->last_outbytes) << 4;
> 		e->last_outbytes = s->kstats.outbytes;
> 		e->outbps += ((s64)rate - (s64)e->outbps) >> 2;
>-		spin_unlock(&s->lock);
>+		spin_unlock_bh(&s->lock);
>+	}
>+	rcu_read_unlock();
>+}
>+
>+static int ip_vs_estimation_kthread(void *data)
>+{
>+	struct ip_vs_est_kt_data *kd = data;
>+	struct netns_ipvs *ipvs = kd->ipvs;
>+	int row = kd->est_row;
>+	unsigned long now;
>+	long gap;
>+
>+	while (1) {
>+		set_current_state(TASK_IDLE);
>+		if (kthread_should_stop())
>+			break;
>+
>+		/* before estimation, check if we should sleep */
>+		now = READ_ONCE(jiffies);
>+		gap = kd->est_timer - now;
>+		if (gap > 0) {
>+			if (gap > IPVS_EST_TICK) {
>+				kd->est_timer = now - IPVS_EST_TICK;
>+				gap = IPVS_EST_TICK;
>+			}
>+			schedule_timeout(gap);
>+		} else {
>+			__set_current_state(TASK_RUNNING);
>+			if (gap < -8 * IPVS_EST_TICK)
>+				kd->est_timer = now;
>+		}
>+
>+		if (sysctl_run_estimation(ipvs) &&
>+		    !hlist_empty(&kd->chains[row]))
>+			ip_vs_estimation_chain(kd, row);
>+
>+		row++;
>+		if (row >= IPVS_EST_NCHAINS)
>+			row = 0;
>+		kd->est_row = row;
>+		/* add_row best to point after the just estimated row */
>+		WRITE_ONCE(kd->add_row, row);
>+		kd->est_timer += IPVS_EST_TICK;
>+	}
>+	__set_current_state(TASK_RUNNING);
>+
>+	return 0;
>+}
>+
>+/* Stop (bump=true)/start kthread tasks */
>+void ip_vs_est_reload_start(struct netns_ipvs *ipvs, bool bump)
>+{
>+	/* Ignore reloads before first service is added */
>+	if (!ipvs->enable)
>+		return;
>+	/* Bump the kthread configuration genid */
>+	if (bump)
>+		atomic_inc(&ipvs->est_genid);
>+	queue_delayed_work(system_long_wq, &ipvs->est_reload_work,
>+			   bump ? 0 : 1);
>+}
>+
>+/* Start kthread task with current configuration */
>+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
>+			    struct ip_vs_est_kt_data *kd)
>+{
>+	unsigned long now;
>+	int ret = 0;
>+	long gap;
>+
>+	lockdep_assert_held(&ipvs->est_mutex);
>+
>+	if (kd->task)
>+		goto out;
>+	now = READ_ONCE(jiffies);
>+	gap = kd->est_timer - now;
>+	/* Sync est_timer if task is starting later */
>+	if (abs(gap) > 4 * IPVS_EST_TICK)
>+		kd->est_timer = now;
>+	kd->task = kthread_create(ip_vs_estimation_kthread, kd, "ipvs-e:%d:%d",
>+				  ipvs->gen, kd->id);
>+	if (IS_ERR(kd->task)) {
>+		ret = PTR_ERR(kd->task);
>+		kd->task = NULL;
>+		goto out;
> 	}
>-	spin_unlock(&ipvs->est_lock);
> 
>-skip:
>-	mod_timer(&ipvs->est_timer, jiffies + 2*HZ);
>+	pr_info("starting estimator thread %d...\n", kd->id);
>+	wake_up_process(kd->task);
>+
>+out:
>+	return ret;
>+}
>+
>+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd)
>+{
>+	if (kd->task) {
>+		pr_info("stopping estimator thread %d...\n", kd->id);
>+		kthread_stop(kd->task);
>+		kd->task = NULL;
>+	}
> }
> 
>+/* Create and start estimation kthread in a free or new array slot */
>+static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
>+{
>+	struct ip_vs_est_kt_data *kd = NULL;
>+	int id = ipvs->est_kt_count;
>+	int err = -ENOMEM;
>+	void *arr = NULL;
>+	int i;
>+
>+	mutex_lock(&ipvs->est_mutex);
>+
>+	for (i = 0; i < id; i++) {
>+		if (!ipvs->est_kt_arr[i])
>+			break;
>+	}
>+	if (i >= id) {
>+		arr = krealloc_array(ipvs->est_kt_arr, id + 1,
>+				     sizeof(struct ip_vs_est_kt_data *),
>+				     GFP_KERNEL);
>+		if (!arr)
>+			goto out;
>+		ipvs->est_kt_arr = arr;
>+	} else {
>+		id = i;
>+	}
>+	kd = kmalloc(sizeof(*kd), GFP_KERNEL);
>+	if (!kd)
>+		goto out;
>+	kd->ipvs = ipvs;
>+	mutex_init(&kd->mutex);
>+	kd->id = id;
>+	kd->est_count = 0;
>+	kd->est_max_count = IPVS_EST_MAX_COUNT;
>+	kd->add_row = 0;
>+	kd->est_row = 0;
>+	kd->est_timer = jiffies;
>+	for (i = 0; i < ARRAY_SIZE(kd->chains); i++)
>+		INIT_HLIST_HEAD(&kd->chains[i]);
>+	memset(kd->chain_len, 0, sizeof(kd->chain_len));
>+	kd->task = NULL;
>+	/* Start kthread tasks only when services are present */
>+	if (ipvs->enable) {
>+		/* On failure, try to start the task again later */
>+		if (ip_vs_est_kthread_start(ipvs, kd) < 0)
>+			ip_vs_est_reload_start(ipvs, false);
>+	}
>+
>+	if (arr)
>+		ipvs->est_kt_count++;
>+	ipvs->est_kt_arr[id] = kd;
>+	/* Use most recent kthread for new ests */
>+	ipvs->est_add_ktid = id;
>+
>+	mutex_unlock(&ipvs->est_mutex);
>+
>+	return 0;
>+
>+out:
>+	mutex_unlock(&ipvs->est_mutex);
>+	if (kd) {
>+		mutex_destroy(&kd->mutex);
>+		kfree(kd);
>+	}
>+	return err;
>+}
>+
>+/* Add estimator to current kthread (est_add_ktid) */
> void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
> {
> 	struct ip_vs_estimator *est = &stats->est;
>+	struct ip_vs_est_kt_data *kd = NULL;
>+	int ktid, row;
>+
>+	INIT_HLIST_NODE(&est->list);
>+	ip_vs_est_init_resched_rcu(est);
>+
>+	if (ipvs->est_add_ktid < ipvs->est_kt_count) {
>+		kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
>+		if (!kd)
>+			goto add_kt;
>+		if (kd->est_count < kd->est_max_count)
>+			goto add_est;
>+	}
> 
>-	INIT_LIST_HEAD(&est->list);
>+add_kt:
>+	/* Create new kthread but we can exceed est_max_count on failure */
>+	if (ip_vs_est_add_kthread(ipvs) < 0) {
>+		if (!kd || kd->est_count >= INT_MAX / 2)
>+			goto out;
>+	}
>+	kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
>+	if (!kd)
>+		goto out;
>+
>+add_est:
>+	ktid = kd->id;
>+	/* add_row points after the row we should use */
>+	row = READ_ONCE(kd->add_row) - 1;
>+	if (row < 0)
>+		row = IPVS_EST_NCHAINS - 1;
>+
>+	kd->est_count++;
>+	kd->chain_len[row]++;
>+	/* Multiple ests added together? Fill chains one by one. */
>+	if (!(kd->chain_len[row] & (IPVS_EST_BURST_LEN - 1)))
>+		kd->add_row = row;
>+	est->ktid = ktid;
>+	est->ktrow = row;
>+	hlist_add_head_rcu(&est->list, &kd->chains[row]);
>+
>+out:
>+	;
>+}
> 
>-	spin_lock_bh(&ipvs->est_lock);
>-	list_add(&est->list, &ipvs->est_list);
>-	spin_unlock_bh(&ipvs->est_lock);
>+static void ip_vs_est_kthread_destroy(struct ip_vs_est_kt_data *kd)
>+{
>+	if (kd) {
>+		if (kd->task)
>+			kthread_stop(kd->task);
>+		mutex_destroy(&kd->mutex);
>+		kfree(kd);
>+	}
> }
> 
>+/* Unlink estimator from list */
> void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
> {
> 	struct ip_vs_estimator *est = &stats->est;
>+	struct ip_vs_est_kt_data *kd;
>+	int ktid = est->ktid;
>+
>+	/* Failed to add to chain ? */
>+	if (hlist_unhashed(&est->list))
>+		goto out;
>+
>+	hlist_del_rcu(&est->list);
>+	ip_vs_est_wait_resched(ipvs, est);
>+
>+	kd = ipvs->est_kt_arr[ktid];
>+	kd->chain_len[est->ktrow]--;
>+	kd->est_count--;
>+	if (kd->est_count)
>+		goto out;
>+	pr_info("stop unused estimator thread %d...\n", ktid);
>+
>+	mutex_lock(&ipvs->est_mutex);
>+
>+	ip_vs_est_kthread_destroy(kd);
>+	ipvs->est_kt_arr[ktid] = NULL;
>+	if (ktid == ipvs->est_kt_count - 1)
>+		ipvs->est_kt_count--;
>+
>+	mutex_unlock(&ipvs->est_mutex);
>+
>+	if (ktid == ipvs->est_add_ktid) {
>+		int count = ipvs->est_kt_count;
>+		int best = -1;
>+
>+		while (count-- > 0) {
>+			if (!ipvs->est_add_ktid)
>+				ipvs->est_add_ktid = ipvs->est_kt_count;
>+			ipvs->est_add_ktid--;
>+			kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
>+			if (!kd)
>+				continue;
>+			if (kd->est_count < kd->est_max_count) {
>+				best = ipvs->est_add_ktid;
>+				break;
>+			}
>+			if (best < 0)
>+				best = ipvs->est_add_ktid;
>+		}
>+		if (best >= 0)
>+			ipvs->est_add_ktid = best;
>+	}
> 
>-	spin_lock_bh(&ipvs->est_lock);
>-	list_del(&est->list);
>-	spin_unlock_bh(&ipvs->est_lock);
>+out:
>+	;
> }
> 
> void ip_vs_zero_estimator(struct ip_vs_stats *stats)
>@@ -191,14 +451,21 @@ void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats)
> 
> int __net_init ip_vs_estimator_net_init(struct netns_ipvs *ipvs)
> {
>-	INIT_LIST_HEAD(&ipvs->est_list);
>-	spin_lock_init(&ipvs->est_lock);
>-	timer_setup(&ipvs->est_timer, estimation_timer, 0);
>-	mod_timer(&ipvs->est_timer, jiffies + 2 * HZ);
>+	ipvs->est_kt_arr = NULL;
>+	ipvs->est_kt_count = 0;
>+	ipvs->est_add_ktid = 0;
>+	atomic_set(&ipvs->est_genid, 0);
>+	atomic_set(&ipvs->est_genid_done, 0);
>+	__mutex_init(&ipvs->est_mutex, "ipvs->est_mutex", &__ipvs_est_key);
> 	return 0;
> }
> 
> void __net_exit ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs)
> {
>-	del_timer_sync(&ipvs->est_timer);
>+	int i;
>+
>+	for (i = 0; i < ipvs->est_kt_count; i++)
>+		ip_vs_est_kthread_destroy(ipvs->est_kt_arr[i]);
>+	kfree(ipvs->est_kt_arr);
>+	mutex_destroy(&ipvs->est_mutex);
> }
>-- 
>2.37.2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Use kthreads for stats
  2022-08-27 17:41 [RFC PATCH 0/4] Use kthreads for stats Julian Anastasov
                   ` (4 preceding siblings ...)
  2022-09-05  6:34 ` [RFC PATCH 0/4] Use kthreads for stats dust.li
@ 2022-09-05  8:26 ` Jiri Wiesner
  2022-09-07 18:33   ` Julian Anastasov
  5 siblings, 1 reply; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-05  8:26 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

On Sat, Aug 27, 2022 at 08:41:50PM +0300, Julian Anastasov wrote:
> 	This patchset implements stats estimation in
> kthread context. Simple tests do not show any problem.
> Please review, comment, test, etc.

Thank you for this, Julian.

> RCU Locking:
> 
> - when RCU preemption is enabled the kthreads use just RCU
> lock for walking the chains and we do not need to reschedule.
> May be this is the common case for distribution kernels.
> In this case ip_vs_stop_estimator() is completely lockless.

Yes, it is the case for SUSE since SLES 15.4, which is a recent release.

> - when RCU preemption is not enabled, we reschedule by using
> refcnt for every estimator to track if the currently removed
> estimator is used at the same time by kthread for estimation.
> As RCU lock is unlocked during rescheduling, the deletion
> should wait kd->mutex, so that a new RCU lock is applied
> before the estimator is freed with RCU callback.

I believe allowing the kthread the possibility to block in each iteration - for each estimator - introduces some problems:
1. Non-preemptive kernels should be optimized for throughput not for latency
Using the figure reported earlier (50,000 services required 200 ms of processing time) it takes roughly 3 ms to process one chain (32 * 1024 / 50 services). The processing time per chain will vary with the number of NUMA nodes and CPUs. Nevertheless, this number comparable with the processing time limit of __do_softirq() - 2 ms, which gets converted to 1 jiffy. In term of latency of non-preemptive kernels, it is entirely resonable to let one chain be processed without rescheduling the kthread.
2. The priority of the kthreads could be set to lower values than the default priority for SCHED_OTHER. If a user space process was trying to stop an estimator the pointer to which is held by a currently sleeping kthread this would constitute priority inversion. The kd->mutex would not be released until the lower priority thread, the kthread, has started running again. AFAIK, the mutex() locking primitive does not implement priority inheritance on non-preemptive kernels.
3. Blocking while in ip_vs_estimation_chain() will results in wrong estimates for the remaining estimators in a chain. The 2 second interval would not be kept, rate estimates would be overestimated in that interval and underestimated in one of the future intervals.
In my opinion, any of the above reasons is sufficient to remove ip_vs_est_cond_resched_rcu(), ip_vs_est_wait_resched() and kd->mutex.

> - As stats are now RCU-locked, tot_stats, svc and dest which
> hold estimator structures are now always freed from RCU
> callback. This ensures RCU grace period after the
> ip_vs_stop_estimator() call.

I think this is sound.

> Kthread data:
> 
> - every kthread works over its own data structure and all
> such structures are attached to array

It seems to me there is no upper bound on the number of kthreads that could be forked. Therefore, it should be possible to devise an attack that would force the system to run out of PIDs:
1. Start adding services so that all chains of kthread A would be used.
2. Add one more service to force the forking of kthread B, thus advancing ipvs->est_add_ktid.
3. Remove all but one service from kthread A.
4. Repeat steps 1-3 but with kthread B.
I think I could come up with a reproducer if need be.

> - to add new estimators we use the last added kthread
> context (est_add_ktid). The new estimators are linked to
> the chain just before the estimated one, based on add_row.
> This ensures their estimation will start after 2 seconds.
> If estimators are added in bursts, common case if all
> services and dests are initially configured, we may
> spread the estimators to more chains. This will reduce
> the chain imbalance.
> 
> - the chain imbalance is not so fatal when we use
> kthreads. We design each kthread for part of the
> possible CPU usage, so even if some chain exceeds its
> time slot it would happen all the time or sporadic
> depending on the scheduling but still keeping the
> 2-second interval. The cpulist isolation can make
> the things more stable as a 2-second time interval
> per estimator.

Unbalanced chains would not be fatal to the system but there is risk of introducing scheduling latencies tens or even hundreds of milliseconds long. There are patterns of adding and removing chains that would results in chain imbalance getting so severe that a handful of chains would have estimators in them while the rest would be empty or almost empty. Some examples:
1. Adding a new service each second after sleeping for 1 second. This would use the add_row value at the time of adding the estimator, which would result in 2 chains holding all the estimators.
2. Repeated addition and removal of services. There would always be more services added than removed. The additions would be carried out in bursts. The forking of a new kthread would not be triggered because the number of services would stay under IPVS_EST_MAX_COUNT (32 * 1024).

The problem is that the code does not guarantee that the length of chains always stays under IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS (32 * 1024 / 50) estimators. A check and a cycle iterating over available rows could be added to ip_vs_start_estimator() to find the rows that still have fewer estimators than IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS. This would come at the expense of having inaccurate estimates for a few intervals but I think the trade-off is worth it. Also, the estimates will be inaccurate when estimators are added in bursts. If, depending on how services are added and removed, the code could introduce scheduling latencies there would be someone running into this sooner or later. The probability of severe chain imbalance being low is not good enough there should be a guarantee.
-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] ipvs: add rcu protection to stats
  2022-08-27 17:41 ` [RFC PATCH 1/4] ipvs: add rcu protection to stats Julian Anastasov
@ 2022-09-05 10:43   ` Jiri Wiesner
  0 siblings, 0 replies; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-05 10:43 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

On Sat, Aug 27, 2022 at 08:41:51PM +0300, Julian Anastasov wrote:
> In preparation to using RCU locking for the list
> with estimators, make sure the struct ip_vs_stats
> are released after RCU grace period by using RCU
> callbacks. This affects ipvs->tot_stats where we
> can not use RCU callbacks for ipvs, so we use
> allocated struct ip_vs_stats_rcu. For services
> and dests we force RCU callbacks for all cases.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] ipvs: use kthreads for stats estimation
  2022-08-27 17:41 ` [RFC PATCH 2/4] ipvs: use kthreads for stats estimation Julian Anastasov
  2022-09-05  6:47   ` dust.li
@ 2022-09-05 13:19   ` Jiri Wiesner
  2022-09-07 19:01     ` Julian Anastasov
  1 sibling, 1 reply; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-05 13:19 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

On Sat, Aug 27, 2022 at 08:41:52PM +0300, Julian Anastasov wrote:
> @@ -239,8 +239,49 @@ static void defense_work_handler(struct work_struct *work)
>  	queue_delayed_work(system_long_wq, &ipvs->defense_work,
>  			   DEFENSE_TIMER_PERIOD);
>  }
> +
>  #endif
>  
> +static void est_reload_work_handler(struct work_struct *work)
> +{
> +	struct netns_ipvs *ipvs =
> +		container_of(work, struct netns_ipvs, est_reload_work.work);
> +	int genid = atomic_read(&ipvs->est_genid);
> +	int genid_done = atomic_read(&ipvs->est_genid_done);
> +	unsigned long delay = HZ / 10;	/* repeat startups after failure */
> +	bool repeat = false;
> +	int id;
> +
> +	mutex_lock(&ipvs->est_mutex);
> +	for (id = 0; id < ipvs->est_kt_count; id++) {
> +		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
> +
> +		/* netns clean up started, abort delayed work */
> +		if (!ipvs->enable)
> +			goto unlock;

It would save some code to move the ipvs->enable check before the critical section and use a return statement right away.

> +		if (!kd)
> +			continue;
> +		/* New config ? Stop kthread tasks */
> +		if (genid != genid_done)
> +			ip_vs_est_kthread_stop(kd);
> +		if (!kd->task && ip_vs_est_kthread_start(ipvs, kd) < 0)
> +			repeat = true;
> +	}
> +
> +	atomic_set(&ipvs->est_genid_done, genid);
> +
> +unlock:
> +	mutex_unlock(&ipvs->est_mutex);
> +
> +	if (!ipvs->enable)
> +		return;
> +	if (genid != atomic_read(&ipvs->est_genid))
> +		delay = 1;
> +	else if (!repeat)
> +		return;
> +	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, delay);
> +}
> +
>  int
>  ip_vs_use_count_inc(void)
>  {

> @@ -47,68 +44,75 @@
>      to 32-bit values for conns, packets, bps, cps and pps.
>  
>    * A lot of code is taken from net/core/gen_estimator.c
> - */
> -
>  
> -/*
> - * Make a summary from each cpu
> +  KEY POINTS:
> +  - cpustats counters are updated per-cpu in SoftIRQ context with BH disabled
> +  - kthreads read the cpustats to update the estimators (svcs, dests, total)
> +  - the states of estimators can be read (get stats) or modified (zero stats)
> +    from processes
> +
> +  KTHREADS:
> +  - kthread contexts are created and attached to array
> +  - the kthread tasks are created when first service is added, before that
> +    the total stats are not estimated
> +  - the kthread context holds lists with estimators (chains) which are
> +    processed every 2 seconds
> +  - as estimators can be added dynamically and in bursts, we try to spread
> +    them to multiple chains which are estimated at different time
>   */
> -static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
> -				 struct ip_vs_cpu_stats __percpu *stats)
> -{
> -	int i;
> -	bool add = false;
>  
> -	for_each_possible_cpu(i) {
> -		struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i);
> -		unsigned int start;
> -		u64 conns, inpkts, outpkts, inbytes, outbytes;
> -
> -		if (add) {
> -			do {
> -				start = u64_stats_fetch_begin(&s->syncp);
> -				conns = s->cnt.conns;
> -				inpkts = s->cnt.inpkts;
> -				outpkts = s->cnt.outpkts;
> -				inbytes = s->cnt.inbytes;
> -				outbytes = s->cnt.outbytes;
> -			} while (u64_stats_fetch_retry(&s->syncp, start));
> -			sum->conns += conns;
> -			sum->inpkts += inpkts;
> -			sum->outpkts += outpkts;
> -			sum->inbytes += inbytes;
> -			sum->outbytes += outbytes;
> -		} else {
> -			add = true;
> -			do {
> -				start = u64_stats_fetch_begin(&s->syncp);
> -				sum->conns = s->cnt.conns;
> -				sum->inpkts = s->cnt.inpkts;
> -				sum->outpkts = s->cnt.outpkts;
> -				sum->inbytes = s->cnt.inbytes;
> -				sum->outbytes = s->cnt.outbytes;
> -			} while (u64_stats_fetch_retry(&s->syncp, start));
> -		}
> -	}
> -}
> +/* Optimal chain length used to spread bursts of newly added ests */
> +#define IPVS_EST_BURST_LEN	BIT(6)
> +/* Max number of ests per kthread (recommended) */
> +#define IPVS_EST_MAX_COUNT	(32 * 1024)
>  
> +static struct lock_class_key __ipvs_est_key;
>  
> -static void estimation_timer(struct timer_list *t)
> +static void ip_vs_estimation_chain(struct ip_vs_est_kt_data *kd, int row)
>  {
> +	struct hlist_head *chain = &kd->chains[row];
>  	struct ip_vs_estimator *e;
> +	struct ip_vs_cpu_stats *c;
>  	struct ip_vs_stats *s;
>  	u64 rate;
> -	struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
>  
> -	if (!sysctl_run_estimation(ipvs))
> -		goto skip;
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(e, chain, list) {
> +		u64 conns, inpkts, outpkts, inbytes, outbytes;
> +		u64 kconns = 0, kinpkts = 0, koutpkts = 0;
> +		u64 kinbytes = 0, koutbytes = 0;
> +		unsigned int start;
> +		int i;
> +
> +		if (kthread_should_stop())
> +			break;
> +		ip_vs_est_cond_resched_rcu(kd, e);
>  
> -	spin_lock(&ipvs->est_lock);
> -	list_for_each_entry(e, &ipvs->est_list, list) {
>  		s = container_of(e, struct ip_vs_stats, est);
> +		for_each_possible_cpu(i) {
> +			c = per_cpu_ptr(s->cpustats, i);
> +			do {
> +				start = u64_stats_fetch_begin(&c->syncp);
> +				conns = c->cnt.conns;
> +				inpkts = c->cnt.inpkts;
> +				outpkts = c->cnt.outpkts;
> +				inbytes = c->cnt.inbytes;
> +				outbytes = c->cnt.outbytes;
> +			} while (u64_stats_fetch_retry(&c->syncp, start));
> +			kconns += conns;
> +			kinpkts += inpkts;
> +			koutpkts += outpkts;
> +			kinbytes += inbytes;
> +			koutbytes += outbytes;
> +		}
> +
> +		spin_lock_bh(&s->lock);

Per-CPU counters are updated from softirq context and read by user space processes and kthreads. The per-CPU counters are protected by the syncp member. Kthreads sum up the per-CPU counters and calculate rate estimates. Both the counters sums and rates are read (or reset) by user space processes. What I am building to is: Bottom halves should stay enabled unless it is necessary to disable them to ensure data consistency. It should not be necessary to disable bottom halves here because the spin lock only protects the counter sums and rates and synchronizes kthreads and user space processes. Following this logic, disabling bottom halves could be dropped in ip_vs_copy_stats() and ip_vs_zero_stats(). Am I wrong about this?

>  
> -		spin_lock(&s->lock);
> -		ip_vs_read_cpu_stats(&s->kstats, s->cpustats);
> +		s->kstats.conns = kconns;
> +		s->kstats.inpkts = kinpkts;
> +		s->kstats.outpkts = koutpkts;
> +		s->kstats.inbytes = kinbytes;
> +		s->kstats.outbytes = koutbytes;
>  
>  		/* scaled by 2^10, but divided 2 seconds */
>  		rate = (s->kstats.conns - e->last_conns) << 9;
> @@ -131,32 +135,288 @@ static void estimation_timer(struct timer_list *t)
>  		rate = (s->kstats.outbytes - e->last_outbytes) << 4;
>  		e->last_outbytes = s->kstats.outbytes;
>  		e->outbps += ((s64)rate - (s64)e->outbps) >> 2;
> -		spin_unlock(&s->lock);
> +		spin_unlock_bh(&s->lock);
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static int ip_vs_estimation_kthread(void *data)
> +{
> +	struct ip_vs_est_kt_data *kd = data;
> +	struct netns_ipvs *ipvs = kd->ipvs;
> +	int row = kd->est_row;
> +	unsigned long now;
> +	long gap;
> +
> +	while (1) {
> +		set_current_state(TASK_IDLE);
> +		if (kthread_should_stop())
> +			break;
> +
> +		/* before estimation, check if we should sleep */
> +		now = READ_ONCE(jiffies);
> +		gap = kd->est_timer - now;
> +		if (gap > 0) {
> +			if (gap > IPVS_EST_TICK) {
> +				kd->est_timer = now - IPVS_EST_TICK;
> +				gap = IPVS_EST_TICK;
> +			}
> +			schedule_timeout(gap);
> +		} else {
> +			__set_current_state(TASK_RUNNING);
> +			if (gap < -8 * IPVS_EST_TICK)
> +				kd->est_timer = now;
> +		}
> +
> +		if (sysctl_run_estimation(ipvs) &&
> +		    !hlist_empty(&kd->chains[row]))
> +			ip_vs_estimation_chain(kd, row);
> +
> +		row++;
> +		if (row >= IPVS_EST_NCHAINS)
> +			row = 0;
> +		kd->est_row = row;
> +		/* add_row best to point after the just estimated row */
> +		WRITE_ONCE(kd->add_row, row);

One of the effects of setting add_row here is that it will randomize the chains to which estimators are added when estimators are added in many short bursts with time intervals between the bursts exceeding IPVS_EST_TICK. I guess that is what we want.

> +		kd->est_timer += IPVS_EST_TICK;
> +	}
> +	__set_current_state(TASK_RUNNING);
> +
> +	return 0;
> +}
> +
> +/* Stop (bump=true)/start kthread tasks */
> +void ip_vs_est_reload_start(struct netns_ipvs *ipvs, bool bump)

The variable name "bump" is not make it obvious what will be the action taken after setting bump to true.

> +{
> +	/* Ignore reloads before first service is added */
> +	if (!ipvs->enable)
> +		return;
> +	/* Bump the kthread configuration genid */
> +	if (bump)
> +		atomic_inc(&ipvs->est_genid);
> +	queue_delayed_work(system_long_wq, &ipvs->est_reload_work,
> +			   bump ? 0 : 1);
> +}
> +
> +/* Start kthread task with current configuration */
> +int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
> +			    struct ip_vs_est_kt_data *kd)
> +{
> +	unsigned long now;
> +	int ret = 0;
> +	long gap;
> +
> +	lockdep_assert_held(&ipvs->est_mutex);
> +
> +	if (kd->task)
> +		goto out;
> +	now = READ_ONCE(jiffies);
> +	gap = kd->est_timer - now;
> +	/* Sync est_timer if task is starting later */
> +	if (abs(gap) > 4 * IPVS_EST_TICK)
> +		kd->est_timer = now;
> +	kd->task = kthread_create(ip_vs_estimation_kthread, kd, "ipvs-e:%d:%d",
> +				  ipvs->gen, kd->id);
> +	if (IS_ERR(kd->task)) {
> +		ret = PTR_ERR(kd->task);
> +		kd->task = NULL;
> +		goto out;
>  	}
> -	spin_unlock(&ipvs->est_lock);
>  
> -skip:
> -	mod_timer(&ipvs->est_timer, jiffies + 2*HZ);
> +	pr_info("starting estimator thread %d...\n", kd->id);
> +	wake_up_process(kd->task);
> +
> +out:
> +	return ret;
> +}
> +
> +void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd)
> +{
> +	if (kd->task) {
> +		pr_info("stopping estimator thread %d...\n", kd->id);
> +		kthread_stop(kd->task);
> +		kd->task = NULL;
> +	}
>  }
>  
> +/* Create and start estimation kthread in a free or new array slot */
> +static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
> +{
> +	struct ip_vs_est_kt_data *kd = NULL;
> +	int id = ipvs->est_kt_count;
> +	int err = -ENOMEM;
> +	void *arr = NULL;
> +	int i;
> +
> +	mutex_lock(&ipvs->est_mutex);
> +
> +	for (i = 0; i < id; i++) {
> +		if (!ipvs->est_kt_arr[i])
> +			break;
> +	}
> +	if (i >= id) {
> +		arr = krealloc_array(ipvs->est_kt_arr, id + 1,
> +				     sizeof(struct ip_vs_est_kt_data *),
> +				     GFP_KERNEL);
> +		if (!arr)
> +			goto out;
> +		ipvs->est_kt_arr = arr;
> +	} else {
> +		id = i;
> +	}
> +	kd = kmalloc(sizeof(*kd), GFP_KERNEL);
> +	if (!kd)
> +		goto out;
> +	kd->ipvs = ipvs;
> +	mutex_init(&kd->mutex);
> +	kd->id = id;
> +	kd->est_count = 0;
> +	kd->est_max_count = IPVS_EST_MAX_COUNT;
> +	kd->add_row = 0;
> +	kd->est_row = 0;
> +	kd->est_timer = jiffies;
> +	for (i = 0; i < ARRAY_SIZE(kd->chains); i++)
> +		INIT_HLIST_HEAD(&kd->chains[i]);
> +	memset(kd->chain_len, 0, sizeof(kd->chain_len));
> +	kd->task = NULL;
> +	/* Start kthread tasks only when services are present */
> +	if (ipvs->enable) {
> +		/* On failure, try to start the task again later */
> +		if (ip_vs_est_kthread_start(ipvs, kd) < 0)
> +			ip_vs_est_reload_start(ipvs, false);
> +	}
> +
> +	if (arr)
> +		ipvs->est_kt_count++;
> +	ipvs->est_kt_arr[id] = kd;
> +	/* Use most recent kthread for new ests */
> +	ipvs->est_add_ktid = id;
> +

To consolidate the code paths, the out label could be moved here after getting rid of the dead code and changing err.

> +	mutex_unlock(&ipvs->est_mutex);
> +
> +	return 0;
> +
> +out:
> +	mutex_unlock(&ipvs->est_mutex);
> +	if (kd) {

The kd variable above does not evaluate to true in this code path. This is dead code:

> +		mutex_destroy(&kd->mutex);
> +		kfree(kd);
> +	}
> +	return err;
> +}
> +
> +/* Add estimator to current kthread (est_add_ktid) */
>  void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
>  {
>  	struct ip_vs_estimator *est = &stats->est;
> +	struct ip_vs_est_kt_data *kd = NULL;
> +	int ktid, row;
> +
> +	INIT_HLIST_NODE(&est->list);
> +	ip_vs_est_init_resched_rcu(est);
> +
> +	if (ipvs->est_add_ktid < ipvs->est_kt_count) {
> +		kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
> +		if (!kd)
> +			goto add_kt;
> +		if (kd->est_count < kd->est_max_count)
> +			goto add_est;
> +	}
>  
> -	INIT_LIST_HEAD(&est->list);
> +add_kt:
> +	/* Create new kthread but we can exceed est_max_count on failure */
> +	if (ip_vs_est_add_kthread(ipvs) < 0) {
> +		if (!kd || kd->est_count >= INT_MAX / 2)
> +			goto out;
> +	}
> +	kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
> +	if (!kd)
> +		goto out;
> +
> +add_est:
> +	ktid = kd->id;
> +	/* add_row points after the row we should use */
> +	row = READ_ONCE(kd->add_row) - 1;
> +	if (row < 0)
> +		row = IPVS_EST_NCHAINS - 1;
> +
> +	kd->est_count++;
> +	kd->chain_len[row]++;
> +	/* Multiple ests added together? Fill chains one by one. */
> +	if (!(kd->chain_len[row] & (IPVS_EST_BURST_LEN - 1)))
> +		kd->add_row = row;
> +	est->ktid = ktid;
> +	est->ktrow = row;
> +	hlist_add_head_rcu(&est->list, &kd->chains[row]);
> +
> +out:

The out label is needless. There is no error handling. A return statement instead of the gotos would express the intention more clearly. (It applies to other functions in the patch as well.)

> +	;
> +}
>  
> -	spin_lock_bh(&ipvs->est_lock);
> -	list_add(&est->list, &ipvs->est_list);
> -	spin_unlock_bh(&ipvs->est_lock);
> +static void ip_vs_est_kthread_destroy(struct ip_vs_est_kt_data *kd)
> +{
> +	if (kd) {
> +		if (kd->task)
> +			kthread_stop(kd->task);
> +		mutex_destroy(&kd->mutex);
> +		kfree(kd);
> +	}
>  }
>  
> +/* Unlink estimator from list */
>  void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
>  {
>  	struct ip_vs_estimator *est = &stats->est;
> +	struct ip_vs_est_kt_data *kd;
> +	int ktid = est->ktid;
> +
> +	/* Failed to add to chain ? */
> +	if (hlist_unhashed(&est->list))
> +		goto out;
> +
> +	hlist_del_rcu(&est->list);
> +	ip_vs_est_wait_resched(ipvs, est);
> +
> +	kd = ipvs->est_kt_arr[ktid];
> +	kd->chain_len[est->ktrow]--;
> +	kd->est_count--;
> +	if (kd->est_count)
> +		goto out;
> +	pr_info("stop unused estimator thread %d...\n", ktid);
> +
> +	mutex_lock(&ipvs->est_mutex);
> +
> +	ip_vs_est_kthread_destroy(kd);
> +	ipvs->est_kt_arr[ktid] = NULL;
> +	if (ktid == ipvs->est_kt_count - 1)
> +		ipvs->est_kt_count--;
> +
> +	mutex_unlock(&ipvs->est_mutex);
> +
> +	if (ktid == ipvs->est_add_ktid) {
> +		int count = ipvs->est_kt_count;
> +		int best = -1;
> +
> +		while (count-- > 0) {
> +			if (!ipvs->est_add_ktid)
> +				ipvs->est_add_ktid = ipvs->est_kt_count;
> +			ipvs->est_add_ktid--;
> +			kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
> +			if (!kd)
> +				continue;
> +			if (kd->est_count < kd->est_max_count) {
> +				best = ipvs->est_add_ktid;
> +				break;
> +			}
> +			if (best < 0)
> +				best = ipvs->est_add_ktid;
> +		}
> +		if (best >= 0)
> +			ipvs->est_add_ktid = best;
> +	}
>  
> -	spin_lock_bh(&ipvs->est_lock);
> -	list_del(&est->list);
> -	spin_unlock_bh(&ipvs->est_lock);
> +out:
> +	;
>  }
>  
>  void ip_vs_zero_estimator(struct ip_vs_stats *stats)

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/4] ipvs: add est_cpulist and est_nice sysctl vars
  2022-08-27 17:41 ` [RFC PATCH 3/4] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
@ 2022-09-05 14:53   ` Jiri Wiesner
  0 siblings, 0 replies; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-05 14:53 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

On Sat, Aug 27, 2022 at 08:41:53PM +0300, Julian Anastasov wrote:
> Allow the kthreads for stats to be configured for
> specific cpulist (isolation) and niceness (scheduling
> priority).
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

I think moving the estimation kthreads to idle CPU core will become necessary for people who want their estimates to be as accurate as possible. Otherwise, scheduling latencies on busy systems may make the estimates inaccurate by delaying the kthreads after they have been woken up.

Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

> diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
> index 387fda80f05f..90c7c325421a 100644
> --- a/Documentation/networking/ipvs-sysctl.rst
> +++ b/Documentation/networking/ipvs-sysctl.rst
> @@ -129,6 +129,26 @@ drop_packet - INTEGER
>  	threshold. When the mode 3 is set, the always mode drop rate
>  	is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
>  
> +est_cpulist - CPULIST
> +	Allowed	CPUs for estimation kthreads
> +
> +	Syntax: standard cpulist format
> +	empty list - stop kthread tasks and estimation
> +	default - the system's housekeeping CPUs for kthreads
> +
> +	Example:
> +	"all": all possible CPUs
> +	"0-N": all possible CPUs, N denotes last CPU number
> +	"0,1-N:1/2": first and all CPUs with odd number
> +	"": empty list
> +
> +est_nice - INTEGER
> +	default 0
> +	Valid range: -20 (more favorable) - 19 (less favorable)

How about using dots instead of a hyphen in the range? I guess it will be easier to read.

> +
> +	Niceness value to use for the estimation kthreads (scheduling
> +	priority)
> +
>  expire_nodest_conn - BOOLEAN
>  	- 0 - disabled (default)
>  	- not 0 - enabled

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 4/4] ipvs: run_estimation should control the kthread tasks
  2022-08-27 17:41 ` [RFC PATCH 4/4] ipvs: run_estimation should control the kthread tasks Julian Anastasov
@ 2022-09-05 14:57   ` Jiri Wiesner
  0 siblings, 0 replies; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-05 14:57 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, yunhjiang, dust.li, tangyang

On Sat, Aug 27, 2022 at 08:41:54PM +0300, Julian Anastasov wrote:
> Change the run_estimation flag to start/stop the kthread tasks.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Reviewed-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] ipvs: use kthreads for stats estimation
  2022-09-05  6:47   ` dust.li
@ 2022-09-07 18:07     ` Julian Anastasov
  0 siblings, 0 replies; 18+ messages in thread
From: Julian Anastasov @ 2022-09-07 18:07 UTC (permalink / raw)
  To: dust.li; +Cc: Jiri Wiesner, Simon Horman, lvs-devel, yunhong-cgl jiang


	Hello,

On Mon, 5 Sep 2022, dust.li wrote:

> >+static inline void ip_vs_est_wait_resched(struct netns_ipvs *ipvs,
> >+					  struct ip_vs_estimator *est)
> >+{
> >+#ifdef IPVS_EST_RESCHED_RCU
> >+	/* Estimator kthread is rescheduling on deleted est? Wait it! */
> >+	if (!refcount_dec_and_test(&est->refcnt)) {
> >+		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[est->ktid];
> >+
> >+		mutex_lock(&kd->mutex);
> >+		mutex_unlock(&kd->mutex);
> 
> IIUC, this mutex_lock/unlock() is just used for waiting for the ipvs-e
> thread schedule back if it had been scheduled out in cond_resched_rcu() ?
> But not to protect data ?

	Yes

> If so, I am wondering if we can remove the mutex_trylock/unlock() in
> ip_vs_est_cond_resched_rcu, and use some wait/wakeup mechanism to do
> this ? Because when I run perf on 'ipvs-e' kthreads, I saw lots of CPU
> cycles are on the mutex_trylock.

	Yes, wait/wakeup would be better alternative, it will
avoid the possible priority inversion as Jiri noticed.

	Probably, the rate of rescheduling can be reduced,
so that we can achieve sub-millisecond latency. But the time
we spend in the for_each_possible_cpu() loop depends on the
number of possible CPUs, they can have different speed.
I thought of checking initially with need_resched() before
mutex_trylock() but when RCU preemption is disabled, the RCU
reader side is a region with disabled preemption and we should
call cond_resched() often to report a quiescent state with
rcu_all_qs().

For example:

- goal: sub-100-microsecond cond_resched rate to
report a quiescent state and to be nice to other kthreads

	if (!(++resched_counter & 15) || need_resched()) ...

	Thank you for the review and testing!

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Use kthreads for stats
  2022-09-05  8:26 ` Jiri Wiesner
@ 2022-09-07 18:33   ` Julian Anastasov
  2022-09-08 15:35     ` Jiri Wiesner
  0 siblings, 1 reply; 18+ messages in thread
From: Julian Anastasov @ 2022-09-07 18:33 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Mon, 5 Sep 2022, Jiri Wiesner wrote:

> I believe allowing the kthread the possibility to block in each iteration - for each estimator - introduces some problems:
> 1. Non-preemptive kernels should be optimized for throughput not for latency
> Using the figure reported earlier (50,000 services required 200 ms of processing time) it takes roughly 3 ms to process one chain (32 * 1024 / 50 services). The processing time per chain will vary with the number of NUMA nodes and CPUs. Nevertheless, this number comparable with the processing time limit of __do_softirq() - 2 ms, which gets converted to 1 jiffy. In term of latency of non-preemptive kernels, it is entirely resonable to let one chain be processed without rescheduling the kthread.

	Thank you for the review!

	My worry is that IPVS_EST_MAX_COUNT is not a hard limit,
we allow this limit to be exceeded when ip_vs_est_add_kthread()
fails to create kthread (mostly on ENOMEM) or if we add a limit
for kthreads, so in such cases we can not refuse to add more
estimators to existing kthreads.

> 2. The priority of the kthreads could be set to lower values than the default priority for SCHED_OTHER. If a user space process was trying to stop an estimator the pointer to which is held by a currently sleeping kthread this would constitute priority inversion. The kd->mutex would not be released until the lower priority thread, the kthread, has started running again. AFAIK, the mutex() locking primitive does not implement priority inheritance on non-preemptive kernels.

	I see, alternative would be wait_event() logic to avoid
holding mutex during reschedule and causing priority inversion.

> 3. Blocking while in ip_vs_estimation_chain() will results in wrong estimates for the remaining estimators in a chain. The 2 second interval would not be kept, rate estimates would be overestimated in that interval and underestimated in one of the future intervals.
> In my opinion, any of the above reasons is sufficient to remove ip_vs_est_cond_resched_rcu(), ip_vs_est_wait_resched() and kd->mutex.

	The chains can become too long. I'm not sure it is
good to avoid cond_resched() for long time. Another solution
would be to allocate more chains for a tick and apply some
hard limit for these chains. cond_resched() will be called
after such chain is processed. But it is difficult to
calculate hard limit for these chains, it depends on the
cycles we spend (CPU speed and the number of possible CPUs
we walk in the loop). For example, we can have again 50 ticks
but 16 chains per tick, so total 800 chains per kthread (50*16).
32*1024/800 means 40 estimators per chain before calling
cond_resched(). And a tick can attach more than 16 chains
if the est_max_count is exceeded. It will cost some memory
to provide more chains but it will avoid the kd->mutex
games.

	struct hlist_head *tick_chain = &kd->tick_chains[row];

	rcu_read_lock();
	/* tick_chains has no limit of chains */
	hlist_for_each_entry_rcu(chain, tick_chains, list) {
		/* This list below is limited */
		hlist_for_each_entry_rcu(e, &chain->ests, list) {
			...
			if (kthread_should_stop())
				goto end;
			...
		}
		cond_resched_rcu();
	}

	If we change ip_vs_start_estimator() to return int err
we can safely allocate new chains and track for ENOMEM.
I think, this is possible, with some reordering of the
ip_vs_start_estimator() calls.

> > Kthread data:
> > 
> > - every kthread works over its own data structure and all
> > such structures are attached to array
> 
> It seems to me there is no upper bound on the number of kthreads that could be forked. Therefore, it should be possible to devise an attack that would force the system to run out of PIDs:
> 1. Start adding services so that all chains of kthread A would be used.
> 2. Add one more service to force the forking of kthread B, thus advancing ipvs->est_add_ktid.
> 3. Remove all but one service from kthread A.
> 4. Repeat steps 1-3 but with kthread B.
> I think I could come up with a reproducer if need be.

	Agreed, the chains management can be made more robust.
This patchset is initial version that can serve as basis.
We should consider such things:

- we do not know how many estimators will be added, if we
limit the number of kthreads, then they will be overloaded

- add/del can be made to allocate memory for chains, if needed.
We should not spend many cycles in adding/deleting estimators.

- try more hard to spread estimators to chains, even with
the risk of applying initially a smaller timer for the newly
added estimator.

> Unbalanced chains would not be fatal to the system but there is risk of introducing scheduling latencies tens or even hundreds of milliseconds long. There are patterns of adding and removing chains that would results in chain imbalance getting so severe that a handful of chains would have estimators in them while the rest would be empty or almost empty. Some examples:
> 1. Adding a new service each second after sleeping for 1 second. This would use the add_row value at the time of adding the estimator, which would result in 2 chains holding all the estimators.
> 2. Repeated addition and removal of services. There would always be more services added than removed. The additions would be carried out in bursts. The forking of a new kthread would not be triggered because the number of services would stay under IPVS_EST_MAX_COUNT (32 * 1024).
> 
> The problem is that the code does not guarantee that the length of chains always stays under IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS (32 * 1024 / 50) estimators. A check and a cycle iterating over available rows could be added to ip_vs_start_estimator() to find the rows that still have fewer estimators than IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS. This would come at the expense of having inaccurate estimates for a few intervals but I think the trade-off is worth it. Also, the estimates will be inaccurate when estimators are added in bursts. If, depending on how services are added and removed, the code could introduce scheduling latencies there would be someone running into this sooner or later. The probability of severe chain imbalance being low is not good enough there should be a guarantee
 .

	About balancing the chains: not sure if it is possible
while serving some row for current tick, to look ahead and
advance the current tick work with estimators from the next
tick. Such decisions can be made every tick. I.e. we do not
move entries between the chains but we can walk one chain and
possibly part of the other chain. If we have more chains per
tick, it will be more easy, for example, if current tick
processes 16 chains by default, it can process some from the  
next 16 chains, say 16+8 in total. After 2 seconds, this tick can
process 16+16+8, for example. It will depend on the currently
added estimators per tick and its chains. Every 2 seconds
we can advance with one tick, so that an estimator is
served every 2 seconds or atleast after 1960ms if such
chain stealing happens. Such logic will allow the
estimators to be spread in time even while they are attached   
to chains and ticks with different length. As result, we will
process equal number of estimators per tick by slowly
adjusting to their current number and chain lengths.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] ipvs: use kthreads for stats estimation
  2022-09-05 13:19   ` Jiri Wiesner
@ 2022-09-07 19:01     ` Julian Anastasov
  2022-09-08 16:00       ` Jiri Wiesner
  0 siblings, 1 reply; 18+ messages in thread
From: Julian Anastasov @ 2022-09-07 19:01 UTC (permalink / raw)
  To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li


	Hello,

On Mon, 5 Sep 2022, Jiri Wiesner wrote:

> On Sat, Aug 27, 2022 at 08:41:52PM +0300, Julian Anastasov wrote:
> >  
> > +static void est_reload_work_handler(struct work_struct *work)
> > +{
> > +	struct netns_ipvs *ipvs =
> > +		container_of(work, struct netns_ipvs, est_reload_work.work);
> > +	int genid = atomic_read(&ipvs->est_genid);
> > +	int genid_done = atomic_read(&ipvs->est_genid_done);
> > +	unsigned long delay = HZ / 10;	/* repeat startups after failure */
> > +	bool repeat = false;
> > +	int id;
> > +
> > +	mutex_lock(&ipvs->est_mutex);
> > +	for (id = 0; id < ipvs->est_kt_count; id++) {
> > +		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
> > +
> > +		/* netns clean up started, abort delayed work */
> > +		if (!ipvs->enable)
> > +			goto unlock;
> 
> It would save some code to move the ipvs->enable check before the critical section and use a return statement right away.

	I preferred to react to cleanup_net() faster and
avoid creating threads if this is what we try to do here.

> 
> > +		if (!kd)
> > +			continue;
> > +		/* New config ? Stop kthread tasks */
> > +		if (genid != genid_done)
> > +			ip_vs_est_kthread_stop(kd);
> > +		if (!kd->task && ip_vs_est_kthread_start(ipvs, kd) < 0)
> > +			repeat = true;
> > +	}
> > +
> > +	atomic_set(&ipvs->est_genid_done, genid);
> > +
> > +unlock:
> > +	mutex_unlock(&ipvs->est_mutex);
> > +
> > +	if (!ipvs->enable)
> > +		return;
> > +	if (genid != atomic_read(&ipvs->est_genid))
> > +		delay = 1;
> > +	else if (!repeat)
> > +		return;
> > +	queue_delayed_work(system_long_wq, &ipvs->est_reload_work, delay);
> > +}
> > +

> > +		spin_lock_bh(&s->lock);
> 
> Per-CPU counters are updated from softirq context and read by user space processes and kthreads. The per-CPU counters are protected by the syncp member. Kthreads sum up the per-CPU counters and calculate rate estimates. Both the counters sums and rates are read (or reset) by user space processes. What I am building to is: Bottom halves should stay enabled unless it is necessary to disable them to ensure data consistency. It should not be necessary to disable bottom halves here because the spin lock only protects the counter sums and rates and synchronizes kthreads and user space processes. Following this logic, disabling bottom halves could be dropped in ip_vs_copy_stats() and ip_vs_zero_stats(). Am I wrong about this?

	Yes, removing _bh is correct. The risk is the
other processes to spin if kthread is interrupted by BH.
While the correct method to use for processes is mutex,
due to RCU we can not use it. So, I preferred the _bh
calls for now.

> > +		row++;
> > +		if (row >= IPVS_EST_NCHAINS)
> > +			row = 0;
> > +		kd->est_row = row;
> > +		/* add_row best to point after the just estimated row */
> > +		WRITE_ONCE(kd->add_row, row);
> 
> One of the effects of setting add_row here is that it will randomize the chains to which estimators are added when estimators are added in many short bursts with time intervals between the bursts exceeding IPVS_EST_TICK. I guess that is what we want.

	Yes, what we want here is to apply an initial
2-second timer for the newly added ests but the first
priority is to reduce this initial timer if the chain
is overloaded.

	This code just ensures 2-second initial timer,
it gives add_row as a hint if we add few estimators but
if many estimators are added they will occupy more/all
chains, if needed. It can happen before the first
estimation if all config is created at start and
not changed later.

> > +/* Stop (bump=true)/start kthread tasks */
> > +void ip_vs_est_reload_start(struct netns_ipvs *ipvs, bool bump)
> 
> The variable name "bump" is not make it obvious what will be the action taken after setting bump to true.

	Yes, more comments should be added. bump=true causes
restart while bump=false just keeps them started. When
we change ip_vs_start_estimator() to propagate the
errors from ip_vs_est_kthread_start() this var will be
removed.

> > +	/* Start kthread tasks only when services are present */
> > +	if (ipvs->enable) {
> > +		/* On failure, try to start the task again later */
> > +		if (ip_vs_est_kthread_start(ipvs, kd) < 0)
> > +			ip_vs_est_reload_start(ipvs, false);
> > +	}
> > +
> > +	if (arr)
> > +		ipvs->est_kt_count++;
> > +	ipvs->est_kt_arr[id] = kd;
> > +	/* Use most recent kthread for new ests */
> > +	ipvs->est_add_ktid = id;
> > +
> 
> To consolidate the code paths, the out label could be moved here after getting rid of the dead code and changing err.
> 
> > +	mutex_unlock(&ipvs->est_mutex);
> > +
> > +	return 0;
> > +
> > +out:
> > +	mutex_unlock(&ipvs->est_mutex);
> > +	if (kd) {
> 
> The kd variable above does not evaluate to true in this code path. This is dead code:

	Yes, it used to fail on ip_vs_est_kthread_start() failure
but not in this patch version.

> > +		mutex_destroy(&kd->mutex);
> > +		kfree(kd);
> > +	}
> > +	return err;
> > +}
> > +
> > +/* Add estimator to current kthread (est_add_ktid) */
> >  void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
> >  {
> >  	struct ip_vs_estimator *est = &stats->est;
> > +	struct ip_vs_est_kt_data *kd = NULL;
> > +	int ktid, row;
> > +
> > +	INIT_HLIST_NODE(&est->list);
> > +	ip_vs_est_init_resched_rcu(est);
> > +
> > +	if (ipvs->est_add_ktid < ipvs->est_kt_count) {
> > +		kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
> > +		if (!kd)
> > +			goto add_kt;
> > +		if (kd->est_count < kd->est_max_count)
> > +			goto add_est;
> > +	}
> >  
> > -	INIT_LIST_HEAD(&est->list);
> > +add_kt:
> > +	/* Create new kthread but we can exceed est_max_count on failure */
> > +	if (ip_vs_est_add_kthread(ipvs) < 0) {
> > +		if (!kd || kd->est_count >= INT_MAX / 2)
> > +			goto out;
> > +	}
> > +	kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
> > +	if (!kd)
> > +		goto out;
> > +
> > +add_est:
> > +	ktid = kd->id;
> > +	/* add_row points after the row we should use */
> > +	row = READ_ONCE(kd->add_row) - 1;
> > +	if (row < 0)
> > +		row = IPVS_EST_NCHAINS - 1;
> > +
> > +	kd->est_count++;
> > +	kd->chain_len[row]++;
> > +	/* Multiple ests added together? Fill chains one by one. */
> > +	if (!(kd->chain_len[row] & (IPVS_EST_BURST_LEN - 1)))
> > +		kd->add_row = row;
> > +	est->ktid = ktid;
> > +	est->ktrow = row;
> > +	hlist_add_head_rcu(&est->list, &kd->chains[row]);
> > +
> > +out:
> 
> The out label is needless. There is no error handling. A return statement instead of the gotos would express the intention more clearly. (It applies to other functions in the patch as well.)

	I'll probably change ip_vs_start_estimator to return error.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Use kthreads for stats
  2022-09-07 18:33   ` Julian Anastasov
@ 2022-09-08 15:35     ` Jiri Wiesner
  2022-09-08 18:32       ` Jiri Wiesner
  0 siblings, 1 reply; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-08 15:35 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Wed, Sep 07, 2022 at 09:33:07PM +0300, Julian Anastasov wrote:
> On Mon, 5 Sep 2022, Jiri Wiesner wrote:
> 
> > I believe allowing the kthread the possibility to block in each iteration - for each estimator - introduces some problems:
> > 1. Non-preemptive kernels should be optimized for throughput not for latency
> > Using the figure reported earlier (50,000 services required 200 ms of processing time) it takes roughly 3 ms to process one chain (32 * 1024 / 50 services). The processing time per chain will vary with the number of NUMA nodes and CPUs. Nevertheless, this number comparable with the processing time limit of __do_softirq() - 2 ms, which gets converted to 1 jiffy. In term of latency of non-preemptive kernels, it is entirely resonable to let one chain be processed without rescheduling the kthread.
> 
> 	My worry is that IPVS_EST_MAX_COUNT is not a hard limit,
> we allow this limit to be exceeded when ip_vs_est_add_kthread()
> fails to create kthread (mostly on ENOMEM)

A design decision has been taken to allow more than IPVS_EST_MAX_COUNT estimators in a kthread. The reason is to be able to add estimators even if there is no memory to create a new kthread. Under such memory conditions, user space would not be able to fork new processes - the shell could not execute commands. The system is practically unusable until the OOM killer frees some memory. No production system should be expected to function in a long term when there is not enough memory to fork new processes. The solutions are: add more RAM, decrease workload or resolve memory leak bugs. Not being able to create a kthread can be expected to occur seldom, and it would have to occur while services or destination are being reconfigured. So, why bother with allowing more than IPVS_EST_MAX_COUNT esti
 mators in a kthread. It is common to return -ENOMEM to user space under such conditions.

> or if we add a limit
> for kthreads, so in such cases we can not refuse to add more
> estimators to existing kthreads.

As for limiting the number of kthreads, the limit could implemented as a sysctl variable. The default should be large enough for most deployments. I guess it could accomodate more than 900,000 estimators, which means 30 kthreads. The maximum number of destinations I have seen on a system was 110,000, with 7,000 services (but I have not seen many production systems running IPVS).

The current design does not take into account estimators that have been removed. As a result, kthreads that are not the last kthread could have fewer than IPVS_EST_MAX_COUNT estimators. This could be prevented by changing ipvs->est_add_ktid in ip_vs_stop_estimator() if the value of est->ktid is lower than the current ipvs->est_add_ktid. Ip_vs_start_estimator() would have to increment ipvs->est_add_ktid until it finds a kthread with fewer than IPVS_EST_MAX_COUNT estimators or it has to create a new kthread.

> > 3. Blocking while in ip_vs_estimation_chain() will results in wrong estimates for the remaining estimators in a chain. The 2 second interval would not be kept, rate estimates would be overestimated in that interval and underestimated in one of the future intervals.
> > In my opinion, any of the above reasons is sufficient to remove ip_vs_est_cond_resched_rcu(), ip_vs_est_wait_resched() and kd->mutex.
> 
> 	The chains can become too long. I'm not sure it is
> good to avoid cond_resched() for long time. Another solution
> would be to allocate more chains for a tick and apply some
> hard limit for these chains. cond_resched() will be called
> after such chain is processed. But it is difficult to
> calculate hard limit for these chains, it depends on the
> cycles we spend (CPU speed and the number of possible CPUs
> we walk in the loop). For example, we can have again 50 ticks
> but 16 chains per tick, so total 800 chains per kthread (50*16).
> 32*1024/800 means 40 estimators per chain before calling
> cond_resched().

OK, this seems workable. All the chains could be in one array
        struct hlist_head       chains[IPVS_EST_NCHAINS * 16];
and you would increment row by 16 in ip_vs_estimation_kthread().

> And a tick can attach more than 16 chains
> if the est_max_count is exceeded. It will cost some memory
> to provide more chains but it will avoid the kd->mutex
> games.
> 
> 	struct hlist_head *tick_chain = &kd->tick_chains[row];
> 
> 	rcu_read_lock();
> 	/* tick_chains has no limit of chains */
> 	hlist_for_each_entry_rcu(chain, tick_chains, list) {
> 		/* This list below is limited */
> 		hlist_for_each_entry_rcu(e, &chain->ests, list) {
> 			...
> 			if (kthread_should_stop())
> 				goto end;
> 			...
> 		}
> 		cond_resched_rcu();
> 	}

So, you set a limit - IPVS_EST_MAX_COUNT - but you also allow this limit to be ignored because of a corner case. As a result, tick_chains must not have a limit of chains because of that. I lean towards keeping the maximum of IPVS_EST_MAX_COUNT estimators per kthread.

There is an alternative design where you could increase kd->est_max_count for all kthreads once all of the available kthreads have kd->est_max_count estimators. Nevertheless, there would also have to be a limit to the value of kd->est_max_count. Imagine the estimation during a single tick would take so long that the gap variable in ip_vs_estimation_kthread() would become negative. You would need to have circa 250,000 estimators per kthread. Since you are already measuring the timeout you need for schedule_timeout() in ip_vs_estimation_kthread(), it should be possible to set the kd->est_max_count limit based on the maximum processing time per chain. It could be half a IPVS_EST_TICK, for example.

But it seems to me that the alternative design - increasing kd->est_max_count - should have some support in what is used in production. Are there servers with more than 983,040 estimators (which would be IPVS_EST_MAX_COUNT * 30 kthreads) or even one third of that?

> 	If we change ip_vs_start_estimator() to return int err
> we can safely allocate new chains and track for ENOMEM.
> I think, this is possible, with some reordering of the
> ip_vs_start_estimator() calls.

I think ip_vs_start_estimator() should return an int.

> > > Kthread data:
> > > 
> > > - every kthread works over its own data structure and all
> > > such structures are attached to array
> > 
> > It seems to me there is no upper bound on the number of kthreads that could be forked. Therefore, it should be possible to devise an attack that would force the system to run out of PIDs:
> > 1. Start adding services so that all chains of kthread A would be used.
> > 2. Add one more service to force the forking of kthread B, thus advancing ipvs->est_add_ktid.
> > 3. Remove all but one service from kthread A.
> > 4. Repeat steps 1-3 but with kthread B.
> > I think I could come up with a reproducer if need be.
> 
> 	Agreed, the chains management can be made more robust.
> This patchset is initial version that can serve as basis.
> We should consider such things:
> 
> - we do not know how many estimators will be added, if we
> limit the number of kthreads, then they will be overloaded

Not all kthreads would be overloaded. The current desing would add all the excess estimators into the last kthread.

> - add/del can be made to allocate memory for chains, if needed.
> We should not spend many cycles in adding/deleting estimators.

The fewer allocations the better. I would not use a linked list as described above - kd->tick_chains.

> - try more hard to spread estimators to chains, even with
> the risk of applying initially a smaller timer for the newly
> added estimator.

Yes.

> > Unbalanced chains would not be fatal to the system but there is risk of introducing scheduling latencies tens or even hundreds of milliseconds long. There are patterns of adding and removing chains that would results in chain imbalance getting so severe that a handful of chains would have estimators in them while the rest would be empty or almost empty. Some examples:
> > 1. Adding a new service each second after sleeping for 1 second. This would use the add_row value at the time of adding the estimator, which would result in 2 chains holding all the estimators.
> > 2. Repeated addition and removal of services. There would always be more services added than removed. The additions would be carried out in bursts. The forking of a new kthread would not be triggered because the number of services would stay under IPVS_EST_MAX_COUNT (32 * 1024).
> > 
> > The problem is that the code does not guarantee that the length of chains always stays under IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS (32 * 1024 / 50) estimators. A check and a cycle iterating over available rows could be added to ip_vs_start_estimator() to find the rows that still have fewer estimators than IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS. This would come at the expense of having inaccurate estimates for a few intervals but I think the trade-off is worth it. Also, the estimates will be inaccurate when estimators are added in bursts. If, depending on how services are added and removed, the code could introduce scheduling latencies there would be someone running into this sooner or later. The probability of severe chain imbalance being low is not good enough there should be a guarant
 ee.
> 
> 	About balancing the chains: not sure if it is possible
> while serving some row for current tick, to look ahead and
> advance the current tick work with estimators from the next
> tick. Such decisions can be made every tick. I.e. we do not
> move entries between the chains but we can walk one chain and
> possibly part of the other chain. If we have more chains per
> tick, it will be more easy, for example, if current tick
> processes 16 chains by default, it can process some from the  
> next 16 chains, say 16+8 in total. After 2 seconds, this tick can
> process 16+16+8, for example. It will depend on the currently
> added estimators per tick and its chains. Every 2 seconds
> we can advance with one tick, so that an estimator is
> served every 2 seconds or atleast after 1960ms if such
> chain stealing happens. Such logic will allow the
> estimators to be spread in time even while they are attached   
> to chains and ticks with different length. As result, we will
> process equal number of estimators per tick by slowly
> adjusting to their current number and chain lengths.

I am definitely not suggesting to implement balancing. I propose that each chain have a hard limit on the number of estimators.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] ipvs: use kthreads for stats estimation
  2022-09-07 19:01     ` Julian Anastasov
@ 2022-09-08 16:00       ` Jiri Wiesner
  0 siblings, 0 replies; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-08 16:00 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Wed, Sep 07, 2022 at 10:01:27PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Mon, 5 Sep 2022, Jiri Wiesner wrote:
> 
> > On Sat, Aug 27, 2022 at 08:41:52PM +0300, Julian Anastasov wrote:
> > >  
> > > +static void est_reload_work_handler(struct work_struct *work)
> > > +{
> > > +	struct netns_ipvs *ipvs =
> > > +		container_of(work, struct netns_ipvs, est_reload_work.work);
> > > +	int genid = atomic_read(&ipvs->est_genid);
> > > +	int genid_done = atomic_read(&ipvs->est_genid_done);
> > > +	unsigned long delay = HZ / 10;	/* repeat startups after failure */
> > > +	bool repeat = false;
> > > +	int id;
> > > +
> > > +	mutex_lock(&ipvs->est_mutex);
> > > +	for (id = 0; id < ipvs->est_kt_count; id++) {
> > > +		struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
> > > +
> > > +		/* netns clean up started, abort delayed work */
> > > +		if (!ipvs->enable)
> > > +			goto unlock;
> > 
> > It would save some code to move the ipvs->enable check before the critical section and use a return statement right away.
> 
> 	I preferred to react to cleanup_net() faster and
> avoid creating threads if this is what we try to do here.

I meant putting
if (!ipvs->enable)
	return;
right before the mutex_lock() statement.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Use kthreads for stats
  2022-09-08 15:35     ` Jiri Wiesner
@ 2022-09-08 18:32       ` Jiri Wiesner
  0 siblings, 0 replies; 18+ messages in thread
From: Jiri Wiesner @ 2022-09-08 18:32 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li

On Thu, Sep 08, 2022 at 05:35:21PM +0200, Jiri Wiesner wrote:
> There is an alternative design where you could increase kd->est_max_count for all kthreads once all of the available kthreads have kd->est_max_count estimators. Nevertheless, there would also have to be a limit to the value of kd->est_max_count. Imagine the estimation during a single tick would take so long that the gap variable in ip_vs_estimation_kthread() would become negative. You would need to have circa 250,000 estimators per kthread. Since you are already measuring the timeout you need for schedule_timeout() in ip_vs_estimation_kthread(), it should be possible to set the kd->est_max_count limit based on the maximum processing time per chain. It could be half a IPVS_EST_TICK, for example.
> 
> But it seems to me that the alternative design - increasing kd->est_max_count - should have some support in what is used in production. Are there servers with more than 983,040 estimators (which would be IPVS_EST_MAX_COUNT * 30 kthreads) or even one third of that?

I did some profiling (but could have just looked at top, actually) of a kthread with IPVS_EST_MAX_COUNT estimators for 100 seconds:
# Samples: 4K of event 'bus-cycles'
# Event count (approx.): 125024900
# Overhead        Period  Command          Shared Object      Symbol
# ........  ............  ...............  .................  .........................................
#
    76.44%      95570475  ipvs-e:0:0       [kernel.kallsyms]  [k] ip_vs_estimation_kthread
     8.75%      10935925  ipvs-e:0:0       [kernel.kallsyms]  [k] _find_next_bit
     3.18%       3978975  swapper          [kernel.kallsyms]  [k] intel_idle
     1.00%       1251250  ipvs-e:0:0       [kernel.kallsyms]  [k] _raw_spin_lock_bh
     0.36%        450450  swapper          [kernel.kallsyms]  [k] _raw_spin_lock
     0.36%        450450  swapper          [kernel.kallsyms]  [k] update_rq_clock

The bus-cycles event on this particular machine makes 25,000,000 events per second. Based on the period in the profile, the CPU utilization for various functions is:
ip_vs_estimation_kthread: 95570475/100/25000000*100 = 3.82%
_find_next_bit: 10935925/100/25000000*100 = 0.44%
_raw_spin_lock_bh: 1251250/100/25000000*100 = 0.05%

The kthread could definitely utilize the CPU more, which is an argument for increasing kd->est_max_count.

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-09-08 18:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-27 17:41 [RFC PATCH 0/4] Use kthreads for stats Julian Anastasov
2022-08-27 17:41 ` [RFC PATCH 1/4] ipvs: add rcu protection to stats Julian Anastasov
2022-09-05 10:43   ` Jiri Wiesner
2022-08-27 17:41 ` [RFC PATCH 2/4] ipvs: use kthreads for stats estimation Julian Anastasov
2022-09-05  6:47   ` dust.li
2022-09-07 18:07     ` Julian Anastasov
2022-09-05 13:19   ` Jiri Wiesner
2022-09-07 19:01     ` Julian Anastasov
2022-09-08 16:00       ` Jiri Wiesner
2022-08-27 17:41 ` [RFC PATCH 3/4] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
2022-09-05 14:53   ` Jiri Wiesner
2022-08-27 17:41 ` [RFC PATCH 4/4] ipvs: run_estimation should control the kthread tasks Julian Anastasov
2022-09-05 14:57   ` Jiri Wiesner
2022-09-05  6:34 ` [RFC PATCH 0/4] Use kthreads for stats dust.li
2022-09-05  8:26 ` Jiri Wiesner
2022-09-07 18:33   ` Julian Anastasov
2022-09-08 15:35     ` Jiri Wiesner
2022-09-08 18:32       ` Jiri Wiesner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.