All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/2] per-CPU vmstat thresholds and vmstat worker disablement
@ 2017-04-25 13:57 ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 13:57 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: Luiz Capitulino, Rik van Riel, Linux RT Users

The per-CPU vmstat worker is a problem on -RT workloads (because
ideally the CPU is entirely reserved for the -RT app, without
interference). The worker transfers accumulated per-CPU 
vmstat counters to global counters.

To resolve the problem, create two tunables:

* Userspace configurable per-CPU vmstat threshold: by default the 
VM code calculates the size of the per-CPU vmstat arrays. This 
tunable allows userspace to configure the values.

* Userspace configurable per-CPU vmstat worker: allow disabling
the per-CPU vmstat worker.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 0/2] per-CPU vmstat thresholds and vmstat worker disablement
@ 2017-04-25 13:57 ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 13:57 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: Luiz Capitulino, Rik van Riel, Linux RT Users

The per-CPU vmstat worker is a problem on -RT workloads (because
ideally the CPU is entirely reserved for the -RT app, without
interference). The worker transfers accumulated per-CPU 
vmstat counters to global counters.

To resolve the problem, create two tunables:

* Userspace configurable per-CPU vmstat threshold: by default the 
VM code calculates the size of the per-CPU vmstat arrays. This 
tunable allows userspace to configure the values.

* Userspace configurable per-CPU vmstat worker: allow disabling
the per-CPU vmstat worker.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 1/2] MM: remove unused quiet_vmstat function
  2017-04-25 13:57 ` Marcelo Tosatti
@ 2017-04-25 13:57   ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 13:57 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Luiz Capitulino, Rik van Riel, Linux RT Users, Marcelo Tosatti

[-- Attachment #1: remove-vmstat-quiet --]
[-- Type: text/plain, Size: 2023 bytes --]

Remove unused quiet_vmstat function.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/vmstat.h |    1 -
 mm/vmstat.c            |   25 -------------------------
 2 files changed, 26 deletions(-)

Index: linux-2.6-git-disable-vmstat-worker/include/linux/vmstat.h
===================================================================
--- linux-2.6-git-disable-vmstat-worker.orig/include/linux/vmstat.h	2017-04-24 18:52:42.957724687 -0300
+++ linux-2.6-git-disable-vmstat-worker/include/linux/vmstat.h	2017-04-24 18:53:15.086793496 -0300
@@ -233,7 +233,6 @@
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
-void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
Index: linux-2.6-git-disable-vmstat-worker/mm/vmstat.c
===================================================================
--- linux-2.6-git-disable-vmstat-worker.orig/mm/vmstat.c	2017-04-24 18:52:42.957724687 -0300
+++ linux-2.6-git-disable-vmstat-worker/mm/vmstat.c	2017-04-24 18:53:53.075874785 -0300
@@ -1657,31 +1657,6 @@
 }
 
 /*
- * Switch off vmstat processing and then fold all the remaining differentials
- * until the diffs stay at zero. The function is used by NOHZ and can only be
- * invoked when tick processing is not active.
- */
-void quiet_vmstat(void)
-{
-	if (system_state != SYSTEM_RUNNING)
-		return;
-
-	if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
-		return;
-
-	if (!need_update(smp_processor_id()))
-		return;
-
-	/*
-	 * Just refresh counters and do not care about the pending delayed
-	 * vmstat_update. It doesn't fire that often to matter and canceling
-	 * it would be too expensive from this path.
-	 * vmstat_shepherd will take care about that for us.
-	 */
-	refresh_cpu_vm_stats(false);
-}
-
-/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 1/2] MM: remove unused quiet_vmstat function
@ 2017-04-25 13:57   ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 13:57 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Luiz Capitulino, Rik van Riel, Linux RT Users, Marcelo Tosatti

[-- Attachment #1: remove-vmstat-quiet --]
[-- Type: text/plain, Size: 2250 bytes --]

Remove unused quiet_vmstat function.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/vmstat.h |    1 -
 mm/vmstat.c            |   25 -------------------------
 2 files changed, 26 deletions(-)

Index: linux-2.6-git-disable-vmstat-worker/include/linux/vmstat.h
===================================================================
--- linux-2.6-git-disable-vmstat-worker.orig/include/linux/vmstat.h	2017-04-24 18:52:42.957724687 -0300
+++ linux-2.6-git-disable-vmstat-worker/include/linux/vmstat.h	2017-04-24 18:53:15.086793496 -0300
@@ -233,7 +233,6 @@
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
-void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
Index: linux-2.6-git-disable-vmstat-worker/mm/vmstat.c
===================================================================
--- linux-2.6-git-disable-vmstat-worker.orig/mm/vmstat.c	2017-04-24 18:52:42.957724687 -0300
+++ linux-2.6-git-disable-vmstat-worker/mm/vmstat.c	2017-04-24 18:53:53.075874785 -0300
@@ -1657,31 +1657,6 @@
 }
 
 /*
- * Switch off vmstat processing and then fold all the remaining differentials
- * until the diffs stay at zero. The function is used by NOHZ and can only be
- * invoked when tick processing is not active.
- */
-void quiet_vmstat(void)
-{
-	if (system_state != SYSTEM_RUNNING)
-		return;
-
-	if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
-		return;
-
-	if (!need_update(smp_processor_id()))
-		return;
-
-	/*
-	 * Just refresh counters and do not care about the pending delayed
-	 * vmstat_update. It doesn't fire that often to matter and canceling
-	 * it would be too expensive from this path.
-	 * vmstat_shepherd will take care about that for us.
-	 */
-	refresh_cpu_vm_stats(false);
-}
-
-/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-04-25 13:57 ` Marcelo Tosatti
@ 2017-04-25 13:57   ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 13:57 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Luiz Capitulino, Rik van Riel, Linux RT Users, Marcelo Tosatti

[-- Attachment #1: vmstat-disable-vmstat-worker --]
[-- Type: text/plain, Size: 10313 bytes --]

The per-CPU vmstat worker is a problem on -RT workloads (because
ideally the CPU is entirely reserved for the -RT app, without
interference). The worker transfers accumulated per-CPU 
vmstat counters to global counters.

To resolve the problem, create two tunables:

* Userspace configurable per-CPU vmstat threshold: by default the 
VM code calculates the size of the per-CPU vmstat arrays. This 
tunable allows userspace to configure the values.

* Userspace configurable per-CPU vmstat worker: allow disabling
the per-CPU vmstat worker.

The patch below contains documentation which describes the tunables
in more detail.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 Documentation/vm/vmstat_thresholds.txt |   38 +++++
 mm/vmstat.c                            |  248 +++++++++++++++++++++++++++++++--
 2 files changed, 272 insertions(+), 14 deletions(-)

Index: linux-2.6-git-disable-vmstat-worker/mm/vmstat.c
===================================================================
--- linux-2.6-git-disable-vmstat-worker.orig/mm/vmstat.c	2017-04-25 07:39:13.941019853 -0300
+++ linux-2.6-git-disable-vmstat-worker/mm/vmstat.c	2017-04-25 10:44:51.581977296 -0300
@@ -91,8 +91,17 @@
 EXPORT_SYMBOL(vm_zone_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
+struct vmstat_uparam {
+	atomic_t vmstat_work_enabled;
+	atomic_t user_stat_thresh;
+};
+
+static DEFINE_PER_CPU(struct vmstat_uparam, vmstat_uparam);
+
 #ifdef CONFIG_SMP
 
+#define MAX_THRESHOLD 125
+
 int calculate_pressure_threshold(struct zone *zone)
 {
 	int threshold;
@@ -110,9 +119,9 @@
 	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
 
 	/*
-	 * Maximum threshold is 125
+	 * Maximum threshold is MAX_THRESHOLD == 125
 	 */
-	threshold = min(125, threshold);
+	threshold = min(MAX_THRESHOLD, threshold);
 
 	return threshold;
 }
@@ -188,15 +197,31 @@
 		threshold = calculate_normal_threshold(zone);
 
 		for_each_online_cpu(cpu) {
-			int pgdat_threshold;
+			int pgdat_threshold, ustat_thresh;
+			struct vmstat_uparam *vup;
 
-			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
-							= threshold;
+			struct per_cpu_nodestat __percpu *pcp;
+			struct per_cpu_pageset *p;
+
+			p = per_cpu_ptr(zone->pageset, cpu);
+
+			vup = &per_cpu(vmstat_uparam, cpu);
+			ustat_thresh = atomic_read(&vup->user_stat_thresh);
+
+			if (ustat_thresh)
+				p->stat_threshold = ustat_thresh;
+			else
+				p->stat_threshold = threshold;
+
+			pcp = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
 
 			/* Base nodestat threshold on the largest populated zone. */
-			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
-			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
-				= max(threshold, pgdat_threshold);
+			pgdat_threshold = pcp->stat_threshold;
+			if (ustat_thresh)
+				pcp->stat_threshold = ustat_thresh;
+			else
+				pcp->stat_threshold = max(threshold,
+							  pgdat_threshold);
 		}
 
 		/*
@@ -226,9 +251,24 @@
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_online_cpu(cpu)
+		for_each_online_cpu(cpu) {
+			int t, ustat_thresh;
+			struct vmstat_uparam *vup;
+
+			vup = &per_cpu(vmstat_uparam, cpu);
+			ustat_thresh = atomic_read(&vup->user_stat_thresh);
+			t = threshold;
+
+			/*
+			 * min because pressure could cause
+			 * calculate_pressure'ed value to be smaller.
+			 */
+			if (ustat_thresh)
+				t = min(threshold, ustat_thresh);
+
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
-							= threshold;
+							= t;
+		}
 	}
 }
 
@@ -1567,6 +1607,9 @@
 	long val;
 	int err;
 	int i;
+	int cpu;
+	struct work_struct __percpu *works;
+	static struct cpumask has_work;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1580,9 +1623,31 @@
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
+
+	cpumask_clear(&has_work);
+	get_online_cpus();
+
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+		if (atomic_read(&vup->vmstat_work_enabled)) {
+			INIT_WORK(work, refresh_vm_stats);
+			schedule_work_on(cpu, work);
+			cpumask_set_cpu(cpu, &has_work);
+		}
+	}
+
+	for_each_cpu(cpu, &has_work)
+		flush_work(per_cpu_ptr(works, cpu));
+
+	put_online_cpus();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
@@ -1674,6 +1739,10 @@
 	/* Check processors whose vmstat worker threads have been disabled */
 	for_each_online_cpu(cpu) {
 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
+		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+		if (atomic_read(&vup->vmstat_work_enabled) == 0)
+			continue;
 
 		if (!delayed_work_pending(dw) && need_update(cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
@@ -1696,6 +1765,135 @@
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+#ifdef CONFIG_SYSFS
+
+static ssize_t vmstat_worker_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	unsigned int cpu = dev->id;
+	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+	return sprintf(buf, "%d\n", atomic_read(&vup->vmstat_work_enabled));
+}
+
+static ssize_t vmstat_worker_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int ret, val;
+	struct vmstat_uparam *vup;
+	unsigned int cpu = dev->id;
+
+	ret = sscanf(buf, "%d", &val);
+	if (ret != 1 || val > 1 || val < 0)
+		return -EINVAL;
+
+	preempt_disable();
+
+	if (cpu_online(cpu)) {
+		vup = &per_cpu(vmstat_uparam, cpu);
+		atomic_set(&vup->vmstat_work_enabled, val);
+	} else
+		count = -EINVAL;
+
+	preempt_enable();
+
+	return count;
+}
+
+static ssize_t vmstat_thresh_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	int ret;
+	struct vmstat_uparam *vup;
+	unsigned int cpu = dev->id;
+
+	preempt_disable();
+
+	vup = &per_cpu(vmstat_uparam, cpu);
+	ret = sprintf(buf, "%d\n", atomic_read(&vup->user_stat_thresh));
+
+	preempt_enable();
+
+	return ret;
+}
+
+static ssize_t vmstat_thresh_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int ret, val;
+	unsigned int cpu = dev->id;
+	struct vmstat_uparam *vup;
+
+	ret = sscanf(buf, "%d", &val);
+	if (ret != 1 || val < 1 || val > MAX_THRESHOLD)
+		return -EINVAL;
+
+	preempt_disable();
+
+	if (cpu_online(cpu)) {
+		vup = &per_cpu(vmstat_uparam, cpu);
+		atomic_set(&vup->user_stat_thresh, val);
+	} else
+		count = -EINVAL;
+
+	preempt_enable();
+
+	return count;
+}
+
+struct device_attribute vmstat_worker_attr =
+	__ATTR(vmstat_worker, 0644, vmstat_worker_show, vmstat_worker_store);
+
+struct device_attribute vmstat_threshold_attr =
+	__ATTR(vmstat_threshold, 0644, vmstat_thresh_show, vmstat_thresh_store);
+
+static struct attribute *vmstat_attrs[] = {
+	&vmstat_worker_attr.attr,
+	&vmstat_threshold_attr.attr,
+	NULL
+};
+
+static struct attribute_group vmstat_attr_group = {
+	.attrs  =  vmstat_attrs,
+	.name   = "vmstat"
+};
+
+static int vmstat_thresh_cpu_online(unsigned int cpu)
+{
+	struct device *dev = get_cpu_device(cpu);
+	int ret;
+
+	ret = sysfs_create_group(&dev->kobj, &vmstat_attr_group);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int vmstat_thresh_cpu_down_prep(unsigned int cpu)
+{
+	struct device *dev = get_cpu_device(cpu);
+
+	sysfs_remove_group(&dev->kobj, &vmstat_attr_group);
+	return 0;
+}
+
+static void init_vmstat_sysfs(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+		atomic_set(&vup->user_stat_thresh, 0);
+		atomic_set(&vup->vmstat_work_enabled, 1);
+	}
+}
+
+#endif /* CONFIG_SYSFS */
+
 static void __init init_cpu_node_state(void)
 {
 	int node;
@@ -1723,9 +1921,13 @@
 {
 	const struct cpumask *node_cpus;
 	int node;
+	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
 
 	node = cpu_to_node(cpu);
 
+	atomic_set(&vup->user_stat_thresh, 0);
+	atomic_set(&vup->vmstat_work_enabled, 1);
+
 	refresh_zone_stat_thresholds();
 	node_cpus = cpumask_of_node(node);
 	if (cpumask_weight(node_cpus) > 0)
@@ -1735,7 +1937,7 @@
 	return 0;
 }
 
-#endif
+#endif /* CONFIG_SMP */
 
 struct workqueue_struct *mm_percpu_wq;
 
@@ -1772,6 +1974,24 @@
 #endif
 }
 
+static int __init init_mm_internals_late(void)
+{
+#ifdef CONFIG_SYSFS
+	int ret;
+
+	init_vmstat_sysfs();
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mm/vmstat_thresh:online",
+					vmstat_thresh_cpu_online,
+					vmstat_thresh_cpu_down_prep);
+	if (ret < 0)
+		pr_err("vmstat_thresh: failed to register 'online' hotplug state\n");
+#endif
+	return 0;
+}
+
+late_initcall(init_mm_internals_late);
+
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
 
 /*
Index: linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt	2017-04-25 08:46:25.237395070 -0300
@@ -0,0 +1,38 @@
+Userspace configurable vmstat thresholds
+========================================
+
+This document describes the tunables to control
+per-CPU vmstat threshold and per-CPU vmstat worker
+thread.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_threshold:
+
+This file contains the per-CPU vmstat threshold.
+This value is the maximum that a single per-CPU vmstat statistic
+can accumulate before transferring to the global counters.
+
+A value of 0 indicates that the value is set
+by the in kernel algorithm.
+
+A value different than 0 indicates that particular
+value is used for vmstat_threshold.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_worker:
+
+Enable/disable the per-CPU vmstat worker.
+
+Usage example:
+=============
+
+To disable vmstat_update worker for cpu1:
+
+cd /sys/devices/system/cpu/cpu0/vmstat/
+
+# echo 1 > vmstat_threshold
+# echo 0 > vmstat_worker
+
+Setting vmstat_threshold to 1 means the per-CPU
+vmstat statistics will not be out-of-date
+for CPU 1.
+
+

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-04-25 13:57   ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 13:57 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Luiz Capitulino, Rik van Riel, Linux RT Users, Marcelo Tosatti

[-- Attachment #1: vmstat-disable-vmstat-worker --]
[-- Type: text/plain, Size: 10540 bytes --]

The per-CPU vmstat worker is a problem on -RT workloads (because
ideally the CPU is entirely reserved for the -RT app, without
interference). The worker transfers accumulated per-CPU 
vmstat counters to global counters.

To resolve the problem, create two tunables:

* Userspace configurable per-CPU vmstat threshold: by default the 
VM code calculates the size of the per-CPU vmstat arrays. This 
tunable allows userspace to configure the values.

* Userspace configurable per-CPU vmstat worker: allow disabling
the per-CPU vmstat worker.

The patch below contains documentation which describes the tunables
in more detail.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 Documentation/vm/vmstat_thresholds.txt |   38 +++++
 mm/vmstat.c                            |  248 +++++++++++++++++++++++++++++++--
 2 files changed, 272 insertions(+), 14 deletions(-)

Index: linux-2.6-git-disable-vmstat-worker/mm/vmstat.c
===================================================================
--- linux-2.6-git-disable-vmstat-worker.orig/mm/vmstat.c	2017-04-25 07:39:13.941019853 -0300
+++ linux-2.6-git-disable-vmstat-worker/mm/vmstat.c	2017-04-25 10:44:51.581977296 -0300
@@ -91,8 +91,17 @@
 EXPORT_SYMBOL(vm_zone_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
+struct vmstat_uparam {
+	atomic_t vmstat_work_enabled;
+	atomic_t user_stat_thresh;
+};
+
+static DEFINE_PER_CPU(struct vmstat_uparam, vmstat_uparam);
+
 #ifdef CONFIG_SMP
 
+#define MAX_THRESHOLD 125
+
 int calculate_pressure_threshold(struct zone *zone)
 {
 	int threshold;
@@ -110,9 +119,9 @@
 	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
 
 	/*
-	 * Maximum threshold is 125
+	 * Maximum threshold is MAX_THRESHOLD == 125
 	 */
-	threshold = min(125, threshold);
+	threshold = min(MAX_THRESHOLD, threshold);
 
 	return threshold;
 }
@@ -188,15 +197,31 @@
 		threshold = calculate_normal_threshold(zone);
 
 		for_each_online_cpu(cpu) {
-			int pgdat_threshold;
+			int pgdat_threshold, ustat_thresh;
+			struct vmstat_uparam *vup;
 
-			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
-							= threshold;
+			struct per_cpu_nodestat __percpu *pcp;
+			struct per_cpu_pageset *p;
+
+			p = per_cpu_ptr(zone->pageset, cpu);
+
+			vup = &per_cpu(vmstat_uparam, cpu);
+			ustat_thresh = atomic_read(&vup->user_stat_thresh);
+
+			if (ustat_thresh)
+				p->stat_threshold = ustat_thresh;
+			else
+				p->stat_threshold = threshold;
+
+			pcp = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
 
 			/* Base nodestat threshold on the largest populated zone. */
-			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
-			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
-				= max(threshold, pgdat_threshold);
+			pgdat_threshold = pcp->stat_threshold;
+			if (ustat_thresh)
+				pcp->stat_threshold = ustat_thresh;
+			else
+				pcp->stat_threshold = max(threshold,
+							  pgdat_threshold);
 		}
 
 		/*
@@ -226,9 +251,24 @@
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_online_cpu(cpu)
+		for_each_online_cpu(cpu) {
+			int t, ustat_thresh;
+			struct vmstat_uparam *vup;
+
+			vup = &per_cpu(vmstat_uparam, cpu);
+			ustat_thresh = atomic_read(&vup->user_stat_thresh);
+			t = threshold;
+
+			/*
+			 * min because pressure could cause
+			 * calculate_pressure'ed value to be smaller.
+			 */
+			if (ustat_thresh)
+				t = min(threshold, ustat_thresh);
+
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
-							= threshold;
+							= t;
+		}
 	}
 }
 
@@ -1567,6 +1607,9 @@
 	long val;
 	int err;
 	int i;
+	int cpu;
+	struct work_struct __percpu *works;
+	static struct cpumask has_work;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1580,9 +1623,31 @@
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
+
+	cpumask_clear(&has_work);
+	get_online_cpus();
+
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+		if (atomic_read(&vup->vmstat_work_enabled)) {
+			INIT_WORK(work, refresh_vm_stats);
+			schedule_work_on(cpu, work);
+			cpumask_set_cpu(cpu, &has_work);
+		}
+	}
+
+	for_each_cpu(cpu, &has_work)
+		flush_work(per_cpu_ptr(works, cpu));
+
+	put_online_cpus();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
@@ -1674,6 +1739,10 @@
 	/* Check processors whose vmstat worker threads have been disabled */
 	for_each_online_cpu(cpu) {
 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
+		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+		if (atomic_read(&vup->vmstat_work_enabled) == 0)
+			continue;
 
 		if (!delayed_work_pending(dw) && need_update(cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
@@ -1696,6 +1765,135 @@
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+#ifdef CONFIG_SYSFS
+
+static ssize_t vmstat_worker_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	unsigned int cpu = dev->id;
+	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+	return sprintf(buf, "%d\n", atomic_read(&vup->vmstat_work_enabled));
+}
+
+static ssize_t vmstat_worker_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int ret, val;
+	struct vmstat_uparam *vup;
+	unsigned int cpu = dev->id;
+
+	ret = sscanf(buf, "%d", &val);
+	if (ret != 1 || val > 1 || val < 0)
+		return -EINVAL;
+
+	preempt_disable();
+
+	if (cpu_online(cpu)) {
+		vup = &per_cpu(vmstat_uparam, cpu);
+		atomic_set(&vup->vmstat_work_enabled, val);
+	} else
+		count = -EINVAL;
+
+	preempt_enable();
+
+	return count;
+}
+
+static ssize_t vmstat_thresh_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	int ret;
+	struct vmstat_uparam *vup;
+	unsigned int cpu = dev->id;
+
+	preempt_disable();
+
+	vup = &per_cpu(vmstat_uparam, cpu);
+	ret = sprintf(buf, "%d\n", atomic_read(&vup->user_stat_thresh));
+
+	preempt_enable();
+
+	return ret;
+}
+
+static ssize_t vmstat_thresh_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int ret, val;
+	unsigned int cpu = dev->id;
+	struct vmstat_uparam *vup;
+
+	ret = sscanf(buf, "%d", &val);
+	if (ret != 1 || val < 1 || val > MAX_THRESHOLD)
+		return -EINVAL;
+
+	preempt_disable();
+
+	if (cpu_online(cpu)) {
+		vup = &per_cpu(vmstat_uparam, cpu);
+		atomic_set(&vup->user_stat_thresh, val);
+	} else
+		count = -EINVAL;
+
+	preempt_enable();
+
+	return count;
+}
+
+struct device_attribute vmstat_worker_attr =
+	__ATTR(vmstat_worker, 0644, vmstat_worker_show, vmstat_worker_store);
+
+struct device_attribute vmstat_threshold_attr =
+	__ATTR(vmstat_threshold, 0644, vmstat_thresh_show, vmstat_thresh_store);
+
+static struct attribute *vmstat_attrs[] = {
+	&vmstat_worker_attr.attr,
+	&vmstat_threshold_attr.attr,
+	NULL
+};
+
+static struct attribute_group vmstat_attr_group = {
+	.attrs  =  vmstat_attrs,
+	.name   = "vmstat"
+};
+
+static int vmstat_thresh_cpu_online(unsigned int cpu)
+{
+	struct device *dev = get_cpu_device(cpu);
+	int ret;
+
+	ret = sysfs_create_group(&dev->kobj, &vmstat_attr_group);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int vmstat_thresh_cpu_down_prep(unsigned int cpu)
+{
+	struct device *dev = get_cpu_device(cpu);
+
+	sysfs_remove_group(&dev->kobj, &vmstat_attr_group);
+	return 0;
+}
+
+static void init_vmstat_sysfs(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
+
+		atomic_set(&vup->user_stat_thresh, 0);
+		atomic_set(&vup->vmstat_work_enabled, 1);
+	}
+}
+
+#endif /* CONFIG_SYSFS */
+
 static void __init init_cpu_node_state(void)
 {
 	int node;
@@ -1723,9 +1921,13 @@
 {
 	const struct cpumask *node_cpus;
 	int node;
+	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
 
 	node = cpu_to_node(cpu);
 
+	atomic_set(&vup->user_stat_thresh, 0);
+	atomic_set(&vup->vmstat_work_enabled, 1);
+
 	refresh_zone_stat_thresholds();
 	node_cpus = cpumask_of_node(node);
 	if (cpumask_weight(node_cpus) > 0)
@@ -1735,7 +1937,7 @@
 	return 0;
 }
 
-#endif
+#endif /* CONFIG_SMP */
 
 struct workqueue_struct *mm_percpu_wq;
 
@@ -1772,6 +1974,24 @@
 #endif
 }
 
+static int __init init_mm_internals_late(void)
+{
+#ifdef CONFIG_SYSFS
+	int ret;
+
+	init_vmstat_sysfs();
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mm/vmstat_thresh:online",
+					vmstat_thresh_cpu_online,
+					vmstat_thresh_cpu_down_prep);
+	if (ret < 0)
+		pr_err("vmstat_thresh: failed to register 'online' hotplug state\n");
+#endif
+	return 0;
+}
+
+late_initcall(init_mm_internals_late);
+
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
 
 /*
Index: linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt	2017-04-25 08:46:25.237395070 -0300
@@ -0,0 +1,38 @@
+Userspace configurable vmstat thresholds
+========================================
+
+This document describes the tunables to control
+per-CPU vmstat threshold and per-CPU vmstat worker
+thread.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_threshold:
+
+This file contains the per-CPU vmstat threshold.
+This value is the maximum that a single per-CPU vmstat statistic
+can accumulate before transferring to the global counters.
+
+A value of 0 indicates that the value is set
+by the in kernel algorithm.
+
+A value different than 0 indicates that particular
+value is used for vmstat_threshold.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_worker:
+
+Enable/disable the per-CPU vmstat worker.
+
+Usage example:
+=============
+
+To disable vmstat_update worker for cpu1:
+
+cd /sys/devices/system/cpu/cpu0/vmstat/
+
+# echo 1 > vmstat_threshold
+# echo 0 > vmstat_worker
+
+Setting vmstat_threshold to 1 means the per-CPU
+vmstat statistics will not be out-of-date
+for CPU 1.
+
+


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-04-25 13:57   ` Marcelo Tosatti
  (?)
@ 2017-04-25 19:29     ` Rik van Riel
  -1 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2017-04-25 19:29 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel, linux-mm; +Cc: Luiz Capitulino, Linux RT Users

On Tue, 2017-04-25 at 10:57 -0300, Marcelo Tosatti wrote:
> The per-CPU vmstat worker is a problem on -RT workloads (because
> ideally the CPU is entirely reserved for the -RT app, without
> interference). The worker transfers accumulated per-CPU 
> vmstat counters to global counters.
> 
> To resolve the problem, create two tunables:
> 
> * Userspace configurable per-CPU vmstat threshold: by default the 
> VM code calculates the size of the per-CPU vmstat arrays. This 
> tunable allows userspace to configure the values.
> 
> * Userspace configurable per-CPU vmstat worker: allow disabling
> the per-CPU vmstat worker.
> 
> The patch below contains documentation which describes the tunables
> in more detail.

The documentation says what the tunables do, but
not how you should set them in different scenarios,
or why.

That could be a little more helpful to sysadmins.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-04-25 19:29     ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2017-04-25 19:29 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel, linux-mm; +Cc: Luiz Capitulino, Linux RT Users

On Tue, 2017-04-25 at 10:57 -0300, Marcelo Tosatti wrote:
> The per-CPU vmstat worker is a problem on -RT workloads (because
> ideally the CPU is entirely reserved for the -RT app, without
> interference). The worker transfers accumulated per-CPU 
> vmstat counters to global counters.
> 
> To resolve the problem, create two tunables:
> 
> * Userspace configurable per-CPU vmstat threshold: by default the 
> VM code calculates the size of the per-CPU vmstat arrays. This 
> tunable allows userspace to configure the values.
> 
> * Userspace configurable per-CPU vmstat worker: allow disabling
> the per-CPU vmstat worker.
> 
> The patch below contains documentation which describes the tunables
> in more detail.

The documentation says what the tunables do, but
not how you should set them in different scenarios,
or why.

That could be a little more helpful to sysadmins.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-04-25 19:29     ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2017-04-25 19:29 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel, linux-mm; +Cc: Luiz Capitulino, Linux RT Users

On Tue, 2017-04-25 at 10:57 -0300, Marcelo Tosatti wrote:
> The per-CPU vmstat worker is a problem on -RT workloads (because
> ideally the CPU is entirely reserved for the -RT app, without
> interference). The worker transfers accumulated per-CPUA 
> vmstat counters to global counters.
> 
> To resolve the problem, create two tunables:
> 
> * Userspace configurable per-CPU vmstat threshold: by default theA 
> VM code calculates the size of the per-CPU vmstat arrays. ThisA 
> tunable allows userspace to configure the values.
> 
> * Userspace configurable per-CPU vmstat worker: allow disabling
> the per-CPU vmstat worker.
> 
> The patch below contains documentation which describes the tunables
> in more detail.

The documentation says what the tunables do, but
not how you should set them in different scenarios,
or why.

That could be a little more helpful to sysadmins.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-04-25 19:29     ` Rik van Riel
  (?)
@ 2017-04-25 19:36       ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 19:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Luiz Capitulino, Linux RT Users

On Tue, Apr 25, 2017 at 03:29:06PM -0400, Rik van Riel wrote:
> On Tue, 2017-04-25 at 10:57 -0300, Marcelo Tosatti wrote:
> > The per-CPU vmstat worker is a problem on -RT workloads (because
> > ideally the CPU is entirely reserved for the -RT app, without
> > interference). The worker transfers accumulated per-CPU 
> > vmstat counters to global counters.
> > 
> > To resolve the problem, create two tunables:
> > 
> > * Userspace configurable per-CPU vmstat threshold: by default the 
> > VM code calculates the size of the per-CPU vmstat arrays. This 
> > tunable allows userspace to configure the values.
> > 
> > * Userspace configurable per-CPU vmstat worker: allow disabling
> > the per-CPU vmstat worker.
> > 
> > The patch below contains documentation which describes the tunables
> > in more detail.
> 
> The documentation says what the tunables do, but
> not how you should set them in different scenarios,
> or why.
> 
> That could be a little more helpful to sysadmins.

OK i'll update the document to be more verbose.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-04-25 19:36       ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 19:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Luiz Capitulino, Linux RT Users

On Tue, Apr 25, 2017 at 03:29:06PM -0400, Rik van Riel wrote:
> On Tue, 2017-04-25 at 10:57 -0300, Marcelo Tosatti wrote:
> > The per-CPU vmstat worker is a problem on -RT workloads (because
> > ideally the CPU is entirely reserved for the -RT app, without
> > interference). The worker transfers accumulated per-CPU 
> > vmstat counters to global counters.
> > 
> > To resolve the problem, create two tunables:
> > 
> > * Userspace configurable per-CPU vmstat threshold: by default the 
> > VM code calculates the size of the per-CPU vmstat arrays. This 
> > tunable allows userspace to configure the values.
> > 
> > * Userspace configurable per-CPU vmstat worker: allow disabling
> > the per-CPU vmstat worker.
> > 
> > The patch below contains documentation which describes the tunables
> > in more detail.
> 
> The documentation says what the tunables do, but
> not how you should set them in different scenarios,
> or why.
> 
> That could be a little more helpful to sysadmins.

OK i'll update the document to be more verbose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-04-25 19:36       ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-04-25 19:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Luiz Capitulino, Linux RT Users

On Tue, Apr 25, 2017 at 03:29:06PM -0400, Rik van Riel wrote:
> On Tue, 2017-04-25 at 10:57 -0300, Marcelo Tosatti wrote:
> > The per-CPU vmstat worker is a problem on -RT workloads (because
> > ideally the CPU is entirely reserved for the -RT app, without
> > interference). The worker transfers accumulated per-CPU 
> > vmstat counters to global counters.
> > 
> > To resolve the problem, create two tunables:
> > 
> > * Userspace configurable per-CPU vmstat threshold: by default the 
> > VM code calculates the size of the per-CPU vmstat arrays. This 
> > tunable allows userspace to configure the values.
> > 
> > * Userspace configurable per-CPU vmstat worker: allow disabling
> > the per-CPU vmstat worker.
> > 
> > The patch below contains documentation which describes the tunables
> > in more detail.
> 
> The documentation says what the tunables do, but
> not how you should set them in different scenarios,
> or why.
> 
> That could be a little more helpful to sysadmins.

OK i'll update the document to be more verbose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-04-25 13:57   ` Marcelo Tosatti
@ 2017-05-02 14:28     ` Luiz Capitulino
  -1 siblings, 0 replies; 66+ messages in thread
From: Luiz Capitulino @ 2017-05-02 14:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, 25 Apr 2017 10:57:19 -0300
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> The per-CPU vmstat worker is a problem on -RT workloads (because
> ideally the CPU is entirely reserved for the -RT app, without
> interference). The worker transfers accumulated per-CPU 
> vmstat counters to global counters.

This is a problem for non-RT too. Any task pinned to an isolated
CPU that doesn't want to be ever interrupted will be interrupted
by the vmstat kworker.

> To resolve the problem, create two tunables:
> 
> * Userspace configurable per-CPU vmstat threshold: by default the 
> VM code calculates the size of the per-CPU vmstat arrays. This 
> tunable allows userspace to configure the values.
> 
> * Userspace configurable per-CPU vmstat worker: allow disabling
> the per-CPU vmstat worker.

I have several questions about the tunables:

 - What does the vmstat_threshold value mean? What are the implications
   of changing this value? What's the difference in choosing 1, 2, 3
   or 500?

 - If the purpose of having vmstat_threshold is to allow disabling
   the vmstat kworker, why can't the kernel pick a value automatically?

 - What are the implications of disabling the vmstat kworker? Will vm
   stats still be collected someway or will it be completely off for
   the CPU?

Also, shouldn't this patch be split into two?

> The patch below contains documentation which describes the tunables
> in more detail.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  Documentation/vm/vmstat_thresholds.txt |   38 +++++
>  mm/vmstat.c                            |  248 +++++++++++++++++++++++++++++++--
>  2 files changed, 272 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6-git-disable-vmstat-worker/mm/vmstat.c
> ===================================================================
> --- linux-2.6-git-disable-vmstat-worker.orig/mm/vmstat.c	2017-04-25 07:39:13.941019853 -0300
> +++ linux-2.6-git-disable-vmstat-worker/mm/vmstat.c	2017-04-25 10:44:51.581977296 -0300
> @@ -91,8 +91,17 @@
>  EXPORT_SYMBOL(vm_zone_stat);
>  EXPORT_SYMBOL(vm_node_stat);
>  
> +struct vmstat_uparam {
> +	atomic_t vmstat_work_enabled;
> +	atomic_t user_stat_thresh;
> +};
> +
> +static DEFINE_PER_CPU(struct vmstat_uparam, vmstat_uparam);
> +
>  #ifdef CONFIG_SMP
>  
> +#define MAX_THRESHOLD 125
> +
>  int calculate_pressure_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -110,9 +119,9 @@
>  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
>  
>  	/*
> -	 * Maximum threshold is 125
> +	 * Maximum threshold is MAX_THRESHOLD == 125
>  	 */
> -	threshold = min(125, threshold);
> +	threshold = min(MAX_THRESHOLD, threshold);
>  
>  	return threshold;
>  }
> @@ -188,15 +197,31 @@
>  		threshold = calculate_normal_threshold(zone);
>  
>  		for_each_online_cpu(cpu) {
> -			int pgdat_threshold;
> +			int pgdat_threshold, ustat_thresh;
> +			struct vmstat_uparam *vup;
>  
> -			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> -							= threshold;
> +			struct per_cpu_nodestat __percpu *pcp;
> +			struct per_cpu_pageset *p;
> +
> +			p = per_cpu_ptr(zone->pageset, cpu);
> +
> +			vup = &per_cpu(vmstat_uparam, cpu);
> +			ustat_thresh = atomic_read(&vup->user_stat_thresh);
> +
> +			if (ustat_thresh)
> +				p->stat_threshold = ustat_thresh;
> +			else
> +				p->stat_threshold = threshold;
> +
> +			pcp = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
>  
>  			/* Base nodestat threshold on the largest populated zone. */
> -			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
> -			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
> -				= max(threshold, pgdat_threshold);
> +			pgdat_threshold = pcp->stat_threshold;
> +			if (ustat_thresh)
> +				pcp->stat_threshold = ustat_thresh;
> +			else
> +				pcp->stat_threshold = max(threshold,
> +							  pgdat_threshold);
>  		}
>  
>  		/*
> @@ -226,9 +251,24 @@
>  			continue;
>  
>  		threshold = (*calculate_pressure)(zone);
> -		for_each_online_cpu(cpu)
> +		for_each_online_cpu(cpu) {
> +			int t, ustat_thresh;
> +			struct vmstat_uparam *vup;
> +
> +			vup = &per_cpu(vmstat_uparam, cpu);
> +			ustat_thresh = atomic_read(&vup->user_stat_thresh);
> +			t = threshold;
> +
> +			/*
> +			 * min because pressure could cause
> +			 * calculate_pressure'ed value to be smaller.
> +			 */
> +			if (ustat_thresh)
> +				t = min(threshold, ustat_thresh);
> +
>  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> -							= threshold;
> +							= t;
> +		}
>  	}
>  }
>  
> @@ -1567,6 +1607,9 @@
>  	long val;
>  	int err;
>  	int i;
> +	int cpu;
> +	struct work_struct __percpu *works;
> +	static struct cpumask has_work;
>  
>  	/*
>  	 * The regular update, every sysctl_stat_interval, may come later
> @@ -1580,9 +1623,31 @@
>  	 * transiently negative values, report an error here if any of
>  	 * the stats is negative, so we know to go looking for imbalance.
>  	 */
> -	err = schedule_on_each_cpu(refresh_vm_stats);
> -	if (err)
> -		return err;
> +
> +	works = alloc_percpu(struct work_struct);
> +	if (!works)
> +		return -ENOMEM;
> +
> +	cpumask_clear(&has_work);
> +	get_online_cpus();
> +
> +	for_each_online_cpu(cpu) {
> +		struct work_struct *work = per_cpu_ptr(works, cpu);
> +		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +		if (atomic_read(&vup->vmstat_work_enabled)) {
> +			INIT_WORK(work, refresh_vm_stats);
> +			schedule_work_on(cpu, work);
> +			cpumask_set_cpu(cpu, &has_work);
> +		}
> +	}
> +
> +	for_each_cpu(cpu, &has_work)
> +		flush_work(per_cpu_ptr(works, cpu));
> +
> +	put_online_cpus();
> +	free_percpu(works);
> +
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
> @@ -1674,6 +1739,10 @@
>  	/* Check processors whose vmstat worker threads have been disabled */
>  	for_each_online_cpu(cpu) {
>  		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> +		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +		if (atomic_read(&vup->vmstat_work_enabled) == 0)
> +			continue;
>  
>  		if (!delayed_work_pending(dw) && need_update(cpu))
>  			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> @@ -1696,6 +1765,135 @@
>  		round_jiffies_relative(sysctl_stat_interval));
>  }
>  
> +#ifdef CONFIG_SYSFS
> +
> +static ssize_t vmstat_worker_show(struct device *dev,
> +				  struct device_attribute *attr, char *buf)
> +{
> +	unsigned int cpu = dev->id;
> +	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +	return sprintf(buf, "%d\n", atomic_read(&vup->vmstat_work_enabled));
> +}
> +
> +static ssize_t vmstat_worker_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int ret, val;
> +	struct vmstat_uparam *vup;
> +	unsigned int cpu = dev->id;
> +
> +	ret = sscanf(buf, "%d", &val);
> +	if (ret != 1 || val > 1 || val < 0)
> +		return -EINVAL;
> +
> +	preempt_disable();
> +
> +	if (cpu_online(cpu)) {
> +		vup = &per_cpu(vmstat_uparam, cpu);
> +		atomic_set(&vup->vmstat_work_enabled, val);
> +	} else
> +		count = -EINVAL;
> +
> +	preempt_enable();
> +
> +	return count;
> +}
> +
> +static ssize_t vmstat_thresh_show(struct device *dev,
> +				  struct device_attribute *attr, char *buf)
> +{
> +	int ret;
> +	struct vmstat_uparam *vup;
> +	unsigned int cpu = dev->id;
> +
> +	preempt_disable();
> +
> +	vup = &per_cpu(vmstat_uparam, cpu);
> +	ret = sprintf(buf, "%d\n", atomic_read(&vup->user_stat_thresh));
> +
> +	preempt_enable();
> +
> +	return ret;
> +}
> +
> +static ssize_t vmstat_thresh_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int ret, val;
> +	unsigned int cpu = dev->id;
> +	struct vmstat_uparam *vup;
> +
> +	ret = sscanf(buf, "%d", &val);
> +	if (ret != 1 || val < 1 || val > MAX_THRESHOLD)
> +		return -EINVAL;
> +
> +	preempt_disable();
> +
> +	if (cpu_online(cpu)) {
> +		vup = &per_cpu(vmstat_uparam, cpu);
> +		atomic_set(&vup->user_stat_thresh, val);
> +	} else
> +		count = -EINVAL;
> +
> +	preempt_enable();
> +
> +	return count;
> +}
> +
> +struct device_attribute vmstat_worker_attr =
> +	__ATTR(vmstat_worker, 0644, vmstat_worker_show, vmstat_worker_store);
> +
> +struct device_attribute vmstat_threshold_attr =
> +	__ATTR(vmstat_threshold, 0644, vmstat_thresh_show, vmstat_thresh_store);
> +
> +static struct attribute *vmstat_attrs[] = {
> +	&vmstat_worker_attr.attr,
> +	&vmstat_threshold_attr.attr,
> +	NULL
> +};
> +
> +static struct attribute_group vmstat_attr_group = {
> +	.attrs  =  vmstat_attrs,
> +	.name   = "vmstat"
> +};
> +
> +static int vmstat_thresh_cpu_online(unsigned int cpu)
> +{
> +	struct device *dev = get_cpu_device(cpu);
> +	int ret;
> +
> +	ret = sysfs_create_group(&dev->kobj, &vmstat_attr_group);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static int vmstat_thresh_cpu_down_prep(unsigned int cpu)
> +{
> +	struct device *dev = get_cpu_device(cpu);
> +
> +	sysfs_remove_group(&dev->kobj, &vmstat_attr_group);
> +	return 0;
> +}
> +
> +static void init_vmstat_sysfs(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +		atomic_set(&vup->user_stat_thresh, 0);
> +		atomic_set(&vup->vmstat_work_enabled, 1);
> +	}
> +}
> +
> +#endif /* CONFIG_SYSFS */
> +
>  static void __init init_cpu_node_state(void)
>  {
>  	int node;
> @@ -1723,9 +1921,13 @@
>  {
>  	const struct cpumask *node_cpus;
>  	int node;
> +	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
>  
>  	node = cpu_to_node(cpu);
>  
> +	atomic_set(&vup->user_stat_thresh, 0);
> +	atomic_set(&vup->vmstat_work_enabled, 1);
> +
>  	refresh_zone_stat_thresholds();
>  	node_cpus = cpumask_of_node(node);
>  	if (cpumask_weight(node_cpus) > 0)
> @@ -1735,7 +1937,7 @@
>  	return 0;
>  }
>  
> -#endif
> +#endif /* CONFIG_SMP */
>  
>  struct workqueue_struct *mm_percpu_wq;
>  
> @@ -1772,6 +1974,24 @@
>  #endif
>  }
>  
> +static int __init init_mm_internals_late(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int ret;
> +
> +	init_vmstat_sysfs();
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mm/vmstat_thresh:online",
> +					vmstat_thresh_cpu_online,
> +					vmstat_thresh_cpu_down_prep);
> +	if (ret < 0)
> +		pr_err("vmstat_thresh: failed to register 'online' hotplug state\n");
> +#endif
> +	return 0;
> +}
> +
> +late_initcall(init_mm_internals_late);
> +
>  #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
>  
>  /*
> Index: linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt	2017-04-25 08:46:25.237395070 -0300
> @@ -0,0 +1,38 @@
> +Userspace configurable vmstat thresholds
> +========================================
> +
> +This document describes the tunables to control
> +per-CPU vmstat threshold and per-CPU vmstat worker
> +thread.
> +
> +/sys/devices/system/cpu/cpuN/vmstat/vmstat_threshold:
> +
> +This file contains the per-CPU vmstat threshold.
> +This value is the maximum that a single per-CPU vmstat statistic
> +can accumulate before transferring to the global counters.
> +
> +A value of 0 indicates that the value is set
> +by the in kernel algorithm.
> +
> +A value different than 0 indicates that particular
> +value is used for vmstat_threshold.
> +
> +/sys/devices/system/cpu/cpuN/vmstat/vmstat_worker:
> +
> +Enable/disable the per-CPU vmstat worker.
> +
> +Usage example:
> +=============
> +
> +To disable vmstat_update worker for cpu1:
> +
> +cd /sys/devices/system/cpu/cpu0/vmstat/
> +
> +# echo 1 > vmstat_threshold
> +# echo 0 > vmstat_worker
> +
> +Setting vmstat_threshold to 1 means the per-CPU
> +vmstat statistics will not be out-of-date
> +for CPU 1.
> +
> +
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-02 14:28     ` Luiz Capitulino
  0 siblings, 0 replies; 66+ messages in thread
From: Luiz Capitulino @ 2017-05-02 14:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, 25 Apr 2017 10:57:19 -0300
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> The per-CPU vmstat worker is a problem on -RT workloads (because
> ideally the CPU is entirely reserved for the -RT app, without
> interference). The worker transfers accumulated per-CPU 
> vmstat counters to global counters.

This is a problem for non-RT too. Any task pinned to an isolated
CPU that doesn't want to be ever interrupted will be interrupted
by the vmstat kworker.

> To resolve the problem, create two tunables:
> 
> * Userspace configurable per-CPU vmstat threshold: by default the 
> VM code calculates the size of the per-CPU vmstat arrays. This 
> tunable allows userspace to configure the values.
> 
> * Userspace configurable per-CPU vmstat worker: allow disabling
> the per-CPU vmstat worker.

I have several questions about the tunables:

 - What does the vmstat_threshold value mean? What are the implications
   of changing this value? What's the difference in choosing 1, 2, 3
   or 500?

 - If the purpose of having vmstat_threshold is to allow disabling
   the vmstat kworker, why can't the kernel pick a value automatically?

 - What are the implications of disabling the vmstat kworker? Will vm
   stats still be collected someway or will it be completely off for
   the CPU?

Also, shouldn't this patch be split into two?

> The patch below contains documentation which describes the tunables
> in more detail.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  Documentation/vm/vmstat_thresholds.txt |   38 +++++
>  mm/vmstat.c                            |  248 +++++++++++++++++++++++++++++++--
>  2 files changed, 272 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6-git-disable-vmstat-worker/mm/vmstat.c
> ===================================================================
> --- linux-2.6-git-disable-vmstat-worker.orig/mm/vmstat.c	2017-04-25 07:39:13.941019853 -0300
> +++ linux-2.6-git-disable-vmstat-worker/mm/vmstat.c	2017-04-25 10:44:51.581977296 -0300
> @@ -91,8 +91,17 @@
>  EXPORT_SYMBOL(vm_zone_stat);
>  EXPORT_SYMBOL(vm_node_stat);
>  
> +struct vmstat_uparam {
> +	atomic_t vmstat_work_enabled;
> +	atomic_t user_stat_thresh;
> +};
> +
> +static DEFINE_PER_CPU(struct vmstat_uparam, vmstat_uparam);
> +
>  #ifdef CONFIG_SMP
>  
> +#define MAX_THRESHOLD 125
> +
>  int calculate_pressure_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -110,9 +119,9 @@
>  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
>  
>  	/*
> -	 * Maximum threshold is 125
> +	 * Maximum threshold is MAX_THRESHOLD == 125
>  	 */
> -	threshold = min(125, threshold);
> +	threshold = min(MAX_THRESHOLD, threshold);
>  
>  	return threshold;
>  }
> @@ -188,15 +197,31 @@
>  		threshold = calculate_normal_threshold(zone);
>  
>  		for_each_online_cpu(cpu) {
> -			int pgdat_threshold;
> +			int pgdat_threshold, ustat_thresh;
> +			struct vmstat_uparam *vup;
>  
> -			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> -							= threshold;
> +			struct per_cpu_nodestat __percpu *pcp;
> +			struct per_cpu_pageset *p;
> +
> +			p = per_cpu_ptr(zone->pageset, cpu);
> +
> +			vup = &per_cpu(vmstat_uparam, cpu);
> +			ustat_thresh = atomic_read(&vup->user_stat_thresh);
> +
> +			if (ustat_thresh)
> +				p->stat_threshold = ustat_thresh;
> +			else
> +				p->stat_threshold = threshold;
> +
> +			pcp = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
>  
>  			/* Base nodestat threshold on the largest populated zone. */
> -			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
> -			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
> -				= max(threshold, pgdat_threshold);
> +			pgdat_threshold = pcp->stat_threshold;
> +			if (ustat_thresh)
> +				pcp->stat_threshold = ustat_thresh;
> +			else
> +				pcp->stat_threshold = max(threshold,
> +							  pgdat_threshold);
>  		}
>  
>  		/*
> @@ -226,9 +251,24 @@
>  			continue;
>  
>  		threshold = (*calculate_pressure)(zone);
> -		for_each_online_cpu(cpu)
> +		for_each_online_cpu(cpu) {
> +			int t, ustat_thresh;
> +			struct vmstat_uparam *vup;
> +
> +			vup = &per_cpu(vmstat_uparam, cpu);
> +			ustat_thresh = atomic_read(&vup->user_stat_thresh);
> +			t = threshold;
> +
> +			/*
> +			 * min because pressure could cause
> +			 * calculate_pressure'ed value to be smaller.
> +			 */
> +			if (ustat_thresh)
> +				t = min(threshold, ustat_thresh);
> +
>  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> -							= threshold;
> +							= t;
> +		}
>  	}
>  }
>  
> @@ -1567,6 +1607,9 @@
>  	long val;
>  	int err;
>  	int i;
> +	int cpu;
> +	struct work_struct __percpu *works;
> +	static struct cpumask has_work;
>  
>  	/*
>  	 * The regular update, every sysctl_stat_interval, may come later
> @@ -1580,9 +1623,31 @@
>  	 * transiently negative values, report an error here if any of
>  	 * the stats is negative, so we know to go looking for imbalance.
>  	 */
> -	err = schedule_on_each_cpu(refresh_vm_stats);
> -	if (err)
> -		return err;
> +
> +	works = alloc_percpu(struct work_struct);
> +	if (!works)
> +		return -ENOMEM;
> +
> +	cpumask_clear(&has_work);
> +	get_online_cpus();
> +
> +	for_each_online_cpu(cpu) {
> +		struct work_struct *work = per_cpu_ptr(works, cpu);
> +		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +		if (atomic_read(&vup->vmstat_work_enabled)) {
> +			INIT_WORK(work, refresh_vm_stats);
> +			schedule_work_on(cpu, work);
> +			cpumask_set_cpu(cpu, &has_work);
> +		}
> +	}
> +
> +	for_each_cpu(cpu, &has_work)
> +		flush_work(per_cpu_ptr(works, cpu));
> +
> +	put_online_cpus();
> +	free_percpu(works);
> +
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
> @@ -1674,6 +1739,10 @@
>  	/* Check processors whose vmstat worker threads have been disabled */
>  	for_each_online_cpu(cpu) {
>  		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> +		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +		if (atomic_read(&vup->vmstat_work_enabled) == 0)
> +			continue;
>  
>  		if (!delayed_work_pending(dw) && need_update(cpu))
>  			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> @@ -1696,6 +1765,135 @@
>  		round_jiffies_relative(sysctl_stat_interval));
>  }
>  
> +#ifdef CONFIG_SYSFS
> +
> +static ssize_t vmstat_worker_show(struct device *dev,
> +				  struct device_attribute *attr, char *buf)
> +{
> +	unsigned int cpu = dev->id;
> +	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +	return sprintf(buf, "%d\n", atomic_read(&vup->vmstat_work_enabled));
> +}
> +
> +static ssize_t vmstat_worker_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int ret, val;
> +	struct vmstat_uparam *vup;
> +	unsigned int cpu = dev->id;
> +
> +	ret = sscanf(buf, "%d", &val);
> +	if (ret != 1 || val > 1 || val < 0)
> +		return -EINVAL;
> +
> +	preempt_disable();
> +
> +	if (cpu_online(cpu)) {
> +		vup = &per_cpu(vmstat_uparam, cpu);
> +		atomic_set(&vup->vmstat_work_enabled, val);
> +	} else
> +		count = -EINVAL;
> +
> +	preempt_enable();
> +
> +	return count;
> +}
> +
> +static ssize_t vmstat_thresh_show(struct device *dev,
> +				  struct device_attribute *attr, char *buf)
> +{
> +	int ret;
> +	struct vmstat_uparam *vup;
> +	unsigned int cpu = dev->id;
> +
> +	preempt_disable();
> +
> +	vup = &per_cpu(vmstat_uparam, cpu);
> +	ret = sprintf(buf, "%d\n", atomic_read(&vup->user_stat_thresh));
> +
> +	preempt_enable();
> +
> +	return ret;
> +}
> +
> +static ssize_t vmstat_thresh_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int ret, val;
> +	unsigned int cpu = dev->id;
> +	struct vmstat_uparam *vup;
> +
> +	ret = sscanf(buf, "%d", &val);
> +	if (ret != 1 || val < 1 || val > MAX_THRESHOLD)
> +		return -EINVAL;
> +
> +	preempt_disable();
> +
> +	if (cpu_online(cpu)) {
> +		vup = &per_cpu(vmstat_uparam, cpu);
> +		atomic_set(&vup->user_stat_thresh, val);
> +	} else
> +		count = -EINVAL;
> +
> +	preempt_enable();
> +
> +	return count;
> +}
> +
> +struct device_attribute vmstat_worker_attr =
> +	__ATTR(vmstat_worker, 0644, vmstat_worker_show, vmstat_worker_store);
> +
> +struct device_attribute vmstat_threshold_attr =
> +	__ATTR(vmstat_threshold, 0644, vmstat_thresh_show, vmstat_thresh_store);
> +
> +static struct attribute *vmstat_attrs[] = {
> +	&vmstat_worker_attr.attr,
> +	&vmstat_threshold_attr.attr,
> +	NULL
> +};
> +
> +static struct attribute_group vmstat_attr_group = {
> +	.attrs  =  vmstat_attrs,
> +	.name   = "vmstat"
> +};
> +
> +static int vmstat_thresh_cpu_online(unsigned int cpu)
> +{
> +	struct device *dev = get_cpu_device(cpu);
> +	int ret;
> +
> +	ret = sysfs_create_group(&dev->kobj, &vmstat_attr_group);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static int vmstat_thresh_cpu_down_prep(unsigned int cpu)
> +{
> +	struct device *dev = get_cpu_device(cpu);
> +
> +	sysfs_remove_group(&dev->kobj, &vmstat_attr_group);
> +	return 0;
> +}
> +
> +static void init_vmstat_sysfs(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
> +
> +		atomic_set(&vup->user_stat_thresh, 0);
> +		atomic_set(&vup->vmstat_work_enabled, 1);
> +	}
> +}
> +
> +#endif /* CONFIG_SYSFS */
> +
>  static void __init init_cpu_node_state(void)
>  {
>  	int node;
> @@ -1723,9 +1921,13 @@
>  {
>  	const struct cpumask *node_cpus;
>  	int node;
> +	struct vmstat_uparam *vup = &per_cpu(vmstat_uparam, cpu);
>  
>  	node = cpu_to_node(cpu);
>  
> +	atomic_set(&vup->user_stat_thresh, 0);
> +	atomic_set(&vup->vmstat_work_enabled, 1);
> +
>  	refresh_zone_stat_thresholds();
>  	node_cpus = cpumask_of_node(node);
>  	if (cpumask_weight(node_cpus) > 0)
> @@ -1735,7 +1937,7 @@
>  	return 0;
>  }
>  
> -#endif
> +#endif /* CONFIG_SMP */
>  
>  struct workqueue_struct *mm_percpu_wq;
>  
> @@ -1772,6 +1974,24 @@
>  #endif
>  }
>  
> +static int __init init_mm_internals_late(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int ret;
> +
> +	init_vmstat_sysfs();
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mm/vmstat_thresh:online",
> +					vmstat_thresh_cpu_online,
> +					vmstat_thresh_cpu_down_prep);
> +	if (ret < 0)
> +		pr_err("vmstat_thresh: failed to register 'online' hotplug state\n");
> +#endif
> +	return 0;
> +}
> +
> +late_initcall(init_mm_internals_late);
> +
>  #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
>  
>  /*
> Index: linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt	2017-04-25 08:46:25.237395070 -0300
> @@ -0,0 +1,38 @@
> +Userspace configurable vmstat thresholds
> +========================================
> +
> +This document describes the tunables to control
> +per-CPU vmstat threshold and per-CPU vmstat worker
> +thread.
> +
> +/sys/devices/system/cpu/cpuN/vmstat/vmstat_threshold:
> +
> +This file contains the per-CPU vmstat threshold.
> +This value is the maximum that a single per-CPU vmstat statistic
> +can accumulate before transferring to the global counters.
> +
> +A value of 0 indicates that the value is set
> +by the in kernel algorithm.
> +
> +A value different than 0 indicates that particular
> +value is used for vmstat_threshold.
> +
> +/sys/devices/system/cpu/cpuN/vmstat/vmstat_worker:
> +
> +Enable/disable the per-CPU vmstat worker.
> +
> +Usage example:
> +=============
> +
> +To disable vmstat_update worker for cpu1:
> +
> +cd /sys/devices/system/cpu/cpu0/vmstat/
> +
> +# echo 1 > vmstat_threshold
> +# echo 0 > vmstat_worker
> +
> +Setting vmstat_threshold to 1 means the per-CPU
> +vmstat statistics will not be out-of-date
> +for CPU 1.
> +
> +
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-02 14:28     ` Luiz Capitulino
@ 2017-05-02 16:52       ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-02 16:52 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, May 02, 2017 at 10:28:36AM -0400, Luiz Capitulino wrote:
> On Tue, 25 Apr 2017 10:57:19 -0300
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > The per-CPU vmstat worker is a problem on -RT workloads (because
> > ideally the CPU is entirely reserved for the -RT app, without
> > interference). The worker transfers accumulated per-CPU 
> > vmstat counters to global counters.
> 
> This is a problem for non-RT too. Any task pinned to an isolated
> CPU that doesn't want to be ever interrupted will be interrupted
> by the vmstat kworker.
> 
> > To resolve the problem, create two tunables:
> > 
> > * Userspace configurable per-CPU vmstat threshold: by default the 
> > VM code calculates the size of the per-CPU vmstat arrays. This 
> > tunable allows userspace to configure the values.
> > 
> > * Userspace configurable per-CPU vmstat worker: allow disabling
> > the per-CPU vmstat worker.
>
> I have several questions about the tunables:
> 
>  - What does the vmstat_threshold value mean? What are the implications
>    of changing this value? What's the difference in choosing 1, 2, 3
>    or 500?

Its the maximum value for a vmstat statistics counter to hold. After
that value, the statistics are transferred to the global counter:

void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
                                long delta)
{
        struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
        s8 __percpu *p = pcp->vm_node_stat_diff + item;
        long x;
        long t;

        x = delta + __this_cpu_read(*p);

        t = __this_cpu_read(pcp->stat_threshold);

        if (unlikely(x > t || x < -t)) {
                node_page_state_add(x, pgdat, item);
                x = 0;
        }
        __this_cpu_write(*p, x);
}
EXPORT_SYMBOL(__mod_node_page_state);

BTW, there is a bug there, should change that to:

        if (unlikely(x >= t || x <= -t)) {

Increasing the threshold value does two things:
	1) It decreases the number of inter-processor accesses.
	2) It increases how much the global counters stay out of
	   sync relative to actual current values.

>  - If the purpose of having vmstat_threshold is to allow disabling
>    the vmstat kworker, why can't the kernel pick a value automatically?

Because it might be acceptable for the user to accept a small 
out of syncedness of the global counters in favour of performance
(one would have to analyze the situation).

Setting vmstat_threshold == 1 means the global counter is always
in sync with the page counter state of the pCPU.

>  - What are the implications of disabling the vmstat kworker? Will vm
>    stats still be collected someway or will it be completely off for
>    the CPU?

It will not be necessary to collect vmstats because at every modification
of the vm statistics, pCPUs with vmstat_threshold=1 transfer their 
values to the global counters (that is, there is no queueing of statistics
locally to improve performance).

> Also, shouldn't this patch be split into two?

First add one sysfs file, then add another sysfs file, you mean?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-02 16:52       ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-02 16:52 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, May 02, 2017 at 10:28:36AM -0400, Luiz Capitulino wrote:
> On Tue, 25 Apr 2017 10:57:19 -0300
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > The per-CPU vmstat worker is a problem on -RT workloads (because
> > ideally the CPU is entirely reserved for the -RT app, without
> > interference). The worker transfers accumulated per-CPU 
> > vmstat counters to global counters.
> 
> This is a problem for non-RT too. Any task pinned to an isolated
> CPU that doesn't want to be ever interrupted will be interrupted
> by the vmstat kworker.
> 
> > To resolve the problem, create two tunables:
> > 
> > * Userspace configurable per-CPU vmstat threshold: by default the 
> > VM code calculates the size of the per-CPU vmstat arrays. This 
> > tunable allows userspace to configure the values.
> > 
> > * Userspace configurable per-CPU vmstat worker: allow disabling
> > the per-CPU vmstat worker.
>
> I have several questions about the tunables:
> 
>  - What does the vmstat_threshold value mean? What are the implications
>    of changing this value? What's the difference in choosing 1, 2, 3
>    or 500?

Its the maximum value for a vmstat statistics counter to hold. After
that value, the statistics are transferred to the global counter:

void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
                                long delta)
{
        struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
        s8 __percpu *p = pcp->vm_node_stat_diff + item;
        long x;
        long t;

        x = delta + __this_cpu_read(*p);

        t = __this_cpu_read(pcp->stat_threshold);

        if (unlikely(x > t || x < -t)) {
                node_page_state_add(x, pgdat, item);
                x = 0;
        }
        __this_cpu_write(*p, x);
}
EXPORT_SYMBOL(__mod_node_page_state);

BTW, there is a bug there, should change that to:

        if (unlikely(x >= t || x <= -t)) {

Increasing the threshold value does two things:
	1) It decreases the number of inter-processor accesses.
	2) It increases how much the global counters stay out of
	   sync relative to actual current values.

>  - If the purpose of having vmstat_threshold is to allow disabling
>    the vmstat kworker, why can't the kernel pick a value automatically?

Because it might be acceptable for the user to accept a small 
out of syncedness of the global counters in favour of performance
(one would have to analyze the situation).

Setting vmstat_threshold == 1 means the global counter is always
in sync with the page counter state of the pCPU.

>  - What are the implications of disabling the vmstat kworker? Will vm
>    stats still be collected someway or will it be completely off for
>    the CPU?

It will not be necessary to collect vmstats because at every modification
of the vm statistics, pCPUs with vmstat_threshold=1 transfer their 
values to the global counters (that is, there is no queueing of statistics
locally to improve performance).

> Also, shouldn't this patch be split into two?

First add one sysfs file, then add another sysfs file, you mean?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-02 16:52       ` Marcelo Tosatti
@ 2017-05-02 17:15         ` Luiz Capitulino
  -1 siblings, 0 replies; 66+ messages in thread
From: Luiz Capitulino @ 2017-05-02 17:15 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, 2 May 2017 13:52:00 -0300
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > I have several questions about the tunables:
> > 
> >  - What does the vmstat_threshold value mean? What are the implications
> >    of changing this value? What's the difference in choosing 1, 2, 3
> >    or 500?  
> 
> Its the maximum value for a vmstat statistics counter to hold. After
> that value, the statistics are transferred to the global counter:
> 
> void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>                                 long delta)
> {
>         struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>         s8 __percpu *p = pcp->vm_node_stat_diff + item;
>         long x;
>         long t;
> 
>         x = delta + __this_cpu_read(*p);
> 
>         t = __this_cpu_read(pcp->stat_threshold);
> 
>         if (unlikely(x > t || x < -t)) {
>                 node_page_state_add(x, pgdat, item);
>                 x = 0;
>         }
>         __this_cpu_write(*p, x);
> }
> EXPORT_SYMBOL(__mod_node_page_state);
> 
> BTW, there is a bug there, should change that to:
> 
>         if (unlikely(x >= t || x <= -t)) {
> 
> Increasing the threshold value does two things:
> 	1) It decreases the number of inter-processor accesses.
> 	2) It increases how much the global counters stay out of
> 	   sync relative to actual current values.

OK, but I'm mostly concerned with the sysadmin who will have
to change the tunable. So, I think it's a good idea to improve
the doc to contain that information.

> >  - If the purpose of having vmstat_threshold is to allow disabling
> >    the vmstat kworker, why can't the kernel pick a value automatically?  
> 
> Because it might be acceptable for the user to accept a small 
> out of syncedness of the global counters in favour of performance
> (one would have to analyze the situation).
> 
> Setting vmstat_threshold == 1 means the global counter is always
> in sync with the page counter state of the pCPU.

IMHO, if vmstat_threshold == 1 is the required setting for
disabling the vmstat kworker then I'd go with only one tunable
for now. But that's just a suggestion.

> 
> >  - What are the implications of disabling the vmstat kworker? Will vm
> >    stats still be collected someway or will it be completely off for
> >    the CPU?  
> 
> It will not be necessary to collect vmstats because at every modification
> of the vm statistics, pCPUs with vmstat_threshold=1 transfer their 
> values to the global counters (that is, there is no queueing of statistics
> locally to improve performance).

Ah, OK. Got this now. I'll give this patch a try. But I think we want
to hear from Christoph (who worked on reducing the vmstat interruptions
in the past).

> > Also, shouldn't this patch be split into two?  
> 
> First add one sysfs file, then add another sysfs file, you mean?

Yes, one tunable per patch.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-02 17:15         ` Luiz Capitulino
  0 siblings, 0 replies; 66+ messages in thread
From: Luiz Capitulino @ 2017-05-02 17:15 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, 2 May 2017 13:52:00 -0300
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > I have several questions about the tunables:
> > 
> >  - What does the vmstat_threshold value mean? What are the implications
> >    of changing this value? What's the difference in choosing 1, 2, 3
> >    or 500?  
> 
> Its the maximum value for a vmstat statistics counter to hold. After
> that value, the statistics are transferred to the global counter:
> 
> void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>                                 long delta)
> {
>         struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>         s8 __percpu *p = pcp->vm_node_stat_diff + item;
>         long x;
>         long t;
> 
>         x = delta + __this_cpu_read(*p);
> 
>         t = __this_cpu_read(pcp->stat_threshold);
> 
>         if (unlikely(x > t || x < -t)) {
>                 node_page_state_add(x, pgdat, item);
>                 x = 0;
>         }
>         __this_cpu_write(*p, x);
> }
> EXPORT_SYMBOL(__mod_node_page_state);
> 
> BTW, there is a bug there, should change that to:
> 
>         if (unlikely(x >= t || x <= -t)) {
> 
> Increasing the threshold value does two things:
> 	1) It decreases the number of inter-processor accesses.
> 	2) It increases how much the global counters stay out of
> 	   sync relative to actual current values.

OK, but I'm mostly concerned with the sysadmin who will have
to change the tunable. So, I think it's a good idea to improve
the doc to contain that information.

> >  - If the purpose of having vmstat_threshold is to allow disabling
> >    the vmstat kworker, why can't the kernel pick a value automatically?  
> 
> Because it might be acceptable for the user to accept a small 
> out of syncedness of the global counters in favour of performance
> (one would have to analyze the situation).
> 
> Setting vmstat_threshold == 1 means the global counter is always
> in sync with the page counter state of the pCPU.

IMHO, if vmstat_threshold == 1 is the required setting for
disabling the vmstat kworker then I'd go with only one tunable
for now. But that's just a suggestion.

> 
> >  - What are the implications of disabling the vmstat kworker? Will vm
> >    stats still be collected someway or will it be completely off for
> >    the CPU?  
> 
> It will not be necessary to collect vmstats because at every modification
> of the vm statistics, pCPUs with vmstat_threshold=1 transfer their 
> values to the global counters (that is, there is no queueing of statistics
> locally to improve performance).

Ah, OK. Got this now. I'll give this patch a try. But I think we want
to hear from Christoph (who worked on reducing the vmstat interruptions
in the past).

> > Also, shouldn't this patch be split into two?  
> 
> First add one sysfs file, then add another sysfs file, you mean?

Yes, one tunable per patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-02 17:15         ` Luiz Capitulino
  (?)
@ 2017-05-02 17:21           ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-02 17:21 UTC (permalink / raw)
  To: Luiz Capitulino, Christoph Lameter
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, May 02, 2017 at 01:15:27PM -0400, Luiz Capitulino wrote:
> On Tue, 2 May 2017 13:52:00 -0300
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > I have several questions about the tunables:
> > > 
> > >  - What does the vmstat_threshold value mean? What are the implications
> > >    of changing this value? What's the difference in choosing 1, 2, 3
> > >    or 500?  
> > 
> > Its the maximum value for a vmstat statistics counter to hold. After
> > that value, the statistics are transferred to the global counter:
> > 
> > void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> >                                 long delta)
> > {
> >         struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> >         s8 __percpu *p = pcp->vm_node_stat_diff + item;
> >         long x;
> >         long t;
> > 
> >         x = delta + __this_cpu_read(*p);
> > 
> >         t = __this_cpu_read(pcp->stat_threshold);
> > 
> >         if (unlikely(x > t || x < -t)) {
> >                 node_page_state_add(x, pgdat, item);
> >                 x = 0;
> >         }
> >         __this_cpu_write(*p, x);
> > }
> > EXPORT_SYMBOL(__mod_node_page_state);
> > 
> > BTW, there is a bug there, should change that to:
> > 
> >         if (unlikely(x >= t || x <= -t)) {
> > 
> > Increasing the threshold value does two things:
> > 	1) It decreases the number of inter-processor accesses.
> > 	2) It increases how much the global counters stay out of
> > 	   sync relative to actual current values.
> 
> OK, but I'm mostly concerned with the sysadmin who will have
> to change the tunable. So, I think it's a good idea to improve
> the doc to contain that information.

Yes, how is that:

Index: linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt	2017-05-02 13:48:45.946840708 -0300
@@ -0,0 +1,78 @@
+Userspace configurable vmstat thresholds
+========================================
+
+This document describes the tunables to control
+per-CPU vmstat threshold and per-CPU vmstat worker
+thread.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_threshold:
+
+This file contains the per-CPU vmstat threshold.
+This value is the maximum that a single per-CPU vmstat statistic
+can accumulate before transferring to the global counters.
+
+A value of 0 indicates that the value is set
+by the in kernel algorithm.
+
+A value different than 0 indicates that particular
+value is used for vmstat_threshold.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_worker:
+
+Enable/disable the per-CPU vmstat worker.
+
+What does the vmstat_threshold value mean? What are the implications
+of changing this value? What's the difference in choosing 1, 2, 3
+or 500?
+====================================================================
+
+Its the maximum value for a vmstat statistics counter to hold. After
+that value, the statistics are transferred to the global counter:
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+                                long delta)
+{
+        struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+        s8 __percpu *p = pcp->vm_node_stat_diff + item;
+        long x;
+        long t;
+
+        x = delta + __this_cpu_read(*p);
+
+        t = __this_cpu_read(pcp->stat_threshold);
+
+        if (unlikely(x > t || x < -t)) {
+                node_page_state_add(x, pgdat, item);
+                x = 0;
+        }
+        __this_cpu_write(*p, x);
+}
+
+Increasing the threshold value does two things:
+        1) It decreases the number of inter-processor accesses.
+        2) It increases how much the global counters stay out of
+           sync relative to actual current values.
+
+
+Usage example:
+=============
+
+In a realtime system, the worker thread waking up and executing
+vmstat_update can be an undesired source of latencies.
+
+To avoid the worker thread from waking up, executing vmstat_update
+on cpu 1, for example, perform the following steps:
+
+
+cd /sys/devices/system/cpu/cpu0/vmstat/
+
+# Set vmstat threshold to 1 for cpu1, so that no
+# vmstat statistics are collected in cpu1's per-cpu
+# stats, instead they are immediately transferred
+# to the global counter.
+
+$ echo 1 > vmstat_threshold
+
+# Disable vmstat_update worker for cpu1:
+$ echo 0 > vmstat_worker
+


> > >  - If the purpose of having vmstat_threshold is to allow disabling
> > >    the vmstat kworker, why can't the kernel pick a value automatically?  
> > 
> > Because it might be acceptable for the user to accept a small 
> > out of syncedness of the global counters in favour of performance
> > (one would have to analyze the situation).
> > 
> > Setting vmstat_threshold == 1 means the global counter is always
> > in sync with the page counter state of the pCPU.
> 
> IMHO, if vmstat_threshold == 1 is the required setting for
> disabling the vmstat kworker then I'd go with only one tunable
> for now. But that's just a suggestion.

I didnt want to force that on the user because allowing different 
tunables covers more cases.

> > >  - What are the implications of disabling the vmstat kworker? Will vm
> > >    stats still be collected someway or will it be completely off for
> > >    the CPU?  
> > 
> > It will not be necessary to collect vmstats because at every modification
> > of the vm statistics, pCPUs with vmstat_threshold=1 transfer their 
> > values to the global counters (that is, there is no queueing of statistics
> > locally to improve performance).
> 
> Ah, OK. Got this now. I'll give this patch a try. But I think we want
> to hear from Christoph (who worked on reducing the vmstat interruptions
> in the past).

Christoph?

> > > Also, shouldn't this patch be split into two?  
> > 
> > First add one sysfs file, then add another sysfs file, you mean?
> 
> Yes, one tunable per patch.

Sure.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-02 17:21           ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-02 17:21 UTC (permalink / raw)
  To: Luiz Capitulino, Christoph Lameter
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cl, cmetcalf

On Tue, May 02, 2017 at 01:15:27PM -0400, Luiz Capitulino wrote:
> On Tue, 2 May 2017 13:52:00 -0300
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > I have several questions about the tunables:
> > > 
> > >  - What does the vmstat_threshold value mean? What are the implications
> > >    of changing this value? What's the difference in choosing 1, 2, 3
> > >    or 500?  
> > 
> > Its the maximum value for a vmstat statistics counter to hold. After
> > that value, the statistics are transferred to the global counter:
> > 
> > void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> >                                 long delta)
> > {
> >         struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> >         s8 __percpu *p = pcp->vm_node_stat_diff + item;
> >         long x;
> >         long t;
> > 
> >         x = delta + __this_cpu_read(*p);
> > 
> >         t = __this_cpu_read(pcp->stat_threshold);
> > 
> >         if (unlikely(x > t || x < -t)) {
> >                 node_page_state_add(x, pgdat, item);
> >                 x = 0;
> >         }
> >         __this_cpu_write(*p, x);
> > }
> > EXPORT_SYMBOL(__mod_node_page_state);
> > 
> > BTW, there is a bug there, should change that to:
> > 
> >         if (unlikely(x >= t || x <= -t)) {
> > 
> > Increasing the threshold value does two things:
> > 	1) It decreases the number of inter-processor accesses.
> > 	2) It increases how much the global counters stay out of
> > 	   sync relative to actual current values.
> 
> OK, but I'm mostly concerned with the sysadmin who will have
> to change the tunable. So, I think it's a good idea to improve
> the doc to contain that information.

Yes, how is that:

Index: linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt	2017-05-02 13:48:45.946840708 -0300
@@ -0,0 +1,78 @@
+Userspace configurable vmstat thresholds
+========================================
+
+This document describes the tunables to control
+per-CPU vmstat threshold and per-CPU vmstat worker
+thread.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_threshold:
+
+This file contains the per-CPU vmstat threshold.
+This value is the maximum that a single per-CPU vmstat statistic
+can accumulate before transferring to the global counters.
+
+A value of 0 indicates that the value is set
+by the in kernel algorithm.
+
+A value different than 0 indicates that particular
+value is used for vmstat_threshold.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_worker:
+
+Enable/disable the per-CPU vmstat worker.
+
+What does the vmstat_threshold value mean? What are the implications
+of changing this value? What's the difference in choosing 1, 2, 3
+or 500?
+====================================================================
+
+Its the maximum value for a vmstat statistics counter to hold. After
+that value, the statistics are transferred to the global counter:
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+                                long delta)
+{
+        struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+        s8 __percpu *p = pcp->vm_node_stat_diff + item;
+        long x;
+        long t;
+
+        x = delta + __this_cpu_read(*p);
+
+        t = __this_cpu_read(pcp->stat_threshold);
+
+        if (unlikely(x > t || x < -t)) {
+                node_page_state_add(x, pgdat, item);
+                x = 0;
+        }
+        __this_cpu_write(*p, x);
+}
+
+Increasing the threshold value does two things:
+        1) It decreases the number of inter-processor accesses.
+        2) It increases how much the global counters stay out of
+           sync relative to actual current values.
+
+
+Usage example:
+=============
+
+In a realtime system, the worker thread waking up and executing
+vmstat_update can be an undesired source of latencies.
+
+To avoid the worker thread from waking up, executing vmstat_update
+on cpu 1, for example, perform the following steps:
+
+
+cd /sys/devices/system/cpu/cpu0/vmstat/
+
+# Set vmstat threshold to 1 for cpu1, so that no
+# vmstat statistics are collected in cpu1's per-cpu
+# stats, instead they are immediately transferred
+# to the global counter.
+
+$ echo 1 > vmstat_threshold
+
+# Disable vmstat_update worker for cpu1:
+$ echo 0 > vmstat_worker
+


> > >  - If the purpose of having vmstat_threshold is to allow disabling
> > >    the vmstat kworker, why can't the kernel pick a value automatically?  
> > 
> > Because it might be acceptable for the user to accept a small 
> > out of syncedness of the global counters in favour of performance
> > (one would have to analyze the situation).
> > 
> > Setting vmstat_threshold == 1 means the global counter is always
> > in sync with the page counter state of the pCPU.
> 
> IMHO, if vmstat_threshold == 1 is the required setting for
> disabling the vmstat kworker then I'd go with only one tunable
> for now. But that's just a suggestion.

I didnt want to force that on the user because allowing different 
tunables covers more cases.

> > >  - What are the implications of disabling the vmstat kworker? Will vm
> > >    stats still be collected someway or will it be completely off for
> > >    the CPU?  
> > 
> > It will not be necessary to collect vmstats because at every modification
> > of the vm statistics, pCPUs with vmstat_threshold=1 transfer their 
> > values to the global counters (that is, there is no queueing of statistics
> > locally to improve performance).
> 
> Ah, OK. Got this now. I'll give this patch a try. But I think we want
> to hear from Christoph (who worked on reducing the vmstat interruptions
> in the past).

Christoph?

> > > Also, shouldn't this patch be split into two?  
> > 
> > First add one sysfs file, then add another sysfs file, you mean?
> 
> Yes, one tunable per patch.

Sure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-02 17:21           ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-02 17:21 UTC (permalink / raw)
  To: Luiz Capitulino, Christoph Lameter
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cmetcalf

On Tue, May 02, 2017 at 01:15:27PM -0400, Luiz Capitulino wrote:
> On Tue, 2 May 2017 13:52:00 -0300
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > I have several questions about the tunables:
> > > 
> > >  - What does the vmstat_threshold value mean? What are the implications
> > >    of changing this value? What's the difference in choosing 1, 2, 3
> > >    or 500?  
> > 
> > Its the maximum value for a vmstat statistics counter to hold. After
> > that value, the statistics are transferred to the global counter:
> > 
> > void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> >                                 long delta)
> > {
> >         struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> >         s8 __percpu *p = pcp->vm_node_stat_diff + item;
> >         long x;
> >         long t;
> > 
> >         x = delta + __this_cpu_read(*p);
> > 
> >         t = __this_cpu_read(pcp->stat_threshold);
> > 
> >         if (unlikely(x > t || x < -t)) {
> >                 node_page_state_add(x, pgdat, item);
> >                 x = 0;
> >         }
> >         __this_cpu_write(*p, x);
> > }
> > EXPORT_SYMBOL(__mod_node_page_state);
> > 
> > BTW, there is a bug there, should change that to:
> > 
> >         if (unlikely(x >= t || x <= -t)) {
> > 
> > Increasing the threshold value does two things:
> > 	1) It decreases the number of inter-processor accesses.
> > 	2) It increases how much the global counters stay out of
> > 	   sync relative to actual current values.
> 
> OK, but I'm mostly concerned with the sysadmin who will have
> to change the tunable. So, I think it's a good idea to improve
> the doc to contain that information.

Yes, how is that:

Index: linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-git-disable-vmstat-worker/Documentation/vm/vmstat_thresholds.txt	2017-05-02 13:48:45.946840708 -0300
@@ -0,0 +1,78 @@
+Userspace configurable vmstat thresholds
+========================================
+
+This document describes the tunables to control
+per-CPU vmstat threshold and per-CPU vmstat worker
+thread.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_threshold:
+
+This file contains the per-CPU vmstat threshold.
+This value is the maximum that a single per-CPU vmstat statistic
+can accumulate before transferring to the global counters.
+
+A value of 0 indicates that the value is set
+by the in kernel algorithm.
+
+A value different than 0 indicates that particular
+value is used for vmstat_threshold.
+
+/sys/devices/system/cpu/cpuN/vmstat/vmstat_worker:
+
+Enable/disable the per-CPU vmstat worker.
+
+What does the vmstat_threshold value mean? What are the implications
+of changing this value? What's the difference in choosing 1, 2, 3
+or 500?
+====================================================================
+
+Its the maximum value for a vmstat statistics counter to hold. After
+that value, the statistics are transferred to the global counter:
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+                                long delta)
+{
+        struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+        s8 __percpu *p = pcp->vm_node_stat_diff + item;
+        long x;
+        long t;
+
+        x = delta + __this_cpu_read(*p);
+
+        t = __this_cpu_read(pcp->stat_threshold);
+
+        if (unlikely(x > t || x < -t)) {
+                node_page_state_add(x, pgdat, item);
+                x = 0;
+        }
+        __this_cpu_write(*p, x);
+}
+
+Increasing the threshold value does two things:
+        1) It decreases the number of inter-processor accesses.
+        2) It increases how much the global counters stay out of
+           sync relative to actual current values.
+
+
+Usage example:
+=============
+
+In a realtime system, the worker thread waking up and executing
+vmstat_update can be an undesired source of latencies.
+
+To avoid the worker thread from waking up, executing vmstat_update
+on cpu 1, for example, perform the following steps:
+
+
+cd /sys/devices/system/cpu/cpu0/vmstat/
+
+# Set vmstat threshold to 1 for cpu1, so that no
+# vmstat statistics are collected in cpu1's per-cpu
+# stats, instead they are immediately transferred
+# to the global counter.
+
+$ echo 1 > vmstat_threshold
+
+# Disable vmstat_update worker for cpu1:
+$ echo 0 > vmstat_worker
+


> > >  - If the purpose of having vmstat_threshold is to allow disabling
> > >    the vmstat kworker, why can't the kernel pick a value automatically?  
> > 
> > Because it might be acceptable for the user to accept a small 
> > out of syncedness of the global counters in favour of performance
> > (one would have to analyze the situation).
> > 
> > Setting vmstat_threshold == 1 means the global counter is always
> > in sync with the page counter state of the pCPU.
> 
> IMHO, if vmstat_threshold == 1 is the required setting for
> disabling the vmstat kworker then I'd go with only one tunable
> for now. But that's just a suggestion.

I didnt want to force that on the user because allowing different 
tunables covers more cases.

> > >  - What are the implications of disabling the vmstat kworker? Will vm
> > >    stats still be collected someway or will it be completely off for
> > >    the CPU?  
> > 
> > It will not be necessary to collect vmstats because at every modification
> > of the vm statistics, pCPUs with vmstat_threshold=1 transfer their 
> > values to the global counters (that is, there is no queueing of statistics
> > locally to improve performance).
> 
> Ah, OK. Got this now. I'll give this patch a try. But I think we want
> to hear from Christoph (who worked on reducing the vmstat interruptions
> in the past).

Christoph?

> > > Also, shouldn't this patch be split into two?  
> > 
> > First add one sysfs file, then add another sysfs file, you mean?
> 
> Yes, one tunable per patch.

Sure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-02 17:15         ` Luiz Capitulino
@ 2017-05-11 15:37           ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-11 15:37 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Tue, 2 May 2017, Luiz Capitulino wrote:

> Ah, OK. Got this now. I'll give this patch a try. But I think we want
> to hear from Christoph (who worked on reducing the vmstat interruptions
> in the past).

A bit confused by this one. The vmstat worker is already disabled if there
are no updates. Also the patches by Chris Metcalf on data plane mode add a
prctl to quiet the vmstat workers.

Why do we need more than this?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-11 15:37           ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-11 15:37 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Tue, 2 May 2017, Luiz Capitulino wrote:

> Ah, OK. Got this now. I'll give this patch a try. But I think we want
> to hear from Christoph (who worked on reducing the vmstat interruptions
> in the past).

A bit confused by this one. The vmstat worker is already disabled if there
are no updates. Also the patches by Chris Metcalf on data plane mode add a
prctl to quiet the vmstat workers.

Why do we need more than this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-11 15:37           ` Christoph Lameter
@ 2017-05-12 12:27             ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-12 12:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Thu, May 11, 2017 at 10:37:07AM -0500, Christoph Lameter wrote:
> On Tue, 2 May 2017, Luiz Capitulino wrote:
> 
> > Ah, OK. Got this now. I'll give this patch a try. But I think we want
> > to hear from Christoph (who worked on reducing the vmstat interruptions
> > in the past).
> 
> A bit confused by this one. The vmstat worker is already disabled if there
> are no updates. Also the patches by Chris Metcalf on data plane mode add a
> prctl to quiet the vmstat workers.
> 
> Why do we need more than this?

If there are vmstat statistic updates on a given CPU, and you don't
want intervention from the vmstat worker, you change the behaviour of
stat data collection to directly write to the global structures (which
disables the performance optimization of collecting data in per-cpu
counters).

This way you can disable vmstat worker (because it causes undesired
latencies), while allowing vmstatistics to function properly.

The prctl from Chris Metcalf patchset allows one to disable vmstat
worker per CPU? If so, they replace the functionality of the patch
"[patch 3/3] MM: allow per-cpu vmstat_worker configuration" 
of the -v2 series of my patchset, and we can use it instead.

Is it integrated already?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-12 12:27             ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-12 12:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Thu, May 11, 2017 at 10:37:07AM -0500, Christoph Lameter wrote:
> On Tue, 2 May 2017, Luiz Capitulino wrote:
> 
> > Ah, OK. Got this now. I'll give this patch a try. But I think we want
> > to hear from Christoph (who worked on reducing the vmstat interruptions
> > in the past).
> 
> A bit confused by this one. The vmstat worker is already disabled if there
> are no updates. Also the patches by Chris Metcalf on data plane mode add a
> prctl to quiet the vmstat workers.
> 
> Why do we need more than this?

If there are vmstat statistic updates on a given CPU, and you don't
want intervention from the vmstat worker, you change the behaviour of
stat data collection to directly write to the global structures (which
disables the performance optimization of collecting data in per-cpu
counters).

This way you can disable vmstat worker (because it causes undesired
latencies), while allowing vmstatistics to function properly.

The prctl from Chris Metcalf patchset allows one to disable vmstat
worker per CPU? If so, they replace the functionality of the patch
"[patch 3/3] MM: allow per-cpu vmstat_worker configuration" 
of the -v2 series of my patchset, and we can use it instead.

Is it integrated already?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-12 12:27             ` Marcelo Tosatti
@ 2017-05-12 15:11               ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 15:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> > A bit confused by this one. The vmstat worker is already disabled if there
> > are no updates. Also the patches by Chris Metcalf on data plane mode add a
> > prctl to quiet the vmstat workers.
> >
> > Why do we need more than this?
>
> If there are vmstat statistic updates on a given CPU, and you don't
> want intervention from the vmstat worker, you change the behaviour of
> stat data collection to directly write to the global structures (which
> disables the performance optimization of collecting data in per-cpu
> counters).

Hmmm.... Ok. That is going to be expensive if you do this for each
individual vmstat update.

> This way you can disable vmstat worker (because it causes undesired
> latencies), while allowing vmstatistics to function properly.

Best then to run the vmstat update mechanism when you leave kernel mode to
get all the updates in one go.


> The prctl from Chris Metcalf patchset allows one to disable vmstat
> worker per CPU? If so, they replace the functionality of the patch
> "[patch 3/3] MM: allow per-cpu vmstat_worker configuration"
> of the -v2 series of my patchset, and we can use it instead.
>
> Is it integrated already?

The data plane mode patches disables vmstat processing  by updating the
vmstats immediately if necessary and switching off the kworker thread.

So the kworker wont be running until the next time statistics are checked
by the shepherd task from a remote cpu. If the counters have been updated
then the shepherd task will reenable the kworker. This is already merged
and has been working for a long time. Data plan mode has not been merged
yet but the infrastructure in vmstat.c is there because NOHZ needs it too.

See linux/vmstat.c:quiet_vmstat()

It would be easy to add a /proc file that allows the quieting of the
vmstat workers for a certain cpu. Just make it call the quiet_vmstat() on
the right cpu.

This will quiet vmstat down. The shepherd task will check the stats in 2
second intervals and will then reenable when necessasry.

Note that we already are updating the global structures directly if the
differential gets too high. Reducing the differential may get you what you
want.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-12 15:11               ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 15:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> > A bit confused by this one. The vmstat worker is already disabled if there
> > are no updates. Also the patches by Chris Metcalf on data plane mode add a
> > prctl to quiet the vmstat workers.
> >
> > Why do we need more than this?
>
> If there are vmstat statistic updates on a given CPU, and you don't
> want intervention from the vmstat worker, you change the behaviour of
> stat data collection to directly write to the global structures (which
> disables the performance optimization of collecting data in per-cpu
> counters).

Hmmm.... Ok. That is going to be expensive if you do this for each
individual vmstat update.

> This way you can disable vmstat worker (because it causes undesired
> latencies), while allowing vmstatistics to function properly.

Best then to run the vmstat update mechanism when you leave kernel mode to
get all the updates in one go.


> The prctl from Chris Metcalf patchset allows one to disable vmstat
> worker per CPU? If so, they replace the functionality of the patch
> "[patch 3/3] MM: allow per-cpu vmstat_worker configuration"
> of the -v2 series of my patchset, and we can use it instead.
>
> Is it integrated already?

The data plane mode patches disables vmstat processing  by updating the
vmstats immediately if necessary and switching off the kworker thread.

So the kworker wont be running until the next time statistics are checked
by the shepherd task from a remote cpu. If the counters have been updated
then the shepherd task will reenable the kworker. This is already merged
and has been working for a long time. Data plan mode has not been merged
yet but the infrastructure in vmstat.c is there because NOHZ needs it too.

See linux/vmstat.c:quiet_vmstat()

It would be easy to add a /proc file that allows the quieting of the
vmstat workers for a certain cpu. Just make it call the quiet_vmstat() on
the right cpu.

This will quiet vmstat down. The shepherd task will check the stats in 2
second intervals and will then reenable when necessasry.

Note that we already are updating the global structures directly if the
differential gets too high. Reducing the differential may get you what you
want.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-12 15:11               ` Christoph Lameter
@ 2017-05-12 15:40                 ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-12 15:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 12, 2017 at 10:11:14AM -0500, Christoph Lameter wrote:
> On Fri, 12 May 2017, Marcelo Tosatti wrote:
> 
> > > A bit confused by this one. The vmstat worker is already disabled if there
> > > are no updates. Also the patches by Chris Metcalf on data plane mode add a
> > > prctl to quiet the vmstat workers.
> > >
> > > Why do we need more than this?
> >
> > If there are vmstat statistic updates on a given CPU, and you don't
> > want intervention from the vmstat worker, you change the behaviour of
> > stat data collection to directly write to the global structures (which
> > disables the performance optimization of collecting data in per-cpu
> > counters).
> 
> Hmmm.... Ok. That is going to be expensive if you do this for each
> individual vmstat update.

In our case, vmstat updates are very rare (CPU is dominated by DPDK).

> > This way you can disable vmstat worker (because it causes undesired
> > latencies), while allowing vmstatistics to function properly.
> 
> Best then to run the vmstat update mechanism when you leave kernel mode to
> get all the updates in one go.

Again, vmstat updates are very rare (CPU is dominated by DPDK).

> > The prctl from Chris Metcalf patchset allows one to disable vmstat
> > worker per CPU? If so, they replace the functionality of the patch
> > "[patch 3/3] MM: allow per-cpu vmstat_worker configuration"
> > of the -v2 series of my patchset, and we can use it instead.
> >
> > Is it integrated already?
> 
> The data plane mode patches disables vmstat processing  by updating the
> vmstats immediately if necessary and switching off the kworker thread.

OK this is what my patch set is doing.

> So the kworker wont be running until the next time statistics are checked
> by the shepherd task from a remote cpu.

We don't want kworker thread to ever run.

>  If the counters have been updated
> then the shepherd task will reenable the kworker. This is already merged
> and has been working for a long time. Data plan mode has not been merged
> yet but the infrastructure in vmstat.c is there because NOHZ needs it too.

OK.

> 
> See linux/vmstat.c:quiet_vmstat()
> 
> It would be easy to add a /proc file that allows the quieting of the
> vmstat workers for a certain cpu. Just make it call the quiet_vmstat() on
> the right cpu.
> 
> This will quiet vmstat down. The shepherd task will check the stats in 2
> second intervals and will then reenable when necessasry.
> 
> Note that we already are updating the global structures directly if the
> differential gets too high. Reducing the differential may get you what you
> want.

Yes, we reduce the differential to 1 (== direct updates to global
structures).

OK, i'll check if the patches from Chris work for us and then add
Tested-by on that.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-12 15:40                 ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-12 15:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 12, 2017 at 10:11:14AM -0500, Christoph Lameter wrote:
> On Fri, 12 May 2017, Marcelo Tosatti wrote:
> 
> > > A bit confused by this one. The vmstat worker is already disabled if there
> > > are no updates. Also the patches by Chris Metcalf on data plane mode add a
> > > prctl to quiet the vmstat workers.
> > >
> > > Why do we need more than this?
> >
> > If there are vmstat statistic updates on a given CPU, and you don't
> > want intervention from the vmstat worker, you change the behaviour of
> > stat data collection to directly write to the global structures (which
> > disables the performance optimization of collecting data in per-cpu
> > counters).
> 
> Hmmm.... Ok. That is going to be expensive if you do this for each
> individual vmstat update.

In our case, vmstat updates are very rare (CPU is dominated by DPDK).

> > This way you can disable vmstat worker (because it causes undesired
> > latencies), while allowing vmstatistics to function properly.
> 
> Best then to run the vmstat update mechanism when you leave kernel mode to
> get all the updates in one go.

Again, vmstat updates are very rare (CPU is dominated by DPDK).

> > The prctl from Chris Metcalf patchset allows one to disable vmstat
> > worker per CPU? If so, they replace the functionality of the patch
> > "[patch 3/3] MM: allow per-cpu vmstat_worker configuration"
> > of the -v2 series of my patchset, and we can use it instead.
> >
> > Is it integrated already?
> 
> The data plane mode patches disables vmstat processing  by updating the
> vmstats immediately if necessary and switching off the kworker thread.

OK this is what my patch set is doing.

> So the kworker wont be running until the next time statistics are checked
> by the shepherd task from a remote cpu.

We don't want kworker thread to ever run.

>  If the counters have been updated
> then the shepherd task will reenable the kworker. This is already merged
> and has been working for a long time. Data plan mode has not been merged
> yet but the infrastructure in vmstat.c is there because NOHZ needs it too.

OK.

> 
> See linux/vmstat.c:quiet_vmstat()
> 
> It would be easy to add a /proc file that allows the quieting of the
> vmstat workers for a certain cpu. Just make it call the quiet_vmstat() on
> the right cpu.
> 
> This will quiet vmstat down. The shepherd task will check the stats in 2
> second intervals and will then reenable when necessasry.
> 
> Note that we already are updating the global structures directly if the
> differential gets too high. Reducing the differential may get you what you
> want.

Yes, we reduce the differential to 1 (== direct updates to global
structures).

OK, i'll check if the patches from Chris work for us and then add
Tested-by on that.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-12 15:40                 ` Marcelo Tosatti
@ 2017-05-12 16:03                   ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 16:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> OK, i'll check if the patches from Chris work for us and then add
> Tested-by on that.

You may not need those if the quiet_vmstat() function is enough for you.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-12 16:03                   ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 16:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> OK, i'll check if the patches from Chris work for us and then add
> Tested-by on that.

You may not need those if the quiet_vmstat() function is enough for you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-12 15:40                 ` Marcelo Tosatti
@ 2017-05-12 16:07                   ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 16:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> In our case, vmstat updates are very rare (CPU is dominated by DPDK).

What is the OS doing on the cores that DPDK runs on? I mean we here can
clean a processor of all activities and are able to run for a long time
without any interruptions.

Why would you still let the OS do things on that processor? If activities
by the OS are required then the existing NOHZ setup already minimizes
latency to a short burst (and Chris Metcalf's work improves on that).


What exactly is the issue you are seeing and want to address? I think we
have similar aims and as far as I know the current situation is already
good enough for what you may need. You may just not be aware of how to
configure this.

I doubt that doing inline updates will do much good compared to what we
already have and what the dataplan mode can do.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-12 16:07                   ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 16:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> In our case, vmstat updates are very rare (CPU is dominated by DPDK).

What is the OS doing on the cores that DPDK runs on? I mean we here can
clean a processor of all activities and are able to run for a long time
without any interruptions.

Why would you still let the OS do things on that processor? If activities
by the OS are required then the existing NOHZ setup already minimizes
latency to a short burst (and Chris Metcalf's work improves on that).


What exactly is the issue you are seeing and want to address? I think we
have similar aims and as far as I know the current situation is already
good enough for what you may need. You may just not be aware of how to
configure this.

I doubt that doing inline updates will do much good compared to what we
already have and what the dataplan mode can do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-12 16:07                   ` Christoph Lameter
@ 2017-05-12 16:19                     ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-12 16:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 12, 2017 at 11:07:48AM -0500, Christoph Lameter wrote:
> On Fri, 12 May 2017, Marcelo Tosatti wrote:
> 
> > In our case, vmstat updates are very rare (CPU is dominated by DPDK).
> 
> What is the OS doing on the cores that DPDK runs on? I mean we here can
> clean a processor of all activities and are able to run for a long time
> without any interruptions.
> 
> Why would you still let the OS do things on that processor? If activities
> by the OS are required then the existing NOHZ setup already minimizes
> latency to a short burst (and Chris Metcalf's work improves on that).
> 
> 
> What exactly is the issue you are seeing and want to address? I think we
> have similar aims and as far as I know the current situation is already
> good enough for what you may need. You may just not be aware of how to
> configure this.

I want to disable vmstat worker thread completly from an isolated CPU.
Because it adds overhead to a latency target, target which 
the lower the better.

> I doubt that doing inline updates will do much good compared to what we
> already have and what the dataplan mode can do.

Can the dataplan mode disable vmstat worker thread completly on a given
CPU?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-12 16:19                     ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-12 16:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 12, 2017 at 11:07:48AM -0500, Christoph Lameter wrote:
> On Fri, 12 May 2017, Marcelo Tosatti wrote:
> 
> > In our case, vmstat updates are very rare (CPU is dominated by DPDK).
> 
> What is the OS doing on the cores that DPDK runs on? I mean we here can
> clean a processor of all activities and are able to run for a long time
> without any interruptions.
> 
> Why would you still let the OS do things on that processor? If activities
> by the OS are required then the existing NOHZ setup already minimizes
> latency to a short burst (and Chris Metcalf's work improves on that).
> 
> 
> What exactly is the issue you are seeing and want to address? I think we
> have similar aims and as far as I know the current situation is already
> good enough for what you may need. You may just not be aware of how to
> configure this.

I want to disable vmstat worker thread completly from an isolated CPU.
Because it adds overhead to a latency target, target which 
the lower the better.

> I doubt that doing inline updates will do much good compared to what we
> already have and what the dataplan mode can do.

Can the dataplan mode disable vmstat worker thread completly on a given
CPU?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-12 16:19                     ` Marcelo Tosatti
@ 2017-05-12 16:57                       ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 16:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> > What exactly is the issue you are seeing and want to address? I think we
> > have similar aims and as far as I know the current situation is already
> > good enough for what you may need. You may just not be aware of how to
> > configure this.
>
> I want to disable vmstat worker thread completly from an isolated CPU.
> Because it adds overhead to a latency target, target which
> the lower the better.

NOHZ already does that. I wanted to know what your problem is that you
see. The latency issue has already been solved as far as I can tell .
Please tell me why the existing solutions are not sufficient for you.

> > I doubt that doing inline updates will do much good compared to what we
> > already have and what the dataplan mode can do.
>
> Can the dataplan mode disable vmstat worker thread completly on a given
> CPU?

That already occurs when you call quiet_vmstat() and is used by the NOHZ
logic. Configure that correctly and you should be fine.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-12 16:57                       ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-12 16:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 12 May 2017, Marcelo Tosatti wrote:

> > What exactly is the issue you are seeing and want to address? I think we
> > have similar aims and as far as I know the current situation is already
> > good enough for what you may need. You may just not be aware of how to
> > configure this.
>
> I want to disable vmstat worker thread completly from an isolated CPU.
> Because it adds overhead to a latency target, target which
> the lower the better.

NOHZ already does that. I wanted to know what your problem is that you
see. The latency issue has already been solved as far as I can tell .
Please tell me why the existing solutions are not sufficient for you.

> > I doubt that doing inline updates will do much good compared to what we
> > already have and what the dataplan mode can do.
>
> Can the dataplan mode disable vmstat worker thread completly on a given
> CPU?

That already occurs when you call quiet_vmstat() and is used by the NOHZ
logic. Configure that correctly and you should be fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-12 16:57                       ` Christoph Lameter
@ 2017-05-15 19:15                         ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-15 19:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 12, 2017 at 11:57:15AM -0500, Christoph Lameter wrote:
> On Fri, 12 May 2017, Marcelo Tosatti wrote:
> 
> > > What exactly is the issue you are seeing and want to address? I think we
> > > have similar aims and as far as I know the current situation is already
> > > good enough for what you may need. You may just not be aware of how to
> > > configure this.
> >
> > I want to disable vmstat worker thread completly from an isolated CPU.
> > Because it adds overhead to a latency target, target which
> > the lower the better.
> 
> NOHZ already does that. I wanted to know what your problem is that you
> see. The latency issue has already been solved as far as I can tell .
> Please tell me why the existing solutions are not sufficient for you.

We don't want vmstat_worker to execute on a given CPU, even if the local
CPU updates vm-statistics. 

Because:

    vmstat_worker increases latency of the application
       (i can measure it if you want on a given CPU,
        how many ns's the following takes:

            schedule_out(qemu-kvm-vcpu)
            schedule_in(kworker_thread)
            execute function to drain local vmstat counters to
                global counters
            schedule_out(kworker_thread)
            schedule_in(qemu-kvm-vcpu)
            x86 instruction to enter guest.
                                                (*)

But you can see right away without numbers that the sequence
above is not desired.

Why the existing solutions are not sufficient:

1) task-isolation patchset seems too heavy for our usecase (we do 
want IPIs, signals, etc).

2) With upstream linux-2.6.git, if dpdk running inside a guest happens
to trigger any vmstat update (say for example migration), we want the
statistics transferred directly from the point where they are generated,
and not the sequence (*).

> > > I doubt that doing inline updates will do much good compared to what we
> > > already have and what the dataplan mode can do.
> >
> > Can the dataplan mode disable vmstat worker thread completly on a given
> > CPU?
> 
> That already occurs when you call quiet_vmstat() and is used by the NOHZ
> logic. Configure that correctly and you should be fine.

quiet_vmstat() is not called by anyone today (upstream code). Are you
talking about task isolation patches?

Those seem a little heavy to me, for example:

1)
"Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running."

For example, what about

     static void do_sync_core(void *data)
             on_each_cpu(do_sync_core, NULL, 1);

You can't enable tracing with this feature?

"Prior to returning to userspace,
isolated tasks will arrange that no future kernel
activity will interrupt the task while the task is running
in userspace.  By default, attempting to re-enter the kernel
while in this mode will cause the task to be terminated
with a signal; you must explicitly use prctl() to disable
task isolation before resuming normal use of the kernel."

2)

A qemu-kvm-vcpu thread, process which runs on the host system,
executes guest code through

    ioctl(KVM_RUN) --> vcpu_enter_guest --> x86 instruction to execute
                                            guest code.

So the "isolation period where task does not want to be interrupted"
contains kernel code.

3) Before using any service of the operating system, through a
syscall, the application has to clear the TIF_TASK_ISOLATION flag,
then do the syscall, and when returning to userspace, setting it again.

Now what guarantees regarding low amount of interrupts do you provide
while this task is in kernel mode?

4)

"We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless. We try to catch the
original source of the interrupt, e.g. if an IPI is dispatched to a
task-isolation task, we dump the backtrace of the remote core that is
sending the IPI, rather than just dumping out a trace showing the core
received an IPI from somewhere."

KVM uses IPI's to for example send virtual interrupts and update the
guest clock at certain conditions (for example after VM migration).

So this seems a little heavy for our usecase.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-15 19:15                         ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-15 19:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 12, 2017 at 11:57:15AM -0500, Christoph Lameter wrote:
> On Fri, 12 May 2017, Marcelo Tosatti wrote:
> 
> > > What exactly is the issue you are seeing and want to address? I think we
> > > have similar aims and as far as I know the current situation is already
> > > good enough for what you may need. You may just not be aware of how to
> > > configure this.
> >
> > I want to disable vmstat worker thread completly from an isolated CPU.
> > Because it adds overhead to a latency target, target which
> > the lower the better.
> 
> NOHZ already does that. I wanted to know what your problem is that you
> see. The latency issue has already been solved as far as I can tell .
> Please tell me why the existing solutions are not sufficient for you.

We don't want vmstat_worker to execute on a given CPU, even if the local
CPU updates vm-statistics. 

Because:

    vmstat_worker increases latency of the application
       (i can measure it if you want on a given CPU,
        how many ns's the following takes:

            schedule_out(qemu-kvm-vcpu)
            schedule_in(kworker_thread)
            execute function to drain local vmstat counters to
                global counters
            schedule_out(kworker_thread)
            schedule_in(qemu-kvm-vcpu)
            x86 instruction to enter guest.
                                                (*)

But you can see right away without numbers that the sequence
above is not desired.

Why the existing solutions are not sufficient:

1) task-isolation patchset seems too heavy for our usecase (we do 
want IPIs, signals, etc).

2) With upstream linux-2.6.git, if dpdk running inside a guest happens
to trigger any vmstat update (say for example migration), we want the
statistics transferred directly from the point where they are generated,
and not the sequence (*).

> > > I doubt that doing inline updates will do much good compared to what we
> > > already have and what the dataplan mode can do.
> >
> > Can the dataplan mode disable vmstat worker thread completly on a given
> > CPU?
> 
> That already occurs when you call quiet_vmstat() and is used by the NOHZ
> logic. Configure that correctly and you should be fine.

quiet_vmstat() is not called by anyone today (upstream code). Are you
talking about task isolation patches?

Those seem a little heavy to me, for example:

1)
"Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running."

For example, what about

     static void do_sync_core(void *data)
             on_each_cpu(do_sync_core, NULL, 1);

You can't enable tracing with this feature?

"Prior to returning to userspace,
isolated tasks will arrange that no future kernel
activity will interrupt the task while the task is running
in userspace.  By default, attempting to re-enter the kernel
while in this mode will cause the task to be terminated
with a signal; you must explicitly use prctl() to disable
task isolation before resuming normal use of the kernel."

2)

A qemu-kvm-vcpu thread, process which runs on the host system,
executes guest code through

    ioctl(KVM_RUN) --> vcpu_enter_guest --> x86 instruction to execute
                                            guest code.

So the "isolation period where task does not want to be interrupted"
contains kernel code.

3) Before using any service of the operating system, through a
syscall, the application has to clear the TIF_TASK_ISOLATION flag,
then do the syscall, and when returning to userspace, setting it again.

Now what guarantees regarding low amount of interrupts do you provide
while this task is in kernel mode?

4)

"We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless. We try to catch the
original source of the interrupt, e.g. if an IPI is dispatched to a
task-isolation task, we dump the backtrace of the remote core that is
sending the IPI, rather than just dumping out a trace showing the core
received an IPI from somewhere."

KVM uses IPI's to for example send virtual interrupts and update the
guest clock at certain conditions (for example after VM migration).

So this seems a little heavy for our usecase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-15 19:15                         ` Marcelo Tosatti
@ 2017-05-16 13:37                           ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-16 13:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Mon, 15 May 2017, Marcelo Tosatti wrote:

> > NOHZ already does that. I wanted to know what your problem is that you
> > see. The latency issue has already been solved as far as I can tell .
> > Please tell me why the existing solutions are not sufficient for you.
>
> We don't want vmstat_worker to execute on a given CPU, even if the local
> CPU updates vm-statistics.

Instead of responding you repeat describing what you want.

> Because:
>
>     vmstat_worker increases latency of the application
>        (i can measure it if you want on a given CPU,
>         how many ns's the following takes:

That still is no use case. Just a measurement of vmstat_worker. Pointless.

If you move the latency from the vmstat worker into the code thats
updating the counters then you will require increased use of atomics
which will increase contention which in turn will significantly
increase the overall latency.

> Why the existing solutions are not sufficient:
>
> 1) task-isolation patchset seems too heavy for our usecase (we do
> want IPIs, signals, etc).

Ok then minor delays from remote random events are tolerable?
Then you can also have a vmstat update.

> So this seems a little heavy for our usecase.

Sorry all of this does not make sense to me. Maybe get some numbers of of
an app with intensive OS access running with atomics vs vmstat worker?

NOHZ currently disables the vmstat worker when no updates occur. This is
applicable to DPDK and will provide a quiet vmstat worker free environment
if no statistics activity is occurring.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-16 13:37                           ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-16 13:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Mon, 15 May 2017, Marcelo Tosatti wrote:

> > NOHZ already does that. I wanted to know what your problem is that you
> > see. The latency issue has already been solved as far as I can tell .
> > Please tell me why the existing solutions are not sufficient for you.
>
> We don't want vmstat_worker to execute on a given CPU, even if the local
> CPU updates vm-statistics.

Instead of responding you repeat describing what you want.

> Because:
>
>     vmstat_worker increases latency of the application
>        (i can measure it if you want on a given CPU,
>         how many ns's the following takes:

That still is no use case. Just a measurement of vmstat_worker. Pointless.

If you move the latency from the vmstat worker into the code thats
updating the counters then you will require increased use of atomics
which will increase contention which in turn will significantly
increase the overall latency.

> Why the existing solutions are not sufficient:
>
> 1) task-isolation patchset seems too heavy for our usecase (we do
> want IPIs, signals, etc).

Ok then minor delays from remote random events are tolerable?
Then you can also have a vmstat update.

> So this seems a little heavy for our usecase.

Sorry all of this does not make sense to me. Maybe get some numbers of of
an app with intensive OS access running with atomics vs vmstat worker?

NOHZ currently disables the vmstat worker when no updates occur. This is
applicable to DPDK and will provide a quiet vmstat worker free environment
if no statistics activity is occurring.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-16 13:37                           ` Christoph Lameter
@ 2017-05-19 14:34                             ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-19 14:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

Hi Christoph,

On Tue, May 16, 2017 at 08:37:11AM -0500, Christoph Lameter wrote:
> On Mon, 15 May 2017, Marcelo Tosatti wrote:
> 
> > > NOHZ already does that. I wanted to know what your problem is that you
> > > see. The latency issue has already been solved as far as I can tell .
> > > Please tell me why the existing solutions are not sufficient for you.
> >
> > We don't want vmstat_worker to execute on a given CPU, even if the local
> > CPU updates vm-statistics.
> 
> Instead of responding you repeat describing what you want.
> 
> > Because:
> >
> >     vmstat_worker increases latency of the application
> >        (i can measure it if you want on a given CPU,
> >         how many ns's the following takes:
> 
> That still is no use case. 

Use-case: realtime application on an isolated core which for some reason
updates vmstatistics.

> Just a measurement of vmstat_worker. Pointless.

Shouldnt the focus be on general scenarios rather than particular
usecases, so that the solution covers a wider range of usecases?

The situation as i see is as follows:

Your point of view is: an "isolated CPU" with a set of applications
cannot update vm statistics, otherwise they pay the vmstat_update cost:

     kworker/5:1-245   [005] ....1..   673.454295: workqueue_execute_start: work struct ffffa0cf6e493e20: function vmstat_update
     kworker/5:1-245   [005] ....1..   673.454305: workqueue_execute_end: work struct ffffa0cf6e493e20

Thats 10us for example.

So if want to customize a realtime setup whose code updates vmstatistic, 
you are dead. You have to avoid any systemcall which possibly updates
vmstatistics (now and in the future kernel versions).

> If you move the latency from the vmstat worker into the code thats
> updating the counters then you will require increased use of atomics
> which will increase contention which in turn will significantly
> increase the overall latency.

The point is that these vmstat updates are rare. From 
http://www.7-cpu.com/cpu/Haswell.html:

RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)

Lets round to 100ns = 0.1us.

You need 100 vmstat updates (all misses to RAM, the worst possible case)
to have equivalent amount of time of the batching version.

With more than 100 vmstat updates, then the batching is more efficient
(as in total amount of time to transfer to global counters).

But thats not the point. The point is the 10us interruption 
to execution of the realtime app (which can either mean 
your current deadline requirements are not met, or that 
another application with lowest latency requirement can't 
be used).

So i'd rather spend more time updating the aggregate of vmstatistics
(with the local->global transfer taking a small amount of time,
therefore not interrupting the realtime application for a long period),
than to batch the updates (which increases overall performance beyond 
a certain number of updates, but which is _ONE_ large interruption).

So lets assume i go and count the vmstat updates on the DPDK case 
(or any other realtime app), batching is more efficient 
for that case.

Still, the one-time interruption of batching is worse than less
efficient one bean at a time vmstatistics accounting.

No?

Also, you could reply that: "oh, there are no vmstat updates 
in fact in this setup, but the logic of disabling vmstat_update 
is broken". Lets assume thats the case.

Even if its fixed (vmstat_update properly shut down) the proposed patch
deals with both cases: no vmstat updates on isolated cpus, and vmstat
updates on isolated cpus.

So why are you against integrating this simple, isolated patch which 
does not affect how current logic works?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-19 14:34                             ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-19 14:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

Hi Christoph,

On Tue, May 16, 2017 at 08:37:11AM -0500, Christoph Lameter wrote:
> On Mon, 15 May 2017, Marcelo Tosatti wrote:
> 
> > > NOHZ already does that. I wanted to know what your problem is that you
> > > see. The latency issue has already been solved as far as I can tell .
> > > Please tell me why the existing solutions are not sufficient for you.
> >
> > We don't want vmstat_worker to execute on a given CPU, even if the local
> > CPU updates vm-statistics.
> 
> Instead of responding you repeat describing what you want.
> 
> > Because:
> >
> >     vmstat_worker increases latency of the application
> >        (i can measure it if you want on a given CPU,
> >         how many ns's the following takes:
> 
> That still is no use case. 

Use-case: realtime application on an isolated core which for some reason
updates vmstatistics.

> Just a measurement of vmstat_worker. Pointless.

Shouldnt the focus be on general scenarios rather than particular
usecases, so that the solution covers a wider range of usecases?

The situation as i see is as follows:

Your point of view is: an "isolated CPU" with a set of applications
cannot update vm statistics, otherwise they pay the vmstat_update cost:

     kworker/5:1-245   [005] ....1..   673.454295: workqueue_execute_start: work struct ffffa0cf6e493e20: function vmstat_update
     kworker/5:1-245   [005] ....1..   673.454305: workqueue_execute_end: work struct ffffa0cf6e493e20

Thats 10us for example.

So if want to customize a realtime setup whose code updates vmstatistic, 
you are dead. You have to avoid any systemcall which possibly updates
vmstatistics (now and in the future kernel versions).

> If you move the latency from the vmstat worker into the code thats
> updating the counters then you will require increased use of atomics
> which will increase contention which in turn will significantly
> increase the overall latency.

The point is that these vmstat updates are rare. From 
http://www.7-cpu.com/cpu/Haswell.html:

RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)

Lets round to 100ns = 0.1us.

You need 100 vmstat updates (all misses to RAM, the worst possible case)
to have equivalent amount of time of the batching version.

With more than 100 vmstat updates, then the batching is more efficient
(as in total amount of time to transfer to global counters).

But thats not the point. The point is the 10us interruption 
to execution of the realtime app (which can either mean 
your current deadline requirements are not met, or that 
another application with lowest latency requirement can't 
be used).

So i'd rather spend more time updating the aggregate of vmstatistics
(with the local->global transfer taking a small amount of time,
therefore not interrupting the realtime application for a long period),
than to batch the updates (which increases overall performance beyond 
a certain number of updates, but which is _ONE_ large interruption).

So lets assume i go and count the vmstat updates on the DPDK case 
(or any other realtime app), batching is more efficient 
for that case.

Still, the one-time interruption of batching is worse than less
efficient one bean at a time vmstatistics accounting.

No?

Also, you could reply that: "oh, there are no vmstat updates 
in fact in this setup, but the logic of disabling vmstat_update 
is broken". Lets assume thats the case.

Even if its fixed (vmstat_update properly shut down) the proposed patch
deals with both cases: no vmstat updates on isolated cpus, and vmstat
updates on isolated cpus.

So why are you against integrating this simple, isolated patch which 
does not affect how current logic works?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-19 14:34                             ` Marcelo Tosatti
@ 2017-05-19 17:13                               ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-19 17:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 19 May 2017, Marcelo Tosatti wrote:

> Use-case: realtime application on an isolated core which for some reason
> updates vmstatistics.

Ok that is already only happening every 2 seconds by default and that
interval is configurable via the vmstat_interval proc setting.

> > Just a measurement of vmstat_worker. Pointless.
>
> Shouldnt the focus be on general scenarios rather than particular
> usecases, so that the solution covers a wider range of usecases?

Yes indeed and as far as I can tell the wider usecases are covered. Not
sure that there is anything required here.

> The situation as i see is as follows:
>
> Your point of view is: an "isolated CPU" with a set of applications
> cannot update vm statistics, otherwise they pay the vmstat_update cost:
>
>      kworker/5:1-245   [005] ....1..   673.454295: workqueue_execute_start: work struct ffffa0cf6e493e20: function vmstat_update
>      kworker/5:1-245   [005] ....1..   673.454305: workqueue_execute_end: work struct ffffa0cf6e493e20
>
> Thats 10us for example.

Well with a decent cpu that is 3 usec and it occurs infrequently on the
order of once per multiple seconds.

> So if want to customize a realtime setup whose code updates vmstatistic,
> you are dead. You have to avoid any systemcall which possibly updates
> vmstatistics (now and in the future kernel versions).

You are already dead because you allow IPIs and other kernel processing
which creates far more overhead. Still fail to see the point.

> The point is that these vmstat updates are rare. From
> http://www.7-cpu.com/cpu/Haswell.html:
>
> RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
> RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)
>
> Lets round to 100ns = 0.1us.

That depends on the kernel functionality used.

> You need 100 vmstat updates (all misses to RAM, the worst possible case)
> to have equivalent amount of time of the batching version.

The batching version occurs every couple of seconds if at all.

> But thats not the point. The point is the 10us interruption
> to execution of the realtime app (which can either mean
> your current deadline requirements are not met, or that
> another application with lowest latency requirement can't
> be used).

Ok then you need to get rid of the IPIs and the other stuff that you have
going on with the OS first I think.

> So why are you against integrating this simple, isolated patch which
> does not affect how current logic works?

Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).

And you can configure the interval of vmstat updates freely.... Set
the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-19 17:13                               ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-19 17:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 19 May 2017, Marcelo Tosatti wrote:

> Use-case: realtime application on an isolated core which for some reason
> updates vmstatistics.

Ok that is already only happening every 2 seconds by default and that
interval is configurable via the vmstat_interval proc setting.

> > Just a measurement of vmstat_worker. Pointless.
>
> Shouldnt the focus be on general scenarios rather than particular
> usecases, so that the solution covers a wider range of usecases?

Yes indeed and as far as I can tell the wider usecases are covered. Not
sure that there is anything required here.

> The situation as i see is as follows:
>
> Your point of view is: an "isolated CPU" with a set of applications
> cannot update vm statistics, otherwise they pay the vmstat_update cost:
>
>      kworker/5:1-245   [005] ....1..   673.454295: workqueue_execute_start: work struct ffffa0cf6e493e20: function vmstat_update
>      kworker/5:1-245   [005] ....1..   673.454305: workqueue_execute_end: work struct ffffa0cf6e493e20
>
> Thats 10us for example.

Well with a decent cpu that is 3 usec and it occurs infrequently on the
order of once per multiple seconds.

> So if want to customize a realtime setup whose code updates vmstatistic,
> you are dead. You have to avoid any systemcall which possibly updates
> vmstatistics (now and in the future kernel versions).

You are already dead because you allow IPIs and other kernel processing
which creates far more overhead. Still fail to see the point.

> The point is that these vmstat updates are rare. From
> http://www.7-cpu.com/cpu/Haswell.html:
>
> RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
> RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)
>
> Lets round to 100ns = 0.1us.

That depends on the kernel functionality used.

> You need 100 vmstat updates (all misses to RAM, the worst possible case)
> to have equivalent amount of time of the batching version.

The batching version occurs every couple of seconds if at all.

> But thats not the point. The point is the 10us interruption
> to execution of the realtime app (which can either mean
> your current deadline requirements are not met, or that
> another application with lowest latency requirement can't
> be used).

Ok then you need to get rid of the IPIs and the other stuff that you have
going on with the OS first I think.

> So why are you against integrating this simple, isolated patch which
> does not affect how current logic works?

Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).

And you can configure the interval of vmstat updates freely.... Set
the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-19 17:13                               ` Christoph Lameter
@ 2017-05-19 17:49                                 ` Luiz Capitulino
  -1 siblings, 0 replies; 66+ messages in thread
From: Luiz Capitulino @ 2017-05-19 17:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 19 May 2017 12:13:26 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> > So why are you against integrating this simple, isolated patch which
> > does not affect how current logic works?  
> 
> Frankly the argument does not make sense. Vmstat updates occur very
> infrequently (probably even less than you IPIs and the other OS stuff that
> also causes additional latencies that you seem to be willing to tolerate).

Infrequently is not good enough. It only has to happen once to
cause a problem.

Also, IPIs take a few us, usually less. That's not a problem. In our
testing we see the preemption caused by the kworker take 10us or
even more. I've never seeing it take 3us. I'm not saying this is not
true, I'm saying if this is causing a problem to us it will cause
a problem to other people too.

> And you can configure the interval of vmstat updates freely.... Set
> the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> enough?

No, we'd have to set it high enough to disable it and this will
affect all CPUs.

Something that crossed my mind was to add a new tunable to set
the vmstat_interval for each CPU, this way we could essentially
disable it to the CPUs where DPDK is running. What's the implications
of doing this besides not getting up to date stats in /proc/vmstat
(which I still have to confirm would be OK)? Can this break anything
in the kernel for example?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-19 17:49                                 ` Luiz Capitulino
  0 siblings, 0 replies; 66+ messages in thread
From: Luiz Capitulino @ 2017-05-19 17:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 19 May 2017 12:13:26 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> > So why are you against integrating this simple, isolated patch which
> > does not affect how current logic works?  
> 
> Frankly the argument does not make sense. Vmstat updates occur very
> infrequently (probably even less than you IPIs and the other OS stuff that
> also causes additional latencies that you seem to be willing to tolerate).

Infrequently is not good enough. It only has to happen once to
cause a problem.

Also, IPIs take a few us, usually less. That's not a problem. In our
testing we see the preemption caused by the kworker take 10us or
even more. I've never seeing it take 3us. I'm not saying this is not
true, I'm saying if this is causing a problem to us it will cause
a problem to other people too.

> And you can configure the interval of vmstat updates freely.... Set
> the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> enough?

No, we'd have to set it high enough to disable it and this will
affect all CPUs.

Something that crossed my mind was to add a new tunable to set
the vmstat_interval for each CPU, this way we could essentially
disable it to the CPUs where DPDK is running. What's the implications
of doing this besides not getting up to date stats in /proc/vmstat
(which I still have to confirm would be OK)? Can this break anything
in the kernel for example?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-19 17:13                               ` Christoph Lameter
@ 2017-05-20  8:26                                 ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-20  8:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 19, 2017 at 12:13:26PM -0500, Christoph Lameter wrote:
> On Fri, 19 May 2017, Marcelo Tosatti wrote:
> 
> > Use-case: realtime application on an isolated core which for some reason
> > updates vmstatistics.
> 
> Ok that is already only happening every 2 seconds by default and that
> interval is configurable via the vmstat_interval proc setting.
> 
> > > Just a measurement of vmstat_worker. Pointless.
> >
> > Shouldnt the focus be on general scenarios rather than particular
> > usecases, so that the solution covers a wider range of usecases?
> 
> Yes indeed and as far as I can tell the wider usecases are covered. Not
> sure that there is anything required here.
> 
> > The situation as i see is as follows:
> >
> > Your point of view is: an "isolated CPU" with a set of applications
> > cannot update vm statistics, otherwise they pay the vmstat_update cost:
> >
> >      kworker/5:1-245   [005] ....1..   673.454295: workqueue_execute_start: work struct ffffa0cf6e493e20: function vmstat_update
> >      kworker/5:1-245   [005] ....1..   673.454305: workqueue_execute_end: work struct ffffa0cf6e493e20
> >
> > Thats 10us for example.
> 
> Well with a decent cpu that is 3 usec and it occurs infrequently on the
> order of once per multiple seconds.
> 
> > So if want to customize a realtime setup whose code updates vmstatistic,
> > you are dead. You have to avoid any systemcall which possibly updates
> > vmstatistics (now and in the future kernel versions).
> 
> You are already dead because you allow IPIs and other kernel processing
> which creates far more overhead. Still fail to see the point.
> 
> > The point is that these vmstat updates are rare. From
> > http://www.7-cpu.com/cpu/Haswell.html:
> >
> > RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
> > RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)
> >
> > Lets round to 100ns = 0.1us.
> 
> That depends on the kernel functionality used.
> 
> > You need 100 vmstat updates (all misses to RAM, the worst possible case)
> > to have equivalent amount of time of the batching version.
> 
> The batching version occurs every couple of seconds if at all.
> 
> > But thats not the point. The point is the 10us interruption
> > to execution of the realtime app (which can either mean
> > your current deadline requirements are not met, or that
> > another application with lowest latency requirement can't
> > be used).
> 
> Ok then you need to get rid of the IPIs and the other stuff that you have
> going on with the OS first I think.

I'll measure the cost of all IPIs in the system to confirm
vmstat_update's costs is larger than the cost of any IPI.

> > So why are you against integrating this simple, isolated patch which
> > does not affect how current logic works?
> 
> Frankly the argument does not make sense. Vmstat updates occur very
> infrequently (probably even less than you IPIs and the other OS stuff that
> also causes additional latencies that you seem to be willing to tolerate).
> 
> And you can configure the interval of vmstat updates freely.... Set
> the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> enough?

Not rare enough. Never is rare enough.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-20  8:26                                 ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-20  8:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 19, 2017 at 12:13:26PM -0500, Christoph Lameter wrote:
> On Fri, 19 May 2017, Marcelo Tosatti wrote:
> 
> > Use-case: realtime application on an isolated core which for some reason
> > updates vmstatistics.
> 
> Ok that is already only happening every 2 seconds by default and that
> interval is configurable via the vmstat_interval proc setting.
> 
> > > Just a measurement of vmstat_worker. Pointless.
> >
> > Shouldnt the focus be on general scenarios rather than particular
> > usecases, so that the solution covers a wider range of usecases?
> 
> Yes indeed and as far as I can tell the wider usecases are covered. Not
> sure that there is anything required here.
> 
> > The situation as i see is as follows:
> >
> > Your point of view is: an "isolated CPU" with a set of applications
> > cannot update vm statistics, otherwise they pay the vmstat_update cost:
> >
> >      kworker/5:1-245   [005] ....1..   673.454295: workqueue_execute_start: work struct ffffa0cf6e493e20: function vmstat_update
> >      kworker/5:1-245   [005] ....1..   673.454305: workqueue_execute_end: work struct ffffa0cf6e493e20
> >
> > Thats 10us for example.
> 
> Well with a decent cpu that is 3 usec and it occurs infrequently on the
> order of once per multiple seconds.
> 
> > So if want to customize a realtime setup whose code updates vmstatistic,
> > you are dead. You have to avoid any systemcall which possibly updates
> > vmstatistics (now and in the future kernel versions).
> 
> You are already dead because you allow IPIs and other kernel processing
> which creates far more overhead. Still fail to see the point.
> 
> > The point is that these vmstat updates are rare. From
> > http://www.7-cpu.com/cpu/Haswell.html:
> >
> > RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
> > RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)
> >
> > Lets round to 100ns = 0.1us.
> 
> That depends on the kernel functionality used.
> 
> > You need 100 vmstat updates (all misses to RAM, the worst possible case)
> > to have equivalent amount of time of the batching version.
> 
> The batching version occurs every couple of seconds if at all.
> 
> > But thats not the point. The point is the 10us interruption
> > to execution of the realtime app (which can either mean
> > your current deadline requirements are not met, or that
> > another application with lowest latency requirement can't
> > be used).
> 
> Ok then you need to get rid of the IPIs and the other stuff that you have
> going on with the OS first I think.

I'll measure the cost of all IPIs in the system to confirm
vmstat_update's costs is larger than the cost of any IPI.

> > So why are you against integrating this simple, isolated patch which
> > does not affect how current logic works?
> 
> Frankly the argument does not make sense. Vmstat updates occur very
> infrequently (probably even less than you IPIs and the other OS stuff that
> also causes additional latencies that you seem to be willing to tolerate).
> 
> And you can configure the interval of vmstat updates freely.... Set
> the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> enough?

Not rare enough. Never is rare enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-19 17:49                                 ` Luiz Capitulino
@ 2017-05-22 16:35                                   ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-22 16:35 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 19 May 2017, Luiz Capitulino wrote:

> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

The data is still going to be updated when the differential gets to big.

Increasing the vmstat interval and reducing the differential threshold
would get your there....

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-22 16:35                                   ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-22 16:35 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 19 May 2017, Luiz Capitulino wrote:

> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

The data is still going to be updated when the differential gets to big.

Increasing the vmstat interval and reducing the differential threshold
would get your there....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-20  8:26                                 ` Marcelo Tosatti
@ 2017-05-22 16:38                                   ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-22 16:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Sat, 20 May 2017, Marcelo Tosatti wrote:

> > And you can configure the interval of vmstat updates freely.... Set
> > the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> > enough?
>
> Not rare enough. Never is rare enough.

Ok what about the other stuff that must be going on if you allow OS
activity like f.e. the tick, scheduler etc etc.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-22 16:38                                   ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-22 16:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Sat, 20 May 2017, Marcelo Tosatti wrote:

> > And you can configure the interval of vmstat updates freely.... Set
> > the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> > enough?
>
> Not rare enough. Never is rare enough.

Ok what about the other stuff that must be going on if you allow OS
activity like f.e. the tick, scheduler etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-22 16:38                                   ` Christoph Lameter
@ 2017-05-22 21:13                                     ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-22 21:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Mon, May 22, 2017 at 11:38:02AM -0500, Christoph Lameter wrote:
> On Sat, 20 May 2017, Marcelo Tosatti wrote:
> 
> > > And you can configure the interval of vmstat updates freely.... Set
> > > the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> > > enough?
> >
> > Not rare enough. Never is rare enough.
> 
> Ok what about the other stuff that must be going on if you allow OS
> activity like f.e. the tick, scheduler etc etc.

Yes these are also problems... but we're either getting rid of them or
reducing their impact as much as possible.

vmstat_update is one member of the problematic set.

I'll get you the detailed IPI measures, hold on...

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-22 21:13                                     ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-22 21:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Mon, May 22, 2017 at 11:38:02AM -0500, Christoph Lameter wrote:
> On Sat, 20 May 2017, Marcelo Tosatti wrote:
> 
> > > And you can configure the interval of vmstat updates freely.... Set
> > > the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> > > enough?
> >
> > Not rare enough. Never is rare enough.
> 
> Ok what about the other stuff that must be going on if you allow OS
> activity like f.e. the tick, scheduler etc etc.

Yes these are also problems... but we're either getting rid of them or
reducing their impact as much as possible.

vmstat_update is one member of the problematic set.

I'll get you the detailed IPI measures, hold on...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-19 17:49                                 ` Luiz Capitulino
  (?)
@ 2017-05-25 19:35                                   ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-25 19:35 UTC (permalink / raw)
  To: Luiz Capitulino, Christoph Lameter
  Cc: Christoph Lameter, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 19, 2017 at 01:49:34PM -0400, Luiz Capitulino wrote:
> On Fri, 19 May 2017 12:13:26 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > > So why are you against integrating this simple, isolated patch which
> > > does not affect how current logic works?  
> > 
> > Frankly the argument does not make sense. Vmstat updates occur very
> > infrequently (probably even less than you IPIs and the other OS stuff that
> > also causes additional latencies that you seem to be willing to tolerate).
> 
> Infrequently is not good enough. It only has to happen once to
> cause a problem.
> 
> Also, IPIs take a few us, usually less. That's not a problem. In our
> testing we see the preemption caused by the kworker take 10us or
> even more. I've never seeing it take 3us. I'm not saying this is not
> true, I'm saying if this is causing a problem to us it will cause
> a problem to other people too.

Christoph, 

Some data:

 qemu-system-x86-12902 [003] ....1..  6517.621557: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621557: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621560: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621561: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621563: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621564: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.622037: empty_smp_call_func:
empty_smp_call_func ran
 qemu-system-x86-12902 [003] ....1..  6517.622040: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fb
 qemu-system-x86-12902 [003] d...2..  6517.622041: kvm_entry: vcpu 2

empty_smp_function_call: 3us.

 qemu-system-x86-12902 [003] ....1..  6517.702739: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702741: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.702758: scheduler_tick
<-update_process_times
 qemu-system-x86-12902 [003] ....1..  6517.702760: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702760: kvm_entry: vcpu 2

scheduler_tick: 2us.

 qemu-system-x86-12902 [003] ....1..  6518.194570: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194571: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6518.194591: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194593: kvm_entry: vcpu 2

That, and the 10us number for kworker mentioned above changes your
point of view of your 
"Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).
And you can configure the interval of vmstat updates freely.... Set
 the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?" 

Argument? We're showing you the data that this is causing a latency
problem for us.

Is there anything you'd like to be improved on the patch?
Is there anything you dislike about it?

> No, we'd have to set it high enough to disable it and this will
> affect all CPUs.
> 
> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

Well, you get incorrect statistics. 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-25 19:35                                   ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-25 19:35 UTC (permalink / raw)
  To: Luiz Capitulino, Christoph Lameter
  Cc: Christoph Lameter, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, May 19, 2017 at 01:49:34PM -0400, Luiz Capitulino wrote:
> On Fri, 19 May 2017 12:13:26 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > > So why are you against integrating this simple, isolated patch which
> > > does not affect how current logic works?  
> > 
> > Frankly the argument does not make sense. Vmstat updates occur very
> > infrequently (probably even less than you IPIs and the other OS stuff that
> > also causes additional latencies that you seem to be willing to tolerate).
> 
> Infrequently is not good enough. It only has to happen once to
> cause a problem.
> 
> Also, IPIs take a few us, usually less. That's not a problem. In our
> testing we see the preemption caused by the kworker take 10us or
> even more. I've never seeing it take 3us. I'm not saying this is not
> true, I'm saying if this is causing a problem to us it will cause
> a problem to other people too.

Christoph, 

Some data:

 qemu-system-x86-12902 [003] ....1..  6517.621557: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621557: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621560: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621561: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621563: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621564: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.622037: empty_smp_call_func:
empty_smp_call_func ran
 qemu-system-x86-12902 [003] ....1..  6517.622040: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fb
 qemu-system-x86-12902 [003] d...2..  6517.622041: kvm_entry: vcpu 2

empty_smp_function_call: 3us.

 qemu-system-x86-12902 [003] ....1..  6517.702739: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702741: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.702758: scheduler_tick
<-update_process_times
 qemu-system-x86-12902 [003] ....1..  6517.702760: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702760: kvm_entry: vcpu 2

scheduler_tick: 2us.

 qemu-system-x86-12902 [003] ....1..  6518.194570: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194571: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6518.194591: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194593: kvm_entry: vcpu 2

That, and the 10us number for kworker mentioned above changes your
point of view of your 
"Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).
And you can configure the interval of vmstat updates freely.... Set
 the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?" 

Argument? We're showing you the data that this is causing a latency
problem for us.

Is there anything you'd like to be improved on the patch?
Is there anything you dislike about it?

> No, we'd have to set it high enough to disable it and this will
> affect all CPUs.
> 
> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

Well, you get incorrect statistics. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-25 19:35                                   ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-25 19:35 UTC (permalink / raw)
  To: Luiz Capitulino, Christoph Lameter
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cmetcalf

On Fri, May 19, 2017 at 01:49:34PM -0400, Luiz Capitulino wrote:
> On Fri, 19 May 2017 12:13:26 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > > So why are you against integrating this simple, isolated patch which
> > > does not affect how current logic works?  
> > 
> > Frankly the argument does not make sense. Vmstat updates occur very
> > infrequently (probably even less than you IPIs and the other OS stuff that
> > also causes additional latencies that you seem to be willing to tolerate).
> 
> Infrequently is not good enough. It only has to happen once to
> cause a problem.
> 
> Also, IPIs take a few us, usually less. That's not a problem. In our
> testing we see the preemption caused by the kworker take 10us or
> even more. I've never seeing it take 3us. I'm not saying this is not
> true, I'm saying if this is causing a problem to us it will cause
> a problem to other people too.

Christoph, 

Some data:

 qemu-system-x86-12902 [003] ....1..  6517.621557: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621557: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621560: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621561: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621563: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621564: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.622037: empty_smp_call_func:
empty_smp_call_func ran
 qemu-system-x86-12902 [003] ....1..  6517.622040: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fb
 qemu-system-x86-12902 [003] d...2..  6517.622041: kvm_entry: vcpu 2

empty_smp_function_call: 3us.

 qemu-system-x86-12902 [003] ....1..  6517.702739: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702741: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.702758: scheduler_tick
<-update_process_times
 qemu-system-x86-12902 [003] ....1..  6517.702760: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702760: kvm_entry: vcpu 2

scheduler_tick: 2us.

 qemu-system-x86-12902 [003] ....1..  6518.194570: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194571: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6518.194591: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194593: kvm_entry: vcpu 2

That, and the 10us number for kworker mentioned above changes your
point of view of your 
"Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).
And you can configure the interval of vmstat updates freely.... Set
 the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?" 

Argument? We're showing you the data that this is causing a latency
problem for us.

Is there anything you'd like to be improved on the patch?
Is there anything you dislike about it?

> No, we'd have to set it high enough to disable it and this will
> affect all CPUs.
> 
> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

Well, you get incorrect statistics. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-25 19:35                                   ` Marcelo Tosatti
@ 2017-05-26  3:24                                     ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-26  3:24 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Thu, 25 May 2017, Marcelo Tosatti wrote:

> Argument? We're showing you the data that this is causing a latency
> problem for us.

Sorry I am not sure where the data shows a latency problem. There are
interrupts and scheduler ticks. But what does this have to do with vmstat?

Show me your dpdk code running and trace the tick on / off events  as well
as the vmstat invocations. Also show all system calls occurring on the cpu
that runs dpdk. That is necessary to see what triggers vmstat and how the
system reacts to the changes to the differentials.

Then please rerun the test by setting the vmstat_interval to 60.

Do another run with your modifications and show the difference.

> > Something that crossed my mind was to add a new tunable to set
> > the vmstat_interval for each CPU, this way we could essentially
> > disable it to the CPUs where DPDK is running. What's the implications
> > of doing this besides not getting up to date stats in /proc/vmstat
> > (which I still have to confirm would be OK)? Can this break anything
> > in the kernel for example?
>
> Well, you get incorrect statistics.

The statistics are never completely accurate. You will get less accurate
statistics but they will be correct. The differentials may not be
reflected in the counts shown via /proc but there is a cap on how
inaccurate those can becore.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-26  3:24                                     ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-26  3:24 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Thu, 25 May 2017, Marcelo Tosatti wrote:

> Argument? We're showing you the data that this is causing a latency
> problem for us.

Sorry I am not sure where the data shows a latency problem. There are
interrupts and scheduler ticks. But what does this have to do with vmstat?

Show me your dpdk code running and trace the tick on / off events  as well
as the vmstat invocations. Also show all system calls occurring on the cpu
that runs dpdk. That is necessary to see what triggers vmstat and how the
system reacts to the changes to the differentials.

Then please rerun the test by setting the vmstat_interval to 60.

Do another run with your modifications and show the difference.

> > Something that crossed my mind was to add a new tunable to set
> > the vmstat_interval for each CPU, this way we could essentially
> > disable it to the CPUs where DPDK is running. What's the implications
> > of doing this besides not getting up to date stats in /proc/vmstat
> > (which I still have to confirm would be OK)? Can this break anything
> > in the kernel for example?
>
> Well, you get incorrect statistics.

The statistics are never completely accurate. You will get less accurate
statistics but they will be correct. The differentials may not be
reflected in the counts shown via /proc but there is a cap on how
inaccurate those can becore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-26  3:24                                     ` Christoph Lameter
@ 2017-05-26 19:09                                       ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-26 19:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Thu, May 25, 2017 at 10:24:46PM -0500, Christoph Lameter wrote:
> On Thu, 25 May 2017, Marcelo Tosatti wrote:
> 
> > Argument? We're showing you the data that this is causing a latency
> > problem for us.
> 
> Sorry I am not sure where the data shows a latency problem. There are
> interrupts and scheduler ticks. But what does this have to do with vmstat?
> 
> Show me your dpdk code running and trace the tick on / off events  as well
> as the vmstat invocations. Also show all system calls occurring on the cpu
> that runs dpdk. That is necessary to see what triggers vmstat and how the
> system reacts to the changes to the differentials.

Sure, i can get that to you. The question remains: Are you arguing
its not valid for a realtime application to use any system call
which changes a vmstat counter? 

Because if they are allowed, then its obvious something like
this is needed.

> Then please rerun the test by setting the vmstat_interval to 60.
> 
> Do another run with your modifications and show the difference.

Will do so.

> > > Something that crossed my mind was to add a new tunable to set
> > > the vmstat_interval for each CPU, this way we could essentially
> > > disable it to the CPUs where DPDK is running. What's the implications
> > > of doing this besides not getting up to date stats in /proc/vmstat
> > > (which I still have to confirm would be OK)? Can this break anything
> > > in the kernel for example?
> >
> > Well, you get incorrect statistics.
> 
> The statistics are never completely accurate. You will get less accurate
> statistics but they will be correct. The differentials may not be
> reflected in the counts shown via /proc but there is a cap on how
> inaccurate those can becore.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-26 19:09                                       ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-05-26 19:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Thu, May 25, 2017 at 10:24:46PM -0500, Christoph Lameter wrote:
> On Thu, 25 May 2017, Marcelo Tosatti wrote:
> 
> > Argument? We're showing you the data that this is causing a latency
> > problem for us.
> 
> Sorry I am not sure where the data shows a latency problem. There are
> interrupts and scheduler ticks. But what does this have to do with vmstat?
> 
> Show me your dpdk code running and trace the tick on / off events  as well
> as the vmstat invocations. Also show all system calls occurring on the cpu
> that runs dpdk. That is necessary to see what triggers vmstat and how the
> system reacts to the changes to the differentials.

Sure, i can get that to you. The question remains: Are you arguing
its not valid for a realtime application to use any system call
which changes a vmstat counter? 

Because if they are allowed, then its obvious something like
this is needed.

> Then please rerun the test by setting the vmstat_interval to 60.
> 
> Do another run with your modifications and show the difference.

Will do so.

> > > Something that crossed my mind was to add a new tunable to set
> > > the vmstat_interval for each CPU, this way we could essentially
> > > disable it to the CPUs where DPDK is running. What's the implications
> > > of doing this besides not getting up to date stats in /proc/vmstat
> > > (which I still have to confirm would be OK)? Can this break anything
> > > in the kernel for example?
> >
> > Well, you get incorrect statistics.
> 
> The statistics are never completely accurate. You will get less accurate
> statistics but they will be correct. The differentials may not be
> reflected in the counts shown via /proc but there is a cap on how
> inaccurate those can becore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-26 19:09                                       ` Marcelo Tosatti
@ 2017-05-30 18:17                                         ` Christoph Lameter
  -1 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-30 18:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 26 May 2017, Marcelo Tosatti wrote:

> > interrupts and scheduler ticks. But what does this have to do with vmstat?
> >
> > Show me your dpdk code running and trace the tick on / off events  as well
> > as the vmstat invocations. Also show all system calls occurring on the cpu
> > that runs dpdk. That is necessary to see what triggers vmstat and how the
> > system reacts to the changes to the differentials.
>
> Sure, i can get that to you. The question remains: Are you arguing
> its not valid for a realtime application to use any system call
> which changes a vmstat counter?

A true realtime app would be conscientious of its use of the OS services
because the use of the services may cause additional latencies and also
cause timers etc to fire later. A realtime app that is willing to use
these services is therefore willing to tolerate larger latencies. A
realtime app that is using OS service may cause the timer tick to be
enabled which also causes additional latencies.

I have seen completely OS noise free processing for extended time period
when not using OS services and using RDMA for I/O. This fits my use case
well.

If there are really these high latencies because of kworker processing for
vmstat then maybe we need a different mechanism there (bh? or other
triggers) and maybe we are using far too many counters so that the
processing becomes a heavy user of resources.

> Because if they are allowed, then its obvious something like
> this is needed.

I am still wondering what benefit there is. Lets get clear on the test
load and see if this actually makes sense.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-05-30 18:17                                         ` Christoph Lameter
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2017-05-30 18:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Fri, 26 May 2017, Marcelo Tosatti wrote:

> > interrupts and scheduler ticks. But what does this have to do with vmstat?
> >
> > Show me your dpdk code running and trace the tick on / off events  as well
> > as the vmstat invocations. Also show all system calls occurring on the cpu
> > that runs dpdk. That is necessary to see what triggers vmstat and how the
> > system reacts to the changes to the differentials.
>
> Sure, i can get that to you. The question remains: Are you arguing
> its not valid for a realtime application to use any system call
> which changes a vmstat counter?

A true realtime app would be conscientious of its use of the OS services
because the use of the services may cause additional latencies and also
cause timers etc to fire later. A realtime app that is willing to use
these services is therefore willing to tolerate larger latencies. A
realtime app that is using OS service may cause the timer tick to be
enabled which also causes additional latencies.

I have seen completely OS noise free processing for extended time period
when not using OS services and using RDMA for I/O. This fits my use case
well.

If there are really these high latencies because of kworker processing for
vmstat then maybe we need a different mechanism there (bh? or other
triggers) and maybe we are using far too many counters so that the
processing becomes a heavy user of resources.

> Because if they are allowed, then its obvious something like
> this is needed.

I am still wondering what benefit there is. Lets get clear on the test
load and see if this actually makes sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
  2017-05-30 18:17                                         ` Christoph Lameter
@ 2017-07-10 15:05                                           ` Marcelo Tosatti
  -1 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-07-10 15:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Tue, May 30, 2017 at 01:17:41PM -0500, Christoph Lameter wrote:
> On Fri, 26 May 2017, Marcelo Tosatti wrote:
> 
> > > interrupts and scheduler ticks. But what does this have to do with vmstat?
> > >
> > > Show me your dpdk code running and trace the tick on / off events  as well
> > > as the vmstat invocations. Also show all system calls occurring on the cpu
> > > that runs dpdk. That is necessary to see what triggers vmstat and how the
> > > system reacts to the changes to the differentials.

This was in the host, while performing virtual machine migration... Which you can
say "invalidates the argument" because virtual machine migration takes 
MUCH longer time than what vmstat_update introduces.

> >
> > Sure, i can get that to you. The question remains: Are you arguing
> > its not valid for a realtime application to use any system call
> > which changes a vmstat counter?
> 
> A true realtime app would be conscientious of its use of the OS services
> because the use of the services may cause additional latencies and also
> cause timers etc to fire later. A realtime app that is willing to use
> these services is therefore willing to tolerate larger latencies. A
> realtime app that is using OS service may cause the timer tick to be
> enabled which also causes additional latencies.
> 
> I have seen completely OS noise free processing for extended time period
> when not using OS services and using RDMA for I/O. This fits my use case
> well.

People might want to use O/S services.

> If there are really these high latencies because of kworker processing for
> vmstat then maybe we need a different mechanism there (bh? or other
> triggers) and maybe we are using far too many counters so that the
> processing becomes a heavy user of resources.
> 
> > Because if they are allowed, then its obvious something like
> > this is needed.
> 
> I am still wondering what benefit there is. Lets get clear on the test
> load and see if this actually makes sense.

Ok, test load: 

	* Any userspace app that causes a systemcall which triggers
	vmstat_update is susceptible to vmstat_update running on that
	CPU, which might be detrimental to latency.

So either something which moves vmstat_update work to another CPU, 
or that avoids vmstat_update (which is what the proposed patchset does),
must be necessary.

So if a customer comes to me and says: "i am using sys_XXX in my
application, but my latency is high", i'll have to tell him: "ok, please
don't use that system call since it triggers kernel activity on the CPU
which does not allow you to achieve the latency you desire".

But it seems the "no syscalls" rule seems to be a good idea for 
CPU isolated, low latency stuff...

So i give up on the use-case behind this patch.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
@ 2017-07-10 15:05                                           ` Marcelo Tosatti
  0 siblings, 0 replies; 66+ messages in thread
From: Marcelo Tosatti @ 2017-07-10 15:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luiz Capitulino, linux-kernel, linux-mm, Rik van Riel,
	Linux RT Users, cmetcalf

On Tue, May 30, 2017 at 01:17:41PM -0500, Christoph Lameter wrote:
> On Fri, 26 May 2017, Marcelo Tosatti wrote:
> 
> > > interrupts and scheduler ticks. But what does this have to do with vmstat?
> > >
> > > Show me your dpdk code running and trace the tick on / off events  as well
> > > as the vmstat invocations. Also show all system calls occurring on the cpu
> > > that runs dpdk. That is necessary to see what triggers vmstat and how the
> > > system reacts to the changes to the differentials.

This was in the host, while performing virtual machine migration... Which you can
say "invalidates the argument" because virtual machine migration takes 
MUCH longer time than what vmstat_update introduces.

> >
> > Sure, i can get that to you. The question remains: Are you arguing
> > its not valid for a realtime application to use any system call
> > which changes a vmstat counter?
> 
> A true realtime app would be conscientious of its use of the OS services
> because the use of the services may cause additional latencies and also
> cause timers etc to fire later. A realtime app that is willing to use
> these services is therefore willing to tolerate larger latencies. A
> realtime app that is using OS service may cause the timer tick to be
> enabled which also causes additional latencies.
> 
> I have seen completely OS noise free processing for extended time period
> when not using OS services and using RDMA for I/O. This fits my use case
> well.

People might want to use O/S services.

> If there are really these high latencies because of kworker processing for
> vmstat then maybe we need a different mechanism there (bh? or other
> triggers) and maybe we are using far too many counters so that the
> processing becomes a heavy user of resources.
> 
> > Because if they are allowed, then its obvious something like
> > this is needed.
> 
> I am still wondering what benefit there is. Lets get clear on the test
> load and see if this actually makes sense.

Ok, test load: 

	* Any userspace app that causes a systemcall which triggers
	vmstat_update is susceptible to vmstat_update running on that
	CPU, which might be detrimental to latency.

So either something which moves vmstat_update work to another CPU, 
or that avoids vmstat_update (which is what the proposed patchset does),
must be necessary.

So if a customer comes to me and says: "i am using sys_XXX in my
application, but my latency is high", i'll have to tell him: "ok, please
don't use that system call since it triggers kernel activity on the CPU
which does not allow you to achieve the latency you desire".

But it seems the "no syscalls" rule seems to be a good idea for 
CPU isolated, low latency stuff...

So i give up on the use-case behind this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2017-07-11 15:21 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-25 13:57 [patch 0/2] per-CPU vmstat thresholds and vmstat worker disablement Marcelo Tosatti
2017-04-25 13:57 ` Marcelo Tosatti
2017-04-25 13:57 ` [patch 1/2] MM: remove unused quiet_vmstat function Marcelo Tosatti
2017-04-25 13:57   ` Marcelo Tosatti
2017-04-25 13:57 ` [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration Marcelo Tosatti
2017-04-25 13:57   ` Marcelo Tosatti
2017-04-25 19:29   ` Rik van Riel
2017-04-25 19:29     ` Rik van Riel
2017-04-25 19:29     ` Rik van Riel
2017-04-25 19:36     ` Marcelo Tosatti
2017-04-25 19:36       ` Marcelo Tosatti
2017-04-25 19:36       ` Marcelo Tosatti
2017-05-02 14:28   ` Luiz Capitulino
2017-05-02 14:28     ` Luiz Capitulino
2017-05-02 16:52     ` Marcelo Tosatti
2017-05-02 16:52       ` Marcelo Tosatti
2017-05-02 17:15       ` Luiz Capitulino
2017-05-02 17:15         ` Luiz Capitulino
2017-05-02 17:21         ` Marcelo Tosatti
2017-05-02 17:21           ` Marcelo Tosatti
2017-05-02 17:21           ` Marcelo Tosatti
2017-05-11 15:37         ` Christoph Lameter
2017-05-11 15:37           ` Christoph Lameter
2017-05-12 12:27           ` Marcelo Tosatti
2017-05-12 12:27             ` Marcelo Tosatti
2017-05-12 15:11             ` Christoph Lameter
2017-05-12 15:11               ` Christoph Lameter
2017-05-12 15:40               ` Marcelo Tosatti
2017-05-12 15:40                 ` Marcelo Tosatti
2017-05-12 16:03                 ` Christoph Lameter
2017-05-12 16:03                   ` Christoph Lameter
2017-05-12 16:07                 ` Christoph Lameter
2017-05-12 16:07                   ` Christoph Lameter
2017-05-12 16:19                   ` Marcelo Tosatti
2017-05-12 16:19                     ` Marcelo Tosatti
2017-05-12 16:57                     ` Christoph Lameter
2017-05-12 16:57                       ` Christoph Lameter
2017-05-15 19:15                       ` Marcelo Tosatti
2017-05-15 19:15                         ` Marcelo Tosatti
2017-05-16 13:37                         ` Christoph Lameter
2017-05-16 13:37                           ` Christoph Lameter
2017-05-19 14:34                           ` Marcelo Tosatti
2017-05-19 14:34                             ` Marcelo Tosatti
2017-05-19 17:13                             ` Christoph Lameter
2017-05-19 17:13                               ` Christoph Lameter
2017-05-19 17:49                               ` Luiz Capitulino
2017-05-19 17:49                                 ` Luiz Capitulino
2017-05-22 16:35                                 ` Christoph Lameter
2017-05-22 16:35                                   ` Christoph Lameter
2017-05-25 19:35                                 ` Marcelo Tosatti
2017-05-25 19:35                                   ` Marcelo Tosatti
2017-05-25 19:35                                   ` Marcelo Tosatti
2017-05-26  3:24                                   ` Christoph Lameter
2017-05-26  3:24                                     ` Christoph Lameter
2017-05-26 19:09                                     ` Marcelo Tosatti
2017-05-26 19:09                                       ` Marcelo Tosatti
2017-05-30 18:17                                       ` Christoph Lameter
2017-05-30 18:17                                         ` Christoph Lameter
2017-07-10 15:05                                         ` Marcelo Tosatti
2017-07-10 15:05                                           ` Marcelo Tosatti
2017-05-20  8:26                               ` Marcelo Tosatti
2017-05-20  8:26                                 ` Marcelo Tosatti
2017-05-22 16:38                                 ` Christoph Lameter
2017-05-22 16:38                                   ` Christoph Lameter
2017-05-22 21:13                                   ` Marcelo Tosatti
2017-05-22 21:13                                     ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.