All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] memcg: add pgfault latency histograms
@ 2011-05-26 21:07 Ying Han
  2011-05-27  0:05 ` KAMEZAWA Hiroyuki
  2011-05-27  8:04 ` Balbir Singh
  0 siblings, 2 replies; 14+ messages in thread
From: Ying Han @ 2011-05-26 21:07 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm

This adds histogram to capture pagefault latencies on per-memcg basis. I used
this patch on the memcg background reclaim test, and figured there could be more
usecases to monitor/debug application performance.

The histogram is composed 8 bucket in ns unit. The last one is infinite (inf)
which is everything beyond the last one. To be more flexible, the buckets can
be reset and also each bucket is configurable at runtime.

memory.pgfault_histogram: exports the histogram on per-memcg basis and also can
be reset by echoing "reset". Meantime, all the buckets are writable by echoing
the range into the API. see the example below.

/proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to turn
on/off recording the histogram.

Functional Test:
Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
Measure the anon page allocation latency.

$ mkdir /dev/cgroup/memory/B
$ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/B/tasks
$ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
$ allocate 8g anon pages

$ echo 1 >/proc/sys/vm/pgfault_histogram

$ cat /dev/cgroup/memory/B/memory.pgfault_histogram
pgfault latency histogram (ns):
< 600            2051273
< 1200           40859
< 2400           4004
< 4800           1605
< 9600           170
< 19200          82
< 38400          6
< inf            0

$ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
$ cat /dev/cgroup/memory/B/memory.pgfault_histogram
pgfault latency histogram (ns):
< 600            0
< 1200           0
< 2400           0
< 4800           0
< 9600           0
< 19200          0
< 38400          0
< inf            0

$ echo 500 520 540 580 600 1000 5000 >/dev/cgroup/memory/B/memory.pgfault_histogram
$ cat /dev/cgroup/memory/B/memory.pgfault_histogram
pgfault latency histogram (ns):
< 500            50
< 520            151
< 540            3715
< 580            1859812
< 600            202241
< 1000           25394
< 5000           5875
< inf            186

Performance Test:
I ran through the PageFaultTest (pft) benchmark to measure the overhead of
recording the histogram. There is no overhead observed on both "flt/cpu/s"
and "fault/wsec".

$ mkdir /dev/cgroup/memory/A
$ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/A/tasks
$ ./pft -m 15g -t 8 -T a

Result:
"fault/wsec"

$ ./ministat no_histogram histogram
x no_histogram
+ histogram
+--------------------------------------------------------------------------+
   N           Min           Max        Median           Avg        Stddev
x   5     813404.51     824574.98      821661.3     820470.83     4202.0758
+   5     821228.91     825894.66     822874.65     823374.15     1787.9355

"flt/cpu/s"

$ ./ministat no_histogram histogram
x no_histogram
+ histogram
+--------------------------------------------------------------------------+
   N           Min           Max        Median           Avg        Stddev
x   5     104951.93     106173.13     105142.73      105349.2     513.78158
+   5     104697.67      105416.1     104943.52     104973.77     269.24781
No difference proven at 95.0% confidence

Signed-off-by: Ying Han <yinghan@google.com>
---
 arch/x86/mm/fault.c        |    8 +++
 include/linux/memcontrol.h |    8 +++
 kernel/sysctl.c            |    7 +++
 mm/memcontrol.c            |  128 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 151 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 20e3f87..d7a1490 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -12,6 +12,7 @@
 #include <linux/mmiotrace.h>		/* kmmio_handler, ...		*/
 #include <linux/perf_event.h>		/* perf_sw_event		*/
 #include <linux/hugetlb.h>		/* hstate_index_to_shift	*/
+#include <linux/memcontrol.h>
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -966,6 +967,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	int write = error_code & PF_WRITE;
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
 					(write ? FAULT_FLAG_WRITE : 0);
+	unsigned long long start, delta;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1125,6 +1127,7 @@ good_area:
 		return;
 	}
 
+	start = sched_clock();
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
@@ -1132,6 +1135,11 @@ good_area:
 	 */
 	fault = handle_mm_fault(mm, vma, address, flags);
 
+	delta = sched_clock() - start;
+	if (unlikely(delta < 0))
+		delta = 0;
+	memcg_histogram_record(current, delta);
+
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 29a945a..c7e6cb8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,6 +92,8 @@ struct mem_cgroup *mem_cgroup_get_shrink_target(void);
 void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
 wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
 
+extern void memcg_histogram_record(struct task_struct *tsk, u64 delta);
+
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
 {
@@ -131,6 +133,8 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+extern unsigned int sysctl_pgfault_histogram;
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -476,6 +480,10 @@ wait_queue_head_t *mem_cgroup_kswapd_waitq(void)
 	return NULL;
 }
 
+static inline
+void memcg_histogram_record(struct task_struct *tsk, u64 delta)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 927fc5a..0dd2939 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1132,6 +1132,13 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &one,
 		.extra2		= &three,
 	},
+	{
+		.procname	= "pgfault_histogram",
+		.data		= &sysctl_pgfault_histogram,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0666,
+		.proc_handler	= proc_dointvec,
+	},
 #ifdef CONFIG_COMPACTION
 	{
 		.procname	= "compact_memory",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a98471b..c795f96 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -51,6 +51,7 @@
 #include "internal.h"
 #include <linux/kthread.h>
 #include <linux/freezer.h>
+#include <linux/ctype.h>
 
 #include <asm/uaccess.h>
 
@@ -207,6 +208,13 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
+#define MEMCG_NUM_HISTO_BUCKETS		8
+unsigned int sysctl_pgfault_histogram;
+
+struct memcg_histo {
+	u64 count[MEMCG_NUM_HISTO_BUCKETS];
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -299,6 +307,9 @@ struct mem_cgroup {
 	 * last node we reclaimed from
 	 */
 	int last_scanned_node;
+
+	struct memcg_histo *memcg_histo;
+	u64 memcg_histo_range[MEMCG_NUM_HISTO_BUCKETS];
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4692,6 +4703,105 @@ static int __init memcg_kswapd_init(void)
 }
 module_init(memcg_kswapd_init);
 
+static int mem_cgroup_histogram_seq_read(struct cgroup *cgrp,
+					struct cftype *cft, struct seq_file *m)
+{
+	struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cgrp);
+	int i, cpu;
+
+	seq_printf(m, "pgfault latency histogram (ns):\n");
+
+	for (i = 0; i < MEMCG_NUM_HISTO_BUCKETS; i++) {
+		u64 sum = 0;
+
+		for_each_present_cpu(cpu) {
+			struct memcg_histo *histo;
+			histo = per_cpu_ptr(mem_cont->memcg_histo, cpu);
+			sum += histo->count[i];
+		}
+
+		if (i < MEMCG_NUM_HISTO_BUCKETS - 1)
+			seq_printf(m, "< %-15llu",
+					mem_cont->memcg_histo_range[i]);
+		else
+			seq_printf(m, "< %-15s", "inf");
+		seq_printf(m, "%llu\n", sum);
+	}
+
+	return 0;
+}
+
+static int mem_cgroup_histogram_seq_write(struct cgroup *cgrp,
+					struct cftype *cft, const char *buffer)
+{
+	int i;
+	u64 data[MEMCG_NUM_HISTO_BUCKETS];
+	char *end;
+	struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cgrp);
+
+	if (!memcmp(buffer, "reset", 5)) {
+		for_each_present_cpu(i) {
+			struct memcg_histo *histo;
+
+			histo = per_cpu_ptr(mem_cont->memcg_histo, i);
+			memset(histo, 0, sizeof(*histo));
+		}
+		goto out;
+	}
+
+	for (i = 0; i < MEMCG_NUM_HISTO_BUCKETS - 1; i++, buffer = end) {
+		while ((isspace(*buffer)))
+			buffer++;
+		data[i] = simple_strtoull(buffer, &end, 10);
+	}
+	data[i] = ULLONG_MAX;
+
+	for (i = 1; i < MEMCG_NUM_HISTO_BUCKETS; i++)
+		if (data[i] < data[i - 1])
+			return -EINVAL;
+
+	memcpy(mem_cont->memcg_histo_range, data, sizeof(data));
+	for_each_present_cpu(i) {
+		struct memcg_histo *histo;
+		histo = per_cpu_ptr(mem_cont->memcg_histo, i);
+		memset(histo->count, 0, sizeof(*histo));
+	}
+out:
+	return 0;
+}
+
+/*
+ * Record values into histogram buckets
+ */
+void memcg_histogram_record(struct task_struct *tsk, u64 delta)
+{
+	u64 *base;
+	int index, first, last;
+	struct memcg_histo *histo;
+	struct mem_cgroup *mem = mem_cgroup_from_task(tsk);
+
+	if (sysctl_pgfault_histogram == 0)
+		return;
+
+	first = 0;
+	last = MEMCG_NUM_HISTO_BUCKETS - 1;
+	base = mem->memcg_histo_range;
+
+	if (delta >= base[first]) {
+		while (first < last) {
+			index = (first + last) / 2;
+			if (delta >= base[index])
+				first = index + 1;
+			else
+				last = index;
+		}
+	}
+	index = first;
+
+	histo = per_cpu_ptr(mem->memcg_histo, smp_processor_id());
+	histo->count[index]++;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4769,6 +4879,12 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "reclaim_wmarks",
 		.read_map = mem_cgroup_wmark_read,
 	},
+	{
+		.name = "pgfault_histogram",
+		.read_seq_string = mem_cgroup_histogram_seq_read,
+		.write_string = mem_cgroup_histogram_seq_write,
+		.max_write_len = 256,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4903,6 +5019,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 		free_mem_cgroup_per_zone_info(mem, node);
 
 	free_percpu(mem->stat);
+	free_percpu(mem->memcg_histo);
 	if (sizeof(struct mem_cgroup) < PAGE_SIZE)
 		kfree(mem);
 	else
@@ -5014,6 +5131,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	struct mem_cgroup *mem, *parent;
 	long error = -ENOMEM;
 	int node;
+	int i;
 
 	mem = mem_cgroup_alloc();
 	if (!mem)
@@ -5068,6 +5186,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mutex_init(&mem->thresholds_lock);
 	init_waitqueue_head(&mem->memcg_kswapd_end);
 	INIT_LIST_HEAD(&mem->memcg_kswapd_wait_list);
+
+	mem->memcg_histo = alloc_percpu(typeof(*mem->memcg_histo));
+	if (!mem->memcg_histo)
+		goto free_out;
+
+
+	for (i = 0; i < MEMCG_NUM_HISTO_BUCKETS - 1; i++)
+		mem->memcg_histo_range[i] = (1 << i) * 600ULL;
+	mem->memcg_histo_range[i] = ULLONG_MAX;
+
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-26 21:07 [PATCH] memcg: add pgfault latency histograms Ying Han
@ 2011-05-27  0:05 ` KAMEZAWA Hiroyuki
  2011-05-27  0:23   ` Ying Han
  2011-05-27  8:04 ` Balbir Singh
  1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27  0:05 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 26 May 2011 14:07:49 -0700
Ying Han <yinghan@google.com> wrote:

> This adds histogram to capture pagefault latencies on per-memcg basis. I used
> this patch on the memcg background reclaim test, and figured there could be more
> usecases to monitor/debug application performance.
> 
> The histogram is composed 8 bucket in ns unit. The last one is infinite (inf)
> which is everything beyond the last one. To be more flexible, the buckets can
> be reset and also each bucket is configurable at runtime.
> 
> memory.pgfault_histogram: exports the histogram on per-memcg basis and also can
> be reset by echoing "reset". Meantime, all the buckets are writable by echoing
> the range into the API. see the example below.
> 
> /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to turn
> on/off recording the histogram.
> 
> Functional Test:
> Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> Measure the anon page allocation latency.
> 
> $ mkdir /dev/cgroup/memory/B
> $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/B/tasks
> $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> $ allocate 8g anon pages
> 
> $ echo 1 >/proc/sys/vm/pgfault_histogram
> 
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 600            2051273
> < 1200           40859
> < 2400           4004
> < 4800           1605
> < 9600           170
> < 19200          82
> < 38400          6
> < inf            0
> 
> $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 600            0
> < 1200           0
> < 2400           0
> < 4800           0
> < 9600           0
> < 19200          0
> < 38400          0
> < inf            0
> 
> $ echo 500 520 540 580 600 1000 5000 >/dev/cgroup/memory/B/memory.pgfault_histogram
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 500            50
> < 520            151
> < 540            3715
> < 580            1859812
> < 600            202241
> < 1000           25394
> < 5000           5875
> < inf            186
> 
> Performance Test:
> I ran through the PageFaultTest (pft) benchmark to measure the overhead of
> recording the histogram. There is no overhead observed on both "flt/cpu/s"
> and "fault/wsec".
> 
> $ mkdir /dev/cgroup/memory/A
> $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/A/tasks
> $ ./pft -m 15g -t 8 -T a
> 
> Result:
> "fault/wsec"
> 
> $ ./ministat no_histogram histogram
> x no_histogram
> + histogram
> +--------------------------------------------------------------------------+
>    N           Min           Max        Median           Avg        Stddev
> x   5     813404.51     824574.98      821661.3     820470.83     4202.0758
> +   5     821228.91     825894.66     822874.65     823374.15     1787.9355
> 
> "flt/cpu/s"
> 
> $ ./ministat no_histogram histogram
> x no_histogram
> + histogram
> +--------------------------------------------------------------------------+
>    N           Min           Max        Median           Avg        Stddev
> x   5     104951.93     106173.13     105142.73      105349.2     513.78158
> +   5     104697.67      105416.1     104943.52     104973.77     269.24781
> No difference proven at 95.0% confidence
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Hmm, interesting....but isn't it very very very complicated interface ?
Could you make this for 'perf' ? Then, everyone (including someone who don't use memcg)
will be happy.
Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  0:05 ` KAMEZAWA Hiroyuki
@ 2011-05-27  0:23   ` Ying Han
  2011-05-27  0:31     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 14+ messages in thread
From: Ying Han @ 2011-05-27  0:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4386 bytes --]

On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 26 May 2011 14:07:49 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This adds histogram to capture pagefault latencies on per-memcg basis. I
> used
> > this patch on the memcg background reclaim test, and figured there could
> be more
> > usecases to monitor/debug application performance.
> >
> > The histogram is composed 8 bucket in ns unit. The last one is infinite
> (inf)
> > which is everything beyond the last one. To be more flexible, the buckets
> can
> > be reset and also each bucket is configurable at runtime.
> >
> > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> also can
> > be reset by echoing "reset". Meantime, all the buckets are writable by
> echoing
> > the range into the API. see the example below.
> >
> > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> turn
> > on/off recording the histogram.
> >
> > Functional Test:
> > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> > Measure the anon page allocation latency.
> >
> > $ mkdir /dev/cgroup/memory/B
> > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/B/tasks
> > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> > $ allocate 8g anon pages
> >
> > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 600            2051273
> > < 1200           40859
> > < 2400           4004
> > < 4800           1605
> > < 9600           170
> > < 19200          82
> > < 38400          6
> > < inf            0
> >
> > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 600            0
> > < 1200           0
> > < 2400           0
> > < 4800           0
> > < 9600           0
> > < 19200          0
> > < 38400          0
> > < inf            0
> >
> > $ echo 500 520 540 580 600 1000 5000
> >/dev/cgroup/memory/B/memory.pgfault_histogram
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 500            50
> > < 520            151
> > < 540            3715
> > < 580            1859812
> > < 600            202241
> > < 1000           25394
> > < 5000           5875
> > < inf            186
> >
> > Performance Test:
> > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> of
> > recording the histogram. There is no overhead observed on both
> "flt/cpu/s"
> > and "fault/wsec".
> >
> > $ mkdir /dev/cgroup/memory/A
> > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/A/tasks
> > $ ./pft -m 15g -t 8 -T a
> >
> > Result:
> > "fault/wsec"
> >
> > $ ./ministat no_histogram histogram
> > x no_histogram
> > + histogram
> >
> +--------------------------------------------------------------------------+
> >    N           Min           Max        Median           Avg
>  Stddev
> > x   5     813404.51     824574.98      821661.3     820470.83
> 4202.0758
> > +   5     821228.91     825894.66     822874.65     823374.15
> 1787.9355
> >
> > "flt/cpu/s"
> >
> > $ ./ministat no_histogram histogram
> > x no_histogram
> > + histogram
> >
> +--------------------------------------------------------------------------+
> >    N           Min           Max        Median           Avg
>  Stddev
> > x   5     104951.93     106173.13     105142.73      105349.2
> 513.78158
> > +   5     104697.67      105416.1     104943.52     104973.77
> 269.24781
> > No difference proven at 95.0% confidence
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> Hmm, interesting....but isn't it very very very complicated interface ?
> Could you make this for 'perf' ? Then, everyone (including someone who
> don't use memcg)
> will be happy.
>

Thank you for looking at it.

There is only one per-memcg API added which is basically exporting the
histogram. The "reset" and reconfiguring the bucket is not "must" but make
it more flexible. Also, the sysfs API can be reduced if necessary since
there is no over-head observed by always turning it on anyway.

I am not familiar w/ perf, any suggestions how it is supposed to be look
like?

Thanks

--Ying


> Thanks,
> -Kame
>
>
>

[-- Attachment #2: Type: text/html, Size: 5799 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  0:23   ` Ying Han
@ 2011-05-27  0:31     ` KAMEZAWA Hiroyuki
  2011-05-27  1:40       ` Ying Han
  0 siblings, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27  0:31 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 26 May 2011 17:23:20 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 26 May 2011 14:07:49 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
> > used
> > > this patch on the memcg background reclaim test, and figured there could
> > be more
> > > usecases to monitor/debug application performance.
> > >
> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
> > (inf)
> > > which is everything beyond the last one. To be more flexible, the buckets
> > can
> > > be reset and also each bucket is configurable at runtime.
> > >
> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> > also can
> > > be reset by echoing "reset". Meantime, all the buckets are writable by
> > echoing
> > > the range into the API. see the example below.
> > >
> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> > turn
> > > on/off recording the histogram.
> > >
> > > Functional Test:
> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> > > Measure the anon page allocation latency.
> > >
> > > $ mkdir /dev/cgroup/memory/B
> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> > > $ echo $$ >/dev/cgroup/memory/B/tasks
> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> > > $ allocate 8g anon pages
> > >
> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
> > >
> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > > pgfault latency histogram (ns):
> > > < 600            2051273
> > > < 1200           40859
> > > < 2400           4004
> > > < 4800           1605
> > > < 9600           170
> > > < 19200          82
> > > < 38400          6
> > > < inf            0
> > >
> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > > pgfault latency histogram (ns):
> > > < 600            0
> > > < 1200           0
> > > < 2400           0
> > > < 4800           0
> > > < 9600           0
> > > < 19200          0
> > > < 38400          0
> > > < inf            0
> > >
> > > $ echo 500 520 540 580 600 1000 5000
> > >/dev/cgroup/memory/B/memory.pgfault_histogram
> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > > pgfault latency histogram (ns):
> > > < 500            50
> > > < 520            151
> > > < 540            3715
> > > < 580            1859812
> > > < 600            202241
> > > < 1000           25394
> > > < 5000           5875
> > > < inf            186
> > >
> > > Performance Test:
> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> > of
> > > recording the histogram. There is no overhead observed on both
> > "flt/cpu/s"
> > > and "fault/wsec".
> > >
> > > $ mkdir /dev/cgroup/memory/A
> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > > $ ./pft -m 15g -t 8 -T a
> > >
> > > Result:
> > > "fault/wsec"
> > >
> > > $ ./ministat no_histogram histogram
> > > x no_histogram
> > > + histogram
> > >
> > +--------------------------------------------------------------------------+
> > >    N           Min           Max        Median           Avg
> >  Stddev
> > > x   5     813404.51     824574.98      821661.3     820470.83
> > 4202.0758
> > > +   5     821228.91     825894.66     822874.65     823374.15
> > 1787.9355
> > >
> > > "flt/cpu/s"
> > >
> > > $ ./ministat no_histogram histogram
> > > x no_histogram
> > > + histogram
> > >
> > +--------------------------------------------------------------------------+
> > >    N           Min           Max        Median           Avg
> >  Stddev
> > > x   5     104951.93     106173.13     105142.73      105349.2
> > 513.78158
> > > +   5     104697.67      105416.1     104943.52     104973.77
> > 269.24781
> > > No difference proven at 95.0% confidence
> > >
> > > Signed-off-by: Ying Han <yinghan@google.com>
> >
> > Hmm, interesting....but isn't it very very very complicated interface ?
> > Could you make this for 'perf' ? Then, everyone (including someone who
> > don't use memcg)
> > will be happy.
> >
> 
> Thank you for looking at it.
> 
> There is only one per-memcg API added which is basically exporting the
> histogram. The "reset" and reconfiguring the bucket is not "must" but make
> it more flexible. Also, the sysfs API can be reduced if necessary since
> there is no over-head observed by always turning it on anyway.
> 
> I am not familiar w/ perf, any suggestions how it is supposed to be look
> like?
> 
> Thanks
> 

IIUC, you can record "all" latency information by perf record. Then, latency
information can be dumped out to some file.

You can add a python? script for perf as

  # perf report memory-reclaim-latency-histgram -f perf.data
                -o 500,1000,1500,2000.....
   ...show histgram in text.. or report the histgram in graphic.

Good point is
  - you can reuse perf.data and show histgram from another point of view.

  - you can show another cut of view, for example, I think you can write a
    parser to show "changes in hisgram by time", easily.
    You may able to generate a movie ;)
    
  - Now, perf cgroup is supported. Then,
    - you can see per task histgram
    - you can see per cgroup histgram
    - you can see per system-wide histgram
      (If you record latency of usual kswapd/alloc_pages)

  - If you record latency within shrink_zone(), you can show per-zone
    reclaim latency histgram. record parsers can gather them and
    show histgram. This will be benefical to cpuset users.


I'm sorry if I miss something.

Thanks,
-Kame









--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  0:31     ` KAMEZAWA Hiroyuki
@ 2011-05-27  1:40       ` Ying Han
  2011-05-27  2:11         ` KAMEZAWA Hiroyuki
  2011-05-28 10:17         ` Ingo Molnar
  0 siblings, 2 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27  1:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 May 2011 17:23:20 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> > On Thu, 26 May 2011 14:07:49 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>> > used
>> > > this patch on the memcg background reclaim test, and figured there could
>> > be more
>> > > usecases to monitor/debug application performance.
>> > >
>> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>> > (inf)
>> > > which is everything beyond the last one. To be more flexible, the buckets
>> > can
>> > > be reset and also each bucket is configurable at runtime.
>> > >
>> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>> > also can
>> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>> > echoing
>> > > the range into the API. see the example below.
>> > >
>> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>> > turn
>> > > on/off recording the histogram.
>> > >
>> > > Functional Test:
>> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>> > > Measure the anon page allocation latency.
>> > >
>> > > $ mkdir /dev/cgroup/memory/B
>> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>> > > $ allocate 8g anon pages
>> > >
>> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>> > >
>> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> > > pgfault latency histogram (ns):
>> > > < 600            2051273
>> > > < 1200           40859
>> > > < 2400           4004
>> > > < 4800           1605
>> > > < 9600           170
>> > > < 19200          82
>> > > < 38400          6
>> > > < inf            0
>> > >
>> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> > > pgfault latency histogram (ns):
>> > > < 600            0
>> > > < 1200           0
>> > > < 2400           0
>> > > < 4800           0
>> > > < 9600           0
>> > > < 19200          0
>> > > < 38400          0
>> > > < inf            0
>> > >
>> > > $ echo 500 520 540 580 600 1000 5000
>> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> > > pgfault latency histogram (ns):
>> > > < 500            50
>> > > < 520            151
>> > > < 540            3715
>> > > < 580            1859812
>> > > < 600            202241
>> > > < 1000           25394
>> > > < 5000           5875
>> > > < inf            186
>> > >
>> > > Performance Test:
>> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>> > of
>> > > recording the histogram. There is no overhead observed on both
>> > "flt/cpu/s"
>> > > and "fault/wsec".
>> > >
>> > > $ mkdir /dev/cgroup/memory/A
>> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> > > $ ./pft -m 15g -t 8 -T a
>> > >
>> > > Result:
>> > > "fault/wsec"
>> > >
>> > > $ ./ministat no_histogram histogram
>> > > x no_histogram
>> > > + histogram
>> > >
>> > +--------------------------------------------------------------------------+
>> > >    N           Min           Max        Median           Avg
>> >  Stddev
>> > > x   5     813404.51     824574.98      821661.3     820470.83
>> > 4202.0758
>> > > +   5     821228.91     825894.66     822874.65     823374.15
>> > 1787.9355
>> > >
>> > > "flt/cpu/s"
>> > >
>> > > $ ./ministat no_histogram histogram
>> > > x no_histogram
>> > > + histogram
>> > >
>> > +--------------------------------------------------------------------------+
>> > >    N           Min           Max        Median           Avg
>> >  Stddev
>> > > x   5     104951.93     106173.13     105142.73      105349.2
>> > 513.78158
>> > > +   5     104697.67      105416.1     104943.52     104973.77
>> > 269.24781
>> > > No difference proven at 95.0% confidence
>> > >
>> > > Signed-off-by: Ying Han <yinghan@google.com>
>> >
>> > Hmm, interesting....but isn't it very very very complicated interface ?
>> > Could you make this for 'perf' ? Then, everyone (including someone who
>> > don't use memcg)
>> > will be happy.
>> >
>>
>> Thank you for looking at it.
>>
>> There is only one per-memcg API added which is basically exporting the
>> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>> it more flexible. Also, the sysfs API can be reduced if necessary since
>> there is no over-head observed by always turning it on anyway.
>>
>> I am not familiar w/ perf, any suggestions how it is supposed to be look
>> like?
>>
>> Thanks
>>
>
> IIUC, you can record "all" latency information by perf record. Then, latency
> information can be dumped out to some file.
>
> You can add a python? script for perf as
>
>  # perf report memory-reclaim-latency-histgram -f perf.data
>                -o 500,1000,1500,2000.....
>   ...show histgram in text.. or report the histgram in graphic.
>
> Good point is
>  - you can reuse perf.data and show histgram from another point of view.
>
>  - you can show another cut of view, for example, I think you can write a
>    parser to show "changes in hisgram by time", easily.
>    You may able to generate a movie ;)
>
>  - Now, perf cgroup is supported. Then,
>    - you can see per task histgram
>    - you can see per cgroup histgram
>    - you can see per system-wide histgram
>      (If you record latency of usual kswapd/alloc_pages)
>
>  - If you record latency within shrink_zone(), you can show per-zone
>    reclaim latency histgram. record parsers can gather them and
>    show histgram. This will be benefical to cpuset users.
>
>
> I'm sorry if I miss something.

After study a bit on perf, it is not feasible in this casecase. The
cpu & memory overhead of perf is overwhelming.... Each page fault will
generate a record in the buffer and how many data we can record in the
buffer, and how many data will be processed later.. Most of the data
that is recorded by the general perf framework is not needed here.

On the other hand, the memory consumption is very little in this
patch. We only need to keep a counter of each bucket and the recording
can go on as long as the machine is up. As also measured, there is no
overhead of the data collection :)

So, the perf is not an option for this purpose.

--Ying

>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  1:40       ` Ying Han
@ 2011-05-27  2:11         ` KAMEZAWA Hiroyuki
  2011-05-27  4:45           ` Ying Han
  2011-05-28 10:17         ` Ingo Molnar
  1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27  2:11 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 26 May 2011 18:40:44 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 26 May 2011 17:23:20 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >>
> >> > On Thu, 26 May 2011 14:07:49 -0700
> >> > Ying Han <yinghan@google.com> wrote:
> >> >
> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
> >> > used
> >> > > this patch on the memcg background reclaim test, and figured there could
> >> > be more
> >> > > usecases to monitor/debug application performance.
> >> > >
> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
> >> > (inf)
> >> > > which is everything beyond the last one. To be more flexible, the buckets
> >> > can
> >> > > be reset and also each bucket is configurable at runtime.
> >> > >
> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> >> > also can
> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
> >> > echoing
> >> > > the range into the API. see the example below.
> >> > >
> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> >> > turn
> >> > > on/off recording the histogram.
> >> > >
> >> > > Functional Test:
> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> >> > > Measure the anon page allocation latency.
> >> > >
> >> > > $ mkdir /dev/cgroup/memory/B
> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> >> > > $ allocate 8g anon pages
> >> > >
> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >> > >
> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > pgfault latency histogram (ns):
> >> > > < 600 A  A  A  A  A  A 2051273
> >> > > < 1200 A  A  A  A  A  40859
> >> > > < 2400 A  A  A  A  A  4004
> >> > > < 4800 A  A  A  A  A  1605
> >> > > < 9600 A  A  A  A  A  170
> >> > > < 19200 A  A  A  A  A 82
> >> > > < 38400 A  A  A  A  A 6
> >> > > < inf A  A  A  A  A  A 0
> >> > >
> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > pgfault latency histogram (ns):
> >> > > < 600 A  A  A  A  A  A 0
> >> > > < 1200 A  A  A  A  A  0
> >> > > < 2400 A  A  A  A  A  0
> >> > > < 4800 A  A  A  A  A  0
> >> > > < 9600 A  A  A  A  A  0
> >> > > < 19200 A  A  A  A  A 0
> >> > > < 38400 A  A  A  A  A 0
> >> > > < inf A  A  A  A  A  A 0
> >> > >
> >> > > $ echo 500 520 540 580 600 1000 5000
> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > pgfault latency histogram (ns):
> >> > > < 500 A  A  A  A  A  A 50
> >> > > < 520 A  A  A  A  A  A 151
> >> > > < 540 A  A  A  A  A  A 3715
> >> > > < 580 A  A  A  A  A  A 1859812
> >> > > < 600 A  A  A  A  A  A 202241
> >> > > < 1000 A  A  A  A  A  25394
> >> > > < 5000 A  A  A  A  A  5875
> >> > > < inf A  A  A  A  A  A 186
> >> > >
> >> > > Performance Test:
> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> >> > of
> >> > > recording the histogram. There is no overhead observed on both
> >> > "flt/cpu/s"
> >> > > and "fault/wsec".
> >> > >
> >> > > $ mkdir /dev/cgroup/memory/A
> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> >> > > $ ./pft -m 15g -t 8 -T a
> >> > >
> >> > > Result:
> >> > > "fault/wsec"
> >> > >
> >> > > $ ./ministat no_histogram histogram
> >> > > x no_histogram
> >> > > + histogram
> >> > >
> >> > +--------------------------------------------------------------------------+
> >> > > A  A N A  A  A  A  A  Min A  A  A  A  A  Max A  A  A  A Median A  A  A  A  A  Avg
> >> > A Stddev
> >> > > x A  5 A  A  813404.51 A  A  824574.98 A  A  A 821661.3 A  A  820470.83
> >> > 4202.0758
> >> > > + A  5 A  A  821228.91 A  A  825894.66 A  A  822874.65 A  A  823374.15
> >> > 1787.9355
> >> > >
> >> > > "flt/cpu/s"
> >> > >
> >> > > $ ./ministat no_histogram histogram
> >> > > x no_histogram
> >> > > + histogram
> >> > >
> >> > +--------------------------------------------------------------------------+
> >> > > A  A N A  A  A  A  A  Min A  A  A  A  A  Max A  A  A  A Median A  A  A  A  A  Avg
> >> > A Stddev
> >> > > x A  5 A  A  104951.93 A  A  106173.13 A  A  105142.73 A  A  A 105349.2
> >> > 513.78158
> >> > > + A  5 A  A  104697.67 A  A  A 105416.1 A  A  104943.52 A  A  104973.77
> >> > 269.24781
> >> > > No difference proven at 95.0% confidence
> >> > >
> >> > > Signed-off-by: Ying Han <yinghan@google.com>
> >> >
> >> > Hmm, interesting....but isn't it very very very complicated interface ?
> >> > Could you make this for 'perf' ? Then, everyone (including someone who
> >> > don't use memcg)
> >> > will be happy.
> >> >
> >>
> >> Thank you for looking at it.
> >>
> >> There is only one per-memcg API added which is basically exporting the
> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
> >> it more flexible. Also, the sysfs API can be reduced if necessary since
> >> there is no over-head observed by always turning it on anyway.
> >>
> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
> >> like?
> >>
> >> Thanks
> >>
> >
> > IIUC, you can record "all" latency information by perf record. Then, latency
> > information can be dumped out to some file.
> >
> > You can add a python? script for perf as
> >
> > A # perf report memory-reclaim-latency-histgram -f perf.data
> > A  A  A  A  A  A  A  A -o 500,1000,1500,2000.....
> > A  ...show histgram in text.. or report the histgram in graphic.
> >
> > Good point is
> > A - you can reuse perf.data and show histgram from another point of view.
> >
> > A - you can show another cut of view, for example, I think you can write a
> > A  A parser to show "changes in hisgram by time", easily.
> > A  A You may able to generate a movie ;)
> >
> > A - Now, perf cgroup is supported. Then,
> > A  A - you can see per task histgram
> > A  A - you can see per cgroup histgram
> > A  A - you can see per system-wide histgram
> > A  A  A (If you record latency of usual kswapd/alloc_pages)
> >
> > A - If you record latency within shrink_zone(), you can show per-zone
> > A  A reclaim latency histgram. record parsers can gather them and
> > A  A show histgram. This will be benefical to cpuset users.
> >
> >
> > I'm sorry if I miss something.
> 
> After study a bit on perf, it is not feasible in this casecase. The
> cpu & memory overhead of perf is overwhelming.... Each page fault will
> generate a record in the buffer and how many data we can record in the
> buffer, and how many data will be processed later.. Most of the data
> that is recorded by the general perf framework is not needed here.
> 

I disagree. "each page fault" is not correct. "every lru scan" is correct.
Then, record to buffer will be at most memory.failcnt times. 

please consider more.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  2:11         ` KAMEZAWA Hiroyuki
@ 2011-05-27  4:45           ` Ying Han
  2011-05-27  5:41             ` Ying Han
  2011-05-27  8:33             ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27  4:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 May 2011 18:40:44 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 26 May 2011 17:23:20 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >>
>> >> > On Thu, 26 May 2011 14:07:49 -0700
>> >> > Ying Han <yinghan@google.com> wrote:
>> >> >
>> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>> >> > used
>> >> > > this patch on the memcg background reclaim test, and figured there could
>> >> > be more
>> >> > > usecases to monitor/debug application performance.
>> >> > >
>> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>> >> > (inf)
>> >> > > which is everything beyond the last one. To be more flexible, the buckets
>> >> > can
>> >> > > be reset and also each bucket is configurable at runtime.
>> >> > >
>> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>> >> > also can
>> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>> >> > echoing
>> >> > > the range into the API. see the example below.
>> >> > >
>> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>> >> > turn
>> >> > > on/off recording the histogram.
>> >> > >
>> >> > > Functional Test:
>> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>> >> > > Measure the anon page allocation latency.
>> >> > >
>> >> > > $ mkdir /dev/cgroup/memory/B
>> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>> >> > > $ allocate 8g anon pages
>> >> > >
>> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>> >> > >
>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > pgfault latency histogram (ns):
>> >> > > < 600            2051273
>> >> > > < 1200           40859
>> >> > > < 2400           4004
>> >> > > < 4800           1605
>> >> > > < 9600           170
>> >> > > < 19200          82
>> >> > > < 38400          6
>> >> > > < inf            0
>> >> > >
>> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > pgfault latency histogram (ns):
>> >> > > < 600            0
>> >> > > < 1200           0
>> >> > > < 2400           0
>> >> > > < 4800           0
>> >> > > < 9600           0
>> >> > > < 19200          0
>> >> > > < 38400          0
>> >> > > < inf            0
>> >> > >
>> >> > > $ echo 500 520 540 580 600 1000 5000
>> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > pgfault latency histogram (ns):
>> >> > > < 500            50
>> >> > > < 520            151
>> >> > > < 540            3715
>> >> > > < 580            1859812
>> >> > > < 600            202241
>> >> > > < 1000           25394
>> >> > > < 5000           5875
>> >> > > < inf            186
>> >> > >
>> >> > > Performance Test:
>> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>> >> > of
>> >> > > recording the histogram. There is no overhead observed on both
>> >> > "flt/cpu/s"
>> >> > > and "fault/wsec".
>> >> > >
>> >> > > $ mkdir /dev/cgroup/memory/A
>> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> >> > > $ ./pft -m 15g -t 8 -T a
>> >> > >
>> >> > > Result:
>> >> > > "fault/wsec"
>> >> > >
>> >> > > $ ./ministat no_histogram histogram
>> >> > > x no_histogram
>> >> > > + histogram
>> >> > >
>> >> > +--------------------------------------------------------------------------+
>> >> > >    N           Min           Max        Median           Avg
>> >> >  Stddev
>> >> > > x   5     813404.51     824574.98      821661.3     820470.83
>> >> > 4202.0758
>> >> > > +   5     821228.91     825894.66     822874.65     823374.15
>> >> > 1787.9355
>> >> > >
>> >> > > "flt/cpu/s"
>> >> > >
>> >> > > $ ./ministat no_histogram histogram
>> >> > > x no_histogram
>> >> > > + histogram
>> >> > >
>> >> > +--------------------------------------------------------------------------+
>> >> > >    N           Min           Max        Median           Avg
>> >> >  Stddev
>> >> > > x   5     104951.93     106173.13     105142.73      105349.2
>> >> > 513.78158
>> >> > > +   5     104697.67      105416.1     104943.52     104973.77
>> >> > 269.24781
>> >> > > No difference proven at 95.0% confidence
>> >> > >
>> >> > > Signed-off-by: Ying Han <yinghan@google.com>
>> >> >
>> >> > Hmm, interesting....but isn't it very very very complicated interface ?
>> >> > Could you make this for 'perf' ? Then, everyone (including someone who
>> >> > don't use memcg)
>> >> > will be happy.
>> >> >
>> >>
>> >> Thank you for looking at it.
>> >>
>> >> There is only one per-memcg API added which is basically exporting the
>> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>> >> it more flexible. Also, the sysfs API can be reduced if necessary since
>> >> there is no over-head observed by always turning it on anyway.
>> >>
>> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
>> >> like?
>> >>
>> >> Thanks
>> >>
>> >
>> > IIUC, you can record "all" latency information by perf record. Then, latency
>> > information can be dumped out to some file.
>> >
>> > You can add a python? script for perf as
>> >
>> >  # perf report memory-reclaim-latency-histgram -f perf.data
>> >                -o 500,1000,1500,2000.....
>> >   ...show histgram in text.. or report the histgram in graphic.
>> >
>> > Good point is
>> >  - you can reuse perf.data and show histgram from another point of view.
>> >
>> >  - you can show another cut of view, for example, I think you can write a
>> >    parser to show "changes in hisgram by time", easily.
>> >    You may able to generate a movie ;)
>> >
>> >  - Now, perf cgroup is supported. Then,
>> >    - you can see per task histgram
>> >    - you can see per cgroup histgram
>> >    - you can see per system-wide histgram
>> >      (If you record latency of usual kswapd/alloc_pages)
>> >
>> >  - If you record latency within shrink_zone(), you can show per-zone
>> >    reclaim latency histgram. record parsers can gather them and
>> >    show histgram. This will be benefical to cpuset users.
>> >
>> >
>> > I'm sorry if I miss something.
>>
>> After study a bit on perf, it is not feasible in this casecase. The
>> cpu & memory overhead of perf is overwhelming.... Each page fault will
>> generate a record in the buffer and how many data we can record in the
>> buffer, and how many data will be processed later.. Most of the data
>> that is recorded by the general perf framework is not needed here.
>>
>
> I disagree. "each page fault" is not correct. "every lru scan" is correct.
> Then, record to buffer will be at most memory.failcnt times.

Hmm. Sorry I might miss something here... :(

The page fault histogram recorded is per page-fault, only the ones
trigger reclaim. The background reclaim testing is just one usecase of
it, and we need this information for more
general usage to monitor application performance. So i recorded the
latency for each single page fault.

--Ying

>
> please consider more.
>
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  4:45           ` Ying Han
@ 2011-05-27  5:41             ` Ying Han
  2011-05-27  8:33             ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27  5:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, May 26, 2011 at 9:45 PM, Ying Han <yinghan@google.com> wrote:
> On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> On Thu, 26 May 2011 18:40:44 -0700
>> Ying Han <yinghan@google.com> wrote:
>>
>>> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
>>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> > On Thu, 26 May 2011 17:23:20 -0700
>>> > Ying Han <yinghan@google.com> wrote:
>>> >
>>> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>>> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> >>
>>> >> > On Thu, 26 May 2011 14:07:49 -0700
>>> >> > Ying Han <yinghan@google.com> wrote:
>>> >> >
>>> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>>> >> > used
>>> >> > > this patch on the memcg background reclaim test, and figured there could
>>> >> > be more
>>> >> > > usecases to monitor/debug application performance.
>>> >> > >
>>> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>>> >> > (inf)
>>> >> > > which is everything beyond the last one. To be more flexible, the buckets
>>> >> > can
>>> >> > > be reset and also each bucket is configurable at runtime.
>>> >> > >
>>> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>>> >> > also can
>>> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>>> >> > echoing
>>> >> > > the range into the API. see the example below.
>>> >> > >
>>> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>>> >> > turn
>>> >> > > on/off recording the histogram.
>>> >> > >
>>> >> > > Functional Test:
>>> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>>> >> > > Measure the anon page allocation latency.
>>> >> > >
>>> >> > > $ mkdir /dev/cgroup/memory/B
>>> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>>> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>>> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>>> >> > > $ allocate 8g anon pages
>>> >> > >
>>> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>>> >> > >
>>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > pgfault latency histogram (ns):
>>> >> > > < 600            2051273
>>> >> > > < 1200           40859
>>> >> > > < 2400           4004
>>> >> > > < 4800           1605
>>> >> > > < 9600           170
>>> >> > > < 19200          82
>>> >> > > < 38400          6
>>> >> > > < inf            0
>>> >> > >
>>> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > pgfault latency histogram (ns):
>>> >> > > < 600            0
>>> >> > > < 1200           0
>>> >> > > < 2400           0
>>> >> > > < 4800           0
>>> >> > > < 9600           0
>>> >> > > < 19200          0
>>> >> > > < 38400          0
>>> >> > > < inf            0
>>> >> > >
>>> >> > > $ echo 500 520 540 580 600 1000 5000
>>> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > pgfault latency histogram (ns):
>>> >> > > < 500            50
>>> >> > > < 520            151
>>> >> > > < 540            3715
>>> >> > > < 580            1859812
>>> >> > > < 600            202241
>>> >> > > < 1000           25394
>>> >> > > < 5000           5875
>>> >> > > < inf            186
>>> >> > >
>>> >> > > Performance Test:
>>> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>>> >> > of
>>> >> > > recording the histogram. There is no overhead observed on both
>>> >> > "flt/cpu/s"
>>> >> > > and "fault/wsec".
>>> >> > >
>>> >> > > $ mkdir /dev/cgroup/memory/A
>>> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>>> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>>> >> > > $ ./pft -m 15g -t 8 -T a
>>> >> > >
>>> >> > > Result:
>>> >> > > "fault/wsec"
>>> >> > >
>>> >> > > $ ./ministat no_histogram histogram
>>> >> > > x no_histogram
>>> >> > > + histogram
>>> >> > >
>>> >> > +--------------------------------------------------------------------------+
>>> >> > >    N           Min           Max        Median           Avg
>>> >> >  Stddev
>>> >> > > x   5     813404.51     824574.98      821661.3     820470.83
>>> >> > 4202.0758
>>> >> > > +   5     821228.91     825894.66     822874.65     823374.15
>>> >> > 1787.9355
>>> >> > >
>>> >> > > "flt/cpu/s"
>>> >> > >
>>> >> > > $ ./ministat no_histogram histogram
>>> >> > > x no_histogram
>>> >> > > + histogram
>>> >> > >
>>> >> > +--------------------------------------------------------------------------+
>>> >> > >    N           Min           Max        Median           Avg
>>> >> >  Stddev
>>> >> > > x   5     104951.93     106173.13     105142.73      105349.2
>>> >> > 513.78158
>>> >> > > +   5     104697.67      105416.1     104943.52     104973.77
>>> >> > 269.24781
>>> >> > > No difference proven at 95.0% confidence
>>> >> > >
>>> >> > > Signed-off-by: Ying Han <yinghan@google.com>
>>> >> >
>>> >> > Hmm, interesting....but isn't it very very very complicated interface ?
>>> >> > Could you make this for 'perf' ? Then, everyone (including someone who
>>> >> > don't use memcg)
>>> >> > will be happy.
>>> >> >
>>> >>
>>> >> Thank you for looking at it.
>>> >>
>>> >> There is only one per-memcg API added which is basically exporting the
>>> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>>> >> it more flexible. Also, the sysfs API can be reduced if necessary since
>>> >> there is no over-head observed by always turning it on anyway.
>>> >>
>>> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
>>> >> like?
>>> >>
>>> >> Thanks
>>> >>
>>> >
>>> > IIUC, you can record "all" latency information by perf record. Then, latency
>>> > information can be dumped out to some file.
>>> >
>>> > You can add a python? script for perf as
>>> >
>>> >  # perf report memory-reclaim-latency-histgram -f perf.data
>>> >                -o 500,1000,1500,2000.....
>>> >   ...show histgram in text.. or report the histgram in graphic.
>>> >
>>> > Good point is
>>> >  - you can reuse perf.data and show histgram from another point of view.
>>> >
>>> >  - you can show another cut of view, for example, I think you can write a
>>> >    parser to show "changes in hisgram by time", easily.
>>> >    You may able to generate a movie ;)
>>> >
>>> >  - Now, perf cgroup is supported. Then,
>>> >    - you can see per task histgram
>>> >    - you can see per cgroup histgram
>>> >    - you can see per system-wide histgram
>>> >      (If you record latency of usual kswapd/alloc_pages)
>>> >
>>> >  - If you record latency within shrink_zone(), you can show per-zone
>>> >    reclaim latency histgram. record parsers can gather them and
>>> >    show histgram. This will be benefical to cpuset users.
>>> >
>>> >
>>> > I'm sorry if I miss something.
>>>
>>> After study a bit on perf, it is not feasible in this casecase. The
>>> cpu & memory overhead of perf is overwhelming.... Each page fault will
>>> generate a record in the buffer and how many data we can record in the
>>> buffer, and how many data will be processed later.. Most of the data
>>> that is recorded by the general perf framework is not needed here.
>>>
>>
>> I disagree. "each page fault" is not correct. "every lru scan" is correct.
>> Then, record to buffer will be at most memory.failcnt times.
>
> Hmm. Sorry I might miss something here... :(
>
> The page fault histogram recorded is per page-fault, only the ones
> trigger reclaim.

typo. I meant it is recording per page-fault, not only the one
triggering the reclaim.

--Ying

The background reclaim testing is just one usecase of
> it, and we need this information for more
> general usage to monitor application performance. So i recorded the
> latency for each single page fault.
>
> --Ying
>
>>
>> please consider more.
>>
>>
>> Thanks,
>> -Kame
>>
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-26 21:07 [PATCH] memcg: add pgfault latency histograms Ying Han
  2011-05-27  0:05 ` KAMEZAWA Hiroyuki
@ 2011-05-27  8:04 ` Balbir Singh
  2011-05-27 16:27   ` Ying Han
  1 sibling, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2011-05-27  8:04 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

* Ying Han <yinghan@google.com> [2011-05-26 14:07:49]:

> This adds histogram to capture pagefault latencies on per-memcg basis. I used
> this patch on the memcg background reclaim test, and figured there could be more
> usecases to monitor/debug application performance.
> 
> The histogram is composed 8 bucket in ns unit. The last one is infinite (inf)
> which is everything beyond the last one. To be more flexible, the buckets can
> be reset and also each bucket is configurable at runtime.
> 

inf is a bit confusing for page faults -- no? Why not call it "rest"
or something line "> 38400". BTW, why was 600 used as base?

> memory.pgfault_histogram: exports the histogram on per-memcg basis and also can
> be reset by echoing "reset". Meantime, all the buckets are writable by echoing
> the range into the API. see the example below.
> 
> /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to turn
> on/off recording the histogram.
>

Why not make this per memcg?
 
> Functional Test:
> Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> Measure the anon page allocation latency.
> 
> $ mkdir /dev/cgroup/memory/B
> $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/B/tasks
> $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> $ allocate 8g anon pages
> 
> $ echo 1 >/proc/sys/vm/pgfault_histogram
> 
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 600            2051273
> < 1200           40859
> < 2400           4004
> < 4800           1605
> < 9600           170
> < 19200          82
> < 38400          6
> < inf            0
> 
> $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram

Can't we use something like "-1" to mean reset?

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  4:45           ` Ying Han
  2011-05-27  5:41             ` Ying Han
@ 2011-05-27  8:33             ` KAMEZAWA Hiroyuki
  2011-05-27 18:46               ` Ying Han
  1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27  8:33 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Thu, 26 May 2011 21:45:28 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 26 May 2011 18:40:44 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Thu, 26 May 2011 17:23:20 -0700
> >> > Ying Han <yinghan@google.com> wrote:
> >> >
> >> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
> >> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >>
> >> >> > On Thu, 26 May 2011 14:07:49 -0700
> >> >> > Ying Han <yinghan@google.com> wrote:
> >> >> >
> >> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
> >> >> > used
> >> >> > > this patch on the memcg background reclaim test, and figured there could
> >> >> > be more
> >> >> > > usecases to monitor/debug application performance.
> >> >> > >
> >> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
> >> >> > (inf)
> >> >> > > which is everything beyond the last one. To be more flexible, the buckets
> >> >> > can
> >> >> > > be reset and also each bucket is configurable at runtime.
> >> >> > >
> >> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> >> >> > also can
> >> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
> >> >> > echoing
> >> >> > > the range into the API. see the example below.
> >> >> > >
> >> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> >> >> > turn
> >> >> > > on/off recording the histogram.
> >> >> > >
> >> >> > > Functional Test:
> >> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> >> >> > > Measure the anon page allocation latency.
> >> >> > >
> >> >> > > $ mkdir /dev/cgroup/memory/B
> >> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> >> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
> >> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> >> >> > > $ allocate 8g anon pages
> >> >> > >
> >> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >> >> > >
> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > pgfault latency histogram (ns):
> >> >> > > < 600 A  A  A  A  A  A 2051273
> >> >> > > < 1200 A  A  A  A  A  40859
> >> >> > > < 2400 A  A  A  A  A  4004
> >> >> > > < 4800 A  A  A  A  A  1605
> >> >> > > < 9600 A  A  A  A  A  170
> >> >> > > < 19200 A  A  A  A  A 82
> >> >> > > < 38400 A  A  A  A  A 6
> >> >> > > < inf A  A  A  A  A  A 0
> >> >> > >
> >> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > pgfault latency histogram (ns):
> >> >> > > < 600 A  A  A  A  A  A 0
> >> >> > > < 1200 A  A  A  A  A  0
> >> >> > > < 2400 A  A  A  A  A  0
> >> >> > > < 4800 A  A  A  A  A  0
> >> >> > > < 9600 A  A  A  A  A  0
> >> >> > > < 19200 A  A  A  A  A 0
> >> >> > > < 38400 A  A  A  A  A 0
> >> >> > > < inf A  A  A  A  A  A 0
> >> >> > >
> >> >> > > $ echo 500 520 540 580 600 1000 5000
> >> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > pgfault latency histogram (ns):
> >> >> > > < 500 A  A  A  A  A  A 50
> >> >> > > < 520 A  A  A  A  A  A 151
> >> >> > > < 540 A  A  A  A  A  A 3715
> >> >> > > < 580 A  A  A  A  A  A 1859812
> >> >> > > < 600 A  A  A  A  A  A 202241
> >> >> > > < 1000 A  A  A  A  A  25394
> >> >> > > < 5000 A  A  A  A  A  5875
> >> >> > > < inf A  A  A  A  A  A 186
> >> >> > >
> >> >> > > Performance Test:
> >> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> >> >> > of
> >> >> > > recording the histogram. There is no overhead observed on both
> >> >> > "flt/cpu/s"
> >> >> > > and "fault/wsec".
> >> >> > >
> >> >> > > $ mkdir /dev/cgroup/memory/A
> >> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> >> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> >> >> > > $ ./pft -m 15g -t 8 -T a
> >> >> > >
> >> >> > > Result:
> >> >> > > "fault/wsec"
> >> >> > >
> >> >> > > $ ./ministat no_histogram histogram
> >> >> > > x no_histogram
> >> >> > > + histogram
> >> >> > >
> >> >> > +--------------------------------------------------------------------------+
> >> >> > > A  A N A  A  A  A  A  Min A  A  A  A  A  Max A  A  A  A Median A  A  A  A  A  Avg
> >> >> > A Stddev
> >> >> > > x A  5 A  A  813404.51 A  A  824574.98 A  A  A 821661.3 A  A  820470.83
> >> >> > 4202.0758
> >> >> > > + A  5 A  A  821228.91 A  A  825894.66 A  A  822874.65 A  A  823374.15
> >> >> > 1787.9355
> >> >> > >
> >> >> > > "flt/cpu/s"
> >> >> > >
> >> >> > > $ ./ministat no_histogram histogram
> >> >> > > x no_histogram
> >> >> > > + histogram
> >> >> > >
> >> >> > +--------------------------------------------------------------------------+
> >> >> > > A  A N A  A  A  A  A  Min A  A  A  A  A  Max A  A  A  A Median A  A  A  A  A  Avg
> >> >> > A Stddev
> >> >> > > x A  5 A  A  104951.93 A  A  106173.13 A  A  105142.73 A  A  A 105349.2
> >> >> > 513.78158
> >> >> > > + A  5 A  A  104697.67 A  A  A 105416.1 A  A  104943.52 A  A  104973.77
> >> >> > 269.24781
> >> >> > > No difference proven at 95.0% confidence
> >> >> > >
> >> >> > > Signed-off-by: Ying Han <yinghan@google.com>
> >> >> >
> >> >> > Hmm, interesting....but isn't it very very very complicated interface ?
> >> >> > Could you make this for 'perf' ? Then, everyone (including someone who
> >> >> > don't use memcg)
> >> >> > will be happy.
> >> >> >
> >> >>
> >> >> Thank you for looking at it.
> >> >>
> >> >> There is only one per-memcg API added which is basically exporting the
> >> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
> >> >> it more flexible. Also, the sysfs API can be reduced if necessary since
> >> >> there is no over-head observed by always turning it on anyway.
> >> >>
> >> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
> >> >> like?
> >> >>
> >> >> Thanks
> >> >>
> >> >
> >> > IIUC, you can record "all" latency information by perf record. Then, latency
> >> > information can be dumped out to some file.
> >> >
> >> > You can add a python? script for perf as
> >> >
> >> > A # perf report memory-reclaim-latency-histgram -f perf.data
> >> > A  A  A  A  A  A  A  A -o 500,1000,1500,2000.....
> >> > A  ...show histgram in text.. or report the histgram in graphic.
> >> >
> >> > Good point is
> >> > A - you can reuse perf.data and show histgram from another point of view.
> >> >
> >> > A - you can show another cut of view, for example, I think you can write a
> >> > A  A parser to show "changes in hisgram by time", easily.
> >> > A  A You may able to generate a movie ;)
> >> >
> >> > A - Now, perf cgroup is supported. Then,
> >> > A  A - you can see per task histgram
> >> > A  A - you can see per cgroup histgram
> >> > A  A - you can see per system-wide histgram
> >> > A  A  A (If you record latency of usual kswapd/alloc_pages)
> >> >
> >> > A - If you record latency within shrink_zone(), you can show per-zone
> >> > A  A reclaim latency histgram. record parsers can gather them and
> >> > A  A show histgram. This will be benefical to cpuset users.
> >> >
> >> >
> >> > I'm sorry if I miss something.
> >>
> >> After study a bit on perf, it is not feasible in this casecase. The
> >> cpu & memory overhead of perf is overwhelming.... Each page fault will
> >> generate a record in the buffer and how many data we can record in the
> >> buffer, and how many data will be processed later.. Most of the data
> >> that is recorded by the general perf framework is not needed here.
> >>
> >
> > I disagree. "each page fault" is not correct. "every lru scan" is correct.
> > Then, record to buffer will be at most memory.failcnt times.
> 
> Hmm. Sorry I might miss something here... :(
> 
> The page fault histogram recorded is per page-fault, only the ones
> trigger reclaim. The background reclaim testing is just one usecase of
> it, and we need this information for more
> general usage to monitor application performance. So i recorded the
> latency for each single page fault.
> 

BTW, why page-fault only ? For some applications, file cache is more imporatant.
I think usual page fault's usual cost is not in interest.
you can get PGPGIN statistics from other source.

Anyway, I think it's better for record reclaim latency.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  8:04 ` Balbir Singh
@ 2011-05-27 16:27   ` Ying Han
  0 siblings, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27 16:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2738 bytes --]

On Fri, May 27, 2011 at 1:04 AM, Balbir Singh <balbir@linux.vnet.ibm.com>wrote:

> * Ying Han <yinghan@google.com> [2011-05-26 14:07:49]:
>
> > This adds histogram to capture pagefault latencies on per-memcg basis. I
> used
> > this patch on the memcg background reclaim test, and figured there could
> be more
> > usecases to monitor/debug application performance.
> >
> > The histogram is composed 8 bucket in ns unit. The last one is infinite
> (inf)
> > which is everything beyond the last one. To be more flexible, the buckets
> can
> > be reset and also each bucket is configurable at runtime.
> >
>
> inf is a bit confusing for page faults -- no? Why not call it "rest"
> or something line "> 38400".


ok, i can change that to "rest".


> BTW, why was 600 used as base?
>

well, that is based some of my experiments. I am doing anon page allocation
and most of the page fault falls into the bucket of 580 ns - 600 ns. So I
just leave it as default.

However, the bucket is configurable and user can change it based on their
workload and platform.




>
> > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> also can
> > be reset by echoing "reset". Meantime, all the buckets are writable by
> echoing
> > the range into the API. see the example below.
> >
> > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> turn
> > on/off recording the histogram.
> >
>
> Why not make this per memcg?
>

That can be done.

>
> > Functional Test:
> > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> > Measure the anon page allocation latency.
> >
> > $ mkdir /dev/cgroup/memory/B
> > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/B/tasks
> > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> > $ allocate 8g anon pages
> >
> > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 600            2051273
> > < 1200           40859
> > < 2400           4004
> > < 4800           1605
> > < 9600           170
> > < 19200          82
> > < 38400          6
> > < inf            0
> >
> > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>
> Can't we use something like "-1" to mean reset?
>

sounds good to me.

Thank you for reviewing.

--Ying

>
> --
>        Three Cheers,
>        Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

[-- Attachment #2: Type: text/html, Size: 4447 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  8:33             ` KAMEZAWA Hiroyuki
@ 2011-05-27 18:46               ` Ying Han
  0 siblings, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27 18:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
	Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, May 27, 2011 at 1:33 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 May 2011 21:45:28 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 26 May 2011 18:40:44 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> >> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> > On Thu, 26 May 2011 17:23:20 -0700
>> >> > Ying Han <yinghan@google.com> wrote:
>> >> >
>> >> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>> >> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> >>
>> >> >> > On Thu, 26 May 2011 14:07:49 -0700
>> >> >> > Ying Han <yinghan@google.com> wrote:
>> >> >> >
>> >> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>> >> >> > used
>> >> >> > > this patch on the memcg background reclaim test, and figured there could
>> >> >> > be more
>> >> >> > > usecases to monitor/debug application performance.
>> >> >> > >
>> >> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>> >> >> > (inf)
>> >> >> > > which is everything beyond the last one. To be more flexible, the buckets
>> >> >> > can
>> >> >> > > be reset and also each bucket is configurable at runtime.
>> >> >> > >
>> >> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>> >> >> > also can
>> >> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>> >> >> > echoing
>> >> >> > > the range into the API. see the example below.
>> >> >> > >
>> >> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>> >> >> > turn
>> >> >> > > on/off recording the histogram.
>> >> >> > >
>> >> >> > > Functional Test:
>> >> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>> >> >> > > Measure the anon page allocation latency.
>> >> >> > >
>> >> >> > > $ mkdir /dev/cgroup/memory/B
>> >> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>> >> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>> >> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>> >> >> > > $ allocate 8g anon pages
>> >> >> > >
>> >> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>> >> >> > >
>> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > pgfault latency histogram (ns):
>> >> >> > > < 600            2051273
>> >> >> > > < 1200           40859
>> >> >> > > < 2400           4004
>> >> >> > > < 4800           1605
>> >> >> > > < 9600           170
>> >> >> > > < 19200          82
>> >> >> > > < 38400          6
>> >> >> > > < inf            0
>> >> >> > >
>> >> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > pgfault latency histogram (ns):
>> >> >> > > < 600            0
>> >> >> > > < 1200           0
>> >> >> > > < 2400           0
>> >> >> > > < 4800           0
>> >> >> > > < 9600           0
>> >> >> > > < 19200          0
>> >> >> > > < 38400          0
>> >> >> > > < inf            0
>> >> >> > >
>> >> >> > > $ echo 500 520 540 580 600 1000 5000
>> >> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > pgfault latency histogram (ns):
>> >> >> > > < 500            50
>> >> >> > > < 520            151
>> >> >> > > < 540            3715
>> >> >> > > < 580            1859812
>> >> >> > > < 600            202241
>> >> >> > > < 1000           25394
>> >> >> > > < 5000           5875
>> >> >> > > < inf            186
>> >> >> > >
>> >> >> > > Performance Test:
>> >> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>> >> >> > of
>> >> >> > > recording the histogram. There is no overhead observed on both
>> >> >> > "flt/cpu/s"
>> >> >> > > and "fault/wsec".
>> >> >> > >
>> >> >> > > $ mkdir /dev/cgroup/memory/A
>> >> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>> >> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> >> >> > > $ ./pft -m 15g -t 8 -T a
>> >> >> > >
>> >> >> > > Result:
>> >> >> > > "fault/wsec"
>> >> >> > >
>> >> >> > > $ ./ministat no_histogram histogram
>> >> >> > > x no_histogram
>> >> >> > > + histogram
>> >> >> > >
>> >> >> > +--------------------------------------------------------------------------+
>> >> >> > >    N           Min           Max        Median           Avg
>> >> >> >  Stddev
>> >> >> > > x   5     813404.51     824574.98      821661.3     820470.83
>> >> >> > 4202.0758
>> >> >> > > +   5     821228.91     825894.66     822874.65     823374.15
>> >> >> > 1787.9355
>> >> >> > >
>> >> >> > > "flt/cpu/s"
>> >> >> > >
>> >> >> > > $ ./ministat no_histogram histogram
>> >> >> > > x no_histogram
>> >> >> > > + histogram
>> >> >> > >
>> >> >> > +--------------------------------------------------------------------------+
>> >> >> > >    N           Min           Max        Median           Avg
>> >> >> >  Stddev
>> >> >> > > x   5     104951.93     106173.13     105142.73      105349.2
>> >> >> > 513.78158
>> >> >> > > +   5     104697.67      105416.1     104943.52     104973.77
>> >> >> > 269.24781
>> >> >> > > No difference proven at 95.0% confidence
>> >> >> > >
>> >> >> > > Signed-off-by: Ying Han <yinghan@google.com>
>> >> >> >
>> >> >> > Hmm, interesting....but isn't it very very very complicated interface ?
>> >> >> > Could you make this for 'perf' ? Then, everyone (including someone who
>> >> >> > don't use memcg)
>> >> >> > will be happy.
>> >> >> >
>> >> >>
>> >> >> Thank you for looking at it.
>> >> >>
>> >> >> There is only one per-memcg API added which is basically exporting the
>> >> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>> >> >> it more flexible. Also, the sysfs API can be reduced if necessary since
>> >> >> there is no over-head observed by always turning it on anyway.
>> >> >>
>> >> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
>> >> >> like?
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >
>> >> > IIUC, you can record "all" latency information by perf record. Then, latency
>> >> > information can be dumped out to some file.
>> >> >
>> >> > You can add a python? script for perf as
>> >> >
>> >> >  # perf report memory-reclaim-latency-histgram -f perf.data
>> >> >                -o 500,1000,1500,2000.....
>> >> >   ...show histgram in text.. or report the histgram in graphic.
>> >> >
>> >> > Good point is
>> >> >  - you can reuse perf.data and show histgram from another point of view.
>> >> >
>> >> >  - you can show another cut of view, for example, I think you can write a
>> >> >    parser to show "changes in hisgram by time", easily.
>> >> >    You may able to generate a movie ;)
>> >> >
>> >> >  - Now, perf cgroup is supported. Then,
>> >> >    - you can see per task histgram
>> >> >    - you can see per cgroup histgram
>> >> >    - you can see per system-wide histgram
>> >> >      (If you record latency of usual kswapd/alloc_pages)
>> >> >
>> >> >  - If you record latency within shrink_zone(), you can show per-zone
>> >> >    reclaim latency histgram. record parsers can gather them and
>> >> >    show histgram. This will be benefical to cpuset users.
>> >> >
>> >> >
>> >> > I'm sorry if I miss something.
>> >>
>> >> After study a bit on perf, it is not feasible in this casecase. The
>> >> cpu & memory overhead of perf is overwhelming.... Each page fault will
>> >> generate a record in the buffer and how many data we can record in the
>> >> buffer, and how many data will be processed later.. Most of the data
>> >> that is recorded by the general perf framework is not needed here.
>> >>
>> >
>> > I disagree. "each page fault" is not correct. "every lru scan" is correct.
>> > Then, record to buffer will be at most memory.failcnt times.
>>
>> Hmm. Sorry I might miss something here... :(
>>
>> The page fault histogram recorded is per page-fault, only the ones
>> trigger reclaim. The background reclaim testing is just one usecase of
>> it, and we need this information for more
>> general usage to monitor application performance. So i recorded the
>> latency for each single page fault.
>>
>
> BTW, why page-fault only ? For some applications, file cache is more imporatant.
> I think usual page fault's usual cost is not in interest.
> you can get PGPGIN statistics from other source.
>
> Anyway, I think it's better for record reclaim latency.

Sounds reasonable. I will add that in the next post.

Thanks for reviewing

--Ying

>
>
> Thanks,
> -Kame
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-27  1:40       ` Ying Han
  2011-05-27  2:11         ` KAMEZAWA Hiroyuki
@ 2011-05-28 10:17         ` Ingo Molnar
  2011-05-31 16:51           ` Ying Han
  1 sibling, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2011-05-28 10:17 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
	Dave Hansen, Zhu Yanhai, linux-mm, Peter Zijlstra,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi


* Ying Han <yinghan@google.com> wrote:

> After study a bit on perf, it is not feasible in this casecase. The 
> cpu & memory overhead of perf is overwhelming.... Each page fault 
> will generate a record in the buffer and how many data we can 
> record in the buffer, and how many data will be processed later.. 
> Most of the data that is recorded by the general perf framework is 
> not needed here.
>
> 
> On the other hand, the memory consumption is very little in this 
> patch. We only need to keep a counter of each bucket and the 
> recording can go on as long as the machine is up. As also measured, 
> there is no overhead of the data collection :)
> 
> So, the perf is not an option for this purpose.

It's not a fundamental limitation in perf though.

The way i always thought perf could be extended to support heavy-duty 
profiling such as your patch does would be along the following lines:

Right now perf supports three output methods:

           'full detail': per sample records, recorded in the ring-buffer
  'filtered full detail': per sample records filtered, recorded in the ring-buffer
          'full summary': the count of all samples (simple counter), no recording

What i think would make sense is to introduce a fourth variant, which 
is a natural intermediate of the above output methods:

       'partial summary': partially summarized samples, record in an 
                          array in the ring-buffer - an extended 
                          multi-dimensional 'count'.

A histogram like yours would be one (small) sub-case of this new 
model.

Now, to keep things maximally flexible we really do not want to hard 
code histogram summary functions: i.e. we do not want to hardcode 
ourselves to 'latency histograms' or 'frequency histograms'.

To achieve that flexibility we could define the histogram function as 
a simple extension to filters: filters that evaluate to an integer 
value.

For example, if we defined the following tracepoint in 
arch/x86/mm/fault.c:

TRACE_EVENT(mm_pagefault,

       TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip),

       TP_ARGS(time_start, time_end, address, error_code, ip),

       TP_STRUCT__entry(
               __field(u64,           time_start)
               __field(u64,           time_end)
               __field(unsigned long, address)
               __field(unsigned long, error_code)
               __field(unsigned long, ip)
       ),

       TP_fast_assign(
               __entry->time_start     = time_start;
               __entry->time_end       = time_end;
               __entry->address        = address;
               __entry->error_code     = error_code;
               __entry->ip             = ip;
       ),

       TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx",
               __entry->time_start, __entry->time_end,
               __entry->address, __entry->error_code, __entry->ip)


Then the following filter expressions could be used to calculate the 
histogram index and value:

	   index: "(time_end - time_start)/1000"
	iterator: "curr + 1"

The /1000 index filter expression means that there is one separate 
bucket per microsecond of delay.

The "curr + 1" iterator filter expression would represent that for 
every bucket an event means we add +1 to the current bucket value.

Today our filter expressions evaluate to a small subset of integer 
numbers: 0 or 1 :-)

Extending them to integer calculations is possible and would be 
desirable for other purposes as well, not just histograms. Adding 
integer operators in addition to the logical and bitwise operators 
the filter engine supports today would be useful as well. (See 
kernel/trace/trace_events_filter.c for the current filter engine.)

This way we would have the equivalent functionality and performance 
of your histogram patch - and it would also open up many, *many* 
other nice possibilities as well:

 - this could be used with any event, anywhere - could even be used
   with hardware events. We could sample with an NMI every 100 usecs 
   and profile with relatively small profiling overhead.

 - arbitrarily large histograms could be created: need a 10 GB
   histogram on a really large system? No problem, create such
   a big ring-buffer.

 - many different types of summaries are possible as well:

    - we could create a histogram over *which* code pagefaults, via
      using the "ip" (faulting instruction) address and a 
      sufficiently large ring-buffer.

    - histogram over the address space (which vmas are the hottest ones),
      by changing the first filter to "address/1000000" to have per
      megabyte buckets.

    - weighted histograms: for example if the histogram iteration 
      function is "curr + (time_end-time_start)/1000" and the
      histogram index is "address/1000000", then we get an 
      address-indexed histogram weighted by length of latency: the 
      higher latencies a given area of memory causes, the hotter the
      bucket.

 - the existing event filter code can be used to filter the incoming
   events to begin with: for example an "error_code = 1" filter would
   limit the histogram to write faults (page dirtying).

So instead of adding just one hardcoded histogram type, it would be 
really nice to work on a more generic solution!

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] memcg: add pgfault latency histograms
  2011-05-28 10:17         ` Ingo Molnar
@ 2011-05-31 16:51           ` Ying Han
  0 siblings, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-31 16:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
	Dave Hansen, Zhu Yanhai, linux-mm, Peter Zijlstra,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi

On Sat, May 28, 2011 at 3:17 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ying Han <yinghan@google.com> wrote:
>
>> After study a bit on perf, it is not feasible in this casecase. The
>> cpu & memory overhead of perf is overwhelming.... Each page fault
>> will generate a record in the buffer and how many data we can
>> record in the buffer, and how many data will be processed later..
>> Most of the data that is recorded by the general perf framework is
>> not needed here.
>>
>>
>> On the other hand, the memory consumption is very little in this
>> patch. We only need to keep a counter of each bucket and the
>> recording can go on as long as the machine is up. As also measured,
>> there is no overhead of the data collection :)
>>
>> So, the perf is not an option for this purpose.
>
> It's not a fundamental limitation in perf though.
>
> The way i always thought perf could be extended to support heavy-duty
> profiling such as your patch does would be along the following lines:
>
> Right now perf supports three output methods:
>
>           'full detail': per sample records, recorded in the ring-buffer
>  'filtered full detail': per sample records filtered, recorded in the ring-buffer
>          'full summary': the count of all samples (simple counter), no recording
>
> What i think would make sense is to introduce a fourth variant, which
> is a natural intermediate of the above output methods:
>
>       'partial summary': partially summarized samples, record in an
>                          array in the ring-buffer - an extended
>                          multi-dimensional 'count'.
>
> A histogram like yours would be one (small) sub-case of this new
> model.
>
> Now, to keep things maximally flexible we really do not want to hard
> code histogram summary functions: i.e. we do not want to hardcode
> ourselves to 'latency histograms' or 'frequency histograms'.
>
> To achieve that flexibility we could define the histogram function as
> a simple extension to filters: filters that evaluate to an integer
> value.
>
> For example, if we defined the following tracepoint in
> arch/x86/mm/fault.c:
>
> TRACE_EVENT(mm_pagefault,
>
>       TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip),
>
>       TP_ARGS(time_start, time_end, address, error_code, ip),
>
>       TP_STRUCT__entry(
>               __field(u64,           time_start)
>               __field(u64,           time_end)
>               __field(unsigned long, address)
>               __field(unsigned long, error_code)
>               __field(unsigned long, ip)
>       ),
>
>       TP_fast_assign(
>               __entry->time_start     = time_start;
>               __entry->time_end       = time_end;
>               __entry->address        = address;
>               __entry->error_code     = error_code;
>               __entry->ip             = ip;
>       ),
>
>       TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx",
>               __entry->time_start, __entry->time_end,
>               __entry->address, __entry->error_code, __entry->ip)
>
>
> Then the following filter expressions could be used to calculate the
> histogram index and value:
>
>           index: "(time_end - time_start)/1000"
>        iterator: "curr + 1"
>
> The /1000 index filter expression means that there is one separate
> bucket per microsecond of delay.
>
> The "curr + 1" iterator filter expression would represent that for
> every bucket an event means we add +1 to the current bucket value.
>
> Today our filter expressions evaluate to a small subset of integer
> numbers: 0 or 1 :-)
>
> Extending them to integer calculations is possible and would be
> desirable for other purposes as well, not just histograms. Adding
> integer operators in addition to the logical and bitwise operators
> the filter engine supports today would be useful as well. (See
> kernel/trace/trace_events_filter.c for the current filter engine.)
>
> This way we would have the equivalent functionality and performance
> of your histogram patch - and it would also open up many, *many*
> other nice possibilities as well:
>
>  - this could be used with any event, anywhere - could even be used
>   with hardware events. We could sample with an NMI every 100 usecs
>   and profile with relatively small profiling overhead.
>
>  - arbitrarily large histograms could be created: need a 10 GB
>   histogram on a really large system? No problem, create such
>   a big ring-buffer.
>
>  - many different types of summaries are possible as well:
>
>    - we could create a histogram over *which* code pagefaults, via
>      using the "ip" (faulting instruction) address and a
>      sufficiently large ring-buffer.
>
>    - histogram over the address space (which vmas are the hottest ones),
>      by changing the first filter to "address/1000000" to have per
>      megabyte buckets.
>
>    - weighted histograms: for example if the histogram iteration
>      function is "curr + (time_end-time_start)/1000" and the
>      histogram index is "address/1000000", then we get an
>      address-indexed histogram weighted by length of latency: the
>      higher latencies a given area of memory causes, the hotter the
>      bucket.
>
>  - the existing event filter code can be used to filter the incoming
>   events to begin with: for example an "error_code = 1" filter would
>   limit the histogram to write faults (page dirtying).
>
> So instead of adding just one hardcoded histogram type, it would be
> really nice to work on a more generic solution!
>
> Thanks,
>
>        Ingo

Hi Ingo,

Thank you for the detailed information.

This patch is used to evaluating the memcg reclaim patch and I have
got some interesting results.  I will post the next version of the
patch which made couple of improvement based on the comments from the
thread. Meantime, I will need to study more on your suggestion :)

Thanks

--Ying
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-05-31 16:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-26 21:07 [PATCH] memcg: add pgfault latency histograms Ying Han
2011-05-27  0:05 ` KAMEZAWA Hiroyuki
2011-05-27  0:23   ` Ying Han
2011-05-27  0:31     ` KAMEZAWA Hiroyuki
2011-05-27  1:40       ` Ying Han
2011-05-27  2:11         ` KAMEZAWA Hiroyuki
2011-05-27  4:45           ` Ying Han
2011-05-27  5:41             ` Ying Han
2011-05-27  8:33             ` KAMEZAWA Hiroyuki
2011-05-27 18:46               ` Ying Han
2011-05-28 10:17         ` Ingo Molnar
2011-05-31 16:51           ` Ying Han
2011-05-27  8:04 ` Balbir Singh
2011-05-27 16:27   ` Ying Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.