* [PATCH] memcg: add pgfault latency histograms
@ 2011-05-26 21:07 Ying Han
2011-05-27 0:05 ` KAMEZAWA Hiroyuki
2011-05-27 8:04 ` Balbir Singh
0 siblings, 2 replies; 14+ messages in thread
From: Ying Han @ 2011-05-26 21:07 UTC (permalink / raw)
To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
Zhu Yanhai
Cc: linux-mm
This adds histogram to capture pagefault latencies on per-memcg basis. I used
this patch on the memcg background reclaim test, and figured there could be more
usecases to monitor/debug application performance.
The histogram is composed 8 bucket in ns unit. The last one is infinite (inf)
which is everything beyond the last one. To be more flexible, the buckets can
be reset and also each bucket is configurable at runtime.
memory.pgfault_histogram: exports the histogram on per-memcg basis and also can
be reset by echoing "reset". Meantime, all the buckets are writable by echoing
the range into the API. see the example below.
/proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to turn
on/off recording the histogram.
Functional Test:
Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
Measure the anon page allocation latency.
$ mkdir /dev/cgroup/memory/B
$ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/B/tasks
$ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
$ allocate 8g anon pages
$ echo 1 >/proc/sys/vm/pgfault_histogram
$ cat /dev/cgroup/memory/B/memory.pgfault_histogram
pgfault latency histogram (ns):
< 600 2051273
< 1200 40859
< 2400 4004
< 4800 1605
< 9600 170
< 19200 82
< 38400 6
< inf 0
$ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
$ cat /dev/cgroup/memory/B/memory.pgfault_histogram
pgfault latency histogram (ns):
< 600 0
< 1200 0
< 2400 0
< 4800 0
< 9600 0
< 19200 0
< 38400 0
< inf 0
$ echo 500 520 540 580 600 1000 5000 >/dev/cgroup/memory/B/memory.pgfault_histogram
$ cat /dev/cgroup/memory/B/memory.pgfault_histogram
pgfault latency histogram (ns):
< 500 50
< 520 151
< 540 3715
< 580 1859812
< 600 202241
< 1000 25394
< 5000 5875
< inf 186
Performance Test:
I ran through the PageFaultTest (pft) benchmark to measure the overhead of
recording the histogram. There is no overhead observed on both "flt/cpu/s"
and "fault/wsec".
$ mkdir /dev/cgroup/memory/A
$ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/memory/A/tasks
$ ./pft -m 15g -t 8 -T a
Result:
"fault/wsec"
$ ./ministat no_histogram histogram
x no_histogram
+ histogram
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 813404.51 824574.98 821661.3 820470.83 4202.0758
+ 5 821228.91 825894.66 822874.65 823374.15 1787.9355
"flt/cpu/s"
$ ./ministat no_histogram histogram
x no_histogram
+ histogram
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 104951.93 106173.13 105142.73 105349.2 513.78158
+ 5 104697.67 105416.1 104943.52 104973.77 269.24781
No difference proven at 95.0% confidence
Signed-off-by: Ying Han <yinghan@google.com>
---
arch/x86/mm/fault.c | 8 +++
include/linux/memcontrol.h | 8 +++
kernel/sysctl.c | 7 +++
mm/memcontrol.c | 128 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 151 insertions(+), 0 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 20e3f87..d7a1490 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -12,6 +12,7 @@
#include <linux/mmiotrace.h> /* kmmio_handler, ... */
#include <linux/perf_event.h> /* perf_sw_event */
#include <linux/hugetlb.h> /* hstate_index_to_shift */
+#include <linux/memcontrol.h>
#include <asm/traps.h> /* dotraplinkage, ... */
#include <asm/pgalloc.h> /* pgd_*(), ... */
@@ -966,6 +967,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
int write = error_code & PF_WRITE;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
(write ? FAULT_FLAG_WRITE : 0);
+ unsigned long long start, delta;
tsk = current;
mm = tsk->mm;
@@ -1125,6 +1127,7 @@ good_area:
return;
}
+ start = sched_clock();
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -1132,6 +1135,11 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);
+ delta = sched_clock() - start;
+ if (unlikely(delta < 0))
+ delta = 0;
+ memcg_histogram_record(current, delta);
+
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, fault);
return;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 29a945a..c7e6cb8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,6 +92,8 @@ struct mem_cgroup *mem_cgroup_get_shrink_target(void);
void mem_cgroup_put_shrink_target(struct mem_cgroup *mem);
wait_queue_head_t *mem_cgroup_kswapd_waitq(void);
+extern void memcg_histogram_record(struct task_struct *tsk, u64 delta);
+
static inline
int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
{
@@ -131,6 +133,8 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
extern int do_swap_account;
#endif
+extern unsigned int sysctl_pgfault_histogram;
+
static inline bool mem_cgroup_disabled(void)
{
if (mem_cgroup_subsys.disabled)
@@ -476,6 +480,10 @@ wait_queue_head_t *mem_cgroup_kswapd_waitq(void)
return NULL;
}
+static inline
+void memcg_histogram_record(struct task_struct *tsk, u64 delta)
+{
+}
#endif /* CONFIG_CGROUP_MEM_CONT */
#if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 927fc5a..0dd2939 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1132,6 +1132,13 @@ static struct ctl_table vm_table[] = {
.extra1 = &one,
.extra2 = &three,
},
+ {
+ .procname = "pgfault_histogram",
+ .data = &sysctl_pgfault_histogram,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0666,
+ .proc_handler = proc_dointvec,
+ },
#ifdef CONFIG_COMPACTION
{
.procname = "compact_memory",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a98471b..c795f96 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -51,6 +51,7 @@
#include "internal.h"
#include <linux/kthread.h>
#include <linux/freezer.h>
+#include <linux/ctype.h>
#include <asm/uaccess.h>
@@ -207,6 +208,13 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
+#define MEMCG_NUM_HISTO_BUCKETS 8
+unsigned int sysctl_pgfault_histogram;
+
+struct memcg_histo {
+ u64 count[MEMCG_NUM_HISTO_BUCKETS];
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -299,6 +307,9 @@ struct mem_cgroup {
* last node we reclaimed from
*/
int last_scanned_node;
+
+ struct memcg_histo *memcg_histo;
+ u64 memcg_histo_range[MEMCG_NUM_HISTO_BUCKETS];
};
/* Stuffs for move charges at task migration. */
@@ -4692,6 +4703,105 @@ static int __init memcg_kswapd_init(void)
}
module_init(memcg_kswapd_init);
+static int mem_cgroup_histogram_seq_read(struct cgroup *cgrp,
+ struct cftype *cft, struct seq_file *m)
+{
+ struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cgrp);
+ int i, cpu;
+
+ seq_printf(m, "pgfault latency histogram (ns):\n");
+
+ for (i = 0; i < MEMCG_NUM_HISTO_BUCKETS; i++) {
+ u64 sum = 0;
+
+ for_each_present_cpu(cpu) {
+ struct memcg_histo *histo;
+ histo = per_cpu_ptr(mem_cont->memcg_histo, cpu);
+ sum += histo->count[i];
+ }
+
+ if (i < MEMCG_NUM_HISTO_BUCKETS - 1)
+ seq_printf(m, "< %-15llu",
+ mem_cont->memcg_histo_range[i]);
+ else
+ seq_printf(m, "< %-15s", "inf");
+ seq_printf(m, "%llu\n", sum);
+ }
+
+ return 0;
+}
+
+static int mem_cgroup_histogram_seq_write(struct cgroup *cgrp,
+ struct cftype *cft, const char *buffer)
+{
+ int i;
+ u64 data[MEMCG_NUM_HISTO_BUCKETS];
+ char *end;
+ struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cgrp);
+
+ if (!memcmp(buffer, "reset", 5)) {
+ for_each_present_cpu(i) {
+ struct memcg_histo *histo;
+
+ histo = per_cpu_ptr(mem_cont->memcg_histo, i);
+ memset(histo, 0, sizeof(*histo));
+ }
+ goto out;
+ }
+
+ for (i = 0; i < MEMCG_NUM_HISTO_BUCKETS - 1; i++, buffer = end) {
+ while ((isspace(*buffer)))
+ buffer++;
+ data[i] = simple_strtoull(buffer, &end, 10);
+ }
+ data[i] = ULLONG_MAX;
+
+ for (i = 1; i < MEMCG_NUM_HISTO_BUCKETS; i++)
+ if (data[i] < data[i - 1])
+ return -EINVAL;
+
+ memcpy(mem_cont->memcg_histo_range, data, sizeof(data));
+ for_each_present_cpu(i) {
+ struct memcg_histo *histo;
+ histo = per_cpu_ptr(mem_cont->memcg_histo, i);
+ memset(histo->count, 0, sizeof(*histo));
+ }
+out:
+ return 0;
+}
+
+/*
+ * Record values into histogram buckets
+ */
+void memcg_histogram_record(struct task_struct *tsk, u64 delta)
+{
+ u64 *base;
+ int index, first, last;
+ struct memcg_histo *histo;
+ struct mem_cgroup *mem = mem_cgroup_from_task(tsk);
+
+ if (sysctl_pgfault_histogram == 0)
+ return;
+
+ first = 0;
+ last = MEMCG_NUM_HISTO_BUCKETS - 1;
+ base = mem->memcg_histo_range;
+
+ if (delta >= base[first]) {
+ while (first < last) {
+ index = (first + last) / 2;
+ if (delta >= base[index])
+ first = index + 1;
+ else
+ last = index;
+ }
+ }
+ index = first;
+
+ histo = per_cpu_ptr(mem->memcg_histo, smp_processor_id());
+ histo->count[index]++;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4769,6 +4879,12 @@ static struct cftype mem_cgroup_files[] = {
.name = "reclaim_wmarks",
.read_map = mem_cgroup_wmark_read,
},
+ {
+ .name = "pgfault_histogram",
+ .read_seq_string = mem_cgroup_histogram_seq_read,
+ .write_string = mem_cgroup_histogram_seq_write,
+ .max_write_len = 256,
+ },
};
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4903,6 +5019,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
free_mem_cgroup_per_zone_info(mem, node);
free_percpu(mem->stat);
+ free_percpu(mem->memcg_histo);
if (sizeof(struct mem_cgroup) < PAGE_SIZE)
kfree(mem);
else
@@ -5014,6 +5131,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
struct mem_cgroup *mem, *parent;
long error = -ENOMEM;
int node;
+ int i;
mem = mem_cgroup_alloc();
if (!mem)
@@ -5068,6 +5186,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mutex_init(&mem->thresholds_lock);
init_waitqueue_head(&mem->memcg_kswapd_end);
INIT_LIST_HEAD(&mem->memcg_kswapd_wait_list);
+
+ mem->memcg_histo = alloc_percpu(typeof(*mem->memcg_histo));
+ if (!mem->memcg_histo)
+ goto free_out;
+
+
+ for (i = 0; i < MEMCG_NUM_HISTO_BUCKETS - 1; i++)
+ mem->memcg_histo_range[i] = (1 << i) * 600ULL;
+ mem->memcg_histo_range[i] = ULLONG_MAX;
+
return &mem->css;
free_out:
__mem_cgroup_free(mem);
--
1.7.3.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-26 21:07 [PATCH] memcg: add pgfault latency histograms Ying Han
@ 2011-05-27 0:05 ` KAMEZAWA Hiroyuki
2011-05-27 0:23 ` Ying Han
2011-05-27 8:04 ` Balbir Singh
1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27 0:05 UTC (permalink / raw)
To: Ying Han
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Thu, 26 May 2011 14:07:49 -0700
Ying Han <yinghan@google.com> wrote:
> This adds histogram to capture pagefault latencies on per-memcg basis. I used
> this patch on the memcg background reclaim test, and figured there could be more
> usecases to monitor/debug application performance.
>
> The histogram is composed 8 bucket in ns unit. The last one is infinite (inf)
> which is everything beyond the last one. To be more flexible, the buckets can
> be reset and also each bucket is configurable at runtime.
>
> memory.pgfault_histogram: exports the histogram on per-memcg basis and also can
> be reset by echoing "reset". Meantime, all the buckets are writable by echoing
> the range into the API. see the example below.
>
> /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to turn
> on/off recording the histogram.
>
> Functional Test:
> Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> Measure the anon page allocation latency.
>
> $ mkdir /dev/cgroup/memory/B
> $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/B/tasks
> $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> $ allocate 8g anon pages
>
> $ echo 1 >/proc/sys/vm/pgfault_histogram
>
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 600 2051273
> < 1200 40859
> < 2400 4004
> < 4800 1605
> < 9600 170
> < 19200 82
> < 38400 6
> < inf 0
>
> $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 600 0
> < 1200 0
> < 2400 0
> < 4800 0
> < 9600 0
> < 19200 0
> < 38400 0
> < inf 0
>
> $ echo 500 520 540 580 600 1000 5000 >/dev/cgroup/memory/B/memory.pgfault_histogram
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 500 50
> < 520 151
> < 540 3715
> < 580 1859812
> < 600 202241
> < 1000 25394
> < 5000 5875
> < inf 186
>
> Performance Test:
> I ran through the PageFaultTest (pft) benchmark to measure the overhead of
> recording the histogram. There is no overhead observed on both "flt/cpu/s"
> and "fault/wsec".
>
> $ mkdir /dev/cgroup/memory/A
> $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/A/tasks
> $ ./pft -m 15g -t 8 -T a
>
> Result:
> "fault/wsec"
>
> $ ./ministat no_histogram histogram
> x no_histogram
> + histogram
> +--------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 5 813404.51 824574.98 821661.3 820470.83 4202.0758
> + 5 821228.91 825894.66 822874.65 823374.15 1787.9355
>
> "flt/cpu/s"
>
> $ ./ministat no_histogram histogram
> x no_histogram
> + histogram
> +--------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 5 104951.93 106173.13 105142.73 105349.2 513.78158
> + 5 104697.67 105416.1 104943.52 104973.77 269.24781
> No difference proven at 95.0% confidence
>
> Signed-off-by: Ying Han <yinghan@google.com>
Hmm, interesting....but isn't it very very very complicated interface ?
Could you make this for 'perf' ? Then, everyone (including someone who don't use memcg)
will be happy.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 0:05 ` KAMEZAWA Hiroyuki
@ 2011-05-27 0:23 ` Ying Han
2011-05-27 0:31 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 14+ messages in thread
From: Ying Han @ 2011-05-27 0:23 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
[-- Attachment #1: Type: text/plain, Size: 4386 bytes --]
On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 May 2011 14:07:49 -0700
> Ying Han <yinghan@google.com> wrote:
>
> > This adds histogram to capture pagefault latencies on per-memcg basis. I
> used
> > this patch on the memcg background reclaim test, and figured there could
> be more
> > usecases to monitor/debug application performance.
> >
> > The histogram is composed 8 bucket in ns unit. The last one is infinite
> (inf)
> > which is everything beyond the last one. To be more flexible, the buckets
> can
> > be reset and also each bucket is configurable at runtime.
> >
> > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> also can
> > be reset by echoing "reset". Meantime, all the buckets are writable by
> echoing
> > the range into the API. see the example below.
> >
> > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> turn
> > on/off recording the histogram.
> >
> > Functional Test:
> > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> > Measure the anon page allocation latency.
> >
> > $ mkdir /dev/cgroup/memory/B
> > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/B/tasks
> > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> > $ allocate 8g anon pages
> >
> > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 600 2051273
> > < 1200 40859
> > < 2400 4004
> > < 4800 1605
> > < 9600 170
> > < 19200 82
> > < 38400 6
> > < inf 0
> >
> > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 600 0
> > < 1200 0
> > < 2400 0
> > < 4800 0
> > < 9600 0
> > < 19200 0
> > < 38400 0
> > < inf 0
> >
> > $ echo 500 520 540 580 600 1000 5000
> >/dev/cgroup/memory/B/memory.pgfault_histogram
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 500 50
> > < 520 151
> > < 540 3715
> > < 580 1859812
> > < 600 202241
> > < 1000 25394
> > < 5000 5875
> > < inf 186
> >
> > Performance Test:
> > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> of
> > recording the histogram. There is no overhead observed on both
> "flt/cpu/s"
> > and "fault/wsec".
> >
> > $ mkdir /dev/cgroup/memory/A
> > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/A/tasks
> > $ ./pft -m 15g -t 8 -T a
> >
> > Result:
> > "fault/wsec"
> >
> > $ ./ministat no_histogram histogram
> > x no_histogram
> > + histogram
> >
> +--------------------------------------------------------------------------+
> > N Min Max Median Avg
> Stddev
> > x 5 813404.51 824574.98 821661.3 820470.83
> 4202.0758
> > + 5 821228.91 825894.66 822874.65 823374.15
> 1787.9355
> >
> > "flt/cpu/s"
> >
> > $ ./ministat no_histogram histogram
> > x no_histogram
> > + histogram
> >
> +--------------------------------------------------------------------------+
> > N Min Max Median Avg
> Stddev
> > x 5 104951.93 106173.13 105142.73 105349.2
> 513.78158
> > + 5 104697.67 105416.1 104943.52 104973.77
> 269.24781
> > No difference proven at 95.0% confidence
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
>
> Hmm, interesting....but isn't it very very very complicated interface ?
> Could you make this for 'perf' ? Then, everyone (including someone who
> don't use memcg)
> will be happy.
>
Thank you for looking at it.
There is only one per-memcg API added which is basically exporting the
histogram. The "reset" and reconfiguring the bucket is not "must" but make
it more flexible. Also, the sysfs API can be reduced if necessary since
there is no over-head observed by always turning it on anyway.
I am not familiar w/ perf, any suggestions how it is supposed to be look
like?
Thanks
--Ying
> Thanks,
> -Kame
>
>
>
[-- Attachment #2: Type: text/html, Size: 5799 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 0:23 ` Ying Han
@ 2011-05-27 0:31 ` KAMEZAWA Hiroyuki
2011-05-27 1:40 ` Ying Han
0 siblings, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27 0:31 UTC (permalink / raw)
To: Ying Han
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Thu, 26 May 2011 17:23:20 -0700
Ying Han <yinghan@google.com> wrote:
> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Thu, 26 May 2011 14:07:49 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
> > used
> > > this patch on the memcg background reclaim test, and figured there could
> > be more
> > > usecases to monitor/debug application performance.
> > >
> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
> > (inf)
> > > which is everything beyond the last one. To be more flexible, the buckets
> > can
> > > be reset and also each bucket is configurable at runtime.
> > >
> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> > also can
> > > be reset by echoing "reset". Meantime, all the buckets are writable by
> > echoing
> > > the range into the API. see the example below.
> > >
> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> > turn
> > > on/off recording the histogram.
> > >
> > > Functional Test:
> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> > > Measure the anon page allocation latency.
> > >
> > > $ mkdir /dev/cgroup/memory/B
> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> > > $ echo $$ >/dev/cgroup/memory/B/tasks
> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> > > $ allocate 8g anon pages
> > >
> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
> > >
> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > > pgfault latency histogram (ns):
> > > < 600 2051273
> > > < 1200 40859
> > > < 2400 4004
> > > < 4800 1605
> > > < 9600 170
> > > < 19200 82
> > > < 38400 6
> > > < inf 0
> > >
> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > > pgfault latency histogram (ns):
> > > < 600 0
> > > < 1200 0
> > > < 2400 0
> > > < 4800 0
> > > < 9600 0
> > > < 19200 0
> > > < 38400 0
> > > < inf 0
> > >
> > > $ echo 500 520 540 580 600 1000 5000
> > >/dev/cgroup/memory/B/memory.pgfault_histogram
> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > > pgfault latency histogram (ns):
> > > < 500 50
> > > < 520 151
> > > < 540 3715
> > > < 580 1859812
> > > < 600 202241
> > > < 1000 25394
> > > < 5000 5875
> > > < inf 186
> > >
> > > Performance Test:
> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> > of
> > > recording the histogram. There is no overhead observed on both
> > "flt/cpu/s"
> > > and "fault/wsec".
> > >
> > > $ mkdir /dev/cgroup/memory/A
> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > > $ ./pft -m 15g -t 8 -T a
> > >
> > > Result:
> > > "fault/wsec"
> > >
> > > $ ./ministat no_histogram histogram
> > > x no_histogram
> > > + histogram
> > >
> > +--------------------------------------------------------------------------+
> > > N Min Max Median Avg
> > Stddev
> > > x 5 813404.51 824574.98 821661.3 820470.83
> > 4202.0758
> > > + 5 821228.91 825894.66 822874.65 823374.15
> > 1787.9355
> > >
> > > "flt/cpu/s"
> > >
> > > $ ./ministat no_histogram histogram
> > > x no_histogram
> > > + histogram
> > >
> > +--------------------------------------------------------------------------+
> > > N Min Max Median Avg
> > Stddev
> > > x 5 104951.93 106173.13 105142.73 105349.2
> > 513.78158
> > > + 5 104697.67 105416.1 104943.52 104973.77
> > 269.24781
> > > No difference proven at 95.0% confidence
> > >
> > > Signed-off-by: Ying Han <yinghan@google.com>
> >
> > Hmm, interesting....but isn't it very very very complicated interface ?
> > Could you make this for 'perf' ? Then, everyone (including someone who
> > don't use memcg)
> > will be happy.
> >
>
> Thank you for looking at it.
>
> There is only one per-memcg API added which is basically exporting the
> histogram. The "reset" and reconfiguring the bucket is not "must" but make
> it more flexible. Also, the sysfs API can be reduced if necessary since
> there is no over-head observed by always turning it on anyway.
>
> I am not familiar w/ perf, any suggestions how it is supposed to be look
> like?
>
> Thanks
>
IIUC, you can record "all" latency information by perf record. Then, latency
information can be dumped out to some file.
You can add a python? script for perf as
# perf report memory-reclaim-latency-histgram -f perf.data
-o 500,1000,1500,2000.....
...show histgram in text.. or report the histgram in graphic.
Good point is
- you can reuse perf.data and show histgram from another point of view.
- you can show another cut of view, for example, I think you can write a
parser to show "changes in hisgram by time", easily.
You may able to generate a movie ;)
- Now, perf cgroup is supported. Then,
- you can see per task histgram
- you can see per cgroup histgram
- you can see per system-wide histgram
(If you record latency of usual kswapd/alloc_pages)
- If you record latency within shrink_zone(), you can show per-zone
reclaim latency histgram. record parsers can gather them and
show histgram. This will be benefical to cpuset users.
I'm sorry if I miss something.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 0:31 ` KAMEZAWA Hiroyuki
@ 2011-05-27 1:40 ` Ying Han
2011-05-27 2:11 ` KAMEZAWA Hiroyuki
2011-05-28 10:17 ` Ingo Molnar
0 siblings, 2 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27 1:40 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 May 2011 17:23:20 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> > On Thu, 26 May 2011 14:07:49 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>> > used
>> > > this patch on the memcg background reclaim test, and figured there could
>> > be more
>> > > usecases to monitor/debug application performance.
>> > >
>> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>> > (inf)
>> > > which is everything beyond the last one. To be more flexible, the buckets
>> > can
>> > > be reset and also each bucket is configurable at runtime.
>> > >
>> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>> > also can
>> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>> > echoing
>> > > the range into the API. see the example below.
>> > >
>> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>> > turn
>> > > on/off recording the histogram.
>> > >
>> > > Functional Test:
>> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>> > > Measure the anon page allocation latency.
>> > >
>> > > $ mkdir /dev/cgroup/memory/B
>> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>> > > $ allocate 8g anon pages
>> > >
>> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>> > >
>> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> > > pgfault latency histogram (ns):
>> > > < 600 2051273
>> > > < 1200 40859
>> > > < 2400 4004
>> > > < 4800 1605
>> > > < 9600 170
>> > > < 19200 82
>> > > < 38400 6
>> > > < inf 0
>> > >
>> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> > > pgfault latency histogram (ns):
>> > > < 600 0
>> > > < 1200 0
>> > > < 2400 0
>> > > < 4800 0
>> > > < 9600 0
>> > > < 19200 0
>> > > < 38400 0
>> > > < inf 0
>> > >
>> > > $ echo 500 520 540 580 600 1000 5000
>> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> > > pgfault latency histogram (ns):
>> > > < 500 50
>> > > < 520 151
>> > > < 540 3715
>> > > < 580 1859812
>> > > < 600 202241
>> > > < 1000 25394
>> > > < 5000 5875
>> > > < inf 186
>> > >
>> > > Performance Test:
>> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>> > of
>> > > recording the histogram. There is no overhead observed on both
>> > "flt/cpu/s"
>> > > and "fault/wsec".
>> > >
>> > > $ mkdir /dev/cgroup/memory/A
>> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> > > $ ./pft -m 15g -t 8 -T a
>> > >
>> > > Result:
>> > > "fault/wsec"
>> > >
>> > > $ ./ministat no_histogram histogram
>> > > x no_histogram
>> > > + histogram
>> > >
>> > +--------------------------------------------------------------------------+
>> > > N Min Max Median Avg
>> > Stddev
>> > > x 5 813404.51 824574.98 821661.3 820470.83
>> > 4202.0758
>> > > + 5 821228.91 825894.66 822874.65 823374.15
>> > 1787.9355
>> > >
>> > > "flt/cpu/s"
>> > >
>> > > $ ./ministat no_histogram histogram
>> > > x no_histogram
>> > > + histogram
>> > >
>> > +--------------------------------------------------------------------------+
>> > > N Min Max Median Avg
>> > Stddev
>> > > x 5 104951.93 106173.13 105142.73 105349.2
>> > 513.78158
>> > > + 5 104697.67 105416.1 104943.52 104973.77
>> > 269.24781
>> > > No difference proven at 95.0% confidence
>> > >
>> > > Signed-off-by: Ying Han <yinghan@google.com>
>> >
>> > Hmm, interesting....but isn't it very very very complicated interface ?
>> > Could you make this for 'perf' ? Then, everyone (including someone who
>> > don't use memcg)
>> > will be happy.
>> >
>>
>> Thank you for looking at it.
>>
>> There is only one per-memcg API added which is basically exporting the
>> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>> it more flexible. Also, the sysfs API can be reduced if necessary since
>> there is no over-head observed by always turning it on anyway.
>>
>> I am not familiar w/ perf, any suggestions how it is supposed to be look
>> like?
>>
>> Thanks
>>
>
> IIUC, you can record "all" latency information by perf record. Then, latency
> information can be dumped out to some file.
>
> You can add a python? script for perf as
>
> # perf report memory-reclaim-latency-histgram -f perf.data
> -o 500,1000,1500,2000.....
> ...show histgram in text.. or report the histgram in graphic.
>
> Good point is
> - you can reuse perf.data and show histgram from another point of view.
>
> - you can show another cut of view, for example, I think you can write a
> parser to show "changes in hisgram by time", easily.
> You may able to generate a movie ;)
>
> - Now, perf cgroup is supported. Then,
> - you can see per task histgram
> - you can see per cgroup histgram
> - you can see per system-wide histgram
> (If you record latency of usual kswapd/alloc_pages)
>
> - If you record latency within shrink_zone(), you can show per-zone
> reclaim latency histgram. record parsers can gather them and
> show histgram. This will be benefical to cpuset users.
>
>
> I'm sorry if I miss something.
After study a bit on perf, it is not feasible in this casecase. The
cpu & memory overhead of perf is overwhelming.... Each page fault will
generate a record in the buffer and how many data we can record in the
buffer, and how many data will be processed later.. Most of the data
that is recorded by the general perf framework is not needed here.
On the other hand, the memory consumption is very little in this
patch. We only need to keep a counter of each bucket and the recording
can go on as long as the machine is up. As also measured, there is no
overhead of the data collection :)
So, the perf is not an option for this purpose.
--Ying
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 1:40 ` Ying Han
@ 2011-05-27 2:11 ` KAMEZAWA Hiroyuki
2011-05-27 4:45 ` Ying Han
2011-05-28 10:17 ` Ingo Molnar
1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27 2:11 UTC (permalink / raw)
To: Ying Han
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Thu, 26 May 2011 18:40:44 -0700
Ying Han <yinghan@google.com> wrote:
> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 26 May 2011 17:23:20 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >>
> >> > On Thu, 26 May 2011 14:07:49 -0700
> >> > Ying Han <yinghan@google.com> wrote:
> >> >
> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
> >> > used
> >> > > this patch on the memcg background reclaim test, and figured there could
> >> > be more
> >> > > usecases to monitor/debug application performance.
> >> > >
> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
> >> > (inf)
> >> > > which is everything beyond the last one. To be more flexible, the buckets
> >> > can
> >> > > be reset and also each bucket is configurable at runtime.
> >> > >
> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> >> > also can
> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
> >> > echoing
> >> > > the range into the API. see the example below.
> >> > >
> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> >> > turn
> >> > > on/off recording the histogram.
> >> > >
> >> > > Functional Test:
> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> >> > > Measure the anon page allocation latency.
> >> > >
> >> > > $ mkdir /dev/cgroup/memory/B
> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> >> > > $ allocate 8g anon pages
> >> > >
> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >> > >
> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > pgfault latency histogram (ns):
> >> > > < 600 A A A A A A 2051273
> >> > > < 1200 A A A A A 40859
> >> > > < 2400 A A A A A 4004
> >> > > < 4800 A A A A A 1605
> >> > > < 9600 A A A A A 170
> >> > > < 19200 A A A A A 82
> >> > > < 38400 A A A A A 6
> >> > > < inf A A A A A A 0
> >> > >
> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > pgfault latency histogram (ns):
> >> > > < 600 A A A A A A 0
> >> > > < 1200 A A A A A 0
> >> > > < 2400 A A A A A 0
> >> > > < 4800 A A A A A 0
> >> > > < 9600 A A A A A 0
> >> > > < 19200 A A A A A 0
> >> > > < 38400 A A A A A 0
> >> > > < inf A A A A A A 0
> >> > >
> >> > > $ echo 500 520 540 580 600 1000 5000
> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> > > pgfault latency histogram (ns):
> >> > > < 500 A A A A A A 50
> >> > > < 520 A A A A A A 151
> >> > > < 540 A A A A A A 3715
> >> > > < 580 A A A A A A 1859812
> >> > > < 600 A A A A A A 202241
> >> > > < 1000 A A A A A 25394
> >> > > < 5000 A A A A A 5875
> >> > > < inf A A A A A A 186
> >> > >
> >> > > Performance Test:
> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> >> > of
> >> > > recording the histogram. There is no overhead observed on both
> >> > "flt/cpu/s"
> >> > > and "fault/wsec".
> >> > >
> >> > > $ mkdir /dev/cgroup/memory/A
> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> >> > > $ ./pft -m 15g -t 8 -T a
> >> > >
> >> > > Result:
> >> > > "fault/wsec"
> >> > >
> >> > > $ ./ministat no_histogram histogram
> >> > > x no_histogram
> >> > > + histogram
> >> > >
> >> > +--------------------------------------------------------------------------+
> >> > > A A N A A A A A Min A A A A A Max A A A A Median A A A A A Avg
> >> > A Stddev
> >> > > x A 5 A A 813404.51 A A 824574.98 A A A 821661.3 A A 820470.83
> >> > 4202.0758
> >> > > + A 5 A A 821228.91 A A 825894.66 A A 822874.65 A A 823374.15
> >> > 1787.9355
> >> > >
> >> > > "flt/cpu/s"
> >> > >
> >> > > $ ./ministat no_histogram histogram
> >> > > x no_histogram
> >> > > + histogram
> >> > >
> >> > +--------------------------------------------------------------------------+
> >> > > A A N A A A A A Min A A A A A Max A A A A Median A A A A A Avg
> >> > A Stddev
> >> > > x A 5 A A 104951.93 A A 106173.13 A A 105142.73 A A A 105349.2
> >> > 513.78158
> >> > > + A 5 A A 104697.67 A A A 105416.1 A A 104943.52 A A 104973.77
> >> > 269.24781
> >> > > No difference proven at 95.0% confidence
> >> > >
> >> > > Signed-off-by: Ying Han <yinghan@google.com>
> >> >
> >> > Hmm, interesting....but isn't it very very very complicated interface ?
> >> > Could you make this for 'perf' ? Then, everyone (including someone who
> >> > don't use memcg)
> >> > will be happy.
> >> >
> >>
> >> Thank you for looking at it.
> >>
> >> There is only one per-memcg API added which is basically exporting the
> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
> >> it more flexible. Also, the sysfs API can be reduced if necessary since
> >> there is no over-head observed by always turning it on anyway.
> >>
> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
> >> like?
> >>
> >> Thanks
> >>
> >
> > IIUC, you can record "all" latency information by perf record. Then, latency
> > information can be dumped out to some file.
> >
> > You can add a python? script for perf as
> >
> > A # perf report memory-reclaim-latency-histgram -f perf.data
> > A A A A A A A A -o 500,1000,1500,2000.....
> > A ...show histgram in text.. or report the histgram in graphic.
> >
> > Good point is
> > A - you can reuse perf.data and show histgram from another point of view.
> >
> > A - you can show another cut of view, for example, I think you can write a
> > A A parser to show "changes in hisgram by time", easily.
> > A A You may able to generate a movie ;)
> >
> > A - Now, perf cgroup is supported. Then,
> > A A - you can see per task histgram
> > A A - you can see per cgroup histgram
> > A A - you can see per system-wide histgram
> > A A A (If you record latency of usual kswapd/alloc_pages)
> >
> > A - If you record latency within shrink_zone(), you can show per-zone
> > A A reclaim latency histgram. record parsers can gather them and
> > A A show histgram. This will be benefical to cpuset users.
> >
> >
> > I'm sorry if I miss something.
>
> After study a bit on perf, it is not feasible in this casecase. The
> cpu & memory overhead of perf is overwhelming.... Each page fault will
> generate a record in the buffer and how many data we can record in the
> buffer, and how many data will be processed later.. Most of the data
> that is recorded by the general perf framework is not needed here.
>
I disagree. "each page fault" is not correct. "every lru scan" is correct.
Then, record to buffer will be at most memory.failcnt times.
please consider more.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 2:11 ` KAMEZAWA Hiroyuki
@ 2011-05-27 4:45 ` Ying Han
2011-05-27 5:41 ` Ying Han
2011-05-27 8:33 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27 4:45 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 May 2011 18:40:44 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 26 May 2011 17:23:20 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >>
>> >> > On Thu, 26 May 2011 14:07:49 -0700
>> >> > Ying Han <yinghan@google.com> wrote:
>> >> >
>> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>> >> > used
>> >> > > this patch on the memcg background reclaim test, and figured there could
>> >> > be more
>> >> > > usecases to monitor/debug application performance.
>> >> > >
>> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>> >> > (inf)
>> >> > > which is everything beyond the last one. To be more flexible, the buckets
>> >> > can
>> >> > > be reset and also each bucket is configurable at runtime.
>> >> > >
>> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>> >> > also can
>> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>> >> > echoing
>> >> > > the range into the API. see the example below.
>> >> > >
>> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>> >> > turn
>> >> > > on/off recording the histogram.
>> >> > >
>> >> > > Functional Test:
>> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>> >> > > Measure the anon page allocation latency.
>> >> > >
>> >> > > $ mkdir /dev/cgroup/memory/B
>> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>> >> > > $ allocate 8g anon pages
>> >> > >
>> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>> >> > >
>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > pgfault latency histogram (ns):
>> >> > > < 600 2051273
>> >> > > < 1200 40859
>> >> > > < 2400 4004
>> >> > > < 4800 1605
>> >> > > < 9600 170
>> >> > > < 19200 82
>> >> > > < 38400 6
>> >> > > < inf 0
>> >> > >
>> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > pgfault latency histogram (ns):
>> >> > > < 600 0
>> >> > > < 1200 0
>> >> > > < 2400 0
>> >> > > < 4800 0
>> >> > > < 9600 0
>> >> > > < 19200 0
>> >> > > < 38400 0
>> >> > > < inf 0
>> >> > >
>> >> > > $ echo 500 520 540 580 600 1000 5000
>> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> > > pgfault latency histogram (ns):
>> >> > > < 500 50
>> >> > > < 520 151
>> >> > > < 540 3715
>> >> > > < 580 1859812
>> >> > > < 600 202241
>> >> > > < 1000 25394
>> >> > > < 5000 5875
>> >> > > < inf 186
>> >> > >
>> >> > > Performance Test:
>> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>> >> > of
>> >> > > recording the histogram. There is no overhead observed on both
>> >> > "flt/cpu/s"
>> >> > > and "fault/wsec".
>> >> > >
>> >> > > $ mkdir /dev/cgroup/memory/A
>> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> >> > > $ ./pft -m 15g -t 8 -T a
>> >> > >
>> >> > > Result:
>> >> > > "fault/wsec"
>> >> > >
>> >> > > $ ./ministat no_histogram histogram
>> >> > > x no_histogram
>> >> > > + histogram
>> >> > >
>> >> > +--------------------------------------------------------------------------+
>> >> > > N Min Max Median Avg
>> >> > Stddev
>> >> > > x 5 813404.51 824574.98 821661.3 820470.83
>> >> > 4202.0758
>> >> > > + 5 821228.91 825894.66 822874.65 823374.15
>> >> > 1787.9355
>> >> > >
>> >> > > "flt/cpu/s"
>> >> > >
>> >> > > $ ./ministat no_histogram histogram
>> >> > > x no_histogram
>> >> > > + histogram
>> >> > >
>> >> > +--------------------------------------------------------------------------+
>> >> > > N Min Max Median Avg
>> >> > Stddev
>> >> > > x 5 104951.93 106173.13 105142.73 105349.2
>> >> > 513.78158
>> >> > > + 5 104697.67 105416.1 104943.52 104973.77
>> >> > 269.24781
>> >> > > No difference proven at 95.0% confidence
>> >> > >
>> >> > > Signed-off-by: Ying Han <yinghan@google.com>
>> >> >
>> >> > Hmm, interesting....but isn't it very very very complicated interface ?
>> >> > Could you make this for 'perf' ? Then, everyone (including someone who
>> >> > don't use memcg)
>> >> > will be happy.
>> >> >
>> >>
>> >> Thank you for looking at it.
>> >>
>> >> There is only one per-memcg API added which is basically exporting the
>> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>> >> it more flexible. Also, the sysfs API can be reduced if necessary since
>> >> there is no over-head observed by always turning it on anyway.
>> >>
>> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
>> >> like?
>> >>
>> >> Thanks
>> >>
>> >
>> > IIUC, you can record "all" latency information by perf record. Then, latency
>> > information can be dumped out to some file.
>> >
>> > You can add a python? script for perf as
>> >
>> > # perf report memory-reclaim-latency-histgram -f perf.data
>> > -o 500,1000,1500,2000.....
>> > ...show histgram in text.. or report the histgram in graphic.
>> >
>> > Good point is
>> > - you can reuse perf.data and show histgram from another point of view.
>> >
>> > - you can show another cut of view, for example, I think you can write a
>> > parser to show "changes in hisgram by time", easily.
>> > You may able to generate a movie ;)
>> >
>> > - Now, perf cgroup is supported. Then,
>> > - you can see per task histgram
>> > - you can see per cgroup histgram
>> > - you can see per system-wide histgram
>> > (If you record latency of usual kswapd/alloc_pages)
>> >
>> > - If you record latency within shrink_zone(), you can show per-zone
>> > reclaim latency histgram. record parsers can gather them and
>> > show histgram. This will be benefical to cpuset users.
>> >
>> >
>> > I'm sorry if I miss something.
>>
>> After study a bit on perf, it is not feasible in this casecase. The
>> cpu & memory overhead of perf is overwhelming.... Each page fault will
>> generate a record in the buffer and how many data we can record in the
>> buffer, and how many data will be processed later.. Most of the data
>> that is recorded by the general perf framework is not needed here.
>>
>
> I disagree. "each page fault" is not correct. "every lru scan" is correct.
> Then, record to buffer will be at most memory.failcnt times.
Hmm. Sorry I might miss something here... :(
The page fault histogram recorded is per page-fault, only the ones
trigger reclaim. The background reclaim testing is just one usecase of
it, and we need this information for more
general usage to monitor application performance. So i recorded the
latency for each single page fault.
--Ying
>
> please consider more.
>
>
> Thanks,
> -Kame
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 4:45 ` Ying Han
@ 2011-05-27 5:41 ` Ying Han
2011-05-27 8:33 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27 5:41 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Thu, May 26, 2011 at 9:45 PM, Ying Han <yinghan@google.com> wrote:
> On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> On Thu, 26 May 2011 18:40:44 -0700
>> Ying Han <yinghan@google.com> wrote:
>>
>>> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
>>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> > On Thu, 26 May 2011 17:23:20 -0700
>>> > Ying Han <yinghan@google.com> wrote:
>>> >
>>> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>>> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> >>
>>> >> > On Thu, 26 May 2011 14:07:49 -0700
>>> >> > Ying Han <yinghan@google.com> wrote:
>>> >> >
>>> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>>> >> > used
>>> >> > > this patch on the memcg background reclaim test, and figured there could
>>> >> > be more
>>> >> > > usecases to monitor/debug application performance.
>>> >> > >
>>> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>>> >> > (inf)
>>> >> > > which is everything beyond the last one. To be more flexible, the buckets
>>> >> > can
>>> >> > > be reset and also each bucket is configurable at runtime.
>>> >> > >
>>> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>>> >> > also can
>>> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>>> >> > echoing
>>> >> > > the range into the API. see the example below.
>>> >> > >
>>> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>>> >> > turn
>>> >> > > on/off recording the histogram.
>>> >> > >
>>> >> > > Functional Test:
>>> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>>> >> > > Measure the anon page allocation latency.
>>> >> > >
>>> >> > > $ mkdir /dev/cgroup/memory/B
>>> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>>> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>>> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>>> >> > > $ allocate 8g anon pages
>>> >> > >
>>> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>>> >> > >
>>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > pgfault latency histogram (ns):
>>> >> > > < 600 2051273
>>> >> > > < 1200 40859
>>> >> > > < 2400 4004
>>> >> > > < 4800 1605
>>> >> > > < 9600 170
>>> >> > > < 19200 82
>>> >> > > < 38400 6
>>> >> > > < inf 0
>>> >> > >
>>> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > pgfault latency histogram (ns):
>>> >> > > < 600 0
>>> >> > > < 1200 0
>>> >> > > < 2400 0
>>> >> > > < 4800 0
>>> >> > > < 9600 0
>>> >> > > < 19200 0
>>> >> > > < 38400 0
>>> >> > > < inf 0
>>> >> > >
>>> >> > > $ echo 500 520 540 580 600 1000 5000
>>> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>>> >> > > pgfault latency histogram (ns):
>>> >> > > < 500 50
>>> >> > > < 520 151
>>> >> > > < 540 3715
>>> >> > > < 580 1859812
>>> >> > > < 600 202241
>>> >> > > < 1000 25394
>>> >> > > < 5000 5875
>>> >> > > < inf 186
>>> >> > >
>>> >> > > Performance Test:
>>> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>>> >> > of
>>> >> > > recording the histogram. There is no overhead observed on both
>>> >> > "flt/cpu/s"
>>> >> > > and "fault/wsec".
>>> >> > >
>>> >> > > $ mkdir /dev/cgroup/memory/A
>>> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>>> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>>> >> > > $ ./pft -m 15g -t 8 -T a
>>> >> > >
>>> >> > > Result:
>>> >> > > "fault/wsec"
>>> >> > >
>>> >> > > $ ./ministat no_histogram histogram
>>> >> > > x no_histogram
>>> >> > > + histogram
>>> >> > >
>>> >> > +--------------------------------------------------------------------------+
>>> >> > > N Min Max Median Avg
>>> >> > Stddev
>>> >> > > x 5 813404.51 824574.98 821661.3 820470.83
>>> >> > 4202.0758
>>> >> > > + 5 821228.91 825894.66 822874.65 823374.15
>>> >> > 1787.9355
>>> >> > >
>>> >> > > "flt/cpu/s"
>>> >> > >
>>> >> > > $ ./ministat no_histogram histogram
>>> >> > > x no_histogram
>>> >> > > + histogram
>>> >> > >
>>> >> > +--------------------------------------------------------------------------+
>>> >> > > N Min Max Median Avg
>>> >> > Stddev
>>> >> > > x 5 104951.93 106173.13 105142.73 105349.2
>>> >> > 513.78158
>>> >> > > + 5 104697.67 105416.1 104943.52 104973.77
>>> >> > 269.24781
>>> >> > > No difference proven at 95.0% confidence
>>> >> > >
>>> >> > > Signed-off-by: Ying Han <yinghan@google.com>
>>> >> >
>>> >> > Hmm, interesting....but isn't it very very very complicated interface ?
>>> >> > Could you make this for 'perf' ? Then, everyone (including someone who
>>> >> > don't use memcg)
>>> >> > will be happy.
>>> >> >
>>> >>
>>> >> Thank you for looking at it.
>>> >>
>>> >> There is only one per-memcg API added which is basically exporting the
>>> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>>> >> it more flexible. Also, the sysfs API can be reduced if necessary since
>>> >> there is no over-head observed by always turning it on anyway.
>>> >>
>>> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
>>> >> like?
>>> >>
>>> >> Thanks
>>> >>
>>> >
>>> > IIUC, you can record "all" latency information by perf record. Then, latency
>>> > information can be dumped out to some file.
>>> >
>>> > You can add a python? script for perf as
>>> >
>>> > # perf report memory-reclaim-latency-histgram -f perf.data
>>> > -o 500,1000,1500,2000.....
>>> > ...show histgram in text.. or report the histgram in graphic.
>>> >
>>> > Good point is
>>> > - you can reuse perf.data and show histgram from another point of view.
>>> >
>>> > - you can show another cut of view, for example, I think you can write a
>>> > parser to show "changes in hisgram by time", easily.
>>> > You may able to generate a movie ;)
>>> >
>>> > - Now, perf cgroup is supported. Then,
>>> > - you can see per task histgram
>>> > - you can see per cgroup histgram
>>> > - you can see per system-wide histgram
>>> > (If you record latency of usual kswapd/alloc_pages)
>>> >
>>> > - If you record latency within shrink_zone(), you can show per-zone
>>> > reclaim latency histgram. record parsers can gather them and
>>> > show histgram. This will be benefical to cpuset users.
>>> >
>>> >
>>> > I'm sorry if I miss something.
>>>
>>> After study a bit on perf, it is not feasible in this casecase. The
>>> cpu & memory overhead of perf is overwhelming.... Each page fault will
>>> generate a record in the buffer and how many data we can record in the
>>> buffer, and how many data will be processed later.. Most of the data
>>> that is recorded by the general perf framework is not needed here.
>>>
>>
>> I disagree. "each page fault" is not correct. "every lru scan" is correct.
>> Then, record to buffer will be at most memory.failcnt times.
>
> Hmm. Sorry I might miss something here... :(
>
> The page fault histogram recorded is per page-fault, only the ones
> trigger reclaim.
typo. I meant it is recording per page-fault, not only the one
triggering the reclaim.
--Ying
The background reclaim testing is just one usecase of
> it, and we need this information for more
> general usage to monitor application performance. So i recorded the
> latency for each single page fault.
>
> --Ying
>
>>
>> please consider more.
>>
>>
>> Thanks,
>> -Kame
>>
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-26 21:07 [PATCH] memcg: add pgfault latency histograms Ying Han
2011-05-27 0:05 ` KAMEZAWA Hiroyuki
@ 2011-05-27 8:04 ` Balbir Singh
2011-05-27 16:27 ` Ying Han
1 sibling, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2011-05-27 8:04 UTC (permalink / raw)
To: Ying Han
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
* Ying Han <yinghan@google.com> [2011-05-26 14:07:49]:
> This adds histogram to capture pagefault latencies on per-memcg basis. I used
> this patch on the memcg background reclaim test, and figured there could be more
> usecases to monitor/debug application performance.
>
> The histogram is composed 8 bucket in ns unit. The last one is infinite (inf)
> which is everything beyond the last one. To be more flexible, the buckets can
> be reset and also each bucket is configurable at runtime.
>
inf is a bit confusing for page faults -- no? Why not call it "rest"
or something line "> 38400". BTW, why was 600 used as base?
> memory.pgfault_histogram: exports the histogram on per-memcg basis and also can
> be reset by echoing "reset". Meantime, all the buckets are writable by echoing
> the range into the API. see the example below.
>
> /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to turn
> on/off recording the histogram.
>
Why not make this per memcg?
> Functional Test:
> Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> Measure the anon page allocation latency.
>
> $ mkdir /dev/cgroup/memory/B
> $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/B/tasks
> $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> $ allocate 8g anon pages
>
> $ echo 1 >/proc/sys/vm/pgfault_histogram
>
> $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> pgfault latency histogram (ns):
> < 600 2051273
> < 1200 40859
> < 2400 4004
> < 4800 1605
> < 9600 170
> < 19200 82
> < 38400 6
> < inf 0
>
> $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
Can't we use something like "-1" to mean reset?
--
Three Cheers,
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 4:45 ` Ying Han
2011-05-27 5:41 ` Ying Han
@ 2011-05-27 8:33 ` KAMEZAWA Hiroyuki
2011-05-27 18:46 ` Ying Han
1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-27 8:33 UTC (permalink / raw)
To: Ying Han
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Thu, 26 May 2011 21:45:28 -0700
Ying Han <yinghan@google.com> wrote:
> On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 26 May 2011 18:40:44 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > On Thu, 26 May 2011 17:23:20 -0700
> >> > Ying Han <yinghan@google.com> wrote:
> >> >
> >> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
> >> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >>
> >> >> > On Thu, 26 May 2011 14:07:49 -0700
> >> >> > Ying Han <yinghan@google.com> wrote:
> >> >> >
> >> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
> >> >> > used
> >> >> > > this patch on the memcg background reclaim test, and figured there could
> >> >> > be more
> >> >> > > usecases to monitor/debug application performance.
> >> >> > >
> >> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
> >> >> > (inf)
> >> >> > > which is everything beyond the last one. To be more flexible, the buckets
> >> >> > can
> >> >> > > be reset and also each bucket is configurable at runtime.
> >> >> > >
> >> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> >> >> > also can
> >> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
> >> >> > echoing
> >> >> > > the range into the API. see the example below.
> >> >> > >
> >> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> >> >> > turn
> >> >> > > on/off recording the histogram.
> >> >> > >
> >> >> > > Functional Test:
> >> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> >> >> > > Measure the anon page allocation latency.
> >> >> > >
> >> >> > > $ mkdir /dev/cgroup/memory/B
> >> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> >> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
> >> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> >> >> > > $ allocate 8g anon pages
> >> >> > >
> >> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >> >> > >
> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > pgfault latency histogram (ns):
> >> >> > > < 600 A A A A A A 2051273
> >> >> > > < 1200 A A A A A 40859
> >> >> > > < 2400 A A A A A 4004
> >> >> > > < 4800 A A A A A 1605
> >> >> > > < 9600 A A A A A 170
> >> >> > > < 19200 A A A A A 82
> >> >> > > < 38400 A A A A A 6
> >> >> > > < inf A A A A A A 0
> >> >> > >
> >> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > pgfault latency histogram (ns):
> >> >> > > < 600 A A A A A A 0
> >> >> > > < 1200 A A A A A 0
> >> >> > > < 2400 A A A A A 0
> >> >> > > < 4800 A A A A A 0
> >> >> > > < 9600 A A A A A 0
> >> >> > > < 19200 A A A A A 0
> >> >> > > < 38400 A A A A A 0
> >> >> > > < inf A A A A A A 0
> >> >> > >
> >> >> > > $ echo 500 520 540 580 600 1000 5000
> >> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> >> >> > > pgfault latency histogram (ns):
> >> >> > > < 500 A A A A A A 50
> >> >> > > < 520 A A A A A A 151
> >> >> > > < 540 A A A A A A 3715
> >> >> > > < 580 A A A A A A 1859812
> >> >> > > < 600 A A A A A A 202241
> >> >> > > < 1000 A A A A A 25394
> >> >> > > < 5000 A A A A A 5875
> >> >> > > < inf A A A A A A 186
> >> >> > >
> >> >> > > Performance Test:
> >> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
> >> >> > of
> >> >> > > recording the histogram. There is no overhead observed on both
> >> >> > "flt/cpu/s"
> >> >> > > and "fault/wsec".
> >> >> > >
> >> >> > > $ mkdir /dev/cgroup/memory/A
> >> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
> >> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
> >> >> > > $ ./pft -m 15g -t 8 -T a
> >> >> > >
> >> >> > > Result:
> >> >> > > "fault/wsec"
> >> >> > >
> >> >> > > $ ./ministat no_histogram histogram
> >> >> > > x no_histogram
> >> >> > > + histogram
> >> >> > >
> >> >> > +--------------------------------------------------------------------------+
> >> >> > > A A N A A A A A Min A A A A A Max A A A A Median A A A A A Avg
> >> >> > A Stddev
> >> >> > > x A 5 A A 813404.51 A A 824574.98 A A A 821661.3 A A 820470.83
> >> >> > 4202.0758
> >> >> > > + A 5 A A 821228.91 A A 825894.66 A A 822874.65 A A 823374.15
> >> >> > 1787.9355
> >> >> > >
> >> >> > > "flt/cpu/s"
> >> >> > >
> >> >> > > $ ./ministat no_histogram histogram
> >> >> > > x no_histogram
> >> >> > > + histogram
> >> >> > >
> >> >> > +--------------------------------------------------------------------------+
> >> >> > > A A N A A A A A Min A A A A A Max A A A A Median A A A A A Avg
> >> >> > A Stddev
> >> >> > > x A 5 A A 104951.93 A A 106173.13 A A 105142.73 A A A 105349.2
> >> >> > 513.78158
> >> >> > > + A 5 A A 104697.67 A A A 105416.1 A A 104943.52 A A 104973.77
> >> >> > 269.24781
> >> >> > > No difference proven at 95.0% confidence
> >> >> > >
> >> >> > > Signed-off-by: Ying Han <yinghan@google.com>
> >> >> >
> >> >> > Hmm, interesting....but isn't it very very very complicated interface ?
> >> >> > Could you make this for 'perf' ? Then, everyone (including someone who
> >> >> > don't use memcg)
> >> >> > will be happy.
> >> >> >
> >> >>
> >> >> Thank you for looking at it.
> >> >>
> >> >> There is only one per-memcg API added which is basically exporting the
> >> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
> >> >> it more flexible. Also, the sysfs API can be reduced if necessary since
> >> >> there is no over-head observed by always turning it on anyway.
> >> >>
> >> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
> >> >> like?
> >> >>
> >> >> Thanks
> >> >>
> >> >
> >> > IIUC, you can record "all" latency information by perf record. Then, latency
> >> > information can be dumped out to some file.
> >> >
> >> > You can add a python? script for perf as
> >> >
> >> > A # perf report memory-reclaim-latency-histgram -f perf.data
> >> > A A A A A A A A -o 500,1000,1500,2000.....
> >> > A ...show histgram in text.. or report the histgram in graphic.
> >> >
> >> > Good point is
> >> > A - you can reuse perf.data and show histgram from another point of view.
> >> >
> >> > A - you can show another cut of view, for example, I think you can write a
> >> > A A parser to show "changes in hisgram by time", easily.
> >> > A A You may able to generate a movie ;)
> >> >
> >> > A - Now, perf cgroup is supported. Then,
> >> > A A - you can see per task histgram
> >> > A A - you can see per cgroup histgram
> >> > A A - you can see per system-wide histgram
> >> > A A A (If you record latency of usual kswapd/alloc_pages)
> >> >
> >> > A - If you record latency within shrink_zone(), you can show per-zone
> >> > A A reclaim latency histgram. record parsers can gather them and
> >> > A A show histgram. This will be benefical to cpuset users.
> >> >
> >> >
> >> > I'm sorry if I miss something.
> >>
> >> After study a bit on perf, it is not feasible in this casecase. The
> >> cpu & memory overhead of perf is overwhelming.... Each page fault will
> >> generate a record in the buffer and how many data we can record in the
> >> buffer, and how many data will be processed later.. Most of the data
> >> that is recorded by the general perf framework is not needed here.
> >>
> >
> > I disagree. "each page fault" is not correct. "every lru scan" is correct.
> > Then, record to buffer will be at most memory.failcnt times.
>
> Hmm. Sorry I might miss something here... :(
>
> The page fault histogram recorded is per page-fault, only the ones
> trigger reclaim. The background reclaim testing is just one usecase of
> it, and we need this information for more
> general usage to monitor application performance. So i recorded the
> latency for each single page fault.
>
BTW, why page-fault only ? For some applications, file cache is more imporatant.
I think usual page fault's usual cost is not in interest.
you can get PGPGIN statistics from other source.
Anyway, I think it's better for record reclaim latency.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 8:04 ` Balbir Singh
@ 2011-05-27 16:27 ` Ying Han
0 siblings, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27 16:27 UTC (permalink / raw)
To: Balbir Singh
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
[-- Attachment #1: Type: text/plain, Size: 2738 bytes --]
On Fri, May 27, 2011 at 1:04 AM, Balbir Singh <balbir@linux.vnet.ibm.com>wrote:
> * Ying Han <yinghan@google.com> [2011-05-26 14:07:49]:
>
> > This adds histogram to capture pagefault latencies on per-memcg basis. I
> used
> > this patch on the memcg background reclaim test, and figured there could
> be more
> > usecases to monitor/debug application performance.
> >
> > The histogram is composed 8 bucket in ns unit. The last one is infinite
> (inf)
> > which is everything beyond the last one. To be more flexible, the buckets
> can
> > be reset and also each bucket is configurable at runtime.
> >
>
> inf is a bit confusing for page faults -- no? Why not call it "rest"
> or something line "> 38400".
ok, i can change that to "rest".
> BTW, why was 600 used as base?
>
well, that is based some of my experiments. I am doing anon page allocation
and most of the page fault falls into the bucket of 580 ns - 600 ns. So I
just leave it as default.
However, the bucket is configurable and user can change it based on their
workload and platform.
>
> > memory.pgfault_histogram: exports the histogram on per-memcg basis and
> also can
> > be reset by echoing "reset". Meantime, all the buckets are writable by
> echoing
> > the range into the API. see the example below.
> >
> > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
> turn
> > on/off recording the histogram.
> >
>
> Why not make this per memcg?
>
That can be done.
>
> > Functional Test:
> > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
> > Measure the anon page allocation latency.
> >
> > $ mkdir /dev/cgroup/memory/B
> > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
> > $ echo $$ >/dev/cgroup/memory/B/tasks
> > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
> > $ allocate 8g anon pages
> >
> > $ echo 1 >/proc/sys/vm/pgfault_histogram
> >
> > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
> > pgfault latency histogram (ns):
> > < 600 2051273
> > < 1200 40859
> > < 2400 4004
> > < 4800 1605
> > < 9600 170
> > < 19200 82
> > < 38400 6
> > < inf 0
> >
> > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>
> Can't we use something like "-1" to mean reset?
>
sounds good to me.
Thank you for reviewing.
--Ying
>
> --
> Three Cheers,
> Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
[-- Attachment #2: Type: text/html, Size: 4447 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 8:33 ` KAMEZAWA Hiroyuki
@ 2011-05-27 18:46 ` Ying Han
0 siblings, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-27 18:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
Tejun Heo, Pavel Emelyanov, Andrew Morton, Li Zefan, Mel Gorman,
Christoph Lameter, Johannes Weiner, Rik van Riel, Hugh Dickins,
Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm
On Fri, May 27, 2011 at 1:33 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 May 2011 21:45:28 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Thu, May 26, 2011 at 7:11 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Thu, 26 May 2011 18:40:44 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> >> On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> > On Thu, 26 May 2011 17:23:20 -0700
>> >> > Ying Han <yinghan@google.com> wrote:
>> >> >
>> >> >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki <
>> >> >> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> >>
>> >> >> > On Thu, 26 May 2011 14:07:49 -0700
>> >> >> > Ying Han <yinghan@google.com> wrote:
>> >> >> >
>> >> >> > > This adds histogram to capture pagefault latencies on per-memcg basis. I
>> >> >> > used
>> >> >> > > this patch on the memcg background reclaim test, and figured there could
>> >> >> > be more
>> >> >> > > usecases to monitor/debug application performance.
>> >> >> > >
>> >> >> > > The histogram is composed 8 bucket in ns unit. The last one is infinite
>> >> >> > (inf)
>> >> >> > > which is everything beyond the last one. To be more flexible, the buckets
>> >> >> > can
>> >> >> > > be reset and also each bucket is configurable at runtime.
>> >> >> > >
>> >> >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis and
>> >> >> > also can
>> >> >> > > be reset by echoing "reset". Meantime, all the buckets are writable by
>> >> >> > echoing
>> >> >> > > the range into the API. see the example below.
>> >> >> > >
>> >> >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used to
>> >> >> > turn
>> >> >> > > on/off recording the histogram.
>> >> >> > >
>> >> >> > > Functional Test:
>> >> >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon page.
>> >> >> > > Measure the anon page allocation latency.
>> >> >> > >
>> >> >> > > $ mkdir /dev/cgroup/memory/B
>> >> >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes
>> >> >> > > $ echo $$ >/dev/cgroup/memory/B/tasks
>> >> >> > > $ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520 &
>> >> >> > > $ allocate 8g anon pages
>> >> >> > >
>> >> >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram
>> >> >> > >
>> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > pgfault latency histogram (ns):
>> >> >> > > < 600 2051273
>> >> >> > > < 1200 40859
>> >> >> > > < 2400 4004
>> >> >> > > < 4800 1605
>> >> >> > > < 9600 170
>> >> >> > > < 19200 82
>> >> >> > > < 38400 6
>> >> >> > > < inf 0
>> >> >> > >
>> >> >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > pgfault latency histogram (ns):
>> >> >> > > < 600 0
>> >> >> > > < 1200 0
>> >> >> > > < 2400 0
>> >> >> > > < 4800 0
>> >> >> > > < 9600 0
>> >> >> > > < 19200 0
>> >> >> > > < 38400 0
>> >> >> > > < inf 0
>> >> >> > >
>> >> >> > > $ echo 500 520 540 580 600 1000 5000
>> >> >> > >/dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram
>> >> >> > > pgfault latency histogram (ns):
>> >> >> > > < 500 50
>> >> >> > > < 520 151
>> >> >> > > < 540 3715
>> >> >> > > < 580 1859812
>> >> >> > > < 600 202241
>> >> >> > > < 1000 25394
>> >> >> > > < 5000 5875
>> >> >> > > < inf 186
>> >> >> > >
>> >> >> > > Performance Test:
>> >> >> > > I ran through the PageFaultTest (pft) benchmark to measure the overhead
>> >> >> > of
>> >> >> > > recording the histogram. There is no overhead observed on both
>> >> >> > "flt/cpu/s"
>> >> >> > > and "fault/wsec".
>> >> >> > >
>> >> >> > > $ mkdir /dev/cgroup/memory/A
>> >> >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes
>> >> >> > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> >> >> > > $ ./pft -m 15g -t 8 -T a
>> >> >> > >
>> >> >> > > Result:
>> >> >> > > "fault/wsec"
>> >> >> > >
>> >> >> > > $ ./ministat no_histogram histogram
>> >> >> > > x no_histogram
>> >> >> > > + histogram
>> >> >> > >
>> >> >> > +--------------------------------------------------------------------------+
>> >> >> > > N Min Max Median Avg
>> >> >> > Stddev
>> >> >> > > x 5 813404.51 824574.98 821661.3 820470.83
>> >> >> > 4202.0758
>> >> >> > > + 5 821228.91 825894.66 822874.65 823374.15
>> >> >> > 1787.9355
>> >> >> > >
>> >> >> > > "flt/cpu/s"
>> >> >> > >
>> >> >> > > $ ./ministat no_histogram histogram
>> >> >> > > x no_histogram
>> >> >> > > + histogram
>> >> >> > >
>> >> >> > +--------------------------------------------------------------------------+
>> >> >> > > N Min Max Median Avg
>> >> >> > Stddev
>> >> >> > > x 5 104951.93 106173.13 105142.73 105349.2
>> >> >> > 513.78158
>> >> >> > > + 5 104697.67 105416.1 104943.52 104973.77
>> >> >> > 269.24781
>> >> >> > > No difference proven at 95.0% confidence
>> >> >> > >
>> >> >> > > Signed-off-by: Ying Han <yinghan@google.com>
>> >> >> >
>> >> >> > Hmm, interesting....but isn't it very very very complicated interface ?
>> >> >> > Could you make this for 'perf' ? Then, everyone (including someone who
>> >> >> > don't use memcg)
>> >> >> > will be happy.
>> >> >> >
>> >> >>
>> >> >> Thank you for looking at it.
>> >> >>
>> >> >> There is only one per-memcg API added which is basically exporting the
>> >> >> histogram. The "reset" and reconfiguring the bucket is not "must" but make
>> >> >> it more flexible. Also, the sysfs API can be reduced if necessary since
>> >> >> there is no over-head observed by always turning it on anyway.
>> >> >>
>> >> >> I am not familiar w/ perf, any suggestions how it is supposed to be look
>> >> >> like?
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >
>> >> > IIUC, you can record "all" latency information by perf record. Then, latency
>> >> > information can be dumped out to some file.
>> >> >
>> >> > You can add a python? script for perf as
>> >> >
>> >> > # perf report memory-reclaim-latency-histgram -f perf.data
>> >> > -o 500,1000,1500,2000.....
>> >> > ...show histgram in text.. or report the histgram in graphic.
>> >> >
>> >> > Good point is
>> >> > - you can reuse perf.data and show histgram from another point of view.
>> >> >
>> >> > - you can show another cut of view, for example, I think you can write a
>> >> > parser to show "changes in hisgram by time", easily.
>> >> > You may able to generate a movie ;)
>> >> >
>> >> > - Now, perf cgroup is supported. Then,
>> >> > - you can see per task histgram
>> >> > - you can see per cgroup histgram
>> >> > - you can see per system-wide histgram
>> >> > (If you record latency of usual kswapd/alloc_pages)
>> >> >
>> >> > - If you record latency within shrink_zone(), you can show per-zone
>> >> > reclaim latency histgram. record parsers can gather them and
>> >> > show histgram. This will be benefical to cpuset users.
>> >> >
>> >> >
>> >> > I'm sorry if I miss something.
>> >>
>> >> After study a bit on perf, it is not feasible in this casecase. The
>> >> cpu & memory overhead of perf is overwhelming.... Each page fault will
>> >> generate a record in the buffer and how many data we can record in the
>> >> buffer, and how many data will be processed later.. Most of the data
>> >> that is recorded by the general perf framework is not needed here.
>> >>
>> >
>> > I disagree. "each page fault" is not correct. "every lru scan" is correct.
>> > Then, record to buffer will be at most memory.failcnt times.
>>
>> Hmm. Sorry I might miss something here... :(
>>
>> The page fault histogram recorded is per page-fault, only the ones
>> trigger reclaim. The background reclaim testing is just one usecase of
>> it, and we need this information for more
>> general usage to monitor application performance. So i recorded the
>> latency for each single page fault.
>>
>
> BTW, why page-fault only ? For some applications, file cache is more imporatant.
> I think usual page fault's usual cost is not in interest.
> you can get PGPGIN statistics from other source.
>
> Anyway, I think it's better for record reclaim latency.
Sounds reasonable. I will add that in the next post.
Thanks for reviewing
--Ying
>
>
> Thanks,
> -Kame
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-27 1:40 ` Ying Han
2011-05-27 2:11 ` KAMEZAWA Hiroyuki
@ 2011-05-28 10:17 ` Ingo Molnar
2011-05-31 16:51 ` Ying Han
1 sibling, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2011-05-28 10:17 UTC (permalink / raw)
To: Ying Han
Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
Dave Hansen, Zhu Yanhai, linux-mm, Peter Zijlstra,
Frédéric Weisbecker, Arnaldo Carvalho de Melo,
Tom Zanussi
* Ying Han <yinghan@google.com> wrote:
> After study a bit on perf, it is not feasible in this casecase. The
> cpu & memory overhead of perf is overwhelming.... Each page fault
> will generate a record in the buffer and how many data we can
> record in the buffer, and how many data will be processed later..
> Most of the data that is recorded by the general perf framework is
> not needed here.
>
>
> On the other hand, the memory consumption is very little in this
> patch. We only need to keep a counter of each bucket and the
> recording can go on as long as the machine is up. As also measured,
> there is no overhead of the data collection :)
>
> So, the perf is not an option for this purpose.
It's not a fundamental limitation in perf though.
The way i always thought perf could be extended to support heavy-duty
profiling such as your patch does would be along the following lines:
Right now perf supports three output methods:
'full detail': per sample records, recorded in the ring-buffer
'filtered full detail': per sample records filtered, recorded in the ring-buffer
'full summary': the count of all samples (simple counter), no recording
What i think would make sense is to introduce a fourth variant, which
is a natural intermediate of the above output methods:
'partial summary': partially summarized samples, record in an
array in the ring-buffer - an extended
multi-dimensional 'count'.
A histogram like yours would be one (small) sub-case of this new
model.
Now, to keep things maximally flexible we really do not want to hard
code histogram summary functions: i.e. we do not want to hardcode
ourselves to 'latency histograms' or 'frequency histograms'.
To achieve that flexibility we could define the histogram function as
a simple extension to filters: filters that evaluate to an integer
value.
For example, if we defined the following tracepoint in
arch/x86/mm/fault.c:
TRACE_EVENT(mm_pagefault,
TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip),
TP_ARGS(time_start, time_end, address, error_code, ip),
TP_STRUCT__entry(
__field(u64, time_start)
__field(u64, time_end)
__field(unsigned long, address)
__field(unsigned long, error_code)
__field(unsigned long, ip)
),
TP_fast_assign(
__entry->time_start = time_start;
__entry->time_end = time_end;
__entry->address = address;
__entry->error_code = error_code;
__entry->ip = ip;
),
TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx",
__entry->time_start, __entry->time_end,
__entry->address, __entry->error_code, __entry->ip)
Then the following filter expressions could be used to calculate the
histogram index and value:
index: "(time_end - time_start)/1000"
iterator: "curr + 1"
The /1000 index filter expression means that there is one separate
bucket per microsecond of delay.
The "curr + 1" iterator filter expression would represent that for
every bucket an event means we add +1 to the current bucket value.
Today our filter expressions evaluate to a small subset of integer
numbers: 0 or 1 :-)
Extending them to integer calculations is possible and would be
desirable for other purposes as well, not just histograms. Adding
integer operators in addition to the logical and bitwise operators
the filter engine supports today would be useful as well. (See
kernel/trace/trace_events_filter.c for the current filter engine.)
This way we would have the equivalent functionality and performance
of your histogram patch - and it would also open up many, *many*
other nice possibilities as well:
- this could be used with any event, anywhere - could even be used
with hardware events. We could sample with an NMI every 100 usecs
and profile with relatively small profiling overhead.
- arbitrarily large histograms could be created: need a 10 GB
histogram on a really large system? No problem, create such
a big ring-buffer.
- many different types of summaries are possible as well:
- we could create a histogram over *which* code pagefaults, via
using the "ip" (faulting instruction) address and a
sufficiently large ring-buffer.
- histogram over the address space (which vmas are the hottest ones),
by changing the first filter to "address/1000000" to have per
megabyte buckets.
- weighted histograms: for example if the histogram iteration
function is "curr + (time_end-time_start)/1000" and the
histogram index is "address/1000000", then we get an
address-indexed histogram weighted by length of latency: the
higher latencies a given area of memory causes, the hotter the
bucket.
- the existing event filter code can be used to filter the incoming
events to begin with: for example an "error_code = 1" filter would
limit the histogram to write faults (page dirtying).
So instead of adding just one hardcoded histogram type, it would be
really nice to work on a more generic solution!
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] memcg: add pgfault latency histograms
2011-05-28 10:17 ` Ingo Molnar
@ 2011-05-31 16:51 ` Ying Han
0 siblings, 0 replies; 14+ messages in thread
From: Ying Han @ 2011-05-31 16:51 UTC (permalink / raw)
To: Ingo Molnar
Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Minchan Kim,
Daisuke Nishimura, Balbir Singh, Tejun Heo, Pavel Emelyanov,
Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
Dave Hansen, Zhu Yanhai, linux-mm, Peter Zijlstra,
Frédéric Weisbecker, Arnaldo Carvalho de Melo,
Tom Zanussi
On Sat, May 28, 2011 at 3:17 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ying Han <yinghan@google.com> wrote:
>
>> After study a bit on perf, it is not feasible in this casecase. The
>> cpu & memory overhead of perf is overwhelming.... Each page fault
>> will generate a record in the buffer and how many data we can
>> record in the buffer, and how many data will be processed later..
>> Most of the data that is recorded by the general perf framework is
>> not needed here.
>>
>>
>> On the other hand, the memory consumption is very little in this
>> patch. We only need to keep a counter of each bucket and the
>> recording can go on as long as the machine is up. As also measured,
>> there is no overhead of the data collection :)
>>
>> So, the perf is not an option for this purpose.
>
> It's not a fundamental limitation in perf though.
>
> The way i always thought perf could be extended to support heavy-duty
> profiling such as your patch does would be along the following lines:
>
> Right now perf supports three output methods:
>
> 'full detail': per sample records, recorded in the ring-buffer
> 'filtered full detail': per sample records filtered, recorded in the ring-buffer
> 'full summary': the count of all samples (simple counter), no recording
>
> What i think would make sense is to introduce a fourth variant, which
> is a natural intermediate of the above output methods:
>
> 'partial summary': partially summarized samples, record in an
> array in the ring-buffer - an extended
> multi-dimensional 'count'.
>
> A histogram like yours would be one (small) sub-case of this new
> model.
>
> Now, to keep things maximally flexible we really do not want to hard
> code histogram summary functions: i.e. we do not want to hardcode
> ourselves to 'latency histograms' or 'frequency histograms'.
>
> To achieve that flexibility we could define the histogram function as
> a simple extension to filters: filters that evaluate to an integer
> value.
>
> For example, if we defined the following tracepoint in
> arch/x86/mm/fault.c:
>
> TRACE_EVENT(mm_pagefault,
>
> TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip),
>
> TP_ARGS(time_start, time_end, address, error_code, ip),
>
> TP_STRUCT__entry(
> __field(u64, time_start)
> __field(u64, time_end)
> __field(unsigned long, address)
> __field(unsigned long, error_code)
> __field(unsigned long, ip)
> ),
>
> TP_fast_assign(
> __entry->time_start = time_start;
> __entry->time_end = time_end;
> __entry->address = address;
> __entry->error_code = error_code;
> __entry->ip = ip;
> ),
>
> TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx",
> __entry->time_start, __entry->time_end,
> __entry->address, __entry->error_code, __entry->ip)
>
>
> Then the following filter expressions could be used to calculate the
> histogram index and value:
>
> index: "(time_end - time_start)/1000"
> iterator: "curr + 1"
>
> The /1000 index filter expression means that there is one separate
> bucket per microsecond of delay.
>
> The "curr + 1" iterator filter expression would represent that for
> every bucket an event means we add +1 to the current bucket value.
>
> Today our filter expressions evaluate to a small subset of integer
> numbers: 0 or 1 :-)
>
> Extending them to integer calculations is possible and would be
> desirable for other purposes as well, not just histograms. Adding
> integer operators in addition to the logical and bitwise operators
> the filter engine supports today would be useful as well. (See
> kernel/trace/trace_events_filter.c for the current filter engine.)
>
> This way we would have the equivalent functionality and performance
> of your histogram patch - and it would also open up many, *many*
> other nice possibilities as well:
>
> - this could be used with any event, anywhere - could even be used
> with hardware events. We could sample with an NMI every 100 usecs
> and profile with relatively small profiling overhead.
>
> - arbitrarily large histograms could be created: need a 10 GB
> histogram on a really large system? No problem, create such
> a big ring-buffer.
>
> - many different types of summaries are possible as well:
>
> - we could create a histogram over *which* code pagefaults, via
> using the "ip" (faulting instruction) address and a
> sufficiently large ring-buffer.
>
> - histogram over the address space (which vmas are the hottest ones),
> by changing the first filter to "address/1000000" to have per
> megabyte buckets.
>
> - weighted histograms: for example if the histogram iteration
> function is "curr + (time_end-time_start)/1000" and the
> histogram index is "address/1000000", then we get an
> address-indexed histogram weighted by length of latency: the
> higher latencies a given area of memory causes, the hotter the
> bucket.
>
> - the existing event filter code can be used to filter the incoming
> events to begin with: for example an "error_code = 1" filter would
> limit the histogram to write faults (page dirtying).
>
> So instead of adding just one hardcoded histogram type, it would be
> really nice to work on a more generic solution!
>
> Thanks,
>
> Ingo
Hi Ingo,
Thank you for the detailed information.
This patch is used to evaluating the memcg reclaim patch and I have
got some interesting results. I will post the next version of the
patch which made couple of improvement based on the comments from the
thread. Meantime, I will need to study more on your suggestion :)
Thanks
--Ying
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-05-31 16:51 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-26 21:07 [PATCH] memcg: add pgfault latency histograms Ying Han
2011-05-27 0:05 ` KAMEZAWA Hiroyuki
2011-05-27 0:23 ` Ying Han
2011-05-27 0:31 ` KAMEZAWA Hiroyuki
2011-05-27 1:40 ` Ying Han
2011-05-27 2:11 ` KAMEZAWA Hiroyuki
2011-05-27 4:45 ` Ying Han
2011-05-27 5:41 ` Ying Han
2011-05-27 8:33 ` KAMEZAWA Hiroyuki
2011-05-27 18:46 ` Ying Han
2011-05-28 10:17 ` Ingo Molnar
2011-05-31 16:51 ` Ying Han
2011-05-27 8:04 ` Balbir Singh
2011-05-27 16:27 ` Ying Han
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.