All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Add some trace events for the page allocator v6
@ 2009-08-10 15:41 ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

This is V6 of a patchset to add some tracepoints of interest when analysing
the page allocator. The only changes since the last revision were to fix a
minor error in the post-processing script and to add a reviewed-by to one
of the patches.

Can we get a yey/nay on whether these should be merged or not?

Changelog since V4
  o Drop the order parameter from mm_pagevec_free() as it's implicitly 0
  o Drop the cpu= information from PCPU events as the CPU printed is
    incorrect and the information is already available
  o Pass down the minimum amount of information during fallback and the
    zone_locked event as the additional information in TP_printk
  o Pass down the minimum amount of information during fallback and figure
    out the additional information in the post-processing TP_printk
  o Make the post-processing script more robust against format changes.
    This could be significantly more robust and construct a regex on
    the fly but it makes more sense to leave this as a POC with the
    view to integrating properly with perf once the important information
    has been identified
  o Exit the script after multiple signals without waiting for further input

Changelog since V3
  o Drop call_site information from trace events
  o Use struct page * instead of void * in trace events
  o Add CPU information to the per-cpu tracepoints information
  o Improved performance of offline-process script so it can run online
  o Add support for interrupting processing script to dump what it has
  o Add support for stripping pids, getting additional information from
    proc and adding information on the parent process
  o Improve layout of output of post processing script for use with sort
  o Add documentation on performance analysis using tracepoints
  o Add documentation on the kmem tracepoints in particular

Changelog since V2
  o Added Ack-ed By's from Rik
  o Only call trace_mm_page_free_direct when page count reaches zero
  o Rebase to 2.6.31-rc5

Changelog since V1
  o Fix minor formatting error for the __rmqueue event
  o Add event for __pagevec_free
  o Bring naming more in line with Larry Woodman's tracing patch
  o Add an example post-processing script for the trace events

The following four patches add some trace events for the page allocator
under the heading of kmem.

	Patch 1 adds events for plain old allocate and freeing of pages
	Patch 2 gives information useful for analysing fragmentation avoidance
	Patch 3 tracks pages going to and from the buddy lists as an indirect
		indication of zone lock hotness
	Patch 4 adds a post-processing script that aggegates the events to
		give a higher-level view
	Patch 5 adds documentation on analysis using tracepoints
	Patch 6 adds documentation on the kmem tracepoints in particular

The first set of events can be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The second patch is mainly of use to users of hugepages and particularly
dynamic hugepage pool resizing as it could be used to tune min_free_kbytes
to a level that fragmentation was rarely a problem. My main concern is
that maybe I'm trying to jam too much into the TP_printk that could be
extrapolated after the fact if you were familiar with the implementation. I
couldn't determine if it was best to hold the hand of the administrator
even if it cost more to figure it out.

The third patch is trickier to draw conclusions from but high activity on
those events could explain why there were a large number of cache misses
on a page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to enter
this path.

The fourth patch parses the trace buffer to draw a higher-level picture of
what is going on broken down on a per-process basis.

The last two patches add documentation.

 Documentation/trace/events-kmem.txt                |  107 +++++
 .../postprocess/trace-pagealloc-postprocess.pl     |  418 ++++++++++++++++++++
 Documentation/trace/tracepoint-analysis.txt        |  327 +++++++++++++++
 include/trace/events/kmem.h                        |  163 ++++++++
 mm/page_alloc.c                                    |   13 +-
 5 files changed, 1027 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/6] Add some trace events for the page allocator v6
@ 2009-08-10 15:41 ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

This is V6 of a patchset to add some tracepoints of interest when analysing
the page allocator. The only changes since the last revision were to fix a
minor error in the post-processing script and to add a reviewed-by to one
of the patches.

Can we get a yey/nay on whether these should be merged or not?

Changelog since V4
  o Drop the order parameter from mm_pagevec_free() as it's implicitly 0
  o Drop the cpu= information from PCPU events as the CPU printed is
    incorrect and the information is already available
  o Pass down the minimum amount of information during fallback and the
    zone_locked event as the additional information in TP_printk
  o Pass down the minimum amount of information during fallback and figure
    out the additional information in the post-processing TP_printk
  o Make the post-processing script more robust against format changes.
    This could be significantly more robust and construct a regex on
    the fly but it makes more sense to leave this as a POC with the
    view to integrating properly with perf once the important information
    has been identified
  o Exit the script after multiple signals without waiting for further input

Changelog since V3
  o Drop call_site information from trace events
  o Use struct page * instead of void * in trace events
  o Add CPU information to the per-cpu tracepoints information
  o Improved performance of offline-process script so it can run online
  o Add support for interrupting processing script to dump what it has
  o Add support for stripping pids, getting additional information from
    proc and adding information on the parent process
  o Improve layout of output of post processing script for use with sort
  o Add documentation on performance analysis using tracepoints
  o Add documentation on the kmem tracepoints in particular

Changelog since V2
  o Added Ack-ed By's from Rik
  o Only call trace_mm_page_free_direct when page count reaches zero
  o Rebase to 2.6.31-rc5

Changelog since V1
  o Fix minor formatting error for the __rmqueue event
  o Add event for __pagevec_free
  o Bring naming more in line with Larry Woodman's tracing patch
  o Add an example post-processing script for the trace events

The following four patches add some trace events for the page allocator
under the heading of kmem.

	Patch 1 adds events for plain old allocate and freeing of pages
	Patch 2 gives information useful for analysing fragmentation avoidance
	Patch 3 tracks pages going to and from the buddy lists as an indirect
		indication of zone lock hotness
	Patch 4 adds a post-processing script that aggegates the events to
		give a higher-level view
	Patch 5 adds documentation on analysis using tracepoints
	Patch 6 adds documentation on the kmem tracepoints in particular

The first set of events can be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The second patch is mainly of use to users of hugepages and particularly
dynamic hugepage pool resizing as it could be used to tune min_free_kbytes
to a level that fragmentation was rarely a problem. My main concern is
that maybe I'm trying to jam too much into the TP_printk that could be
extrapolated after the fact if you were familiar with the implementation. I
couldn't determine if it was best to hold the hand of the administrator
even if it cost more to figure it out.

The third patch is trickier to draw conclusions from but high activity on
those events could explain why there were a large number of cache misses
on a page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to enter
this path.

The fourth patch parses the trace buffer to draw a higher-level picture of
what is going on broken down on a per-process basis.

The last two patches add documentation.

 Documentation/trace/events-kmem.txt                |  107 +++++
 .../postprocess/trace-pagealloc-postprocess.pl     |  418 ++++++++++++++++++++
 Documentation/trace/tracepoint-analysis.txt        |  327 +++++++++++++++
 include/trace/events/kmem.h                        |  163 ++++++++
 mm/page_alloc.c                                    |   13 +-
 5 files changed, 1027 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-10 15:41 ` Mel Gorman
@ 2009-08-10 15:41   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   74 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    7 +++-
 2 files changed, 80 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..0d358a0 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,80 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(struct page *page, int cold),
+
+	TP_ARGS(page, cold),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("page=%p pfn=%lu order=0 cold=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(struct page *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d052abb..440c439 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 void free_hot_page(struct page *page)
 {
+	trace_mm_page_free_direct(page, 0);
 	free_hot_cold_page(page, 0);
 }
 	
@@ -1905,6 +1906,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1945,13 +1947,16 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(pvec->pages[i], pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
+		trace_mm_page_free_direct(page, order);
 		if (order == 0)
 			free_hot_page(page);
 		else
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   74 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    7 +++-
 2 files changed, 80 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..0d358a0 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,80 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(struct page *page, int cold),
+
+	TP_ARGS(page, cold),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("page=%p pfn=%lu order=0 cold=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(struct page *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d052abb..440c439 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 void free_hot_page(struct page *page)
 {
+	trace_mm_page_free_direct(page, 0);
 	free_hot_cold_page(page, 0);
 }
 	
@@ -1905,6 +1906,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1945,13 +1947,16 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(pvec->pages[i], pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
+		trace_mm_page_free_direct(page, order);
 		if (order == 0)
 			free_hot_page(page);
 		else
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-08-10 15:41 ` Mel Gorman
@ 2009-08-10 15:41   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    4 ++++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 0d358a0..aae16ee 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,44 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+	TP_PROTO(struct page *page,
+			int alloc_order, int fallback_order,
+			int alloc_migratetype, int fallback_migratetype),
+
+	TP_ARGS(page,
+		alloc_order, fallback_order,
+		alloc_migratetype, fallback_migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page			)
+		__field(	int,		alloc_order		)
+		__field(	int,		fallback_order		)
+		__field(	int,		alloc_migratetype	)
+		__field(	int,		fallback_migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page			= page;
+		__entry->alloc_order		= alloc_order;
+		__entry->fallback_order		= fallback_order;
+		__entry->alloc_migratetype	= alloc_migratetype;
+		__entry->fallback_migratetype	= fallback_migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->alloc_order,
+		__entry->fallback_order,
+		pageblock_order,
+		__entry->alloc_migratetype,
+		__entry->fallback_migratetype,
+		__entry->fallback_order < pageblock_order,
+		__entry->alloc_migratetype == __entry->fallback_migratetype)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 440c439..4c20bfa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							start_migratetype);
 
 			expand(zone, page, order, current_order, area, migratetype);
+
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+				start_migratetype, migratetype);
+
 			return page;
 		}
 	}
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    4 ++++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 0d358a0..aae16ee 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,44 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+	TP_PROTO(struct page *page,
+			int alloc_order, int fallback_order,
+			int alloc_migratetype, int fallback_migratetype),
+
+	TP_ARGS(page,
+		alloc_order, fallback_order,
+		alloc_migratetype, fallback_migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page			)
+		__field(	int,		alloc_order		)
+		__field(	int,		fallback_order		)
+		__field(	int,		alloc_migratetype	)
+		__field(	int,		fallback_migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page			= page;
+		__entry->alloc_order		= alloc_order;
+		__entry->fallback_order		= fallback_order;
+		__entry->alloc_migratetype	= alloc_migratetype;
+		__entry->fallback_migratetype	= fallback_migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->alloc_order,
+		__entry->fallback_order,
+		pageblock_order,
+		__entry->alloc_migratetype,
+		__entry->fallback_migratetype,
+		__entry->fallback_order < pageblock_order,
+		__entry->alloc_migratetype == __entry->fallback_migratetype)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 440c439..4c20bfa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							start_migratetype);
 
 			expand(zone, page, order, current_order, area, migratetype);
+
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+				start_migratetype, migratetype);
+
 			return page;
 		}
 	}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-10 15:41 ` Mel Gorman
@ 2009-08-10 15:41   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   51 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    2 +
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index aae16ee..eaf46bd 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,57 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%u migratetype=%d percpu_refill=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		__entry->order == 0)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+	TP_PROTO(struct page *page, int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c20bfa..bbd7de8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
+		trace_mm_page_pcpu_drain(page, order, page_private(page));
 		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
@@ -876,6 +877,7 @@ retry_reserve:
 		}
 	}
 
+	trace_mm_page_alloc_zone_locked(page, order, migratetype);
 	return page;
 }
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   51 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    2 +
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index aae16ee..eaf46bd 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,57 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%u migratetype=%d percpu_refill=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		__entry->order == 0)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+	TP_PROTO(struct page *page, int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c20bfa..bbd7de8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
+		trace_mm_page_pcpu_drain(page, order, page_private(page));
 		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
@@ -876,6 +877,7 @@ retry_reserve:
 		}
 	}
 
+	trace_mm_page_alloc_zone_locked(page, order, migratetype);
 	return page;
 }
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-10 15:41 ` Mel Gorman
@ 2009-08-10 15:41   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                under lock     direct  pagevec      drain
swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0

Optionally, the output can include information on the parent or aggregate
based on process name instead of aggregating based on each pid. Example output
including parent information and stripped out the PID looks something like;

Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                     under lock     direct  pagevec      drain
gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0

This says that Xorg allocated 3796 pages and it's parent process is gdm
with a PID of 3756;

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format changes,
the parser will break but the regular expression parser is fairly rudimentary
so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  418 ++++++++++++++++++++
 1 files changed, 418 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..7df50e8
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --prepend-parent	Report on the parent proc and PID
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_PAGE_ALLOC		=> 1;
+use constant MM_PAGE_FREE_DIRECT 	=> 2;
+use constant MM_PAGEVEC_FREE		=> 3;
+use constant MM_PAGE_PCPU_DRAIN		=> 4;
+use constant MM_PAGE_ALLOC_ZONE_LOCKED	=> 5;
+use constant MM_PAGE_ALLOC_EXTFRAG	=> 6;
+use constant EVENT_UNKNOWN		=> 7;
+
+# Constants used to track state
+use constant STATE_PCPU_PAGES_DRAINED	=> 8;
+use constant STATE_PCPU_PAGES_REFILLED	=> 9;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_PCPU_DRAINS		=> 10;
+use constant HIGH_PCPU_REFILLS		=> 11;
+use constant HIGH_EXT_FRAGMENT		=> 12;
+use constant HIGH_EXT_FRAGMENT_SEVERE	=> 13;
+use constant HIGH_EXT_FRAGMENT_MODERATE	=> 14;
+use constant HIGH_EXT_FRAGMENT_CHANGED	=> 15;
+
+my %perprocesspid;
+my %perprocess;
+my $opt_ignorepid;
+my $opt_read_procstat;
+my $opt_prepend_parent;
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+	'prepend-parent' =>	\$opt_prepend_parent,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
+
+# Dyanically discovered regex
+my $regex_fragdetails;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		$regex = $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			if ($line =~ /^print fmt:\s"(.*)",.*/) {
+				$regex = $1;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected '$key' != '$expected'");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
+			$regex_fragdetails_default,
+			"page", "pfn",
+			"alloc_order", "fallback_order", "pageblock_order",
+			"alloc_migratetype", "fallback_migratetype",
+			"fragmenting", "change_ownership");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+sub parent_info($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+	my $ppid;
+
+	if ($pid == 0) {
+		return "NOPARENT-0";
+	}
+
+	if ($statline !~ /$regex_statppid/o) {
+		die("Failed to match stat line process ppid:: $statline");
+	}
+
+	# Read the ppid stat line
+	$ppid = $1;
+	return guess_process_pid($ppid, read_statline($ppid));
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$tracepoint = $4;
+
+			if ($opt_read_procstat || $opt_prepend_parent) {
+				$process_pid =~ /(.*)-([0-9]*)$/;
+				my $process = $1;
+				my $pid = $2;
+
+				$statline = read_statline($pid);
+
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+
+				if ($opt_prepend_parent) {
+					$process_pid = parent_info($pid, $statline) . " :: $process_pid";
+				}
+			}
+
+			# Unnecessary in this script. Uncomment if required
+			# $cpus = $2;
+			# $timestamp = $3;
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_page_alloc") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
+		} elsif ($tracepoint eq "mm_page_free_direct") {
+			$perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
+		} elsif ($tracepoint eq "mm_pagevec_free") {
+			$perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
+		} elsif ($tracepoint eq "mm_page_pcpu_drain") {
+			$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
+
+			# Extract the details of the event now
+			$details = $5;
+
+			my ($page, $pfn);
+			my ($alloc_order, $fallback_order, $pageblock_order);
+			my ($alloc_migratetype, $fallback_migratetype);
+			my ($fragmenting, $change_ownership);
+
+			if ($details !~ /$regex_fragdetails/o) {
+				print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
+				next;
+			}
+
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
+			$page = $1;
+			$pfn = $2;
+			$alloc_order = $3;
+			$fallback_order = $4;
+			$pageblock_order = $5;
+			$alloc_migratetype = $6;
+			$fallback_migratetype = $7;
+			$fragmenting = $8;
+			$change_ownership = $9;
+
+			if ($fragmenting) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
+				if ($fallback_order <= 3) {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
+				} else {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
+				}
+			}
+			if ($change_ownership) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		# Catch a full pcpu drain event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
+				$tracepoint ne "mm_page_pcpu_drain") {
+
+			$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+
+		# Catch a full pcpu refill event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
+				$tracepoint ne "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
+
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
+
+	foreach $process_pid (keys %stats) {
+		# Dump final aggregates
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n",
+			$process_pid,
+			$stats{$process_pid}->{MM_PAGE_ALLOC},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
+			$stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
+			$stats{$process_pid}->{MM_PAGEVEC_FREE},
+			$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS},
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
+			$stats{$process_pid}->{EVENT_UNKNOWN});
+	}
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
+		$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
+		$perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
+		$perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
+		$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
+		$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
+		$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
+		$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
+		$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                under lock     direct  pagevec      drain
swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0

Optionally, the output can include information on the parent or aggregate
based on process name instead of aggregating based on each pid. Example output
including parent information and stripped out the PID looks something like;

Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                     under lock     direct  pagevec      drain
gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0

This says that Xorg allocated 3796 pages and it's parent process is gdm
with a PID of 3756;

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format changes,
the parser will break but the regular expression parser is fairly rudimentary
so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  418 ++++++++++++++++++++
 1 files changed, 418 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..7df50e8
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --prepend-parent	Report on the parent proc and PID
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_PAGE_ALLOC		=> 1;
+use constant MM_PAGE_FREE_DIRECT 	=> 2;
+use constant MM_PAGEVEC_FREE		=> 3;
+use constant MM_PAGE_PCPU_DRAIN		=> 4;
+use constant MM_PAGE_ALLOC_ZONE_LOCKED	=> 5;
+use constant MM_PAGE_ALLOC_EXTFRAG	=> 6;
+use constant EVENT_UNKNOWN		=> 7;
+
+# Constants used to track state
+use constant STATE_PCPU_PAGES_DRAINED	=> 8;
+use constant STATE_PCPU_PAGES_REFILLED	=> 9;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_PCPU_DRAINS		=> 10;
+use constant HIGH_PCPU_REFILLS		=> 11;
+use constant HIGH_EXT_FRAGMENT		=> 12;
+use constant HIGH_EXT_FRAGMENT_SEVERE	=> 13;
+use constant HIGH_EXT_FRAGMENT_MODERATE	=> 14;
+use constant HIGH_EXT_FRAGMENT_CHANGED	=> 15;
+
+my %perprocesspid;
+my %perprocess;
+my $opt_ignorepid;
+my $opt_read_procstat;
+my $opt_prepend_parent;
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+	'prepend-parent' =>	\$opt_prepend_parent,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
+
+# Dyanically discovered regex
+my $regex_fragdetails;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		$regex = $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			if ($line =~ /^print fmt:\s"(.*)",.*/) {
+				$regex = $1;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected '$key' != '$expected'");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
+			$regex_fragdetails_default,
+			"page", "pfn",
+			"alloc_order", "fallback_order", "pageblock_order",
+			"alloc_migratetype", "fallback_migratetype",
+			"fragmenting", "change_ownership");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+sub parent_info($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+	my $ppid;
+
+	if ($pid == 0) {
+		return "NOPARENT-0";
+	}
+
+	if ($statline !~ /$regex_statppid/o) {
+		die("Failed to match stat line process ppid:: $statline");
+	}
+
+	# Read the ppid stat line
+	$ppid = $1;
+	return guess_process_pid($ppid, read_statline($ppid));
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$tracepoint = $4;
+
+			if ($opt_read_procstat || $opt_prepend_parent) {
+				$process_pid =~ /(.*)-([0-9]*)$/;
+				my $process = $1;
+				my $pid = $2;
+
+				$statline = read_statline($pid);
+
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+
+				if ($opt_prepend_parent) {
+					$process_pid = parent_info($pid, $statline) . " :: $process_pid";
+				}
+			}
+
+			# Unnecessary in this script. Uncomment if required
+			# $cpus = $2;
+			# $timestamp = $3;
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_page_alloc") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
+		} elsif ($tracepoint eq "mm_page_free_direct") {
+			$perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
+		} elsif ($tracepoint eq "mm_pagevec_free") {
+			$perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
+		} elsif ($tracepoint eq "mm_page_pcpu_drain") {
+			$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
+
+			# Extract the details of the event now
+			$details = $5;
+
+			my ($page, $pfn);
+			my ($alloc_order, $fallback_order, $pageblock_order);
+			my ($alloc_migratetype, $fallback_migratetype);
+			my ($fragmenting, $change_ownership);
+
+			if ($details !~ /$regex_fragdetails/o) {
+				print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
+				next;
+			}
+
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
+			$page = $1;
+			$pfn = $2;
+			$alloc_order = $3;
+			$fallback_order = $4;
+			$pageblock_order = $5;
+			$alloc_migratetype = $6;
+			$fallback_migratetype = $7;
+			$fragmenting = $8;
+			$change_ownership = $9;
+
+			if ($fragmenting) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
+				if ($fallback_order <= 3) {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
+				} else {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
+				}
+			}
+			if ($change_ownership) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		# Catch a full pcpu drain event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
+				$tracepoint ne "mm_page_pcpu_drain") {
+
+			$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+
+		# Catch a full pcpu refill event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
+				$tracepoint ne "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
+
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
+
+	foreach $process_pid (keys %stats) {
+		# Dump final aggregates
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n",
+			$process_pid,
+			$stats{$process_pid}->{MM_PAGE_ALLOC},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
+			$stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
+			$stats{$process_pid}->{MM_PAGEVEC_FREE},
+			$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS},
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
+			$stats{$process_pid}->{EVENT_UNKNOWN});
+	}
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
+		$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
+		$perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
+		$perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
+		$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
+		$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
+		$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
+		$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
+		$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-10 15:41 ` Mel Gorman
@ 2009-08-10 15:41   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
  2009-08-10 15:41 ` Mel Gorman
@ 2009-08-10 15:41   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/6] Add some trace events for the page allocator v6
  2009-08-10 15:41 ` Mel Gorman
@ 2009-08-10 18:56   ` Ingo Molnar
  -1 siblings, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-08-10 18:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, Li Ming Chun,
	LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> This is V6 of a patchset to add some tracepoints of interest when analysing
> the page allocator. The only changes since the last revision were to fix a
> minor error in the post-processing script and to add a reviewed-by to one
> of the patches.
> 
> Can we get a yey/nay on whether these should be merged or not?

It's up to Andrew - from an instrumentation POV it all looks fine to 
me:

   Reviewed-by: Ingo Molnar <mingo@elte.hu>

With the highlight of the patchset being the very low cross section 
it has to regular MM code/hacking:

>  mm/page_alloc.c                                    |   13 +-

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/6] Add some trace events for the page allocator v6
@ 2009-08-10 18:56   ` Ingo Molnar
  0 siblings, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-08-10 18:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, Li Ming Chun,
	LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> This is V6 of a patchset to add some tracepoints of interest when analysing
> the page allocator. The only changes since the last revision were to fix a
> minor error in the post-processing script and to add a reviewed-by to one
> of the patches.
> 
> Can we get a yey/nay on whether these should be merged or not?

It's up to Andrew - from an instrumentation POV it all looks fine to 
me:

   Reviewed-by: Ingo Molnar <mingo@elte.hu>

With the highlight of the patchset being the very low cross section 
it has to regular MM code/hacking:

>  mm/page_alloc.c                                    |   13 +-

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-10 15:41   ` Mel Gorman
@ 2009-08-11 17:31     ` Jonathan Corbet
  -1 siblings, 0 replies; 42+ messages in thread
From: Jonathan Corbet @ 2009-08-11 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm, Mel Gorman

On Mon, 10 Aug 2009 16:41:54 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a simple document intended to get someone started on the
> different ways of using tracepoints to gather meaningful data.

It's really good stuff.  Here's a few nits which maybe help to make it
imperceptibly better...

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.

I would bet it is worth mentioning what those are.  In particular, "perf"
failed for me until I turned on CONFIG_EVENT_PROFILE - which isn't under
the tracing options and which doesn't appear to have a help text.

It might also be worth mentioning that "perf" is in tools/perf - it might
be a bit before distributors ship it universally.

> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done

This appears not to be necessary if you're using "perf."  If you're doing
something else (just catting out the events, say), you also need to set
.../tracing/tracing_enabled.

> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local

"activated"

> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the

s/deviation in/deviation is/

> +performance analyst to do it by hand. In the event that the discrete event

s/to do it/to do/

> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events

"Calculating" would match the tense of the first two items.

> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example

I don't have that file in current git...?

> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing

That's why I don't have it!  It can generate reports without
existing! :)

> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number

A PID differs from an "individual process number" how?

> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root

It works happily without root on my -rc5++ system; I don't think I've
tweaked anything to make that happen.

> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.

What are the units for -c?

> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack

Worth mentioning that [k] marks kernel-space symbols?

> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
> +forgotten about from months ago!

jon

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-11 17:31     ` Jonathan Corbet
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Corbet @ 2009-08-11 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm

On Mon, 10 Aug 2009 16:41:54 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a simple document intended to get someone started on the
> different ways of using tracepoints to gather meaningful data.

It's really good stuff.  Here's a few nits which maybe help to make it
imperceptibly better...

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.

I would bet it is worth mentioning what those are.  In particular, "perf"
failed for me until I turned on CONFIG_EVENT_PROFILE - which isn't under
the tracing options and which doesn't appear to have a help text.

It might also be worth mentioning that "perf" is in tools/perf - it might
be a bit before distributors ship it universally.

> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done

This appears not to be necessary if you're using "perf."  If you're doing
something else (just catting out the events, say), you also need to set
.../tracing/tracing_enabled.

> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local

"activated"

> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the

s/deviation in/deviation is/

> +performance analyst to do it by hand. In the event that the discrete event

s/to do it/to do/

> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events

"Calculating" would match the tense of the first two items.

> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example

I don't have that file in current git...?

> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing

That's why I don't have it!  It can generate reports without
existing! :)

> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number

A PID differs from an "individual process number" how?

> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root

It works happily without root on my -rc5++ system; I don't think I've
tweaked anything to make that happen.

> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.

What are the units for -c?

> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack

Worth mentioning that [k] marks kernel-space symbols?

> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
> +forgotten about from months ago!

jon

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-11 17:31     ` Jonathan Corbet
@ 2009-08-11 17:39       ` Jonathan Corbet
  -1 siblings, 0 replies; 42+ messages in thread
From: Jonathan Corbet @ 2009-08-11 17:39 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Mel Gorman, Larry Woodman, Ingo Molnar, Andrew Morton, riel,
	Peter Zijlstra, Li Ming Chun, LKML, linux-mm

On Tue, 11 Aug 2009 11:31:39 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> > +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example  
> 
> I don't have that file in current git...?

Duh, obviously, that's added by part 4 of the series.  Please ignore
this bit of noise, sorry.

jon

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-11 17:39       ` Jonathan Corbet
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Corbet @ 2009-08-11 17:39 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Mel Gorman, Larry Woodman, Ingo Molnar, Andrew Morton, riel,
	Peter Zijlstra, Li Ming Chun, LKML, linux-mm

On Tue, 11 Aug 2009 11:31:39 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> > +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example  
> 
> I don't have that file in current git...?

Duh, obviously, that's added by part 4 of the series.  Please ignore
this bit of noise, sorry.

jon

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-10 15:41   ` Mel Gorman
@ 2009-08-11 23:31     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2009-08-11 23:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar, Andrew Morton, riel,
	Peter Zijlstra, Li Ming Chun, LKML, linux-mm

> This patch adds trace events for the allocation and freeing of pages,
> including the freeing of pagevecs.  Using the events, it will be known what
> struct page and pfns are being allocated and freed and what the call site
> was in many cases.
> 
> The page alloc tracepoints be used as an indicator as to whether the workload
> was heavily dependant on the page allocator or not. You can make a guess based
> on vmstat but you can't get a per-process breakdown. Depending on the call
> path, the call_site for page allocation may be __get_free_pages() instead
> of a useful callsite. Instead of passing down a return address similar to
> slab debugging, the user should enable the stacktrace and seg-addr options
> to get a proper stack trace.
> 
> The pagevec free tracepoint has a different usecase. It can be used to get
> a idea of how many pages are being dumped off the LRU and whether it is
> kswapd doing the work or a process doing direct reclaim.

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
@ 2009-08-11 23:31     ` KOSAKI Motohiro
  0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2009-08-11 23:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar, Andrew Morton, riel,
	Peter Zijlstra, Li Ming Chun, LKML, linux-mm

> This patch adds trace events for the allocation and freeing of pages,
> including the freeing of pagevecs.  Using the events, it will be known what
> struct page and pfns are being allocated and freed and what the call site
> was in many cases.
> 
> The page alloc tracepoints be used as an indicator as to whether the workload
> was heavily dependant on the page allocator or not. You can make a guess based
> on vmstat but you can't get a per-process breakdown. Depending on the call
> path, the call_site for page allocation may be __get_free_pages() instead
> of a useful callsite. Instead of passing down a return address similar to
> slab debugging, the user should enable the stacktrace and seg-addr options
> to get a proper stack trace.
> 
> The pagevec free tracepoint has a different usecase. It can be used to get
> a idea of how many pages are being dumped off the LRU and whether it is
> kswapd doing the work or a process doing direct reclaim.

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-10 15:41   ` Mel Gorman
@ 2009-08-11 23:31     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2009-08-11 23:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar, Andrew Morton, riel,
	Peter Zijlstra, Li Ming Chun, LKML, linux-mm

> The page allocation trace event reports that a page was successfully allocated
> but it does not specify where it came from. When analysing performance,
> it can be important to distinguish between pages coming from the per-cpu
> allocator and pages coming from the buddy lists as the latter requires the
> zone lock to the taken and more data structures to be examined.
> 
> This patch adds a trace event for __rmqueue reporting when a page is being
> allocated from the buddy lists. It distinguishes between being called
> to refill the per-cpu lists or whether it is a high-order allocation.
> Similarly, this patch adds an event to catch when the PCP lists are being
> drained a little and pages are going back to the buddy lists.
> 
> This is trickier to draw conclusions from but high activity on those
> events could explain why there were a large number of cache misses on a
> page-allocator-intensive workload. The coalescing and splitting of buddies
> involves a lot of writing of page metadata and cache line bounces not to
> mention the acquisition of an interrupt-safe lock necessary to enter this
> path.

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-11 23:31     ` KOSAKI Motohiro
  0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2009-08-11 23:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar, Andrew Morton, riel,
	Peter Zijlstra, Li Ming Chun, LKML, linux-mm

> The page allocation trace event reports that a page was successfully allocated
> but it does not specify where it came from. When analysing performance,
> it can be important to distinguish between pages coming from the per-cpu
> allocator and pages coming from the buddy lists as the latter requires the
> zone lock to the taken and more data structures to be examined.
> 
> This patch adds a trace event for __rmqueue reporting when a page is being
> allocated from the buddy lists. It distinguishes between being called
> to refill the per-cpu lists or whether it is a high-order allocation.
> Similarly, this patch adds an event to catch when the PCP lists are being
> drained a little and pages are going back to the buddy lists.
> 
> This is trickier to draw conclusions from but high activity on those
> events could explain why there were a large number of cache misses on a
> page-allocator-intensive workload. The coalescing and splitting of buddies
> involves a lot of writing of page metadata and cache line bounces not to
> mention the acquisition of an interrupt-safe lock necessary to enter this
> path.

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-10 15:41   ` Mel Gorman
@ 2009-08-12  1:04     ` Fernando Carrijo
  -1 siblings, 0 replies; 42+ messages in thread
From: Fernando Carrijo @ 2009-08-12  1:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm

On Mon, 2009-08-10 at 16:41 +0100, Mel Gorman wrote:
> The documentation for ftrace, events and tracepoints is pretty
> extensive. Similarly, the perf PCL tools help files --help are there and
> the code simple enough to figure out what much of the switches mean.
> However, pulling the discrete bits and pieces together and translating
> that into "how do I solve a problem" requires a fair amount of
> imagination.
> 
> This patch adds a simple document intended to get someone started on the
> different ways of using tracepoints to gather meaningful data.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.
> +
> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local
> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the
> +performance analyst to do it by hand. In the event that the discrete event
> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events
> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing
> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number
> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root
> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.
> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
                    ^^^^^^^^^

libpixman, right?

> +forgotten about from months ago!




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-12  1:04     ` Fernando Carrijo
  0 siblings, 0 replies; 42+ messages in thread
From: Fernando Carrijo @ 2009-08-12  1:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm

On Mon, 2009-08-10 at 16:41 +0100, Mel Gorman wrote:
> The documentation for ftrace, events and tracepoints is pretty
> extensive. Similarly, the perf PCL tools help files --help are there and
> the code simple enough to figure out what much of the switches mean.
> However, pulling the discrete bits and pieces together and translating
> that into "how do I solve a problem" requires a fair amount of
> imagination.
> 
> This patch adds a simple document intended to get someone started on the
> different ways of using tracepoints to gather meaningful data.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.
> +
> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local
> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the
> +performance analyst to do it by hand. In the event that the discrete event
> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events
> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing
> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number
> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root
> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.
> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
                    ^^^^^^^^^

libpixman, right?

> +forgotten about from months ago!



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-11 17:31     ` Jonathan Corbet
@ 2009-08-12 12:06       ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-12 12:06 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm

On Tue, Aug 11, 2009 at 11:31:39AM -0600, Jonathan Corbet wrote:
> On Mon, 10 Aug 2009 16:41:54 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a simple document intended to get someone started on the
> > different ways of using tracepoints to gather meaningful data.
> 
> It's really good stuff.  Here's a few nits which maybe help to make it
> imperceptibly better...
> 

It makes things much better, thanks. I'll send a fix to the patch as a
follow-up.

> > <SNIP>
> > +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> > +the appropriate tracing options have been configured into the kernel. It is
> > +assumed that the PCL tool tools/perf has been installed and is in your path.
> 
> I would bet it is worth mentioning what those are.  In particular, "perf"
> failed for me until I turned on CONFIG_EVENT_PROFILE - which isn't under
> the tracing options and which doesn't appear to have a help text.
> 
> It might also be worth mentioning that "perf" is in tools/perf - it might
> be a bit before distributors ship it universally.
> 

Done.

> > <SNIP>
> > +2. Enabling Events
> > +==================
> > +
> > +2.1 System-Wide Event Enabling
> > +------------------------------
> > +
> > +See Documentation/trace/events.txt for a proper description on how events
> > +can be enabled system-wide. A short example of enabling all events related
> > +to page allocation would look something like
> > +
> > +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> 
> This appears not to be necessary if you're using "perf."  If you're doing
> something else (just catting out the events, say),

The example was for people using cat and echo rather than perf. I'll
expand the example. I talk about using perf to enable system-wide
monitoring later.

> you also need to set .../tracing/tracing_enabled.
> 

I added that step.


> > <SNIP>
> > +2.5 Local Event Enablement with PCL
> > +-----------------------------------
> > +
> > +Events can be activate and tracked for the duration of a process on a local
> 
> "activated"
> 

Fixed.

> > <SNIP>
> > +4. Analysing Event Variances with PCL
> > +=====================================
> > +
> > +Any workload can exhibit variances between runs and it can be important
> > +to know what the standard deviation in. By and large, this is left to the
> 
> s/deviation in/deviation is/
> 
> > +performance analyst to do it by hand. In the event that the discrete event
> 
> s/to do it/to do/
> 

Both fixed.

> > <SNIP>
> > +5. Higher-Level Analysis with Helper Scripts
> > +============================================
> > +
> > +When events are enabled the events that are triggering can be read from
> > +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> > +options exist as well. By post-processing the output, further information can
> > +be gathered on-line as appropriate. Examples of post-processing might include
> > +
> > +  o Reading information from /proc for the PID that triggered the event
> > +  o Deriving a higher-level event from a series of lower-level events.
> > +  o Calculate latencies between two events
> 
> "Calculating" would match the tense of the first two items.
> 

True, fixed.

> > +
> > +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> 
> I don't have that file in current git...?
> 

You found it later.

> > +script that can read trace_pipe from STDIN or a copy of a trace. When used
> > +on-line, it can be interrupted once to generate a report without existing
> 
> That's why I don't have it!  It can generate reports without
> existing! :)
> 

It's the latest in "Pull It Out Of Your Ass Benchmarking" technology.

> > +and twice to exit.
> > +
> > +Simplistically, the script just reads STDIN and counts up events but it
> > +also can do more such as
> > +
> > +  o Derive high-level events from many low-level events. If a number of pages
> > +    are freed to the main allocator from the per-CPU lists, it recognises
> > +    that as one per-CPU drain even though there is no specific tracepoint
> > +    for that event
> > +  o It can aggregate based on PID or individual process number
> 
> A PID differs from an "individual process number" how?
> 

It's not, that should have been process name, not number ala

  o It can aggregate based on PID or process name

> > <SNIP>
> > +6. Lower-Level Analysis with PCL
> > +================================
> > +
> > +There may also be a requirement to identify what functions with a program
> > +were generating events within the kernel. To begin this sort of analysis, the
> > +data must be recorded. At the time of writing, this required root
> 
> It works happily without root on my -rc5++ system; I don't think I've
> tweaked anything to make that happen.
> 

/me slaps self

It was a permissions problem in the directory I was working from. I
dropped the line. I was so used of requiring root for oprofile that I
didn't think about it properly.

> > +
> > +  $ perf record -c 1 \
> > +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> > +	-e kmem:mm_pagevec_free \
> > +	./hackbench 10
> > +  Time: 0.894
> > +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> > +
> > +Note the use of '-c 1' to set the event period to sample. The default sample
> > +period is quite high to minimise overhead but the information collected can be
> > +very coarse as a result.
> 
> What are the units for -c?
> 

The documentation is somewhat fuzzy. My understanding is as follows and if
I'm wrong, someone will point it out to me.

====
Note the use of '-c 1' to set the sample period. The default sample period
is quite high to minimise overhead but the information collected can be very
coarse as a result.

The sample period is in units of "events occurred". For a hardware counter,
this would usually mean the PMU is programmed to "raise an interrupt after
this many events occured" and the event is recorded on interrupt receipt. For
software-events such as tracepoints, one event will be recorded every "sample
period" number of times the tracepoint triggered.  In this case, -c 1 means
"record a sample every time this tracepoint is triggered".
====

> > <SNIP>
> > +So, almost half of the events are occuring in a library. To get an idea which
> > +symbol.
> > +
> > +  $ perf report --sort comm,dso,symbol
> > +  # Samples: 27666
> > +  #
> > +  # Overhead  Command                            Shared Object  Symbol
> > +  # ........  .......  .......................................  ......
> > +  #
> > +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> > +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> > +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> > +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> > +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> > +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> > +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> 
> Worth mentioning that [k] marks kernel-space symbols?
> 

Another column has "[kernel]" in it but it does not hurt to point it
out.

> > +
> > +To see where within the function pixmanFillsse2 things are going wrong
> > +
> > +  $ perf annotate pixmanFillsse2
> > +  [ ... ]
> > +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> > +         :      }
> > +         :
> > +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> > +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> > +         :        *__P = __B;
> > +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> > +    0.00 :         34ef5:       ff 
> > +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> > +    0.00 :         34efd:       ff 
> > +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> > +    0.00 :         34f05:       ff 
> > +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> > +    0.00 :         34f0d:       ff 
> > +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> > +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> > +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> > +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> > +
> > +At a glance, it looks like the time is being spent copying pixmaps to
> > +the card.  Further investigation would be needed to determine why pixmaps
> > +are being copied around so much but a starting point would be to take an
> > +ancient build of libpixmap out of the library path where it was totally
> > +forgotten about from months ago!
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-12 12:06       ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-12 12:06 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm

On Tue, Aug 11, 2009 at 11:31:39AM -0600, Jonathan Corbet wrote:
> On Mon, 10 Aug 2009 16:41:54 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a simple document intended to get someone started on the
> > different ways of using tracepoints to gather meaningful data.
> 
> It's really good stuff.  Here's a few nits which maybe help to make it
> imperceptibly better...
> 

It makes things much better, thanks. I'll send a fix to the patch as a
follow-up.

> > <SNIP>
> > +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> > +the appropriate tracing options have been configured into the kernel. It is
> > +assumed that the PCL tool tools/perf has been installed and is in your path.
> 
> I would bet it is worth mentioning what those are.  In particular, "perf"
> failed for me until I turned on CONFIG_EVENT_PROFILE - which isn't under
> the tracing options and which doesn't appear to have a help text.
> 
> It might also be worth mentioning that "perf" is in tools/perf - it might
> be a bit before distributors ship it universally.
> 

Done.

> > <SNIP>
> > +2. Enabling Events
> > +==================
> > +
> > +2.1 System-Wide Event Enabling
> > +------------------------------
> > +
> > +See Documentation/trace/events.txt for a proper description on how events
> > +can be enabled system-wide. A short example of enabling all events related
> > +to page allocation would look something like
> > +
> > +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> 
> This appears not to be necessary if you're using "perf."  If you're doing
> something else (just catting out the events, say),

The example was for people using cat and echo rather than perf. I'll
expand the example. I talk about using perf to enable system-wide
monitoring later.

> you also need to set .../tracing/tracing_enabled.
> 

I added that step.


> > <SNIP>
> > +2.5 Local Event Enablement with PCL
> > +-----------------------------------
> > +
> > +Events can be activate and tracked for the duration of a process on a local
> 
> "activated"
> 

Fixed.

> > <SNIP>
> > +4. Analysing Event Variances with PCL
> > +=====================================
> > +
> > +Any workload can exhibit variances between runs and it can be important
> > +to know what the standard deviation in. By and large, this is left to the
> 
> s/deviation in/deviation is/
> 
> > +performance analyst to do it by hand. In the event that the discrete event
> 
> s/to do it/to do/
> 

Both fixed.

> > <SNIP>
> > +5. Higher-Level Analysis with Helper Scripts
> > +============================================
> > +
> > +When events are enabled the events that are triggering can be read from
> > +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> > +options exist as well. By post-processing the output, further information can
> > +be gathered on-line as appropriate. Examples of post-processing might include
> > +
> > +  o Reading information from /proc for the PID that triggered the event
> > +  o Deriving a higher-level event from a series of lower-level events.
> > +  o Calculate latencies between two events
> 
> "Calculating" would match the tense of the first two items.
> 

True, fixed.

> > +
> > +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> 
> I don't have that file in current git...?
> 

You found it later.

> > +script that can read trace_pipe from STDIN or a copy of a trace. When used
> > +on-line, it can be interrupted once to generate a report without existing
> 
> That's why I don't have it!  It can generate reports without
> existing! :)
> 

It's the latest in "Pull It Out Of Your Ass Benchmarking" technology.

> > +and twice to exit.
> > +
> > +Simplistically, the script just reads STDIN and counts up events but it
> > +also can do more such as
> > +
> > +  o Derive high-level events from many low-level events. If a number of pages
> > +    are freed to the main allocator from the per-CPU lists, it recognises
> > +    that as one per-CPU drain even though there is no specific tracepoint
> > +    for that event
> > +  o It can aggregate based on PID or individual process number
> 
> A PID differs from an "individual process number" how?
> 

It's not, that should have been process name, not number ala

  o It can aggregate based on PID or process name

> > <SNIP>
> > +6. Lower-Level Analysis with PCL
> > +================================
> > +
> > +There may also be a requirement to identify what functions with a program
> > +were generating events within the kernel. To begin this sort of analysis, the
> > +data must be recorded. At the time of writing, this required root
> 
> It works happily without root on my -rc5++ system; I don't think I've
> tweaked anything to make that happen.
> 

/me slaps self

It was a permissions problem in the directory I was working from. I
dropped the line. I was so used of requiring root for oprofile that I
didn't think about it properly.

> > +
> > +  $ perf record -c 1 \
> > +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> > +	-e kmem:mm_pagevec_free \
> > +	./hackbench 10
> > +  Time: 0.894
> > +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> > +
> > +Note the use of '-c 1' to set the event period to sample. The default sample
> > +period is quite high to minimise overhead but the information collected can be
> > +very coarse as a result.
> 
> What are the units for -c?
> 

The documentation is somewhat fuzzy. My understanding is as follows and if
I'm wrong, someone will point it out to me.

====
Note the use of '-c 1' to set the sample period. The default sample period
is quite high to minimise overhead but the information collected can be very
coarse as a result.

The sample period is in units of "events occurred". For a hardware counter,
this would usually mean the PMU is programmed to "raise an interrupt after
this many events occured" and the event is recorded on interrupt receipt. For
software-events such as tracepoints, one event will be recorded every "sample
period" number of times the tracepoint triggered.  In this case, -c 1 means
"record a sample every time this tracepoint is triggered".
====

> > <SNIP>
> > +So, almost half of the events are occuring in a library. To get an idea which
> > +symbol.
> > +
> > +  $ perf report --sort comm,dso,symbol
> > +  # Samples: 27666
> > +  #
> > +  # Overhead  Command                            Shared Object  Symbol
> > +  # ........  .......  .......................................  ......
> > +  #
> > +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> > +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> > +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> > +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> > +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> > +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> > +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> 
> Worth mentioning that [k] marks kernel-space symbols?
> 

Another column has "[kernel]" in it but it does not hurt to point it
out.

> > +
> > +To see where within the function pixmanFillsse2 things are going wrong
> > +
> > +  $ perf annotate pixmanFillsse2
> > +  [ ... ]
> > +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> > +         :      }
> > +         :
> > +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> > +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> > +         :        *__P = __B;
> > +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> > +    0.00 :         34ef5:       ff 
> > +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> > +    0.00 :         34efd:       ff 
> > +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> > +    0.00 :         34f05:       ff 
> > +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> > +    0.00 :         34f0d:       ff 
> > +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> > +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> > +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> > +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> > +
> > +At a glance, it looks like the time is being spent copying pixmaps to
> > +the card.  Further investigation would be needed to determine why pixmaps
> > +are being copied around so much but a starting point would be to take an
> > +ancient build of libpixmap out of the library path where it was totally
> > +forgotten about from months ago!
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-12  1:04     ` Fernando Carrijo
@ 2009-08-12 13:12       ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-12 13:12 UTC (permalink / raw)
  To: Fernando Carrijo
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm

On Tue, Aug 11, 2009 at 10:04:19PM -0300, Fernando Carrijo wrote:
> > <SNIP>
> > +At a glance, it looks like the time is being spent copying pixmaps to
> > +the card.  Further investigation would be needed to determine why pixmaps
> > +are being copied around so much but a starting point would be to take an
> > +ancient build of libpixmap out of the library path where it was totally
>                     ^^^^^^^^^
> 
> libpixman, right?
> 

Yep. Thanks

> > +forgotten about from months ago!
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-12 13:12       ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-12 13:12 UTC (permalink / raw)
  To: Fernando Carrijo
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	Li Ming Chun, LKML, linux-mm

On Tue, Aug 11, 2009 at 10:04:19PM -0300, Fernando Carrijo wrote:
> > <SNIP>
> > +At a glance, it looks like the time is being spent copying pixmaps to
> > +the card.  Further investigation would be needed to determine why pixmaps
> > +are being copied around so much but a starting point would be to take an
> > +ancient build of libpixmap out of the library path where it was totally
>                     ^^^^^^^^^
> 
> libpixman, right?
> 

Yep. Thanks

> > +forgotten about from months ago!
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] tracing, documentation: Clarifications and corrections to tracepoint-analysis.txt
  2009-08-10 15:41   ` Mel Gorman
@ 2009-08-12 13:15     ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-12 13:15 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, Jonathan Corbet,
	Fernando Carrijo, LKML, linux-mm

This patch makes a number of corrections and clarifications as pointed
out by Jonathan Corbet.

  o Listed some requirement kernel config options
  o Spelled out that perf has to be installed from tools/perf
  o Mention that tracing_enabled must be set
  o Fix numerous minor typos
  o Fix tense issues
  o Expand on what -c means

Fernando Carrijo also spotted that a library was misnamed libpixmap
instead of libpixman. This patch should be considered a fix to
tracing-documentation-add-a-document-describing-how-to-do-some-performance-analysis-with-tracepoints.patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/trace/tracepoint-analysis.txt |   49 ++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
index e7a7d3e..282e8d9 100644
--- a/Documentation/trace/tracepoint-analysis.txt
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -17,8 +17,18 @@ gathering and interpreting these events. Lacking any current Best Practises,
 this document describes some of the methods that can be used.
 
 This document assumes that debugfs is mounted on /sys/kernel/debug and that
-the appropriate tracing options have been configured into the kernel. It is
-assumed that the PCL tool tools/perf has been installed and is in your path.
+at least the following tracing options have been configured into the kernel
+
+	CONFIG_EVENT_PROFILE=y
+	CONFIG_FTRACE=y
+	CONFIG_DYNAMIC_FTRACE=y
+	CONFIG_TRACEPOINTS=y
+
+It is also assumed that the PCL tool available from tools/perf has been
+installed and is in your path. This can be trivially installed as
+
+  $ make prefix=/usr/local
+  $ make prefix=/usr/local install
 
 2. Listing Available Events
 ===========================
@@ -61,6 +71,11 @@ to page allocation would look something like
 
   $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
 
+To start monitoring the events then, one would do something like
+
+  $ echo 1 > tracing_enabled
+  $ cat trace_pipe | some-processing-script
+
 2.2 System-Wide Event Enabling with SystemTap
 ---------------------------------------------
 
@@ -116,7 +131,7 @@ basis using set_ftrace_pid.
 2.5 Local Event Enablement with PCL
 -----------------------------------
 
-Events can be activate and tracked for the duration of a process on a local
+Events can be activated and tracked for the duration of a process on a local
 basis using PCL such as follows.
 
   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -142,8 +157,8 @@ as any script reading trace_pipe.
 =====================================
 
 Any workload can exhibit variances between runs and it can be important
-to know what the standard deviation in. By and large, this is left to the
-performance analyst to do it by hand. In the event that the discrete event
+to know what the standard deviation is. By and large, this is left to the
+performance analyst to do by hand. In the event that the discrete event
 occurrences are useful to the performance analyst, then perf can be used.
 
   $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
@@ -190,11 +205,11 @@ be gathered on-line as appropriate. Examples of post-processing might include
 
   o Reading information from /proc for the PID that triggered the event
   o Deriving a higher-level event from a series of lower-level events.
-  o Calculate latencies between two events
+  o Calculating latencies between two events
 
 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
 script that can read trace_pipe from STDIN or a copy of a trace. When used
-on-line, it can be interrupted once to generate a report without existing
+on-line, it can be interrupted once to generate a report without exiting
 and twice to exit.
 
 Simplistically, the script just reads STDIN and counts up events but it
@@ -204,7 +219,7 @@ also can do more such as
     are freed to the main allocator from the per-CPU lists, it recognises
     that as one per-CPU drain even though there is no specific tracepoint
     for that event
-  o It can aggregate based on PID or individual process number
+  o It can aggregate based on PID or process name
   o In the event memory is getting externally fragmented, it reports
     on whether the fragmentation event was severe or moderate.
   o When receiving an event about a PID, it can record who the parent was so
@@ -217,7 +232,7 @@ also can do more such as
 
 There may also be a requirement to identify what functions with a program
 were generating events within the kernel. To begin this sort of analysis, the
-data must be recorded. At the time of writing, this required root
+data must be recorded.
 
   $ perf record -c 1 \
 	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -226,10 +241,17 @@ data must be recorded. At the time of writing, this required root
   Time: 0.894
   [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
 
-Note the use of '-c 1' to set the event period to sample. The default sample
-period is quite high to minimise overhead but the information collected can be
+Note the use of '-c 1' to set the sample period. The default sample period
+is quite high to minimise overhead but the information collected can be
 very coarse as a result.
 
+The sample period is in units of "events occurred". For a hardware counter,
+this would usually mean the PMU is programmed to "raise an interrupt after
+this many events occured" and the event is recorded on interrupt receipt. For
+software-events such as tracepoints, one event will be recorded every
+"sample period" number of times the tracepoint triggered.  In this case,
+-c 1 means "record a sample every time this tracepoint is triggered".
+
 This record outputted a file called perf.data which can be analysed using
 perf report.
 
@@ -297,7 +319,8 @@ symbol.
        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
        0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
 
-To see where within the function pixmanFillsse2 things are going wrong
+Note here that kernel symbols are marked [k]. To see where within the
+function pixmanFillsse2 things are going wrong
 
   $ perf annotate pixmanFillsse2
   [ ... ]
@@ -323,5 +346,5 @@ To see where within the function pixmanFillsse2 things are going wrong
 At a glance, it looks like the time is being spent copying pixmaps to
 the card.  Further investigation would be needed to determine why pixmaps
 are being copied around so much but a starting point would be to take an
-ancient build of libpixmap out of the library path where it was totally
+ancient build of libpixman out of the library path where it was totally
 forgotten about from months ago!

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH] tracing, documentation: Clarifications and corrections to tracepoint-analysis.txt
@ 2009-08-12 13:15     ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-12 13:15 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, Jonathan Corbet,
	Fernando Carrijo, LKML, linux-mm

This patch makes a number of corrections and clarifications as pointed
out by Jonathan Corbet.

  o Listed some requirement kernel config options
  o Spelled out that perf has to be installed from tools/perf
  o Mention that tracing_enabled must be set
  o Fix numerous minor typos
  o Fix tense issues
  o Expand on what -c means

Fernando Carrijo also spotted that a library was misnamed libpixmap
instead of libpixman. This patch should be considered a fix to
tracing-documentation-add-a-document-describing-how-to-do-some-performance-analysis-with-tracepoints.patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/trace/tracepoint-analysis.txt |   49 ++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
index e7a7d3e..282e8d9 100644
--- a/Documentation/trace/tracepoint-analysis.txt
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -17,8 +17,18 @@ gathering and interpreting these events. Lacking any current Best Practises,
 this document describes some of the methods that can be used.
 
 This document assumes that debugfs is mounted on /sys/kernel/debug and that
-the appropriate tracing options have been configured into the kernel. It is
-assumed that the PCL tool tools/perf has been installed and is in your path.
+at least the following tracing options have been configured into the kernel
+
+	CONFIG_EVENT_PROFILE=y
+	CONFIG_FTRACE=y
+	CONFIG_DYNAMIC_FTRACE=y
+	CONFIG_TRACEPOINTS=y
+
+It is also assumed that the PCL tool available from tools/perf has been
+installed and is in your path. This can be trivially installed as
+
+  $ make prefix=/usr/local
+  $ make prefix=/usr/local install
 
 2. Listing Available Events
 ===========================
@@ -61,6 +71,11 @@ to page allocation would look something like
 
   $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
 
+To start monitoring the events then, one would do something like
+
+  $ echo 1 > tracing_enabled
+  $ cat trace_pipe | some-processing-script
+
 2.2 System-Wide Event Enabling with SystemTap
 ---------------------------------------------
 
@@ -116,7 +131,7 @@ basis using set_ftrace_pid.
 2.5 Local Event Enablement with PCL
 -----------------------------------
 
-Events can be activate and tracked for the duration of a process on a local
+Events can be activated and tracked for the duration of a process on a local
 basis using PCL such as follows.
 
   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -142,8 +157,8 @@ as any script reading trace_pipe.
 =====================================
 
 Any workload can exhibit variances between runs and it can be important
-to know what the standard deviation in. By and large, this is left to the
-performance analyst to do it by hand. In the event that the discrete event
+to know what the standard deviation is. By and large, this is left to the
+performance analyst to do by hand. In the event that the discrete event
 occurrences are useful to the performance analyst, then perf can be used.
 
   $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
@@ -190,11 +205,11 @@ be gathered on-line as appropriate. Examples of post-processing might include
 
   o Reading information from /proc for the PID that triggered the event
   o Deriving a higher-level event from a series of lower-level events.
-  o Calculate latencies between two events
+  o Calculating latencies between two events
 
 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
 script that can read trace_pipe from STDIN or a copy of a trace. When used
-on-line, it can be interrupted once to generate a report without existing
+on-line, it can be interrupted once to generate a report without exiting
 and twice to exit.
 
 Simplistically, the script just reads STDIN and counts up events but it
@@ -204,7 +219,7 @@ also can do more such as
     are freed to the main allocator from the per-CPU lists, it recognises
     that as one per-CPU drain even though there is no specific tracepoint
     for that event
-  o It can aggregate based on PID or individual process number
+  o It can aggregate based on PID or process name
   o In the event memory is getting externally fragmented, it reports
     on whether the fragmentation event was severe or moderate.
   o When receiving an event about a PID, it can record who the parent was so
@@ -217,7 +232,7 @@ also can do more such as
 
 There may also be a requirement to identify what functions with a program
 were generating events within the kernel. To begin this sort of analysis, the
-data must be recorded. At the time of writing, this required root
+data must be recorded.
 
   $ perf record -c 1 \
 	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -226,10 +241,17 @@ data must be recorded. At the time of writing, this required root
   Time: 0.894
   [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
 
-Note the use of '-c 1' to set the event period to sample. The default sample
-period is quite high to minimise overhead but the information collected can be
+Note the use of '-c 1' to set the sample period. The default sample period
+is quite high to minimise overhead but the information collected can be
 very coarse as a result.
 
+The sample period is in units of "events occurred". For a hardware counter,
+this would usually mean the PMU is programmed to "raise an interrupt after
+this many events occured" and the event is recorded on interrupt receipt. For
+software-events such as tracepoints, one event will be recorded every
+"sample period" number of times the tracepoint triggered.  In this case,
+-c 1 means "record a sample every time this tracepoint is triggered".
+
 This record outputted a file called perf.data which can be analysed using
 perf report.
 
@@ -297,7 +319,8 @@ symbol.
        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
        0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
 
-To see where within the function pixmanFillsse2 things are going wrong
+Note here that kernel symbols are marked [k]. To see where within the
+function pixmanFillsse2 things are going wrong
 
   $ perf annotate pixmanFillsse2
   [ ... ]
@@ -323,5 +346,5 @@ To see where within the function pixmanFillsse2 things are going wrong
 At a glance, it looks like the time is being spent copying pixmaps to
 the card.  Further investigation would be needed to determine why pixmaps
 are being copied around so much but a starting point would be to take an
-ancient build of libpixmap out of the library path where it was totally
+ancient build of libpixman out of the library path where it was totally
 forgotten about from months ago!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH] tracing, documentation: Clarifications and corrections to tracepoint-analysis.txt (resend)
  2009-08-12 13:15     ` Mel Gorman
@ 2009-08-24 14:32       ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-24 14:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, Jonathan Corbet,
	Fernando Carrijo, LKML, linux-mm

I think this patch might have got lost in the noise so am resending just
in case.

This patch makes a number of corrections and clarifications as pointed
out by Jonathan Corbet.

  o Listed some requirement kernel config options
  o Spelled out that perf has to be installed from tools/perf
  o Mention that tracing_enabled must be set
  o Fix numerous minor typos
  o Fix tense issues
  o Expand on what -c means

Fernando Carrijo also spotted that a library was misnamed libpixmap
instead of libpixman. This patch should be considered a fix to
tracing-documentation-add-a-document-describing-how-to-do-some-performance-analysis-with-tracepoints.patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/trace/tracepoint-analysis.txt |   49 ++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
index e7a7d3e..282e8d9 100644
--- a/Documentation/trace/tracepoint-analysis.txt
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -17,8 +17,18 @@ gathering and interpreting these events. Lacking any current Best Practises,
 this document describes some of the methods that can be used.
 
 This document assumes that debugfs is mounted on /sys/kernel/debug and that
-the appropriate tracing options have been configured into the kernel. It is
-assumed that the PCL tool tools/perf has been installed and is in your path.
+at least the following tracing options have been configured into the kernel
+
+	CONFIG_EVENT_PROFILE=y
+	CONFIG_FTRACE=y
+	CONFIG_DYNAMIC_FTRACE=y
+	CONFIG_TRACEPOINTS=y
+
+It is also assumed that the PCL tool available from tools/perf has been
+installed and is in your path. This can be trivially installed as
+
+  $ make prefix=/usr/local
+  $ make prefix=/usr/local install
 
 2. Listing Available Events
 ===========================
@@ -61,6 +71,11 @@ to page allocation would look something like
 
   $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
 
+To start monitoring the events then, one would do something like
+
+  $ echo 1 > tracing_enabled
+  $ cat trace_pipe | some-monitor-script
+
 2.2 System-Wide Event Enabling with SystemTap
 ---------------------------------------------
 
@@ -116,7 +131,7 @@ basis using set_ftrace_pid.
 2.5 Local Event Enablement with PCL
 -----------------------------------
 
-Events can be activate and tracked for the duration of a process on a local
+Events can be activated and tracked for the duration of a process on a local
 basis using PCL such as follows.
 
   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -142,8 +157,8 @@ as any script reading trace_pipe.
 =====================================
 
 Any workload can exhibit variances between runs and it can be important
-to know what the standard deviation in. By and large, this is left to the
-performance analyst to do it by hand. In the event that the discrete event
+to know what the standard deviation is. By and large, this is left to the
+performance analyst to do by hand. In the event that the discrete event
 occurrences are useful to the performance analyst, then perf can be used.
 
   $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
@@ -190,11 +205,11 @@ be gathered on-line as appropriate. Examples of post-processing might include
 
   o Reading information from /proc for the PID that triggered the event
   o Deriving a higher-level event from a series of lower-level events.
-  o Calculate latencies between two events
+  o Calculating latencies between two events
 
 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
 script that can read trace_pipe from STDIN or a copy of a trace. When used
-on-line, it can be interrupted once to generate a report without existing
+on-line, it can be interrupted once to generate a report without exiting
 and twice to exit.
 
 Simplistically, the script just reads STDIN and counts up events but it
@@ -204,7 +219,7 @@ also can do more such as
     are freed to the main allocator from the per-CPU lists, it recognises
     that as one per-CPU drain even though there is no specific tracepoint
     for that event
-  o It can aggregate based on PID or individual process number
+  o It can aggregate based on PID or process name
   o In the event memory is getting externally fragmented, it reports
     on whether the fragmentation event was severe or moderate.
   o When receiving an event about a PID, it can record who the parent was so
@@ -217,7 +232,7 @@ also can do more such as
 
 There may also be a requirement to identify what functions with a program
 were generating events within the kernel. To begin this sort of analysis, the
-data must be recorded. At the time of writing, this required root
+data must be recorded.
 
   $ perf record -c 1 \
 	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -226,10 +241,17 @@ data must be recorded. At the time of writing, this required root
   Time: 0.894
   [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
 
-Note the use of '-c 1' to set the event period to sample. The default sample
-period is quite high to minimise overhead but the information collected can be
+Note the use of '-c 1' to set the sample period. The default sample period
+is quite high to minimise overhead but the information collected can be
 very coarse as a result.
 
+The sample period is in units of "events occurred". For a hardware counter,
+this would usually mean the PMU is programmed to "raise an interrupt after
+this many events occured" and the event is recorded on interrupt receipt. For
+software-events such as tracepoints, one event will be recorded every
+"sample period" number of times the tracepoint triggered.  In this case,
+-c 1 means "record a sample every time this tracepoint is triggered".
+
 This record outputted a file called perf.data which can be analysed using
 perf report.
 
@@ -297,7 +319,8 @@ symbol.
        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
        0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
 
-To see where within the function pixmanFillsse2 things are going wrong
+Note here that kernel symbols are marked [k]. To see where within the
+function pixmanFillsse2 things are going wrong
 
   $ perf annotate pixmanFillsse2
   [ ... ]
@@ -323,5 +346,5 @@ To see where within the function pixmanFillsse2 things are going wrong
 At a glance, it looks like the time is being spent copying pixmaps to
 the card.  Further investigation would be needed to determine why pixmaps
 are being copied around so much but a starting point would be to take an
-ancient build of libpixmap out of the library path where it was totally
+ancient build of libpixman out of the library path where it was totally
 forgotten about from months ago!

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH] tracing, documentation: Clarifications and corrections to tracepoint-analysis.txt (resend)
@ 2009-08-24 14:32       ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-24 14:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, Jonathan Corbet,
	Fernando Carrijo, LKML, linux-mm

I think this patch might have got lost in the noise so am resending just
in case.

This patch makes a number of corrections and clarifications as pointed
out by Jonathan Corbet.

  o Listed some requirement kernel config options
  o Spelled out that perf has to be installed from tools/perf
  o Mention that tracing_enabled must be set
  o Fix numerous minor typos
  o Fix tense issues
  o Expand on what -c means

Fernando Carrijo also spotted that a library was misnamed libpixmap
instead of libpixman. This patch should be considered a fix to
tracing-documentation-add-a-document-describing-how-to-do-some-performance-analysis-with-tracepoints.patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 Documentation/trace/tracepoint-analysis.txt |   49 ++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
index e7a7d3e..282e8d9 100644
--- a/Documentation/trace/tracepoint-analysis.txt
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -17,8 +17,18 @@ gathering and interpreting these events. Lacking any current Best Practises,
 this document describes some of the methods that can be used.
 
 This document assumes that debugfs is mounted on /sys/kernel/debug and that
-the appropriate tracing options have been configured into the kernel. It is
-assumed that the PCL tool tools/perf has been installed and is in your path.
+at least the following tracing options have been configured into the kernel
+
+	CONFIG_EVENT_PROFILE=y
+	CONFIG_FTRACE=y
+	CONFIG_DYNAMIC_FTRACE=y
+	CONFIG_TRACEPOINTS=y
+
+It is also assumed that the PCL tool available from tools/perf has been
+installed and is in your path. This can be trivially installed as
+
+  $ make prefix=/usr/local
+  $ make prefix=/usr/local install
 
 2. Listing Available Events
 ===========================
@@ -61,6 +71,11 @@ to page allocation would look something like
 
   $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
 
+To start monitoring the events then, one would do something like
+
+  $ echo 1 > tracing_enabled
+  $ cat trace_pipe | some-monitor-script
+
 2.2 System-Wide Event Enabling with SystemTap
 ---------------------------------------------
 
@@ -116,7 +131,7 @@ basis using set_ftrace_pid.
 2.5 Local Event Enablement with PCL
 -----------------------------------
 
-Events can be activate and tracked for the duration of a process on a local
+Events can be activated and tracked for the duration of a process on a local
 basis using PCL such as follows.
 
   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -142,8 +157,8 @@ as any script reading trace_pipe.
 =====================================
 
 Any workload can exhibit variances between runs and it can be important
-to know what the standard deviation in. By and large, this is left to the
-performance analyst to do it by hand. In the event that the discrete event
+to know what the standard deviation is. By and large, this is left to the
+performance analyst to do by hand. In the event that the discrete event
 occurrences are useful to the performance analyst, then perf can be used.
 
   $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
@@ -190,11 +205,11 @@ be gathered on-line as appropriate. Examples of post-processing might include
 
   o Reading information from /proc for the PID that triggered the event
   o Deriving a higher-level event from a series of lower-level events.
-  o Calculate latencies between two events
+  o Calculating latencies between two events
 
 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
 script that can read trace_pipe from STDIN or a copy of a trace. When used
-on-line, it can be interrupted once to generate a report without existing
+on-line, it can be interrupted once to generate a report without exiting
 and twice to exit.
 
 Simplistically, the script just reads STDIN and counts up events but it
@@ -204,7 +219,7 @@ also can do more such as
     are freed to the main allocator from the per-CPU lists, it recognises
     that as one per-CPU drain even though there is no specific tracepoint
     for that event
-  o It can aggregate based on PID or individual process number
+  o It can aggregate based on PID or process name
   o In the event memory is getting externally fragmented, it reports
     on whether the fragmentation event was severe or moderate.
   o When receiving an event about a PID, it can record who the parent was so
@@ -217,7 +232,7 @@ also can do more such as
 
 There may also be a requirement to identify what functions with a program
 were generating events within the kernel. To begin this sort of analysis, the
-data must be recorded. At the time of writing, this required root
+data must be recorded.
 
   $ perf record -c 1 \
 	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -226,10 +241,17 @@ data must be recorded. At the time of writing, this required root
   Time: 0.894
   [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
 
-Note the use of '-c 1' to set the event period to sample. The default sample
-period is quite high to minimise overhead but the information collected can be
+Note the use of '-c 1' to set the sample period. The default sample period
+is quite high to minimise overhead but the information collected can be
 very coarse as a result.
 
+The sample period is in units of "events occurred". For a hardware counter,
+this would usually mean the PMU is programmed to "raise an interrupt after
+this many events occured" and the event is recorded on interrupt receipt. For
+software-events such as tracepoints, one event will be recorded every
+"sample period" number of times the tracepoint triggered.  In this case,
+-c 1 means "record a sample every time this tracepoint is triggered".
+
 This record outputted a file called perf.data which can be analysed using
 perf report.
 
@@ -297,7 +319,8 @@ symbol.
        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
        0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
 
-To see where within the function pixmanFillsse2 things are going wrong
+Note here that kernel symbols are marked [k]. To see where within the
+function pixmanFillsse2 things are going wrong
 
   $ perf annotate pixmanFillsse2
   [ ... ]
@@ -323,5 +346,5 @@ To see where within the function pixmanFillsse2 things are going wrong
 At a glance, it looks like the time is being spent copying pixmaps to
 the card.  Further investigation would be needed to determine why pixmaps
 are being copied around so much but a starting point would be to take an
-ancient build of libpixmap out of the library path where it was totally
+ancient build of libpixman out of the library path where it was totally
 forgotten about from months ago!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
@ 2009-08-07 17:40   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-07 17:40   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-07  8:07     ` Ingo Molnar
@ 2009-08-07 14:25       ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-07 14:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Fr?d?ric Weisbecker, Pekka Enberg, eduard, Larry Woodman,
	Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 10:07:07AM +0200, Ingo Molnar wrote:
> > <SNIP>
> 
> This is a very nice and comprehensive description!
> 

Thanks, you did write a nice chunk of it yourself though :)

> I'm wondering: would you mind if we integrated the analysis ideas 
> from your perl script into 'perf trace'? Those kinds of high-level 
> counts and summaries are useful not just for MM events.
> 

Of course not. Part of the motivation for doing the perl script was as a
POC for the gathering of high-level events. In the event such sample
scripts work out, it'd justify the greater effort to integrate them into
perf.

> Another thing that was raise dbefore is a 'perf mem' special-purpose 
> tool to help the analysis of all things memory related: leak 
> detection, high level stats, etc. That could have some turn-key 
> modes of analysis for the page allocator too.
> 

Again, my vague notion was to prototype such things in perl and then when it
works out to incorporate it in perf if suitable. As high-level gathering of
information is just a state machine, it's conceivable that some of the code
could be auto-generated.

> perf will do a proper format-string evaluation of 
> /debug/tracing/events/*/format as well, thus any tweaks to the 
> tracepoints get automatically adopted to.
> 

Which would be a major plus.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-07 14:25       ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-07 14:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Fr?d?ric Weisbecker, Pekka Enberg, eduard, Larry Woodman,
	Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 10:07:07AM +0200, Ingo Molnar wrote:
> > <SNIP>
> 
> This is a very nice and comprehensive description!
> 

Thanks, you did write a nice chunk of it yourself though :)

> I'm wondering: would you mind if we integrated the analysis ideas 
> from your perl script into 'perf trace'? Those kinds of high-level 
> counts and summaries are useful not just for MM events.
> 

Of course not. Part of the motivation for doing the perl script was as a
POC for the gathering of high-level events. In the event such sample
scripts work out, it'd justify the greater effort to integrate them into
perf.

> Another thing that was raise dbefore is a 'perf mem' special-purpose 
> tool to help the analysis of all things memory related: leak 
> detection, high level stats, etc. That could have some turn-key 
> modes of analysis for the page allocator too.
> 

Again, my vague notion was to prototype such things in perl and then when it
works out to incorporate it in perf if suitable. As high-level gathering of
information is just a state machine, it's conceivable that some of the code
could be auto-generated.

> perf will do a proper format-string evaluation of 
> /debug/tracing/events/*/format as well, thus any tweaks to the 
> tracepoints get automatically adopted to.
> 

Which would be a major plus.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-07  8:07     ` Ingo Molnar
  -1 siblings, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:07 UTC (permalink / raw)
  To: Mel Gorman, Frédéric Weisbecker, Pekka Enberg, eduard
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> The documentation for ftrace, events and tracepoints is pretty 
> extensive. Similarly, the perf PCL tools help files --help are 
> there and the code simple enough to figure out what much of the 
> switches mean. However, pulling the discrete bits and pieces 
> together and translating that into "how do I solve a problem" 
> requires a fair amount of imagination.
> 
> This patch adds a simple document intended to get someone started 
> on the different ways of using tracepoints to gather meaningful 
> data.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.
> +
> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local
> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the
> +performance analyst to do it by hand. In the event that the discrete event
> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events
> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing
> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number
> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root
> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.
> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
> +forgotten about from months ago!

This is a very nice and comprehensive description!

I'm wondering: would you mind if we integrated the analysis ideas 
from your perl script into 'perf trace'? Those kinds of high-level 
counts and summaries are useful not just for MM events.

Another thing that was raise dbefore is a 'perf mem' special-purpose 
tool to help the analysis of all things memory related: leak 
detection, high level stats, etc. That could have some turn-key 
modes of analysis for the page allocator too.

perf will do a proper format-string evaluation of 
/debug/tracing/events/*/format as well, thus any tweaks to the 
tracepoints get automatically adopted to.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-07  8:07     ` Ingo Molnar
  0 siblings, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:07 UTC (permalink / raw)
  To: Mel Gorman, Frédéric Weisbecker, Pekka Enberg, eduard
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> The documentation for ftrace, events and tracepoints is pretty 
> extensive. Similarly, the perf PCL tools help files --help are 
> there and the code simple enough to figure out what much of the 
> switches mean. However, pulling the discrete bits and pieces 
> together and translating that into "how do I solve a problem" 
> requires a fair amount of imagination.
> 
> This patch adds a simple document intended to get someone started 
> on the different ways of using tracepoints to gather meaningful 
> data.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.
> +
> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local
> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the
> +performance analyst to do it by hand. In the event that the discrete event
> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events
> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing
> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number
> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root
> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.
> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
> +forgotten about from months ago!

This is a very nice and comprehensive description!

I'm wondering: would you mind if we integrated the analysis ideas 
from your perl script into 'perf trace'? Those kinds of high-level 
counts and summaries are useful not just for MM events.

Another thing that was raise dbefore is a 'perf mem' special-purpose 
tool to help the analysis of all things memory related: leak 
detection, high level stats, etc. That could have some turn-key 
modes of analysis for the page allocator too.

perf will do a proper format-string evaluation of 
/debug/tracing/events/*/format as well, thus any tweaks to the 
tracepoints get automatically adopted to.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-06 16:07 [PATCH 0/4] Add some trace events for the page allocator v4 Mel Gorman
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-08-26 11:01 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-10 15:41 [PATCH 0/6] Add some trace events for the page allocator v6 Mel Gorman
2009-08-10 15:41 ` Mel Gorman
2009-08-10 15:41 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
2009-08-10 15:41   ` Mel Gorman
2009-08-11 23:31   ` KOSAKI Motohiro
2009-08-11 23:31     ` KOSAKI Motohiro
2009-08-10 15:41 ` [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
2009-08-10 15:41   ` Mel Gorman
2009-08-10 15:41 ` [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
2009-08-10 15:41   ` Mel Gorman
2009-08-11 23:31   ` KOSAKI Motohiro
2009-08-11 23:31     ` KOSAKI Motohiro
2009-08-10 15:41 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
2009-08-10 15:41   ` Mel Gorman
2009-08-10 15:41 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
2009-08-10 15:41   ` Mel Gorman
2009-08-11 17:31   ` Jonathan Corbet
2009-08-11 17:31     ` Jonathan Corbet
2009-08-11 17:39     ` Jonathan Corbet
2009-08-11 17:39       ` Jonathan Corbet
2009-08-12 12:06     ` Mel Gorman
2009-08-12 12:06       ` Mel Gorman
2009-08-12  1:04   ` Fernando Carrijo
2009-08-12  1:04     ` Fernando Carrijo
2009-08-12 13:12     ` Mel Gorman
2009-08-12 13:12       ` Mel Gorman
2009-08-12 13:15   ` [PATCH] tracing, documentation: Clarifications and corrections to tracepoint-analysis.txt Mel Gorman
2009-08-12 13:15     ` Mel Gorman
2009-08-24 14:32     ` [PATCH] tracing, documentation: Clarifications and corrections to tracepoint-analysis.txt (resend) Mel Gorman
2009-08-24 14:32       ` Mel Gorman
2009-08-10 15:41 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
2009-08-10 15:41   ` Mel Gorman
2009-08-10 18:56 ` [PATCH 0/6] Add some trace events for the page allocator v6 Ingo Molnar
2009-08-10 18:56   ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
2009-08-07 17:40 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
2009-08-07 17:40   ` Mel Gorman
2009-08-06 16:07 [PATCH 0/4] Add some trace events for the page allocator v4 Mel Gorman
2009-08-06 16:07 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
2009-08-06 16:07   ` Mel Gorman
2009-08-07  8:07   ` Ingo Molnar
2009-08-07  8:07     ` Ingo Molnar
2009-08-07 14:25     ` Mel Gorman
2009-08-07 14:25       ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.