All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Add some trace events for the page allocator v4
@ 2009-08-06 16:07 ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This is V4 of a patchset to add some trace points for the page allocator. The
largest changes in this version is performance improvements and expansion
of the post-processing script as well as some documentation. There were minor changes
elsewhere that are described in the changelog.

Changelog since V3
  o Drop call_site information from trace events
  o Use struct page * instead of void * in trace events
  o Add CPU information to the per-cpu tracepoints information
  o Improved performance of offline-process script so it can run online
  o Add support for interrupting processing script to dump what it has
  o Add support for stripping pids, getting additional information from
    proc and adding information on the parent process
  o Improve layout of output of post processing script for use with sort
  o Add documentation on performance analysis using tracepoints
  o Add documentation on the kmem tracepoints in particular

Changelog since V2
  o Added Ack-ed By's from Rik
  o Only call trace_mm_page_free_direct when page count reaches zero
  o Rebase to 2.6.31-rc5

Changelog since V1
  o Fix minor formatting error for the __rmqueue event
  o Add event for __pagevec_free
  o Bring naming more in line with Larry Woodman's tracing patch
  o Add an example post-processing script for the trace events

The following four patches add some trace events for the page allocator
under the heading of kmem.

	Patch 1 adds events for plain old allocate and freeing of pages
	Patch 2 gives information useful for analysing fragmentation avoidance
	Patch 3 tracks pages going to and from the buddy lists as an indirect
		indication of zone lock hotness
	Patch 4 adds a post-processing script that aggegates the events to
		give a higher-level view
	Patch 5 adds documentation on analysis using tracepoints
	Patch 6 adds documentation on the kmem tracepoints in particular

The first set of events can be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The second patch is mainly of use to users of hugepages and particularly
dynamic hugepage pool resizing as it could be used to tune min_free_kbytes
to a level that fragmentation was rarely a problem. My main concern is
that maybe I'm trying to jam too much into the TP_printk that could be
extrapolated after the fact if you were familiar with the implementation. I
couldn't determine if it was best to hold the hand of the administrator
even if it cost more to figure it out.

The third patch is trickier to draw conclusions from but high activity on
those events could explain why there were a large number of cache misses
on a page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to enter
this path.

The fourth patch parses the trace buffer to draw a higher-level picture of
what is going on broken down on a per-process basis.

The last two patches add documentation.

 Documentation/trace/events-kmem.txt                |  107 ++++++
 .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
 Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
 include/trace/events/kmem.h                        |  177 ++++++++++
 mm/page_alloc.c                                    |   16 +-
 5 files changed, 982 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

Mel Gorman (6):
  tracing, page-allocator: Add trace events for page allocation and
    page freeing
  tracing, page-allocator: Add trace events for anti-fragmentation
    falling back to other migratetypes
  tracing, page-allocator: Add trace event for page traffic related to
    the buddy lists
  tracing, page-allocator: Add a postprocessing script for
    page-allocator-related ftrace events
  tracing, documentation: Add a document describing how to do some
    performance analysis with tracepoints
  tracing, documentation: Add a document on the kmem tracepoints

 Documentation/trace/events-kmem.txt                |  107 ++++++
 .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
 Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
 include/trace/events/kmem.h                        |  177 ++++++++++
 mm/page_alloc.c                                    |   15 +-
 5 files changed, 981 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
 create mode 100644 Documentation/trace/tracepoint-analysis.txt


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/4] Add some trace events for the page allocator v4
@ 2009-08-06 16:07 ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This is V4 of a patchset to add some trace points for the page allocator. The
largest changes in this version is performance improvements and expansion
of the post-processing script as well as some documentation. There were minor changes
elsewhere that are described in the changelog.

Changelog since V3
  o Drop call_site information from trace events
  o Use struct page * instead of void * in trace events
  o Add CPU information to the per-cpu tracepoints information
  o Improved performance of offline-process script so it can run online
  o Add support for interrupting processing script to dump what it has
  o Add support for stripping pids, getting additional information from
    proc and adding information on the parent process
  o Improve layout of output of post processing script for use with sort
  o Add documentation on performance analysis using tracepoints
  o Add documentation on the kmem tracepoints in particular

Changelog since V2
  o Added Ack-ed By's from Rik
  o Only call trace_mm_page_free_direct when page count reaches zero
  o Rebase to 2.6.31-rc5

Changelog since V1
  o Fix minor formatting error for the __rmqueue event
  o Add event for __pagevec_free
  o Bring naming more in line with Larry Woodman's tracing patch
  o Add an example post-processing script for the trace events

The following four patches add some trace events for the page allocator
under the heading of kmem.

	Patch 1 adds events for plain old allocate and freeing of pages
	Patch 2 gives information useful for analysing fragmentation avoidance
	Patch 3 tracks pages going to and from the buddy lists as an indirect
		indication of zone lock hotness
	Patch 4 adds a post-processing script that aggegates the events to
		give a higher-level view
	Patch 5 adds documentation on analysis using tracepoints
	Patch 6 adds documentation on the kmem tracepoints in particular

The first set of events can be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The second patch is mainly of use to users of hugepages and particularly
dynamic hugepage pool resizing as it could be used to tune min_free_kbytes
to a level that fragmentation was rarely a problem. My main concern is
that maybe I'm trying to jam too much into the TP_printk that could be
extrapolated after the fact if you were familiar with the implementation. I
couldn't determine if it was best to hold the hand of the administrator
even if it cost more to figure it out.

The third patch is trickier to draw conclusions from but high activity on
those events could explain why there were a large number of cache misses
on a page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to enter
this path.

The fourth patch parses the trace buffer to draw a higher-level picture of
what is going on broken down on a per-process basis.

The last two patches add documentation.

 Documentation/trace/events-kmem.txt                |  107 ++++++
 .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
 Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
 include/trace/events/kmem.h                        |  177 ++++++++++
 mm/page_alloc.c                                    |   16 +-
 5 files changed, 982 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

Mel Gorman (6):
  tracing, page-allocator: Add trace events for page allocation and
    page freeing
  tracing, page-allocator: Add trace events for anti-fragmentation
    falling back to other migratetypes
  tracing, page-allocator: Add trace event for page traffic related to
    the buddy lists
  tracing, page-allocator: Add a postprocessing script for
    page-allocator-related ftrace events
  tracing, documentation: Add a document describing how to do some
    performance analysis with tracepoints
  tracing, documentation: Add a document on the kmem tracepoints

 Documentation/trace/events-kmem.txt                |  107 ++++++
 .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
 Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
 include/trace/events/kmem.h                        |  177 ++++++++++
 mm/page_alloc.c                                    |   15 +-
 5 files changed, 981 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-06 16:07 ` Mel Gorman
@ 2009-08-06 16:07   ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   77 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    7 +++-
 2 files changed, 83 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..8ab0f98 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,83 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(struct page *page, int order, int cold),
+
+	TP_ARGS(page, order, cold),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		order		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d cold=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->order,
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(struct page *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d052abb..f3f6039 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 void free_hot_page(struct page *page)
 {
+	trace_mm_page_free_direct(page, 0);
 	free_hot_cold_page(page, 0);
 }
 	
@@ -1905,6 +1906,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1945,13 +1947,16 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(pvec->pages[i], 0, pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
+		trace_mm_page_free_direct(page, order);
 		if (order == 0)
 			free_hot_page(page);
 		else
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   77 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    7 +++-
 2 files changed, 83 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..8ab0f98 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,83 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(struct page *page, int order, int cold),
+
+	TP_ARGS(page, order, cold),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		order		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d cold=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->order,
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(struct page *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d052abb..f3f6039 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 void free_hot_page(struct page *page)
 {
+	trace_mm_page_free_direct(page, 0);
 	free_hot_cold_page(page, 0);
 }
 	
@@ -1905,6 +1906,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1945,13 +1947,16 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(pvec->pages[i], 0, pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
+		trace_mm_page_free_direct(page, order);
 		if (order == 0)
 			free_hot_page(page);
 		else
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-08-06 16:07 ` Mel Gorman
@ 2009-08-06 16:07   ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   44 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    6 +++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 8ab0f98..4aed74b 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -302,6 +302,50 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+	TP_PROTO(struct page *page,
+			int alloc_order, int fallback_order,
+			int alloc_migratetype, int fallback_migratetype,
+			int fragmenting, int change_ownership),
+
+	TP_ARGS(page,
+		alloc_order, fallback_order,
+		alloc_migratetype, fallback_migratetype,
+		fragmenting, change_ownership),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page			)
+		__field(	int,		alloc_order		)
+		__field(	int,		fallback_order		)
+		__field(	int,		alloc_migratetype	)
+		__field(	int,		fallback_migratetype	)
+		__field(	int,		fragmenting		)
+		__field(	int,		change_ownership	)
+	),
+
+	TP_fast_assign(
+		__entry->page			= page;
+		__entry->alloc_order		= alloc_order;
+		__entry->fallback_order		= fallback_order;
+		__entry->alloc_migratetype	= alloc_migratetype;
+		__entry->fallback_migratetype	= fallback_migratetype;
+		__entry->fragmenting		= fragmenting;
+		__entry->change_ownership	= change_ownership;
+	),
+
+	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->alloc_order,
+		__entry->fallback_order,
+		pageblock_order,
+		__entry->alloc_migratetype,
+		__entry->fallback_migratetype,
+		__entry->fragmenting,
+		__entry->change_ownership)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3f6039..0b2a6d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							start_migratetype);
 
 			expand(zone, page, order, current_order, area, migratetype);
+
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+				start_migratetype, migratetype,
+				current_order < pageblock_order,
+				migratetype == start_migratetype);
+
 			return page;
 		}
 	}
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   44 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    6 +++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 8ab0f98..4aed74b 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -302,6 +302,50 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+	TP_PROTO(struct page *page,
+			int alloc_order, int fallback_order,
+			int alloc_migratetype, int fallback_migratetype,
+			int fragmenting, int change_ownership),
+
+	TP_ARGS(page,
+		alloc_order, fallback_order,
+		alloc_migratetype, fallback_migratetype,
+		fragmenting, change_ownership),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page			)
+		__field(	int,		alloc_order		)
+		__field(	int,		fallback_order		)
+		__field(	int,		alloc_migratetype	)
+		__field(	int,		fallback_migratetype	)
+		__field(	int,		fragmenting		)
+		__field(	int,		change_ownership	)
+	),
+
+	TP_fast_assign(
+		__entry->page			= page;
+		__entry->alloc_order		= alloc_order;
+		__entry->fallback_order		= fallback_order;
+		__entry->alloc_migratetype	= alloc_migratetype;
+		__entry->fallback_migratetype	= fallback_migratetype;
+		__entry->fragmenting		= fragmenting;
+		__entry->change_ownership	= change_ownership;
+	),
+
+	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->alloc_order,
+		__entry->fallback_order,
+		pageblock_order,
+		__entry->alloc_migratetype,
+		__entry->fallback_migratetype,
+		__entry->fragmenting,
+		__entry->change_ownership)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3f6039..0b2a6d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							start_migratetype);
 
 			expand(zone, page, order, current_order, area, migratetype);
+
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+				start_migratetype, migratetype,
+				current_order < pageblock_order,
+				migratetype == start_migratetype);
+
 			return page;
 		}
 	}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-06 16:07 ` Mel Gorman
@ 2009-08-06 16:07   ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   56 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    2 +
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 4aed74b..fb8588d 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -302,6 +302,62 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+	TP_PROTO(struct page *page, unsigned int order,
+				int migratetype, int percpu_refill),
+
+	TP_ARGS(page, order, migratetype, percpu_refill),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+		__field(	int,		percpu_refill	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+		__entry->percpu_refill	= percpu_refill;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		smp_processor_id(),
+		__entry->percpu_refill)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+	TP_PROTO(struct page *page, int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		smp_processor_id(),
+		__entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b2a6d9..97ea4c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
+		trace_mm_page_pcpu_drain(page, order, page_private(page));
 		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
@@ -878,6 +879,7 @@ retry_reserve:
 		}
 	}
 
+	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
 	return page;
 }
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   56 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    2 +
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 4aed74b..fb8588d 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -302,6 +302,62 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+	TP_PROTO(struct page *page, unsigned int order,
+				int migratetype, int percpu_refill),
+
+	TP_ARGS(page, order, migratetype, percpu_refill),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+		__field(	int,		percpu_refill	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+		__entry->percpu_refill	= percpu_refill;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		smp_processor_id(),
+		__entry->percpu_refill)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+	TP_PROTO(struct page *page, int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		smp_processor_id(),
+		__entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b2a6d9..97ea4c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
+		trace_mm_page_pcpu_drain(page, order, page_private(page));
 		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
@@ -878,6 +879,7 @@ retry_reserve:
 		}
 	}
 
+	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
 	return page;
 }
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-06 16:07 ` Mel Gorman
@ 2009-08-06 16:07   ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                under lock     direct  pagevec      drain
swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0

Optionally, the output can include information on the parent or aggregate
based on process name instead of aggregating based on each pid. Example output
including parent information and stripped out the PID looks something like;

Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                     under lock     direct  pagevec      drain
gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0

This says that Xorg allocated 3796 pages and it's parent process is gdm
with a PID of 3756;

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format changes,
the parser will break but the regular expression parser is fairly rudimentary
so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
 1 files changed, 356 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..56c7f42
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,356 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --prepend-parent	Report on the parent proc and PID
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_PAGE_ALLOC		=> 1;
+use constant MM_PAGE_FREE_DIRECT 	=> 2;
+use constant MM_PAGEVEC_FREE		=> 3;
+use constant MM_PAGE_PCPU_DRAIN		=> 4;
+use constant MM_PAGE_ALLOC_ZONE_LOCKED	=> 5;
+use constant MM_PAGE_ALLOC_EXTFRAG	=> 6;
+use constant EVENT_UNKNOWN		=> 7;
+
+# Constants used to track state
+use constant STATE_PCPU_PAGES_DRAINED	=> 8;
+use constant STATE_PCPU_PAGES_REFILLED	=> 9;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_PCPU_DRAINS		=> 10;
+use constant HIGH_PCPU_REFILLS		=> 11;
+use constant HIGH_EXT_FRAGMENT		=> 12;
+use constant HIGH_EXT_FRAGMENT_SEVERE	=> 13;
+use constant HIGH_EXT_FRAGMENT_MODERATE	=> 14;
+use constant HIGH_EXT_FRAGMENT_CHANGED	=> 15;
+
+my %perprocesspid;
+my %perprocess;
+my $opt_ignorepid;
+my $opt_read_procstat;
+my $opt_prepend_parent;
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		print "Second SIGINT received quickly, exiting\n";
+		$sigint_exit = 1;
+	}
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+	'prepend-parent' =>	\$opt_prepend_parent,
+);
+
+# Regexes used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*) fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*) fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+sub parent_info($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+	my $ppid;
+
+	if ($pid == 0) {
+		return "NOPARENT-0";
+	}
+
+	if ($statline !~ /$regex_statppid/o) {
+		die("Failed to match stat line process ppid:: $statline");
+	}
+
+	# Read the ppid stat line
+	$ppid = $1;
+	return guess_process_pid($ppid, read_statline($ppid));
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$tracepoint = $4;
+
+			if ($opt_read_procstat || $opt_prepend_parent) {
+				$process_pid =~ /(.*)-([0-9]*)$/;
+				my $process = $1;
+				my $pid = $2;
+
+				$statline = read_statline($pid);
+
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+
+				if ($opt_prepend_parent) {
+					$process_pid = parent_info($pid, $statline) . " :: $process_pid";
+				}
+			}
+
+			# Unnecessary in this script. Uncomment if required
+			# $cpus = $2;
+			# $timestamp = $3;
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_page_alloc") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
+		} elsif ($tracepoint eq "mm_page_free_direct") {
+			$perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
+		} elsif ($tracepoint eq "mm_pagevec_free") {
+			$perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
+		} elsif ($tracepoint eq "mm_page_pcpu_drain") {
+			$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
+
+			# Extract the details of the event now
+			$details = $5;
+
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
+			my ($page, $pfn);
+			my ($alloc_order, $fallback_order, $pageblock_order);
+			my ($alloc_migratetype, $fallback_migratetype);
+			my ($fragmenting, $change_ownership);
+
+			$details =~ /$regex_fragdetails/o;
+			$page = $1;
+			$pfn = $2;
+			$alloc_order = $3;
+			$fallback_order = $4;
+			$pageblock_order = $5;
+			$alloc_migratetype = $6;
+			$fallback_migratetype = $7;
+			$fragmenting = $8;
+			$change_ownership = $9;
+
+			if ($fragmenting) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
+				if ($fallback_order <= 3) {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
+				} else {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
+				}
+			}
+			if ($change_ownership) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		# Catch a full pcpu drain event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
+				$tracepoint ne "mm_page_pcpu_drain") {
+
+			$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+
+		# Catch a full pcpu refill event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
+				$tracepoint ne "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
+		"Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
+		"details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
+
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
+		"",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
+
+	foreach $process_pid (keys %stats) {
+		# Dump final aggregates
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n", 
+			$process_pid,
+			$stats{$process_pid}->{MM_PAGE_ALLOC},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
+			$stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
+			$stats{$process_pid}->{MM_PAGEVEC_FREE},
+			$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS},
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
+			$stats{$process_pid}->{EVENT_UNKNOWN});
+	}
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
+		$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
+		$perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
+		$perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
+		$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
+		$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
+		$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
+		$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
+		$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                under lock     direct  pagevec      drain
swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0

Optionally, the output can include information on the parent or aggregate
based on process name instead of aggregating based on each pid. Example output
including parent information and stripped out the PID looks something like;

Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                     under lock     direct  pagevec      drain
gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0

This says that Xorg allocated 3796 pages and it's parent process is gdm
with a PID of 3756;

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format changes,
the parser will break but the regular expression parser is fairly rudimentary
so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
 1 files changed, 356 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..56c7f42
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,356 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --prepend-parent	Report on the parent proc and PID
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_PAGE_ALLOC		=> 1;
+use constant MM_PAGE_FREE_DIRECT 	=> 2;
+use constant MM_PAGEVEC_FREE		=> 3;
+use constant MM_PAGE_PCPU_DRAIN		=> 4;
+use constant MM_PAGE_ALLOC_ZONE_LOCKED	=> 5;
+use constant MM_PAGE_ALLOC_EXTFRAG	=> 6;
+use constant EVENT_UNKNOWN		=> 7;
+
+# Constants used to track state
+use constant STATE_PCPU_PAGES_DRAINED	=> 8;
+use constant STATE_PCPU_PAGES_REFILLED	=> 9;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_PCPU_DRAINS		=> 10;
+use constant HIGH_PCPU_REFILLS		=> 11;
+use constant HIGH_EXT_FRAGMENT		=> 12;
+use constant HIGH_EXT_FRAGMENT_SEVERE	=> 13;
+use constant HIGH_EXT_FRAGMENT_MODERATE	=> 14;
+use constant HIGH_EXT_FRAGMENT_CHANGED	=> 15;
+
+my %perprocesspid;
+my %perprocess;
+my $opt_ignorepid;
+my $opt_read_procstat;
+my $opt_prepend_parent;
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		print "Second SIGINT received quickly, exiting\n";
+		$sigint_exit = 1;
+	}
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+	'prepend-parent' =>	\$opt_prepend_parent,
+);
+
+# Regexes used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*) fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*) fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+sub parent_info($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+	my $ppid;
+
+	if ($pid == 0) {
+		return "NOPARENT-0";
+	}
+
+	if ($statline !~ /$regex_statppid/o) {
+		die("Failed to match stat line process ppid:: $statline");
+	}
+
+	# Read the ppid stat line
+	$ppid = $1;
+	return guess_process_pid($ppid, read_statline($ppid));
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$tracepoint = $4;
+
+			if ($opt_read_procstat || $opt_prepend_parent) {
+				$process_pid =~ /(.*)-([0-9]*)$/;
+				my $process = $1;
+				my $pid = $2;
+
+				$statline = read_statline($pid);
+
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+
+				if ($opt_prepend_parent) {
+					$process_pid = parent_info($pid, $statline) . " :: $process_pid";
+				}
+			}
+
+			# Unnecessary in this script. Uncomment if required
+			# $cpus = $2;
+			# $timestamp = $3;
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_page_alloc") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
+		} elsif ($tracepoint eq "mm_page_free_direct") {
+			$perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
+		} elsif ($tracepoint eq "mm_pagevec_free") {
+			$perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
+		} elsif ($tracepoint eq "mm_page_pcpu_drain") {
+			$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
+
+			# Extract the details of the event now
+			$details = $5;
+
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
+			my ($page, $pfn);
+			my ($alloc_order, $fallback_order, $pageblock_order);
+			my ($alloc_migratetype, $fallback_migratetype);
+			my ($fragmenting, $change_ownership);
+
+			$details =~ /$regex_fragdetails/o;
+			$page = $1;
+			$pfn = $2;
+			$alloc_order = $3;
+			$fallback_order = $4;
+			$pageblock_order = $5;
+			$alloc_migratetype = $6;
+			$fallback_migratetype = $7;
+			$fragmenting = $8;
+			$change_ownership = $9;
+
+			if ($fragmenting) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
+				if ($fallback_order <= 3) {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
+				} else {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
+				}
+			}
+			if ($change_ownership) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		# Catch a full pcpu drain event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
+				$tracepoint ne "mm_page_pcpu_drain") {
+
+			$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+
+		# Catch a full pcpu refill event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
+				$tracepoint ne "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
+		"Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
+		"details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
+
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
+		"",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
+
+	foreach $process_pid (keys %stats) {
+		# Dump final aggregates
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n", 
+			$process_pid,
+			$stats{$process_pid}->{MM_PAGE_ALLOC},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
+			$stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
+			$stats{$process_pid}->{MM_PAGEVEC_FREE},
+			$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS},
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
+			$stats{$process_pid}->{EVENT_UNKNOWN});
+	}
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
+		$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
+		$perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
+		$perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
+		$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
+		$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
+		$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
+		$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
+		$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-06 16:07 ` Mel Gorman
@ 2009-08-06 16:07   ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
  2009-08-06 16:07 ` Mel Gorman
@ 2009-08-06 16:07   ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
@ 2009-08-06 16:07   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:07 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/4] Add some trace events for the page allocator v4
  2009-08-06 16:07 ` Mel Gorman
@ 2009-08-06 16:10   ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:10 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:07:01PM +0100, Mel Gorman wrote:
> This is V4 of a patchset to add some trace points for the page allocator. The
> largest changes in this version is performance improvements and expansion
> of the post-processing script as well as some documentation. There were minor changes
> elsewhere that are described in the changelog.
> 

Cack, the subject should be 0/6 of course, not 0/4

> <SNIP>
> 
>  Documentation/trace/events-kmem.txt                |  107 ++++++
>  .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
>  Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
>  include/trace/events/kmem.h                        |  177 ++++++++++
>  mm/page_alloc.c                                    |   16 +-
>  5 files changed, 982 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/trace/events-kmem.txt
>  create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> Mel Gorman (6):
>   tracing, page-allocator: Add trace events for page allocation and
>     page freeing
>   tracing, page-allocator: Add trace events for anti-fragmentation
>     falling back to other migratetypes
>   tracing, page-allocator: Add trace event for page traffic related to
>     the buddy lists
>   tracing, page-allocator: Add a postprocessing script for
>     page-allocator-related ftrace events
>   tracing, documentation: Add a document describing how to do some
>     performance analysis with tracepoints
>   tracing, documentation: Add a document on the kmem tracepoints
> 

Similarly, I should have stripped this junk away before tending. The
real diffstat is below. Sorry

>  Documentation/trace/events-kmem.txt                |  107 ++++++
>  .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
>  Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
>  include/trace/events/kmem.h                        |  177 ++++++++++
>  mm/page_alloc.c                                    |   15 +-
>  5 files changed, 981 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/trace/events-kmem.txt
>  create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/4] Add some trace events for the page allocator v4
@ 2009-08-06 16:10   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-06 16:10 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

On Thu, Aug 06, 2009 at 05:07:01PM +0100, Mel Gorman wrote:
> This is V4 of a patchset to add some trace points for the page allocator. The
> largest changes in this version is performance improvements and expansion
> of the post-processing script as well as some documentation. There were minor changes
> elsewhere that are described in the changelog.
> 

Cack, the subject should be 0/6 of course, not 0/4

> <SNIP>
> 
>  Documentation/trace/events-kmem.txt                |  107 ++++++
>  .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
>  Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
>  include/trace/events/kmem.h                        |  177 ++++++++++
>  mm/page_alloc.c                                    |   16 +-
>  5 files changed, 982 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/trace/events-kmem.txt
>  create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> Mel Gorman (6):
>   tracing, page-allocator: Add trace events for page allocation and
>     page freeing
>   tracing, page-allocator: Add trace events for anti-fragmentation
>     falling back to other migratetypes
>   tracing, page-allocator: Add trace event for page traffic related to
>     the buddy lists
>   tracing, page-allocator: Add a postprocessing script for
>     page-allocator-related ftrace events
>   tracing, documentation: Add a document describing how to do some
>     performance analysis with tracepoints
>   tracing, documentation: Add a document on the kmem tracepoints
> 

Similarly, I should have stripped this junk away before tending. The
real diffstat is below. Sorry

>  Documentation/trace/events-kmem.txt                |  107 ++++++
>  .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
>  Documentation/trace/tracepoint-analysis.txt        |  327 ++++++++++++++++++
>  include/trace/events/kmem.h                        |  177 ++++++++++
>  mm/page_alloc.c                                    |   15 +-
>  5 files changed, 981 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/trace/events-kmem.txt
>  create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-06 21:07     ` Li, Ming Chun
  -1 siblings, 0 replies; 54+ messages in thread
From: Li, Ming Chun @ 2009-08-06 21:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: LKML, linux-mm

On Thu, 6 Aug 2009, Mel Gorman wrote:

Code style nitpick, There are four trailing whitespace errors while 
applying this script patch, use ./scripts/checkpatch.pl would tell which 
lines have trailing whitespace.

Vincent 

> This patch adds a simple post-processing script for the page-allocator-related
> trace events. It can be used to give an indication of who the most
> allocator-intensive processes are and how often the zone lock was taken
> during the tracing period. Example output looks like
> 
> Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
> details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
>                                 under lock     direct  pagevec      drain
> swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
> Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
> modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
> xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
> awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
> thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
> hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
> akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
> xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
> khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0
> 
> Optionally, the output can include information on the parent or aggregate
> based on process name instead of aggregating based on each pid. Example output
> including parent information and stripped out the PID looks something like;
> 
> Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
> details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
>                                      under lock     direct  pagevec      drain
> gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
> init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
> git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0
> 
> This says that Xorg allocated 3796 pages and it's parent process is gdm
> with a PID of 3756;
> 
> The postprocessor parses the text output of tracing. While there is a binary
> format, the expectation is that the binary output can be readily translated
> into text and post-processed offline. Obviously if the text format changes,
> the parser will break but the regular expression parser is fairly rudimentary
> so should be readily adjustable.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
>  1 files changed, 356 insertions(+), 0 deletions(-)
>  create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
> 
> diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
> new file mode 100755
> index 0000000..56c7f42
> --- /dev/null
> +++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
> @@ -0,0 +1,356 @@
> +#!/usr/bin/perl
> +# This is a POC (proof of concept or piece of crap, take your pick) for reading the
> +# text representation of trace output related to page allocation. It makes an attempt
> +# to extract some high-level information on what is going on. The accuracy of the parser
> +# may vary considerably
> +#
> +# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
> +# other options
> +#   --prepend-parent	Report on the parent proc and PID
> +#   --read-procstat	If the trace lacks process info, get it from /proc
> +#   --ignore-pid	Aggregate processes of the same name together
> +#
> +# Copyright (c) IBM Corporation 2009
> +# Author: Mel Gorman <mel@csn.ul.ie>
> +use strict;
> +use Getopt::Long;
> +
> +# Tracepoint events
> +use constant MM_PAGE_ALLOC		=> 1;
> +use constant MM_PAGE_FREE_DIRECT 	=> 2;
> +use constant MM_PAGEVEC_FREE		=> 3;
> +use constant MM_PAGE_PCPU_DRAIN		=> 4;
> +use constant MM_PAGE_ALLOC_ZONE_LOCKED	=> 5;
> +use constant MM_PAGE_ALLOC_EXTFRAG	=> 6;
> +use constant EVENT_UNKNOWN		=> 7;
> +
> +# Constants used to track state
> +use constant STATE_PCPU_PAGES_DRAINED	=> 8;
> +use constant STATE_PCPU_PAGES_REFILLED	=> 9;
> +
> +# High-level events extrapolated from tracepoints
> +use constant HIGH_PCPU_DRAINS		=> 10;
> +use constant HIGH_PCPU_REFILLS		=> 11;
> +use constant HIGH_EXT_FRAGMENT		=> 12;
> +use constant HIGH_EXT_FRAGMENT_SEVERE	=> 13;
> +use constant HIGH_EXT_FRAGMENT_MODERATE	=> 14;
> +use constant HIGH_EXT_FRAGMENT_CHANGED	=> 15;
> +
> +my %perprocesspid;
> +my %perprocess;
> +my $opt_ignorepid;
> +my $opt_read_procstat;
> +my $opt_prepend_parent;
> +
> +# Catch sigint and exit on request
> +my $sigint_report = 0;
> +my $sigint_exit = 0;
> +my $sigint_pending = 0;
> +my $sigint_received = 0;
> +sub sigint_handler {
> +	my $current_time = time;
> +	if ($current_time - 2 > $sigint_received) {
> +		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
> +		$sigint_report = 1;
> +	} else {
> +		print "Second SIGINT received quickly, exiting\n";
> +		$sigint_exit = 1;
> +	}
> +	$sigint_received = $current_time;
> +	$sigint_pending = 1;
> +}
> +$SIG{INT} = "sigint_handler";
> +
> +# Parse command line options
> +GetOptions(
> +	'ignore-pid'	 =>	\$opt_ignorepid,
> +	'read-procstat'	 =>	\$opt_read_procstat,
> +	'prepend-parent' =>	\$opt_prepend_parent,
> +);
> +
> +# Regexes used. Specified like this for readability and for use with /o
> +#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
> +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
> +my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*) fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*) fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
> +my $regex_statname = '[-0-9]*\s\((.*)\).*';
> +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
> +
> +sub read_statline($) {
> +	my $pid = $_[0];
> +	my $statline;
> +
> +	if (open(STAT, "/proc/$pid/stat")) {
> +		$statline = <STAT>;
> +		close(STAT);
> +	}
> +
> +	if ($statline eq '') {
> +		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
> +	}
> +
> +	return $statline;
> +}
> +
> +sub guess_process_pid($$) {
> +	my $pid = $_[0];
> +	my $statline = $_[1];
> +
> +	if ($pid == 0) {
> +		return "swapper-0";
> +	}
> +
> +	if ($statline !~ /$regex_statname/o) {
> +		die("Failed to math stat line for process name :: $statline");
> +	}
> +	return "$1-$pid";
> +}
> +
> +sub parent_info($$) {
> +	my $pid = $_[0];
> +	my $statline = $_[1];
> +	my $ppid;
> +
> +	if ($pid == 0) {
> +		return "NOPARENT-0";
> +	}
> +
> +	if ($statline !~ /$regex_statppid/o) {
> +		die("Failed to match stat line process ppid:: $statline");
> +	}
> +
> +	# Read the ppid stat line
> +	$ppid = $1;
> +	return guess_process_pid($ppid, read_statline($ppid));
> +}
> +
> +sub process_events {
> +	my $traceevent;
> +	my $process_pid;
> +	my $cpus;
> +	my $timestamp;
> +	my $tracepoint;
> +	my $details;
> +	my $statline;
> +
> +	# Read each line of the event log
> +EVENT_PROCESS:
> +	while ($traceevent = <STDIN>) {
> +		if ($traceevent =~ /$regex_traceevent/o) {
> +			$process_pid = $1;
> +			$tracepoint = $4;
> +
> +			if ($opt_read_procstat || $opt_prepend_parent) {
> +				$process_pid =~ /(.*)-([0-9]*)$/;
> +				my $process = $1;
> +				my $pid = $2;
> +
> +				$statline = read_statline($pid);
> +
> +				if ($opt_read_procstat && $process eq '') {
> +					$process_pid = guess_process_pid($pid, $statline);
> +				}
> +
> +				if ($opt_prepend_parent) {
> +					$process_pid = parent_info($pid, $statline) . " :: $process_pid";
> +				}
> +			}
> +
> +			# Unnecessary in this script. Uncomment if required
> +			# $cpus = $2;
> +			# $timestamp = $3;
> +		} else {
> +			next;
> +		}
> +
> +		# Perl Switch() sucks majorly
> +		if ($tracepoint eq "mm_page_alloc") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
> +		} elsif ($tracepoint eq "mm_page_free_direct") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
> +		} elsif ($tracepoint eq "mm_pagevec_free") {
> +			$perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
> +		} elsif ($tracepoint eq "mm_page_pcpu_drain") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
> +		} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
> +		} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
> +
> +			# Extract the details of the event now
> +			$details = $5;
> +
> +			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
> +			my ($page, $pfn);
> +			my ($alloc_order, $fallback_order, $pageblock_order);
> +			my ($alloc_migratetype, $fallback_migratetype);
> +			my ($fragmenting, $change_ownership);
> +
> +			$details =~ /$regex_fragdetails/o;
> +			$page = $1;
> +			$pfn = $2;
> +			$alloc_order = $3;
> +			$fallback_order = $4;
> +			$pageblock_order = $5;
> +			$alloc_migratetype = $6;
> +			$fallback_migratetype = $7;
> +			$fragmenting = $8;
> +			$change_ownership = $9;
> +
> +			if ($fragmenting) {
> +				$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
> +				if ($fallback_order <= 3) {
> +					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
> +				} else {
> +					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
> +				}
> +			}
> +			if ($change_ownership) {
> +				$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
> +			}
> +		} else {
> +			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
> +		}
> +
> +		# Catch a full pcpu drain event
> +		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
> +				$tracepoint ne "mm_page_pcpu_drain") {
> +
> +			$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
> +		}
> +
> +		# Catch a full pcpu refill event
> +		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
> +				$tracepoint ne "mm_page_alloc_zone_locked") {
> +			$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
> +		}
> +
> +		if ($sigint_pending) {
> +			last EVENT_PROCESS;
> +		}
> +	}
> +}
> +
> +sub dump_stats {
> +	my $hashref = shift;
> +	my %stats = %$hashref;
> +
> +	# Dump per-process stats
> +	my $process_pid;
> +	my $max_strlen = 0;
> +
> +	# Get the maximum process name
> +	foreach $process_pid (keys %perprocesspid) {
> +		my $len = length($process_pid);
> +		if ($len > $max_strlen) {
> +			$max_strlen = $len;
> +		}
> +	}
> +	$max_strlen += 2;
> +
> +	printf("\n");
> +	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
> +		"Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
> +	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
> +		"details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
> +
> +	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
> +		"",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
> +
> +	foreach $process_pid (keys %stats) {
> +		# Dump final aggregates
> +		if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
> +			$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
> +			$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
> +		}
> +		if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
> +			$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
> +			$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
> +		}
> +
> +		printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n", 
> +			$process_pid,
> +			$stats{$process_pid}->{MM_PAGE_ALLOC},
> +			$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
> +			$stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
> +			$stats{$process_pid}->{MM_PAGEVEC_FREE},
> +			$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
> +			$stats{$process_pid}->{HIGH_PCPU_DRAINS},
> +			$stats{$process_pid}->{HIGH_PCPU_REFILLS},
> +			$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
> +			$stats{$process_pid}->{HIGH_EXT_FRAG},
> +			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
> +			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
> +			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
> +			$stats{$process_pid}->{EVENT_UNKNOWN});
> +	}
> +}
> +
> +sub aggregate_perprocesspid() {
> +	my $process_pid;
> +	my $process;
> +	undef %perprocess;
> +
> +	foreach $process_pid (keys %perprocesspid) {
> +		$process = $process_pid;
> +		$process =~ s/-([0-9])*$//;
> +		if ($process eq '') {
> +			$process = "NO_PROCESS_NAME";
> +		}
> +
> +		$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
> +		$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
> +		$perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
> +		$perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
> +		$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
> +		$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
> +		$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
> +		$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
> +		$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
> +		$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
> +		$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
> +		$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
> +		$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
> +	}
> +}
> +
> +sub report() {
> +	if (!$opt_ignorepid) {
> +		dump_stats(\%perprocesspid);
> +	} else {
> +		aggregate_perprocesspid();
> +		dump_stats(\%perprocess);
> +	}
> +}
> +
> +# Process events or signals until neither is available
> +sub signal_loop() {
> +	my $sigint_processed;
> +	do {
> +		$sigint_processed = 0;
> +		process_events();
> +
> +		# Handle pending signals if any
> +		if ($sigint_pending) {
> +			my $current_time = time;
> +
> +			if ($sigint_exit) {
> +				print "Received exit signal\n";
> +				$sigint_pending = 0;
> +			}
> +			if ($sigint_report) {
> +				if ($current_time >= $sigint_received + 2) {
> +					report();
> +					$sigint_report = 0;
> +					$sigint_pending = 0;
> +					$sigint_processed = 1;
> +				}
> +			}
> +		}
> +	} while ($sigint_pending || $sigint_processed);
> +}
> +
> +signal_loop();
> +report();
> -- 
> 1.6.3.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 

Vincent Li
Biomedical Research Center
University of British Columbia

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
@ 2009-08-06 21:07     ` Li, Ming Chun
  0 siblings, 0 replies; 54+ messages in thread
From: Li, Ming Chun @ 2009-08-06 21:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: LKML, linux-mm

On Thu, 6 Aug 2009, Mel Gorman wrote:

Code style nitpick, There are four trailing whitespace errors while 
applying this script patch, use ./scripts/checkpatch.pl would tell which 
lines have trailing whitespace.

Vincent 

> This patch adds a simple post-processing script for the page-allocator-related
> trace events. It can be used to give an indication of who the most
> allocator-intensive processes are and how often the zone lock was taken
> during the tracing period. Example output looks like
> 
> Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
> details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
>                                 under lock     direct  pagevec      drain
> swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
> Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
> modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
> xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
> awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
> thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
> hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
> akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
> xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
> khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0
> 
> Optionally, the output can include information on the parent or aggregate
> based on process name instead of aggregating based on each pid. Example output
> including parent information and stripped out the PID looks something like;
> 
> Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
> details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
>                                      under lock     direct  pagevec      drain
> gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
> init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
> git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0
> 
> This says that Xorg allocated 3796 pages and it's parent process is gdm
> with a PID of 3756;
> 
> The postprocessor parses the text output of tracing. While there is a binary
> format, the expectation is that the binary output can be readily translated
> into text and post-processed offline. Obviously if the text format changes,
> the parser will break but the regular expression parser is fairly rudimentary
> so should be readily adjustable.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  .../postprocess/trace-pagealloc-postprocess.pl     |  356 ++++++++++++++++++++
>  1 files changed, 356 insertions(+), 0 deletions(-)
>  create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
> 
> diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
> new file mode 100755
> index 0000000..56c7f42
> --- /dev/null
> +++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
> @@ -0,0 +1,356 @@
> +#!/usr/bin/perl
> +# This is a POC (proof of concept or piece of crap, take your pick) for reading the
> +# text representation of trace output related to page allocation. It makes an attempt
> +# to extract some high-level information on what is going on. The accuracy of the parser
> +# may vary considerably
> +#
> +# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
> +# other options
> +#   --prepend-parent	Report on the parent proc and PID
> +#   --read-procstat	If the trace lacks process info, get it from /proc
> +#   --ignore-pid	Aggregate processes of the same name together
> +#
> +# Copyright (c) IBM Corporation 2009
> +# Author: Mel Gorman <mel@csn.ul.ie>
> +use strict;
> +use Getopt::Long;
> +
> +# Tracepoint events
> +use constant MM_PAGE_ALLOC		=> 1;
> +use constant MM_PAGE_FREE_DIRECT 	=> 2;
> +use constant MM_PAGEVEC_FREE		=> 3;
> +use constant MM_PAGE_PCPU_DRAIN		=> 4;
> +use constant MM_PAGE_ALLOC_ZONE_LOCKED	=> 5;
> +use constant MM_PAGE_ALLOC_EXTFRAG	=> 6;
> +use constant EVENT_UNKNOWN		=> 7;
> +
> +# Constants used to track state
> +use constant STATE_PCPU_PAGES_DRAINED	=> 8;
> +use constant STATE_PCPU_PAGES_REFILLED	=> 9;
> +
> +# High-level events extrapolated from tracepoints
> +use constant HIGH_PCPU_DRAINS		=> 10;
> +use constant HIGH_PCPU_REFILLS		=> 11;
> +use constant HIGH_EXT_FRAGMENT		=> 12;
> +use constant HIGH_EXT_FRAGMENT_SEVERE	=> 13;
> +use constant HIGH_EXT_FRAGMENT_MODERATE	=> 14;
> +use constant HIGH_EXT_FRAGMENT_CHANGED	=> 15;
> +
> +my %perprocesspid;
> +my %perprocess;
> +my $opt_ignorepid;
> +my $opt_read_procstat;
> +my $opt_prepend_parent;
> +
> +# Catch sigint and exit on request
> +my $sigint_report = 0;
> +my $sigint_exit = 0;
> +my $sigint_pending = 0;
> +my $sigint_received = 0;
> +sub sigint_handler {
> +	my $current_time = time;
> +	if ($current_time - 2 > $sigint_received) {
> +		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
> +		$sigint_report = 1;
> +	} else {
> +		print "Second SIGINT received quickly, exiting\n";
> +		$sigint_exit = 1;
> +	}
> +	$sigint_received = $current_time;
> +	$sigint_pending = 1;
> +}
> +$SIG{INT} = "sigint_handler";
> +
> +# Parse command line options
> +GetOptions(
> +	'ignore-pid'	 =>	\$opt_ignorepid,
> +	'read-procstat'	 =>	\$opt_read_procstat,
> +	'prepend-parent' =>	\$opt_prepend_parent,
> +);
> +
> +# Regexes used. Specified like this for readability and for use with /o
> +#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
> +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
> +my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*) fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*) fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
> +my $regex_statname = '[-0-9]*\s\((.*)\).*';
> +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
> +
> +sub read_statline($) {
> +	my $pid = $_[0];
> +	my $statline;
> +
> +	if (open(STAT, "/proc/$pid/stat")) {
> +		$statline = <STAT>;
> +		close(STAT);
> +	}
> +
> +	if ($statline eq '') {
> +		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
> +	}
> +
> +	return $statline;
> +}
> +
> +sub guess_process_pid($$) {
> +	my $pid = $_[0];
> +	my $statline = $_[1];
> +
> +	if ($pid == 0) {
> +		return "swapper-0";
> +	}
> +
> +	if ($statline !~ /$regex_statname/o) {
> +		die("Failed to math stat line for process name :: $statline");
> +	}
> +	return "$1-$pid";
> +}
> +
> +sub parent_info($$) {
> +	my $pid = $_[0];
> +	my $statline = $_[1];
> +	my $ppid;
> +
> +	if ($pid == 0) {
> +		return "NOPARENT-0";
> +	}
> +
> +	if ($statline !~ /$regex_statppid/o) {
> +		die("Failed to match stat line process ppid:: $statline");
> +	}
> +
> +	# Read the ppid stat line
> +	$ppid = $1;
> +	return guess_process_pid($ppid, read_statline($ppid));
> +}
> +
> +sub process_events {
> +	my $traceevent;
> +	my $process_pid;
> +	my $cpus;
> +	my $timestamp;
> +	my $tracepoint;
> +	my $details;
> +	my $statline;
> +
> +	# Read each line of the event log
> +EVENT_PROCESS:
> +	while ($traceevent = <STDIN>) {
> +		if ($traceevent =~ /$regex_traceevent/o) {
> +			$process_pid = $1;
> +			$tracepoint = $4;
> +
> +			if ($opt_read_procstat || $opt_prepend_parent) {
> +				$process_pid =~ /(.*)-([0-9]*)$/;
> +				my $process = $1;
> +				my $pid = $2;
> +
> +				$statline = read_statline($pid);
> +
> +				if ($opt_read_procstat && $process eq '') {
> +					$process_pid = guess_process_pid($pid, $statline);
> +				}
> +
> +				if ($opt_prepend_parent) {
> +					$process_pid = parent_info($pid, $statline) . " :: $process_pid";
> +				}
> +			}
> +
> +			# Unnecessary in this script. Uncomment if required
> +			# $cpus = $2;
> +			# $timestamp = $3;
> +		} else {
> +			next;
> +		}
> +
> +		# Perl Switch() sucks majorly
> +		if ($tracepoint eq "mm_page_alloc") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
> +		} elsif ($tracepoint eq "mm_page_free_direct") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
> +		} elsif ($tracepoint eq "mm_pagevec_free") {
> +			$perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
> +		} elsif ($tracepoint eq "mm_page_pcpu_drain") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
> +		} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
> +			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
> +		} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
> +
> +			# Extract the details of the event now
> +			$details = $5;
> +
> +			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
> +			my ($page, $pfn);
> +			my ($alloc_order, $fallback_order, $pageblock_order);
> +			my ($alloc_migratetype, $fallback_migratetype);
> +			my ($fragmenting, $change_ownership);
> +
> +			$details =~ /$regex_fragdetails/o;
> +			$page = $1;
> +			$pfn = $2;
> +			$alloc_order = $3;
> +			$fallback_order = $4;
> +			$pageblock_order = $5;
> +			$alloc_migratetype = $6;
> +			$fallback_migratetype = $7;
> +			$fragmenting = $8;
> +			$change_ownership = $9;
> +
> +			if ($fragmenting) {
> +				$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
> +				if ($fallback_order <= 3) {
> +					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
> +				} else {
> +					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
> +				}
> +			}
> +			if ($change_ownership) {
> +				$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
> +			}
> +		} else {
> +			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
> +		}
> +
> +		# Catch a full pcpu drain event
> +		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
> +				$tracepoint ne "mm_page_pcpu_drain") {
> +
> +			$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
> +		}
> +
> +		# Catch a full pcpu refill event
> +		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
> +				$tracepoint ne "mm_page_alloc_zone_locked") {
> +			$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
> +			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
> +		}
> +
> +		if ($sigint_pending) {
> +			last EVENT_PROCESS;
> +		}
> +	}
> +}
> +
> +sub dump_stats {
> +	my $hashref = shift;
> +	my %stats = %$hashref;
> +
> +	# Dump per-process stats
> +	my $process_pid;
> +	my $max_strlen = 0;
> +
> +	# Get the maximum process name
> +	foreach $process_pid (keys %perprocesspid) {
> +		my $len = length($process_pid);
> +		if ($len > $max_strlen) {
> +			$max_strlen = $len;
> +		}
> +	}
> +	$max_strlen += 2;
> +
> +	printf("\n");
> +	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
> +		"Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
> +	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
> +		"details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
> +
> +	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n", 
> +		"",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
> +
> +	foreach $process_pid (keys %stats) {
> +		# Dump final aggregates
> +		if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
> +			$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
> +			$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
> +		}
> +		if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
> +			$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
> +			$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
> +		}
> +
> +		printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n", 
> +			$process_pid,
> +			$stats{$process_pid}->{MM_PAGE_ALLOC},
> +			$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
> +			$stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
> +			$stats{$process_pid}->{MM_PAGEVEC_FREE},
> +			$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
> +			$stats{$process_pid}->{HIGH_PCPU_DRAINS},
> +			$stats{$process_pid}->{HIGH_PCPU_REFILLS},
> +			$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
> +			$stats{$process_pid}->{HIGH_EXT_FRAG},
> +			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
> +			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
> +			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
> +			$stats{$process_pid}->{EVENT_UNKNOWN});
> +	}
> +}
> +
> +sub aggregate_perprocesspid() {
> +	my $process_pid;
> +	my $process;
> +	undef %perprocess;
> +
> +	foreach $process_pid (keys %perprocesspid) {
> +		$process = $process_pid;
> +		$process =~ s/-([0-9])*$//;
> +		if ($process eq '') {
> +			$process = "NO_PROCESS_NAME";
> +		}
> +
> +		$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
> +		$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
> +		$perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
> +		$perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
> +		$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
> +		$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
> +		$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
> +		$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
> +		$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
> +		$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
> +		$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
> +		$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
> +		$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
> +	}
> +}
> +
> +sub report() {
> +	if (!$opt_ignorepid) {
> +		dump_stats(\%perprocesspid);
> +	} else {
> +		aggregate_perprocesspid();
> +		dump_stats(\%perprocess);
> +	}
> +}
> +
> +# Process events or signals until neither is available
> +sub signal_loop() {
> +	my $sigint_processed;
> +	do {
> +		$sigint_processed = 0;
> +		process_events();
> +
> +		# Handle pending signals if any
> +		if ($sigint_pending) {
> +			my $current_time = time;
> +
> +			if ($sigint_exit) {
> +				print "Received exit signal\n";
> +				$sigint_pending = 0;
> +			}
> +			if ($sigint_report) {
> +				if ($current_time >= $sigint_received + 2) {
> +					report();
> +					$sigint_report = 0;
> +					$sigint_pending = 0;
> +					$sigint_processed = 1;
> +				}
> +			}
> +		}
> +	} while ($sigint_pending || $sigint_processed);
> +}
> +
> +signal_loop();
> +report();
> -- 
> 1.6.3.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 

Vincent Li
Biomedical Research Center
University of British Columbia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-07  7:50     ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  7:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> +TRACE_EVENT(mm_pagevec_free,

> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->cold		= cold;
> +	),

> -	while (--i >= 0)
> +	while (--i >= 0) {
> +		trace_mm_pagevec_free(pvec->pages[i], 0, pvec->cold);
>  		free_hot_cold_page(pvec->pages[i], pvec->cold);
> +	}

Pagevec freeing has order 0 implicit, so you can further optimize 
this by leaving out the 'order' field and using this format string:

+	TP_printk("page=%p pfn=%lu order=0 cold=%d",
+                       __entry->page,
+                       page_to_pfn(__entry->page),
+                       __entry->cold)

the trace record becomes smaller by 4 bytes and the tracepoint 
fastpath becomes shorter as well.

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
@ 2009-08-07  7:50     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  7:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> +TRACE_EVENT(mm_pagevec_free,

> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->cold		= cold;
> +	),

> -	while (--i >= 0)
> +	while (--i >= 0) {
> +		trace_mm_pagevec_free(pvec->pages[i], 0, pvec->cold);
>  		free_hot_cold_page(pvec->pages[i], pvec->cold);
> +	}

Pagevec freeing has order 0 implicit, so you can further optimize 
this by leaving out the 'order' field and using this format string:

+	TP_printk("page=%p pfn=%lu order=0 cold=%d",
+                       __entry->page,
+                       page_to_pfn(__entry->page),
+                       __entry->cold)

the trace record becomes smaller by 4 bytes and the tracepoint 
fastpath becomes shorter as well.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-07  7:53     ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  7:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> +TRACE_EVENT(mm_page_pcpu_drain,
> +
> +	TP_PROTO(struct page *page, int order, int migratetype),
> +
> +	TP_ARGS(page, order, migratetype),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct page *,	page		)
> +		__field(	int,		order		)
> +		__field(	int,		migratetype	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->migratetype	= migratetype;
> +	),
> +
> +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> +		__entry->page,
> +		page_to_pfn(__entry->page),
> +		__entry->order,
> +		smp_processor_id(),
> +		__entry->migratetype)

> +	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);

This can be optimized further by omitting the migratetype field and 
adding something like this:

	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
		__entry->page,
		page_to_pfn(__entry->page),
		__entry->order,
		smp_processor_id(),
		__entry->order == 0);

The advantage is 4 bytes less in the record and a shorter tracepoint 
fast-path - while still having the same output.

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-07  7:53     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  7:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> +TRACE_EVENT(mm_page_pcpu_drain,
> +
> +	TP_PROTO(struct page *page, int order, int migratetype),
> +
> +	TP_ARGS(page, order, migratetype),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct page *,	page		)
> +		__field(	int,		order		)
> +		__field(	int,		migratetype	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->migratetype	= migratetype;
> +	),
> +
> +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> +		__entry->page,
> +		page_to_pfn(__entry->page),
> +		__entry->order,
> +		smp_processor_id(),
> +		__entry->migratetype)

> +	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);

This can be optimized further by omitting the migratetype field and 
adding something like this:

	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
		__entry->page,
		page_to_pfn(__entry->page),
		__entry->order,
		smp_processor_id(),
		__entry->order == 0);

The advantage is 4 bytes less in the record and a shorter tracepoint 
fast-path - while still having the same output.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-07  7:53     ` Ingo Molnar
@ 2009-08-07  7:55       ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  7:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +TRACE_EVENT(mm_page_pcpu_drain,
> > +
> > +	TP_PROTO(struct page *page, int order, int migratetype),
> > +
> > +	TP_ARGS(page, order, migratetype),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	int,		order		)
> > +		__field(	int,		migratetype	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		smp_processor_id(),
> > +		__entry->migratetype)
> 
> > +	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
> 
> This can be optimized further by omitting the migratetype field and 
> adding something like this:

erm, cut & pasted the wrong thing, i meant:

s/migratetype/percpu_refill

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-07  7:55       ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  7:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +TRACE_EVENT(mm_page_pcpu_drain,
> > +
> > +	TP_PROTO(struct page *page, int order, int migratetype),
> > +
> > +	TP_ARGS(page, order, migratetype),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	int,		order		)
> > +		__field(	int,		migratetype	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		smp_processor_id(),
> > +		__entry->migratetype)
> 
> > +	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
> 
> This can be optimized further by omitting the migratetype field and 
> adding something like this:

erm, cut & pasted the wrong thing, i meant:

s/migratetype/percpu_refill

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-07  8:00     ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML,
	linux-mm, Frédéric Weisbecker


* Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a simple post-processing script for the 
> page-allocator-related trace events. It can be used to give an 
> indication of who the most allocator-intensive processes are and 
> how often the zone lock was taken during the tracing period. 
> Example output looks like

Note, this script hard-codes certain aspects of the output format:

+my $regex_traceevent =
+'\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*)
+fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*)
+fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';

the proper appproach is to parse /debug/tracing/events/mm/*/format. 
That is why we emit a format string - to detach tools and reduce the 
semi-ABI effect.

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
@ 2009-08-07  8:00     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML,
	linux-mm, Frédéric Weisbecker


* Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a simple post-processing script for the 
> page-allocator-related trace events. It can be used to give an 
> indication of who the most allocator-intensive processes are and 
> how often the zone lock was taken during the tracing period. 
> Example output looks like

Note, this script hard-codes certain aspects of the output format:

+my $regex_traceevent =
+'\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*)
+fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*)
+fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';

the proper appproach is to parse /debug/tracing/events/mm/*/format. 
That is why we emit a format string - to detach tools and reduce the 
semi-ABI effect.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-07  8:02     ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> +++ b/mm/page_alloc.c
> @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>  							start_migratetype);
>  
>  			expand(zone, page, order, current_order, area, migratetype);
> +
> +			trace_mm_page_alloc_extfrag(page, order, current_order,
> +				start_migratetype, migratetype,
> +				current_order < pageblock_order,
> +				migratetype == start_migratetype);

This tracepoint too should be optimized some more:

 - pageblock_order can be passed down verbatim instead of the 
   'current_order < pageblock_order': it means one comparison less 
   in the fast-path, plus it gives more trace information as well.

 - migratetype == start_migratetype check is superfluous as both 
   values are already traced. This property can be added to the 
   TP_printk() post-processing stage instead, if the pretty-printing 
   is desired.

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
@ 2009-08-07  8:02     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> +++ b/mm/page_alloc.c
> @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>  							start_migratetype);
>  
>  			expand(zone, page, order, current_order, area, migratetype);
> +
> +			trace_mm_page_alloc_extfrag(page, order, current_order,
> +				start_migratetype, migratetype,
> +				current_order < pageblock_order,
> +				migratetype == start_migratetype);

This tracepoint too should be optimized some more:

 - pageblock_order can be passed down verbatim instead of the 
   'current_order < pageblock_order': it means one comparison less 
   in the fast-path, plus it gives more trace information as well.

 - migratetype == start_migratetype check is superfluous as both 
   values are already traced. This property can be added to the 
   TP_printk() post-processing stage instead, if the pretty-printing 
   is desired.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-07  8:04     ` Li Zefan
  -1 siblings, 0 replies; 54+ messages in thread
From: Li Zefan @ 2009-08-07  8:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Ingo Molnar, Peter Zijlstra,
	LKML, linux-mm

> +TRACE_EVENT(mm_page_alloc_zone_locked,
> +
> +	TP_PROTO(struct page *page, unsigned int order,
> +				int migratetype, int percpu_refill),
> +
> +	TP_ARGS(page, order, migratetype, percpu_refill),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct page *,	page		)
> +		__field(	unsigned int,	order		)
> +		__field(	int,		migratetype	)
> +		__field(	int,		percpu_refill	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->migratetype	= migratetype;
> +		__entry->percpu_refill	= percpu_refill;
> +	),
> +
> +	TP_printk("page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d",
> +		__entry->page,
> +		page_to_pfn(__entry->page),
> +		__entry->order,
> +		__entry->migratetype,
> +		smp_processor_id(),

This is the cpu when printk() is called, but not the cpu when
this event happens.

And this information has already been stored, and is printed
if context-info option is set, which is set by default.

> +		__entry->percpu_refill)
> +);
> +
> +TRACE_EVENT(mm_page_pcpu_drain,
> +
> +	TP_PROTO(struct page *page, int order, int migratetype),
> +
> +	TP_ARGS(page, order, migratetype),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct page *,	page		)
> +		__field(	int,		order		)
> +		__field(	int,		migratetype	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->migratetype	= migratetype;
> +	),
> +
> +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> +		__entry->page,
> +		page_to_pfn(__entry->page),
> +		__entry->order,
> +		smp_processor_id(),

ditto

> +		__entry->migratetype)
> +);
> +

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-07  8:04     ` Li Zefan
  0 siblings, 0 replies; 54+ messages in thread
From: Li Zefan @ 2009-08-07  8:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Ingo Molnar, Peter Zijlstra,
	LKML, linux-mm

> +TRACE_EVENT(mm_page_alloc_zone_locked,
> +
> +	TP_PROTO(struct page *page, unsigned int order,
> +				int migratetype, int percpu_refill),
> +
> +	TP_ARGS(page, order, migratetype, percpu_refill),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct page *,	page		)
> +		__field(	unsigned int,	order		)
> +		__field(	int,		migratetype	)
> +		__field(	int,		percpu_refill	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->migratetype	= migratetype;
> +		__entry->percpu_refill	= percpu_refill;
> +	),
> +
> +	TP_printk("page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d",
> +		__entry->page,
> +		page_to_pfn(__entry->page),
> +		__entry->order,
> +		__entry->migratetype,
> +		smp_processor_id(),

This is the cpu when printk() is called, but not the cpu when
this event happens.

And this information has already been stored, and is printed
if context-info option is set, which is set by default.

> +		__entry->percpu_refill)
> +);
> +
> +TRACE_EVENT(mm_page_pcpu_drain,
> +
> +	TP_PROTO(struct page *page, int order, int migratetype),
> +
> +	TP_ARGS(page, order, migratetype),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct page *,	page		)
> +		__field(	int,		order		)
> +		__field(	int,		migratetype	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->migratetype	= migratetype;
> +	),
> +
> +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> +		__entry->page,
> +		page_to_pfn(__entry->page),
> +		__entry->order,
> +		smp_processor_id(),

ditto

> +		__entry->migratetype)
> +);
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-06 16:07   ` Mel Gorman
@ 2009-08-07  8:07     ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:07 UTC (permalink / raw)
  To: Mel Gorman, Frédéric Weisbecker, Pekka Enberg, eduard
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> The documentation for ftrace, events and tracepoints is pretty 
> extensive. Similarly, the perf PCL tools help files --help are 
> there and the code simple enough to figure out what much of the 
> switches mean. However, pulling the discrete bits and pieces 
> together and translating that into "how do I solve a problem" 
> requires a fair amount of imagination.
> 
> This patch adds a simple document intended to get someone started 
> on the different ways of using tracepoints to gather meaningful 
> data.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.
> +
> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local
> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the
> +performance analyst to do it by hand. In the event that the discrete event
> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events
> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing
> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number
> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root
> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.
> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
> +forgotten about from months ago!

This is a very nice and comprehensive description!

I'm wondering: would you mind if we integrated the analysis ideas 
from your perl script into 'perf trace'? Those kinds of high-level 
counts and summaries are useful not just for MM events.

Another thing that was raise dbefore is a 'perf mem' special-purpose 
tool to help the analysis of all things memory related: leak 
detection, high level stats, etc. That could have some turn-key 
modes of analysis for the page allocator too.

perf will do a proper format-string evaluation of 
/debug/tracing/events/*/format as well, thus any tweaks to the 
tracepoints get automatically adopted to.

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-07  8:07     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07  8:07 UTC (permalink / raw)
  To: Mel Gorman, Frédéric Weisbecker, Pekka Enberg, eduard
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> The documentation for ftrace, events and tracepoints is pretty 
> extensive. Similarly, the perf PCL tools help files --help are 
> there and the code simple enough to figure out what much of the 
> switches mean. However, pulling the discrete bits and pieces 
> together and translating that into "how do I solve a problem" 
> requires a fair amount of imagination.
> 
> This patch adds a simple document intended to get someone started 
> on the different ways of using tracepoints to gather meaningful 
> data.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
>  1 files changed, 327 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/tracepoint-analysis.txt
> 
> diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
> new file mode 100644
> index 0000000..e7a7d3e
> --- /dev/null
> +++ b/Documentation/trace/tracepoint-analysis.txt
> @@ -0,0 +1,327 @@
> +		Notes on Analysing Behaviour Using Events and Tracepoints
> +
> +			Documentation written by Mel Gorman
> +		PCL information heavily based on email from Ingo Molnar
> +
> +1. Introduction
> +===============
> +
> +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
> +creating custom kernel modules to register probe functions using the event
> +tracing infrastructure.
> +
> +Simplistically, tracepoints will represent an important event that when can
> +be taken in conjunction with other tracepoints to build a "Big Picture" of
> +what is going on within the system. There are a large number of methods for
> +gathering and interpreting these events. Lacking any current Best Practises,
> +this document describes some of the methods that can be used.
> +
> +This document assumes that debugfs is mounted on /sys/kernel/debug and that
> +the appropriate tracing options have been configured into the kernel. It is
> +assumed that the PCL tool tools/perf has been installed and is in your path.
> +
> +2. Listing Available Events
> +===========================
> +
> +2.1 Standard Utilities
> +----------------------
> +
> +All possible events are visible from /sys/kernel/debug/tracing/events. Simply
> +calling
> +
> +  $ find /sys/kernel/debug/tracing/events -type d
> +
> +will give a fair indication of the number of events available.
> +
> +2.2 PCL
> +-------
> +
> +Discovery and enumeration of all counters and events, including tracepoints
> +are available with the perf tool. Getting a list of available events is a
> +simple case of
> +
> +  $ perf list 2>&1 | grep Tracepoint
> +  ext4:ext4_free_inode                     [Tracepoint event]
> +  ext4:ext4_request_inode                  [Tracepoint event]
> +  ext4:ext4_allocate_inode                 [Tracepoint event]
> +  ext4:ext4_write_begin                    [Tracepoint event]
> +  ext4:ext4_ordered_write_end              [Tracepoint event]
> +  [ .... remaining output snipped .... ]
> +
> +
> +2. Enabling Events
> +==================
> +
> +2.1 System-Wide Event Enabling
> +------------------------------
> +
> +See Documentation/trace/events.txt for a proper description on how events
> +can be enabled system-wide. A short example of enabling all events related
> +to page allocation would look something like
> +
> +  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
> +
> +2.2 System-Wide Event Enabling with SystemTap
> +---------------------------------------------
> +
> +In SystemTap, tracepoints are accessible using the kernel.trace() function
> +call. The following is an example that reports every 5 seconds what processes
> +were allocating the pages.
> +
> +  global page_allocs
> +
> +  probe kernel.trace("mm_page_alloc") {
> +  	page_allocs[execname()]++
> +  }
> +
> +  function print_count() {
> +  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
> +  	foreach (proc in page_allocs-)
> +  		printf("%-25d %s\n", page_allocs[proc], proc)
> +  	printf ("\n")
> +  	delete page_allocs
> +  }
> +
> +  probe timer.s(5) {
> +          print_count()
> +  }
> +
> +2.3 System-Wide Event Enabling with PCL
> +---------------------------------------
> +
> +By specifying the -a switch and analysing sleep, the system-wide events
> +for a duration of time can be examined.
> +
> + $ perf stat -a \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	sleep 10
> + Performance counter stats for 'sleep 10':
> +
> +           9630  kmem:mm_page_alloc      
> +           2143  kmem:mm_page_free_direct
> +           7424  kmem:mm_pagevec_free    
> +
> +   10.002577764  seconds time elapsed
> +
> +Similarly, one could execute a shell and exit it as desired to get a report
> +at that point.
> +
> +2.4 Local Event Enabling
> +------------------------
> +
> +Documentation/trace/ftrace.txt describes how to enable events on a per-thread
> +basis using set_ftrace_pid.
> +
> +2.5 Local Event Enablement with PCL
> +-----------------------------------
> +
> +Events can be activate and tracked for the duration of a process on a local
> +basis using PCL such as follows.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		 -e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.909
> +
> +    Performance counter stats for './hackbench 10':
> +
> +          17803  kmem:mm_page_alloc      
> +          12398  kmem:mm_page_free_direct
> +           4827  kmem:mm_pagevec_free    
> +
> +    0.973913387  seconds time elapsed
> +
> +3. Event Filtering
> +==================
> +
> +Documentation/trace/ftrace.txt covers in-depth how to filter events in
> +ftrace.  Obviously using grep and awk of trace_pipe is an option as well
> +as any script reading trace_pipe.
> +
> +4. Analysing Event Variances with PCL
> +=====================================
> +
> +Any workload can exhibit variances between runs and it can be important
> +to know what the standard deviation in. By and large, this is left to the
> +performance analyst to do it by hand. In the event that the discrete event
> +occurrences are useful to the performance analyst, then perf can be used.
> +
> +  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
> +			-e kmem:mm_pagevec_free ./hackbench 10
> +  Time: 0.890
> +  Time: 0.895
> +  Time: 0.915
> +  Time: 1.001
> +  Time: 0.899
> +
> +   Performance counter stats for './hackbench 10' (5 runs):
> +
> +          16630  kmem:mm_page_alloc         ( +-   3.542% )
> +          11486  kmem:mm_page_free_direct   ( +-   4.771% )
> +           4730  kmem:mm_pagevec_free       ( +-   2.325% )
> +
> +    0.982653002  seconds time elapsed   ( +-   1.448% )
> +
> +In the event that some higher-level event is required that depends on some
> +aggregation of discrete events, then a script would need to be developed.
> +
> +Using --repeat, it is also possible to view how events are fluctuating over
> +time on a system wide basis using -a and sleep.
> +
> +  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-a --repeat 10 \
> +		sleep 1
> +  Performance counter stats for 'sleep 1' (10 runs):
> +
> +           1066  kmem:mm_page_alloc         ( +-  26.148% )
> +            182  kmem:mm_page_free_direct   ( +-   5.464% )
> +            890  kmem:mm_pagevec_free       ( +-  30.079% )
> +
> +    1.002251757  seconds time elapsed   ( +-   0.005% )
> +
> +5. Higher-Level Analysis with Helper Scripts
> +============================================
> +
> +When events are enabled the events that are triggering can be read from
> +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
> +options exist as well. By post-processing the output, further information can
> +be gathered on-line as appropriate. Examples of post-processing might include
> +
> +  o Reading information from /proc for the PID that triggered the event
> +  o Deriving a higher-level event from a series of lower-level events.
> +  o Calculate latencies between two events
> +
> +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
> +script that can read trace_pipe from STDIN or a copy of a trace. When used
> +on-line, it can be interrupted once to generate a report without existing
> +and twice to exit.
> +
> +Simplistically, the script just reads STDIN and counts up events but it
> +also can do more such as
> +
> +  o Derive high-level events from many low-level events. If a number of pages
> +    are freed to the main allocator from the per-CPU lists, it recognises
> +    that as one per-CPU drain even though there is no specific tracepoint
> +    for that event
> +  o It can aggregate based on PID or individual process number
> +  o In the event memory is getting externally fragmented, it reports
> +    on whether the fragmentation event was severe or moderate.
> +  o When receiving an event about a PID, it can record who the parent was so
> +    that if large numbers of events are coming from very short-lived
> +    processes, the parent process responsible for creating all the helpers
> +    can be identified
> +
> +6. Lower-Level Analysis with PCL
> +================================
> +
> +There may also be a requirement to identify what functions with a program
> +were generating events within the kernel. To begin this sort of analysis, the
> +data must be recorded. At the time of writing, this required root
> +
> +  $ perf record -c 1 \
> +	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +	-e kmem:mm_pagevec_free \
> +	./hackbench 10
> +  Time: 0.894
> +  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
> +
> +Note the use of '-c 1' to set the event period to sample. The default sample
> +period is quite high to minimise overhead but the information collected can be
> +very coarse as a result.
> +
> +This record outputted a file called perf.data which can be analysed using
> +perf report.
> +
> +  $ perf report
> +  # Samples: 30922
> +  #
> +  # Overhead    Command                     Shared Object
> +  # ........  .........  ................................
> +  #
> +      87.27%  hackbench  [vdso]                          
> +       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
> +       2.62%  hackbench  /lib/ld-2.9.so                  
> +       1.52%       perf  [vdso]                          
> +       1.22%  hackbench  ./hackbench                     
> +       0.48%  hackbench  [kernel]                        
> +       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
> +       0.01%       perf  /usr/bin/perf                   
> +       0.01%       perf  /lib/ld-2.9.so                  
> +       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +According to this, the vast majority of events occured triggered on events
> +within the VDSO. With simple binaries, this will often be the case so lets
> +take a slightly different example. In the course of writing this, it was
> +noticed that X was generating an insane amount of page allocations so lets look
> +at it
> +
> +  $ perf record -c 1 -f \
> +		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
> +		-e kmem:mm_pagevec_free \
> +		-p `pidof X`
> +
> +This was interrupted after a few seconds and
> +
> +  $ perf report
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object
> +  # ........  .......  .......................................
> +  #
> +      51.95%     Xorg  [vdso]                                 
> +      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
> +       0.01%     Xorg  [kernel]                               
> +  #
> +  # (For more details, try: perf report --sort comm,dso,symbol)
> +  #
> +
> +So, almost half of the events are occuring in a library. To get an idea which
> +symbol.
> +
> +  $ perf report --sort comm,dso,symbol
> +  # Samples: 27666
> +  #
> +  # Overhead  Command                            Shared Object  Symbol
> +  # ........  .......  .......................................  ......
> +  #
> +      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
> +      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
> +       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
> +       0.01%     Xorg  [kernel]                                 [k] read_hpet
> +       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
> +       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
> +
> +To see where within the function pixmanFillsse2 things are going wrong
> +
> +  $ perf annotate pixmanFillsse2
> +  [ ... ]
> +    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
> +         :      }
> +         :
> +         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
> +         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
> +         :        *__P = __B;
> +   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
> +    0.00 :         34ef5:       ff 
> +   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
> +    0.00 :         34efd:       ff 
> +   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
> +    0.00 :         34f05:       ff 
> +   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
> +    0.00 :         34f0d:       ff 
> +   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
> +   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
> +   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
> +   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
> +
> +At a glance, it looks like the time is being spent copying pixmaps to
> +the card.  Further investigation would be needed to determine why pixmaps
> +are being copied around so much but a starting point would be to take an
> +ancient build of libpixmap out of the library path where it was totally
> +forgotten about from months ago!

This is a very nice and comprehensive description!

I'm wondering: would you mind if we integrated the analysis ideas 
from your perl script into 'perf trace'? Those kinds of high-level 
counts and summaries are useful not just for MM events.

Another thing that was raise dbefore is a 'perf mem' special-purpose 
tool to help the analysis of all things memory related: leak 
detection, high level stats, etc. That could have some turn-key 
modes of analysis for the page allocator too.

perf will do a proper format-string evaluation of 
/debug/tracing/events/*/format as well, thus any tweaks to the 
tracepoints get automatically adopted to.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-07  7:50     ` Ingo Molnar
@ 2009-08-07 10:49       ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 10:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 09:50:30AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +TRACE_EVENT(mm_pagevec_free,
> 
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->cold		= cold;
> > +	),
> 
> > -	while (--i >= 0)
> > +	while (--i >= 0) {
> > +		trace_mm_pagevec_free(pvec->pages[i], 0, pvec->cold);
> >  		free_hot_cold_page(pvec->pages[i], pvec->cold);
> > +	}
> 
> Pagevec freeing has order 0 implicit, so you can further optimize 
> this by leaving out the 'order' field and using this format string:
> 
> +	TP_printk("page=%p pfn=%lu order=0 cold=%d",
> +                       __entry->page,
> +                       page_to_pfn(__entry->page),
> +                       __entry->cold)
> 
> the trace record becomes smaller by 4 bytes and the tracepoint 
> fastpath becomes shorter as well.
> 

Good point. It's fixed now.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
@ 2009-08-07 10:49       ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 10:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 09:50:30AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +TRACE_EVENT(mm_pagevec_free,
> 
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->cold		= cold;
> > +	),
> 
> > -	while (--i >= 0)
> > +	while (--i >= 0) {
> > +		trace_mm_pagevec_free(pvec->pages[i], 0, pvec->cold);
> >  		free_hot_cold_page(pvec->pages[i], pvec->cold);
> > +	}
> 
> Pagevec freeing has order 0 implicit, so you can further optimize 
> this by leaving out the 'order' field and using this format string:
> 
> +	TP_printk("page=%p pfn=%lu order=0 cold=%d",
> +                       __entry->page,
> +                       page_to_pfn(__entry->page),
> +                       __entry->cold)
> 
> the trace record becomes smaller by 4 bytes and the tracepoint 
> fastpath becomes shorter as well.
> 

Good point. It's fixed now.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-08-07  8:02     ` Ingo Molnar
@ 2009-08-07 10:57       ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 10:02:49AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +++ b/mm/page_alloc.c
> > @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> >  							start_migratetype);
> >  
> >  			expand(zone, page, order, current_order, area, migratetype);
> > +
> > +			trace_mm_page_alloc_extfrag(page, order, current_order,
> > +				start_migratetype, migratetype,
> > +				current_order < pageblock_order,
> > +				migratetype == start_migratetype);
> 
> This tracepoint too should be optimized some more:
> 
>  - pageblock_order can be passed down verbatim instead of the 
>    'current_order < pageblock_order': it means one comparison less 
>    in the fast-path, plus it gives more trace information as well.
> 
>  - migratetype == start_migratetype check is superfluous as both 
>    values are already traced. This property can be added to the 
>    TP_printk() post-processing stage instead, if the pretty-printing 
>    is desired.
> 

I think what you're saying that it's better to handle additional information
like this in TP_printk always. That's what I've changed both of these into
at least. I didn't even need to pass down pageblock_order because it should
be available in the post-processing context from a header.

The additional parameters are not being passed down any more and the
TP_printk looks like

        TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
                __entry->page,
                page_to_pfn(__entry->page),
                __entry->alloc_order,
                __entry->fallback_order,
                pageblock_order,
                __entry->alloc_migratetype,
                __entry->fallback_migratetype,
                __entry->fallback_order < pageblock_order,
                __entry->alloc_migratetype == __entry->fallback_migratetype)

Is that what you meant?


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
@ 2009-08-07 10:57       ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 10:02:49AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +++ b/mm/page_alloc.c
> > @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> >  							start_migratetype);
> >  
> >  			expand(zone, page, order, current_order, area, migratetype);
> > +
> > +			trace_mm_page_alloc_extfrag(page, order, current_order,
> > +				start_migratetype, migratetype,
> > +				current_order < pageblock_order,
> > +				migratetype == start_migratetype);
> 
> This tracepoint too should be optimized some more:
> 
>  - pageblock_order can be passed down verbatim instead of the 
>    'current_order < pageblock_order': it means one comparison less 
>    in the fast-path, plus it gives more trace information as well.
> 
>  - migratetype == start_migratetype check is superfluous as both 
>    values are already traced. This property can be added to the 
>    TP_printk() post-processing stage instead, if the pretty-printing 
>    is desired.
> 

I think what you're saying that it's better to handle additional information
like this in TP_printk always. That's what I've changed both of these into
at least. I didn't even need to pass down pageblock_order because it should
be available in the post-processing context from a header.

The additional parameters are not being passed down any more and the
TP_printk looks like

        TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
                __entry->page,
                page_to_pfn(__entry->page),
                __entry->alloc_order,
                __entry->fallback_order,
                pageblock_order,
                __entry->alloc_migratetype,
                __entry->fallback_migratetype,
                __entry->fallback_order < pageblock_order,
                __entry->alloc_migratetype == __entry->fallback_migratetype)

Is that what you meant?


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-07  8:04     ` Li Zefan
@ 2009-08-07 11:00       ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 11:00 UTC (permalink / raw)
  To: Li Zefan
  Cc: Larry Woodman, Andrew Morton, riel, Ingo Molnar, Peter Zijlstra,
	LKML, linux-mm

On Fri, Aug 07, 2009 at 04:04:37PM +0800, Li Zefan wrote:
> > +TRACE_EVENT(mm_page_alloc_zone_locked,
> > +
> > +	TP_PROTO(struct page *page, unsigned int order,
> > +				int migratetype, int percpu_refill),
> > +
> > +	TP_ARGS(page, order, migratetype, percpu_refill),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	unsigned int,	order		)
> > +		__field(	int,		migratetype	)
> > +		__field(	int,		percpu_refill	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +		__entry->percpu_refill	= percpu_refill;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		__entry->migratetype,
> > +		smp_processor_id(),
> 
> This is the cpu when printk() is called, but not the cpu when
> this event happens.
> 
> And this information has already been stored, and is printed
> if context-info option is set, which is set by default.
> 

/me slaps self

I even know the CPU column was there in the trace output. What was I
thinking :/

Thanks

> > +		__entry->percpu_refill)
> > +);
> > +
> > +TRACE_EVENT(mm_page_pcpu_drain,
> > +
> > +	TP_PROTO(struct page *page, int order, int migratetype),
> > +
> > +	TP_ARGS(page, order, migratetype),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	int,		order		)
> > +		__field(	int,		migratetype	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		smp_processor_id(),
> 
> ditto
> 
> > +		__entry->migratetype)
> > +);
> > +
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-07 11:00       ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 11:00 UTC (permalink / raw)
  To: Li Zefan
  Cc: Larry Woodman, Andrew Morton, riel, Ingo Molnar, Peter Zijlstra,
	LKML, linux-mm

On Fri, Aug 07, 2009 at 04:04:37PM +0800, Li Zefan wrote:
> > +TRACE_EVENT(mm_page_alloc_zone_locked,
> > +
> > +	TP_PROTO(struct page *page, unsigned int order,
> > +				int migratetype, int percpu_refill),
> > +
> > +	TP_ARGS(page, order, migratetype, percpu_refill),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	unsigned int,	order		)
> > +		__field(	int,		migratetype	)
> > +		__field(	int,		percpu_refill	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +		__entry->percpu_refill	= percpu_refill;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		__entry->migratetype,
> > +		smp_processor_id(),
> 
> This is the cpu when printk() is called, but not the cpu when
> this event happens.
> 
> And this information has already been stored, and is printed
> if context-info option is set, which is set by default.
> 

/me slaps self

I even know the CPU column was there in the trace output. What was I
thinking :/

Thanks

> > +		__entry->percpu_refill)
> > +);
> > +
> > +TRACE_EVENT(mm_page_pcpu_drain,
> > +
> > +	TP_PROTO(struct page *page, int order, int migratetype),
> > +
> > +	TP_ARGS(page, order, migratetype),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	int,		order		)
> > +		__field(	int,		migratetype	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		smp_processor_id(),
> 
> ditto
> 
> > +		__entry->migratetype)
> > +);
> > +
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-08-07 10:57       ` Mel Gorman
@ 2009-08-07 11:02         ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07 11:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> On Fri, Aug 07, 2009 at 10:02:49AM +0200, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > +++ b/mm/page_alloc.c
> > > @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> > >  							start_migratetype);
> > >  
> > >  			expand(zone, page, order, current_order, area, migratetype);
> > > +
> > > +			trace_mm_page_alloc_extfrag(page, order, current_order,
> > > +				start_migratetype, migratetype,
> > > +				current_order < pageblock_order,
> > > +				migratetype == start_migratetype);
> > 
> > This tracepoint too should be optimized some more:
> > 
> >  - pageblock_order can be passed down verbatim instead of the 
> >    'current_order < pageblock_order': it means one comparison less 
> >    in the fast-path, plus it gives more trace information as well.
> > 
> >  - migratetype == start_migratetype check is superfluous as both 
> >    values are already traced. This property can be added to the 
> >    TP_printk() post-processing stage instead, if the pretty-printing 
> >    is desired.
> > 
> 
> I think what you're saying that it's better to handle additional 
> information like this in TP_printk always. That's what I've 
> changed both of these into at least. I didn't even need to pass 
> down pageblock_order because it should be available in the 
> post-processing context from a header.

yeah. I formulated my suggestions in a trace-output-invariant way. 
If some information can be omitted altogether from the trace, the 
better.

> The additional parameters are not being passed down any more and 
> the TP_printk looks like
> 
>         TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
>                 __entry->page,
>                 page_to_pfn(__entry->page),
>                 __entry->alloc_order,
>                 __entry->fallback_order,
>                 pageblock_order,
>                 __entry->alloc_migratetype,
>                 __entry->fallback_migratetype,
>                 __entry->fallback_order < pageblock_order,
>                 __entry->alloc_migratetype == __entry->fallback_migratetype)
> 
> Is that what you meant?

yeah, this looks more compact.

A detail: we might still want to pass in pageblock_order somehow - 
for example 'perf' will get access to the raw binary record but wont 
run the above printk line.

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
@ 2009-08-07 11:02         ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2009-08-07 11:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm


* Mel Gorman <mel@csn.ul.ie> wrote:

> On Fri, Aug 07, 2009 at 10:02:49AM +0200, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > +++ b/mm/page_alloc.c
> > > @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> > >  							start_migratetype);
> > >  
> > >  			expand(zone, page, order, current_order, area, migratetype);
> > > +
> > > +			trace_mm_page_alloc_extfrag(page, order, current_order,
> > > +				start_migratetype, migratetype,
> > > +				current_order < pageblock_order,
> > > +				migratetype == start_migratetype);
> > 
> > This tracepoint too should be optimized some more:
> > 
> >  - pageblock_order can be passed down verbatim instead of the 
> >    'current_order < pageblock_order': it means one comparison less 
> >    in the fast-path, plus it gives more trace information as well.
> > 
> >  - migratetype == start_migratetype check is superfluous as both 
> >    values are already traced. This property can be added to the 
> >    TP_printk() post-processing stage instead, if the pretty-printing 
> >    is desired.
> > 
> 
> I think what you're saying that it's better to handle additional 
> information like this in TP_printk always. That's what I've 
> changed both of these into at least. I didn't even need to pass 
> down pageblock_order because it should be available in the 
> post-processing context from a header.

yeah. I formulated my suggestions in a trace-output-invariant way. 
If some information can be omitted altogether from the trace, the 
better.

> The additional parameters are not being passed down any more and 
> the TP_printk looks like
> 
>         TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
>                 __entry->page,
>                 page_to_pfn(__entry->page),
>                 __entry->alloc_order,
>                 __entry->fallback_order,
>                 pageblock_order,
>                 __entry->alloc_migratetype,
>                 __entry->fallback_migratetype,
>                 __entry->fallback_order < pageblock_order,
>                 __entry->alloc_migratetype == __entry->fallback_migratetype)
> 
> Is that what you meant?

yeah, this looks more compact.

A detail: we might still want to pass in pageblock_order somehow - 
for example 'perf' will get access to the raw binary record but wont 
run the above printk line.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-07  7:53     ` Ingo Molnar
@ 2009-08-07 11:09       ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 11:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 09:53:17AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +TRACE_EVENT(mm_page_pcpu_drain,
> > +
> > +	TP_PROTO(struct page *page, int order, int migratetype),
> > +
> > +	TP_ARGS(page, order, migratetype),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	int,		order		)
> > +		__field(	int,		migratetype	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		smp_processor_id(),
> > +		__entry->migratetype)
> 
> > +	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
> 
> This can be optimized further by omitting the migratetype field and 
> adding something like this:
> 
> 	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> 		__entry->page,
> 		page_to_pfn(__entry->page),
> 		__entry->order,
> 		smp_processor_id(),
> 		__entry->order == 0);
> 
> The advantage is 4 bytes less in the record and a shorter tracepoint 
> fast-path - while still having the same output.
> 

Knowing you meant percpu_refill, it's now been figured out in TP_printk
instead of in the code.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
@ 2009-08-07 11:09       ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 11:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 09:53:17AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > +TRACE_EVENT(mm_page_pcpu_drain,
> > +
> > +	TP_PROTO(struct page *page, int order, int migratetype),
> > +
> > +	TP_ARGS(page, order, migratetype),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	struct page *,	page		)
> > +		__field(	int,		order		)
> > +		__field(	int,		migratetype	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->migratetype	= migratetype;
> > +	),
> > +
> > +	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> > +		__entry->page,
> > +		page_to_pfn(__entry->page),
> > +		__entry->order,
> > +		smp_processor_id(),
> > +		__entry->migratetype)
> 
> > +	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
> 
> This can be optimized further by omitting the migratetype field and 
> adding something like this:
> 
> 	TP_printk("page=%p pfn=%lu order=%d cpu=%d migratetype=%d",
> 		__entry->page,
> 		page_to_pfn(__entry->page),
> 		__entry->order,
> 		smp_processor_id(),
> 		__entry->order == 0);
> 
> The advantage is 4 bytes less in the record and a shorter tracepoint 
> fast-path - while still having the same output.
> 

Knowing you meant percpu_refill, it's now been figured out in TP_printk
instead of in the code.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-06 21:07     ` Li, Ming Chun
@ 2009-08-07 11:13       ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 11:13 UTC (permalink / raw)
  To: Li, Ming Chun; +Cc: LKML, linux-mm

On Thu, Aug 06, 2009 at 02:07:33PM -0700, Li, Ming Chun wrote:
> On Thu, 6 Aug 2009, Mel Gorman wrote:
> 
> Code style nitpick, There are four trailing whitespace errors while 
> applying this script patch, use ./scripts/checkpatch.pl would tell which 
> lines have trailing whitespace.
> 

Fixed now. I had been ignoring the checkpatch output to some extent as
it were so many warnings about the formatting. It was one of those
cases where it looked better without checkpatch but that's no excuse for
the whitespace :)

Thanks

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
@ 2009-08-07 11:13       ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 11:13 UTC (permalink / raw)
  To: Li, Ming Chun; +Cc: LKML, linux-mm

On Thu, Aug 06, 2009 at 02:07:33PM -0700, Li, Ming Chun wrote:
> On Thu, 6 Aug 2009, Mel Gorman wrote:
> 
> Code style nitpick, There are four trailing whitespace errors while 
> applying this script patch, use ./scripts/checkpatch.pl would tell which 
> lines have trailing whitespace.
> 

Fixed now. I had been ignoring the checkpatch output to some extent as
it were so many warnings about the formatting. It was one of those
cases where it looked better without checkpatch but that's no excuse for
the whitespace :)

Thanks

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-07  8:00     ` Ingo Molnar
@ 2009-08-07 14:16       ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 14:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML,
	linux-mm, Fr?d?ric Weisbecker

On Fri, Aug 07, 2009 at 10:00:18AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a simple post-processing script for the 
> > page-allocator-related trace events. It can be used to give an 
> > indication of who the most allocator-intensive processes are and 
> > how often the zone lock was taken during the tracing period. 
> > Example output looks like
> 
> Note, this script hard-codes certain aspects of the output format:
> 

Yes, I noted that to some extent in the header with "The accuracy of the
parser may vary considerably" knowing that significant changes in the output
format would bust the script.

> +my $regex_traceevent =
> +'\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
> +my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*)
> +fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*)
> +fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
> +my $regex_statname = '[-0-9]*\s\((.*)\).*';
> +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
> 
> the proper appproach is to parse /debug/tracing/events/mm/*/format. 
> That is why we emit a format string - to detach tools and reduce the 
> semi-ABI effect.
> 

Building a regularly expression is a tad messy but I can certainly do a
better job than currently. The information on every tracepoint seems
static so it doesn't need to be discovered but the trace format of the
details needs to be verified. I did the following and it should

o Ignore unrecognised fields in the middle of the format string
o Exit if expected fields do not exist
o It's not pasted, but it'll warn if the regex fails to match

Downsides include that I now hardcode the mount point of debugfs.

Basically, this can still break but it's more robust than it was.

# Defaults for dynamically discovered regex's
my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';

# Dyanically discovered regex
my $regex_fragdetails;

# Static regex used. Specified like this for readability and for use with /o
#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
my $regex_statname = '[-0-9]*\s\((.*)\).*';
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';

sub generate_traceevent_regex {
	my $event = shift;
	my $default = shift;
	my @fields = @_;
	my $regex;

	# Read the event format or use the default
	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
		$regex = $default;
	} else {
		my $line;
		while (!eof(FORMAT)) {
			$line = <FORMAT>;
			if ($line =~ /^print fmt:\s"(.*)",.*/) {
				$regex = $1;
				$regex =~ s/%p/\([0-9a-f]*\)/g;
				$regex =~ s/%d/\([-0-9]*\)/g;
				$regex =~ s/%lu/\([0-9]*\)/g;
			}
		}
	}

	# Verify fields are in the right order
	my $tuple;
	foreach $tuple (split /\s/, $regex) {
		my ($key, $value) = split(/=/, $tuple);
		my $expected = shift;
		if ($key ne $expected) {
			print("WARNING: Format not as expected '$key' != '$expected'");
			$regex =~ s/$key=\((.*)\)/$key=$1/;
		}
	}
	if (defined $_) {
		die("Fewer fields than expected in format");
	}

	return $regex;
}
$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
			$regex_fragdetails_default,
			"page", "pfn",
			"alloc_order", "fallback_order", "pageblock_order",
			"alloc_migratetype", "fallback_migratetype",
			"fragmenting", "change_ownership");


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
@ 2009-08-07 14:16       ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 14:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML,
	linux-mm, Fr?d?ric Weisbecker

On Fri, Aug 07, 2009 at 10:00:18AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a simple post-processing script for the 
> > page-allocator-related trace events. It can be used to give an 
> > indication of who the most allocator-intensive processes are and 
> > how often the zone lock was taken during the tracing period. 
> > Example output looks like
> 
> Note, this script hard-codes certain aspects of the output format:
> 

Yes, I noted that to some extent in the header with "The accuracy of the
parser may vary considerably" knowing that significant changes in the output
format would bust the script.

> +my $regex_traceevent =
> +'\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
> +my $regex_fragdetails = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*)
> +fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*)
> +fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])';
> +my $regex_statname = '[-0-9]*\s\((.*)\).*';
> +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
> 
> the proper appproach is to parse /debug/tracing/events/mm/*/format. 
> That is why we emit a format string - to detach tools and reduce the 
> semi-ABI effect.
> 

Building a regularly expression is a tad messy but I can certainly do a
better job than currently. The information on every tracepoint seems
static so it doesn't need to be discovered but the trace format of the
details needs to be verified. I did the following and it should

o Ignore unrecognised fields in the middle of the format string
o Exit if expected fields do not exist
o It's not pasted, but it'll warn if the regex fails to match

Downsides include that I now hardcode the mount point of debugfs.

Basically, this can still break but it's more robust than it was.

# Defaults for dynamically discovered regex's
my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';

# Dyanically discovered regex
my $regex_fragdetails;

# Static regex used. Specified like this for readability and for use with /o
#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
my $regex_statname = '[-0-9]*\s\((.*)\).*';
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';

sub generate_traceevent_regex {
	my $event = shift;
	my $default = shift;
	my @fields = @_;
	my $regex;

	# Read the event format or use the default
	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
		$regex = $default;
	} else {
		my $line;
		while (!eof(FORMAT)) {
			$line = <FORMAT>;
			if ($line =~ /^print fmt:\s"(.*)",.*/) {
				$regex = $1;
				$regex =~ s/%p/\([0-9a-f]*\)/g;
				$regex =~ s/%d/\([-0-9]*\)/g;
				$regex =~ s/%lu/\([0-9]*\)/g;
			}
		}
	}

	# Verify fields are in the right order
	my $tuple;
	foreach $tuple (split /\s/, $regex) {
		my ($key, $value) = split(/=/, $tuple);
		my $expected = shift;
		if ($key ne $expected) {
			print("WARNING: Format not as expected '$key' != '$expected'");
			$regex =~ s/$key=\((.*)\)/$key=$1/;
		}
	}
	if (defined $_) {
		die("Fewer fields than expected in format");
	}

	return $regex;
}
$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
			$regex_fragdetails_default,
			"page", "pfn",
			"alloc_order", "fallback_order", "pageblock_order",
			"alloc_migratetype", "fallback_migratetype",
			"fragmenting", "change_ownership");


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-07  8:07     ` Ingo Molnar
@ 2009-08-07 14:25       ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 14:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Fr?d?ric Weisbecker, Pekka Enberg, eduard, Larry Woodman,
	Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 10:07:07AM +0200, Ingo Molnar wrote:
> > <SNIP>
> 
> This is a very nice and comprehensive description!
> 

Thanks, you did write a nice chunk of it yourself though :)

> I'm wondering: would you mind if we integrated the analysis ideas 
> from your perl script into 'perf trace'? Those kinds of high-level 
> counts and summaries are useful not just for MM events.
> 

Of course not. Part of the motivation for doing the perl script was as a
POC for the gathering of high-level events. In the event such sample
scripts work out, it'd justify the greater effort to integrate them into
perf.

> Another thing that was raise dbefore is a 'perf mem' special-purpose 
> tool to help the analysis of all things memory related: leak 
> detection, high level stats, etc. That could have some turn-key 
> modes of analysis for the page allocator too.
> 

Again, my vague notion was to prototype such things in perl and then when it
works out to incorporate it in perf if suitable. As high-level gathering of
information is just a state machine, it's conceivable that some of the code
could be auto-generated.

> perf will do a proper format-string evaluation of 
> /debug/tracing/events/*/format as well, thus any tweaks to the 
> tracepoints get automatically adopted to.
> 

Which would be a major plus.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
@ 2009-08-07 14:25       ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 14:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Fr?d?ric Weisbecker, Pekka Enberg, eduard, Larry Woodman,
	Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 10:07:07AM +0200, Ingo Molnar wrote:
> > <SNIP>
> 
> This is a very nice and comprehensive description!
> 

Thanks, you did write a nice chunk of it yourself though :)

> I'm wondering: would you mind if we integrated the analysis ideas 
> from your perl script into 'perf trace'? Those kinds of high-level 
> counts and summaries are useful not just for MM events.
> 

Of course not. Part of the motivation for doing the perl script was as a
POC for the gathering of high-level events. In the event such sample
scripts work out, it'd justify the greater effort to integrate them into
perf.

> Another thing that was raise dbefore is a 'perf mem' special-purpose 
> tool to help the analysis of all things memory related: leak 
> detection, high level stats, etc. That could have some turn-key 
> modes of analysis for the page allocator too.
> 

Again, my vague notion was to prototype such things in perl and then when it
works out to incorporate it in perf if suitable. As high-level gathering of
information is just a state machine, it's conceivable that some of the code
could be auto-generated.

> perf will do a proper format-string evaluation of 
> /debug/tracing/events/*/format as well, thus any tweaks to the 
> tracepoints get automatically adopted to.
> 

Which would be a major plus.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-08-07 11:02         ` Ingo Molnar
@ 2009-08-07 15:26           ` Mel Gorman
  -1 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 15:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 01:02:03PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Fri, Aug 07, 2009 at 10:02:49AM +0200, Ingo Molnar wrote:
> > > 
> > > * Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > +++ b/mm/page_alloc.c
> > > > @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> > > >  							start_migratetype);
> > > >  
> > > >  			expand(zone, page, order, current_order, area, migratetype);
> > > > +
> > > > +			trace_mm_page_alloc_extfrag(page, order, current_order,
> > > > +				start_migratetype, migratetype,
> > > > +				current_order < pageblock_order,
> > > > +				migratetype == start_migratetype);
> > > 
> > > This tracepoint too should be optimized some more:
> > > 
> > >  - pageblock_order can be passed down verbatim instead of the 
> > >    'current_order < pageblock_order': it means one comparison less 
> > >    in the fast-path, plus it gives more trace information as well.
> > > 
> > >  - migratetype == start_migratetype check is superfluous as both 
> > >    values are already traced. This property can be added to the 
> > >    TP_printk() post-processing stage instead, if the pretty-printing 
> > >    is desired.
> > > 
> > 
> > I think what you're saying that it's better to handle additional 
> > information like this in TP_printk always. That's what I've 
> > changed both of these into at least. I didn't even need to pass 
> > down pageblock_order because it should be available in the 
> > post-processing context from a header.
> 
> yeah. I formulated my suggestions in a trace-output-invariant way. 
> If some information can be omitted altogether from the trace, the 
> better.
> 

Yeah, it's an obvious point once thought about for more than a second. When
I wrote it this way, it was because I wanted to work out the higher-level
workings near the code in case it's assumptions went out of date and the
tracepoint was forgotten about, just as comments going out of date. However,
it's not much protection, tracepoints are just something that will need to
have to be remembered when changing existing assumptions.

> > The additional parameters are not being passed down any more and 
> > the TP_printk looks like
> > 
> >         TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
> >                 __entry->page,
> >                 page_to_pfn(__entry->page),
> >                 __entry->alloc_order,
> >                 __entry->fallback_order,
> >                 pageblock_order,
> >                 __entry->alloc_migratetype,
> >                 __entry->fallback_migratetype,
> >                 __entry->fallback_order < pageblock_order,
> >                 __entry->alloc_migratetype == __entry->fallback_migratetype)
> > 
> > Is that what you meant?
> 
> yeah, this looks more compact.
> 
> A detail: we might still want to pass in pageblock_order somehow - 
> for example 'perf' will get access to the raw binary record but wont 
> run the above printk line.
> 

It's invariant for the lifetime of the system so it shouldn't be part of the
record. Often it can be reliably guessed because it's based on the default
hugepage size that can be allocated from the buddy lists.

x86-without-PAE:	pageblock == 10
x86-with-PAE:		pageblock == 9
x86-64:			pageblock == 9  (even if 1GB pages are available)
ppc64:			pageblock == 12 (4K base page) or 8 (64K base page)
ia64:			depends on boot parameters
other cases:		pageblock == MAX_ORDER-1

So perf can make a reasonably guess that can be specified from command-line
if absolutly necessary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
@ 2009-08-07 15:26           ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 15:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Larry Woodman, Andrew Morton, riel, Peter Zijlstra, LKML, linux-mm

On Fri, Aug 07, 2009 at 01:02:03PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Fri, Aug 07, 2009 at 10:02:49AM +0200, Ingo Molnar wrote:
> > > 
> > > * Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > +++ b/mm/page_alloc.c
> > > > @@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> > > >  							start_migratetype);
> > > >  
> > > >  			expand(zone, page, order, current_order, area, migratetype);
> > > > +
> > > > +			trace_mm_page_alloc_extfrag(page, order, current_order,
> > > > +				start_migratetype, migratetype,
> > > > +				current_order < pageblock_order,
> > > > +				migratetype == start_migratetype);
> > > 
> > > This tracepoint too should be optimized some more:
> > > 
> > >  - pageblock_order can be passed down verbatim instead of the 
> > >    'current_order < pageblock_order': it means one comparison less 
> > >    in the fast-path, plus it gives more trace information as well.
> > > 
> > >  - migratetype == start_migratetype check is superfluous as both 
> > >    values are already traced. This property can be added to the 
> > >    TP_printk() post-processing stage instead, if the pretty-printing 
> > >    is desired.
> > > 
> > 
> > I think what you're saying that it's better to handle additional 
> > information like this in TP_printk always. That's what I've 
> > changed both of these into at least. I didn't even need to pass 
> > down pageblock_order because it should be available in the 
> > post-processing context from a header.
> 
> yeah. I formulated my suggestions in a trace-output-invariant way. 
> If some information can be omitted altogether from the trace, the 
> better.
> 

Yeah, it's an obvious point once thought about for more than a second. When
I wrote it this way, it was because I wanted to work out the higher-level
workings near the code in case it's assumptions went out of date and the
tracepoint was forgotten about, just as comments going out of date. However,
it's not much protection, tracepoints are just something that will need to
have to be remembered when changing existing assumptions.

> > The additional parameters are not being passed down any more and 
> > the TP_printk looks like
> > 
> >         TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
> >                 __entry->page,
> >                 page_to_pfn(__entry->page),
> >                 __entry->alloc_order,
> >                 __entry->fallback_order,
> >                 pageblock_order,
> >                 __entry->alloc_migratetype,
> >                 __entry->fallback_migratetype,
> >                 __entry->fallback_order < pageblock_order,
> >                 __entry->alloc_migratetype == __entry->fallback_migratetype)
> > 
> > Is that what you meant?
> 
> yeah, this looks more compact.
> 
> A detail: we might still want to pass in pageblock_order somehow - 
> for example 'perf' will get access to the raw binary record but wont 
> run the above printk line.
> 

It's invariant for the lifetime of the system so it shouldn't be part of the
record. Often it can be reliably guessed because it's based on the default
hugepage size that can be allocated from the buddy lists.

x86-without-PAE:	pageblock == 10
x86-with-PAE:		pageblock == 9
x86-64:			pageblock == 9  (even if 1GB pages are available)
ppc64:			pageblock == 12 (4K base page) or 8 (64K base page)
ia64:			depends on boot parameters
other cases:		pageblock == MAX_ORDER-1

So perf can make a reasonably guess that can be specified from command-line
if absolutly necessary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
  2009-08-10 15:41 [PATCH 0/6] Add some trace events for the page allocator v6 Mel Gorman
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
@ 2009-08-10 15:41   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-10 15:41 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, Li Ming Chun, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
@ 2009-08-07 17:40   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
@ 2009-08-07 17:40   ` Mel Gorman
  0 siblings, 0 replies; 54+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2009-08-10 15:42 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-06 16:07 [PATCH 0/4] Add some trace events for the page allocator v4 Mel Gorman
2009-08-06 16:07 ` Mel Gorman
2009-08-06 16:07 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
2009-08-06 16:07   ` Mel Gorman
2009-08-07  7:50   ` Ingo Molnar
2009-08-07  7:50     ` Ingo Molnar
2009-08-07 10:49     ` Mel Gorman
2009-08-07 10:49       ` Mel Gorman
2009-08-06 16:07 ` [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
2009-08-06 16:07   ` Mel Gorman
2009-08-07  8:02   ` Ingo Molnar
2009-08-07  8:02     ` Ingo Molnar
2009-08-07 10:57     ` Mel Gorman
2009-08-07 10:57       ` Mel Gorman
2009-08-07 11:02       ` Ingo Molnar
2009-08-07 11:02         ` Ingo Molnar
2009-08-07 15:26         ` Mel Gorman
2009-08-07 15:26           ` Mel Gorman
2009-08-06 16:07 ` [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
2009-08-06 16:07   ` Mel Gorman
2009-08-07  7:53   ` Ingo Molnar
2009-08-07  7:53     ` Ingo Molnar
2009-08-07  7:55     ` Ingo Molnar
2009-08-07  7:55       ` Ingo Molnar
2009-08-07 11:09     ` Mel Gorman
2009-08-07 11:09       ` Mel Gorman
2009-08-07  8:04   ` Li Zefan
2009-08-07  8:04     ` Li Zefan
2009-08-07 11:00     ` Mel Gorman
2009-08-07 11:00       ` Mel Gorman
2009-08-06 16:07 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
2009-08-06 16:07   ` Mel Gorman
2009-08-06 21:07   ` Li, Ming Chun
2009-08-06 21:07     ` Li, Ming Chun
2009-08-07 11:13     ` Mel Gorman
2009-08-07 11:13       ` Mel Gorman
2009-08-07  8:00   ` Ingo Molnar
2009-08-07  8:00     ` Ingo Molnar
2009-08-07 14:16     ` Mel Gorman
2009-08-07 14:16       ` Mel Gorman
2009-08-06 16:07 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
2009-08-06 16:07   ` Mel Gorman
2009-08-07  8:07   ` Ingo Molnar
2009-08-07  8:07     ` Ingo Molnar
2009-08-07 14:25     ` Mel Gorman
2009-08-07 14:25       ` Mel Gorman
2009-08-06 16:07 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
2009-08-06 16:07   ` Mel Gorman
2009-08-06 16:10 ` [PATCH 0/4] Add some trace events for the page allocator v4 Mel Gorman
2009-08-06 16:10   ` Mel Gorman
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
2009-08-07 17:40 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
2009-08-07 17:40   ` Mel Gorman
2009-08-10 15:41 [PATCH 0/6] Add some trace events for the page allocator v6 Mel Gorman
2009-08-10 15:41 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
2009-08-10 15:41   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.